1. Introduction: Why RL for Trading?
Reinforcement Learning (RL) offers a fundamentally different approach to algorithmic trading compared to traditional supervised learning. Instead of predicting price movements directly, RL agents learn optimal trading policies through trial-and-error interaction with the market environment.
In this deep dive, we analyze a complete forex trading implementation using Proximal Policy Optimization (PPO) — one of the most robust and widely-used RL algorithms. We'll explore:
- Custom trading environment design
- Action space with discrete SL/TP combinations
- Technical indicator feature engineering
- Training and testing pipelines
2. The PPO Algorithm
PPO, introduced by OpenAI in 2017, has become the go-to algorithm for many RL applications due to its stability and sample efficiency. It's an Actor-Critic method that:
PPO Key Characteristics
- Clipped Objective: Prevents destructively large policy updates
- On-Policy: Uses current policy to collect experiences
- Actor-Critic: Learns both policy (actor) and value function (critic)
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
# Define RL model (PPO)
model = PPO(
policy="MlpPolicy", # Multi-layer perceptron
env=vec_env,
verbose=1,
tensorboard_log="./tensorboard_log/"
)
# Train the model
model.learn(total_timesteps=10100)
model.save("model_eurusd")We use Stable-Baselines3's implementation with an MlpPolicy (neural network with two hidden layers of 64 units each by default). TensorBoard logging enables visualization of training metrics like policy loss and value loss.
3. Trading Environment Design
The heart of any RL trading system is the environment. Our ForexTradingEnv extends OpenAI Gym and defines how the agent interacts with market data.
Observation Space
A sliding window of the last 30 bars, each containing:
- • Open, High, Low, Close
- • Volume
- • RSI (14)
- • SMA (20, 50)
- • ATR (14)
- • MA Slope
Action Space
19 discrete actions combining:
- • Action 0: No Trade
- • Direction: Long or Short
- • Stop Loss: 30, 60, 80 pips
- • Take Profit: 30, 60, 80 pips
# Action 0 => No Trade
# Then for direction in [0=short, 1=long] and each sl, tp
self.action_map = [(None, None, None)] # no trade
for direction in [0, 1]: # 0=short, 1=long
for sl in self.sl_options: # [30, 60, 80]
for tp in self.tp_options: # [30, 60, 80]
self.action_map.append((direction, sl, tp))
self.action_space = spaces.Discrete(len(self.action_map))4. Reward Function Design
The reward function is critical for shaping agent behavior. This implementation uses a single-bar P&L approach — calculating profit/loss based on the next bar's price action relative to the chosen SL/TP levels.
Reward Logic
- 1Entry at current bar's close price
- 2Check next bar's High/Low for SL/TP triggers
- 3If both SL and TP touched → assume loss (conservative)
- 4If neither triggered → use close-to-close P&L
# Convert pips to price distance
pip_value = 0.0001
sl_price_distance = sl * pip_value # e.g., 60 pips = 0.006
tp_price_distance = tp * pip_value
# For LONG trades
if direction == 1:
stop_loss = entry_price - sl_price_distance
take_profit = entry_price + tp_price_distance
if next_low <= stop_loss:
pnl = -sl_price_distance # Hit stop loss
elif next_high >= take_profit:
pnl = tp_price_distance # Hit take profit
else:
pnl = next_close - entry_price # Partial move
# Reward in pips (multiply by 10,000)
reward = pnl * 100005. Feature Engineering: Technical Indicators
Raw price data alone is often insufficient for RL models. We enrich the observation space with technical indicators using the pandas-ta library:
| Indicator | Parameters | Purpose |
|---|---|---|
| RSI | length=14 | Momentum oscillator (0-100) |
| SMA (20) | length=20 | Short-term trend |
| SMA (50) | length=50 | Medium-term trend |
| ATR | length=14 | Volatility measure |
| MA Slope | diff() | Trend direction/strength |
import pandas_ta as ta
def load_and_preprocess_data(csv_path: str) -> pd.DataFrame:
df = pd.read_csv(csv_path, parse_dates=True, index_col='Gmt time')
df.sort_index(inplace=True)
# Technical indicators
df['rsi_14'] = ta.rsi(df['Close'], length=14)
df['ma_20'] = ta.sma(df['Close'], length=20)
df['ma_50'] = ta.sma(df['Close'], length=50)
df['atr'] = ta.atr(df['High'], df['Low'], df['Close'], length=14)
df['ma_20_slope'] = df['ma_20'].diff()
df.dropna(inplace=True)
return df6. Training Pipeline
The training script orchestrates the entire learning process:
Training Workflow
⚠️ Important Note
10,100 timesteps is relatively small for RL training. For production, consider 500K-1M+ timesteps and implement early stopping based on validation performance.
7. Testing & Evaluation
The test script evaluates the trained model on unseen data — a critical step for validating generalization:
# Load DIFFERENT test data (out-of-sample)
test_df = load_and_preprocess_data(
"data/test_EURUSD_Candlestick_1_Hour_BID_20.02.2023-22.02.2025.csv"
)
# Load trained model
model = PPO.load("model_eurusd", env=vec_test_env)
# Run deterministic evaluation
while not done:
action, _states = model.predict(obs, deterministic=True)
obs, rewards, dones, info = vec_test_env.step(action)
# Log trades and equity
trade_info = vec_test_env.get_attr("last_trade_info")[0]
trade_history.append(trade_info)
equity_curve.append(vec_test_env.get_attr("equity")[0])
# Save results
trades_df.to_csv("trade_history_output.csv")Key outputs include:
- Equity Curve: Visual representation of cumulative P&L
- Trade History CSV: Entry/exit prices and per-trade P&L
8. Key Insights & Takeaways
✓ Strengths
- • Discrete SL/TP simplifies action space
- • Includes "No Trade" option (capital preservation)
- • Window-based observations capture temporal patterns
- • PPO provides stable training
✗ Limitations
- • Single-bar resolution may miss intra-bar events
- • No transaction costs or slippage (max_slippage=0)
- • Limited training timesteps
- • No position sizing (fixed lot)
🚀 Future Improvements
- • Multi-bar trade holding periods
- • Dynamic position sizing based on volatility
- • Include spread and commission in reward
- • Ensemble with multiple RL algorithms (SAC, A2C)
- • Add more features (orderflow, sentiment)
9. Conclusion
This implementation demonstrates a complete RL trading system using PPO for forex markets. While there's room for improvement in terms of realism (slippage, costs, position sizing), the core architecture provides a solid foundation for experimentation.
The key takeaway: RL for trading works best when you carefully design the action space, reward function, and observation features to match real trading constraints. The discrete SL/TP approach used here is particularly elegant because it forces the agent to commit to a risk-reward ratio at trade entry.
Tech Stack Used
📚 Credits & Resources
The implementation analyzed in this article is inspired by the excellent tutorials from:
Check out their channel for more in-depth tutorials on reinforcement learning for trading.