1. Introduction: Why RL for Trading?

Reinforcement Learning (RL) offers a fundamentally different approach to algorithmic trading compared to traditional supervised learning. Instead of predicting price movements directly, RL agents learn optimal trading policies through trial-and-error interaction with the market environment.

In this deep dive, we analyze a complete forex trading implementation using Proximal Policy Optimization (PPO) — one of the most robust and widely-used RL algorithms. We'll explore:

Custom trading environment design
Action space with discrete SL/TP combinations
Technical indicator feature engineering
Training and testing pipelines

2. The PPO Algorithm

PPO, introduced by OpenAI in 2017, has become the go-to algorithm for many RL applications due to its stability and sample efficiency. It's an Actor-Critic method that:

PPO Key Characteristics

Clipped Objective: Prevents destructively large policy updates
On-Policy: Uses current policy to collect experiences
Actor-Critic: Learns both policy (actor) and value function (critic)

# train_agent.py

from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv

# Define RL model (PPO)
model = PPO(
    policy="MlpPolicy",      # Multi-layer perceptron
    env=vec_env,
    verbose=1,
    tensorboard_log="./tensorboard_log/"
)

# Train the model
model.learn(total_timesteps=10100)
model.save("model_eurusd")

We use Stable-Baselines3's implementation with an MlpPolicy (neural network with two hidden layers of 64 units each by default). TensorBoard logging enables visualization of training metrics like policy loss and value loss.

3. Trading Environment Design

The heart of any RL trading system is the environment. Our ForexTradingEnv extends OpenAI Gym and defines how the agent interacts with market data.

Observation Space

A sliding window of the last 30 bars, each containing:

• Open, High, Low, Close
• Volume
• RSI (14)
• SMA (20, 50)
• ATR (14)
• MA Slope

Shape: (30, 10)

Action Space

19 discrete actions combining:

• Action 0: No Trade
• Direction: Long or Short
• Stop Loss: 30, 60, 80 pips
• Take Profit: 30, 60, 80 pips

1 + 2 × 3 × 3 = 19 actions

# trading_env.py - Action Space Construction

# Action 0 => No Trade
# Then for direction in [0=short, 1=long] and each sl, tp
self.action_map = [(None, None, None)]  # no trade

for direction in [0, 1]:  # 0=short, 1=long
    for sl in self.sl_options:      # [30, 60, 80]
        for tp in self.tp_options:  # [30, 60, 80]
            self.action_map.append((direction, sl, tp))

self.action_space = spaces.Discrete(len(self.action_map))

4. Reward Function Design

The reward function is critical for shaping agent behavior. This implementation uses a single-bar P&L approach — calculating profit/loss based on the next bar's price action relative to the chosen SL/TP levels.

Reward Logic

1Entry at current bar's close price
2Check next bar's High/Low for SL/TP triggers
3If both SL and TP touched → assume loss (conservative)
4If neither triggered → use close-to-close P&L

# Reward calculation (simplified)

# Convert pips to price distance
pip_value = 0.0001
sl_price_distance = sl * pip_value  # e.g., 60 pips = 0.006
tp_price_distance = tp * pip_value

# For LONG trades
if direction == 1:
    stop_loss = entry_price - sl_price_distance
    take_profit = entry_price + tp_price_distance
    
    if next_low <= stop_loss:
        pnl = -sl_price_distance  # Hit stop loss
    elif next_high >= take_profit:
        pnl = tp_price_distance   # Hit take profit
    else:
        pnl = next_close - entry_price  # Partial move

# Reward in pips (multiply by 10,000)
reward = pnl * 10000

5. Feature Engineering: Technical Indicators

Raw price data alone is often insufficient for RL models. We enrich the observation space with technical indicators using the pandas-ta library:

Indicator	Parameters	Purpose
RSI	length=14	Momentum oscillator (0-100)
SMA (20)	length=20	Short-term trend
SMA (50)	length=50	Medium-term trend
ATR	length=14	Volatility measure
MA Slope	diff()	Trend direction/strength

# indicators.py

import pandas_ta as ta

def load_and_preprocess_data(csv_path: str) -> pd.DataFrame:
    df = pd.read_csv(csv_path, parse_dates=True, index_col='Gmt time')
    df.sort_index(inplace=True)
    
    # Technical indicators
    df['rsi_14'] = ta.rsi(df['Close'], length=14)
    df['ma_20'] = ta.sma(df['Close'], length=20)
    df['ma_50'] = ta.sma(df['Close'], length=50)
    df['atr'] = ta.atr(df['High'], df['Low'], df['Close'], length=14)
    df['ma_20_slope'] = df['ma_20'].diff()
    
    df.dropna(inplace=True)
    return df

6. Training Pipeline

The training script orchestrates the entire learning process:

Training Workflow

Load Data: EUR/USD hourly candles with indicators

Create Environment: ForexTradingEnv with window_size=30

Vectorize: Wrap in DummyVecEnv for Stable-Baselines3

Train: PPO for 10,100 timesteps

Save: Model checkpoint to disk

⚠️ Important Note

10,100 timesteps is relatively small for RL training. For production, consider 500K-1M+ timesteps and implement early stopping based on validation performance.

7. Testing & Evaluation

The test script evaluates the trained model on unseen data — a critical step for validating generalization:

# test_agent.py

# Load DIFFERENT test data (out-of-sample)
test_df = load_and_preprocess_data(
    "data/test_EURUSD_Candlestick_1_Hour_BID_20.02.2023-22.02.2025.csv"
)

# Load trained model
model = PPO.load("model_eurusd", env=vec_test_env)

# Run deterministic evaluation
while not done:
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = vec_test_env.step(action)
    
    # Log trades and equity
    trade_info = vec_test_env.get_attr("last_trade_info")[0]
    trade_history.append(trade_info)
    equity_curve.append(vec_test_env.get_attr("equity")[0])

# Save results
trades_df.to_csv("trade_history_output.csv")

Key outputs include:

Equity Curve: Visual representation of cumulative P&L
Trade History CSV: Entry/exit prices and per-trade P&L

8. Key Insights & Takeaways

✓ Strengths

• Discrete SL/TP simplifies action space
• Includes "No Trade" option (capital preservation)
• Window-based observations capture temporal patterns
• PPO provides stable training

✗ Limitations

• Single-bar resolution may miss intra-bar events
• No transaction costs or slippage (max_slippage=0)
• Limited training timesteps
• No position sizing (fixed lot)

🚀 Future Improvements

• Multi-bar trade holding periods
• Dynamic position sizing based on volatility
• Include spread and commission in reward
• Ensemble with multiple RL algorithms (SAC, A2C)
• Add more features (orderflow, sentiment)

9. Conclusion

This implementation demonstrates a complete RL trading system using PPO for forex markets. While there's room for improvement in terms of realism (slippage, costs, position sizing), the core architecture provides a solid foundation for experimentation.

The key takeaway: RL for trading works best when you carefully design the action space, reward function, and observation features to match real trading constraints. The discrete SL/TP approach used here is particularly elegant because it forces the agent to commit to a risk-reward ratio at trade entry.

Tech Stack Used

Python 3.xStable-Baselines3OpenAI GymPyTorchpandas-taNumPyMatplotlibTensorBoard

📚 Credits & Resources

The implementation analyzed in this article is inspired by the excellent tutorials from:

Code Trading Cafe

YouTube Channel — Algorithmic Trading Tutorials

Check out their channel for more in-depth tutorials on reinforcement learning for trading.

Reinforcement Learning for Forex Trading: A Deep Dive into PPO