Approach 1 Geometric (Log) Return with Transaction Costs
That's an excellent choice. Approach 1—Geometric (Log) Return with Transaction Costs—is the most direct way to translate the casino's compounding, low-variance profit strategy into a trading algorithm.
Here is the detailed construction of the investment strategy and the corresponding reward function, assuming a fixed investment amount and no leverage.
Investment Strategy: Fixed-Fractional Trading¶
The core strategy is to simulate trading a fixed fraction of the portfolio's cash value on every active (Buy or Sell) signal.
1. The Strategy Mechanics¶
- Cash Allocation: We assume the agent is trading a fixed amount of capital (e.g., 1 unit of currency, representing the entire portfolio).
- Trade Sizing (Fixed Fraction): When the agent takes an action (
BuyorSell), it commits a fixed fraction of the current total portfolio value. Since we are optimizing for returns, the simulation will assume a \(100\%\) allocation for simplicity in the reward function, but the log return inherently manages the compounding and risk. - Three Actions:
- Action 1 (Buy): The agent buys the stock. The portfolio is now \(100\%\) invested. The next day, the portfolio value changes by the stock's return.
- Action 2 (Sell): The agent sells the stock (short-selling). The portfolio is now \(100\%\) short the stock. The next day, the portfolio value changes by the inverse of the stock's return.
- Action 0 (Hold): The agent remains \(100\%\) in cash (or the status quo of the previous period). The portfolio value does not change from the stock's movement.
2. The Transaction Cost (The House's Rake)¶
A small percentage penalty is applied on every active trade (Buy or Sell) to simulate brokerage fees, market impact, or the bid-ask spread. This is the fixed negative EV the agent must overcome.
- Cost Parameter: We set the cost at \(C = 0.0005\) (or 5 basis points, \(0.05\%\)).
- Application: Applied once per active trade, reducing the net return of that trade.
Reward Function: Log Return Maximization¶
The goal of the reward function is to maximize the sum of logarithmic returns, which maximizes the portfolio's geometric mean return.
3. The New get_reward Function (Conceptual Code)¶
def get_reward_log_return(self, weights):
self.model.set_weights(weights)
# Initialize total log return
total_log_return = 0.0
# Define Transaction Cost (C): 0.05%
TRANSACTION_COST = 0.0005
# Loop through the training data, up to the second-to-last day (t+1 needed)
for t in range(self.train_start, self.train_end - 1):
# 1. Get the Agent's Prediction
state = self.get_state(t)
action = self.act(state) # 0: Hold, 1: Buy, 2: Sell
# 2. Calculate the Stock Return for t+1
current_price = self.trend[t]
next_price = self.trend[t + 1]
# Simple Return: R_t+1 = (P_t+1 / P_t) - 1
simple_return = (next_price - current_price) / current_price
# 3. Calculate Portfolio Return (G_t+1) based on Action
gross_return = 1.0 # 1.0 means no change (Hold)
if action == 1: # Buy (Go Long)
# Gross Return = (1 + Simple Return) - Transaction Cost
gross_return = (1.0 + simple_return) - TRANSACTION_COST
elif action == 2: # Sell (Go Short)
# The return of a short position is the negative of the simple return
# Gross Return = (1 - Simple Return) - Transaction Cost
gross_return = (1.0 - simple_return) - TRANSACTION_COST
elif action == 0: # Hold (Cash/No Trade)
# Net Gross Return is 1.0 (no change in value)
gross_return = 1.0
# 4. Update the Total Log Return (The Core Metric)
# Log Return = log(Gross Return)
log_return = np.log(gross_return)
total_log_return += log_return
return total_log_return # The higher the cumulative log return, the better
4. Detailed Explanation of the Reward¶
| Component | Purpose | Casino Analogy |
|---|---|---|
| Logarithmic Return (\(\ln(1+R)\)) | This is the mathematical basis for maximizing geometric compounding (long-term wealth). It aggressively punishes large negative returns more than it rewards equally sized positive returns. This forces the agent to prioritize high win rate and low loss size. | The house takes small, repeated wins. A large single loss could wipe out a long streak of small wins, which is heavily penalized by the log function. |
| Transaction Cost | A fixed, unavoidable drag on every active trade. | The House Edge (Rake). The agent must find an actual edge that is larger than the \(0.05\%\) cost, otherwise the ES will be rewarded for choosing the Hold (Action 0) action. |
| Action 0 (Hold) | The reward is \(\ln(1.0) = 0\). | Do Nothing. This sets the baseline return for the agent. If the agent cannot find an EV greater than \(0.05\%\) (cost), the safest and highest EV choice is Hold, mirroring the casino's strategy of only taking bets when the odds favor the house. |
| Action 1 (Buy) / Action 2 (Sell) | The reward is based on the stock's return minus the transaction cost. | Taking the Bet. The agent must predict an edge large enough that \(1+R - C > 1\). If \(R < C\) (the return is less than the cost), the log return is negative, and the agent is penalized, teaching it to be selective. |
This reward function fundamentally changes the agent's goal from predictive accuracy to profit-maximizing trading, ensuring that the agent's strategy aligns with the principle of consistently exploiting a small, positive expected value.