Approach 1 Geometric (Log) Return with Transaction Costs

That's an excellent choice. Approach 1—Geometric (Log) Return with Transaction Costs—is the most direct way to translate the casino's compounding, low-variance profit strategy into a trading algorithm.

Here is the detailed construction of the investment strategy and the corresponding reward function, assuming a fixed investment amount and no leverage.

Investment Strategy: Fixed-Fractional Trading¶

The core strategy is to simulate trading a fixed fraction of the portfolio's cash value on every active (Buy or Sell) signal.

1. The Strategy Mechanics¶

Cash Allocation: We assume the agent is trading a fixed amount of capital (e.g., 1 unit of currency, representing the entire portfolio).
Trade Sizing (Fixed Fraction): When the agent takes an action (Buy or Sell), it commits a fixed fraction of the current total portfolio value. Since we are optimizing for returns, the simulation will assume a \(100\%\) allocation for simplicity in the reward function, but the log return inherently manages the compounding and risk.
Three Actions:
- Action 1 (Buy): The agent buys the stock. The portfolio is now \(100\%\) invested. The next day, the portfolio value changes by the stock's return.
- Action 2 (Sell): The agent sells the stock (short-selling). The portfolio is now \(100\%\) short the stock. The next day, the portfolio value changes by the inverse of the stock's return.
- Action 0 (Hold): The agent remains \(100\%\) in cash (or the status quo of the previous period). The portfolio value does not change from the stock's movement.

2. The Transaction Cost (The House's Rake)¶

A small percentage penalty is applied on every active trade (Buy or Sell) to simulate brokerage fees, market impact, or the bid-ask spread. This is the fixed negative EV the agent must overcome.

Cost Parameter: We set the cost at \(C = 0.0005\) (or 5 basis points, \(0.05\%\)).
Application: Applied once per active trade, reducing the net return of that trade.

Reward Function: Log Return Maximization¶

The goal of the reward function is to maximize the sum of logarithmic returns, which maximizes the portfolio's geometric mean return.

3. The New `get_reward` Function (Conceptual Code)¶

def get_reward_log_return(self, weights):
    self.model.set_weights(weights)

    # Initialize total log return
    total_log_return = 0.0

    # Define Transaction Cost (C): 0.05%
    TRANSACTION_COST = 0.0005 

    # Loop through the training data, up to the second-to-last day (t+1 needed)
    for t in range(self.train_start, self.train_end - 1):

        # 1. Get the Agent's Prediction
        state = self.get_state(t)
        action = self.act(state) # 0: Hold, 1: Buy, 2: Sell

        # 2. Calculate the Stock Return for t+1
        current_price = self.trend[t]
        next_price = self.trend[t + 1]

        # Simple Return: R_t+1 = (P_t+1 / P_t) - 1
        simple_return = (next_price - current_price) / current_price

        # 3. Calculate Portfolio Return (G_t+1) based on Action

        gross_return = 1.0 # 1.0 means no change (Hold)

        if action == 1: # Buy (Go Long)
            # Gross Return = (1 + Simple Return) - Transaction Cost
            gross_return = (1.0 + simple_return) - TRANSACTION_COST

        elif action == 2: # Sell (Go Short)
            # The return of a short position is the negative of the simple return
            # Gross Return = (1 - Simple Return) - Transaction Cost
            gross_return = (1.0 - simple_return) - TRANSACTION_COST

        elif action == 0: # Hold (Cash/No Trade)
            # Net Gross Return is 1.0 (no change in value)
            gross_return = 1.0 

        # 4. Update the Total Log Return (The Core Metric)
        # Log Return = log(Gross Return)
        log_return = np.log(gross_return)
        total_log_return += log_return

    return total_log_return # The higher the cumulative log return, the better

4. Detailed Explanation of the Reward¶

Component	Purpose	Casino Analogy
Logarithmic Return (\(\ln(1+R)\))	This is the mathematical basis for maximizing geometric compounding (long-term wealth). It aggressively punishes large negative returns more than it rewards equally sized positive returns. This forces the agent to prioritize high win rate and low loss size.	The house takes small, repeated wins. A large single loss could wipe out a long streak of small wins, which is heavily penalized by the log function.
Transaction Cost	A fixed, unavoidable drag on every active trade.	The House Edge (Rake). The agent must find an actual edge that is larger than the \(0.05\%\) cost, otherwise the ES will be rewarded for choosing the Hold (Action 0) action.
Action 0 (Hold)	The reward is \(\ln(1.0) = 0\).	Do Nothing. This sets the baseline return for the agent. If the agent cannot find an EV greater than \(0.05\%\) (cost), the safest and highest EV choice is Hold, mirroring the casino's strategy of only taking bets when the odds favor the house.
Action 1 (Buy) / Action 2 (Sell)	The reward is based on the stock's return minus the transaction cost.	Taking the Bet. The agent must predict an edge large enough that \(1+R - C > 1\). If \(R < C\) (the return is less than the cost), the log return is negative, and the agent is penalized, teaching it to be selective.

This reward function fundamentally changes the agent's goal from predictive accuracy to profit-maximizing trading, ensuring that the agent's strategy aligns with the principle of consistently exploiting a small, positive expected value.