5 ways to implement strategies to be reward function
The core logic of a casino's mechanics is the consistent application of a small, positive Expected Value (EV) on every single trial. They win not by hitting a jackpot, but by minimizing the variance of their outcomes and maximizing the volume of trials.
For a quantitative trader, this translates to: 1. Risk Management: Severe penalties for large drawdowns (Variance Minimization). 2. Consistency: Rewarding steady, small positive returns (Consistent Positive EV). 3. Selectivity: The agent should prefer to "Hold" when the EV is ambiguous, minimizing exposure to "negative EV" bets.
The current reward function in the file is a simple accuracy counter (points for predicting a \(\pm 1\%\) move). To shift the strategy to an EV-maximizing, casino-style approach, the reward function must be changed to incorporate P&L, risk, and consistency metrics.
Here are 5 ways to implement a new get_reward function to achieve this casino-style strategic approach:
1. Geometric (Log) Return with Transaction Costs¶
This function directly measures the final compounding wealth, which is the most realistic measure of long-term EV.
| Focus | Consistency and Compounding |
|---|---|
| Logic | Casino profits compound over time. Log returns heavily penalize large single-period losses, forcing the agent to prioritize small, steady gains over volatile, high-stakes bets. Transaction costs (\(\approx 0.05\%\) per trade) represent the casino's "rake." |
| Reward Function Implementation | The reward is the sum of the log returns achieved over the training period, adjusted for costs only when a trade occurs. |
| Formula Concept | \(\text{Reward} = \sum_{t} (\text{Action Return}_t) - (\text{Transaction Cost} \times \mathbb{1}_{\text{Action } \ne 0})\) |
| Why it works | Maximizing the sum of log returns is mathematically equivalent to maximizing the final geometric wealth, ensuring that the agent values consistent, small wins that compound rather than volatile returns that can be wiped out by one large loss. |
2. Sharpe Ratio Maximization¶
The Sharpe Ratio is the classic metric for risk-adjusted return, exactly what the house cares about: maximum profit for minimum volatility.
| Focus | Risk-Adjusted Performance (Variance Minimization) |
|---|---|
| Logic | A high Sharpe Ratio means the strategy achieves a return that is high relative to the volatility of its returns. A casino has high volume and low volatility of profit. This forces the ES to find an edge that is reliable and consistent. |
| Reward Function Implementation | The agent simulates the profit/loss series based on its actions, and the reward is the resulting Sharpe Ratio of that P&L series (using the Risk-Free Rate of 0, for simplicity). |
| Formula Concept | \(\text{Reward} = \frac{\text{Mean}(\text{Returns Series})}{\text{StdDev}(\text{Returns Series})}\) |
| Why it works | The agent will be penalized severely for trades that result in highly volatile outcomes, even if the net profit is positive. It encourages the agent to only make trades where its positive EV is highly reliable. |
3. Asymmetric Stop-Loss and Take-Profit Penalty¶
This approach explicitly models the house's rigid rules (payouts and limits) by introducing disproportionate penalties for losing trades, forcing conservative behavior.
| Focus | Hard Risk Management and Trade Discipline |
|---|---|
| Logic | Define a mandatory synthetic "Stop-Loss" (SL) and "Take-Profit" (TP). If the agent signals a Buy (1) and the price drops by \(SL\) (e.g., \(0.5\%\)), the reward is a massive penalty (e.g., \(-10\) points), regardless of what the price does next. This trains the network to avoid entries that immediately lead to high-risk drawdown. |
| Reward Function Implementation | Use P&L as the baseline reward, but apply a heavy multiplier penalty whenever a predicted trade would hit a predefined SL threshold. |
| Formula Concept | \(\text{Reward} = \sum_{t} \begin{cases} -10 \times (\text{Loss Size}) & \text{if Trade Hits Stop-Loss} \\ \text{Actual Profit} & \text{Otherwise} \end{cases}\) |
| Why it works | The ES optimizer is trained to fear big, single-trial losses, forcing the model to only select trades with a very high probability of not hitting the stop-loss, thus improving the \(\text{Win Rate}\) and reducing the \(\text{Average Loss Size}\). |
4. Drawdown Penalty (Calmar or Sortino Ratio)¶
Focusing on drawdowns directly is essential for capital preservation, a primary concern of the "house" bankroll.
| Focus | Capital Preservation (Minimizing Max Drawdown) |
|---|---|
| Logic | The casino must maintain its cash reserve (bankroll). The Calmar Ratio measures the average return relative to the Maximum Drawdown (MDD)—the largest peak-to-trough decline. Maximizing this ratio ensures steady growth without catastrophic failures. |
| Reward Function Implementation | Calculate the entire P&L series, find the maximum drawdown (MDD), and use the Calmar Ratio as the reward. |
| Formula Concept | \(\text{Reward} = \frac{\text{Annualized Mean Return}}{\text{Maximum Drawdown (absolute value)}}\) |
| Why it works | Any strategy that delivers high, volatile returns with a single deep loss will have a poor Calmar Ratio, while a strategy with moderate, steady returns and almost no volatility (i.e., minimal drawdowns) will score very highly. |
5. Selective Trading Reward¶
This strategy encourages the agent to only act when its internal confidence is high, and rewards the "Hold" action proportionally to market inactivity rather than just meeting the arbitrary \(\pm 1\%\) threshold.
| Focus | EV Selectivity and Aversion to Ambiguous Bets |
|---|---|
| Logic | The reward is structured to make Holding the default, low-penalty action, while making Buying/Selling high-reward/high-penalty actions. The agent must develop a sharp, positive-EV signal to overcome the penalty of a wrong active trade. |
| Reward Function Implementation | 1. Active Trade (1 or 2): Reward is the P&L only if \(\text{P&L} > 0\). If \(\text{P&L} < 0\), the reward is $-2 \times |
| Why it works | This structure forces the agent to be highly selective. It encourages holding during quiet, neutral periods and only risking a trade when the expected gain significantly outweighs the disproportionate penalty of a loss, thus isolating the true positive EV signal. |