Prev Article: 1.2 - Searching for Strategies

Backtesting

Having chosen some candidate strategies, you then need to check whether the strategy is actually profitable.

We do this through backtesting, which is testing our strategy on data. The data itself can be either historical or simulated.

In this section we will look into a few considerations when backtesting:

Different metrics to assess, and why.
Methods of Backtesting, either historical or simulated, and which one to use based on your context.
Common fallacies in backtesting.

Example implemented backtests can be seen here:
https://github.com/Aldo-Aditiya/algo_trading/blob/master/lib/backtest.py

Basic

A backtest is basically this: Using some price data, we run our selected strategy, and then generate metrics to evaluate said strategy.

Also, backtests are more conveniently done with periodic (e.g. daily) returns rather than prices:

For Example:

In a simple bitcoin buy and hold strategy, we simply use the periodic returns of bitcoin for the basis of generating our metrics.
But in a bitcoin EMA strategy, the returns are calculated according to the periods where you are in the market. e.g. for day 1-10 you might be invested, but for day 10-20 you are not, so your return within that period is 0.

Once we run our backtest, we calculate the relevant metrics.

Metrics

Cumulative Return
Pasted image 20221206153248.png
Cumulative Return is just the total amount of return generated across some period of time:

This shows the strategy's performance on the data. Note that it is usually useful to compare the results with some benchmark. For instance, an american equity strategy should have a benchmark of S&P500.

Monthly Return
Pasted image 20221206154426.png
Like Cumulative return, but for monthly period. Very useful to get a feel of the strategy's performance in different market conditions.

Compound Annual Growth Rate (CAGR)
If cumulative return is the return over some period of time, CAGR is the return every year, assuming the final value.

Volatility
Standard deviation of a stock price over some period of time. Usually measured daily, but if we want an annualized volatility, we just need to multiply by for 252 number of trading days.

As a side note, the term "Risk" in quantitative finance is synonymous with Volatility. So a historically volatile asset is thus risky.

Also, be careful that standard deviation are usually calculated asssuming the data is distributed normally. Though a useful approximation, sometimes it does not reflect the reality.

Sharpe Ratio
Risk adjusted returns. Or, returns that have been adjusted by the volatility and some risk-free return rate.

The risk free return rate is the return rate of some instrument that provides a "risk free return". This is usually synonymous with the treasury interest rate, i.e. 3.6%.

Useful to assess whether the startegy has a good tradeoff between returns and volatility.

Maximum Drawdown
Pasted image 20221206155444.png
The worst % downturn that ever happened when running the strategy on the data. Calculated from the local peak, starting from the point where the returns started going down, and ending when the returns goes back to that local peak.

Very useful to assess the risk of the strategy.

Pasted image 20221206155617.png
Another useful visualization is the underwater plot, which visualizes how much was lost in every downturn period.

Longest Drawdown Period
The longest period where the returns remain underwater.

Other Metrics
There are other useful metrics such as, among others:

Skewness of returns
Probabilistic Sharpe Ratio
Hit Ratio
Average Return from Hits and Misses
Average Holding Periods

You can pick and choose based on your needs.

Methods

Walk Forward Backtest
The simplest form of backtesting. We simply set some historical period of backtesting, and then run our backtest within those periods.

The problem with this is that:

A single historical price series data is just one of many possible alternative paths that could have happened. So it might not reflect other alternative paths where, perhaps, the market is more bullish or more bearish.
Old price movements are not at all guaranteed to happen in the future.
A single price data is not enough to conclude the spread of possible strategy performance

Historical Event Backtest
It might be useful to test the performance of our strategy at particular times in history.

For example, it might be useful to see how our strategy would have performed in the 2008 stock market crash to understand how much we would stand to lose if such an event happens again.

Some interesting periods filtered based on the LQ45 Index are as such:

2006 Pre-GFC Bull Run -> Used to test whether system can detect the overarching positive trend, despite some setbacks
2008 Great Financial Crisis -> Used to test how the system will perform in a multi-year bear market
2009 Post-GFC Bull Run -> Used to test whether system can detect the overarching positive trend, despite some setbacks
2013 May Turbulence -> Used to test how the system will perform in a sudden drop, followed by a dead cat bounce
2015 April Drop -> Used to test how the system will handle a sudden drop
2015-2016 Turbulence -> Used to test how the system will handle an turbulent upwards market
2018 February Drop -> Used to test how the system will handle a sudden drop
2018-2019 Turbulence -> Used to test how the system will handle a sideways turbulent market
2020 Covid Scare -> used to test how the system will handle a sudden catasthropic drop

Bootstrapped Backtest
By running our backtest with our base data, we would have generated a returns probability distribution. illustrated below.

The idea of bootstrapping is basically to randomly sample from the returns distribution, and generate multiple different possible return curves.

Pasted image 20221206161724.png
By doing this, we are essentially simulating possible return curves based on the strategy's characteristic that is implicit from the returns probability distribution.

This is useful to estimate the probability distribution of our strategy's metrics.

For example, by having 500 returns curve we would have 500 of the same metric. Thus, we can model the probability distribution of the 500 metrics, and use it to make a conclusion about our model characteristics.

Model Backtest
Model backtest extends the bootstrap idea by asking: instead of just the returns distribution, can we directly model the underlying process that generates the price / returns?

Modelling a process is (slightly) more complicated, and outside the scope of this discussion, but to give some examples of stochastic process models:

Once we have made the model, we can use the model the same way as we did in the bootstrap method.

You can read more here: http://www.turingfinance.com/random-walks-down-wall-street-stochastic-processes-in-python/

Considerations in Backtesting

There are a number of considerations when doing backtesting and evaluating its results.

Biases

Look Ahead Bias - Assuming that you can trade at the close or open data
- Ensure that info on timestamp X are only accessed on X+1.
- Model the intraday dynamics.
Survivorship Bias
- Use historic index membership and full historic stock data as your universe basis.
Multiple Comparisons Bias and Overfitting
- Use simple models.
- Use Out of sample Cross Validation and Moving Window backtesting
- Hypothesize the variations of parameters before doing backtesting
- Do sensitivity analysis to judge whther small changes in parameters affect the strategy significantly.
Make sure strategy is deterministic

Market Model

Not considering the transaction details of a broker
Not considering stock splits
Not considering market impact
- Build a slippage (liquidity) or market impact model for your backtests.

Statistical Significance

We might just be lucky in getting a good result from our backtest.
- Hypothesis testing. Some approaches:
  - (1) Assume Gaussian Distribution, and test p-value on Sharpe Ratio (larger Sharpe means higher statistical significance)
  - (2) Monte Carlo method
  - (3) Simulated trades which are distributed randomly.
- Although rejection of the null might be flawed, acceptance of the null can give us some insights.
Not testing on long enough data
- Thus, making it not statistically significant enough. (Intuitively, the more data it works on, the less likely the strategy will be lucky)
Can we make sure that our strategy does not actually give a negative SR?

Regime Change

Not considering regime shifts, which causes strategy to not work
- Choose recent data, where the properties of the price series data might not have changed.
- Some Potential Regime Shifts:
  - Change in policy and regulation
  - Change in economic prospect
  - Change in market sentiment
- Be aware of potential changes in the future, and plan accordingly by doing risk management.
- Make a regime change detector
- (Yet I need to use a lot of data to test - so how do I approach this?)
Not testing on different market conditions
- Take into account both bull and bear market conditions on your backtests.

Backtest Method

Using only one variation of historical price data (which is the only one that we have) is flawed - especially if we do nit have enough data.
Only observing PnL metrics does not show us its distribution characteristics.

Sources

Next Article: 1.4 - Fitting and Quantifying Forecasts