Naïve Backtesting is Bogus

August 16, 2009

The most frequently cited conventional wisdom of quant trading is backtesting, often summarized as:

Wise traders do as much backtesting as possible before starting to trade a system with real money.

Unfortunately, this wisdom is bogus. More accurately, this wisdom is bogus when practiced according to the standard backtesting formula:

Indicator: choose indicator (whether fundamental, technical, or statistical)
Data: choose long panel of data for some instrument (usually as much data as possible)
Backtest: build strategy by optimizing entry and exit, given indicator, over data panel
Profit!

Yes, undoubtedly some traders find short-term success with this formula. This is actually inevitable, due to the infinite monkey theorem: enough traders doing enough data snooping on powerful computers will inevitably result in a small number of them discovering what appears to be successful strategy due to pure randomness.

Consider an automotive analogy, to help illustrate the fallacy of this formula: predict the future viscosity of oil, given many measurements of viscosity at random times over a preceding period of time. Such measurements exhibit the following statistical attributes:

Random dispersion: values appear randomly, roughly following Brownian motion
Two primary clusters: values appear roughly centered around two primary clusters
Random outliers: random outliers exist between the two primary clusters, mostly lying on the plane connecting the two clusters

These measurements very roughly follow what traders observe in low-frequency security prices (oil in above analogy) across bull/bear market regimes (cluster in above analogy). Given that, back to prediction: what will future values of viscosity be given the past? Many years of effort could go into analyzing this problem, with numerous results providing mild predictiveness gained from use of diverse applied mathematical methods. Readers are encouraged to ponder this challenge, assuming they are given no more data than above.

Much of quant trading is analagous to this viscosity problem: blindly backtesting over myriad observed measurements, rigorously optimizing parameters using varying mathematical methods.

Now, consider one additional fact being discovered: oil being measured is contained in an engine, which randomly varies between being “on” and “off” for periods of time. With this knowledge (knowing running engines are hot), quants will quickly recall the relationship between temperature and viscosity. Shortly thereafter, someone will combine Sutherland’s formula with cluster analysis and identify an effective prediction methodology.

If traders have learned anything during either 2007 – 2009 (or 1998 – 2002), it should be that the fundamentals of economics and finance are not stable: nearly every statistical measure in common use across nearly all asset classes exhibited inconsistent behavior over this period (mean, variance, volatility, covariance, correlation, cointegration, principal components, skew, kurtosis, etc.). Introductory economics informs us the root causes for these instabilities are many, ranging from business cycles to monetary policy.

Yet, despite these first-hand observations, many traders merrily go along their way continuing to faithfully believe in the wisdom of backtesting. By the above analogy, traders would benefit from refining their quantitative methodologies to accommodate systemic biases—rather than blindly backtesting over long periods of time, with no concern for externalities such as market regime.

The real challenge is somehow refining the methodology of quantitative trading in response to this knowledge. Subsequent posts will strive to take up this challenge, with hope readers contribute their expertise.

41 Comments leave one →

Jeff Pietsch permalink

August 17, 2009 7:28 am

Excellent points. I hope that one of the subsequent articles will address the adequacy of resampling methodologies as well. Thanks for your good work, Jeff

Reply
gappy permalink

August 17, 2009 8:03 am

While I agree with the thrust of the post, I disagree with the statement that backtesting done this way is *always* meaningless. Let’s assume you pick a single trading rule *completely* at random, backtest it, and observe that it performs well; say, a Sharpe ratio of 3. You stop there, and go trading. This is quite different from testing 1e8 trading rules and choosing the best performer, even it has a Sharpe ratio greater than 3.

Reply
nick gogerty permalink

August 17, 2009 10:44 am

I agree, few back tests are done looking at “context” economic etc. in which they are operating. Regime shifting models are the typical solutions, but most regmie shifts disregards larger economic shifts and look for trends from mean reversion and trending regimes.

The meta regime or context is an important component. IMO the next meta regime that is ignored is USD related inflation. The term “real returns” may come into vogue as profits become more important. Most sharpe ratio calculations assume a fixed risk free rate which is then duly ignored.

some thoughts: http://nickgogerty.typepad.com/designing_better_futures/2009/06/minsky-moments-wilmotts-magician-and-a-lack-of-imagination.html

http://nickgogerty.typepad.com/designing_better_futures/2009/06/which-beta-i-vote-null-beta-is-overworked-and-needs-a-rest.html

http://nickgogerty.typepad.com/designing_better_futures/2009/03/changing-metrics-in-the-hedge-fund-industry.html

Trading systems: Portfolios: Hedge Funds: Fund of Funds: Institutions could all re-discover the shift importance of Real Returns in the next few years, if not sooner.

Reply
James permalink

August 17, 2009 11:43 am

Backtesting is useful in weeding out ideas which sound good, but have major flaws which aren’t obvious at first glance. Most traders seem to know it’s a tool, not a solution.

Reply
eber terandst permalink

August 17, 2009 1:26 pm

You talk a great talk, indeed.
Now, how about doing some walking and posting some trades and see how well they work ? ? ? ? ? ?
eber

Reply
david varadi permalink

August 17, 2009 3:36 pm

excellent post…..it is difficult to convey the extreme importance of this concept. having worked with a lot of historical data as well as machine learning algorithms, most of what i do emphasizes methodology and theory instead of inputs or parameters. this seems to work best. cluster analysis is of course central to creating robust models. the data and interrelationships are indeed remarkably unstable.

thanks for infusing the blogoshpere with some rigor.
dv

Reply
Manolo permalink

August 17, 2009 8:02 pm

Backtesting is as much art as math. You have to know how to blend. As important, you have to know when to bail on a strategy that has lost it’s edge. And even more importantly, you gotta realize that if you apply too much leverage, even the best strategies will blow you up. {Control that greed.}

I think that might be why it takes years to really get good at this game.

Reply
quantivity permalink

August 17, 2009 8:49 pm

@Jeff: excellent suggestion. For robust estimation, Monte Carlo resampling methods (e.g. jackknifing and bootstrapping) are invaluable. I will post on model selection using resampling techniques, towards opening dialog on ideas on how to improve robustness of backtesting.

@Gappy: totally agreed; I certainly did not intend to imply all backtesting was worthless; just the converse, amplifying other readers’ comments: backtesting for identifying robust strategies (with known risk/reward profiles) benefits from a nuanced blend of intuition, art, and math.

@Nick: 100% agree; I mostly concur with long-term inflationary outlook (if only because of mean reversion, let alone recent monetary expansion policies), with caveat US needs to first survive immediate-term deflationary climate.

@James: absolutely.

@David: thanks for your positive comments.

@Manolo: you said it right; the more years spent analyzing and trading, the more humility one gains in appreciating how little one really knows.

Reply
JimJinNJ permalink

August 18, 2009 5:16 pm

I take some exception to your definition of system development. You’ve described data mining and, in effect, optimization (aka “curve fitting). So defined, it is the kiss of death.
I’d suggest that system building follows the rules of science in that one 1)posits a hypothesis based on observation and/or theory; 2) test the theory for (in this case predictive utility); 30 re-test (or as we psychologists call it, cross-validate) the findings on an independent data set. My experience is that most systems die at step 3.
Implied here is that math does not equal scientific inquiry.

What am I missing?–if anyting.

Reply
Manolo permalink

August 18, 2009 6:37 pm

@Jin

Broadly speaking you’re correct. But the devil is in the details. The details are what seperates good from great.

For example,by the time you have enough data for a stat significant test, most of the “low hanging fruit” is picked, ie FX trading using momentum. Most of the guys who made small fortunes back in the 1970’s made conclusions based on limited data sets. Sure you could wait for more data, but most of the easy money will already be made by the guys who visualized before he had all the data.
If you “aren’t missing anything” you’re late to the trade/concept.

Reply
Sam permalink

August 19, 2009 4:58 am

The bottom line is this. You can backtest until the cows come home and it doesn’t really matter. You can pull a portion of data and make that your in sample, test and optimize on that, then run it foward and backward using realistic transacation costs. It looks great on in sample and out of sample data of changing market regimes and it still blows out.

I’ve been at this game for a while and I’ve seen this happen time and time again to very smart and humble people.

To me what it comes down to is that backtesting, monte carlo, etc. are all fine tools to work with, but in the end even if you’ve found an edge there is no gaurantee that it will perform going forward.

Backtest away, but don’t put too much weight on it.

Reply
MDan permalink

August 20, 2009 10:24 am

Don’t forget the other end of the spectrum: the ‘intelligent’ systems that try to pick up the most recent best performing strategies.

Basically, it is the same problem: traders are treating market regimes as externalities and are attemtping to work around them instead of trying to understand them.

Reply
- quantivity permalink
  
  August 20, 2009 10:49 pm
  
  @MDan: absolutely; I will cover this perspective in a subsequent post (specifically how modern portfolio theory can be combined with regime detection can be used as building blocks for such a system), as this is indeed a natural implication of this post.
  
  Reply
Kevin permalink

August 20, 2009 9:38 pm

I’m new to all of this… so I ask, isn’t there a way to generate random data using alternate techniques rather than statistics? If the stats broke down in 2008, why can’t a technique be developed that generates random data according to the stats, but then forces periods of breakdowns (random in frequency and amplitude). Why can’t historical data be taken and ‘randomized’ such that we create a 1000 random data sets based upon 2008 data where each random data set has a decline 5% worse than the source set. And then do another 1000 where each is 10% worse. Perhaps it is a way to ‘stress test’ a 100% mechanical strategy.

I’d love to be able to generate such data and run it through a mechanical trading system such as my own.

Reply
- quantivity permalink
  
  August 20, 2009 10:57 pm
  
  @Kevin: yes, there are numerous ways to do so (many of which are accessible in Excel). Conceptually, doing so is a combo of iterative resampling (as replied to @Jeff’s comment above) with a randomized disturbance according to your desired distribution. If you are interested, a subsequent post could certainly cover the theory and practice for doing so.
  
  Reply
  - JB permalink
    
    September 21, 2009 11:08 pm
    
    I would be interested in any approaches that incorporate dependencies between adjacent observations. I’m not educated in the theoretical backdrop but my point is that random sampling using whatever worst case is one thing but often one wants to have a data set that has instances of excessive euphoria/fear in which case observations are very dependent to each other.
Kevin permalink

August 21, 2009 9:58 am

Nice. I am doing my own randomization research using the frequency domain rather than the stochastics. It turns out it works very well (for my purposes at least). I would be very interested in a post covering your theory and technique.

Kevin

Reply
- quantivity permalink
  
  August 21, 2009 4:43 pm
  
  Interesting. Frequency domain ala Fourier, wavelets, or other technique? Wavelets offer wonderful opportunity to introduce non-temporal disturbances.
  
  Reply
  - Kevin permalink
    
    August 21, 2009 6:22 pm
    
    Fourier. When I backtest my own personal strategy against regular historical data I get parameters that work well with that data (curve fitted), but when I run it against slightly randomized historical data (via my FFT technique), I notice various scenarios where the strategy breaks down. By optimizing the strategy against the slightly or significantly random historical data, I can come up with parameters that are just as profitable, but result in significantly reduced risk and drawdown across all of the random data sets.
    
    This is by no means a silver bullet, but to me it is a lot better than the rusty bullet of ‘plain old one time backtesting.’ or curve fitting to just one set of historical data.
    
    I will certainly have to post about it eventually, once I do more research.
  - quantivity permalink
    
    August 23, 2009 1:46 pm
    
    @Kevin: Interested to see your post; particularly interested in see how you are selectively (in time) disturbing the signal, given the stationarity assumption of classical Fourier analysis (including FFT) implies a local-in-frequency disturbance would not be applied local-in-time. Or, maybe is that exactly your point: you want to introduce spectral disturbances which pervade time?
Kevin permalink

August 24, 2009 9:22 pm

To hopefully answer your question, here is a preview post of what I am researching:

http://blog.quantumfading.com/2009/08/24/historical-data-randomization-using-the-frequency-domain-preview/

Reply
- quantivity permalink
  
  August 24, 2009 9:37 pm
  
  @Kevin: excellent post; thanks for link. Particularly apt emphasis on risk management rather than backtesting / optimization. Readers may be interested in the intuition and numbers behind the frequency adjustment (as well as periodogram or other spectral density diagram), in addition to the code.
  
  Reply
Kevin permalink

August 25, 2009 9:11 am

Thanks for the feedback. I will certainly cover the details behind the research as I get more information. Since I am new to trading/backtesting, etc. I wanted to do that introductory post in order to get feedback from experts like you and other bloggers, and to make sure I am not barking up the wrong tree. While the end result may or may not be useful, I will learn a lot during the research process.

Reply
- quantivity permalink
  
  August 25, 2009 9:30 am
  
  @Kevin: one suggestion to consider in your research is conducting frequency analysis using both Fourier and wavelet techniques, towards exploring sensitivity / robustness of the Fourier stationary assumption (along the lines of implied by Three Horsemen).
  
  Reply
Kevin permalink

August 25, 2009 10:50 am

Ok, thanks for the ideas.

Reply
quantivity permalink

September 14, 2009 12:15 am

WSJ similarly affirmed the futility of naive backtesting for investors, in Data Mining Isn’t a Good Bet For Stock-Market Predictions. Always interesting to see these sorts of topics covered by columns intended for fundamental investor audiences (who one hopes are unlikely to be doing backtesting).

Reply
nicolas permalink

January 21, 2010 10:21 pm

Good point but if you talk about this subject, it might be worth mentioning that this is the canonical problem that justify whole fields in statistics, bayes method, model selection, bootstrap etc…

Otherwise we stay at the surface and know nothing.

Also, people who are not aware of that, may be should not be given money to play with..

Reply
- quantivity permalink
  
  January 23, 2010 7:06 pm
  
  @Nicolas: thanks for your comments. Naturally, I agree with your remarks. Towards this end, I have recently been drafting a post introducing basic Bayesian forecasting via dynamic generalized linear models and Kalman filtering (broadly following West et al. [1985]).
  
  Reply
Andrew permalink

March 29, 2010 6:13 am

Great thought-provoking post. I suppose with curve-fitting it’s a question of degrees – it’s pointless trying to cash in on every twist and turn because they are unpredictable and there’s a good chance you’ll get burnt, but a simple buy in an established bull trend isn’t such a bad bet.

Most DM techniques have some form of protection against overfitting that can help draw out general rules rather than very specific rules which are unlikely to be of use on any other data set e.g. with genetic programming you can impose a complexity limit on the solutions.

Reply
Pat Burns permalink

November 5, 2010 12:32 pm

This blog post: http://www.portfolioprobe.com/2010/11/05/backtesting-almost-wordless/ has some additional warnings about backtesting. Basically it shows that the naive idea of what a backtest says is not necessarily true, and it shows how to get accurate information on the strategy. This reduces but does not eliminate the data snooping problem.

Reply