The Bootstrap Test - How Significant Are Your Back-Testing Results - Au - Tra.Sy Blog - Automated Trading System
The Bootstrap Test - How Significant Are Your Back-Testing Results - Au - Tra.Sy Blog - Automated Trading System
com/bootstrap-test/
Au.Tra.Sy blog – Automated Trading System
Systematic Trading research and development, with a flavour of Trend Following
home
About
Library
Free Code
Resources
Consulting
Archives
Subscribe
Futures Broker
Wisdom Trading, a Futures Broker who can Execute your Trading System and provide access to Global Markets and CTA's – all at great rates.
Recent Posts
Trend Following Wizards – December 2019
Trend Following Wizards – October
State of Trend Following in October
Trend Following Wizards – September
State of Trend Following in September
Categories
Backtest (42)
Blog (15)
Books (13)
Code (11)
Data (17)
Development (5)
Fund Review (13)
Instruments (24)
Equities (5)
Forex (1)
Futures (22)
Money Management (13)
Off-track (11)
Software (19)
Strategies (26)
Trend Following (268)
the State of Trend Following (116)
Trend Following Wizards (127)
← VIX, Peso… Sometimes you just cannot trade it! Bootstrap – Take 2: Data Mining bias, Code and using geometric mean →
As mentioned in the Evidence-based Technical Analysis review post, the main value of the book lies in the presentation of the two methods allowing
for computing the statistical significance of trading strategy results, despite having a single sample of data:
Both methods solve the problem of estimating the degree of random variation in a test statistic when there is only a single sample of data
and, therefore, only a single value of the test statistic.
Today, let’s look at the bootstrap test, with a practical application of it.
In very brief terms, the concept uses hypothesis testing to verify whether the test statistic (such as mean return of the back-testing sample) is
statistically significant. This is done by establishing the p-value of the test statistic based on the sampling distribution. (Aronson covers the basics of
statistical analysis earlier in the book. I have also mentioned previously The Cartoon Guide to Statistics, which covers these concepts too)
The problem with back-testing is that the results generated represent a single sample, which does not provide any information on the sample
statistic’s variability and its sampling distribution. This is where bootstrapping comes in: by systematically and randomly resampling the single
available sample many times, it is possible to approximate the shape of the sampling distribution (and therefore calculate the p-value of the test
statistic).
The bootstrap uses the daily returns of a back-test (run on detrended data) and performs a resampling with replacement.
In practice:
1. A back-test is run on detrended data and the mean daily return, based on n observations, is calculated.
2. The mean daily return is substracted from each day’s return (zero-centering), This gives a set of adjusted returns.
3. For each resample, select n instances of adjusted returns, at random (with replacement), and calculate their mean daily return (bootstrapped
mean).
4. Perform a large number of resamples to generate a large number of bootstrapped means.
5. Form the sampling distribution of the means generated in the step above.
6. Derive the p-value of the initial back-test mean return (non zero-centered) based on the sampling distribution
A Practical Application
To illustrate the concept, we can look at a back-test and apply the bootstrap method to its daily return series. I decided to look at a back-test I
presented in Better Trend Following via improved Roll Yield. Remember: a standard 50/20 Moving Average cross-over system applied to Crude Oil
was improved by adding a roll yield optimisation process.
In that instance, the benchmark is the standard strategy and we want to check that the strategy improvement was not the result of random chance. In
Aronson’s book, benchmarking is achieved by detrending the data. However, this case is different as the benchmark is the standard strategy. The
improved strategy results can be thought of 2 distinct parts:
I therefore generated a composite, “Roll Yield-only” equity curve (by removing from the improved strategy equity curve the returns that could be
attributed to the Trend Following component). I then computed the daily returns based on that equity curve.
1. This set of daily returns is the original sample of 5120 observations, with an arithmetic mean of 0.216%.
2. Substracting 0.216% to all 5120 returns adjusts those returns (zero-centering), ready to be picked for resampling.
3. The 10,000 resamples all pick at random, with replacement, 5120 observations from the zero-centered, adjusted returns. A mean is computed
for each resample.
4. Each of the resample means are used to form the sampling distribution of the mean return:
5. The last step is the comparison of the non-adjusted original sample mean (0.216%) to the sampling distribution to establish the p-value, which
is 0.006 in this example.
Once the p-value is obtained, it is simply a matter of deciding which threshold qualifies for statistical significance. Scientists usually determine the
statistical significance threshold at 0.05 (ie. the null hypothesis would be rejected for any p-value less or equal to 0.05).
Proving that the mean arithmetic return is significantly positive, and deducing that the trading system is therefore profitable is flawed. It is ironically
amusing that Aronson spends quite a lot of time talking about logic reasoning and usual traps people fall into, to actually present a flawed deduction
logic. To use an example from the book:
A dog having four legs (a profitable rule having a positive mean arithmetic return) does not imply that any four-legged animal is a dog
(ie. any rule with a positive mean arithmetic return is not necessarily profitable).
On the other hand, any profitable rule has a positive mean geometric return, and any rule with positive mean geometric return is profitable. On that
basis, using the mean geometric return as the test-statistic in the bootstrap must be more appropriate.
31 Comments so far ↓
For this specific post: What do you think about the de-trending? Always had mixed feelings about that.
Kim
Thanks Kim!
Re: detrending, I have mixed feelings about it too (see: http://www.automated-trading-system.com/detrending-for-trend-following/ for an
earlier post)
Aronson shows that detrending is benchmarking based on position bias (ie. avoid favoring rules with bullish biases during bull markets for
example) and therefore allows for comparing different rules on “equal footing”. There is a point there…
He also quotes a study done by Timothy Masters comparing the bootstrap and Monte Carlo methods and concluding that the two methods give
similar results, when both are run on detrended data.
I have not done enough testing to have a strong opinion on both and I would welcome additional comments from more experienced readers.
My initial thoughts are that at least you have to be aware of the position bias issue and decide how much it would affect the rules you’re testing
(for example a Long-only strategy is likely to outperform a Short-only strategy during a bull market – a Long/Short Trend Following strategy
goes equally likely long or short and should not suffer from position bias as much…)
My big question after reading the book is: But aren’t I trying to exploit a bias in the data? Sure, omitting short trades in the 90’s trading Nasdaq
only is overoptimizing. However, what if there is a systematic bias in the data (i.e. roll yield), by detrending you are distorting reality.
I had the feeling Aronson used arithmetic mean since permuting daily returns could produce the same arithmetic but different geometric mean
return but it’s not – thanks to a quick simulation in excel.
Paolo
Motomoto // Aug 12, 2010 at 3:22 am
While discussing the mathematics is well above my pay grade – I have to agree with KF.
When asking how significant are the backtesting results, aren’t you really trying to look at how good are the backtesting results in replicating
what I could have profited from by following the system in the real world. It seems if you then start chopping up the data, detrending and then
running multiple tests, are you complicating things with tests that then distort what actually happens? These are not randomly generated
numbers that are based on a population, they are meant to be a ‘live time’ simulation of a series of trades – that may be related to previous
trades…. ie, the market trends – it does not revert back to a mean – or average?….. (as I said my statistical knowledge is sub-standard)
I agree, this can seem counter-intuitive to reshuffle the results like this and I do have mixed feelings about some aspects of the approach –
detrending for instance (as I mentioned before – especially for strategies like trend following).
You also allude to dependency in the results – which is obviously discarded in this sort of approach and might indeed indeed be a weakness of
the methodology. Unfortunately I have yet to see a model of trading results dependency – in which case it could probably be integrated to that
method (ie some sort of conditional/random resampling to produce a more sophisticated testing method).
The problem is that the outcomes of a trading strategy is a stochastic process with a large part of randomness and therefore with only one set of
back-testing results (single sample), you cannot assert how much of the performance can be attributed to randomness.
And I think the bootstrap, amongst other methods, tries (with some weaknesses that you highlight) to address this issue to identify whether the
performance is more likely coming from random luck or strategy value. Not the holy grail of trading results significance checking but probably
a good tool in the system development box..
There are long debates on the practice of back-testing and this issue is surely one of them!… ;-)
Lou, this can seem confusing at first indeed (but it makes sense, it’s all about checking the variance in the process)…
H0 is the hypothesis that the mean return is zero, so you need a distribution with a zero-mean.
What the bootstrap does is build such distribution using the actual data from the test in order to have variance/deviations in line with the
process being tested.
After the zero-mean distribution is built you perform a standard statistical significance test by comparing the data tested (mean return) against
the zero-mean distribution. This is a non-parametric method, but an analogy would be to calculate how many standard deviations the mean
tested lies at.
You only reject H0 if the mean return is in the top x%. If the process has high variance, it is likely that H0 will not be rejected (at x%) level
because the mean return will not be “far enough” on the right
Jez,
I think that the test you’ve created only shows that the mean does not come from the bootstrapped sample of (x – mean).
Hi Lou,
Not sure what you mean. I am only describing the bootstrap test as per Aronson’s book. Are you saying that the test is flawed or that there is
some mistake in the illustration?
I was not very clear/accurate in my description/comment above but in effect the test simulates a zero-mean sample process with similar
variance to the process (back-)tested. If we run that sample process a large amount of times, we know that the expected mean (ie mean of
means) will be zero but we can also check how far and frequenty the data spreads towards the right/left of the mean (as shown graphically in
the distribution).
If 95% of the means are below the back-test observed mean, there is statistical significance (at 95% confidence level) that the back-test’s
profitability was not the result of random variation from a zero-mean process with similar variance (which is H0, which can then be rejected).
I don’t know anything about Aronson. Maybe you can post or send a link to a relevant excerpt on applying bootstrapping to hypothesis tests.
So, now we have a distribution of tree tops (x – mean) and recorded values for negative trees.
If we want to test a full tree, one that hasn’t been cut (detrended) against the distribution of cut trees then our hypothesis test is Ho: tree(uncut)
= tree(cut).
If we reject Ho we’ve inferred that the uncut tree did not come from the cut tree sample.
That’s all that you’re doing in your example. You’ve inferred that the mean did not come from the detrended distribution.
The 1000 trees only represent a sample from the total population of trees (for which we do not know the true average height).
We want to know if the sample average height (15 feet) is statistically significant to infer that the true average height (for all trees) is different
from 0 (which is our H0 hypothesis: true average tree height = 0).
One problem is that we only have one sample forest (our 1000 trees). We need to have many more sample forests to establish the sampling
distribution of the average height.
Moreover, if our assumption is true (H0 = the true tree average height is 0), then the sample forest is skewed/biased upwards. So we need to
adjust it (by cutting the tree tops) to “zero-center” it (and meet H0’s condition).
We can now create many forests/resamples from the adjusted trees and calculate the average height from each resample to establish the
bootstrapped sampling distribution of the average height.
The mean average height will be zero, but some samples’average height will be 15 feet or over. The number/frequency of these samples with
average heights of 15ft+ provide us with an indication of how rare they are. The more rare they are in the sampling distribution, the least likely
our initial sample’s average height was due to random variation from a zero-mean population: ie the total population is unlikely to have a zero
mean provided that our sample has a mean of 15 feet.
It is mostly related to variance within the sample: if all trees are 15 feet +/- 2 inches, it is very likely that the true average height is different
from 0. In the bootstrap test, very few or no sampling distributions (zero-centered) will have an average height of 15 feet or more: H0 is
rejected.
If if all trees are 15 feet +/- 100 feet (assuming negative trees), we have much less certainty about the true average height being different from 0
(as the 15 feet average could just be the result of random variation). In the bootstrap test, a much larger number of sampling distributions (zero-
centered) will have an average height of 15 feet or more: H0 can not be rejected .
Apologies as I do not have any good links on this topic to refer you to. I feel maybe we should “branch” out to a discussion offline on this
(email if you prefer) if you need…
I do think Aronson’s book does a good job of explaining the concept (obviously in a longer form) and I thought I managed to put a clear
synthesis of the ideas… Maybe not so clear.
Jez,
It seems that we’re talking past each other. My issue isn’t with bootstrapping. My issue with your original example is that you are contrasting
the mean return with a sample distribution made up of (x – mean) observations and then stating that “…the bootstrap tests for the null
hypothesis that the rule does not have any predictive power”.
This sounds really vague and in any case I’m pretty sure that you haven’t found out anything about “predictive power”. I think it would help to
know exactly what Ho: and Ha: are and what they have to do with predictive power.
Lou,
Yes – sorry that we don’t seem to understand each other…
“the bootstrap tests for the null hypothesis that the rule does not have any predictive power” is a quote from Aronson’s book. In it he equates
this to “the null hypothesis that the arithmetic mean return of the rule being back-tested is zero”
So
H0: back-tested rule has no predictive power = arithmetic mean return of the rule is zero
Ha: back-tested rule has predictive power = arithmetic mean return of the rule is positive
This would give the same results as the method described above.
-Jez
[…] blog – Automated Trading Systems. His posts discussing bootstrap testing can be found here and […]
Has anyone tried Aronson’s detrending on a portfolio? It is reasonably straight forward when dealing with a single symbol. But, what about
when dealing with trades from multiple symbols? Must we detrend the history of each symbol individually for the calculations of those trades,
or is there some way to amalgamate the data streams? Individual streams would likely have too few trades to really be of any value.
P.S. The mean of your example 50% followed by -40% is 5%, not the 10% stated in your example.
Hi Mike – oops, thanks for letting me know about that mistake. Fixed now.
Re: the detrending question, I am not convinced by the concept of detrending ( I discuss about it here) and therefore have not researched it too
much. Intuitively I would think you need to keep track of trades and daily drift for each individual symbol and make individual adjustments.
Jez,
Did you try the way that you proposed (quoted below) and were 95% of them positive? I’m curious to see how that worked out.
“I suppose you could also see/run this test in a different way:
generate a sample distribution made up of (x) observations (instead of x – mean) and check how many observations are positive. If the number
is high enough (ie 95%), the positive result would be deemed statistically significant.”
Also, until you stated the null I didn’t realize that you were just testing for Ho = 0. That was really what my original question was about.
Jez,
If you have the time would you mind trying the other option. I’d really like to see if it works.
Also, in this instance I don’t think that this was a particularly useful test. My understanding is that you had 2 strategies already optimized in this
test and then you did a simple hypothesis test for 1 mean. You’d expect to reject the null in this case.
The Bootstrap Method for Hypothesis Testing: Can it tell us anything we do not know? | Price Action Lab Blog // Nov 29, 2010 at 7:58
am
The return from +50% and -40% is -10%. When taking the square root (for geometric average) the average return is -5.1%
Thanks for mentioning the issue on the rss reader. I’ll try to look into fixing it.
Jez,
Lou is right. There is a logical fallacy in the RC test as suggested by the author.
This is because, the resampled distribution is from the “detrended” return (after subtracting the mean), but we are reading the p-value as the
fraction of observations from the resamplings that is above the “initial” mean return. Its not correct to compare these two because the initial
mean return is not detrended.
Performance Metric: Pessimistic Gain/MDD « Quanting Dutchman // Dec 30, 2011 at 1:36 pm
[…] of the original list of trades (If you want to learn more about bootstrap-testing I higly recommend this post by Jez Liberty). The Pessimistic
Gain/MaxDD is then calculated by subtracting 1xSD from the Gain […]
Jez
First of all, thanks for a great blog.
Two basic (but important to me, at least) questions on bootstrapping as applied to estimating the distribution of equity CAGR and MaxDD of a
series of back-tested trades P&Ls:
1. re-sampling with replacement vs without replacement: I have seen plausible arguments for either approaches. Do you have a view?
2. how important it is to resample without destroying (totally or even partially) the original trade P&L series’ degree of randomness (or lack
thereof)?
Hope these issues are relevant to the other readers as well.
ZigZag
Great blog but I have a question about the random re-sampling. Would the random re-sampling totally destroy the correlation structure the
original sample?
Jez Liberty // Sep 17, 2012 at 10:38 am
yes it would most likely but I do not think it is an issue with this approach/usage as we are not recreating trading signals off that re-sampled
data or even calculating time-sensitive stats such as MaxDD but only using one composite return stat.
This is a great blog. I wonder why has anyone paid attention to the stuff that guy Aronson wrote. He repeats the same things over and over
again in his book like he is talking to elementary school kids or even more seriously like he is trying to understand it himself. As some people
have noted above, his tests are seriously flawed. Why would anyone want to check: “H0: back-tested rule has no predictive power = arithmetic
mean return of the rule is zero”?
In practice this will happen if there is no commission and slippage. But in reality these two can result in serious performance degradation. YOU
DO NOT WANT to test the hypothesis before applying cost to your trading. This is preposterous indeed. Then, why bootstrapping at all?
Bootstrapping returns WILL NOT subject your system to stress from new market conditions, which is the real issue here. Again, the whole
book Aronson wrote promoted some idiosyncratic methods of hypothesis testing that have virtually little to do with real world system trading.
Hello everyone,
1. Does it make sense to subtract mean and center distribution to zero? Why not just counting the percentage of mean returns above the target
return and calculate p-value accordingly?
2. BTW, how is sample size determined in re-sampling? Does this have any effect? Or are trade return order just reshuffled?
Best.
Leave a Comment
Name
Website
Submit
Free Updates
By
Email:
Enter your email address:
Twee Twitter
Popular Posts
Trend Following Wizards performance
A trick to reduce Drawdowns
Trading Blox review
Better Trend Following via improved Roll Yield
e-ratio: How to measure your trading edge in 4 easy steps
A practical Guide to ETF Trading Systems
Which CTAs REALLY provide alpha (and HOW do you calculate it)?
Were the Turtles just lucky?...
Moving Median: a better indicator than Moving Average?
Check the list of global futures markets Wisdom Trading offer access to, from Maize in South Africa, Palm Oil in Malaysia to Korean Won,
Brazilian Real or Japanese Kerosene to name a few, it is impressive and great to benefit from diversification.
Blogroll
All About Alpha
Allocate Smartly
Attain Capital
Automated Trader
Beyond the Blue Event Horizon
CSS Analytics
FOSS Trading
GestaltU
Liquid Alpha
MarketSci
Michael Covel – Trend Following
Quantivity
Quantocracy (ex-Whole Street)
Quantum Financier
STROM Macro
TRADERS' magazine
Traders' Place
Trading Automatique (French)
Trading Blox forum
Wisdom Trading
World Beta
Tag Cloud ↓
Abraham Trading AHL Altis Partners amibroker Aspect Capital Bill Dunn Chesapeake Christian Baha Clarke Capital comparison correlation CSI dave harding
distribution diversification Drury Capital eckhardt EMC Capital Futures Hawksbill Howard Seidler Hyman Beck John W Henry Larry Hite Liz Cheval Man
Millburn Ridgefield optimisation Paul Rabar ralph vince report research paper robust rollover Saxon Investment screenshots Superfund Tom Shanks tradersstudio
Trading Blox Transtrend Trend Following Unfair Advantage winton capital wizards
Au.Tra.Sy blog, Systematic Trading research and development, with a flavour of Trend Following.
Disclaimer: Past performance is not necessarily indicative of future results. Futures trading is complex and presents the risk of substantial losses; as
such, it may not be suitable for all investors. The content on this site is provided as general information only and should not be taken as investment
advice. All site content, shall not be construed as a recommendation to buy or sell any security or financial instrument, or to participate in any
particular trading or investment strategy. The ideas expressed on this site are solely the opinions of the author. The author may or may not have a
position in any financial instrument or strategy referenced above. Any action that you take as a result of information or analysis on this site is
ultimately your sole responsibility.
HYPOTHETICAL PERFORMANCE RESULTS HAVE MANY INHERENT LIMITATIONS, SOME OF WHICH ARE DESCRIBED BELOW. NO
REPRESENTATION IS BEING MADE THAT ANY ACCOUNT WILL OR IS LIKELY TO ACHIEVE PROFITS OR LOSSES SIMILAR TO
THOSE SHOWN; IN FACT, THERE ARE FREQUENTLY SHARP DIFFERENCES BETWEEN HYPOTHETICAL PERFORMANCE RESULTS
AND THE ACTUAL RESULTS SUBSEQUENTLY ACHIEVED BY ANY PARTICULAR TRADING PROGRAM. ONE OF THE LIMITATIONS
OF HYPOTHETICAL PERFORMANCE RESULTS IS THAT THEY ARE GENERALLY PREPARED WITH THE BENEFIT OF HINDSIGHT. IN
ADDITION, HYPOTHETICAL TRADING DOES NOT INVOLVE FINANCIAL RISK, AND NO HYPOTHETICAL TRADING RECORD CAN
COMPLETELY ACCOUNT FOR THE IMPACT OF FINANCIAL RISK OF ACTUAL TRADING. FOR EXAMPLE, THE ABILITY TO
WITHSTAND LOSSES OR TO ADHERE TO A PARTICULAR TRADING PROGRAM IN SPITE OF TRADING LOSSES ARE MATERIAL
POINTS WHICH CAN ALSO ADVERSELY AFFECT ACTUAL TRADING RESULTS. THERE ARE NUMEROUS OTHER FACTORS
RELATED TO THE MARKETS IN GENERAL OR TO THE IMPLEMENTATION OF ANY SPECIFIC TRADING PROGRAM WHICH CANNOT
BE FULLY ACCOUNTED FOR IN THE PREPARATION OF HYPOTHETICAL PERFORMANCE RESULTS AND ALL WHICH CAN
ADVERSELY AFFECT TRADING RESULTS.
THESE PERFORMANCE TABLES AND RESULTS ARE HYPOTHETICAL IN NATURE AND DO NOT REPRESENT TRADING IN ACTUAL
ACCOUNTS.