Slides Deep Learning Statistical Arbitrage
Slides Deep Learning Statistical Arbitrage
Stanford University
1
Motivation
Price
-0.2 0
-0.4 -0.1
-0.6 -0.2
Jul 2011 Jan 2012 Jul 2012 Jul 2011 Jan 2012 Jul 2012
Challenges:
1. Large number of assets with unknown similarities
2. Complex time-series patterns in price deviations
3. Optimal trading rules are complicated and depend on trading objective
Key questions:
1. What is the “best solution” for the three key elements?
2. What matters for statistical arbitrage?
3. How much realistic arbitrage is in the market?
3
Contribution: Methodology
4
Contribution: Empirical
Machine learning for asset pricing (explain risk premium not arbitrage)
• Pricing kernel: Chen, Pelger, Zhu (2019), Bryzgalova, Pelger, and Zhu (2019)
• Return prediction: Gu, Kelly and Xiu (2020),
• Factor models: Lettau and Pelger (2020), Kelly, Pruitt and Su (2019)
6
Model
Arbitrage portfolios
7
Arbitrage portfolios
R
Residuals are traded portfolios for factor implied matrix Φt−1 ∈ Nt ×Nt :
T T F T F
t = Rt − βt−1 Ft = Rt − βt−1 wt−1 Rt = INt − βt−1 wt−1 Rt .
| {z }
Φt−1
8
Arbitrage Signal and Allocation
Arbitrage trading
has 2 steps given a cumulative residual
P2 PL
x := Lt := n,t−L l=1 n,t−L−1+l ··· l=1 n,t−L−1+l
−µ
Xt√
• The allocation is a threshold rule on the ratio σ/ 2κ
.
• In our framework, this corresponds to
xL −µ̂
−1, if > cthres
√
σ̂/ 2κ̂
OU X OU xL −µ̂
θ (x) = (κ̂, µ̂, σ̂, xL ), w θ = 1 if √
σ̂/ 2κ̂
< −cthres
0 otherwise
11
Second class: Pre-specified filter with neural network
Our novel model: Data driven time-series filter based on most advanced deep
learning tools for pattern detection
13
Convolutional Network Intuition
(a) Upward trend (b) Downward trend (c) Up reversal (d) Down reversal 14
Transformer Network Intuition
Implementation:
• All results are out-of-sample
• We use L = 30 days lookback windows of returns as input for signal.
• We retrain functions every half year using rolling windows of 4 years.
• Factors models are estimated OOS daily on rolling window of 60 days
• Main analysis with Sharpe ratio objective
16
Arbitrage Portfolios
17
OOS Annualized Performance
FFT+ 0 0.36 4.9% 13.6% 0.36 4.9% 13.6% 0.36 4.9% 13.6%
FFN 5 1.66 3.1% 1.8% 1.98 12.4% 6.3% 1.90 7.7% 4.1%
OU+ 0 -0.18 -2.4% 13.3% -0.18 -2.4% 13.3% -0.18 -2.4% 13.3%
Thres 5 0.38 0.9% 2.3% 0.73 4.4% 6.1% 0.97 3.8% 4.0%
20
(g) OU+Thresh Fama-French 5 (h) OU+Thresh PCA 5 (i) OU+Thresh IPCA 5
Significance of Arbitrage Alphas
CNN+Trans model
Fama-French PCA IPCA
K 0 5 0 5 0 5
21
Significance of Arbitrage Alphas
CNN+Trans model
Fama-French PCA IPCA
K α tα R2 µ tµ α tα R2 µ tµ α tα R2 µ tµ
0 11.6% 6.4∗∗∗ 30.3% 13.7% 6.3∗∗∗ 11.6% 6.4∗∗∗ 30.3% 13.7% 6.3∗∗∗ 11.6% 6.4∗∗∗ 30.3% 13.7% 6.3∗∗∗
1 7.0% 14∗∗∗ 2.4% 7.2% 14∗∗∗ 14.9% 10∗∗∗ 0.6% 15.2% 11∗∗∗ 8.1% 12∗∗∗ 9.5% 8.7% 12∗∗∗
3 5.5% 12∗∗∗ 1.2% 5.5% 12∗∗∗ 15.8% 14∗∗∗ 1.7% 16.0% 14∗∗∗ 8.2% 15∗∗∗ 6.0% 8.6% 15∗∗∗
5 4.5% 12∗∗∗ 2.3% 4.6% 12∗∗∗ 14.1% 13∗∗∗ 1.3% 14.3% 13∗∗∗ 8.3% 16∗∗∗ 3.9% 8.7% 16∗∗∗
8 3.3% 9.4∗∗∗ 2.1% 3.4% 9.6∗∗∗ 12.0% 12∗∗∗ 0.9% 12.2% 12∗∗∗ 7.8% 15∗∗∗ 5.0% 8.2% 15∗∗∗
10 - - - - - 10.5% 11∗∗∗ 0.7% 10.7% 11∗∗∗ 7.7% 15∗∗∗ 4.0% 8.0% 15∗∗∗
15 - - - - - 7.5% 8.8∗∗∗ 0.5% 7.6% 8.9∗∗∗ 8.1% 16∗∗∗ 4.2% 8.4% 16∗∗∗
21
Mean-Variance Objective
R
• Increase mean return while maintaining leverage constraint of kwt−1 k=1
• Here we set risk aversion to γ = 1
• Annual returns up to 20% while volatility is only half of market.
• Slightly lower Sharpe ratios
22
Importance of Time-Series Signal
23
Additional Results
• Include trading costs for high turnover and large short-selling positions:
R R R R R
cost(wt−1 , wt−2 ) = 0.0005kwt−1 − wt−2 kL1 + 0.0001k min(wt−1 , 0)kL1
5 basis points per transaction and 1 basis point per short position
• No market impact as we only trade in the largest most liquid stocks
• Lower bound on profitability: less turnover with sparse factors, etc.
⇒ Arbitrage trading retains economic significance in presence of trading costs
25
Turnover and Short Selling
⇒ The effect of trading frictions is time-varying and our model can exploit particularly
profitable arbitrage time periods by increasing trading and short positions. 26
Estimated Structure: Dissecting the
CNN+Transformer Model with IPCA-5
Examples of Allocation and Returns of CNN+Transformer Strategy
(a) Basic pattern 1 (b) Basic pattern 2 (c) Basic pattern 3 (d) Basic pattern 4
(e) Basic pattern 5 (f) Basic pattern 6 (g) Basic pattern 7 (h) Basic pattern 8
l
(a) Input residual and attention head weights for xl = sin 2π 30
(b) Input residual and attention head weights for for xl = sin 2π l+15
30
(a) Cumulative residual (b) Attention weights per head (c) Average attention weights
(d) 1st CNN activation (e) 2nd CNN activation (f) 3rd CNN activation (g) 4th CNN activation
(h) 5th CNN activation (i) 6th CNN activation (j) 7th CNN activation (k) 8th CNN activation
30
CNN+Transformer Model Structure for Representative Residual Over Time
(a) Cumulative residuals (b) Average attention weights (c) Allocation weights
(d) Attention weights for (e) Attention weights for (f) Attention weights for (g) Attention weights for
head 1 head 2 head 3 head 4
• Attention head weights 4 highest for down-times in 2009, 2014, middle 2016.
Focuses uniformly on last 10 days in 30-day window
• Attention head weights 3 highest for up-patterns in 2007, 2010, 2012.
Focuses uniformly on first 20 days in 30-day window
• Asymmetric response of Transformer:
act swiftly during downtrends, stay cautious during uptrends 31
Variable Importance for Allocation Weight
32
Conclusion
Conclusion
Methodology:
• Unifying conceptual framework to compare different approaches:
(1) portfolio generation, (2) signal extraction, (3) allocation decision
• Novel deep learning statistical arbitrage:
1. Conditional latent factors to generate arbitrage portfolios
2. CNN+Transformer signal: global dependency pattern with local filters
3. FFN allocation and global trading objective for estimation
Empirical results:
• Comprehensive out-of-sample study on U.S. equities
• CNN+Transformer substantially outperforms benchmark approaches
• Unspanned by conventional risk factors
• Survives realistic transaction and holding costs
• Insights into trading policies: asymmetric trend and reversion patterns
• Trading signal extraction is the most challenging and separating element
33
Appendix
Firm specific characteristics
34
Significance of Arbitrage Alphas
CNN+Trans model
Fama-French PCA IPCA
K α tα R2 µ tµ α tα R2 µ tµ α tα R2 µ tµ
0 11.6% 6.4∗∗∗ 30.3% 13.7% 6.3∗∗∗ 11.6% 6.4∗∗∗ 30.3% 13.7% 6.3∗∗∗ 11.6% 6.4∗∗∗ 30.3% 13.7% 6.3∗∗∗
1 7.0% 14∗∗∗ 2.4% 7.2% 14∗∗∗ 14.9% 10∗∗∗ 0.6% 15.2% 11∗∗∗ 8.1% 12∗∗∗ 9.5% 8.7% 12∗∗∗
3 5.5% 12∗∗∗ 1.2% 5.5% 12∗∗∗ 15.8% 14∗∗∗ 1.7% 16.0% 14∗∗∗ 8.2% 15∗∗∗ 6.0% 8.6% 15∗∗∗
5 4.5% 12∗∗∗ 2.3% 4.6% 12∗∗∗ 14.1% 13∗∗∗ 1.3% 14.3% 13∗∗∗ 8.3% 16∗∗∗ 3.9% 8.7% 16∗∗∗
8 3.3% 9.4∗∗∗ 2.1% 3.4% 9.6∗∗∗ 12.0% 12∗∗∗ 0.9% 12.2% 12∗∗∗ 7.8% 15∗∗∗ 5.0% 8.2% 15∗∗∗
10 - - - - - 10.5% 11∗∗∗ 0.7% 10.7% 11∗∗∗ 7.7% 15∗∗∗ 4.0% 8.0% 15∗∗∗
15 - - - - - 7.5% 8.8∗∗∗ 0.5% 7.6% 8.9∗∗∗ 8.1% 16∗∗∗ 4.2% 8.4% 16∗∗∗
Fourier+FFN model
Fama-French PCA IPCA
K α tα R2 µ tµ α tα R2 µ tµ α tα R2 µ tµ
0 2.7% 0.8 8.6% 4.9% 1.4 2.7% 0.8 8.6% 4.9% 1.4 2.7% 0.8 8.6% 4.9% 1.4
1 3.0% 3.3∗∗ 3.3% 3.2% 3.5∗∗∗ 7.4% 2.7∗∗ 3.3% 8.4% 3.1∗∗ 4.8% 4.0∗∗∗ 16.4% 6.3% 4.8∗∗∗
3 3.2% 4.7∗∗∗ 4.2% 3.5% 5.1∗∗∗ 10.9% 6.3∗∗∗ 2.2% 11.2% 6.4∗∗∗ 6.8% 6.4∗∗∗ 13.0% 7.8% 6.9∗∗∗
5 2.9% 6.1∗∗∗ 3.5% 3.1% 6.4∗∗∗ 12.1% 7.5∗∗∗ 1.5% 12.4% 7.6∗∗∗ 6.7% 6.9∗∗∗ 13.3% 7.7% 7.4∗∗∗
8 3.0% 7.2∗∗∗ 3.2% 3.1% 7.4∗∗∗ 10.0% 7.5∗∗∗ 0.9% 10.1% 7.6∗∗∗ 6.8% 7.0∗∗∗ 13.3% 7.8% 7.5∗∗∗
10 - - - - - 8.0% 6.5∗∗∗ 1.0% 8.2% 6.6∗∗∗ 6.8% 7.1∗∗∗ 12.7% 7.6% 7.5∗∗∗
15 - - - - - 4.7% 4.3∗∗∗ 0.4% 4.8% 4.4∗∗∗ 7.1% 7.6∗∗∗ 12.2% 7.9% 8.0∗∗∗
OU+Thresh model
Fama-French PCA IPCA
K α tα R2 µ tµ α tα R2 µ tµ α tα R2 µ tµ
0 -4.5% -1.4 13.4% -2.4% -0.7 -4.5% -1.4 13.4% -2.4% -0.7 -4.5% -1.4 13.4% -2.4% -0.7
1 -0.2% -0.2 13.5% 0.6% 0.6 0.7% 0.3 6.3% 2.1% 0.8 1.7% 1.4 18.9% 3.0% 2.3∗
3 0.9% 1.2 10.4% 1.6% 2.1∗ 4.3% 2.5∗ 4.3% 5.2% 3.0∗∗ 2.6% 2.6∗∗ 18.8% 3.8% 3.4∗∗∗
5 0.5% 0.9 6.8% 0.9% 1.5 3.7% 2.4∗ 3.2% 4.4% 2.8∗∗ 2.8% 3.0∗∗ 17.7% 3.8% 3.8∗∗∗
8 0.6% 1.2 5.5% 1.0% 1.9 3.9% 3.0∗∗ 1.9% 4.4% 3.4∗∗∗ 2.3% 2.6∗∗ 17.6% 3.5% 3.6∗∗∗
10 - - - - - 2.6% 2.2∗ 1.4% 2.9% 2.4∗ 2.1% 2.5∗ 17.6% 3.1% 3.3∗∗∗
15 - - - - - 2.1% 2.1∗ 0.7% 2.4% 2.4∗ 2.3% 2.8∗∗ 18.1% 3.2% 3.6∗∗∗ 35
Significance of Arbitrage Alphas with Mean-Variance Objective
CNN+Trans model
Fama-French PCA IPCA
K α tα R2 µ tµ α tα R2 µ tµ α tα R2 µ tµ
0 5.8% 2.2∗ 19.6% 9.5% 3.2∗∗ 5.8% 2.2∗ 19.6% 9.5% 3.2∗∗ 5.8% 2.2∗ 19.6% 9.5% 3.2∗∗
1 9.9% 12∗∗∗ 7.1% 10.5% 12∗∗∗ 26.3% 8.3∗∗∗ 1.6% 27.3% 8.6∗∗∗ 14.0% 11∗∗∗ 23.5% 15.9% 11∗∗∗
3 7.5% 11∗∗∗ 5.3% 7.8% 11∗∗∗ 22.1% 9.1∗∗∗ 2.2% 22.6% 9.2∗∗∗ 16.6% 12∗∗∗ 17.6% 17.9% 12∗∗∗
5 5.7% 11∗∗∗ 5.3% 5.9% 12∗∗∗ 19.0% 10∗∗∗ 3.2% 19.6% 11∗∗∗ 16.7% 12∗∗∗ 16.0% 18.2% 12∗∗∗
8 4.4% 9.8∗∗∗ 3.6% 4.6% 10∗∗∗ 16.3% 10∗∗∗ 1.6% 16.6% 10∗∗∗ 15.5% 12∗∗∗ 18.3% 17.0% 12∗∗∗
10 - - - - - 14.8% 10∗∗∗ 1.7% 15.3% 10∗∗∗ 15.2% 13∗∗∗ 20.6% 16.6% 12∗∗∗
15 - - - - - 8.5% 8.4∗∗∗ 0.9% 8.7% 8.5∗∗∗ 14.8% 13∗∗∗ 21.6% 16.3% 13∗∗∗
Fourier+FFN model
Fama-French PCA IPCA
K α tα R2 µ tµ α tα R2 µ tµ α tα R2 µ tµ
0 3.2% 0.7 8.4% 5.5% 1.1 3.2% 0.7 8.4% 5.5% 1.1 3.2% 0.7 8.4% 5.5% 1.1
1 2.8% 1.6 1.8% 2.5% 1.5 15.4% 1.7 1.3% 16.6% 1.9 7.9% 1.8 2.6% 9.7% 2.2∗
3 4.1% 4.4∗∗∗ 3.4% 4.3% 4.5∗∗∗ 30.3% 1.3 0.1% 32.1% 1.3 17.4% 4.1∗∗∗ 1.9% 17.6% 4.1∗∗∗
5 2.9% 4.8∗∗∗ 3.1% 3.1% 5.0∗∗∗ 21.0% 1.3 0.1% 22.5% 1.4 15.9% 4.3∗∗∗ 2.6% 17.0% 4.5∗∗∗
8 3.5% 6.8∗∗∗ 2.3% 3.6% 7.0∗∗∗ 17.4% 2.6∗∗ 0.3% 17.2% 2.6∗∗ 12.9% 4.3∗∗∗ 4.4% 14.4% 4.7∗∗∗
10 - - - - - 7.1% 1.7 0.3% 7.4% 1.8 11.7% 3.9∗∗∗ 3.5% 12.6% 4.1∗∗∗
15 - - - - - 5.5% 2.1∗ 0.1% 5.7% 2.2∗ 11.3% 4.3∗∗∗ 4.0% 12.1% 4.5∗∗∗
36
Dependency between Arbitrage Strategies
37
Importance of Time-Series Signal
38
Robustness to Rolling Window Size
39
Robustness to Rolling Window Size
40
Mean-Variance Objective
41
Constant Model without Re-estimation
Ttrain = 4 years
Fama-French PCA IPCA
K SR µ σ SR µ σ SR µ σ
0 1.10 8.5% 7.8% 1.10 8.5% 7.8% 1.10 8.5% 7.8%
1 1.90 4.5% 2.3% 0.44 3.0% 6.9% 0.94 3.1% 3.3%
3 1.60 3.6% 2.2% 1.65 8.7% 5.3% 1.82 5.3% 2.9%
5 1.81 3.0% 1.7% 1.93 9.8% 5.1% 2.09 5.4% 2.6%
8 1.70 2.5% 1.5% 2.04 9.6% 4.7% 1.89 5.0% 2.6%
10 - - - 2.06 9.1% 4.4% 1.77 4.7% 2.7%
15 - - - 1.82 7.0% 3.9% 2.09 5.5% 2.7%
Ttrain = 8 years
Fama-French PCA IPCA
K SR µ σ SR µ σ SR µ σ
0 1.33 12.0% 9.0% 1.33 12.0% 9.0% 1.33 12.0% 9.0%
1 2.06 5.0% 2.4% 1.81 15.2% 8.4% 2.02 8.5% 4.2%
3 2.46 5.3% 2.2% 2.04 13.1% 6.4% 2.47 7.5% 3.0%
5 1.82 3.2% 1.8% 1.91 11.9% 6.2% 2.64 7.6% 2.9%
8 1.48 2.5% 1.7% 1.89 10.8% 5.7% 2.71 8.3% 3.1%
10 - - - 1.82 10.0% 5.5% 2.68 8.2% 3.1%
15 - - - 1.38 6.2% 4.5% 2.70 7.8% 2.9%
42
Constant Model without Re-estimation
43
Empirical example: (1) OU+Threshold signals & allocation weights
44
Empirical example: (2) Fourier+FFN signals & allocation weights
45
Simulation example: (3) CNN+Transformer signals & allocation weights
46
Fourier+FFN architecture
FFN equations:
47
Convolutional network equations
Given v
L u L
(i) 1 X (i) (i)
u1 X (i) (i)
2
µk = yl,k , σk =t yl,k − µk .
L L
l=1 l=1
R
• Features x̃ ∈ L×F are projected onto i = 1, . . . , h F /h-dimensional
subspaces (“heads”):
49
Hyperparameter information
50