Trading Pairs With Excel Python by Anjana Gupta @tradingpdfgratis2
Trading Pairs With Excel Python by Anjana Gupta @tradingpdfgratis2
Anjana Gupta
Contents
Title Page
Disclaimer
Preface
Pair Trading: Introduction
Basics of Python
Fetching historical data
Basics of Statistics
MEAN
HISTOGRAM
PROBABILITY
Standard Deviation & Variance
Bollinger Band
Correlation
LINEAR REGRESSION
Stationary Time Series
Z-Score
Pair trade backtesting setup in Google Spreadsheet
Pair trade backtesting in Python
Machine Learning for Pair Trading
Python Codes @ GITHUB
About Authors / Acknowledgments
Books By This Author
Books By This Author
Books By This Author
Disclaimer
When you start independent trading than fear of loss is natural. One
should not have a immediate need of profit to sustain his daily living. If yes
then trading with your own money in not for you. Better you go for a salary
based proprietary trader with any broker. Trading with own money is for
those who have surplus funds and some other source of earning for daily
living. In India hundreds of professional traders are earning bread and butter
from trading, but mostly are doing arbitrage from co-location and they also
have some fixed income from arbitrage, arbitrage again is not risk free but
these professional traders have provisions for losses. Till 2008 when algo
trading was not there in India hundreds of manual jobbers earning bread and
butter from manual arbitrage. When algo trading came, manual jobbers who
have not updated themselves became dealer, joined in back office or
operation department and many gone out from the market. Who upgraded
themselves with latest knowledge of algos are still in the game. You must
have heard about diversification. Apply this rule in your income also.
Diversify your sources of income. So my suggestion for professional traders
also that should develop some secondary source of income apart from full
time trading may be like buying a commercial property and earning fixed
rental income. Diversify your trading strategies also. Work on many
strategies so that you can remain profitable even if some strategies gives you
loss. Nothing is permanent if you are in capital market so you need to keep
yourself updated with knowledge and new skills. When excel came you
learned first time and today its part of daily life, in the same way you could
make yourself familiar with Python also. When you will keep using it
regularly you will be able to do many things which are not possible in Excel.
In Python you can analyze thousands of data point in few seconds. Today
Statistical trading is not easy without software like Python. If you want to go
to next level of trading than knowledge of any software like python is
essential.
If you do not know Python or programming then I will suggest you read
my first book ‘Option Greeks, Strategies & Backtesting in Python’ available
on Amazon. First book covers derivative, Option Greeks, Option strategies,
basic of python, how to fetch past data in Python, back testing of option
strategies on past data. First book covers options trading in details and also
explain why trading options are better than trading naked future position.
This is second book of the series. This book is written for individuals and
traders. With help of this book individual trader, investor can understand
statistical tools of pair trading and machine learning for pair trading.
Pair Trading: Introduction
A pair trading is a market neutral trading strategy (meaning market
direction doesn’t matter) that involves matching a long position with a short
position in two stocks with a high correlation. Pair trading is a statistical
arbitrage strategy, which is based on the mean reversion principle. While it
isn’t riskless, by understanding how pairs trading works, how you control
risk and how you manage profits, it’s a great tool to add to your trading
arsenal !
A pair trading strategy is based on the historical correlation of two
securities. Do not rely 100% on statistics and mathematics in trading. If you
will find correlation in top 500 stocks trading on exchange then you may
find correlation in many stocks which are not from same sector or having
same market capitalization. It’s not a good idea to trade stocks from 2
different sector or different market capitalization. Look for stocks from the
same sector, having similar business model and comparable market cap
as they have the highest chance to be co-integrated over a longer time
horizon. Same sector and comparable size immunes the pair from
unexpected news flow regarding the sector as a whole. Be it negative news
or positive, both stocks will hopefully move in the same direction - and this
is what is desired in pair trade.
The securities in a pair trade must have a positive correlation, which is
the primary driver behind the strategy’s profits. A pair trade strategy is best
deployed when a trader identifies a correlation discrepancy. Relying on the
historical notion that the two securities will maintain a specified correlation,
the pairs trade can be deployed when this correlation falters.
To illustrate the potential profit of the pairs trade strategy, consider Stock
A and Stock B, which have a high correlation of 0.95. The two stocks
deviate from their historical trending correlation in the short-term, with a
correlation of 0.75. The arbitrage trader steps in to take a dollar/rupee
matched the long position on underperforming Stock A and a short position
on outperforming Stock B. The stocks converge and return to their 0.95
correlation over time. The trader profits from a long position and closed
short position.
So basic idea in pair trading is to trade the two stocks by studying its
historical relationship and spotting an opportunity which has arisen due to
breakdown in the correlation and essentially we are betting that the gap
would come back to its original state (called mean reversion). If you are
implementing a mean reversion strategy, you are assuming that the mean
will remain the same in the future as it has been in the past. But mean can
also change over a period of time. So trading pairs is not a risk-free strategy.
The difficulty comes when prices of the two securities begin to drift apart,
i.e. the spread begins to trend instead of reverting to the original mean.
Dealing with such adverse situations requires strict risk management rules,
which have the trader exit an unprofitable trade as soon as the original setup
—a bet for reversion to the mean—has been invalidated. So pair trading is a
market-neutral strategy to an extent only. Mean can change, hence, please
do not be under the impression that pair trading is a 100% market neutral
strategy. So you can say this is a trading strategy that seeks to take advantage
of price differentials between two, related assets.
Therefore, the bulk of the work in pair trading revolves around
identifying the relationship among the stocks of same sector, quantifying
their relationship, tracking the behavior of this relationship on a minutes/
hourly/ daily basis and looking for anomalies in the price behavior. When an
anomaly occurs an opportunity to trade arises. In pair trading, you buy the
undervalued security and sell the overvalued one that is why it is also called
statistical arbitrage.
All of the above things could be done by machine itself. We will also
learn machine learning for pair trading in this book. Let’s start with some
basics. First you need past data of stocks so that you can quantify
relationship among various stocks of same sector. You can fetch past data in
excel from Google spread sheet, you can also fetch past data in Python using
Jupyter notebook. I will explain both. I will suggest you to learn python if
you do not know because machine learning tools are not available in Excel.
Basics of Python
First you need to open Jupyter Notebook in Google colab in any browser.
Open Google > Search Google Colab > Open Google Colab > click on
file > open new notebook.
Click on new notebook than a new notebook will open, will look like this –
I have given print command in 1st cell. You can write and execute
commands in cell. Get familiar with the Jupyter Notebook interface. Now
Jupyter Notebook is your playground. Jupyter Notebook is an interactive
document. You need to write commands in predefined format and on
execution of commands you get results of your commands. It’s easy; you can
learn basics of python thru any video available on Google. Good thing with
python is that past data of exchanges and tools are already available. Most of
the professional traders use python or c++ for development and back-testing
of strategies. You need not to learn complete programming language. You
can analyze data in the way you want for that you need to learn basics of
programming. In coming chapters we will learn how to write program and
how Python will help in past data analysis for any strategy. With
understanding and use of limited commands you can backtest your
strategies. So I used limited commands and functions for better
understanding. Let’s start with some basic of Python.
First you need to understand that you need to give commands in predefined
format, it is called coding. Programming languages understand commands in
predefined format. Initially you will make lots of mistake in coding and
gradually you will learn how to write program code.
Open a new Jupyter note book. Type following in a cell (code written in blue
color is python code. Statement written after ‘#’ is explanation in English,
it’s not a part of code, it is just to explain you the code) –
Than you can click on ‘run’ button. You will get the results of your
command. Jupyter notebook will print the value of ‘a’, ‘b’, ‘c’ and ‘d’ as per
your command. Python makes use of simple syntax which looks like written
English.
For option trader one of the basic module required is NumPy. It is used for
Scientific Computing. If you will observe the next screen shot, I have
imported module numpy and I have given command to find minimum value
out of 3, 6, and 9. We are using Np.min() command from numpy module.
You will check out of command is 3. So program itself found the minimum
value out of 3 values.
In next cell I have given command to find maximum value. Result of this
command is 9. So program itself found the maximum value out of 3 values
Math, NumPy are the Built-in modules, these modules are available by
default in Python. But there are many other publicly available module which
you may want to use. To install a module following command is used in
python –
!pip install <module name>
Third and the most important thing a trader need is past data. Various options
are available to get past data. You can download data from yahoo finance.
The nsepy module module is used to get the stock market data for contracts
traded on both exchanges. You can install these modules thru following
commands –
!pip install yfinance
Forth you need to understand data in tabular format in python. Tabular
format comprising of row and columns like Excel spread sheet in python this
is called ‘dataframe’. You can perform anything in a particular row and
column thru commands in python. Trader can fetch past data thru yfinance in
tabular format, we will learn this in next chapter. You can also import and
export csv/excel files in python. You can perform various functions on a
table in python like you do on ‘Excel’. Some basic functions required for
past data analysis and strategy backtesting are discussed below -
Following command will be used for import of data of a file saved on your
laptop –
Table = pd.read_csv('filename.csv')
Following command will be used to save data in a file on your laptop (export
data) –
import pandas as pd
data = {'Stocks': ['Reliance', 'Infosys', 'TCS'],
'Price': [1200, 700, 2500]}
print (Table)
Select last 2 values of column ‘Price’ in new variable ‘last’ with following
command –
last = Table['Price'][-2:]
Table['Amount']=Table['Price']*Table['Quantity']
Following command can be used for cumulative sum of any column –
Table['Total'] = Table['Amount'].cumsum()
def my_function(df):
try:
return (df['Price']*df['Quantity'])
except:
return np.NaN
x=1
for x in range (1, 5):
print (x)
The range() function defaults to increment the sequence by 1, however it is
possible to specify the increment value by adding a third parameter:
Lambda - you can create your own function through lambda. Here x and y
are the parameters and x+y is the operation performed. So when you will use
xyz in your code you will get addition of 2 values as result.
Now you have basic understanding of python. Some codes to compute profit
and loss of few strategies on past data given in coming chapters. Same codes
could be used for computation of profit and loss of strategy for any contract
and for any period. You just need to download code files from link given at
end of this book.
Free Resources
This book is written for pair trading not for python. Still I explained
some basics of python required for a trader. If you do not understand python
then I will suggest you to learn some basics of python first. You can also
enroll for free basic course on Python through following link -
https://quantra.quantinsti.com/course/python-trading-basic
Fetching historical data
Past Data on Google Spreadsheet
Following codes could be used to fetch past data. First you need to install
required libraries. Past data could be fetched from yfinance or NSEPY could
be used to fetch data of NSE India. In the following example I am fetching
historical prices of 'HDFC' trading on NSE India from yahoo finance.
!pip install yfinance
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import yfinance as yf
data = yf.download('HDFC.NS', start="2020-01-01", end="2019-01-31")
https://www.anaconda.com/products/individual
I have given print command in 1st cell. You can write and execute
commands in second highlighted green cell.
Per Minutes historical data through yfinance
You can also download 1 minute historical data of last 10 days from
yfinance. Following codes will be used to get 1 minute data –
You need to install package yahoo_fin to get Live stock quotes using web
scraping tools. The following code gets the real-time stock price during
market hours.
Quandl is a platform that provides its users with economic, financial and
alternative datasets. According to Quandl, their user amount is over 400,000
people, which ranges from the world’s top hedge funds to investment banks
and various asset managers. Quandl want to inspire customers to make new
discoveries and incorporate them into trading strategies. They believe there
are new and better ways to understand the complex information that creates
markets and market movement. They believe that data, and alternative data
in particular, is going to become the primary driver of active investment
performance over the next decade.
Quandl offers both free and premium products. The Quandl API is free to
use and grants access to all free datasets. Quandl users only pay to access
Quandl’s premium data products. In this book we will use free data available
on quandl.
The best thing with Quandl is that data is free, transparent, easy to find and
cleaned.
First you need to create account with Quandl and you will get API Key.
After that you need to install Quandl library with following command –
Following codes can help you to fetch historical data. In the following
example I am fetching WTI Crude oil prices of CME –
import quandl
start = datetime(2020, 1, 1)
end = datetime(2020, 12, 31)
df = quandl.get('CHRIS/CME_CL1', start_date=start, end_date=end, qopts=
{'columns': ['Settle']}, authtoken='insert you key that you get on registration
with quandl')
plt.figure(figsize=(20,10))
plt.plot(df.Settle)
Further you can learn from Quandl website itself about how to fetch data in
python –
https://www.quandl.com/tools/python
Basics of Statistics
MEAN
Mean is average of all numbers. Mean value is sum of all numbers
divided by count of all numbers.
For example a stock has given a return of 5%, 8%, 15%, 2% and 10% in
last 5 years. What is average return?
Average return = (5+8+15+2+10) / 5 = 8
So mean value is 8, you can say stock has given a average return of 8%
in last 5 years.
Let’s compute mean value in Google Sheet and Google Colab. In the
following example we are computing moving average of data of HDFC we
fetched in Google sheet and Google Colab. We are computing 20 days
moving average of close price of S.No. 1 to 20 in cell D28 as shown in
following screenshot. Again 20 days moving average of close price of S.No.
2 to 21 in cell D29 and so on. Moving average is series of average of
different subsets of the full data set. The average is taken over for a specific
period of time for example 30minuts, 30 days, 50 days etc.
You can plot the chart of Close price and moving average we have
computed with the following command -
data[['Close', 'Moving_average']].plot()
You will get following output.
In the above chart blue line is close price and orange line is last 20 days
moving average. Moving average crossover is used to generate buy or sell
signals. If short period moving average line is above the long period moving
average then it’s a buy signal and if short period line is below the long
period line it is sell signal. This works in trending market but in rage bound
market this strategy may give losses. In range bound market mean reversal
strategy will give you profit.
Stock = "RELIANCE.NS"
data = yf.download(Stock, start="2020-01-01", end="2020-10-31")
T = pd.DataFrame({"Close": data["Close"]})
SMA=10
LMA=30
T.positions_long = T.positions_long.fillna(method='ffill')
print(T)
If you will observe the above output you will find that Reliance was
trading at Rs 1509/- on 1st January 2020 and at Rs 2054/- on 30th October
2020. If any investor bought on 1st January 2020 then sold at 30th October
2020 then trader earned Rs 545/-. However if trader is trading on 10 days
and 30 days moving average crossover then trader earned Rs 747/-.
Which moving averages to take for trading is a very subjective decision.
However with the help of Python we can backtest return given by stock on
various moving averages in various periods. Backtesting of this on Google
spreadsheet or Excel is going to be a very time consuming activity. Lets do it
with the help of Python codes. I have written following codes in which one
can define stock, years, short moving averages and long moving averages to
get returns. With the help of following code I am computing yearly return
given by Reliance Industries during the year 2016 to 2020 on different
combinations of moving average crossovers from 1 to 35. Trader is buying
when short moving average is crossing long moving average from downside
and selling when short moving average is crossing long moving average
from upside.
If you will observe the following code you will notice that I have used 3
‘for’ loops. 1st ‘for’ loop for years, 2nd ‘for’ loop for short moving average
and 3rd for loop for long moving average. Pivot Table created with values of
‘cumpnl_long’. This field of ‘cumpnl_long’ is giving us the per share return
given by stock when we are having long position if short moving average is
above long moving average
Python Code -
T3 = pd.DataFrame({"Close": data["Close"]})
T3['Year'] = T3.index.year
T2 = pd.DataFrame({"cumpnl_long":['0'], "SMA":['0'], "LMA":['0'],
"Year":['0']})
T['positions_long'] = np.nan
T.positions_long = T.positions_long.fillna(method='ffill')
T['price_difference']= T.Close - T.Close.shift(1)
T['pnllong'] = T.positions_long.shift(1) * T.price_difference
T['cumpnl_long'] = T.pnllong.cumsum()
T1 = T[['cumpnl_long']].tail(1)
T1['SMA'] = SMA
T1['LMA'] = LMA
T1['Year'] = z
T2 = T2.append(T1)
In the above output of excel you will observe moving average crossover
of 1 day and 5 days is giving consistent return year on year (SMA denotes
short moving average and LMA denotes long moving average). So
technically you can say Reliance is buy if trading above 5 days moving
average. But this will result into many trades and cost is associated with
every trade. If you will observe moving average crossover of 1 day and 29
days is also giving consistent return. It means if Reliance is trading 30 days
moving average its a buy.
One more thing you will notice that in year 2020 all the moving averages
has given very good return because prices were trending in year 2020. We
have seen rollercoaster ride in year 2020 from Nifty 12000 to 8000 and
again back to 12000. In trending market moving average crossover gives
good return however in range bound market mean reversal strategies gives
good return.
As we have computed return for Reliance, in the same way you can
compute moving average return given by any stock on past data for any
combination of moving averages.
HISTOGRAM
Histogram is graphical display of data using bars. It’s similar to bar
chart, but in case of Histogram each bar shows how many values fall into
each range. Let’s take example. We have 9 months HDFC prices in table
data.
Lets compute how many times HDFC closed above previous close and
how many times it closed below previous close. We can use the following
command to plot the histogram –
data[‘return’] = data[‘Close’].pct_change()
plt.hist(data['return'], bins=[-1,0,1])
We will get the following output-
In year 2020 out of 207 trading days 106 days HDFC closed below last
day closing price and 101 days HDFC closed above the last trading day
closing price.
PROBABILITY
Probability is how likely something is to happen. Many events can't be
predicted with total certainty. The best we can say is how likely they are to
happen, using the idea of probability. So probability helps us to make
decision by quantifying the uncertainty.
# Computation of reruns
data = yf.download(‘HDFC.NS’, start="2016-01-01", end="2020-10-
31")
Out of 208 trading days 145 days HDFC closed above previous day close
price if open above previous day close price and closed below previous day
close price if open below previous day close price.
Probability = 145/208 * 100 = 69.71%
It means 2020 data says that there is 69.71% probability that HDFC will
close above previous day close price if open above previous day close price
and HDFC will close below previous day close price if open below previous
day close price.
You can download file in Excel thru following command to understand
how this program is computing probability –
data.to_csv("HDFC2020.csv", index=True, encoding='utf8')
You need to write additional 2 more commands if you are working in
Google colab and you wish to download data on your laptop.
from google.colab import files
files.download('HDFC2020.csv')
Sample data of csv file generated through above program code is given
below. If you will observe the data on 6th Jan 2020 HDFC open at 2428
lower than the previous day close price of 2454, same day HDFC close at
2384 it is also lower than the previous day close price. In the same way if
you will observe the data of 7th Jan 2020 HDFC open at 2401 and closed at
2415, both prices are higher than previous day closing of 2384.
Mean value of both sample is 100 but 1st sample have standard deviation
of 1 and second sample have standard deviation of 10. So the values in
sample2 are more widely dispersed from the mean value in compare to
sample1.
For traders it’s important to understand the probabilities associated with
certain multiples of standard deviations:
- 1 standard deviation includes approximately 68.2% of outcomes
- 2 standard deviations includes approximately 95.4% of outcomes
- 3 standard deviations includes approximately 99.7% of outcomes in a
distribution of occurrences
One standard deviation covers 68.2% of the values. It means there is
68.2% probability that next value will be within range of +- 1 standard
deviation from mean value.
The volatility of a stock is synonymous with one standard deviation of
daily returns (annualized) in that stock. You can check the volatility of any
Future Contract from Exchange Website.
For NSE > Open stock quotes > Derivatives > Nifty Future
As you can check above image daily volatility of Nifty Future is 1.85
and annualized volatility is 35.30. Nifty closing price is 11122. Given this
information you predict likely range within which nifty will trade 1 year
from Now –
Upper range = 11122 + 35.3% of 11122 = 11122 + 3926 = 15048
Lower range = 11122 – 35.3% of 11122 = 11122 – 3926 = 7196
Statistically speaking, there is 68% probability that Nifty will remain in
the range of 7200 to 15000 for next 1 year.
In the same way you can compute the monthly range also. You have
daily volatility of 1.85%. There are 30 days to expiry.
30 days Standard Deviation = 1.85 multiplied by Square root of 30 =
1.85 * 5.47 = 10.11%
Upper range = 11122 + 10.11% of 11122 = 11122 + 1124 = 12246
Lower range = 11122 – 10.11% of 11122 = 11122 – 1124 = 9988
Data suggest that there is 68% probability that Nifty is likely to trade
anywhere in the range of 9988 to 12246 for next 1 month.
Standard deviation works better in a normal distributed data. However,
closing prices of the stock you are trading may not be normally distributed.
You can easily see that running the code below -
data['Close'].hist(bins=100, figsize=(8, 6))
You will get the following output. Statistically 68% values should remain
within the range of +- one standard deviation from the mean value, so if
price is touching upper band or lower band than they should come back to
mean value but mean value is also moving up or down with the prices that’s
why in range bound market price reverse to mean value will be true but in
case of trending market this will not be true. In a trending market price
touching upper band or a lower band may be a breakout upside or downside
respectively. You can observe in following chart also when January to April
when price was trending prices not reverse to mean value however from
April to October when Stock prices were range bound so price reverse to the
mean value.
Trading with Bollinger Bands-
Some investors trade with the help of Bollinger bands. Again which
moving average to use for computation of upper band and lower band is a
very subjective decision and it may depend on contract you are trading. Let’s
backtest the returns given by Bollinger Band with the help of historical data.
Backtesting on Google spreadsheet or Excel is going to be a very time
consuming activity so again lets do it with the help of Python codes you can
run on Google colab easily. I have written following codes in which one can
define stock, years, moving averages period for computation of moving
average and standard deviation to get returns. With the help of following
code I am computing yearly return given by Reliance Industries during the
year 2016 to 2020 on different combinations of moving average crossovers
from 1 to 35 and different multiplier factors of Standard Deviation. Basic
idea is to buy stock when price going below lower band in the hope that
price will come back to mean value and sell stock when price going above
upper band in the hope that prices will come back to mean values.
Following are the python codes –
!pip install yfinance
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import yfinance as yf
Stock = "RELIANCE.NS"
data = yf.download(Stock, start="2016-01-01", end="2020-10-31")
T3 = pd.DataFrame({"Close": data["Close"]})
T3['Year'] = T3.index.year
T2 = pd.DataFrame({"cumpnl":['0'], "MA":['0'], "STD":['0'], "Stock":
['0'], "Year":['0']})
for z in range (2016, 2021, 1):
T = T3.where(T3.Year == z)
T = T.dropna()
for a in range(1,37,2):
for b in range(1,3,1):
MA=a
STD=b
T['moving_average'] = T.Close.rolling(MA).mean()
T['moving_std_dev'] = T.Close.rolling(MA).std()
T['upper_band'] = T.moving_average + (T.moving_std_dev*STD)
T['lower_band'] = T.moving_average - (T.moving_std_dev*STD)
T['positions_long'] = np.nan
T['positions_short'] = np.nan
However, in trending market you can do just opposite. It means, you are
buying when price going above upper band and selling when stock prices
going below lower band. If anyone has traded Reliance this way buying
when price going above upper band and selling when price going below
lower band and both side bands are created with 3 days moving average and
1 standard deviation then profit was 329, 516, 282, 518 from 2017 to 2020
respectively as per following screen shot.
Let’s try this on strategy on range bound stock. Following is the output
when we tried this strategy on historical prices of NTPC. With the following
it can be observed that results of 4 days moving average with bands on
distance of 1 standard deviation are consistently profitable if trader is buying
when price goes below lower band and sell when price goes above upper
band.
Correlation
Correlation is a statistical technique that can show whether and how
strongly pairs of variables are related. When two sets of data are strongly
linked together we say they have a High Correlation. Correlation
is Positive when the values increase together, and Correlation
is Negative when one value decreases as the other increases.
Correlation is quantified by the correlation coefficient ρ, which ranges
from -1 to +1. The correlation coefficient indicates the degree of correlation
between the two variables. The value of +1 means there exist a perfect
positive correlation between the two variables, -1 means there is a perfect
negative correlation and 0 means there is no correlation.
Where, cov (X, Y) is the covariance between X & Y while SD (X) and
SD(Y) denote the standard deviation of the respective variables.
Trader can also plot chart of both data series and can observer the price
behavior of both of the data series. One of the best ways to visually depict
the relationship between two variables is by using a scatter plot. We can use
the following code for Scatter diagram –
from matplotlib import pyplot
pyplot.scatter(data['Close'], data1['Close'])
pyplot.show()
In this plot, blue dots represent the different points where HDFC Bank
prices given on Y axis and HDFC prices are given on X axis. As most points
lie around the blue line, it is an indication that a linear relationship might
indeed exist between HDFC and HDFC Bank.
We can use the following codes for price chart of both stocks on different
axis –
fig,ax = plt.subplots(figsize=(18,6))
ax.plot(data['Close'], color="red")
ax2=ax.twinx()
ax2.plot(data1['Close'],color="blue")
plt.show()
Output of above code is given in the next screen shot. Please observe the
chart for HDFC prices with red line refer left side prices and HDFC Bank
prices with blue line refer right side prices.
LINEAR REGRESSION
As a pair trader we want to forecast the future prices of stocks we are
observing with a reasonable degree of confidence. This brings us to
regression.
Y=a+bx+e
Now you will get Regression tools in Data > Data Analysis > Regression
as shown in following screenshot –
As we have fetched past data from Google finance in Google spread
sheet. We have downloaded this data in Excel. Now we can perform
‘Regression’ function on this data.
We will get the following output –
In this table, we can see the actual estimated values for the intercept and
the slope.
Thus, the model is: (Price of HDFC) = 394.89 + 1.455 *(HDFC Bank)
Thus, a data driven trader can decides to use the price of HDFC Bank as
an important variable while designing a strategy to trade stock HDFC. Value
of ‘b’ is also called beta value. Value of ‘b’, 1.455 can be used as hedge ratio
for pair trading. It means if trader is buying 10 share of HDFC then he
should sell 15 shares of HDFC Bank. I am not in favor of using beta value as
hedge ratio in parity trading. I will explain my view of hedge ratio later in
this book.
Stationary Time Series
This is the most important concept to understand for traders. Before
going into deep details of statistical tools you must have an idea of price
behavior of instruments you are trading. First we understand this concept.
‘A stationary time series is one whose statistical properties such as mean,
variance, autocorrelation, etc. are all constant over time.’ So you can say that
if prices of a stock are range bound then it’s a stationary data and if prices
are trending upwards or downwards then it’s a non-stationary data. A time
series is stationary if it reverses to mean values.
Trading using statistical tools includes –
1. Directional Trading
2. Pair Trading
If you will observe the following price difference chart of HDFC and
HDFC Bank you will notice that difference in prices of both stocks was
approx 550 at beginning of year 2016 and this price difference goes up to
900 till beginning of year 2018. So you can say there was a slight upper
trend during this period. From beginning of 2018 to mid 2019 prices are
coming back to the mean value. There was a rollercoaster ride in price
difference during mid of 2019 to mid of 2020. Theoretically speaking, Just
to explain a thought, assume anyone who has created a position of pair
trading at 900 somewhere in beginning of 2020 with an hope that difference
will come down to mean value of 800 has seen MTM loss of 300 points
when this difference goes upto 1200 points in a month or so. There was risk
of permanent shifting of mean value also from 800 to 1200. That is why
some stop loss always required in pair trading. I don’t have past data
analysis in support of my statement but I believe that It’s always better to
book loss exit the position if things are going in adverse direction beyond a
certain point. That’s why we need a system that can address the above issue.
You can address this issue by taking rolling mean (moving average) period
as low as possible. This will give you quick entry and quick exit. You will
not see big mark to market losses but this will increase the cost of trading
because buy sell signal will be higher. In the following chart we have taken a
fixed mean value that is constant over 5 years.
Instead of going for the visual test, we can use statistical tests like ADF
(Augmented Dickey Fuller) Test. The Dickey Fuller test is one of the most
popular statistical tests. It can be used to determine the presence of unit root
in the series, and hence help us understand if the series is stationary or not.
Following commands can be used –
from statsmodels.tsa.stattools import adfuller
adf = adfuller(data['Close']-data1['Close'], maxlag = 1)
print (adf[0])
print (adf[1])
print (adf[4])
Running the example prints the test statistic value of -3.31. The more
negative this statistic, the more likely we have a stationary dataset. We can
see that our statistic value of -3.31 is less than the value of -2.86 at 5%. The
p-value is obtained is less than significance level of 0.05. p-value in the
above test is 0.0135 it means there is 98.65% (100-1.35%) chance of time
series being stationary. This suggests that we can reject the null hypothesis,
rejecting the null hypothesis means that the time series is stationary.
Z-Score
The Z-score is the number of standard deviations that the pair ratio has
diverged from its mean. For a pair that is co-integrated, you will typically
see the Z-score bounce around 0.
Z-score is used in pair trading for entry and exit signals. Once the trade
is entered one just waits for the Zscore to mean revert. A trade is executed
when: 1. The spread goes less than -2 Z-score(z-score < -2 ) we long on the
pair (buy y, short x) and exit the position if it hits -1 Z-score (z-score > -1) 2.
The spread goes greater than 2 Z-score(z-score > 2 ) we short on the pair
(buy y, short x) and exit the position if it hits +1 Z-score (z-score < +1) So
basically we initiate a trade when the z-score exceeds +2/-2 and exit the
trade when it drops below +1 and above -1. Trade can be initiated when the
z-score exceeds +1/-1 (it depends on instrument you are trading), exit the
trade when it comes back to zero.
There would be lot of drawdown if the pairs do not start to mean revert
immediately. If you are taking beta as a hedge ratio it will be better to run
the linear regression on the 2 stocks every day and measure the stock daily
position size based on the new beta value for that day. Sell or buy more of
the Long stock so that the hedge is maintained on daily basis. This helps in
reducing the volatility in the portfolio and drawdown is reduced. And once
the Z-score mean reverts close both the trades.
Pair trade backtesting setup in Google
Spreadsheet
If two companies are similar, operate in the same
sector/country/conditions then their stock prices tend to move together. We
check this relation with statistical tools like correlation, ADF test etc. as
discussed earlier. Any change in the business landscape will affect the stock
prices of both the companies. If stock price of one company deviate away
from the stock price of the other without any event/incident then on such
days, the price difference of both of the companies deviates. We look for
such deviations to identify good trading opportunities. When such deviations
arise we take long position in one stock and short position in another then we
wait for the pair to move towards the mean value of price difference. We
need to stay long and short of the same Rupee/Dollar value. This is also
called ‘Rupee/Dollar Neutrality’. It means value of long position (price
multiplied by quantity) should be equals to the value of short position.
Before implementing any strategy we should backtest our logics on past data
to check the profitability. For a trader most important thing is the profit.
While getting into the technicalities of math and stats one should not forget
that ultimate objective is profit. So we will set up a Google spreadsheet to
backtest strategy on different parameters. (You can download this spread
sheet from Github. Link is given at end of this book.)
Please refer the following screen shot. We have taken daily closing
price of HDFC and HDFC Bank from 1 January 2016 to 30th October
2020 in column ‘C’ and column ‘F’ respectively.
=AVERAGE(H11:H15)
=STDEV(H11:H15)
This formula will return ‘T’ in cell ‘N12’ if value in cell ‘H12’
(Spread) will be lower then value in cell ‘L12’ (Lower Band). If value
will be higher than this formula will return ‘F’. If you will observe the
following screen shot you will find value ‘T’ in column ‘N12’ because
spread price 643.23 is lower than the Lower Band price 648.56.
Above formula will return ‘T’ when Spread value will be higher
than or equals to the mean value.
You will get the following output. You can observe that price difference
among both stocks is increasing with rise in prices (difference is plotted right
side and for stock prices please refer left side axis.)
We have computed upper band and lower band. Following python code
will compute profit on loss of strategy –
prices_df['positions_short'] = np.nan
prices_df['positions_long'] = np.nan
prices_df.positions_short =
prices_df.positions_short.fillna(method='ffill')
prices_df['positions'] = prices_df.positions_long +
prices_df.positions_short
prices_df['price_difference']= prices_df.Diff - prices_df.Diff.shift(1)
prices_df['pnl'] = prices_df.positions.shift(1) * prices_df.price_difference
prices_df['cumpnl'] = prices_df.pnl.cumsum()
prices_df[['cumpnl']].plot(figsize=(16,8))
# Calculate the max drawdown in the past window days for each day
prices_df['rolling_max'] = prices_df['cumpnl'].rolling(250,
min_periods=1).max()
prices_df['daily_drawdown'] = prices_df['cumpnl']-
prices_df['rolling_max']
You will get the following output. Maximum drawdown in the strategy is
approx 200 points.
prices_df['Trade'] = prices_df['positions'].diff()
T1 = prices_df.where((prices_df.Trade != 0))
T1 = T1.dropna()
T1.drop(["moving_average", "moving_std_dev", "upper_band",
"lower_band", "positions_long", "positions_short", "price_difference", "pnl",
"rolling_max", "daily_drawdown", "max_daily_drawdown"], axis = 1,
inplace = True)
T1['Trade_Return'] = (T1['cumpnl'].diff()/T1['Diff'])*100
print ("Number of Trade count", round(len(T1)/2))
T1[['Trade_Return']].plot(figsize=(16,8))
You will get the following output. Total numbers of trade done by
strategy was 193. You can observe the maximum retrun generated by each
trade in following chart. As you can observe maximum loss was approx 15%
in a single trade and maximum retrun was approx 20% in a single trade.
You can also download data in excel with following command for further
analysis -
T.positions_short = T.positions_short.fillna(method='ffill')
T.positions_long = T.positions_long.fillna(method='ffill')
T['positions'] = T.positions_long + T.positions_short
T['price_difference']= T.Diff - T.Diff.shift(1)
T['pnl'] = T.positions.shift(1) * T.price_difference
T['cumpnl'] = T.pnl.cumsum()
T1 = T[['cumpnl']].tail(1)
T1['Stock1'] = abc[x]
T1['Stock2'] = abc[y]
T1['Moving_average'] = w
T1['Year'] = z
T2 = T2.append(T1)
You will get the following output of above program codes. If you
observe the following output you will notice that pair trading among HDFC
and HDFCBANK with computation of 7 days moving average and moving
standard deviation is consistent every year since 2016.
Pair trading combination of HDFC and HDFC Bank was profitable with
all moving averages most of the time however if you will check data in
above table pair trading of HDFC Bank and ICICI bank was not much
profitable. Past data says in trading of this combination it will be difficult to
cover cost.
Machine Learning for Pair Trading
We learned to find trading opportunities of pair trading between two
stocks of same sector. Many stocks in same sectors are available for trading
and if you will try to find pair trading opportunities among all stocks by
comparing stocks on one to one basis then it’s going to be a very time
consuming task. Now let us build a model based on pair trading where a
program code can find the most expensive stock to sell and cheapest one to
buy out of a basket of many stocks. Principal component analysis will make
our task easy.
Principal Component Analysis(PCA) is an unsupervised statistical
technique used to examine the interrelation among a set of variables in order
to identify the underlying structure of those variables. Suppose we have 100
stocks in basket then we have much more complex data and therefore
identifying and predicting a dependent factor against so many independent
factors can reduce the probability of getting a correct prediction. That is why
it is important to identify strong independent variables. Dimensionality
Reduction is a technique that allows us to understand the independent
variables and their variance thus helping to identify a minimum number of
independent variables that has the highest variance with respect to the
dependent variables. So the dimensionality reduction technique helps us
reduce the number to independent variables in a problem by identifying new
and most effective ones. It is also known as factor analysis.
In regression, we usually determine the line of best fit to the dataset but
here in the PCA, we determine several orthogonal lines of best fit to the
dataset. Orthogonal means these lines are at a right angle to each other.
Actually, the lines are perpendicular to each other in the n-dimensional
space.
PCA is a method of compressing a lot of data into something that
captures the essence of the original data. PCA takes a dataset with a lot of
dimensions and flattens it to 2 or 3 dimensions so we can look at it. PCA
creates new variables known as principal components. The 1st principal
component will try to explain the direction of the most variation. The 2nd
principal component will try to explain the remaining variability. Actually
these components are calculating the systematic risks. In our data, we apply
PCA on stock price/returns of various stocks. PCA plot converts the
correlations among all of the cells into a 2-D graph. We use PCA to find the
common risk factors of stock returns which is helpful to group pairs
accordingly using clustering algorithms.
Python program codes –
!pip install yfinance
from datetime import datetime
import yfinance as yf
from scipy import stats as stats
from scipy.stats import pearsonr
from numpy import mean
from numpy import std
import numpy as np
import pandas as pd
import statsmodels
from statsmodels.tsa.stattools import coint, adfuller
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style="whitegrid")
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from itertools import groupby, count
import statsmodels.api as sm
%matplotlib inline
#you will get the following output (Chart of cumulative profit and loss) –
You will get the following output. As you can observe the data in Excel
sheet (following screen shot) on 16th Jan 2017 system generated buy signal
in Indusind Bank and sell signal in HDFC. On 18th Jan 2017 sell signal
generated in Axis Bank so trader will buy HDFC and sell Axis Bank. After
this trader will have long position in Indusind Bank and short in Axis Bank.
On 20th Short position shifted from Axis Bank to ICICI Bank and so on. At
any given point of time trader will have long position in 1 stock and short
position in 1 stock out of 6 stocks taken for pair trading.
Strategy has given a profit of 10,000/- on investment of Rs 10,000 in
each stock.
Python Codes @ GITHUB
You can download Python codes given in this book in Jupyter notebook
from github. Link to the Github account is given below -
https://github.com/OptionsnPython/Trading-Pairs
About Authors / Acknowledgments
If you found this book useful please give review on Amazon and
recommend this book in your network for benefit of all.
Happy learning.
Anjana Gupta
Books By This Author
1. First part cover option Greeks - Delta, Gamma, Theta, Vega, Delta
hedging & Gamma Scalping, implied volatility with the example of past
closing prices of Nifty/USDINR/Stocks (Basics of Future and options
explain).
3. Third part covers Python for traders. After reading this book a novice
trader will also be able to use python from installation of Anaconda on his
laptop & extracting past data to back-testing and development of his own
strategies. Python is explained from very basic so that anyone who does not
have in-depth understanding of programming can understand and develop
codes. Many program codes and their results also explained for back-testing
of strategies likes ratios, butterfly etc.
Books By This Author
The book will explain multiple trading strategies with python codes to get
you well on the path to becoming a professional systematic trader.
Most of the trading strategies explained with historical per minute data Nifty
and Bank Nifty weekly options.