Krish Naik - Hands-On Python For Finance-Packt Publishing (2019)
Krish Naik - Hands-On Python For Finance-Packt Publishing (2019)
BIRMINGHAM - MUMBAI
Hands-On Python for Finance
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or
reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the
information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or
its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this
book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this
book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-78934-637-4
www.packtpub.com
mapt.io
Mapt is an online digital library that gives you full access to over 5,000
books and videos, as well as industry leading tools to help you plan your
personal development and advance your career. For more information,
please visit our website.
Why subscribe?
Spend less time learning and more time coding with practical eBooks
and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
At www.packt.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters, and receive exclusive discounts and
offers on Packt books and eBooks.
Contributors
About the author
Krish Naik works as a lead data scientist, pioneering in machine learning,
deep learning, and computer vision, and is an artificial intelligence
practitioner, an educator, and a mentor, with over 7 years' experience in the
industry. He also runs a YouTube channel where he explains various topics
on machine learning, deep learning, and AI with many real-world problem
scenarios. He has implemented various complex projects involving complex
financial data with predictive modeling, machine learning, text mining, and
sentiment analysis in the healthcare, retail, and e-commerce domains. He
has delivered over 30 tech talks on data science, machine learning, and AI
at various meet-ups, technical institutions, and community-arranged
forums.
I would like to thank God for helping me and guiding me throughout my book. I would most
importantly like to thank my parents, siblings (Vish Naik and Reena Naik), friends, students, and
colleagues (Deepak Jha and Sudhanshu), who inspired me with their wonderful ideas. Lastly, I would
like to dedicate this book to my Dad. He is, and always has been, the backbone of my life.
About the reviewer
Arunkumar N T has attained an MSc (physics) and an MBA (finance),
pursuing CMA and CS. He has over 20 years' experience of corporate life
and 2 years' experience teaching MBA students. He is an entrepreneur and
has previously worked for Airtel, Citi Finance, ICICI Bank, and many other
companies.
I would like to thank my parents for their support and trust, in spite of repeated failures; my brothers,
Anand and Prabhanjan, for their acknowledgment, love, and support; my wife, Bharathi, for her
critical input; my kids, Vardhini and Charvangi, for their naughtiness and also would love to thank
my gurus, the late Prof. Badwe and Prof. Sundararajan, for their constant encouragement and
guiding me; and my friends, Dr. Sreepathi B and Anand K.
Packt is searching for authors like
you
If you're interested in becoming an author for Packt, please visit authors.packt
pub.com and apply today. We have worked with thousands of developers and
tech professionals, just like you, to help them share their insight with the
global tech community. You can make a general application, apply for a
specific hot topic that we are recruiting an author for, or submit your own
idea.
Table of Contents
Title Page
Copyright and Credits
Hands-On Python for Finance
About Packt
Why subscribe?
Packt.com
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book 
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
1. Section 1: Introduction to Python for Finance
This book will explain how to code in Python and how to apply these skills
in the world of finance. It is both a programming and a finance book. It will
provide hands-on experience of various data analysis techniques that are
relevant for financial analysis in Python, and of machine learning using
sklearn and various stats libraries.
This book will also deal with lots of projects related to financial data, and
we will see how different machine learning and deep learning algorithms
are applied, and how various information and insights are gained from the
data in order to make predictions.
At the time of writing in 2019, this book will have the most up-to-date code
on Python, along with the latest libraries and techniques used for
preprocessing financial data. This book also covers creating models using
the best machine learning and deep learning techniques with open source
software libraries such as TensorFlow and Keras.
Who this book is for
This book targets the following audience:
how to use the different data analysis libraries such as NumPy, pandas, and
matplotlib.
analysis, which comprises methods for analyzing time series data to extract
meaningful statistics and other characteristics of the data. Time series
forecasting is using a model to predict future values based on previously
observed values. We will focus on the data preprocessing techniques of time
series data and use that data to make forecasts.
various use cases related to finance using deep learning techniques with the
Keras library.
can apply all the different techniques to do with Python, machine learning,
and deep learning that we have learned throughout this book.
To get the most out of this book
The readers must know the basics of Python, and must install the Anaconda
distribution, which is explained in the first chapter of the book.
Download the example code files
You can download the example code files for this book from your account
at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.
com/support and register to have the files emailed directly to you.
Once the file is downloaded, please make sure that you unzip or extract the
folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/P
acktPublishing/Hands-On-Python-for-Finance. In case there's an update to the code,
We also have other code bundles from our rich catalog of books and videos
available at https://github.com/PacktPublishing/. Check them out!
Download the color images
We also provide a PDF file that has color images of the
screenshots/diagrams used in this book. You can download it here: https://ww
w.packtpub.com/sites/default/files/downloads/9781789346374_ColorImages.pdf.
Conventions used
There are a number of text conventions used throughout this book.
Bold: Indicates a new term, an important word, or words that you see
onscreen.
Warnings or important notes appear like this.
General feedback: If you have questions about any aspect of this book,
mention the book title in the subject of your message and email us at
[email protected].
Errata: Although we have taken every care to ensure the accuracy of our
content, mistakes do happen. If you have found a mistake in this book, we
would be grateful if you would report this to us. Please visit www.packt.com/sub
mit-errata, selecting your book, clicking on the Errata Submission Form link,
Piracy: If you come across any illegal copies of our works in any form on
the Internet, we would be grateful if you would provide us with the location
address or website name. Please contact us at [email protected] with a link
to the material.
Public finance
Corporate finance
Personal finance
Public finance
Public finance involves looking at the role of the government or the
authorities within the financial system. It includes the way in which
governments secure and manage their revenues. It can also be thought of as
a branch of political economy that studies the general economy of
the public sector. Governments finance their expenditure through taxation
(such as direct tax, indirect tax, corporation tax, and so on), the borrowing
of funds by the general public sector, and the printing of cash in accordance
with rules and laws.
Corporate finance
Corporate finance is the essential organ of an enterprise. It involves making
decisions regarding the economics and funding of a company. The terms
company finance and company financier deal with planning, transactions,
or decision-making bodies that are responsible for raising capital that is
used to create, broaden, grow, or acquire an organization. These choices are
primarily based on the knowledge of numerous stakeholders.
In the 21st century, banking has become a very important part of personal
finance. Banks are the entities that actually provides the financial
facilities that can help an individual, such as loan, insurance, and saving.
Understanding the stock market
The term stock market is analogous to the term share market, which is where
shares are traded. The key distinction is that a stock market enables us to
exchange a wide variety of financial items, such as securities, company
shares, common assets, subordinate shares, bonds, or mutual funds. A share
market solely permits the commercialism of shares.
The stock exchange is the most basic platform that provides services to trade
the stocks and securities of different institutes. A stock can only be bought or
sold if it's listed on an exchange. It is where stock purchasers and marketers
from all over the world gather together.
The following diagram shows the various categories of the stock market:
The primary market is where the trading of shares happens directly between
the company and the investor. The capital amount that is received by the
company after issuing new shares is used for expanding its business plan or
setting up new ventures. In short, a company gets registered in the primary
market to raise funds.
Shares sold by a company who haven't sold shares before are known as
an Initial Public Offering (IPO). The process by which the company
issues new shares in the primary market is called underwriting, and is
carried out by security dealers or underwriters. From a retail investor's
perspective, making an investment within the primary market is the first
step toward buying and selling stocks and shares.
Let's now take a look at the key financial instruments that are exchanged:
We will talk more about stocks when we look at some use cases to do with
the forecasting of stock prices in Chapter 5, Portfolio Allocation and
Markowitz Portfolio Optimization, and Chapter 11, Stock Market Analysis
and Forecasting Case Study.
Bonds
Bonds are debt instruments that are simple ways for the government and
companies to borrow money. Governments and organizations can sell bonds
to a large group of investors to raise the funds that are needed for operations
and growth.
For example, let's imagine that company ABC has issued a bond with a
principal amount of $1,000. The maturity period is set to 10 years and the
coupon rate, or the interest rate, is 7% per annum. This means that investors
who buy the bond will pay $1,000 for one bond and will get a return of 7%
per annum for 10 years. At the end of 10 years, the investor will get back
the full principal amount. If the investor wants to sell the bond within the
10-year maturity period, they can do this in the bond market. This is similar
to the stock market, but it is used to buy and sell bonds before their maturity
date. The prices of bonds change frequently, which is primarily due to the
following two factors:
Interest rate risk: This is also known as the market rate risk and
refers to the propensity of bonds to change. Let's suppose that an
investor has bought a 10-year maturity government bond with a face
value of $1,000, and an interest rate (or coupon rate) of 8%. After two
years, the investor wants to sell the bond, but the interest rate for the
other bonds is around 10%. It will be difficult for the investor to
convince someone to purchase their bond with an interest rate of 8%
when someone can buy new bonds with an interest rate of up to 10%.
In this scenario, the investor will sell the bond at a discounted price,
perhaps for around $885.
An important point to note is that the bond price has an inverse relationship with the
interest rate. When the interest rate increases, the bond price increases.
Credit risk: This type of risk usually happens when the organization
that has issued the bond is not performing well, so investors fear that
the company may not be able to return the required payments. In this
case, investors are likely to sell bonds at a discounted price to other
investors.
Types of funds
There are three major types of funds:
These funds vary with regard to their fees, transparency, rules, and
regulations.
ETFs
ETFs are funds that are made up of a basket of assets, bonds, and
commodities. Their holdings are completely public and transparent, and
individuals can buy and trade these kinds of funds. Typically, people
investing in ETFs are interested in having a diverse portfolio, which refers
to the list of stocks and bonds purchased, and want to keep their investment
in an ETF for a longer period of time. The return on investment of an ETF
depends on the rise in the index or the commodity value. One of the most
common ETFs is the Spider (SPY), which tracks the S&P 500 index.
Mutual funds
Mutual funds are created by a group of investors coming together and
pooling a certain amount of money to buy stocks, bonds, or both. These
funds are managed by a professional fund manager, who aims to build up a
portfolio in accordance with a particular investment objective. Investments
are often spread across a wide range of different industries, such as IT,
telecommunications, or infrastructure. This ensures that the risk is
controlled, because the price of the different stocks will not move in the
same direction and in the same proportion at the same time.
The most basic structure of mutual funds are units. Mutual funds allocate
units to investors based on the amount of money invested. Investors in
mutual funds are usually called unit holders.
Forward contracts
Future contracts
Options contracts
Swap contracts
Forward contracts
A forward contract is a tweaked contract between two gatherings, where
settlement happens on a particular date in the future at a cost that has been
incurred today. Forward contracts are not traded on the standard
stock exchange and, as a result, they are not standardized, making them
particularly useful for hedging. The primary highlights of forward contracts
include the following:
A purchaser
A vendor
A cost
An expiry date
Option contracts
An option contract is a contract that gives someone the right, but not the
obligation, to buy (call) or sell (put) security or another financial asset. A
call option gives the purchaser the privilege of purchasing the asset at a
given cost. This is called the strike price. While the holder of the call
option has the privilege of requesting an offer from the seller, the vendor, or
the seller, has the right to sell, but not necessarily the commitment to do so.
If a purchaser wants to purchase the underlying asset, the merchant needs to
offer it, the merchants don't have an obligation to do so.
Similarly, a put option gives the purchaser the privilege of selling the asset
to the seller at the strike price. Here, the purchaser has the privilege to offer,
and the seller has the commitment to purchase. In every option contract, the
privilege to exercise the option is vested with the purchaser of the
agreement. The seller of the agreement has the right to sell but not the
commitment to do so. As the seller of the agreement bears the commitment,
they pay a cost, called a premium.
Swap contracts
A swap contract is used for the exchange of one cash flow for another set of
future cash flows. Swaps refer to the exchange of one security for another,
based on different factors.
Why use swap contracts? The main advantages of swap contracts are as
follows:
Usually, companies split their stock using a split ratio. The most commonly
used split ratios are 3 to 1, 2 to 1, and 3 to 2, even though any other
combination is also possible.
How do stock splits work?
Suppose a share has a face value of $50 and the company wants to split
each share into five shares; this is a 1 to 5 stock split. Each share will be
divided into five shares with a face value of $10. Now, imagine that an
organization chooses to use a 2 to 1 stock split. This fundamentally implies
that the investors get two extra shares of stock for every one officially held.
When the stock is split, investors get extra shares. However, the face
value of each share decreases proportionally.
A stock split creates more liquidity in the share market.
After a stock split, the share capital and reserves stay the same, in
absolute dollar terms, as before.
If there are any bonus shares, the investors get an extra share of the
face value in a predecided proportion as an incentive.
The main role of issuing bonuses is to remunerate investors by issuing
a couple of additional shares.
When the stock is split, the different shares are immediately reflected
in investors' demat accounts the day after the split, and investors can
sell these shares without the risk of losing money from their additional
shares if the stock price goes down.
A summary of stock splits
Splitting your stock doesn't transform a business or its valuation; it simply
increases the quantity of shares and makes each share worth less. Investors
get a similar aggregate dividend for the shares, but with less cash coming
from each individual share.
Let's think about what actually happens when someone clicks the buy/sell
button. The first thing that happens is that the order gets placed. The order
includes the following information:
Buy or sell: This indicates whether the person wants to buy or sell the
stock.
Symbol: This term usually indicates the code of the stock company
that the user wants to buy or sell (such as AAPL or GOOGL).
Number of shares: This field basically indicates the number of shares
that we want to sell or buy.
Limit or market: These are the types of order. The maximum price
for the order is set by the limit price. If the limit price is not reached in
the market, the order usually does not gets executed. Market orders are
exchanges that are intended to execute as fast as possible at the present
or the market cost.
Price: This is only needed for limited orders.
Lately, software engineering has joined forces with the financial industry in
the act of purchasing and offering monetary resources to create benefits.
Financial exchanges have become overwhelmed by computers, and various
complex algorithms are in charge of settling split-second exchange
decisions quicker than humans.
Python has become a very popular language due to its association with
rapidly growing fields such as data science, machine learning, and deep
learning. Researchers and scientists throughout the world are working on
new and innovative ideas using concepts from statistics that integrate well
with Python. The recent development of deep learning with Python has
caused this language to reach new heights and achieve things that were
previously impossible. A lot of AI products are currently being developed
using Python.
Anaconda has a collection of more than 700 open source packages and it is
available in both free and paid versions. The Anaconda distribution ships
with the conda command-line utility. You can learn more about Anaconda
and conda by reading the Anaconda documentation pages at the following
link: https://anaconda.com/.
Why Anaconda?
The main reasons why we use Anaconda are given here:
The following screenshot shows what Anaconda Navigator looks likes after
you open it from the start menu:
In the preceding screenshot, you can click on Launch to open Jupyter
Notebook or the Spyder IDE.
As the financial domain deals with lots of data that may be in different
formats and structures, we often have to analyze, preprocess, and visualize
it in order to understand it efficiently. The NumPy, pandas, and matplotlib
libraries are very important libraries that are used in Python for data
analysis and data preprocessing.
NumPy
pandas
Matplotlib
Another prerequisite of this chapter is that you have Anaconda installed on your
computer.
A brief introduction to NumPy
In this section, we will briefly discuss NumPy, its uses, and how to install it
on your computer. NumPy is a linear algebra library for Python. All other
libraries in the PyData ecosystem rely on NumPy, as it is one of the
fundamental building blocks.
PyData refers to the group that is primarily involved in using Python and its
various libraries for data preprocessing and data analysis. It is more
business-centered than SciPy, which is made up of the company Enthought
and is more focused on academic applications. The two networks have a lot
in common, but you'll discover more finance-related themes with PyData.
Now that we have installed NumPy, let's discuss the following topics:
NumPy arrays
NumPy indexing
Various NumPy operations
NumPy arrays
NumPy is a library that is used in the Python programming language to
create single and multidimensional arrays and matrices. It also supports
various built-in functions that can perform high-level mathematical
functions and operations on these arrays.
Vectors
Matrices
Vectors are arrays that are strictly one-dimensional, whereas matrices are
two-dimensional. They can still only have, however, one row or column.
In this section, we are going to learn about the various ways to create
NumPy arrays using Python and the NumPy library. We are going to use the
Jupyter Notebook to show the programming code.
Let's create a NumPy array using a Python object. First, we will create a list
and then convert it to a NumPy array:
In [12]:
#Example of a list
list_1 = [1,2,3]
#show
list_1
Out[12]:
[1, 2, 3]
We can also assign this array to a variable and use it where necessary:
In [14]:
#Assign array to variable
list_array=np.array(list_1)
#Print list_array
print(list_array)
Out[14]
[1 2 3]
We can cast a normal Python list into an array. Currently, this is an example
of a one-dimensional array. We can also create an array by directly
converting a list of lists or a nested list:
In [15]:
nested_list= [[1,2,3],[4,5,6],[7,8,9]]
#show
nested_list
Out[15]:
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
Let's now take a look at some of the built-in methods to generate arrays
using NumPy.
NumPy's arange function
arangeis a built-in function provided by the NumPy library to create an array
with even-spaced elements according to a predefined interval. The syntax
for NumPy's arange function is as follows:
arange([start,] stop[, step,], dtype=None)
The following examples demonstrate the basic usage of the arange function:
In [9]: np.arange(0,10,2)
Out[9]: array([ 0, 2, 4, 6, 8, 10])
NumPy's zeros and ones functions
The zeros and ones functions are built-in functions used to create arrays of
zeros or ones; the syntax is as follows:
np.zeros(shape)
np.ones(shape)
The following examples show us how to create an array using zeros and ones:
In [24]:
np.zeros(3)
Out[24]:
array([ 0., 0., 0.])
In [26]:
np.zeros((5,5))
Out[26]:
array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
In [27]:
np.ones(3)
Out[27]:
array([ 1., 1., 1.])
In [28]:
np.ones((3,3))
Out[28]:
array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
NumPy's linspace function
The linspace function is a built-in function that returns an evenly-spaced
number over a specified interval. linspace requires three important attributes:
start, stop, and num. By default, the num value is 50, but this can be changed.
In [6]:
np.linspace(0,10,3)
Out[6]:
array([ 0., 5., 10.])
In [7]:
np.linspace(0,10,50)
Out[8]:
array([[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.]])
NumPy's rand function
NumPy has a built-in function called rand to create an array with random
numbers. The rand function creates an array of a specified shape and
populates it with random samples from a uniform distribution over [0, 1].
This essentially means that a random value between 0 and 1 will be
selected, as shown in the following code block:
In [47]:
np.random.rand(2)
Out[47]:
array([ 0.11570539, 0.35279769])
In [46]:
np.random.rand(5,5)
Out[46]:
The randn function creates an array of a specified shape and populates it with
random samples from a standard normal distribution:
In [1]:
import numpy as np
In [2]:
np.random.randn(2)
Out[2]:
array([ 0.1426585 , -0.79882962])
In [3]:
np.random.randn(5,5)
Out[3]:
array([[-0.31525094, -0.76859012, 0.72035964, 0.7312833 , -0.57112783],
[ 0.47523585, 0.18562321, -1.42741078, -0.50190548, 0.39230943],
[-0.06597815, -0.92100907, 0.27146975, -0.84471005, -0.09242036],
[-1.70155241, -0.79810538, 0.04569422, 0.1908103 , 0.15467256],
[ 0.36371628, -0.39255851, 0.02732152, -1.62381529, 0.42104139]])
The randint function creates an array by returning random integers, from a
low value (inclusive) to a high value (exclusive). It is a built-in function of
the random module in Python 3. The random module provides access to various
useful functions, one of which is randint().
The following are the parameters required inside the randint() function:
(start, end) : Both of them must be integer type values.
The following are the errors and exceptions usually given by the randint()
function based on the input:
ValueError : Returns a ValueError when floating
point values are passed as parameters.
In [6]:
np.random.randint(1,100)
Out[6]:
1
In [7]:
np.random.randint(1,100,10)
Out[7]:
array([95, 11, 47, 22, 63, 84, 16, 91, 67, 94])
Out[11]:
array([0, 2, 4, 6, 8])
Now let's discuss NumPy indexing. Here, indexing will help us to retrieve
the elements of the array in a much faster way.
NumPy indexing
In this section, we are going to see how we can select elements or a group
of elements from an array. We are also going to look at the indexing and
selection mechanisms. First, let's create a simple one-dimensional array and
look at how the indexing mechanism works:
In [2]:
import numpy as np
In [3]:
#Creating a sample array
arr_example = np.arange(0,11)
In [4]:
#Show the array
arr_example
Out[4]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
In the preceding code, we created an array with the arr variable name. Let's
now see how we can select elements using indexing:
In [5]:
#Get a value at an index
arr_example[8]
Out[5]:
8
In [6]:
#Get values in a range
arr_example[1:5]
Out[6]:
array([1, 2, 3, 4])
In [7]:
#Get values in a range
arr_example[0:5]
Out[7]:
array([0, 1, 2, 3, 4])
#Show
arr_example
Out[8]:
array([100, 100, 100, 100, 100, 5, 6, 7, 8, 9, 10])
Let's take a look at how the indexing mechanism works for a two-
dimensional array.
The general format for a two-dimensional array is either array_2d[row][col]
or array_2d[row,col]. It is recommended to use comma notation for clarity.
The first step is to create a two-dimensional array and use the same
indexing mechanism with slicing techniques to retrieve the elements from
the array:
In [4]:
arr_2d = np.array([[5,10,15],[20,25,30],[35,40,45]])
#Show
arr_2d
Out[4]:
array([[ 5, 10, 15],
[20, 25, 30],
[35, 40, 45]])
In [5]:
#Indexing row
arr_2d[1]
Out[5]:
array([20, 25, 30])
In [6]:
# Format is arr_2d[row][col] or arr_2d[row,col]
Out[6]:
20
In [7]:
# Getting individual element value
arr_2d[1,0]
Out[7]:
20
The slicing technique helps us to retrieve elements from an array with
respect to their indexes. We need to provide the indexes to retrieve the
specific data from the array. Let's take a look at a few examples of slicing in
the following code:
In [8]:
# 2D array slicing technique
arr_2d[:2,1:]
Out[8]:
array([[10, 15],
[25, 30]])
In [9]:
arr_2d[2]
Out[9]:
array([35, 40, 45])
In [10]:
#Shape bottom row
arr_2d[2,:]
Out[10]:
array([35, 40, 45])
NumPy operations
We can easily perform arithmetic operations using a NumPy array. These
operations may be array operations, array arithmetic, or scalar operations:
In [3]:
import numpy as np
arr_example = np.arange(0,10)
In [4]:
arr_example + arr_example
Out[4]:
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
In [5]:
arr_example * arr_example
Out[5]:
array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81])
In [6]:
arr_example - arr_example
Out[6]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
NumPy has many universal array functions that are essential mathematical
techniques that anyone can use to perform arithmetic operations:
In [10]:
#Taking Square Roots
np.sqrt(arr_example)
Out[10]:
array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
In [11]:
#Calculating exponential (e^)
np.exp(arr_example)
Out[11]:
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])
In [12]:
np.max(arr_example) #same as arr.max()
Out[12]:
9
A brief introduction to pandas
In the previous section, we discussed the NumPy library, its built-in
functions, and its applications. We are now going to move on to discussing
the pandas library. The pandas library is very powerful, and is one of the
most important tools for data analysis and data preprocessing. It is an open
source library that is built on top of NumPy. It provides many important
features, such as fast data analysis, data cleaning, and data preparation. We
provide data as input to our machine learning or deep learning models for
training purposes. pandas has high performance and high productivity; it
also has many built-in visualization features. One of the most important
attributes of the pandas library is that it can work with data from a variety
of data sources, such as comma-separated value (CSV) files, HTML,
JSON, and Excel.
In order to use the pandas library, we will need to install it by going to the
Anaconda prompt, the Command Prompt, or the Terminal, and typing in the
following command:
conda install pandas
We can also use the pip install command to install the pandas library:
pip install pandas
Series
DataFrames
pandas operations
pandas data input and output
Series
Series is the first datatype that we will be discussing in pandas. A series is a
one-dimensional array with custom labels or indexes that have the ability to
hold different types of data, such as integer, string, float, and Python
objects.
Serial
Parameters Description
number
3
dtype dtype stands for datatype. The default value
for the dtype parameter is None. If dtype is None,
the datatype will be inferred.
This allows you to copy data; the default
4 copy
value is false.
An array input
A dictionary input
A constant or scalar input
Let's consider how we can create series using the Python and pandas
libraries. First, we will import the NumPy and pandas libraries:
In [1]:
import numpy as np
import pandas as pd
Series can be also be created from a list, as shown in the following code
block:
In [3]:
pd.Series(data=list_example)
In [4]:
pd.Series(data=list_example,index=labels_example)
In [6]:
pd.Series(arr_example,labels_example)
pandas series can also hold various objects, including built-in Python
functions such as data, as demonstrated in the following code:
In [8]:
pd.Series(data=labels_example)
In the following code, we are adding built-in functions, such as print and
len, and converting into series:
In [10]:
# Even functions (although unlikely that you will use this)
pd.Series([print,len])
Now that we have seen how to create series using different datatypes, we
should also learn how to retrieve values in the series using indexes. pandas
makes use of index values, which allow fast retrieval of data; this is shown
in the following example:
In [14]:
series1 = pd.Series([1,2,3,4],index = ['A','B','C','D'])
series1
The following code demonstrates how we can retrieve the series values by
using indexing:
In [19]:
# Retrieving through index
series1[1]
Out[19]:
2
The following code helps us to add two series – series1 and series2:
In [24]:
series1 + series2
The output is as follows:
Out[24]:
A NaN
B 3.0
C 5.0
D 9.0
E NaN
dtype: float64
We can even multiply two or any number of series; the code is as follows:
In [25]:
series1 * series2
Serial
Parameter and description
number
index: The index value is provided for the row labels. These labels
2 act as an index for the complete rows, which are made of various
columns provided by the data parameters.
3
columns : This parameter is to provide the column names or
headings.
copy: This parameter is used for copying data; the default value is
5
FALSE.
Usually, data parameter values are given in the form of a two-dimensional array in the shape
of (n, m). The second parameter, index length, should be equal to the n value, while the
column length should be equal to the m value.
A DataFrame can be created using various different types of input, such as lists,
dictionaries, NumPy arrays, series, or other DataFrames. Let's consider some
examples; the first step is to import the NumPy and pandas libraries, as we will
need to create some arrays and DataFrames. The remaining tasks are quite simple;
we need to add the elements provided in the syntax.
We will also be importing the randn library, which will help us to create random
numbers:
In [6]:
from numpy.random import randn
np.random.seed(55)
Out[12]:
pandas.core.series.Series
So, how is this possible? drop is a very risky operation, as it allows us to drop
or delete a column. In order to prevent us from doing this by mistake, there
is an extra parameter, inplace, which is usually used to ensure that the user
really wants to do a drop operation; let's take a look at an example:
In [17]:
dataframe.drop('F',axis=1,inplace=True)
In [19]:
dataframe
There are more built-in properties available for DataFrames that are to do
with selecting rows and columns:
Here, the output returns the complete row with respect to the column
values.
In the following code, we will select the subset of Q and R rows along with
the B and C columns:
In [38]:
#Select subset Q and R rows along with B and C columns
dataframe.iloc[:2,1:3]
In [41]:
# Select subset of rows and column using dataframe.loc
dataframe.loc['Q','A']
Out[41]:
-0.3810863829200849
In [42]:
dataframe
dataframe.loc[['Q','R'],['B','C']]
The following code will return values that are greater than 1:
In [46]:
#return the values where the value> 0 else returns NaN
dataframe[dataframe>0]
The following code returns the complete rows and columns where the value
of the A column is greater than 0:
In [47]:
#Returns the complete rows and columns where the value of columns A > 0
dataframe[dataframe['A']>0]
There is also a built-in function that can reset the index in the DataFrame to
the default index, as in NumPy (such as 0, 1, 2, 3, ..., n):
In [52]:
dataframe
The reset_index() method reset the index to default index as shown here:
In [53]:
# reset_index() reset to default 0,1...n index
dataframe.reset_index()
unique(): The unique function returns unique values from a column or a row in the form of
an array.
nunique(): The nunique function provides the number of unique elements in a row or
column.
value_counts(): The value_counts function provides the number of elements present in a row
or column.
Out[5]:
array([100, 200, 300, 400], dtype=int64)
In [6]:
df['col2'].nunique()
Out[6]:
4
In [7]:
df['col2'].value_counts()
The output is as follows:
Out[7]:
100 1
400 1
300 1
200 1
Name: col2, dtype: int64
A DataFrame also provides an option to apply an operation within a function to all the
elements in the DataFrame. Let's define a function and see how we can apply it to a
DataFrame:
In [20]:
def squareofanumber(value):
return value**2
In [21]:
df['col1']
Out[21]:
0 10
1 11
2 12
3 13
Name: col1, dtype: int64
In [22]:
#apply() function applies the function definition in all the elements
from a column
df['col1'].apply(squareofanumber)
Out[22]:
0 100
1 121
2 144
3 169
Name: col1, dtype: int64
The apply() function applies the function to all of the elements in the DataFrame.
Make sure the function definition is compatible with the values of the DataFrame with respect to the datatypes
used, or it may produce a runtime error.
Data input and output operations
with pandas
The pandas library provides a lot of built-in functions to read data from
different data sources, such as CSV, Excel, JSON, and HTML. These
features have made pandas the favorite library of many data scientists and
machine learning developers.
Reading CSV files
pandas provides a built-in function, read_csv(), to read data from a CSV data
source. The output returned from this function is essentially a DataFrame.
First, we need to import the pandas library, as follows:
In [15]:
import pandas as pd
In [17]:
dataframe.head()
We can convert this DataFrame back into the CSV file using the to_csv()
function:
In [18]:
type(dataframe)
Out[18]:
pandas.core.frame.DataFrame
In [19]:
dataframe.to_csv('customer_copy.csv',index=False)
The CSV file is saved in the same location as that of the current file. It can be
downloaded from the GitHub repository here: https://goo.gl/DJxn9x.
Reading Excel files
pandas also has a built-in read_excel() function, which is used to read data
from Excel files. The output that is returned is also a DataFrame. The
parameter that is usually provided in the read_excel() function is the name of
the Excel file and the sheet name:
In [3]:
import pandas as pd
In [9]:
df=pd.read_excel('Sample.xlsx',sheet_name='Sample')
df.head()
The to_excel() function is used to convert the DataFrame back into the Excel
file. It is stored in the same location that we are currently working in:
In [10]:
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')
The sample Excel file can be found at the GitHub link mentioned in the Technical
requirements section.
Reading from HTML files
pandas also provides a built-in function to read data from HTML files. This
function reads the table tabs from the HTML and returns a DataFrame
object; an example of this is as follows:
In [12]:
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
In [14]:
df[0].head()
It is generally very easy to get started for simple and complex plots.
It provides features that support custom labels and texts.
It provides great control of each and every element in a plotted figure.
It provides high-quality output with many image formats.
It provides lots of customizable options.
Let's now move on and take a look at some examples of how to use
matplotlib to visualize data. The first step is to import the library, as
follows:
In [1]:
import matplotlib.pyplot as plt
You'll also need to use the following line to see plots in the Notebook:
In [2]:
%matplotlib inline
We are going to use some NumPy examples to create some arrays and learn
how to plot them using the matplotlib library. Let's create an array using the
numpy.linspace function, as follows:
In [6]:
import numpy as np
import numpy as np
arr1= np.linspace(0, 10, 11)
arr2 = arr1 ** 2
In [7]:
arr1
Out[7]:
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
In [8]:
arr2
Out[8]:
array([ 0., 1., 4., 9., 16., 25., 36., 49., 64., 81., 100.])
Here, two arrays, arr1 and arr2, have been created using NumPy. Let's now
take a look at some matplotlib commands to create a visualization graph.
The plot function
The plot function is used for plotting y versus x in the form of points,
markers, or lines. The syntax of the plot function is as follows:
Usually, the first and second parameters are the arrays that we need to plot.
There is a list of parameters that can be given to the plot function, such as the
line color and the pixel size of the line. We can create a very simple line plot
using the following code:
In [9]:
# 'g' is the color green
plt.plot(a,b, 'g*',label="Example 2")
plt.xlabel('X Axis Title Here')
plt.ylabel('Y Axis Title Here')
plt.title('String Title Here')
The output is as follows:
The third parameter inside the plot function is essentially the color
parameter, which is used to provide a color for the plotted points.
To find out more about the different parameters and features inside the various built-in
functions provided by matplotlib, I encourage you to explore the official matplotlib web
page here: http://matplotlib.org/.
The xlabel function
In the preceding example, we used the xlabel built-in function to provide a
label to the x axis. This helps you to provide your own custom label values
for the x axis. We can create a very simple line plot using the following
code:
In [9]:
# 'g' is the color green
plt.plot(a,b, 'g*',label="Example 2")
plt.xlabel('X Axis Title Here')
plt.ylabel('Y Axis Title Here')
plt.title('String Title Here')
In [15]:
# plt.subplot(nrows, ncols, plot_number)
plt.subplot(1,2,1)
plt.plot(a, b, 'b--') # More on color options later
plt.subplot(1,2,2)
plt.plot(b, a, 'r*-');
plt.show()
All these libraries help us to carry out efficient data analysis, data
preprocessing, and data visualization tasks, which represent the core of any
problem statements that we solve using Python. In the next chapter, we will
learn how to use these libraries to solve some real-world problems.
Further reading
It is always a good idea to read the official documentation of the libraries
discussed to gain additional knowledge.
You can refer to the following links for the documentation of each library:
NumPy: https://docs.scipy.org/doc/
pandas: https://pandas.pydata.org/pandas-docs/stable/
Matplotlib: https://matplotlib.org
Section 2: Advanced Analysis in
Python for Finance
In this section, we will deep dive into various topics, such as time series
financial data, portfolio optimization, capital asset pricing model, and
regression analysis. In these topics, we will see how to use Python and the
tools available in Python to implement these techniques.
Time series data is usually represented using line charts, which allow us to
understand the patterns of the data that's been plotted. Time series data is
used in various fields, including statistics, control engineering, astronomy,
communications engineering, econometrics, mathematical finance, weather
forecasting, earthquake prediction, electroencephalography, signal
processing, pattern recognition, and in any applied science or engineering
domain.
Time series analysis refers to the methods that are used for carefully
studying time series data in order to extract meaningful statistical
information and other important features.
To begin with, let's import all the necessary libraries—NumPy, pandas, and
matplotlib—as shown in the following code segment:
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Now, let's import the datetime library. This library is the Python inbuilt
library that allows us to create a date time stamp or a specific date object.
The datetime library is imported as follows:
In [2]:
from datetime import datetime
Let's take a look at how we can create a date time object. Initially, we will
create variables for the year, the month, the day, and the time:
In [3]:
year = 2018
month = 9
day = 24
hour = 12
minute = 30
second = 15
Out[5]:
datetime.datetime(2018, 9, 24, 0, 0)
The variables for the year, the month, the day, and the time have been
created to illustrate the order of arguments that are used inside the date time
function. Since we have not provided a parameter for the hours or the
minutes inside the datetime function, the default value is set to 0. If you want
to provide the hour, the minute, and the second as well, we can do this as
follows:
In [8]:
date_time = datetime(year,month,day,hour,minute,second)
date_time
Out[8]:
datetime.datetime(2018, 9, 24, 12, 30, 15)
We can now see the values for the hours, the minutes, and the seconds.
The date time object has various attributes that help us retrieve information
from the date time variables, as shown in the following code segment. The
following code displays the day of the date time object:
In [12]:
date_time.day
Out[12]:
24
The following code displays the hour of the date time variable:
In [14]:
date_time.hour
Out[14]:
12
The following code displays the minute value from the date time variable:
In [15]:
date_time.minute
Out[15]:
30
The following code displays the year from the date time variable:
In [16]:
date_time.year
Out[16]:
2018
The preceding examples use various attributes, such as the day, the hour, the
minute, or the year to provide information about the date time variable.
If you want more information about the various inbuilt properties, please go to the
Python official documentation provided at the following link: https://docs.python.org/2/librar
y/datetime.html.
Up until now, we have been looking at the basic date time object that is
provided by Python. We will now go ahead and discuss how we can use the
pandas library to handle the date time object and convert it into a date time
index. Usually, in a real-world scenario, most financial data will have a date
time object in its dataset. This data can be retrieved through an API or a
CSV file, which can be read or retrieved with the help of pandas. The
syntax for the pandas date time index is as follows:
pandas.DateTimeIndex( data,copy,start,end)
Serial
Parameter and its description
no.
4
start : This parameter is a date that is a date time object
and is optional. If the data is None, start is used as the start
point to generate regular timestamp data.
We usually deal with time series as an index when working with a pandas
dataframe that's been obtained from some sort of financial API. Fortunately,
pandas has a lot of functions and methods for working with time series data.
Let's take a look at an example of how we can convert a Python date time
object into a pandas date time index.
First, we will create some date time variables using the Python datetime
library, as shown here:
In [28]:
# Create an example datetime list
list_date = [datetime(2018, 9, 24), datetime(2016, 9, 25)]
list_date
Out[28]:
[datetime.datetime(2018, 9, 24, 0, 0), datetime.datetime(2016, 9, 25, 0, 0)]
The next step is to use a pandas date time index to convert the preceding
data into a date time index, as shown here:
In [29]:
# Converted to a Date time index
dt_date = pd.DatetimeIndex(list_date)
dt_date
Out[29]:
DatetimeIndex(['2018-09-24', '2016-09-25'], dtype='datetime64[ns]', freq=None)
Here, we can see that the return type of list_date was datetime.datetime. In the
next line, it is converted into a date time index. Now, let's create a
dataframe with the date time index:
In [30]:
# Considering some random data
data = np.random.randn(2,2)
print(data)
cols = ['X','Y']
Out[30]:
[[-0.0747333 0.21118015]
[-0.5697708 -0.36354259]]
Then, we will create a dataframe using the pandas library, as shown here:
In [31]:
df = pd.DataFrame(data,dt_date,cols)
df
X Y
2018-09-24 -0.074733 0.211180
2016-09-25 -0.569771 -0.363543
Another approach would be to use groupby, but the group operation is not smart enough to understand concepts such
as business quarters or the start of a business year. Fortunately, pandas has other inbuilt functions that help us
implement frequency sampling. To get started with time resampling, we will be considering the stock prices of
the Walmart dataset. You can find the CSV in the Chapter 3 folder of the GitHub link we mentioned previously. This
dataset is taken from Yahoo Finance.
To begin with, let's import all the necessary libraries, such as numpy, pandas, and matplotlib, as shown in the following
code segment:
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Now, let's read the Walmart CSV dataset using the pandas read_csv function, as shown here:
In [3]:
df = pd.read_csv('walmart_stock.csv')
In [4]:
df.head()
From the preceding output, we can see that the first column is the date column, which has to be set as an index. If
you take a look at the type of the Date column, we can see that it is of the object type.
We can see the datatypes of all the columns using the following code:
In [5]:
df.info()
Out[5]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1258 entries, 0 to 1257
Data columns (total 7 columns):
Date 1258 non-null object
Open 1258 non-null float64
High 1258 non-null float64
Low 1258 non-null float64
Close 1258 non-null float64
Volume 1258 non-null int64
Adj Close 1258 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 68.9+ KB
From the preceding output, the Date column is not a date time object. We need to convert the Date column into a date
time object in order to set it as an index. The simplest way to do this is by using the pandas to_datetime function, as
shown in the following code segment:
In [6]:
df['Date'] = df['Date'].apply(pd.to_datetime)
In [7]:
df.info()
Out[7]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1258 entries, 0 to 1257
Data columns (total 7 columns):
Date 1258 non-null datetime64[ns]
Open 1258 non-null float64
High 1258 non-null float64
Low 1258 non-null float64
Close 1258 non-null float64
Volume 1258 non-null int64
Adj Close 1258 non-null float64
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 68.9 KB
Once we convert it into a date time object, we can see the type of the Date column: it is a date time object. We can
now set the date time as an index by using the set_index pandas inbuilt function, as shown in the following code
segment:
In [9]:
df.set_index('Date',inplace=True)
df.head()
Resampling is usually applied to a date time index, which was the reason why we converted the Date column into
a date time index. Let's proceed and see how we can apply the date time index.
The resampling syntax is usually as follows:
DataFrame.resample(rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=No
The following are the important parameters that are used in the resample() method:
Parameters:
rule
axis
This can have values that are either 0 or 1 and the default value is 0.
:
closed {'right', 'left'}
This indicates which side of the bin interval is closed. The default is left for all
frequency offsets except for M, A, Q, BM, BA, BQ, and W, which all have a default of
right. These letters refer to different frequencies, more details of which can be found in
the list following this table.
:
label {'right', 'left'}
This indicates which bin edge label should be used to label the bucket with. The
default is "left" for all frequency offsets except for M, A, Q, BM, BA, BQ, and W, which all
have a default of right.
:
convention {‘start’, ‘end’, ‘s’, ‘e’}
This is used for PeriodIndex only. It controls whether to use the start or end of the rule.
The PeriodIndex helps us to display the dataframe with respect to the starting month of a
year or the end month of a year.
:
kind {‘timestamp’, ‘period’} , optional
We can pass the timestamp to convert the resulting index to a date time index or period
to convert it to a period index. By default, the input representation is retained.
loffset : timedelta
This refers to the origin of the aggregated intervals for frequencies that evenly
subdivide a day. For example, for a frequency of five minutes, the base could range
from zero to four. The default value is zero.
on : string, optional
This indicates which column to use for resampling instead of the index for a
dataframe. A column must be a date time object. This is new in version 0.19.0.
:
level string or int, optional
For a multiindex, this indicates which column to use for resampling instead of the
index. The level must be a date time object. It is new in version 0.19.0.
Returns: Resampler object.
The first parameter in the resample function is rule. The rule parameter is used to indicate how we want to resample
the date time index. It basically applies a groupby method that is specific to time series data. The following are the
various options that can be provided for the rule:
Rule Description
B Business day frequency
D Calendar day frequency
W Weekly frequency
M Month end frequency
SM Semi-month end frequency (15th and end of month)
BM Business month end frequency
CBM Custom business month end frequency
MS Month start frequency
SMS Semi-month start frequency (1st and 15th)
BMS Business month start frequency
CBMS Custom business month start frequency
Q Quarter end frequency
BQ Business quarter end frequency
A Year end frequency
BA Business year end frequency
AS Year start frequency
BAS Business year start frequency
BH Business hour frequency
H Hourly frequency
T, min Minutely frequency
S Secondly frequency
L, ms Milliseconds
U, us Microseconds
N Nanoseconds
You can read the pandas documentation for more information on the rules: http://pandas.pydata.org/pandas-docs/stable/timeseries.html.
Let's apply a new resampling rule to the previously discussed dataset. In the following example, we will apply rule
A, which specifies year end frequency, to resample our data, as shown here:
In [10]:
df.resample(rule='A')
Out[10]:
DatetimeIndexResampler [freq=<YearEnd: month=12>, axis=0, closed=right, label=right, convention=start, base=0]
The return type of the resample method is DatetimeIndexResample. After applying the resample method, we should
apply an aggregate function to group the data. In the following example, we will apply a mean function to group
the resampled data:
In [11]:
# To find the yearly mean
df.resample(rule='A').mean()
From the preceding example, after applying rule A and the aggregate function mean, we get the result of the yearly
mean, or the average values of Open, High, Low, Close, and Volume of the Walmart stock prices.
Similarly, we can apply various other rules, such as weekly frequency (W), calendar day frequency (D), and
business month end frequency (BM), as shown in the following code segment:
In [13]:
# Weekly frequency Means
df.resample(rule='W').mean().head()
The following code provides us with the calendar day frequency means:
In [14]:
# Calendar day frequency Means
df.resample(rule='D').mean().head()
The following code provides us with the business month end frequency means:
In [15]:
# Business month end frequency Means
df.resample(rule='BM').mean().head()
In the 14th line of code, where we applied the calendar day frequency rule, we can see that one row has NaN
values. This is because the stock prices are not available for Saturdays and Sundays.
We can also use various aggregate functions, such as max(), min(), or std(), as shown in the following code segment:
In [16]:
yearly frequency max
df.resample(rule='A').max().head()
The following code shows how we can apply the min() aggregate function:
In [17]:
#yearly frequency min
df.resample(rule='A').min().head()
The following code help us get the standard deviation with yearly frequency:
In [18]:
#yearly frequency standard deviation
df.resample(rule='A').std().head()
Let's go ahead and see how we can use timeshifts with the help of pandas. In this section,
we will be reading the same Walmart stock prices dataset. We'll start by importing the
libraries, as shown in the following code segment:
In [5]:
import pandas as pd
Let's read the wallmart_stock.csv file, as shown in the following code segment:
In [6]:
dataframe = pd.read_csv('walmart_stock.csv',index_col='Date')
dataframe.index = pd.to_datetime(dataframe.index)
We can use the head() function to see the first five records of the dataframe, as shown in
the following code segment:
In [7]:
dataframe.head()
In the preceding code, the head() function displays the top five records from the dataset,
whereas the tail function displays the last five records. Let's take a look at the basic
syntax of the shift function:
The shift function helps us shift the index by a desired number of periods with an
optional frequency. The following are the various parameters of the shift function:
Parameters Descriptions
periods This is usually an integer value. It specifies the number of periods or
indexes to move. It can be positive or negative.
frequency This is usually a date offset, time delta, or time rule. It is optional.
axis The axis value is zero for the index and one for the column.
Let's implement the Timeshift operation using the shift method, as shown in the
following code segment:
In [11]:
dataframe.shift(periods=1).head()
2012-
NaN NaN NaN NaN NaN NaN
01-03
2012-
59.970001 61.060001 59.869999 60.330002 12668800.0 52.619235
01-04
2012-
60.209999 60.349998 59.470001 59.709999 9593300.0 52.078475
01-05
2012-
59.349998 59.619999 58.369999 59.419998 12768200.0 51.825539
01-06
2012-
59.419998 59.450001 58.869999 59.000000 8069400.0 51.459220
01-09
In the preceding example, we provided the period value as one, so the dataframe is
shifted downward by one time step. For this reason, the first row is made up of all NaN
values.
Similarly, we can also provide a negative value to the periods parameter to shift the dataset
upward. In this case, the last row will be appended as NaN:
In [13]:
dataframe.shift(periods=-1).tail()
In [19]:
# Shift everything forward one year
dataframe.tshift(periods=1,freq='Y').head()
The shift function will be handy when we implement the forecasting technique in later
subsections.
Timeseries rolling and expanding using
pandas
In this section, we will be discussing the built-in pandas rolling methods. We can use these
methods to create a rolling mean based on a given time period. Let's discuss what a rolling
method can be used for. Usually, the financial data that is generated daily is noisy. We can
use the rolling mean, also known as the moving average, to get more information about the
general trend of the data.
When using the pandas built-in rolling methods, we will provide a time period window and
use the data within that to calculate various statistical measures, such as the mean, the
standard deviation, and other mathematical concepts, including autoregression and the
moving average. Let's implement the pandas rolling methods in our Jupyter Notebook. We
will be using the same Walmart stock dataset:
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
The top five records of the dataframe can be seen by using the head() function, as shown in
the following code segment:
In [3]:
dataframe.head()
We will be using the Close column of the stock dataset and plotting it, as shown in the
following code segment:
In [6]:
dataframe['Close'].plot(figsize=(14,5))
We can see that the preceding diagram consists of lots of noise. We try to find the average
by week using the moving average or rolling mean, which can be implemented using the
pandas library. To begin with, let's see the basic syntax of the rolling function that's
provided by pandas:
The important parameters of the rolling function are highlighted in the following table:
Parameters Description
This parameter is basically the size of the moving window. It refers to
window: int or
the number of rows of a dataset used for calculating the statistics, such as
offset
the mean and the standard deviation.
min_periods : This refers to the minimum number of rows in a window that are usually
int required to have a value. By default, the value is one.
:
center
Boolean This is used to position the label at the center of the window.
value
For more detailed information regarding the rolling function, please refer to the following link: https://pandas.py
data.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html.
Now, let's implement the rolling function with the size of the window set to seven days and
apply an aggregate function such as mean, as shown in the following code segment:
In [8]:
# rolling with windows size of every 7 days
dataframe.rolling(7).mean().head(10)
From the preceding output, we can see that the first six rows have values of NaN. This is
because the size of the window is seven. Therefore, the record of the seventh row is the
average of first six rows. Similarly, the eighth row is the average of the previous six rows.
The other rows are also averaged in this way.
By using the rolling method with a particular window size, the data becomes less noisy and
more reflective of the trend than the actual data. Let's plot the preceding data with respect to
the Open and Close columns, as shown in the following code segment:
In [12]:
dataframe['Open'].plot()
dataframe.rolling(window=7).mean()['Close'].plot(figsize=(15,5))
Let's change the window size value to 30, which is a month instead of a week, and plot the
stock dataset again:
In [11]:
dataframe['Open'].plot()
dataframe.rolling(window=30).mean()['Close'].plot(figsize=(15,5))
Parameters Description
min_periods The minimum number of records or observations required to have a
value (otherwise the result is NaN).
center Places the labels at the center of the window.
axis The value of the axis can be either 0 or 1. The default value is 0.
The expanding function, along with the aggregate function mean, works as a rolling
function in that it computes the average mean of all the previous stock values at specific
time steps. With the expanding function, we will be able to understand whether the trend of
the stock prices is increasing, decreasing, or stationary. The following is an example of an
expanding function:
In [15]:
#specify a minimum number of periods
dataframe['Close'].expanding(min_periods=1).mean().plot(figsize=(16,5))
If you have opened the cmd Command Prompt, then type the following
command:
pip install statsmodels
Let's move on and take a look at how we can use the statsmodels library for
various important activities, such as forecasting.
For additional information on the statsmodels library, please go through the statsmodels
documentation at the following link: http://www.statsmodels.org/stable/index.html.
Error trend seasonality models
In this section, we will be discussing the error trend seasonality (ETS)
models. The ETS models usually consist of the following:
Exponential smoothing
Trend method models
ETS decomposition
Before moving on, let's discuss the key components of time series analysis,
which are given here:
Trend
Seasonality
Cyclical patterns
Irregular patterns
All time series data will follow the pattern of one of these components.
Trend
A trend pattern in time series data usually has the following characteristics:
Seasonality
Irregular patterns
In [2]:
dataframe = pd.read_csv('walmart_stock.csv',index_col='Date')
dataframe.index = pd.to_datetime(dataframe.index)
We can get the top five records of the dataframe, as shown in the following code
segment:
In [3]:
dataframe.head()
To apply the ETS decomposition, we need to use the statsmodels.tsa.seasonal library and
import the seasonal_decompose class, as shown in the following code segment:
In [5]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(dataframe['Open'], freq=12)
fig = plt.figure()
fig = decomposition.plot()
fig.set_size_inches(15, 8)
From the preceding output, we can see the important components of the time series
data, such as trends, seasonality, residuals (the error difference), and the observed data.
This ETS decomposition helps us get more insights about the stock data to find the
pattern and the trend of the stock prices. It also provides information such as whether or
not the data is seasonal.
AutoRegressive integrated moving
average model
In this section, we are going to discuss the AutoRegressive integrated
moving average (ARIMA) model, which is a very popular and widely
used statistical tool for implementing time series forecasting in Python. It
uses the statsmodels library to achieve forecasting. All of the topics we
covered in the previous section will also be used in this technique when we
are implementing the forecasting.
AutoRegression (AR)
Integration (I)
Moving Average (MA)
AR
Whenever we apply an ARIMA model on time series data for time series
forecasting, we usually need stationary time series data. If the data is not
stationary, we use a technique called differencing to make the data
stationary. Differencing basically means subtracting an observation from an
observation of the previous step. Stationary data is a set of data with a mean
and variance that does not change over time and that does not have trends.
We can check whether a set of data is stationary using the Dickey Fuller
test. If we find that it isn't stationary, we can stationarize it using high-order
differencing. Let's discuss the Dickey Fuller Test for stationarity.
If the null hypothesis is not rejected, γ*=0, then y(t) is not stationary
Difference the variable and repeat the test to see whether the
differenced variable is stationary
If the null hypothesis is rejected, then y(t) is stationary
Testing for stationarity
We will be using the Dickey Fuller test to test the null hypothesis, which is
denoted as H0. This indicates that the time series data has a unit root, which
indicates that it is not stationary. If we reject the null hypothesis, we should
use the alternate hypothesis, which is denoted as H1. This indicates that the
time series data has no unit root and is basically stationary. We decide
between H0 and H1 based on the p-value returned by the Dickey Fuller
Test:
To select the values of p and q for the autoregressive and moving average
models, we will have to use the concepts of AutoCorrelation Function
(ACF) and Partial Auto Correlation Function (PACF). ACF and PACF
will be discussed later on as we move ahead with the implementation of the
forecasting technique using ARIMA.
Let's go ahead and implement the Arima model with Python and the
Statsmodels library. In this example, we are going to use the same Wallmart
stock price dataset. We will be using the ARIMA model to carry out the
forecasting prediction for the Open column.
ARIMA code
The general process for the ARIMA model when used for forecasting is as follows:
1. The first step is to visualize the time series data to discover the trends and find out whether the time series
data is seasonal.
2. As we know, to apply the ARIMA model, we need to use stationary data. The second step, therefore, is to
convert the non-stationary data into stationary data using the Dickey Fuller Test.
3. We then select the p and q values for ARIMA (p,i,q) using ACF and PACF.
Let's begin by importing the necessary libraries and reading the dataset, as shown in the following code segment:
In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
In [3]:
df = pd.read_csv('walmart_stock.csv')
We can check the top five records of the dataframe, as shown in the following code segment:
In [4]:
df.head()
Then, we are going to follow the same steps we used in the previous section on time series data. First, we need to
convert the Date column into a date time object using pandas. After that, we set this date column as the index of the
dataframe, as shown in the following code segment:
In [5]:
df['Date'] = pd.to_datetime(df['Date'])
In [6]:
df.set_index('Date',inplace=True)
df.head()
As we can see from the output, the Date column is set as an index in the dataframe.
Let's see the rolling mean and the rolling standard deviation of the Open column of the time series dataset that we
discussed in the previous section:
In [8]:
timeseries = df['Open']
In [9]:
timeseries.rolling(12).mean().plot(label='12 Month Rolling Mean')
timeseries.rolling(12).std().plot(label='12 Month Rolling Std')
timeseries.plot()
plt.legend()
The next step is to use the ETS decomposition method to visualize the general trend of the data, as we discussed in
the earlier section:
In [11]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df['Open'], freq=12)
fig = plt.figure()
fig = decomposition.plot()
fig.set_size_inches(15, 8)
In [15]:
# Store in a function for later use!
def adf_check(time_series):
"""
Pass in a time series, returns ADF report
"""
result = adfuller(time_series)
print('Augmented Dickey-Fuller Test:')
labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']
Out[15]:
Augmented Dickey-Fuller Test:
ADF Test Statistic : -2.315173149148315
p-value : 0.1671210162134677
#Lags Used : 11
Number of Observations Used : 1246
weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary
From the preceding output, we can see that the value of p is larger than 0.05, so we decide that the data is not
stationary. To make the data stationary, we will follow the differencing technique, which we have already
discussed. In differencing, the first difference of a time series is the series of changes from one period to the next.
We take away this change by using a shift operation, which we can do easily with pandas. We can continue to take
away the second difference, the third difference, and so on, until the data is stationary. We can do this easily with
pandas. You can continue to take the second difference, third difference, and so on, until your data is stationary:
In [16]:
df['Open First Difference'] = df['Open'] - df['Open'].shift(1)
In [17]:
df['Open First Difference'].head()
Out[17]:
Date
2012-01-03 NaN
2012-01-04 0.239998
2012-01-05 -0.860001
2012-01-06 0.070000
2012-01-09 -0.389999
Name: Open First Difference, dtype: float64
After the first differencing, we will pass the new dataset column, Open First Difference, to the same method of the
Dickey Fuller Test to see whether the data is stationary:
In [18]:
adf_check(df['Open First Difference'].dropna())
Out[18]:
Augmented Dickey-Fuller Test:
ADF Test Statistic : -10.395143169790536
p-value : 1.9741449125945693e-18
#Lags Used : 10
Number of Observations Used : 1246
strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary
As you can see in the preceding output, we are now getting a p value that is much less than 0.05, so we can now
consider the dataset as stationary. We then set the differencing (d) value in the Arima (p,d,q) model as one, as we
have only observed one difference to establish whether the data is stationary.
If we try to plot the Open , we will see the stationary pattern, as shown here:
First Difference
In [19]:
df['Open First Difference'].plot()
Imagine taking a time series of length T, copying it, and deleting the first
observation of copy #1 and the last observation of copy #2. Now, you have
two series of length T−1, for which you calculate a correlation coefficient.
This is the value of the vertical axis at x=1 in your plots. It represents the
correlation of the series lagged by one time unit. You go on and do this for
all possible time lags x, and this defines the plot.
Autocorrelation interpretation
Now, let's plot the final ACF and PACF plots to select the p and q values, as shown in the
following code segment:
In [26]:
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(df['Open First Difference'].iloc[13:], lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(df['Open First Difference'].iloc[13:], lags=40, ax=ax2)
p: The number of lag observations included in the model. This is the number before
the first inverted bar in the ACF (we begin counting from zero).
d: The number of times that the raw observations are differenced, also called the
degree of differencing.
q: The size of the moving average window, also called the order of moving average.
This is the number before the first inverted bar in the PACF (again, we begin
counting from zero).
From the preceding diagram, we can see that the value of both p and q is 0. We will
provide these parameters when we call the ARIMA model.
Let's see how we can import the ARIMA model. We will be using the seasonal ARIMAX
model to predict the future of the time series data:
# For non-seasonal data
from statsmodels.tsa.arima_model import ARIMA
One thing to note is that when we visualized the stock dataset using the ETS
decomposition, we were able to find the seasonal pattern in the time series. Usually, when
we find a seasonal pattern, we should use another model, which is called the seasonal
ARIMAX model. The basic difference between the seasonal ARIMAX model and the
ARIMA model is that we need to provide an extra parameter for the seasonal ARIMAX
model, which is seasonal_order. This parameter will be in tuple form, (p,d,q,S), where p and
q are the lags for the autoregressive and moving average models, d is the differencing
value, and S is the number of months in a year, which is 12.
Before moving on, let's take a look at the different sources from which we
can extract financial data.
Sources of financial data
Let's look at how to use financial data from a technical perspective. The data
you'll be using for financial analysis, or for any other type of analysis, is
likely to come from one of two sources: a web server or your computer. In
practice, in order to access data stored on a web server, you'll need to
connect to its application programming interface (API). In our case, we
will need a financial data API. We can also call these online financial data
sources; examples of these include the IEX, Morningstar, Alpha Vantage,
and the Quandl API.
It is usually much harder to clean and organize your data than to conduct the
subsequent business or financial analysis. At first glance, it might seem like
a better idea to use APIs as these provide the better data, and all you need is
an internet connection. So, why bother using CSV files at all? Firstly, web
services are prone to breaking down for unknown periods of time. Secondly,
it is possible that a certain API may contain only part of the data that you
need for proper financial analysis. Currently, there's no free API out there
that provides financial data that is as rich as you might want. Some
APIs offer single or multiple stock data, while others provide foreign stock
data or data to do with market indices. Note that an analyst may often need
to work with all of these. Furthermore, you can sometimes only connect to
an API if you use Python 3, and not Python 2.
In order to read the financial data from the API, we will be using the pandas-
datareader library. To use this, we need to install this library; type the
Once you install the package, we will be reading from various financial
APIs, such as IEX and Morningstar. First, we will import the pandas and
pandas_datareader libraries. While using this API, we will retrieve details
related to Apple, which is denoted as AAPL in the stock index:
In [5]:
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
from pandas_datareader import data as wb
Make sure you execute the second line, otherwise you may get an error. This
line indicates that we should bypass the version control of your package
manager. We will be using the DataReader function to read from the financial
API. The basic syntax of this function is given as follows:
The following are the details of some of the specified parameters:
Parameters Description
This is the name of the dataset. Some data sources, such
name
as Google or FRED, will accept a list of names.
This is the name of the financial API, such as
datasource
Morningstar or IEX.
This parameter specifies where we want to start reading
start
the financial data from.
This parameter specifies where we should stop reading
end
the financial data.
Let's now go ahead and see whether we can retrieve the data from the
financial API. The code to do this is as follows:
In [9]:
AAPL_IEX = wb.DataReader('AAPL', data_source='iex', start='2015-1-1')
Out[9]:
5y
The following code helps us to see the top five records of the AAPL_IEX
dataframe:
In [10]:
AAPL_IEX.head()
In the preceding code, our goal was to extract data from the IEX data source
about AAPL from January 1, 2015. The DataReader function allows us to do
this in one row. The wb alias uses this DataReader function and we specify three
important parameters. The first one is the ticker of the Apple stock, AAPL.
Second, we select the data source, Morningstar. Finally, we specify the start
date, which is January 1, 2015. With respect to these parameters, we get the
preceding output.
We can use the same function to retrieve details about the stock price of
Google using the GOOGL ticker, as shown in the following code block:
In [12]:
GOOGL_IEX = wb.DataReader('GOOGL', data_source='iex', start='2015-1-1')
Out[12]:
5y
This is as follows:
In our case, this will be $116 minus $105, divided by $105. This gives us a
10.5% rate of return. This computation of the rate of return is called
a simple rate of return. If we assume that Apple paid a $2 dividend at the
end of the year, the rate of return calculation becomes the following:
We can calculate the logarithmic return of the investment as follows:
The value of 10% was attained after multiplying the value with 100. Note
that mathematicians writing logx usually mean logex, also called lnx. The
reason we use log is specified next.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Then, we are going to read the Microsoft stock prices dataset. The dataset is available
in the GitHub repository of this chapter: https://github.com/PacktPublishing/Hands-on-Python-f
or-Finance/tree/master/Chapter%204.
The following code helps us to see the top five records of the MSFT dataframe:
In [4]:
MSFT.head()
The following code helps us to see the tail, or the last five records, of the MSFT
dataframe:
In [5]:
MSFT.tail()
As discussed in the previous section, the simple rate of returns of the security is as
follows:
The Python code for calculating the simple rate of returns is shown here:
In [6]:
MSFT['simple_return'] = (MSFT['Close'] / MSFT['Close'].shift(1)) - 1
print (MSFT['simple_return'].head(10))
Out[6]:
Date
1999-12-31 NaN
2000-01-03 -0.001606
2000-01-04 -0.033780
2000-01-05 0.010544
2000-01-06 -0.033498
2000-01-07 0.013068
2000-01-10 0.007291
2000-01-11 -0.025612
2000-01-12 -0.032571
2000-01-13 0.018901
Name: simple_return, dtype: float64 of
Here, we are calculating the simple returns on a day-to-day basis using the shift
function. The value we provided is 1. The data series shown in the output is as
expected: it exhibits the percentage daily change of the closing price. On most days, the
number is lower than 1%. Significant movements of a company's stock price are not an
everyday occurrence. Note that the first value of the series is not a number (NaN).
This makes sense, as there is no lag for our first observation.
Let's plot these transformations using the matplotlib library. Before plotting it, however,
we have to convert the index, which is in string format, to datetime, using pandas, as
follows:
In [13]:
MSFT.index=pd.to_datetime(MSFT.index)
Then, we will plot the simple returns field calculated for the security, as shown here:
In [14]:
MSFT['simple_return'].plot(figsize=(9,5))
plt.show()
However, an investor who is interested in buying a stock and holding it in the long run
is mainly interested in the average rate of return that the stock will have. For this
reason, we calculate the mean return of the MSFT throughout the period under analysis.
To do this, we will apply a mean function that calculates the average daily rate of returns.
This function is available in pandas, as in the following code:
In [16]:
MSFT_average_return=MSFT['simple_return'].mean()
MSFT_average_return
Out[16]:
0.00027
The output is a really small number, much smaller than 1%, which makes it very
difficult to interpret. We might prefer to find the average annual rate of return. The
data that we have extracted, however, is not composed of 365 days observations per
year. It excludes non-trading days, such as Saturdays, Sundays, and bank holidays. The
number of trading days actually comes to between 250 and 252 days. For now, let's use
the number 250. The next step is to multiply the average daily return by 250, which will
give us a close approximation of the actual average return per year. This value will be
easier to understand than the previous one. The code to do this is shown as follows:
In [17]:
MSFT_average_return=MSFT['simple_return'].mean()*250
MSFT_average_return
Out[17]:
0.06820496450977334
Out[19]:
6.820496450977334%
In this section, we looked at how to estimate the simple market returns of a given stock,
how to plot it, and how to turn this into a meaningful value that is easy to understand
and interpret.
Calculating a security's rate of
return using logarithmic return
Just as a reminder, the logarithm rate of return is given by the following
formula:
Let's see how we can implement this with Python. We will be using the same
libraries as we did for simple returns: pandas, matplotlib, and numpy. We will
read the same Microsoft stock prices dataset and apply the preceding log
formula using Python, as shown here:
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
MSFT = pd.read_csv('MSFT_stock.csv', index_col = 'Date')
When we plot the newly obtained data on a graph, we see that the log rate of
return ostensibly resembles the simple rate of return. However, when we
calculate the average daily and annual log returns, we obtain a percentage
that is significantly smaller than the one we obtained for simple returns.
Let's calculate the average mean and the annual log return:
In [6]:
log_return_d = MSFT['log_return'].mean()
log_return_d
Out[6]:
8.45835836828664e-05
The following code is used to calculate the annual log returns, considering
250 as the number of working days in a year:
In [8]:
log_return_a = MSFT['log_return'].mean() * 250
log_return_a
Out[8]:
0.0211458959207166
Out[9]:
2.11458959207166%
In one year, the stock earned 14%. In the following year, it earned
16%. In the final two years observed, it earned 13% and 17%
respectively.
In one year, the stock earned 50%. In the following two years, it
earned -20%. In the final year observed, it earned 50%.
There is a big difference between these two sets of data. In the first case,
you can be certain that your money will earn an amount that is more or less
in line with what you expect. Things might be slightly better or slightly
worse, but the rate of return will always be between 13% and 17%. In the
second set of data, however, although the average return is the same, there
is a huge variability from one year to the next. An investor will be unsure
about what might happen next. If they invested their money over the second
and third observed years, they would have lost 40% of their initial
investment.
This shows that variability plays an important role in the world of finance.
It is the best measure of risk. The stock market is volatile and is likely to
surprise investors, both positively and negatively. Investors, however, don't
like surprises and are much more sensitive to the possibility of losing their
initial investment. Most people prefer to have a good idea about the rate of
return they can expect from a security, or a portfolio of securities, and do
their best to reduce the risk that they are exposed to. Our goal is to measure
the risk faced by investors and try to reduce it as much as possible.
The squared variance is equal to the sum of the squares of the difference
between a data point x and the mean divided by the total number of data
points minus one. If we take the square root of the variance, we get the
standard deviation of this sample of observations:
Let's go ahead and calculate the stock's variance and standard deviation
using Python in the next section.
Calculating the risk of a security in
Python
In this section, we will look at how to calculate a security risk. We will
using the Adjusted Close column of MSFT and AAPL and we will use the
variance and standard variance. First, we will import the same libraries:
numpy, pandas, and matplotlib:
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
After that, we will examine the behavior of the two stocks over the past 17
years by retrieving data from December 31, 1999:
In [3]:
security_data.index=pd.to_datetime(security_data.index)
In [4]:
security_data.head()
MSFT AAPL
Date
1999-12-31 38.965767 3.303425
2000-01-03 38.903194 3.596616
2000-01-04 37.589046 3.293384
2000-01-05 37.985374 3.341579
MSFT AAPL
Date
2000-01-06 36.712940 3.052405
The standard deviation of a company's returns can also be called the risk or
volatility. A stock whose returns show a large deviation from its mean is
said to be more volatile.
Let's take a look at which company's stocks are riskier or more volatile.
First, we take the logarithmic returns, because we will examine each
company separately in the given timeframe. This approach will tell us more
about the behavior of the stock:
In [7]:
security_returns = np.log(security_data / security_data.shift(1))
We can use the head function to see the top 20 records, as shown in the
following code:
In [9]:
security_returns.head(20)
MSFT AAPL
Date
1999-12-31 NaN NaN
2000-01-03 -0.001607 0.085034
2000-01-04 -0.034364 -0.088078
2000-01-05 0.010489 0.014528
2000-01-06 -0.034072 -0.090514
2000-01-07 0.012983 0.046281
MSFT AAPL
Date
2000-01-10 0.007264 -0.017745
2000-01-11 -0.025946 -0.052505
2000-01-12 -0.033114 -0.061847
2000-01-13 0.018725 0.104069
2000-01-14 0.040335 0.037405
2000-01-18 0.026918 0.034254
2000-01-19 -0.074817 0.024942
2000-01-20 -0.009390 0.063071
2000-01-21 -0.021455 -0.019461
2000-01-24 -0.024391 -0.046547
2000-01-25 0.015314 0.054934
2000-01-26 -0.034007 -0.018545
2000-01-27 -0.006309 -0.001703
2000-01-28 -0.005076 -0.079191
We will store this data in a variable called security returns. This newly
created variable has two columns, each of which contains the log returns of
MSFT and AAPL, respectively. This allows us to obtain the mean and the
standard deviation of the two stocks for the dataframe.
Out[10]:
0.00011508712580239439
From this function, we obtain a small number that equals the daily average
return. Let's analyze this value by multiplying it by the number of trading
days in a year, 250, as follows:
In [11]:
security_returns['MSFT'].mean()*250
Out[11]:
0.028771781450598596
The output obtained is a value just under 3%, which is the annual rate of
return. The same Pythonic logic must be applied to calculate the volatility
of the company's stock. The method that we will be using is
called standard deviation:
In [12]:
security_returns['MSFT'].std()
Out[12]:
0.019656948849403607
After calculating the standard deviation, we will multiply the output by the
square root of 250, which represents the trading days. This square root of
250 is taken because the standard deviation is the square root of the
variance:
In [18]:
security_returns['MSFT'].std()*250**0.5
Out[18]:
0.31080365106770774
We then repeat the same procedure for AAPL; we get a lower mean but a
higher volatility percentage:
In [14]:
security_returns['AAPL'].mean()
Out[14]:
0.000864342049190355
The following code help us to calculate for the annual rate of return:
In [15]:
security_returns['AAPL'].mean()*250
Out[15]:
0.21608551229758874
In [16]:
security_returns['AAPL'].std()
Out[16]:
0.027761934312069386
In [19]:
security_returns['AAPL'].std()*250**0.5
Out[19]:
0.4389547233905951
It will be easier for us to interpret the results if we quoted the two means in
the two standard deviations next to each other. To do this, we can print the
equations of the two annual means and standard deviations, as shown in the
following code:
In [20]:
print(security_returns['MSFT'].mean()*250)
print(security_returns['AAPL'].mean()*250)
MSFT AAPL
Date
1999-12-31 NaN NaN
2000-01-03 -0.001607 0.085034
2000-01-04 -0.034364 -0.088078
2000-01-05 0.010489 0.014528
2000-01-06 -0.034072 -0.090514
In [23]:
security_returns[['MSFT','AAPL']].mean()*250
Out[23]:
MSFT 0.028772
AAPL 0.216086
dtype: float64
In [24]:
security_returns[['MSFT','AAPL']].mean()*250**0.5
Out[24]:
MSFT 0.001820
AAPL 0.013666
dtype: float64
Stocks with a higher expected return often attract more buyers. The
Microsoft rate of return is slightly higher, but this comes at the expense of a
higher volatility.
Portfolio diversification
In this section, we will talk about one of the most important concepts in
finance: the relationship between financial securities. It is reasonable to
expect the prices of shares in a stock exchange to be influenced by the same
factors. The most obvious example of this is the development of the
economy. In general, favorable macro-economic conditions facilitate the
business of all companies. When people have jobs and money in their
pockets, they will spend more. Companies benefit from this as their
revenues increase.
Technology changes rapidly and any sector can face problems. If this is the
case, we would still have the Walmart share, which would not suffer in the
same way. The same concept is valid for the retail sector: if Walmart isn't
performing as well as expected, we'd still have our Facebook share, which
operates in a different industry.
By buying the shares of two companies operating in the same industry, our
portfolio will be exposed to an excessive risk for the same level of expected
return. There is clearly a relationship between the prices of different
companies. It is very important to understand what causes this relationship
and how to use this measurement to build optimal investment portfolios.
Covariance and correlation
Now that we know that it is reasonable to expect a relationship between the
returns of different stocks, we have to learn how to quantify this
relationship. Let's take a look at an example about the factors that determine
the price of a house. One of the main factors is the size of the house: the
larger it is, the more expensive it is. There is clearly a relationship between
these two factors. Statisticians uses the term correlation to measure this
relationship. Usually, this correlation output of the calculation lies in the
interval from -1 to +1.
To understand this concept better, let's take a look at the formula that allows
us to calculate the covariance between two variables:
Here, x̅ is the mean of factor x (size) and ȳ is the mean of factor y (price). n
is the number of records, which contains the combination of x and y. The
correlation coefficient measures the relationship between the two variables.
If the covariance is more than zero, the two variables move in the same
direction.
If the covariance is less than zero, the two variables move in the
opposite direction.
If the covariance is equal to zero, the two variables are independent.
The covariance between a variable and itself is the variance of that variable.
Along the main diagonal, we have the variances of different variables. The
rest of the table should be filled with the covariances between them. In our
case, the variables are the prices of two stocks. We will therefore expect a
two-by-two covariance matrix with the variances of each stock along the
main diagonal, and the covariance between the two stocks displayed in the
other two cells. In Python, there is no need to do mathematical calculations
manually. The var() method calculates the variance of the object for us. We
will use this function to calculate the variances of the stock prices of
Microsoft and Apple.
Out[8]:
0.00038639563806806983
The following code helps us to calculate the variance of the Apple stock:
In [9]:
AAPL = security_returns['AAPL'].var()
AAPL
Out[9]:
0.0007707249967476555
After that, we make these values annual, as shown in the following code
block:
In [10]:
MSFT = security_returns['MSFT'].var() * 250
MSFT
Out[10]:
0.09659890951701745
In [11]:
AAPL = security_returns['AAPL'].var()*250
AAPL
Out[11]:
0.19268124918691387
MSFT AAPL
MSFT 0.000386 0.000218
AAPL 0.000218 0.000771
MSFT AAPL
MSFT 0.096599 0.054592
AAPL 0.054592 0.192681
Let's examine this matrix cell by cell, starting from the top-left corner. The
top-left corner has the same values as that of the MSFT annualized variance
value, which was computed by the var() function mentioned previously. The
cov() method is useful because it allows us to obtain the other numbers
easily.
Let's now calculate the correlation with the help of the corr() method, which
is available in pandas, as shown here:
In[14]:
corr_matrix = security_returns.corr()
corr_matrix
The following are the topics that we will cover in this chapter:
Sharpe ratio
Portfolio allocation
Portfolio optimization
Markowitz portfolio optimization theory
Obtaining the efficient frontier in Python – part 1
Obtaining the efficient frontier in Python – part 2
Obtaining the efficient frontier in Python – part 3
Technical requirements
In this chapter, we will be using Jupyter Notebook for coding purposes. We
will be using both the pandas library and the quandl library. In order to install
the quandl library, open the Anaconda Command Prompt and type the
following command:
conda install -c anaconda quandl
Daily Returns: The percent returned from one day to the next for each
stock
Cumulative return: The amount returned after an entire time period
Average daily returns: The mean of the daily returns
Standard daily returns: The standard deviation of the daily returns
Another critical statistics measure is the Sharpe ratio, which is named after
William Sharpe. Let's take a look at what the Sharpe ratio is and why it is
necessary.
In Case 2, both portfolios return the same value, which is 5%. If you were to
choose which portfolio was better in Case 2, you might opt for the flatter
line, because this one is a lot more stable and has less volatility. The other
line from Case 2 is much more volatile, which means it has a higher
risk. Similarly, in Case 3, we might say that the stock with a 6% return is
more stable, while the stock that has a 12% return is more volatile. We have
to decide whether to go for the riskier stock, which will potentially bring
higher gains, or to go for the stock with the lower returns but with lower
volatility.
The Sharpe ratio allows us to use math to quantify relationships between the
mean daily returns and the volatility (or the standard deviation) of the daily
returns. It is a measure for calculating a risk-adjusted return and has become
the industry standard for this calculation. It was developed by Nobel
Laureate William F. Sharpe. The mathematical formula of the Sharpe ratio is
as follows:
Here, Rp is the expected portfolio return, Rf is the risk-free return, and σp is
the standard deviation of the portfolio.
Portfolio allocation and the Sharpe ratio with code
Let's now take a look at a few examples using Python. In this section, we are going to use the stock datasets of a
range of tech companies, including Apple, Cisco, IBM, and Amazon. First, we will import the necessary libraries,
pandas and quandl, which will help us to retrieve the stock data of a company from an API.
Quandl is a platform for financial, economic, and alternative data that serves investment professionals. Quandl
sources data from over 500 publishers. All Quandl's data is accessible through an API. This is possible through
packages for multiple programming languages, including R, Python, Matlab, Maple, and Stata.
In[4]:
import pandas as pd
import quandl
We will retrieve the past five years' worth of stock data for the four companies using the quandl API, as follows:
In[7]:
start_date = pd.to_datetime('2013-01-01')
end_date = pd.to_datetime('2018-01-01')
In[8]:
aapl_stock = quandl.get('WIKI/AAPL.11',start_date=start_date,end_date=end_date)
cisco_stock = quandl.get('WIKI/CSCO.11',start_date=start_date,end_date=end_date)
ibm_stock = quandl.get('WIKI/IBM.11',start_date=start_date,end_date=end_date)
amzn_stock = quandl.get('WIKI/AMZN.11',start_date=start_date,end_date=end_date)
We have to use the quandl.get() function to extract the stock information. After extracting the information, we can
save it as a CSV file, as follows:
In[11]:
aapl_stock.to_csv('AAPL_CLOSE')
cisco_stock.to_csv('CISCO_CLOSE')
ibm_stock.to_csv('IBM_CLOSE')
amzn_stock.to_csv('AMZN_CLOSE')
Please refer the documentation of the quandl library for more information: https://www.quandl.com/tools/python.
We also have the CSV files that are provided in the GitHub repository. We can use this data and read the CSV file
if Quandl doesn't work due to firewall issues:
aapl_stock = pd.read_csv('AAPL_CLOSE',index_col='Date',parse_dates=True)
cisco_stock = pd.read_csv('CISCO_CLOSE',index_col='Date',parse_dates=True)
ibm_stock = pd.read_csv('IBM_CLOSE',index_col='Date',parse_dates=True)
amzn_stock = pd.read_csv('AMZN_CLOSE',index_col='Date',parse_dates=True)
First, we will review some important metrics. Let's see whether we can get a value for the cumulative daily returns.
We will do this with respect to all the stocks that we have extracted, as follows:
In[14]:
In[15]:
aapl_stock.head()
We will multiply the Normalize Return column with these allocations, as follows:
In[21]:
A new Allocation column has been created with respect to the allocation of the portfolio.
The following code is used to display the top five records of the aapl_stock dataframe:
In[22]:
aapl_stock.head()
Now, suppose we have invested $100,000 in this portfolio. We can take this into account as follows:
In[23]:
The following code helps us to see the top five records of the aapl_stock dataframe:
In[24]:
aapl_stock.head()
As you can see in the preceding output, Position Values specify how much money we allocate to the stocks. You can
also see that on January 2, 2013, we initially invested $30,000. The very next day, this value went down to
$29,621.10 and we continued to drop until the 8th. We can actually see how much money is in our portfolio as the
days go by.
Let's now create a larger portfolio dataframe that essentially has all of these positioned values for all of our stocks.
We are going to concatenate all the Position Values of all the stocks, as follows:
In[25]:
portfolio_val
From the preceding output, we can see how our $100,000 is distributed in this portfolio and how this changes over
time. We can see how much money we have in our entire portfolio. Currently, all the columns are named Position
Values. Let's rename the columns to give us a clear understanding of which column relates to which stock:
In[27]:
portfolio_val.head()
Let's calculate the sum of all the position values on a day-to-day basis and plot this on a graph. This will give us an
idea about whether our total money will increase in the future. To do this, we will create a new column, Total Pos,
which will have the total sum of all the portfolios on a day-to-day basis.
The following code is used to create a new column, Total Pos, which will have the total sum of all the portfolios on
a day-to-day basis:
In[30]:
The following code helps us to see the top five records of the portfolio dataframe:
In[31]:
portfolio_val.head()
Let's plot a figure using Matplotlib for the Total Pos column and see the changes day by day:
In[32]:
The preceding graph gives us the total sum of all the money invested in the portfolio and the changes that happen
on a daily basis. If we need a better plotted diagram to further understand the changes that each company
undergoes on a daily basis, we can create one, as follows:
In[33]:
portfolio_val.drop('Total Pos',axis=1).plot(kind='line')
y = x^2
y = (2-x)^2
In the first equation, x=0 is the value that minimizes the first equation. In
the second equation, x=2 is the value that minimizes the equation. This idea
of using a minimizer will allow us to build an optimizer. There are
mathematical ways of finding this minimum value. Usually, for complex
equations such as an optimizer, we can use the scipy library to do this math
for us.
In the context of our portfolio optimization, we actually want to maximize
our Sharpe ratio. Remember that we're trying to figure out which stock
allocation will give us the best Sharpe ratio. We want to implement an
inverse Sharpe ratio and try to minimize the optimizer, which will end up
being the best Sharpe ratio, just in reverse. In other words, we are raising a
minimizer in order to calculate the optimal allocation. We will use SciPy's
built-in optimization algorithms to calculate the optimal weight allocation
for our portfolio and then we are going to optimize it based on the Sharpe
ratio.
In the next section, we are going to walk through these steps using Python.
We are also going to look at Markowitz portfolio optimization.
Markowitz portfolio optimization
According to Markowitz, investors shouldn't put all their eggs in one basket.
Markowitz proved the existence of an efficient set of portfolios that optimize
an investor's return according to the amount of risk they are willing to
accept. One of the most important concepts he developed was that
investments in multiple securities shouldn't be analyzed separately, but
should be considered as a portfolio. The financier must understand how
different securities in a portfolio interact with each other. Markowitz' model
is based on analyzing various risks and returns and finding the relationship
between them using various statistical tools. This analysis is then used to
select stocks in a portfolio in an efficient manner, which leads to much more
efficient portfolios. Individuals vary widely in their risk tolerance and asset
preferences. Their means, expenditures, and investment requirements vary
greatly. Because of this, portfolio selection is not a simple choice of any one
security or securities, but a subtle selection of the right combination of
securities.
Let's now think about a simple example. We're going to look at six different
portfolios with two different stocks. The allocations of the two stocks in
each portfolio are presented in the following screenshot:
The following quantitative values show the basics weightage along with
expected returns in the portfolio:
From these figures, we can see that the first portfolio is made up of 100% of
stock A and 0% of stock B. The second portfolio is made up of 80% of stock
A and 20% of stock B. The third portfolio is made of 60% of stock A and
40% of stock B. The fourth portfolio is made up of 40 % of stock A and 60%
of stock B. The fifth portfolio has 20 % of stock A and 80 % of stock B,
while the sixth portfolio is made up of 100% of stock B.
With respect to these allocations, we can calculate the expected return and
the standard deviation.
In general, the higher the expected return, the lower the standard deviation
or variance. The lower the correlation, the lower the risk for the investor.
Regardless of the risk of the individual securities in isolation, the total risk
of the portfolio of all securities may be lower if the covariance of their
returns is negative or negligible. In the next section, we will examine more
stocks and find the frontier using Python.
Obtaining the efficient frontier in
Python – part 1
In this section, we will learn how to calculate the efficient frontier of a group
of portfolios composed of two assets: Walmart and Facebook. These two
datasets can be found in the GitHub repository.
To start, import the necessary libraries and read the dataset, as follows:
In[3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
companies = ['WMT', 'FB']
df = pd.read_csv('Walmart_FB_2014_2017.csv', index_col='Date')
The following code helps us to see the top five records of the dataframe:
In[2]:
df.head()
df.index=pd.to_datetime(df.index)
In[17]:
The following code helps us to read the top five records of the log_returns
dataset:
In[20]
log_returns.head()
#Calculating Mean
log_returns.mean() * 250
# Calculating Covariance
log_returns.cov() * 250
# Calculating Correlation
log_returns.corr()
Note that the mean and the covariance have been multiplied by 250, because
there are 250 working days in a year. From the correlation matrix, we see
that both the stocks are averagely correlated. The next step is to address
portfolio optimization from a coding perspective. First, we will create a
variable that will carry the number of assets in our portfolio. This variable is
important, as we will be using it in our formulas so they can respond to a
change in the number of assets that make up the portfolio. It is equal to the
number of elements in the asset list. We can obtain this number with the help
of the len function, as follows:
In[27]:
num_assets = len(companies)
In[28]
num_assets
Here, we are considering two assets: Facebook and Walmart. Remember that
the portfolio does not need to be equally weighted. Create a variable called
weights. Let this contain as many randomly-generated values as there are
assets in your portfolio. Don't forget that these values should be neither
smaller than 0 nor equal to or greater than 1.
The next step is to create two random variables using the random function
available in the numpy library:
In[30]:
arr = np.random.random(2)
arr
array([0.85216426, 0.73407425])
The following snippets helps us to add the initialized values in the array:
In[31]:
arr[0]+arr[1]
1.5862385086575679
By randomly initializing two values, we sometimes find out that the sum of
the initialized values is more than one.
weights = np.random.random(num_assets)
weights /= np.sum(weights)
weights
array([0.50354655, 0.49645345])
This gives us the weights of the assets. Using this technique, the sum of the
values will always be equal to one. These steps are really important for the
next section.
Obtaining the efficient frontier in
Python – part 2
Let's push the analysis a few steps further. We will now write the formula for the
expected portfolio return. This is given by the sum of the weighted average log
returns, which is specified as follows:
In[33]:
0.1444881608787945
The following code will provide us with the expected portfolio variance and the
volatility:
0.18108103586950486
We will need the formulas for the return and the volatility in the simulation of the
portfolio's mean-variance combinations. We will now create a graph, where 1,000
mean-variance simulations will be plotted. Pay attention here: we are not
considering 1,000 different investments composed of different stocks, we are
considering 1,000 combinations of the same two assets, Walmart and Facebook. In
other words, we are simulating 1,000 combinations of their weight values. Among
these 1,000 combinations, we will probably have one portfolio composed of 1% of
Walmart stocks and 99% of Facebook stocks and vice versa. The idea is to compare
the two and see which one is more efficient. As mentioned, our goal is to create a
graph that visualizes the hypothetical portfolio returns versus the volatilities.
We will therefore, need two objects that we can use to store this data. The portfolio
returns start as an empty list. We want to fill this with randomly-generated expected
returns. We apply a for loop 1,000 times for the portfolio volatilities, as shown in
the following code:
In[36]:
portfolio_returns=[]
portfolio_volatilities=[]
for x in range(1000):
weights=np.random.random(num_assets)
weights/=np.sum(weights)
portfolio_returns.append(np.sum(weights*log_returns.mean())*250)
portfolio_volatilities.append(np.sqrt(np.dot(weights.T,np.dot(log_returns.cov(),weights))))
portfolio_returns,portfolio_volatilities
portfolio_returns=np.array(portfolio_returns)
portfolio_volatilities=np.array(portfolio_volatilities)
portfolio_returns,portfolio_volatilities
portfolios=pd.DataFrame({'Return': portfolio_returns,'Volatility':portfolio_volatilities})
Let's check the head and the tail of the portfolios dataframe:
In[40]:
portfolios.head()
We now need to plot data on a graph with an x axis that corresponds to the
Volatility column and a y axis that corresponds to the Return column. An important
specification is what kind of graph we want to insert. In our case, we will need a
scatter plot. We will also adjust the figure size and provide some labels for the x
axis and the y axis:
In[42]:
portfolios.plot(x='Volatility',y='Return',kind='scatter',figsize=(10,6))
plt.xlabel('Expected Volatility')
plt.ylabel('Expected Return')
In the next chapter, we will discuss the capital-asset pricing model, which is
a model used to determine the theoretically-appropriate required rate of
return of an asset, to make decisions about adding assets to a well-
diversified portfolio.
The Capital Asset Pricing Model
In finance, the Capital Asset Pricing Model (CAPM) is a model that is
used to determine the rate of return of any assets; it helps us determine
whether we can add the assets to create a diversified portfolio. This model
usually divides the assets based on sensitivity and risk, which is represented
by β in the financial industry. It also considers the expected returns of the
market and theoretical risk-free assets.
The premise of CAPM is not that different to the one created by Markowitz.
Investors are risk-averse. They prefer earning a higher return but are also
cautious about the risks they might face and want to optimize their
portfolios in terms of both risk and return. Investors are unwilling to buy
anything other than an optimal portfolio that optimizes their expected
returns and standard deviation.
Rational investors will form their portfolios by considering both the risk-
free stocks and how much they are going to invest in the market
portfolio. This question really depends on how much they want to earn.
In the next section, we will discuss the relationship between the securities
of individuals and the market portfolio. This will move us one step closer to
learning how to build expectations about the prices of real-world assets.
The beta of securities
In this section, we will introduce the concept of beta, which is one of the
main pillars of the CAPM. Beta helps us to quantify the relationship
between the security and the overall market portfolio. Remember the
market portfolio is made up of all the securities in the market: securities
with low expected returns and securities with high expected returns. If there
is an economic crisis, it is reasonable to expect the prices of most assets that
make up the market portfolio to decrease and the market portfolio to
experience a negative rate of return of, for example, -5%. In this case,
investors can't protect themselves through diversification, because this is a
systemic risk.
However, some securities in the market portfolio are less risky. They have a
lower standard deviation and will decrease in value less than the market
standard. Let's imagine that we have stock A and stock B. Stock A is less
volatile than the market portfolio and so we only lose 3%. Stock A is an
example of a security in the market portfolio that is less risky in times of
volatility. The value of stock B, however, decreases by 7%, which is more
than stock A. We can therefore consider stock B to be more volatile than
stock A. Since stock B is volatile, when the economy recovers from the
crisis, the market portfolio will do well and stock B may earn us a higher
rate of return than stock A.
This example allows us to see that stocks might have a different behavior
regarding their overall market performance. Some stocks are safer and will
lose less and earn less than the market portfolio, while other stocks are
riskier and will do well when the economy flourishes, but will perform
poorly in times of crisis. This is precisely where beta (β) comes in
handy. Beta allows us to measure the relationship between a stock and the
market portfolio. It can be calculated as the covariance between the stock
and the market divided by the variance of the market:
The preceding formula measures the market risk, which cannot be avoided
through diversification. The riskier a stock, the higher its beta.
Let's look at a few important points that will help us understand beta:
Initially, we will import all the libraries that are required, such as
numpy and pandas. We import the dataset of the Microsoft and S&P 500 as
follows. The S&P index is given by the ^GSPC. The S&P 500 is a
capitalization-weighted index, and is associated with many ticker symbols,
such as ^GSPC, INX, and $SPX, depending on market or website. The S&P
500 differs from the Dow Jones Industrial Average and the NASDAQ
Composite index because of its diverse constituency and weighting
methodology.
In [1]:
import numpy as np
import pandas as pd
data = pd.read_csv('MSFT_S&P.csv', index_col = 'Date')
data
We need two values: the covariance between Microsoft and the S&P 500,
and the variance of the S&P 500. We start by calculating the logarithmic
returns and storing them in a sec_returns variable, as follows:
In [2]:
MSFT ^GSPC
The iloc method allows us to obtain the covariance between the MSFT and
S&P market as a float, as follows:
In [4]:
cov_with_market = cov.iloc[0,1]
cov_with_market
Out[4]:
0.01820843191039893
0.016360592699269063
Finally, we calculate the beta value of these stocks by using the formula
discussed in the previous section:
In [6]:
Out[6]:
1.112944515219942
The beta value that we get is greater than one. This means that the MSFT
stock is riskier than the market. It might do better than the market when the
economy flourishes and lose more when it goes down.
The next step is to verify whether our calculations are correct. We are
currently using data from Yahoo Finance. Let's now go to Yahoo Finance to
check whether the beta value we obtained is credible. We can allow
ourselves a 2 - 3% difference, because the data and the estimation methods
may differ slightly, but the difference shouldn't be any bigger than that.
Once we go the Yahoo finance website, search for the MSFT stock and
check the quoted beta coefficient, as shown in the following screenshot:
We can see that the beta value calculated for three years by Yahoo Finance is
1.09, which is similar to the value we obtained. This validates the beta
coefficient that we calculated.
The CAPM formula
In the previous sections, we imagined that we were living in a world where
all investors are rational, risk-averse, and willing to optimize their
investment portfolios. We introduced the reader to the concept of a market
portfolio and a risk-free asset. We also stated that investors make their
decisions based on their risk appetite: those seeking higher expected returns
will allocate a greater portion of their money to the market portfolio and
less to the risk-free assets. Finally, we introduced the concept of beta, which
measures how securities are expected to perform with respect to the entire
market. We are now going to introduce the capital asset pricing model. The
formula of the capital asset pricing model is as follows:
Let's take a look how the CAPM formula works. We start with a risk-free
rate of return (rf), which is the bare minimum an investor would accept in
order to buy a security. Since the investor must be compensated for the risk
they're taking, we need the equity premium component given by the
expected return of the market portfolio minus the risk-free rate. It is
reasonable to expect this incentive since the security has a risk attached to
it. This is the most widely used model by practitioners.
Calculating the expected return of
a stock (using the CAPM)
We have covered a lot of concepts related to the CAPM. We have also
discussed the beta coefficient and the CAPM formula. Let's now look at
how we can compute the expected returns of a stock using the CAPM. First
of all, we need to calculate the beta coefficient before we can apply it to the
CAPM formula. Here, we will use Python and implement all the formulas
of CAPM using Python. We will be using the same dataset that we used to
calculate the beta coefficient and follow the same process as before:
In [2]:
import numpy as np
import pandas as pd
From the preceding code, we get the beta value, which is stored in the
MSFT_beta variable.
Out[3]:
0.0806472257609971
The value we get is 8.06%. This is the return on the investment we might
expect when buying the MSFT stock. This technique can be applied for any
listed company that you are interested in.
Applying the Sharpe ratio in
practice
Now that we know how to calculate a stock's return and assess its risk
through its variance and standard deviation, we're ready to take another
look at the Sharpe ratio. We discussed this topic in the previous chapter, but
we will go into more detail in this section. As mentioned earlier, rational
investors want to maximize their returns and are risk averse, which means
they want to minimize the risk they face and invest in less volatile
securities. They want less uncertainty and more clarity about an
investment's rate of return. Rational investors are afraid that a risky
investment would cause significant losses and want to invest their money in
securities that are less volatile. It becomes obvious that the two dimensions
must be combined.
We have to subtract the risk-free rate from the expected rate of return of the
stock. We calculated the expected returns in a previous section and got a
MSFT_er value of 0.0806. We can use the MSFT_er variable directly as the risk-
free rate of Microsoft (rmsft). We also assumed the risk-free rate (rf) to be
2.5%, which is equal to 0.025. To calculate the standard deviation of MSFT
stock, we will use the std() method to obtain the volatility and annualize it
by multiplying it by the square root of 250. The complete code to calculate
the Sharpe ratio can be written as follows:
In [2]:
Out[2]:
0.23995502423552612
The output is approximately 24%. We can use this ratio when we want to
compare different stocks in stock portfolios.
Measuring alpha and verifying the
performance of a portfolio
manager
So far, we have learned about regression, the Markowitz efficient frontier of
investment portfolios, and William Sharpe's capital asset pricing model. We
have also learned how to calculate a portfolio's risk, return variance, and
covariance. We know what beta is and how it can be interpreted in the
context of Sharpe's capital market line.
In this section, we'll add another tool to our arsenal of financial knowledge.
We'll learn how to interpret the intercept of the CAPM model, alpha (α). In
the world of finance and investments, alpha is often thought of as a measure
of the performance of a fund manager.
Given that the beta (βim) multiplied by the equity risk (rm-rf) premium
gives us the compensation for the risk that's been taken with the investment,
alpha shows us how much we get without bearing the extra risk. A great
portfolio manager is someone who outperforms the market and can achieve
a high alpha. Conversely, a poor investment professional may obtain a
negative alpha, meaning they underperformed with respect to the market.
Type of
Description
correlation
This consists of buying a portfolio of assets and holding
Passive
it for a long time regardless of the short-term
investing
macroeconomic developments.
Active This refers to frequent trading based on the expectations
investing of macroeconomic and company-specific developments.
This is where we find pricing discrepancies on the market
Arbitrage
and exploit them in order to make a profit without
trading
assuming additional risk.
Value This is where we invest in specific companies, in the
investing hope that they will outperform their peers.
Different use cases of the CAPM using the scipy
library
In this section, we will take a look at how we can implement the CAPM using the scipy library.
Then, we will import pandas and pandas_datareader, which will be used to pull the dataset from the API, as
follows:
In [17]:
import pandas as pd
In [13]:
After that, we import the data with the help of pandas_datareader. The index we have provided is basically
the SPY, which is an ETF. In this context, index basically means a market:
In [14]:
spy_etf = web.DataReader('SPY','google')
We use the info function to see the details of the columns extracted, as follows:
In [21]:
spy_etf.info()
Out[21]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1897 entries, 2010-01-04 to 2017-07-18
Data columns (total 5 columns):
Open 1879 non-null float64
High 1879 non-null float64
Low 1879 non-null float64
Close 1897 non-null float64
Volume 1897 non-null int64
dtypes: float64(4), int64(1)
memory usage: 88.9 KB
We can also see the first five rows of the dataset retrieved from the API using the head() function:
In [22]:
spy_etf.head()
If you take a look at SPY ETF, we have about 2,000 entries, each of which have open, high, low, and close
columns. These results are actually indicators, rather than the actual S&P 500. SPY ETF is an exchange-
traded fund that represents the S&P 500.
start = pd.to_datetime('2010-01-04')
end = pd.to_datetime('2017-07-18')
Now, let's say that our portfolio strategy is to invest entirely in Apple stocks. We will use pandas_datareader
and retrieve the Apple stock with the preceding start and end dates from Google Finance, as follows:
In [24]:
aapl = web.DataReader('AAPL','google',start,end)
Once we have executed the preceding statement, we can check out the head of the dataframe, as follows:
In [27]:
aapl.head()
The CAPM basically states that there should be a relationship between the performance of Apple's stocks
and the overall market performance. Here, the market data is the data that we extracted for SPY, which is
an Exchange-traded Fund (ETF). Let's visualize this data. Import the matplotlib library, as follows:
In [28]:
We will be plotting the Close column of the aapl dataframe along with the Close column of the
spy_etf dataframe. We will consider a plot size of (8, 5), as follows:
In [29]:
aapl['Close'].plot(label='AAPL',figsize=(10,8))
spy_etf['Close'].plot(label='SPY Index')
plt.legend()
The cumulative returns for Apple and the SPY ETF can be calculated as follows:
In [31]:
aapl['Cumulative'] = aapl['Close']/aapl['Close'].iloc[0]
spy_etf['Cumulative'] = spy_etf['Close']/spy_etf['Close'].iloc[0]
We basically divide all the Close values with the close price on the very first day. We create a new column
called Cumulative in both the aapl and spy_etf dataframes. We now plot these cumulative values, as follows:
In [33]:
aapl['Cumulative'].plot(label='AAPL',figsize=(10,8))
spy_etf['Cumulative'].plot(label='SPY Index')
plt.legend()
plt.title('Cumulative Return')
Here, we use the pct_change() function, where we provide one day as the parameter and create a new
column called Daily Return for both aapl and spy_etf. Next, we will take the scatter plot and plot the daily
return values of both aapl and spy_etf to check whether there is any correlation. Here, we also provide the
value of alpha, which is equal to 0.3:
In [45]:
Form the preceding output, we can see that there is some correlation. We can also plot Daily Returns with
the help of the histogram to see the highest frequency returns, as follows:
In [46]:
aapl['Daily Return'].hist(bins=100)
spy_etf['Daily Return'].hist(bins=100)
Let's calculate the beta and alpha values. The following code uses a tuple unpacking, which returns four
different values. We are going to use the stats.linregress() method, as follows:
beta,alpha,r_value,p_value,std_err = stats.linregress(aapl['Daily Return'].iloc[1:],spy_etf['Daily Return'].iloc[1:])
beta
Out[38]:
0.19423150396392763
In [39]:
alpha
Out[39]:
0.00026461336993233316
In [40]:
r_value
Out [40]:
0.33143080741409325
Summary
In this chapter, we looked at the capital asset pricing model, which was used
to determine the rate of return of an asset. We also discussed how to
calculate the beta of the security, which helps us to determine which stock
is riskier compared to the market. Then, we saw how to measure an alpha
value and how to verify the performance of the portfolio. We also learned
about the various use cases of the capital asset pricing model using the scipy
library.
Regression models (both linear and non-linear) are used to predict real
values, such as a salary. If your independent variable is time, then you are
forecasting future values, otherwise your model is predicting present
but unknown values. In this chapter, we will look at the basic concepts of
simple linear regression, where we will examine the relationship between
two variables, and then we will move on to a more complex regression
technique called multiple or multivariate linear regression in which we will
be examining the relationship between more than two variables.
We will also learn how we can design a predictive model that will able to
predict an output based on the relationship between these independent and
dependent features.
Let's consider an example to do with house prices. Usually, the larger the
house, the more expensive it is.
Here, the explanatory or independent variable is the size of the house, as this
helps us explain why certain houses cost more, and the dependent variable is
the price of the house, which is completely dependent on the size of the
house. Based on the size of the house, this value varies, so we know for a
fact that there is a relationship between the two variables. If we know the
value of the explanatory variable, which is the size of the house, we can
determine the expected value of the dependent variable, which is the house
price.
There are many other factors that determine house prices. In this case, we
are only using size, but this is not the only variable in real life. If we use
only one independent variable in a regression, this is known as simple
regression, while regressions with more than one variable are
called multivariate regression or multi-linear regression.
Let's start by considering a simple regression with two variables, x and y.
Here, y indicates the house price and x indicates the house size. The data
with the size and the price of the house can be found in the Excel sheet
called Housing-Data.xlsx, which has been uploaded to the GitHub link specified
in the Technical requirements section. We can plot the x and y variables in a
graph, as shown here:
The preceding diagram contains information about the actual size and prices
of houses. We can easily see that the two variables are connected: larger
houses have higher prices. Regression analysis assumes the existence of a
linear relationship between the two variables. We can draw a line of best fit
that can help us describe the relationship between all the data points, as
shown in the following plot:
To determine the line of best fit that will help us describe the relationship
between house prices and house size, we need to find a line that minimizes
the error between the lines and the actual observation, as shown here:
From the preceding diagram, observe how the different observations tend to
deviate from the line. A linear regression will calculate the error that's
observed when using different lines, allowing us to determine which
contains the least error. Each deviation from the line is an error because it
deviates from the prediction that the line would have provided.
Given that this is a linear equation and its output is a line, we would expect
to obtain an equation with a very similar shape to the following. In this
equation, we will be considering β0=b and β1=m:
y=β0+β1 x1
As we move ahead, we will learn about the financial meaning of the alpha
(β0) and the beta coefficients (β1). However, first, let's take a look at the
different regression techniques and the basic mathematical equation that's
used to represent these techniques:
Type of
Description
regression
y=β0+β1 x1+εi
Regression analysis assumes the existence of a linear
Simple
relationship between the two variables. One straight line
regression
is the best fit and can help us describe the rapport
between all the data points we see in the plot.
Multivariate y=β0+β1 x1+β2 x2+β3 x3+εi
regression By considering more variables in the regression
equation, we’ll improve its explanatory power and
provide a better idea of the circumstances that
determine the development of the variable we are trying
to predict.
Running a regression model in
Python
So far, we have learned about some of the fundamental concepts behind
implementing linear regression problems in finance. Now, let's look at how
we can implement regression analysis using Python.
In Python, there are modules that help run a huge variety of regression
problems. In this section, we will learn about using Python for both simple
linear regression and multiple linear regression. We are going to consider a
housing dataset and predict houses prices using regression analysis by
considering various features that affect the price of houses.
Simple linear regression using
Python and scikit learn
To begin, we are going to import a couple of modules and libraries, as shown
here:
In[1]:
import numpy as np
import pandas as pd
In the preceding code, we imported numpy, pandas, scipy, statsmodels, and the
matplotlib library. The next step is to import the Housing.xlsx dataset, on which
we are going to apply the regression analysis, as shown here:
In[2]:
data = pd.read_excel('Housing.xlsx')
The house size is the independent feature and is stored in the X variable,
whereas House Price is the dependent feature and is stored in the Y variable.
These scattered points show the house price with respect to the various
house sizes. First, we need to set the limits of the x label and the y label
limit, as we can see that the house size in the x axis starts from 800, whereas
the house price in the y axis starts from around 500,000. This can be done
using plt.axis():
In[10]:
plt.scatter(X,Y)
plt.axis([0, 2500, 0, 1500000])
plt.xlabel("House size")
plt.ylabel("House Price")
plt.show()
In the preceding code, we are limiting the x-axis value from a range of 0 to
2,500, whereas we are limiting the y axis to a range of 0 to 1,500,000. This
will give us enough space to plot our observation, as shown here:
We can now see that the output looks much better. We can see that even the
smallest of houses in our sample cost a lot of money and we can get a better
idea of the size-to-price ratio of our data.
Now that we have our independent features (X) and our dependent features
(Y), to create a regression model, we will first split the independent and
dependent features into a training dataset and a test dataset. We will use the
former to train our linear regression model and the latter to check the
accuracy of the linear regression model.
To split the independent and dependent features, we will use the Scikit learn
library. Scikit-learn (formerly scikits.learn) is a free machine learning library
for the Python programming language. It features various classification,
regression, and clustering algorithms, including support vector machines,
random forests, gradient boosting, k-means, and DBSCAN, and is designed
to interoperate with the Python numerical and scientific libraries, NumPy
and SciPy.
To use the scikit learn library to split our data, we need to import the sklearn
library:
In[11]:
from sklearn.model_selection import train_test_split
Here, X_train and y_train are the training datasets that we will be using to train
our regression model. The X_test data will be used to test the accuracy of our
regression model, which will later be compared with y_test.
In[13]:
X_train=np.array(X_train).reshape(-1,1)
y_train=np.array(y_train).reshape(-1,1)
X_test=np.array(X_test).reshape(-1,1)
y_test=np.array(y_test).reshape(-1,1)
Finally, we apply the linear regression model using the scikit.learn library:
In[14]:
from sklearn.linear_model import LinearRegression
linregression=LinearRegression()
linregression.fit(X_train,y_train)
In the first line of the preceding code, we import the linear regression model
from the scikit learn library. In the next line, we initialize an object of the
LinearRegression model, which will be trained using the X_train and y_train
datasets with the fit() method. Once this code is executed, we get the
following output:
Out[14]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
If we see the preceding output, we can consider our model ready and can test
our X_test data to see the prediction of our model, as shown here:
ln[15]:
y_pred=linregression.predict(X_test)
y_pred
y=β0+β1 x1+εi
β0 is the intercept
β1 is the slope
We can see the intercept from the preceding simple linear regression model,
as shown here:
In[16]:
linregression.intercept_
From the preceding code, we can see that intercept_ is the keyword
that's used to see the intercept values.
Similarly, we can see the slope from the simple linear regression model:
In[17]:
linregression.coef_
We would also like to visualize how the the line of best fit, which is
represented by the y=β0+β1 x1+εi equation, is displayed. Here, we will use
the matplolib library to see the line:
In[18]:
import matplotlib.pyplot as plt
plt.scatter(X_train,y_train)
plt.plot(X_train,linregression.predict(X_train),'r')
plt.show()
The straight line that we can see in the graph is plotted using the intercept
and the slope value that we calculated using the simple regression model.
In this section, we discussed the simple linear regression technique, in which
we only had one independent feature. In the next section, we will discuss the
multiple or multivariate regression technique.
Multiple linear regression in Python
using the scikit-learn library
If we have more than one independent feature, we can create a regression model to
carry out future predictions. This regression model is called a multiple linear
regression model. If we have three independent features, for example, this is
represented by the y=β0+β1 x1+β2 x2+β3 x3+εi equation, where the independent
features are x1, x2, and x3. Here, β1, β2, and β3 are the slopes or the
coefficients and β0 is the intercept.
We will perform the following steps to create a multiple linear regression model
using Python and the sklearn library:
2. Read and split our dataset into dependent and independent features:
In[2]:
dataset = pd.read_csv('Startups_Invest.csv')
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, 4]
Here, X represents the independent features and y is the dependent feature. The
following code helps us see the top five records of the X variables:
In][3]
X.head()
Similarly, the following code helps us see the top five records of the Y variable:
In[4]:
y.head()
In the next step, we can observe that the State column is basically a categorical
feature, so we need to convert the categorical feature into dummy variables using
pandas. Our final regression model will not be able to understand categorical
features, since they are in the forms of strings. The regression model usually
understands mathematical calculations using various algorithms, so we need to
convert the categorical variables into dummy variables. To do this, we will be using
pandas.get_dummies().
The following is the code that helps us to convert the State column into dummy
variables:
In[5]:
State=pd.get_dummies(X.iloc[:,3],drop_first=True)
Inside the get_dummies function, we have provided the State column using iloc
indexing and we have set the drop_first parameter as True to just get only K-1
dummies out of the K categorical variables from the State column. Here, K is the
total number of categorical variables inside the State column. Currently, we have
three unique States, which are New York, California, and Florida. We choose only
K-1 categories because this is the minimum number required to represent all K
categories. We know that if the state is neither Florida nor New York, for example,
as is the case in the second record in the following output, it must be California.
The following code helps us to see the top 10 records of the dummy variables:
In[6]:
State.head(10)
The output looks as follows:
Since we have converted the State column into categorical features, we will be
deleting the State column, as it is no longer required. This can be done as follows:
In[7]:
X.drop('State',axis=1,inplace=True)
Let's check the X independent variable to see whether the State column has
successfully been dropped:
In[8]:
X.head()
We now need to concatenate the dummy variables that were created from the State
feature, which is stored in the State variable. To concatenate them, we use the concat()
function that's present in the pandas library. The code is as follows:
In[9]:
X=pd.concat([X,State],axis=1)
We can see that the dummy variable columns have been added to the X variable. With
this, we have completed the data-preprocessing step. The next step is to divide the
independent features, X, and the dependent feature, y, into training and test data using
the scikit learn library.
The following code splits the independent and dependent features into training and
test data:
In[12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Finally, we apply the linear regression model by using the scikit learn library:
In[14]:
from sklearn.linear_model import LinearRegression
linregression=LinearRegression()
linregression.fit(X_train,y_train)
In the first line of the preceding code, we import the linear regression model from
the scikit learn library. In the next line, we initialize an object of the LinearRegression
model, which will be trained using the X_train and the y_train dataset with the fit()
method. Once this code is executed, we get the following output:
Out[15]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
If we see the preceding output, we can consider our model ready. We can test our
X_test data and see the prediction of our model, as shown here:
In[15]:
y_pred=linregression.predict(X_test)
y_pred
Out[17]:
array([ 7.73467193e-01, 3.28845975e-02, 3.66100259e-02, -9.59284160e+02,
6.99369053e+02])
From the preceding output, we can see that we have five different slope
values for five different independent features. Note that we have also
converted our state feature into two columns with dummy variables. In
total, we have five different features. We have determined the slope or
coefficient for each of these.
Evaluating the regression model
using the R square value
R square is a very important technique to evaluate how good your model is.
Usually, the value of R square ranges between zero and one. The closer the
value of R square is to one, the better your model.
Let's explore the square sum of the residuals. To understand this better,
consider the following example:
SSres is the summation of the difference between the real and the predicted
output, as shown here:
Similarly, SStot is the summation of the difference between the real and the
mean of the real output, as shown here:
The following equation shows us the SStot:
Finally, the R square value for a regression model is given by the following
formula:
Let's find the R square value with Python code for the Startups_Invest.csv file
that we created earlier. Import the r2_score library from sklearn, as shown
here:
In[18]:
from sklearn.metrics import r2_score
Initialize r2_score and provide the y_test and y_pred parameters to get the r
square value:
In[19]:
score=r2_score(y_test,y_pred)
score
Out[19]
0.93470684732824227
Here, we get an output of 0.93, which is very close to 1. This indicates that
the regression model that we created is a very good model.
The decision tree uses a divide-and-conquer strategy to select the leaf nodes and create a
decision tree. Here is the pseudo algorithm to implement the decision tree:
1. Select the best independent feature or the attribute using a concept called entropy and
information gain, then select it as the root node.
2. Split or divide the dataset into subsets. Subsets should be made in such a way that all
the values are split, based on the data belonging to the attributes.
3. Repeat steps 1 and 2 on every subset until we get the root node or the leaf node.
The algorithm that is used to construct a decision tree is called an ID3 algorithm.
:
Split(node,{examples})
1. A<- is the best attribute for splitting the examples of the node
2. Select the decision attribute for this node, <-A
3. For each value of A, create a new child node
4. Split training {examples} to child nodes
Entropy helps us to measure the purity of the splitting based on the attributes selected.
Entropy is given by the following formula:
If the entropy value is nearer to 0, we would be considering the split based on the attribute
as a pure subset.
After calculating the Entropy for each and every attributes or the independent features we
can calculate the information gain. Information gain is used to select the attributes that
need to be selected when the decision tree is created.
Let's implement the same dataset example that we discussed in the Multiple linear
regression in Python using the scikit-learn library section.
We will consider a new dataset in this section, which is Startups_Invest.csv. This dataset has
five features, as shown here:
The dataset consists of data related to 50 start-up companies who have invested some
amount of money in various internal departments, such as R&D, marketing,
and administration, from various states. Based on this expenditure, the company has
achieved some profit, which is specified in the Profit column. Our goal will be to create a
multiple linear regression model that will be able to predict the Profit based on the
expenditure in the R&D, marketing, and administration departments from various states. R&D
Spend, Marketing Spend, Administration, and State are the independent features in this example,
while Profit is the dependent feature.
We will perform the following steps to create a multiple linear regression model using
Python and the sklearn library:
2. Read and split our dataset into dependent and independent features:
In[2]:
dataset = pd.read_csv('Startups_Invest.csv')
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, 4]
Here, X represents the independent features and y is the dependent feature. The following
code helps us see the top five records of the X variable:
In[3]:
X.head()
The State column is a categorical feature, so we need to convert the categorical feature into
dummy variables using pandas. Our final regression model will not be able to understand
categorical features, since they are in the form of strings. The regression model usually
understands mathematical calculations using various algorithms, so we need to convert the
categorical variables into dummy variables. To do this, we will be using a pandas function,
get_dummies().
Here is the code that helps us to convert the State column into dummy variables:
In[5]:
State=pd.get_dummies(X.iloc[:,3],drop_first=True)
Inside the get_dummies function, we have provided the State column using iloc indexing and
have set the drop_first parameter as True to get only K-1 dummies out of the K categorical
variables from the State column. Here, K is the total number of categorical variables inside
the State column. Currently, we have three unique States, which are New York, California,
and Florida. We choose only K-1 categories because this is the minimum number required
to represent all K categories. We know that if the state is neither Florida nor New York, for
example, as is the case in the second record in the following output, it must be California.
The following code helps us to see the top 10 records of the dummy variables:
In[6]:
State.head(10)
Since we have converted the State column into categorical features, we will be deleting the
State column, as it is no longer required. This can be done as follows:
X.drop('State',axis=1,inplace=True)
Let's check the X independent variable to see whether the State column has successfully been
dropped:
In[8]:
X.head()
We can see that the State column has been dropped successfully.
We need to concatenate the dummy variables that were created from the State feature, which
is stored in the State variable. To concatenate them, we use the concat() function that's present
in the pandas library. The code is as follows:
In[9]:
X=pd.concat([X,State],axis=1)
We can see that the dummy variable columns have been added to the X variable. With this,
we have completed the data-preprocessing step. The next step is to divide the independent
features, X, and the dependent feature, y, into training and test data using the scikit learn
library.
First, we import the sklearn library, as shown here:
In[11]:
from sklearn.model_selection import train_test_split
The following code splits the independent and dependent features into training and test data:
In[12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Finally, we apply the decision tree regressor model by using the scikit learn library:
In[14]:
from sklearn.tree import DecisionTreeRegressor
decisionregressor=DecisionTreeRegressor()
decisionregressor.fit(X_train,y_train)
Here, we import the decision tree regressor and then we initialize an object and fit the object
with the X_train and y_train values. The output is as follows:
Out[14]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
The next step is that we do the prediction using the DecisionTree regressor object. We will be
using the predict method:
In[15]:
y_pred=decisionregressor.predict(X_test)
y_pred
Check the r square value, which indicates how good our model is. Import the r2_score
library:
In[16]:
from sklearn.metrics import r2_score
Now, we will check out the visualization of the decision tree using some of the libraries that
are present in Python.
Install the pydot library to visualize the decision tree. Open the Anaconda prompt and type
the following command:
After importing, we will consider all the independent features on which the decision tree
was applied. Use the following code:
In[20]:
features = list(X_train)
features
This code helps us to create the decision tree graph in Jupyter Notebook:
In[21]:
dot_data = StringIO()
export_graphviz(decisionregressor, out_file=dot_data,feature_names=features,filled=True,rounded=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph[0].create_png())
Here is the output:
The preceding graph shows the decision tree that was created by the decision tree regressor.
The diagram will be more visible in the Jupyter Notebook.
We saw how we can implement the decision tree regressor and perform the
prediction, and we looked at how we can visualize the decision tree that
was created by DecisionTreeRegressor.
In the next chapter, we will discuss the Monte Carlo simulation and
decision making.
Section 3: Deep Learning and
Monte Carlo Simulation
In this section, we will discuss Monte Carlo simulation and explore deep
learning techniques implemented using neural networks. These topics will
be implemented using Python and some of the libraries that are available in
Python. The libraries that we will be using include the StatModel library,
TensorFlow, and Keras.
Monte Carlo simulation was first used by scientists working on the atom
bomb in 1940 (https://www.solver.com/press/monte-carlo-methods-led-atomic-bomb-ma
y-be-your-best-bet-business-decision-making). It is now used in situations in
The GitHub repository for this chapter can be found at the following link: ht
tps://github.com/PacktPublishing/Hands-on-Python-for-Finance/tree/master/Chapter%208.
An introduction to Monte Carlo
simulation
Monte Carlo simulation is an important tool that has a variety of
applications in the world of business and finance. When we run a Monte
Carlo simulation, we are interested in observing the different possible
realizations of a future event. What happens in real life is just one of the
possible outcomes of any event. Let's consider an example. If a basketball
player shoots a free throw at the end of a game and the game is tied, there
are two possibilities:
However, this doesn't tell us much about the player's chances of scoring the
free throw. This is where Monte Carlo simulation comes in handy. We can
use past data to create a simulation that is in fact nothing but a new set of
fictional, but sensible, data.
Usually, the value of last year's revenue is not available. The random
variable, which can have any value in this case, is the revenue growth rate.
The computer software would allow us to simulate or predict the
development of the revenue, say, 1,000 times, and help us to obtain an idea
of the average, the maximum, and the minimum values of the expected
revenue figure.
The following diagram show how we can calculate the current revenue:
This can be really useful for a finance manager, as it would allow them to
understand the overall direction of where the company is heading and the
maximum and the minimum amount of revenue that they can expect. The
two parameters for revenue growth and standard deviation could be
obtained by looking at historical data or arbitrarily chosen data.
The logic behind determining the cost of goods sold (COGS) and the
operating expenses (Opex) is almost the same; both COGS and Opex are
expenditures that companies incur when running their businesses. However,
there is one main difference, which is that the expenses are segregated on
the income statement. Opex and COGS measure different ways in which
resources are spent in the process of running a company.
Our goal in this exercise is to predict the firm's future gross profit. We will
need the values of the expected revenue and the expected COGS. We will
start by performing 1,000 simulations of the company's expected revenue.
Let's assume that we have a value for last year's revenue and we that have an
idea about the revenue growth rate we can expect. The expected revenue for
this year, therefore, is $170 million, the standard deviation of which would
be $20 million. To simplify things, let's work in millions of dollars. We are
going to create two variables. The first is rev_m, which refers to the mean of
the revenue, and is assigned the value 170. The other variable will
be rev_stdev, which refers to the standard deviation of the revenue, and is
assigned the value 20. The next important value we need to specify is the
number of iterations we intend to produce, which is 1,000. This is defined in
the iterations variable. Let's create these three variables, as shown here:
In[2]:
rev_m = 170
rev_stdev = 20
iterations = 1000|
From the preceding code, we can see that rev_m is assigned to 170, rev_std is
assigned to 20, and iterations is assigned to 1000.
The next line of code will produce a simulation of the future revenues. We
will apply NumPy's random normal distribution generator. The arguments
we provide in the NumPy random function are the expected mean of the
revenue (rev_m), the standard deviation (rev_std), and the number of
iterations we would like to perform (iterations). The code is as follows:
In[3]:
rev = np.random.normal(rev_m, rev_stdev, iterations)
rev
From the preceding output, we can see that we have created 1,000 random
values. Most of these values are close to the mean we selected. In the next
step, we will try to plot these observations and see their distributions using
the following code:
In[4]:
plt.figure(figsize=(15, 6))
plt.plot(rev)
plt.show()
We can see that all data points fall in the region of 150 to 190, as shown
here:
The values of the 150 and 190 range are within the first standard deviation of
the mean, which is 170.
This is how we can interpret the distribution of the expected revenue and
find out the impact of COGS. Let's see how we can simulate this
development. Let's assume that we are experienced in this business and able
to tell that the typical COGS amount is approximately 60% of the company's
annual revenue.
To work this out, we need to know the previous values of COGS. Let's say
that COGS was once equal to 55% of the revenue and then it went up to
62%, before going up again to 63%, and then finally coming back down to
55%. It is reasonable to determine that this is a normal distribution with a
mean of 60% and that this estimation would have a standard deviation of
10%. It doesn't matter that COGS was actually deviating with 6% of the
revenue value. To set the distribution, the distribution should deviate by 10%
from the mean or the average of COGS.
COGS is the money spent, so we need to put a minus sign first, and then the
expression must reflect the multiplication of the revenue by 60%. We need to
pay attention here, because what comes next is crucial for this analysis. We
will not simulate COGS 1,000 times. This is because this simulation has
already been done for the revenue in line In[3] of the code, and we have
already generated 1,000 revenue data points. We must assign a random
COGS value to each one of these points. COGS is a percentage of the
revenue, which is why the revenue value we obtained must be multiplied by
a value extracted from a random normal distribution with a mean of 0.6 and
a standard deviation of 0.1. Had we put 0.6 directly, this would have meant
that COGS always equals 60% of the revenue, and this is not always the
case. The percentage will probably vary, and we have decided that the
standard deviation is equal to 10%. NumPy allows us to incorporate the
expected deviation, so we will take advantage of this in the following code:
In[5]:
COGS = - (rev * np.random.normal(0.6,0.1))
plt.figure(figsize=(15, 6))
plt.plot(COGS)
plt.show()
When we plot the results on a graph, we see the typical behavior of the
normal distributions. The normal distribution basically follows a bell-curved
shape:
It is interesting that if you re-run the COGS approximation, you will not
always get the same mean value for the observations.
Out[7]:
11.589576145999802
It is important that the deviation of COGS is around 10% of its mean at all
times.
In this part, we have simulated all the variables. In the next part, we will see
how we can calculate the future gross profit.
Using Monte Carlo simulation to
predict gross profit – part 2
Computing the gross profit is the objective of the second part of this lesson.
We have generated 1,000 potential values for both the revenue and COGS.
Calculating the gross profit requires us to combine these values.
The code is as follows. Here, we are combining the revenue and COGS and
plotting the graph:
In[8]:
Gross_Profit = rev + COGS
Gross_Profit
plt.figure(figsize=(15, 6))
plt.plot(Gross_Profit)
plt.show()
From the preceding output, we can agree that the distribution is normally
distributed with a mean value that is equal to the difference between the
means of the revenues and the mean of COGS.
With the help of the max and min functions, it is easy to obtain the biggest and
the smallest potential values of gross profit:
Out[10]:
24.388073332020554
The mean() and std() methods can provide an output that is equal to the mean
and the standard deviation of the gross profit, respectively. The mean can be
computed as follows:
In[11]:
Gross_Profit.mean()
While we have completed our original task here, it is always a good idea to
plot these results on a histogram. A histogram is a graph that helps you to
identify the distribution of your output. The histogram syntax to be
implemented is similar to the one we used for a regular plot. The histogram
is created using the hist() function, which is present inside the matplotlib
library. The syntax is as follows:
In[13]:
plt.figure(figsize=(10, 6));
plt.hist(Gross_Profit, bins = [40, 50, 60, 70, 80, 90, 100, 110, 120]);
plt.show()
Here, hist() takes two parameters. One is the data, which is basically the
gross profit, and the other is the bins. These are the chunks in which the data
in the plot will be divided. There are basically two ways to use bins. One
way is to create a list whose elements separates the bins along the x axis. If
our list contains the numbers 40, 50, 60, and so on, up to 120, then we will
see the observations between 40 and 50 grouped in one bin and the
observations between 50 and 60 grouped in another bin, and so on.
Take a look at the following formula, which is used to predict stock prices:
The preceding formula basically says that the price of a share today is equal
to the price of the same share yesterday multiplied by er, where r is the the
log return of the share.
We usually know yesterday's stock price, but we do not know the value of r
as it is a random variable. To determine the value of r, we will use a new
concept called Brownian motion, which allows us to model this kind of
randomness. The formula of Brownian motion is made up of two
components:
Drift
Volatility (random variable)
Drift is the direction in which the rates of returns have headed in the past. It
is the expected daily return of the stock and it is the best approximation
about the future we have.
We will start by calculating the periodic daily returns of stocks over the
historical period. We only have to take a natural logarithm of the ratio
between the current and previous price. Once we have calculated the daily
returns in the historical period, we can easily calculate their average standard
deviation and variance. This would allow us to calculate the drift
component, which is given by the following formula:
This means that the equation of the price of a stock for today is yesterday's
price, multiplied by e to the power of the drift plus the random value, as
shown here:
If we repeat this calculation 1,000 times, we will able to simulate the
development of tomorrow's stock price and assets. The likelihood is that
these will follow a certain pattern. This is a great way to assess the upside
and the downside of the investment, as we have obtained an upper and a
lower bound when performing the Monte Carlo simulation.
These are the mechanics you need to understand when using Monte Carlo
for asset pricing. In the next section we will apply this technique in Python.
Using Monte Carlo simulation to
forecast stock prices – part 1
In this section, we will be gaining further knowledge on how to implement
Monte Carlo simulations in the real world. Here, we will be looking at how
we can run or implement simulations, which, in turn, will help us to predict
the stock price of any company.
First, we will import all the important libraries that we have already used, as
shown here:
In[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
%matplotlib inline
The company dataset that we will be using for our analysis will be the
Microsoft (MSFT) dataset. The timeframe under consideration includes the
past 17 years of data, starting from December 31, 1999 to October 18, 2017.
We will be forecasting the future stock prices of MSFT in this exercise.
Let's read the dataset using the read_csv() function, as shown here:
In[2]:
data = pd.read_csv('MSFT_2000.csv', index_col = 'Date')
We can create a formula for the log returns by using the NumPy log function
and add one to the simple returns, as shown here:
In[3]:
log_returns = np.log(1 + data.pct_change())
We'll take a look at the last five records of the log_returns, as shown here:
In[4]
log_returns.tail()
In the next plot, we will plot the MSFT data graph using matplotlib, as
shown here:
In[5]:
data.plot(figsize=(10, 6));
The output looks like there is a gradual increase in the stock price over time:
In the next step, we will plot the log returns that were assigned to the
log_returns variables, as follows:
In[6]:
log_returns.plot(figsize = (10, 6))
Let's compute the drift with the help of the following code:
In[9]:
drift = u - (0.5 * var)
drift
Out[9]:
MSFT -0.000034
dtype: float64
We have obtained a small drift value, which we don't need to worry about.
We will do this exercise without annualizing our indicators because we will
try to predict the MSFT daily stock price. Next, we will find the standard
deviation of the log returns, as shown here:
In[10]:
stdev = log_returns.std()
stdev
Out[10]:
MSFT 0.019397
dtype: float64
We have already discussed that the Brownian motion comprises the sum of
the drift and standard deviation adjusted by e to the power of r, as shown
here:
We will be using a type() function, which allows us to check the drift variable datatype and
see it as a pandas series.
The code to check the type of drift variable that was computed in the previous section is as
follows:
In[10]:
type(drift)
Similarly, to use the type of the stdev variable, we use the following code:
In[11]:
type(stdev)
The reason why we are trying to see the type is so that we can proceed further with our task.
To do this, we should convert these values into NumPy arrays. Here, we know that the
numpy.array() method can already do this task for us. The code is as follows:
In[12]:
np.array(drift)
Out[12]:
array([-3.42521946e-05])
Alternatively, we can also convert these values into arrays using the values property, as shown
here:
In[13]:
drift.values
The second component of the Brownian motion is basically a random variable, indicated by z,
which is given by the distance between the events and the mean. This is usually expressed as
the number of standard deviations. We will be using the scipy norm.ppf() function to obtain this
result. The code is as follows:
In[16]:
norm.ppf(0.95)
The preceding code specifies that if an event has a 95% chance of occurring, the distance
between this event and the mean will be approximately 1.65 standard deviations.
To complete the second component, we will need to initialize some random variables. We will
be using the well-known NumPy rand() function to create a random array, as shown here:
In[17]:
x = np.random.rand(10, 2)
x
The first number from the first row of the output corresponds to the first probability from the
first row of the x matrix or array. Similarly, the second element corresponds to the second
probability, as shown in the x matrix, and so on.
The array, which is created from the preceding code, will create probabilities using the
random function, which in turn is converted from the mean 0 and the number of standard
deviations from the mean. The probabilities that are created assume values of the standard
normal distribution.
Once these tools are built and have calculated all the necessary variables, we will be ready to
calculate the daily returns.
Initially, we will create a time interval variable and assign it to 1,000, so that we will be
forecasting the stock price for the upcoming 1,000 days. Another variable we will create is
iterations, and we will set the value to 10, which means that we will produce a series of 10
future stock predictions. The code to do this is as follows:
In[20]:
t_intervals = 1000
iterations = 10
Now, let's get back to the equation that we discussed at the start of this section. The daily
returns and the r value equation are given as follows:
daily_returns=er
r=drift +stdev * z
We use the preceding equation to get an array of a size of (1,000, 10), where 1,000 is the
number of rows and 10 is the number of columns. This means that we have 10 sets of 1,000
random future stock prices.
Using Monte Carlo simulation to
forecast stock prices – part 3
In this section, we need to create a price list. Each price must equal the price
of the product that was observed the previous day, as shown here:
St= S0 . daily_returnt
St+1 = St . daily_returnt+1
...
St+999 = St+998 . daily_returnt+999
Once we calculate the price of the stock at the date specified by t, we can
predict the price of the stock at the date specified by t+1. This process will
then continue 1,000 times, giving us predictions of the stocks for 1,000 days
in the future.
To make useful predictions, the first stock price in our list must be
considered or initialized as the last one in the dataset. This is called
the current market price. We will call this variable S0 and it will contain the
stock price of the starting day, specified as t0. The last piece of data can be
retrieved using the iloc operator that's present in the pandas library, as shown
here:
In[21]:
S0 = data.iloc[-1]
S0
We can now create our stock price list, wherein we will run a loop from day
1 to the 1000th day. The expected price of the stock on any day t will be
calculated as stock price at t-1 times the daily returns observed on day t. The
code is as follows:
In[24]:
for t in range(1, t_intervals):
price_list[t] = price_list[t - 1] * daily_returns[t]
price_list
Now, let's plot this price list using matplotlib, as shown here:
In[26]:
plt.figure(figsize=(10,6))
plt.plot(price_list);
In this chapter, we got into a lot of technical language, which involved a lot
of advanced concepts. This is the kind of topic, however, that you need to
master to get into the field of finance or data science. I would personally
suggest that you go through Jupyter carefully so that you can get an upper
hand when you are carrying out this kind of work.
Summary
In this chapter, we have seen and understood the importance of Monte Carlo
simulation and how it is helpful in predicting the gross profit of a company
based on COGS, both intuitively and practically, with Python. We also
understood how to forecast stock prices using Monte Carlo simulation. We
used the Microsoft stock dataset and implemented the forecasting technique
using Python and Monte Carlo simulation.
In the next chapter, we will look at the option pricing Black Scholes model
to price various derivatives.
Option Pricing - the Black Scholes
Model
The Black Scholes formula is one of the most popular financial instruments
in use. It was derived by Fisher Black, Myron Scholes, and Robert Merton
in 1973, and since then it has become the primary tool for derivative
pricing.
The GitHub repository for this chapter can be found at the following link: ht
tps://github.com/PacktPublishing/Hands-on-Python-for-Finance/tree/master/Chapter%209.
An introduction to the derivative
contracts
A derivative is a financial instrument whose price is determined, or derived,
based on the development of one or more underlying assets, such as stocks,
bonds, interest rate commodities, and exchange rates. It is a contract
involving at least two parties and describes how and when the two parties
will exchange payments. Some derivative contracts are traded in regulated
markets, while others that are traded over the counter are not regulated.
However, with time, a great deal of innovation was introduced to the scene.
So-called financial engineering was applied and new types of derivatives
appeared. Nowadays, there are many types of derivatives. Some are rather
complicated and difficult to understand, even for those with a lot of
experience in finance.
There are three groups of people who tend to be interested in dealing with
derivatives:
In this chapter, we will learn about the four main types of financial
derivatives and how they function:
Forward
Future
Option
Swap
Forward contracts
A forward contract is a tweaked contract between two gatherings, where
settlement happens on a particular date in the future at a cost incurred today.
Forward contracts are not traded in the standard stock exchange and, as a
result, they are not standardized, making them particularly useful for
hedging. The primary characteristics of forward contracts are as follows:
A purchaser
A vendor
A cost
An expiry date
Option contracts
An option contract is a contract that gives someone the right, but not the
obligation, to buy (call) or sell (put) security or an other financial asset. A
call option gives the purchaser the privilege of purchasing the asset at a
given cost, called the strike price. While the holder of the call option has
the privilege of requesting an offer from the seller, the vendor or the seller
has the right to sell but not necessarily the commitment to do so. If a
purchaser wants to purchase the underlying asset, the merchant needs to
offer it, but they don't have the obligation to do so.
Similarly, a put option gives the purchaser the privilege of selling the asset
at the strike price to the purchaser. Here, the purchaser has the privilege to
offer, and the seller has the commitment to purchase. In every option
contract, the privilege to exercise the option is vested with the purchaser of
the agreement. The seller of the agreement has the right to sell, but not the
commitment to do so. As the seller of the agreement bears the commitment,
they are paid a cost, called a premium.
Swap contracts
A swap contract is used for the exchange of one cash flow for another set of
future cash flows. A swap refers to the exchange of one security for another,
based on different factors.
The y axis shows the profitability of an investor who buys the call option,
while the x axis shows the development of the underlying share price for
which the investor has an option. When the option expires, the owner will
compare the strike price and the actual market price of the underlying share.
The strike price is the price at which the derivative can be bought or
exercised. This term is most commonly used to describe the index and the
stock options. For call options, the strike price is the price at which the
stocks or the derivatives can be bought by the buyer until the expiration date.
For put options, the strike price is the price at which shares can be sold by
the option buyer.
If the strike price is lower than the market price, the owner of the option will
exercise it. Conversely, if the strike price is higher than the market price,
they won't exercise the option, because they must buy the share at a higher
price than its market price.
At first, the investor buys the option. They pay the money and so their
profitability is negative. Then, when the expiration date comes around, they
are able to use this option if the price of the underlying share is higher than
the strike price. Even if the market price is higher than the strike price, this
doesn't mean the investor will profit from the deal. They need a price that is
significantly higher to reach a break-even point and profit from the deal.
The Black Scholes formula provides an intuitive way to calculate the option
prices. The formula is as given here:
The first component, d1, shows us how much we are going to get if the
option is exercised. The second component, d2, is the amount we must pay
when exercising the option. If we have all the inputs necessary to calculate
d1 and d2, we shouldn't have a problem in obtaining these values and
applying them in the preceding formulas.
This is the key to understanding the formulas. In the first component, to find
d1, we are multiplying the probability of exercising the option by the amount
received when the option is exercised. In the second component, to find d2,
we subtract the value of the amount we must pay to exercise the option. All
this is done if the development of the stock's share price follows a normal
distribution.
In short, the Black Scholes formula calculates the value of a call by taking
the difference between the amount you get if you exercise the option, minus
the amount you have to pay if you exercise the option.
Calculating the price of an option using Black
Scholes
Our goal in this section will be to calculate the price of a call option using Python. We will apply
the Black Scholes formula that we introduced in the previous section.
Let's begin by importing the necessary libraries, as shown in the following code:
In[1]:
import numpy as np
import pandas as pd
from pandas_datareader import data as wb
from scipy.stats import norm
Now, we will create two functions that will allow us to calculate d1 and d2, which we need to apply
the Black Scholes formula and price a call option. The parameters to include are the current stock
(S), the strike price (K), the risk-free interest rate (r), the standard deviation (stdev), and the time (t)
measured in years.
The following are the equations of the two functions, d1 and d2:
The difference between d1 and d2 is that in d1, we add the variance divided by 2, denoted as (r +
stdev **2/2), while in d2, we must subtract it, denoted as (r - stdev **2/2).
To apply the Black Scholes formula, we won't need the PPF distribution we used previously when
we forecasted the future stock prices. Instead, we will need the cumulative normal distribution. The
reason this is needed is that it shows us how the data accumulates over time. Its output can never be
below zero or above one.
We will use the SciPy library and the cdf() function, which will take a value from the data as an
argument and show us what portion of the data lies below that value.
For instance, an argument of 0 will lead to an output of 0.5, as shown in the following code:
In[3]:
norm.cdf(0)
Since zero is the mean of the standard normal distribution, half the data lies below this value:
We can now introduce the Black Scholes function. It will have the same parameters as d1 and d2:
We have now defined all the necessary functions. Let's apply them to the Procter and Gamble
dataset, which consists of the stock prices of the Procter and Gamble company. We will first read
the dataset using the pandas read_csv function, as shown here:
In[8]:
data = pd.read_csv('PG_2007_2017.csv', index_col = 'Date')
We will then use the iloc operator with the argument as -1 to get the last record. We will store the
last record in the variable S, as shown here:
In[9]:
S = data.iloc[-1]
S
Out[9]:
PG 88.118629
Name: 2017-04-10, dtype: float64
Another argument we can extract from the data is the standard deviation. In our case, we will use
an approximation of the standard deviation of the logarithmic returns of this stock, as shown here:
In[10]:
log_returns = np.log(1 + data.pct_change())
We can now calculate the price of the call option. We will stick to a risk-free interest rate (r) of
2.5%, corresponding to the yield of a 10 year government bond. Let's assume that the strike price
(K) equals $110 and that the time (T) is 1 year. Let's define these variables, as follows:
In[12]:
r = 0.025
K = 110.0
T = 1
At this stage, we have the values for all five parameters we are interested in. If we use them in the
three functions we created, we will be able to determine the value for d1, d2, and the BSM function,
which are the components of the Black Scholes formula, as shown here:
In[13]:
d1(S, K, r, stdev, T)
Out[15]:
PG 1.132067
Name: 2017-04-10, dtype: float64
The call price option from the preceding output is close to $1 and 13 cents. We were able to
successfully price the call option. Is it possible to have a call option price that is much lower than
the actual stock price? The value of the option depends on multiple parameters, such as the strike
price, the time when we are exercising the option, the maturity of the option, and the
market volatility. It is not directly proportional to the price of the security.
If you rerun the same code with different values for the time to maturity period, the standard
deviation, or the strike price, you will obtain a different option price. In the next section, we will
see how we can calculate the price of a stock option in a more sophisticated way.
Using Monte Carlo in conjunction with
Euler Discretization in Python
In this section, we will be looking at some more advanced features of finance and
mathematics. We will use the Euler Discretization technique to calculate the call
option. This is a more sophisticated way of calculating the call option than the
techniques we discussed in the previous section. To compare the two approaches,
we will use the same dataset we used in the previous section, which is the Procter
and Gamble dataset for the time period from January 1, 2007 until March 21, 2017.
To begin with, let's import all the necessary libraries, as shown here:
In[1]:
import numpy as np
import pandas as pd
from pandas_datareader import data as web
from scipy.stats import norm
import matplotlib.pyplot as plt
%matplotlib inline
The next step is to read the dataset using the pandas read_csv() method, as shown
here:
In[2]:
data = pd.read_csv('PG_2007_2017.csv', index_col = 'Date')
In Euler Discretization, the methods and formula we will apply to compute the call
option price are different. We want to run a huge number of experiments to make
sure that the price we pick is the most accurate one.
Monte Carlo simulation can provide us with thousands of possible call option
prices. We can then average the pay off. The trick lies in the formula we will use to
calculate the future prices, which is as follows:
This is another version of a Brownian motion. The formula and the approach
employed are what is known as Euler Discretization. The following are the
parameters of the preceding equation:
Let's go through the steps that will allow us to assign values in this long formula.
The first thing that we will do is assign a risk-free interest rate (r). The initialization
is shown here:
r = 0.025
Here, we assign the value of the risk-free interest rate as 2.5%. Next, we can obtain
the standard deviation of the log returns and store it in a stdev variable, as shown
here:
In[5]:
stdev = log_returns.std() * 250 ** 0.5
stdev
We can verify that the stdev variable is a series using the following code:
In[6]:
type(stdev)
Let's plot and check the data of the Proctor and Gamble dataset, as shown here:
In[6]:
data.plot(figsize=(10, 6));
The preceding output basically tells us that stdev is now a NumPy array.
We will consider the time (T) to be 1 year, since we are forecasting the prices for 1
year ahead. The number of time intervals (t_intervals) must correspond to the
number of trading days in a year, which is 250. The initialization is shown in the
following code:
In[8]:
T = 1.0
t_intervals = 250
delta_t = T / t_intervals
iterations = 10000
From the preceding code, we have created two more variables: delta_t, which is the
fixed time interval, and the 10,000 simulations or iterations for simulating Z, which
is the random component.
The random component Z will be a matrix with random components drawn from a
standard normal distribution, which is a normal distribution with a mean of 0 and a
standard deviation of 1.
The dimension of the matrix will be defined by the number of time intervals
augmented by 1 (t_intervals + 1) and the number of iterations (iterations).
In this step, we can create an empty array, S, of the same dimensions as the random
component Z. For this, we use the np.zeros_like() method, as shown in the preceding
code.
Then, we will use the iloc operator with the argument as -1 to get the last record. We
will store the last record in the variable SO, as shown in the preceding code:
In[11]:
S0 = data.iloc[-1]
S[0] = S0
As well as changing the formula, which differs from the one we used for calculating
the future stock prices, the remaining steps are almost identical to the Monte Carlo
technique we looked at in the previous chapter.
We can check the dimension by using the shape property, as shown here:
In[14]:
S.shape
To plot only 10 simulations, we can use matplotlib. The code is shown here:
In[15]:
plt.figure(figsize=(10, 6))
plt.plot(S[:, :10]);
Let's see what the payoff is like for a call option. At a certain point in time, we will
exercise our right to buy the option if the stock price minus the strike price is
greater than zero. We will not exercise our right to buy if the difference is a negative
number.
So, the value of the option depends on the chance of the difference between the
stock price (S) and the strike price (K) being positive and, in particular, how positive
it is expected to be.
To work this out, we can use a NumPy method called maximum(), which will create an
array that contains either zeros or the numbers equal to the differences. We call p,
which represents the payoff. The code is as follows:
In[16]:
p = np.maximum(S[-1] - 110, 0)
p
This corresponds to the exponential of the r and T multiplied by the sum of the
computed payoffs, divided by the number of iterations we choose, which is 10,000.
It is important to compare whether this number differs substantially from the one we
obtained with the Black Scholes formula in the previous section. From the Black
Scholes formula, we got a call option value of 1.13. This is very close to the value
of 1.15, which we got by using Euler Discretization. This is proof that the choice of
computation method can lead to insignificant but not unimportant differences in the
outcome.
We are now armed with enough pythonic skills to conduct these kinds of studies on
our own.
Summary
In this chapter, we have looked at various concepts, such as derivative
contracts, the Black Scholes formula, and Euler Discretization. We used
these techniques to calculate the call option price, which in turn can be used
in derivative contracts. We also implemented this practically by using
Python. We then looked at the difference between the call option price that's
calculated using the different techniques.
3. Type the following command in the Anaconda prompt and press Enter:
conda install keras
The purpose of deep learning is to mimic the human brain. It uses a neural
network architecture that is based on the idea of neurons being the basic
units of the human brain. Let's consider an example: as humans, it is very
easy for us to determine whether an animal is a dog or a cat. When we see
an animal, our sensory organs (our eyes, in this case) receive the signal.
This signal is then passed through a large number of neurons, a structure
known as the neural network. As this happens, the signal is processed and
the resulting information is sent to the brain. In this section, our aim is to
create a machine or model that is able to replicate the working of a neural
network with respect to deep learning using the same methodology that our
brain uses to solve problems concerned with financial data.
An introduction to neural networks
In this section, we will discuss neural networks. A neural network in
computing is a complex model that is inspired by the way in which neural
networks that are present in the human brain digest and process
information. Neural networks have had a significant impact on research and
development in areas such as speech recognition, natural language
processing, predictive modelling, and computer vision.
Neurons
In this section, we are going to discuss neurons, which are the basic building
blocks of the neural network. They are also called units or nodes.
We can also represent the input signal as a node, as shown in the following
diagram:
In the preceding diagram, all the nodes that represent the input are called
the input layer. These inputs represent some information that the neurons
receive. The layer in which the neuron is present, which is receiving the
input, is called the hidden layer. The input layer can be thought of as your
senses in the human brain analogy. Any information that comes from your
sensory organs is an input, and it is processed by the neurons. We can't really
see how the information is processed, so we consider these neurons to exist
in hidden layers.
In machine learning and deep learning, we consider these input values as the
independent features that are passed to the neurons. The neurons carry out
some processing and provide an output. This output is sent to the other
neuron, and this step continues until we get a final output. Let's take a look
at the different layers in more detail:
Input layer: The inputs in the input layer are the independent features
or variables. One important thing to note is that these variables are all
from a single observation. We can think of this layer as one row of our
dataset. Let's look at an example: imagine we need to predict the price
of a house based on the size and number of bedrooms in the house. The
size and the number of bedrooms are the independent features or
variables.
Hidden layer: The hidden layer consists of the neurons that will
receive the signal and provide an output.
Output layer: The output layer provides the output value of the neuron
from the hidden layer. The output values can be one of the following
types:
Continuous value (regression): In the case of a continuous
output, we will have one output
Binary values (binary classification): In the case of binary
classification, we will have two outputs
Categorical values (multi-class classification): In this case, we
will have multiple outputs
Each input that the neuron or the node receives is assigned or associated
with a weight (w). Weights are crucial to the functioning of a neural network
because this is how neural networks learn. By adjusting the weights, the
neural network decides in every single case which signals are important and
which signals are not important. The weights are assigned as follows:
The next important thing to understand is what happens inside a
neuron. There are two main activities that take place:
The neuron finds the weighted sum of all input and weights, which is
represented by the following equation:
A network with many hidden layers and many neurons in each hidden layer
is called a multilayer neural network. Diagrammatically, these can
be represented as follows:
Types of neural networks
There are three main types of neural network:
f(x) = 1 / 1 + exp(-x)
This activation usually transforms the value of the weighted sum of the
input and the weights to a value between 0 and 1. The equation can be
represented as follows:
This activation function is usually used in the output layer of the neural
network when we are trying to solve a binary classification problem.
The tanh activation function
This activation function is also called a hyperbolic tangent function. It is
represented by the following equation:
This activation transforms the value of the weighted sum of the input and
the weights to a value between -1 and 1. It is better than the sigmoid
activation function, since the optimization process is much easier. The
equation can be represented as follows:
The ReLu activation function
This is currently the most popular activation function. It was also recently
proven that this activation function significantly outperformed tanh in
convergence. It is mathematically represented as follows:
Relu, however, has its own limitations. It can only be used with the hidden
layers of a neural network. Therefore, for all the output layers, the common
activation function that's used is sigmoid, while ReLu is used for the hidden
layers. Let's move on and understand how neural networks work.
What is bias?
When we talk about bias, we are referring to bias on a per-neuron basis. We
can think of each neuron as having its own bias term, so that the entire
network will be made up of several biases. The values that get assigned to
the biases also get updated, just like the weights. The weights are usually
updated by the gradient descent through back-propagation during training.
Gradient descent is also learning and updating the biases. We will be
learning about gradient descent in the Gradient descent section. First, let's
take a look at the activation function.
We can think of the bias of each neuron as having a role similar to that of a
threshold. The bias value determines whether the activation output from a
neuron is propagated through the network. In other words, the bias
determines whether or not or by how much a neuron will be activated or
fired. The addition of these biases ends up increasing the flexibility of the
model to fit the given input data.
Each neuron receives a weighted sum of input from the previous layers, and
then that weighted sum gets passed through an activation function. This is
where the bias is added. The equation is as follows:
Rather than passing the weight directly to the activation function, we pass
the weighted sum plus the bias term to the activation function instead.
Let's say we want to shift this threshold. Instead of 0, let's say we want a
neuron to fire if its input is greater than or equal to -1. This is where bias
comes into play. The bias is added to the weighted sum before being passed
to the activation function. The value we assign to the bias is the opposite of
the threshold value. If we want our threshold to move from 0 to -1, the bias
will be the opposite of -1, which is just 1. Now, our weighted sum is 0.35 +
1, which is 0.65. Passing this to the ReLu activation function gives us the
maximum of 0.65 and 0, which is 0.65. The neuron that didn't fire before is
now considered to be firing. The model has a bit more flexibility in fitting
the data, since it now has a broader range of values that it considers
activated or not.
How do neural networks learn?
Now that we have explored the various components of a neural network, we
will discuss how it learns. There are two fundamental approaches to get a
program to do what you want it to do. The first method is hardcoding, where
you actually tell the program specific rules and what outcomes you want,
guiding it and accounting for all the different options that the program has to
deal with. The other method is to use neural networks, where we provide the
program with the ability to understand what it needs to do on its own. We
provide the input and we request the output, and then we let the neural
network figure the rest out on its own.
Our goal is to create a network that learns on its own. Let's consider the
following diagram of a neural network:
In the preceding diagram, we have a basic neural network, which is called a
perceptron. The perceptron is a single-layer feed-forward neural network.
Before we proceed, we need to adjust the output value. Currently, the output
is represented as y, but we will write it as ŷ here to indicate that it is a
predicted value rather than the actual value:
The perceptron was first invented in 1957 by Frank Rosenblatt. Let's think
about how our perceptron learns. Let's say that we have some input values
that have been supplied to the neuron or the perceptron:
Then, the activation function is applied. We have an output that we are going
to plot on a chart, as shown here:
Now, if we need to make the neural network learn, we need to compare the
output value (ŷ) to the actual value (y) that we want the neural network to
achieve. If we plot this on a graph, we will see that there is a difference:
We will now try to calculate the function called the cost function, C, which
is represented as follows:
C = 1/2(ŷ-y)^2
The cost function can be thought of as one half of the squared difference
between the actual value and the predicted value. There are many other cost
functions that we can use, but this is the most common one. It tells us about
the error that can be found in our prediction. Our goal is to minimize the cost
function.
Once we compare the cost function, which is the error, we will feed this
information back to the neural network. As the information goes into the
neuron, all the weights are updated:
The only thing that we control in this neural network is the weights (w1,
w2, w3, ..., wm). After the weights have been updated, we will resend the input
and the whole process will be repeated. We will compute the weighted sum
of all the input and weights, and then pass this through the activation
function. We will then compute the cost function again. After this, we send
the information again, the weights are updated, and so on. This training is
continued until our cost function is minimized.
Once we have the full cost function, the neural network goes back and
updates the weights (w1, w2, w3, ..., wm). This step is called back-
propagation.
Next, we are going to run all the input records again. We will feed every
single row into the neural network, find out the cost, and then update the
weight and do this whole process again. The final goal is to minimize the
cost function.
Gradient descent
In the previous section, we learned that for a neural network to learn, we
need back-propagation. In this section, we will learn how the weights are
adjusted.
Again, here, we have the whole action in process. The neuron receives an
input and then we get the product of the weights, w, and the input, x. Later,
an activation function is applied, so we get ŷ. We compare the predicted
value with the actual value and calculate the cost function, C.
Now, we are going to look at the angle of our cost function at the following
point. This is called the gradient and we find it using differentiation:
The reason we differentiate is basically to find out the slope in that specific
point and see whether it is negative or positive. The following diagram
indicates that the slope is negative:
We now need to consider a concept called the learning rate. Usually, the
learning rate is initialized with a small value between 0.01 and 0.0001. The
learning rate is the rate at which we want the neural network to converge to
reach the point of best fit. If we initialize a higher value for the learning rate,
it may never converge to the point of best fit. This value needs to be
carefully selected.
After continuing this process and moving downward, we will reach the
global minimal point, and the graph will be as follows:
An introduction to TensorFlow
TensorFlow is an open source deep learning framework that was released by
Google Brain Team in November 2015. TensorFlow was created based on
Google's own proprietary machine learning framework, DistBelief.
TensorFlow usually supports multiple core CPUs and faster GPUs with good
processing power by using the Nvidia CUDA library. TensorFlow is
compatible with all 64-bit operating systems, such as Linux or macOS. It also
supports mobile operating systems, such as Android and iOS.
Here, T1 and T2 are the input tensors and T3 is the resultant tensor. The node
represents the operation and the edges represent the multi-dimensional data
arrays (called tensors) communicating between them. There are two main
types of TensorFlow Core programs:
2. Let's create the nodes that form the input. Currently, we have two inputs
as we want to do an addition operation:
In[2]:
# Creating nodes in computation graph
# Constant node takes no inputs, and it outputs a value it stores internally.
# We can also specify the data type of output tensor using dtype argument.
node1 = tf.constant(5, dtype=tf.int32)
node2 = tf.constant(6, dtype=tf.int32)
Now, let's discuss the two main types of tensor objects in a graph, which are
variables and placeholders:
1. Import the numpy and matplotlib libraries. The numpy library will help us to create a dataset. The code is as
follows:
In[1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
2. Create a million equally spaced points between 0 to 10 using the NumPy linspace() function. This data
will be our input data. The code is as follows:
In[2]:
# 1 Million Points
x_data = np.linspace(0.0,10.0,1000000)
3. Add some noise in the dataset so that it looks like a real-world problem. Here, we are using the randn()
function that is present in NumPy. The code is as follows:
In[3]:
noise = np.random.randn(len(x_data))
4. Use the y = mx + b + noise_levels equation to create the dependent feature, which can be represented as
y_true:
In[4]:
# y = mx + b + noise_levels
b = 5
5. Take x_data, which is the independent feature, and y_true, which is the dependent feature, and combine
them into a dataframe. The code is as follows:
In[5]:
my_data = pd.concat([pd.DataFrame(data=x_data,columns=['X Data']),pd.DataFrame(data=y_true,columns=['Y'])],axis=1)
6. The following code helps us see the top five records of the dataset:
In[6]:
my_data.head()
Once we have created a dataset, we need to create a linear regression neural network using TensorFlow.
8. Import TensorFlow:
In[8]:
import tensorflow as tf
10. We will be using the linear regression equation, y= mx + c. Create variables using TensorFlow. The code
is as follows:
In[10]:
## Variables
m = tf.Variable(0.5)
b = tf.Variable(1.0)
11. Create placeholders for the input and output variables. The code is as follows:
In[11]:
xph = tf.placeholder(tf.float32,[batch_size])
yph = tf.placeholder(tf.float32,[batch_size])
Here, tf.square() is a built-in function of TensorFlow that will square the difference between yph, which
is the actual value, and y_model, which is the predicted value.
14. Create an optimizer. We will be using GradientDescentoptimizer, which will help us find the global minima
point, which basically indicates when and where we have to stop the training of the neural network. The
code is as follows:
In[14]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train = optimizer.minimize(error)
15. If we want to run the TensorFlow code, we need to initialize all the variables using
the tf.global_variables_initializer() function. The code is as follows:
init = tf.global_variables_initializer()
sess.run(init)
batches = 1000
for i in range(batches):
rand_ind = np.random.randint(len(x_data),size=batch_size)
feed = {xph:x_data[rand_ind],yph:y_true[rand_ind]}
sess.run(train,feed_dict=feed)
model_m,model_b = sess.run([m,b])
Within this session, we will pass our input data, which is x_data, in the x_ph placeholder, and the true output,
which is y_true, in the y_ph placeholder. We will run this for 1,000 epochs. One epoch is basically the
combination of one forward and one backward propagation, which will result in updating the weights of the
neural network.
Once the preceding code is executed, we can get the coefficients that are indicated by m, and the bias, b, which
is the intercept, by using the following code:
In[17]:
model_m
Once we get the coefficient, which is stored in the model_m variable, and the intercept, which is stored in model_b,
we can plot the graph of the line of best fit. The code for this is as follows:
In[19]:
y_hat = x_data * model_m + model_b
Here, y_hat is the value that's predicted by the neural network when we use the coefficient and the bias in the
equation. The line of best fit can be plotted using the following code, in which we are using the matplotlib
plot() function:
In[20]:
my_data.sample(n=250).plot(kind='scatter',x='X Data',y='Y')
plt.plot(x_data,y_hat,'r')
The linear line that we can see in the preceding graph is the line of best fit.
An introduction to Keras
Keras is a high-level neural network API that is capable of running on top
of TensorFlow, Theano, and CNTK. It enables fast experimentation through
a high-level, user-friendly, modular, and extensible API. Keras can also be
run on both CPU and GPU. We can say that Keras is a wrapper on top of
other powerful libraries, such as TensorFlow, Theano, and CNTK. This
structure is shown in the following diagram:
Let's consider a neural network in which we want to create an input layer with 10 inputs, a hidden layer with 20
neurons, and an output layer with two outputs, which is like a binary classification.
To begin, we need to import the Keras library, because all the APIs are present within this library. We will also be
importing the layers library present in Keras, which will help us to create the input layer, the hidden layer, and the
output layer.
Out[1]:
Using TensorFlow backend.
In this step, we will initialize the sequential model since we are creating a feed-forward neural network. The code
is shown here:
In[2]:
model = keras.Sequential()
In the next step, we will start adding the input layer and the first hidden layer. We will be using the add() function to
add layers inside the sequential model. To add layers, we will be using the layer.Dense() function. The following is
the syntax of the layer.Dense() function:
Init signature:
layers.Dense(
['units', 'activation=None', 'use_bias=True', "kernel_initializer='glorot_uniform'", "bias_initializer='zeros'", 'kernel_re
)
# Input shape
nD tensor with shape: `(batch_size, ..., input_dim)`.
The most common situation would be
a 2D input with shape `(batch_size, input_dim)`.
# Output shape
nD tensor with shape: `(batch_size, ..., units)`.
For instance, for a 2D input with shape `(batch_size, input_dim)`,
the output would have shape `(batch_size, units)`.
In the preceding code, the first parameter is the number of inputs. The input_shape parameter defines the shape of
the input and is usually a scalar value. The activation parameter is used to provide the activation function that's
used.
We are considering 10 inputs here, so we add 10 neurons in the first hidden layer. We will apply the ReLu
activation function.
In the preceding code, we added an output layer with two output. The activation function that's used is the sigmoid
activation function, since it is a binary classification.
Finally, we can see a summary of the model using the following code:
In[4]:
model.summary()
The total parameters, 132, is the number of weights and biases that we are using in this neural network.
The functional API
This is suitable for multi-input, multi-output, and arbitrary static graph
topologies:
In[1]:
import keras
from keras import layers
inputs = keras.Input(shape=(10,))
x = layers.Dense(20, activation='relu')(x)
x = layers.Dense(20, activation='relu')(x)
outputs = layers.Dense(10, activation='softmax')(x)
def __init__(self):
super(CustomModel, self).__init__()
self.dense1 = layers.Dense(20, activation='relu')
self.dense2 = layers.Dense(20, activation='relu')
self.dense3 = layers.Dense(10, activation='softmax')
model = CustomModel()
model.fit(x, y, epochs=10, batch_size=32)
Summary
In this chapter, we discussed deep learning and its uses in financial case
studies. We also discussed neural networks and looked closely at how they
work. We then learned about activation functions, such as sigmoid, tanh,
and ReLu. After that, we looked at bias, which is another important
parameter, before exploring different types of neural networks, including
ANNs, CNNs, and RNNs. We discussed TensorFlow, an open source library
that was developed by Google. We used TensorFlow to solve a practical
problem in which we implemented a linear regression neural network
model. We finished by looking at another library, called Keras, which is a
wrapper on top of TensorFlow.
In the next chapter, we will discuss how we can use LSTM RNNs for stock
predictions.
Stock Market Analysis and
Forecasting Case Study
In this chapter, we will discuss various use cases related to finance using
deep learning techniques and the Keras library. In the previous chapters, we
focused on implementing use cases with Python, along with statistics using
the StatModels library. We will initially go ahead with an in-depth
discussion of the Long Short Term Memory (LSTM) version of the
Recurrent Neural Network (RNN). Part of the RNN has already been
covered in previous chapters. In this chapter, we will deep dive into the
architecture of the LSTM RNN, which is a variant of RNN. We need to
understand this architecture since we will be solving the use causes with
these architectures using Keras as a wrapper, and TensorFlow as the
backend library.
LSTM RNN
Predicting and forecasting the stock market price using LSTM – case
study 1
Predicting and forecasting wine sales using the Arima model – case
study 2
Technical requirements
In this chapter, we will use Jupyter Notebook for coding purposes. We will
also use pandas, NumPy, and matplotlib. Along with these libraries, we will
use libraries such as TensorFlow and Keras.
You can refer to the technical requirements of the previous chapter to install
TensorFlow and Keras.
This means that an RNN will be able to remember the output of the previous
time steps when it is getting trained, and this output is simultaneously fed
along with the new input to the RNN. The RNN is able to remember the
previous output because of the memory that is part of the architecture
present. RNN considers both context units based on what they've seen
previously and the current input.
In a traditional feedforward neural network, all the input data and output
data is independent of each other. Suppose that in the stock prediction use
case, if you want to predict the next day stock, it is always best to know what
the previous day's stock prices were. The RNN is defined as recurrent
because they do a similar task for every sequence of data, with the output
being dependent on the previous output computation. There is also another
way to define RNNs, which is that they have a memory that saves
information from the previous output calculations.
The following diagram shows the architecture of RNN:
The preceding diagram shows how an RNN is being unfolded into a full
network. The left side of the diagram indicates x as our input and o as our
output. Furthermore, the s symbol indicates the neurons in the hidden layer.
There is a self loop on the hidden neurons that indicates that the output of
the previous time step t-1 is also provided as an input in the current time step
t. For example, if the sequence of the input is the six days of stock opening
price data, then the network would unfold itself to six layered neural
networks, which refers to one layer for each day's opening stock price.
However, usually, with respect to the stock prediction use case, we require
many days previous data to predict the next day prediction more accurately.
A conventional RNN will not work properly, because it has memory
limitations to remember the previous output of the data. Consequently, to
overcome this problem, we will use another variant of RNN, which is called
LSTM RNN. This tends to outperform conventional RNNs.
How does the RNN work?
Before looking at the way in which the RNN works, we will first discuss
the normal feedforward neural networks that we must be familiar with in
order to understand the RNN. However, it is also important to understand
what sequential data is.
RNNs
The RNN usually adds the immediate computed past result to the present
input. Consequently, the RNN has two inputs: the present and the recent
computed past. This is really important, since many sequences of data such
as time series data contain very crucial or important information. This is the
reason that the RNN (which is a state of the art algorithm) can do things
that the other algorithms can't.
When we fold the RNN, we can view the RNN as a sequence of neural
networks that are usually trained in a sequential manner.
On the left-hand side before the equals sign, the representation is an RNN.
On the right-hand side, the RNN is unfolded, and we can see that there is no
loop or cycle since the information is getting passed from one specific time
step to the next one. So, we define an RNN as a sequence of many neural
networks.
In the backpropagation through time, the error is back propagated from the
last time step to the first time step in the unfolded RNN. This allows us to
compute the error in each and every time step, which, in turn, allows us to
update the weights. Usually, in the RNN, the backpropagation through time
is an expensive process if our number of time steps is a greater value, or if
we have a higher number of time steps.
Problems with standard RNNs
There are two major problems with a standard RNN. These are covered in
the following sections.
Vanishing gradient problem in
RNN
This is a problem that causes a major difficulty when training an RNN.
More specifically, this involves weights in the initial layers of the networks.
As we do backpropagation with time, which is moving backward in the
unfolded RNN based on the number of time steps, when we calculate the
gradient loss or error with respect to weights, the gradients usually become
smaller, and they keep getting smaller as we continue moving backward in
the RNN. This means that the neurons that are present in the initial layer
learn very slowly compared to all the neurons that are present in the final
layers. As a result of this, the initial layers in the networks are also trained
very slowly.
Why earlier layers of the RNN are
important for us
The initial layers of the RNN are the basic building blocks, or layers, for the
complete RNN. Consequently, for most of the sequential data, it is always
good that the initial layers should be able to distinguish the patterns from
the dataset. This is also very important for the financial data that will
initially help us to understand the pattern.
What harm does it do to our
model?
Usually, due to the vanishing gradient problem, the training time of the
model increases while the accuracy of the model decreases.
Exploding gradients in RNN
Exploding gradients is another type of problem that usually occurs in an
RNN while, in the gradient, errors are accumulated. This results in a larger
update of the weight in the neural network during training.
What are exploding gradients?
Usually, while training an RNN, we will get an error gradient as we perform
back propagation over time. In training, these error gradients are usually
accumulated with different values that are actually used for updating the
weights. Due to the accumulation, it so happens that there is a large update
in the weights, which, in turn, creates a very unstable network. This leads to
underfitting the RNN so that performance is degraded.
There are many ways to address or fix the issues of the exploding
and vanishing gradients. These include the following:
Redesigning the RNN so that is has fewer layers based on time stamps
Use of LSTM along with the RNN
LSTM
LSTM is an added extension in the RNN that makes it possible to increase
the RNN's memory. Consequently, LSTM can be very useful when we are
working with financial data, such as time series data, since the time series
data is data with sequences, along with very long time lags in-between.
LSTMs are usually integrated with the RNN in each and every layer, which
is also called the LSTM network. These LSTMs enable the RNNs to
remember the input and the output for a long time due to the memory. This
memory is similar to the memory of any computer, and these LSTMs can
write, delete, and read information to and from its memory.
The LSTM usually consists of three gates and a cell: the input (it), output
(ot), and forget gates. The input gate is used to determine whether or not we
need to let the input data in, the forget gate is useful to delete the
information if it is not required, and the output gate is responsible for
handling output data.
All of these gates in the LSTM RNN perform the operation of the sigmoid
that is transforming the values in a range from zero to one. The issue of the
vanishing gradient is solved using the LSTM because it is responsible for
keeping the gradient values steep enough, which means that the training
time is relatively shorter and the accuracy is quite high.
Use case to predict stock prices
using LSTM
In this section, we will predict the future stock prices for a specific
company stock data, and then compare them with the test data (which is our
future data) to see if we are getting good accuracy.
For this, we will be considering Google stock prices – we will try to predict
the future stock price and then compare the prediction. In this use case, we
will be using LSTM RNN using the Keras open source wrapper with the
TensorFlow library in the backend.
We will be using two CSV files, where one CSV file is the training dataset
named Google_Stock_Train.csv and the other one is the test dataset named
Google_Stock_Test.csv.
Data preprocessing of the Google
stock data
Let's go ahead and import the training dataset, which is
the Google_Stock_Train.csv file, using the pandas library, as shown in the
following code snippet:
In[2]:
# Importing the training set
dataset_train = pd.read_csv('Google_Stock_Price_Train.csv')
We can see the top five records by using the head() function:
In[3]:
# Check the first 5 records
dataset_train.head()
We can use the following code to see the last five records:
In[4]:
# Check the last 5 records
dataset_train.tail()
The output is as follows:
Consequently, the training dataset of the Google stock prices are from
January 3, 2012 to December 30, 2016. From all present columns, we will
consider the Open column and base the future prediction on that. We can also
consider any column and do the prediction.
Let's pick up the column for which we need to make the prediction. We can
use the iloc operation that present in pandas, and we will also convert the
data into arrays by using a values operation, as shown here:
In[5]:
# Pick up the Open Column
training_set = dataset_train.iloc[:, 1:2].values
After importing the library, we will transform our complete training dataset
using the MinMax scaler, which will transform between 0 and 1. The code is
as follows:
In[7]:
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)
After creating the object variable of the MinMax scaler, which has the
name sc, we need to apply the fit_tranform() method to apply the MinMax
scaler on all of these datasets, and assign a variable named
training_set_scaled, which contains our scaled data.
After applying the MinMax scaler functionality, we can see the dataset by
using the following code:
In[9]:
training_set_scaled
The output is as follows:
Consequently, we have scaled down the open column values using the
MinMax scaler. In the next step, we will divide our dataset into independent
and dependent features.
This step is the most important while we are doing data preprocessing,
because this is where we have to create a dataset with some time step data,
which is our historical data, and one piece of output data. We will consider
the time steps as being equal to 100 time steps. Here, 100 time steps is
equal to the previous 100 days stock prices.
So, first let me explain what the previous paragraph means. The 100 time
steps means that for each time t, the RNN is going to look at the 100 stock
prices before time t, which is the stock price between 100 days before time
t-100 and t. Based on the trends it is capturing during the previous 100 time
steps, historical data will try to predict the next output. Consequently, 100
time steps, represent the historical data from the previous 100 dates in the
dataset from which our RNN is going to learn and understand some
correlation and trends. Based on the 100 time steps. the model is going to
predict the next output, which is the stock price at t+1. 100 is the timestamp
that we are experimenting with, but we can try different time step values
too. However, we should select a high value for the time steps. Suppose we
choose one time step; the RNN will lead to overfitting. I have even tried 20
timestamps or 30 timestamps, but the model did not give an accurate result
for this.
As we know, there are 20 financial days in a month, so 100 time steps
corresponds to five months. This means that each day, our RNN is going to
look at the previous five months and try to predict the stock price of the
next day.
We will restructure our training data so that we have 100 time steps and one
output, which will be the stock price at time t+1.
Let's look at the coding part to see how we can restructure the input to
the data structure. Initially, we will create two variables that indicate the
independent and dependent variables, as follows:
In[10]:
X_train = []
y_train = []
The following code helps us see the shape of the X_train dataset:
In[12]:
X_train.shape
The preceding shape indicates that the independent feature has 1,158 rows
and 100 columns. The 100 columns indicate the 100 time steps, or the 100
previous days stock price.
The following code helps us to see the shape of the y_train data:
In[13]:
y_train.shape
The y_train data has the output of the next day with respect to the
independent features.
The last step of data preprocessing includes reshaping the input data into a
three-dimensional array. The reason we do this is because the structure of
the RNN accepts only three-dimensional data. The following code helps us
to reshape the data from a two-dimensional shape into a three-dimensional
shape:
In[14]:
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
The preceding code helps us to reshape the X_train data from two-
dimensions into three-dimensions with shape (number of records, number of
time steps, 1).
The following code now helps us to see the shape of the X_train data:
In[15]:
X_train.shape
To begin with, we need to import some of the Keras libraries. The code is as follows:
In[16]:
# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
In the preceding code, we imported the Sequential library. The Sequential library helps
us to create a feedforward neural network. The Dense library helps us to create
neurons in the hidden layers. The LSTM library helps us to add the LSTM extensions in
the RNN. Dropout is a library that will help us to deactivate some of the neurons in the
RNN.
First, we need to initialize the RNN. The code for this is as follows:
In[17]:
# Initializing the RNN
regressor = Sequential()
In the following steps, we need to add LSTM layers and dropout regularization.
The following code helps us to add a LSTM layer to the neural network:
In[18]:
# Adding the first LSTM layer and some Dropout regularisation
regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 1)))
regressor.add(Dropout(0.2))
In the preceding regressor model, we are adding the LSTM layer. In this LSTM
layer, we have added 50 neurons, the activation function is ReLu, and we
have provided the input shape of the input data. We have also added a dropout value
of 20%.
Similarly, we will add one more layer of LSTM RNN and add a dropout value of
20% again. The code for this is as follows:
In[19]:
# Adding a second LSTM layer and some Dropout regularisation
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))
In the second layer, we do not have to provide the input_shape, since the input has to
be provided to the first layer only. return_sequences is also set to True. return_sequences, if
set to True, will help us to access the hidden state output for every layer in which it is
set.
Similarly, we will be adding two more layers of LSTM in the RNN. The code for
this is as follows:
In[20]:
# Adding a third LSTM layer and some Dropout regularisation
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))
Finally, we add the output layer. The code for this is as follows:
In[22]:
# Adding the output layer
regressor.add(Dense(units = 1))
Now, we can see the summary of the model by using the following code:
In[23]:
regressor.summary()
The total parameter implies all the weights and biases that are initialized for the
RNN.
Compiling the LSTM RNN
In this section, we will select the optimizer process, along with the training
process and the number of epochs to be used with the batch size. The batch
size indicates how much input data we will be passing to the RNN at a time
during training.
The following code helps in selecting the optimizer and the loss or cost
function to be used while training the LSTM RNN:
In[26]:
# Compiling the RNN
regressor.compile(optimizer = 'adam', loss = 'mean_squared_error')
Mean squared error: The mean squared error is the loss or the cost
function. As we train the LSTM RNN, we need to reduce the loss
unless we meet the global minima point, which is found out by the
optimizer.
The final step is to compile the RNN and start training the LSTM RNN by
providing the inputs that fit the RNN to the training set:
In[27]:
# Fitting the RNN to the Training set
regressor.fit(X_train, y_train, epochs = 100, batch_size = 32)
Here, we are running for 100 epochs, and it will take some time to train the
network. The RNN will be trained on the X_train data, which is our input, and
the y_train data, which is our output.
The top five records of the dataset_test can be seen by using the following code:
In[29]:
dataset_test.head()
The next step is to consider the Open column, which has the real test data, so that
we can compare this data with the predicted data and convert it into arrays. We
will follow all the data preprocessing steps that we have already discussed:
In[30]:
real_stock_price = dataset_test.iloc[:, 1:2].values
As discussed in the data preprocessing steps, we need to create the same data
structures that we created for the training dataset. Here, we need to consider
100 time steps, which act as the input to the LSTM RNN:
In[36]:
# Getting the predicted stock price of 2017
dataset_total = pd.concat((dataset_train['Open'], dataset_test['Open']), axis = 0)
inputs = dataset_total[len(dataset_total) - len(dataset_test) - 100:].values
inputs = inputs.reshape(-1,1)
inputs = sc.transform(inputs)
X_test = []
The transform function is also applied on the test data based on the MinMax
scalar. The following code helps us to see the shape of the input data:
In[37]:
inputs.shape
In the next step, we will create the data structures with 100 inputs or features
where we are performing the same data preprocessing for creating the test data:
In[38]:
for i in range(100, 142):
X_test.append(inputs[i-100:i, 0])
Then, we will convert the X_test data into arrays by using the following code:
In[39]:
X_test = np.array(X_test)
After converting it into an array, we need to convert the test data into three-
dimensions, since the input that's required by the LSTM is in that format:
In[40]:
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
In the next step, we will predict the stock price for the test data. The code is as
follows:
In[41]:
predicted_stock_price = regressor.predict(X_test)
Finally, we will visualize the predicted values and the actual stock price by
using the matplotlib library. The code is as follows:
In[43]:
# Visualising the results
plt.plot(real_stock_price, color = 'red', label = 'Real Google Stock Price')
plt.plot(predicted_stock_price, color = 'blue', label = 'Predicted Google Stock Price')
plt.title('Google Stock Price Prediction')
plt.xlabel('Time')
plt.ylabel('Google Stock Price')
plt.legend()
plt.show()
In the preceding code, we are plotting the predicted value and the actual values,
and then we can compare the predicted results. The output is as follows:
The preceding output indicates the real Google stock price and the predicted
Google stock price. From the prediction, we can see that this is quite accurate,
and minimal errors.
Predicting wine sales using the ARIMA model
Since we have already discussed the ARIMA model in Chapter 3, Time Series Analysis and Forecasting, in this
section, we will be solving another use case, which is predicting the sale of wine. The dataset file we have is of the
wine sales dataset.
After importing the libraries, we will be reading the dataset using pandas. The dataset name is perrin-freres-monthly-
champagne-.csv. The following is code for reading dataset:
In[2]:
df=pd.read_csv('perrin-freres-monthly-champagne-.csv')
Consequently, we can see that we have the dataset from January 1964 to September 1972. The first thing we will
do is carry out some data preprocessing. We will remove the last row, as this row does not have the correct data.
For this, we will apply the drop function that we discussed in Chapter 3, Time Series Analysis and Forecasting. The
code for this is as follows:
In[5]:
df.drop(106,axis=0,inplace=True)
df.drop(105,axis=0,inplace=True)
We have set the inplace value as True, as we want these changes to happen permanently. We will check the last five
records again to see if the row has been deleted or not. The code is as follows:
In[6]:
df.tail()
We will also change the column name to interpret it in a simpler way. The code is as follows:
In[9]:
df.columns=['Month','Sales per month' ]
Let's look at the top five records to see whether or not the column names have been changed:
In[10]:
df.head()
The general process for the ARIMA model that's used for forecasting is as follows:
1. The first step is to visualize the time series data to discover the trends and find out whether or not the data is
seasonal.
2. As we already know, to apply the ARIMA model, we need to use stationary data. The second step, therefore,
is to convert the non-stationary data into stationary data using the Dicky-Fuller Test.
3. We then select the p and q values for ARIMA(p,i,q) using Auto Correlation Function (ACF) and the Partial
Auto Correlation Function (PACF).
4. The next step is to construct the ARIMA model.
5. Finally, we use the model for the prediction.
First, we need to convert the Month column into a datetime object using pandas. After that, we will set this Month
column as the index of the dataframe, as follows:
In[11]:
df['Month']=pd.to_datetime(df['Month'])
The next step is to find out whether or not the dataset is stationary. To do this, we will define a method to check
whether the time series data is stationary or not using the Dickey-Fuller library, which is called adfuller. This is
present in the statsmodels library. Based on the p value that's returned, we will decide whether the data is stationary
or not. First, we will import the adfuller library or class and create a method, as demonstrated in the following code
snippet:
In[15]:
from statsmodels.tsa.stattools import adfuller
# Store in a function for later use!
def adf_check(time_series):
result = adfuller(time_series)
print('Augmented Dickey-Fuller Test:')
labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']
Once we have created the method, we can pass the dataframe with sales per month to the method, and see whether
or not the data is stationary, as follows:
In[16]:
adf_check(df['Sales per month'])
The p value is not less than 0.05, so this dataset is not a stationary dataset.
The next step is to convert the data into stationary data, which is done by differencing, which we discussed in Chapt
er 3, Time Series Analysis and Forecasting. The code is as follows:
In[17]:
df['Sales per Month First Difference'] = df['Sales per month'] - df['Sales per month'].shift(1)
We will pass the Sales per Month First Difference dataset to the adfuller check again to see whether the dataset is
stationary or not. The code is as follows:
In[18]
adf_check(df['Sales per Month First Difference'].dropna())
Now, we can see that the value of p is less than 0.05. At this point, we can consider the dataset to be stationary.
We can also plot and see whether the dataset is stationary visually by viewing the stationary graph. The code for
this is as follows:
In[19]:
df['Sales per Month First Difference'].plot()
We will now go ahead and apply the sarimax model, which is usually applied for seasonal data. The code for this is
as follows:
In[20]:
model=sm.tsa.statespace.SARIMAX(df['Sales per month'],order=(1, 1, 1),seasonal_order=(1,1,1,12))
results=model.fit()
Note that we need to set the values of p, q, and d by using autocorrelation and partial autocorrelation, which we
discussed in Chapter 3, Time Series Analysis and Forecasting.
Here, the value of p that we get is 1, the value of q is 1, and the d value, which differs, is also 1.
We can also create some future dates by using the DateOffset feature from pandas. The following code helps us to
create datasets for an additional two years, which will be our future dates:
In[22]:
from pandas.tseries.offsets import DateOffset
future_dates=[df.index[-1]+ DateOffset(months=x)for x in range(0,24)]
In the next step, we will convert this data into dataframes. The code for this is shown here:
In[23]:
future_datest_df=pd.DataFrame(index=future_dates[1:],columns=df.columns)
We can then check the first five records using the following code:
In[24]:
future_datest_df.head()
For the new added dates, we do not have the data from the sales, and we will be predicting the future sales data
using our model.
In the next step, we will be concatenating the future dataset with the original dataset. The code for this is as
follows:
In[25]:
future_df=pd.concat([df,future_datest_df])
Finally, we carry out the prediction for a future date we created by using the following code:
In[26]:
future_df['forecast'] = results.predict(start = 104, end = 120, dynamic= True)
future_df[['Sales per month', 'forecast']].plot(figsize=(12, 8))
This chapter covered two use cases: the stock prediction and forecasting using LSTM RNN, and the wine sales use
case prediction. We also forecasted for the upcoming two years using ARIMA.
Summary
In this chapter, we have explored various concepts of deep learning
techniques in which we discussed the RNN and the way in which the LSTM
RNN works. We also implemented the stock prediction use case, where we
discussed all the steps to follow using LSTM RNN. We also discussed how
we can increase the accuracy of the LSTM RNN by considering large
amounts of data.
In the next chapter, we will discuss many other use cases for the concepts
that we have learned about in this book.
What Is Next?
We produce huge amounts of data, around 2.5 quintillion bytes, every day.
A huge chunk of this data is financial data, from transactions, banking,
investments, and trading. Financial data can be quite complex, ranging from
time-series transaction data to high frequency trading and algorithmic
trading data that's produced at short intervals in a single day.
In this chapter, we will take a look at different use cases in which we can
apply all of the different techniques to do with Python, machine learning,
and deep learning that we learned about in this book.
Many online banking and insurance providers provide the best investment
and insurance policies based on their current customer's financial status and
future financial goals. Many portals use machine learning and deep learning
techniques to get the best possible policy that's suited to a particular person.
Financial data is the key for creating financial models that are used for
helping in various financial use cases. Many financial companies make
huge profits from the business of selling real-time transaction data. This
helps to build well-trained models for predicting trends in the financial
market.
Financial mergers and acquisitions
The logical reasoning behind the process of merging is that two separate
companies are of more value together than as separate entities. This
consolidation of two companies is a critical corporate strategy for
companies to preserve their competitive advantages.
Deep learning can also help in predicting the success of a start up venture
and the investment roadmap for early stage ventures. This is commonly
defined as a two-way strategy that generates a huge amount of revenue to a
company's founders, investors, and first employees. A company can either
have an Initial Public Offering (IPO) by going to a public stock market, or
it might be merged or acquired by (M&A) another company. In this case,
those who have previously invested money receive immediate cash in
return for their shares. This process also helps to prepare exit
strategies. Deep learning allows investors to explore their possibilities by
creating a predictive model that is able to predict whether a start-up will be
successful.
Additional research paper links for
further reference
In this section, I've listed some additional research paper links that you can
use to find out more about machine learning and deep learning in the
financial domain:
ISBN: 9781788624565
Understand time series data and its relevance in the financial industry
Build a time series forecasting model in SAS using advanced modeling
theories
Develop models in SAS and infer using regression and Markov chains
Forecast inflation by building an econometric model in SAS for your
financial planning
Manage customer loyalty by creating a survival model in SAS using
various groupings
Understand similarity analysis and clustering in SAS using time series
data
Hands-On Cybersecurity for Finance
Dr. Erdal Ozkaya, Milad Aslaner
ISBN: 9781788836296