Afin8015 Topic 3 2022 v1
Afin8015 Topic 3 2022 v1
' $
& %
Page-1
Financial Data Science Topic-3
Readings
Chapter-1: Nina Zumel & John Mount (2019). Practical Data Science with R, Second
Edition. Manning Publications.
https://multisearch.mq.edu.au/permalink/f/1od1ft6/TN_safari_s9781617295874
Chapter-1 and 2: Ozdemir, S. (2016). Principles of data science : Learn the techniques
and math you need to start making sense of your data / Sinan Ozdemir.
https://multisearch.mq.edu.au/permalink/f/i7uiug/MQ_ALMA51204622540002171
Chapter-1 and Chapter-2: Boehmke, Brad and Greenwell, Brandon M, Hands-on machine
learning with R (CRC Press, 2019).https://bradleyboehmke.github.io/HOML/
Chapter-1: Sunila Gollapudi. (2016). Practical Machine Learning. Packt
Publishing.https://multisearch.mq.edu.au/permalink/f/1lmkbbh/
TN_pq_ebook_centralEBC4520739
Chapter-9 and Chapter 10: Statistics and Data Analysis for Financial Engineering with R
examples Second Edition
https://multisearch.mq.edu.au/permalink/f/i7uiug/MQ_ALMA51175555040002171
Page-2
Contents
1 Background 5
1.1 Active Data Science Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1 Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Life Cycle of a Data Science Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Defining the Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Collect & Manage Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Build the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.4 Evaluate & Critique the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.5 Present Results & Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.6 Model Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Page-3
Financial Data Science Topic-3
4 Linear Regression 30
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
References 49
Contents Page-4
Part 1
Background
Page-5
Financial Data Science Topic-3
• We will first discuss some background theory for foundation before jumping into the methods details.
1.1.1 Finance
• Trading in Finance has been using Data Science for decades
• Investment banking, hedge funds, etc., have been using complex models to analyse data and make decision
for sometime.
• Figure-2 depicts a typical data science process (chapter-1 (Mount & Zumel, 2019))
• As per Mount & Zumel (2019) (Chapter-1) Chapter-1 from Mount &
Zumel (2019) is the main
– Why do the sponsors want the project in the first place? reference for Section 1.2
– What do they lack, and what do they need? What are they doing to solve the problem now, and why isn’t that
good enough?
– What resources will you need: what kind of data and how much staff?
– Will you have domain experts to collaborate with, and what are the computational resources?
– How do the project sponsors plan to deploy your results?
– What are the constraints that have to be met for successful deployment?
• This is the stage to initially explore the data, describe (descriptive statistics) and visualise (plots for understand-
ing).
• There may be overlap and back-and-forth between the modelling stage and the data-cleaning stage as you try
to find the best way to represent the data and the best form in which to model it.
• There are several possible methods and approaches for these tasks.
• For example, for classification tasks, some common approaches are logistic regressions and tree based meth-
ods. Neural Networks based forecasting will be an example for predictive tasks. We will cover some of these in
Part 1. Background Page-11
Financial Data Science Topic-3
this unit.
• This lecture will cover the basics of some broad categories of these methods.
• You must also document the model for those in the organization who are responsible for using, running, and
maintaining the model once it has been deployed.
• A presentation for the model’s end users would instead emphasize how the model will help them do their job
better:
Organised data: This refers to data that is sorted into a row/column structure, where every row represents a
single observation and the columns represent the characteristics of that observation.
Unorganised data: This is the type of data that is in the free form, usually text or raw audio/signals that must be
parsed further to become organized.
Page-16
Financial Data Science Topic-3
• Structured Data: Usually organised as a table format with rows and columns, and has observations and
characteristics. For example, finance stock price data.
• Unstructured Data: Does not follow any standard organisation or structure. For example, unorganised text
data such as Twitter posts, facebook posts etc.
– Is really common.
– Exists in many forms; Tweets, emails, literature, news articles, server logs etc.
• Quantitative Data: The data described using numbers and mathematics. For example, annual revenue data
for a company.
– Discrete data: Usually data which is counted based on outcomes. For example, roll of a dice.
– Continuous data: Data which is measured; usually at a regular interval.
• Qualitative Data: The data which can not be described using numbers and basic mathematics. For example,
personal particulars of the board members of a company.
Data science is a superset of Machine learning, data mining, and related subjects. It extensively covers the
complete process starting from data loading until production.
Page-18
Financial Data Science Topic-3
"A computer program is said to learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with experience E." (Mitchell, 2017.
Machine Learning, Mcgraw Hill)
• As per Wikipedia
"Machine learning is a scientific discipline that is concerned with the design and development of algorithms that
allow computers to evolve behaviours based on empirical data, such as from sensor data or databases."
• Primary goal of a ML implementation is to develop a general purpose algorithm that solves a practical and
focused problem.
• Important aspects in the process include data, time, and space requirements.
• The goal of a learning algorithm is to produce a result that is a rule and is as accurate as possible.
3.2 ML Process
• Types of datasets required: Training Set, Validation Set (may come from the initial data) and Testing Set
• Training set: data examples that are used to learn or build a classifier.
• Validation set: data examples that are verified against the built classifier and can help tune the accuracy of the
output.
• Testing set: data examples that help assess the performance of the classifier.
Phase 1-Training Phase: Training data used to train the model by using expected output with the input. Output is
the learning model.
Phase 2-Validation/Test Phase: Measuring the validity and fit of the model. How good is the model? Uses valid-
ation dataset, which can be a subset of the initial dataset.
Phase 3-Application Phase: Run the model with real world data to generate results.
3.3 Models
• At a high level
• Clustering methods
• Dimensionality reduction
• Prediction of a given output (or target) using other variables (or features) in the data set. https://bradleyboehmke
.github.io/HOML/
• Supervision refers to the fact that the target values provide a supervisory roles. Indicates to the learner the task
it needs to learn.
• Unlabelled dataset
• This part will discuss two regression methods. Linear Regression and Logistic Regression.
• We will start with Linear regression in this lecture and continue with Logistic regression in week-4.
Page-29
Part 4
Linear Regression
4.1 Introduction
• Regression analysis is one of the most widely used tool in quantitative research which is used to analyse the
relationship between variables.
• One or more variables are considered to be explanatory variables, and the other is considered to be the de-
pendent variable.
• In general linear regression is used to predict a continuous dependent variable (regressand) from a number
of independent variables (regressors) assuming that the relationship between the dependent and independent
variables is linear. Reading: Statistics and
Data Analysis for Financial
4.2 OLS Engineering with R
examples Second Edition
• The regression model with only one independent variable is called as simple linear regression and the model (Chapter-9 and Chapter 10)
with more than one independent variable is known as multiple linear regression. (Ruppert, 2015)
Page-30
Financial Data Science Topic-3
• If we have a dependent (or response) variable Y which is related to a predictor variables Xi. The simple
regression model is given by
Y = α + βXi + ϵi (4.1)
• here, the error term ϵi are assumed to be i.i.d and independent of Xi. This model describes Y lying on a straight
line with the slope of the line β , also called as the regression coefficient and the intercept of the line α. Here Y
and X are assumed to have bivariate normal distribution.
• These three parameters can be estimated using the method of Ordinary Least Squares (OLS). The basic
optimisation model minimizes the sum of squared residuals
X
SumRes = (Yi − (α + βXi))2 (4.2)
i
• The main arguments to the function lm are a formula and the data. lm takes the defining model input as a
formula1, which is from a f ormula class.
library(readxl)
# change the working directory to the folder containing file
1
A f ormula object is also used in other statistical function like glm, nls, rq etc
# [1] "Date"
# [2] "Composite"
# [3] "ASX All Ordinaries (180334)"
# [4] "Scentre Group (SCG-AU)"
# [5] "S&P ASX 50 (180520)"
# [6] "Australia and New Zealand Banking Group Limited (ANZ-AU)"
# [7] "Westpac Banking Corporation (WBC-AU)"
# [8] "Telstra Corporation Limited (TLS-AU)"
# [9] "BHP Group Ltd (BHP-AU)"
# [10] "CSL Limited (CSL-AU)"
# [11] "Transurban Group Ltd. (TCL-AU)"
# [12] "Commonwealth Bank of Australia (CBA-AU)"
# [13] "Rio Tinto Limited (RIO-AU)"
# [14] "Aristocrat Leisure Limited (ALL-AU)"
Part 4. Linear Regression Page-32
Financial Data Science Topic-3
head(data2)
# 1827 31.71581
# Newcrest Mining Limited (NCM-AU) Wesfarmers Limited (WES-AU)
# 1260 11.44 28.59431
# 1267 11.26 30.48146
# 1828 11.26 30.48146
# 1259 10.86 28.45798
# 1266 10.96 30.08681
# 1827 10.96 30.08681
# Woodside Petroleum Ltd (WPL-AU) Woolworths Group Ltd (WOW-AU)
# 1260 32.16218 26.90
# 1267 33.98789 28.12
# 1828 33.98789 28.12
# 1259 31.62927 26.76
# 1266 33.52406 27.73
# 1827 33.52406 27.73
# Goodman Group (GMG-AU) Brambles Limited (BXB-AU)
# 1260 6.35 10.12
# 1267 6.39 10.54
# 1828 6.39 10.54
# 1259 6.30 10.18
# 1266 6.26 10.49
# 1827 6.26 10.49
The above data file contains prices . The ’market model’ regression can be represented as the following regres-
sion.
Ri = α + βiRM + ϵ (4.3)
The following example estimates OLS regression coefficient for BHP and ASX
ret_bhp = 100 * diff(log(data2$`BHP Group Ltd (BHP-AU)`))
ret_asx = 100 * diff(log(data2$`ASX All Ordinaries (180334)`))
lreg1 = lm(formula = ret_bhp ~ ret_asx)
lreg1
#
# Call:
# lm(formula = ret_bhp ~ ret_asx)
Part 4. Linear Regression Page-38
Financial Data Science Topic-3
#
# Coefficients:
# (Intercept) ret_asx
# 0.01241 1.63913
• The result in the above example is an lm object which can be used with extractor functions like summary to
provide more information.
summary(lreg1)
#
# Call:
# lm(formula = ret_bhp ~ ret_asx)
#
# Residuals:
# Min 1Q Median 3Q Max
# -17.8179 -1.0931 -0.0124 1.0826 17.5035
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.01241 0.07196 0.172 0.863
# ret_asx 1.63913 0.04451 36.822 <2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 3.076 on 1825 degrees of freedom
# Multiple R-squared: 0.4263,Adjusted R-squared: 0.4259
# F-statistic: 1356 on 1 and 1825 DF, p-value: < 2.2e-16
• There are other generic functions which can be used to get more information from lreg1 and similar regression
objects. Table-2.1 gives a list of some such functions.
The following example shows how to create plots for the lreg1 object.
Part 4. Linear Regression Page-40
Financial Data Science Topic-3
20
6
459 459
463 463
Standardized residuals
4
10
Residuals
2
0
0
−6 −4 −2
−10
−20
458
458
−15 −10 −5 0 5 10 −3 −2 −1 0 1 2 3
6
463
Standardized residuals
Standardized residuals
2.0
430
4
428
1727
1.5
2
0
1.0
−6 −4 −2
0.5
Cook's distance
0.0
0.5
Part 4. Linear Regression Figure 4.1: Linear Regression Diagnostic Plots Page-42
Financial Data Science Topic-3
• The upper left plot in figure-2.1 shows the residual errors plotted versus their fitted values.
• The plot in the upper right is a standard Q-Q plot, which should suggest that the residual errors are normally
distributed.
• The scale-location plot in the lower left shows the square root of the standardized residuals as a function of the
fitted values.
• The fourth plot in the lower right shows each points leverage, a measure of the point importance in determining
the regression result.
• The contour lines on the plot are for the Cook’s distance, which is another measure of the importance of each
observation to the regression. Smaller distances means that removing the observation has little affect on the
regression results. Only one plot out of the four
can also be generated using
Sometimes, its just required to plot the regression line over the data points. The following example demonstrate the argument which in the
how to add the regression line using the function abline function plot.
20
10
ret_bhp
0
−10
−20
−10 −5 0 5
ret_asx
The function lm can handle multiple linear regression along with simple linear regression. We will discuss multiple
linear regression during factor Models
20
10
BHP
−10
−20
−10 −5 0 5
ASX
library(stargazer)
stargazer(lreg1, summary = TRUE, title = "OLS Results", type = "latex",
no.space = TRUE)
library(stargazer)
stargazer(lreg1, summary = TRUE, title = "OLS Results", type = "html",
out = "bhp_capm.doc", no.space = TRUE)
• Logistic Regressions
Page-49
References
Boehmke, Brad, & Greenwell, Brandon M. 2019. Hands-on machine learning with R. CRC Press.
Dasgupta, Nataraj, Farias, Ricardo Anjoleto, & Lanzetta, Vitor Bianchi. 2018. Hands-On Data Science with R. Packt Publishing.
Mount, John, & Zumel, Nina. 2019. Practical Data Science with R, Second Edition. Manning Publications.
Ozdemir, Sinan. 2016. Principles of data science : learn the techniques and math you need to start making sense of your data. Packt Publishing.
Ruppert, David. 2015. Statistics and data analysis for financial engineering. 2 edn. Vol. 13. Springer.
Page-50