0% found this document useful (0 votes)
120 views50 pages

Afin8015 Topic 3 2022 v1

1. Data science is applied across many domains including finance. 2. A typical data science project involves defining the goal, collecting and managing data, building and evaluating models, presenting results, and deploying models. 3. Important steps include defining quantifiable goals, exploring available data, describing data through statistics and visualizations, and ensuring data quality and quantity are sufficient to address the defined goal.

Uploaded by

Kritika Rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views50 pages

Afin8015 Topic 3 2022 v1

1. Data science is applied across many domains including finance. 2. A typical data science project involves defining the goal, collecting and managing data, building and evaluating models, presenting results, and deploying models. 3. Important steps include defining quantifiable goals, exploring available data, describing data through statistics and visualizations, and ensuring data quality and quantity are sufficient to address the defined goal.

Uploaded by

Kritika Rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Financial Data Science Topic-3

' $

AFIN-8015: Financial Data Science


Topic-3: Data Science and Machine Learning Methods(I)

& %

Page-1
Financial Data Science Topic-3

Readings
Chapter-1: Nina Zumel & John Mount (2019). Practical Data Science with R, Second
Edition. Manning Publications.
https://multisearch.mq.edu.au/permalink/f/1od1ft6/TN_safari_s9781617295874
Chapter-1 and 2: Ozdemir, S. (2016). Principles of data science : Learn the techniques
and math you need to start making sense of your data / Sinan Ozdemir.
https://multisearch.mq.edu.au/permalink/f/i7uiug/MQ_ALMA51204622540002171
Chapter-1 and Chapter-2: Boehmke, Brad and Greenwell, Brandon M, Hands-on machine
learning with R (CRC Press, 2019).https://bradleyboehmke.github.io/HOML/
Chapter-1: Sunila Gollapudi. (2016). Practical Machine Learning. Packt
Publishing.https://multisearch.mq.edu.au/permalink/f/1lmkbbh/
TN_pq_ebook_centralEBC4520739
Chapter-9 and Chapter 10: Statistics and Data Analysis for Financial Engineering with R
examples Second Edition
https://multisearch.mq.edu.au/permalink/f/i7uiug/MQ_ALMA51175555040002171

Page-2
Contents

1 Background 5
1.1 Active Data Science Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1 Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Life Cycle of a Data Science Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Defining the Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Collect & Manage Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Build the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.4 Evaluate & Critique the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.5 Present Results & Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.6 Model Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Let’s talk Data 16


2.1 What is Data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Types of Data (Chapter-2 Ozdemir (2016)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Introduction to Machine Learning 18


3.1 What is Machine Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 ML Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Page-3
Financial Data Science Topic-3

3.4 Types of Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24


3.5 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Subfields of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Linear Regression 30
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

References 49

Contents Page-4
Part 1

Background

Page-5
Financial Data Science Topic-3

• We will first discuss some background theory for foundation before jumping into the methods details.

1.1 Active Data Science Domains

• Figure-1 shows active data science domains

Figure 1.1: Data Science Domains (Dasgupta et al., 2018)

Part 1. Background Page-6


Financial Data Science Topic-3

1.1.1 Finance
• Trading in Finance has been using Data Science for decades

• Investment banking, hedge funds, etc., have been using complex models to analyse data and make decision
for sometime.

• Some examples of data science use cases are:

– Credit Risk Modelling and Management


– Loan Fraud and default detection
– Market Basket Analysis
– High Frequency Trading (HFT)
– Forecasting risk return using Machine Learning Methods
– Using alternate data like text data for financial modelling

Part 1. Background Page-7


Financial Data Science Topic-3

1.2 Life Cycle of a Data Science Project

• Figure-2 depicts a typical data science process (chapter-1 (Mount & Zumel, 2019))

Figure 1.2: Data Science Project- Life cycle

Part 1. Background Page-8


Financial Data Science Topic-3

1.2.1 Defining the Goal


• The first task in Data Science is to define a measurable and quantifiable goal. For example, forecast the n-day
ahead price movement in oil prices, is a general goal which needs to be quantifiable and measurable.

• As per Mount & Zumel (2019) (Chapter-1) Chapter-1 from Mount &
Zumel (2019) is the main
– Why do the sponsors want the project in the first place? reference for Section 1.2
– What do they lack, and what do they need? What are they doing to solve the problem now, and why isn’t that
good enough?
– What resources will you need: what kind of data and how much staff?
– Will you have domain experts to collaborate with, and what are the computational resources?
– How do the project sponsors plan to deploy your results?
– What are the constraints that have to be met for successful deployment?

Part 1. Background Page-9


Financial Data Science Topic-3

1.2.2 Collect & Manage Data


• This step is around identifying the data required to achieve the goal.

• This is the stage to initially explore the data, describe (descriptive statistics) and visualise (plots for understand-
ing).

• One of the most important steps.

• Typical questions to ask in this step:

– What data is available to me?


– Will it help me solve the problem?
– Is it enough?
– Is the data quality good enough?

Part 1. Background Page-10


Financial Data Science Topic-3

1.2.3 Build the model


• Step involving statistics and machine learning: The analysis stage.

• There may be overlap and back-and-forth between the modelling stage and the data-cleaning stage as you try
to find the best way to represent the data and the best form in which to model it.

• The most common data science modelling tasks are:

Classifying— Deciding if something belongs to one category or another


Scoring— Predicting or estimating a numeric value, such as a price or probability (also referred to as predictive
analysis task)
Ranking— Learning to order items by preferences
Clustering— Grouping items into most-similar groups
Finding relations— Finding correlations or potential causes of effects seen in the data
Characterizing— Very general plotting and report generation from data

• There are several possible methods and approaches for these tasks.

• For example, for classification tasks, some common approaches are logistic regressions and tree based meth-
ods. Neural Networks based forecasting will be an example for predictive tasks. We will cover some of these in
Part 1. Background Page-11
Financial Data Science Topic-3

this unit.

• This lecture will cover the basics of some broad categories of these methods.

Part 1. Background Page-12


Financial Data Science Topic-3

1.2.4 Evaluate & Critique the Model


• Evaluate the model to check if it satisfies the goal requirements.

– Is it accurate enough for your needs? Does it generalize well?


– Does it perform better than “the obvious guess”? Better than whatever estimate you currently use?
– Do the results of the model (coefficients, clusters, rules, confidence intervals, significances, and diagnostics)
make sense in the context of the problem domain?

• Various measures of accuracy, model fit, predictive power etc.

Part 1. Background Page-13


Financial Data Science Topic-3

1.2.5 Present Results & Document


• Present your results to your project sponsor and other stakeholders.

• You must also document the model for those in the organization who are responsible for using, running, and
maintaining the model once it has been deployed.

• Reproducibility is a key component here.

• A presentation for the model’s end users would instead emphasize how the model will help them do their job
better:

– How should they interpret the model?


– What does the model output look like?
– If the model provides a trace of which rules in the decision tree executed, how do they read that?
– If the model provides a confidence score in addition to a classification, how should they use the confidence
score?
– When might they potentially overrule the model?

Part 1. Background Page-14


Financial Data Science Topic-3

1.2.6 Model Deployment


• Putting the model to action (operation).

• Model should run smoothly and shouldn’t result in disastrous decisions.

• Usually initial deployment will happen at a smaller scale.

Part 1. Background Page-15


Part 2

Let’s talk Data

2.1 What is Data?

• Collection of information in either an organised or unorganised format (Chapter-1Ozdemir (2016)).

Organised data: This refers to data that is sorted into a row/column structure, where every row represents a
single observation and the columns represent the characteristics of that observation.
Unorganised data: This is the type of data that is in the free form, usually text or raw audio/signals that must be
parsed further to become organized.

Page-16
Financial Data Science Topic-3

2.2 Types of Data (Chapter-2 Ozdemir (2016))

• Structured Data: Usually organised as a table format with rows and columns, and has observations and
characteristics. For example, finance stock price data.

– Generally thought to be much easier to work with

• Unstructured Data: Does not follow any standard organisation or structure. For example, unorganised text
data such as Twitter posts, facebook posts etc.

– Is really common.
– Exists in many forms; Tweets, emails, literature, news articles, server logs etc.

• Quantitative Data: The data described using numbers and mathematics. For example, annual revenue data
for a company.

– Discrete data: Usually data which is counted based on outcomes. For example, roll of a dice.
– Continuous data: Data which is measured; usually at a regular interval.

• Qualitative Data: The data which can not be described using numbers and basic mathematics. For example,
personal particulars of the board members of a company.

Part 2. Let’s talk Data Page-17


Part 3

Introduction to Machine Learning

Data science is a superset of Machine learning, data mining, and related subjects. It extensively covers the
complete process starting from data loading until production.

3.1 What is Machine Learning?

Main reference: Chapter-1 of


Gollapudi (2016). Whole of
• Fig-3.1 presents an example concept map representing the key aspects of Machine learning (ML). chapter-1 is relevant.

Page-18
Financial Data Science Topic-3

Figure 3.1: Concept Map Gollapudi (2016)

• There are various definitions for machine learning.

Part 3. Introduction to Machine Learning Page-19


Financial Data Science Topic-3

"A computer program is said to learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with experience E." (Mitchell, 2017.
Machine Learning, Mcgraw Hill)
• As per Wikipedia

"Machine learning is a scientific discipline that is concerned with the design and development of algorithms that
allow computers to evolve behaviours based on empirical data, such as from sensor data or databases."
• Primary goal of a ML implementation is to develop a general purpose algorithm that solves a practical and
focused problem.

• Important aspects in the process include data, time, and space requirements.

• The goal of a learning algorithm is to produce a result that is a rule and is as accurate as possible.

Part 3. Introduction to Machine Learning Page-20


Financial Data Science Topic-3

3.2 ML Process

• Types of datasets required: Training Set, Validation Set (may come from the initial data) and Testing Set

• Training set: data examples that are used to learn or build a classifier.

• Validation set: data examples that are verified against the built classifier and can help tune the accuracy of the
output.

• Testing set: data examples that help assess the performance of the classifier.

Phase 1-Training Phase: Training data used to train the model by using expected output with the input. Output is
the learning model.

Phase 2-Validation/Test Phase: Measuring the validity and fit of the model. How good is the model? Uses valid-
ation dataset, which can be a subset of the initial dataset.

Phase 3-Application Phase: Run the model with real world data to generate results.

• Fig-3.2 example flowchart on how learning can be applied to predict

Part 3. Introduction to Machine Learning Page-21


Financial Data Science Topic-3

Figure 3.2: Example Flowchart for predictive ML workflow

Part 3. Introduction to Machine Learning Page-22


Financial Data Science Topic-3

3.3 Models

• Central to any ML implementation

• At a high level

– Logical : Rule based (if else...), for example, decision trees.


– Geometric: Use geometric concepts like lines, planes etc. Linear transformations are often used.
– Probabilistic: Statistical models. Defines relationship between two variables.

Part 3. Introduction to Machine Learning Page-23


Financial Data Science Topic-3

3.4 Types of Learning Problems

Figure 3.3: Learning Problems Categories

Part 3. Introduction to Machine Learning Page-24


Financial Data Science Topic-3

3.5 Machine Learning Algorithms

• Decision tree based algorithms • Association rule based learning algorithms

• Bayesian method based algorithms

• Kernel method based algorithms

• Clustering methods

• Artificial neural networks

• Dimensionality reduction

• Ensemble methods (combining multiple methods)

• Instance based learning algorithms


Figure 3.4: Machine learning algorithms/methods Gol-
• Regression Analysis based algorithms lapudi (2016)

Part 3. Introduction to Machine Learning Page-25


Financial Data Science Topic-3

3.6 Subfields of Machine Learning

Figure 3.5: Subfields of ML

Part 3. Introduction to Machine Learning Page-26


Financial Data Science Topic-3

3.6.1 Supervised Learning


Also review Chapter-1 and 2
from Boehmke & Greenwell
• Construct predictive models (2019)

• Prediction of a given output (or target) using other variables (or features) in the data set. https://bradleyboehmke
.github.io/HOML/
• Supervision refers to the fact that the target values provide a supervisory roles. Indicates to the learner the task
it needs to learn.

• Uses labelled data.

• Most supervised learning problems are either regression or classification.

Part 3. Introduction to Machine Learning Page-27


Financial Data Science Topic-3

3.6.2 Unsupervised Learning


• Statistical tools to conduct descriptive analysis; for better understanding of the data.

• No specific target to solve, for example, clustering to identify groups.

• Unsupervised learning is often performed as part of an exploratory data analysis (EDA).

• Unlabelled dataset

Part 3. Introduction to Machine Learning Page-28


Regression Methods

• This part will discuss two regression methods. Linear Regression and Logistic Regression.

• We will start with Linear regression in this lecture and continue with Logistic regression in week-4.

Page-29
Part 4

Linear Regression

4.1 Introduction

• Regression analysis is one of the most widely used tool in quantitative research which is used to analyse the
relationship between variables.

• One or more variables are considered to be explanatory variables, and the other is considered to be the de-
pendent variable.

• In general linear regression is used to predict a continuous dependent variable (regressand) from a number
of independent variables (regressors) assuming that the relationship between the dependent and independent
variables is linear. Reading: Statistics and
Data Analysis for Financial
4.2 OLS Engineering with R
examples Second Edition
• The regression model with only one independent variable is called as simple linear regression and the model (Chapter-9 and Chapter 10)

with more than one independent variable is known as multiple linear regression. (Ruppert, 2015)

Page-30
Financial Data Science Topic-3

• If we have a dependent (or response) variable Y which is related to a predictor variables Xi. The simple
regression model is given by

Y = α + βXi + ϵi (4.1)

• here, the error term ϵi are assumed to be i.i.d and independent of Xi. This model describes Y lying on a straight
line with the slope of the line β , also called as the regression coefficient and the intercept of the line α. Here Y
and X are assumed to have bivariate normal distribution.

• These three parameters can be estimated using the method of Ordinary Least Squares (OLS). The basic
optimisation model minimizes the sum of squared residuals
X
SumRes = (Yi − (α + βXi))2 (4.2)
i

• R has the function lm (linear model) for linear regression.

• The main arguments to the function lm are a formula and the data. lm takes the defining model input as a
formula1, which is from a f ormula class.

library(readxl)
# change the working directory to the folder containing file
1
A f ormula object is also used in other statistical function like glm, nls, rq etc

Part 4. Linear Regression Page-31


Financial Data Science Topic-3

data1 = read_excel("PriceHistory_lec-3_afin8015.xlsx", skip = 1) #import data


# convert
data1 = as.data.frame(data1)

• Sort the data from old to new

data2 = data1[order(as.Date(data1$Date, fromat = "%Y-%m-%d")), ]


colnames(data2)

# [1] "Date"
# [2] "Composite"
# [3] "ASX All Ordinaries (180334)"
# [4] "Scentre Group (SCG-AU)"
# [5] "S&P ASX 50 (180520)"
# [6] "Australia and New Zealand Banking Group Limited (ANZ-AU)"
# [7] "Westpac Banking Corporation (WBC-AU)"
# [8] "Telstra Corporation Limited (TLS-AU)"
# [9] "BHP Group Ltd (BHP-AU)"
# [10] "CSL Limited (CSL-AU)"
# [11] "Transurban Group Ltd. (TCL-AU)"
# [12] "Commonwealth Bank of Australia (CBA-AU)"
# [13] "Rio Tinto Limited (RIO-AU)"
# [14] "Aristocrat Leisure Limited (ALL-AU)"
Part 4. Linear Regression Page-32
Financial Data Science Topic-3

# [15] "Insurance Australia Group Limited (IAG-AU)"


# [16] "Suncorp Group Limited (SUN-AU)"
# [17] "National Australia Bank Limited (NAB-AU)"
# [18] "Newcrest Mining Limited (NCM-AU)"
# [19] "Wesfarmers Limited (WES-AU)"
# [20] "Woodside Petroleum Ltd (WPL-AU)"
# [21] "Woolworths Group Ltd (WOW-AU)"
# [22] "Goodman Group (GMG-AU)"
# [23] "Brambles Limited (BXB-AU)"
# [24] "Macquarie Group Limited (MQG-AU)"

head(data2)

# Date Composite ASX All Ordinaries (180334)


# 1260 2015-08-06 100.0000 5368.639
# 1267 2015-08-06 100.0000 5600.117
# 1828 2015-08-06 100.0000 5600.117
# 1259 2015-08-07 97.1721 5309.430
# 1266 2015-08-07 97.1721 5472.331
# 1827 2015-08-07 97.1721 5472.331
# Scentre Group (SCG-AU) S&P ASX 50 (180520)
# 1260 3.84 5492.369
# 1267 4.02 5757.634
Part 4. Linear Regression Page-33
Financial Data Science Topic-3

# 1828 4.02 5757.634


# 1259 3.80 5419.528
# 1266 3.96 5606.614
# 1827 3.96 5606.614
# Australia and New Zealand Banking Group Limited (ANZ-AU)
# 1260 29.52
# 1267 32.58
# 1828 32.58
# 1259 28.97
# 1266 30.14
# 1827 30.14
# Westpac Banking Corporation (WBC-AU)
# 1260 31.76512
# 1267 33.25691
# 1828 33.25691
# 1259 31.48665
# 1266 32.17287
# 1827 32.17287
# Telstra Corporation Limited (TLS-AU) BHP Group Ltd (BHP-AU)
# 1260 6.09 25.20
# 1267 6.40 26.69
# 1828 6.40 26.69

Part 4. Linear Regression Page-34


Financial Data Science Topic-3

# 1259 6.12 24.98


# 1266 6.29 25.93
# 1827 6.29 25.93
# CSL Limited (CSL-AU) Transurban Group Ltd. (TCL-AU)
# 1260 93.87 9.560097
# 1267 98.45 9.813731
# 1828 98.45 9.813731
# 1259 93.49 9.472299
# 1266 96.84 9.667402
# 1827 96.84 9.667402
# Commonwealth Bank of Australia (CBA-AU)
# 1260 81.27000
# 1267 84.18964
# 1828 84.18964
# 1259 76.91000
# 1266 80.95350
# 1827 80.95350
# Rio Tinto Limited (RIO-AU) Aristocrat Leisure Limited (ALL-AU)
# 1260 50.89 8.53
# 1267 53.55 8.71
# 1828 53.55 8.71
# 1259 50.59 8.51

Part 4. Linear Regression Page-35


Financial Data Science Topic-3

# 1266 53.27 8.70


# 1827 53.27 8.70
# Insurance Australia Group Limited (IAG-AU)
# 1260 5.942622
# 1267 6.055327
# 1828 6.055327
# 1259 5.891393
# 1266 6.014343
# 1827 6.014343
# Suncorp Group Limited (SUN-AU)
# 1260 13.77961
# 1267 14.84037
# 1828 14.84037
# 1259 13.66632
# 1266 14.70649
# 1827 14.70649
# National Australia Bank Limited (NAB-AU)
# 1260 31.17465
# 1267 32.45990
# 1828 32.45990
# 1259 30.69147
# 1266 31.71581

Part 4. Linear Regression Page-36


Financial Data Science Topic-3

# 1827 31.71581
# Newcrest Mining Limited (NCM-AU) Wesfarmers Limited (WES-AU)
# 1260 11.44 28.59431
# 1267 11.26 30.48146
# 1828 11.26 30.48146
# 1259 10.86 28.45798
# 1266 10.96 30.08681
# 1827 10.96 30.08681
# Woodside Petroleum Ltd (WPL-AU) Woolworths Group Ltd (WOW-AU)
# 1260 32.16218 26.90
# 1267 33.98789 28.12
# 1828 33.98789 28.12
# 1259 31.62927 26.76
# 1266 33.52406 27.73
# 1827 33.52406 27.73
# Goodman Group (GMG-AU) Brambles Limited (BXB-AU)
# 1260 6.35 10.12
# 1267 6.39 10.54
# 1828 6.39 10.54
# 1259 6.30 10.18
# 1266 6.26 10.49
# 1827 6.26 10.49

Part 4. Linear Regression Page-37


Financial Data Science Topic-3

# Macquarie Group Limited (MQG-AU)


# 1260 78.98
# 1267 81.00
# 1828 81.00
# 1259 78.85
# 1266 80.08
# 1827 80.08

The above data file contains prices . The ’market model’ regression can be represented as the following regres-
sion.

Ri = α + βiRM + ϵ (4.3)

The following example estimates OLS regression coefficient for BHP and ASX
ret_bhp = 100 * diff(log(data2$`BHP Group Ltd (BHP-AU)`))
ret_asx = 100 * diff(log(data2$`ASX All Ordinaries (180334)`))
lreg1 = lm(formula = ret_bhp ~ ret_asx)
lreg1

#
# Call:
# lm(formula = ret_bhp ~ ret_asx)
Part 4. Linear Regression Page-38
Financial Data Science Topic-3

#
# Coefficients:
# (Intercept) ret_asx
# 0.01241 1.63913

• The result in the above example is an lm object which can be used with extractor functions like summary to
provide more information.

summary(lreg1)

#
# Call:
# lm(formula = ret_bhp ~ ret_asx)
#
# Residuals:
# Min 1Q Median 3Q Max
# -17.8179 -1.0931 -0.0124 1.0826 17.5035
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.01241 0.07196 0.172 0.863
# ret_asx 1.63913 0.04451 36.822 <2e-16 ***

Part 4. Linear Regression Page-39


Financial Data Science Topic-3

# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 3.076 on 1825 degrees of freedom
# Multiple R-squared: 0.4263,Adjusted R-squared: 0.4259
# F-statistic: 1356 on 1 and 1825 DF, p-value: < 2.2e-16

• We get more information about the regression model using summary.

• There are other generic functions which can be used to get more information from lreg1 and similar regression
objects. Table-2.1 gives a list of some such functions.

Table 4.1: List of generic functions to extract more information


Generic Function Use
summary() Returns summary of the fitted models
coef() Estimated model parameters
resid() The model residuals
fitted() The fitted values of the model
deviance() The residual sum of squares
anova() An ANOVA table
predict() Returns predictions
plot() Used for creating plots

The following example shows how to create plots for the lreg1 object.
Part 4. Linear Regression Page-40
Financial Data Science Topic-3

# we first set the graphical parameter as the plot function for lm


# object creates 4 plots
par1 = par()
par(mfrow = c(2, 2))
plot(lreg1)

Part 4. Linear Regression Page-41


Financial Data Science Topic-3

Residuals vs Fitted Normal Q−Q

20

6
459 459
463 463

Standardized residuals

4
10
Residuals

2
0

0
−6 −4 −2
−10
−20
458
458

−15 −10 −5 0 5 10 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage


2.5

459 458 0.5

6
463
Standardized residuals

Standardized residuals
2.0

430

4
428
1727
1.5

2
0
1.0

−6 −4 −2
0.5

Cook's distance
0.0

0.5

−15 −10 −5 0 5 10 0.000 0.005 0.010 0.015 0.020

Fitted values Leverage

Part 4. Linear Regression Figure 4.1: Linear Regression Diagnostic Plots Page-42
Financial Data Science Topic-3

• The upper left plot in figure-2.1 shows the residual errors plotted versus their fitted values.

• The plot in the upper right is a standard Q-Q plot, which should suggest that the residual errors are normally
distributed.

• The scale-location plot in the lower left shows the square root of the standardized residuals as a function of the
fitted values.

• The fourth plot in the lower right shows each points leverage, a measure of the point importance in determining
the regression result.

• The contour lines on the plot are for the Cook’s distance, which is another measure of the importance of each
observation to the regression. Smaller distances means that removing the observation has little affect on the
regression results. Only one plot out of the four
can also be generated using
Sometimes, its just required to plot the regression line over the data points. The following example demonstrate the argument which in the
how to add the regression line using the function abline function plot.

# first plot BHP and ASX returns


plot(ret_asx, ret_bhp)
# add the regression line
abline(lreg1, col = "blue")

Part 4. Linear Regression Page-43


Financial Data Science Topic-3

20
10
ret_bhp

0
−10
−20

−10 −5 0 5

ret_asx

Part 4. Linear Regression Figure 4.2: Regression Fit Page-44


Financial Data Science Topic-3

The function lm can handle multiple linear regression along with simple linear regression. We will discuss multiple
linear regression during factor Models

• Plot using ggplot2

data_ggplot = data.frame(ASX = ret_asx, BHP = ret_bhp)


library(ggplot2)
p1 = ggplot(data_ggplot, aes(ASX, BHP))
p1 + geom_point(color = "blue") + stat_smooth(method = "lm", color = "red") +
theme_bw() + labs(title = "Market Model BHP")

Part 4. Linear Regression Page-45


Financial Data Science Topic-3

Market Model BHP

20

10

BHP

−10

−20
−10 −5 0 5
ASX

Part 4. Linear Regression Page-46


Financial Data Science Topic-3

• Table of output in LaTeX

library(stargazer)
stargazer(lreg1, summary = TRUE, title = "OLS Results", type = "latex",
no.space = TRUE)

Table 4.2: OLS Results


Dependent variable:
ret_bhp
ret_asx 1.639∗∗∗
(0.045)
Constant 0.012
(0.072)
Observations 1,827
2
R 0.426
Adjusted R2 0.426
Residual Std. Error 3.076 (df = 1825)
F Statistic 1,355.879∗∗∗ (df = 1; 1825)

Note: p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

• Table output in Word

Part 4. Linear Regression Page-47


Financial Data Science Topic-3

library(stargazer)
stargazer(lreg1, summary = TRUE, title = "OLS Results", type = "html",
out = "bhp_capm.doc", no.space = TRUE)

Part 4. Linear Regression Page-48


Next Time

• Logistic Regressions

• Machine Learning Methods (Continued...)

Page-49
References

Boehmke, Brad, & Greenwell, Brandon M. 2019. Hands-on machine learning with R. CRC Press.

Dasgupta, Nataraj, Farias, Ricardo Anjoleto, & Lanzetta, Vitor Bianchi. 2018. Hands-On Data Science with R. Packt Publishing.

Gollapudi, Sunila. 2016. Practical Machine Learning. Packt Publishing.

Mount, John, & Zumel, Nina. 2019. Practical Data Science with R, Second Edition. Manning Publications.

Ozdemir, Sinan. 2016. Principles of data science : learn the techniques and math you need to start making sense of your data. Packt Publishing.

Ruppert, David. 2015. Statistics and data analysis for financial engineering. 2 edn. Vol. 13. Springer.

Page-50

You might also like