0% found this document useful (0 votes)
67 views57 pages

2024 L2 QuantMethods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views57 pages

2024 L2 QuantMethods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Last Revised: 07/25/2023

2024 Level 2 - Quantitative Methods


Learning Modules Page

Basics of Multiple Regression and Underlying Assumptions 2

Evaluating Regression Model Fit and Interpreting Model Results 6

Model Misspecification 11

Extensions of Multiple Regression 16

Time-Series Analysis 23

Machine Learning 32

Big Data Projects 43

Review 50

M.M138463888.

This document should be used in conjunction with the corresponding learning modules in the 2024 Level 2 CFA® Program
curriculum. Some of the graphs, charts, tables, examples, and figures are copyright 2023, CFA Institute. Reproduced and
republished with permission from CFA Institute. All rights reserved.

Required disclaimer: CFA Institute does not endorse, promote, or warrant accuracy or quality of the products or services
offered by MarkMeldrum.com. CFA Institute, CFA®, and Chartered Financial Analyst® are trademarks owned by CFA
Institute.

© 2533695 Ontario Limited d/b/a MarkMeldrum.com. All rights reserved.

1
Last Revised: 07/25/2023

Basics of Multiple Regression and Underlying Assumptions

a. describe the types of investment problems addressed by multiple linear regression


and the regression process

b. formulate a multiple linear regression model, describe the relation between the
dependent variable and several independent variables, and interpret estimated
regression coefficients

c. explain the assumptions underlying a multiple linear regression model and


interpret residual plots indicating potential violations of these assumptions

M.M138463888.

2
Last Revised: 07/25/2023

Basics of Multiple Regression


Page 1/
specify the model
Main tasks
interpret the output

Multiple regression used to:


identify relationships between variables
test existing theories
forecast value of a DV

- model:

𝐘𝐢 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏𝐢 + 𝐛𝟐 𝐗 𝟐𝐢 + ⋯ + 𝐛𝐊 𝐗 𝐊 𝐢 + 𝛆𝐢 𝐢 = 1 ➞ n
deterministic part n > 𝐤
intercept
𝐤 IVs or slope Stochastic
coefficients part
- partial slope coefficients
%𝐛 ➞ estimated

* Describe the types of investment problems addressed by multiple linear regression and the regression process

Page 2/
partial slope coefficient: measures ∆DV for a 1 unit ∆IV holding
all other IVs constant
e.g./ RET = .0023 - 5.0585 BY - 2.1901 CS

return bond yield credit spread


when both IVs = 0, RET = .0023
BY ↑ 1, RET ↓ 5.0585
CS ↓ 1, RET ↑ 2.1901
Assumptions/ M.M138463888.

1/ Linearity - relationship between the DV and IVs are linear


2/ Homoskedasticity - the variance of the regression residuals is
the same for all observations i.e. 𝐕𝐚𝐫(𝛆𝐢 ) = 𝐕𝐚𝐫.𝛆𝐣 /

* Formulate a multiple linear regression model, describe the relation between the dependent variable and several independent variables, and interpret estimated regression coefficients

3
Last Revised: 07/25/2023

Page 3/
Assumptions/
3/ Independence of errors - the observations are independent of
one another
∴ regression residuals are uncorrelated across observations
4/ Normality - regression residuals are normally distributed
5/ Independence of IVs
1/ IVs are not random (i.e. - they have a specific value)
2/ no exact linear relationship between 2 or more IVs
Scatterplot matrix (pairs plot)
- uses simple linear regression: DV vs. each IV
+ each IV vs. the other IVs
what to see
don’t want to see
linear relationships
linear relationships
* Explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions

Page 4/
Scatterplot Matrix
since we can, and will, interpret
slight pos. relationship output ➞ this is not a very useful
step
- any violations will be identified
statistically, not visually

DV
IVs
pos. linear neg. linear
M.M138463888.

almost no ‘apparent’
➞ 𝐛𝐒𝐌𝐁 is sig. in the output however
linear relationship

* Explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions

4
Last Revised: 07/25/2023

Page 5/
- helps identify outliers

- helps visually assess


homoskedasticity

dispersion appears fairly constant


and random

non-constant correlated errors


variance (autocorrelation)

* Explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions

Page 6/
standardized residuals vs. normal distribution
- outliers
affect outlier
parameter value of 𝛆
values 5%
𝛆𝐢 − 𝛆1
directional 𝛔𝛆
relationship
= misspecified
model
5% Q-Q plot
corr(IV, 𝛆)
outliers
M.M138463888.
-1.65 +1.65
Z score
- normally distributed 𝛆 should
fall on the vertical line

* Explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions

5
Last Revised: 07/25/2023

Evaluating Regression Model Fit and Interpreting Model Results

a. evaluate how well a multiple regression model explains the dependent variable by
analyzing ANOVA table results and measures of goodness of fit

b. formulate hypotheses on the significance of two or more coefficients in a multiple


regression model and interpret the results of the joint hypothesis tests

c. calculate and interpret a predicted value for the dependent variable, given the
estimated regression model and assumed values for the independent variable

M.M138463888.

6
Last Revised: 07/25/2023

Evaluating Model Fit/Interpreting Results


Page 1/
Coefficient of Determination ➞ 𝐑𝟐 explained predicted
𝐬𝐮𝐦 𝐨𝐟 𝐬𝐪𝐮𝐚𝐫𝐞𝐬 𝐫𝐞𝐠𝐫𝐞𝐬𝐬𝐢𝐨𝐧 ∑.𝐘 @/𝟐
%𝐢 − 𝐘
= =
𝐬𝐮𝐦 𝐨𝐟 𝐬𝐪𝐮𝐚𝐫𝐞𝐬 𝐭𝐨𝐭𝐚𝐥 @)𝟐
∑(𝐘𝐢 − 𝐘
average
observed
as IVs are added, 𝐑𝟐 will increase or stay the same
- never decreases
no information on coefficient significance
no information on regression violations that may cause
coefficients to be biased
poor gauge of model fit - overfitting creates a bad model
- low in-sample error, high
out-of-sample error

$ 𝟐 = 1 - 𝐒𝐒𝐄/𝐧 − 𝐤 − 𝟏
Adjusted 𝐑𝟐 ➞ 𝐑 𝐧−𝟏
= 𝟏 − IJ K (𝟏 − 𝐑𝟐 )L
𝐒𝐒𝐓/𝐧 − 𝟏 𝐧−𝐤−𝟏

* Evaluate how well a multiple regression model explains the dependent variable by analyzing ANOVA table results and measures of goodness of fit

Page 2/
Adjusted 𝐑 ➞ 𝐑 𝟐 $𝟐 = 1 - 𝐒𝐒𝐄/𝐧 − 𝐤 − 𝟏 𝐧−𝟏
= 𝟏 − IJ K (𝟏 − 𝐑𝟐 )L
𝐒𝐒𝐓/𝐧 − 𝟏 𝐧−𝐤−𝟏
FYI 𝐒𝐒𝐄 𝐧−𝟏
× 𝐒𝐒𝐄(𝐧 − 𝟏) 𝐧−𝟏 𝐒𝐒𝐄
𝐧−𝐤−𝟏 𝐒𝐒𝐓
= =- .- .
𝐒𝐒𝐓 𝐧−𝟏 𝐒𝐒𝐓(𝐧 − 𝐤 − 𝟏) 𝐧 − 𝐤 − 𝟏 𝐒𝐒𝐓
×
𝐧−𝟏 𝐒𝐒𝐓

➞ SSR + SSE = SST ➞ 𝐒𝐒𝐑D𝐒𝐒𝐓 + 𝐒𝐒𝐄D𝐒𝐒𝐓 = 1

𝐑𝟐 + 𝐒𝐒𝐄D𝐒𝐒𝐓 = 1
M.M138463888.

∴ 𝐒𝐒𝐄D𝐒𝐒𝐓 = 1 - 𝐑𝟐
" 𝟐 : if 𝐤 ≥ 𝟏 , 𝐑
- for 𝐑 " 𝟐 < 𝐑𝟐
" 𝟐 ↑ , else 𝐑
& if coefficient’s |𝐭 − 𝐬𝐭𝐚𝐭| > 𝟏 , 𝐑 "𝟐 ↓

* Evaluate how well a multiple regression model explains the dependent variable by analyzing ANOVA table results and measures of goodness of fit

7
Last Revised: 07/25/2023

Page 3/
application/
𝐒𝐒𝐑 𝟗𝟎. 𝟔𝟐𝟑𝟒
𝐑𝟐 = = = . 𝟔𝟏𝟓𝟓
𝐒𝐒𝐓 𝟏𝟒𝟕. 𝟐𝟒𝟏𝟔
𝟓𝟎 − 𝟏
@ 𝟐 = 𝟏 − IJ
𝐑 K (𝟏 − . 𝟔𝟏𝟓𝟓)L
𝟓𝟎 − 𝟓 − 𝟏
= . 𝟓𝟕𝟏𝟖

$ 𝟐 ↑ with Factor 1, 3, 4
𝐑
$ 𝟐 ↓ with Factor 2 & 5
𝐑
also
insignificant
F1 + F2 : 𝐑
@ 𝟐 ↓ , add F3 : 𝐑
@ 𝟐 ↑ , Add F4 : 𝐑
@ 𝟐 ↑ , Add F5 : 𝐑
@𝟐 ↓

* Evaluate how well a multiple regression model explains the dependent variable by analyzing ANOVA table results and measures of goodness of fit

Page 4/
$𝟐
𝐑 ➞ no intuitive explanation, re: %’age of variance explained
➞ no information on coefficient significance or potential
coefficient bias
➞ not a ‘goodness of fit’ measure

AIC - Akaike’s Information Criterion/ evaluates a collection of models


that explain the same DV
lower = better
adding IVs may lower
AIC = n In .𝐒𝐒𝐄D𝐧/ + 2(𝐤 + 1)
SSE, but never raise it.
lower = better penalty term

BIC - Schwartz’s Bayesian Information Criterion/


BIC = n In .𝐒𝐒𝐄D𝐧/ + InM.M138463888.
(n)(𝐤 + 1)

since In(n) > 2, BIC assesses a greater penalty

* Evaluate how well a multiple regression model explains the dependent variable by analyzing ANOVA table results and measures of goodness of fit

8
Last Revised: 07/25/2023

Page 5/
AIC better for prediction purposes
BIC better when ‘goodness of fit’ is preferred

Hypothesis Testing/ same as SLR


single coefficient ➞ t-stat vs. critical value
H0: 𝐛𝐣 = 𝐁𝐣 H0: 𝐛𝐣 ≤ 𝐁𝐣 H0: 𝐛𝐣 ≥ 𝐁𝐣
Ha : 𝐛𝐣 ≠ 𝐁𝐣 Ha : 𝐛𝐣 > 𝐁𝐣 Ha : 𝐛𝐣 < 𝐁𝐣
right left
- default output t-stat
- testing any other value
∝ = .05 vs. 𝐁𝐣 = 0 confidence interval

* Formulate hypotheses on the significance of two or more coefficients in a multiple regression model and interpret the results of the joint hypothesis tests

Page 6/
testing joint coefficients/
e.g./ 𝐘 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + 𝐛𝟐 𝐗 𝟐 + 𝐛𝟑 𝐗 𝟑 + 𝐛𝟒 𝐗 𝟒 + 𝐛𝟓 𝐗 𝟓 - unrestricted
model
H0 : 𝐛𝟒 = 𝐛𝟓 = 0 Ha : at least one coefficient ≠ 0
𝐘 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + 𝐛𝟐 𝐗 𝟐 + 𝐛𝟑 𝐗 𝟑 - restricted model (nested model)

F-stat = (𝐒𝐒𝐄𝐫𝐞𝐬𝐭𝐫𝐢𝐜𝐭𝐞𝐝 − 𝐒𝐒𝐄𝐮𝐧𝐫𝐞𝐬𝐭𝐫𝐢𝐜𝐭𝐞𝐝 )D


𝐪 # of restrictions
ratio of 𝐒𝐒𝐄 𝐮𝐧𝐫𝐞𝐬𝐭𝐫𝐢𝐜𝐭𝐞𝐝D
𝐧−𝐤−𝟏
2 variances
testing unrestricted model/

F-stat = 𝐌𝐒𝐑D𝐒𝐒𝐑 = 𝐒𝐒𝐑 0𝐤


M.M138463888.

𝐒𝐒𝐄0
𝐧−𝐤−𝟏

* Formulate hypotheses on the significance of two or more coefficients in a multiple regression model and interpret the results of the joint hypothesis tests

9
Last Revised: 07/25/2023

Page 7/
'
prediction ➞ substitute values for the IVs, solve 𝐘
once a model is estimated, even if a coefficient is
insignificant, it must be used in prediction
- forecasts have a standard error, not a standard deviation
measure of measure of dispersion
precision

every model has model error (𝛆)


selecting values of IVs have sampling error
%=𝐛
𝐘 ^ 𝟎+𝐛^ 𝟏𝐗𝟏 + 𝐛
^ 𝟐𝐗𝟐 𝐗 𝟏 & 𝐗 𝟐 require a forecast
' will be larger than the SE of the
∴ SE of the forecast for 𝐘
regression

* Calculate and interpret a predicted value for the dependent variable, given the estimated regression model and assumed values for the independent variable

M.M138463888.

10
Last Revised: 07/25/2023

Model Misspecification

a. describe how model misspecification affects the results of a regression analysis


and how to avoid common forms of misspecification

b. explain the types of heteroskedasticity and how it affects statistical inference

c. explain serial correlation and how it affects statistical inference

d. explain multicollinearity and how it affects regression analysis

M.M138463888.

11
Last Revised: 07/25/2023

Model Misspecification
Page 1/
- principles for good regression model specification:
should be grounded in economic reasoning
model should be parsimonious
model should perform well out-of-sample
model functional form should be appropriate
model should satisfy regression assumptions

Failures in regression functional form


1/ omitted variables 𝐘 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + 𝐛𝟐 𝐗 𝟐 𝛆 = 𝐛𝟐 𝐗 𝟐 + 𝛆
omit 𝐗 𝟐 ➞ if corr(𝐗 𝟏 , 𝐗 𝟐 ) = 0 𝐄(𝛆) ≠ 𝟎
- result - possible CH/SC will not be iid
𝐛𝟎 will be biased
➞ if corr(𝐗 𝟏 , 𝐗 𝟐 ) > 0
then corr(𝐗 𝟏 , 𝛆) > 0, 𝐛𝟎 & 𝐛𝟏 will be biased
as will the SEs ➞ t-tests will be invalid

* Describe how model misspecification affects the results of a regression analysis and how to avoid common forms of misspecification

Page 2/
2/ Inappropriate form of variables
- failing to account for non-linearity ➞ can transform a
variable to make it linear (possible CH)
3/ Inappropriate scaling
- possible CH/MC
4/ Inappropriate pooling of data - data from different populations
- regime changes in time series
knowledge - possible CH, SC
check

Homoskedastic - 𝐕𝐚𝐫(𝛆) constant across observations


M.M138463888.

Heterskedastic - non-constant variance


- may result from: model misspecification, omitted vars.,
incorrect functional form, incorrect data
transformations, extreme values of IVs

* Explain the types of heteroskedasticity and how it affects statistical inference

12
Last Revised: 07/25/2023

Page 3/
Consequences/
unconditional heteroskedasticity: 𝐯𝐚𝐫(𝛆) not correlated with IVs
- no issues for inference
conditional heteroskedasticity (CH): 𝐕𝐚𝐫(𝛆) correlated with IVs
F-test unreliable since MSE is a biased estimator of the
true population variance
t-tests of coefficients unreliable since SE estimators will be
biased (underestimated, ∴ t-stats inflated)
- significance where none may exist (Type I error)
Test: BP (Breusch-Pagan)
H0: no CH ➞ 𝐚𝟏 = 𝐚𝟐 = 0
Ha : CH ➞ at least one ≠ 0 test-stat
𝛘𝟐𝐤 = 𝐧𝐑𝟐
if CH present, 𝐚𝟏
or 𝐚𝟐 or both will from
be significant Step 2

* Explain the types of heteroskedasticity and how it affects statistical inference

Page 4/
Correcting/ compute robust SEs (included in most software packages)
(a.k.a. heteroskedasticity-consistent SEs or White-corrected
SEs)
Serial Correlation/ errors correlated across observations
(autocorrelation)
result: incorrect SEs
if IV is a lagged variable of the DV, then also invalid
coefficients
Positive SC - positive residual most likely followed by a pos. residual
(neg.) (neg.)
Negative SC - neg. residual most likely followed by a pos. residual
(pos.) (neg.)
M.M138463888.

trending mean reversion

positive SC negative SC

* Explain serial correlation and how it affects statistical inference

13
Last Revised: 07/25/2023

Page 5/
Serial Correlation/ first-order SC ➞ corr(𝛆𝐧 , 𝛆𝐧6𝟏 ) > 0
pos. SC - affects tests +
t-stat = 𝐗 − 𝐗
𝐒𝐄 ➞ underestimated
inflated
F-stat = 𝐌𝐒𝐑0𝐌𝐒𝐄 ➞ underestimated
inflated

Testing/ DW - Durbin-Watson tests first-order SC only


BG - Breusch-Godfrey tests for SC at any lag

one lag ➞ 𝛘𝟐 test (testing one


coefficient)
𝐩 > 𝟏 ➞ F-test - testing joint
df = n - p - k - 1 coefficients

* Explain serial correlation and how it affects statistical inference

Page 6/
Correcting SC/ adjust the SEs
- SC-consistent SEs will also correct
- Newey-West SEs for CH

Multicollinearity/ 2 or more IVs are highly correlated or there is


an approximate linear relationship among IVs
- coefficients will be consistent but imprecise & unreliable
- inflated SEs and insignificant t-stats
but possible significant F-stat
- detecting: VIF - variance inflation factor
- regress one IV on all the others, get 𝐑𝟐
M.M138463888.

𝟏
𝐕𝐈𝐅𝐗 𝟏 = min. = 1 when 𝐑𝟐𝐗 𝟏 = 0
𝟏 − 𝐑𝟐𝐗 𝟏
> 5 becoming concerning
> 10 indicates MC

* Explain multicollinearity and how it affects regression analysis

14
Last Revised: 07/25/2023

Page 7/
Solutions: exclude one or more IVs (preferred)
use a different proxy for one of the variables
increase sample size
- if the goal is prediction, then a significant model is more
important than a significant coefficient
∴ MC less of an issue

* Explain multicollinearity and how it affects regression analysis

M.M138463888.

15
Last Revised: 07/25/2023

Extensions of Multiple Regression

a. describe influence analysis and methods of detecting influential data points

b. formulate and interpret a multiple regression model that includes qualitative


independent variables

c. formulate and interpret a logistic regression model

M.M138463888.

16
Last Revised: 07/25/2023

Extensions of Multiple Regression


Page 1/
Influential Data Points/
1/ high leverage point - extreme value of an IV
2/ Outlier - extreme value of a DV

outlier

high
range
leverage
range
point

- may not be an issue if it lies


on the regression line
𝟏 +)𝟐
Detection/ leverage 𝐡𝐢𝐢 = + (𝐗 𝐢 − 𝐗 ➞ ranges from 𝟏0𝐧 to 1
𝐧 +)𝟐
∑(𝐗 𝐢 − 𝐗
𝐤8𝟏
- if 𝐡𝐢𝐢 > 𝟑 d 𝐧
e, obs. 𝐢 is a potential influencer

* Describe influence analysis and methods of detecting influential data points

Page 2/
𝟑 (𝐤 + 𝟏, - typically for small samples
𝐧
- large samples use 2 "𝐤 + 𝟏'𝐧( (Belsey, Kuh, Welsch, 1980)
most extreme 5>
Detecting outliers: externally studentized residuals
- delete each case 𝒊
- calculate a new regression on 𝐧 - 1 cases
2 for all 𝐧 (using all cases of the IVs)
- calculate 𝐞∗𝐢 = 𝐘𝐢 − 𝐘
deleted 𝐢 forecasted 𝐢 from regression with case 𝐢
deleted
- calculate studentized residual:
residual from case 𝐢
M.M138463888.

𝐭 ∗𝐢 = 𝐞𝐢 ➞ 𝐞𝐢6
𝐒𝐞∗ (𝟏 − 𝐡𝐢𝐢 )
initial regression
𝐌𝐒𝐄 𝐒𝐒𝐄0
#𝟏 & 𝐡 𝐧−𝐤−𝟏
𝐢𝐢

* Describe influence analysis and methods of detecting influential data points

17
Last Revised: 07/25/2023

Page 3/
𝐌𝐒𝐄 𝐒𝐒𝐄 𝟏 𝐒𝐒𝐄
➞ × =
𝟏 − 𝐡𝐢𝐢 𝐧 − 𝐤 − 𝟏 𝟏 − 𝐡𝐢𝐢 (𝐧 − 𝐤 − 𝟏)(𝟏 − 𝐡𝐢𝐢 ) and 𝐞∗ = 𝐞
𝐢 𝐢
D(𝟏 − 𝐡 )
𝟏 − 𝐡𝐢𝐢 𝟏 𝐢𝐢
×
𝟏 𝟏 − 𝐡𝐢𝐢
𝐞𝐢 𝐞𝟐𝐢 (𝐧 − 𝐤 − 𝟏)(𝟏 − 𝐡𝐢𝐢 )
𝟏 − 𝐡𝐢𝐢 square ×
𝐭 𝐢∗ = both 𝐭 𝟐𝐢∗ =
(𝟏 − 𝐡𝐢𝐢 )𝟐 𝐒𝐒𝐄 = 𝐞𝟐𝐢 (𝐧 − 𝐤 − 𝟏)
𝐒𝐒𝐄 sides 𝐒𝐒𝐄 (𝐧 − 𝐤 − 𝟏)(𝟏 − 𝐡𝐢𝐢 ) 𝐒𝐒𝐄(𝟏 − 𝐡𝐢𝐢 )
2 ×
(𝐧 − 𝐤 − 𝟏)(𝟏 − 𝐡𝐢𝐢 ) (𝐧 − 𝐤 − 𝟏)(𝟏 − 𝐡𝐢𝐢 ) 𝐒𝐒𝐄

- square root both sides


(𝐧 − 𝐤 − 𝟏) ➞ follows a t-distribution
𝐭 𝐢∗ = 𝐞𝐢 <
𝐒𝐒𝐄(𝟏 − 𝐡𝐢𝐢 ) with df = 𝐧 − 𝐤 − 𝟏

|𝐭 𝐢∗ | > critical 𝐭 for small samples (preferred)

|𝐭 𝐢∗ | > 3 for large samples

* Describe influence analysis and methods of detecting influential data points

Page 4/
outliers/high leverage points not always influential
- observations are influential if their exclusion causes
substantial changes in the estimated regression function
- measure: Cook’s Distance (Cook’s D) - metric for identifying
influential data points
residual for obs. 𝐢

𝐃𝐢 = 𝐞𝟐𝐢 𝐡𝐢𝐢 - depends on both residuals


I L
𝐤 𝐌𝐒𝐄 (𝟏 − 𝐡𝐢𝐢 )𝟐 and leverage
leverage value for (Discrepancy × Leverage)
obs. 𝐢

- large 𝐃𝐢 ➞ influencer
𝐃𝐢 > .5 may be influential
M.M138463888.

𝐃𝐢 > 1 likely to be (common)


𝐃𝐢 > 20𝐤1𝐧 - large samples (actually for DFFITS)

* Describe influence analysis and methods of detecting influential data points

18
Last Revised: 07/25/2023

Page 5/

𝟏&
𝐧

𝐭 𝐢∗

2*𝐤'𝐧

Dummy variables (indicator variables)


- for a qualitative IV ➞ takes a value of 0 or 1
(F) (T)
- to distinguish between 𝐧 categories, need 𝐧 − 𝟏
dummy vars. ➞ avoids exact linear relationships

* Formulate and interpret a multiple regression model that includes qualitative independent variables

Page 6/
Intercept dummy/
𝐘𝐢 = 𝐛𝟎 + 𝐝𝟎 𝐃𝟏 + 𝐛𝟏 𝐗 𝟏
𝐘𝐢 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 base category
- estimate 2 regressions
𝐘𝐢 = (𝐛𝟎 + 𝐝𝟎 ) + 𝐛𝟏 𝐗 𝟏
Slope dummy/
𝐘𝐢 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + 𝐝𝟏 𝐃𝐗 𝟏 + 𝛆
interaction term
𝐃=0 𝐃=1 𝐝𝟎
𝐘𝐢 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 𝐘𝐢 = 𝐛𝟎 + (𝐛𝟏 + 𝐝𝟏 )𝐗 𝟏
base category

Both:
m = 𝐛𝟏 + 𝐝𝟏
M.M138463888. 𝐘𝐢 = 𝐛𝟎 + 𝐝𝟎 𝐃𝟏 + 𝐛𝟏 𝐗 𝟏 + 𝐝𝟏 𝐃𝐗 𝟏

m = 𝐛𝟏

* Formulate and interpret a multiple regression model that includes qualitative independent variables

19
Last Revised: 07/25/2023

Page 7/
Testing - individual t-tests on the coefficients for the dummy vars.

Example/ Ret. = 𝐛𝟎 + 𝐛𝟏 𝐄𝐗𝐏 + 𝐛𝟐 𝐂𝐀𝐒𝐇 + 𝐛𝟑 𝐀𝐆𝐄 + 𝐛𝟒 𝐒𝐈𝐙𝐄


expense ratio cash ratio
add 𝐝𝟏 - BLEND if both = 0, original regression applies
𝐝𝟐 - GROWTH RET is for a VALUE fund
intercept dummies

add 𝐝𝟑 - AGE_BLEND if both = 0, original regression applies


𝐝𝟒 - AGE_GROWTH if BLEND = 1, 𝐛𝟑 𝐀𝐆𝐄 + 𝐝𝟑 𝐀𝐆𝐄
slope dummies = (𝐛𝟑 + 𝐝𝟑 )𝐀𝐆𝐄
if GROWTH = 1, 𝐛𝟑 𝐀𝐆𝐄 + 𝐝𝟒 𝐀𝐆𝐄
= (𝐛𝟑 + 𝐝𝟒 )𝐀𝐆𝐄

* Formulate and interpret a multiple regression model that includes qualitative independent variables

Page 8/
Qualitative DV ➞ categorical (fraud, no fraud;
(discrete) (binary) bankruptcy, no bankruptcy)
- most common
linear regression unsuitable (relation between the DVs and IVs
not linear)
logit model
𝐩
𝐈𝐧 ( , = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + ⋯ + 𝐛𝐧 𝐗 𝐧 + 𝛆
𝟏−𝐩
transforms 𝐘 - logistic transformation
(most common)

linear relationship
M.M138463888. 𝐩 ➞ prob. of event occurring (i.e. 𝐓)
~ Bernouilli
𝐩
➞ ratio of 𝐏(𝐓) to 𝐏(𝐅)
𝟏−𝐩
(the odds of an event occurring)

20
Last Revised: 07/25/2023

Page 9/
𝐩
e.g./ 𝐩 = .75 , = . 𝟕𝟓D. 𝟐𝟓 = 3 to 1
𝟏−𝐩
(3 events occurring to 1 non-occurrence)
𝐩 𝐩
𝐈𝐧 -𝟏 − 𝐩. ➞ log odds or logit ∴ 𝐞𝐈𝐧 +𝟏 − 𝐩. ➞ odds

➞ Logistic Regression ➞ widely used in ML where the objective is


classification
𝐩
𝐈𝐧 -𝟏 − 𝐩. = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + ⋯ 𝐛𝐧 𝐗 𝐧 + 𝛆 ➞ logistic regression model

to recover 𝐩 𝟏
𝐩=
𝟏+ 𝐞((𝐛𝟎 +𝐛𝟏 𝐗𝟏 +⋯+𝐛𝐧 𝐗𝐧 )

𝐞𝐛𝟎 +𝐛𝟏 𝐗𝟏 +⋯+𝐛𝐧 𝐗𝐧


or 𝐩=
𝟏 + 𝐞𝐛𝟎 +𝐛𝟏 𝐗𝟏 +⋯+𝐛𝐧 𝐗𝐧

Page 10/
𝐩
𝐈𝐧 0𝟏 − 𝐩1 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + ⋯ + 𝐛𝐧 𝐗 𝐧

=A
𝐩 𝐞𝐛𝟎 8𝐛𝟏 𝐗 𝟏 8⋯8𝐛𝐧 𝐗 𝐧 𝟏
∴ 𝐈𝐧 0𝟏 − 𝐩1 = A 𝐩= =
𝟏 + 𝐞𝐛𝟎 8𝐛𝟏 𝐗 𝟏 8⋯8𝐛𝐧 𝐗 𝐧 𝟏+ 𝐞6(𝐛𝟎 8𝐛𝟏 𝐗 𝟏 8⋯8𝐛𝐧 𝐗 𝐧 )
𝐩
𝟏−𝐩 = 𝐞𝐀 𝐩 = .75 ➞ 𝐈𝐧 0 . 𝟕𝟓 1 = 1.098612289
. 𝟐𝟓

𝐩 = 𝐞𝐀(𝟏%𝐩) log odds

𝐩 = 𝐞𝐀 − 𝐩𝐞𝐀 𝐞𝟏.𝟎𝟗𝟖𝟔𝟏𝟐𝟐𝟖𝟗 𝟏
𝐩= =
𝟏 + 𝐞𝟏.𝟎𝟗𝟖𝟔𝟏𝟐𝟐𝟖𝟗 𝟏+ 𝐞6𝟏.𝟎𝟗𝟖𝟔𝟏𝟐𝟐𝟖𝟗
𝐩 + 𝐩𝐞𝐀 = 𝐞𝐀
𝟑 𝟏
𝐩(𝟏 + 𝐞𝐀) = 𝐞𝐀 =M.M138463888.
= . 𝟕𝟓 = = . 𝟕𝟓
𝟒 𝟏. 𝟑𝟑̇
𝐞𝐀
𝐩=
𝟏 + 𝐞𝐀

21
Last Revised: 07/25/2023

𝐩 Page 11/
𝐈𝐧 0𝟏 − 𝐩1 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + ⋯ + 𝐛𝐧 𝐗 𝐧 + 𝛆

has a logistic distribution


like a normal dist. but
𝐩 ➞ 1 as 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + ⋯ + 𝐛𝐧 𝐗 𝐧 ➞ +∞ 𝐊 = 4.2
𝐩 0 as 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + ⋯ + 𝐛𝐧 𝐗 𝐧
➞ ➞ −∞ 𝟎
𝐩
𝐈𝐧 0 1 ➞ undefined
𝟏
𝐈𝐧 (𝟏 − 𝐩, undefined for 𝐩 = 1 or 𝐩 = 0 𝟏
𝟎
➞ undefined

∴ coefficients are estimated using maximum likelihood


rather than least squares estimates coefficients that make it
most likely the sample data would
slope coefficient ➞ change in have occurred
log odds that an event happens per unit change in IV

odds ratio ➞ 𝐞𝐛𝐢 ratio of odds that the event happens with a unit
increase in 𝐗 𝐢 to the odds that the event happens
without an increase in 𝐗 𝐢

Page 12/
Log-odds is linear with changes in IVs
- odds ratios are exponential with ∆IVs
- probabilities are non-linear - pos. 𝐛𝐢 increase 𝐩
- neg. 𝐛𝐢 decrease 𝐩
Goodness of Fit/ Likelihood ratio (LR) test
(sort of like the F-test for model significance)

LR = -2(log likelihood restricted model - Log likelihood unrestricted model)


intercept only model Log Likelihood
𝛘𝟐
LL - null
df = q (𝐤) M.M138463888.

H0: 𝐛𝟏 = 𝐛𝟐 = ... = 𝐛𝐤 = 0
Ha : at least one coefficient ≠ 0
- also, no 𝐑𝟐 ➞ pseudo - 𝐑𝟐
- can only be used to compare different specifications
of the same model

* Formulate and interpret a logistic regression model

22
Last Revised: 07/25/2023

Time-Series Analysis

a. calculate and evaluate the predicted trend value for a time series, modeled as either a
linear trend or a log-linear trend, given the estimated trend coefficients

b. describe factors that determine whether a linear or a log-linear trend should be used with
a particular time series and evaluate limitations of trend models

c. explain the requirement for a time series to be covariance stationary and describe the
significance of a series that is not stationary

d. describe the structure of an autoregressive (AR) model of order p and calculate one- and
two-period-ahead forecasts given the estimated coefficients

e. explain how autocorrelations of the residuals can be used to test whether the
autoregressive model fits the time series

f. explain mean reversion and calculate a mean-reverting level

g. contrast in-sample and out-of-sample forecasts and compare the forecasting accuracy of
different time-series models based on the root mean squared error criterion

h. explain the instability of coefficients of time-series models

i. describe characteristics of random walk processes and contrast them to covariance


stationary processes

j. describe implications of unit roots for time-series analysis, explain when unit roots are
likely to occur and how to test for them, and demonstrate how a time series with a unit
root can be transformed so it can be analyzed with an AR model

k. describe the steps of the unit root test for nonstationarity and explain the relation of the
test to autoregressive time-series models

l. explain how to test and correct for seasonality in a time-series model and calculate and
interpret a forecasted value using an AR model with a seasonal lag
M.M138463888.
m. explain autoregressive conditional heteroskedasticity (ARCH) and describe how ARCH
models can be applied to predict the variance of a time series

n. explain how time-series variables should be analyzed for nonstationarity and/or


cointegration before use in a linear regression

o. determine an appropriate time-series model to analyze a given investment problem and


justify that choice

23
Last Revised: 07/25/2023

Section 2: Challenges of Working with TS


Page 1
DV

Y Y T C S R

@→𝛍
𝐗 𝟐
(𝐱 𝐢 − 𝐱1)𝟐
𝛔 =
𝐧−𝟏
𝐭
causal relationship X 0 1 2 3 T

IV time series
- a set of observations on
- DV & IV are distinct a variable’s outcomes in different
- SC will not affect time periods
consistency of 𝐛𝟎 & 𝐛𝟏

Section 3: Trend Models


Page 2
➀ Linear/ the dependent variable changes at
a constant rate with time
often trend coefficient
inappropriate 𝐲𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐭 + 𝛆𝐭 𝐭 = 𝟏, 𝟐, 𝟑 … 𝐓
for economic
Note: time is the indep. variable
data
trends typically have changing slopes/intercepts over time
➁ Log-Linear/ the dependent variable changes at a constant
growth rate
𝐲𝐭 = 𝐞𝐛𝟎 +𝐛𝟏 𝐭+𝛆𝐭 M.M138463888. if linear model has
SC, use log-linear
- take natural log of both sides
𝐥𝐧 𝐲𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐭 + 𝛆𝐭

24
Last Revised: 07/25/2023

Page 3
⇒ 𝛆𝐭 must be uncorrelated across time periods
- Durbin-Watson test on residuals
H0: DW = 2
Ha: DW ≠ 2
Table
k = # of slope parameters 𝐝𝐋 & 𝐝𝐮
df = # of obs. e.g.: n = 77 ∝ = .05 𝐛^𝟎&𝐛 ^𝟏
(k = 1)
𝐝𝐋 = 1.60 𝐝𝐮 = 1.65
= Trend models will typically have
correlated errors
∴ need a better model

Section 4: Autoregressive (AR) TS Models


Page 4
- a time series regressed on its own past values
lagged var.
first-order
𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝛆𝐭 AR(1)
AR model
𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝐛𝟐 𝐱 𝐭6𝟐 + . . . + 𝐛𝐩 𝐱 𝐭6𝐩 + 𝛆𝐭 AR(p) pth -order
AR model
⇒ Covariance Stationary Series/
a) mean 𝐄(𝐲𝐭 ) = 𝛍
- constant & finite
b) variance
in all periods
c) Cov(𝐲𝐭 , 𝐲𝐭6𝐬 )

- if not, results will in invalid (estimate of 𝐛𝟏 will


be biased)
⇒ Serial Correlation/ in (AR) models
- cannot use DW statistic
M.M138463888.
𝐄(𝛆𝐭 𝛆𝐭6𝐤 )
- test H0: 𝛒𝐞𝐤 = 0 𝛒𝐞𝐤 =
𝛔𝟐𝛆

25
Last Revised: 07/25/2023

Page 5
⇒ Serial Correlation/ in (AR) Models
Recall 𝛒 = 𝐂𝐨𝐯(𝐀𝐁) 𝐂𝐨𝐯(𝛆𝐭 𝛆𝐭6𝐤 )
𝐀𝐁 , so 𝛒𝛆𝐤 =
𝛔𝐀 𝛔𝐁 𝛔𝛆𝐭 𝛔𝛆𝐭(𝐤
test/ autocorrelation
𝐂𝐨𝐯(𝛆𝐭 𝛆𝐭6𝐤 )
=
𝐄(𝛒𝛆𝐤 ) = 𝟎 of the error 𝛔𝟐𝛆
𝛒
- test-statistic 𝐭 = 𝟏 𝛆𝐤 term
1
√𝐍

Steps/ 1. Estimate an AR model (e.g. AR(1))


2. Compute 𝛒𝛆𝛆𝐭2𝐤 AR(2)

3. Test 𝐇𝟎 : 𝛒𝛆𝛆𝐭%𝐤 = 𝟎 ➞ reject H0 implies a misspecified


𝐇𝐚 : 𝛒𝛆𝛆𝐭%𝐤 ≠ 𝟎 model

Page 6
⇒ Mean Reversion/ falls when series is above
its mean
mean
𝐭=0 𝐭=T
value
rises when series is below its
mean
- at a mean reverting level 𝐱 𝐭 = 𝐱 𝐭8𝟏 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭
𝐛𝟎 time series 𝐛𝟎 𝐛𝟎
𝐱𝐭 = 𝐱𝐭 < 𝐱𝐭 >
𝟏 − 𝐛𝟏 will stay the 𝟏 − 𝐛𝟏 𝟏 − 𝐛𝟏
same will increase will decrease

* - all cov-st. TS have a finite mean-reverting level


⇒ Chain Rule of Forecasting/
one period ahead forecast 𝐱z 𝐭8𝟏 = 𝐛^ 𝟎 +𝐛^ 𝟏 𝐱𝐭
^ 𝟎+𝐛 ^ 𝟏 𝐱 𝐭8𝟏 𝐱 𝐭8𝟏 is
two period ahead forecast 𝐱z 𝐭8𝟐 = 𝐛
unknown
- substitute 𝐱z 𝐭8𝟏 for 𝐱 𝐭8𝟏 𝐱z 𝐭8𝟐 = 𝐛 ^𝟎+𝐛
^ 𝟏 𝐱z 𝐭8𝟏 at 𝐱 𝐭
M.M138463888.

26
Last Revised: 07/25/2023

Page 7
⇒ Comparing Model Forecast Performance/
𝛔𝟐𝛆 of model 1 vs. 𝛔𝟐𝛆 of model 2
- the model with the smaller 𝛔𝟐𝛆 is more accurate
➞ in-sample forecast errors - predicted vs. observed values
used to generate the model
➞ out-of-sample forecast errors – forecasts vs. ‘outside the model
values’
Root mean squared error (RMSE) – used to compare out of sample
forecasting performance
∑(𝐚𝐜𝐭. −𝐟𝐨𝐫𝐜. )𝟐
| the average the smaller the RMSE
𝐧 squared error the more accurate
the square root of

Page 8
⇒ Instability of Regression Coefficients/
Choice of - regression coefficients can vary based on
AR(p) model a) different sample periods
may also (e.g. 1965-1975 vs. 1985-1995)
change b) different time periods
depending on (e.g. 5 yrs. vs. 8 yrs.)
time period choice

experience and judgment help decide on


the correct time period, and thus the
correct model

M.M138463888.

27
Last Revised: 07/25/2023

Random Walks & Unit Roots


Page 9
T C S R unit root
⇒ RW = AR(1) with 𝐛𝟎 = 0 & 𝐛𝟏 = 1
𝐱 𝐭 = 𝐱 𝐭6𝟏 + 𝛆𝐭 - the value of the series in one time
period is the value of the series in the
previous time period plus a random error
𝐄(𝛆𝐭 ) = 𝟎
(unpredictable)
𝐄.𝛆𝟐𝐭 / = 𝛔𝟐 - constant variance
𝐂𝐨𝐯(𝛆𝐭 𝛆𝐬 ) = 𝟎 - uncorrelated across time
𝐛𝟎
finite mean-reverting level I = 𝟎D𝟎L
Random walks have no 𝟏 − 𝐛𝟏
finite variance
not covariance 𝐱 𝐭 = 𝐱 𝐭6𝟏 + 𝛆𝐭 = 𝛆𝐭 [𝐕𝐚𝐫(𝐱 𝐭 ) = 𝐕𝐚𝐫(𝛆𝐭 ) = 𝛔𝟐𝛆 ]
stationary = 𝐱 𝐭6𝟐 + 𝛆𝐭6𝟏 + 𝛆𝐭
∴ cannot estimate an AR(1) = 𝐱𝐭6𝟑 + 𝛆𝐭6𝟐 + 𝛆𝐭6𝟏 + 𝛆𝐭
on a TS that is a RW 𝐭6𝟏

= 𝐱 𝟎 + ‚ 𝛆𝐭6𝐢 [𝐕𝐚𝐫(𝐱 𝐭 ) = (𝐭 − 𝟏)𝛔𝟐 ]


𝐢D𝟎

Page 10
T C S R
- first differencing/ 𝚫𝐱 𝐭 = 𝐱 𝐭 − 𝐱 𝐭6𝟏 transforms a non cov. st.
TS to a cov-st. TS
new series
(even a RW)
note: 𝚫𝐱 𝐭 = 𝐱 𝐭 − 𝐱 𝐭6𝟏 = 𝛆𝐭
and since 𝐄(𝛆𝐭 ) = 0, 𝐄(∆𝐱 𝐭 ) = 0
= 𝟏 ( 𝟎 = 𝟎;𝟏 = 𝟎
𝐛𝟎 𝟎
⇒ AR(1) w/ 𝐛𝟎 = 0 𝐛𝟏 = 0 MRL =
𝟏 ( 𝐛𝟏
(constant & finite)
can now use regression
var. of ∆𝐱 𝐭 = Var(𝛆𝐭 ) ∀t
no forecasting power however
since 𝐛𝟎 & 𝐛𝟏 = ∅
M.M138463888.

- we can only conclude that we have a random walk

28
Last Revised: 07/25/2023

Page 11
T C S R
⇒ RW w/ drift/ ⇒ AR(1) w/ 𝐛𝟎 ≠ 0 𝐛𝟏 = 1
- after first differencing ⇒ AR(1) w/ 𝐛𝟎 ≠ 0, 𝐛𝟏 = 0
∆𝐱 𝐭 = 𝐛𝟎 + 𝛆𝐭
⇒ Unit Root Test of Non-Stationarity/ drift random
Dickey-Fuller test for a unit root
- if a series is cov-st., then |𝐛𝟏 | < 1 from AR(1)
- if 𝐛𝟏 = 1, then TS has a unit root - is a RW
- is not cov-st.
- cannot test 𝐛𝟏 from AR(1)
DF/ 𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝛆𝐭 - subtract 𝐱 𝐭6𝟏 from both sides
𝐱 𝐭 – 𝐱 𝐭6𝟏 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝛆𝐭 – 𝐱 𝐭6𝟏 g = (𝐛𝟏 – 1) critical value
𝐱 𝐭 – 𝐱 𝐭6𝟏 = 𝐛𝟎 + (𝐛𝟏 – 1)𝐱 𝐭6𝟏 + 𝛆𝐭 test of (𝐭)
1
st ∆𝐱 𝐭 = 𝐛𝟎 + 𝐠𝐱 𝐭6𝟏 + 𝛆𝐭 H0: g = 0 - larger than
diff. 1st lag Ha: g < 0 conventional 𝐭

Section 7: Seasonality
Page 12
⇒ use a seasonal lag
e.g. 𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝛆 AR(1)
- seasonal autocorrelation of the error term ≠ 0
[ 𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝐛𝟐 𝐱 𝐭6𝟒 + 𝛆𝐭 ] 4th autocorrelation for quarterly data
[ 𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝐛𝟐 𝐱 𝐭6𝟏𝟐 + 𝛆𝐭 ] 12th autocorrelation for monthly data

Section 8: ARCH Models


- autoregressive conditional heteroskedasticity
- estimate an AR(1)
- residuals (usually given/saved in SPSS)
M.M138463888.

- create a new variable for (residuals)2


test H0: 𝐚𝟏 = 0 rejecting H0
run = 𝛆z𝟐𝐭 = 𝐚𝟎 + 𝐚𝟏 𝛆z𝟐𝐭6𝟏 + 𝛍 implies ARCH(1)
Ha: 𝐚𝟏 ≠ 0

29
Last Revised: 07/25/2023

Regression with more than 1 TS


Page 13
- if any TS in a linear regression has a unit root,
OLS t-stats may be invalid
- test each TS using DF test
A. both reject H0: 𝐛𝟏 = 1 - no unit roots
B. reject H0: 𝐛𝟏 = 1 for IV but not for DV non-const. mean
- error term would not be Cov-st. non-const. var.
- reg. coeff. + SE inconsistent correlated errors
C. reject H0: 𝐛𝟏 = 1 for DV but not for IV
- same results as B
D. both series have a unit root
- are they cointegrated? (i.e. share a common trend)
- a long-term economic/financial relationship exists
between them such that they do not diverge from each other
without bound in the long-term

Page 14
E. both have a unit root ⇒ no cointegration
- same results as B & C earlier
- only 2 options offer valid regression results/
➀ no unit roots in either series - use DF on each
➁ both have unit roots but are cointegrated
- will estimate long-term relationship
- may not be the best model of the short term
relation between the two
test for cointegration
1) estimate 𝐲𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱𝐭 + 𝛆𝐭
2) use (Engle-Granger) DF to test H0: g = 0
- note: test is [(𝛆𝐭 – 𝛆𝐭3𝟏 ) = 𝐛𝟎 + g 𝛆𝐭 -1 + μ] - must use different
M.M138463888.

3) failing to reject H0: no cointegration 𝐭-values than DF


(𝚫𝐱 𝐭 = 𝐛𝟎 + g𝐱 𝐭3𝟏 + 𝛆)
4) reject H0 ⇒ regression output ok!

30
Last Revised: 07/25/2023

Page 15
- more than one IV
A. use DF on all ➞ no unit roots
B. at least one has a unit root, one does not
- cannot use OLS
C. all have unit roots
- test for cointegration 𝐲𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭 + 𝐛𝟐 𝐙𝐭 + 𝛆𝐭
- (𝛆𝐭 – 𝛆𝐭6𝟏 ) = 𝐛𝟎 + g 𝛆𝐭6𝟏 + μ
[H0: g = 0] - fail to reject H0
reject = no cointegration
- OLS ok but
requires much - cannot use OLS
more work!

Road Map
Page 16
Causal Time
vs.
Relationship Series

SC only ➞ add a
at lags seasonality
lag
linear log-linear

DW = 2 test ARCH(1)
D D 𝐭3𝟏 + 𝛍
∑𝟐𝐭 = 𝐚𝟎 + 𝐚𝟏 ∑
yes no = SC H0: a1 = 0
SC AR(2)
no SC AR(1)
Ok! no SC
- must be
no SC ok fail to reject
Cov.St.
ok M.M138463888.
reject = ok
GLS

31
Last Revised: 07/25/2023

Machine Learning

a. describe supervised machine learning, unsupervised machine learning, and deep


learning

b. describe overfitting and identify methods of addressing it

c. describe supervised machine learning algorithms–including penalized regression,


support vector machine, k-nearest neighbor, classification and regression tree,
ensemble learning, and random forest–and determine the problems for which they
are best suited

d. describe unsupervised machine learning algorithms–including principal


components analysis, k-means clustering, and hierarchical clustering–and
determine the problems for which they are best suited

e. describe neural networks, deep learning nets, and reinforcement learning

M.M138463888.

32
Last Revised: 07/25/2023

Machine Learning
Page 1
➞ Statistical approaches (LR, TS) – some underlying structure LOS a
to the data is assumed (some probability dist.) - distinguish
- models are built on that structure
➞ Machine Learning – no underlying structure is assumed
- the algorithm attempts to discover the underlying structure
- 3 distinct classes of techniques
1/ Supervised Learning: involves training an algorithm to take a
set of inputs (X) and find a model that best relates
them to output (Y) features target
- training data set is ‘labeled’ ⇒ Xs are matched with Y
(regression is a good example)
- 2 categories of problems: ➞ regression problems (continuous
➞ classification problems target)
(categorical or ordinal target)

Page 2
⇒ Process Overview LOS a
𝐗𝐤, N ➃ inputs - distinguish
➀ ➁ ➂
Labeled Data Supervised ML Prediction Test
Y, 𝐗 𝐤 , N Algorithm Rule Data

Prediction actual Y

➄ evaluation of fit
2/ Unsupervised Learning/
- does not make use of labeled training data
- ML program has to discover structure within the data
2 types of problems: M.M138463888.

➀ Dimension reduction – reduce the number of features


while retaining variation across observations
➁ Clustering – sorting observations into groups

33
Last Revised: 07/25/2023

Page 3
3/ Deep Learning/Reinforcement Learning LOS a
- based on neural networks (more later) - distinguish

(6) (3)
✓ ✓



✓ ✓



(3) ✓

Page 4
LOS a
- distinguish
dimension
unsupervised reduction
clustering

regression
supervised
classification

M.M138463888.

34
Last Revised: 07/25/2023

Page 5
➞ ML models can produce overly complex models that LOS b
may fit the training data too well - describe
- known as overfitting ➞ will not generalize to new data
- evaluation of ML algorithms thus focus on prediction error well
- data sets partitioned into 3 non-overlapping samples (supervised)
Training Sample Validation Sample Test Sample
in-sample out-of-sample

Prediction model – must be able to generalize


too simple too complex
- underfit – overfit
⇒ Errors and overfitting/
poor
- low/no in-sample errors + large out-of-sample errors =
prediction
- bias error
- variance error
- base error

Page 6
LOS b
⇒ Errors and overfitting/
- describe
Bias – the degree to which the
model fits the training data
Variance error – how much the
model’s results change in
response to new data from
validation & text samples

- goal is to minimize both bias and variance errors


M.M138463888.

- as complexity in the training set ↑, bias error ↓, variance error ↑


(𝐄𝐢𝐧 ) (𝐄𝐨𝐮𝐭 )
- linear models are more susceptible to bias error (underfitting)
- non-linear models more prone to variance error (overfitting)

35
Last Revised: 07/25/2023

Page 7
⇒ Errors and overfitting/ LOS b
- describe
Preventing overfitting/
1/ prevent the algorithm from getting
too complex
- limit the number of features
- penalize algorithms that are
too complex or too flexible
2/ proper data-sampling by using
cross-validation hold-out samples
- validation and test samples
should be from the same
domain as the training data
k-fold cross validation
- data (less test data) shuffled
randomly, divided into k sub-samples
k-1 training, 1 validation ⇒ repeat k times (k ∼ 5 to 10)

Page 8
⇒ Supervised ML Algorithms/ LOS c
1/ Penalized Regression – dimension reduction - describe
- eliminates/minimizes overfitting - determine

- regression coefficients are chosen to minimize the sum


of the 𝛆𝟐 plus a ‘penalty term’ that increases in size with
the number of included features
^ 𝐤 † – least absolute shrinkage and
Penalty term = 𝛌 ∑𝐤𝐢D𝟏†𝐛
selection operator (LASSO)
the greater the number
of 𝐛𝐤 , the greater the penalty (for 𝐛𝐤 ≠ 0)

∴ 𝐛𝐤 is included only if ∑ 𝛆𝟐 ↓ more than penalty term ↑


M.M138463888.

𝐧 𝐤

𝐦𝐢𝐧 ‚ 𝛆𝟐𝐢 ^ 𝐤†
+ 𝛌 ‚†𝐛 where 𝛌 is called a
𝐢D𝟏 𝐢D𝟏 hyperparameter (set by
the researcher)

36
Last Revised: 07/25/2023

Page 9
⇒ Supervised ML Algorithms/ LOS c
2/ Support Vector Machine – classification, regression - describe
& outlier detection - determine

- goal is to select the linear


classifier that optimally separates
the observations into 2 sets of
data points ➞ max. the probability
of making a correct prediction
by determining the boundary that
linear is furthest from all observations
discriminant boundary
classifiers
- circled observations
are called support
vectors

resilient to outliers
& correlated features
- trade-off between a wider margin and
lower total error penalty

Page 10
⇒ Supervised ML Algorithms/ LOS c
3/ k-nearest neighbour – classification (sometimes regression) - describe
- classify new observations by finding similarities - determine

in the existing data


Challenges/
1/ selection of features
2/ distance metric e.g./ profitability,
efficiency, Sales, Mkt. cap, # of analysts
- KNN sensitive to irrelevant &
M.M138463888. correlated features
(works best with a small # of features)
- choose the classification with 3/ selection of k
the largest # of nearest - too small ➞ high error rate
➞ sensitive to local outliers
neighbours
- too large ➞ averages over too many
outcomes
➞ negates the idea of ‘nearest’

37
Last Revised: 07/25/2023

Page 11
⇒ Supervised ML Learning/ LOS c
4/ Classification & Regression Trees (CART) - typically - describe
predict a categorial predict a applied when the - determine
target continuous target target is binary
Root Node ➞ one of the features and a cutoff value are
yes
no chosen at each node by some
true false
< > minimization function (MSE, Gini)
lower next IV
Decision
within –
Decision - minimize misclassification error
same
Node group error Node process ➞ data partition becomes smaller and
at each smaller
lower level
Terminal Decision Terminal Terminal
Node Node Node Node
➞ category that is the majority at
- output
decision this
Terminal Terminal
(predicted label) node (i.e. the target)
Node Node

- process ends when classification error does not diminish much more
- to avoid overfitting: can set max. depth of tree, min. population at each
node, etc.
or/ Pruning ➞ sections of the tree with low classifying power cut

Page 12
⇒ Supervised ML Learning/ LOS c
5/ Ensemble Learning & Random Forests - a collection of - describe
classification trees - determine
prediction of a group of models
2 main categories:
a) different types of algorithms combined with a voting classifier
b) a combination of the same algorithm, using different training
data on a bootstrap aggregating technique
Voting Classifiers/ majority vote of all models = classification of new
M.M138463888. data point
Bootstrap Aggregating/ original training data set is used
to generate n new training data sets (can have small
occurrences of the target data multiple times ➞ increases
proportional representation)
- n new models created
➞ majority vote classifier for each new data point

38
Last Revised: 07/25/2023

Page 13
⇒ Supervised ML Learning/ LOS c
5/ Ensemble Learning & Random Forests - describe
Random Forest - determine

e.g./ - Create a bootstrap data set each tree


4 features ➞ select 2 at random ➞ build a tree will be
slightly
select another 2 at random ➞ build a tree
different
etc…
- for each new obs., let all trees undertake classification by
majority vote
- protects against overfitting
- reduces ratio of noise to signal because errors cancel out
across the collection of trees

Page 14
⇒ Unsupervised ML Algorithms/ LOS d
1/ Principal Components Analysis – dimension reduction - describe
- used to reduce highly correlated features of data - determine

into a few main uncorrelated composite variables


- eigenvectors ➞ uncorrelated composite variables that are linear
combinations of the original features
- eigenvalues ➞ proportion of total variance in the initial data
that is explained by each eigenvector
- select the eigenvectors that account for 85% - 95% of the variance
- PCA typically performed as part of exploratory data analysis, before
training another learning model
- goal is to find each PC such that projection errors are
minimized and spread is maximized
M.M138463888.

- Scree plots show the proportion of total variation in the data


explained by each PC

39
Last Revised: 07/25/2023

Page 15
⇒ Unsupervised ML Algorithms/ LOS d
2/ Clustering – create subsets of the data such that - describe
- determine
observations within a cluster are deemed similar - maximum
cohesion
- observations within different clusters are dissimilar - maximum
separation
- must define ‘similar’ in order to measure distance
a) k-means clustering:
among
chosen by k the chosen
researcher centroids features
chosen at
random centroids cluster new centroids
- iteration continues
until no observation cluster new centroids cluster etc…
is reassigned to a
new cluster

Page 16
⇒ Unsupervised ML Algorithms/ LOS d
- describe
a) k-means clustering: works well on very large data sets
- determine
hundreds of millions of data
points
- useful for discovering patterns in high dimensional data
b) Hierarchical clustering:
- creates intermediate rounds
of clusters of increasing
(agglomerative clustering) or
decreasing (divisive clustering)
size
- agglomerative ➞ clusters based
on local patterns
M.M138463888.

- divisive ➞ begins with the


global structure of the
data

40
Last Revised: 07/25/2023

Page 17
⇒ Unsupervised ML Algorithms/ LOS d
➞ Dendrograms/ - describe
- determine

cluster 1 = A + B
cluster 7 = 1 + 2 + 3
measure
of cluster 9 = 7 + 8
distance (4 + 5 + 6)

H + I

Page 18
1/ Neural Networks (classification, regression supervised LOS e
- nodes & links
learning) - describe
- input data is scaled to have values
between 0 & 1
- 4 input nodes, 5 hidden nodes, 1 output node

hyperparameters
forward propagation
neurons ➞ each node receives input from each preceding node
➞ weights each value received and adds them up
- better able to cope with 1 summation
operator
non-linear relationships
activation function
- requires large training activation
data sets M.M138463888. – transforms the
sum into an
- weights chosen to minimize
a loss function output
0
0 summation 1

41
Last Revised: 07/25/2023

Page 19
2/ Deep Learning Nets ➞ neural networks with many LOS e
hidden layers (at least 3, typically > 20) - describe

(image, pattern, speech recognition)


- requires substantial time to train
3/ Reinforcement Learning (unsupervised)
- learns by testing new actions and reuse its
previous experience
- learning occurs through millions of trials & errors

M.M138463888.

42
Last Revised: 07/25/2023

Big Data Projects

a. identify and explain steps in a data analysis project

b. describe objectives, steps, and examples of preparing and wrangling data

c. describe objectives, methods, and examples of data exploration

d. describe objectives, steps, and techniques in model training

e. describe preparing, wrangling, and exploring text-based data for financial


forecasting

f. describe methods for extracting, selecting and engineering features from textual
data

g. evaluate the fit of a machine learning algorithm

LOSs will match between the video and the MM PDFs, but may be
in a different order than the CFAI readings

M.M138463888.

43
Last Revised: 07/25/2023

Big Data Projects


Page 1
➞ first/ - big data differs from traditional data sources LOS a
due to several characteristics - state
- explain
1/ Volume – refers to the quantity of data
2/ Variety – pertains to the array of available data sources
structured, unstructured, semi-structured ➞ text
within & outside the organization ➞ images
➞ video
3/ Velocity – speed at which data are created
➞ social media
(data-in-motion vs. data-at-rest)
➞ sensor-based
4/ Veracity – the credibility/reliability of different data sources
⇒ Steps in Data Analysis Projects/
a) traditional (w/ structured data)
Conceptualize Collect Data Data Model
the task Data Preparation/ Exploration Training
(what outputs?) Preprocessing

Page 2
⇒ Steps in Data Analysis Projects/ LOS a
- state
b) Textual big Data ➞ Text ML Model Building
- explain
Text
Data Text Text Classifier
Problem
Curation Preparation/ exploration output
Formulation
- from Preprocessing - what to
- inputs/outputs?
where? - cleansing & look for
preparing

e.g./ Ruling political party - sentiment

favourable/ news story spelling? feature


unfavourable comments?
M.M138463888.
aggregation selection/
score? social media? engineering

44
Last Revised: 07/25/2023

Page 3
⇒ Data Preparation/Preprocessing/ LOS b, e
- cleansing/organizing raw data into a useable format - describe

domain knowledge required


Data Preparation and Wrangling
Data
Collection/ Data Data Data Model
Exploration Results
Curation Cleansing Processing Training
√ √

➞ Structured Data ➞ Cleansing/ errors ➞ incomplete, invalid, inaccurate,


inconsistent, non-uniform, duplicates
incomplete ➞ missing entries
(delete or impute a value) identify & mitigate
invalid ➞ outside a meaningful range
inaccurate ➞ not a true value
inconsistent ➞ some data conflicts with other data (e.g. live in New York, Canada)
non-uniform ➞ data not in identical format (Jan. 15, 2019 vs. 15/01/19)
duplication ➞ multiple identical observations

Page 4
⇒ Data Preparation/Preprocessing/ LOS b, e
Structured Data ➞ Preprocessing/ transformations/scaling of data - describe

extraction ➞ new variable created from a current one for ease


of analyzing (e.g. DOB converted to age)
aggregation ➞ 2 or more variables aggregated into one
(e.g. Street address + City = GPS co-ordinates)
filtration ➞ eliminate data rows (records) not needed
(e.g. only L2 CFA candidates)
selection ➞ columns (record fields) that can be eliminated
(e.g. Mr./Mrs./Ms.)
conversion ➞ nominal, ordinal, integer,
M.M138463888.ratio, categorical

➞ Outliers ➞ delete (trimming) ➞ 1% trimmed = top + bottom 1% deleted


➞ winsorization ➞ replace outliers with min. or max. values

45
Last Revised: 07/25/2023

Page 5
⇒ Data Preparation/Preprocessing/ LOS b, e
Structured Data ➞ Preprocessing/ - describe
➞ scaling ➞ adjusting the range of a feature
1/ normalization 𝐱 𝐢 − 𝐱 𝐦𝐢𝐧 - rescales in the range 0 - 1
𝐱 𝐦𝐚𝐱 − 𝐱 𝐦𝐢𝐧 (sensitive to outliers)
2/ standardization 𝐱 𝐢 − 𝛍 - centers and rescales
𝛔 (requires normal distribution)
➞ Unstructured (Text) Data ➞ Cleansing
- must be transformed into structured data
remove html tags – if data is downloaded from the web
remove punctuation – most not required, some is (0.76 needs the period
e-mail needs the hyphen)
remove numbers – or substituted with annotation
identifying them as numbers
remove white spaces – tabs, leading spaces, spaces created by other
cleaning tasks

Page 6
⇒ Data Preparation/Preprocessing/ LOS b, e
➞ Unstructured (Text) Data ➞ Cleansing - describe
token = word; tokenization = splitting a text into tokens
unigram ➞ single word token, biogram ➞ 2 word tokens
n-grams ➞ n-word tokens
- normalizing text data:
Lowercasing
Stop words (the, is, a) – do not carry meaning
– typically removed
stemming e.g. analyzed, analyzing ➞ analyz
stem
lemmatization ➞ analyz becomes analyze (more work than just
M.M138463888.

stemming)
reduce token count, decrease sparseness
(minimize ML more common
- end result is a BOW (bag of words)
algo. complexity)
- final step is a ‘document term’ matrix (DTM) – structured data

46
Last Revised: 07/25/2023

Page 7
⇒ Data exploration/ – requires domain knowledge LOS c, f
- describe

Structured Data ➞ exploratory data analysis


- histograms, bar charts, box plots, density plots
- correlation matrix, stacked bar charts, etc…
- basically examining relationships visually – get a feel for the data
Feature selection
- remove unneeded or redundant features
- statistical tests to determine relevancy
- dimension reduction (PCA)
Feature Engineering
- creating a new feature (transformation, aggregation)
- decomposing one into multiple features

Page 8
⇒ Data exploration/ LOS c, f
Unstructured Data: Text exploration - describe
➞ exploratory data analysis:
word counts, word clouds, co-occurrences
anything that helps gain early insight into existing patterns
➞ feature selection: selecting a sub-set of the tokens
- eliminate noisy features (features that do not contribute
to ML model training)
typically the most
would lead to frequent and most sparse ➞ would lead to overfitting
underfitting
M.M138463888.

methods 1/ frequency measures ➞ document frequency # 𝐨𝐟 𝐝𝐨𝐜𝐮𝐦𝐞𝐧𝐭𝐬 𝐰𝐢𝐭𝐡


𝐭𝐨𝐤𝐞𝐧
2/ Chi-square test ➞ identify token that # 𝐨𝐟 𝐝𝐨𝐜𝐮𝐦𝐞𝐧𝐭𝐬
appear to be associated
with a specific class(ification)

47
Last Revised: 07/25/2023

Page 9
⇒ Data exploration/ LOS c, f
methods 3/ mutual information (MI) – how much info. - describe
a token contributes to a class (0 - 1)
0 ➞ token’s distribution is the same in all classes
1 ➞ token only occurs in one class
feature engineering (maintain semantics while structuring)
1) Numbers ➞ differentiate among types of numbers (i.e. year, tax ID, etc.)
2) N-grams ➞ multi-word patterns kept intact
3) Name entity recognition (NER) – money, time, organization

- tag a class to a token class


e.g. CFAI = organization
libraries or
4) Parts of speech ➞ noun, verb, adjective, proper noun packages in.
class prog. languages
e.g. apple = noun market = noun
Apple = proper noun vs. market = verb

Page 10
⇒ Model Training/ LOS d
- describe

classification task fit


improve model
type of data predicted
performance
size of data vs.
actual
- most ML models use structured data classification – Y/N
➞ Method Selection/ covered in previous reading labeled/unlabeled
linear vs. non-linear
M.M138463888.

observations (length)
- size of data
features (width) - too wide – overfit
- too narrow – underfit

48
Last Revised: 07/25/2023

Page 11
⇒ Model Training/ LOS d
➞ Performance evaluation/ - describe
P = precision = 𝐓𝐏0(𝐓𝐏 + 𝐅𝐏)
error analysis
R = recall = 𝐓𝐏0(𝐓𝐏 + 𝐅𝐍)

Precision (useful when cost of


Type I error is high)

(𝐓𝐏 + 𝐓𝐍)
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 = J(𝐓𝐏 + 𝐅𝐏 + 𝐓𝐍 + 𝐅𝐍)
Recall (a.k.a. sensitivity)
(useful when cost of 𝐅𝟏 = 𝟐 𝐏 𝐑1(𝐏 + 𝐑) (for unequal
Type II error is high) class distribution
in the dataset)

Page 12
⇒ Model Training/ LOS d
➞ Performance evaluation/ - describe
Receiver Operating Characteristic (ROC)
False Positive Rate (FPR) = 𝐅𝐏0(𝐓𝐍 + 𝐅𝐏)
True Positive Rate (TPR) = 𝐓𝐏0(𝐓𝐏 + 𝐅𝐍)
- as we allow FPR to increase, TPR ↑

trade-off
Root Mean Squared Error (RMSE)
𝐧
(𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝 − 𝐀𝐜𝐭𝐮𝐚𝐥)𝟐
𝐑𝐌𝐒𝐄 = <=
𝐧
𝐢,𝟏 M.M138463888.

➞ Tuning ➞ goal is to minimize


(bias error + variance error)
- modify hyperparameters
- increase size of training data-set

49
Last Revised: 07/25/2023

REVIEW

M.M138463888.

50
Last Revised: 07/25/2023

Time Series
Review - 1
causal time
or
relationship series

SC only add a seasonality


at lags lag
linear log-linear
test ARCH(1)
DW = 2 𝛆J𝟐𝐭= 𝐚𝟎 + 𝐚𝟏 𝛆𝐭-𝟏 + 𝛍
yes no = SC H0: a1 = 0
SC AR(2)
no SC fail to reject
AR(1) no SC reject
Ok
- must be
· GLS
no SC Ok Ok
Cov. St.
Ok

Review - 2
⇒ Linear/ DV changes at a constant rate w/ time
𝐲𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐭 + 𝛆 IV = 𝐭
if SC (DW ≠ 2)
then use - often inappropriate for economic data
⇒ Log-Linear/ DV changes at a constant growth rate w/ time

test SC 𝐲𝐭 = 𝐞𝐛𝟎 +𝐛𝟏 𝐭+𝛆𝐭 ⇒ 𝐥𝐧 𝐲𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐭 + 𝛆𝐭


H0: DW = 2
if SC, use
⇒ AR(1)/ - a time series regressed on its own past valves
(cannot use 𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝛆𝐭 AR(1)
DW anymore)
𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + ... + 𝐛𝐩 𝐱 𝐭6𝐩 + 𝛆𝐭 AR(p)

Covariance Stationary/
M.M138463888.

a) 𝐄(𝐲𝐭 ) = 𝛍
b) 𝛔𝟐 constant & finite
c) Covariance in all periods

51
Last Revised: 07/25/2023

Review - 3
𝛒𝛜𝐤
⇒ test for SC/ 𝐇𝟎 : 𝛒𝛆𝐭 𝛆𝐭(𝐤 test stat = 𝟏 Output
1 SE
√𝐍
autocorrelation of the ρ 𝟏&
√𝐧 t-stat
error term - - -
- - -
- reject 𝐇𝟎 ⇒ misspecified model - - -
- - -

⇒ Mean Reversion/ falls when series is


(in AR(p)) above its mean
[requires |𝐛𝟏 | < 1] rises when series is below
its mean
𝐛𝟎 time series 𝐛𝟎 𝐛𝟎
TS will TS will
𝐱𝐭 = will stay 𝐱𝐭 < 𝐱𝐭 >
𝟏 − 𝐛𝟏 𝟏 − 𝐛𝟏 𝟏 − 𝐛𝟏 decrease
the same increase

⇒ Chain Rule of Forecasting/ Note: all cov-st.


^ 𝟎+𝐛
𝐱z 𝐭8𝟏 = 𝐛 ^ 𝟏 𝐱𝐭 TS have a
^ 𝟎+𝐛
𝐱z 𝐭8𝟐 = 𝐛 ^ 𝟏 𝐱z 𝐭8𝟏 finite mean
etc...
reverting level

Review - 4
⇒ Random Walks w/ Unit Roots/
finite mean reverting level 𝐛𝟎
have no = 𝟎S𝟎
finite variance 𝟏 − 𝐛𝟏

- not covariance stat. – cannot estimate an AR(1) on a


random walk.
first differencing
𝚫𝐱 𝐭 = 𝐱 𝐭 − 𝐱 𝐭3𝟏
since 𝚫𝐱 𝐭 = 𝐱 𝐭 − 𝐱 𝐭6𝟏 = 𝛆𝐭 , AR(1) ⇒ 𝐛𝟎 = 0, 𝐛𝟏 = 0 , MRL = 0
- no forecasting power (we can only conclude we have a RW)

- RW w/ drift ⇒ AR(1) ⇒ 𝐛𝟎 ≠ 0, 𝐛𝟏 = 0 , so 𝚫𝐱 𝐭 = 𝐛𝟎 + 𝛆𝐭
drift
⇒ Unit Root Test of Non-Stationarity/
M.M138463888.

- Dickey-Fuller test - if 𝐛𝟏 = 1 , unit root (is a RW)


H0 : g = 0 [g = 𝐛𝟏 - 1] 𝚫𝐱 𝐭 = 𝐛𝟎 + g𝐱 𝐭6𝟏 + 𝛆𝐭

52
Last Revised: 07/25/2023

Review - 5
⇒ Seasonality/ - use a seasonal lag
- autocorrelation 𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝐛𝟐 𝐱 𝐭6𝟒 (quarterly)
of the error term or 𝐛𝟐 𝐱 𝐭6𝟏𝟐 (monthly)

⇒ ARCH models/ autoregressive conditional heteroskedasticity


𝛆𝟐𝐭 = 𝐚𝟎 + 𝐚𝟏 𝛆𝟐𝐭6𝟏 + 𝛍 test: H0: 𝐚𝟏 = 0
Ha: 𝐚𝟏 ≠ 0
rejecting implies CH
⇒ Regression with more than one TS/
- test each TS w/ DF test (unit root)
a) both reject H0: 𝐛𝟏 = 1 - ok
reg. coef. + SEs inconsistent
b) reject H0 for IV but not for DV
(non-constant mean & variance)
c) reject H0 for DV but not for IV + correlated errors
d) Do not reject H0 for both – are they cointegrated?

Review - 6
⇒ Regression with more than one TS/
d) reject H0 for both – are they cointegrated.

do not diverge from each other


without bound in the long run
- not cointegrated – coeff. & SEs inconsistent
- test for cointegration ⇒ Engle-Granger DF

M.M138463888.

53
Last Revised: 07/25/2023

Machine Learning
Review - 1
LOS a/ 1/ Supervised Learning: training data is labelled
- features + target
- regression for continuous targets
- classification for categorical targets
2/ Unsupervised Learning: training data is not labelled
- ML program attempts to discover
structure within the data
- dimension reduction – reduce the number of features
- clustering - sort observations into groups
LOS b/ overfitting ➞ models fit the training data too well (produce low
➞ will not generalize well (high variance error) bias error)

underfitting ➞ high bias + high variance error

Review - 2
LOS b/ Bias error ➞ the degree to which the model fits
the training data
Variance error ➞ how much the model’s results change in
response to new data from validation & test samples
- as bias error ↓, variance error ↑
linear models more prone to bias error
non-linear models more prone to variance error
Preventing overfitting:
1/ prevent the algorithm from getting too complex
2/ proper data sampling by using cross validation (i.e. k-fold
cross-validation)
LOS c/ 1/ Penalized Regression: minimizes overfitting
𝐊
LASSO – least absolute
M.M138463888.
𝐦𝐢𝐧 ‚ 𝛆𝟐𝐢 + 𝛌 ‚|𝐛𝐢 | λ = hyperparameter
shrinkage & selection operator 𝐢D𝟏

penalty term

54
Last Revised: 07/25/2023

Review - 3
LOS c/ 2/ SVM – Support Vector Machine
- select linear classifier that optimally
separates obs. (furthest from all obs.)
- resilient to outliers & correlated features

discriminant boundary
support vectors

3/ k-nearest neighbours (classification) - classify new obs. by


max. vote of k-nearest
obs.
- challenge: definition of
nearness
- sensitive to irrelevant or
correlated features

Review - 4
LOS c/ 4/ Classification and Regression Trees
Root Node ➞ one of the features + a cutoff value
is used (chosen by some
Decision Decision ➞ same
minimization function)
Node Node process

Terminal Decision Terminal Decision


Node Node Node Node

Terminal ➞ output: decision is category


- process ends when classification Node that is the majority at
error does not diminish M.M138463888.
this node

5/ Ensemble Learning
1/ different types of algorithms used, majority vote of all
models
2/ same algorithm using different training data
- majority vote

55
Last Revised: 07/25/2023

Review - 5
LOS c/ 5/ Ensemble Learning
Random Forest ➞ a collection of CARTs
LOS d/ 1/ PCA - Principal Components Analysis – dimension reduction
- used to reduce highly correlated features into a few
main composite variables ➞ uncorrelated

eigenvectors ➞ select # of these that account for


85% - 95% of the variance
2/ K–means clustering
- create subsets of the
data such that obs.
within a cluster are
similar
- classify a new obs. by its
distance from the center

Review - 6
LOS d/ 3/ Hierarchical clustering
- create intermediate rounds
of clusters of increasing
or decreasing size

agglomerative – bottom-up
divisive – top-down

M.M138463888.

56
Last Revised: 07/25/2023

Review - 7
LOS e/ 1/ Neural networks
- input data scaled from 0 - 1
➞ 4 input nodes
5 hidden nodes hyperparameters
1 output node
- hidden node ➞ summation operator
➞ activation function
- best able to handle non-linear
relationships
2/ Deep Learning Nets – neural networks with many
hidden layers (> 20 typically) – image, text,
speech
3/ Reinforcement Learning – learning occurs through trial and error
(millions of iterations)

M.M138463888.

57

You might also like