2024 L2 QuantMethods
2024 L2 QuantMethods
Model Misspecification 11
Time-Series Analysis 23
Machine Learning 32
Review 50
M.M138463888.
This document should be used in conjunction with the corresponding learning modules in the 2024 Level 2 CFA® Program
curriculum. Some of the graphs, charts, tables, examples, and figures are copyright 2023, CFA Institute. Reproduced and
republished with permission from CFA Institute. All rights reserved.
Required disclaimer: CFA Institute does not endorse, promote, or warrant accuracy or quality of the products or services
offered by MarkMeldrum.com. CFA Institute, CFA®, and Chartered Financial Analyst® are trademarks owned by CFA
Institute.
1
Last Revised: 07/25/2023
b. formulate a multiple linear regression model, describe the relation between the
dependent variable and several independent variables, and interpret estimated
regression coefficients
M.M138463888.
2
Last Revised: 07/25/2023
- model:
𝐘𝐢 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏𝐢 + 𝐛𝟐 𝐗 𝟐𝐢 + ⋯ + 𝐛𝐊 𝐗 𝐊 𝐢 + 𝛆𝐢 𝐢 = 1 ➞ n
deterministic part n > 𝐤
intercept
𝐤 IVs or slope Stochastic
coefficients part
- partial slope coefficients
%𝐛 ➞ estimated
* Describe the types of investment problems addressed by multiple linear regression and the regression process
Page 2/
partial slope coefficient: measures ∆DV for a 1 unit ∆IV holding
all other IVs constant
e.g./ RET = .0023 - 5.0585 BY - 2.1901 CS
* Formulate a multiple linear regression model, describe the relation between the dependent variable and several independent variables, and interpret estimated regression coefficients
3
Last Revised: 07/25/2023
Page 3/
Assumptions/
3/ Independence of errors - the observations are independent of
one another
∴ regression residuals are uncorrelated across observations
4/ Normality - regression residuals are normally distributed
5/ Independence of IVs
1/ IVs are not random (i.e. - they have a specific value)
2/ no exact linear relationship between 2 or more IVs
Scatterplot matrix (pairs plot)
- uses simple linear regression: DV vs. each IV
+ each IV vs. the other IVs
what to see
don’t want to see
linear relationships
linear relationships
* Explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions
Page 4/
Scatterplot Matrix
since we can, and will, interpret
slight pos. relationship output ➞ this is not a very useful
step
- any violations will be identified
statistically, not visually
DV
IVs
pos. linear neg. linear
M.M138463888.
almost no ‘apparent’
➞ 𝐛𝐒𝐌𝐁 is sig. in the output however
linear relationship
* Explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions
4
Last Revised: 07/25/2023
Page 5/
- helps identify outliers
* Explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions
Page 6/
standardized residuals vs. normal distribution
- outliers
affect outlier
parameter value of 𝛆
values 5%
𝛆𝐢 − 𝛆1
directional 𝛔𝛆
relationship
= misspecified
model
5% Q-Q plot
corr(IV, 𝛆)
outliers
M.M138463888.
-1.65 +1.65
Z score
- normally distributed 𝛆 should
fall on the vertical line
* Explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions
5
Last Revised: 07/25/2023
a. evaluate how well a multiple regression model explains the dependent variable by
analyzing ANOVA table results and measures of goodness of fit
c. calculate and interpret a predicted value for the dependent variable, given the
estimated regression model and assumed values for the independent variable
M.M138463888.
6
Last Revised: 07/25/2023
$ 𝟐 = 1 - 𝐒𝐒𝐄/𝐧 − 𝐤 − 𝟏
Adjusted 𝐑𝟐 ➞ 𝐑 𝐧−𝟏
= 𝟏 − IJ K (𝟏 − 𝐑𝟐 )L
𝐒𝐒𝐓/𝐧 − 𝟏 𝐧−𝐤−𝟏
* Evaluate how well a multiple regression model explains the dependent variable by analyzing ANOVA table results and measures of goodness of fit
Page 2/
Adjusted 𝐑 ➞ 𝐑 𝟐 $𝟐 = 1 - 𝐒𝐒𝐄/𝐧 − 𝐤 − 𝟏 𝐧−𝟏
= 𝟏 − IJ K (𝟏 − 𝐑𝟐 )L
𝐒𝐒𝐓/𝐧 − 𝟏 𝐧−𝐤−𝟏
FYI 𝐒𝐒𝐄 𝐧−𝟏
× 𝐒𝐒𝐄(𝐧 − 𝟏) 𝐧−𝟏 𝐒𝐒𝐄
𝐧−𝐤−𝟏 𝐒𝐒𝐓
= =- .- .
𝐒𝐒𝐓 𝐧−𝟏 𝐒𝐒𝐓(𝐧 − 𝐤 − 𝟏) 𝐧 − 𝐤 − 𝟏 𝐒𝐒𝐓
×
𝐧−𝟏 𝐒𝐒𝐓
𝐑𝟐 + 𝐒𝐒𝐄D𝐒𝐒𝐓 = 1
M.M138463888.
∴ 𝐒𝐒𝐄D𝐒𝐒𝐓 = 1 - 𝐑𝟐
" 𝟐 : if 𝐤 ≥ 𝟏 , 𝐑
- for 𝐑 " 𝟐 < 𝐑𝟐
" 𝟐 ↑ , else 𝐑
& if coefficient’s |𝐭 − 𝐬𝐭𝐚𝐭| > 𝟏 , 𝐑 "𝟐 ↓
* Evaluate how well a multiple regression model explains the dependent variable by analyzing ANOVA table results and measures of goodness of fit
7
Last Revised: 07/25/2023
Page 3/
application/
𝐒𝐒𝐑 𝟗𝟎. 𝟔𝟐𝟑𝟒
𝐑𝟐 = = = . 𝟔𝟏𝟓𝟓
𝐒𝐒𝐓 𝟏𝟒𝟕. 𝟐𝟒𝟏𝟔
𝟓𝟎 − 𝟏
@ 𝟐 = 𝟏 − IJ
𝐑 K (𝟏 − . 𝟔𝟏𝟓𝟓)L
𝟓𝟎 − 𝟓 − 𝟏
= . 𝟓𝟕𝟏𝟖
$ 𝟐 ↑ with Factor 1, 3, 4
𝐑
$ 𝟐 ↓ with Factor 2 & 5
𝐑
also
insignificant
F1 + F2 : 𝐑
@ 𝟐 ↓ , add F3 : 𝐑
@ 𝟐 ↑ , Add F4 : 𝐑
@ 𝟐 ↑ , Add F5 : 𝐑
@𝟐 ↓
* Evaluate how well a multiple regression model explains the dependent variable by analyzing ANOVA table results and measures of goodness of fit
Page 4/
$𝟐
𝐑 ➞ no intuitive explanation, re: %’age of variance explained
➞ no information on coefficient significance or potential
coefficient bias
➞ not a ‘goodness of fit’ measure
* Evaluate how well a multiple regression model explains the dependent variable by analyzing ANOVA table results and measures of goodness of fit
8
Last Revised: 07/25/2023
Page 5/
AIC better for prediction purposes
BIC better when ‘goodness of fit’ is preferred
* Formulate hypotheses on the significance of two or more coefficients in a multiple regression model and interpret the results of the joint hypothesis tests
Page 6/
testing joint coefficients/
e.g./ 𝐘 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + 𝐛𝟐 𝐗 𝟐 + 𝐛𝟑 𝐗 𝟑 + 𝐛𝟒 𝐗 𝟒 + 𝐛𝟓 𝐗 𝟓 - unrestricted
model
H0 : 𝐛𝟒 = 𝐛𝟓 = 0 Ha : at least one coefficient ≠ 0
𝐘 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + 𝐛𝟐 𝐗 𝟐 + 𝐛𝟑 𝐗 𝟑 - restricted model (nested model)
𝐒𝐒𝐄0
𝐧−𝐤−𝟏
* Formulate hypotheses on the significance of two or more coefficients in a multiple regression model and interpret the results of the joint hypothesis tests
9
Last Revised: 07/25/2023
Page 7/
'
prediction ➞ substitute values for the IVs, solve 𝐘
once a model is estimated, even if a coefficient is
insignificant, it must be used in prediction
- forecasts have a standard error, not a standard deviation
measure of measure of dispersion
precision
* Calculate and interpret a predicted value for the dependent variable, given the estimated regression model and assumed values for the independent variable
M.M138463888.
10
Last Revised: 07/25/2023
Model Misspecification
M.M138463888.
11
Last Revised: 07/25/2023
Model Misspecification
Page 1/
- principles for good regression model specification:
should be grounded in economic reasoning
model should be parsimonious
model should perform well out-of-sample
model functional form should be appropriate
model should satisfy regression assumptions
* Describe how model misspecification affects the results of a regression analysis and how to avoid common forms of misspecification
Page 2/
2/ Inappropriate form of variables
- failing to account for non-linearity ➞ can transform a
variable to make it linear (possible CH)
3/ Inappropriate scaling
- possible CH/MC
4/ Inappropriate pooling of data - data from different populations
- regime changes in time series
knowledge - possible CH, SC
check
12
Last Revised: 07/25/2023
Page 3/
Consequences/
unconditional heteroskedasticity: 𝐯𝐚𝐫(𝛆) not correlated with IVs
- no issues for inference
conditional heteroskedasticity (CH): 𝐕𝐚𝐫(𝛆) correlated with IVs
F-test unreliable since MSE is a biased estimator of the
true population variance
t-tests of coefficients unreliable since SE estimators will be
biased (underestimated, ∴ t-stats inflated)
- significance where none may exist (Type I error)
Test: BP (Breusch-Pagan)
H0: no CH ➞ 𝐚𝟏 = 𝐚𝟐 = 0
Ha : CH ➞ at least one ≠ 0 test-stat
𝛘𝟐𝐤 = 𝐧𝐑𝟐
if CH present, 𝐚𝟏
or 𝐚𝟐 or both will from
be significant Step 2
Page 4/
Correcting/ compute robust SEs (included in most software packages)
(a.k.a. heteroskedasticity-consistent SEs or White-corrected
SEs)
Serial Correlation/ errors correlated across observations
(autocorrelation)
result: incorrect SEs
if IV is a lagged variable of the DV, then also invalid
coefficients
Positive SC - positive residual most likely followed by a pos. residual
(neg.) (neg.)
Negative SC - neg. residual most likely followed by a pos. residual
(pos.) (neg.)
M.M138463888.
positive SC negative SC
13
Last Revised: 07/25/2023
Page 5/
Serial Correlation/ first-order SC ➞ corr(𝛆𝐧 , 𝛆𝐧6𝟏 ) > 0
pos. SC - affects tests +
t-stat = 𝐗 − 𝐗
𝐒𝐄 ➞ underestimated
inflated
F-stat = 𝐌𝐒𝐑0𝐌𝐒𝐄 ➞ underestimated
inflated
Page 6/
Correcting SC/ adjust the SEs
- SC-consistent SEs will also correct
- Newey-West SEs for CH
𝟏
𝐕𝐈𝐅𝐗 𝟏 = min. = 1 when 𝐑𝟐𝐗 𝟏 = 0
𝟏 − 𝐑𝟐𝐗 𝟏
> 5 becoming concerning
> 10 indicates MC
14
Last Revised: 07/25/2023
Page 7/
Solutions: exclude one or more IVs (preferred)
use a different proxy for one of the variables
increase sample size
- if the goal is prediction, then a significant model is more
important than a significant coefficient
∴ MC less of an issue
M.M138463888.
15
Last Revised: 07/25/2023
M.M138463888.
16
Last Revised: 07/25/2023
outlier
high
range
leverage
range
point
Page 2/
𝟑 (𝐤 + 𝟏, - typically for small samples
𝐧
- large samples use 2 "𝐤 + 𝟏'𝐧( (Belsey, Kuh, Welsch, 1980)
most extreme 5>
Detecting outliers: externally studentized residuals
- delete each case 𝒊
- calculate a new regression on 𝐧 - 1 cases
2 for all 𝐧 (using all cases of the IVs)
- calculate 𝐞∗𝐢 = 𝐘𝐢 − 𝐘
deleted 𝐢 forecasted 𝐢 from regression with case 𝐢
deleted
- calculate studentized residual:
residual from case 𝐢
M.M138463888.
∗
𝐭 ∗𝐢 = 𝐞𝐢 ➞ 𝐞𝐢6
𝐒𝐞∗ (𝟏 − 𝐡𝐢𝐢 )
initial regression
𝐌𝐒𝐄 𝐒𝐒𝐄0
#𝟏 & 𝐡 𝐧−𝐤−𝟏
𝐢𝐢
17
Last Revised: 07/25/2023
Page 3/
𝐌𝐒𝐄 𝐒𝐒𝐄 𝟏 𝐒𝐒𝐄
➞ × =
𝟏 − 𝐡𝐢𝐢 𝐧 − 𝐤 − 𝟏 𝟏 − 𝐡𝐢𝐢 (𝐧 − 𝐤 − 𝟏)(𝟏 − 𝐡𝐢𝐢 ) and 𝐞∗ = 𝐞
𝐢 𝐢
D(𝟏 − 𝐡 )
𝟏 − 𝐡𝐢𝐢 𝟏 𝐢𝐢
×
𝟏 𝟏 − 𝐡𝐢𝐢
𝐞𝐢 𝐞𝟐𝐢 (𝐧 − 𝐤 − 𝟏)(𝟏 − 𝐡𝐢𝐢 )
𝟏 − 𝐡𝐢𝐢 square ×
𝐭 𝐢∗ = both 𝐭 𝟐𝐢∗ =
(𝟏 − 𝐡𝐢𝐢 )𝟐 𝐒𝐒𝐄 = 𝐞𝟐𝐢 (𝐧 − 𝐤 − 𝟏)
𝐒𝐒𝐄 sides 𝐒𝐒𝐄 (𝐧 − 𝐤 − 𝟏)(𝟏 − 𝐡𝐢𝐢 ) 𝐒𝐒𝐄(𝟏 − 𝐡𝐢𝐢 )
2 ×
(𝐧 − 𝐤 − 𝟏)(𝟏 − 𝐡𝐢𝐢 ) (𝐧 − 𝐤 − 𝟏)(𝟏 − 𝐡𝐢𝐢 ) 𝐒𝐒𝐄
Page 4/
outliers/high leverage points not always influential
- observations are influential if their exclusion causes
substantial changes in the estimated regression function
- measure: Cook’s Distance (Cook’s D) - metric for identifying
influential data points
residual for obs. 𝐢
- large 𝐃𝐢 ➞ influencer
𝐃𝐢 > .5 may be influential
M.M138463888.
18
Last Revised: 07/25/2023
Page 5/
𝟏&
𝐧
𝐭 𝐢∗
2*𝐤'𝐧
* Formulate and interpret a multiple regression model that includes qualitative independent variables
Page 6/
Intercept dummy/
𝐘𝐢 = 𝐛𝟎 + 𝐝𝟎 𝐃𝟏 + 𝐛𝟏 𝐗 𝟏
𝐘𝐢 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 base category
- estimate 2 regressions
𝐘𝐢 = (𝐛𝟎 + 𝐝𝟎 ) + 𝐛𝟏 𝐗 𝟏
Slope dummy/
𝐘𝐢 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + 𝐝𝟏 𝐃𝐗 𝟏 + 𝛆
interaction term
𝐃=0 𝐃=1 𝐝𝟎
𝐘𝐢 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 𝐘𝐢 = 𝐛𝟎 + (𝐛𝟏 + 𝐝𝟏 )𝐗 𝟏
base category
Both:
m = 𝐛𝟏 + 𝐝𝟏
M.M138463888. 𝐘𝐢 = 𝐛𝟎 + 𝐝𝟎 𝐃𝟏 + 𝐛𝟏 𝐗 𝟏 + 𝐝𝟏 𝐃𝐗 𝟏
m = 𝐛𝟏
* Formulate and interpret a multiple regression model that includes qualitative independent variables
19
Last Revised: 07/25/2023
Page 7/
Testing - individual t-tests on the coefficients for the dummy vars.
* Formulate and interpret a multiple regression model that includes qualitative independent variables
Page 8/
Qualitative DV ➞ categorical (fraud, no fraud;
(discrete) (binary) bankruptcy, no bankruptcy)
- most common
linear regression unsuitable (relation between the DVs and IVs
not linear)
logit model
𝐩
𝐈𝐧 ( , = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + ⋯ + 𝐛𝐧 𝐗 𝐧 + 𝛆
𝟏−𝐩
transforms 𝐘 - logistic transformation
(most common)
linear relationship
M.M138463888. 𝐩 ➞ prob. of event occurring (i.e. 𝐓)
~ Bernouilli
𝐩
➞ ratio of 𝐏(𝐓) to 𝐏(𝐅)
𝟏−𝐩
(the odds of an event occurring)
20
Last Revised: 07/25/2023
Page 9/
𝐩
e.g./ 𝐩 = .75 , = . 𝟕𝟓D. 𝟐𝟓 = 3 to 1
𝟏−𝐩
(3 events occurring to 1 non-occurrence)
𝐩 𝐩
𝐈𝐧 -𝟏 − 𝐩. ➞ log odds or logit ∴ 𝐞𝐈𝐧 +𝟏 − 𝐩. ➞ odds
to recover 𝐩 𝟏
𝐩=
𝟏+ 𝐞((𝐛𝟎 +𝐛𝟏 𝐗𝟏 +⋯+𝐛𝐧 𝐗𝐧 )
Page 10/
𝐩
𝐈𝐧 0𝟏 − 𝐩1 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + ⋯ + 𝐛𝐧 𝐗 𝐧
=A
𝐩 𝐞𝐛𝟎 8𝐛𝟏 𝐗 𝟏 8⋯8𝐛𝐧 𝐗 𝐧 𝟏
∴ 𝐈𝐧 0𝟏 − 𝐩1 = A 𝐩= =
𝟏 + 𝐞𝐛𝟎 8𝐛𝟏 𝐗 𝟏 8⋯8𝐛𝐧 𝐗 𝐧 𝟏+ 𝐞6(𝐛𝟎 8𝐛𝟏 𝐗 𝟏 8⋯8𝐛𝐧 𝐗 𝐧 )
𝐩
𝟏−𝐩 = 𝐞𝐀 𝐩 = .75 ➞ 𝐈𝐧 0 . 𝟕𝟓 1 = 1.098612289
. 𝟐𝟓
𝐩 = 𝐞𝐀 − 𝐩𝐞𝐀 𝐞𝟏.𝟎𝟗𝟖𝟔𝟏𝟐𝟐𝟖𝟗 𝟏
𝐩= =
𝟏 + 𝐞𝟏.𝟎𝟗𝟖𝟔𝟏𝟐𝟐𝟖𝟗 𝟏+ 𝐞6𝟏.𝟎𝟗𝟖𝟔𝟏𝟐𝟐𝟖𝟗
𝐩 + 𝐩𝐞𝐀 = 𝐞𝐀
𝟑 𝟏
𝐩(𝟏 + 𝐞𝐀) = 𝐞𝐀 =M.M138463888.
= . 𝟕𝟓 = = . 𝟕𝟓
𝟒 𝟏. 𝟑𝟑̇
𝐞𝐀
𝐩=
𝟏 + 𝐞𝐀
21
Last Revised: 07/25/2023
𝐩 Page 11/
𝐈𝐧 0𝟏 − 𝐩1 = 𝐛𝟎 + 𝐛𝟏 𝐗 𝟏 + ⋯ + 𝐛𝐧 𝐗 𝐧 + 𝛆
odds ratio ➞ 𝐞𝐛𝐢 ratio of odds that the event happens with a unit
increase in 𝐗 𝐢 to the odds that the event happens
without an increase in 𝐗 𝐢
Page 12/
Log-odds is linear with changes in IVs
- odds ratios are exponential with ∆IVs
- probabilities are non-linear - pos. 𝐛𝐢 increase 𝐩
- neg. 𝐛𝐢 decrease 𝐩
Goodness of Fit/ Likelihood ratio (LR) test
(sort of like the F-test for model significance)
H0: 𝐛𝟏 = 𝐛𝟐 = ... = 𝐛𝐤 = 0
Ha : at least one coefficient ≠ 0
- also, no 𝐑𝟐 ➞ pseudo - 𝐑𝟐
- can only be used to compare different specifications
of the same model
22
Last Revised: 07/25/2023
Time-Series Analysis
a. calculate and evaluate the predicted trend value for a time series, modeled as either a
linear trend or a log-linear trend, given the estimated trend coefficients
b. describe factors that determine whether a linear or a log-linear trend should be used with
a particular time series and evaluate limitations of trend models
c. explain the requirement for a time series to be covariance stationary and describe the
significance of a series that is not stationary
d. describe the structure of an autoregressive (AR) model of order p and calculate one- and
two-period-ahead forecasts given the estimated coefficients
e. explain how autocorrelations of the residuals can be used to test whether the
autoregressive model fits the time series
g. contrast in-sample and out-of-sample forecasts and compare the forecasting accuracy of
different time-series models based on the root mean squared error criterion
j. describe implications of unit roots for time-series analysis, explain when unit roots are
likely to occur and how to test for them, and demonstrate how a time series with a unit
root can be transformed so it can be analyzed with an AR model
k. describe the steps of the unit root test for nonstationarity and explain the relation of the
test to autoregressive time-series models
l. explain how to test and correct for seasonality in a time-series model and calculate and
interpret a forecasted value using an AR model with a seasonal lag
M.M138463888.
m. explain autoregressive conditional heteroskedasticity (ARCH) and describe how ARCH
models can be applied to predict the variance of a time series
23
Last Revised: 07/25/2023
Y Y T C S R
@→𝛍
𝐗 𝟐
(𝐱 𝐢 − 𝐱1)𝟐
𝛔 =
𝐧−𝟏
𝐭
causal relationship X 0 1 2 3 T
IV time series
- a set of observations on
- DV & IV are distinct a variable’s outcomes in different
- SC will not affect time periods
consistency of 𝐛𝟎 & 𝐛𝟏
24
Last Revised: 07/25/2023
Page 3
⇒ 𝛆𝐭 must be uncorrelated across time periods
- Durbin-Watson test on residuals
H0: DW = 2
Ha: DW ≠ 2
Table
k = # of slope parameters 𝐝𝐋 & 𝐝𝐮
df = # of obs. e.g.: n = 77 ∝ = .05 𝐛^𝟎&𝐛 ^𝟏
(k = 1)
𝐝𝐋 = 1.60 𝐝𝐮 = 1.65
= Trend models will typically have
correlated errors
∴ need a better model
25
Last Revised: 07/25/2023
Page 5
⇒ Serial Correlation/ in (AR) Models
Recall 𝛒 = 𝐂𝐨𝐯(𝐀𝐁) 𝐂𝐨𝐯(𝛆𝐭 𝛆𝐭6𝐤 )
𝐀𝐁 , so 𝛒𝛆𝐤 =
𝛔𝐀 𝛔𝐁 𝛔𝛆𝐭 𝛔𝛆𝐭(𝐤
test/ autocorrelation
𝐂𝐨𝐯(𝛆𝐭 𝛆𝐭6𝐤 )
=
𝐄(𝛒𝛆𝐤 ) = 𝟎 of the error 𝛔𝟐𝛆
𝛒
- test-statistic 𝐭 = 𝟏 𝛆𝐤 term
1
√𝐍
Page 6
⇒ Mean Reversion/ falls when series is above
its mean
mean
𝐭=0 𝐭=T
value
rises when series is below its
mean
- at a mean reverting level 𝐱 𝐭 = 𝐱 𝐭8𝟏 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭
𝐛𝟎 time series 𝐛𝟎 𝐛𝟎
𝐱𝐭 = 𝐱𝐭 < 𝐱𝐭 >
𝟏 − 𝐛𝟏 will stay the 𝟏 − 𝐛𝟏 𝟏 − 𝐛𝟏
same will increase will decrease
26
Last Revised: 07/25/2023
Page 7
⇒ Comparing Model Forecast Performance/
𝛔𝟐𝛆 of model 1 vs. 𝛔𝟐𝛆 of model 2
- the model with the smaller 𝛔𝟐𝛆 is more accurate
➞ in-sample forecast errors - predicted vs. observed values
used to generate the model
➞ out-of-sample forecast errors – forecasts vs. ‘outside the model
values’
Root mean squared error (RMSE) – used to compare out of sample
forecasting performance
∑(𝐚𝐜𝐭. −𝐟𝐨𝐫𝐜. )𝟐
| the average the smaller the RMSE
𝐧 squared error the more accurate
the square root of
Page 8
⇒ Instability of Regression Coefficients/
Choice of - regression coefficients can vary based on
AR(p) model a) different sample periods
may also (e.g. 1965-1975 vs. 1985-1995)
change b) different time periods
depending on (e.g. 5 yrs. vs. 8 yrs.)
time period choice
M.M138463888.
27
Last Revised: 07/25/2023
Page 10
T C S R
- first differencing/ 𝚫𝐱 𝐭 = 𝐱 𝐭 − 𝐱 𝐭6𝟏 transforms a non cov. st.
TS to a cov-st. TS
new series
(even a RW)
note: 𝚫𝐱 𝐭 = 𝐱 𝐭 − 𝐱 𝐭6𝟏 = 𝛆𝐭
and since 𝐄(𝛆𝐭 ) = 0, 𝐄(∆𝐱 𝐭 ) = 0
= 𝟏 ( 𝟎 = 𝟎;𝟏 = 𝟎
𝐛𝟎 𝟎
⇒ AR(1) w/ 𝐛𝟎 = 0 𝐛𝟏 = 0 MRL =
𝟏 ( 𝐛𝟏
(constant & finite)
can now use regression
var. of ∆𝐱 𝐭 = Var(𝛆𝐭 ) ∀t
no forecasting power however
since 𝐛𝟎 & 𝐛𝟏 = ∅
M.M138463888.
28
Last Revised: 07/25/2023
Page 11
T C S R
⇒ RW w/ drift/ ⇒ AR(1) w/ 𝐛𝟎 ≠ 0 𝐛𝟏 = 1
- after first differencing ⇒ AR(1) w/ 𝐛𝟎 ≠ 0, 𝐛𝟏 = 0
∆𝐱 𝐭 = 𝐛𝟎 + 𝛆𝐭
⇒ Unit Root Test of Non-Stationarity/ drift random
Dickey-Fuller test for a unit root
- if a series is cov-st., then |𝐛𝟏 | < 1 from AR(1)
- if 𝐛𝟏 = 1, then TS has a unit root - is a RW
- is not cov-st.
- cannot test 𝐛𝟏 from AR(1)
DF/ 𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝛆𝐭 - subtract 𝐱 𝐭6𝟏 from both sides
𝐱 𝐭 – 𝐱 𝐭6𝟏 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝛆𝐭 – 𝐱 𝐭6𝟏 g = (𝐛𝟏 – 1) critical value
𝐱 𝐭 – 𝐱 𝐭6𝟏 = 𝐛𝟎 + (𝐛𝟏 – 1)𝐱 𝐭6𝟏 + 𝛆𝐭 test of (𝐭)
1
st ∆𝐱 𝐭 = 𝐛𝟎 + 𝐠𝐱 𝐭6𝟏 + 𝛆𝐭 H0: g = 0 - larger than
diff. 1st lag Ha: g < 0 conventional 𝐭
Section 7: Seasonality
Page 12
⇒ use a seasonal lag
e.g. 𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝛆 AR(1)
- seasonal autocorrelation of the error term ≠ 0
[ 𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝐛𝟐 𝐱 𝐭6𝟒 + 𝛆𝐭 ] 4th autocorrelation for quarterly data
[ 𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝐛𝟐 𝐱 𝐭6𝟏𝟐 + 𝛆𝐭 ] 12th autocorrelation for monthly data
29
Last Revised: 07/25/2023
Page 14
E. both have a unit root ⇒ no cointegration
- same results as B & C earlier
- only 2 options offer valid regression results/
➀ no unit roots in either series - use DF on each
➁ both have unit roots but are cointegrated
- will estimate long-term relationship
- may not be the best model of the short term
relation between the two
test for cointegration
1) estimate 𝐲𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱𝐭 + 𝛆𝐭
2) use (Engle-Granger) DF to test H0: g = 0
- note: test is [(𝛆𝐭 – 𝛆𝐭3𝟏 ) = 𝐛𝟎 + g 𝛆𝐭 -1 + μ] - must use different
M.M138463888.
30
Last Revised: 07/25/2023
Page 15
- more than one IV
A. use DF on all ➞ no unit roots
B. at least one has a unit root, one does not
- cannot use OLS
C. all have unit roots
- test for cointegration 𝐲𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭 + 𝐛𝟐 𝐙𝐭 + 𝛆𝐭
- (𝛆𝐭 – 𝛆𝐭6𝟏 ) = 𝐛𝟎 + g 𝛆𝐭6𝟏 + μ
[H0: g = 0] - fail to reject H0
reject = no cointegration
- OLS ok but
requires much - cannot use OLS
more work!
Road Map
Page 16
Causal Time
vs.
Relationship Series
SC only ➞ add a
at lags seasonality
lag
linear log-linear
DW = 2 test ARCH(1)
D D 𝐭3𝟏 + 𝛍
∑𝟐𝐭 = 𝐚𝟎 + 𝐚𝟏 ∑
yes no = SC H0: a1 = 0
SC AR(2)
no SC AR(1)
Ok! no SC
- must be
no SC ok fail to reject
Cov.St.
ok M.M138463888.
reject = ok
GLS
31
Last Revised: 07/25/2023
Machine Learning
M.M138463888.
32
Last Revised: 07/25/2023
Machine Learning
Page 1
➞ Statistical approaches (LR, TS) – some underlying structure LOS a
to the data is assumed (some probability dist.) - distinguish
- models are built on that structure
➞ Machine Learning – no underlying structure is assumed
- the algorithm attempts to discover the underlying structure
- 3 distinct classes of techniques
1/ Supervised Learning: involves training an algorithm to take a
set of inputs (X) and find a model that best relates
them to output (Y) features target
- training data set is ‘labeled’ ⇒ Xs are matched with Y
(regression is a good example)
- 2 categories of problems: ➞ regression problems (continuous
➞ classification problems target)
(categorical or ordinal target)
Page 2
⇒ Process Overview LOS a
𝐗𝐤, N ➃ inputs - distinguish
➀ ➁ ➂
Labeled Data Supervised ML Prediction Test
Y, 𝐗 𝐤 , N Algorithm Rule Data
Prediction actual Y
➄ evaluation of fit
2/ Unsupervised Learning/
- does not make use of labeled training data
- ML program has to discover structure within the data
2 types of problems: M.M138463888.
33
Last Revised: 07/25/2023
Page 3
3/ Deep Learning/Reinforcement Learning LOS a
- based on neural networks (more later) - distinguish
(6) (3)
✓ ✓
✓
✓
✓
✓ ✓
✓
✓
✓
(3) ✓
✓
Page 4
LOS a
- distinguish
dimension
unsupervised reduction
clustering
regression
supervised
classification
M.M138463888.
34
Last Revised: 07/25/2023
Page 5
➞ ML models can produce overly complex models that LOS b
may fit the training data too well - describe
- known as overfitting ➞ will not generalize to new data
- evaluation of ML algorithms thus focus on prediction error well
- data sets partitioned into 3 non-overlapping samples (supervised)
Training Sample Validation Sample Test Sample
in-sample out-of-sample
Page 6
LOS b
⇒ Errors and overfitting/
- describe
Bias – the degree to which the
model fits the training data
Variance error – how much the
model’s results change in
response to new data from
validation & text samples
35
Last Revised: 07/25/2023
Page 7
⇒ Errors and overfitting/ LOS b
- describe
Preventing overfitting/
1/ prevent the algorithm from getting
too complex
- limit the number of features
- penalize algorithms that are
too complex or too flexible
2/ proper data-sampling by using
cross-validation hold-out samples
- validation and test samples
should be from the same
domain as the training data
k-fold cross validation
- data (less test data) shuffled
randomly, divided into k sub-samples
k-1 training, 1 validation ⇒ repeat k times (k ∼ 5 to 10)
Page 8
⇒ Supervised ML Algorithms/ LOS c
1/ Penalized Regression – dimension reduction - describe
- eliminates/minimizes overfitting - determine
𝐧 𝐤
𝐦𝐢𝐧 ‚ 𝛆𝟐𝐢 ^ 𝐤†
+ 𝛌 ‚†𝐛 where 𝛌 is called a
𝐢D𝟏 𝐢D𝟏 hyperparameter (set by
the researcher)
36
Last Revised: 07/25/2023
Page 9
⇒ Supervised ML Algorithms/ LOS c
2/ Support Vector Machine – classification, regression - describe
& outlier detection - determine
resilient to outliers
& correlated features
- trade-off between a wider margin and
lower total error penalty
Page 10
⇒ Supervised ML Algorithms/ LOS c
3/ k-nearest neighbour – classification (sometimes regression) - describe
- classify new observations by finding similarities - determine
37
Last Revised: 07/25/2023
Page 11
⇒ Supervised ML Learning/ LOS c
4/ Classification & Regression Trees (CART) - typically - describe
predict a categorial predict a applied when the - determine
target continuous target target is binary
Root Node ➞ one of the features and a cutoff value are
yes
no chosen at each node by some
true false
< > minimization function (MSE, Gini)
lower next IV
Decision
within –
Decision - minimize misclassification error
same
Node group error Node process ➞ data partition becomes smaller and
at each smaller
lower level
Terminal Decision Terminal Terminal
Node Node Node Node
➞ category that is the majority at
- output
decision this
Terminal Terminal
(predicted label) node (i.e. the target)
Node Node
- process ends when classification error does not diminish much more
- to avoid overfitting: can set max. depth of tree, min. population at each
node, etc.
or/ Pruning ➞ sections of the tree with low classifying power cut
Page 12
⇒ Supervised ML Learning/ LOS c
5/ Ensemble Learning & Random Forests - a collection of - describe
classification trees - determine
prediction of a group of models
2 main categories:
a) different types of algorithms combined with a voting classifier
b) a combination of the same algorithm, using different training
data on a bootstrap aggregating technique
Voting Classifiers/ majority vote of all models = classification of new
M.M138463888. data point
Bootstrap Aggregating/ original training data set is used
to generate n new training data sets (can have small
occurrences of the target data multiple times ➞ increases
proportional representation)
- n new models created
➞ majority vote classifier for each new data point
38
Last Revised: 07/25/2023
Page 13
⇒ Supervised ML Learning/ LOS c
5/ Ensemble Learning & Random Forests - describe
Random Forest - determine
Page 14
⇒ Unsupervised ML Algorithms/ LOS d
1/ Principal Components Analysis – dimension reduction - describe
- used to reduce highly correlated features of data - determine
39
Last Revised: 07/25/2023
Page 15
⇒ Unsupervised ML Algorithms/ LOS d
2/ Clustering – create subsets of the data such that - describe
- determine
observations within a cluster are deemed similar - maximum
cohesion
- observations within different clusters are dissimilar - maximum
separation
- must define ‘similar’ in order to measure distance
a) k-means clustering:
among
chosen by k the chosen
researcher centroids features
chosen at
random centroids cluster new centroids
- iteration continues
until no observation cluster new centroids cluster etc…
is reassigned to a
new cluster
Page 16
⇒ Unsupervised ML Algorithms/ LOS d
- describe
a) k-means clustering: works well on very large data sets
- determine
hundreds of millions of data
points
- useful for discovering patterns in high dimensional data
b) Hierarchical clustering:
- creates intermediate rounds
of clusters of increasing
(agglomerative clustering) or
decreasing (divisive clustering)
size
- agglomerative ➞ clusters based
on local patterns
M.M138463888.
40
Last Revised: 07/25/2023
Page 17
⇒ Unsupervised ML Algorithms/ LOS d
➞ Dendrograms/ - describe
- determine
cluster 1 = A + B
cluster 7 = 1 + 2 + 3
measure
of cluster 9 = 7 + 8
distance (4 + 5 + 6)
H + I
Page 18
1/ Neural Networks (classification, regression supervised LOS e
- nodes & links
learning) - describe
- input data is scaled to have values
between 0 & 1
- 4 input nodes, 5 hidden nodes, 1 output node
hyperparameters
forward propagation
neurons ➞ each node receives input from each preceding node
➞ weights each value received and adds them up
- better able to cope with 1 summation
operator
non-linear relationships
activation function
- requires large training activation
data sets M.M138463888. – transforms the
sum into an
- weights chosen to minimize
a loss function output
0
0 summation 1
41
Last Revised: 07/25/2023
Page 19
2/ Deep Learning Nets ➞ neural networks with many LOS e
hidden layers (at least 3, typically > 20) - describe
M.M138463888.
42
Last Revised: 07/25/2023
f. describe methods for extracting, selecting and engineering features from textual
data
LOSs will match between the video and the MM PDFs, but may be
in a different order than the CFAI readings
M.M138463888.
43
Last Revised: 07/25/2023
Page 2
⇒ Steps in Data Analysis Projects/ LOS a
- state
b) Textual big Data ➞ Text ML Model Building
- explain
Text
Data Text Text Classifier
Problem
Curation Preparation/ exploration output
Formulation
- from Preprocessing - what to
- inputs/outputs?
where? - cleansing & look for
preparing
44
Last Revised: 07/25/2023
Page 3
⇒ Data Preparation/Preprocessing/ LOS b, e
- cleansing/organizing raw data into a useable format - describe
Page 4
⇒ Data Preparation/Preprocessing/ LOS b, e
Structured Data ➞ Preprocessing/ transformations/scaling of data - describe
45
Last Revised: 07/25/2023
Page 5
⇒ Data Preparation/Preprocessing/ LOS b, e
Structured Data ➞ Preprocessing/ - describe
➞ scaling ➞ adjusting the range of a feature
1/ normalization 𝐱 𝐢 − 𝐱 𝐦𝐢𝐧 - rescales in the range 0 - 1
𝐱 𝐦𝐚𝐱 − 𝐱 𝐦𝐢𝐧 (sensitive to outliers)
2/ standardization 𝐱 𝐢 − 𝛍 - centers and rescales
𝛔 (requires normal distribution)
➞ Unstructured (Text) Data ➞ Cleansing
- must be transformed into structured data
remove html tags – if data is downloaded from the web
remove punctuation – most not required, some is (0.76 needs the period
e-mail needs the hyphen)
remove numbers – or substituted with annotation
identifying them as numbers
remove white spaces – tabs, leading spaces, spaces created by other
cleaning tasks
Page 6
⇒ Data Preparation/Preprocessing/ LOS b, e
➞ Unstructured (Text) Data ➞ Cleansing - describe
token = word; tokenization = splitting a text into tokens
unigram ➞ single word token, biogram ➞ 2 word tokens
n-grams ➞ n-word tokens
- normalizing text data:
Lowercasing
Stop words (the, is, a) – do not carry meaning
– typically removed
stemming e.g. analyzed, analyzing ➞ analyz
stem
lemmatization ➞ analyz becomes analyze (more work than just
M.M138463888.
stemming)
reduce token count, decrease sparseness
(minimize ML more common
- end result is a BOW (bag of words)
algo. complexity)
- final step is a ‘document term’ matrix (DTM) – structured data
46
Last Revised: 07/25/2023
Page 7
⇒ Data exploration/ – requires domain knowledge LOS c, f
- describe
Page 8
⇒ Data exploration/ LOS c, f
Unstructured Data: Text exploration - describe
➞ exploratory data analysis:
word counts, word clouds, co-occurrences
anything that helps gain early insight into existing patterns
➞ feature selection: selecting a sub-set of the tokens
- eliminate noisy features (features that do not contribute
to ML model training)
typically the most
would lead to frequent and most sparse ➞ would lead to overfitting
underfitting
M.M138463888.
47
Last Revised: 07/25/2023
Page 9
⇒ Data exploration/ LOS c, f
methods 3/ mutual information (MI) – how much info. - describe
a token contributes to a class (0 - 1)
0 ➞ token’s distribution is the same in all classes
1 ➞ token only occurs in one class
feature engineering (maintain semantics while structuring)
1) Numbers ➞ differentiate among types of numbers (i.e. year, tax ID, etc.)
2) N-grams ➞ multi-word patterns kept intact
3) Name entity recognition (NER) – money, time, organization
Page 10
⇒ Model Training/ LOS d
- describe
observations (length)
- size of data
features (width) - too wide – overfit
- too narrow – underfit
48
Last Revised: 07/25/2023
Page 11
⇒ Model Training/ LOS d
➞ Performance evaluation/ - describe
P = precision = 𝐓𝐏0(𝐓𝐏 + 𝐅𝐏)
error analysis
R = recall = 𝐓𝐏0(𝐓𝐏 + 𝐅𝐍)
(𝐓𝐏 + 𝐓𝐍)
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 = J(𝐓𝐏 + 𝐅𝐏 + 𝐓𝐍 + 𝐅𝐍)
Recall (a.k.a. sensitivity)
(useful when cost of 𝐅𝟏 = 𝟐 𝐏 𝐑1(𝐏 + 𝐑) (for unequal
Type II error is high) class distribution
in the dataset)
Page 12
⇒ Model Training/ LOS d
➞ Performance evaluation/ - describe
Receiver Operating Characteristic (ROC)
False Positive Rate (FPR) = 𝐅𝐏0(𝐓𝐍 + 𝐅𝐏)
True Positive Rate (TPR) = 𝐓𝐏0(𝐓𝐏 + 𝐅𝐍)
- as we allow FPR to increase, TPR ↑
trade-off
Root Mean Squared Error (RMSE)
𝐧
(𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝 − 𝐀𝐜𝐭𝐮𝐚𝐥)𝟐
𝐑𝐌𝐒𝐄 = <=
𝐧
𝐢,𝟏 M.M138463888.
49
Last Revised: 07/25/2023
REVIEW
M.M138463888.
50
Last Revised: 07/25/2023
Time Series
Review - 1
causal time
or
relationship series
Review - 2
⇒ Linear/ DV changes at a constant rate w/ time
𝐲𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐭 + 𝛆 IV = 𝐭
if SC (DW ≠ 2)
then use - often inappropriate for economic data
⇒ Log-Linear/ DV changes at a constant growth rate w/ time
Covariance Stationary/
M.M138463888.
a) 𝐄(𝐲𝐭 ) = 𝛍
b) 𝛔𝟐 constant & finite
c) Covariance in all periods
51
Last Revised: 07/25/2023
Review - 3
𝛒𝛜𝐤
⇒ test for SC/ 𝐇𝟎 : 𝛒𝛆𝐭 𝛆𝐭(𝐤 test stat = 𝟏 Output
1 SE
√𝐍
autocorrelation of the ρ 𝟏&
√𝐧 t-stat
error term - - -
- - -
- reject 𝐇𝟎 ⇒ misspecified model - - -
- - -
Review - 4
⇒ Random Walks w/ Unit Roots/
finite mean reverting level 𝐛𝟎
have no = 𝟎S𝟎
finite variance 𝟏 − 𝐛𝟏
- RW w/ drift ⇒ AR(1) ⇒ 𝐛𝟎 ≠ 0, 𝐛𝟏 = 0 , so 𝚫𝐱 𝐭 = 𝐛𝟎 + 𝛆𝐭
drift
⇒ Unit Root Test of Non-Stationarity/
M.M138463888.
52
Last Revised: 07/25/2023
Review - 5
⇒ Seasonality/ - use a seasonal lag
- autocorrelation 𝐱 𝐭 = 𝐛𝟎 + 𝐛𝟏 𝐱 𝐭6𝟏 + 𝐛𝟐 𝐱 𝐭6𝟒 (quarterly)
of the error term or 𝐛𝟐 𝐱 𝐭6𝟏𝟐 (monthly)
Review - 6
⇒ Regression with more than one TS/
d) reject H0 for both – are they cointegrated.
M.M138463888.
53
Last Revised: 07/25/2023
Machine Learning
Review - 1
LOS a/ 1/ Supervised Learning: training data is labelled
- features + target
- regression for continuous targets
- classification for categorical targets
2/ Unsupervised Learning: training data is not labelled
- ML program attempts to discover
structure within the data
- dimension reduction – reduce the number of features
- clustering - sort observations into groups
LOS b/ overfitting ➞ models fit the training data too well (produce low
➞ will not generalize well (high variance error) bias error)
Review - 2
LOS b/ Bias error ➞ the degree to which the model fits
the training data
Variance error ➞ how much the model’s results change in
response to new data from validation & test samples
- as bias error ↓, variance error ↑
linear models more prone to bias error
non-linear models more prone to variance error
Preventing overfitting:
1/ prevent the algorithm from getting too complex
2/ proper data sampling by using cross validation (i.e. k-fold
cross-validation)
LOS c/ 1/ Penalized Regression: minimizes overfitting
𝐊
LASSO – least absolute
M.M138463888.
𝐦𝐢𝐧 ‚ 𝛆𝟐𝐢 + 𝛌 ‚|𝐛𝐢 | λ = hyperparameter
shrinkage & selection operator 𝐢D𝟏
penalty term
54
Last Revised: 07/25/2023
Review - 3
LOS c/ 2/ SVM – Support Vector Machine
- select linear classifier that optimally
separates obs. (furthest from all obs.)
- resilient to outliers & correlated features
discriminant boundary
support vectors
Review - 4
LOS c/ 4/ Classification and Regression Trees
Root Node ➞ one of the features + a cutoff value
is used (chosen by some
Decision Decision ➞ same
minimization function)
Node Node process
5/ Ensemble Learning
1/ different types of algorithms used, majority vote of all
models
2/ same algorithm using different training data
- majority vote
55
Last Revised: 07/25/2023
Review - 5
LOS c/ 5/ Ensemble Learning
Random Forest ➞ a collection of CARTs
LOS d/ 1/ PCA - Principal Components Analysis – dimension reduction
- used to reduce highly correlated features into a few
main composite variables ➞ uncorrelated
Review - 6
LOS d/ 3/ Hierarchical clustering
- create intermediate rounds
of clusters of increasing
or decreasing size
agglomerative – bottom-up
divisive – top-down
M.M138463888.
56
Last Revised: 07/25/2023
Review - 7
LOS e/ 1/ Neural networks
- input data scaled from 0 - 1
➞ 4 input nodes
5 hidden nodes hyperparameters
1 output node
- hidden node ➞ summation operator
➞ activation function
- best able to handle non-linear
relationships
2/ Deep Learning Nets – neural networks with many
hidden layers (> 20 typically) – image, text,
speech
3/ Reinforcement Learning – learning occurs through trial and error
(millions of iterations)
M.M138463888.
57