Quantitative Methods
Quantitative Methods
Quantitative Methods
PAGE NOS. 55 VOL. 2 CHAPTERS. 6
Linear Regression a.k.a. Linear Least Squares assumes a linear relationship between the dependent and the
independent variables. Linear Regression computes a straight line that best fits the observations; it chooses
values for the intercept 'b0' and slope 'b1', that minimize the sum of the squared vertical distances between
the observations and the regression line.
ACTUAL Y =b +b X +ε i = 1, 2, 3 ... n
i 0 1 i i
Whereas,
th
Y : i observation of the dependent variable Y. The dependent variable is also referred to as the
i
'Explained Variable', 'Endogenous Variable' or 'Predicted Variable'.
th
X : i observation of the independent variable X. The independent variable is also referred to as
i
the 'Explanatory Variable', 'Exogenous Variable' or 'Predicting Variable'.
b : Regression Intercept. It is the value of dependent variable, if value of the independent
0
Regression variable is '0'. The intercept term in this regression is called the stock's Ex-post Alpha. It is a
Coefficients measure of excess risk-adjusted returns a.k.a. Jensen's L p = R p - [R f + (R m- R f )]
b : Regression Slope Coefficient. It is the change in the dependent variable for a 1-unit change
1
in the independent variable. The slope coefficient in a regression like this is called Stock's
Beta and it measures the relative amount of systematic risk in returns.
th
ε : Residual for the i observation, also referred to as the 'Disturbance Term', 'Error Term' or
i
'Unexplained Deviation'. It represents the portion of the dependent variable that cannot be
explained by the independent variable.
^ ^ ^
PREDICTED Y =b +b X i = 1, 2, 3 ... n
i 0 1 i
Whereas,
^
Y : Estimated (Fitted Parameters) value of Yi given X i
^ ^ ^
b : Estimated Intercept. b0 = Y - b1 X ; The intercept equation highlights the fact that the
I
0
regression line passes through a point with coordinates equal to the mean of the
independent and dependent variables.
^ ^ 2
b : Estimated Slope Coefficient. b1 = Cov XY / o X
I
The sum of the squared vertical distances between the estimated and actual Y-values is referred to as the
'Sum of Squared Errors' (SSE). Thus, regression line is the line that minimizes the SSE. This explains why
simple linear regression is frequently referred to as 'Ordinary Least Squares' (OLS) regression and the values
^
estimated by the estimated regression equation Yi is called Least Squares estimates. Here are some
assumptions below:
2
Assumption 1: A linear relationship exists between the dependent and the independent variable. This
requirement means that b and b are raised to the first power only and that neither b nor b is multiplied or
divided by another regression parameter (as in b /b ). The requirement doesn't exclude X from being raised
to a power other than 1. If the relationship between the independent and dependent variables is non-linear
in the parameters, then estimating that relation with a linear regression model will produce invalid results.
b x
Y = b e 1 i+ ε
i 0 i Non-Linear (Not Allowed) x
2
Y =b +b X +ε Linear (Allowed)
i 0 1 i i
Assumption 2: The independent variable X is not random. If the independent variable is random, we can still
rely on the results of regression models given the crucial assumption that the error term is uncorrelated with
the independent variable.
2 2
Assumption 4: The variance of the error term is the same for all observations; E(ε i ) = o ε . It is also known as
I
the 'Homoscedasticity' assumption i.e. violation.
Assumption 5: The error term is independently distributed, that the error for one observation is not
correlated with that of another observation. Related to 'Serial Correlation' i.e. violation.
Assumption 6: The error term is normally distributed. If the regression errors are not normally distributed,
we can still use regression analysis. Econometricians who dispense with the normality assumption use Chi-
Square tests of hypothesis rather than F-tests.
An unbiased forecast can be expressed as E(Actual Change - Predicted Change) = 0. If the forecasts are
unbiased, the intercept 'b 0' should be 0 and the slope 'b 1' should be 1, then the error term
[Actual Change - b 0- b1 (Predicted Change)] will have an expected value of 0, as required by Assumption 3 of
the linear regression model.
lllll ANOVA
Analysis of Variance (ANOVA) is a statistical procedure for analyzing the total variability of the
dependent variable. In regression analysis, we use ANOVA to determine the usefulness of the
independent variable or variables in explaining variation in the dependent variable.
1. Total Sum of Squares (TSS): measures the total variation in the dependent variable.
n 2
TSS = Σ (Yi - Y)
I
i=1
2. Regression Sum of Squares (RSS): measures the explained variation in the dependent variable.
n ^ 2
RSS = Σ (Yi - Y)
I
i=1
3. Sum of Squared Errors (SSE): measures the unexplained variation in the dependent variable. It is also
known as the Sum of Squared Residuals or the Residual Sum of Squares.
n ^ 2
SSE = Σ (Yi - Yi )
i=1
Fig 1: Components of
Total Variation
Y
Yi .
^
^ ^ ^ (Y - Y ) = SSE
Y i = b0 + b1 X i i i
. . (Y i - Y) = TSS
I
.
^
..
(Y - Y) = RSS
I
i
.
. . .
Y
I
. . . . Actual
^
b0
X Predicted
I
Fig 2: ANOVA Table
df ss ss / df
Degrees of Sum of
Source of Variation Mean Sum of Squares
Freedom Squares
n: no. of observations
k: no. of independent variables
lllll T-Statistic
T-value measures the size of the difference relative to the variation in your sample data. When
T-Stat is very small it indicates that none of the autocorrelations are significantly different than 0.
^
t = b1 - b1 df = n - k - 1
s ^b
1
a. H0 : μ = 0 Ha : μ ≠ 0 b. H0 : μ ≤ 0 Ha : μ > 0 c. H0 : μ ≥ 0 Ha : μ < 0
ρ-value: is the smallest level of significance for which the null hypothesis can be rejected.
ρ < L : Reject H
0
ρ > L : Accept H0
Example 1:
The estimated slope coefficient of the ABC plc. is 0.64 with standard error equal to 0.26. Assuming that the
sample has 36 observations, determine if the estimated slope coefficient is significantly different than 0 at a
5% level of significance.
Reject H 0
H0 : b 1 = 0 Ha : b1 ≠ 0
^
T-test = b1 - b1 = 0.64 - 0 = 2.46 IIlll lllII.
s ^b 0.26 2.03 -2.03 -2.46
1
The critical two-tailed t-values are ± 2.03 (df = n - k -1 = 36 - 1 - 1 = 34). Because t > t c i.e. 2.46 > 2.03, we
reject the null hypothesis and conclude the slope is different from 0.
If none of the independent variables in a regression model helps explain the dependent variable,
the slope coefficients should all equal '0'. In a multiple regression however, we cannot test the null
hypothesis that all slope coefficients equal '0' based on T-tests that each individual slope coefficient
equals '0', because the individual tests don't account for the effects of interactions among the
independent variables. To test the null hypothesis that all of the slope coefficients in the multiple
regression model are jointly equal to '0' (H 0 : b1 = b2 = ... b k = 0) against the alternative hypothesis
that at least one slope coefficient is not equal to '0'; we must use an F-test. The F-test is viewed as a
test of the regression's overall significant.
lllll F-Statistic
F-test assesses how well the set of independent variables as a group explains the variation in the
dependent variable. That is, the F-Stat is used to test whether at least one of the independent
variables explains a significant portion of the variation of the dependent variable. This is always a
one-tailed test, despite the fact that it looks like it should be a two-tailed test because there is an
equal sign in the null hypothesis. If the regression model does a good job of explaining variation in
the dependent variable, then this ratio should be high.
F = MSR = RSS / k df =k
denominator
MSE = SSE / n - k - 1 df numerator = n - k - 1
Example 2:
An analyst runs a regression of monthly value stock returns on five independent variables over 60 months.
The TSS is 460 and the SSE is 170. Test the null hypothesis at the 5% significance level that all five
independent variables are equal to 0.
5
.
F-test = MSR = 58 = 18.41
llIII
MSE 3.15 0
2.4 18.41
The critical F-value for 5 and 54 degrees of freedom at a 5% significance level is significantly 2.4. Therefore,
we can reject the null hypothesis and conclude that at least one of the five independent variables is
significantly different than 0.
2
1. R : It is defined as the percentage of the total variation in the dependent variable explained by
2
the independent variable. R explains the correlation between predicted and actual values of
2
dependent variable. For example, an R of 0.63 indicates that the variation of the independent
variable explains 63% of the variation in the dependent variable.
2
R = TSS - SSE = RSS = 1 - SSE ↑Higher the better; k↑ R2↑
TSS TSS TSS
Regression output often includes multiple R (correlation coefficient), which is the correlation
between actual values of Y and forecasted values of Y. (Multiple R is the square root of R2 ).
2
For simple linear regression (i.e. one independent variable), the coefficient of determination R
2
may be computed by simply squaring the correlation coefficient 'r' i.e. R = r 2. This approach is
not appropriate when more than one independent variable is used in the regression.
2 2
For multiple regression, R by itself may not be reliable, this is because R almost always increases
as variables are added to the model, even if the marginal contribution of the new variables is not
statistically significant (if we add regression variables to the model, the amount of unexplained
variation will decrease and RSS will increase, if the new independent variable explains any of the
unexplained variation in the model. Such a reduction occurs when the new independent variable
is even slightly correlated with the dependent variable and is not a linear combination of other
2
independent variables in the regression). Consequently, a relatively high R may reflect the impact
of a large set of independent variables rather than how well the set explains the dependent
variable. This problem is often referred to as overestimating the regression.
2 2
2. Ra : Some financial analysts use an alternative measure of goodness of fit called Ra . To overcome
6
the problem of overestimating the impact of additional variables on the explanatory power of a
2
regression model, many researchers recommend adjusting R for the number of independent
variables 'k'.
2 ] 2
Ra = 1 - n - 1 x (1 - R )
n-k-1 ]
%
↑ 2↓
As k overly R a ; should not add independent variables beyond
2
k*. When a new independent variable is added, Ra can decrease if
2
R adding that variable results in only a small increase in R 2 . When
2 2 2
k ≥ 1, then R ≥ R a. In fact R a can be negative (effectively consider
its value 0), although R2 is always non-negative. In addition, R2a may
2
be less than 0, if R2 is low enough. Furthermore, we must be aware
Ra 2
that a high R a doesn't necessarily indicate that the regression is
well specified in the sense of including the correct set of variables.
0 k One reason for caution is that a high R 2a may reflect peculiarities
k*
i.e. weirdness of the dataset used to estimate the regression.
Goodness of 'fit' (SEE) and Coefficient of Determination are different values for the same concept.
The Coefficient of Variation is not directly part of regression model.
Example 3:
Part a: An analyst runs a regression of monthly value stock returns on five independent variables over 60
2 2
months. The TSS is 460 and the SSE is 170. Calculate the R and Ra .
2
Part b: Suppose the analyst now adds four more independent variables to the regression and the R
increases to 65%. Identify which model the analyst should most likely prefer.
2 2 ]
a. R = 460 - 170 = 0.63 or 63% Ra = 1 - 60 - 1 x (1 - 0.63) = 59.6%
460 60 - 5 - 1 ]
2 2 ]
b. R = 65% (given) Ra = 1 - 60 - 1 x (1 - 0.65)
60 - 9 - 1 ] = 58.7%
2 2
With nine independent variables, even though the R has increased from 63% to 65%, the R a has decreased
2
from 59.6% to 58.7%. The analyst would prefer the first model because the Ra is higher and the model has
five independent variables as opposed to nine.
^
b1± (t c x S ^b )
1
^
Y ± (t c x S f )
Standard Error of Forecast
⤴
7
Whereas,
2 2 ] 2 2
S f : SEE 1 + 1 + (X - X) ; S x is variance of the independent variable.
I
n (n - 1) Sx
2 ] 2
SY = TSS is variance dependent variable.
n-1
Rule: H0 : b1 = 0 and Ha : b1 ≠ 0. If the confidence interval (CI) at the desired level of significance doesn't
include 0, the H0 is rejected and the coefficient is said to be statistically different from 0.
Example 4:
Coldplay forecasts the excess return on the S&P 500 for June 2017 to be 5% and the 95% CI for the predicted
value of the excess return on VIGRX for June 2017 to be 3.9% to 7.7% (b 0 : 0.0023 and b 1 : 1.1163). The
standard error of the forecast is closest to?
^ ^ ^
Y=b +b X
^ 0 1
Y = 0.0023 + 1.1163 (0.05) = 0.058115
NOTES
1. The distinction between SSE and SEE: SSE is the sum of the squared residuals, while SEE is the standard
deviation of the residuals.
^ ^
3. Prediction must be based on the parameters' estimated values (b0 and b1 ) in relation to the hypothesized
population values.
XY
n-1
The covariance between two random variables is a statistical measure of the degree to which the two
variables move together. The covariance captures the linear relationship between two variables.
5. rXY= Cov XY
ox oY
I
The correlation coefficient is a measure of the strength of the linear relationship between two variables.
Multiple Regression is regression analysis with more than one independent variable. It is used to quantify the
influence of two or more independent variables on a dependent variable.
^ ^ ^ ^ ^
PREDICTED Yi = b0 + b1 X 1i + b2 X2 i+ ..... + bk X k i
Assumptions for multiple regression are almost exactly the same as those for the single variable linear
regression model, except assumption 2 and 3. Here are some changes under multiple regression model:
Assumption 2: The independent variables (X 1 , X2 , ... , X k) are not random. Also, no exact linear elation exists
between two or more of the independent variables. If this part of assumption 2 is violated, then we cannot
compute linear regression, may encounter problems if two or more of the independent variables or
combinations thereof are highly correlated. Such a high correlation is known as 'Multicollinearity' i.e. violation
Assumption 3: The expected value of the error term, conditional on the independent variables is 0.
E ( ε| X , X , ... , X ) = 0.
1 2 k
Consider the following regression equation for explaining quarterly EPS in terms of the quarter of
their occurrence:
EPS = b + b Q + b Q + b Q + ε df = n - 1
t 0 1 1t 2 2t 3 3t t
Whereas,
EPS t : Quarterly observation of Earning Per Share.
Q1t : 1 if period t is the first quarter, Q1t : 0 otherwise.
Q2t : 1 if period t is the second quarter, Q2t : 0 otherwise.
Q3t : 1 if period t is the third quarter, Q 3t : 0 otherwise.
b0 : Average value of EPS for the fourth quarter.
b1 , b2 , b3 : Estimate the difference in EPS on average between the respective quarter
(i.e. quarter 1, 2 or 3) and the omitted quarter (the fourth quarter in this case).
Think of the omitted class as the reference point.
9
Example 1:
Some developing nations are hesitant to open their equity markets to foreign investments because they fear
that rapid inflows and outflows of foreign funds will increase volatility. You want to test whether the volatility
of returns of stocks traded on the Bombay Stock Exchange (BSE) increased after July 1993, when foreign
institutional investors were first allowed to invest in India. Your dependent variable is a measure of return
volatility of stocks traded on the BSE; your independent variable is a dummy variable that is coded 1 if foreign
investment was allowed during the month and 0 otherwise.
a. State null and alternative hypothesis for the slope coefficient of the dummy variables that are consistent
with testing your stated belief about the effect of opening the equity markets on stock return volatility.
H 0: b 1≥ 0 Ha : b 1 < 0
b. Determine whether you can reject the null hypothesis at the 5% significance level in a one-sided test of
significance. Reject H 0
df = n - k - 1 = 95 - 1 - 1 = 93 . IIII
-2.7604 -1.661
We reject the null hypothesis because the dummy variable takes on a value of 1 when foreign investment is
allowed, we can conclude that the volatility was lower with foreign investment.
c. According to the estimated regression equation, what is the level of return volatility before and after the
market-opening event?
0 0
-1 ]
Y = Φ (P) Y = In P
(1 - P) ]
⤴
Probit Model i.e. Probit regression, which is Logit Model i.e. Logistic regression is based
based on the normal distribution, estimates on the logistic distribution also called a log
the probability that Y = 1 (a condition is adds ratio. Logistic regression is widely
fulfilled) given the values of the used in machine learning, where the
independent variables. objective is classification. Logistic
regression assumes a logistic distribution
for the error term; this distribution is
similar in shape to the normal distribution
but has heavier tails.
In most cases it makes no difference which one is used (probit or logit). Both functions increase
relatively quickly at x = 0 and relatively slowly at extreme values of x. Both functions lie between 0 and 1.
In econometrics, probit and logit models are traditionally viewed as models suitable when the
dependent variable is not fully observed.
Examine the individual coefficients using T-tests, determine the validity of the model with the F-test, the R2
and look out for Heteroskedasticity, Serial Correlation and Multicollinearity.
Heteroskedasticity occurs when Serial Correlation a.k.a. When one of the independent
the variance of the residuals is Autocorrelation refers to the variables is an exact linear
not the same across all situation in which the residual combination of other
observations in the sample. terms or regression errors are independent variables, it
This happens when there are correlated with one another. It becomes mechanically
sub-samples that are more is a relatively common problem impossible to estimate the
spread out than the rest of the with time series data. Any effect regression. That case, known as
sample. of serial correlation appears 'Perfect Collinearity' is much
only in the regression less of a practical concern than
(a) (b) coefficient standard errors. If multicollinearity.
Homoskedastic Heteroskedastic one of the independent Multicollinearity occurs when
2
[Var. (ε i) = o ] [With errors] variables is a lagged value of two or more independent
I
High residual
the dependent variable, then variables (or combinations of
. . .. .
value
x x
.... . ..... .. .
.. . .... serial correlation in the error independent variables) are
.. ... .. . . .. . . . .
.. . .. .. . . . term will cause all the highly (but not perfectly)
. . .. . .
...... .. Low residual
.. . value parameter estimates from correlated with each other.
Y Y
linear regression to be Multicollinearity if a serious
No relationship On average,
between, value of regression
inconsistent and they will not practical concern because
independent residuals grow be valid estimates of the true approximate linear
variables and larger as the size parameter. relationships among financial
regression of the independent variables are common.
residuals variable increases
11
^ ^
T-test = b i = Reliable Positive Sc. T-test = b i = Unreliable
(unreliable) S ^ ^ (unreliable) S ^
b Unreliable T-test = bi = Reliable b Unreliable
i i
(unreliable) S^ Unreliable
(Type II Error) bi (Type II Error)
(a) Overestimated S^b i : ↓
T-test (a) Overestimated Sb^ : ↓ T-test
(b) Underestimated S^ :
b
↑
T-test (b) Underestimated Sb^ :
i
↑ T-test i
i
(Type I Error) (Type I Error)
I
to represent the estimate of the constant variables don't. The only way
2 2 error variance. If in addition, the errors
BP X Test = n x R df = k this can happen is when the
residuals are also not serially correlated, then we
(one-tailed test)
expect Cov (ε^ t , ε^t-1) = 0. In that case DW is independent variables are
2
R from a second approximately equal to: highly correlated with each
regression of the
squared residuals from ^ 2 ^ 2 other. If the absolute value of
o ε- 0 + o
I ε =2
I
the first regression on 2 the sample correlation
the independent o^ ε
I
between any two
variables.
Therefore we can test the null hypothesis independent variables in the
H0 : No Conditional Heteroskedasticity that the errors are not serially correlated
H1 : Conditional Heteroskedasticity by testing whether the DW test differs
regression is greater than
significantly from 2 0.7, multicollinearity is a
Conditional potential problem. However,
Heteroskedasticity is only a (a) Large Sample this only works if there are
2
problem if the R and the BP exactly two independent
Test statistic are too large. DW ≈ 2 (1 - r) df = k variables. If there are more
than two independent
'r' correlation coefficient
between residuals from variables, while individual
one period and those variables may not be highly
from the previous period.
correlated, linear
H0 : No Positive Serial Correlation combinations might lead to
H1 : Positive Serial Correlation
multicollinearity (conflicting
T-test and F-test statistics).
Reject H0
x
Accept H0 High pairwise correlations
Positive Sc. Negative Sc.
Inconclusive among the independent
0 0 variables are not a necessary
dL dU
condition for multicollinearity
(Lower) (Upper)
and low pairwise correlations
No Autocorrelation (r = 0)
don't mean that
DW ≈ 2 (1 - 0) = 2 (DW = 2) multicollinearity is not a
problem.
Positive Serial Correlation (r > 0)
DW ≈ 2 (1 - 1) = 0 (DW < 2)
Example 2:
A variable is regressed against three other variables X, Y, Z. Which of the following would not be an indication
of multicollinearity? X is closely related to:
A. 3y + 2z
B. 9y - 4z + 3
✓ C. y
2
There are three broad categories of 'Model Misspecification' or ways in which the regression model can be
specified incorrectly, each with several subcategories:
Y = b 0+ b 1 X 1+ b 2X 2+ ε
Y = a 0 + a1 X 1+ ε b 0≠ a 0
14
- Variables should be transformed: Sometimes analysts fail to account for curvature or non-linearity in the
relationship between the dependent variable and one or more of the independent variables, instead
specifying a linear relation among variables. We should also consider whether economic theory suggests a
non-linear relation. We may be able to correct the misspecification by taking the natural logarithm (In) of
the variable we want to represent as a proportional change.
- Data is improperly pooled: Suppose the relationship between returns and the independent variables
during the first three-years is actually different than relationship in the second three-year period i.e.
regression coefficients are different from one period to the next. By pooling the data and estimating one
regression over the entire period, rather estimate two separate regressions over each of the subperiods.
If we have misspecified the model, the predictions of portfolio returns will be misleading.
NOTES
1. If expected value of the sample mean is equal to the population mean, the sample mean is therefore an
unbiased estimator of the population mean. A consistent estimator is one for which the accuracy of the
parameter estimate increases as the sample size increases.
2. Predictions in multiple regression model are subject to both parameter estimate uncertainty and
regression model uncertainty.
3. Y = 5 + 4.2 (Beta) - 0.05 (Alpha) + ε. One unit increase in Beta risk is associated with a 4.2% increase in
return, while a $1 bn. increase in Alpha implies a 0.05% decrease in return.
15
Time Series is a set of observations on a variable's outcomes in different time periods e.g. quarterly sales for
a particular company during the past 5 years.
Yt In (Y t )
.. . .
Linear TM Log-Linear TM
Yt
Transformed
.
..
In (Y t )
Data ..
. . .
... . . . .
Raw Data
.
. . . . ..
0 Time 0 Time
A Linear Trend is a time series pattern that can A Log-Linear Trend works well in fitting time
be graphed using a straight line. A downward series that have exponential growth. Positive
sloping line indicates a negative trend, while an exponential growth means that the time series
upward sloping line indicates a positive trend. tend to increase at some constant rate of growth
↱Trend Coefficient i.e. the observations will form a convex curve ().
Yt = b 0 + b1 (t) + ε t t = 1, 2, ... T Negative exponential growth means that the
data tends to decrease at some constant rate of
^ ^ ^
Yt = b0 + b1 (t) decay i.e. the plotted time series will be concave
curve )(.
When the variable increases over time by a
b0 + b1 (t) + ε t
constant amount, a Linear Trend Model is most Y =e t = 1, 2, ... T
appropriate.
In (Yt ) = In (eb0 + b1 (t) )
Yt = e In (Yt )
^
^ In (Yt )
Yt = e
If on the other hand, the data plots with a non-linear curved shape, then the residuals from a Linear Trend
Model will be persistently positive or negative for a period of time. In this case, the Log-Linear Trend Model
may be more suitable. In other words, when the residuals from a Linear Trend Model are serially correlated, a
Log-Linear Trend Model maybe more appropriate. However, it may be the case that even a Log-Linear Trend
Model is not appropriate in the presence of serial correlation. In this case, we will want to turn to an
Autoregressive Model.
Stationary Non-Stationary
A time series is Covariance Stationary if its mean, Data points are often Non-Stationary or have
variances and covariances with lagged and means, variances and covariances that change
leading values don't change over time. over time.
E.g. Strict Stationary, Second-Order Stationary E.g. Trends, Cycles, Random Walks or
(Weak-Stationary), Trend Stationary and combinations of three.
Difference Stationary Models.
In order to receive consistent-reliable results the non-stationary data needs to be transformed into stationary
data.
1. Constant and Finite Expected Value: The expected value of the time series is constant over time. E(y t ) = μ
and |μ| < ∞, t = 1, 2, 3, ... T. We refer to this value as the 'Mean Reverting Level'. All covariance stationary
AR(1) time series have a finite mean reverting level, when the absolute value of the lag coefficient is less
than 1 i.e. |b 1|< 1.
(a) If time series ↘ Current Value > Mean; Current Value above Mean AR (1): x t = b 0 + b1 x t-1
(b) If time series ↗ Current Value < Mean; Current value below Mean
As per (c): x t = b 0 + b1 x t
^
(c) If time series — Current Value = Mean; Next value of the time series will equal its current value x t = x t-1
Decline: x t > x t+1 ; x t > b0 Rise: x t < x t+1 ; x t < b 0 Same: x t = x t+1 ; x t = b0
(1 - b 1 ) (1 - b1 ) (1 - b1 )
2. Constant and Finite Variance: The time series' volatility around its mean doesn't change over time.
3. Constant and Finite Covariance between values at any given lag: The covariance of the time series with
itself for a fixed no. of periods in the past or future must be constant and finite in all periods.
Both 2 and 3: The covariance of a random variable with itself is its variance
Covariance (yt , y t ) = Var (y t )
Stationary in the past doesn't guarantee stationary in the future. There is always the possibility that a well
specified model will fail with the state of change in time. Models estimated with shorter time series are
usually more stable than those with longer time series because a longer sample period increases the chance
that the underlying economic process has changed. Thus, there is a trade off between the increased
statistical reliability when using longer time periods and the increased stability of the estimates when using
shorter periods.
17
This implies that multi-period forecasts are more uncertain than single period forecasts.
Step 2: Calculate the autocorrelations of the model's residuals i.e. the level of correlation between
the forecast errors from one period to the next.
Where 'E' stands for the expected value. Note that we have the relationship
2
Cov (x t , xt-k ) ≤ o x with equality holding when k = 0. This means that the absolute value of
I
ρ k ≤ 1.
t=k+1
Σ (x t - x) 2
I
t=1
Step 3: Test whether the autocorrelations are significantly different from 'o': If the model is
correctly specified, none of the autocorrelations will be statistically significant. To test for
significance, a T-test is used to test the hypothesis that the correlations of residuals are 0.
t = ρ εt , εt-k → Autocorrelation df = n - k -1
1/ n → Standard Error
H 0 : |Autocorrelations| = 0 H1 : |Autocorrelations| > 0
18
↷
x t - x t-1= b 0 + b1 x t-1- x t-1+ ε
x t - x t-1= b 0+ (b 1 - 1) x t-1+ ε
Rather than directly testing whether the original coefficient is different from 1, they test whether
the new transformed coefficient (b1 - 1) is different from 0 using a modified T-test. In their actual
test, Dickey and Fuller use the variable 'g' = b 1 - 1.
↱ Error Term
ε^t = a 0 + a1 ε^t-1 + μ t
2 2
ARCH (1)
o^t+1
2
=a^ ^ ^2
0+ a 1 ε t
I
H1 : a1 ≥ or ≤ 0 ;The variance increases (decreases) over time i.e. the error terms
exhibit heteroskedasticity
If a time series model has been determined to contain ARCH errors, regression procedures that correct
for heteroskedasticity such as Generalized Least Squares (GLS) must be used in order to develop a
predictive model. Otherwise, the standard error of the model's coefficients will be incorrect, leading to
invalid conclusions. Engle and other researchers have suggested many generalizations of the ARCH (1)
model which include ARCH (p) and generalized autoregressive conditional heteroskedasticity (GARCH)
models. GARCH models are similar to ARMA models of the error variance in a time series. Just like
ARMA models, GARCH models can be finicky and unstable.
In-Sample Forecasts: are within the range of data i.e. time period.
^
Errors: (Yt - Yt )
Example 1:
To qualify as a covariance stationary process, which of the following doesn't have to be true?
→
A. Covariance (x t , x t-2) = Covariance (x t , xt+2) The covariance between any two
observations equal distance apart will be
B. E (x t ) = E (xt+1)
equal e.g. t and t-2 observations with
✓ C. Covariance (x t , x t-1) = Covariance (x t , x t-2) t and t+2 observations.
Example 2:
Suppose the following model describes changes in the unemployment rate: △
UER t = - 0.0405 - 0.4674 UER t-1 △
The current change (first difference) in the unemployment rate is 0.03. Assume that the mean reverting level
for changes in the unemployment rate is -0.0276.
a. What is the best prediction for the next change? - 0.0405 - 0.4674 (0.03) = - 0.0545 (y )
t+1
Example 3:
Table below gives actual sales, log of sales and changes in the log of sales of Cisco Systems for the period
1Q: 2001 to 4Q: 2001.
Quarters / Yr. Actual Sales Log of Sales Actual Values Forecast Values
△ In (Sales t ) △ In (Sales t-1)
Forecast the first and second quarter sales of Cisco Systems for 2002 using the regression
△ In (Sales t ) = 0.0661 + 0.4698 .△In (Sales t-1)
Step 1: Calculate forecast values of △ In (Sales t ) with the help of above regression.
1Q 2002: △ In Sales 1Q 2002 = 0.0661 + 0.4698 (-0.0954) = 0.0213
2Q 2002: △ In Sales 2Q 2002 = 0.0661 +0.4698 (0.0213) = 0.0761
Example 4:
Table below gives the actual change in the Log of Sales of Cisco Systems from 1Q: 2001 to 4Q: 2001, along
with the forecasts from the regression model △
In (Sales t ) = 0.0661 + 0.4698 . △
In (Sales t-1) estimated using
data from 3Q: 1991 to 4Q: 2000. Note the observations after the fourth quarter of 2000 are out-of-sample.
b. Compare the forecasting performance of the model given with that of another model having an
out-of-sample RMSE of 20%
The model with the RMSE of 20% has greater accuracy in forecasting than model in Part a, which has an
RMSE of 27%
Example 5:
Based on the regression output below, the forecasted value of quarterly sales for March 2016 for PoweredUP
is closest to?
Example 6:
David Brice, CFA has used AR (1) model to forecast the next period's interest rate to be 0.08. The AR (1) has a
positive slope coefficient. If the interest rate is the mean reverting process with an unconditional mean a.k.a
mean reverting level, equal to 0.09, then which of the following could be his forecast for two periods ahead?
✓ A. 0.081 →
Brice makes more distant forecast, each forecast will be closer to the
unconditional mean, so the two period forecast would be between
B. 0.072
0.08 and 0.09; therefore 0.081 is the only possible outcome.
C. 0.113
↶
(Unit Root) (Unit Root)
No Mean Reverting Level for Random Walk because, Undefined Mean Reverting Level for Random Walk
x t = b0 = 0 = 0 = 0 with Drift because, x t = b0 = b 0
1 - b1 1- 1 0 1 - b1 0
For a time series that is not covariance stationary, the least squares regression procedure that we have been
using to estimate an AR (1) model will not work without transforming the data.
b1 < 1: Stationary
b1 = 1: Non-Stationary 'Unit Root'
b1 > 1: 'Explosive Root'
AR (1) y t = b0 + b1 y t-1+ ε t
This transformed time series has a finite mean reverting level of b 0 = 0 = 0 and is therefore
covariance stationary. 1 - b1 1 - 0
22
After 1st
Differencing
y =x -x
t t t-1
lllll Seasonality
Seasonality in a time series is a pattern that tends to repeat from year to year. One example is monthly
sales data for a retailer. Given that sales data normally vary accordingly to the time of year, we might
expect this month's sales (x t ) to be related to sales for the same month last year (x t-12 ). To adjust for
seasonality in an AR model, an additional lag of the dependent variable (corresponding to the same
period in the previous year) is added to the original model as another independent variable. For
example, if quarterly data are used, the seasonal lag is 4; if monthly data is used, the seasonal lag is 12.
Suppose for example, we model a particular quarterly time series using an AR (1) model,
x t = b 0 + b1 x t-1+ ε t . If the time series had significant seasonality, this model would not be correctly
specified. The seasonality would be easy to detect because the seasonal autocorrelation in the case of
quarterly data, the 4th autocorrelation of the error term would differ significantly from 0. Suppose this
quarterly model has significant seasonality. In this case, we might include a seasonal lag in the
autoregressive model and estimate x t = b 0 + b1 x t-1+ b2 xt-2 + ε t , to test whether including the seasonal
lag in the autoregressive model would eliminate statistically significant autocorrelation in the error
term.
Example 7:
Suppose we want to compute the four-quarter moving average of AstraZeneca's sales as of the beginning of
the first quarter of 2012. AstraZeneca's sales in the previous four quarters were 1Q 2011: $ 8490 m, 2Q 2011:
$8601 m, 3Q 2011: $ 8405 m and 4Q 2011: $ 8872 m.
23
I
MA (1) i.e. moving average model places different weights on the two terms Cov (εt, εs) = E (εt. εs) = 0
in the moving average ( 1 on ε t and θ on εt-1 ). Because the expected value of t≠s
x t is 0 in all periods and ε t is uncorrelated with its own part values, the first
autocorrelation is not equal to 0, but the second and higher autocorrelations
are equal to 0. Further analysis shows all autocorrelations except for the first
will be equal to 0 in an MA (1) model. Thus for an MA (1) process, any value x t
is correlated with x t-1 and xt+1 but with no other time series values; we could
say that an MA (1) model has a memory of one period.
I
For an MA (q) model, the first 'q' autocorrelations will be significantly Cov (εt, εs) = E (εt. εs) = 0
different from 0 and all autocorrelations beyond that will be equal to 0; an t≠s
MA (q) model has a memory of 'q' periods.
MA (0) x t= μ + ε t E (εt) = 0
2 2
E (εt) = o
I
MA (0) time series in which we allow the mean to be 'non-zero' which also Cov (εt, εs) = E (εt. εs) = 0
means that the time series is not predictable. t≠s
AR MA
lllll Cointegration
Occasionally an analyst will run a regression using two time series i.e. time series utilizing two different
variables. For example, using the market model to estimate the equity beta for a stock, an analyst
regresses a time series of the stock's returns (y t ) on a time series of returns for the market (x t ).
Cointegration means that two time series are economically linked (related to the same macro variable)
or follow the same trend and that relationship is not expected to change.
y t = b0 + b 1 x t + ε
whereas,
y t : Value of time series y at time t
xt : Value of time series x at time t
24
Machine Learning (ML) refers to computer programs that learn from their errors and refine predictive models
to improve their predictive accuracy over time. Machine Learning is one method used to extract useful
information from Big Data. An elementary way to think of ML algorithms is to 'find the pattern, apply the
pattern'. ML techniques are better able than statistical approaches to handle problems with many variables
i.e. high dimensionality or with a high degree of non-linearity. ML algorithms are particularly good at
detecting change, even in high non-linear systems because they can detect the preconditions of a model's
break or anticipate the probability of a regime switch. ML is broadly divided into 3 distinct classes of
techniques: (a) Supervised Learnings, (b) Unsupervised Learning and (c) Deep Learning. Machine Learning is
the science of making computers learn and act like humans by feeding data and information without being
explicitly programmed.
The data set is typically divided into three non overlapping samples:
(i) Training Sample: used to train the model. } In-Sample Data
(ii) Validation Sample: for validating and tuning the model.
} Out-of-Sample Data
(iii) Test Sample: for testing the model's ability to predict well on new data.
To be valid and useful, any supervised machine learning model must generalize well beyond the training data.
The model should retain its explanatory power when tested out-of-sample.
Y o ▲ Y
▲
Y
o ▲
▲ ▲ o o ▲
o
o ▲ o ▲ ▲ o o ▲
o ▲ o
▲ o ▲ ▲
o ▲ ▲
o
▲
o
o o ▲ o o o ▲
o
o ▲
▲ ▲ ▲
o o o
o ▲
▲ ▲ o
▲
x x x
Underfitting is similar to making a Think of Overfitting as tailoring a Robust filling, the desired result
baggy suit that fits no one. It also custom suit that fits only one is similar to fashioning a
means that the model doesn't person. An ML algorithm that fits universal suit that fits all similar
capture the relationships in the the training data too well, will people. A Good Fit or Robust
data. The graph shows four typically not predict well using Model fits the training in-sample
errors in this underfit model (3 new data. The model begins to data well and generalize well to
misclassified circles and 1 incorporate noise coming from out-of-sample data, both within
misclassified triangle). quirks or spurious correlations; it acceptable degrees of error. The
mistakes randomness for graph shows that the good fitting
patterns and relationships. The model has only 1 error, the
algorithm may have memorized misclassified circle.
the data rather than learn from it,
so it has perfect hindsight but no
foresight. The graph shows no
errors in this overfit model. As
models become more complex,
overfitting risk increases.
26
∴ The evaluation of any ML algorithm thus focuses on its prediction error on new data rather than on its
goodness of fit on the data in which the algorithm was fitted i.e. trained.
A Learning Curve plots the accuracy rate = 1 - Error Rate, in the validation or test sample i.e. out-of-sample
against the amount of data in the training sample i.e. in sample, so it is useful for describing under and
overfitting as a function of bias and variance errors. Low or no in-sample error but large out-of-sample error
are indicative of poor generalization. Data scientists decompose the total out-of-sample error into 3 sources:
(i) Bias Error, (ii) Variance Error and (iii) Base Error.
Bias Error or the degree to which Variance Error or how much the Base Error due to randomness in
a model fits the training data. model's results change in the data. (Out-of-sample
Algorithms with erroneous response to new data from accuracy increases as the
assumptions produce high bias validation and test samples. training sample size increases)
with poor approximation, causing Unstable models pick up noise
underfitting and ↑
in-sample and produce high variance
error. (Adding more training causing overfitting and ↑
out-
samples will not improve the of-sample error.
model)
0 0 0
No. of Training Samples No. of Training Samples No. of Training Samples
Fig 3: Fitting Curve shows Trade-off between Bias and Variance Errors and Model Complexity
↑ ↓ and
Error
As Complexity in Training sets, E-in
Bias Error shrinks.
Variance
Bias Error
↑ ↑ and
Error
As Complexity in Test sets, E-out
0 Variance Error rises.
Model Complexity
(Out of Sample Error
rates are also a function
of model complexity)
27
1. Ocean's Razor: The problem solving principle that the simplest solution tends to be the correct one. In
supervised ML, it means preventing the algorithm from getting too complex during selection and training
by limiting the no. of features and penalizing algorithms that are too complex or too flexible by
constraining them to include only parameters that reduce out-of-sample error.
2. K-Fold Cross Validation: This strategy comes from the principle of avoiding sampling bias. The challenge is
having a large enough data set to make both training and testing possible on representative samples.
① For example, imagine that this ② Using ML lingo, we used the data to:
column represents all of the data (a) Train the ML methods
A
we have collected about people (b) Test the ML methods
with or without heart disease.
B 75% ③ Reusing same data for both training & testing is a bad
. x x . ..
x
to minimize the SSE plus a penalty term that increases in size with the no. of included features. The
greater the no. of included features (i.e. variables with non-zero coefficients), the larger the penalty
term. Therefore, penalized regression ensures that a feature is included only if the SSE declines by
more than the penalty term increases. All types of penalty regression involve a trade-off of this
type. Imposing such a penalty can exclude features that are not meaningfully contributing to
out-of-sample prediction accuracy i.e. it makes the model more parsimonious. Therefore, only the
more important features for explaining Y will remain in the penalized regression model.
λ (Lamda) a.k.a.
hyperparameter, determines
n ^ 2 k ^ how severe the penalty is
LASSO Penalized Regression = Σ (Yi - Y i ) + λ Σ | b k |
i=1 K=1 λ > 0. It also determines the
balance between fitting the
SSE (OLS) Penalty Term model vs. keeping the model
parsimonious.
In one popular type of penalized regression, LASSO (Least Absolute Shrinkage and Selection
Operator) aims to minimize the SSE and the sum of the absolute value of the regression
coefficients. When using LASSO or other penalized regression techniques, the penalty term is added
only during the model building process and not once the model has been built.
Y
x x x
x
▲
}
}
The shortest distance between the observations and the threshold is called 'Margin'. When we use
the threshold that gives us the largest margin to make classification, we call that a 'Maximal Margin
Classifier' (MMC). MMC are super sensitive to outliers in the training data, which makes them pretty
lame! The margin is determined by the observations are called Support Vectors. Some observations
may fall on the wrong side of the boundary and be misclassified by the SVM algorithm. Choosing a
threshold that allows misclassification is an example of the 'Bias/Variance Trade-off' that plagues all
of ML. When we allow misclassification, the distance between the observation and the threshold is
called 'Soft Margin Classification', which adds a penalty to the objective function for observations in
the training that are misclassified. In essence, the SVM algorithm will choose a discriminant
boundary that optimizes the trade-off between a wider margin and a lower total error penalty. As
an alternative to soft margin classification, a non-linear SVM algorithm can be run by introducing
more advanced non-linear separation boundaries. These algorithms will reduce the no. of
misclassified instances in the training datasets but will have more features, thus adding to the
model's complexity.
29
Y x Y x
x x
△ △ x
◆ x
x ◆ x
△ △ x
△ △ x △ △ x
△ x △ x
△ △
△ △
0 x 0 x
The diamond (observation) needs to be classified as belonging to either the cross or the triangle
category. If K = 1, the diamond will be classified into the same category as its nearest neighbor (i.e.
triangle in the left panel), whereas if K = 5, the algorithm will look at the diamond's 5 nearest
neighbors, which are 3 triangles and 2 crosses. The decision rule is to choose the classification with
the largest no. of nearest neighbors, out of 5 being considered. So, the diamond is again classified
as belonging to the triangle category.
KNN is a straightforward intuitive model that is still very powerful because it is non-parametric; the
model makes no assumption about the distribution of the data. Moreover, it can be used directly
for multi-class classification. A critical challenge of KNN however, is defining what it means to be
'similar' (or near). The choice of a correct distance measure may be even more subjective for ordinal
or categorical data. KNN results can be sensitive to inclusion of irrelevant or correlated features, so
it may be necessary to select features manually. If done correctly, this process should generate a
more representative distance measure. KNN algorithms tend to work better with a small no. of
features.
Finally, the number K, the hyperparameters of the model, must be chosen with the understanding
that different values of K can lead to different conclusions. If K is an even number, there may be ties
and no class classification. Choosing a value for K that is too small would result in a high error rate
and sensitivity to local outliers, but choosing a value for K that is too large would dilute the concept
of nearest neighbors by averaging too many outcomes. For K, one must take into account the no. of
categories and their partitioning of the feature space.
decision node represent a single feature (f) and a cut-off value (c) (e.g. X1 > 10%) for that feature.
The CART algorithm chooses the feature and the cut-off value at each node that generates the
widest separation of the labeled data to minimize Classification error (e.g. by a criterion such as
MSE, Mean or Average) or Regression error (e.g. by a criterion such as Mode or Class). Every
successive classification should result in a lower estimation error than the nodes that predicted it.
The tree stops when the error can no longer be reduced further resulting in a terminal node.
E.g: Classifying companies by whether or not they increase their dividends to shareholders:
↷ Partitioning of the
Decision Tree Feature Space feature (X1, X2)
X2
Initial Root X1≤5% X1>5%
↷ Node
+
— + +
+ + X2>20%
No Yes — +
+
Decision
X2>10% — +
Nodes + +
—
—
No Yes No Yes
—
X2≤20%
—
No Yes
—
↷
—
Terminal X2≤10% —
Nodes
X1
0
X1≤10% X1>10%
If the goal is regression, then prediction at each terminal node is the mean of the labeled values.
If the goal is classification, then prediction of the algorithm at each terminal node will be mode.
For example, in the feature space, representing IG (X1 > 10%) and FCFG (X2 > 20%) contains 5
crosses. So a new company with similar features will also belong to the cross (dividend increase)
category.
CART makes no assumptions about the characteristics of the training data, so if left unconstrained,
potentially it can perfectly learn the training data. To avoid such overfitting, regularization
parameters can be added such as the maximum depth of the tree, the minimum population at a
node or the maximum no. of decision nodes. Alternatively, regularization can occur via a 'pruning'
technique that can be used later on, to reduce the size of the tree i.e. sections of the tree that
provide little classifying power are pruned or removed. By its iterative structure, CART is a powerful
tool to build expert systems for decision-making processes. It can induce robust rules despite noisy
data and complex relationships between high no. of features.
Ensemble learning typically produces more accurate and more stable predictions than one best
model. Ensemble learning can be divided into two main categories:
Voting Classifiers
An ensemble method can be an aggregation of heterogeneous learners i.e. different types of
algorithms combined together with a voting classifier.
There is an optimal no. of models
beyond which performance would be
expected to deteriorate from overfitting.
A majority-vote classifier will assign to a new data point the predicted label with most votes. The
more individual models you have trained, the higher the accuracy of predictions.
Bootstrap Aggregating
An ensemble method can be an aggregation of homogeneous learners i.e. a combination of the
same algorithm, using different training data that are based on a bootstrap aggregating i.e. bagging
technique.
Random with replacement
↷ n: no. of training instances
n': no. of instances in a 'bag'
m: no. of bags
1 2 m
Alternatively, one can use the same machine learning algorithm but with different training data.
Bootstrap aggregating or Bagging is a technique whereby the original training dataset is used to
generate 'n' new training datasets or bags of data. Each new bag of data is generated by random
sampling with replacement from the initial training set. The algorithm can now be trained on 'n'
independent datasets that will generate 'n' new models. Then for each new observation, we can
aggregate the 'n' predictions using a majority vote classifier for a classification or an average for a
regression. Bagging is a very useful technique because it helps to improve the stability of
predictions and protects against overfitting the model.
subset of features is used in creating each tree and each tree is slightly different from the others.
The process of using multiple classification trees uses crowdsourcing (majority wins) in determining
the final classification. Because each tree only uses a subset of features, random forests can
mitigate the problem of overfitting. Using random forests can increase the signal-to-noise ratio
because errors across different trees tend to cancel each other out.
↷
It's the same
as the 3rd
Step 2: Create a decision tree using the bootstrapped dataset, but only use a random subset of
variables or columns at each step.
We will consider X2 We'll focus on
or X3 as initial other variables as
node (random decision nodes, like Bootstrapped Dataset
choice) X1 or X4
↷
Step 3: Now go back to Step 1 and repeat: Make a new bootstrapped dataset and build a tree
considering a subset of variables at each step. Ideally, you'd do this 100s of times, using a
bootstrapped sample and considering only a subset of the variables at each step results in
a wide variety of trees (100 trees). The variety is what makes random forests more effective
than individual decision trees.
Step 4: Now that we've created a random forest, how do we use it? Well, first we get a new patient
with measurements (X1, X2, X3 and X4) and now we want to know if they have a 'Heart
Disease or No' (Y). So we take the data and run it down the first tree we made. The 1st tree
says 'Yes'. Now we run the data down the second tree; the 2nd tree also says 'Yes'. Then we
repeat for all 100 trees we made. After running the data down of all the trees in the random
forest, we see which option received more votes. In this case, 'Yes' received the most votes,
so we will conclude that this patient has Heart Disease.
However, an important drawback of random forest is that it lacks the case of interpretability of
individual trees, as a result it is considered a relatively 'Black-Box' type algorithm i.e. difficult to
understand the reasoning behind their outcomes and thus to foster in them.
33
Step 1: For example, the samples could be 'Blood Samples' in a lab and the variables could be 'DNA'
calculate the
Sample 4, 5, 6
are similar
Sample 1, 2, 3
are similar
③ center of the (x*)
②
DNA 1 ⑥⑤④ ③ ①② x2
④
x*
0
(i) We'll calculate x1
the average (x1 )
measurement for
DNA 1
We'll also talk about how PCA can tell us which DNA or variable is the most valuable for
clustering the data.
Step 2: Now we'll shift the data so that the center (x*) is on top of origin (0,0) in the graph. Note,
shifting the data did not change how the data points are positioned relative to each other.
DNA 2
PC 1
How do we
determine if this is a
good fit or not?
x* DNA 1
(0,0)
34
Step 3: We need to talk about how PCA decides if a fit is good or no? If best fit, then the PCA can
either minimize ('b') the distance to the line or maximize ('c') the distance from the projected
point to the origin.
PC 1 (not a good fit) Pythagoras Theorem
DNA 2
△ Note a2 = b2 + c2
↷
b Intuitively, it makes But, it's actually
PC 1 (good fit) sense to minimize 'b', easier to calculate 'c',
x
x c
the distance from the the distance from
↶ refer. Step 2 point to line. 'b' is the the projected point
a
vertical distance to the origin. So PCA
between the data finds the best fitting
DNA 1 x* point and PC1 by maximizing the
representing a sum of the squared
The distance between 'Projection Error'. distance
x
[SS (Distances)] from
each data point in the
direction that is parallel
↶ the projected points
to the origin.
x
to PC 1 represents the
spread or variation of
the data along PC 1
x
We can pick 'b' or 'c'. For now as per 'c's' principles; PCA measures 6 'c' distances.
↶
(Conclusions
derived from d 21 + d 22 + d23 + d24 + d 25 + d 26 = Sum of Squared Distances
both 'b' or 'c'
or SS (Distances)
will be same)
△ Note We keep rotating the line until we end up with the line with the largest SS (Distances)
between the projected points and the origin. PCA calls such a line with largest SS (Distance)
as the 'best fit' line a.k.a. Eigenvalue for PC 1
of DNA 1 & 2
To make PC 1 :
PC 1
Take 4 parts of DNA 1 and
1 part of DNA 2
4.1x
2 ∴
( DNA 1 is more important than DNA 2)
1
DNA 1 x*
4
2 2 2
a = b + c
x 2 = 42 + 1 2 = 4.12
Propositions of each
DNA is called We can scale the length of PC 1 to 1 by
'Loading Score' dividing by 4.12
Step 5: The next largest portion of the remaining variance is best explained by PC 2 , which is at right
angles to PC 1 and thus is uncorrelated with PC1 . Since it is a 2D graph, PC 2 is simply a line
through the origin that is perpendicular to PC1 , without any further optimization that has to
be done.
DNA 2
PC 2
PC 1
Now calculate the
Eigenvector and Eigenvalue for PC 2
Step 6: We can convert the Eigenvalues or SS (Distances) into variation around the origin (0,0) by
dividing by the sample size minus 1 i.e. (n - 1).
Step 7: Scree Plot i.e. It shows the proportion of total variance in the data explained by each
principal component (PC).
The main drawback of PCA is that since the PCs are combinations of the dataset's initial features,
they typically cannot be easily labeled or directly interpreted by the analyst. Compared to modelling
36
data with variables that represent well-defined concepts, the end user of PCA may perceive PCA as
something of a 'Black-Box'. Reducing the no. of features to the most relevant predictors is very
useful.
lllll Clustering
Given a dataset, Clustering is the process of grouping observations into categories based on
similarities in their attributes i.e. meaning that the observations inside each cluster are similar or
close to each other, a property known as 'cohesion' and the observations in two different clusters
are as far way from one another or are as dissimilar as possible, a property known as 'separation'.
▲▲ ▲ ▲ ▲ ▲
▲ ▲ ▲
▲ ▲ ▲ ▲ ▲ ▲
▲ ▲ ▲
▲ ▲ ▲
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
▲ ▲ ▲ ▲ ▲ ▲
▲ ▲ ▲
▲ ▲ ▲ ▲ ▲ ▲
K-Means Clustering
K-Means is a relatively old algorithm that repeatedly partitions observations into a fixed number 'K',
of non-overlapping clusters. The number of clusters 'K' is a model hyperparameter - a parameter
whose value must be set by the researcher before learning begins. Each cluster is characterized by
its 'Centroid' i.e. center and each observation is assigned by the algorithm to the cluster with the
centroid to which that observation is closest. The K-means algorithm follows an iterative process.
S1 S2 S3
1 2 3
C C C
1 1 1
C
3
C
3
C
3
C2 C C
2 2
S4 S5 S6
C
1
C C
C 1 C 1 C
3 3 3
C C C
2 2 2
37
The K-means algorithm will continue to iterate until no observation is reassigned to a new cluster
i.e. no need to recalculate new centroids. The algorithm will then be converged and reveal the final
K clusters with their member observations. The K-means algorithm has minimized inter-cluster
distance (thereby maximizing cohesion) and has maximized inter-cluster distance (thereby
maximizing separation) under the constraint that K = 3. The K-means algorithm is fast and works
well on very large datasets with hundreds of millions of observations. However, the final
assignment of observations to clusters can depend on the initial location of the centroids. One
limitation of this technique is that the hyperparameters 'K', the no. of clusters in which to partition
the data, must be decided before K-means can be run.
Hierarchical Clustering
Hierarchical Clustering is an iterative procedure used to build a hierarchy of clusters. In Hierarchical
Clustering, the algorithms create intermediate rounds of clusters of increasing (agglomerative) or
decreasing (divisive) size until the final clustering is reached. This process creates relationships
among the rounds of clusters. It has the advantage of allowing the investment analyst to examine
alternative segmentations of data of different granularity before deciding which one to use.
clusters. 2
7
3
1
Example 1:
A category of general linear regression (GLS) models that focuses on reduction in the total no. of features
used is best described as?
A. Clustering Model
B. Dimension Reduction Model
✓ C. Penalized Regression Model
38
Example 2:
We apply ML techniques to a model including fundamental and technical variables (features) to predict next
quarter's return for each of the 100 stocks currently in our portfolio. Then, the 20 stocks with the lowest
estimated return are identified for replacement. The ML techniques appropriate for executing Step 1 are
most likely to be based on:
✓ A. Regression Because target variable (quarterly return) is continuous.
B. Classification
C. Clustering
Dendrogram
A type of diagram for visualizing a hierarchical cluster analysis known as a Dendrogram, highlights
the hierarchical relationships among the clusters.
Distance Arch
Refer Agglomerative Clustering:
Clusters are represented by a
9 Dendrite
horizontal line - the 'Arch',
which connects two vertical
0.07 lines called 'Dendrites', where
7 8 the height of each arch
represents the distance
0.05 between the two clusters being
considered. Shorter dendrites
represent a shorter distance
0.03 (and greater similarity) between
1 3 5 6 clusters. The horizontal dashed
lines cutting across the
0.01 dendrites show the no. of
2 4
clusters into which the data are
Clusters split at each stage.
0 A B C D E F G H I J K
Input 1
Linking the information in the input layer to multiple
nodes in the hidden layers, each with its own activation
Input 2 function, allows the neural network to model complex
non-linear functions to use the information in the input
Links > variables well. The nodes in the hidden layer transform
Input 3 the inputs in a non-linear fashion into new values that
are then combined into the target value. This structure
4, 5 and 1 is set by researcher and referred to as the
hyperparameters of the neural network.
Input 4
Nodes
Neurons
39
x x x
y y y
If the process of adjustment works forward through the layers of network this process is called
'Forward Propagation', where as, if the process of adjustment works backward through the layers of
network, this process is called 'Backward Propagation'.
Learning Rate hyperparameter controls the rate or speed at which the model learns. Learning takes
place through this process of adjustment to network weights with the aim of reducing the total
error. When learning is complete, all the network weights have assigned values.
Partial Derivative or Rate of Change
↱ of the total error with respect to
New Weight = Old Weight - (Learning Rate) (Gradient)
the change in the old weight.
When the learning rate is too large, gradient descent can inadvertently increase rather than
decrease the training error. When the learning rate is too small, training is not only slower but may
become permanently stuck with a high training error.
E.g: ReLU function. F (x) = max (0 , x), y will be equal to β1 times z1 , where z 1is the maximum of
(x1 + x2 + x3 ) or 0, plus β2 times z2 , the maximum of (x2 + x4) or 0, plus β3 times z 3, the
maximum of (x1 + x3 + x4) or 0, plus an error term.
x1 x1
z1
x2 x2
Y z2 > Y
x3 x3
z3
x4 x4
Weights
↷
When more nodes and more hidden layers are specified, a neural network's ability to handle
complexity tends to increase, but so does the risk of overfitting. However, the tradeoffs in using
them are the lack of interpretability and the amount of data needed to train such models.
x2 B1
1
B2 Y1 ↷
The function
x3 B2 gets activated
when it hits 1.
B3 Y2
x4 B3
B4
0
x5
Bias
The information is fed to the input layer and the information is transferred from one layer to
another over connecting channels, each of these has a value attached to it and hence is called a
weighted channel 'w ij ' (for neuron i and input j), each of which usually produces a scaled number in
the range (0,1) or (-1,1). All neurons have unique numbers associated with it called Bias. This bias is
added to the weighted sum of inputs reaching the neurons, which is then applied to the function
known as the Activation Function. The result of the activation function determines if the neurons
get activated. Every activated neuron passes on information to the following layer; this continues
uptil the second last year (a layer before the output layer). The one neuron activated in the output
layer corresponds to the input digit. [Note: The weights, biases or numbers are passed to another
layer of functions and into another and so on until the final year produces a set of probabilities of
the observation being in any of the target categories (each represented by a node in the output
layer). The DLN assigns the category based on the category with the highest probability]. The weight
and bias are continuously adjusted to produce a well trained network. The DLN is trained on large
datasets; during training the weights w i are determined to minimize a specified loss function.
Unfortunately, DLNs require substantial time to train and systematically varying the
hyperparameters, may not be feasible.
In the case of AlphaGo. a virtual gamer (the agent) uses his/her console to command (the actions)
with the help of the information on the screen (the environment) to maximize his/her score (the
reward). Unlike supervised learning, RL gets instantaneous feedback i.e. the learning subsequently
occurs through millions of trail and errors. For example, an agent could be a virtual trader who
follows certain trading rules (the action) in a specific market (the environment) to maximize its
profits (its rewards). The success of RL is still an open question in financial markets.
NO YES
YES
YES NO
YES NO
NO YES
NO YES NO YES
YES NO
42
Big Data differs from traditional data sources based on the presence of a set of characteristics commonly
referred to as the 4 V's: Volume, Variety, Velocity and Veracity.
Big Data also affords opportunities for enhanced fraud detection and risk management.
E.g: One study conducted in the U.S. found that positive sentiment on Twitter could predict the trend for the
Dow Jones Industrial Average up to 3 days later with nearly 87% accuracy.
②
① ②
③
⑤ ④
① ②
③
⑤ ④
1
Exploratory Data Analysis (EDA): is the preliminary step in data exploration. Exploratory graphs, charts and other visualizations
such as heat maps and word clouds are designed to summarize and observe data.
2
Feature Selection: is a process whereby only pertinent features from the dataset are selected for ML model training.
3 Feature Engineering: is a process of creating new features by changing or transforming existing features. Feature Engineering
techniques systematically alter, decompose or combine existing features to produce more meaningful
features.
(Feature Selection is a key factor in minimizing model overfitting & Feature Engineering tends to prevent model underfitting)
43
* Data Preparation (Cleansing): Data Cleansing is * Text Preparation (Cleansing): Raw text data are a
the process of examining, identifying and sequence of characters and contain other
mitigating errors in raw data. non-useful elements including html tags,
punctuation and white spaces. The initial step in
1. Incompleteness Error: where the data is not text processing is cleansing, which involves to clean
present, resulting in missing data. the text by removing unnecessary elements from
the raw text. A 'Regular Expression' (Regex) is a
2. Invalidity Error: where the data is outside of series that contains characters in a particular order.
Regex is used to search for patterns of interest in a
a meaningful range.
given text. It can be used to find all the html tags
that are present in the form of < ... > in text. Once a
3. Inaccuracy Error: the data is not a measure
pattern is found, it can be removed or replaced.
of true value.
1. Remove Html Tags: Most of the text data are
4. Inconsistency Error: data conflicts with the acquired from web pages and the text inherits
corresponding data points or reality. html mark-up tags with the actual content. The
initial task is to remove (or strip) the html tags
5. Non-Uniformity Error: data is not present in with the help of regex.
an identical framework.
2. Remove Punctuations: Most punctuations are
6. Duplication Error: where duplicate not necessary for text analysis and should be
observations are present. removed. However, some punctuations such as
period (dots), percentage signs, currency symbols
In addition to a manual inspection and verification and question marks may be useful for ML model
of the data, analysis software such as SPSS can be training. These punctuations should be
used to understand 'MetaData' (data that substituted with such annotations as
describes and gives information about other data) /percentagesign/, /dollarsign/ and
/questionmark/ to presume their grammatical
about the data properties to use as a starting
meaning. The periods (dots) must be
point to investigate any errors in the data.
appropriately replaced or removed i.e. period
(dots) after abbreviations, but the periods
separating the sentences should be replaced by
the annotation. Regex is often used to remove or
replace punctuations.
Xi = X i - X min
(normalized)
X max - X min
Structured Data
1D: Summary statistics such as mean, median, quartiles, ranges , standard deviations, skewness
and kurtosis.
Units
Frequency
→ 3rd Quartile
the high-level Density Plots are
distribution of the data. → Median
smoothed histograms.
Frequency
Frequency
→ 1st Quartile
Minimum
↶
Data Data
Data Salary
statistically significant
relationship.
...
..... . .
...
Salary
. ........ . .
. ..................
.
.
...... .
Time
Age
For Multivariate Data, commonly utilized exploratory visualization designs include stacked bar and
line charts, multiple box plots and scatter plots showing multivariate data that use different colors
or shapes for each feature.
Central Tendency helps measure minimum and maximum values for continuous data. Counts and
Frequencies for categorical data are commonly employed to gain insight regarding the distribution
of possible values.
47
Unstructured Data
The most common applications are,
(a) Text Classification: uses supervised ML approaches to classify texts into different classes. Text
Classification involves dividing text documents into assigned classes (a class is a category,
examples include 'relevant' and 'irrelevant' text documents or 'Bearish' and 'Bullish' sentences).
(b) Topic Modeling: uses unsupervised ML approaches to group the texts in the dataset into topic
clusters. Topic modeling is a text data application in which the words that are most informative
are identified by calculating the term frequency of each word. For example, the word 'soccer' can
be informative for the topic 'sports'. The words with high term frequency values are eliminated
as they are likely to be stop words or other common vocabulary words, making the resulting
BOW compact and more likely to be relevant to topics within the texts.
(c) Fraud Detection
(d) Sentiment Analysis: predicts the sentiment i.e. negative, neutral or positive of the texts in a
dataset using both supervised and unsupervised approaches.
(In Sentiment Analysis and Text Classification applications, the Chi-square measure of word
association can be useful for understanding the significant word appearances in negative and
positive sentences in the text or in different documents).
Text data includes a collection of texts, also known as a Corpus, that are sequences of tokens.
↶ Word Clouds are common visualizations when working with text data as they can be made to
visualize the most informative words and their term frequency values.
Structured Data
Typically, structured data even after the data preparation can contain features that don't contribute
to the accuracy of an ML model or that negatively effect the quality of ML training. Feature Selection
on structural data is a methodical and iterative process. Statistical measures can be used to assign a
score gauging the importance of each feature. These features can then be ranked using the score
and can either be retained or eliminated from the dataset. Methods include Chi-Square Test,
2
Correlation Coefficient and information-gain measures i.e. R
48
Feature Selection is different from Dimensionality Reduction, but both methods seek to reduce the
no. of features in the dataset. The Dimensionality Reduction method creates new combinations of
features that are uncorrelated, whereas Feature Selection includes and excludes features present in
the data without altering them.
Unstructured Data
For text data, Feature Selection involves selecting a subset of the terms or tokens occurring in the
dataset. The token serve as features for ML model training. Feature Selection in text data effectively
decreases the size of the vocabulary or BOW. This helps the ML model be more efficient and less
complex. Another benefit is to eliminate noisy features from the dataset. Noisy features are tokens
that don't contribute to ML model training and actually might detract from the ML model accuracy.
The frequent tokens strain the ML model to choose a decision boundary among the texts as the
terms are present across all the texts, an example of model underfitting. The rare tokens mislead
the ML model into classifying texts containing the rare terms into a specific class, an example of
model overfitting.
Structured Data
For Continuous Data, a new feature may be created. For example, by taking the logarithm of the
product of two or more features. As another example, when considering a salary or income feature,
it may be important to recognize that different salary brackets impose a different taxation rate.
Domain Knowledge can be used to decompose an income feature into different tax brackets,
resulting in a new feature: 'income_above_100K' with possible values 0 and 1.
For Categorical Data, a new feature can be a combination. For example, sum or product of two
features or a decomposition of one feature into many. If a single categorical feature represents
education level with 5 possible values i.e. high school, associates, bachelor's, master's and
doctorate then these values can be decomposed into 5 new features, one for each possible value
(e.g. is, high, school, is, doctorate) filled with 0s (for false) and 1s (for true). The process in which
categorical variables are converted into binary form (0 or 1) for ML is called One Hot Encoding.
Unstructured Data
The following are some feature engineering techniques which may overlap with text processing
techniques:
(a) Numbers: In text processing, numbers are converted into a token such as '/number/'.
(b) N-grams
49
(c) Name Entity Recognition (NER): The NER algorithm analyzes the individual tokens and their
surrounding semantics while referring to its dictionary to tag an object class to the token. For
example, NER tags of the text 'CFA Institute was formed in 1947 and is headquartered in Virginia';
NER tags can also help identify critical tokens on which such operations as lowercasing and
stemming then can be avoided e.g. Institute here refers to an organization rather than a verb.
Additional object classes are for example MONEY, TIME and PERCENT which are not present in
the example text.
(d) Parts of Speech (POS): uses language structure and dictionaries to tag every token in the text
with a corresponding part of speech. Some common POS tags are noun, verb, adjective and
proper noun.
Class Imbalance: where the no. of instances for a particular class is significantly larger than for
other classes. For example, say for corporate issuers in the BB+/Ba1 to B+/B1
credit quality range, issuers who defaulted (positive or '1' class) would be very few
compared to issuers who did not default (negative or '0' class). Hence, on such
training data, a naïve model that simply assumes no corporate issuer will default
may achieve good accuracy, albeit with all default cases misclassified. Balancing
the training data can help alleviate such problems. In cases of unbalanced data,
the '0' class (majority class) can be randomly undersampled or the '1' class
(minority class) randomly oversampled.
50
(a) Error Analysis: For classification problems, error analysis involves computing four basic
evaluation matrices TP, FP, TN and FN. Here's a Confusion Matrix for error analysis:
Precision: is the ratio of correctly predictive positive classes to all predictive positive classes.
Precision is useful in situations where the cost of FP or Type I Error is high. For
example, when an expensive product fails quality inspection (predicted class 1) and is
scrapped, but it is actually perfectly good (actual class 0).
P = TP
TP + FP
Recall: also known as sensitivity i.e. is the ratio of correctly predicted positive classes to all actual
positive classes. Recall is useful in situations where the cost of FN or Type II Error is high.
For example, when an expensive product passes quality inspection (predicted class 0) and
is sent to the valued customer, but it is actually quite defective (actual class 1).
R = TP
TP + FN
Accuracy = TP + TN
TP + FP + TN + FN
F1 Score = 2. P. R
P+R
F1 Score is more appropriate than Accuracy when unequal class distribution is in the dataset and
it is necessary to measure the equilibrium of Precision and Recall. High scores on both of these
metrices suggest good model performance.
Example 1:
Calculate Precision, Recall, Accuracy and F1 Score with the help of table below:
51
1 1 1 TP
2 0 0 TN
3 1 1 TP
4 1 0 FN
5 1 1 TP
6 1 0 FN
7 0 0 TN
8 0 0 TN
9 0 0 TN
10 0 1 FP
Precision = 3 = 0.75 Recall = 3 = 0.6 Accuracy = (3 + 4) = 0.7 F1 Score = 2 (0.75) (0.6) = 0.67
(3 + 1) (3 + 2) (3 + 1 + 4 + 2) (0.75 + 0.6)
If the no. of classes in a dataset is unequal; however, then F1 score and Accuracy should be used as the
overall performance measure for the model.
Predicted Threshold
The shape of the ROC curve provides Probability
insight into the model's performance. (P)
(c) Root Mean Squared Error (RMSE): is appropriate for continuous data prediction and is mostly
used for regression models. It is a simple matrix that captures all the prediction errors in the
data (n). A small RMSE indicates potentially better model performance.
Σ (Predicted i - Actual i ) 2 = Σ (Y i - Y i ) 2
^
RMSE =
n n
52
↶
If Prediction Error on Training Set = Underfitting i.e. Bias Error
If Prediction Error on Cross Validation Set ↑
= Overfitting i.e. Variance Error
↷
(Variance Error is high when a model is
overlsimplified and memorizes the training
data so much so that it will likely perform
poorly on new data)
It is not possible to completely eliminate both types of error. The Bias-Variance trade-off is critical in
finding the optimal balance where a model neither underfits or overfits.
(a) Parameters: are critical for a model and are dependent on the training data. Parameters are
learned from the training data as part of the training process by a optimization technique.
Examples of parameters include: Coefficients in regression, weights in NN and Support Vectors in
SVM.
(b) Hyperparameters: are used for estimating model parameters and are not dependent on the
training data. Examples of hyperparameters include the Regularization Term (λ) in supervised
model, Activation Function and No. of Hidden Layers in NN, No. of Trees and Tree Depth in
Ensemble Methods, K in KNN and K-means Clustering & P-Threshold in Logistic Regression.
Hyperparameters are manually set and timed. Thus, timing heuristics and such techniques such
as Grid Search is used to obtain the optimum values of hyperparameters. Grid Search is a
method of systematically training an ML model by using various combinations of
hyperparameter values, cross validating each model and determining which combination of
hyperparameter values ensure the best model performance. The plot of training error for each
value of a hyperparameter (i.e. changing model complexity) is called a Fitting Curve.
If high bias or variance error exists after tuning of hyperparameters, either a larger no. of training
instances may be needed or the no. of features included in the model may need to be decreased
(in the case of high variance) or increased (in the case of high bias). The model then needs to be
retrained and retuned using the new training dataset. In the case of a complex model, where a large
model is comprised of sub-model(s), Ceiling Analysis can be performed. Ceiling Analysis can help
determine which sub-model needs to be tuned to improve the overall accuracy of the larger model
i.e. is a systematic process of evaluating components in the pipeline of model building.
54
Step 2: Define Probability Distributions for these Variables: Generically, there are 3 ways in which we can go
about defining probability distributions:
- Historical Data: This method assumes that the future values of the variable will be similar to its past
e.g. long term treasury bond rate.
- Cross-Sectional Data: When past data is unavailable or reliable, we may estimate the distribution of
the variable based on the values of the variable for peers.
- Pick a Distribution and Estimate the Parameters: When neither historical nor cross-sectional data
provide adequate insight, subjective specification of a distribution along with related parameters is
the appropriate approach.
Step 3: Check for Correlation across Variables: When there is a strong correlation between variables, we can
either (a) Allow only one of the variables to vary i.e. it makes sense to focus on the impact that has
the bigger impact on value or (b) build the rules of correlation into the simulation (this necessities
more sophisticated simulation packages). As with the distribution, the correlations can be estimated
by looking at the past.
Step 4: Run the Simulation: Means randomly drawing variables from their underlying distributions and then
using them as inputs to generate estimated values. This process may be repeated to yield thousands
of estimates of value, giving a distribution of the investment's value, though the marginal
contribution of each simulation drops off as the no. of simulations increases. The no. of simulations
needed for a good output is driven by:
- No. of Probabilistic Inputs: The larger the no. of inputs that have probability distributions attached
to them, the greater will be the required no. of distributions.
- Characteristics of Probability Distributions: The greater the variability in types of distributions, the
greater the no. of simulations needed. Conversely, if all variables are specified by one distribution
(e.g. normal), then the no. of simulations needed would be lower.
- Range of Outcomes: The greater the potential range of outcomes on each input, the greater will be
the no. of simulations.
Advantages of Simulations
1. Better Input Quality: Superior inputs are likely to result when an analyst goes through the process of
selecting a proper distribution for critical inputs rather than relying on single best estimate.
2. Provides a Distribution of Expected Value rather than a Point Estimate: The distribution of an investments
expected value provides an indication of risk in the investment.
Disadvantages of Simulations
1. Garbage In, Garbage Out: Regardless of the complexities employed in running simulations, if the
underlying inputs are poorly specified, the output will be low quality. It is also worth noting that simulations
require more than a passing knowledge of statistical distributions and their characteristics; analysts who
cannot assess the difference between normal and lognormal distributions shouldn't be doing simulations.
55
2. Real data may not fit the requirements of statistical distributions which may yield misleading results.
3. Non-Stationary Distributions: Input variable distributions may change over time, so the distribution and
parameters specified for a particular simulation may not be valid anymore. For example, the mean and
variance estimated from historical data for an input that is normally distributed may change for the next
period.
4. Changing Correlation across Inputs: In the third simulation step, we noted that correlation across input
variables can be modeled into simulations. However, this works only if the correlations remain stable and
predictable. To the extent that correlations between input variables change over time, it becomes far more
difficult to model them.
1. Book Value Constraints: There are two types of restrictions on book value of equity that may call for risk
hedging:
- Regulatory Capital Requirements: Banks and insurance companies are required to main adequate levels
of capital. Violations of minimum capital requirements are considered serious and could threaten the very
existence of the firm.
- Negative Book Value for Equity: In some countries, negative book value of equity may have serious
consequences like in the European countries.
2. Earnings and Cashflow Constraints: Earnings or cashflow constraints can be imposed internally to meet
analyst expectations or to achieve bonus targets. Earnings constraints can also be imposed externally, such
as a loan covenant. Violating such a constraint could be very expensive for the firm.
3. Market Value Constrains: Market value constraints seek to minimize the likelihood of financial distress or
bankruptcy for the firm, by incorporating the costs of financial distress in a valuation model for the firm
e.g. stress testing and VaR.
Decision Trees and Simulation can be used as Complements or as Substitutes for risk-adjusted valuation.
Scenario Analysis doesn't include the full spectrum of outcomes and therefore can only be used as a
Complement to risk-adjusted valuation. If used as a substitute, the cashflows in an investment are discounted
at R f rate and then the expected value obtained is evaluated in conjunction with the variability obtained from
the analysis.