0% found this document useful (0 votes)
11 views55 pages

Quantitative Methods

This document provides an introduction to linear regression, explaining its assumptions, coefficients, and the method of least squares for estimating the relationship between dependent and independent variables. It also covers analysis of variance (ANOVA) to evaluate the effectiveness of independent variables in explaining variations in the dependent variable, along with statistical tests such as T-tests and F-tests to assess the significance of regression coefficients. Additionally, it discusses the standard error of estimate and the coefficient of determination (R²) to measure the fit of the regression model.

Uploaded by

Ganesh Kabra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views55 pages

Quantitative Methods

This document provides an introduction to linear regression, explaining its assumptions, coefficients, and the method of least squares for estimating the relationship between dependent and independent variables. It also covers analysis of variance (ANOVA) to evaluate the effectiveness of independent variables in explaining variations in the dependent variable, along with statistical tests such as T-tests and F-tests to assess the significance of regression coefficients. Additionally, it discusses the standard error of estimate and the coefficient of determination (R²) to measure the fit of the regression model.

Uploaded by

Ganesh Kabra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

1

Quantitative Methods
PAGE NOS. 55 VOL. 2 CHAPTERS. 6

CHAPTER 4 Introduction to Linear Regression

Linear Regression a.k.a. Linear Least Squares assumes a linear relationship between the dependent and the
independent variables. Linear Regression computes a straight line that best fits the observations; it chooses
values for the intercept 'b0' and slope 'b1', that minimize the sum of the squared vertical distances between
the observations and the regression line.

ACTUAL Y =b +b X +ε i = 1, 2, 3 ... n
i 0 1 i i

Whereas,
th
Y : i observation of the dependent variable Y. The dependent variable is also referred to as the
i
'Explained Variable', 'Endogenous Variable' or 'Predicted Variable'.
th
X : i observation of the independent variable X. The independent variable is also referred to as
i
the 'Explanatory Variable', 'Exogenous Variable' or 'Predicting Variable'.
b : Regression Intercept. It is the value of dependent variable, if value of the independent
0
Regression variable is '0'. The intercept term in this regression is called the stock's Ex-post Alpha. It is a
Coefficients measure of excess risk-adjusted returns a.k.a. Jensen's L p = R p - [R f + (R m- R f )]
b : Regression Slope Coefficient. It is the change in the dependent variable for a 1-unit change
1
in the independent variable. The slope coefficient in a regression like this is called Stock's
Beta and it measures the relative amount of systematic risk in returns.
th
ε : Residual for the i observation, also referred to as the 'Disturbance Term', 'Error Term' or
i
'Unexplained Deviation'. It represents the portion of the dependent variable that cannot be
explained by the independent variable.

^ ^ ^
PREDICTED Y =b +b X i = 1, 2, 3 ... n
i 0 1 i

Whereas,
^
Y : Estimated (Fitted Parameters) value of Yi given X i
^ ^ ^
b : Estimated Intercept. b0 = Y - b1 X ; The intercept equation highlights the fact that the
I

0
regression line passes through a point with coordinates equal to the mean of the
independent and dependent variables.
^ ^ 2
b : Estimated Slope Coefficient. b1 = Cov XY / o X
I

The sum of the squared vertical distances between the estimated and actual Y-values is referred to as the
'Sum of Squared Errors' (SSE). Thus, regression line is the line that minimizes the SSE. This explains why
simple linear regression is frequently referred to as 'Ordinary Least Squares' (OLS) regression and the values
^
estimated by the estimated regression equation Yi is called Least Squares estimates. Here are some
assumptions below:
2

Assumption 1: A linear relationship exists between the dependent and the independent variable. This
requirement means that b and b are raised to the first power only and that neither b nor b is multiplied or
divided by another regression parameter (as in b /b ). The requirement doesn't exclude X from being raised
to a power other than 1. If the relationship between the independent and dependent variables is non-linear
in the parameters, then estimating that relation with a linear regression model will produce invalid results.

b x
Y = b e 1 i+ ε
i 0 i Non-Linear (Not Allowed) x
2
Y =b +b X +ε Linear (Allowed)
i 0 1 i i

Assumption 2: The independent variable X is not random. If the independent variable is random, we can still
rely on the results of regression models given the crucial assumption that the error term is uncorrelated with
the independent variable.

Assumption 3: The expected value of the error term is 0; E(ε) = 0.

2 2
Assumption 4: The variance of the error term is the same for all observations; E(ε i ) = o ε . It is also known as

I
the 'Homoscedasticity' assumption i.e. violation.

Assumption 5: The error term is independently distributed, that the error for one observation is not
correlated with that of another observation. Related to 'Serial Correlation' i.e. violation.

Assumption 6: The error term is normally distributed. If the regression errors are not normally distributed,
we can still use regression analysis. Econometricians who dispense with the normality assumption use Chi-
Square tests of hypothesis rather than F-tests.

An unbiased forecast can be expressed as E(Actual Change - Predicted Change) = 0. If the forecasts are
unbiased, the intercept 'b 0' should be 0 and the slope 'b 1' should be 1, then the error term
[Actual Change - b 0- b1 (Predicted Change)] will have an expected value of 0, as required by Assumption 3 of
the linear regression model.

lllll ANOVA
Analysis of Variance (ANOVA) is a statistical procedure for analyzing the total variability of the
dependent variable. In regression analysis, we use ANOVA to determine the usefulness of the
independent variable or variables in explaining variation in the dependent variable.

1. Total Sum of Squares (TSS): measures the total variation in the dependent variable.

n 2
TSS = Σ (Yi - Y)
I

i=1

2. Regression Sum of Squares (RSS): measures the explained variation in the dependent variable.
n ^ 2
RSS = Σ (Yi - Y)
I

i=1

3. Sum of Squared Errors (SSE): measures the unexplained variation in the dependent variable. It is also
known as the Sum of Squared Residuals or the Residual Sum of Squares.
n ^ 2
SSE = Σ (Yi - Yi )
i=1

∴ Total Variation = Explained Variation + Unexplained Variation


(TSS) (RSS) (SSE)
3

Fig 1: Components of
Total Variation
Y
Yi .
^
^ ^ ^ (Y - Y ) = SSE
Y i = b0 + b1 X i i i

. . (Y i - Y) = TSS

I
.
^

..
(Y - Y) = RSS

I
i

.
. . .
Y
I

. . . . Actual
^
b0
X Predicted

I
Fig 2: ANOVA Table
df ss ss / df

Degrees of Sum of
Source of Variation Mean Sum of Squares
Freedom Squares

Regression k RSS MSR = RSS


(Explained) (Mean Regression k
Sum of Squares)

Error n-k-1 SSE MSE = SSE


(Unexplained) (Mean Squared n-k-1
Error)

TOTAL n-1 TSS

n: no. of observations
k: no. of independent variables

lllll Variations Explained

lllll T-Statistic
T-value measures the size of the difference relative to the variation in your sample data. When
T-Stat is very small it indicates that none of the autocorrelations are significantly different than 0.

^
t = b1 - b1 df = n - k - 1
s ^b
1

Fig 3: Null and Alternative Hypothesis { HH :: What


0
1
you don't believe in (wants to reject)
Something you believe in (wants to accept)

a. H0 : μ = 0 Ha : μ ≠ 0 b. H0 : μ ≤ 0 Ha : μ > 0 c. H0 : μ ≥ 0 Ha : μ < 0

accept accept accept


IIlll lllII lllII IIlll
reject reject reject reject
4

ρ-value: is the smallest level of significance for which the null hypothesis can be rejected.

ρ < L : Reject H
0
ρ > L : Accept H0

Example 1:
The estimated slope coefficient of the ABC plc. is 0.64 with standard error equal to 0.26. Assuming that the
sample has 36 observations, determine if the estimated slope coefficient is significantly different than 0 at a
5% level of significance.
Reject H 0
H0 : b 1 = 0 Ha : b1 ≠ 0

^
T-test = b1 - b1 = 0.64 - 0 = 2.46 IIlll lllII.
s ^b 0.26 2.03 -2.03 -2.46
1

The critical two-tailed t-values are ± 2.03 (df = n - k -1 = 36 - 1 - 1 = 34). Because t > t c i.e. 2.46 > 2.03, we
reject the null hypothesis and conclude the slope is different from 0.

If none of the independent variables in a regression model helps explain the dependent variable,
the slope coefficients should all equal '0'. In a multiple regression however, we cannot test the null
hypothesis that all slope coefficients equal '0' based on T-tests that each individual slope coefficient
equals '0', because the individual tests don't account for the effects of interactions among the
independent variables. To test the null hypothesis that all of the slope coefficients in the multiple
regression model are jointly equal to '0' (H 0 : b1 = b2 = ... b k = 0) against the alternative hypothesis
that at least one slope coefficient is not equal to '0'; we must use an F-test. The F-test is viewed as a
test of the regression's overall significant.

lllll F-Statistic
F-test assesses how well the set of independent variables as a group explains the variation in the
dependent variable. That is, the F-Stat is used to test whether at least one of the independent
variables explains a significant portion of the variation of the dependent variable. This is always a
one-tailed test, despite the fact that it looks like it should be a two-tailed test because there is an
equal sign in the null hypothesis. If the regression model does a good job of explaining variation in
the dependent variable, then this ratio should be high.

F = MSR = RSS / k df =k
denominator
MSE = SSE / n - k - 1 df numerator = n - k - 1

For simple linear regression, F = t2b is true


1
H0 : b1 = 0 Ha : b1 ≠ 0

For multiple regression, F = t2b is not true


1
H0 : b1 = b2 = b3 = b4 = 0 Ha : At least one b i ≠ 0

Example 2:
An analyst runs a regression of monthly value stock returns on five independent variables over 60 months.
The TSS is 460 and the SSE is 170. Test the null hypothesis at the 5% significance level that all five
independent variables are equal to 0.
5

H0 : b1 = b2 = b3 = b4 = b5 = 0 Ha : At least one b i ≠ 0 Reject H


0

.
F-test = MSR = 58 = 18.41
llIII
MSE 3.15 0
2.4 18.41

The critical F-value for 5 and 54 degrees of freedom at a 5% significance level is significantly 2.4. Therefore,
we can reject the null hypothesis and conclude that at least one of the five independent variables is
significantly different than 0.

lllll Standard Error of Estimate


The Standard Error of Estimate (SEE) measures the degree of variability of the actual Y-values
relative to the estimates Y-values from a regression equation. The SEE gauges the 'fit' of the
regression line. The smaller the standard error, the better fit. The SEE is the standard deviation of
the error terms in the regression. As such, SEE is also referred to as the Standard Error of the
Residual or Standard Error of the Regression. In regressions, the relationship between the
independent and dependent variables can be strong or weak; relative to total variability, SEE will be
low if the relationship is very strong and high if the relationship is weak. Any violation of an
assumption that affects the error term will ultimately affect the coefficient standard error, cause
coefficient standard error is calculated using SEE.
n 2 1/2
Σ (ε^ i )
SEE = MSE = SSE =
]
n-k-1 n-k-1 ]
2 2
lllll Coefficient of Determination: R and Adjusted: R a

2
1. R : It is defined as the percentage of the total variation in the dependent variable explained by
2
the independent variable. R explains the correlation between predicted and actual values of
2
dependent variable. For example, an R of 0.63 indicates that the variation of the independent
variable explains 63% of the variation in the dependent variable.

2
R = TSS - SSE = RSS = 1 - SSE ↑Higher the better; k↑ R2↑
TSS TSS TSS

Regression output often includes multiple R (correlation coefficient), which is the correlation
between actual values of Y and forecasted values of Y. (Multiple R is the square root of R2 ).

2
For simple linear regression (i.e. one independent variable), the coefficient of determination R
2
may be computed by simply squaring the correlation coefficient 'r' i.e. R = r 2. This approach is
not appropriate when more than one independent variable is used in the regression.

2 2
For multiple regression, R by itself may not be reliable, this is because R almost always increases
as variables are added to the model, even if the marginal contribution of the new variables is not
statistically significant (if we add regression variables to the model, the amount of unexplained
variation will decrease and RSS will increase, if the new independent variable explains any of the
unexplained variation in the model. Such a reduction occurs when the new independent variable
is even slightly correlated with the dependent variable and is not a linear combination of other
2
independent variables in the regression). Consequently, a relatively high R may reflect the impact
of a large set of independent variables rather than how well the set explains the dependent
variable. This problem is often referred to as overestimating the regression.

2 2
2. Ra : Some financial analysts use an alternative measure of goodness of fit called Ra . To overcome
6

the problem of overestimating the impact of additional variables on the explanatory power of a
2
regression model, many researchers recommend adjusting R for the number of independent
variables 'k'.

2 ] 2
Ra = 1 - n - 1 x (1 - R )
n-k-1 ]
%
↑ 2↓
As k overly R a ; should not add independent variables beyond
2
k*. When a new independent variable is added, Ra can decrease if
2
R adding that variable results in only a small increase in R 2 . When
2 2 2
k ≥ 1, then R ≥ R a. In fact R a can be negative (effectively consider
its value 0), although R2 is always non-negative. In addition, R2a may
2
be less than 0, if R2 is low enough. Furthermore, we must be aware
Ra 2
that a high R a doesn't necessarily indicate that the regression is
well specified in the sense of including the correct set of variables.
0 k One reason for caution is that a high R 2a may reflect peculiarities
k*
i.e. weirdness of the dataset used to estimate the regression.

Goodness of 'fit' (SEE) and Coefficient of Determination are different values for the same concept.
The Coefficient of Variation is not directly part of regression model.

Example 3:
Part a: An analyst runs a regression of monthly value stock returns on five independent variables over 60
2 2
months. The TSS is 460 and the SSE is 170. Calculate the R and Ra .
2
Part b: Suppose the analyst now adds four more independent variables to the regression and the R
increases to 65%. Identify which model the analyst should most likely prefer.

2 2 ]
a. R = 460 - 170 = 0.63 or 63% Ra = 1 - 60 - 1 x (1 - 0.63) = 59.6%
460 60 - 5 - 1 ]
2 2 ]
b. R = 65% (given) Ra = 1 - 60 - 1 x (1 - 0.65)
60 - 9 - 1 ] = 58.7%

2 2
With nine independent variables, even though the R has increased from 63% to 65%, the R a has decreased
2
from 59.6% to 58.7%. The analyst would prefer the first model because the Ra is higher and the model has
five independent variables as opposed to nine.

lllll Confidence Intervals

Estimated Regression Coefficient ± (t c ) . (Coefficient Standard Error)

SEE ↑ Standard Error ↑ CI widens tc ↑


SEE ↓ Standard Error ↓ CI tightens tc ↓
Standard Error of Slope Coefficient

^
b1± (t c x S ^b )
1

^
Y ± (t c x S f )
Standard Error of Forecast

7

Whereas,
2 2 ] 2 2
S f : SEE 1 + 1 + (X - X) ; S x is variance of the independent variable.

I
n (n - 1) Sx
2 ] 2
SY = TSS is variance dependent variable.
n-1

Rule: H0 : b1 = 0 and Ha : b1 ≠ 0. If the confidence interval (CI) at the desired level of significance doesn't
include 0, the H0 is rejected and the coefficient is said to be statistically different from 0.

Example 4:
Coldplay forecasts the excess return on the S&P 500 for June 2017 to be 5% and the 95% CI for the predicted
value of the excess return on VIGRX for June 2017 to be 3.9% to 7.7% (b 0 : 0.0023 and b 1 : 1.1163). The
standard error of the forecast is closest to?

^ ^ ^
Y=b +b X
^ 0 1
Y = 0.0023 + 1.1163 (0.05) = 0.058115

0.039 = 0.058115 + 2.03 (S f ) to 0.058115 - 2.03 (S f ) = 0.077 ∴ S = 0.0093


f

NOTES

1. The distinction between SSE and SEE: SSE is the sum of the squared residuals, while SEE is the standard
deviation of the residuals.

2. Limitations of regression analysis:


(a) Linear relationships can change overtime. This is referred to as 'Parameter Instability'.
(b) Public knowledge of regression relationship may negate their future usefulness.
(c) If the assumptions underlying regression analysis don't hold, the interpretation and tests of hypothesis
may not be valid.

^ ^
3. Prediction must be based on the parameters' estimated values (b0 and b1 ) in relation to the hypothesized
population values.

4. Cov = Σ (X i - X) (Yi - Y) Past Sample


I

XY
n-1

Cov XY= Σ P (X i - X) (Yi - Y) Future Sample


I

The covariance between two random variables is a statistical measure of the degree to which the two
variables move together. The covariance captures the linear relationship between two variables.

5. rXY= Cov XY
ox oY
I

The correlation coefficient is a measure of the strength of the linear relationship between two variables.

6. Regression analysis uses two principle types of data:


(a) Cross-Sectional: data involve many observations on X and Y for the same time period.
(b) Time Series: data use many observations from different time periods for the same company or country.
A mix of time series and cross sectional data is also known as 'Panel Data'.
8

CHAPTER 5 Multiple Regression

Multiple Regression is regression analysis with more than one independent variable. It is used to quantify the
influence of two or more independent variables on a dependent variable.

ACTUAL Yi = b0 + b1 X 1i + b2 X2 i+ ..... + bk X k i+ ε i = 1, 2, 3 ... k


i

^ ^ ^ ^ ^
PREDICTED Yi = b0 + b1 X 1i + b2 X2 i+ ..... + bk X k i

Assumptions for multiple regression are almost exactly the same as those for the single variable linear
regression model, except assumption 2 and 3. Here are some changes under multiple regression model:

Assumption 2: The independent variables (X 1 , X2 , ... , X k) are not random. Also, no exact linear elation exists
between two or more of the independent variables. If this part of assumption 2 is violated, then we cannot
compute linear regression, may encounter problems if two or more of the independent variables or
combinations thereof are highly correlated. Such a high correlation is known as 'Multicollinearity' i.e. violation

Assumption 3: The expected value of the error term, conditional on the independent variables is 0.
E ( ε| X , X , ... , X ) = 0.
1 2 k

lllll Qualitative Factors


Qualitative Independent Variable i.e. Dummy Variables capture the effect of binary independent
variables. Whereas, Qualitative Dependent Variables (Categorical Dependent Variables) require
methods other than OLS i.e. Probit, Logit or Discriminant Analysis.

lllll Dummy Variables


There are occasions when the independent variable is binary in nature, it is either 'on' or 'off'. One
type of qualitative variable is called a Dummy Variable, takes on a value of 1 if a particular condition
is true and 0 if that condition is false. Not all qualitative variables are simply dummy variables. For
example, in a trinomial choice model i.e. a model with three choices; a qualitative variable might
have value of 0, 1 or 2. Whenever we want to distinguish between n classes, we must use n -1
dummy variables to avoid 'Multicollinearity'. Otherwise, the regression assumption of no exact
linear relationship between independent variables would be violated.

Consider the following regression equation for explaining quarterly EPS in terms of the quarter of
their occurrence:

EPS = b + b Q + b Q + b Q + ε df = n - 1
t 0 1 1t 2 2t 3 3t t

Whereas,
EPS t : Quarterly observation of Earning Per Share.
Q1t : 1 if period t is the first quarter, Q1t : 0 otherwise.
Q2t : 1 if period t is the second quarter, Q2t : 0 otherwise.
Q3t : 1 if period t is the third quarter, Q 3t : 0 otherwise.
b0 : Average value of EPS for the fourth quarter.
b1 , b2 , b3 : Estimate the difference in EPS on average between the respective quarter
(i.e. quarter 1, 2 or 3) and the omitted quarter (the fourth quarter in this case).
Think of the omitted class as the reference point.
9

Example 1:
Some developing nations are hesitant to open their equity markets to foreign investments because they fear
that rapid inflows and outflows of foreign funds will increase volatility. You want to test whether the volatility
of returns of stocks traded on the Bombay Stock Exchange (BSE) increased after July 1993, when foreign
institutional investors were first allowed to invest in India. Your dependent variable is a measure of return
volatility of stocks traded on the BSE; your independent variable is a dummy variable that is coded 1 if foreign
investment was allowed during the month and 0 otherwise.

Coefficient Standard Error T-test

Intercept 0.0133 0.002 6.5351


Dummy -0.0075 -0.0027 -2.7604
n = 95

a. State null and alternative hypothesis for the slope coefficient of the dummy variables that are consistent
with testing your stated belief about the effect of opening the equity markets on stock return volatility.

H 0: b 1≥ 0 Ha : b 1 < 0

b. Determine whether you can reject the null hypothesis at the 5% significance level in a one-sided test of
significance. Reject H 0

df = n - k - 1 = 95 - 1 - 1 = 93 . IIII
-2.7604 -1.661
We reject the null hypothesis because the dummy variable takes on a value of 1 when foreign investment is
allowed, we can conclude that the volatility was lower with foreign investment.

c. According to the estimated regression equation, what is the level of return volatility before and after the
market-opening event?

Before: Y = 0.0133 - 0.0075 (0) = 0.0133


After: Y = 0.0133 - 0.0075 (1) = 0.0058

lllll Probit Model lllll Logit Model

0 0

Y = f (b0 + b1 X 1+ b 2 X 2 ) + ε Y = f (b0 + b 1 X1 + b 2X2 ) + ε

= f (x) = Φ (1 + e-x) = f (x) = 1


(1 + e-x)
Event Happens

-1 ]
Y = Φ (P) Y = In P
(1 - P) ]

Event Doesn't Happen


10

Probit Model i.e. Probit regression, which is Logit Model i.e. Logistic regression is based
based on the normal distribution, estimates on the logistic distribution also called a log
the probability that Y = 1 (a condition is adds ratio. Logistic regression is widely
fulfilled) given the values of the used in machine learning, where the
independent variables. objective is classification. Logistic
regression assumes a logistic distribution
for the error term; this distribution is
similar in shape to the normal distribution
but has heavier tails.

In most cases it makes no difference which one is used (probit or logit). Both functions increase
relatively quickly at x = 0 and relatively slowly at extreme values of x. Both functions lie between 0 and 1.
In econometrics, probit and logit models are traditionally viewed as models suitable when the
dependent variable is not fully observed.

lllll Discriminant Model


Discriminant Analysis yields a linear function similar to a regression equation, which can then be
used to create an overall score. Based on the score, an observation can be classified into bankrupt
or not bankrupt category. Discriminant model makes use of financial ratios as the independent
variables to predict the qualitative dependent variable bankruptcy.

Examine the individual coefficients using T-tests, determine the validity of the model with the F-test, the R2
and look out for Heteroskedasticity, Serial Correlation and Multicollinearity.

Fig 3: Violations of Assumptions

A. Heteroskedasticity B. Serial Correlation C. Multicollinearity

Heteroskedasticity occurs when Serial Correlation a.k.a. When one of the independent
the variance of the residuals is Autocorrelation refers to the variables is an exact linear
not the same across all situation in which the residual combination of other
observations in the sample. terms or regression errors are independent variables, it
This happens when there are correlated with one another. It becomes mechanically
sub-samples that are more is a relatively common problem impossible to estimate the
spread out than the rest of the with time series data. Any effect regression. That case, known as
sample. of serial correlation appears 'Perfect Collinearity' is much
only in the regression less of a practical concern than
(a) (b) coefficient standard errors. If multicollinearity.
Homoskedastic Heteroskedastic one of the independent Multicollinearity occurs when
2
[Var. (ε i) = o ] [With errors] variables is a lagged value of two or more independent
I

High residual
the dependent variable, then variables (or combinations of
. . .. .
value
x x
.... . ..... .. .
.. . .... serial correlation in the error independent variables) are
.. ... .. . . .. . . . .
.. . .. .. . . . term will cause all the highly (but not perfectly)
. . .. . .
...... .. Low residual
.. . value parameter estimates from correlated with each other.
Y Y
linear regression to be Multicollinearity if a serious
No relationship On average,
between, value of regression
inconsistent and they will not practical concern because
independent residuals grow be valid estimates of the true approximate linear
variables and larger as the size parameter. relationships among financial
regression of the independent variables are common.
residuals variable increases
11

When one of the independent


(i) (ii) (a) (a) variables is an exact linear
Unconditional Conditional Positive Sc. Negative Sc. combination of other
[Var. (εi ) ≠ o 2 ] [Var. (ε i /x) ≠ independent variables, it
I
x x
const.] . . +.
+
Unconditional +
. .+ +. becomes mechanically
. +.
+
Time
+
Time
Heteroskedasticity The type of . .
-
... - .- .- impossible to estimate the
- -
occurs when the heteroskedasticity
regression. That case, known as
-

heteroskedasticity that causes the Y Y


'Perfect Collinearity' is much
is not related to most problems for Positive Sc. is serial Negative Sc. occurs
the level of statistical inference correlation in when a positive
less of a practical concern than
independent is Conditional which a positive error in one period multicollinearity.
variables; which Heteroskedasticity. error for one increases the Multicollinearity occurs when
means that it It is the error observation probability of two or more independent
doesn't variance that is increases the observing a
variables (or combinations of
systematically ↑ correlated with chance of positive negative error in
or ↓ with changes (conditional on) error for another the next period and
independent variables) are
in the value of the the values of the observation. It also vice versa. highly (but not perfectly)
independent independent means that a correlated with each other.
variables. While variables in the negative error for Multicollinearity if a serious
this is a violation of regression. It one observation practical concern because
the equal variance identifies non- increases the
assumption, it constant volatility
approximate linear
chance of a
usually causes no related to prior negative error for relationships among financial
major problems period's volatility. another variables are common.
with the Conditional observation. First
regression. It heteroskedasticity Order Sc. means
refers to general is not predictable the sign of the
structural changes by nature. error term tends to
in volatility that are persist from one
not related to prior period to the next.
period's volatility,
Unconditional
heteroskedasticity
is predictable and
can relate to
variables that are
cyclical by nature.

^ ^
T-test = b i = Reliable Positive Sc. T-test = b i = Unreliable
(unreliable) S ^ ^ (unreliable) S ^
b Unreliable T-test = bi = Reliable b Unreliable
i i
(unreliable) S^ Unreliable
(Type II Error) bi (Type II Error)
(a) Overestimated S^b i : ↓
T-test (a) Overestimated Sb^ : ↓ T-test
(b) Underestimated S^ :
b

T-test (b) Underestimated Sb^ :
i
↑ T-test i

i
(Type I Error) (Type I Error)

F-test = MSR F-test = MSR


(unreliable) MSE MSE is a biased MSE MSE will be
estimate of the true underestimated,
population variance
Negative Sc.
↑ F-test (Type I Error)

OLS standard errors need not Type I Error: Reject H 0


be underestimates of actual when it is true.
standard errors, if negative Sc. Type II Error: Fail to Reject H 0
when it is false.
is present in the regression.

1. Examining scatter plots of 1. Durbin-Watson: is used to 1. The most common way to


the residuals. detect the presence of serial detect multicollinearity is the
correlation. situation where T-tests
12

2. Breusch-Pagan Chi-Square (a) Small Sample indicate that none of the


Test: which calls for the individual coefficient is
^ ^ 2
regression of the squared DW = Σ (ε t - ε t-1 ) df = k significantly different than 0,
residuals on the independent Σ (ε^ t2 ) while the F-test is statistically
variables. If conditional significant and the R2 is high.
^ ^ ^
heteroskedasticity is present, = [Var (ε t ) - 2 Cov (εt, ε t-1 ) This suggests that the
^ ^
the independent variables +Var (ε t-1 )] / Var (ε t ) variables together explain
will significantly contribute to much of the variation in the
If the variance of the error is constant
the explanation of the dependent variable, but the
through time, then we expect
squared residuals. Var (ε^t ) = o^ ε for all 't', where we use o^ 2ε
2 individual independent

I
to represent the estimate of the constant variables don't. The only way
2 2 error variance. If in addition, the errors
BP X Test = n x R df = k this can happen is when the
residuals are also not serially correlated, then we
(one-tailed test)
expect Cov (ε^ t , ε^t-1) = 0. In that case DW is independent variables are
2
R from a second approximately equal to: highly correlated with each
regression of the
squared residuals from ^ 2 ^ 2 other. If the absolute value of
o ε- 0 + o
I ε =2

I
the first regression on 2 the sample correlation
the independent o^ ε
I
between any two
variables.
Therefore we can test the null hypothesis independent variables in the
H0 : No Conditional Heteroskedasticity that the errors are not serially correlated
H1 : Conditional Heteroskedasticity by testing whether the DW test differs
regression is greater than
significantly from 2 0.7, multicollinearity is a
Conditional potential problem. However,
Heteroskedasticity is only a (a) Large Sample this only works if there are
2
problem if the R and the BP exactly two independent
Test statistic are too large. DW ≈ 2 (1 - r) df = k variables. If there are more
than two independent
'r' correlation coefficient
between residuals from variables, while individual
one period and those variables may not be highly
from the previous period.
correlated, linear
H0 : No Positive Serial Correlation combinations might lead to
H1 : Positive Serial Correlation
multicollinearity (conflicting
T-test and F-test statistics).
Reject H0
x
Accept H0 High pairwise correlations
Positive Sc. Negative Sc.
Inconclusive among the independent
0 0 variables are not a necessary
dL dU
condition for multicollinearity
(Lower) (Upper)
and low pairwise correlations
No Autocorrelation (r = 0)
don't mean that
DW ≈ 2 (1 - 0) = 2 (DW = 2) multicollinearity is not a
problem.
Positive Serial Correlation (r > 0)
DW ≈ 2 (1 - 1) = 0 (DW < 2)

Negative Serial Correlation (r < 0)


DW ≈ 2 (1 - (- 1) = 4 (DW > 2)

Corrections: Corrections: Corrections:


1. Calculate 'Robust Standard 1. Adjust the coefficient 1. The most common method
Errors' i.e. 'White Corrected standard errors, using the to correct for
Standard Errors' or 'Hansen Method' i.e. 'Newey multicollinearity is to omit
'Heteroskedasticity - & West Method'. These adjust one or more of the
Consistent Standard Errors'. standard errors upwards correlated independent
These robust standard errors using the Hansen method, variables.
13

are then used to recalculate which are then used in


the T-statistic using the hypothesis testing of the
original regression regression coefficients. Only
coefficients, if there is an use the 'Hansen Method' if
evidence of serial correlation is a
heteroskedasticity. problem. The 'White
Corrected Standard Errors'
2. Use 'Generalized Least are preferred, if only
Squares', which attempts to heteroskedasticity is a
eliminate the problem. If both conditions
heteroskedasticity by are present. use the 'Hansen
modifying the original Method'.
equation.
2. It is to explicitly incorporate
the time series nature of the
data e.g. include a seasonal
term (this can be tricky).

a. Heteroskedasticity: variance of error term is constant.


b. Serial Correlation: errors are not serially correlated.
c. Multicollinearity: no exact linear relationship among X's.

Example 2:
A variable is regressed against three other variables X, Y, Z. Which of the following would not be an indication
of multicollinearity? X is closely related to:
A. 3y + 2z
B. 9y - 4z + 3
✓ C. y
2

lllll Model Specification

Principles of Model Specification:


(a) The model should be grounded in cogent economic reasoning.
(b) The functional form chosen for the variables in the regression should be appropriate given the
nature of the variables.
(c) The model should be parsimonious i.e. accomplishing a lot with little.
(d) The model should be examined for violations of regression assumptions before being accepted.
(e) The model should be tested and be found useful out of sample before being accepted.

There are three broad categories of 'Model Misspecification' or ways in which the regression model can be
specified incorrectly, each with several subcategories:

1. Misspecification of Functional Form


- Important variables are omitted: If the omitted independent variable (X 2) is correlated with the remaining
independent variable (X 1 ), then the error term in the model will be correlated with (X 1) and the estimated
values of the regression coefficients a 0 and a 1 would be biased and inconsistent, so will the standard error
of those coefficients.

Y = b 0+ b 1 X 1+ b 2X 2+ ε

Y = a 0 + a1 X 1+ ε b 0≠ a 0
14

- Variables should be transformed: Sometimes analysts fail to account for curvature or non-linearity in the
relationship between the dependent variable and one or more of the independent variables, instead
specifying a linear relation among variables. We should also consider whether economic theory suggests a
non-linear relation. We may be able to correct the misspecification by taking the natural logarithm (In) of
the variable we want to represent as a proportional change.

- Data is improperly pooled: Suppose the relationship between returns and the independent variables
during the first three-years is actually different than relationship in the second three-year period i.e.
regression coefficients are different from one period to the next. By pooling the data and estimating one
regression over the entire period, rather estimate two separate regressions over each of the subperiods.
If we have misspecified the model, the predictions of portfolio returns will be misleading.

2. Time Series Misspecification


- Lagged dependent variable is used as an independent variable: If the error term in the regression model
is serially correlated as a result of the lagged dependent variable (which is common in time series
regression), then this model misspecification will result in biased and inconsistent regression estimates
and unreliable hypothesis tests. If lagged dependent variables help explain and not cause serial
correlation in residuals, they are okay!
- A function of the dependent variable is used as an independent variable - 'Forecasting the Past':
Sometimes as a result of the incorrect dating of variables.
- Independent variables are measured with error: Common example being when an independent variable
is measured with error is when we want to use 'Expected Inflation' in our regression but use 'Actual
Inflation' as a proxy.

3. Other Time Series Misspecifications that result in Non-Stationary


Means that a variable's properties such as mean and variance are not constant through time e.g.
consumption and GDP, random walks or exchange rates.

NOTES

1. If expected value of the sample mean is equal to the population mean, the sample mean is therefore an
unbiased estimator of the population mean. A consistent estimator is one for which the accuracy of the
parameter estimate increases as the sample size increases.

2. Predictions in multiple regression model are subject to both parameter estimate uncertainty and
regression model uncertainty.

3. Y = 5 + 4.2 (Beta) - 0.05 (Alpha) + ε. One unit increase in Beta risk is associated with a 4.2% increase in
return, while a $1 bn. increase in Alpha implies a 0.05% decrease in return.
15

CHAPTER 6 Time Series Analysis

Time Series is a set of observations on a variable's outcomes in different time periods e.g. quarterly sales for
a particular company during the past 5 years.

Fig 1: Linear Vs. Log-Linear Trend Models

Linear Trend Model Log-Linear Trend Model

Yt In (Y t )

.. . .
Linear TM Log-Linear TM
Yt
Transformed
.
..
In (Y t )

Data ..
. . .
... . . . .
Raw Data

.
. . . . ..
0 Time 0 Time

A Linear Trend is a time series pattern that can A Log-Linear Trend works well in fitting time
be graphed using a straight line. A downward series that have exponential growth. Positive
sloping line indicates a negative trend, while an exponential growth means that the time series
upward sloping line indicates a positive trend. tend to increase at some constant rate of growth
↱Trend Coefficient i.e. the observations will form a convex curve ().
Yt = b 0 + b1 (t) + ε t t = 1, 2, ... T Negative exponential growth means that the
data tends to decrease at some constant rate of
^ ^ ^
Yt = b0 + b1 (t) decay i.e. the plotted time series will be concave
curve )(.
When the variable increases over time by a
b0 + b1 (t) + ε t
constant amount, a Linear Trend Model is most Y =e t = 1, 2, ... T
appropriate.
In (Yt ) = In (eb0 + b1 (t) )

Yt = e In (Yt )
^
^ In (Yt )
Yt = e

When a variable grows at a constant rate, a Log-


Linear Trend Model is most appropriate.

If on the other hand, the data plots with a non-linear curved shape, then the residuals from a Linear Trend
Model will be persistently positive or negative for a period of time. In this case, the Log-Linear Trend Model
may be more suitable. In other words, when the residuals from a Linear Trend Model are serially correlated, a
Log-Linear Trend Model maybe more appropriate. However, it may be the case that even a Log-Linear Trend
Model is not appropriate in the presence of serial correlation. In this case, we will want to turn to an
Autoregressive Model.

If Linear Model exhibits Autocorrelation, try Log-Linear


If Log-Linear Model exhibits Autocorrelation, try Autoregression
16

Fig 2: Covariance Stationary Vs. Non-Stationary

Stationary Non-Stationary

A time series is Covariance Stationary if its mean, Data points are often Non-Stationary or have
variances and covariances with lagged and means, variances and covariances that change
leading values don't change over time. over time.

E.g. Strict Stationary, Second-Order Stationary E.g. Trends, Cycles, Random Walks or
(Weak-Stationary), Trend Stationary and combinations of three.
Difference Stationary Models.

In order to receive consistent-reliable results the non-stationary data needs to be transformed into stationary
data.

Time series is Covariance Stationary if it satisfies the following three conditions:

1. Constant and Finite Expected Value: The expected value of the time series is constant over time. E(y t ) = μ
and |μ| < ∞, t = 1, 2, 3, ... T. We refer to this value as the 'Mean Reverting Level'. All covariance stationary
AR(1) time series have a finite mean reverting level, when the absolute value of the lag coefficient is less
than 1 i.e. |b 1|< 1.

(a) If time series ↘ Current Value > Mean; Current Value above Mean AR (1): x t = b 0 + b1 x t-1

(b) If time series ↗ Current Value < Mean; Current value below Mean
As per (c): x t = b 0 + b1 x t

^
(c) If time series — Current Value = Mean; Next value of the time series will equal its current value x t = x t-1

Decline: x t > x t+1 ; x t > b0 Rise: x t < x t+1 ; x t < b 0 Same: x t = x t+1 ; x t = b0
(1 - b 1 ) (1 - b1 ) (1 - b1 )

2. Constant and Finite Variance: The time series' volatility around its mean doesn't change over time.

3. Constant and Finite Covariance between values at any given lag: The covariance of the time series with
itself for a fixed no. of periods in the past or future must be constant and finite in all periods.

Both 2 and 3: The covariance of a random variable with itself is its variance
Covariance (yt , y t ) = Var (y t )

Stationary in the past doesn't guarantee stationary in the future. There is always the possibility that a well
specified model will fail with the state of change in time. Models estimated with shorter time series are
usually more stable than those with longer time series because a longer sample period increases the chance
that the underlying economic process has changed. Thus, there is a trade off between the increased
statistical reliability when using longer time periods and the increased stability of the estimates when using
shorter periods.
17

lllll Autoregressive Model


When the dependent variable is regressed against one or more lagged values of itself, the resultant
model is called as an Autoregressive Model (AR).

BACKWARD AR (1) x t = b 0+ b 1x t-1+ ε t t = 1. 2, ... T


No. of lagged values include
AR (2) x t = b 0 + b 1x t-1+ b 2 x t-2+ ε t
↱as independent variable
To calculate AR (p) x t = b 0+ b 1 x t-1+ b2 x t-2+ ... + bp x t-p+ ε t
the successive ↳Time
forecasts ^ ^ ^
FORWARD AR (1) x t+1 = b0 + b1 x t
referred to as ^ ^
x^ t+2 = b 0 + b 1 x t+1

the 'Chain Rule


of Forecasting'

This implies that multi-period forecasts are more uncertain than single period forecasts.

Testing Autoregressive Model (AR)

lllll Autocorrelation [Stationary]


If the residuals have significant autocorrelation, the AR model that produced the residuals is not the
best model for the time series being analyzed. We can estimate an AR model using ordinary least
squares (OLS) if the time series is covariance stationary and the errors are uncorrelated.
Unfortunately, our previous DW test statistic is invalid when the independent variables include past
values of the dependent variable. Residual autocorrelations drag to '0' as the no. of lags increases.

Step 1: Start with AR (1): x t = b0 + b 1x t-1+ ε t

Step 2: Calculate the autocorrelations of the model's residuals i.e. the level of correlation between
the forecast errors from one period to the next.

ρ k = Cov (x t , x t-1) = E [(x t - μ) (x t-k - μ)] When k = 1


o 2x o 2x
I

Where 'E' stands for the expected value. Note that we have the relationship
2
Cov (x t , xt-k ) ≤ o x with equality holding when k = 0. This means that the absolute value of
I

ρ k ≤ 1.

ρ k = Σ [(xt - x) (x t-k - x)]


I

t=k+1
Σ (x t - x) 2
I

t=1

Whenever we refer to autocorrelation without qualification, we mean autocorrelation of the


time series itself rather than autocorrelation of the error term.

ρε,k = Cov (ε t , εt-k ) = E [(ε t - 0) (ε t-k- 0)] = E (ε t . εt-k )


o 2ε o 2ε o 2ε
I
I

Step 3: Test whether the autocorrelations are significantly different from 'o': If the model is
correctly specified, none of the autocorrelations will be statistically significant. To test for
significance, a T-test is used to test the hypothesis that the correlations of residuals are 0.

t = ρ εt , εt-k → Autocorrelation df = n - k -1
1/ n → Standard Error
H 0 : |Autocorrelations| = 0 H1 : |Autocorrelations| > 0
18

lllll Unit Root: Dickey and Fuller


Remember, if an AR (1) model has a coefficient of 1, it has a 'unit root' and no finite mean reverting
level i.e. it is not covariance stationary. By definition, all Random Walks with or without a drift term
have unit roots. Dickey and Fuller (DF) transform the AR (1) model to run a simple regression.

AR (1) x t = b 0+ b1 x t-1+ ε Substract x t-1from both sides


x t - x t-1= b 0 + b1 x t-1- x t-1+ ε
x t - x t-1= b 0+ (b 1 - 1) x t-1+ ε

Rather than directly testing whether the original coefficient is different from 1, they test whether
the new transformed coefficient (b1 - 1) is different from 0 using a modified T-test. In their actual
test, Dickey and Fuller use the variable 'g' = b 1 - 1.

H0 : g = 0 ;Time series has Unit Root


Not Stationary

H1 : g < 0 ;Time series doesn't have a Unit Root


Stationary

lllll Autoregressive Conditional Heteroskedasticity


Autoregressive Conditional Heteroskedasticity (ARCH) exists if the variance of the residuals in one
period is dependent on the variance of the residuals in a previous period. At times, however, this
assumption is violated and the variance of the error term is not constant. In such a situation, ARMA
models will be incorrect and our hypothesis tests would be invalid.

↱ Error Term
ε^t = a 0 + a1 ε^t-1 + μ t
2 2
ARCH (1)
o^t+1
2
=a^ ^ ^2
0+ a 1 ε t
I

H0 : a1 = 0 ;The variance is constant from period to period

H1 : a1 ≥ or ≤ 0 ;The variance increases (decreases) over time i.e. the error terms
exhibit heteroskedasticity

If a time series model has been determined to contain ARCH errors, regression procedures that correct
for heteroskedasticity such as Generalized Least Squares (GLS) must be used in order to develop a
predictive model. Otherwise, the standard error of the model's coefficients will be incorrect, leading to
invalid conclusions. Engle and other researchers have suggested many generalizations of the ARCH (1)
model which include ARCH (p) and generalized autoregressive conditional heteroskedasticity (GARCH)
models. GARCH models are similar to ARMA models of the error variance in a time series. Just like
ARMA models, GARCH models can be finicky and unstable.

In-Sample Forecasts: are within the range of data i.e. time period.
^
Errors: (Yt - Yt )

Out-of-Sample Forecasts: are made outside of the sample period.


Out-of-sample forecast accuracy is important
because the future is always out-of-sample.
Errors: RMSE i.e. Root Mean Squared Error (Square root
of the average of the squared errors). The model
with the smallest RMSE is judged most accurate.
19

Example 1:
To qualify as a covariance stationary process, which of the following doesn't have to be true?

A. Covariance (x t , x t-2) = Covariance (x t , xt+2) The covariance between any two
observations equal distance apart will be
B. E (x t ) = E (xt+1)
equal e.g. t and t-2 observations with
✓ C. Covariance (x t , x t-1) = Covariance (x t , x t-2) t and t+2 observations.

Example 2:
Suppose the following model describes changes in the unemployment rate: △
UER t = - 0.0405 - 0.4674 UER t-1 △
The current change (first difference) in the unemployment rate is 0.03. Assume that the mean reverting level
for changes in the unemployment rate is -0.0276.

a. What is the best prediction for the next change? - 0.0405 - 0.4674 (0.03) = - 0.0545 (y )
t+1

b. What is the prediction of the change following the next change?


Using chain rule of forecasting: - 0.0405 - 0.4674 (-0.0545) = - 0.0150 (yt+2)

c. Explain your answer to Part 'b' in terms of equilibrium?


The answer to Part 'b' is quite close to the mean reverting level of - 0.0276 (-0.0405/0.4674). A stationary
time series may need many periods to return to its equilibrium, mean reverting level.

Example 3:
Table below gives actual sales, log of sales and changes in the log of sales of Cisco Systems for the period
1Q: 2001 to 4Q: 2001.

Quarters / Yr. Actual Sales Log of Sales Actual Values Forecast Values
△ In (Sales t ) △ In (Sales t-1)

1Q: 2001 6519 8.7825 0.1308


2Q: 2001 6748 8.8170 0.0345
3Q: 2001 4728 8.4613 - 0.3557
4Q: 2001 4298 8.3659 - 0.0954
1Q: 2002 x (4391) x (8.3872) x (0.0213)
2Q: 2002 x (4738) x (8.4633) x (0.0761)

Forecast the first and second quarter sales of Cisco Systems for 2002 using the regression
△ In (Sales t ) = 0.0661 + 0.4698 .△In (Sales t-1)

Step 1: Calculate forecast values of △ In (Sales t ) with the help of above regression.
1Q 2002: △ In Sales 1Q 2002 = 0.0661 + 0.4698 (-0.0954) = 0.0213
2Q 2002: △ In Sales 2Q 2002 = 0.0661 +0.4698 (0.0213) = 0.0761

Step 2: Calculate Log of Sales with the help of Step 1


1Q 2002: In Sales 1Q 2002 = 8.3659 + 0.0213 = 8.3872
2Q 2002: In Sales 2Q 2002 = 8.3872 + 0.0761 = 8.4633

Step 3: Calculate Actual Sales


8.3872
1Q 2002: e = 4391
2Q 2002: e8.4633 = 4738
20

Example 4:
Table below gives the actual change in the Log of Sales of Cisco Systems from 1Q: 2001 to 4Q: 2001, along
with the forecasts from the regression model △
In (Sales t ) = 0.0661 + 0.4698 . △
In (Sales t-1) estimated using
data from 3Q: 1991 to 4Q: 2000. Note the observations after the fourth quarter of 2000 are out-of-sample.

Date Actual Values: △ In Sales t Forecast Values: △ In Sales t


1Q: 2001 0.1308 0.1357
2Q: 2001 0.0345 0.1299
3Q: 2001 - 0.3557 0.1271
4Q: 2001 - 0.0954 0.1259

a. Calculate the RMSE for the out-of-sample forecast errors.

Error Squared Error


1Q: 2001 - 0.0049 0
2Q: 2001 - 0.0954 0.0091
3Q: 2001 - 0.4828 0.2331
4Q: 2001 - 0.2213 0.0490
0.2912 Sum
0.0728 Average; RMSE = 0.0728 = 0.2698

b. Compare the forecasting performance of the model given with that of another model having an
out-of-sample RMSE of 20%

The model with the RMSE of 20% has greater accuracy in forecasting than model in Part a, which has an
RMSE of 27%

Example 5:
Based on the regression output below, the forecasted value of quarterly sales for March 2016 for PoweredUP
is closest to?

AR (1) Regression Output Quarterly Sales Data


Coefficient Bn

Intercept 0.0092 Dec 2015 (S t-1) $ 3.868


In S t-1- In S t-2 - 0.1279 Sept 2015 (S t-2 ) $ 3.780
In S t-4- In S t-5 0.7239 June 2015 (St-3 ) $ 3.692
Mar 2014 (S t-4 ) $ 3.836
Dec 2014 (St-5 ) $ 3.418

The quarterly sales for March 2016 is calculated as follows:

In S t - In St-1 = b 0+ b1 (In S t-1- In St-2 ) + b 2(In St-4 - In S t-5 )


In S t - In 3.868 = 0.0092 - 0.1279 (In 3.868 - In 3.780) + 0.7239 (In 3.836 - In 3.418)
In S t = 1.35274 + 0.0092 - 0.1279 (0.02301) + 0.7239 (0.11538)
In S t = 1.44251
S t = e 1.44251
= 4.231
21

Example 6:
David Brice, CFA has used AR (1) model to forecast the next period's interest rate to be 0.08. The AR (1) has a
positive slope coefficient. If the interest rate is the mean reverting process with an unconditional mean a.k.a
mean reverting level, equal to 0.09, then which of the following could be his forecast for two periods ahead?
✓ A. 0.081 →
Brice makes more distant forecast, each forecast will be closer to the
unconditional mean, so the two period forecast would be between
B. 0.072
0.08 and 0.09; therefore 0.081 is the only possible outcome.
C. 0.113

lllll Random Walks [Non-Stationary]

Random Walks Random Walks with a Drift

AR (1): x t = b0 + b1 x t-1+ ε t AR (1): x t = b 0 + b1 x t-1+ ε t

Since b0 = 0; x t = x t-1+ ε t Since b0 ≠ 0; x t = b0 + x t-1+ ε t


b1 = 1 b1 = 1


(Unit Root) (Unit Root)

No Mean Reverting Level for Random Walk because, Undefined Mean Reverting Level for Random Walk
x t = b0 = 0 = 0 = 0 with Drift because, x t = b0 = b 0
1 - b1 1- 1 0 1 - b1 0

(i) E (ε t ) = 0: The expected value of each error term is 0.


2
(ii) E (ε 2t ) = o : The variance of the error term is constant.
I

(iii) E (εi ε j ) = 0; If i ≠ j: There is no serial correlation in the error terms.

For a time series that is not covariance stationary, the least squares regression procedure that we have been
using to estimate an AR (1) model will not work without transforming the data.

b1 < 1: Stationary
b1 = 1: Non-Stationary 'Unit Root'
b1 > 1: 'Explosive Root'

Testing Random Walk Model

lllll First Differencing


If we believe a time series is a random walk i.e. has a unit root, we can transform the data to a
covariance stationary time series using a procedure called 'First Differencing'. It is because, it
substracts the value of the time series in the first prior period from the current value of the time
series. Note that by taking first differences, you model the change in the value of the dependent
variable.

AR (1) y t = b0 + b1 y t-1+ ε t

If b0 = b 1 = 0, then yt = ε t and ε t = x t - x t-1


△x
yt = xt - x t-1 = ε t

This transformed time series has a finite mean reverting level of b 0 = 0 = 0 and is therefore
covariance stationary. 1 - b1 1 - 0
22

This is how first differencing


removes the upward trend.
Linear
Trend

After 1st
Differencing
y =x -x
t t t-1

lllll Seasonality
Seasonality in a time series is a pattern that tends to repeat from year to year. One example is monthly
sales data for a retailer. Given that sales data normally vary accordingly to the time of year, we might
expect this month's sales (x t ) to be related to sales for the same month last year (x t-12 ). To adjust for
seasonality in an AR model, an additional lag of the dependent variable (corresponding to the same
period in the previous year) is added to the original model as another independent variable. For
example, if quarterly data are used, the seasonal lag is 4; if monthly data is used, the seasonal lag is 12.

Suppose for example, we model a particular quarterly time series using an AR (1) model,
x t = b 0 + b1 x t-1+ ε t . If the time series had significant seasonality, this model would not be correctly
specified. The seasonality would be easy to detect because the seasonal autocorrelation in the case of
quarterly data, the 4th autocorrelation of the error term would differ significantly from 0. Suppose this
quarterly model has significant seasonality. In this case, we might include a seasonal lag in the
autoregressive model and estimate x t = b 0 + b1 x t-1+ b2 xt-2 + ε t , to test whether including the seasonal
lag in the autoregressive model would eliminate statistically significant autocorrelation in the error
term.

lllll Moving Average Time Series Model


Suppose you are analyzing the long-term trend in the past sales of a company. In order to focus on the
trend, you may find it useful to remove short-term fluctuations or noise by smoothing out the time
series of sales. One technique to smooth out period-to-period fluctuations is by:

n-period moving average = x t + x t-1 + ... + x t-n+1


n

Moving Average always lags large movements in the


actual data i.e. slowly rises and slowly falls. Though often
useful in smoothing out a time series, it may not be the
↑ best predictor of the future. A main reason for this is that
Moving Average
a simple moving average gives equal weight to all the
periods in the moving average.

Example 7:
Suppose we want to compute the four-quarter moving average of AstraZeneca's sales as of the beginning of
the first quarter of 2012. AstraZeneca's sales in the previous four quarters were 1Q 2011: $ 8490 m, 2Q 2011:
$8601 m, 3Q 2011: $ 8405 m and 4Q 2011: $ 8872 m.
23

The 4 Quarter Moving Average = 8490 + 8601 + 8405 + 8872 = $ 8592 m


of 1st Quarter of 2012 4

MA (1) x t = ε t + θ εt-1 E (εt) = 0


2 2
E (εt) = o

I
MA (1) i.e. moving average model places different weights on the two terms Cov (εt, εs) = E (εt. εs) = 0
in the moving average ( 1 on ε t and θ on εt-1 ). Because the expected value of t≠s
x t is 0 in all periods and ε t is uncorrelated with its own part values, the first
autocorrelation is not equal to 0, but the second and higher autocorrelations
are equal to 0. Further analysis shows all autocorrelations except for the first
will be equal to 0 in an MA (1) model. Thus for an MA (1) process, any value x t
is correlated with x t-1 and xt+1 but with no other time series values; we could
say that an MA (1) model has a memory of one period.

MA (p) x t = εt + θ 1 ε t-1+ ... + θp ε t-q E (εt) = 0


2 2
E (εt) = o

I
For an MA (q) model, the first 'q' autocorrelations will be significantly Cov (εt, εs) = E (εt. εs) = 0
different from 0 and all autocorrelations beyond that will be equal to 0; an t≠s
MA (q) model has a memory of 'q' periods.

MA (0) x t= μ + ε t E (εt) = 0
2 2
E (εt) = o

I
MA (0) time series in which we allow the mean to be 'non-zero' which also Cov (εt, εs) = E (εt. εs) = 0
means that the time series is not predictable. t≠s
AR MA

ARMA (p, q) x t = b 0+ b1 x t-1+ ... + bp xt-p + ε t + θ1 ε t-1 + ... + θ qε t-q E (εt) = 0


E (ε2t) = o 2
I
'b1 , b2 ... b p' are the autoregressive parameters and 'θ 1 , θ 2 ... θ q' are the Cov (εt, εs) = E (εt. εs) = 0
moving average parameters. Estimating and using ARMA models have several t≠s
limitations. First, the parameters in ARMA models can be very unstable.
Second, choosing the right ARMA model is more of an art than a science. The
criteria for deciding on 'p' and 'q' for a particular time series are far from
perfect. Moreover, even after a model is selected, that model may not
forecast well. Thirdly, even some of the strongest advocates of ARMA models
admit that these models should not be used with fewer than 80 observations
and they don't recommend using ARMA models for predicting quarterly sales
or gross margins for a company using even 15 years of quarterly data.

lllll Cointegration
Occasionally an analyst will run a regression using two time series i.e. time series utilizing two different
variables. For example, using the market model to estimate the equity beta for a stock, an analyst
regresses a time series of the stock's returns (y t ) on a time series of returns for the market (x t ).
Cointegration means that two time series are economically linked (related to the same macro variable)
or follow the same trend and that relationship is not expected to change.

y t = b0 + b 1 x t + ε

whereas,
y t : Value of time series y at time t
xt : Value of time series x at time t
24

If neither data series has a unit root : VALID


If only one data series has a unit root: INVALID
If both data series have unit root & they are cointegrated : VALID
If both data series have unit root & they are not cointegrated : INVALID

Fig 3: Time Series Analysis Summary


25

CHAPTER 7 Machine Learning

Machine Learning (ML) refers to computer programs that learn from their errors and refine predictive models
to improve their predictive accuracy over time. Machine Learning is one method used to extract useful
information from Big Data. An elementary way to think of ML algorithms is to 'find the pattern, apply the
pattern'. ML techniques are better able than statistical approaches to handle problems with many variables
i.e. high dimensionality or with a high degree of non-linearity. ML algorithms are particularly good at
detecting change, even in high non-linear systems because they can detect the preconditions of a model's
break or anticipate the probability of a regime switch. ML is broadly divided into 3 distinct classes of
techniques: (a) Supervised Learnings, (b) Unsupervised Learning and (c) Deep Learning. Machine Learning is
the science of making computers learn and act like humans by feeding data and information without being
explicitly programmed.

The data set is typically divided into three non overlapping samples:
(i) Training Sample: used to train the model. } In-Sample Data
(ii) Validation Sample: for validating and tuning the model.
} Out-of-Sample Data
(iii) Test Sample: for testing the model's ability to predict well on new data.

To be valid and useful, any supervised machine learning model must generalize well beyond the training data.
The model should retain its explanatory power when tested out-of-sample.

Fig 1: Types of Data Fits

Underfitting Overfitting Good Fit

Y o ▲ Y

Y
o ▲
▲ ▲ o o ▲
o
o ▲ o ▲ ▲ o o ▲
o ▲ o
▲ o ▲ ▲
o ▲ ▲
o

o
o o ▲ o o o ▲
o
o ▲
▲ ▲ ▲
o o o
o ▲
▲ ▲ o

x x x

Underfitting is similar to making a Think of Overfitting as tailoring a Robust filling, the desired result
baggy suit that fits no one. It also custom suit that fits only one is similar to fashioning a
means that the model doesn't person. An ML algorithm that fits universal suit that fits all similar
capture the relationships in the the training data too well, will people. A Good Fit or Robust
data. The graph shows four typically not predict well using Model fits the training in-sample
errors in this underfit model (3 new data. The model begins to data well and generalize well to
misclassified circles and 1 incorporate noise coming from out-of-sample data, both within
misclassified triangle). quirks or spurious correlations; it acceptable degrees of error. The
mistakes randomness for graph shows that the good fitting
patterns and relationships. The model has only 1 error, the
algorithm may have memorized misclassified circle.
the data rather than learn from it,
so it has perfect hindsight but no
foresight. The graph shows no
errors in this overfit model. As
models become more complex,
overfitting risk increases.
26

∴ The evaluation of any ML algorithm thus focuses on its prediction error on new data rather than on its
goodness of fit on the data in which the algorithm was fitted i.e. trained.

A Learning Curve plots the accuracy rate = 1 - Error Rate, in the validation or test sample i.e. out-of-sample
against the amount of data in the training sample i.e. in sample, so it is useful for describing under and
overfitting as a function of bias and variance errors. Low or no in-sample error but large out-of-sample error
are indicative of poor generalization. Data scientists decompose the total out-of-sample error into 3 sources:
(i) Bias Error, (ii) Variance Error and (iii) Base Error.

Fig 2: Types of Errors

Bias Error Variance Error Base Error

Bias Error or the degree to which Variance Error or how much the Base Error due to randomness in
a model fits the training data. model's results change in the data. (Out-of-sample
Algorithms with erroneous response to new data from accuracy increases as the
assumptions produce high bias validation and test samples. training sample size increases)
with poor approximation, causing Unstable models pick up noise
underfitting and ↑
in-sample and produce high variance
error. (Adding more training causing overfitting and ↑
out-
samples will not improve the of-sample error.
model)

Accuracy Accuracy Accuracy


Rate Rate Rate

0 0 0
No. of Training Samples No. of Training Samples No. of Training Samples

Linear functions are more Non-Linear functions are more


susceptible to bias error and prone to variance error and
underfitting. overfitting.

Desired Accuracy Rate


Training Accuracy Rate
Validation Accuracy Rate

Fig 3: Fitting Curve shows Trade-off between Bias and Variance Errors and Model Complexity

Model Error Optimal 'Finding the optimal point (managing overfitting


(E-in, E-out) Complexity Total Error
risk) - the smart spot just before the total error
E-in: In
Sample
rate starts to rise due to increasing variance error
Error - is a core part of the machine learning process
E-out: Out and the key to successful generalization'
of Sample

↑ ↓ and
Error
As Complexity in Training sets, E-in
Bias Error shrinks.
Variance
Bias Error
↑ ↑ and
Error
As Complexity in Test sets, E-out
0 Variance Error rises.
Model Complexity
(Out of Sample Error
rates are also a function
of model complexity)
27

Preventing Overfitting in Supervised Machine Learning:

1. Ocean's Razor: The problem solving principle that the simplest solution tends to be the correct one. In
supervised ML, it means preventing the algorithm from getting too complex during selection and training
by limiting the no. of features and penalizing algorithms that are too complex or too flexible by
constraining them to include only parameters that reduce out-of-sample error.

2. K-Fold Cross Validation: This strategy comes from the principle of avoiding sampling bias. The challenge is
having a large enough data set to make both training and testing possible on representative samples.

① For example, imagine that this ② Using ML lingo, we used the data to:
column represents all of the data (a) Train the ML methods
A
we have collected about people (b) Test the ML methods
with or without heart disease.

B 75% ③ Reusing same data for both training & testing is a bad
. x x . ..
x

idea because we need to know how the method will


work on data it wasn't trained on. A slightly better idea
would be to use the first 75% of data for training & test
C 25% of the data for testing. But how do we know that
using the first 75% of the data for training and the last
25% of the data for testing is the best way to divide up x ... .. x
the data? Rather than worry too much about which
block would be best for testing, cross validation uses
0
. Training (Tr)
D 25% x Testing (T)
them all, one at a time and summarizes the results at
the end.

(ABC Tr, D T) (BCD Tr, A T) (CDA Tr, B T) (DAB Tr, CT)



4 Times = K
Note that K is typically set at 5 or 10; In this case it is 4. This process is then repeated 'K' times, which helps
minimize both bias and variance by insuring that each data point is used in the training set 'K-1' times and
in the validation (or testing) set once. The average of the K validation errors (E-val) is then taken as a
reasonable estimate of the model's out-of-sample error (E-out).

Tag variable can


Target Variable or Tag Variable i.e. Dependent Variable (Y-variable) be continuous,
Feature i.e. Independent Variable (X-variable) categorical or
ordinal.

lllll Supervised Learning


Supervised Learning uses labelled training data to guide the ML program in achieving superior
forecasting accuracy. To forecast earnings manipulators, for example, a large collection of attributes
could be provided for known manipulators and for known non-manipulators. A computer program
could then be used to identify patterns that identify manipulators in another dataset. Typical data
analytics tasks for supervised learning that include prediction (i.e. regression) and classification. When
the Y-variable is continuous, the appropriate approach is that of regression. When the Y-variable is
categorical (i.e. belonging to a category or classification) or ordinal (i.e. ordered or ranked), a
classification is used.

lllll Penalized Regression


A special case of generalized linear model (GLM) is Penalized Regression. Penalized Regression
models seek to minimize forecasting errors by reducing the problem of overfitting. To reduce the
problem of overfitting, researchers may impose a penalty based on the no. of features used by the
model. Penalized Regression includes a constraint such that the regression coefficients are chosen
28

to minimize the SSE plus a penalty term that increases in size with the no. of included features. The
greater the no. of included features (i.e. variables with non-zero coefficients), the larger the penalty
term. Therefore, penalized regression ensures that a feature is included only if the SSE declines by
more than the penalty term increases. All types of penalty regression involve a trade-off of this
type. Imposing such a penalty can exclude features that are not meaningfully contributing to
out-of-sample prediction accuracy i.e. it makes the model more parsimonious. Therefore, only the
more important features for explaining Y will remain in the penalized regression model.
λ (Lamda) a.k.a.
hyperparameter, determines
n ^ 2 k ^ how severe the penalty is
LASSO Penalized Regression = Σ (Yi - Y i ) + λ Σ | b k |
i=1 K=1 λ > 0. It also determines the
balance between fitting the
SSE (OLS) Penalty Term model vs. keeping the model
parsimonious.

In one popular type of penalized regression, LASSO (Least Absolute Shrinkage and Selection
Operator) aims to minimize the SSE and the sum of the absolute value of the regression
coefficients. When using LASSO or other penalized regression techniques, the penalty term is added
only during the model building process and not once the model has been built.

lllll Support Vector Machine


Support Vector Machine (SVM) is a powerful supervised algorithm used for classifications,
regressions and outlier detection. SVM is a linear classifier that determines the hyperplane that
optimally separates the observations into two sets of data points. The intuitive idea behind the SVM
algorithm is maximizing the probability of making a correct prediction (here, that an observation is
a triangle or a cross) by determining the boundary that is furthest away from all the observations.
The general term for a n-dimensional hyperplane, with n = 1D is called Line, n = 2D is called Plane
and n = 3D is called space.
Threshold

Y
x x x
x

}
}

Margin (a.k.a Maximal Margin Classifier)


Support
▲ x x
Vectors
▲ x x
▲ ▲ x
x
▲ ▲
0

x

The shortest distance between the observations and the threshold is called 'Margin'. When we use
the threshold that gives us the largest margin to make classification, we call that a 'Maximal Margin
Classifier' (MMC). MMC are super sensitive to outliers in the training data, which makes them pretty
lame! The margin is determined by the observations are called Support Vectors. Some observations
may fall on the wrong side of the boundary and be misclassified by the SVM algorithm. Choosing a
threshold that allows misclassification is an example of the 'Bias/Variance Trade-off' that plagues all
of ML. When we allow misclassification, the distance between the observation and the threshold is
called 'Soft Margin Classification', which adds a penalty to the objective function for observations in
the training that are misclassified. In essence, the SVM algorithm will choose a discriminant
boundary that optimizes the trade-off between a wider margin and a lower total error penalty. As
an alternative to soft margin classification, a non-linear SVM algorithm can be run by introducing
more advanced non-linear separation boundaries. These algorithms will reduce the no. of
misclassified instances in the training datasets but will have more features, thus adding to the
model's complexity.
29

lllll K-Nearest Neighbor


K-Nearest Neighbor (KNN) is a supervised learning technique most often used for classification and
sometimes for regression. The idea is to classify a new observation by finding similarities 'nearness'
between this new observation and the existing data.

KNN with new observation, K = 1 KNN with new observation, K = 5

Y x Y x
x x
△ △ x
◆ x
x ◆ x
△ △ x
△ △ x △ △ x

△ x △ x
△ △
△ △
0 x 0 x

The diamond (observation) needs to be classified as belonging to either the cross or the triangle
category. If K = 1, the diamond will be classified into the same category as its nearest neighbor (i.e.
triangle in the left panel), whereas if K = 5, the algorithm will look at the diamond's 5 nearest
neighbors, which are 3 triangles and 2 crosses. The decision rule is to choose the classification with
the largest no. of nearest neighbors, out of 5 being considered. So, the diamond is again classified
as belonging to the triangle category.

KNN is a straightforward intuitive model that is still very powerful because it is non-parametric; the
model makes no assumption about the distribution of the data. Moreover, it can be used directly
for multi-class classification. A critical challenge of KNN however, is defining what it means to be
'similar' (or near). The choice of a correct distance measure may be even more subjective for ordinal
or categorical data. KNN results can be sensitive to inclusion of irrelevant or correlated features, so
it may be necessary to select features manually. If done correctly, this process should generate a
more representative distance measure. KNN algorithms tend to work better with a small no. of
features.

Finally, the number K, the hyperparameters of the model, must be chosen with the understanding
that different values of K can lead to different conclusions. If K is an even number, there may be ties
and no class classification. Choosing a value for K that is too small would result in a high error rate
and sensitivity to local outliers, but choosing a value for K that is too large would dilute the concept
of nearest neighbors by averaging too many outcomes. For K, one must take into account the no. of
categories and their partitioning of the feature space.

lllll Classification and Regression Tree


Regression trees are appropriate
when the target is continuous and
Classification trees are appropriate
Yes No
when the target variable is
categorical. Most commonly,
Yes No Yes No Classification and Regression
Trees (CART) are applied to binary
classification or regression. Such a
Yes No Yes No
classification requires a binary tree
a combination of an initial root
node, decision nodes and terminal
nodes. The root node and each
30

decision node represent a single feature (f) and a cut-off value (c) (e.g. X1 > 10%) for that feature.
The CART algorithm chooses the feature and the cut-off value at each node that generates the
widest separation of the labeled data to minimize Classification error (e.g. by a criterion such as
MSE, Mean or Average) or Regression error (e.g. by a criterion such as Mode or Class). Every
successive classification should result in a lower estimation error than the nodes that predicted it.
The tree stops when the error can no longer be reduced further resulting in a terminal node.

E.g: Classifying companies by whether or not they increase their dividends to shareholders:

↷ Partitioning of the
Decision Tree Feature Space feature (X1, X2)

X2
Initial Root X1≤5% X1>5%
↷ Node
+
— + +
+ + X2>20%
No Yes — +
+
Decision
X2>10% — +
Nodes + +


No Yes No Yes

X2≤20%

No Yes


Terminal X2≤10% —
Nodes
X1
0

X1≤10% X1>10%

X1: Investment Opportunities Growth (IG)


X2: Free Cashflow Growth (FCFG)

If the goal is regression, then prediction at each terminal node is the mean of the labeled values.
If the goal is classification, then prediction of the algorithm at each terminal node will be mode.
For example, in the feature space, representing IG (X1 > 10%) and FCFG (X2 > 20%) contains 5
crosses. So a new company with similar features will also belong to the cross (dividend increase)
category.

CART makes no assumptions about the characteristics of the training data, so if left unconstrained,
potentially it can perfectly learn the training data. To avoid such overfitting, regularization
parameters can be added such as the maximum depth of the tree, the minimum population at a
node or the maximum no. of decision nodes. Alternatively, regularization can occur via a 'pruning'
technique that can be used later on, to reduce the size of the tree i.e. sections of the tree that
provide little classifying power are pruned or removed. By its iterative structure, CART is a powerful
tool to build expert systems for decision-making processes. It can induce robust rules despite noisy
data and complex relationships between high no. of features.

lllll Ensemble Learning


Instead of basing predictions on the results of a single model, why not use the predictions of a
group or an ensemble of models? Every single model will have a certain error rate and will make
noisy predictions. But by taking the average result of many predictions from many model, we can
expect to achieve a reduction in noise as the average result converges towards a more accurate
prediction. This technique of combining the predictions from a collection of models is called
Ensemble Learning and the combination of multiple learning algorithm is known as the Ensemble
Method.
31

Ensemble learning typically produces more accurate and more stable predictions than one best
model. Ensemble learning can be divided into two main categories:

Voting Classifiers
An ensemble method can be an aggregation of heterogeneous learners i.e. different types of
algorithms combined together with a voting classifier.
There is an optimal no. of models
beyond which performance would be
expected to deteriorate from overfitting.

A majority-vote classifier will assign to a new data point the predicted label with most votes. The
more individual models you have trained, the higher the accuracy of predictions.

Bootstrap Aggregating
An ensemble method can be an aggregation of homogeneous learners i.e. a combination of the
same algorithm, using different training data that are based on a bootstrap aggregating i.e. bagging
technique.
Random with replacement
↷ n: no. of training instances
n': no. of instances in a 'bag'
m: no. of bags

Usually, n' < n

1 2 m

(Train) (Train) (Train)

Alternatively, one can use the same machine learning algorithm but with different training data.
Bootstrap aggregating or Bagging is a technique whereby the original training dataset is used to
generate 'n' new training datasets or bags of data. Each new bag of data is generated by random
sampling with replacement from the initial training set. The algorithm can now be trained on 'n'
independent datasets that will generate 'n' new models. Then for each new observation, we can
aggregate the 'n' predictions using a majority vote classifier for a classification or an average for a
regression. Bagging is a very useful technique because it helps to improve the stability of
predictions and protects against overfitting the model.

lllll Random Forest


A Random Forest classifier is a collection of a large no. of decision trees trained via a bagging
method or by randomly reducing the no. of features available during training. A random forest is a
collection of randomly generated classification trees from the same dataset. A randomly selected
32

subset of features is used in creating each tree and each tree is slightly different from the others.
The process of using multiple classification trees uses crowdsourcing (majority wins) in determining
the final classification. Because each tree only uses a subset of features, random forests can
mitigate the problem of overfitting. Using random forests can increase the signal-to-noise ratio
because errors across different trees tend to cancel each other out.

Step 1: Create a 'Bootstrapped' dataset.

Original Dataset Bootstrapped Dataset


It's the same
as the 3rd
Step 2: Create a decision tree using the bootstrapped dataset, but only use a random subset of
variables or columns at each step.
We will consider X2 We'll focus on
or X3 as initial other variables as
node (random decision nodes, like Bootstrapped Dataset
choice) X1 or X4

Step 3: Now go back to Step 1 and repeat: Make a new bootstrapped dataset and build a tree
considering a subset of variables at each step. Ideally, you'd do this 100s of times, using a
bootstrapped sample and considering only a subset of the variables at each step results in
a wide variety of trees (100 trees). The variety is what makes random forests more effective
than individual decision trees.

Step 4: Now that we've created a random forest, how do we use it? Well, first we get a new patient
with measurements (X1, X2, X3 and X4) and now we want to know if they have a 'Heart
Disease or No' (Y). So we take the data and run it down the first tree we made. The 1st tree
says 'Yes'. Now we run the data down the second tree; the 2nd tree also says 'Yes'. Then we
repeat for all 100 trees we made. After running the data down of all the trees in the random
forest, we see which option received more votes. In this case, 'Yes' received the most votes,
so we will conclude that this patient has Heart Disease.

However, an important drawback of random forest is that it lacks the case of interpretability of
individual trees, as a result it is considered a relatively 'Black-Box' type algorithm i.e. difficult to
understand the reasoning behind their outcomes and thus to foster in them.
33

lllll Unsupervised Learning


Unsupervised Learning is machine learning that doesn't make use of labelled data. In Unsupervised
learning, we have inputs (X's) that are used for analysis without any target being (Y) supplied. In the
absence of any tagged data, the program seeks out structure of interrelationships in the data.

lllll Principal Component Analysis


Principal Component Analysis (PCA) or Dimension Reduction (seeks to remove the noise i.e.
attributes that don't contain much information) focuses on reducing the no. of features while
retaining variation across observations to preserve the information contained in that variation. PCA
is used to summarize or reduce highly correlated features of data into few main uncorrelated
composite variables (composite variable is a variable that combines two or more variables that are
statistically strongly related to each other).

Step 1: For example, the samples could be 'Blood Samples' in a lab and the variables could be 'DNA'

With 1 DNA With DNA 1 and DNA 2


(ii) The average
DNA 2 measurement (iii) With average
(x2 ) for DNA 2 volumes, we can

calculate the
Sample 4, 5, 6
are similar
Sample 1, 2, 3
are similar
③ center of the (x*)


DNA 1 ⑥⑤④ ③ ①② x2

x*

Low Values High Values




DNA 1

0
(i) We'll calculate x1
the average (x1 )
measurement for
DNA 1

We'll also talk about how PCA can tell us which DNA or variable is the most valuable for
clustering the data.

Step 2: Now we'll shift the data so that the center (x*) is on top of origin (0,0) in the graph. Note,
shifting the data did not change how the data points are positioned relative to each other.
DNA 2

PC 1
How do we
determine if this is a
good fit or not?

x* DNA 1
(0,0)
34

Step 3: We need to talk about how PCA decides if a fit is good or no? If best fit, then the PCA can
either minimize ('b') the distance to the line or maximize ('c') the distance from the projected
point to the origin.
PC 1 (not a good fit) Pythagoras Theorem
DNA 2
△ Note a2 = b2 + c2


b Intuitively, it makes But, it's actually
PC 1 (good fit) sense to minimize 'b', easier to calculate 'c',

x
x c
the distance from the the distance from
↶ refer. Step 2 point to line. 'b' is the the projected point

a
vertical distance to the origin. So PCA
between the data finds the best fitting
DNA 1 x* point and PC1 by maximizing the
representing a sum of the squared
The distance between 'Projection Error'. distance

x
[SS (Distances)] from
each data point in the
direction that is parallel
↶ the projected points
to the origin.

x
to PC 1 represents the
spread or variation of
the data along PC 1

x
We can pick 'b' or 'c'. For now as per 'c's' principles; PCA measures 6 'c' distances.

(Conclusions
derived from d 21 + d 22 + d23 + d24 + d 25 + d 26 = Sum of Squared Distances
both 'b' or 'c'
or SS (Distances)
will be same)

△ Note We keep rotating the line until we end up with the line with the largest SS (Distances)
between the projected points and the origin. PCA calls such a line with largest SS (Distance)
as the 'best fit' line a.k.a. Eigenvalue for PC 1

SS (Distances for PC1 ) = Eigenvalue for PC 1


= Eigenvalue for PC 1 = Singular Value for PC 1

Step 4: Linear Combinations Mathematicians call


this cocktail recipe a
DNA 2 'Linear Combination'

of DNA 1 & 2
To make PC 1 :
PC 1
Take 4 parts of DNA 1 and
1 part of DNA 2
4.1x
2 ∴
( DNA 1 is more important than DNA 2)
1
DNA 1 x*
4
2 2 2
a = b + c
x 2 = 42 + 1 2 = 4.12
Propositions of each
DNA is called We can scale the length of PC 1 to 1 by
'Loading Score' dividing by 4.12

This 1 unit long vector is called 4.12 = 1 1 =


Eigenvector for PC1 or Singular for PC 1 4.12
x 4.12
0.242

Informally, PCA involves transforming the After standardizing: 4 = 0.97


4.12
covariance matrix of the features and involves Take 0.97 parts of DNA 1 and
0.242 parts of DNA 2
two key concepts: Eigenvectors and Eigenvalues. ∴
( DNA 1 is more important than DNA 2)
The Eigenvectors define new mutually uncorrelated
composite variables that are linear combinations of the original features. As a vector, an
Eigenvector also represents a direction. Associated with each Eigenvector is an Eigenvalue.
PCA selects as the first principal component the Eigenvector that explains the largest
proportion of variation in the dataset (the Eigenvector with the largest Eigenvalue). An
Eigenvalue gives the proportion of total variance in the initial data that is explained by each
Eigenvector.
35

Step 5: The next largest portion of the remaining variance is best explained by PC 2 , which is at right
angles to PC 1 and thus is uncorrelated with PC1 . Since it is a 2D graph, PC 2 is simply a line
through the origin that is perpendicular to PC1 , without any further optimization that has to
be done.
DNA 2
PC 2
PC 1
Now calculate the
Eigenvector and Eigenvalue for PC 2

DNA 1 - 0.242 for DNA 1


0.97 for DNA 2

( ∴ DNA 2 is 4 times more important than DNA 1)

Step 6: We can convert the Eigenvalues or SS (Distances) into variation around the origin (0,0) by
dividing by the sample size minus 1 i.e. (n - 1).

SS (Distances for PC1 ) = Variation for PC 1


n-1

SS (Distances for PC2 ) = Variation for PC 2


n-1

For the sake of example, imagine that Variation for PC 1 15 15 / 18 = 83%


Variation for PC 2 3 3 / 18 = 17%
Total Variation for PC 18 = 100%

Step 7: Scree Plot i.e. It shows the proportion of total variance in the data explained by each
principal component (PC).

The first factor in PCA would be the


80 83%
most important factor in explaining
PC 1 and PC 2 account for more than
the variation across observations.
The second factor would be the 90% of variation, so we can just use
second most important and so, up 40 these to draw a 2D graph.
to the no. of uncorrelated factors
specified by the researcher.
17%
0
PC 1 PC2

If the scree plot looked like this, 40


where PC 3 and PC4 account for a
substantial amount of variation then,
20
using just the first 2 PCs would not
create a very accurate representation
of the data.
0
PC1 PC 2 PC3 PC 4

The main drawback of PCA is that since the PCs are combinations of the dataset's initial features,
they typically cannot be easily labeled or directly interpreted by the analyst. Compared to modelling
36

data with variables that represent well-defined concepts, the end user of PCA may perceive PCA as
something of a 'Black-Box'. Reducing the no. of features to the most relevant predictors is very
useful.

lllll Clustering
Given a dataset, Clustering is the process of grouping observations into categories based on
similarities in their attributes i.e. meaning that the observations inside each cluster are similar or
close to each other, a property known as 'cohesion' and the observations in two different clusters
are as far way from one another or are as dissimilar as possible, a property known as 'separation'.

▲▲ ▲ ▲ ▲ ▲
▲ ▲ ▲
▲ ▲ ▲ ▲ ▲ ▲
▲ ▲ ▲
▲ ▲ ▲
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
▲ ▲ ▲ ▲ ▲ ▲
▲ ▲ ▲
▲ ▲ ▲ ▲ ▲ ▲

K-Means Clustering
K-Means is a relatively old algorithm that repeatedly partitions observations into a fixed number 'K',
of non-overlapping clusters. The number of clusters 'K' is a model hyperparameter - a parameter
whose value must be set by the researcher before learning begins. Each cluster is characterized by
its 'Centroid' i.e. center and each observation is assigned by the algorithm to the cluster with the
centroid to which that observation is closest. The K-means algorithm follows an iterative process.

S1 S2 S3

1 2 3

C C C
1 1 1
C
3
C
3
C
3

C2 C C
2 2

S4 S5 S6

C
1
C C
C 1 C 1 C
3 3 3

C C C
2 2 2
37

The K-means algorithm will continue to iterate until no observation is reassigned to a new cluster
i.e. no need to recalculate new centroids. The algorithm will then be converged and reveal the final
K clusters with their member observations. The K-means algorithm has minimized inter-cluster
distance (thereby maximizing cohesion) and has maximized inter-cluster distance (thereby
maximizing separation) under the constraint that K = 3. The K-means algorithm is fast and works
well on very large datasets with hundreds of millions of observations. However, the final
assignment of observations to clusters can depend on the initial location of the centroids. One
limitation of this technique is that the hyperparameters 'K', the no. of clusters in which to partition
the data, must be decided before K-means can be run.

Hierarchical Clustering
Hierarchical Clustering is an iterative procedure used to build a hierarchy of clusters. In Hierarchical
Clustering, the algorithms create intermediate rounds of clusters of increasing (agglomerative) or
decreasing (divisive) size until the final clustering is reached. This process creates relationships
among the rounds of clusters. It has the advantage of allowing the investment analyst to examine
alternative segmentations of data of different granularity before deciding which one to use.

(a) Agglomerative Clustering


I
We start with one observation G
H 5
as it's own cluster and add other K
4 6
8
similar observations to that J
9
cluster. C D 2
7
F
3
B E
1
A

(b) Divisive Clustering


It starts with one giant cluster 5
and then partitions that cluster 4
6
into smaller and smaller 9
8

clusters. 2

7
3
1

Agglomerative Clustering Divisive Clustering


[Bottom-Up] [Top-Down]

- Agglomerative method is well - Divisive method is better suited


suited for identifying small for identifying large clusters.
clusters.

- Agglomerative method makes - Divisive method starts with a


decision based on local patterns holistic representation of the data.
without initially accounting for the
global structure of the data.

Example 1:
A category of general linear regression (GLS) models that focuses on reduction in the total no. of features
used is best described as?
A. Clustering Model
B. Dimension Reduction Model
✓ C. Penalized Regression Model
38

Example 2:
We apply ML techniques to a model including fundamental and technical variables (features) to predict next
quarter's return for each of the 100 stocks currently in our portfolio. Then, the 20 stocks with the lowest
estimated return are identified for replacement. The ML techniques appropriate for executing Step 1 are
most likely to be based on:
✓ A. Regression Because target variable (quarterly return) is continuous.
B. Classification
C. Clustering

Dendrogram
A type of diagram for visualizing a hierarchical cluster analysis known as a Dendrogram, highlights
the hierarchical relationships among the clusters.

Distance Arch
Refer Agglomerative Clustering:
Clusters are represented by a
9 Dendrite
horizontal line - the 'Arch',
which connects two vertical
0.07 lines called 'Dendrites', where
7 8 the height of each arch
represents the distance
0.05 between the two clusters being
considered. Shorter dendrites
represent a shorter distance
0.03 (and greater similarity) between
1 3 5 6 clusters. The horizontal dashed
lines cutting across the
0.01 dendrites show the no. of
2 4
clusters into which the data are
Clusters split at each stage.
0 A B C D E F G H I J K

2 Clusters 6 Clusters 11 Clusters

lllll Neural Networks, Deep Learning Nets and Reinforcement Learning


These sophisticated algorithms can address highly complex machine learning tasks such as Image
Classification, Face Recognition, Speech Recognition and Natural Language Processing. These
complicated tasks are characterized by non-linearities and interactions between large no. of feature
inputs.

lllll Neural Networks


Neural Networks also called Artificial Neural Networks (ANNs) are highly flexible type of ML
algorithm that have been successfully applied to a variety of tasks characterized by non-linearities
and complex interactions among features.

Input Layer Hidden Layer Output Layer


(4 Features) (1 Target)

Input 1
Linking the information in the input layer to multiple
nodes in the hidden layers, each with its own activation
Input 2 function, allows the neural network to model complex
non-linear functions to use the information in the input
Links > variables well. The nodes in the hidden layer transform
Input 3 the inputs in a non-linear fashion into new values that
are then combined into the target value. This structure
4, 5 and 1 is set by researcher and referred to as the
hyperparameters of the neural network.
Input 4
Nodes
Neurons
39

Each node has conceptually two functional parts:


(i) Summation Operator: Once the node receives the four input values, the summation operator
multiplies each value by a weight and sums the weighted values to form the total net input. The
total net input is then passed to the activation function, which transforms this input into the final
output of the node.
1 ① (ii) Activation Function: Informally, the activation function operates like a light dimmer switch that
decreases or increases the strength of the input. The activation function is characteristically
non-linear.
0 0

Softplus ReLU Sigmoid


(Rectified Linear Unit)

x x x

y y y

x will be given; y = f (x) = log (1+e x ) y = f (x) = e


x
y = f (x) = max (0 , x)
ex + 1
(Here, x and y are coordinates)

If the process of adjustment works forward through the layers of network this process is called
'Forward Propagation', where as, if the process of adjustment works backward through the layers of
network, this process is called 'Backward Propagation'.

Learning Rate hyperparameter controls the rate or speed at which the model learns. Learning takes
place through this process of adjustment to network weights with the aim of reducing the total
error. When learning is complete, all the network weights have assigned values.
Partial Derivative or Rate of Change
↱ of the total error with respect to
New Weight = Old Weight - (Learning Rate) (Gradient)
the change in the old weight.

When the learning rate is too large, gradient descent can inadvertently increase rather than
decrease the training error. When the learning rate is too small, training is not only slower but may
become permanently stuck with a high training error.

E.g: ReLU function. F (x) = max (0 , x), y will be equal to β1 times z1 , where z 1is the maximum of
(x1 + x2 + x3 ) or 0, plus β2 times z2 , the maximum of (x2 + x4) or 0, plus β3 times z 3, the
maximum of (x1 + x3 + x4) or 0, plus an error term.

Input Output Input Hidden Output

x1 x1

z1

x2 x2

Y z2 > Y

x3 x3

z3

x4 x4

Weights

Y = 1 (x1) + 2 (x2) + 3 (x3) + 4 (x4) Y = 1 . Max (0, x1 + x2 + x3) +


2 . Max (0, x2 + x4) +
3 . Max (0, x2 + x3 + x4)
= 1 (z1) + 2 (z2) + 3 (z3)
40

When more nodes and more hidden layers are specified, a neural network's ability to handle
complexity tends to increase, but so does the risk of overfitting. However, the tradeoffs in using
them are the lack of interpretability and the amount of data needed to train such models.

lllll Deep Learning Nets


Neural networks with many hidden layers - atleast 3 but often more than 20 hidden layers are
known as Deep Learning Nets (DLNs) is the backbone of the artificial intelligence revolution.
Advances in DLNs have driven developments in many complex activities such as image, pattern and
speech recognition.

Input Hidden Output


Bias
x1
B1
E.g: 0.2 (x1) + 0.8 (x2) + 2.14 = x-axis

x2 B1
1
B2 Y1 ↷
The function
x3 B2 gets activated
when it hits 1.
B3 Y2
x4 B3

B4
0
x5
Bias

The information is fed to the input layer and the information is transferred from one layer to
another over connecting channels, each of these has a value attached to it and hence is called a
weighted channel 'w ij ' (for neuron i and input j), each of which usually produces a scaled number in
the range (0,1) or (-1,1). All neurons have unique numbers associated with it called Bias. This bias is
added to the weighted sum of inputs reaching the neurons, which is then applied to the function
known as the Activation Function. The result of the activation function determines if the neurons
get activated. Every activated neuron passes on information to the following layer; this continues
uptil the second last year (a layer before the output layer). The one neuron activated in the output
layer corresponds to the input digit. [Note: The weights, biases or numbers are passed to another
layer of functions and into another and so on until the final year produces a set of probabilities of
the observation being in any of the target categories (each represented by a node in the output
layer). The DLN assigns the category based on the category with the highest probability]. The weight
and bias are continuously adjusted to produce a well trained network. The DLN is trained on large
datasets; during training the weights w i are determined to minimize a specified loss function.
Unfortunately, DLNs require substantial time to train and systematically varying the
hyperparameters, may not be feasible.

lllll Reinforcement Learning


Reinforcement Learning (RL) is an algorithm that involves an agent that should perform actions that
will maximize its rewards over time, taking into consideration the constraints of its environment.

Gamer Video Game


② Action
③ Reward
State of
① Environment
or Situation
41

In the case of AlphaGo. a virtual gamer (the agent) uses his/her console to command (the actions)
with the help of the information on the screen (the environment) to maximize his/her score (the
reward). Unlike supervised learning, RL gets instantaneous feedback i.e. the learning subsequently
occurs through millions of trail and errors. For example, an agent could be a virtual trader who
follows certain trading rules (the action) in a specific market (the environment) to maximize its
profits (its rewards). The success of RL is still an open question in financial markets.

Fig 4: Guide to ML Algorithms

Variables Supervised ML Unsupervised ML

CONTINUOUS REGRESSION DIMENSIONALITY REDUCTION


- Penalized Regression (LASSO) - Principal Component Analysis (PCA)
- Classification and Regression Tree (CART)
- Random Forest CLUSTERING
- K-Means
- Hierarchical

CATEGORICAL CLASSIFICATION DIMENTIONALITY REDUCTION


- Support Vector Machine (SVM) - Principal Component Analysis (PCA)
- K-Nearest Neighbor (KNN)
- Classification and Regression Tree (CART) CLUSTERING
- K-Means
- Hierarchical

CONTINUOUS or NEURAL NETWORKS NEURAL NETWORKS


CATEGORICAL DEEP LEARNING DEEP LEARNING
REINFORCEMENT LEARNING REINFORCEMENT LEARNING

Fig 5: Decision Flowchart for Choosing ML Algorithms

NO YES

YES

YES NO
YES NO

NO YES

NO YES NO YES

YES NO
42

CHAPTER 8 Big Data Projects

Big Data differs from traditional data sources based on the presence of a set of characteristics commonly
referred to as the 4 V's: Volume, Variety, Velocity and Veracity.

Volume: refers to the quantity of data.


Variety: pertains to the array of available data sources.
Velocity: is the speed at which data is created (data in motion is hard to analyze compared to data at rest).
Veracity: related to the credibility and reliability of different data sources.

Big Data also affords opportunities for enhanced fraud detection and risk management.

E.g: One study conducted in the U.S. found that positive sentiment on Twitter could predict the trend for the
Dow Jones Industrial Average up to 3 days later with nearly 87% accuracy.

Fig 1: Model Building for Financial Forecasting using Big Data


① ②

⑤ ④

① ②

⑤ ④

1
Exploratory Data Analysis (EDA): is the preliminary step in data exploration. Exploratory graphs, charts and other visualizations
such as heat maps and word clouds are designed to summarize and observe data.

2
Feature Selection: is a process whereby only pertinent features from the dataset are selected for ML model training.

3 Feature Engineering: is a process of creating new features by changing or transforming existing features. Feature Engineering
techniques systematically alter, decompose or combine existing features to produce more meaningful
features.

(Feature Selection is a key factor in minimizing model overfitting & Feature Engineering tends to prevent model underfitting)
43

lllll Data Preparation and Wrangling

Structured Data Unstructured Data

* Data Preparation (Cleansing): Data Cleansing is * Text Preparation (Cleansing): Raw text data are a
the process of examining, identifying and sequence of characters and contain other
mitigating errors in raw data. non-useful elements including html tags,
punctuation and white spaces. The initial step in
1. Incompleteness Error: where the data is not text processing is cleansing, which involves to clean
present, resulting in missing data. the text by removing unnecessary elements from
the raw text. A 'Regular Expression' (Regex) is a
2. Invalidity Error: where the data is outside of series that contains characters in a particular order.
Regex is used to search for patterns of interest in a
a meaningful range.
given text. It can be used to find all the html tags
that are present in the form of < ... > in text. Once a
3. Inaccuracy Error: the data is not a measure
pattern is found, it can be removed or replaced.
of true value.
1. Remove Html Tags: Most of the text data are
4. Inconsistency Error: data conflicts with the acquired from web pages and the text inherits
corresponding data points or reality. html mark-up tags with the actual content. The
initial task is to remove (or strip) the html tags
5. Non-Uniformity Error: data is not present in with the help of regex.
an identical framework.
2. Remove Punctuations: Most punctuations are
6. Duplication Error: where duplicate not necessary for text analysis and should be
observations are present. removed. However, some punctuations such as
period (dots), percentage signs, currency symbols
In addition to a manual inspection and verification and question marks may be useful for ML model
of the data, analysis software such as SPSS can be training. These punctuations should be
used to understand 'MetaData' (data that substituted with such annotations as
describes and gives information about other data) /percentagesign/, /dollarsign/ and
/questionmark/ to presume their grammatical
about the data properties to use as a starting
meaning. The periods (dots) must be
point to investigate any errors in the data.
appropriately replaced or removed i.e. period
(dots) after abbreviations, but the periods
separating the sentences should be replaced by
the annotation. Regex is often used to remove or
replace punctuations.

3. Remove Numbers: The numbers or digits should


be removed or substituted with an annotation
'number/. However, the number and any
decimals must be retained where the outputs of
interest are the actual values of the number. One
such text application is Information Extraction
(IE), where the goal is to extract relevant
information from a given text.

4. Remove White Spaces: Extra white spaces, tab


spaces, leading and ending spaces in the text
should be identified and removed to keep the
text intact and clean. For example, text mining
package in R offers a 'StripWhiteSpace' function.

Web Scrapping is a technique to extract raw


content from a source typically webpages.
44

* Data Wrangling (Preprocessing): Data * Text Wrangling (Preprocessing): A token is


Preprocessing primarily includes transformations equivalent to a word and tokenization is the
and scaling of the data. These processes are process of splitting a given text into separate
exercised on the cleansed dataset. tokens. In other words, a text is considered to be a
collection of tokens. Similar to structured data, text
1. Extraction: A new variable can be extracted data also requires normalization. The normalization
from the current variable for the ease of process in text processing involves the following:
analyzing and to use for training the ML model.
1. Lower-casing: the alphabet removes distinctions
among the same words due to upper and lower
2. Aggregation: Two or more variables can be
cases. This actions help the computers to process
aggregated into one variable to consolidate
the same words appropriately e.g. 'The' and 'the'.
similar variables.
2. Stop Words: are such commonly used words as
3. Filtration: The data rows that are not needed 'the', 'is' and 'a'. For ML training purposes, stop
for the project must be identified and filtered. words typically are removed to reduce the no. of
tokens involved in the training set. In some cases,
4. Selection: The data columns that are intuitively additional stop words can be added to the list
not needed for the project can be removed. based on the content. For example, the word
'exhibit' may occur often in financial filings, which
5. Conversion: The variables can be of different is general, is not a stop word but in the context of
types i.e. nominal, ordinal, continuous and the filings, it can be treated as a stop words.
categorical. The variables in the dataset must be
converted into appropriate types to further 3. Stemming: is the process of converting inflected
process and analyze them correctly. Before forms of a word into its base word, known as
converting, values must be stripped out with stems. For example, the stem of the words
prefixes and suffixes such as currency symbols. 'analyzed' and 'analyzing' is 'analyz'.

4. Lemmatization: is the process of converting


Outliers may be present in the data and several
inflected forms of a word into its morphological
techniques can be used to detect outliers in the
root, known as Lemma. Lemmatization is an
data. In normally distributed data, data values
algorithmic approach and depends on the
outside of 1.5 IQR (Interquartile range) are
knowledge of the word and language structure.
considered as outliers and values outside of 3 IQR For example, the lemma of the words 'analyzed'
as extreme values i.e. a data value that is outside and 'analyzing' is 'analyze'. Lemmatization is
of 3 standard deviations from the mean may be computationally more expensive and advanced.
considered as an outlier. When extreme values
and outliers are simply removed from the dataset, Stemming or Lemmatization will reduce the
it is known as Trimming also called 'Truncation'. repetition of words occurring in various forms and
When extreme values and outliers are replaced maintain the semantic structure of the text data.
with the maximum (for large outliers) and Stemming is more simpler to perform compared to
minimum (for small outliers) values of data points lemmatization. In text data, Data Sparseness refers
that are not outliers, this process is known as to words that appear very infrequently, resulting in
Winsorization. Scaling is a process of adjusting the data consisting of many unique low frequency
range of a feature by shifting and changing the tokens. Both techniques decrease data sparseness
scale of data. Scaling helps SVMs and ANNs by aggregating many sparsely occurring words in
relatively less sparse stems or lemmes, thereby
models. It is most important to remove outliers
aiding in training less complex ML models. After the
before scaling is performed. There are two of the
cleansed text is normalized, a Bag-of-Words is
most common ways of scaling:
created. BOW is simply a set of words and doesn't
capture the position or sequence of words present
(a) Normalization: is the process of rescaling
in the text. Note that the no. of words decreases as
numeric variables in the range of (0, 1). the normalizing steps are applied, making the
Normalization is sensitive to outliers. resulting BOW smaller and simpler.
45

Normalization can be used when the


distribution of the data is not known.

Xi = X i - X min
(normalized)
X max - X min

(b) Standardization: is the process of both


centering and scaling the variables. The
resultant standardized variable will have an
arithmetic mean of 0 and standard deviation of
1. Standardization is relatively less sensitive to
outliers as it depends on the mean and
standard deviation of the data. However, the
data must be normally distributed to use The last step of text preprocessing is using the final
standardization. BOW after normalizing to build a Document Term
Matrix (DTM). DTM of four texts and using
X i (standardized) = X i - μ normalized BOW filled with counts of occurrence.
ox

BOW doesn't represent the word sequences or


positions, which limits its use for some advanced
ML training applications. In the example, the word
'no' is treated as a simple token and has been
removed during the normalization because it is a
stop word. Consequently, this fails to signify the
negative meaning ('no market') of the text (i.e. Text
4). To overcome such problems, a technique called
N-grams can be employed. N-grams is a
representation of word sequences. The length of a
sequence can vary from 1 to 0. Stemming can be
applied on the cleansed text before building
n-grams and BOW.

Even after removing isolated stop words, stop


words tend to persist when they are attached to
their adjacent words.
46

lllll Data Exploration


Domain Knowledge plays a vital role in exploratory analysis as this stage should involve cooperation
between analysts, model designers and experts in this particular data domain. In Data exploration,
without Domain Knowledge can result in spurious relationships i.e. mislead analysis.

lllll Data Exploration

Structured Data
1D: Summary statistics such as mean, median, quartiles, ranges , standard deviations, skewness
and kurtosis.

Histograms Bar Charts Box Plots Density Plots


Histograms represent Bar Charts summarize Box Plots shows the Density Plots are
equal bins of data and the frequencies of distribution of another effective way
their respective categorical variables. continuous data. to understand the
frequencies. They can distribution of
be used to understand ↷ Maximum continuous data.

Units
Frequency

→ 3rd Quartile
the high-level Density Plots are
distribution of the data. → Median
smoothed histograms.
Frequency

Frequency
→ 1st Quartile
Minimum

Data Data

Data Salary

2D: Summary statistics of relationships such as a correlation matrix.

Scatter Plots Line Graphs


A Scatter Plot provides Line Graph is a type of
a starting point where chart used to visualize
relationships can be the value of something
examined visually, but it over time.
may not be a
.
Temparature

statistically significant
relationship.
...
..... . .
...
Salary

. ........ . .
. ..................
.
.
...... .
Time

Age

For Multivariate Data, commonly utilized exploratory visualization designs include stacked bar and
line charts, multiple box plots and scatter plots showing multivariate data that use different colors
or shapes for each feature.

Common Parametric Statistical Tests: ANOVA, T-test and Pearson Correlation


Common Non-Parametric Statistical Tests: Chi-Square and Spearman Rank-Order Correlation

Central Tendency helps measure minimum and maximum values for continuous data. Counts and
Frequencies for categorical data are commonly employed to gain insight regarding the distribution
of possible values.
47

Unstructured Data
The most common applications are,
(a) Text Classification: uses supervised ML approaches to classify texts into different classes. Text
Classification involves dividing text documents into assigned classes (a class is a category,
examples include 'relevant' and 'irrelevant' text documents or 'Bearish' and 'Bullish' sentences).
(b) Topic Modeling: uses unsupervised ML approaches to group the texts in the dataset into topic
clusters. Topic modeling is a text data application in which the words that are most informative
are identified by calculating the term frequency of each word. For example, the word 'soccer' can
be informative for the topic 'sports'. The words with high term frequency values are eliminated
as they are likely to be stop words or other common vocabulary words, making the resulting
BOW compact and more likely to be relevant to topics within the texts.
(c) Fraud Detection
(d) Sentiment Analysis: predicts the sentiment i.e. negative, neutral or positive of the texts in a
dataset using both supervised and unsupervised approaches.
(In Sentiment Analysis and Text Classification applications, the Chi-square measure of word
association can be useful for understanding the significant word appearances in negative and
positive sentences in the text or in different documents).

Text data includes a collection of texts, also known as a Corpus, that are sequences of tokens.
↶ Word Clouds are common visualizations when working with text data as they can be made to
visualize the most informative words and their term frequency values.

Document Frequency (DF): defined as the no.


DF = Sentence Count with Word
of documents i.e. sentences that contain a given
Total no. of Sentences
word divided by the total no. of sentences.

Inverse Document Frequency (IDF): a relative


measure of how unique a term is across the IDF = Log ( 1 / DF )
entire corpus.

Term Frequency (TF): is the ratio of the no. of


times a given token occurs in all texts in the
dataset to the total no. of tokens in the dataset TF = Total Word Count
e.g. word associations, average word and Total no. of Words in Collection
sentence length & word and syllable counts. TF
at corpus level is known as 'Collection
Frequency' (CF).

Term Frequency - Inverse Document


Frequency (TF-IDF): Higher TF-IDF value indicate
words that appear more frequently within a TF-IDF = TF x IDF
small no. of documents. This signifies relatively
more unique terms that are important.
Conversely, a low TF-IDF value indicate terms
that appear in many documents.

lllll Feature Selection

Structured Data
Typically, structured data even after the data preparation can contain features that don't contribute
to the accuracy of an ML model or that negatively effect the quality of ML training. Feature Selection
on structural data is a methodical and iterative process. Statistical measures can be used to assign a
score gauging the importance of each feature. These features can then be ranked using the score
and can either be retained or eliminated from the dataset. Methods include Chi-Square Test,
2
Correlation Coefficient and information-gain measures i.e. R
48

Feature Selection is different from Dimensionality Reduction, but both methods seek to reduce the
no. of features in the dataset. The Dimensionality Reduction method creates new combinations of
features that are uncorrelated, whereas Feature Selection includes and excludes features present in
the data without altering them.

Unstructured Data
For text data, Feature Selection involves selecting a subset of the terms or tokens occurring in the
dataset. The token serve as features for ML model training. Feature Selection in text data effectively
decreases the size of the vocabulary or BOW. This helps the ML model be more efficient and less
complex. Another benefit is to eliminate noisy features from the dataset. Noisy features are tokens
that don't contribute to ML model training and actually might detract from the ML model accuracy.
The frequent tokens strain the ML model to choose a decision boundary among the texts as the
terms are present across all the texts, an example of model underfitting. The rare tokens mislead
the ML model into classifying texts containing the rare terms into a specific class, an example of
model overfitting.

The general Feature Selection methods in text data are as follows,


(a) Frequency: measures can be used for vocabulary pruning to remove noisy features by filtering
the tokens with very high and low TF values across all texts. DF is another frequency measure
that helps to discard the noise features and often perform well when many thousands of tokens
are present.
(b) Chi-Square Test: is applied to test the independence of two events: occurrence of the token and
occurrence of the class. The test ranks the tokens by their usefulness to each class in text
classification problems. Tokens with the highest Chi-Square Test statistic values occur more
frequently in texts associated with a particular class and therefore can be selected for use as
features for ML model training.
(c) Mutual Information (MI): measures how much information is contributed by a token to a class of
texts. If MI = 0, then the token's distribution in all text classes is the same. The MI approaches to
1 as the token in any one class tends to occur more often in only that particular class of text.

lllll Feature Engineering

Structured Data
For Continuous Data, a new feature may be created. For example, by taking the logarithm of the
product of two or more features. As another example, when considering a salary or income feature,
it may be important to recognize that different salary brackets impose a different taxation rate.
Domain Knowledge can be used to decompose an income feature into different tax brackets,
resulting in a new feature: 'income_above_100K' with possible values 0 and 1.

For Categorical Data, a new feature can be a combination. For example, sum or product of two
features or a decomposition of one feature into many. If a single categorical feature represents
education level with 5 possible values i.e. high school, associates, bachelor's, master's and
doctorate then these values can be decomposed into 5 new features, one for each possible value
(e.g. is, high, school, is, doctorate) filled with 0s (for false) and 1s (for true). The process in which
categorical variables are converted into binary form (0 or 1) for ML is called One Hot Encoding.

Unstructured Data
The following are some feature engineering techniques which may overlap with text processing
techniques:
(a) Numbers: In text processing, numbers are converted into a token such as '/number/'.
(b) N-grams
49

(c) Name Entity Recognition (NER): The NER algorithm analyzes the individual tokens and their
surrounding semantics while referring to its dictionary to tag an object class to the token. For
example, NER tags of the text 'CFA Institute was formed in 1947 and is headquartered in Virginia';
NER tags can also help identify critical tokens on which such operations as lowercasing and
stemming then can be avoided e.g. Institute here refers to an organization rather than a verb.
Additional object classes are for example MONEY, TIME and PERCENT which are not present in
the example text.
(d) Parts of Speech (POS): uses language structure and dictionaries to tag every token in the text
with a corresponding part of speech. Some common POS tags are noun, verb, adjective and
proper noun.

lllll Model Training

lllll Method Selection

(a) Supervised and Unsupervised learning models


(b) Type of Data:
For Numerical Data (e.g. predicting stock prices using historical stock market
values), CART methods may be suitable.
For Text Data (e.g. predicting the topic of a financial news article by reading the headline of the
article), methods as GLMs and SVMs are used.
For Image Data (e.g. identifying objects in a satellite image such as tanker ships moving in and
out of port), NNs and deep learning methods tend to perform better than others.
For Speech Data (e.g. predicting financial sentiment from quarterly earnings' conference call
recordings), deep learning methods can offer promising results.
(c) Size of Data: A typical dataset has two basic characteristics, no. of instances (observations) and
no. of features. For instance, SVMs have been found to work well on 'wider' datasets with 10,000
to 100,000 features and with fewer instances. Conversely, NNs often work better on 'longer'
datasets where the no. of instances is much larger than the no. of features.

Before model training begins,


Supervised Learning: The data are split using a random sampling technique such as K-Fold.
Unsupervised Learning: No splitting is needed due to the absence of labeled training data.

Class Imbalance: where the no. of instances for a particular class is significantly larger than for
other classes. For example, say for corporate issuers in the BB+/Ba1 to B+/B1
credit quality range, issuers who defaulted (positive or '1' class) would be very few
compared to issuers who did not default (negative or '0' class). Hence, on such
training data, a naïve model that simply assumes no corporate issuer will default
may achieve good accuracy, albeit with all default cases misclassified. Balancing
the training data can help alleviate such problems. In cases of unbalanced data,
the '0' class (majority class) can be randomly undersampled or the '1' class
(minority class) randomly oversampled.
50

lllll Performance Evaluation

(a) Error Analysis: For classification problems, error analysis involves computing four basic
evaluation matrices TP, FP, TN and FN. Here's a Confusion Matrix for error analysis:

Precision: is the ratio of correctly predictive positive classes to all predictive positive classes.
Precision is useful in situations where the cost of FP or Type I Error is high. For
example, when an expensive product fails quality inspection (predicted class 1) and is
scrapped, but it is actually perfectly good (actual class 0).

P = TP
TP + FP

Recall: also known as sensitivity i.e. is the ratio of correctly predicted positive classes to all actual
positive classes. Recall is useful in situations where the cost of FN or Type II Error is high.
For example, when an expensive product passes quality inspection (predicted class 0) and
is sent to the valued customer, but it is actually quite defective (actual class 1).

R = TP
TP + FN

Accuracy: is the percentage of correctly predicted classes out of total predictions.

Accuracy = TP + TN
TP + FP + TN + FN

F1 Score: is the harmonic mean of precision and recall.

F1 Score = 2. P. R
P+R

F1 Score is more appropriate than Accuracy when unequal class distribution is in the dataset and
it is necessary to measure the equilibrium of Precision and Recall. High scores on both of these
metrices suggest good model performance.

Example 1:
Calculate Precision, Recall, Accuracy and F1 Score with the help of table below:
51

Observation Actual Training Predicted Classification


Labels Results

1 1 1 TP
2 0 0 TN
3 1 1 TP
4 1 0 FN
5 1 1 TP
6 1 0 FN
7 0 0 TN
8 0 0 TN
9 0 0 TN
10 0 1 FP

Precision = 3 = 0.75 Recall = 3 = 0.6 Accuracy = (3 + 4) = 0.7 F1 Score = 2 (0.75) (0.6) = 0.67
(3 + 1) (3 + 2) (3 + 1 + 4 + 2) (0.75 + 0.6)

If the no. of classes in a dataset is unequal; however, then F1 score and Accuracy should be used as the
overall performance measure for the model.

(b) Receiver Operating Characteristic (ROC)

True Positive Rate 1 ROC: This technique for assessing


(TPR) model performance involves the
TPR = TP AUC = 0.9
plot of curve showing the
TP + FN A AUC = 0.75
trade-off between TPR and FPR
B AUC = 0.5
for various threshold points
C (cut off). The ROC curve
summarizes all of the Confusion
Matrices that each threshold
0
1
False Positive Rate produced.
(FPR) ?
FPR = FP
1
FP + TN
0.75

Predicted Threshold
The shape of the ROC curve provides Probability
insight into the model's performance. (P)

A more convex curve indicates better


model performance. It is clear that 0
Defective
Model A with the most convex ROC
If P from a logistics regression model
curve with AUC (Area Under the Curve) for a given observation is greater than
P > Threshold = 1
of more than 0.9 (90%) is the best P < Threshold = 0 the threshold, then the observation is
performing among the three models. classified as '1', otherwise the
observation will be classified '0'.

(c) Root Mean Squared Error (RMSE): is appropriate for continuous data prediction and is mostly
used for regression models. It is a simple matrix that captures all the prediction errors in the
data (n). A small RMSE indicates potentially better model performance.

Σ (Predicted i - Actual i ) 2 = Σ (Y i - Y i ) 2
^
RMSE =
n n
52

lllll Tuning (Bias Error is high when a model is


oversimplified and doesn't sufficiently learn
from the patterns in the training data)


If Prediction Error on Training Set = Underfitting i.e. Bias Error
If Prediction Error on Cross Validation Set ↑
= Overfitting i.e. Variance Error


(Variance Error is high when a model is
overlsimplified and memorizes the training
data so much so that it will likely perform
poorly on new data)

It is not possible to completely eliminate both types of error. The Bias-Variance trade-off is critical in
finding the optimal balance where a model neither underfits or overfits.

(a) Parameters: are critical for a model and are dependent on the training data. Parameters are
learned from the training data as part of the training process by a optimization technique.
Examples of parameters include: Coefficients in regression, weights in NN and Support Vectors in
SVM.

(b) Hyperparameters: are used for estimating model parameters and are not dependent on the
training data. Examples of hyperparameters include the Regularization Term (λ) in supervised
model, Activation Function and No. of Hidden Layers in NN, No. of Trees and Tree Depth in
Ensemble Methods, K in KNN and K-means Clustering & P-Threshold in Logistic Regression.
Hyperparameters are manually set and timed. Thus, timing heuristics and such techniques such
as Grid Search is used to obtain the optimum values of hyperparameters. Grid Search is a
method of systematically training an ML model by using various combinations of
hyperparameter values, cross validating each model and determining which combination of
hyperparameter values ensure the best model performance. The plot of training error for each
value of a hyperparameter (i.e. changing model complexity) is called a Fitting Curve.

Slight Regularization: highly penalized model complexity,


thereby allowing most or all of the features to be included in
the model and thus potentially enabling the model to
memorize the data. When high variance error and low bias
error exist, the model performs well on the training dataset but
[Error CV > Error Train ] [Error CV < Error Train ] generates many FP and FN errors on the CV dataset in other
Large Error High Variance High Bias words, the model is overfitted and doesn't generalize well to
Error CV new data.

Optimum Regularization: minimizes both variance and bias


error in a balanced fashion. The range of optimum
regularization values can be found heuristically using such
techniques as Grid Search.
Error Train
Large Regularization: excessively penalizes model complexity,
Error
thereby allowing too few of the features to be included in the
model to learn less from the data. Typically, with large
regularizations, the prediction errors on the training and CV
datasets are both large. When high bias error exists, the model
doesn't perform well on either training or CV datasets because
it is typically lacking important predictor variables.

Regularization: describes methods that reduce statistical


Small Error
variability in high dimensional data estimation problems,
Slight Optimum Large reducing regression coefficient estimates toward '0' and
Regularization Regularization Regularization thereby avoiding complex models and the risk of overfitting.
Regularization models can also be applied to non-linear
λ
models. For example: Asset returns typically exhibit strong
multicollinearity, making the estimation of the covariance
matrix highly sensitive to noise and outliers, so the resulting
optimized asset weights are highly unstable. Regularization
methods have been used to address this problem.
53

If high bias or variance error exists after tuning of hyperparameters, either a larger no. of training
instances may be needed or the no. of features included in the model may need to be decreased
(in the case of high variance) or increased (in the case of high bias). The model then needs to be
retrained and retuned using the new training dataset. In the case of a complex model, where a large
model is comprised of sub-model(s), Ceiling Analysis can be performed. Ceiling Analysis can help
determine which sub-model needs to be tuned to improve the overall accuracy of the larger model
i.e. is a systematic process of evaluating components in the pipeline of model building.
54

CHAPTER 9 Probabilistic Approaches: Scenario Analysis, Decision Trees and Simulations

The steps associated with running a simulation are as follows:

It makes sense to focus attention on few


Step 1: Determine the Probabilistic Variables
variables that have significant impact on value.

Step 2: Define Probability Distributions for these Variables: Generically, there are 3 ways in which we can go
about defining probability distributions:
- Historical Data: This method assumes that the future values of the variable will be similar to its past
e.g. long term treasury bond rate.
- Cross-Sectional Data: When past data is unavailable or reliable, we may estimate the distribution of
the variable based on the values of the variable for peers.
- Pick a Distribution and Estimate the Parameters: When neither historical nor cross-sectional data
provide adequate insight, subjective specification of a distribution along with related parameters is
the appropriate approach.

Step 3: Check for Correlation across Variables: When there is a strong correlation between variables, we can
either (a) Allow only one of the variables to vary i.e. it makes sense to focus on the impact that has
the bigger impact on value or (b) build the rules of correlation into the simulation (this necessities
more sophisticated simulation packages). As with the distribution, the correlations can be estimated
by looking at the past.

Step 4: Run the Simulation: Means randomly drawing variables from their underlying distributions and then
using them as inputs to generate estimated values. This process may be repeated to yield thousands
of estimates of value, giving a distribution of the investment's value, though the marginal
contribution of each simulation drops off as the no. of simulations increases. The no. of simulations
needed for a good output is driven by:
- No. of Probabilistic Inputs: The larger the no. of inputs that have probability distributions attached
to them, the greater will be the required no. of distributions.
- Characteristics of Probability Distributions: The greater the variability in types of distributions, the
greater the no. of simulations needed. Conversely, if all variables are specified by one distribution
(e.g. normal), then the no. of simulations needed would be lower.
- Range of Outcomes: The greater the potential range of outcomes on each input, the greater will be
the no. of simulations.

Advantages of Simulations

1. Better Input Quality: Superior inputs are likely to result when an analyst goes through the process of
selecting a proper distribution for critical inputs rather than relying on single best estimate.
2. Provides a Distribution of Expected Value rather than a Point Estimate: The distribution of an investments
expected value provides an indication of risk in the investment.

Disadvantages of Simulations

1. Garbage In, Garbage Out: Regardless of the complexities employed in running simulations, if the
underlying inputs are poorly specified, the output will be low quality. It is also worth noting that simulations
require more than a passing knowledge of statistical distributions and their characteristics; analysts who
cannot assess the difference between normal and lognormal distributions shouldn't be doing simulations.
55

2. Real data may not fit the requirements of statistical distributions which may yield misleading results.
3. Non-Stationary Distributions: Input variable distributions may change over time, so the distribution and
parameters specified for a particular simulation may not be valid anymore. For example, the mean and
variance estimated from historical data for an input that is normally distributed may change for the next
period.
4. Changing Correlation across Inputs: In the third simulation step, we noted that correlation across input
variables can be modeled into simulations. However, this works only if the correlations remain stable and
predictable. To the extent that correlations between input variables change over time, it becomes far more
difficult to model them.

Simulations with Constraints

1. Book Value Constraints: There are two types of restrictions on book value of equity that may call for risk
hedging:
- Regulatory Capital Requirements: Banks and insurance companies are required to main adequate levels
of capital. Violations of minimum capital requirements are considered serious and could threaten the very
existence of the firm.
- Negative Book Value for Equity: In some countries, negative book value of equity may have serious
consequences like in the European countries.
2. Earnings and Cashflow Constraints: Earnings or cashflow constraints can be imposed internally to meet
analyst expectations or to achieve bonus targets. Earnings constraints can also be imposed externally, such
as a loan covenant. Violating such a constraint could be very expensive for the firm.
3. Market Value Constrains: Market value constraints seek to minimize the likelihood of financial distress or
bankruptcy for the firm, by incorporating the costs of financial distress in a valuation model for the firm
e.g. stress testing and VaR.

Fig 1: Risk Types

Simulations and Decision Trees consider


all possible states of the outcome and
hence the sum of probabilities is 1.

Scenario Analysis doesn't consider full


spectrum of outcomes and hence the
combined probability of the outcomes is
less than 1.

Decision Trees and Simulation can be used as Complements or as Substitutes for risk-adjusted valuation.
Scenario Analysis doesn't include the full spectrum of outcomes and therefore can only be used as a
Complement to risk-adjusted valuation. If used as a substitute, the cashflows in an investment are discounted
at R f rate and then the expected value obtained is evaluated in conjunction with the variability obtained from
the analysis.

You might also like