Watermarked 2 Quantitative Analysis
Watermarked 2 Quantitative Analysis
By AnalystPrep
1
©2023 AnalystPrep “This document is protected by International copyright laws. Reproduction and/or distribution of this document is
12 - Fundamentals of Probability 3
13 - Random Variables 18
14 - Common Univariate Random Variables 40
15 - Multivariate Random Variables 67
16 - Sample Moments 92
17 - Hypothesis Testing 113
18 - Linear Regression 134
19 - Regression with Multiple Explanatory Variables 150
20 - Regression Diagnostics 174
21 - Stationary Time Series 189
22 - Nonstationary Time Series 213
23 - Measuring Return, Volatility, and Correlation 238
24 - Simulation and Bootstrapping 256
25 - Machine-Learning Methods 256
26 - Machine Learning and Prediction 256
2
© 2014-2023 AnalystPrep.
Reading 12: Fundamentals of Probability
After compl eti ng thi s readi ng, you shoul d be abl e to:
Explain the difference between independent events and conditionally independent events.
Probability is the foundation of statistics, risk management, and econometrics. Probability quantifies
the likelihood that some event will occur. For instance, we could be interested in the probability that
outcomes are dependent on the problem being studied. For example, when modeling returns from a
portfolio, the sample space is a set of real numbers. As another example, assume we want to model
defaults in loan payment; we know that there can only be two outcomes: either the firm defaults or
it doesn’t. As such, the sample space is Ω = {Default, No Default}. To give yet another example, the
sample space when a fair six-sided die is tossed is made of six different outcomes:
Ω = {1, 2, 3, 4, 5, 6}
Events (ω)
3
© 2014-2023 AnalystPrep.
An event is a set of outcomes (which may contain more than one element). For example, suppose we
tossed a die. A “6” would constitute an event. If we toss two dice simultaneously, a {6, 2} would
constitute an event. An event that contains only one outcome is termed an el ementary event.
Event Space ( F )
T he event space refers to the set of all possible outcomes and combinations of outcomes. For
example, consider a scenario where we toss two fair coins simultaneously. T he following would
{HH, HT, T H, T T }
Note: If the coins are fair, the probability of a head, P(H), equals the probability of a tail, P(T ).
Probability
T he probability of an event refers to the likelihood of that particular event occurring. For example,
the probability of a Head when we toss a coin is 0.5, and so is the probability of a Tail.
According to frequentist interpretation, the term probability stands for the number of times an event
occurs if a set of independent experiments is performed. But this is what we call the frequentist
interpretation because it defines an event’s probability as the limit of its relative frequency in many
trials. It is just a conceptual explanation; in finance, we deal with actual, non-experimental events
T wo events, A and B, are said to be mutually exclusive if the occurrence of A rules out the
occurrence of B, and vice versa. For example, a car cannot turn left and turn right at the same time.
4
© 2014-2023 AnalystPrep.
Mutual l y excl usi ve events are such that one event precl udes the occurrence of al l the
other events. Thus, i f you rol l a di ce and a 4 comes up, that parti cul ar event precl udes
al l the other events, i .e., 1,2,3,5 and 6. In other words, rol l i ng a 1 and a 5 are mutual l y
Furthermore, there i s no way a si ngl e i nvestment can have more than one ari thmeti c
mean return. Thus, ari thmeti c returns of, say, 20% and 17% consti tute mutual l y
Independent Events
T wo events, A and B, are independent if the fact that A occurs does not affect the probability of B
occurring. When two events are independent, this simply means that both events can happen at the
same time. In other words, the probability of one event happening does not depend on whether the
other event occurs or not. For example, we can define A as the likelihood that it rains on March 15
in New York and B as the probability that it rains in Frankfurt on March 15. In this instance, both
Another example would be defining event A as getting tails on the first coin toss and B on the second
coin toss. T he fact of landing on tails on the first toss will not affect the probability of getting tails on
5
© 2014-2023 AnalystPrep.
the second toss.
Intersection
T he intersection of events say A and B is the set of outcomes occurring both in A and B. It is
then:
6
© 2014-2023 AnalystPrep.
P (A ∩ B) = P (A and B) = 0
T his is because A's occurrence rules out B's occurrence. Remember that a car cannot turn left and
Union
T he union of events, say, A and B, is the set of outcomes occurring in at least one of the two sets – A
To determine the likelihood of any two mutual l y excl usi ve events occurring, we sum up their
P (A ∪ B) = P (A or B) = P (A) + P (B)
Given two events A and B, that are not mutually exclusive (i ndependent events), the probability
7
© 2014-2023 AnalystPrep.
P (A ∪ B) = P (A or B) = P (A) + P (B) − P (A ∩ B)
Another important concept under probability is the compl ement of a set denoted by Ac (where A
can be any other event) which is the set of outcomes that are not in A. For example, consider the
Conditional Probability
Until now, we've only looked at unconditional probabilities. An uncondi ti onal probabi l i ty (also
known as a marginal probability) is simply the probability that an event occurs without considering
any other preceding events. In other words, unconditional probabilities are not conditioned on the
8
© 2014-2023 AnalystPrep.
occurrence of any other events; they are 'stand-alone' events.
Condi ti onal probabi l i ty is the probability of one event occurring with some relationship to one
or more other events. Our interest lies in the probability of an event 'A' gi ven that another event 'B
"What is the probability of one event occurring i f another event has already taken place?" We
P (A ∩ B)
P (A│B) =
P (B)
Bayes' Theorem
Bayes' theorem describes the probability of an event based on prior knowledge of conditions that
might be related to the event. Assuming that we have two random variables, A and B, then according
to Bayes' theorem:
P (B|A) × P (A)
P (A|B) =
P (B)
Supposing that we are issued with two bonds, A and B. Each bond has a default probability of 10%
over the following year. We are also told that there is a 6% chance that both the bonds will default,
an 86% chance that none of them will default, and a 14% chance that either of the bonds will default.
Often, there is a high correlation between bond defaults. T his can be attributed to the sensitivity
displayed by bond issuers when dealing with broad economic bonds. T he 6% chances of both the
bonds defaulting are higher than the 1% chances of default had the default events been independent.
T he features of the probability matrix can also be expressed in terms of conditional probabilities.
9
© 2014-2023 AnalystPrep.
For example, the likelihood that bond A will default given that B has defaulted is computed as:
P [A ∩ B] 6%
P (A|B) = = = 60%
P [B] 10%
T his means that in 60% of the scenarios in which bond B will default, bond A will also default.
P [A ∩ B] = P (A|B) × P [B] I
Also:
P [A ∩ B] = P (B|A) × P [A] II
Both the right-hand sides of equations I and II are combined and rearranged to give the Bayes'
theorem:
P (B|A) × P [A]
⇒ P (A|B) =
P [B]
When presented with new data, Bayes' theorem can be applied to update beliefs. To understand how
the theorem can provide a framework for how exactly the new beliefs should be, consider the
following scenario:
Based on an examination of historical data, it's been determined that all fund managers at a certain
Fund fall into one of two groups: Stars and Non-Stars. Stars are the best managers. T he probability
that a Star will beat the market in any given year is 75%. Other managers are just as likely to beat
the market as they are to underperform it [i.e., Non-Stars have 50/50 odds of beating the market. For
both types of managers, the probability of beating the market is independent from one year to the
next. Stars are rare. Of a given pool of managers, only 16% turn out to be Stars.
10
© 2014-2023 AnalystPrep.
A new manager was added to the portfolio of funds three years ago. Since then, the new manager has
beaten the market every year. What was the probability that the manager was a star when the
manager was first added to the portfolio? What is the probability that this manager is a star now?
What's the probability that the manager will beat the market next year, given that he has beaten it in
Sol uti on
We first summarize the data by introducing some notations as follows: T he chances that a manager
3
P (B|S) = 0.75 =
4
1
P (B|S̄) = 0.5 =
2
T he chances of the new manager being a star during the particular time he was added to the analyst's
portfolio are exactly the chances that any manager will be made a star, which is unconditional:
4
P [S] = 0.16 =
25
To evaluate the likelihood of him being a star at present, we compute the likelihood of him being a
star given that he has beaten the market for three consecutive years, P (S|3B), using the Bayes’
theorem:
P (3B|S) × P [S]
P (S|3B) =
P [3B]
3 3 27
P (3B|S) = ( ) =
4 64
T he unconditional chances that the manager will beat the market for three years is the denominator.
11
© 2014-2023 AnalystPrep.
3 3 4 1 3 21 69
P [3B] = ( ) × +( ) =
4 25 2 25 400
T herefore:
( 27 4
) ( 25 ) 9
64
P (S|3B) = = = 39%
69 23
( 400 )
T herefore, there is a 39% chance that the manager will be a star after beating the market for three
consecutive years, which happens to be our new belief and is a significant improvement from our old
Finally, we compute the manager's chances of beating the market the next year. T his happens to be
the summation of the chances of a star beating the market and the chances of a non-star beating the
3 9 1 14 3
P [B] = × + × = 60% =
4 23 2 23 5
P (3B|S) × P [S]
P (S|3B) =
P [3B]
T he L.H.S of the formula is posterior. T he first item on the numerator is the likelihood, and the
12
© 2014-2023 AnalystPrep.
Question 1
T he probability that the Eurozone economy will grow this year is 18%, and the
probability that the European Central Bank (ECB) will loosen its monetary policy is 52%.
Assume that the joint probability that the Eurozone economy will grow and the ECB will
loosen its monetary policy is 45%. What is the probability that either the Eurozone
economy will grow or the ECB will loosen its monetary policy?
A. 42.12%
B. 25%
C. 11%
D. 17%
T he correct answer is B.
P(E) = 0.18 (the probability that the Eurozone economy will grow is 18%)
p(M) = 0.52 (the probability that the ECB will loosen the monetary policy is 52%)
p(EM) = 0.45 (the joint probability that Eurozone economy will grow and the ECB will
T he probability that either the Eurozone economy will grow or the central bank will
Question 2
13
© 2014-2023 AnalystPrep.
p(O|T ) = 0.62 Conditional probability of reaching
the office if the train arrives on time
p(O|T c) = 0.47 Conditional probability of reaching the office
if the train does not arrive on time
p(T ) = 0.65 Unconditional probability of
the train arriving on time
p(O) = ? Unconditional probability
of reaching the office
A. 0.4325
B. 0.5675
C. 0.3856
D. 0.5244
T he correct answer is B.
If p(T ) = 0.65 (Unconditional probability of train arriving on time is 0.65), then the
unconditional probability of the train not arriving on time p(T c) = 1 - p(T ) = 1 - 0.65 =
0.35.
Note: p(O) is the unconditional probability of reaching the office. It is simply the addition
of:
1. reaching the office if the train arrives on time, multiplied by the train arriving on
time, and
2. reaching the office if the train does not arrive on time, multiplied by the train not
arriving on time (or given the information, one minus the train arriving on time)
14
© 2014-2023 AnalystPrep.
Question 3
Suppose you are an equity analyst for the XYZ investment bank. You use historical data
market 70% of the time and average managers outperform the market only 40% of the
time. Furthermore, 20% of all fund managers are excellent managers and 80% are simply
A new fund manager started three years ago and outperformed the market all three
A. 29.53%
B. 12.56%
C. 57.26%
D. 30.21%
T he correct answer is C.
T he best way to visualize this problem is to start off with a probability matrix:
Let E be the event of an excellent manager, and A represent the event of an average
manager.
We know that:
15
© 2014-2023 AnalystPrep.
We want P(E|O):
P (O|E) × P (E)
P (E|O) =
P (O|E) × P (E) + P (O|A) × P (A)
(0.73 ) × 0.2
=
(0.73 ) × 0.2 + (0.43) × 0.8
= 57.26%
16
© 2014-2023 AnalystPrep.
Reading 13: Random Variables
After compl eti ng thi s readi ng, you shoul d be abl e to:
Explain the differences between a probability mass function and a probability density
function.
Explain the effect of a linear transformation of a random variable on the mean, variance,
Random Variables
A random variable is a variable whose possible values are outcomes of a random phenomenon. It is a
function that maps outcomes of a random process to real values. It can also be termed as the
Conventionally, random variables are given in upper case (such as X, Y, and Z) while the realized
For example, let X be the random variable as a result of rolling a die. T herefore, x is the outcome of
one roll, and it could take any of the values 1, 2, 3, 4, 5, or 6. T he probability that the resulting
P (X = x) where x = 3
17
© 2014-2023 AnalystPrep.
Types of Random Variables
A discrete random variable is one that produces a set of distinct values. A discrete random variable
manifests:
If the range of all possible values is a fi ni te set, e.g., {1,2,3,4,5,6} as in the case of a six-
If the range of all possible values is a countabl y i nfi ni te set: e.g. {1,2,3, ... }
T he number of candidates registered for the FRM level 1 exam at any given time.
Since the possible values of a random variable are mostly numerical, they can be explained using
18
© 2014-2023 AnalystPrep.
mathematical functions. A function f X (x) = P (X = x) for each x in the range of X is the probability
function (PF) of X and explains how the total chance (which is 1) is distributed amongst the possible
values of X.
T here are two functions used when explaining the features of the distribution of discrete random
variables: probability mass function (PMF) and cumulative distribution function (CDF).
T his function gives the probability that a random variable takes a particular value. Since PMF
2. ∑x f X (x) = 1 (sum across all value in support of a random variable should be equal to 1)
f X (x) = px (1 − p)1−x , X = 0, 1
And
f X (1) = p1 (1 − p)1−1 = p
Looking at the above results, the first property f X (x) ≥ 0) of probability distributions is met. For the
second property:
∑ f X (x) = ∑ f X (x) = 1 − p + p = 1
x x=0,1
19
© 2014-2023 AnalystPrep.
Moreover, the probability that we observe random variable 0 is 1-p, and the probability of observing
FX(x) = { 1 − p, x = 0
p, x=1
T he graph of the Bernoulli PMF is shown below, assuming the p=0.7. Note that PMF is only defined
for X=0,1.
CDF measures the probability of realizing a value less than or equal to the input x, P r(X ≤ x). It is
FX (x) = P r(X ≤ x)
CDF is monotonic and increasing in x since it measures total probability. It is a continuous function
(in contrast with PMF) because it supports any value between 0 and 1 (in the case of Bernoulli
20
© 2014-2023 AnalystPrep.
random variables) inclusively.
⎧ 0, x<0
FX(x) = ⎨ 1 − p, 0 ≤ x < 1
⎩
1, x≥1
FX (x) is defined for all real values of x. T he graph of FX(x) against x begins at 0 then rises by jumps
as values of x are realized for which p(X = x) is positive. T he graph reaches its maximum value at 1.
For the Bernoulli distribution with p=0.7, the graph is shown below:
Since CDF is defined for all values of x, the CDF for a Bernoulli distribution with a parameter p=0.7
is:
⎧ 0, x<0
FX(x) = ⎨ 0.3, 0 ≤ x < 1
⎩
1, x≥1
21
© 2014-2023 AnalystPrep.
Relationship Between the CDF and PMF with Discrete Random
Variables
T he CDF can be represented as the sum of the PMF for all the values that are less than or equal to
x. Simply put:
FX(x) = ∑ f X ( t)
tϵR( x) ,t≤x
On the other hand, PMF is equivalent to the difference between the consecutive values of X. T hat
is:
T here are 8 hens with different weights in a cage. Hens1 to 3 weigh 1 kg, hens 4 and 5 weigh 2kg,
and the rest weigh 3kg. We need to develop the PMF and the CDF.
Sol uti on
T he random variables (X = 1kg, 2kg, or 3kg) here are the weights of the chicken,
3
f X (1) = P r(X = 1) =
8
2 1
f X (2) = P r(X = 2) = =
8 4
3
f X (3) = P r(X = 3) =
8
3
, x =1
⎧
⎪
⎪ 81
⎨ 4, x =2
⎪
⎩
⎪ 3, x =3
8
22
© 2014-2023 AnalystPrep.
For the CDF, it includes all the realized values of the random variable. So,
FX (0) = P r(X ≤ 0) = 0
3
FX (1) = P r(X ≤ 1) =
8
3 2 5⎡ ⎤
FX (2) = P r(X ≤ 2) = + = Using FX(x) = ∑ f X ( t)
8 8 8⎣ tϵR( x) ,t≤x ⎦
5 3
FX (3) = P r(X ≤ 3) = + =1
8 8
0, x<1
⎧
⎪
⎪
⎪
⎪ 3, 1≤x<2
FX (x) = ⎨ 85
⎪ , 2≤x<3
⎪
⎪
⎩8
⎪
1, 3≤x
Note that
5 3
f X (3) = FX(3) − FX(2) = 1 − =
8 8
A continuous random variable can assume any val ue al ong a gi ven i nterval of a number l i ne.
For instance, x > 0, (−∞ < x < ∞) and 0 < x < 1. Examples of continuous random variables include the
price of stock or bond, or the value at risk of a portfolio at a particular point in time.
23
© 2014-2023 AnalystPrep.
T his implies that p is the likelihood that the random variable X falls between r1 and r2.
The Probabi l i ty Densi ty Functi on (PDF) under Conti nuous Random Vari abl es
Given a PDF f(x), we can determine the probability that x falls between a and b:
b
P r(a < x ≤ b) = ∫ f (x) dx
a
T he probability that X lies between two values is the area under the density function graph
Probability distribution function is another term used to refer to the probability density function.
T he properties of the PDF are the same as those of PMF. T hat is:
1. f X (x) ≥ 0, −∞ ≤ x ≤ ∞ (nonnegativity)
2. ∫rrmmax f(x)dx = 1(T he sum of all probabilities must be equal to 1, just like in discrete random
in
variables)
It is also called the cumulative density function and is closely related to the concept of a PDF. CDFA
CDF defines the likelihood of a random variable falling below a specific value. To determine the CDF,
24
© 2014-2023 AnalystPrep.
the PDF is integrated from its lower bound.
T he corresponding density function’s capital letter has traditionally been used to denote the CDF.
T he following computation depicts a CDF, F(x), of a random variable X whose PDF is f(x):
a
F (a) = ∫ f(x)d(x) = P [X ≤ a]
−∞
T he region under the PDF is a depiction of the CDF. T he CDF is usually non-decreasing and varies
from zero to one. We must have a zero CDF at the minimum value of the PDF. T he variable cannot
be less than the minimum. T he likelihood of the random variable is less than or equal to the maximum
is 100%.
To obtain the PDF from the CDF, we have to compute the first derivative of the CDF. T herefore:
dF (x)
f(x) =
dx
Next, we look at how to determine the probability that a random variable X will fall between some
b
P [a < X ≤ b] = ∫ f(x)dx = F (b) − F (a)
a
25
© 2014-2023 AnalystPrep.
P [X > a] = 1 − F (a)
T he continuous random variable X has a pdf of f (x) = 12x 2 (1 − x) for 0 < x < 1. We need to find the
Sol uti on
We know that:
x
F (x) = ∫ f(t)d(t)
−∞
x
F (x) = ∫ 12t2 (1 − t)d(t) = [4t3 − 3t4]x0 = x 3 (4 − 3x)
0
So,
F (x) = x 3 (4 − 3x)
Expected Values
T he expected values are the numerical summaries of features of the distribution of random
variables. Denoted by E[X] or μ, it gives the value of X that is the measure of average or center of
E[X] = ∑ xf (X)
x
It is simply the sum of the product of the value of the random variable and the probability assumed by
T here are 8 hens with different weights in a cage. Hens 1 to 3 weigh 1 kg, hens 4 and 5 weigh 2kg,
26
© 2014-2023 AnalystPrep.
and the rest weigh 3kg. We need to calculate the mean weight of the hens.
Sol uti on
3
, x =1
⎧
⎪
⎪ 81
f (x) = ⎨ 4 , x =2
⎪
⎩ 3,
⎪ x =3
8
Now,
3 1 3
E[X] = ∑ xf(X) = 1 × +2× +3 × = 2
x 8 4 8
∞
E[X ] = ∫ xf (x)dx
−∞
Basically, it is all about integrating the product of the value of the random variable and the probability
T he continuous random variable X has a pdf of f (x) = 12x 2 (1 − x) for 0 < x < 1.
Solution
We know that:
∞
E[X ] = ∫ xf (x)dx
−∞
27
© 2014-2023 AnalystPrep.
So,
1
12 5 1
E(X) = ∫ x12x 2(1 − x)d(x) = [3x 4 − x ] = 0.6
0 5 0
For random variables that are functions, we apply the same method as that of a “single” random
variable. T hat is, summing or integrating the product of the value of the random variable function and
∞
E[g(x)] = ∫ g(x)f (x)dx
−∞
1 2
f X (x) = x , for 0 < x < 3
5
Calculate E(2X + 1)
Solution
∞
E[g(x)] = ∫ g(x)f(x)dx
−∞
∞ 1 1 x4 x3 3
=∫ (2x + 1)x 2dx = [ + ] = 9.9
−∞ 5 5 2 3 0
28
© 2014-2023 AnalystPrep.
Properties of Expectation
constant. T hat is, E(c)=c. Moreover, the expected value of a random variable is a constant and not a
random variable.
T he variance of random variable measures the spread (dispersion or variability) of the distribution
Intuitively, the standard deviation is the square root of the variance. Now, denoting E(X) = μ, then:
V ar(X) = E(X 2) − μ2
T he continuous random variable X has a pdf of f (x) = 12x 2 (1 − x) for 0 < x < 1.
Solution
We know that:
We have to calculate:
29
© 2014-2023 AnalystPrep.
E(X 2)
1
12 5 1
E(X ) = ∫ x.[12x 2 (1 − x)]dx = [3x 4 − x ] = 0.6
0 5 0
1 1
12 5
E(X 2 ) = ∫ 12x 4 − 12x 5dx = [ x − 2x 6 ] = 0.4
0 5 0
So,
Moments
Moments are defined as the expected values that briefly describe the features of a distribution. T he
μ1 = E(X)
T herefore, the first moment provides the information about the average value. T he second and
higher moments are broadly divided into Central and Non-central moments
Central Moments
Where k denotes the order of the moment. Central moments are moments about the mean.
Non-Central Moments
Non-central moments describe those moments about 0. T he general formula is given by:
μk = E(X k)
30
© 2014-2023 AnalystPrep.
Note that the central moments are constructed from the non-central moments and the first central
Population Moments
T he four common population moments are: mean, variance, skewness, and kurtosis.
The Mean
μ = E(X)
The Variance
T he variance measures the spread of the random variable from its mean. T he standard deviation (σ)
is the square root of the variance. T he standard deviation is more commonly quoted in the world of
finance because it is easily comparable to the mean since they share the measurement units.
The Skewness
3
E([X − E(X)])3 ⎡ X−μ ⎤
skew(X) = =E ( )
σ3 ⎣ σ ⎦
X−μ
Note that is a standardized X with a mean of 0 and a variance of 1.
σ
31
© 2014-2023 AnalystPrep.
Posi ti ve sk ew
In most cases (but not always), the mean is greater than the median, or equivalently,
the mean is greater than the mode; in which case the skewness is greater than zero.
Negati ve sk ew
In most cases (but not always), the mean is lower than the median, or equivalently,
the mean is lower than the mode, in which case the skewness is lower than zero.
Kurtosis
32
© 2014-2023 AnalystPrep.
4
E([X − E(X)]4 ⎡ X −μ ⎤
Kurt(X) = =E ( )
σ4 ⎣ σ ⎦
T he description of kurtosis is analogous to that of the Skewness only that the fourth power of the
Kurtosis implies that it measures the absolute deviation of random variables. T he reference value of
a normally distributed random variable is 3. A random variable with Kurtosis exceeding 3 is termed to
33
© 2014-2023 AnalystPrep.
and/or dividing the variable by a constant.
referred to as the shi ft constant, and β is the scal e constant. T he transformation shifts X by α
and scales it by β . T he process results in the formation of a new random variable, usually denoted by
Y.
Y = α + βx
Linear transformation of random variables is informed by the fact that many variables used in finance
Suppose your salary is α dollars per year, and you are entitled to a bonus of β dollars for every dollar
of sales you successfully bring in. Let X be what you sell in a certain year. How much in total do you
make?
Solution
We can linearly transform the sales variable X into a new variable Y that represents the total amount
made.
Y = α + βx
34
© 2014-2023 AnalystPrep.
E(Y ) = E(α + βx) = α + βE(X)
T he shift parameter α does not affect the variance. Why? Because variance is a measure of spread
from the mean; adding α does not change the spread but merely shifts the distribution to the left or
right.
√β 2σ 2 = |β| σ
It can also be shown that if β is positive (so that Y = α + βx is an increasing transformation), then the
skewness and kurtosis of Y are identical to the skewness and kurtosis of X . T his is because both
moments are defined on standardized quantities, which removes the effect of the shift constant α and
We know that:
3
⎡ X−μ ⎤
skew(X ) = E ( )
⎣ σ ⎦
Now,
35
© 2014-2023 AnalystPrep.
3
E([Y − E(Y )])3 ⎡ Y − E(Y ) ⎤
skew(Y ) = =E ( )
σ3 ⎣ σ ⎦
3
⎡ α + βX − (α + βμ) ⎤
=E ( )
⎣ βσ ⎦
3 3
⎡ β(X − μ) ⎤ ⎡ X−μ ⎤
=E ( ) =E ( ) = Skew(X )
⎣ βσ ⎦ ⎣ σ ⎦
However, if β < 0, the magnitude of skewness of Y is the same as that of X but with the opposite sign
because of the odd power (i.e., 3). On the other hand, the kurtosis is unaffected because it uses an
Just like any data, quantities such as the quantiles and the modes are used to describe the distribution.
The Quantiles
For a continuous random variable X, the α -quartile of X is the smallest number m such that:
P r(X < m) = α
Where αϵ[0, 1]
For instance, if X is a continuous random variable, the median is defined to be the solution of:
m
P (X < m) = ∫ f X (x)dx = 0.5
−∞
Similarly, the lower and upper quartile is such that P (X < Q1) = 0.25 and P (X < Q3) = 0.75
IQR = Q3 − Q1
36
© 2014-2023 AnalystPrep.
Exampl e: Cal cul ati ng the Quarti l es of a PDF
Sol uti on
m
P (X < m) = ∫ 3e−2xdx = 0.5
0
So,
3 m
= [− e−2x ] = 0.5
2 0
3 −2m 3
=− e + = 0.5
2 2
1 2
⇒ m = − × ln = 0.2027
2 3
Mode
T he mode measures the common tendency, that is, the location of the most observed value of a
random variable. In a continuous random variable, the mode is represented by the highest point in the
PDF.
Random variables can be unimodal if there’s just one mode, bimodal if there are two modes, or
T he graph below shows the difference between unimodal and bimodal distributions.
37
© 2014-2023 AnalystPrep.
38
© 2014-2023 AnalystPrep.
Question 1
If a random variable X has a mean of 4 and a standard deviation of 2, calculate Var(3 - 4x)
A. 29
B. 30
C. 64
D. 35
Sol uti on
Recall that:
So,
But we are given that the standard deviation is 2, implying that the variance is 4.
T herefore,
Var(3 − 4X ) = 16 × 4 = 64
Question 2
A continuous random variable has a pdf given by f X (x) = ce−3x for all x > 0. Calculate
Pr(X<6.5)
A. 0.4532
39
© 2014-2023 AnalystPrep.
B. 0.4521
C. 0.3321
D. 0.9999
Solution
T he correct answer is D.
∞
∫ f (x)dx = 1
−∞
So,
∞ ∞
1 1
∫ ce−3x dx = 1 = c[− e−3x ] = c [0 − − ] = 1
0 3 0 3
⇒c=3
T herefore, the PDF is f X (x) = 3e−3x so that P r(X < 6.5) is given by:
6. 5 6. 5
1 1 1
∫ 3e−3xdx = 3[− e−3x ] = c [− e−3×6. 5 − − ]
0 3 0
3 3
= 0.9999
40
© 2014-2023 AnalystPrep.
Reading 14: Common Univariate Random Variables
After compl eti ng thi s readi ng, you shoul d be abl e to:
Distinguish the key properties among the following distributions: uniform distribution,
Describe a mixture distribution and explain the creation and characteristics of mixture
distributions.
Parametric Distributions
T here are two types of distributions, namely parametric and non-parametric distributions. Functions
mathematically describe parametric distributions. On the other hand, one cannot use a mathematical
Bernoulli Distribution
Bernoulli distribution is a discrete random variable that takes on values of 0 and 1. T his distribution is
suitable for scenarios with binary outcomes, such as corporate defaults. Most of the time, 1 is
41
© 2014-2023 AnalystPrep.
T he Bernoulli distribution has a parameter p which is the probability of success, i.e., the probability
P [X = 1] = p and P [X = 0] = 1 − p
T he probability mass function of the Bernoulli distribution stated as X ∼ Bernoulli(p) is given by:
T herefore, the mean and variance of the distribution are computed as:
P [X = 1] = p and P [X = 0] = 1 − p
⎧ 0, y < 0
FX (x) = ⎨ 1 − p, 0 ≤ y < 1
⎩
1, y ≥ 1
T herefore, the mean and variance of the distribution are computed as:
42
© 2014-2023 AnalystPrep.
E (X) = p × 1 + (1 − p) × 0 = p
Sol uti on
E(X) = p
and
V (X ) = p(1 − p)
So,
E(X) p 1
= = =4
V (X) p(1 − p) 0.25
Binomial Distribution
quantifies the total number of successes from an independent Bernoulli random variable, with the
probability of success being p and, of course, the failure being 1-p. Consider the following example:
Suppose we are given two independent bonds with a default likelihood of 10%. T hen we have the
following possibilities:
43
© 2014-2023 AnalystPrep.
Only one of them defaults.
P [X = 0] = (1 − 10%)2 = 81%
P [X = 2] = 10% 2 = 1%
P [X = 0] = (1 − 10%)3 = 72.9%
P [X = 3] = 10% 3 = 0.1%
Suppose now that we have n bonds. T he following combination represents the number of ways in
n n!
( )= … … … … equation I
x x! (n − x)!
If p is the likelihood that one bond will default, then the chances that any particular k bonds will
px (1 − p)n −x … … … … … equation II
Combining equation I and II , we can determine the likelihood of k bonds defaulting as follows:
n n
P [X = x] = ( ) px(1 − p)n −x = ( ) px(1 − p)n −x f orx = 0, 1, 2,… n
x x
44
© 2014-2023 AnalystPrep.
T herefore, binomial distribution has two parameters: n and p and usually stated as X B(n, p).
|x|
n
∑ ( ) pi(1 − p)n −i
i=1 i
T he mean and variance of the binomial distribution can be evaluated using moments. T he mean and
E(X) = np
And
V (X) = np(1 − p)
T he binomial can be approximated using a normal distribution (as will be seen later) if np ≥ 10 and
n(1 − p) ≥ 10
Sol uti on
n
P [X = x] = ( ) px(1 − p)n −x
x
4 4
⇒ P (X ≥ 3) = P (X = 3) + P (X = 4) = ( ) p3 (1 − p)4−3 + ( ) p4(1 − p)4−4
3 4
4 4
= ( ) 0.63(1 − 0.6)4−3 + ( ) 0.64(1 − 0.6)4−4
3 4
45
© 2014-2023 AnalystPrep.
= 0.3456 + 0.1296 = 0.4752
Poisson Distribution
Events are said to follow a Poisson process if they happen at a constant rate over time, and the
likelihood that one event will take place is independent of all the other events,for instance,the
Suppose that X is a Poisson random variable, stated as X~Poisson(λ) then the PMF is given by:
λx e−λ
P [X = x] =
x!
|x|
λi
∑
i=1 i!
T he Poisson parameter λ (lambda), termed as the hazard rate, represents the mean number of events
in an interval. T herefore, the mean and variance of the Poisson distribution are given by:
E(X) = λ
And
V (X) = λ
A fixed income portfolio is made of a huge number of independent bonds. T he average number of
bonds defaulting every month is 10. What is the probability that there are exactly 5 defaults in one
month?
Sol uti on
λx e−λ
46
© 2014-2023 AnalystPrep.
λx e−λ
P (X = x) =
x!
105e−10
P (X = 5) = = 0.03783
5!
Y ∼ Poisson(λ1 + λ2 )
T herefore, Poisson distribution is suitable for time series data since summing the number of events
Uniform Distribution
A uniform distribution is a continuous distribution, which takes any value within the range [a,b],
1
f X (x) =
b−a
47
© 2014-2023 AnalystPrep.
Note that the PDF of a uniform random variable does not depend on x since all values are equally
likely.
⎧
⎪ 0, x < a
x− a
FX(x) = ⎨ b−a
,a ≤ x ≤ b
⎩
⎪ 1, x ≥ b
When a=0 and b=1, the distribution is called the standard uniform distribution. From this distribution,
U 2 = a + (b − a) U 1
T he uniform distribution is denoted by X ∼ U (a,b) , and the mean and variance are given by:
a+ b
48
© 2014-2023 AnalystPrep.
a+ b
E(X) =
2
(b − a)2
V (X) =
12
For instance, the variance of the standard uniform distribution U 1 ∼ N (0, 1) is given by:
(0 + 1) 1
E(X) = =
2 2
And
(1 − 0)2 1
V(X ) = =
12 12
Assume that we want to calculate the probability that X falls in the interval l < X < u where l is the
lower limit and u is the upper limit. T hat is, we need P (l < X < u) given that X ∼ U (a, b). To compute
min(u, b) − max(l,a)
P (l < X < u) =
b−a
u−l
b− a
Given the uniform distribution X U (−5, 10), calculate the mean, variance, and P (−3 < X < 6) .
Sol uti on
a+ b −5 + 10
E(X) = = = 2.5
2 2
And
(10 − −5)2
49
© 2014-2023 AnalystPrep.
(10 − −5)2 225
V (X ) = = = 18.75
12 12
min(u, b) − max(l,a)
P (l < X < u) =
b−a
Alternatively, you can think of the probability as the area under the curve. Note that the height of
T hat is:
1 1 9
× (u − l) = × (6 − −3) = = 0.60
b−a 10 − −5 15
Normal Distribution
Also called the Gaussian distribution, the normal distribution has a symmetrical PDF, and the mean
and median coincide with the highest point of the PDF. Furthermore, the normal distribution always
50
© 2014-2023 AnalystPrep.
T he following is the formula of a PDF that is normally distributed, for a given random variable X :
1 ( x − μ )2
−
1
f (x) = e 2 σ , −∞ < x < ∞
σ√2π
X ∼ N (μ, σ 2)
We read this as X is normally distributed, with a mean, μ,and variance of σ 2. Any linear combination
of independent normal variables is also normal. To illustrate this, assume X and Y are two variables
that are normally distributed. We also have constants a and b. T hen Z will be normally distributed
such that:
51
© 2014-2023 AnalystPrep.
For instance for a = b = 1, then Z = X + Y and thus Z ∼ N(μX + μY , σ2X + σ2Y )
A standard normal distribution is a normal distribution whose mean is 0 and standard deviation is 1.It is
1 1 2
∅= e− 2 x
√2π
To determine a normal variable whose standard deviation is σ and mean is μ, we compute the product
of the standard normal variable with σ and then add the mean:
X = μ + σ∅ ⇒ X ∼ N (μ, σ 2)
T hree standard normal variables X 1, X 2, and X 3 are combined in the following way to construct two
X A = √ρX 1 + √1 − ρX 2
X B = √ρX 1 + √1 − ρX 3
T he z-value measures how many standard deviations the corresponding x value is above or below the
X −μ
Φ (z) = ∼ N (0, 1)
σ
And
X ∼ N (μσ 2)
tabulated.
For example, consider the normal distribution X~N(1,2). We wish to calculate P(X>2).
52
© 2014-2023 AnalystPrep.
Solution
For
2−1
P (X > 2) = 1 − P (X ≤ 2) = 1 − = 0.2929 ≈ 0.29
√2
ϕ (0.29) ≈ 61.41%
x-value z-value
μ 0
μ + 1σ 1
μ + 2σ 2
μ + nσ n
Recall that for a binomial random variable, if np ≥ 10 and n(1 − p) ≥ 10, then the binomial distribution
53
© 2014-2023 AnalystPrep.
Also, Poisson distribution is normally approximated as λ≥1000 so that:
X ∼ N (λ, λ)
We then calculate the probabilities while maintaining the normal distribution principles. T he normal
Many discrete and continuous random variables distributions can be approximated using the
normal distribution.
T he normal distribution is widely used in Central Limit T heorem (CLT ), which is utilized in
hypothesis testing.
T he normal distribution is closely related to other important distributions, such as the chi-
T he notable property of the normal random variables is that they are infinitely divisible,
which makes the normal distribution suitable for modeling asset prices.
T he normal distributions are closed under linear operations. In other words, the weighted
Lognormal Distribution
A variable X is said to be lognormally distributed if the variable Y is normally distributed such that:
Y = lnX
X = eY
Where
Y ∼ N (μ, σ 2)
54
© 2014-2023 AnalystPrep.
2
1 ⎛ ln(x) − μ ⎞
−
1 σ
e 2⎝ ⎠
f (x) = ,x ≥ 0
xσ√2π
A variable is said to have a lognormal distribution if its natural logarithm has a normal distribution.
T he lognormal distribution is undefined for negative values, unlike the normal distribution that has a
If the above equation of the density function of the lognormal distribution is rearranged, we obtain an
equation that has a similar form to the normal distribution. T hat is:
2
2
1 ⎛ lnx−(μ−σ ) ⎞
−
1 2 −μ 1 2⎝ σ ⎠
f (x) = e 2 σ e
σ√2π
From the above, we notice that the lognormal distribution happens to be asymmetrical. It's not
symmetrical around the mean as is the case under the normal distribution. T he lognormal distribution
55
© 2014-2023 AnalystPrep.
1 2
E [X] = eμ+ 2 σ
T his yields to an expression that closely resembles the Taylor expansion of the natural logarithm
1 2
r≈R− R
2
2 2
V (X) = E [(X − E[X]2 )] = (eσ − 1) e2μ+σ
Consider a lognormal distribution given by X ∼ LogN (0.08,0.2) . Calculate the expected value.
Sol uti on
1 2 1
E[X] = eμ+ 2 σ = e0. 08+ 2 ×0. 2 = 1.19721
Chi-Squared Distribution, χ2
Assume we’ve got k independent standard normal variables ranging from Z1 to Zk. T he sum of their
k
S = ∑ Zi2
1=1
S ∼ X k2
k is called the degree of freedom. It is important to note that two chi-squared variables that are
56
© 2014-2023 AnalystPrep.
independent, with degrees of freedom as k1 and k2, respectively, have a sum that is chi-square
E (S) = k
and
V (S) = 2k
1 k x
f (x) = x 2 −1e− 2
k
2 Γ ( k2 )
2
57
© 2014-2023 AnalystPrep.
T he gamma function, Γ, is such that:
∞
Γ (n) = ∫ x n −1e−x dx
0
Γ (n) = (n − 1)!
For instance:
Γ (3) = (3 − 1)! = 2 × 1 = 2
T his distribution is widely applicable in statistics and risk management when testing hypotheses. T he
chi-distribution is approximated using normal distribution when n is large. T his implies that:
T his is true because as the number of degrees of freedom increases, the skewness reduces. Degrees
of freedom measure the amount of data required to test model parameters. If we have a sample size
n, the degrees of freedom are given by n – p, where p is the number of parameters estimated..
Student’s t Distribution
T his distribution is often called the t distribution. Let Z be the standard normal variable, and U a chi-
square variable with k degrees of freedom. Also, assume that U is independent of Z. T hen, a random
Z
X=
U
√k
Γ (k + 12 ) +1
−k
f (x) = (1 + x 2/k) 2
√kπΓ ( k2 )
58
© 2014-2023 AnalystPrep.
T he mean of the t distribution is usually zero, and the distribution is symmetrical around it.
T hat is:
E (X) = 0
k
V (X) =
k− 2
k−2
Kurt(X) = 3
k−4
It is easy to see that the mean is valid for k > 1 and the variances finite for v > 2. T he kurtosis is
We can also separate the degrees of freedom from variance to get what we called the
v −2
V [√ Y]=1
v
Where
X ∼ tk
T he generalized student’s t is called standardized student’s t because it has a mean of 0 and a variance
59
© 2014-2023 AnalystPrep.
of 1. Note that we still rescale it to have any variance for k>2.
A generalized student’s t is stated by the mean, variance, and the number of degrees of freedom. It is
T his distribution is widely applicable in hypotheses testing, and modeling the returns of financial
T he kurtosis of some returns on a bond portfolio with three parameters to be estimated is 6. What
are the degrees of freedom if the parameters were generated using student’s tk?
Sol uti on
k−2
Kurt(X) = 3
k−4
k− 2 5
∴6 = 3 ⇒ (k − 4)
k− 4 3
60
© 2014-2023 AnalystPrep.
So that
k=6
F–Distribution
asymmetric distribution that has a minimum value of 0, but no maximum value. Notably, the curve
U 1/k1
X= ∼ F (k1 ,k2 )
U 2/k2
Provided that U 1 and U 2 are chi-squared distributions that are independent having k1 and k2 as their
degrees of freedom.
61
© 2014-2023 AnalystPrep.
T he F -distribution has the following PDF:
(k1 X) k 1 kk
2
2
√
(k1 X+k2 ) k 1 +k 2
f (x) =
k1 k2
xB ( , )
2 2
1
B (x , y) = ∫ z x−1(1 − z)y−1 dz
0
k2
E (X ) = for k2 > 2
k2 − 2
2k22 (k1 + k2 − 2)
σ2 = for k2 > 4
k1(k2 − 2)2 (k2 − 4)
Suppose that X is a random variable with a t-distribution, and it has k degrees of freedom, then X 2 is
χ2 ∼ F (1, k)
T he beta distribution applies to continuous random variables in the range of 0 and 1. T his distribution
is similar to the triangle distribution in the sense that they are both applicable in the modelling of
default rates and recovery rates. Assuming that a and b are two positive constants, then the PDF of
1
f (x) = x a−1(1 − x)b−1, 0≤x≤1
B (a, b)
Γ( a)Γ(b)
Where B (a, b) = Γ( a+b)
T he following two equations represent the mean and variance of the beta distribution:
a
62
© 2014-2023 AnalystPrep.
a
μ=
a+ b
ab
σ2 = 2
(a + b) (a + b + 1)
Exponential Distribution
1 − xβ
f X (x) = e ,x ≥ 0
β
x
−
FX (x) = 1 − e β
63
© 2014-2023 AnalystPrep.
T he parameter of the exponential distribution determines the mean and variance of the distribution.
T hat is:
E(X) = β
And
V (X) = β 2
64
© 2014-2023 AnalystPrep.
Notably, exponential distribution is a close ‘cousins’ of a Poisson. T he time intervals between one
and subsequent Poisson random variables are exponentially distributed. Another feature of the
exponential distribution is that it is memoryless. T hat is, its distributions are independent of their
histories.
Assume that the time to default for a specific segment of mortgage consumers is exponentially
distributed with a β of ten years. What is the probability that a borrower will not default before year
11?
Sol uti on
To find the probability that the borrower will not default before year eleven, we start by calculating
the cumulative distribution until year eleven and then subtract this from 100%::
11
−
=1 −e 10 = 1 − 0.3329 = 0.6671 = 66.7%
Mixture distributions are complex, and new distributions built using two or more distributions. In
Generally, a mixture distribution comes from a weighted average distribution of density functions,
n n
f (x) = ∑ w i f i (x) such that : ∑ w i = 1
j=1 i=1
f i (x) 's are the component distributions, with w ′i s as the weights or the mixing proportions. T he
component weights must all sum up to one, for the resulting mixture to be legitimately distributed. In
other words, a two-distribution combination must draw value from Bernoulli random variables and
65
© 2014-2023 AnalystPrep.
depending on the benefits (0 or 1), it then picks the component distributions. By doing this, it is
possible to compute the CDF of the mixture when the component distributions are normal random
variables. T hese distributions are very flexible as they fall between parametric and non-parametric
distributions.
For example, consider X 1 ∼ Fx1 and X 2 ∼ Fx2 and W i ∼ Bernoulli(p) . So that the mixture distribution
Y = pX 1 + (1 − p)X 2
Both of the PDF and the CDF of the mixture distribution are weighted average of the constituent
And
Intuitively, the computation of the central moment is done in a similar way. T hat is:
And
Where
Using the same logic, we can calculate the other higher central moments such as the kurtosis and
skewness. However, note that the mixture distribution might have both the skewness and the
kurtosis, while the components do not have (for example, normal random variables).
Moreover, mixing components with different means and variances leads to distribution that is both
66
© 2014-2023 AnalystPrep.
skewed and heavy-tailed.
Consider two normal random variables X 1 ∼ N (0.15, 0.60) and X 1 ∼ N(−0.8, 3). . What is the mean of
Sol uti on
We know that:
67
© 2014-2023 AnalystPrep.
Question
distributed as a Poisson random variable with mean 2. Calculate the probability that the
A. 5.48%
B. 0.10%
C. 3.54%
D. 10.2%
T he correct answer is A.
λn −λ
P [X = n] = e
n!
2428 −24
P [X = 28] = e = 5.48
28!
68
© 2014-2023 AnalystPrep.
Reading 15: Multivariate Random Variables
After compl eti ng thi s readi ng, you shoul d be abl e to:
Explain how a probability matrix can be used to express a probability mass function (PMF).
Compute the marginal and conditional distributions of a discrete bivariate random variable.
Explain how the expectation of a function is computed for a bivariate discrete random
variable.
Explain the relationship between the covariance and correlation of two random variables
and how these are related to the independence of the two variables.
Explain the effects of applying linear transformations on the covariance and correlation
Explain how the iid property is helpful in computing the mean and variance of a sum of iid
random variables.
Multivariate random variables accommodate the dependence between two or more random variables.
T he concepts under multivariate random variables (such as expectations and moments) are analogous
69
© 2014-2023 AnalystPrep.
Multivariate random variables involve defining several random variables simultaneously on a sample
space. In other words, multivariate random variables are vectors of random variables. For instance, a
bivariate random variable X can be a vector with two components X 1 and X 2 with the corresponding
T he PMF or PDF for a bivariate random variable gives the probability that the two random variables
each take a certain value. If we wish to plot these functions, we would need three factors: X 1, X 2,
T he PMF of a bivariate random variable is a function that gives the probability that the components
f X1 ,X2 (x 1, x 2) = P(X 1 = x 1, X 2 = x 2)
T he PMF explains the probability of realization as a function of x 1 and x 2. T he PMF has the following
properties:
1. f X1,X2 (x 1 ,x 2 ) ≥ 0
T he trinomial distribution is the distribution of n independent trials where each trial results in one of
the three outcomes (a generalization of the binomial distribution). T he first, second and the third
70
© 2014-2023 AnalystPrep.
Intuitively, the probability of observing n − X 1 − X 2 is:
1 − p1 − p2
n!
f X1,X2 (x 1, x 2 ) = px1 px2 (1 − p1 − p2)n−x1−x2
x 1 !x 2!(n − x 1 − x 2)1 2
T he CDF of a bivariate discrete random variable returns the total probability that each component is
In this equation, t1 contains the values that X 1 may take as long as t1 ≤ x 1. Similarly, t2 contains the
71
© 2014-2023 AnalystPrep.
Probability Matrices
In financial markets, market sentiments play a role in determining the return earned on a security.
Suppose the return earned on a bond is in part determined by the rating given to the bond by analysts.
72
© 2014-2023 AnalystPrep.
Bond Return (X 1)
−10% 0% 10%
Analyst Positive +1 5% 5% 30%
(X 2 ) Neutral 0 10% 10% 15%
Negative −1 20% 5% 0%
Each cell represents the probability of a joint outcome. For example, there’s a 5% probability of a
negative return (-10%) if analysts have positive views about the bond and its issuer. In other words,
there’s a 5% probability that the bond will decline in price with a positive rating. Similarly, there’s a
10% chance that the bond’s price will not change (and hence a zero return) given a neutral rating.
T he marginal distribution gives the distribution of a single variable in a joint distribution. In the case
of bivariate distribution, the marginal PMF of X 1 is computed by summing up the probabilities for X 1
across all the values in the support of X 2. T he resulting PMF of X 1 is denoted by f X1 (x 1) , i.e., the
marginal distribution of X 1.
f X1 (x 1 ) = ∑ f X1,X2 (x 1 ,x 2 )
x2 ϵR(X2 )
f X2 (x 2 ) = ∑ f X1,X2 (x 1 ,x 2 )
x1 ϵR(X1 )
Using the probability matrix, we created above, we can come up with marginal distributions for both
For X 1,
73
© 2014-2023 AnalystPrep.
For X 2,
Bond Return (X 1)
−10% 0% 10% f X2 (x 2)
Analyst Positive +1 5% 5% 30% 40%
(X 2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
f X1 (x 1 ) 35% 20% 45%
As you may have noticed, the marginal distribution satisfies the property of the ideal probability
∑ f X1 (x 1) = 1
∀X1
And
f X1 (x 1) ≥ 0
We can, in addition, use the marginal PMF to compute the marginal CDF. T he marginal CDF is such
FX1(x 1 ) = ∑ f X1 (t1)
t1 ϵR(X1 )
t1 ≤x1
74
© 2014-2023 AnalystPrep.
Independence of Random Variables
P(A ∩ B) = P(A)P(B)
T his principle applies to bivariate random variables as well. If the distributions of the components of
f X1 ,X2 (x 1, x 2) = f X1 (x 1)f X2 (x 2 )
Now let’s use our earlier example on the return earned on a bond. If we assume that the two
variables – return and ratings – are independent, we can calculate the joint distribution by the
multiplying their marginal distributions. But are they really independent? Let’s find out! We have
already established the joint and the marginal distributions, as reproduced in the following table.
Bond Return (X 1)
−10% 0% 10% f X2 (x 2)
Analyst Positive +1 5% 5% 30% 40%
(X 2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
f X1 (x 1 ) 35% 20% 45%
So assuming that our two variables are independent, our joint distribution would be as follows:
Bond Return (X 1 )
−10% 0% 10%
Analyst Positive +1 14% 8% 18%
(X 2 ) Neutral 0 12.25% 7% 15.75%
Negative −1 8.75% 5% 11.25%
We obtain the table above by multiplying the marginal PMF of the bond return by the marginal PMF
of ratings. For example, the marginal probability that the bond return is 10% is 45% -- the sum of the
third column. T he marginal probability of a positive rating is 40% -- the sum of the first row. T hese
75
© 2014-2023 AnalystPrep.
two values when multiplied give us the joint probability on the upper left end of the table (18%).
It is clear that the two variables are not independent because multiplying their marginal PMFs does
P(A ∩ B)
P(A│B) =
P(B)
T his result can be applied in bivariate distributions. T hat is, the conditional distribution of X 1 given
X 2 is defined as:
f X1,X2 (x 1, x 2)
f X1│X2 (x 1│X 2 = x 2 ) =
f X2 (x 2)
From the result above, the conditional distribution is joint distribution divided by the marginal
76
© 2014-2023 AnalystPrep.
Example: Calculating the Conditional Distribution
Bond Return (X 1)
−10% 0% 10% f X2 (x 2)
Analyst Positive +1 5% 5% 30% 40%
(X 2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
f X1 (x 1 ) 35% 20% 45%
Suppose we want to find the distribution of bond returns conditional on a positive analyst rating. T he
f X1,X2 (x 1, X 2 = 1) f X1,X2 (x 1, X 2 = 1)
f (X1│X2 )(x 1 │X 2 = 1) = =
f X2(x 2 = 1) 40%
77
© 2014-2023 AnalystPrep.
Returns(X 1 ) −10% 0% 10%
5% 5% 30%
f (X1│X2) (x 1│X 2 = x 2 ) = 12.5% = 12.5% = 75%
40% 40% 40%
= P(X 1 = x 1|X 2 = 1)
What we have done is to take the joint probabilities where there’s a positive analyst rating and then
divided these values by the marginal probability of a positive rating (40%) to produce the conditional
distribution.
Note that the conditional PMF obeys the laws of probability, i.e.,
Conditional distributions can be computed for one variable, while conditioning on more than one
variable.
For example, assume that we need to compute the conditional distribution of the bond returns given
that analyst ratings are non-negative. T herefore, our conditioning set is {+1,0}:
X 2 ∈ {+1, 0}
T he conditional PMF must sum across all outcomes in the set that is conditioned on S {+1,0}:
∑X2 ϵC f (X1,X2) (x 1, x 2 )
f (X1│X2) (x 1│x 2 ϵS) =
∑X2ϵC f (X2 )(x 2 )
T he marginal probability that X 2 ∈ {+1, 0} is the sum of the marginal probabilities of these two
outcomes:
78
© 2014-2023 AnalystPrep.
Bond Return (X 1)
−10% 0% 10% f X2 (x 2)
Analyst Positive +1 5% 5% 30% 40%
(X 2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
f X1 (x 1 ) 35% 20% 45%
5%+10%
⎧
⎪ = 20%
⎪
⎪ 75%
⎪ 5%+10%
f (X1│X2) (x 1│x 2 ϵ {+1, 0}) = ⎨ 75% = 20%
⎪
⎪
⎪
⎩ 30%+15% = 60%
⎪
75%
f X1,X2 (x 1, x 2)
f (X1│X2) (x 1│X 2 = x 2 ) =
f X2 (x 2)
Or
Also, if the distributions of the components of the bivariate distributions are independent, then:
79
© 2014-2023 AnalystPrep.
Applying again to
we get:
f X2 (x 2) = f (X2│X1 )(x 2 │X 1 = x 1)
Expectations
T he expectation of a function of a bivariate random variable is defined in the same way as that of the
univariate random variable. Consider the function g(X 1 ,X 2 ). T he expectation is defined as:
g(x 1 ,x 2 depends on both x 1 and x 2 ) and it may be a function of one component only. Just like the
X1
1 2
X2 3 10% 15%
4 70% 5%
Sol uti on
80
© 2014-2023 AnalystPrep.
Using the formula:
Moments
Just like the univariate random variables, we shall use the expectations to define the moments.
T he second moment involves the covariance between the components of the bivariate distribution
Note that Cov(X 1 ,X 1 ) = Var(X 1 ) and that if X 1 and X 2 are independent then E[X 1X 2 ] − E[X 1 ]E[X 2] = 0
and thus:
Most of the correlation between X 1 and X 2 is reported. Now let Var(X 1 ) = σ12, Var(X 2) = σ22 and
81
© 2014-2023 AnalystPrep.
Cov(X 1 , X 2 ) = σ12 then the correlation is defined as:
Cov(X 1, X 2) σ12
Corr(X 1, X 2) = ρX1X2 = =
σ1 σ2
√σ12√σ22
Correlation gives the measure of the strength of the linear relationship between the two random
β Var(X 1) β
Corr(X 1, X 2) = ρX1X2 = =
√Var(X 1)√β 2Var(X 1 ) |β|
it is now evident that if β > 0, then ρX1X2 = 1 and when β ≤ 0 then ρX1X2 = 0
T hen,
Cov(a + bX 1 , c + dX 2) = bdCov(X 1, X 2)
T his implies that the scale factor in each random variable multiactivity affects the covariance. Using
abCov(X 1, X 2 ) ab Cov(X 1, X 2 )
Corr(aX 1, bX 2) = =
|a||b| √Var(X 1)√Var(X 2 )
√a2 Var(X 1)√b2 Var(X 2)
ab
= ρ X1 X2
|a||b|
82
© 2014-2023 AnalystPrep.
Application of Correlation: Portfolio Variance and Hedging
T he variance of the underlying securities and their respective correlations are the necessary
two securities whose random returns are X A and X B and their means are μA and μB with standard
deviations of σA and σB. T hen, the variance of X A plus X B can be computed as follows:
σA+B
2 = σA2 + σB2 + 2ρABσA σB
σA+B
2 = 2σ 2(1 + ρAB),
Where:
σA2 = σB2 = σ 2
if both securities have an equal variance. If the correlation between the two securities is zero, then
the equation can be simplified further. We have the following relation for the standard deviation:
n
Y = ∑ X i = 1nX i
i=1
n n
σY2 = ∑ ∑ ρijσi σj
i=1 j=1
In case all the X i’s are uncorrelated and all variances are equal to σ, then we have:
T his is what is called the square root rule for the addition of uncorrelated variables.
83
© 2014-2023 AnalystPrep.
Suppose that Y , X A, and X B are such that:
Y = aX A + bX B
T he major challenge during hedging is a correlation. Suppose we are provided with $1 of a security A.
We are to hedge it with $h of another security B. A random variable p will be introduced to our
hedged portfolio. h is, therefore, the hedge ratio. T he variance of the hedged portfolio can easily be
P = X A + hX B
σP2 = σA2 + h2σB2 + 2hρAB σAσB
T he minimum variance of a hedge ratio can be determined by determining the derivative with
dσ 2P
= 2hσB2 + 2ρAB σAσB = 0
dh
σA
⇒ h∗ = −ρAB
σB
T he covariance matrix is a 2x2 matrix that displays the covariance between the components of X.
σ12 σ12
Cov(X) = [ ]
σ12 σ22
84
© 2014-2023 AnalystPrep.
The Variance of Sums of Random Variables
Conditional Expectation
A conditional expectation is simply the mean calculated after a set of prior conditions has happened.
It is the value that a random variable takes “on average” over an arbitrarily large number of
occurrences – given the occurrence of a certain set of "conditions." A conditional expectation uses
the same expression as any other expectation and is a weighted average where the probabilities are
In the bond return/rating example, we may wish to calculate the expected return on the bond given a
85
© 2014-2023 AnalystPrep.
T he conditional expectation of the return is determined as follows:
Conditional Variance
We can calculate the conditional variance by substituting the expectation in the variance formula
We know that:
Returning to our example above, the conditional variance Var(X 1 |X 2 = 1) is given by:
Now,
E(X 1 |X 2 = 1) = 0.0625
We need to calculate:
So that
2
Var(X 1│X 2 = 1) = σ(X = 0.00875 − [0.0625]2 = 0.004844 = 0.484%
1 │X2 = 1)
If we wish to find the standard deviation of the returns, we just find the square root of the variance:
86
© 2014-2023 AnalystPrep.
Continuous Random Variables
Before we continue, it is essential to note that continuous random variables make use of the same
concepts and methodologies as discrete random variables. T he main distinguishing factor is that
T he joint (bivariate) distribution function gives the probability that the pair (X 1 ,X 2 ) takes values in a
b d
P(a < X 1 < b,c < X 2 < d) = ∫ ∫ f X1,X2 (x 1 ,x 2 )dx 1dx 2
a c
T he joint pdf is always nonnegative, and the double integration yield a value of 1. T hat is:
87
© 2014-2023 AnalystPrep.
f X1,X2 (x 1, x 2 ) ≥ 0
And
b d
∫ ∫ f X1,X2 (x 1 , x 2 )dx 1dx 2 = 1
a c
Assume that the random variables (X 1 ) and (X 2) are jointly distributed as:
Sol uti on
88
© 2014-2023 AnalystPrep.
We need to first calculate the value of k.
b d
∫ ∫ f X1,X2 (x 1 , x 2 )dx 1dx 2 = 1
a c
We have
2 2 2 2
1 2
∫ ∫ k(x 1 + 3x 2 )dx 1dx 2 = ∫ k[( x 1 + 3x 1x 2 )] dx 2 = 1
0 0 0 2 0
2
2
=∫ k(2 + 6x 2)dx 2 = k[2x 2 + 3x 22 ]0 = 1
0
1
16k = 1 ⇒ k =
16
So,
1
f X1,X2 (x 1 , x 2 ) = (x 1 + 3x 2)
16
T herefore,
1 2
1
P(X 1 < 1, X 2 > 1) = ∫ ∫ (x 1 + 3x 2)dx 1 dx 2 = 0.3125
0 1 16
x1 x2
F(X 1 < x 1, X 2 < x 2 ) = ∫ ∫ f X1 ,X2(t1 , t2)dt1 dt2
−∞ −∞
Note that the lower bound of the integral can be adjusted so that it is the lower value of the interval.
Using the example above, we can calculate F(X 1 < 1, X 2 < 1) in a similar way as above.
89
© 2014-2023 AnalystPrep.
For the continuous random the marginal distribution is given by:
∞
f X1 (x 1 ) = ∫ f X1 ,X2 (x 1, x 2)dx 2
−∞
Similarly,
∞
f X2 (x 2 ) = ∫ f X1 ,X2 (x 1, x 2)dx 1
−∞
Note that if we want to find the marginal distribution of X 1 we integrate X 2 out and vice versa.
1
f X1,X2 (x 1 , x 2 ) = (x 1 + 3x 2) 0 < x 1 < 2, 0 < x 2 < 2
16
We wish to find the marginal distribution of X 1. T his implies that we need to integrate out X 2. So,
2 2
1 1 3
f X1 (x 1) = ∫ (x 1 + 3x 2)dx 2 = [x 1 x 2 + x 22 ]
0 16 16 2 0
1 1
= [2x 1 + 6] = (x 1 + 3)
16 8
1 1
⇒ f X1(x 1 ) = [2x 1 + 6] = (x 1 + 3)
16 8
Conditional Distributions
T he conditional distribution is analogously defined as that of discrete random variables. T hat is:
f X1,X2 (x 1, x 2)
f (X1│X2) (x 1│X 2 = x 2 ) =
f X2 (x 2)
T he conditional distributions are applied in the field of finance, such as risk management. For
90
© 2014-2023 AnalystPrep.
instance, we may wish to compute the conditional distribution of interests rates, X 1 given that the
A collection of random vari abl es is independent and identically distributed (iid) if each random
vari abl e has the same probability distribution as the others and all are mutually independent.
Example:
The probabi l i ty of head vs. tai l i n every throw i s 50:50; so the coin is equally
likely and stays fair; the distribution from which every throw is drawn is normal and stays
the same, and thus each outcome is "i denti cal l y di stri buted"
91
© 2014-2023 AnalystPrep.
iid variables are mostly applied in time series analysis.
Consider the iid variables generated by a normal distribution. T hey are typically defined as:
x iid
i ∼
N(μ, σ 2)
n n n
E (∑ X i) = ∑ E(X i ) = ∑ μ = nμ
i i i
Where E(X i ) = μ
T he result above is valid since the variables are independent and have similar moments. Maintaining
this line of thought, the variance of iid random variables is given by:
n n n n
Var (∑ X i ) = ∑ Var (X i) + 2 ∑ ∑ Cov(X j, X k)
i i j=1 k=j+1
n n n n
= ∑ σ 2 + 2∑ ∑ 0 = ∑ σ 2 = nσ 2
i j=1 k=j+1 i
T he independence property is important because there’s a difference between the variance of the
sum of mul ti pl e random variables and the variance of a multiple of a single random variable. If X 1
Var(2X 1) = 4Var(X 1) = 4 × σ 2 = 4σ 2
92
© 2014-2023 AnalystPrep.
Practice Question
policy. Let X be the portion of a claim representing damage to inventory and let Y be the
portion of the same application representing damage to the rest of the property. T he
What is the probability that the portion of a claim representing damage to the rest of the
A. 0.657
B.0.450
C. 0.415
D. 0.752
1−y
1−y x2
f Y (y) = ∫ 6[1 − (x + y)] ∂x = [6(x − − xy]
0 2 0
(1 − y)2
6 [(1 − y) − − y(1 − y)]
2
At this we can factor out (1 − y) and solve what remains in the square bracket:
93
© 2014-2023 AnalystPrep.
(1 − y) 1 −y
6(1 − y) [1 − − y] = 6(1 − y) [ ]
2 2
1−y
6(1 − y) [ ] = 3(1 − y)[1 − y] = 3(1 − 2y + y 2) = 3 − 6y + 3y 2
2
So,
0. 3
P(Y < 0.3) = ∫ (3 − 6y + 3y 2 )dy = 0.9 − 0.27 + 0.027 = 0.657
0
94
© 2014-2023 AnalystPrep.
Reading 16: Sample Moments
After compl eti ng thi s readi ng, you shoul d be abl e to:
Estimate the mean, variance, and standard deviation using sample data.
Describe the bias of an estimator and explain what the bias measures.
Explain what is meant by the statement that the mean estimator is BLUE.
Describe the consistency of an estimator and explain the usefulness of this concept.
Explain how the Law of Large Numbers (LLN) and Central Limit T heorem (CLT ) apply to
Estimate the mean of two random variables and apply the CLT.
Explain how coskewness and cokurtosis are related to skewness and kurtosis.
Sample Moments
Recall that moments are defined as the expected values that briefly describe the features of a
distribution. Sample moments are those that are utilized to approximate the unknown population
95
© 2014-2023 AnalystPrep.
Such moments include mean, variance, skewness, and kurtosis. We shall discuss each moment in
detail.
T he population mean, denoted by μ is estimated from the sample mean (X̄ ). T he estimated mean is
denoted by μ
^ and defined by:
1 n
^ = X̄ =
μ ∑ Xi
n i=1
Where X i is a random variable assumed to be independent and identically distributed so E(X i) = μ and
96
© 2014-2023 AnalystPrep.
n is the number of observations.
Note that the mean estimator is a function of random variables, and thus it is a random variable.
Consequently, we can examine its properties as a random variable (its mean and variance)
follows:
1 n 1 n 1 n 1
E(μ
^) = E(X̄ ) = E[ ∑ X i] = ∑ E(X i ) = ∑ μ = × nμ = μ
n i=1 n i=1 n i=1 n
T he above result is true since we have assumed that X i's are iid. T he mean estimator is an unbiased
Bias(θ^) = E(θ^) − θ
Where θ^ is the true estimator of the population value θ. So, in the case of the population mean:
Bias(μ
^) = E(μ
^) − μ = μ − μ = 0
Since the value of the mean estimator is 0, it is an unbiased estimator of the population mean.
Using conventional features of a random variable, the variance of the mean estimator is calculated
as:
1 n 1 n
Var(μ
^) = Var ( ∑ X i ) = 2 [∑ Var(X i) + Covariances]
n i=1 n i=1
But we are assuming that X i's are iid, and thus they are uncorrelated, implying that their covariance
1 n 1 n 1 n 12 σ2
Var(μ
^) = Var ( ∑ X i ) = 2 [∑ Var(X i)] = 2 [∑ σ 2] = × nσ 2 =
n i=1 n i=1 n i=1 n n
97
© 2014-2023 AnalystPrep.
T hus
σ2
Var(μ
^) =
n
Looking at the last formula, the variance of the mean estimator depends on the data variance (σ 2) and
the sample mean n. Consequently, the variance of the mean estimator decreases as the number of
the observations (sample size) is increased. T his implies that the larger the sample size, the closer
An experiment was done to find out the number of hours that candidates spend preparing for the
FRM part 1 exam. It was discovered that for a sample of 10 students, the following times were
spent:
318, 304, 317, 305, 309, 307, 316, 309, 315, 327
Sol uti on
We know that:
1 n
X̄ = μ
^= ∑ Xi
n i=1
318 + 304 + 317 + 305 + 309 + 307 + 316 + 309 + 315 + 327
⇒ X̄ =
10
= 312.7
As the sample size (the number of the observation) increases, the sample mean tends to
98
© 2014-2023 AnalystPrep.
Estimation of Variance and Standard Deviation
1 n
^2 =
σ ^)2
∑ (X − μ
n i=1 i
Note that we are still assuming that X i’s are iid. As compared to the mean estimator, the sample
n−1 2 σ2
^2 ) = E(σ
Bias(σ ^2) − σ 2 = σ − σ2 =
n n
T his implies that the bias decreases as the number of observations are increased. Intuitively, the
2
source of the bias is the variance of the mean estimator ( σn ) . Since the bias is known, we construct
1 n n
n n 1
s2 = ^2 =
σ ^)2 =
× ∑ (X i − μ ^)2
∑ (X i − μ
n−1 n−1 n i=1 n − 1 i=1
It can be shown that E(s2 ) = σ 2 and thus s2 is an unbiased variance estimator. Maintaining this line of
^2 is less than that of s2. However, financial analysis involves large data sets,
since the variance of σ
and thus either of these values can be used. However, when the number of observations is more
^2 is preferred conventionally.
than 30 (n ≥ 30), σ
T he sample standard deviation is the square root of the sample variance. T hat is:
^2
^ = √σ
σ
or
s = √s2
Note that the square root is a nonlinear function, and thus, the standard deviation estimators are
99
© 2014-2023 AnalystPrep.
Example: Calculating the Sample Variance Estimator (Unbiased)
Using the example as in calculating the sample mean, what is the sample variance?
Solution
n
1
s2 = ^)2
∑ (X i − μ
n − 1 i=1
Xi μ )2
(X i − ^
318 (318 − 312.7)2 = 28.09
304 75.69
317 18.49
305 59.29
309 13.69
307 32.49
316 10.89
309 13.69
315 5.29
327 204.49
Total 434.01
n
1 434.01
s2 = ^)2 =
∑ (X i − μ = 48.22
n − 1 i=1 10 − 1
II. T hey give a clue on the range of the values that can be observed.
III. T he units of the mean and the standard deviation are the same as those of the data, and thus
100
© 2014-2023 AnalystPrep.
Skewness
As we saw in chapter two, the skewness is a cubed standardized central moment given by:
X−μ
Note that σ
is a standardized X with a mean of 0 and variance of 1.
E([X − E(X)]3 μ3
skew(X) = 3 =
2 2 σ3
E[(X − E(X)) ]
T he skewness measures the asymmetry of the distribution (since the third power depends on the
sign of the difference). When the value of the asymmetry is negative, there is a high probability of
observing the large magnitude of negative value than positive values (tail is on the left side of the
distribution). Conversely, if the skewness is positive, there is a high probability of observing the
large magnitude of positive values than negative values (tail is on the right side of the distribution).
T he estimators of the skewness utilize the principle of expectation and is denoted by:
^3
101
© 2014-2023 AnalystPrep.
^3
μ
^3
σ
^3 as:
We can estimate μ
1 n
^3 =
μ ^)3
∑ (x i − μ
n i=1
T he following are the data on the financial analysis of a sales company’s income over the last 100
months:
n = 100, ∑ni=1 (x i − μ
^)2 = 674, 759.90. and ∑ni=1 (x i − μ
^)3 = −12, 456.784
Solution
1
^3
μ
1
∑ni=1(x i − μ
^)3 100
(−12, 456.784)
n
= = = −0.000225
^3 3 3
σ
[ n1 ∑ni=1 (x i 1
^)2] 2 2
−μ [ 100 × 674, 759.90]
Kurtosi s
E([X − E(X)]4 μ4
Kurt(X) = =
E[(X − E(X))2 ] 2 σ4
102
© 2014-2023 AnalystPrep.
T he description of kurtosis is analogous to that of the Skewness, only that the fourth power of the
Kurtosis implies that it measures the absolute deviation of random variables. T he reference value of
a normally distributed random variable is 3. A random variable with Kurtosis exceeding 3 is termed to
T he estimators of the skewness utilize the principle of expectation and is denoted by:
^4
μ
^4
σ
1 n
^4 =
μ ^)4
∑ (x i − μ
n i=1
We say that the mean estimator is the Best Linear Unbiased Estimator (BLUE) of the population
I. T he variance of the mean has the lowest variance of any Linear Unbiased Estimator (LUE).
II. It is the unbiased estimator of the population mean (as shown earlier)
T he linear estimators are a function of the mean and can be defined as:
n
μ
^ = ∑ ωiX i
i=1
1
Where ωi is independent of X i . In the case of the sample mean estimator, ωi = n
. Recall that we had
BLUE puts an estimator as the best by having the smallest variance among all linear and unbiased
estimators. However, there are other superior estimators, such as Maximum Likelihood Estimators
(MLE).
103
© 2014-2023 AnalystPrep.
The Behavior of Mean in Large Sample Sizes
Recall that the mean estimator is unbiased, and its variance takes a simple form. Moreover, if the
data used are iid and normally distributed, then the estimator is also normally distributed. However, it
poses a great difficulty in defining the exact distribution of the mean in a finite number of
observations.
To overcome this, we use the behavior of the mean in large sample sizes (that is as n → ∞) to
approximate the distribution of the mean infinite sample sizes. We shall explain the behavior of the
mean estimator using the Law of Large Numbers (LLN) and the Central Limit T heorem (CLT ).
T he law of large numbers (Kolmogorov Strong Law of Large Numbers) for iid states that if X i’s is a
1 n a. s
^n =
μ ∑ X−i
→ μ
n i=1
An estimator is said to be consistent if LLN applies to it. Consistency requires that an estimator is:
a. s
^2−→ σ 2
Moreover, under LLN, the sample variance is consistent. T hat is, LLN implies that σ
T he Central Limit T heorem (CLT ) states that if X 1 , X 2 ,… , X n is a sequence of iid random variables
^−μ
μ
with a finite mean μ and a finite non-zero variance σ 2, then the distribution of σ tends to a standard
√n
104
© 2014-2023 AnalystPrep.
normal distribution as n → ∞.
Put simply,
^−μ
μ
→ N(0, 1)
σ
√n
Note that μ
^ = X̄ = Sample Mean
Note that CLT extends LLN and provides a way of approximating the distribution of the sample mean
estimator. CLT seems to be appropriate since it does not require the distribution of random variables
used.
Since CLT is asymptotic, we can also use the unstandardized forms so that:
σ2
^ ∼ N (μ,
μ )
n
^−μ
μ
Z= σ
√n
T he value of n solely depends on the shape of the population (distribution of X i’s), i.e., the skewness.
105
© 2014-2023 AnalystPrep.
Example: Applying CLT
A sales expert believes that the number of sales per day for a particular company has a mean of 40
and a standard deviation of 12. He surveyed for over 50 working days. What is the probability that the
Solution
σ2
^ ∼ N (μ,
μ )
n
We need
106
© 2014-2023 AnalystPrep.
⎡ 35 − 40⎤
^ < 35) = P ⎢Z <
P(μ ⎥ = P(μ
^ < −2.946)
12
⎣ ⎦
√50
= P(μ
^ < −2.946) = 1 − P(μ
^ < 2.946) = 0.00161
Median
Median is a central tendency measure of distribution, also called the 50% quantile, which divides the
distribution in half ( 50% of observations lie on either side of the median value).
When the sample size is odd, the value in position (n + 1)/2 of the sorted list is used to estimate the
median:
Med(x) = x n +1
2
If the number of the observations is even, the median is estimated as the average of the two central
1
Med(x) = [x n + x n +1 ]
2 2 2
Solution
107
© 2014-2023 AnalystPrep.
T he sample size is 6 (even), so the median is given by:
1 1 1
Med(Age) = [x 6 + x 6 +1] = (x 3 + x 4 ) = (43 + 50) = 46.5
2 2 2 2 2
It is used when the exact midpoint of the score distribution is desired, or when there are
Other Quartiles
For other quantiles such as 25% and 75% quantiles, we estimate analogously as the median. For
instance, a θ-quantile is determined using the nθ, which is a value in the sorted list. If nθ is not an
integer, we will have to take the average below or above the value nθ.
So, in our example above, the 25% quantile (θ=0.25) is 6×0.25=1.5. T his implies that we need to find
1
^
q 25 = (25 + 34) = 29.5
2
T he interquartile range (IQR) is defined as the difference between the 75% and 25% quartiles. T hat
is:
ˆ
(IQR) =^
q 75 − ^
q 25
IQR is a measure of dispersion and thus can be used as an alternative to the standard deviation
If we use the example above, the 75% quantile is 6×0.75=4.5. So, we need to average the 4th and 5th
108
© 2014-2023 AnalystPrep.
values:
1
^
q 75 = (50 + 51) = 50.5
2
ˆ
(IQR) = 50.5 − 29.5 = 21
I. T he units of the quantiles are the same as those of the data used hence they are easy to
interpret.
II. T hey are robust to outliers of the data. T he median and the IQR are unaffected by the
outliers.
We can extend the definition of moments from the univariate to multivariate random variables. T he
mean is unaffected by this because it is just the combination of the means of the two univariate
sample means.
However, if we extend the variance, we would need to estimate the covariance between each pair
plus the variance of each data set used. Moreover, we can also define Kurtosis and Skewness
Covariance
In covariance, we focus on the relationship between the deviations of some two variables rather
109
© 2014-2023 AnalystPrep.
T he sample covariance estimator is analogous to this result. T he sample covariance estimator is
given by:
1 n
^XY =
σ ∑ (X i − μ
^X)(Y i − μ
^Y )
n i=1
Where
T he sample covariance estimator is biased towards zero, but we can remove the estimator by using
Correlation
Correlation measures the strength of the linear relationship between the two random variables, and
covariance by the product of the sample standard deviation estimator of each random variable. It is
defined as:
^XY
σ ^XY
σ
ρXY = =
^2X√σ
^2Y ^Xσ
σ ^Y
√σ
We estimate the mean of two random variables the same way we estimate that of a single variable.
T hat is:
1 n
^x =
μ ∑ (x i )
n i=1
110
© 2014-2023 AnalystPrep.
And
1 n
^y =
μ ∑ (y )
n i=1 i
Assuming both of the random variables are iid, we can apply CLT in each estimator. However, if we
consider the joint behavior (as a bivariate statistic), CLT stacks the two mean estimators into a 2x1
matrix:
^x
μ
^ =[
μ ]
^y
μ
Which is normally distributed as long the random variable Z=[X, Y] is iid. T he CLT on this vector
σX2 σXY
[ ]
σXY σY2
Note that in a covariance matrix, one diagonal displays the variance of random variable series, and
the other is covariances between the pair of the random variables. So, the CLT for bivariate iid data
is given by:
^x − μx
μ 0 σ2 σXY
√n [ ] → N ([ ] , [ X ])
^y − μy
μ 0 σXY σY2
If we scale the difference between the vector of means, then the vector of means is normally
σ2 σ XY
^
μ ⎛ μ ⎡ X ⎤⎞
[ x ] → N ⎜[ x ] , ⎢ n n ⎥⎟
^y σ2
μ ⎝ μy ⎣ σ XY Y ⎦⎠
n n
T he annualized estimates of the means, variances, covariance, and correlation for monthly return of
111
© 2014-2023 AnalystPrep.
stock trade (T ) and the government's bonds (G) for 350 months are as shown below:
Moment ^T
μ σT2 ^G
μ σG2 σTG ρTG
11.9 335.6 6.80 26.7 14.0 0.1434
We need to compare the volatility, interpret the correlation coefficient, and apply bivariate CLT.
Solution
Looking at the output, it is evident that the return from the stock trade is more volatile than the
government bond return since it has a higher variance. T he correlation between the two forms of
^x − μx
μ 0 335.6 14.0
√n [ ^ − μ ] → N ([ ] , [ ])
μy y 0 14.0 26.7
But the mean estimators have a limiting distribution (which is assumed to be normally distributed).
So,
^x
μ μ 0.9589 0.04
[ ] → N ([ x ] , [ ])
^y
μ μy 0.04 0.07629
Note the new covariance matrix is equivalent to the previous covariance divided by the sample size
n=350.
In bivariate CLT, the correlation in the data is the correlation between the sample means and should
112
© 2014-2023 AnalystPrep.
Coskewness and Cokurtosis are an extension of the univariate skewness and kurtosis.
Coskewness
T hese measures both capture the likelihood of the data taking a large directional value whenever the
other variable is large in magnitude. When there is no sensitivity to the direction of one variable to
the magnitude of the other, the two coskewnesses are 0. For example, the coskewness in a bivariate
normal is always 0, even when the correlation is different from 0. Note that the univariate skewness
T he coskewness is estimated by using the estimation analogy. T hat is, replacing the expectation
∑ni=1 (x i − μ
^X )2(y i − μ
^Y )
Skew(X , X , Y) =
^2Xσ
σ ^Y
∑ni=1(x i − μ ^Y )2
^X)(y i − μ
Skew(X , Y , Y) =
^2Y
^X σ
σ
Cokurtosis
T he reference value of a normally distributed random variable is 3. A random variable with Kurtosis
exceeding 3 is termed to be heavi l y or fat-tai l ed. However, comparing the cokurtosis to that of
the normal is not easy since the cokurtosis of the bivariate normal depends on the correlation.
When the value of the cokurtosis is 1, then the random variables are uncorrelated and increases as
114
© 2014-2023 AnalystPrep.
Practice Question
∑100 x
i=1 i =
3, 353 and ∑100 x 2 844, 536
i=1 i =
What is the sample mean and standard deviation of the monthly profits?
Sol uti on
T he correct answer is A.
1 n
^ = X̄ =
μ ∑ Xi
n i=1
1
⇒ X̄ = × 3353 = 33.53
100
1 n
s2 = ^)2
∑ (X i − μ
n−1 i=1
Note that,
^)2 = X 2i − 2X iμ
(X i − μ ^2
^+ μ
So that
115
© 2014-2023 AnalystPrep.
n n n n n
^)2 = ∑ X 2i − 2X iμ
∑ (X i − μ ^2 = ∑ X 2i − 2^
^+ μ ^2
μ ∑ Xi + ∑ μ
i=1 i=1 i=1 i=1 i=1
1 n n
^=
μ ∑ X i ⇒ ∑ X i = n^
μ
n i=1 i=1
So,
n n n n
^2 = ∑ X 2i − 2^
∑ X 2i − 2μ̂ ∑ X i + ∑ μ μ . n^
μ + n^
μ
i=1 i=1 i=1 i=1
n
μ2
= ∑ X 2i − n^
i=1
T hus:
1 n 1 n
2
s2 = ^)2 =
∑ (X i − μ {∑ X 2i − n^
μ }
n−1 i=1 n −1 i=1
1 n
2
1
s2 = {∑ X 2i − n^
μ }= (844, 536 − 100 × 33.532 ) = 7395.0496
n−1 i=1 99
s = √7395.0496 = 85.99
116
© 2014-2023 AnalystPrep.
Reading 17: Hypothesis Testing
After compl eti ng thi s readi ng, you shoul d be abl e to:
Construct and apply confidence intervals for one-sided and two-sided hypothesis tests, and
Differentiate between a one-sided and a two-sided test and identify when to use each test.
Explain the difference between T ype I and T ype II errors and how these relate to the size
Identify the steps to test a hypothesis about the difference between two population means.
Explain the problem of multiple testing and how it can bias results.
Hypothesi s testi ng is defined as a process of determining whether a hypothesis is in line with the
sample data. Hypothesis testing tries to test whether the observed data of the hypothesis is true.
Hypothesis testing starts by stating the null hypothesis and the alternative hypothesis. T he null
hypothesis is an assumption of the population parameter. On the other hand, the alternative
hypothesis states the parameter values (critical values) at which the null hypothesis is rejected. T he
critical values are determined by the distribution of the test statistic (when the null hypothesis is
true) and the size of the test (which gives the size at which we reject the null hypothesis).
117
© 2014-2023 AnalystPrep.
I. T he null hypothesis.
V. T he critical value.
As stated earlier, the first stage of the hypothesis test is the statement of the null hypothesis. T he
null hypothesis is the statement concerning the population parameter values. It brings out the notion
T he nul l hypothesi s, denoted as H 0, represents the current state of knowledge about the
population parameter that’s the subject of the test. In other words, it represents the “status quo.”
For example, the U.S Food and Drug Administration may walk into a cooking oil manufacturing plant
intending to confirm that each 1 kg oil package has, say, 0.15% cholesterol and not more. T he
A test would then be carried out to confirm or reject the null hypothesis.
H 0 : μ = μ0
H 0 : μ ≤ μ0
Where:
118
© 2014-2023 AnalystPrep.
T he al ternati ve hypothesi s, denoted H 1, is a contradi cti on of the null hypothesis. T he null
hypothesis determines the values of the population parameter at which the null hypothesis is
rejected. T hus, rejecting the H 0 makes H 1 valid. We accept the alternative hypothesis when the
Using our FDA example above, the alternative hypothesis would be:
H 0 : μ ≠ μ0
H 0 : μ > μ0
Where:
Note that we have stated the alternative hypothesis, which contradicted the above statement of the
null hypothesis.
A test statistic is a standardized value computed from sample information when testing hypotheses. It
compares the given data with what we would expect under the null hypothesis. T hus, it is a major
We use the test statistic to gauge the degree of agreement between sample data and the null
hypothesis. Analysts use the following formula when calculating the test statistic.
119
© 2014-2023 AnalystPrep.
T he test statistic is a random variable that changes from one sample to another. Test statistics
assume a variety of distributions. We shall focus on normally distributed test statistics because it is
used hypotheses concerning the means, regression coefficients, and other econometric models.
We shall consider the hypothesis test on the mean. Consider a null hypothesis H 0 : μ = μ0 . Assume
that the data used is iid, and asymptotic normally distributed as:
^ − μ) ∼ N (0, σ 2)
√n(μ
Where σ 2 is the variance of the sequence of the iid random variable used. T he asymptotic
^ − μ0
μ
T = ∼ N (0, 1)
2
√^
σ
n
Note this is consistent with our initial definition of the test statistic.
T he following table gives a brief outline of the various test statistics used regularly, based on the
We can subdivide the set of values that can be taken by the test statistic into two regions: One is
called the non-rejection region, which is consistent with H 0 and the rejection region (critical region),
which is inconsistent with H 0. If the test statistic has a value found within the critical region, we
reject H 0.
Just like with any other statistic, the distribution of the test statistic must be specified entirely under
H 0 when H 0 is true.
120
© 2014-2023 AnalystPrep.
The Size of the Hypothesis Test and the Type I and Type II
Errors
While using sample statistics to draw conclusions about the parameters of the population as a whole,
there is always the possibility that the sample collected does not accurately represent the
population. Consequently, statistical tests carried out using such sample data may yield incorrect
results that may lead to erroneous rejection (or lack thereof) of the null hypothesis. We have two
types of error:
Type I Error
T ype I error occurs when we reject a true null hypothesis. For example, a type I error would
Type II Error
T ype II error occurs when we fail to reject a false null hypothesis. In such a scenario, the test
provides insufficient evidence to reject the null hypothesis when it’s false.
T he level of significance denoted by α represents the probability of making a type I error, i.e.,
rejecting the null hypothesis when, in fact, it’s true. α is the direct opposite of β, which is taken to
be the probability of making a type II error within the bounds of statistical testing. T he ideal but
practically impossible statistical test would be one that si mul taneousl y minimizes α and β. We
use α to determine critical values that subdivide the distribution into the rejection and the non-
rejection regions.
T he decision to reject or not to reject the null hypothesis is based on the distribution assumed by the
test statistic. T his means if the variable involved follows a normal distribution, we use the level of
significance (α) of the test to come up with critical values that lie along with the standard normal
distribution.
121
© 2014-2023 AnalystPrep.
T he decision rule is a result of combining the critical value (denoted by Cα ), the alternative
hypothesis, and the test statistic (T ). T he decision rule is to whether to reject the null hypothesis in
For the t-test, the decision rule is dependent on the alternative hypothesis. When testing the two-
side alternative, the decision is to reject the null hypothesis if |T | > Cα . T hat is, reject the null
hypothesis if the absolute value of the test statistic is greater than the critical value. When testing
on the one-sided, the decision rule, reject the null hypothesis if T < Cα when using a one-sided
lower alternative and if T > Cα when using a one-sided upper alternative. When a null hypothesis is
rejected at an α significance level, we say that the result is significant at α significance level.
Note that prior to decision making, one must decide whether the test should be one-tailed or two-
tailed. T he following is a brief summary of the decision rules under different scenarios:
H 1: parameter < X
Decision rule: Reject H 0 if the test statistic is less than the critical value. Otherwise, do not
rej ect H 0.
122
© 2014-2023 AnalystPrep.
Right One-tailed Test
H 1: parameter > X
Decision rule: Reject H 0 if the test statistic is greater than the critical value. Otherwise, do not
rej ect H 0.
123
© 2014-2023 AnalystPrep.
Two-tailed Test
Decision rule: Reject H 0 if the test statistic is greater than the upper critical value or less than the
Consider, α=5%. Consider a one-sided test. T he rejection regions are shown below:
124
© 2014-2023 AnalystPrep.
T he first graph represents the rejection region when the alternative is one-sided lower. For
T he second graph represents the rejection region when the alternative is a one-sided upper. T he null
Consider the returns from a portfolio X = (x 1, x 2, … , x n ) from 1980 through 2020. T he approximated
mean of the returns is 7.50%, with a standard deviation of 17%. We wish to determine whether the
Sol uti on
H 0 : μ =0 vs. H 1 : μ ≠ 0
n=40
μ
^=0.075
μ0=0
^2=0.172
σ
So,
0.075 − 0
T = ≈ 2.79
0. 172
√ 40
At the significance level, α = 5%,the critical value is ±1.96. Since this is a two-sided test, the
rejection regions are ( −∞, −1.96 ) and ( 1.96,∞ ) as shown in the diagram below:
Since the test statistic (2.79) is higher than the critical value, then we reject the null hypothesis in
126
© 2014-2023 AnalystPrep.
T he example above is an example of a Z-test (which is mostly emphasized in this chapter and
immediately follows from the central limit theorem (CLT )). However, we can use the Student’s t-
distribution if the random variables are iid and normally distributed and that the sample size is small
(n<30).
^ − μ0
μ
s2 =
2
√ sn
^ − μ0
μ
T = ∼ tn −1
2
√ sn
T he power of a test is the direct opposite of the level of significance. While the level of relevance
gives us the probability of rejecting the null hypothesis when it’s, in fact, true, the power of a test
gives the probability of correctly discrediting and rejecting the null hypothesis when it is false. In
other words, it gives the likelihood of rejecting H 0 when, indeed, it’s false. Denoting the probability
Power of a Test = 1– β
T he power test measures the likelihood that the false null hypothesis is rejected. It is influenced by
the sample size, the length between the hypothesized parameter and the true value, and the size of
the test.
Confidence Intervals
A confidence interval can be defined as the range of parameters at which the true parameter can be
found at a confidence level. For instance, a 95% confidence interval constitutes the set of parameter
127
© 2014-2023 AnalystPrep.
values where the null hypothesis cannot be rejected when using a 5% test size. T herefore, a 1-α
confidence interval contains the values that cannot be disregarded at a test size of α.
It is important to note that the confidence interval depends on the alternative hypothesis statement
H0 : μ = 0
H1 : μ ≠ 0
^
σ ^
σ
[μ
^ − Cα × ^ + Cα ×
,μ ]
√n √n
Consider the returns from a portfolio X = (x 1, x 2, … , x n ) from 1980 through 2020. T he approximated
mean of the returns is 7.50%, with a standard deviation of 17%. Calculate the 95% confidence
σ
^ σ
^
[μ
^ − Cα × ,μ
^ + Cα × ]
√n √n
0.17 0.17
= [0.0750 − 1.96 × ,0.0750 + 1.96 × ]
√40 √40
= [0.02232, 0.1277]
T hus, the confidence intervals imply any value of the null between 2.23% and 12.77% cannot be
One-Sided Alternative
128
© 2014-2023 AnalystPrep.
For the one-sided alternative, the confidence interval is given by either:
^
σ
(−∞, μ
^ + Cα × )
√n
or,
^
σ
(μ
^ + Cα × , ∞)
√n
H0 : μ ≤ 0
H1 : μ > 0
σ
^
= (−∞, μ
^ + Cα × )
√n
0.17
= (−∞, 0.0750 + 1.645 × )
√40
= (−∞, 0.1192)
H0 : μ > 0
H1 : μ ≤ 0
129
© 2014-2023 AnalystPrep.
^
σ
= (−∞, μ
^ + Cα × )
√n
0.17
= (−∞, 0.0750 + 1.645 × ) = (−∞, 0.1192)
√40
Note that the critical value decrease from 1.96 to 1.645 due to a change in the direction of the
change.
The p-Value
When carrying out a statistical test with a fixed value of the significance level (α), we merely
compare the observed test statistic with some critical value. For example, we might “reject H 0 using
a 5% test” or “reject H 0 at 1% significance level”. T he problem with this ‘classical’ approach is that
it does not give us the details about the strength of the evi dence against the null hypothesis.
Determination of the p-value gives statisticians a more informative approach to hypothesis testing.
T he p-value is the lowest level at which we can reject H 0. T his means that the strength of the
evidence against H 0 increases as the p-value becomes smaller. T he test-statistic depends on the
alternative.
For one-tailed tests, the p-value is given by the probability that lies below the calculated test statistic
for left-tailed tests. Similarly, the likelihood that lies above the test statistic in right-tailed tests gives
the p-value.
Denoting the test statistic by T, the p-value for H 1 : μ > 0 is given by:
P (Z > |T |) = 1 − P (Z ≤ |T |) = 1 − Φ(|T |)
P (Z ≤ |T |) = Φ(|T |)
130
© 2014-2023 AnalystPrep.
Where z is a standard normal random variable, the absolute value of T (|T |) ensures that the right tail
If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We start
by determining the probability lying below the negative value of the test statistic. T hen, we add this
to the probability lying above the positive value of the test statistic. T hat is the p-value for the two-
2 [1 − Φ[|T |]
Let θ represent the probability of obtaining a head when a coin is tossed. Suppose we toss the coin
200 times, and heads come up in 85 of the trials. Test the following hypothesis at 5% level of
significance.
H 0: θ = 0.5
H 1: θ < 0.5
Sol uti on
Our p-value will be given by P(X < 85) where X `binomial(200,0.5) with mean 100(np=200*0.5),
assuming H 0 is true.
85.5 − 100
P [z < ] = P (Z < −2.05)
√50
= 1– 0.97982 = 0.02018
131
© 2014-2023 AnalystPrep.
np(1 − p) = 200(0.5)(1 − 0.5) = 50
(We have applied the Central Limit T heorem by taking the binomial distribution as approx. normal)
Since the probability is less than 0.05, H 0 is extremely unlikely, and we actually have strong
evidence against H 0 that favors H 1. T hus, clearly expressing this result, we could say:
“T here is very strong evidence against the hypothesis that the coin is fair. We, therefore, conclude
Remember, failure to reject H 0 does not mean it’s true. It means there’s insufficient evidence to
A CFA candidate conducts a statistical test about the mean value of a random variable X.
H 0: μ = μ0 vs. H 1: μ ≠ μ0
She obtains a test statistic of 2.2. Given a 5% significance level, determine and interpret the p-value
Sol uti on
132
© 2014-2023 AnalystPrep.
Interpretati on
T he p-value (2.78%) is less than the level of significance (5%). T herefore, we have sufficient
evidence to reject H 0. In fact, the evidence is so strong that we would also reject H 0 at significance
levels of 4% and 3%. However, at significance levels of 2% or 1%, we would not reject H 0 since
It’s common for analysts to be interested in establishing whether there exists a significant difference
between the means of two different populations. For instance, they might want to know whether the
average returns for two subsidiaries of a given company exhibit si gni fi cant differences.
W i = [X i , Y i ]
133
© 2014-2023 AnalystPrep.
Assume that the components X i and Y iare both iid and are correlated. T hat is:
Corr(X i , Y i ) ≠ 0
H 0 : μX = μY
H 1 : μX ≠ μY
In other words, we want to test whether the constituent random variables have equal means. Note
H 0 : μX − μY = 0
H 1 : μX − μY ≠ 0
Zi = X i − Y i
T herefore, considering the above random variable, if the null hypothesis is correct then,
H 0: μZ =0 vs. H 1: μZ ≠ 0.
^z
μ
T = ∼ N(0, 1)
2
√^
σz
n
Note that the test statistic formula accounts for the correction between X i and Y i. It is easy to see
that:
V (Zi ) = V (X i) + V (Y i) − 2COV (X i , Y i )
134
© 2014-2023 AnalystPrep.
Which can be denoted as:
^2z = σ
σ ^2X + σ
^2Y − 2σXY
^z = μX − μY
μ
μX − μY
T =
^ 2+^ 2
σ Y −2σ XY
√σ X
n
T his formula indicates that correlation plays a crucial role in determining the magnitude of the test
statistic.
Another special case of the test-statistic is when X i, and Y i are iid and independent. T he test statistic
is given by:
μX − μY
T =
2 2
σ
^ σY
^
√n X +
X nY
An investment analyst wants to test whether there is a significant difference between the means of
the two portfolios at a 95% level. T he first portfolio X consists of 30 government-issued bonds and
has a mean of 10% and a standard deviation of 2%. T he second portfolio Y consists of 30 private
bonds with a mean of 14% and a standard deviation of 3%. T he correlation between the two
portfolios is 0.7. Calculate the null hypothesis and state whether the null hypothesis is rejected or
otherwise.
Solution
135
© 2014-2023 AnalystPrep.
H 0: μX - μY=0 vs. H 1: μX - μY ≠ 0.
Note that this is a two-tailed test. At 95% level, the test size is α=5% and thus the critical value
Cα = ±1.96.
Recall that:
Cov(X , Y ) = σXY = ρX Y σX σY
μX − μY μX − μY
T = =
^ 2+^ 2
σ Y −2σ XY ^ 2+^ 2
σ Y −2ρXY σ Xσ Y
√σ X √σ X
n n
0.10 − 0.14
= = −10.215
0. 022 +0. 032 −2×0. 7×0. 02×0. 03
√ 30
T he test statistic is far much less than -1.96. T herefore the null hypothesis is rejected at a 95%
level.
Multiple testing occurs when multiple multiple hypothesis tests are conducted on the same data set.
T he reuse of data results in spurious results and unreliable conclusions that do not hold up to
scrutiny. T he fundamental problem with multiple testing is that the test size (i.e., the probability that
a true null is rejected) is only applicable for a single test. However, repeated testing creates test
sizes that are much larger than the assumed size of alpha and therefore increases the probability of a
T ype I error.
Some control methods have been developed to combat multiple testing. T hese include Bonferroni
correction, the False Discovery Rate (FDR), and Familywise Error Rate (FWER).
136
© 2014-2023 AnalystPrep.
Practice Question
An experiment was done to find out the number of hours that candidates spend preparing
for the FRM part 1 exam. It was discovered that for a sample of 10 students, the
318, 304, 317, 305, 309, 307, 316, 309, 315, 327
If the sample mean and standard deviation are 312.7 and 7.2, respectively, calculate a
symmetrical 95% confidence interval for the mean time a candidate spends preparing for
A. [307.5, 317.9]
B. [307.6, 317.8]
C. [307.9, 317.5]
D. [307.3, 318.2]
T he correct answer is A.
To find the value of t1− α , we use the t-table with (10 - 1 =) 9 degrees of freedom and the
2
137
© 2014-2023 AnalystPrep.
(1 - 0.025 =) 0.975 which gives us 2.262.
s 7.2
X̄ ± t1− α × = 312.7 ± 2.262 ×
2 √n √10
= [307.5, 317.9]
138
© 2014-2023 AnalystPrep.
Reading 18: Linear Regression
After compl eti ng thi s readi ng, you shoul d be abl e to:
Describe the models that can be estimated using linear regression and differentiate them
Construct, apply, and interpret hypothesis tests and confidence intervals for a single
Describe the relationship between a t-statistic, its p-value, and a confidence interval.
Linear regression is a statistical tool for modeling the relationship between two random variables.
T his chapter will concentrate on the linear regression model (regression model with one
explanatory variable).
As stated earlier, linear regression determines the relationship between the dependent variable Y and
the independent (explanatory) variable X. T he linear regression with a single explanatory variable is
given by:
Y = β0 + βX + ϵ
Where:
139
© 2014-2023 AnalystPrep.
ϵ =error(sometimes referred to as shock). It represents the portion of Y that cannot be explained by
X.
T he assumption is that the expectation of the error is 0. T hat is, E(ϵ) = 0 and thus,
⇒ E[Y ] = β0 + βE[X]
Note that β0 is the value of Y when X = 0. However, there are cases when the explanatory variable
is not equal to 0. In this case, β0 is interpreted as the value that ensures that the Y¯ in the regression
line Y¯ = β^ 0 + β^X̄ where Y¯ and X̄ are the mean of y i and x i random variables.
140
© 2014-2023 AnalystPrep.
T he independent variable can be continuous, discrete or even functions. Above the diversity of the
(X 1 , X 2 ,… , X n ) must be linear.
2. T he error term must be additive except where the variance of the error term depends on
Y = β0 + βX k + ϵ
T his model cannot be estimated using linear regression due to the presence of the unknown
parameter k, which violates the first restriction (it is non-linear regression function). T his kind of
Transformations
When a linear regression model does not satisfy the linearity conditions stated above, we can
reverse the violation of the restrictions by transforming the model. Consider the model:
Y = β0 X βϵ
Where ϵ is the positive error term (shock). Clearly, this model violates the condition of the
restriction since X is raised to an unknown parameter β, and the error term is not additive.
However, we can make this model linear by taking natural logarithm on both sides of the equation so
that:
ln(Y ) = (β0X β ϵ)
141
© 2014-2023 AnalystPrep.
T he last equation can be written as:
k
Y = β^0 + βX^ + ^
ϵ
Clearly, this equation satisfies the three linearity conditions. It is worth noting that when we are
interpreting the parameters of the transformed model, we measure the change of the transformed
For instance, ln(Y ) = lnβ0 + βlnX + lnϵ implies that β represents the change in lnY corresponding to
T here are cases where the explanatory variables are binary numbers (0 and 1) representing the
occurrences of an event. T hese binary numbers are called dummies. For instance,
β is the coefficient on Di .
T he equation will change to the one written below under the condition that Di = 0:
Y i = β0 + ϵ i
When Di = 1:
Y i = β0 + β + ϵ i
142
© 2014-2023 AnalystPrep.
T his implies that when Di = 1,E (Y i|Di = 1) = β0 + β1. T he test scores will have a population mean
when Di = 1 and when Di = 0 will have a difference of β1 between them written as:
(β0 + β) − β0 = β
T he Ordinary Least Squares (OLS) is a method of estimating the linear regression parameters by
minimizing the sum of squared deviations. T he regression coefficients chosen by the OLS estimators
are such that the observed data and the regression line are as close as possible.
143
© 2014-2023 AnalystPrep.
Consider a regression equation:
Y = β0 + βX + ϵ
Assume that each of x i and yi are linearly related, then the parameters can be estimated using the
n 2 n
∑ (y i − β^0 − β^x i ) = ∑ ϵ
^2i
i=1 i=1
Where the β^0 and β^ are parameter estimators (intercept and the slope respectively) which
minimizes the squared deviations between the line β^0 + β^x i and y i so that:
144
© 2014-2023 AnalystPrep.
β^0 = Y¯ − β^ X̄
and
∑ni=1 (x i − X̄ ) (y i − Y¯ )
β^ = 2
∑ni=1 (x i − X̄ )
After the estimation of the parameters, the estimated regression line is given by:
y^i = β^ 0 + β^x i
ϵ i = y i − y^i = y i − β^ 0 − β^x i
^
1 n
2
s2 = ∑ ^ϵ
n − 2 i=1 i
n
s2 = ^2 (1 − ρ^2XY )
σ
n−2 Y
Note that n-2 implies that two parameters are estimated and that s2 is an unbiased estimator of σ2.
Moreover, it is assumed that the mean of the residuals is zero and uncorrelated with the explanatory
variables X i.
∑ni=1 (x i − X̄ ) (y i − Y¯ )
β^ = 2
∑ni=1 (x i − X̄ )
Note that the numerator is the covariance between X and Y, and the denominator is the variance of
1
∑ni=1 (x i − X̄ ) (y i − Y¯ ) σ
^XY
β^ =
n
=
1 2 σX2
∑ni=1 (x i − X̄ )
n
Cov(X , Y )
Corr(X , Y ) = ρXY =
σX σY
⇒ σXY = ρX Y σX σY
So,
ρXY σX σY
β^ =
^2X
σ
ρ^XY σ
^Y
∴ β^ =
^X
σ
An investment analyst wants to explain the return from the portfolio (Y) using the prevailing interest
rates (X) over the past 30 years. T he mean interest rate is 7%, and the return from the portfolio is
^2Y
σ ^X Y
σ 1600 500
[ ]=[ ]
^X Y
σ σ 2
^X 500 338
Assume that the analyst wants to estimate the linear regression equation:
Y^i = β^ 0 + β^X i
146
© 2014-2023 AnalystPrep.
Estimate the parameters and, thus, the model equation.
Sol uti on
Now,
σ
^XY 500
β^ = = = 1.4793
^2X
σ 338
and
Assumptions of OLS
1. T he conditional distribution of the error term given the independent variables X i is 0. More
precisely E(ϵ i |X i) = 0. T his also implies that the independent variables and the error term
2. Both the dependent and independent variables are i.i.d. T his assumption concerns the
drawing of the sample. According to this assumption, (X i ,Y i ),i = 1, … , n are i.i.d in case a
simple random sampling is applied when drawing observations from a single large population.
Despite the i.i.d assumption being a reasonable assumption for many data collection schemes,
3. Large outliers are unlikely. In this assumption, observations whose values of X i and/or Y i fall
far outside the usual range of the data, are unlikely. T hese observations are known as
significant outliers. Results of OLS regression can be misleading due to large outliers.
4. T he variance of the independent variable is strictly nonnegative. T hat is, σ2X > 0. T his is
147
© 2014-2023 AnalystPrep.
5. T he variance of the error term is independent of the explanatory variables and that
V (ϵ i│X ) = σ2 < ∞ and that the variance of all the error terms (shocks) is equal. T his
T he OLS estimators imply that the parameter estimators are unbiased estimators. T hat is,E(α
^) = α
and E(β^) = β. T his is actually true for large sample sizes or rather as the sample sizes increases.
Lastly, the assumptions ensure that that the estimated parameters are normally distributed. T he
σ2
√n (β^ − β) ∼ N (0, )
σX2
Where σ2 is the variance of the error term and σ2X is the variance of X. It is easy to see that the
σ 2 (μ2X − σX2 )
√n (β^ 0 − β0) ∼ N (0, )
σX2
According to the central limit theorem (CLT ), β^ can be treated as the standard random variable with
σ2
the mean as the true value β and the variance . T hat is:
nσ 2
X
σ2
β^ ∼ N (β, )
nσX2
However, we cannot use this value in hypothesis testing. We need to use the variance estimators
such that:
σ2 = s2
1 n 2
^X =
σ ∑ (x i − X̄ )
n i=1
148
© 2014-2023 AnalystPrep.
n
2
⇒ n^
σ X = ∑ (x i − X̄ )
i=1
^2
σ s2
^2β =
σ =
2
∑ni=1 (x i − X̄ ) σ 2X
n^
T he standard error estimate of the β denoted as SEEβ is equivalent to the square root of its variance,
so:
s2 s
SEEβ = √ =
σ 2X
n^ √n^
σX
^2X + σ
s2 (μ ^2X)
^2β0
σ =
σ 2X
n^
When the OLS assumptions are met, the parameters are assumed to be normally distributed when
large samples are used. T herefore, we can run a hypothesis tests on the parameters just like the
random variable.
parameters. For instance, we may want to test the significance of a si ngl e regression coefficient in
Whenever a statistical test is being performed, the following procedure is generally considered ideal:
2. Select the appropriate test statistic, i.e., what’s being tested, e.g., the population means, the
149
© 2014-2023 AnalystPrep.
3. Specify the level of significance;
4. Clearly, state the decision rule to guide you in choosing whether to reject or not to reject
β^ − βH0
T =
SEEβ
T his statistic possesses asymptotic normal distribution, which is then compared to a critical value Ct
|T | > Ct
For instance, if we assume a 5% significance level in this case, then the critical value is 1.96.
We can also evaluate the p-values. For one-tailed tests, the p-value is given by the probability that lies
below the calculated test statistic for left-tailed tests. Similarly, the likelihood that lies above the test
Denoting the test statistic by T, the p-value for H 1 : β^ > 0 is given by:
P (Z > |T |) = 1 − P (Z ≤ |T |) = 1 − Φ(|T |)
P (Z ≤ |T |) = Φ(|T |)
150
© 2014-2023 AnalystPrep.
Where z is a standard normal random variable, the absolute value of T (|T |) ensures that the right tail
If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We start by
determining the probability lying below the negative value of the test statistic. T hen, we add this to
the probability lying above the positive value of the test statistic. T hat is the p-value for the two-
We can also construct confidence intervals (discussed in detail in the previous chapter). Recall that a
confidence interval can be defined as the range of parameters at which the true parameter can be
found at a confidence level. For instance, a 95% confidence interval constitutes that the set of
parameter values where the null hypothesis cannot be rejected when using a 5% test size.
For instance, if we are performing the two-tailed hypothesis tests, then the confidence interval is
given by:
An investment analyst wants to explain the return from the portfolio (Y) using the prevailing interest
rates (X) over the past 30 years. T he mean interest rate is 7%, and the return from the portfolio is
^2Y
σ ^X Y
σ 1600 500
[ ]=[ ]
^X Y
σ σ 2
^X 500 338
Assume that the analyst wants to estimate the linear regression equation:
Y^i = β^ 0 + β^X i
Test whether the slope coefficient is equal to zero and construct a 95% confidence interval for the
151
© 2014-2023 AnalystPrep.
slope of the coefficient.
Solution
β^ − βH0
T =
SEEβ
^XY
σ 500
β^ = = = 1.4793
σ
^X2 338
s
SEEβ^ =
√n^
σX
But
n
s2 = ^2 (1 − ρ^XY )
σ
n−2 Y
30 500
s2 = × 1600 (1 − ) = 548.7251
30 − 2 √338√1600
^
σ XY
(Note that for ρ^XY we have used the relationship ρ^XY = .)
^
σ X^
σY
T herefore,
So,
s 23.4249
152
© 2014-2023 AnalystPrep.
s 23.4249
SEEβ^ = = = 0.23263
√nσ
^X √30√338
β^ − βH0 1.4793
T = = = 6.3590
SEEβ 0.23263
For the two-tailed test, the critical value is 1.96, and since the t-statistic here is greater than the
β^ − Ct × SEEβ , β^ + Ct × SEEβ
= [1.0233, 1.9353]
153
© 2014-2023 AnalystPrep.
Practice Question 1
Assume that you have carried out a regression analysis (to determine whether the slope
is different from 0) and found out that the slope β^ = 1.156. Moreover, you have
constructed a 95% confidence interval of [0.550, 1.762]. What is the likely value of your
test statistic?
A. 4.356
B. 3.7387
C. 0.7845
D. 0.6545
Solution
T he Correct answer is B
T his is a two-tailed test since we’re asked to determine if the slope is different from
1.156 − 0.550
1.156 − 1.96 × SEEβ = 0.550 ⇒ SEEβ = = 0.3092
1.96
β^ − βH0 1.156 − 0
T = = = 3.7387
SEEβ 0.3092
154
© 2014-2023 AnalystPrep.
Practice Question 2
A trader develops a simple linear regression model to predict the price of a stock. T he
estimated slope coefficient for the regression is 0.60, the standard error is equal to 0.25,
and the sample has 30 observations. Determine if the estimated slope coefficient is
decision rule.
Sol uti on
T he correct answer is B.
H 0:β1=0
H 1:β1≠0
β1 − βH 0 0.60 − 0
= = 2.4
Sβ1 0.25
155
© 2014-2023 AnalystPrep.
Step 4: State the deci si on rul e
156
© 2014-2023 AnalystPrep.
Reading 19: Regression with Multiple Explanatory Variables
After compl eti ng thi s readi ng, you shoul d be abl e to:
Interpret goodness of fit measures for single and multiple regressions, including R 2 and
adjusted R 2.
Construct, apply, and interpret joint hypothesis tests and confidence intervals for multiple
coefficients in regression.
Unlike linear regression, mul ti pl e regressi on simultaneously considers the influence of multiple
explanatory variables on a response variable Y. In other words, it permits us to evaluate the effect of
Y i = β0 + β1 X 1i + β2X 2 i + … + βkX k i + εi ∀i = 1, 2, … n
Intuitively, the multiple regression model has k slope coefficients and k+1 regression coefficients.
Normally, statistical software (such as Excel and R) are used to estimate the multiple regression
model.
T he slope coefficients βk computes the level of variation of the dependent variable Y when the
independent variable X j changes by one unit while holding other independent variables constant. T he
interpretation of the multiple regression coefficients is quite different compared to linear regression
with one independent variable. T he effect of one variable is explored while keeping other
157
© 2014-2023 AnalystPrep.
For instance, a linear regression model with one independent variable could be estimated as
Y^ = 0.6 + 0.85X 1. In this case, the slope coefficient is 0.85, which implies that a 1 unit increase in
Now, assume that we had the second independent variable to the regression so that the regression
equation is Y^ = 0.6 + 0.85X 1 + 0.65X 2. A unit increase in X 1 will not result in a 0.85 unit increase in
Y unless X 1 and X 2 are uncorrelated. T herefore, we will interpret 0.85 as one unit of X 1 leads to
0.85 units increase in the dependent variable Y, while keeping X2 constant.
Although the multiples regression parameters can be estimated, it is challenging since it involves a
huge amount of algebra and the use of matrices. However, we build a foundation of understanding
Y i = β0 + β1X 1 i + β2 X 2i + εi
T he first step is to regress X 1 and X 2 and to get the residual of X 1i given by:
ϵ X1i = X 1i − α
^0 − α
^1 X 2i
Where α
^0 and α
^1 are the OLS estimators of X 2i.
T he next step is to regress Y on X 2 to get the residuals of Y i, which is intuitively given by:
ϵ Y i = Y i − y^0 − y^1 X 2i
Where ^
γ 0 and ^
γ 1 are the OLS estimators of X 2i . T he final step is to regress the residual of X 1 and Y
158
© 2014-2023 AnalystPrep.
ϵ Y i = β^ 1ϵ X1i + ϵ i
Note that we do not have a constant, the expected values of ϵ Y i and ϵ Xi are both 0. Moreover, the
main purpose of the first and the second regression is to exclude the effect of X 2 from both Y and X 1
by dividing the variable into the fittest value which is correlated with X 2, and the residual error
which is uncorrelated with X 2 and thus the two-residual obtained is uncorrelated with X 2 by
intuition. T he last step of the regression gives the regression between the components of Y and X 1,
the process above. By repeating this process, we can estimate a k-parameter model such as:
Y i = β0 + β1X 1i + β2 X 2i + … + βk X ki + εi∀i = 1, 2, … n
Most of the time, this is done using a statistical package such as Excel and R.
Suppose that we have n observations of the dependent variable (Y) and the independent variables
Y i = β0 + β1X 1i + β2 X 2i + … + βk X ki + εi ∀i = 1, 2, … n
For us to make a valid inference from the above equation, we need to make classical normal multiple
. , X k, is linear.
E(ϵ| X 1, X 2, . . . , X k) = 0
159
© 2014-2023 AnalystPrep.
4. T he variance of the error term is equal for all observations. T hat is, E(ϵ 2i ) = σ2ϵ , i = 1, 2,… , n
6. T he error term ϵ is normally distributed. T his allows us to test the hypothesis about
regression analysis.
7. T here are no outliers so that E(X ji4 ) < ∞ for all j=1,2….k
T he assumptions are almost the same as those of linear regression with one independent variable,
only that the second assumption is tailored to ensure no linear relationships between the
T he goodness of fit of a regression is a measure using the Coefficient of determination (R 2) and the
Recall that the standard error estimate gives a percentage at which we are certain of a forecast
made by a regression model. However, it does not tell us how suitable is the independent variable in
T he coefficient of variation measures a proportion of the total change in the dependent variable
explained by the independent variable. We can calculate the coefficient of variation in two ways:
T he coefficient of variation can be computed by squaring the correlation coefficient (r) between the
R 2 = r2
160
© 2014-2023 AnalystPrep.
Recall that:
Cov(X , Y )
r=
σ Xσ Y
Where
σX -standard deviation of X
σY -standard deviation of Y
However, this method only accommodates regression with one independent variable.
T he correlation coefficient between the money supply growth rate (dependent, Y) and inflation rates
(independent, X) is 0.7565. T he standard deviation of the dependent (explained) variable is 0.050, and
that of the independent variable is 0.02. Regression analysis for the ten years was conducted on this
Sol uti on
We know that:
Cov(X , Y ) 0.0007565
r= = = 0.7565
σ Xσ Y 0.05 × 0.02
So, in regression, the money supply growth rate explains roughly 57.23% of the variation in the
161
© 2014-2023 AnalystPrep.
Variables
If the regression analysis is known, then our best estimate for any observation for the dependent
variable would be the mean. Alternatively, instead of using the mean as an estimate of Y i , we can
predict an estimate using the regression equation. T he resulting solution will be denoted as:
So that:
Y i = Y^i + ^
ϵi
Now if we subtract the mean of the dependent variable in the above equation and square and sum on
n n 2
2
∑ (Y i − Y¯ ) = ∑ (Y^i − Y¯ + ^
ϵ i)
i=1 i=1
n 2 n n
= ∑ (Y^i − Y¯ ) + 2 ∑ ^ ^2i
ϵ i (Y^i − Y¯ ) + ∑ ϵ
i=1 i=1 i=1
Note that:
n
ϵ i (Y^i − Y¯ ) = 0
2∑ ^
i=1
n n 2 n
2
∑ (Y i − Y¯ ) = ∑ (Y^i − Y¯ ) + ∑ ^
ϵ 2i
i=1 i=1 i=1
But
2
ϵ 2i = (Y i − Y^)
^
So, that
162
© 2014-2023 AnalystPrep.
n n 2
^2i = ∑ (Y i − Y^)
∑ ϵ
i=2 i=1
T herefore,
n n 2 n 2
2
∑ (Y i − Y¯ ) = ∑ (Y^i − Y¯ ) + ∑ (Y i − Y^)
i=1 i=1 i=1
If the regression analysis is useful for predicting Y i using the regression equation, then the error
Now let:
2
Explained Sum of Squares (ESS)=∑ni=1 (Y^i − Y¯ )
2
Residual Sum of Squares (RSS) =∑ni=1 (Y i − Y^)
2
Total Sum of Squares (T SS)=∑ni=1 (Y i − Y¯ )
T hen:
T SS = ESS + RSS
ESS RSS
1= +
T SS T SS
ESS RSS
⇒ = 1−
T SS T SS
Now, recall than the coefficient of determination is the fraction of the overall change that is
163
© 2014-2023 AnalystPrep.
If a model does not explain any of the observed data, then it has an R 2 of 0 . On the other hand, if the
model perfectly describes the data, then it has an R 2 of 1. Other values are in the range of 0 and 1 and
are always positive. For instance, in the above example, the R 2 is approximately 1 and thus, the
money supply growth rate perfectly explains the level of inflation rates in the countries.
Limitations of R2
1. As the number of explanatory variables increases, the value of R 2 always increases even if
the new variable is almost completely irrelevant to the dependent variable. For instance, if a
regression model with one explanatory variable is modified to have two explanatory
variables, the new R 2 is greater or equal to that of a single explanatory model. In the case
where β = 0, adding a variable will not increase R 2. In that case, the RSS will remain the
164
© 2014-2023 AnalystPrep.
2. T he Coefficient of Determination R 2 cannot be compared in different dependent variables.
3. T here is no standard value of R 2 that is considered good because its values depend on the
The Adjusted R 2
2
Denoted by R̄ , the adjusted-R 2 measures the goodness of fit, which does not automatically increase
if an independent variable is added to the model; that is, it is adjusted for the degrees of freedom.
2 2
Note that R̄ is produced by statistical software. T he relationship between the R 2 and R̄ is given
by:
( nR−k−1
SS
)
2
R̄ = 1 −
( T −1
SS
)
n
n−1
=1 −( ) (1 − R 2)
n−k −1
Where
n=number of observations
T he adjusted R-squared can increase, but that happens only if the new variable improves the model
more than would be expected by chance. If the added variable improves the model by less than
2 2
When k≥ 1, then R 2 > R̄ since adding an extra new independent variable results in a decrease in R̄
2
if that addition causes a small increase in R 2. T his explains the fact that R̄ can be a negative though
R 2 is always nonnegative.
165
© 2014-2023 AnalystPrep.
2
A point to note is that when we decide to use R̄ to compare the regression models, the dependent
variable is defined the same way and that the sample size is the same as that of R 2.
2
T he following are the factors to watch out for when guarding against applying the R 2 or the R̄ :
2
An added variable doesn’t have to be statistically significant just because the R 2 or the R̄
has increased.
It is not always true that the regressors are a true cause of the dependent variable, just
2
because there is a high R 2 or R̄ .
It is not necessary that there is no omitted variable bias just because we have a high R 2 or
2
R̄ .
It is not necessarily true that we have the most appropriate set of regressors just because
2
we have a high R 2 or R̄
It is not necessarily true that we have an inappropriate set of regressors just because we
2
have a low R 2 or R̄ .
2
R̄ does not automatically indicate that regression is well specified due to its inclusion of a right set
2
of variables since a high R̄ could reflect other uncertainties in the data in the analysis. Moreover,
2
R̄ can be negative if the regression model produces an extremely poor fit.
Previously, we had conducted hypothesis tests on individual regression coefficients using the t-test.
We need to perform a joint hypothesis test on the multiple regression coefficients using the F-test
In multiple regression, we cannot test the null hypothesis that all the slope coefficients are equal to
0 using the t-test. T his is because an individual test on the coefficient does not accommodate the
F-test (test of regression’s generalized significance) determines whether the slope coefficients in
166
© 2014-2023 AnalystPrep.
multiple linear regression are all equal to 0. T hat is, the null hypothesis is stated as
H 0 : β1 = β2 =. . .= βK = 0 against the alternative hypothesis that at least one slope coefficient is not
equal to 0.
To accurately compute the test statistic for the null hypothesis that the slope is equal to 0, we need
n 2
∑ (Y i − Y^i )
i=1
n 2
∑ (Y^i − Y¯ i )
i=1
III. T he number of parameters to be estimated. For example, in a regression analysis with one
independent variable, there are two parameters: the slope and the intercept coefficients.
Using the above four requirements, we can determine the F-statistic. T he F-statistic measures how
effective the regression equation explains the changes in the dependent variable. T he F-statistic is
denoted by F(Number of slope parameters, n-(number of parameters)). For instance, the F-statistic for
multiple regression with two slope coefficients (and one intercept coefficient) is denoted as F2, n-3.
T he F-statistic is the ratio of the average regression sum of squares to the average amount of
squared errors. T he average regression sum of squares is the regression sum of squares divided by
the number of slope parameters (k ) estimated. T he average sum of squared errors is the sum of
squared errors divided by the number of observations (n) less a total number of parameters
In this case, we are dealing with a multiple linear regression model with k independent variable
( E SS )
k
F =
SS R
( )
n −(k+1)
In regression analysis output (ANOVA part), MSR and MSE are displayed as the first and the second
quantities under the MSS (mean sum of the squares) column, respectively. If the overall regression’s
If the independent variables do not explain any of the variations in the dependent variable, each
predicted independent variable Y^i) possess the mean value of the dependent variable (Y ).
So, how do we decide F-test? We reject the null hypothesis at α significance level if the computed F-
statistic is greater than the upper α critical value of the F-distribution with the provided numerator
An analyst runs a regression of monthly value-stock returns on four independent variables over 48
months.
T he total sum of squares for the regression is 360, and the sum of squared errors is 120.
Test the null hypothesis at a 5% significance level (95% confidence) that all the four independent
Solution
168
© 2014-2023 AnalystPrep.
H 0 : β1 = 0, β2 = 0,… , β4 = 0
Versus
( ESS 240
)
k 4
F = = = 21.5
SS R 120
( ) 43
n −(k+1)
Decision: Reject H 0.
Conclusion: at l east one of the 4 independent variables is significantly different than zero.
An investment analyst wants to determine whether the natural log of the ratio of bid-offer spread to
the price of a stock can be explained by the natural log of the number of market participants and the
169
© 2014-2023 AnalystPrep.
Residual standard error 0.9180
Multiple R-squared 0.6418
Observations 2, 800
We are concerned with the ANOVA (Analysis of variance) results. We need to conduct F-test to
Sol uti on
H 0 : β^ 1 = β^2 = 0
vs
H 1 : At least 1β^ j ≠ 0, ∀j = 1, 2
T here are two slope coefficients, k=2 (coefficients on the natural log of the number of market
participants and the amount of market capitalization), which is degrees of freedom for the numerator
of the F-statistic formula. For the denominator, the degrees of freedom are n- (k + 1) =2800-3=
2,797.
T he sum of the squared errors is 2,351.9973, while the regression sum of squares is 3,730.1534.
Since we are working at a 5% (0.05) significance level, we look at the F-distribution table on the
second column which displays the F-distributions with degrees of freedom in the numerator of the F-
170
© 2014-2023 AnalystPrep.
1 2 3 4 5 6 7 8 9 10
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24
26 4.22 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16
35 4.12 3.27 2.87 2.64 2.49 2.37 2.29 2.22 2.16 2.11
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08
50 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 2.03
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99
70 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.02 1.97
80 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2.00 1.95
90 3.95 3.10 2.71 2.47 2.32 2.20 2.11 2.04 1.99 1.94
100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97 1.93
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91
1000 3.85 3.00 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84
As seen from the table, the critical value of the F-test for the null hypothesis to be rejected is
between 3.00 and 3.07. T he actual F-statistic is 2217.95, which is far higher than the F-test critical
value, and thus we reject the null hypothesis that all the slope coefficients are equal to 0.
171
© 2014-2023 AnalystPrep.
Confidence interval (CI) is a closed interval in which the actual parameter is believed to lie with
some degree of confidence. Confidence intervals are used to perform hypothesis tests. For instance,
we may want to ascertain stock valuation using the capital asset pricing model (CAPM). In this case,
we may wish to hypothesize that the beta possesses the market’s systematic risk or averaged beta.
T he same analogy used in the regression analysis with one explanatory variable is also used in a
An economist tests the hypothesis that interest rates and inflation can explain GDP growth in a
country. Using some 73 observations, the analyst formulates the following regression equation:
GDP growth = ^
b0+^
b 1 (Interest) + ^
b 2 (Inflation)
What is the 95% confidence interval for the coefficient on the inflation rate?
A. 0.12024 to 0.27976
B. 0.13024 to 0.37976
C. 0.12324 to 0.23976
D. 0.11324 to 0.13976
Sol uti on
T he correct answer is A
From the regression analysis, β^ =0.20 and the estimated standard error, sβ^ =0.04. T he number of
172
© 2014-2023 AnalystPrep.
degrees of freedom is 73-3=70. So, the t-critical value at the 0.05 significance level is =
t 0. 05,73−2−1 = t0. 025,70 = 1.994. T herefore, the 95% confidence level for the stock return is:
2
173
© 2014-2023 AnalystPrep.
Practice Questions
Question 1
variables over 48 months. T he total sum of squares for the regression is 360 and the sum
A. 42.1%
B. 50%
C. 33.3%
D. 66.7%
T he correct answer is D.
Question 2
A. 27.1%
B. 63.6%
C. 72.9%
D. 36.4%
T he correct answer is B.
n−1
174
© 2014-2023 AnalystPrep.
2 n−1
R̄ = 1 − × (1 − R 2)
n− k−1
48 − 1
= 1− × (1 − 0.667)
48 − 4 − 1
= 63.6%
Question 3
Refer to the previous problem. T he analyst now adds four more independent variables to
the regression and the new R 2 increases to 69%. What is the new adjusted R 2 and which
A. T he analyst would prefer the model with four variables because its adjusted R 2 is
higher.
B. T he analyst would prefer the model with four variables because its adjusted R 2 is
lower.
C. T he analyst would prefer the model with eight variables because its adjusted R 2 is
higher.
D. T he analyst would prefer the model with eight variables because its adjusted R 2 is
lower.
T he correct answer is A.
2
New R = 69%
2 48 − 1
New adjusted R = 1− × (1 − 0.69) = 62.6%
48 − 8 − 1
T he analyst would prefer the first model because it has a higher adjusted R 2 and the
Question 4
175
© 2014-2023 AnalystPrep.
An economist tests the hypothesis that GDP growth in a certain country can be
Using some 30 observations, the analyst formulates the following regression equation:
A. Since the test statistic < t-critical, we accept H 0; the interest rate coefficient
B. Since the test statistic > t-critical, we reject H 0; the interest rate coefficient
C. Since the test statistic > t-critical, we reject H 0; the interest rate coefficient is
D. Since the test statistic < t-critical, we accept H 1; the interest rate coefficient
T he correct answer is C.
Hypothesis:
H 0 : β^1 = 0 vs H 1 : β^ 1 ≠ 0
176
© 2014-2023 AnalystPrep.
0.20 − 0
t =( )=4
0.05
T he critical value is t(α/2, n-k-1) = t0.025,27 = 2.052 (which can be found on the t-table).
177
© 2014-2023 AnalystPrep.
Reading 20: Regression Diagnostics
After compl eti ng thi s readi ng, you shoul d be abl e to:
Explain two model selection procedures and how these relate to the bias-variance tradeoff.
Describe the various methods of visualizing residuals and their relative strengths.
Determine the conditions under which OLS is the best linear unbiased estimator.
T hat is, an ideal regression model should consist of all the variables that explain the dependent
Model specification includes the residual diagnostics and the statistical tests on the assumptions of
OLS estimators. Basically, the choice of variables to be included in a model depends on the bias-
variance tradeoff. For instance, large models that include the relevant number of variables are likely
to have unbiased coefficients. On the other side, smaller models lead to accurate estimates of the
T he conventional specification makes sure that the functional form of the model is adequate, the
178
© 2014-2023 AnalystPrep.
parameters are constant, and the homoscedasticity assumption is met.
An omitted variable is one with a non-zero coefficient, but they are excluded in the regression model.
I. T he remaining variables sustain the impact of the excluded variables in terms of the common
variation. T hus, they do not consistently approximate the change in the independent variable
II. T he magnitude of the estimated residuals is larger than the true value. T his is true since the
estimated residuals have the true value and the effect of the omitted value that cannot be
Y i = α + βi X 1i + β2X 2i + ϵ i
If we omit X 2 from the estimated model, then the model is given by:
Y i = α + βi X 1i + ϵ i
Now, in large samples sizes, the OLS estimator β^1 converges to:
β1 + β2 δ
Where:
Cov(X 1, X 2)
δ=
Var(X 1 )
179
© 2014-2023 AnalystPrep.
δ is the population slope coefficient in a regression of X 2 on X 1.
It is clear that the bias – due to the omitted variable – depends on the population coefficient of the
When the correlation between X 1 and X 2 is high, X 1 can explain a significant proportion of variation
in X 2 and hence the bias is high. On the other hand, if the independent variables are uncorrelated,
Conclusively, the omitted variable leads to biasness of the coefficient on the variables that are
An extraneous variable is one that is unnecessarily included in the model, whose actual coefficient is
2 RSS
R̄ = 1 − ξ
T SS
Where:
(n − 1)
ξ=
(n − k − 1)
Looking at the formula above, adding more variables increase the value of k which in turn increases
2
the value of ξ and hence reducing the value of R̄ . However, if the model is large, then RSS is smaller
2
which reduces the effect of ξ and produces larger R̄ .
Contrastingly, this is not always the case when the true coefficient is equal to 0 because, in this
180
© 2014-2023 AnalystPrep.
2
case, RSS remains constant as ξ increases leading to a smaller R̄ and a large standard error.
Lastly, if the correlation between X 1 and X 2 increases, the standard error value rises.
T he bias-variance tradeoff amounts to choosing between the including irrelevant variables and
excluding relevant variables. Bigger models tend to have low bias level because it includes more
relevant variables. However, they are less accurate in approximating the regression parameters due
Moreover, regression models with fewer independent variables are characterized by low estimation
In the general-to-specific method, we start with a large general model that incorporates all
the relevant variables. T hen, the reduction of the general model starts. We use hypothesis
tests to establish if there are any statistically insignificant coefficients in the estimated
model. When such coefficients are found, the variable with the coefficient with the smallest
t-statistic is removed. T he model is then re-estimated using the remaining set of independent
variables. Once more, hypothesis tests are carried out to establish if statistically
insignificant coefficients are present. T hese two steps (remove and re-estimate) are
repeated until all coefficients that are statistically insignificant have been removed.
T he m-fold cross-validation model-selection method aims at choosing the model that’s best at
181
© 2014-2023 AnalystPrep.
How i s thi s method executed?
As a first step, the number of models has to be decided, and this is determined in part by the
number of explanatory variables. When this number is small, the researcher can consider all
the possible combinations. With 10 variables, for example, 1,024 (=) distinct models can be
constructed.
3. Estimate parameters using m-1 of the groups; these groups make up what we call the
4. Use the estimated parameters and the data in the excluded block (validation block) to
since they are arrived at using data not included in the sample used to come up with
5. Repeat parameter estimation and residual computation a total of m times; each group
6. Compute the sum of squared errors using the residuals estimated from the out-of-
sample data.
7. Select the model with the smallest out-of-sample sum of squared residuals.
Heteroskedasticity
Recall that homoskedasticity is one of the critical assumptions in the determination of the
distribution of the OLS estimator. T hat is, the variance of ϵ i is constant and that it does not vary with
Heterosk edasti ci ty is a systemati c pattern in the residuals where the variances of the residuals
182
© 2014-2023 AnalystPrep.
Test for Heteroskedasticity
Halbert White proposed a simple test, with the following two-step procedures:
1. A constant
3. T he cross product of all the independent variables, including the product of each
Y i = α + βi X 1i + β2X 2i + ϵ i
T he first step is to calculate the residuals by utilizing the OLS parameter estimators:
^ − β^1 X 1i − β^2 X 2i
ϵ i = Yi − α
^
183
© 2014-2023 AnalystPrep.
ϵ 2i = Υ 0 + Υ 1X 1i + Υ 2 X 2i + Υ 3X 21i + Υ 4 X 22i + Υ 5X 21i X 22i
^
hypothesis is: H 0 : Υ 1 = ⋯ = Υ 5 = 0
T he test statistic is calculated as nR 2 where R 2 is calculated in the second regression and that the
test statistic has a χ2k(k+3) (chi-distribution), where k is the number of explanatory variables in the
2
first-step model.
For instance, if the number of the explanatory variables is two, k=2, then the test statistic has a
distribution of χ5.
1. Ignori ng the heterosk edasti ci ty when approxi mati ng the parameters and then
However simple, this method leads to less accurate model parameter estimates compared to
2. Transformati on of data.
For instance, positive data can be log-transformed to try and remove heteroskedasticity and
give a better view of data. Another transformation can be in the form of dividing the
T his is a complicated method that applies weights to the data before approximating the
transform the data by dividing by w i to remove the heteroskedasticity from the errors. In
Yi Xi
other words, the WLS regresses wi
on wi
such as:
Y 1 X ϵ
184
© 2014-2023 AnalystPrep.
Yi 1 Xi ϵi
=α +β +
wi wi wi wi
Ȳ i = α C̄i + βX̄ i + ϵ̄ i
Note that the parameters of the model above are estimated using OLS on the transformed data. T hat
1 X
is, the weighted version of Y i which is Ȳ i on two weighted explanatory variables C̄i = and X̄ i = wi .
wi i
Note that the WLS model does not clearly include the intercept α , but the interpretation is still the
Multicollinearity
Multicollinearity occurs when others can significantly explain one or more independent variables.
For instance, in the case of two independent variables, there is evidence of multicollinearity if the
185
© 2014-2023 AnalystPrep.
In contrast with multicollinearity, perfect correlation is where one of the variables is perfectly
correlated to others such that the R 2 of regression of X j on the remaining independent variable is
precisely 1.
Conventionally, when R 2 is above 90% leads to problems in medium sample sizes such as that of 100.
Multicollinearity does not pose an issue in parameter approximation, but rather, it brings some
When multicollinearity is present, some of the coefficients in a regression model are jointly
statistically significant (F-statistic is substantial), but the individual t-statistic is very small (less than
1.96) since the regression analysis assumes the collective effect of the variables rather than the
Addressing Multicollinearity
186
© 2014-2023 AnalystPrep.
T here are two ways of dealing with multicollinearity:
II. Identification of the multicollinear variables and excluding them from the model.
Identification of multicollinear variables using the variance inflation factor which compares
the variance of the regression coefficients on independent variable X j in two models: one
1
VIFj =
1 − R 2j
Where R 2j originates from regressing X j on the other variable in the model. When the value
of the VIF is above 10, then it is considered too much and the variable should be excluded
Residual Plots
Residual plots are utilized to identify the deficiencies in a model specification. When the residual
plots are not systematically related to any of the included independent (explanatory variables) and
relatively small (within ±4s, where s, is the standard shock deviation of the model) in magnitude, then
Outliers
Outliers are values that, if removed from the sample, produce large changes in the estimated
187
© 2014-2023 AnalystPrep.
coefficients. T hey can also be viewed as data points that devi ate si gni fi cantl y from the normal
Cook’s distance helps us measure the impact of dropping a single observation j on a regression (and
(−j) 2
∑ni=1 (Ȳ i ^ i)
−Y
Dj =
ks2
Where:
(−j)
Ȳ i =fitted value of Ȳ i when the observed value j is excluded, and the model is approximated using n-
1 observations.
188
© 2014-2023 AnalystPrep.
k=number of coefficients in the regression model
When a variable is an inline (does not affect the coefficient estimates when excluded), the value of
its Cook’s distance (Dj ) is small. On the other hand, Dj is higher than 1 if it is an outlier.
Observation Y X
1 3.67 1.85
2 1.88 0.65
3 1.35 −0.63
4 0.34 1.24
5 −0.89 −2.45
6 1.95 0.76
7 2.98 0.85
8 1.65 0.28
9 1.47 0.75
10 1.58 −0.43
11 0.66 1.14
12 0.05 −1.79
13 1.67 1.49
14 −0.14 −0.64
15 9.05 1.87
If you look at the data sets above, it is easy to see that observation 15 is quite more significant than
the rest of the observations, and there is a possibility to be an outlier. However, we need to
ascertain this.
We begin by fitting the whole dataset (Ȳ i) and then the 14 observations which remain after excluding
Ȳ i = 1.4465 + 1.1281X i
189
© 2014-2023 AnalystPrep.
(−j)
Ȳ i = 1.1516 + 0.6828X i
(−j) (−j) 2
Observation Y X Ȳ i Ȳ i (Ȳ i − Ȳ i )
1 3.67 1.85 3.533 2.4148 1.2504
2 1.88 0.65 2.179 1.5954 0.3406
3 1.35 0.63 0.7358 0.7214 0.0002
4 0.34 1.24 2.8453 1.9983 0.7174
5 0.89 2.45 −1.3174 −0.5213 0.6338
6 1.95 0.76 2.3039 1.6705 0.4012
7 2.98 0.85 2.4053 1.732 0.4533
8 1.65 0.28 1.7624 1.3428 0.1761
9 1.47 0.75 2.2926 1.6637 0.3955
10 1.58 0.43 0.9614 0.858 0.0107
11 0.66 1.14 2.7325 1.921 0.6585
12 0.05 1.79 −0.5728 −0.07061 0.2522
13 1.67 1.49 3.1274 2.169 0.9185
14 0.14 0.64 0.7245 0.7146 0.0001
15 9.05 1.87 3.556 2.4284 1.2715
Sum 7.4800
(−j) 2
∑ni=1 (Ȳ i ^i )
−Y 7.4800
Dj = = = 1.0523
ks2 2 × 3.554
OLS is the Best Linear Unbiased Estimator (BLUE) when some key assumptions are met, which
implies that it can assume the smallest possible variance among any given estimator that is linear and
unbiased:
Li neari ty: the parameters being estimated using the OLS method must be themselves
linear.
190
© 2014-2023 AnalystPrep.
Random: the data must have been randomly sampled from the population.
Non-Col l i neari ty: the regressors being calculated should not be perfectly correlated
Exogenei ty: the regressors aren’t correlated with the error term.
I. A big proportion of the estimators are not linear such as maximum likelihood estimators (but
biased).
II. BLUE property is heavily dependent on residuals being homoskedastic. In the case that the
variances of residuals vary the independent variables, then it is possible to construct linear
unbiased estimators (LUE) of the coefficients α and β using WLS but with extra assumptions.
When the residuals and iid and normally distributed with a mean of 0 and variance of σ 2, formally
stated as ϵ i ∼iid N(0, σ 2) makes the upgrades BLUE to BUE (Best Unbiased Estimator) by virtue
having the smallest variance among all linear and non-linear estimators. However, errors being
normally distributed is not a requirement for accurate estimates of the model coefficients or a
191
© 2014-2023 AnalystPrep.
Practice Question 1
I. Homoskedasticity means that the variance of the error terms is constant for all
independent variables.
II. Heteroskedasticity means that the variance of error terms varies over the sample.
A. Only I
B. II and III
Sol uti on
T he correct answer is C.
If the variance of the residuals is constant across all observations in the sample, the
regression is said to be homoskedastic. When the opposite is true, the regression is said
to exhibit heteroskedasticity, i.e., the variance of the residuals is not the same across all
significant problem: it introduces a bias into the estimators of the standard error of the
Practice Question 2
A financial analyst fails to include a variable which inherently has a non-zero coefficient
in his regression analysis. Moreover, the ignored variable is highly correlated with the
192
© 2014-2023 AnalystPrep.
remaining variables.
C. Presence of heteroskedasticity.
Sol uti on
T he correct answer is A.
the model, whose true coefficient and consistently approximated value is 0 in large
sample sizes.
the errors varies systematically with the independent variables of the model.
193
© 2014-2023 AnalystPrep.
Reading 21: Stationary Time Series
After compl eti ng thi s readi ng, you shoul d be abl e to:
Define white noise; describe independent white noise and normal (Gaussian) white noise.
Define and describe the properties of autoregressive moving average (ARMA) processes.
T ime series is a collection of observations on a variable’s outcome in distinct periods — for example,
monthly sales of a company for the past ten years. T ime series are used to forecast the future of the
time series. T he time series are classified into the trend, seasonal, and cyclical components. A trend
time-series changes its level over time, while a seasonal time series has predictable changes over a
given time. Lastly, a cyclical time series, as its name suggests, reflects the cycles in a given data. We
194
© 2014-2023 AnalystPrep.
A stochastic process is a set of variables. T he stochastic process is mostly denoted by Y t and by the
subscript, the random variable is ordered time so that Y s occurs first before Y t if s < t.
starts from the infinite past and proceeds to the infinite future. However, only a finite subset of
A series is said to be covariance stationary if both its mean and covariance structure is stable over
time.
I. T he mean does not change and thus constant over time. T hat is:
E(Y t) = μ∀t
II. T he variance does change over time, and it is constant. T hat is:
V (Y t) = γ0 < ∞ ∀t
III. T he autocovariance of the time series is finite and does not change over time, and it depends on
Cov (Y t, Y t−h ) = γh ∀t
T he covariance stationarity is crucial so that the time series has a constant relationship across time
195
© 2014-2023 AnalystPrep.
and that the parameters are easily interpreted since the parameters will be asymptotically normally
distributed.
It can be quite challenging to quantify the stability of a covariance structure. We will, therefore, use
the autocovariance function. T he autocovariance is the covariance between the stochastic process
at a different point in time (analogous to the covariance between two random variables). It is given
by:
γh = γ|h|
T his is asserting the fact that the autocovariance depends on the length h and not the time t. So that:
Cov(Y t, Y th ) γh γh
ρ ( t) = = =
√γ0γ0 γ0
√V (Y t )√V(Y t−h )
Similarly, for h = 0.
γ
196
© 2014-2023 AnalystPrep.
γ0
ρ ( t) = =1
γ0
as, p(h), and in a linear population regression of Y t on Y t−1, … , Y t−h , it is the coefficient of y t−h . T his
regression is referred to as the autoregression. T his is because the regression is on the lagged values
of the variable.
White Noise
Assume that:
yt = ϵt
197
© 2014-2023 AnalystPrep.
ϵ t ∼ (0, σ 2) , ∀ σ2 < ∞
where ϵ t is the shock and is uncorrelated over time. T herefore, ϵ t and y t are said to be serially
uncorrelated.
T his auto-correlation that has a zero mean and unchanging variance is referred to as the zero-mean
ϵ t ∼ W N (0, σ 2)
198
© 2014-2023 AnalystPrep.
And:
y t ∼ W N (0, σ 2)
ϵ t and y t serially uncorrelated but not necessarily serially independent. If y possesses this property,
(serially uncorrelated but not necessarily serially independent) then it is said to be an independent
white noise.
T herefore, we write:
y t iid
∼
(0, σ 2)
T his is read as “y is independently and identically distributed with a mean 0 and constant variance. y
is said to be serially independent if it is serially uncorrelated and it has a normal distribution. In this
case, y is called the normal white noise or the Gaussian white noise.
Written as:
y t iid
∼
N (0, σ 2)
E (y t) = 0
And:
var (y t ) = σ 2
T hese two are constant since only displacement affects the autocovariances rather than time. All
the autocovariances and autocorrelations are zero beyond displacement zero since white noise is
199
© 2014-2023 AnalystPrep.
2
γ (h) = { σ , h=0
0, h≥0
ρ (h) = { 1, h=0
0, h≥1
Beyond displacement zero, all partial autocorrelations for a white noise process are zero. T hus, by
construction white noise is serially uncorrelated. T he following is the function of the partial
p(h) = { 1, h=0
0, h≥1
Simple transformations of white noise are considered in the construction of processes with much
richer dynamics. T hen the white noise should be the 1-step-ahead forecast errors from good models.
T he mean and variance of a process, conditional on its past, is another crucial characterization of
To compare the conditional and unconditional means and variances, consider the independence white
noise: y t iid
∼
(0, σ 2). y has an unconditional mean and variance of zero and σ 2 respectively. Now,
Or:
T he conditional mean and variance do not necessarily have to be constant. T he conditional mean for
E (y t|Ωt−1 ) = 0
200
© 2014-2023 AnalystPrep.
T he conditional variance is:
Independent white noise series have identical conditional and unconditional means and variances.
Wold’s Theorem
∞
Y t = ϵ t + β1 ϵ t−1 + β2ϵ t−2 + ⋯ = ∑ βi ϵ t−i
i=0
Where:
ϵ t ∼ W N (0, σ 2)
T he accurate model for any stationary covariance series is the Wold’s representation. Since ϵ t
corresponds to the 1-step-ahead forecast errors to be incurred should a particularly good forecast be
Time-Series Models
AR models are time series models mostly used in finance and economics which links the stochastic
process Y t to the previous value Y t−1. T he first order AR model denoted by AR(1) is given by:
Y t = α + βY t−1 + ϵ t
Where:
201
© 2014-2023 AnalystPrep.
α = intercept
β = AR parameter
SinceY t is assumed to be covariance stationary, the mean,variance, and autocovariances are all
T herefore,
⇒ μ = α + βμ + 0
α
∴μ=
1−β
γ0 = β 2 γ0 + σ 2 + 0
σ2
∴
1 − β2
Note that Cov(Y t−1 , ϵ t)=0 since Y t−1 is uncorrelated with the shocks ϵ t−1, ϵ t−2, …
T he Autocovariances for AR(1) process is calculated recursively. T he first autocovariance for the
202
© 2014-2023 AnalystPrep.
Cov(Y t, Y t−h )) = Cov(α + βY t−1 + ϵ t, Y t−h )
= βCov(Y t−1 , Y t−h ) + Cov(Y t−h , ϵ t )
= βγh−1
It should be easy to see that Cov(Y t−h , ϵ t) = 0. Applying this recursion analogy:
γh = β h γ0
γh = β|h|γ0
β h γ0 |h|
ρ (ρ) = =β
γ0
T he ACF tends to 0 when h increases and that -1<β<0. T he Partial autocorrelation of an AR(1) model
is given by:
β |h|, h ∈ {0, ±}
∂ (h) = {
0, h ≥ 2
T he lag operator denoted by L is important for manipulating complex time-series models. As its name
suggests, the lag operator moves the index of a particular observation one step back. T hat is:
LY t = Y t−1
(I). T he lag operator moves the index of a time series one step back. T hat is:
LY t = Y t−1
203
© 2014-2023 AnalystPrep.
(II). Consider the following mth-order lag operator polynomial Lm then:
L m Y t = y t−m
For example Lα = α
so that:
(V). T he lag operator has a multiplicative property. Consider two lag operators a(L) and b(L). T hen:
a(L)b(L) = b(L)a(L)
IV. Under some restrictive conditions, the lag operator polynomial can be inverted so that: a(L)a(L)−1
=1. When a(L) is a first-order lag operator polynomial given by 1 − a1 (L), is invertible if |a1| < 1 so
∞
(1 − a1 (L))−1 = ∑ ai L i
i=1
204
© 2014-2023 AnalystPrep.
Y t = α + βY t−1 + ϵ t
Y t = α + β(LY )t + ϵ t
⇒ (1 − βL)Y t = α + ϵ t
∞ ∞ ∞
α
⇒ Y t = α ∑ β i + ∑ β jL j ϵ t = + ∑ β iL i ϵ t−i
i=1 j=1 1−β i=1
T he AR(p) model is a generalization of the AR(1) model to include the p lags of Y t−1. T hus, the AR(p)
is given by:
α
E (Y t ) =
1 − β1 − β2 − … βp
σ2
V (Y t) = γ0 =
1 − β1 ρ1 − β2ρ2 − … βpρp
From the formulas of the mean and variance of the AR(p) model, the covariance stationarity
β1 + β2 + ⋯ + βp < 1
205
© 2014-2023 AnalystPrep.
Otherwise, the covariance stationarity will be violated.
T he autocorrelations function of the AR(p) model bears the same structural model as AR(1) model;
the ACF tends to 1 as the length between the two-time series increases and may oscillate. However,
Y t = μ + θϵ t−1 + ϵ t
Evidently, the process Y t depends on the current shock ϵ t and the previous shock ϵ t−1 where the
coefficient θ measures the magnitude at which the previous shock affects the process. Note μ is the
For θ > 0, MA(1) is persistent because the consecutive values are positively correlated. On the
other hand, if θ < 0, the process mean reverts because the effect of the previous shock is reversed
T he MA(1) model is always a covariance stationary process. T he mean is as shown above, while the
T he variance uses the intuition that the shock is white noise processes that are uncorrelated.
206
© 2014-2023 AnalystPrep.
⎧
⎪ 1, h = 0
θ
ρ (h) = ⎨ ,h = 1
1+ 2θ
⎩
⎪
0, h ≥ 2
T he partial autocorrelations (PACF) of the MA(1) model is a complex and non-zero at all lags.
From the MA(1), we can generalize the qth order MA process. Denoted by MA(q), it is given by:
Y t = μ + ϵ t + θ1ϵt−1 + … + θq ϵ t−q
T he mean of the MA(q) process is still μ since all the shocks are white noise process (their
expectations are 0). T he autocovariance function of the MA(q) process is given by:
σ 2 ∑q−h θi θi+h , 0 ≤ h ≤ q
γ (h) = { i=0
0, h > q
And θ0=1
T he value of θ can be determined by substituting the value taken by the autocorrelation function and
solving the resulting quadratic equation. T he partial autocorrelation of an MA(q) model is complex
Given an MA(2), Y t = 3.0 + 5ϵ t−1 + 5.75ϵ t−2 + ϵ t where ϵ t ∼ W N (0, σ2). What is the mean of the
process?
Solution
Where μ is the mean. So, the mean of the above process is 3.0
207
© 2014-2023 AnalystPrep.
T he ARMA model is a combination of AR and MA processes. Consider a first-order ARMA model
Y t = α + βY t−1 + θϵ t−1 + ϵ t
α
μ=
1 −β
σ 2 (1 + 2βθ)
γ0 =
1 − β2
1+2βθ+θ2
⎧
⎪ σ2 ,h = 0
⎪
⎪
⎪ 1−β2
γ (h) = ⎨ σ 2 β(1+βθ)+θ(1+βθ)
,h= 1
⎪
⎪ 1−β2
⎪
⎩
⎪ βγh−1, h ≥ 2
T he ACF form of the ARMA(1,1) decays as the length h increases and oscillate if β < 0, which is
T he PACF tends to 0 as the length h increase, which is consistent with the MA process. T he decay
of ARMA’s ACF and PACF is slow, which distinguishes it from the pure AR and MA models.
From the variance formula of ARMA(1,1), it is easy to see that the process is covariance stationery
if |β| < 1
ARMA(p,q) Model
As the name suggests, ARMA(p,q) is a combination of the AR(p) and MA(q) process. Its form is given
by:
208
© 2014-2023 AnalystPrep.
When expressed using lag polynomial, this expression reduces to:
β(L)Y t = α + θ(L)ϵ t
stationary. T he autocovariance and ACFs of the ARMA process are complex that decay at a slow
Sample Autocorrelation
is given by:
T
1
γ^h = ∑ (Y i − Y¯ ) (Y i−h − Y¯ )
T − h i=h+i
Test for autocorrelation is done using the graphical examination by plotting ACF and PACF of the
residuals and check for any deficiencies such as inadequacy of the model to capture the dynamics of
209
© 2014-2023 AnalystPrep.
Box-Pierce and Ljung-Box test both tests the null hypothesis that:
H 0 : ρ1 = ρ2 = … = ρh
Both the test are chi-distributed (χ2h ) random variables. If the test statistic is larger than the critical
Box-Pierce Test
h
QBP = T ∑ ρ^2i
i=1
T hat is, the test statistic is the sum of squared autocorrelation scaled by the sample size T, which is (
Ljung-Box Test
Ljung-Box test is a revised version of Box-Pierce that is appropriate with small sample sizes. T he
h
1
QLP = T (T + 2) ∑ ( ) ρ^2
i=1 T −i i
Model Selection
T he first step in model selection is the inspection of the sample autocorrelations and the PACFs.
210
© 2014-2023 AnalystPrep.
T his provides the initial signs of the correlation of the data and thus can be used to select the type of
models to be used.
T he next step is to measure the fit of the selected model. T he most commonly used method of
measuring the model’s fit is Mean Squared Error (MSE) which is defined as:
1 T 2
^2 =
σ ∑ ^ϵ
T t=1 t
When the MSE is small, the model selected explains more of the time series. However, choosing a
model with a small MSE implies that we need to increase the coefficient of variation R 2, which can
lead to overfitting. To attend to this problem, other methods have been developed to measure the fit
of the model. T hese methods involve adding an adjustment factor to MSE each time a parameter is
added. T hese measures are termed as the Informati on Cri teri a (IC). T here are two such ICs:
Akaike Information Criteria (AIC) and the Bayesian Information Criteria (BIC).
σ 2 + 2k
AI C = T ln^
Where T is the sample size, and k is the number of the parameter. T he AIC model adds the
σ 2 + klnT
BIC = T ln^
Where the variables are defined as in AIC; however, note that the adjustment factor in BIC increases
with an increase in the sample size T. Hence, it is a consistent model selection criterion. Moreover,
the BIC criterion does not select the model that is larger than that selected by AIC.
211
© 2014-2023 AnalystPrep.
The Box-Jenkin Methodology
T he Box-Jenkin methodology provides a criterion of selecting between models that are equivalent
but with different parameter values. T he equivalency of the models implies that their mean, ACF and
T he Box-Jenkin methodology postulates two principles of selecting the models. One of the principles
is termed as Parsi mony. Under this principle, given two equivalent models, choose a model with
fewer parameters.
T he last principle is i nverti bi l i ty, which states that when selecting an MA or ARMA, select the
Model Forecasting
Forecasting is the process of using current information to forecast the future. In time series
T he one-step forecast time series forecasts the conditions expectation E(Y T +1 |ΩT ). ΩT is termed as
the information set at time T which includes the entire history of Y (Y T, Y T-1... ) and the shock history
ET (Y T +1 │ΩT ) = ET (Y T +1 )
Principles of Forecasting.
I. T he expectation of a variable is the realization of that variable. T hat is: ET (Y T ) = Y T . T his applies
ET (ϵ T +h ) = 0
212
© 2014-2023 AnalystPrep.
III. T he forecasts are done recursively, beginning with ET (Y T +1 ) and that the forecast of a given time
Note that we are using the current values Y T to predict Y T +1 and shock used is that of the future
ϵ T +1.
ET (Y T +2) = ET (α + βY T +1 + ϵ T +2 )
= α + βET (Y T +1 ) + ET (ϵ T +2 )
So that:
ET (Y T +2) = α + βET (α + βY T ) = α + β (α + βY T )
⇒ ET (Y T +2) = α + αβ + β2Y T
ET (Y T +h ) = α + αβ + αβ2 + … + αβh−1 + βh Y T
h
= ∑ αβ i + β h Y T
i=1
When h is large, βh must be very small by the intuition of covariance stationary of Y t. T herefore, it
213
© 2014-2023 AnalystPrep.
h
α
lim ∑ αβ iβ h Y T =
h→∞ 1− β
i=0
T he limit is actually the mean of the AR(1) model. T he mean-reverting level implies Y T does not
T he forecast error is the difference between the true future value and the forecasted value, that is,
ϵ T +1 = Y T +1 − ET (Y T+1 )
For longer time-horizon, the forecast is mostly functions of the model parameters.
T he ARMA(1,1) for modeling the default in premiums for an insurance company is given by
Dt = 0.055 + 0.934Dt−1 + ϵ t
Given that DT = 1.50, what is the first step forecast of the default?
Sol uti on
We need:
ET (Y T+1 ) = α + βY T
⇒ ET (DT+1 ) = 0.055 + 0.934 × 1.5 = 1.4560
Some time-series data are seasonal. For instance, the sales at the time of summer that may differ
from that of winter. T he time series with deterministic seasonality is termed as non-stationary, while
those with stochastic seasonality are called stationary time series and hence modeled with AR or
ARMA process.
214
© 2014-2023 AnalystPrep.
A pure seasonal lag utilizes the lags at a seasonal frequency. For instance, assume that we are using
the semi-annual data, then the pure seasonal AR(1) model of quarterly time seasonal time series is:
(1 − βL 4)Y t = α + ϵ t
So that:
Y t = α + βY t−4 + ϵ t
A more efficient seasonality includes the short-term and seasonal lag components. T he short-term
Seasonality can also be introduced to AR, MA, or both models by multiplying the short run lag
polynomial and by the seasonal lag polynomial. For instance, the seasonal ARMA is specified as:
ARMA(p, q) × (ps , qs )f
Where p and q are the orders of the short run-lag polynomials, and ps and qs are the seasonal lag
polynomials. Practically, seasonal lag polynomials are restricted to one seasonal lag because the
accuracy of the parameter approximations depends on the number of full seasonal cycles in the
sample data.
215
© 2014-2023 AnalystPrep.
Question 1
T he following sample autocorrelation estimates are obtained using 300 data points:
Lag 1 2 3
Coefficient 0.25 −0.1 −0.05
A. 22.5
B. 22.74
C. 30
D. 30.1
T he correct answer is A.
m
QBP = T ∑ ρ^2 (h)
h =1
2 2
= 300(0.252 + (−0.1) + (−0.05) )
= 22.5
Question 2
T he following sample autocorrelation estimates are obtained using 300 data points:
Lag 1 2 3
Coefficient 0.25 −0.1 −0.05
A. 30.1
B. 30
C. 22.5
216
© 2014-2023 AnalystPrep.
D. 22.74
T he correct answer is D.
m ˆ1
QLB = T (T + 2) ∑ ( )ρ2 (h)
h =1 T − h
0.252 −0.12 −0.052
= 300(302)( + + )
299 298 297
= 22.74
Note: Provided the sample size is large, the Box-Pierce and the Ljung-Box tests typically
Question 3
Assume the shock in a time series is approximated by Gaussian white noise. Yesterday's
realization, y(t) was 0.015, and the lagged shock was -0.160. Today's shock is 0.170.
If the weight parameter theta, θ, is equal to 0.70 and the mean of the process is 0.5,
A. -4.205
B. 4.545
C. 0.558
D. 0.282
T he correct answer is C.
realization = y t−1.
y t = μ + θϵ t−1 + ϵ t
= 0.5 + 0.170 + 0.7(−0.160) = 0.558
= 0.558
217
© 2014-2023 AnalystPrep.
218
© 2014-2023 AnalystPrep.
Reading 22: Nonstationary Time Series
After compl eti ng thi s readi ng, you shoul d be abl e to:
Explain how to construct an h-step-ahead point forecast for a time series with seasonality.
Calculate the estimated trend value and form an interval forecast for a time series.
Recall that the stationary time series have means, variance, and autocovariance that are independent
of time. T herefore any time series that violates this rule is termed as the non-stationary time series.
T he nonstationary time series include time trends, random walks( also called unit-roots) and
seasonalities. T ime trends reflect the feature of the time series to grow over time.
Seasonalities occur due to change in the time series over different seasons such as each quarter.
Seasonalities can be shifts of the mean (for example depending on the period of the year) and the
mean cycle of the time series (this occurs when the shock of the current value depends on the
shock of the same future period). Seasonalities can be modeled using the dummy variables or
modeling it period after period changes (such as year after year) in an attempt to remove the
In a random walk, time series depends on each other and their respective shocks. We discuss each of
the non-stationarities.
Time Trends.
219
© 2014-2023 AnalystPrep.
T he time trend deterministically shifts the mean of the time series. T he time trend can be linear and
Linear trend models are those that the dependent variable changes at a constant rate with time. If the
time series y t has a linear trend, we can model the series by the following equation:
Y t = β0 + β1 t + ϵ t , t = 1, 2,… , T
Where
From the equation above, the β0 + β1t predicts y t at any time t. T he slope β1 is described as the trend
coefficient since it is the slope coefficient. We estimate both factors β0 and β1 using the ordinary
E(Y t) = β0 + β1t
220
© 2014-2023 AnalystPrep.
Estimation of the Trend Value Under Linear Trend Models
Using the estimated coefficients, we can predict the value of the dependent variable at any time
^2 = β^ 0 + β^1 (2). We can also forecast the
(t=1, 2…, T ). For instance, the trend value at time 2 is Y
value of the time series outside the sample’s period, that is, T +1. T herefore, the predicted value of
^T+1 = β^0 + β^ 1(T + 1).
Y t at time T +1 is Y
A linear trend is defined to be Y t = 17.5 + 0.65t . What is the trend projection for time 10?
Solution
T = 17.5 + 0.65 × 10 = 24
221
© 2014-2023 AnalystPrep.
In linear time series, the growth is a constant which might pose problems in economic and financial
time series.
1. When the trend is positive, then the growth rate is expected to decrease over time.
2. If the slope coefficient is less than 0, the Y t will tend toward negative values, a situation that
would not be plausible in most financial time series, e.g., asset prices and quantiles.
Considering these limitations, we discuss the log-linear time series, with a constant growth rate
Sometimes the linear trend models result in uncorrelated errors. For instance, the time series with
exponential growth rates. T he appropriate model for the time series with exponential growth is the
Log-linear trends are those in which the variable changes at an increasing or decreasing rate rather
222
© 2014-2023 AnalystPrep.
Assume that the time series is defined as:
Y t = eβ0+β1 t, t = 1, 2, … , T
Which also can be written as (by taking the natural logarithms on both sides):
ln Y t = β0 + β1 t, t = 1, 2,… , T
By Exponential rate, we mean growth at a constant rate with continuous compounding. T his can be
seen as follows: Using the time series formula above, the value of the time series at time 1 and 2 are
y2
y 1 = eβ0+β1 (1) and y 2 = eβ0 +β1(2) . T he ratio y1
is given by:
Y2 eβ0+β1(2)
= = eβ(1)
Y1 eβ0+β1(1)
223
© 2014-2023 AnalystPrep.
Similarly, the value of the time-series at time t is Y t = eβ0+β1 t , and at t+1, we have Y t+1 = eβ0 +β1(t+1) .
Y t+1 eβ0+β1(t+1)
= = eβ1
Yt eβ0 +β1(t)
If we take the natural logarithm on both sides of the above equation we have:
Y t+1
ln ( ) = lnY t+1 − lnY t = β1
Yt
From the above results, proportional growth in time series over the two consecutive periods is
y t+1 − y t y t+1
= − 1 = eβ1 − 1
yt yt
An investment analyst wants to fit the weekly sales (in millions) of his company by using the sales
data from Jan 2016 to Feb 2018. T he regression equation is defined as:
What is the trend estimated value of the sales in the 80th week?
Solution
From the regression equation, β^0 = 5.1062 and β^1 = 0.0443. We know that, under log-linear trend
224
© 2014-2023 AnalystPrep.
^ ^
Y t = eβ 0 + β 1 t
Y t = β0 + β1t + β2t2 + ⋯ + βm tm ϵ t, t = 1, 2, … , T
Practically speaking, the polynomial-time trends are only limited to the linear (discussed above) and
the quadratic (second degree) time trend. In a quadratic time trend, the parameter can be estimated
using the OLS. T he approximated parameter are asymptotically normally distributed and hence
statistical inference using the t-statistics and the standard error happen only if the residuals ϵ t are
white noise.
As the name suggests, this time trend is a mixture of the log-linear and quadratic time series. It is
given by:
ln Y t = β0 + β1 t + β2 t2
It can be shown that the growth rate of the log-quadratic time trend is β1 + 2β2t . T his can be seen as
follows:
2
T he value of the time-series at time t is Y t = eβ0+β1 t+β2t , and at t+1, we have
2
Y t+1 = eβ0+β1 (t+1)+β2(t+1) . T his implies that the ratio:
2
Y t+1 eβ0 +β1(t+1)+β2(t+1)
= = eβ1 +2β2t
Yt eβ0+β1t+β2 t2
225
© 2014-2023 AnalystPrep.
Example: Calculating the Growth Rate of Log-Quadratic Time Trend
T he monthly real GDP of a country over 20 years can be modeled by the time series equation given
by:
What is the growth rate of the real GDP of this country at the end of 20 years?
Solution
β1 + 2β2 t
From the regression time-series equation given, we have β^1 = 0.015 and β^2 = 0.0000564 so that the
Note that, since the data is modeled monthly, at the end of 20 years implies 240th month!
T he coefficient of variation (R 2 ) for the time trend series is always high and will tend to 100% as the
sample size increases. T herefore, the coefficient of variation is not an appropriate measure in trend
Seasonality
Seasonality is a feature of a time series in which the data undergoes regular and predictable changes
that recur every calendar year. For instance, gas consumption in the US rises during the winter and
Seasonal effects are observed within a calendar year, e.g., spikes in sales over Christmas, while
cyclical effects span time periods shorter or longer than one calendar year, e.g., spikes in sales due
226
© 2014-2023 AnalystPrep.
to low unemployment rates.
Regression on seasonal dummies is an essential method of modeling seasonality. Assuming that there
are s seasons in a year. T hen the pure annual dummy model is:
1 t mod s = j
Djt = { ,
1, t mod s ≠ j
E[Y 1 ] = β0 + γ1
E[Y 2 ] = β0 + γ2
Since period s, all dummy variables are zero, then the mean of the seasonality at time s is:
E[Y s ] = β0
T he parameters of seasonality are estimated using the OLS estimators by regressing Y t on constant
227
© 2014-2023 AnalystPrep.
T ime trends and seasonalities can be insufficient in explaining economic time series and since their
residuals might not be white noise. In the case that the non-stationary time series appears to be
stationary, but the residuals are not white noise, we can add stationary time series components (such
Y t = β0 + β1 t + ϵ t
If the residuals are not white noise but the time series appears to be stationary, we can include an
Y t = β0 + β1 t + δ1 Y t−1 + ϵ t
s−1
Y t = β0 + β1 t + ∑ γj Djt + δ1 Y t−1 + ϵ t
j=1
Note that the AR component reflects the cyclicity of the time series, γj measures the shifts of the
mean from the trend growth, i.e β1 t. However, combinations of the time series do not always lead to
a model with the required dynamics. For instance, the Ljung-Box statistics may suggest rejection of
A random walk is a time series in which the value of the series in one period is equivalent to the
value of the series in the previous period plus the unforeseeable random error. A random walk can
be defined as follows:
Let
Y t = Y t−1 + ϵ t
228
© 2014-2023 AnalystPrep.
Intuitively,
Y t = (Y t−2 + ϵ t−1) + ϵ t
t
Yt = Y0 + ∑ ϵi
i=1
T he random walk equation is a particular case of an AR(1) model with β0 = 0 and β1 = 1. T hus, we
cannot utilize the regression techniques to estimate such AR(1). T his is because a random walk does
not have a finite mean-reverting level or finite variance. Recall that if Y t has a mean-reverting level,
β0 0
then Y t = β0 + β1Y t and thus . However, in a random walk, β0 = 0 and β1 = 1 so, 1−1 = 0.
1−β1
V(Y t ) = tσ 2
T he implication of the infinite variance of a random walk is that we are unable to use standard
Unit Roots
We have been discussing the random walks without a drift; that the current value is the best
A random walk with a drift is defined as a time-series where it increases or decreases by a constant
Y t = β0 + β1Y t−1 + ϵ t
229
© 2014-2023 AnalystPrep.
β0 ≠ 1, β1 = 1
Or
Y t = β0 + Y t−1 + ϵ t
Where ϵ t ∼ WN(0, σ 2)
Recall that β1 = 1 implies undefined mean-reversion level and hence non-stationarity. T herefore, we
are unable to use the AR model to analyze a time series unless we transform the time series by
ΔY t = Y t − Y (t−1) , y t = β0 + ϵ t , ∀β0 ≠ 0
T he unit root test involves the application of the random walk concepts to determine whether a time
series is nonstationary by focusing on the slope coefficient in a random walk time series with a drift
case of AR(1) model. T his test is popularly known as the Dickey-Fuller Test
Consider an AR(1) model. If the time-series originates from an AR(1) model, then the time-series is
covariance stationary if the absolute value of the lag coefficient β1 is less than 1. T hat is, |β1| < 1.
T herefore, we could not depend on the statistical results if the lag coefficient is greater or equal to 1
(|β1 | ≥ 1).
When the lag coefficient is precisely equal to 1, then the time series is said to have a unit root. In
other words, the time-series is a random walk and hence not covariance stationary.
T he unit root problem can also be expressed using the lag polynomial. Let
ψ(L) be the full lag polynomial, which can be factorized into the unit root lag denoted by (1-L) and the
remainder lag polynomial ϕ(L ) which is the characteristic lag for stationary time series. Moreover,
let θ(L)ϵ t be an MA. T hus, the unit root process can be described as:
230
© 2014-2023 AnalystPrep.
ψ(L)Y t = θ(L)ϵ t
(1 − L)ϕ(L) = θ(L)ϵ t
An AR(2) model is given by Y t = 1.7Y t−1 − 0.7Y t−2 + ϵ t . Does the process contain a unit root?
Solution
Using the definition of a lag polynomial, we can write the above equation as:
(1 − 1.7L + 0.7L 2 )Y t = ϵ t
(1 − L)(1 − 0.7L)Y t = ϵ t
T herefore, the process has a unit root due to the presence of a unit root lag operator (1-L).
1. A unit root process does not have a mean-reverting level. Recall that the stationary time
series does mean revert, that is, the long-run mean can be estimated.
2. In a time series with a unit root, spotting spurious relationships is a problem. A spurious
correlation is where there is no important link between the time series but regression
3. T he parameter estimators in ARMA time series with a unit root possess Dickey-Fuller (DF)
distribution, which is asymmetric, dependent on the sample size, and that its critical value
231
© 2014-2023 AnalystPrep.
depends on whether time trends have been incorporated. T his characteristic makes it
difficult to come up with sound statistical inference and model selection when fitting the
models.
If the time series seem to have unit roots, the best method is to model it using the first-differencing
series as an autoregressive time series, which can be effectively analyzed using regression analysis.
Recall that the time series with a drift is a form of AR(1) model given by:
y t = β0 + Y t−1 + ϵ t ,
Where ϵ t ∼ WN(0, σ 2)
Clearly β1 = 1 implies that the time series has an undefined mean-reversion level and hence non-
stationary. T herefore, we are unable to use the AR model to analyze time series unless we
Y t = Y t − Y (t−1) ⇒ y t = β0 + ϵ t , ∀β0 ≠ 0
Using the lag polynomials, let ΔY t = Y t − Y (t−1) where Y t has a unit root (implying that Y t − Y (t−1) does
(1 − L)ϕ(L)Y t = ϵt
ϕ(L)[(1 − L)Y t] = ϵt
ϕ(L)[(Y t − LY t)] = ϵt
ϕ(L)ΔY t = ϵt
Since the lag polynomial ϕ(L) is stationary series lag polynomial, the time series defined by ΔY t must
be stationary.
232
© 2014-2023 AnalystPrep.
T he unit root test is done using the Augmented Dickey-Fuller (ADF) test. T he test involves OLS
estimation of the parameters where the difference of the time series is regressed on the lagged
Where:
To get the gist of this, assume that we are conducting an ADF test on a time series with lagged level
only:
ΔY t = γY t−1
Y t = Y t−1 + ϵ t
T herefore, it implies that the time series is a random walk if γ=0. T his leads us to the hypothesis
233
© 2014-2023 AnalystPrep.
You should note this is a one-sided test, and thus, the null hypothesis is not rejected if γ>0. T he
positivity of γ corresponds to an AR time series stationary. For example, recall that the AR(1) model
is given by:
Y t = β0 + β1Y t−1 + ϵ t
ΔY t = β0 + γY t−1 + ϵ t
Clearly, if β1 = 1, then let γ = 0. T herefore, γ = 0 is the test for β1 = 1. In other words, if there is a
unit root in an AR(1) model (with the dependent variable being the difference between the time
series and independent variable of the first lag) then, γ = 0, implying that the series has a unit root
and is nonstationary.
Implementing an ADF test on a time series requires making two choices: which deterministic terms
to include and the number of lags of the differenced data to use. T he number of lags to include is
simple to determine—it should be large enough to absorb any short-run dynamics in the difference
ΔY t
T he appropriate method of selecting the lagged differences is the AIC (which selects a relatively
larger model as compared to BIC). T he length of the lag should be set depending on the length of the
deterministic terms can be excluded, and instead, use constant terms or trend deterministic terms.
While keeping all other things equal, the addition of more deterministic terms reduces the chance of
rejecting the null hypothesis when the time series does not have a unit root, and hence the power of
the ADF test is reduced. T herefore, relevant deterministic terms should be included.
234
© 2014-2023 AnalystPrep.
deterministic terms that are significant at 10% level. In case the deterministic trend term is not
significant at 10%, it is then dropped and the constant deterministic term is used instead. If the trend
is also insignificant, then it can be dropped and the test is rerun without the deterministic term. It is
important to note that the majority of macroeconomic time series require the use of the constant.
In the case that the null of the ADF test cannot be rejected, the series should be differenced and the
test is rerun to make sure that the time series is stationary. If this is repeated (double differenced)
and the time series is still non-stationary, then other transformations to the data such as taking the
A financial analyst wishes to conduct an ADF test on the log of 20-year real GDP from 1999 to 2019.
T he output of the ADF reports the results at the different number deterministic terms (first
column), and the last three columns indicate the number of lags according to AIC and the 5% and 1%
critical values that are appropriate to the underlying sample size and the deterministic terms. T he
quantities in the parenthesis (below the parameters) are the test statistics.
Solution
235
© 2014-2023 AnalystPrep.
H 1 : γ < 0 (the time series is a covariance stationary )
We begin with choosing the appropriate model. At 10%, the trend model has an absolute value of the
statistic greater than the CV at 1% and 5% significance level; thus, we choose a model with the trend
deterministic term.
T herefore, for this model, the null hypothesis is rejected at a 99% confidence level since
|-4.376|>|-3.984|. Note that the null hypothesis is also rejected at a 95% confidence level.
Moreover, if the model was constant or no-deterministic, the null hypothesis will fail to be rejected.
Seasonal differencing is an alternative method of modeling the seasonal time series with a unit root.
Seasonal differencing is done by subtracting the value in the same period in the previous year to
remove the deterministic seasonalities, the unit root, and the time trends.
Consider the following quarterly time series with deterministic seasonalities and non-zero growth
rate:
But
γj (D1j − D1j−4) = 0
236
© 2014-2023 AnalystPrep.
Because D1j = D1j−4 by the definition of the seasonal differencing. So that:
T herefore,
Δ4 Y t = 4β1 + ϵ t − ϵ t−4
Intuitively, this an MA(1) model, which is covariance stationary. T he seasonal differenced time series
is described as the year to year change in Y t or year to year growth in case of logged time series.
Spurious Regression
Spuri ous regressi on is a type of regressi on that gives misleading statistical evidence of a linear
relationship between independent non-stationary variables. T his is a problem in time series analysis,
but this can be avoided by making sure each of the time series in question is stationary by using
methods such as first differencing and log transformation (in case the time series is positive)
Practically, many financial and economic time series are plausibly persistent but stationary.
T herefore, differencing is only required when there is clear evidence of unit root in the time series.
Moreover, when it is difficult to distinguish whether time series is stationary or not, it is a good
For example, we wish to model the interest rate on government bonds using an AR(3) model. T he
AR(3) is estimated on the levels and the differences (if we assume the existence of unit root) are
modeled by AR(2) since the AR is reduced by one due to differencing. By considering the models at
all levels allows us to choose the best model when the time series are highly persistent.
Forecasting
Forecasting in non-stationary time series is analogous to that of stationary time series. T hat is, the
237
© 2014-2023 AnalystPrep.
forecasted value at time T is the expected value of Y T+h .
Y T = β0 + β1 T + ϵ t
Intuitively,
T his is true because of both β0 and β1(T + h) are constants while ϵ t+h ∼ WN(0, σ 2) .
Recall that the seasonal time series can be modeled using the dummy variables. Consequently, we
need to track the period of the forecast we desire. T he annual time series is given by:
s−1
Y T = β0 + ∑ γj Djt + ϵ t
j=1
ET (Y T+1 ) = β0 + γj
Where:
j = (T + 1)mod s is the forecasted period and that the forecast and the coefficient on the omitted
periods is 0.
For instance, for quarterly seasonal time series that excludes the dummy variable for the fourth
quarter (Q4 ), then the forecast for period 116 is given by:
238
© 2014-2023 AnalystPrep.
ET (Y T+1 ) = β0 + γj
ET (Y T+1 ) = β0 + γ(116+1)(mod 4) = β0 + γ1
T herefore, the h-step ahead forecast are by tracking the period of T +h so that:
ET (Y T+h ) = β0 + γj
Where:
j = (T + h)mod s
ϵ iid
∼ N (0, σ 2)
X ∼ N(0,σ 2) , then define W = eX ∼ Log(μ,σ 2) . Also recall that the mean of a log-normal distribution
is given by:
σ2
E(W) = eμ+ 2
239
© 2014-2023 AnalystPrep.
ln Y T+h ∼ (β0 + β1(Y T+h ), σ 2)
T hus,
σ2
ET (Y T+h ) = eβ0+β1 (Y T+h)+ 2
Confidence intervals are constructed to reflect the uncertainty of the forecasted value. T he
confidence interval is dependent on the variance of the forecasted error, which is defined as:
i.e., it is the difference between the actual value and the forecasted value.
Clearly,
ET (Y T+h ) = β0 + β1(T + h)
If we wish to construct a 95% confidence interval, given that the forecast error is Gaussian white
ET (Y T+h ) ± 1.96σ
σ is not known and thus can be esti mated by the vari ance of the forecast error.
Intuitively, the confidence intervals for any model can be computed depending on the individual
240
© 2014-2023 AnalystPrep.
A linear time trend model is estimated on annual government bond interest rates from the year 2000
R t = 0.25 + 0.000154t + ^
ϵt
T he standard deviation of the forecasting error is estimated to be σ ̂ =0.0245. What is the 95%
confidence interval for the second year if the forecasting residual errors (residuals) is a Gaussian
white noise?
(Note that for the first time period t=2000 and the last time period is t=2020)
Solution
ET (Y T+h ) ± 1.96σ
= 0.28083 ± 1.96 × 0.0245
= [0.2328108, 0.3288508]
So the 95% confidence interval for the interest rate is between 1.029% and 10.68%.
241
© 2014-2023 AnalystPrep.
Question 1
s−1
Y t = β0 + ∑ γj Djt + ϵ t
j=1
T he estimated parameters are γ^1 = 6.25, γ^2 = 50.52, γ^3 = 10.25 and β^ 0 = −10.42 using
the data up to the end of 2019. What is the forecasted value of the growth rate of the
A. 40.10
B. 34.56
C. 43.56
D. 36.90
1, for Q2
Djt = {
0, for Q1, Q3 and Q4
So,
3
^ Q 2) = β0 + ∑ γjDjt = −10.42 + 0 × 6.25 + 1 × 50.52 + 0 × 10.25 = 40.1
E(Y
j=1
Question 2
A mortgage analyst produced a model to predict housing starts (given in thousands) within
242
© 2014-2023 AnalystPrep.
California in the US. T he time series model contains both a trend and a seasonal
T he trend component is reflected in variable time(t), where (t) month and seasons are
defined as follows:
T he model started in April 2019; for example, y (T+1) refers to May 2019.
reflected by the intercept (15.5) plus the three seasonal dummy variables (D2, D3, and D4
).
243
© 2014-2023 AnalystPrep.
y T+11 = 0.20 × 11 + 15.5 + 4.0 × 1 = 21.7
244
© 2014-2023 AnalystPrep.
Reading 23: Measuring Return, Volatility, and Correlation
After compl eti ng thi s readi ng, you shoul d be abl e to:
Calculate, distinguish, and convert between simple and continuously compounded returns.
Define and distinguish between volatility, variance rate, and implied volatility.
Describe how the first two moments may be insufficient to describe non-normal
distributions.
Explain how the Jarque-Bera test is used to determine whether returns are normally
distributed.
Describe the power law and its use for non-normal distributions.
Define correlation and covariance and differentiate between correlation and dependence.
one-factor model.
Measurement of Returns
A return is a profit from an investment. T wo common methods used to measure returns include:
P t − P t−1
Rt =
P t−1
Where
245
© 2014-2023 AnalystPrep.
P t−1=Price of an asset at time t (current time)
T he time scale is arbitrary or shorter period such monthly or quarterly. Under the simple returns
method, the returns over multiple periods is the product of the simple returns in each period.
T
1 + R T = ∏ (1 + R t )
t=i
T
⇒ R T = (∏ (1 + R t )) − 1
t=i
T ime Price
0 100
1 98.65
2 98.50
3 97.50
4 95.67
5 96.54
Calculate the simple return based on the data for all periods.
Solution
We need to calculate the simple return over multiple periods which is given by:
T
1 + R T = ∏ (1 + R t )
t=i
246
© 2014-2023 AnalystPrep.
T ime Price Rt 1 + Rt
0 100 − −
1 98.65 −0.0135 0.9865
2 98.50 −0.00152 0.998479
3 97.50 −0.01015 0.989848
4 95.67 −0.01877 0.981231
5 96.54 0.009094 1.009094
Product 0.9654
Note that
P t − P t−1
Rt =
P t−1
So that that
P1 − P0 98.65 − 100
R1 = = = −0.0135
P0 100
And
P2 − P1 98.50 − 98.65
R2 = = = −0.00152
P1 98.65
And so on.
5
∏ (1 + R t) = 0.9865 × 0.998479 × … × 1.009094 = 0.9654
t=1
So,
Denoted by rt . Compounded returns is the difference between the natural logarithm of the price of
247
© 2014-2023 AnalystPrep.
assets at time t and t-1. It is given by:
rt = ln P t − lnP t−1
Computing the compounded returns over multiple periods is easy because it is just the sum of
T
rT = ∑ rt
t=1
T ime Price
0 100
1 98.65
2 98.50
3 97.50
4 95.67
5 96.54
What is the continuously compounded return based on the data over all periods?
Solution
T
rT = ∑ rt
t=1
Where
rt = ln P t − ln P t−1
248
© 2014-2023 AnalystPrep.
T ime Price rt = ln P t − ln P t−1
0 100 −
1 98.65 −0.01359
2 98.50 −0.00152
3 97.50 −0.0102
4 95.67 −0.01895
5 96.54 0.009053
Sum −0.03521
Note that
And so on.
Also,
5
rT = ∑ rt = −0.01359 + −0.00152 + ⋯ + 0.009053 = −0.03521 = −3.521%
t=1
however, is prone to significant error over longer time horizons, and thus compounded returns are
T he relationship between the compounded returns and the simple returns is given by the formula:
1 + R t = ert
What is the equivalent simple return for a 30% continuously compounded return?
Solution.
249
© 2014-2023 AnalystPrep.
1 + R t = ert
⇒ R t = ert − 1 = e0.3 − 1 = 0.3499 = 34.99%
It is worth noting that compound returns are always less than the simple return. Moreover, simple
returns are never less than -100%, unlike compound returns, which can be less than -100%. For
instance, the equivalent compound return for -65% simple return is:
rt = ln (1 − 0.65) = −104.98%
returns measures the volatility of the return over the time period at which it is captured.
Consider the linear scaling of the mean and variance over the period at which the returns are
rt = μ + σet
Where E(rt ) = μ is the mean of the return, V(rt ) = σ 2 is the variance of the return. et is the shocks,
which is assumed to be iid distributed with the mean 0 and variance of 1. Moreover, the return is
assumed to be also iid and normally distributed with the mean μ2 i.e. rt∼iid N(μ, σ 2) . Note the shock
Assume that we wish to calculate the returns under this model for 10 working days (two weeks).
10 10 10
∑ rt+i = ∑ (μ + σet+i ) = 10μ + σ ∑ et+i
i=1 i=1 i=1
So that the mean of the return over the 10 days is 10μ and the variance also is 10σ 2 since et is iid.
√10σ
250
© 2014-2023 AnalystPrep.
T herefore, the variance and the mean of return are scaled to the holding period while the volatility is
scaled to the square root of the holding period. T his feature allows us to convert volatility between
different periods.
For instance, given daily volatility, we would to have yearly (annualized) volatility by scaling it by
2
σannual = √252 × σdaily
Note that 252 is the conventional number of trading days in a year in most markets.
T he monthly volatility of the price of gold is 4% in a given year. What is the annualized volatility of
Solution
Using the scaling analogy, the corresponding annualized volatility is given by:
Variance Rate
T he variance rate, also termed as variance, is the square of volatility. Similar to mean, variance rate
is linear to holding period and hence can be converted between periods. For instance, an annual
2
σannual 2
= 12 × σmonthly
1 T
^2 =
σ ^)2
∑ (rt − μ
T t−1
251
© 2014-2023 AnalystPrep.
Where μ
^ is the sample mean of return, and T is the sample size.
T he investment returns of a certain entity for five consecutive days is 6%, 5%, 8%,10% and 11%.
Solution
1
^=
μ (0.06 + 0.05 + 0.08 + 0.10 + 0.11) = 0.08
5
1 T
^2 =
σ ^)2
∑ (rt − μ
T t−1
1
= [(0.06 − 008)2 + (0.05 − 0.08)2 + (0.08 − 0.08)2 + (0.10 − 0.08)2 + (0.11 − 0.08)2 ] = 0.00052 = 0.052
5
Implied volatility is an alternative measure of volatility that is constructed using options valuation.
T he options (both put and call) have payouts that are nonlinear functions of the price of the
underlying asset. For instance, the payout from the put option is given by:
max(K − P T )
where P T is the price of the underlying asset, K being the strike price, and T is the maturity period.
T herefore, the price payout from an option is sensitive to the variance of the return on the asset.
T he Black-Scholes-Merton model is commonly used for option pricing valuation. T he model relates
252
© 2014-2023 AnalystPrep.
the price of an option to the risk-free rate of interest, the current price of the underlying asset, the
For instance, the price of the call option can be denoted by:
Ct = f(rf , T , P t , σ 2)
Where:
T =T ime to maturity
T he implied volatility σ relates the price of an option with the other three parameters. T he implied
T he volatility index (VIX) measures the volatility in the S&P 500 over the coming 30 calendar days.
VIX is constructed from a variety of options with different strike prices. VIX applies to a large
variety of assets such as gold, but it is only applicable to highly liquid derivative markets and thus not
T he financial returns are assumed to follow a normal distribution. T ypically, a normal distribution is
thinned-tailed, does not have skewness and excess kurtosis. T he assumption of the normal
distribution is sometimes not valid because a lot of return series are both skewed and mostly heavy-
tailed.
To determine whether it is appropriate to assume that the asset returns are normally distributed, we
253
© 2014-2023 AnalystPrep.
Jarque-Bera test tests whether the skewness and kurtosis of returns are compatible with that of
normal distribution.
Denoting the skewness by S and kurtosis by k, the hypothesis statement of the Jarque-Bera test is
stated as:
vs
2
⎛^
S (^
k − 3)2⎞
J B = (T − 1) +
⎝ 6 24 ⎠
T he basis of the test is that, under normal distribution, the skewness is asymptotically normally
2
^
S
distributed with the variance of 6 so that the variable is chi-squared distributed with one degree of
6
freedom (χ21 ) and kurtosis is also asymptotically normally distributed with the mean of 3 and variance
^ − 3)2
(k
of 24 so that is also (χ21) variable. Coagulating these arguments given that these variables are
24
independent, then:
JB ∼ χ22
When the test statistic is greater than the critical value, then the null hypothesis is rejected.
Otherwise, the alternative hypothesis is true. We use the χ22 table with the appropriate degrees of
freedom:
254
© 2014-2023 AnalystPrep.
d.f. .995 .99 .975 .95 .9 .1 .05 .025 .01
1 0.00 0.00 0.00 0.00 0.02 2.71 3.84 5.02 6.63
2 0.01 0.02 0.05 0.10 0.21 4.61 5.99 7.38 9.21
3 0.07 0.11 0.22 0.35 0.58 6.25 7.81 9.35 11.34
4 0.21 0.30 0.48 0.71 1.06 7.78 9.49 11.14 13.28
5 0.41 0.55 0.83 1.15 1.61 9.24 11.07 12.83 15.09
6 0.68 0.87 1.24 1.64 2.20 10.64 12.59 14.45 16.81
7 0.99 1.24 1.69 2.17 2.83 12.02 14.07 16.01 18.48
8 1.34 1.65 2.18 2.73 3.49 13.36 15.51 17.53 20.09
9 1.73 2.09 2.70 3.33 4.17 14.68 16.92 19.02 21.67
10 2.16 2.56 3.25 3.94 4.87 15.99 18.31 20.48 23.21
11 2.60 3.05 3.82 4.57 5.58 17.28 19.68 21.92 24.72
12 3.07 3.57 4.40 5.23 6.30 18.55 21.03 23.34 26.22
For example, the critical value of a χ22 at a 5% confidence level is 5.991, and thus, if the computed
Investment return is such that it has a skewness of 0.75 and a kurtosis of 3.15. If the sample size is
125, what is the JB test statistic? Does the data qualify to be normally distributed at a 95%
confidence level?
Solution
2
⎛^
S (^
k − 3)2⎞ 0.752 (3.15 − 3)2
JB = (T − 1) + = (125 − 1) ( + ) = 11.74
⎝ 6 24 ⎠ 6 24
Since the test statistic is greater than the 5% critical value (5.991), then the null hypothesis that the
T he power law is an alternative method of determining whether the returns are normal or not by
255
© 2014-2023 AnalystPrep.
studying the tails. For a normal distribution, the tail is thinned, such that the probability of any return
greater than kσ decreases sharply as k increases. Other distributions are such that their tails
T he power law tails are such that, the probability of observing a value greater than a given value x is
defined as:
P(X > x) = kx −α
T he tail behavior of distributions is effectively compared by considering the natural log (ln(P(X>x)))
To test whether the above equation holds, a graph of ln prob(X > x) plotted against lnx .
For a normal distribution, the plot is quadratic in x, and hence it decays quickly, meaning that they
have thinned tails. For other distributions such as Student’s t distribution, the plots are linear to x,
and thus, the tails decay at a slow rate, and hence they have fatter tails (produce values that are far
256
© 2014-2023 AnalystPrep.
Dependence and Correlation of Random Variables.
T he two random variables X and Y are said to be independent if their joint density function is equal to
Otherwise, the random variables are said to be dependent. T he dependence of random variables can
be linear or nonlinear.
257
© 2014-2023 AnalystPrep.
T he linear relationship of the random variables is measured using the correlation estimator called
Pearson’s correlation.
Y i = α + βi X i + ϵ i
T he slope β is related to the correlation coefficient ρ. T hat is, if β = 0, then the random variables X i
and Y i are uncorrelated. Otherwise, β ≠ 0. Infact, if the variances of the random variables are
engineered such that they are both equal to unity (σX2 = σY2 = 1), the slope of the regression equation
is equal to the correlation coefficient (β = ρ). T hus, the regression equation reflects how the
Nonlinear dependence is complex and thus cannot be summarized using a single statistic.
Measures of Correlation
T he correlation is mostly measured using the rank correlation (Spearman’s rank correlation) and
Kendal’s τ correlation coefficient. T he values of the correlation coefficient are between -1 and 1.
When the value of the correlation coefficient is 0, then the random variables are independent;
Rank Correlation
T he rank correlation uses the ranks of observations of random variables X and Y. T hat is, rank
correlation depends on the linear relationship between the ranks rather than the random variables
themselves.
T he ranks are such that 1 is assigned to the smallest value, 2 to the next value, and so on until the
When a rank repeats itself, an average is computed depending on the number of repeated variables,
and each is assigned the averaged rank. Consider the ranks 1,2,3,3,3,4,5,6,7,7. Rank 3 is repeated
258
© 2014-2023 AnalystPrep.
(3+4+5)
three times, and rank 7 is repeated two times. For the repeated 3’s, the averaged rank is 3
= 4.
(9+10)
For the repeated 7’s the averaged rank is 2
= 8.5. Note that we are averaging the ranks, which
the repeated ranks could have to assume if they were not repeated. So the new ranks
are:1,2,4,4,4,4,5,6,8.5,8.5.
Now, denote the rank of X by R X and that of Y by R Y then the rank correlation estimator is given by:
Cov(RˆX, R Y )
ρ^s =
^ (R X)√V
√V ^ (R Y )
Alternatively, when all the ranks are distinct (no repeated ranks), the rank correlation estimator is
estimated as:
2
6 ∑ni=1 (R Xi − R Y i ) .
ρ^s = 1 −
n(n2 − 1)
T he intuition of the last formula is that when a highly ranked value of X is paired with corresponding
ranked values of Y, then the value of R Xi − R Y i is very small and thus, correlation tends to 1. On the
other, if the smaller rank values of X are marched with larger rank values of Y, then R Xi − R Y i is
When the variables X and Y have a linear relationship, linear and rank, correlations have equal value.
However, rank correlation is inefficient compared to linear correlation and only used for
confirmational checks. On the other hand, rank correlation is insensitive to outliers because it only
259
© 2014-2023 AnalystPrep.
i X Y
1 0.35 2.50
2 1.73 6.65
3 −0.45 −2.43
4 −0.56 −5.04
5 4.03 3.20
6 3.21 2.31
Solution
Consider the following table where the ranks of each variable have been filled and the square of their
difference in ranks.
i X Y RX RY (R X − R Y )2
1 0.35 2.50 4 3 1
2 1.73 6.65 3 1 4
3 −0.45 −2.43 5 5 0
4 −0.56 −5.04 6 6 0
5 4.03 3.20 1 2 1
6 3.21 2.31 2 4 4
Sum 10
Since there are no repeated ranks, then the rank correlation is given by:
2
6 ∑ni=1 (R Xi − R Y i) .
ρ^s = 1 −
n(n2 − 1)
6 × 10
= 1− = 1 − 0.2857 = 0.7143
6(62 − 1)
Kendal’s Tau is a non-parametric measure of the relationship between two random variables, say, X
Consider the set of random variables X i and Y i. T hese pairs are said to be concordant for all i≠j if the
ranks of the components agree. T hat is, X i > X j when Y i > Y j or X i < X j when Y i < Y j. T hat is, they
260
© 2014-2023 AnalystPrep.
are concordant if they agree on the same directional position (consistent). When the pairs disagree,
they are termed as discordant. Note that ties are neither concordant nor discordant.
Intuitively, random variables with a high number of concordant pairs have a strong positive
correlation, while those with a high number of discordant pairs are negatively correlated.
nc − nd nc nd
τ^ = = −
n(n − 1) nc + nd + nt nc + nd + nt
2
Where
nt =number of ties
It is easy to se that Kendal’s Tau is equivalent to the difference between the probabilities of
concordance and discordance. Moreover, when all the pairs are concordant, τ^ = 1 and when all pairs
i X Y
1 0.35 2.50
2 1.73 6.65
3 −0.45 −2.43
4 −0.56 −5.04
5 4.03 3.20
6 3.21 2.31
Sol uti on
261
© 2014-2023 AnalystPrep.
T he first step is to rank each data:
i X Y RX RY
1 0.35 2.50 3 4
2 1.73 6.65 4 6
3 −0.45 −2.43 2 2
4 −0.56 −5.04 1 1
5 4.03 3.20 6 5
6 3.21 2.31 5 3
Next is to arrange ranks in order of rank X, then the concordant (C) pairs are the number of ranks
greater than the given rank of Y, and discordant pairs are the number of ranks less than the given
rank of Y.
RX RY C D
1 1 5 0
2 2 4 0
3 4 2 1
4 6 1 1
5 3 1 0
6 5 − −
Total 13 2
Note that, C=4, are the number of ranks greater than 2 (4,3,5 and 6) below it. Also, D=0 is the
number of ranks less than 2 below it. T his is continued up to the second last row since there are no
So, nc = 13 and nd = 2
nc − nd 13 − 2 11
⇒ τ^ = = = = 0.7333
n(n − 1) 6(6 − 1 15
2 2
262
© 2014-2023 AnalystPrep.
Practice Question
Suppose that we know from experience that α = 3 for a particular financial variable, and
A. 125%
B. 0.5%
C. 4%
D. 0.1%
T he correct answer is B.
From the given probability, we can get the value of constant k as follows:
T hus,
263
© 2014-2023 AnalystPrep.
Reading 24: Simulation and Bootstrapping
After compl eti ng thi s readi ng, you shoul d be abl e to:
Explain the use of antithetic and control variates in reducing Monte Carlo sampling error.
Describe the bootstrapping method and its advantage over the Monte Carlo simulation.
Si mul ati on is a way of modeling random events to match real-world outcomes. By observing
si mul ated results, researchers gain insight into real problems. Examples of the application of the
simulation are the calculation of option payoff and determining the accuracy of an estimator. Some of
the simulation methods are the Monte Carlo Simulation (Monte Carlo) and the Bootstrapping.
Monte Carlo Simulation approximates the expected value of a random variable using the numerical
methods. T he Monte Carlo generates the random variables from an assumed data generating process
(DGP), and then it applies a function(s) to create realizations from the unknown distribution of the
transformed random variables. T his process is repeated (to improve the accuracy), and the statistic
Bootstrapping is a type of simulation where it uses the observed variables to simulate from the
unknown distribution that generates the observed variables. In other words, bootstrapping involves
the combination of the observed data and the simulated values to create a new sample that is related
T he notable similarity between Monte Carlo and bootstrapping is that both aim at calculating the
expected value of the function by using simulated data (often by use of a computer).
264
© 2014-2023 AnalystPrep.
Also, the contrasting feature in these methods is that in Monte Carlo simulation, a data generating
process (DGP) is entirely used to simulate the data. However, in bootstrapping, observed data is used
T he simulation requires the generation of random variables from an assumed distribution, mostly
using a computer. However, computer-generated numbers are not necessarily random and thus
termed as pseudo-random numbers. Pseudo numbers are produced by the complex deterministic
functions (pseudo number generators, PNGs), which seem to be random. T he initial values of pseudo
numbers are termed as a seed val ue, which is usually unique but generates similar random variables
T he ability of the simulated variables from PRNGs to replicate makes it possible to use pseudo
numbers across multiple experiments because the same sequence of random variables can be
generated using the same seed value. T herefore, we can use this feature to choose the best model or
reproduce the same results in the future in case of regulatory requirements. Moreover, the
Simulating random variables from a specific distribution is initiated by first generating a random
number from a uniform distribution (0,1). After that, the cumulative distribution of the distribution
we are trying to simulate is used to get the random values from that distribution. T hat is, we first
generate a random number U from U(0,1) distribution, then, we use the generated random number to
simulate a random variable X with the pdf f(x) by using the CDF, F(x).
Let U be the probability that X takes a value less than or equal to x, that is,
U = P(≤ x) = F(x)
x = F−1 (u)
265
© 2014-2023 AnalystPrep.
To put this in a more straightforward perspective, the algorithm for simulating random variable from
2. Compute x = F−1(u)
Note that the random variable X has a CDF F(x) as shown below:
Assume that we want to simulate three random variables from an exponential distribution with a
parameter λ = 0.2 using the value 0.112, 0.508, and 0.005 from U(0,1).
266
© 2014-2023 AnalystPrep.
Solution
T his question assumes that the uniform random variable has been generated. T he inverse of the CDF
1
F−1 (x) = − ln (1 − x)
λ
1
F−1(x) = − ln (1 − x)
0.2
1
x=− ln (1 − u)
0.2
1
x1 = − ln (1 − u1 ) = −5 ln (1 − 0.112) = 2.37567
0.2
1
x2 = − ln (1 − u2 ) = −5 ln (1 − 0.508) = 14.1855
0.2
1
x3 = − ln (1 − u3 ) = −5 ln (1 − 0.005) = 0.10025
0.2
Monte Carlo simulation is used to estimate the population moments or functions. T he Monte Carlo is
as follows:
Assume that X is a random variable that can be simulated and let g(X) be a function that can be
evaluated at the realizations of X. T hen, the simulation generates multiple copies of g(X) by
T his process is then repeated b times so that a set of iid variables is generated from the unknown
267
© 2014-2023 AnalystPrep.
distribution g(X), which can then be used to estimate the desired statistic.
For instance, if we wish to estimate the mean of the generated random variables, then the mean is
given by:
1 b
^(g(X)) =
E ∑ g(X i )
b i=1
T his is true because the generated variables are iid, and then the process is repeated b times.
^(g(X)) = E(g(x))
lim E
b→∞
Also, the Central Limit T heorem applies to the estimated mean so that:
σg2
^(g(X))] =
Var[E
b
T he second moment, which is the variance (standard variance estimator) is estimated as:
1 b 2
^2g =
σ ∑ (g(X i ) − E[ĝ(X)])
b i=1
From CLT, the standard error of the simulated expectation is given by:
σg2
σ2
⎷ b = √b
T he standard error of the simulated expectation measures the level of accuracy of the estimation;
Another quantity that can be calculated from the simulation is the α -quantile by arranging the b
draws in ascending order then selecting the value bα of the sorted set.
268
© 2014-2023 AnalystPrep.
Moreover, using the simulation, we determine the finite sample properties of the estimated
parameters. Assume that the sample size n is large enough so that approximation by CLT is adequate.
Now, consider a finite-sample distribution of a parameter θ^. Using the assumed DGP, n random
X = [x 1, x 2, … , x n ]
We would need to simulate new data set and estimate the parameter b times: (θ^1, θ^2, … , θ^b) from
the finite-sample distribution of the estimator of θ. From these values, we can rule out the
properties of the estimator θ^. For instance, the bias defined as:
Bias(θ) = E(θ^) − θ
1 b
ˆ
(Bias) (θ) = ∑ (θ^i − θ)
b i=1
Having the basics of the Monte Carlo simulation, its basic logarithm is as follows:
iv. From the replications {g1 , g2 ,… , gb}, calculate the statistic of interest.
v. Determine the accuracy of the estimated quantity by calculating the standard error. If the
standard error is huge, increase the number of b-replications to obtain the smallest error
possible.
Exampl e: Usi ng the Monte Carl o Si mul ati on to Esti mate the Pri ce of a Cal l Opti on
max(0, ST − K)
269
© 2014-2023 AnalystPrep.
ST is the price of the underlying stock at the time of maturity T, and K is the strike price. T he price
of the call option is a non-linear function of the underlying stock price at the expiration date, and
Assuming that the log of the stock price is normally distributed, then the price of the stock can be
modeled as the sum of the initial stock price, a mean and normally distributed error. Mathematically
stated as:
σ2
sT = s0 + T (rf − ) √T x i
2
Where
270
© 2014-2023 AnalystPrep.
σ 2= variance of the stock return
From the formula above, to simulate the price of the underlying stock requires the estimation of the
stock volatility.
Using the simulated price of the stock, the price of the option can be calculated as:
c = e(−rfT)max(ST − K, 0)
And thus the mean of the price of the call option can be estimated as:
1 b
^(c) = c̄ =
E ∑ ci
b i=1
Where c i is the simulated payoffs of the call option. Note that, using the equation,
σ2
sT = s0 + T (rf − ) √T x i , the simulated stock prices can be expressed as:
2
⎛ σ 2⎞
s 0 +T rf− +√T xi
STi = e
⎝ 2⎠
And thus
σ2
g(x i ) = c i = e(−rT) max (es 0 +T(rf− )+√T xi
2 − K, 0)
^2g
σ ^g
σ
^(c)) =
s.e (E =
⎷ b √b
^2g
Where σ
1
^2g =
σ c )2
∑ (c i − ^
b ∀i
271
© 2014-2023 AnalystPrep.
Given that we calculate the standard error, we can calculate the confidence intervals for the
estimated mean of the call option price. For instance, the 95% confidence interval; is given by:
2. Control Variates.
To set the mood, recall that the estimation of expected values in simulation depends on the Law of
Large Numbers (LLN) and that the standard error of the estimated expected value is proportional to
1/√b. T herefore, the accuracy of the simulation depends on the variance of the simulated quantities.
Antithetic Variables
Moreover, if the covariance between the variables is negative (or negatively correlated), then:
T he antithetic variables use the last result. T he antithetic variables reduce the sampling error by
incorporating the second set of variables that are generated in such a way that they are negatively
correlated with the initial iid simulated variables. T hat is, each simulated variable is paired with an
272
© 2014-2023 AnalystPrep.
antithetic variable so that they occur in pairs and are negatively correlated.
F−1(U 1 ) ∼ Fx
U2 = 1 − U1
F−1(U 2 ) ∼ Fx
T hen by definition of antithetic variables, the correlation between U 1 and U 2 is negative as well as
Using the antithetic random variables is analogous to typical Monte Carlo simulation only that values
Note that the number of simulations is b/2 since the simulation values are in pairs. T he antithetic
variables reduce the sampling error only if the function g(X) is monotonic in x so that
Notably, the antithetic random variables reduce the sampling error through the correlation
coefficient. Note that usually sampling error using b iid simulated values, is
σg
√b
But by introducing the antithetic random variables, then the standard error is given by:
σg√1 + ρ
√b
273
© 2014-2023 AnalystPrep.
Clearly, the standard error decreases when the correlation coefficient, ρ < 0.
Control Variates
Control variates reduce the sampling error by incorporating values that have a mean of zero and
correlated to simulation. T he control variates have a mean of zero so that it does not bias the
approximation. Given that the control variate and the desire function are correlated, an effective
combination (optimal weights) of the control variate and the initial simulation value to reduce the
1 b
^[g(X)] =
E ∑ g(x i )
b i=1
^[g(X)] = E[g(X)] + ηi
E
Denote the control variate by h(X i ) so that by definition, E[h(X i)] = 0 and that it is correlated with ηi.
An ideal control variate should be less costly to construct and that it should be highly correlated with
g(X) so that the optimal combination parameter β0 that minimizes the estimation errors can be
g(x i ) = β0 + β1 h(X i)
Disadvantages of Simulation
Monte Carlo Simulation can result in unreliable approximates of moments if the DGPs used
do not adequately describe the observed data. T his mostly occurs due to misspecifications
of the DGP.
Simulation can be costly, especially when you are running multiple simulation experiments
274
© 2014-2023 AnalystPrep.
because it can be time-consuming.
Bootstrapping
As stated earlier, bootstrapping is a type of simulation where it uses the observed variables to
simulate from the unknown distribution that generates the observed variable. However, note that
bootstrapping does not directly model the observed data or suggest any assumption about the
distribution, but rather, the unknown distribution in which the sample is drawn is the origin of the
observed data.
275
© 2014-2023 AnalystPrep.
T here are two types of bootstraps:
i. iid Bootstraps
iid Bootstrap
iid bootstraps select the samples that are constructed with replacement from the observed data.
Assume that a simulation sample of size m is created from the observed data with n observations. iid
bootstraps construct observation indices by randomly sampling with replacing from the values 1,2,...,
n. T hese random indices are then used to draw the observed data to be included in the simulated data
(bootstrap sample).
For instance, assume we want to draw 10 observations from a sample of 50 data points:
and second simulation could use {x 50 , x 21 , x 23 x 19, x 32, x 49 x 41, x 22, x 12, , x 39} and so on until the desired
In other words, iid bootstrap is analogous to Monte Carlo Simulation, where bootstrap samples are
used instead of simulated samples. Under iid bootstrap, the expected values are estimated as:
1 b
^[g(X)] =
E ∑ g(x BS x BS x BS )
1,j, 2,j,, … , m ,j,
b i=1
Where
x BS
i,j,
= observation i from observation j
T he iid bootstrap is suitable when observations used are independent over time, and thus using it in
In short, the logarithm of generating a sample using the iid bootstrap include:
276
© 2014-2023 AnalystPrep.
ii. Construct the bootstrap sample as x i1 , x i2, … , x im
T he circular block bootstrap differs from the iid bootstrap in that instead of sampling each data point
with replacement, it samples the blocks of size q with replacement. For instance, assume that we
have 50 observations which are sampled into five blocks (q=5), each with 10 observations.
T he blocks are sampled with replacement until the desired sample size is produced. In the case that
the number of observations in sampled blocks is larger than the required sample size, some of the
T he size of the number of blocks should be large enough to reflect the dependence of observations
but not too large to exclude some crucial blocks. Conventionally, the size of the blocks is the square
i. Decide on the size of block q-more preferably, the block size should be equal to the square
ii. Select the first block index i from (1,2,…,n) and transfer {x i , x i+1 ,… , x i+q} to the bootstrap
iii. Incase the bootstrap sample has less than m elements, repeat step (ii) above.
iv. In case the bootstrap sample has more than m elements, omit the values from the end of the
Application of Bootstrapping
One of the applications of bootstrapping is the estimation of the p-value at risk in financial markets.
Where:
277
© 2014-2023 AnalystPrep.
L = loss of the portfolio over a given period, and
If the loss is measured in percentages of a particular portfolio, then p-VaR can be seen as a quantile
of the return distribution. For instance, if we wish to calculate a one-year VaR of a portfolio, then we
will simulate a one-year data (252 days) and then find the quantile of the simulated annual returns.
T he VaR is then calculated by sorting the bootstrapped annual returns from lowest to highest and
then determining (1-p)b, which is basically the empirical 1-p quantile of the annual returns.
T he following are the two situations where bootstraps will not be sufficiently effective:
In cases where there are outliers in the data, hence there is a likelihood that the
Non-independent data – When a bootstrap is applied, the assumption the data are
Disadvantages of Bootstrapping
Bootstrapping uses the whole data to generate a simulated sample and thus may make the
simulated sample unreliable when the past and the present data are different. For example,
the present state of a financial market might be different from the past.
Bootstrapping of historical data can be unreliable due to changes in the market so that the
present is different from the past. For instance, if we are bootstrapping market interest
rates, there might be huge discrepancies due to past and present market forces, which
278
© 2014-2023 AnalystPrep.
Monte Carlo simulation uses an entire statistical model that incorporates the assumption on the
distribution of the shocks, and therefore, the results are inaccurate if the model used is poor even
On the other hand, bootstrapping does not specify the model but instead assumes the past resembles
the present of the data. In other words, the bootstrapping incorporates the aspect of the dependence
Both Monte Carlo Simulation and bootstrapping are affected by the “Black Swan” problem, where the
resulting simulations in both methods closely resemble historical data. In other words, simulations
tend to focus on historical data, and thus, the simulations are not so different from what it has been
observed.
279
© 2014-2023 AnalystPrep.
Practice Question
A. T hey are variables that are generated to have a negative correlation with the
B. T hey are mean zero values that are correlated to the desired statistic that is to
C. T hey are the mean zero variables that are negatively correlated with the initial
simulated sample.
Solution
T he correct answer is A.
Control variates are used to reduce the sampling error in the Monte Carlo simulation.
T hey are constructed to have a negative correlation with the initial simulated sample so
280
© 2014-2023 AnalystPrep.
Reading 25: Machine-Learning Methods
After compl eti ng thi s readi ng, you shoul d be abl e to:
Understand the differences between and consequences of underfitting and overfitting and
Explain the differences among the training, validation, and test data sub-samples, and how
each is used.
Machine learning (ML) is the art of programming computers to learn from data. Its basic idea is that
systems can learn from data and recognize patterns without active human intervention. ML is best
suited for certain applications, such as pattern recognition and complex problems that require large
amounts of data and are not well solved with traditional approaches.
On the other hand, classical econometrics has traditionally been used in finance to identify patterns
in data. It has a solid foundation in mathematical statistics, probability, and economic theory. In this
case, the analyst researches the best model to use along with the variables to be used. T he
computer’s algorithm tests the significance of variables, and based on the results, the analyst decides
281
© 2014-2023 AnalystPrep.
Machine learning and traditional linear econometric approaches are both employed in prediction. T he
former has several advantages: machine learning does not rely on much financial theory when
selecting the most relevant features to include in a model. It can also be used by a researcher who is
unsure or has not specified whether the relationship between variables is linear or non-linear. T he
ML algorithm automatically selects the most relevant features and determines the most appropriate
Secondly, ML algorithms are flexible and can handle complex relationships between variables.
y = β0 + β1 X 1 + β2X 2 + ε
Suppose that the effect of X 1 on y depends on the level of X 2. Analysts would miss this interaction
effect unless a multiplicative term was explicitly included in the model. In the case of many
explanatory variables, a linear model may be difficult to construct for all combinations of interaction
terms. T he use of machine learning algorithms can mitigate this problem by automatically capturing
interactions.
Additionally, the traditional statistical approaches for evaluating models, such as analyses of statistical
significance and goodness of fit tests, are not typically applied in the same way to supervised machine
learning models. T his is because the goal of supervised machine learning is often to make accurate
predictions rather than to understand the underlying relationships between variables or to test
hypotheses.
T here are different terminologies and notations used in ML. T his is because engineers, rather than
statisticians, developed most machine learning techniques. T here has been a lot of discussion of
independent variables. Targets/outputs are dependent variables, and the values of the outputs are
referred to as labels.
282
© 2014-2023 AnalystPrep.
T he following gives a summary of some of the differences between ML techniques and classical
econometrics.
283
© 2014-2023 AnalystPrep.
Machine Learning Classical Econometrics
Techniques
Builds models that can learn
from data and continuously Identifies and estimates the
improve their performance relationships between variables.
Goals
with time, and do not need It also tests the hypothesis
to specify the relationships about these relationships.
between variables in advance.
Require well-structured
ML models can deal with large
and clearly defined
Data amounts of complex and
requirements dependent and independent
unstructured data.
variables.
They are not built on Based on various assumptions, e.g.,
assumptions and can handle errors are normally distributed,
Assumptions
non-linear relationships linear relationships
between variables. between variables.
Maybe complex to interpret,
as they may involve complex Statistical models can
be interpreted in terms
Interpretability patterns and relationships
of the relationships
that are difficult to understand
between variables.
or explain.
T here are many types of Machine learning systems. Some of the types include unsupervised
Unsupervised Learning
As the name suggests, the system attempts to learn without a teacher. It recognizes data patterns
without an explicit target. More specifically, it uses inputs (X’s) for analysis with no corresponding
target (Y). Data is clustered to detect groups or factors that explain the data. It is, therefore, not
For example, unsupervised learning can be used by an entrepreneur who sells books to detect
groups of similar customers. T he entrepreneur will at no point tell the algorithm which group a
customer belongs to. It instead finds the connections without the entrepreneur’s help. T he algorithm
may notice, for instance, that 30% of the store’s customers are males who love science fiction
books and frequent the store mostly during weekends, while 25% are females who enjoy drama
books. A hierarchical clustering algorithm can be used to further subdivide groups into smaller ones.
284
© 2014-2023 AnalystPrep.
Supervised Learning
By using well-labeled training data, this system is trained to work as a supervisor to teach the
machine to predict the correct output. You can think of it as how a student learns under the
supervision of a teacher. In supervised learning, a mapping function is determined that can map
inputs (X’s) with output (Y). T he output is also known as the target, while X’s are also known as the
features.
T ypically, there are two types of tasks in supervised learning. One is classification. For example, a
loan borrower may be classified as “likely to repay” or “likely to default.” T he second one is the
prediction of a target numerical value. For example, predicting a vehicle’s price based on a set of
features such as mileage, year of manufacture, etc. For the latter, labels will indicate the selling
prices. As for the former, the features would be the borrower’s credit score, income, etc., while the
Reinforcement Learning
285
© 2014-2023 AnalystPrep.
Reinforcement learning differs from other forms of learning. A learning system called an agent
perceives and interprets its environment, performs actions, and is rewarded for desired behavior and
penalized for undesired behavior. T his is done through a trial-and-error approach. Over time, the
agent learns by itself what is the best strategy (policy) that will generate the best reward while
avoiding undesirable behaviors. Reinforcement learning can be used to optimize portfolio allocation
and create trading bots that can learn from stock market data through trial and error, among many
other uses.
286
© 2014-2023 AnalystPrep.
T raining ML models can be slowed by the millions of features that might be present in each training
instance. T he many features can also make it difficult to find a good solution. T his problem is
Dimensions and features are often used interchangeably. Dimension reduction involves reducing the
datasets, scales down the computational burden of dealing with large datasets, and improves the
interpretability of models.
PCA is the most popular dimension reduction approach. It involves projecting the training dataset
onto a lower-dimensional hyperplane. T his is done by finding the directions in the dataset that
capture the most variance and projecting the dataset onto those directions. PCA reduces the
In PCA, the variance measures the amount of information. Hence, principal components capture the
most variance and retain the most information. Accordingly, the first principal component will
account for the largest possible variance; the second component will intuitively account for the
second largest variance (provided that it is uncorrelated with the first principal component), and so
on. A scree plot shows how much variance is explained by the principal components of the data. T he
principal components that explain a significant proportion of the variance are retained (usually 85%
to 95%).
Researchers are concerned about which principal components will adequately explain returns in a
hypothetical Very Small Cap (VSC) 30 and Diversified Small Cap (DSC) 500 equity index over a 15-
year period. DSC 500 is a diversified index that contains stocks across all sectors, whereas VSC 30 is
a concentrated index that contains technology stocks. In addition to index prices, the dataset contains
more than 1000 technical and fundamental features. T he fact that the dataset has so many features
causes them to overlap due to multicollinearity. T his is where PCA comes in handy, as it works by
creating new variables that can explain most of the variance while preserving information in the
data.
287
© 2014-2023 AnalystPrep.
Below is a screen plot for each index. Based on the 20 principal components generated, the first
three components explain 88% and 91% of the variance in the VSC 30 and DSC 500 index values,
respectively. Screen plots for both indexes illustrate that the incremental contribution in explaining
variance structure is very small after PC5 or so. From PC5 onwards, it is possible to ignore the
Clustering is a type of unsupervised machine-learning technique that organizes data points into
288
© 2014-2023 AnalystPrep.
Clusters contain observations from data that are similar in nature. K-means is an iterative algorithm
that is used to solve clustering problems. K is the number of fixed clusters determined by the analyst
at the outset. It is based on the idea of minimizing the sum of squared distances between data points
and the centroid of the cluster to which they belong. T he following outlines the process for
1. Randomly allocate initial K centroids within the data (centers of the clusters).
3. Calculate the new K centroids for each cluster by taking the average value of all data points
4. Reassign each data point to the closest centroid based on the newly calculated centroids.
5. Repeat the process of recalculating the new K centroids until the centroids converge or a
Iterations continue until no data point is left to reassign to the closest centroid (there is no need to
recalculate new centroids). T he distance between each data point and the centroids can be measured
in two ways. T he first is the Euclidean distance, while the second is the Manhattan distance.
289
© 2014-2023 AnalystPrep.
Consider two features x and y , which both have two data points A and B, with coordinates (x A ,y A ) and
root of the sum of the squares of the differences between the coordinates of the two points. Imagine
the Pythagoras T heorem, where Euclidean distance is the unknown side of a right-angled triangle.
In the case that there are more than two dimensions, for example, n features for two data points A
and B, Euclidean distance will be constructed in a similar fashion. Euclidean distance is also known as
the "straight-line distance " because it is the shortest distance between two points, indicated by the
solid line in the figure below. Manhattan distance, also known as L 1 - norm, is calculated as the sum of
the absolute differences between two coordinates. For a two-dimensional space, this is represented
as:
290
© 2014-2023 AnalystPrep.
Manhattan distance (dM ) = |x B − x A | + |x B − x A |
Manhattan distance is named after the layout of streets in Manhattan, where streets are laid out in a
grid pattern, and the only way to travel between two points is by going along the grid lines.
Suppose you have the following financial data for three companies:
Company P:
Company Q:
Company R
Calculate the Euclidean and Manhattan distances between companies P and Q in feature space for the
raw data.
To calculate the Euclidean distance between companies P and Q in feature space for the raw data,
we first need to find the difference between each feature value for the two companies and then
291
© 2014-2023 AnalystPrep.
square the differences. T he Euclidean distance is then calculated by taking the square root of the
Euclidean Distance (dE ) = √(0.5 − 2.5)2 + (9 − 15)2 + (0.6 − 8)2 = √94.76 = 9.73
Manhattan Di stance
To calculate the Manhattan distance between companies P and Q in feature space for the raw data,
we simply find the absolute difference between each feature value for the two companies and sum
these differences. T he Manhattan distance is then calculated by taking the sum of these differences.
Formulas described above indicate the distance between two points A and B. It should be noted that K
-means aims to minimize the distance between each data point and its centroid rather than to
minimize the distance between data points. T he data points will be closer to the centroids when the
Inertia, also known as the Within-Cluster Sum of Squared errors (WCSS), is a measure of the sum of
the squared distances between the data points within a cluster and the cluster's centroid. Denoting
n
WCSS = ∑ di2
i=1
K-means algorithm aims to minimize the inertia by iteratively reassigning data points to different
clusters and updating the cluster centroids until convergence. T he final inertia value can be used to
292
© 2014-2023 AnalystPrep.
Choosing an Appropriate Value for K
Choosing an appropriate value for K can affect the performance of the K-means model. For example,
if K is set too low, the clusters may be too general and may not be a true representative of the
underlying structure of the data. Similarly, if K is set too high, the clusters may be too specific and
may not represent the data’s overall structure. T hese clusters may not be useful for the intended
purpose of the analysis in either case. It is, therefore, important to choose K optimally in practice.
T he optimal value of K can be calculated using different methods, such as the elbow method and the
silhouette analysis. T he elbow method fits the K-means model for different values of K and plots the
inertia/ WCSS for each value of K. Similar to PCA, this is called a scree-plot. It is then examined for
the obvious point on the plot where the inertia decreases more slowly as K increases (elbow), which
is chosen as the optimal value of K. In other words, it is the value that corresponds to the “elbow”
293
© 2014-2023 AnalystPrep.
T he second approach involves fitting the K-means model for a range of values of K and determining
the silhouette coefficient for each value of K. T he silhouette coefficient compares the distance of
each data point from other points in its own cluster with its distance from the data points in the
other closest cluster. In other words, it measures the similarity of a data point to its own cluster
compared to the other closest clusters. T he optimal value of K is the one that corresponds to the
K-means clustering is simple and easy to implement, making it a popular choice for clustering tasks.
T here are some disadvantages to K-Means, such as the need to specify clusters, which can be
difficult if the dataset is not well separated. Additionally, it assumes that the clusters are spherical and
294
© 2014-2023 AnalystPrep.
equal in size, which is not always the case in practice.
K-means algorithm is very common in investment practice. It can be used for data exploration in high-
Understand the differences between and consequences of underfitting and overfitting, and propose
Overfitting
Imagine that you have traveled to a new country, and the shop assistant rips you off. It is a natural
instinct to assume that all shop assistants in that country are thieves. If we are not careful, machines
can also fall into the same trap of overgeneralizing. T his is known as overfitting in ML.
Overfitting occurs when the model has been trained too well on the training data and performs
poorly on new, unseen data. An overfitted model can have too many model parameters, thus learning
the detail and noise in the training data rather than the underlying patterns. T his is a problem because
it means that the model cannot make reliable predictions about new data, which can lead to poor
prediction error on new data rather than on its goodness of fit on the trained data. If an algorithm is
overfitted to the training data, it will have a low prediction error on the training data but a high
T he dataset to which an ML model is applied is normally split into training and validation samples. T he
training data set is used to train the ML model by fitting the model parameters. On the other hand, the
validation data set is used to evaluate the trained model and estimate how well the model will
Overfitting is a severe problem in ML, which can easily have thousands of parameters, unlike
classical econometric models that can only have a few parameters. Potential remedies for overfitting
include decreasing the complexity of the model, reducing features, or using techniques such as
295
© 2014-2023 AnalystPrep.
regularization or early stopping.
Underfitting
Underfitting is the opposite of overfitting. It occurs when a model is too simple and thus not able to
capture the underlying patterns in the training data. T his results in poor performance on both the
training data and new data. For example, we would expect a linear model of life satisfaction to be
prone to underfit as the real world is more complicated than the model. In this scenario, the ML
Underfitting is more likely in conventional models because they tend to be less flexible than ML
not follow assumptions about the structure of the model. It should be noted, however, that ML
models can still experience underfitting. T his can happen when there is insufficient data to train the
model, when the data is of poor quality, and if there is excessively stringent regularization.
Regularization is an approach commonly used to prevent overfitting. It adds a penalty to the model as
the complexity of the model increases. If the regularization is set too high, it can cause the model to
underfit the data. Potential remedies for addressing underfitting include increasing the complexity of
the model, adding more features, or increasing the amount of training data.
Bias-Variance Tradeoff
T he complexity of the ML model, which determines whether the data is over, under, or well-fitted,
involves a phenomenon called bias-variance tradeoff. Complexity refers to the number of features in
a model and whether a model is linear or non-linear (with non-linear being too complex). Bias occurs
when a complex model is approximated with a simpler model, i.e., by omitting relevant factors and
interactions. A model with highly biased predictions is likely to be oversimplified and thus results to
underfitting. Variance refers to how sensitive the model is to small fluctuations in the training data. A
model with high variance in predictions is likely to be complex and thus results to overfitting.
T he figure below illustrates how bias and variance are affected by model complexity.
296
© 2014-2023 AnalystPrep.
Sample Splitting and Preparation
Data Preparation
T here is a tendency for ML algorithms to perform poorly when the variables have very different
scales. For example, there is a vast difference in the range between income and age. A person’s
income ranges in the thousands while their age ranges in the tens. Since ML algorithms only see
numbers, they will assume that higher-ranging numbers (income in this case) are superior, which is
false. It is, therefore, crucial to have values in the same range. Standardization and normalization are
Standardization involves centering and scaling variables. Centering is where the variable’s mean value
is subtracted from all observations on that variable (so standardized values have a mean of 0). Scaling
297
© 2014-2023 AnalystPrep.
is where the centered values are divided by the standard deviation so that the distribution has a unit
xi − μ
x i (standardized) =
σ
Normalization, also known as min-max scaling, entails rescaling values from 0 to 1. T his is done by
subtracting the minimum value (x min ) from each observation and dividing by the difference between
the maximum (x max ) and minimum values (x min ) of X . T his is expressed as follows:
x i − x min
x i (normalized ) =
x max − x min
Standardization is used when the data includes outliers. T his is because normalization would
compress data points into a narrow range of 0 − 1, which would be uncharacteristic of the
original data.
Data Cleaning
T his is a crucial component of ML and may be the difference between an ML's success and failure.
Mi ssi ng data: Analysts encounter this issue very often. Missing data can be dealt with in
the following ways. First, observations with only a small number of missing values can be
removed. Secondly, they can replace them with the mean or median of the non-missing
Inconsi stent recordi ng: It is important to record data consistently so that it can be read
298
© 2014-2023 AnalystPrep.
Unwanted observati ons: Observations that are not relevant to the specific task should
Dupl i cate observati ons: Duplicate data points should be removed to avoid biases.
Probl emati c features: A feature with many standard deviations from the mean should
We briefly discussed the training and validation data sets, which are in-sample datasets. Additionally,
there is an out-of-sample dataset, which is the test data. T he training dataset teaches an ML model to
make predictions, i.e., it learns the relationships between the input data and the desired output. A
validation dataset is used to evaluate the performance of an ML model during the training process. It
compares the performance of different models so as to determine which one generalizes (fits) best
to new data. A test dataset is used to evaluate an ML model’s final performance and identify any
remaining issues or biases in the model. T he performance of a good ML model on the test dataset
should be relatively similar to the performance on the training dataset. However, the training and
test datasets may perform differently, and perfect generalization may not always be possible.
It is up to the researchers to decide how to subdivide the available data into the three samples. A
common rule of thumb is to use two-thirds of the sample for training and the remaining third to be
equally split between validation and testing. T he subdivision of the data will be less crucial when the
overall data points are large. Using a small training dataset can introduce biases into the parameter
estimation because the model will not have enough data to learn the underlying patterns in the data
accurately. Using a small validation dataset can lead to inaccurate model evaluation because the model
may not have enough data to assess its performance accurately; thus, it will be hard to identify the
best specification. When subdividing the data into training, validation, and test datasets, it is crucial to
For cross-sectional data, it is best to divide the dataset randomly, as the data has no natural ordering
(i.e., the variables are not related to each other in any specific order). For time series data, it is best
to divide the data into chronological order, starting with training data, then validation data, and testing
data.
299
© 2014-2023 AnalystPrep.
Cross-validation Searches
Cross-validation can be used when the overall dataset is insufficient to be divided into training,
validation, and testing datasets. In cross-validation, training and validation datasets are combined into
one sample, and the testing dataset is excluded. T he combined data is then equally split into sub-
samples, with a different sub-sample left out each time as the test dataset. T his technique is known
as k-fold cross-validation. It splits the training and validation data into k sub-samples, and the model is
trained and evaluated k times while leaving out the test data from the combined sample. T he values k
maximize a reward. T he agent is given feedback as either a reward or punishment depending on its
actions. It then uses the feedback to learn the actions that are likely to generate the highest reward.
T he algorithm learns through trial and error by playing many times against itself.
300
© 2014-2023 AnalystPrep.
How Reinforcement Learning Operates
T he environment consists of the state space, action space, and the reward function. T he state space
is the set of all possible states in which the agent can be. On the other hand, the action space
consists of a set of actions that the agent can take. Lastly, the reward function defines the feedback
that the agent receives for taking a particular action in a given state space.
301
© 2014-2023 AnalystPrep.
Involves specifying the learning algorithm and any relevant parameters. T he agent is then put in the
Take an Action
T he agent chooses an action depending on its current state and the learning algorithm. T his action is
then taken in the environment, which may lead to a change of state and a reward. At any given state,
the algorithm can choose between taking the best course of action (exploitation) and trying a new
action (exploration). Exploitation is assigned the probability p and exploration given the probability
1 − p. p increases as more trials are concluded, and the algorithm has learned more about the best
strategy.
Based on the agent’s reward and the environment’s new state, it updates its internal state. T his
T he agent continues to take actions and update its internal state until it reaches a predefined number
T he Monte Carlo method estimates the value of a state or action based on the final reward received
at the end of an episode. On the other hand, the temporal difference method updates the value of a
state or action by looking at only one decision ahead when updating strategies.
An estimate of the expected value of taking action A in state S, after several trials, is denoted as
Q(S, A). T he estimated value of being in state S at any time is expressed as:
302
© 2014-2023 AnalystPrep.
Where α is a parameter, say 0.05, which is the learning rate that determines how much the agent
updates its Q value based on the difference between the expected and actual reward.
Suppose that we have three states (S1, S2, S3) and two actions (A1, A2), with the following Q(S, A)
values:
S1 S2 S3
A1 0.3 0.4 0.5
A2 0.7 0.6 0.5
Monte-Carlo Method
Suppose that on the next trial, Action 2 is taken in State 3, and the total subsequent reward is 1.0. If
= 0.075, the Monte Carlo method would lead to Q(3, 2) being updated from 0.5 to:
If the next decision that has to be made on the trial under consideration happens when we are in
T he value of being in State 2, Action 2 is 0.6. T he temporal difference method would lead to Q(3, 2)
1. Tradi ng: Reinforcement learning algorithms can learn from past data and market dynamics
to make informed decisions on when to buy and sell, possibly optimizing the trading of
303
© 2014-2023 AnalystPrep.
2. Detecti ng fraud: RL can be used to detect fraudulent activity in financial transactions.
T his algorithm learns from past data and hence adapts to new fraud patterns. T his means that
the algorithm becomes better at detecting and preventing fraud with time.
3. Credi t scori ng: RL can be used to predict the probability of a borrower defaulting on a
loan. T he algorithm can be trained on historical data about borrowers and their credit
4. Ri sk management: RL can be trained using past data to identify and mitigate financial
risks.
5. Portfol i o opti mi zati on: RL can be trained to take actions that modify the allocation of
assets in the portfolio with time, with the sim of maximizing portfolio returns and minimizing
risks.
Natural language processing (NLP) focuses on helping machines process and understand human
language.
1. Data col l ecti on: Involves acquiring data from various sources, including financial
2. Data preprocessi ng: T he raw textual data is cleaned, formatted, and transformed into a
form suitable for computer usage. Tasks such as tokenization, stemming, and stop word
3. Feature extracti on: T his involves extracting relevant features from the preprocessed
data. It may involve extracting financial metrics, sentiments, and other relevant information.
4. Model trai ni ng: T his involves training the machine learning model using the extracted
features.
5. Model eval uati on: T his involves evaluating the performance of the trained model to
304
© 2014-2023 AnalystPrep.
ensure it generates accurate and reliable predictions. Techniques such as cross-validation
can be employed here. Model evaluation is carried out on the test dataset.
6. Model depl oyment: T he evaluated model is then deployed for use in real-world investment
scenarios.
Data Preprocessing
Textual data (unstructured data) is more suitable for human consumption rather than for computer
processing. Unstructured data thus needs to be converted to structured data through cleaning and
preprocessing, a process called text processing. Text cleansing involves involving removing HT ML
tags, punctuations, numbers, and white spaces (e.g., tabs and indents).
1. Tok eni zati on: Involves separating a piece of text into smaller units called tokens. It allows
the NLP model to analyze the textual data more easily by breaking it down into individual
3. Removi ng stop words: T hese are words with no informational value, e.g., as, the, is, used
as sentence connectors. T hey are eliminated to reduce the number of tokens in the training
data.
4. Stemmi ng: Reduces all the variations of a word into a common value (base form/stem): For
example, “earned,” “earnings,” and “earning.” are all assigned a common value of earn. It only
5. Lemmati zati on: Involves reducing words to their base form/lemma to identify related
words. Unlike stemming, lemmatization incorporates the full structure of the word and uses
a dictionary or morphological analysis to identify the lemma. It generates more accurate base
6. Consi der “n-grams:” T hese are words that need to be placed together to give a specific
Finance professionals can leverage on NLP to derive insights from large chunks of data to make
305
© 2014-2023 AnalystPrep.
Tradi ng: NLP can be employed to analyze real-time financial data, e.g., stock prices, to
derive trends and patterns that could be used to inform investment decisions.
Ri sk management: NLP can be used to identify possible risks in financial contracts and
regulatory filings. For example, identifying language that implies a high level of risk, or
News anal ysi s: NLP can be used to derive information from news articles and other
sources of financial information, e.g., earnings reports. T he resulting information can then
opportunities.
Senti ment anal ysi s: NLP can be used to measure the public opinion of a company,
industry, or market trend by analyzing sentiments on social media posts and news articles.
Investors can use this information to make more informed investment decisions. Investors
can classify the text as positive, negative, or neutral based on the sentiment expressed in
the text.
Customer servi ce: NLP can be employed in chatbots to aid companies in responding to
Detect accounti ng fraud: For example, to detect accounting fraud, the Securities and
Text cl assi fi cati on: T his is the process of assigning text data to prespecified categories.
For example, text classification could involve assigning newswire statements based on the
306
© 2014-2023 AnalystPrep.
Practice Question
Which of the following is least likely a task that can be performed using natural language
processing?
A. Sentiment analysis.
B. Text translation.
C. Image recognition.
D. Text classification.
Solution
T he correct answer is C.
Image recognition is not a task that can be performed using NLP. T his is because NLP is
A i s i ncorrect: NLP can be used for sentiment analysis. For example, NLP can be used
categories. For example, text classification could involve assigning newswire statements
based on the news they represent, e.g., education, financial, environmental, etc.
307
© 2014-2023 AnalystPrep.
Reading 26: Machine Learning and Prediction
After compl eti ng thi s readi ng, you shoul d be abl e to:
Discuss why regularization is useful and distinguish between the ridge regression and
LASSO approaches.
Outline the intuition behind the K nearest neighbors and support vector machine methods
for classification.
Understand how neural networks are constructed and how their weights are determined.
Evaluate the predictive performance of logistic regression models and neural network
Linear regression models the relationship between a dependent variable and one or more
independent variables by fitting a linear equation to the observed data. It works by finding the line of
the best fit through the data points. T his line is called a regression line, and it is straight. T he
equation of the best fit can then be used to make predictions about the dependent variable based on
308
© 2014-2023 AnalystPrep.
T he regression line can be expressed as follows:
Where:
y = Dependent variable.
α = Intercept.
309
© 2014-2023 AnalystPrep.
x 1 , x 2 ,… x n = Independent variables.
T he coefficients show the effect of each independent variable on the dependent variable and are
T raining any machine learning model aims to minimize the cost (loss) function. A cost function
measures the inaccuracy of the model predictions. It is the sum of squared residuals (RSS) for a
linear regression model. T his is the sum of the squared difference between the actual and predicted
2
n n
RSS = ∑ (y i − α − ∑ βj x ij)
i=1 i=1
To measure how well the data fits the line, take the difference between each actual data point (y)
and the model's prediction (y^). T he differences are then squared to eliminate negative numbers and
penalize larger differences. T he squared differences are then added up, and an average is taken.
T he advantage of linear regression is that it is easy to understand and interpret. However, it has the
following limitations:
It assumes that residuals (the difference between observed and predicted values) are
It is prone to overfitting.
310
© 2014-2023 AnalystPrep.
Aditya Khun, an investment analyst, wants to predict the return on a stock based on its P/E ratio and
the market capitalization of the company using linear regression in machine learning. Khun has
access to the P/E ratio and market capitalization dataset for several stocks, along with their
corresponding returns. Khun can employ linear regression to model the relationship between the
return on a stock and its P/E ratio and market capitalization. T he following equation represents the
model:
Where:
β0 = Intercept.
T he first step of fitting a linear regression model is estimating the values of the coefficients β0, β1,
and β2 using the training data. Coefficients that minimize the sum of the squared residuals are
determined.
Intercept = 3.432.
311
© 2014-2023 AnalystPrep.
Market cap coefficient = 0.0368.
Given a P/E ratio of 14 and a market capitalization of $150M, the return of the stock can be
determined as follows:
Logistic Regression
When using a linear regression model for binary classification, where the dependent variable Y can
only be 0 or 1, the model can predict probabilities outside the range of 0 to 1. T his occurs because
the model attempts to fit a straight line to the data, and the predicted values may not be restricted to
the valid range of probabilities. As a result, the model may produce predictions that are less than zero
or greater than one. To avoid this issue, it may be necessary to use a different type of model, such as
logistic regression, which is specifically designed for binary classification tasks and ensures that the
predicted probabilities are within the valid range. T his is achieved by applying a sigmoid function.
312
© 2014-2023 AnalystPrep.
Logistic regression is used to forecast a binary outcome. In other words, it predicts the likelihood of
e yj
F (y j ) =
1 + e yj
Where:
α = Intercept term.
e yj
313
© 2014-2023 AnalystPrep.
e yj
pj =
1 + e yj
Probability that y j = 0 is (1 − pi )
T his measures how often we predicted zero when the true answer was one and vice versa. T he
logistic regression coefficients are trained using techniques such as maximum likelihood estimation
(MLE) to predict values close to 0 and 1. MLE works by selecting the values of the model
parameters (∝ and the β s) that maximize the likelihood of the training data occurring. T he likelihood
function is a mathematical function that describes the probability of the observed data given the
model parameters. By maximizing the likelihood function, we can find the values of the parameters
most likely to have produced the observed data. T his is expressed as:
n 1−yj
y
∏ F (y j ) j (1 − F (y j ))
j=1
It is often easier to maximize the log-likelihood function, log(L), than the likelihood function itself.
T he log-likelihood function is obtained by taking the natural logarithm of the likelihood function:
n
Log(L) = ∑ [y j log(F (y j )) + (1 − y j) log(1 − F (y j))]
j=1
Once the model parameters (∝ and the β s) that maximize the log-likelihood function have been
estimated using MLE, predictions can be made using the logistic regression model. To make
predictions, a threshold value Z is chosen. If the predicted probability pj is greater than or equal to
the threshold Z , the model predicts the positive outcome (y j = 1) ; if pj is less than the threshold Z ,
1 if pj ≥ Z
yj = {
0 if pj < Z
A credit analyst wants to predict whether a customer will default on a loan based on their credit
314
© 2014-2023 AnalystPrep.
score and debt-to-income ratio. He gathers a dataset of 500 customers, with their corresponding
credit scores, debt-to-income ratio, and whether they defaulted on the loan. He then splits the data
into training and test sets and uses the training data to train a logistic regression model.
T he model learns the following relationship between the independent variables (input features) and
T he above expression calculates the probability that the customer will default on the loan, given
So, if the credit score is 650 and the debt-to-income ratio is 0.6, the probability of default will be
calculated as:
So there is a 12% probability that the customer will default on the loan. One can then use a threshold
(such as 50%) to convert this probability into a binary prediction (either “default” or “no default”).
Logistic regression is applied for prediction and classification tasks in machine learning. For example,
you could use logistic regression to classify stock returns as either “positive” or “negative” based on
a set of input features that you choose. It is simple to implement and interpret. However, it assumes
a linear relationship between the dependent and independent variables and requires a large sample
Categorical data refers to information presented in groups and can take on values that are names,
315
© 2014-2023 AnalystPrep.
attributes, or labels. It is not in a numerical format. For example, a given set of stocks can be
categorized as either growth or value stocks depending on the investment style. Many ML algorithms
It isn't easy to transform categorical variables, especially non-ordinal categorical data, where the
classes are not in any order. Mapping or encoding involves transforming non-numerical information
into numbers. One-hot encoding is the most common solution for dealing with non-ordinal categorical
data. It involves creating a new dummy variable for each group of the categorical feature and
encoding the categories as binary. Each observation is marked as either belonging (Value=1) or not
For ordered categorical variables, for example, where a candidate's grades are specified as either
poor, good, or excellent, a dummy variable that equals 0 for poor, 1 for good, and 2 for excellent can
be used.
If an intercept term and correlated dummy variables are included in a model, the dummy variable trap
may be encountered. T his means that the model will have multiple possible solutions, and we cannot
find a unique best-fit solution. To address this issue, techniques such as regularization can be used.
T hese approaches penalize the magnitude of the coefficients of the model, which can help to reduce
the impact of correlated variables and prevent the dummy variable trap from occurring.
Regularization
Regularization is a technique that events overfitting in machine learning models by penalizing large
coefficients. It adds a penalty term to the model's objective function, encouraging the coefficients to
316
© 2014-2023 AnalystPrep.
take on smaller values. T his reduces the impact of correlated variables, as it forces the model to rely
more on the overall pattern of the data and less on the influence of any single variable. It improves
scaling the data to have a minimum value of 0 and a maximum value of 1. On the other hand,
standardization involves scaling the data so that it has a mean of zero and a standard deviation of one.
Ridge regression and the least absolute shrinkage and selection operator (LASSO) regression are the
Ridge Regression
Ridge regression, sometimes known as L2 regularization, is a type of linear regression that is used to
analyze data and make predictions. It is similar to ordinary least squares regression but includes a
penalty term that constrains the size of the model's coefficients. Consider a dataset with n
observations on each of k features in addition to a single output variable y and, for simplicity, assume
that we are estimating a standard linear regression model with hats above parameters denoting their
estimated values. T he relevant objective function (referred to as a loss function) in ridge regression
is:
k
1 n 2 2
L= ^ − β̂1 x 1j − β̂2 x 2j − … − β̂kx kj ) + λ ∑ (β^ i )
∑ (ŷ j − ∝
n j=1 i=1
or
k
2
L = R̂SS + λ ∑ (β^ i )
i=1
T he first term in the expression is the residual sum of squares, which measures how well the model
fits the data. T he second term is the shrinkage term, which introduces a penalty for large slope
parameter values. T his is known as regularization, and it helps to prevent overfitting, which is when
a model fits the training data too well and performs poorly on new, unseen data.
T he parameter λ is a hyperparameter, which means that it is not part of the model itself but is used
317
© 2014-2023 AnalystPrep.
to determine the model. In this case, it controls the relative weight given to the shrinkage term
versus the model fit term. It is essential to tune the value of λ, or perform hyperparameter
λ is a hyperparameter.
introduces a penalty term to the objective function to prevent overfitting. However, the penalty
term in LASSO regression takes the form of the absolute value of the coefficients rather than the
k
1 n 2
L= ∑ (ŷ j − ∝
^ − β̂1 x 1j − β̂2 x 2j − … − β̂k x kj) + λ ∑ (| β̂i|)
n j=1 i=1
k
L = R̂SS + λ ∑ (| β̂i |)
i=1
T his means that the values of the coefficients can be calculated directly, without the need for
iterative optimization. On the other hand, LASSO does not have closed-form solutions for the
coefficients, so a numerical optimization procedure must be used to determine the values of the
parameters.
Ridge regression and LASSO have a crucial difference. Ridge regression adds a penalty term that
reduces the magnitude of the β parameters and makes them more stable. T he effect of this is to
“shrink” the β parameters towards zero, but not all the way to zero. T his can be especially useful
when there is multicollinearity among the variables, as it can help to prevent one variable from
However, LASSO sets some of the less important β parameters to exactly zero. T he effect of this is
to perform feature selection, as the β parameters corresponding to the least important features will
318
© 2014-2023 AnalystPrep.
be set to zero. In contrast, the β parameters corresponding to the more important features will be
retained. T his can be useful in cases where the number of variables is very large, and some variables
are irrelevant or redundant. T he choice between LASSO and ridge regression depends on the
Elastic Net
Elastic net regularization is a method that combines the L1 and L2 regularization techniques in a
k k
1 n 2 2
L= ^ − β̂1 x 1j − β̂2 x 2j − … − β̂k x kj) + λ1 ∑ (β^ i ) + λ2 ∑ (| β̂i|)
∑ (ŷ j − ∝
n j=1 i=1 i=1
k k
2
L = R̂SS + λ1 ∑ (β^i ) + λ2 ∑ (| β̂i|)
i=1 i=1
By adjusting λ1 and λ2, which are hyperparameters, it is possible to obtain the advantages of both L1
and L2 regularization. T hese advantages include decreasing the magnitude of some parameters and
eliminating some unimportant ones. T his can help to improve the model's performance and the
Example: Regularization
OLS regression determines the coefficients of the model by minimizing the sum of the squared
residuals (RSS). Note that it does not incorporate any regularization and can therefore lead to
significant coefficients and overfitting. On the other hand, ridge regularization adds a penalty term to
319
© 2014-2023 AnalystPrep.
RSS. T he penalty term is determined as the sum of the squared coefficient values, multiplied by λ,
which is regarded as a hyperparameter. T he hyperparameter controls the strength of the penalty and
can be adjusted to find an optimal balance between the model's fitness and the model's simplicity.
Notice that as λ increases, the penalty term becomes more influential, and the coefficient values
become smaller.
As discussed earlier, LASSO uses the sum of the absolute values of the coefficients as the penalty
term. T his leads to some coefficients being reduced to zero, which eliminates unnecessary features
from the model. Notice the same from the table above. Similar to ridge regression, the strength of
Choosing the value of the hyperparameter in a regularized regression model is an important step in
the modeling process, as it can significantly impact the model's performance. One common approach
to selecting the value of the hyperparameter is to use cross-validation, which involves splitting the
data into a training set, a validation set, and a test set. T his was discussed in detail in Chapter 14. T he
training set is used to fit the model and determine the coefficients for different values of λ. T he
validation set determines how well the model generalizes to new data. T he test set is used to
evaluate the final performance of the model and provide an unbiased estimate of the model's
accuracy.
Decision Trees
A decision tree is a supervised machine-learning technique that can be used to predict either a
categorical target variable, produce a classification tree, or produce a regression tree. It creates a
tree-like decision model based on the input features. At each internal node of the tree, there is a
question, and the algorithm makes a decision based on the value of one of the features. It then
branches an observation to another node or a leaf. A leaf is a terminal node that leads to no further
nodes. In other words, the decision tree includes the initial root node, decision nodes, and terminal
nodes.
Classification and Regression T ree (CART ) is a decision tree algorithm commonly used for
supervised learning tasks, such as classification and regression. One of the main benefits of CART is
320
© 2014-2023 AnalystPrep.
that it is highly interpretable, meaning it is easy to understand how the model makes predictions. T his
is because CART models are built using a series of simple decision rules that are easy to understand
and follow. For this reason, CART models are often referred to as “white-box models,” in contrast to
other techniques like neural networks, which are often referred to as “black-box models.” Neural
networks are more challenging to interpret because they are based on complex mathematical
T he following is a visual representation of a simple model for predicting whether a company will
When building a decision tree, the goal is to create a model that can accurately predict the value of a
target variable based on the importance of other features in the dataset. To do this, the decision tree
must decide which features to split on at each node of the tree. T he tree is constructed by starting
321
© 2014-2023 AnalystPrep.
at the root node and recursively partitioning the data into smaller and smaller groups based on the
values of the chosen features. We use a measure called information gain to determine which feature
Information gain measures how much uncertainty or randomness is reduced by obtaining additional
information about the feature. In other words, it measures how much the feature helps us predict
T here are two commonly used measures of information gain: entropy and the Gini coefficient. Both
of these measures are used to evaluate the purity of a node in the decision tree. T he goal is to
choose the feature that results in the most significant reduction in entropy or the Gini coefficient, as
this will be the most helpful feature in predicting the target variable.
Entropy ranges from 0 to 1, with 0 representing a completely ordered or predictable system and 1
K
Entropy = − ∑ pi log2(pi )
i=1
Where K is the total number of possible outcomes and pi the probability of that outcome. T he
logarithm used in the formula is typically the base-2 logarithm, also known as the binary logarithm.
K
Gini = 1 − ∑ p2i
i=1
A credit card company is building a decision-tree model to classify credit card holders as high-risk or
low-risk for defaulting on their payments. T hey have the following data on whether a credit card
holder has defaulted (“Defaulted”) and two features (for the label and the features, in each case,
“yes” = 1 and “no” = 0): whether the credit card holder has a high income and whether they have a
322
© 2014-2023 AnalystPrep.
Defaulted High_income Late_payments
1 1 1
0 0 0
0 0 0
1 1 1
1 0 1
0 0 1
0 1 0
0 1 0
T he base entropy measures the randomness (uncertainty) of the output series before any data is
K
Entropy = − ∑ pi log2 (pi)
i=1
Where:
T he logarithm used in the formula is typically the base-2 logarithm, also known as the binary
logarithm.
In this case, three credit card holders defaulted, and five didn't.
3 3 5 5
Entropy = − ( log2 ( ) + log2( )) = 0.954
8 8 8 8
Both features are binary, so there are no issues with determining a threshold as there would be for a
continuous series. T he first stage is to calculate the entropy if the split was made for each of the
two features. Examining the High_income feature first, among high-income credit card owners
(feature = 1), two defaulted while two did not, leading to entropy for this sub-set of:
323
© 2014-2023 AnalystPrep.
2 2 2 2
Entropy = −( log2( ) + log2 ( )) = 1
4 4 4 4
Among non-high income credit card owners (feature = 0), one defaulted while three did not, leading
to an entropy of:
1 1 3 3
Entropy = − ( log2 ( ) + log2( )) = 0.811
4 4 4 4
4 4
Entropy = × 1 + × 0.811 = 0.906
8 8
We repeat this process by calculating the entropy that would occur if the split was made via the late
payment feature.
T hree of the four credit card owners who made late payments (feature = 1) defaulted, while one did
not.
3 3 1 1
Entropy = − ( log2 ( ) + log2( )) = 0.811
4 4 4 4
Among the four credit card owners who did not make late payments (feature = 0), none defaulted.
4
Entropy = × 0.811 = 0.4055
8
Notice that the entropy is maximized when the sample is first split by the late payments feature.
T his becomes the root node of the decision tree. For credit card owners who do not make late
payments (i.e., the feature =0), there is already a pure split as none of them defaulted. T his is to say
that credit card holders who make timely payments do not default. T his means that no further splits
are required along this branch. T he (incomplete) tree structure is, therefore:
324
© 2014-2023 AnalystPrep.
Ensemble Techniques
used to make predictions rather than relying on the output of a single model. T he idea behind
ensemble learning is that the individual models in the ensemble may have different error rates and
make noisy predictions. Still, by taking the average result of many predictions from various models,
the noise can be reduced, and the overall forecast can be more accurate.
T here are two objectives of using an ensemble approach in machine learning. First, ensembles can
often achieve better performance than individual models (think of the law of large numbers where,
as the number of models in the ensemble increases, the overall prediction accuracy tends to
325
© 2014-2023 AnalystPrep.
improve). Second, ensembles can be more robust and less prone to overfitting, as they are able to
average out the errors made by individual models. Some ensemble techniques are discussed below,
Bootstrap Aggregation
decision trees by sampling from the original training data. T he decision trees are then combined to
make a final prediction. A basic bagging algorithm for a decision tree would involve the following
steps:
1. Sample the training data with the replacement to obtain multiple subsets of the training data
2. Contruct a decision tree on each subset of the training data using the usual techniques.
3. Combine the predictions made by each of the decision tree models, e.g., average, to make a
forecast.
Sampling with replacement is a statistical method that involves randomly selecting a sample from a
dataset and returning the selected element back into the dataset before choosing the next element.
T his means that an element can be selected multiple times, or it can be left out entirely.
Sampling with replacement allows for the use of out-of-bag (OOB) data for model evaluation. OOB
data are observations that were not selected in a particular sample, and therefore were not used for
model training. T hese observations can be used to evaluate the model's performance, as they can
Random Forests
A random forest is an ensemble of decision trees. T he number of features chosen for each tree is
usually approximately equal to the square root of the total number of features. T he individual
decision trees in a random forest are trained on different subsets of the data and different subsets of
the features, which means that each tree may give a slightly different prediction. However, by
combining the predictions of all the trees, the random forest can produce a more accurate final
prediction. T he performance improvements of ensembles are often greatest when the individual
326
© 2014-2023 AnalystPrep.
model outputs have low correlations with one another because this helps to improve the
Boosting
Boosting is an ensemble learning technique that involves training a series of weak models, where
each successive model is trained on the errors or residuals of its predecessor. T he goal of boosting is
to improve the model's overall performance by combining the weaker models' predictions to reduce
bias and variance. Gradient boosting and AdaBoost (Adaptive Boosting) are the most popular methods.
AdaBoost
327
© 2014-2023 AnalystPrep.
AdaBoost is a boosting algorithm that trains a series of weak models, where each successive model
focuses more on the examples that were difficult for its predecessor to predict correctly. T his
results in new predictors that concentrate more and more on the hard cases. Specifically, AdaBoost
adjusts the weights of the training examples at each iteration based on the previous model's
performance, focusing the training on the examples that are most difficult to predict. Here is a more
1. T he AdaBoost algorithm first trains a base classifier (such as a decision tree) on the training
data.
2. T he algorithm then uses the trained classifier to make predictions on the training set and
calculates the errors or residuals between the predicted labels and the true labels.
3. T he algorithm then adjusts the weights of the training examples based on the previous
classifier's performance, focusing the training on the examples that were most difficult to
predict correctly. Specifically, the weights of the misclassified examples are increased,
4. A second classifier is then trained on the updated weights. T he whole process is repeated
until a predetermined number of classifiers have been trained, or until the model's
T he final prediction of the AdaBoost model is calculated by combining the predictions of all of the
individual classifiers using a weighted sum, where the accuracy of each classifieraccuracy of each
Gradient Boosting
In gradient boosting, a new model is trained on the residuals or errors of the previous model, which
are used as the target labels for the current model. T his process is repeated until a predetermined
number of models have been trained, or until the model's performance meets a desired threshold. In
contrast to AdaBoost, which adjusts the weights of the training examples at each iteration based on
the performance of the previous classifier, gradient boosting tries to fit the new predictor to the
328
© 2014-2023 AnalystPrep.
K-Nearest Neighbors and Support Vector Machine Methods
K-Nearest Neighbors
K-nearest neighbors (KNN) is a supervised machine learning technique commonly used for
classification and regression tasks. T he idea is to find similarities or “nearness” between a new
observation and its k-nearest neighbors in the existing dataset. To do this, the model uses one of the
distance metrics described in the previous chapter (Euclidean distance or Manhattan distance) to
calculate the distance between the new observation and each observation in the training set. T he k
observations with the smallest distances are considered the k-nearest neighbors of the new
observation. T he class label or value of the new observation is determined based on these neighbors'
KNN is sometimes called a “lazy learner” as it does not learn the relationships between the features
and the target like other approaches do. Instead, it simply stores the training data and makes
predictions based on the similarity between the new observation and its K-nearest neighbors in the
training set.
Here are the basic steps involved in implementing the KNN model:
329
© 2014-2023 AnalystPrep.
Choosing an appropriate value for K is important, as it can impact the model's ability to generalize to
new data and avoid overfitting or underfitting. If K is too large so that many neighbors are selected, it
will give a high bias but low variance, and vice versa for small K. If the value of K is set too small, it
may result in a model that is more sensitive to individual observations and more complex. T his may
allow the model to fit the training data better. However, it may also make the model more prone to
A typical heuristic for selecting K is to set it approximately equal to the square root of the size of
the training sample. For example, if the training sample contains 10,000 points, then K could be set to
Support vector machines (SVMs) are supervised machine learning models commonly used for
330
© 2014-2023 AnalystPrep.
classification tasks, particularly when there are many features. SVM works by finding the path's
hyperplane or center that maximizes the distance between the two classes, called the margin.
T his hyperplane (the solid line blue line in the figure below) is constructed by finding the two
parallel lines that are furthest apart and that best separate the observations into the two classes. T he
data points on the edge of this path, or the points closest to the hyperplane, are called support
vectors.
Emma White is a portfolio manager at Delta Investments, a firm that manages a diverse range of
investment portfolios for its clients. Delta has a portfolio of “investment-grade” stocks, which are
relatively low-risk and have a high likelihood of producing steady returns. T he portfolio also includes
a selection of “non-investment grade” stocks, which are higher-risk and have the potential for higher
331
© 2014-2023 AnalystPrep.
returns but also come with a greater risk of loss.
White is considering adding a new stock, ABC Inc., to the portfolio. ABC is a medium-sized company
in the retail sector but has not yet been rated by any of the major credit rating agencies. To
determine whether ABC is suitable for the portfolio, White decides to use machine learning methods
to predict the stock's risk level. How can Emma use the SVM algorithm to explore the implied credit
rating of ABC?
Solution
White would first gather data on the features and target of bonds from companies rated as either
investment grade or non-investment grade. She would then use this data to train the SVM algorithm to
identify the optimal hyperplane that separates the two classes. Once the SVM model is trained,
White can use it to predict the rating of ABC Inc's bonds by inputting the features of the bonds into
the model and noting on which side of the margin the data point lies. If the data point lies on the side
of the margin associated with the investment grade class, then the SVM model would predict that
ABC Inc's bonds are likely to be investment grade. If the data point lies on the side of the margin
associated with the non-investment grade class, then the SVM model would predict that ABC Inc's
Neural Networks
Neural networks (NNs), also known as artificial neural networks (ANNs), are machine learning
algorithms capable of learning and adapting to complex nonlinear relationships between input and
output data. T hey can be used for both classification and regression tasks in supervised learning, as
well as for reinforcement learning tasks that do not require human-labeled training data. A feed-
forward neural network with backpropagation is a type of artificial neural network that updates its
332
© 2014-2023 AnalystPrep.
In this neural network, there are three input variables, a single hidden layer comprising three nodes
and a single output variable. T he output variable is determined based on the values of the hidden
nodes, which are calculated from the input variables. T he equations that are used to determine the
∅ is known as an activation function, which is a nonlinear function that is applied to the linear
combination of the input feature values to introduce nonlinearity into the model.
333
© 2014-2023 AnalystPrep.
y = ∅(W 211 H 1 + W 221H 2 + W 231 H 3 + W 4)
T he other W parameters (coefficients in the linear functions) are weights. As previously stated, if
the activation functions were not included, the model would only be able to output linear
combinations of the inputs and hidden layer values, limiting its ability to identify complex nonlinear
relationships. T his is not desirable, as the main purpose of using a neural network is to identify and
T he parameters of a neural network are chosen based on the training data, similar to how the
parameters are chosen in linear or logistic regression. To predict the value of a continuous variable,
we can select the parameters that minimize the mean squared errors. We can use a maximum
T here are no exact formulas for finding the optimal values for the parameters in a neural network.
Instead, a gradient descent algorithm is used to find values that minimize the error for the training
set. T his involves starting with initial values for the parameters and iteratively adjusting them in the
direction that reduces the error of the objective function. T his process is similar to stepping down a
T he learning rate is a hyperparameter that determines the size of the step taken during the gradient
descent algorithm. If the learning rate is too small, it will take longer to reach the optimal
parameters, but if it is too large, the algorithm may oscillate from one side of the valley to another
instead of accurately finding the optimal values. A hyperparameter is a value set before the model
training process begins and is used to control the model's behavior. It is not a parameter of the model
itself but rather a value used to determine how the model will be trained and function.
In the example given earlier, the neural network had 16 parameters (i.e., a total of the weights and
the biases). T he presence of many hidden layers and nodes in a neural network can lead to too many
parameters and the risk of overfitting. To prevent overfitting, calculations are performed on a
validation data set while training the model on the training data set. As the gradient descent algorithm
progresses through the multi-dimensional valley, the objective function will improve for both data
sets.
334
© 2014-2023 AnalystPrep.
However, at a certain point, further steps down the valley will begin to degrade the model's
performance on the validation data set while continuing to improve it on the training data set. T his
indicates that the model is starting to overfit, so the algorithm should be stopped to prevent this from
happening.
A confusion matrix is a tool used to evaluate the performance of a binary classification model, where
the output variable is a binary categorical variable with two possible values (such as “default” or “not
default”). It is a 2×2 table that shows the possible outcomes, and whether the predicted outcome
i. T rue positive (T P) refers to the number of times the model correctly predicted that a
ii. False negative (FN) refers to the number of times the model incorrectly predicted that a
iii. False positive (FP) refers to the number of times the model incorrectly predicted that a
iv. T rue negative (T N) refers to the number of times the model correctly predicted that a
335
© 2014-2023 AnalystPrep.
T he most common performance metrics based on a confusion matrix are:
i. Accuracy: T his is the model's overall accuracy, calculated as the number of correct
predictions divided by the total number of predictions. In the case of a binary classification
(T P + T N )
(T P + T N + F P + F N)
ii. Precision: T his is the proportion of correct positive predictions, calculated as:
TP
(T P + F P )
iii. Recall: T his is the proportion of actual positive cases that were correctly predicted,
336
© 2014-2023 AnalystPrep.
calculated as:
TP
(T P + F N )
iv. T he error rate is the proportion of incorrect predictions made by the model, calculated as
follows:
Suppose we have a dataset of 1600 borrowers, 400 of whom defaulted on their loans and 1200 of
whom did not. We can use logistic regression or a neural network to create a prediction model that
predicts the likelihood that a borrower will default on their loan. We can set a threshold value to
Assume that a neural network with one hidden layer and backpropagation is used to model the data.
T he hidden layer has 5 units, and the activation function used is the logistic function. T he loss
function used in the optimization process is based on an entropy measure. Note that a loss function
is used to evaluate how well a model performs on a given task. T he optimization process aims to find
the set of model parameters that minimize the loss function. Suppose that the optimization process
takes 150 iterations to converge, which means it takes 150 steps to find the set of model parameters
In the context of machine learning, the effectiveness of a model specification is evaluated based on
its performance in classifying a validation sample. For simplicity, a threshold of 0.5 is used to
determine the predicted class label based on the model's output probability. If the probability of a
default predicted by the model is greater than or equal to 0.5, the predicted class label is “default.” If
the probability is less than 0.5, the predicted class label is “no default.”
Adjusting the threshold can affect the true positive and false positive rates in different ways. For
example, if the threshold is set too low, the model may have a high true positive rate and a high false
positive rate because the model is classifying more observations as positive. On the other hand, if
337
© 2014-2023 AnalystPrep.
the threshold is set too high, the model may have a low true positive rate and a low false positive rate
because the model is classifying fewer observations as positive. T his trade-off between true positive
and false positive rates is similar to the trade-off between type I and type II errors in hypothesis
testing. In hypothesis testing, a type I error occurs when the null hypothesis is rejected when it is
actually true. In contrast, a type II error occurs when the null hypothesis is not rejected when it is
actually false.
Hypothetical confusion matrices for the logistic and neural network models are presented for both
T he values in the confusion matrix can be used to calculate various evaluation metrics:
338
© 2014-2023 AnalystPrep.
T raining sample Validation sample
Performance Logistic Neural Logistic Neural
metrics regression network regression network
Accuracy 0.781 0.743 0.654 0.651
Precision 0.667 0.470 0.641 0.646
Recall 0.250 0.235 0.364 0.338
T he model appears to perform slightly better on the training data than on the validation data,
indicating that the model is overfitting. To improve the model's performance, it may be beneficial to
remove some of the features with limited empirical relevance or apply regularization to the model.
T hese steps may help reduce overfitting and improve the model's ability to generalize to new data.
T here is not much difference in the performance of the logistic regression and neural network
approaches. T he logistic regression model has a higher true positive rate but a lower true negative
rate for the training data compared to the neural network model. On the other hand, the neural
network model appears to have a higher true positive rate but a lower true negative rate for the
between the true positive rate and the false positive rate, which is illustrated in the figure below. It
is calculated by varying the threshold value or decision boundary, classifying predictions as positive
or negative, and plotting the true positive rate and the false positive rate at each threshold.
339
© 2014-2023 AnalystPrep.
A higher area under the receiver operating curve (or area under curve/AUC) value indicates better
performance, with a perfect model having an AUC of 1. An AUC value of 0.5 corresponds to the
dashed line in the figure above and indicates that the model is no better than random guessing. In
contrast, an AUC value less than 0.5 indicates that the model has a negative predictive value.
340
© 2014-2023 AnalystPrep.
Practice Question
Model A
Predicted: Predicted:
No Default Default
Actual: No Default T N = 100 F P = 50
Actual: default F N = 50 T P = 900
Model B
Predicted: Predicted:
No Default Default
Actual: No Default T N = 120 F P = 80
Actual: default F N = 30 T P = 870
T he model that is most likely to have a higher accuracy and higher precision,
respectively, is:
Solution
T he correct answer is C.
(T P + T N)
Model accuracy is calculated as
(T P + T N + F P + F N )
900 + 100
Model A accuracy = = 0.909
900 + 100 + 50 + 50
870 + 120
Model B accuracy = = 0.900
870 + 120 + 80 + 30
341
© 2014-2023 AnalystPrep.
Model A has a slightly higher accuracy than model B.
TP
(T P + F P )
900
Model precision of A = = 0.9474
900 + 50
870
Model precision for B = = 0.9158
870 + 80
342
© 2014-2023 AnalystPrep.