0% found this document useful (0 votes)
11 views342 pages

Watermarked 2 Quantitative Analysis

The document provides study notes for the FRM Part I Exam, focusing on Quantitative Analysis, with a last update in March 2023. It covers fundamental concepts of probability, including sample space, events, independent and mutually exclusive events, conditional probability, and Bayes' theorem, along with practical applications and examples. The content is structured into sections that detail various probability topics essential for understanding risk management and econometrics.

Uploaded by

Mạnh Trần
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views342 pages

Watermarked 2 Quantitative Analysis

The document provides study notes for the FRM Part I Exam, focusing on Quantitative Analysis, with a last update in March 2023. It covers fundamental concepts of probability, including sample space, events, independent and mutually exclusive events, conditional probability, and Bayes' theorem, along with practical applications and examples. The content is structured into sections that detail various probability topics essential for understanding risk management and econometrics.

Uploaded by

Mạnh Trần
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 342

FRM Part I Exam

By AnalystPrep

Study Notes - Quantitative Analysis

Last Updated: Mar 6, 2023

1
©2023 AnalystPrep “This document is protected by International copyright laws. Reproduction and/or distribution of this document is

prohibited. Infringers will be prosecuted in their local jurisdictions. ”


Table of Contents

12 - Fundamentals of Probability 3
13 - Random Variables 18
14 - Common Univariate Random Variables 40
15 - Multivariate Random Variables 67
16 - Sample Moments 92
17 - Hypothesis Testing 113
18 - Linear Regression 134
19 - Regression with Multiple Explanatory Variables 150
20 - Regression Diagnostics 174
21 - Stationary Time Series 189
22 - Nonstationary Time Series 213
23 - Measuring Return, Volatility, and Correlation 238
24 - Simulation and Bootstrapping 256
25 - Machine-Learning Methods 256
26 - Machine Learning and Prediction 256

2
© 2014-2023 AnalystPrep.
Reading 12: Fundamentals of Probability

After compl eti ng thi s readi ng, you shoul d be abl e to:

Describe an event and an event space.

Describe independent events and mutually exclusive events.

Explain the difference between independent events and conditionally independent events.

Calculate the probability of an event for a discrete probability function.

Define and calculate a conditional probability.

Distinguish between conditional and unconditional probabilities.

Explain and apply Bayes' rule.

Probability is the foundation of statistics, risk management, and econometrics. Probability quantifies

the likelihood that some event will occur. For instance, we could be interested in the probability that

there will be a defaulter in a prime mortgage facility.

Sample Space, Event Space, and Events

Sample Space (Ω)

A sample space is defined as a collection of all possible occurrences of an experiment. T he

outcomes are dependent on the problem being studied. For example, when modeling returns from a

portfolio, the sample space is a set of real numbers. As another example, assume we want to model

defaults in loan payment; we know that there can only be two outcomes: either the firm defaults or

it doesn’t. As such, the sample space is Ω = {Default, No Default}. To give yet another example, the

sample space when a fair six-sided die is tossed is made of six different outcomes:

Ω = {1, 2, 3, 4, 5, 6}

Events (ω)

3
© 2014-2023 AnalystPrep.
An event is a set of outcomes (which may contain more than one element). For example, suppose we

tossed a die. A “6” would constitute an event. If we toss two dice simultaneously, a {6, 2} would

constitute an event. An event that contains only one outcome is termed an el ementary event.

Event Space ( F )

T he event space refers to the set of all possible outcomes and combinations of outcomes. For

example, consider a scenario where we toss two fair coins simultaneously. T he following would

constitute our event space:

{HH, HT, T H, T T }

Note: If the coins are fair, the probability of a head, P(H), equals the probability of a tail, P(T ).

Probability

T he probability of an event refers to the likelihood of that particular event occurring. For example,

the probability of a Head when we toss a coin is 0.5, and so is the probability of a Tail.

According to frequentist interpretation, the term probability stands for the number of times an event

occurs if a set of independent experiments is performed. But this is what we call the frequentist

interpretation because it defines an event’s probability as the limit of its relative frequency in many

trials. It is just a conceptual explanation; in finance, we deal with actual, non-experimental events

such as the return earned on a stock.

Independent and Mutually Exclusive Events

Mutually Exclusive Events

T wo events, A and B, are said to be mutually exclusive if the occurrence of A rules out the

occurrence of B, and vice versa. For example, a car cannot turn left and turn right at the same time.

4
© 2014-2023 AnalystPrep.
Mutual l y excl usi ve events are such that one event precl udes the occurrence of al l the

other events. Thus, i f you rol l a di ce and a 4 comes up, that parti cul ar event precl udes

al l the other events, i .e., 1,2,3,5 and 6. In other words, rol l i ng a 1 and a 5 are mutual l y

excl usi ve events: they cannot occur si mul taneousl y.

Furthermore, there i s no way a si ngl e i nvestment can have more than one ari thmeti c

mean return. Thus, ari thmeti c returns of, say, 20% and 17% consti tute mutual l y

excl usi ve events.

Independent Events

T wo events, A and B, are independent if the fact that A occurs does not affect the probability of B

occurring. When two events are independent, this simply means that both events can happen at the

same time. In other words, the probability of one event happening does not depend on whether the

other event occurs or not. For example, we can define A as the likelihood that it rains on March 15

in New York and B as the probability that it rains in Frankfurt on March 15. In this instance, both

events can happen simultaneously or not.

Another example would be defining event A as getting tails on the first coin toss and B on the second

coin toss. T he fact of landing on tails on the first toss will not affect the probability of getting tails on

5
© 2014-2023 AnalystPrep.
the second toss.

Intersection

T he intersection of events say A and B is the set of outcomes occurring both in A and B. It is

denoted as P(A∩B). Using the Venn diagram, this is represented as:

For independent events,

P (A ∩ B) = P (A and B) = P (A) × P (B)

Independence can be extended to n independent events: Let A1 ,A2 ,… , An be independent events

then:

P (A1 ∩ A2 ∩ … ∩ An ) = P (A1) × P (A2) × … × P (An )

For mutually exclusive events,

6
© 2014-2023 AnalystPrep.
P (A ∩ B) = P (A and B) = 0

T his is because A's occurrence rules out B's occurrence. Remember that a car cannot turn left and

turn right at the same time!

Union

T he union of events, say, A and B, is the set of outcomes occurring in at least one of the two sets – A

or B. It is denoted as P(A∪B). Using the Venn diagram, this is represented as:

To determine the likelihood of any two mutual l y excl usi ve events occurring, we sum up their

individual probabilities. T he following is the statistical notation:

P (A ∪ B) = P (A or B) = P (A) + P (B)

Given two events A and B, that are not mutually exclusive (i ndependent events), the probability

that at l east one of the events will occur is given by:

7
© 2014-2023 AnalystPrep.
P (A ∪ B) = P (A or B) = P (A) + P (B) − P (A ∩ B)

The Complement of a Set

Another important concept under probability is the compl ement of a set denoted by Ac (where A

can be any other event) which is the set of outcomes that are not in A. For example, consider the

following Venn diagram:

T his is the first axiom of probability, and it implies that:

P (A ∪ Ac) = P (A) + P (Ac ) = 1

Conditional Probability

Until now, we've only looked at unconditional probabilities. An uncondi ti onal probabi l i ty (also

known as a marginal probability) is simply the probability that an event occurs without considering

any other preceding events. In other words, unconditional probabilities are not conditioned on the

8
© 2014-2023 AnalystPrep.
occurrence of any other events; they are 'stand-alone' events.

Condi ti onal probabi l i ty is the probability of one event occurring with some relationship to one

or more other events. Our interest lies in the probability of an event 'A' gi ven that another event 'B

'has al ready occurred. Here’s what you should ask yourself:

"What is the probability of one event occurring i f another event has already taken place?" We

pronounce P(A | B) as "the probability of A given B.," and it is given by:

P (A ∩ B)
P (A│B) =
P (B)

T he bar sandwiched between A and B simply indicates "given."

Bayes' Theorem

Bayes' theorem describes the probability of an event based on prior knowledge of conditions that

might be related to the event. Assuming that we have two random variables, A and B, then according

to Bayes' theorem:

P (B|A) × P (A)
P (A|B) =
P (B)

Applying Bayes' Theorem

Supposing that we are issued with two bonds, A and B. Each bond has a default probability of 10%

over the following year. We are also told that there is a 6% chance that both the bonds will default,

an 86% chance that none of them will default, and a 14% chance that either of the bonds will default.

All of this information can be summarized in a probability matrix.

Often, there is a high correlation between bond defaults. T his can be attributed to the sensitivity

displayed by bond issuers when dealing with broad economic bonds. T he 6% chances of both the

bonds defaulting are higher than the 1% chances of default had the default events been independent.

T he features of the probability matrix can also be expressed in terms of conditional probabilities.

9
© 2014-2023 AnalystPrep.
For example, the likelihood that bond A will default given that B has defaulted is computed as:

P [A ∩ B] 6%
P (A|B) = = = 60%
P [B] 10%

T his means that in 60% of the scenarios in which bond B will default, bond A will also default.

T he above equation is often written as:

P [A ∩ B] = P (A|B) × P [B] I

Also:

P [A ∩ B] = P (B|A) × P [A] II

Both the right-hand sides of equations I and II are combined and rearranged to give the Bayes'

theorem:

P (B│A) × P [A] = P (A│B) × P (B)

P (B|A) × P [A]
⇒ P (A|B) =
P [B]

When presented with new data, Bayes' theorem can be applied to update beliefs. To understand how

the theorem can provide a framework for how exactly the new beliefs should be, consider the

following scenario:

Example: Applying Baye's Theorem

Based on an examination of historical data, it's been determined that all fund managers at a certain

Fund fall into one of two groups: Stars and Non-Stars. Stars are the best managers. T he probability

that a Star will beat the market in any given year is 75%. Other managers are just as likely to beat

the market as they are to underperform it [i.e., Non-Stars have 50/50 odds of beating the market. For

both types of managers, the probability of beating the market is independent from one year to the

next. Stars are rare. Of a given pool of managers, only 16% turn out to be Stars.

10
© 2014-2023 AnalystPrep.
A new manager was added to the portfolio of funds three years ago. Since then, the new manager has

beaten the market every year. What was the probability that the manager was a star when the

manager was first added to the portfolio? What is the probability that this manager is a star now?

What's the probability that the manager will beat the market next year, given that he has beaten it in

the past three years?

Sol uti on

We first summarize the data by introducing some notations as follows: T he chances that a manager

will beat the market on the condition that he is a star is:

3
P (B|S) = 0.75 =
4

T he chances of a non-star manager beating the market are:

1
P (B|S̄) = 0.5 =
2

T he chances of the new manager being a star during the particular time he was added to the analyst's

portfolio are exactly the chances that any manager will be made a star, which is unconditional:

4
P [S] = 0.16 =
25

To evaluate the likelihood of him being a star at present, we compute the likelihood of him being a

star given that he has beaten the market for three consecutive years, P (S|3B), using the Bayes’

theorem:

P (3B|S) × P [S]
P (S|3B) =
P [3B]

3 3 27
P (3B|S) = ( ) =
4 64

T he unconditional chances that the manager will beat the market for three years is the denominator.

P [3B] = P (3B|S) × P [S] + P (3B|S̄) × P [S̄]

11
© 2014-2023 AnalystPrep.
3 3 4 1 3 21 69
P [3B] = ( ) × +( ) =
4 25 2 25 400

T herefore:

( 27 4
) ( 25 ) 9
64
P (S|3B) = = = 39%
69 23
( 400 )

T herefore, there is a 39% chance that the manager will be a star after beating the market for three

consecutive years, which happens to be our new belief and is a significant improvement from our old

belief, which was 16%.

Finally, we compute the manager's chances of beating the market the next year. T his happens to be

the summation of the chances of a star beating the market and the chances of a non-star beating the

market, weighted by the new belief:

P [B] = P (B|S) × P [S] + P (B|S̄) × P [S̄]

3 9 1 14 3
P [B] = × + × = 60% =
4 23 2 23 5

We also have that:

P (3B|S) × P [S]
P (S|3B) =
P [3B]

T he L.H.S of the formula is posterior. T he first item on the numerator is the likelihood, and the

second part is prior.

12
© 2014-2023 AnalystPrep.
Question 1

T he probability that the Eurozone economy will grow this year is 18%, and the

probability that the European Central Bank (ECB) will loosen its monetary policy is 52%.

Assume that the joint probability that the Eurozone economy will grow and the ECB will

loosen its monetary policy is 45%. What is the probability that either the Eurozone

economy will grow or the ECB will loosen its monetary policy?

A. 42.12%

B. 25%

C. 11%

D. 17%

T he correct answer is B.

T he addition rule of probability is used to solve this question:

P(E) = 0.18 (the probability that the Eurozone economy will grow is 18%)

p(M) = 0.52 (the probability that the ECB will loosen the monetary policy is 52%)

p(EM) = 0.45 (the joint probability that Eurozone economy will grow and the ECB will

loosen its monetary policy is 45%)

T he probability that either the Eurozone economy will grow or the central bank will

loosen its the monetary policy:

p(E or M) = p(E) + p(M) - p(EM) = 0.18 + 0.52 - 0.45 = 0.25

Question 2

A mathematician has given you the following conditional probabilities:

13
© 2014-2023 AnalystPrep.
p(O|T ) = 0.62 Conditional probability of reaching
the office if the train arrives on time
p(O|T c) = 0.47 Conditional probability of reaching the office
if the train does not arrive on time
p(T ) = 0.65 Unconditional probability of
the train arriving on time
p(O) = ? Unconditional probability
of reaching the office

What is the unconditional probability of reaching the office, p(O)?

A. 0.4325

B. 0.5675

C. 0.3856

D. 0.5244

T he correct answer is B.

T his question can be solved using the total probability rule.

If p(T ) = 0.65 (Unconditional probability of train arriving on time is 0.65), then the

unconditional probability of the train not arriving on time p(T c) = 1 - p(T ) = 1 - 0.65 =

0.35.

Now, we can solve for

p(O) = p(O|T ) ∗ p(T ) + p(O|T c) ∗ p(T c)


= 0.62 ∗ 0.65 + 0.47 ∗ 0.35
= 0.5675

Note: p(O) is the unconditional probability of reaching the office. It is simply the addition

of:

1. reaching the office if the train arrives on time, multiplied by the train arriving on

time, and

2. reaching the office if the train does not arrive on time, multiplied by the train not

arriving on time (or given the information, one minus the train arriving on time)

14
© 2014-2023 AnalystPrep.
Question 3

Suppose you are an equity analyst for the XYZ investment bank. You use historical data

to categorize the managers as excellent or average. Excellent managers outperform the

market 70% of the time and average managers outperform the market only 40% of the

time. Furthermore, 20% of all fund managers are excellent managers and 80% are simply

average. T he probability of a manager outperforming the market in any given year is

independent of their performance in any other year.

A new fund manager started three years ago and outperformed the market all three

years. What’s the probability that the manager is excellent?

A. 29.53%

B. 12.56%

C. 57.26%

D. 30.21%

T he correct answer is C.

T he best way to visualize this problem is to start off with a probability matrix:

Kind of manager Probability Probability of beating market


Excellent 0.2 0.7
Average 0.8 0.4

Let E be the event of an excellent manager, and A represent the event of an average

manager.

P(E) = 0.2 and P(A) = 0.8

Further, let O be the event of outperforming the market.

We know that:

P(O|E) = 0.7 and P(O|A) = 0.4

15
© 2014-2023 AnalystPrep.
We want P(E|O):

P (O|E) × P (E)
P (E|O) =
P (O|E) × P (E) + P (O|A) × P (A)
(0.73 ) × 0.2
=
(0.73 ) × 0.2 + (0.43) × 0.8
= 57.26%

Note: T he power of three is used to indicate three consecutive years.

16
© 2014-2023 AnalystPrep.
Reading 13: Random Variables

After compl eti ng thi s readi ng, you shoul d be abl e to:

Describe and distinguish a probability mass function from a cumulative distribution

function and explain the relationship between these two.

Understand and apply the concept of a mathematical expectation of a random variable.

Describe the four common population moments.

Explain the differences between a probability mass function and a probability density

function.

Characterize the quantile function and quantile-based estimators.

Explain the effect of a linear transformation of a random variable on the mean, variance,

standard deviation, skewness, kurtosis, median, and interquartile range.

Random Variables

A random variable is a variable whose possible values are outcomes of a random phenomenon. It is a

function that maps outcomes of a random process to real values. It can also be termed as the

realization of a random process.

Precisely, if ω is an element of a sample space Ω and x is the realization, then X(ω) = x.

Conventionally, random variables are given in upper case (such as X, Y, and Z) while the realized

random values are represented in lower case (such as x, y, and z)

For example, let X be the random variable as a result of rolling a die. T herefore, x is the outcome of

one roll, and it could take any of the values 1, 2, 3, 4, 5, or 6. T he probability that the resulting

random variable is equal to 3 can be expressed as:

P (X = x) where x = 3

17
© 2014-2023 AnalystPrep.
Types of Random Variables

Discrete Random Variables

A discrete random variable is one that produces a set of distinct values. A discrete random variable

manifests:

If the range of all possible values is a fi ni te set, e.g., {1,2,3,4,5,6} as in the case of a six-

sided die or,

If the range of all possible values is a countabl y i nfi ni te set: e.g. {1,2,3, ... }

Examples of discrete random variables include:

Picking a random stock from the S&P 500.

T he number of candidates registered for the FRM level 1 exam at any given time.

T he number of study topics in a program.

Probability Functions under Discrete Random Variables

Since the possible values of a random variable are mostly numerical, they can be explained using

18
© 2014-2023 AnalystPrep.
mathematical functions. A function f X (x) = P (X = x) for each x in the range of X is the probability

function (PF) of X and explains how the total chance (which is 1) is distributed amongst the possible

values of X.

T here are two functions used when explaining the features of the distribution of discrete random

variables: probability mass function (PMF) and cumulative distribution function (CDF).

Probability Mass Function (PMF)

T his function gives the probability that a random variable takes a particular value. Since PMF

outputs the probabilities, it should possess the following properties:

1. f X (x) ≥ 0 ∀ range of X (value returned must be a nonnegative)

2. ∑x f X (x) = 1 (sum across all value in support of a random variable should be equal to 1)

Example: Bernoulli Distribution

Assume that X is a Bernoulli random variable, the PMF of X is given by:

f X (x) = px (1 − p)1−x , X = 0, 1

T he Random variables in a Bernoulli distribution are 0 and 1. T herefore,

f X(0) = p0(1 − p)1−0 = 1 − p

And

f X (1) = p1 (1 − p)1−1 = p

Looking at the above results, the first property f X (x) ≥ 0) of probability distributions is met. For the

second property:

∑ f X (x) = ∑ f X (x) = 1 − p + p = 1
x x=0,1

19
© 2014-2023 AnalystPrep.
Moreover, the probability that we observe random variable 0 is 1-p, and the probability of observing

random variable 1 is p. More precisely,

FX(x) = { 1 − p, x = 0
p, x=1

T he graph of the Bernoulli PMF is shown below, assuming the p=0.7. Note that PMF is only defined

for X=0,1.

Cumulative Distribution Function (CDF)

CDF measures the probability of realizing a value less than or equal to the input x, P r(X ≤ x). It is

denoted by FX (x) and so,

FX (x) = P r(X ≤ x)

CDF is monotonic and increasing in x since it measures total probability. It is a continuous function

(in contrast with PMF) because it supports any value between 0 and 1 (in the case of Bernoulli

20
© 2014-2023 AnalystPrep.
random variables) inclusively.

For instance, the CDF of the Bernoulli random variable is:

⎧ 0, x<0
FX(x) = ⎨ 1 − p, 0 ≤ x < 1

1, x≥1

FX (x) is defined for all real values of x. T he graph of FX(x) against x begins at 0 then rises by jumps

as values of x are realized for which p(X = x) is positive. T he graph reaches its maximum value at 1.

For the Bernoulli distribution with p=0.7, the graph is shown below:

Since CDF is defined for all values of x, the CDF for a Bernoulli distribution with a parameter p=0.7

is:

⎧ 0, x<0
FX(x) = ⎨ 0.3, 0 ≤ x < 1

1, x≥1

T he corresponding graph is as shown above

21
© 2014-2023 AnalystPrep.
Relationship Between the CDF and PMF with Discrete Random
Variables

T he CDF can be represented as the sum of the PMF for all the values that are less than or equal to

x. Simply put:

FX(x) = ∑ f X ( t)
tϵR( x) ,t≤x

Where R(x) is the range of realized values of X (X=x).

On the other hand, PMF is equivalent to the difference between the consecutive values of X. T hat

is:

f X (x) = FX (x) − FX(x − 1)

Example: PMF and CDF under Discrete Random Variables

T here are 8 hens with different weights in a cage. Hens1 to 3 weigh 1 kg, hens 4 and 5 weigh 2kg,

and the rest weigh 3kg. We need to develop the PMF and the CDF.

Sol uti on

T he random variables (X = 1kg, 2kg, or 3kg) here are the weights of the chicken,

3
f X (1) = P r(X = 1) =
8
2 1
f X (2) = P r(X = 2) = =
8 4
3
f X (3) = P r(X = 3) =
8

So, the PMF is:

3
, x =1


⎪ 81
⎨ 4, x =2


⎪ 3, x =3
8

22
© 2014-2023 AnalystPrep.
For the CDF, it includes all the realized values of the random variable. So,

FX (0) = P r(X ≤ 0) = 0
3
FX (1) = P r(X ≤ 1) =
8
3 2 5⎡ ⎤
FX (2) = P r(X ≤ 2) = + = Using FX(x) = ∑ f X ( t)
8 8 8⎣ tϵR( x) ,t≤x ⎦
5 3
FX (3) = P r(X ≤ 3) = + =1
8 8

So that the CDF is

0, x<1




⎪ 3, 1≤x<2
FX (x) = ⎨ 85
⎪ , 2≤x<3


⎩8

1, 3≤x

Note that

f X (x) = FX (x) − FX(x − 1)

Which implies that:

5 3
f X (3) = FX(3) − FX(2) = 1 − =
8 8

Which gives the same result as before.

Continuous Random Variables

A continuous random variable can assume any val ue al ong a gi ven i nterval of a number l i ne.

For instance, x > 0, (−∞ < x < ∞) and 0 < x < 1. Examples of continuous random variables include the

price of stock or bond, or the value at risk of a portfolio at a particular point in time.

T he following relationship holds for a continuous random variable X:

P [r1 < X < r2] = p

23
© 2014-2023 AnalystPrep.
T his implies that p is the likelihood that the random variable X falls between r1 and r2.

The Probabi l i ty Densi ty Functi on (PDF) under Conti nuous Random Vari abl es

A probability density function (PDF) allows us to calculate the probability of an event.

Given a PDF f(x), we can determine the probability that x falls between a and b:

b
P r(a < x ≤ b) = ∫ f (x) dx
a

T he probability that X lies between two values is the area under the density function graph

between the two values:

Probability distribution function is another term used to refer to the probability density function.

T he properties of the PDF are the same as those of PMF. T hat is:

1. f X (x) ≥ 0, −∞ ≤ x ≤ ∞ (nonnegativity)

2. ∫rrmmax f(x)dx = 1(T he sum of all probabilities must be equal to 1, just like in discrete random
in

variables)

T he upper and lower bounds of f(x) are defined by rm in and rmax

Cumulative Distribution Functions (CDF) under Continuous Random


Variables

It is also called the cumulative density function and is closely related to the concept of a PDF. CDFA

CDF defines the likelihood of a random variable falling below a specific value. To determine the CDF,

24
© 2014-2023 AnalystPrep.
the PDF is integrated from its lower bound.

T he corresponding density function’s capital letter has traditionally been used to denote the CDF.

T he following computation depicts a CDF, F(x), of a random variable X whose PDF is f(x):

a
F (a) = ∫ f(x)d(x) = P [X ≤ a]
−∞

T he region under the PDF is a depiction of the CDF. T he CDF is usually non-decreasing and varies

from zero to one. We must have a zero CDF at the minimum value of the PDF. T he variable cannot

be less than the minimum. T he likelihood of the random variable is less than or equal to the maximum

is 100%.

To obtain the PDF from the CDF, we have to compute the first derivative of the CDF. T herefore:

dF (x)
f(x) =
dx

Next, we look at how to determine the probability that a random variable X will fall between some

two values, a and b.

b
P [a < X ≤ b] = ∫ f(x)dx = F (b) − F (a)
a

Where a is less than b.

T he following relationship is also true:

25
© 2014-2023 AnalystPrep.
P [X > a] = 1 − F (a)

Example:Formulating the CDF of a Continuous Random Variable

T he continuous random variable X has a pdf of f (x) = 12x 2 (1 − x) for 0 < x < 1. We need to find the

expression for F(x).

Sol uti on

We know that:

x
F (x) = ∫ f(t)d(t)
−∞
x
F (x) = ∫ 12t2 (1 − t)d(t) = [4t3 − 3t4]x0 = x 3 (4 − 3x)
0

So,

F (x) = x 3 (4 − 3x)

Expected Values

T he expected values are the numerical summaries of features of the distribution of random

variables. Denoted by E[X] or μ, it gives the value of X that is the measure of average or center of

the distribution of X. T he expected value is the mean of the distribution of X.

For discrete random variables, the expected value is given by:

E[X] = ∑ xf (X)
x

It is simply the sum of the product of the value of the random variable and the probability assumed by

the corresponding random variable.

Example: Calculating the Expected Value in Discrete Random Variable

T here are 8 hens with different weights in a cage. Hens 1 to 3 weigh 1 kg, hens 4 and 5 weigh 2kg,

26
© 2014-2023 AnalystPrep.
and the rest weigh 3kg. We need to calculate the mean weight of the hens.

Sol uti on

We had calculated the PDF as:

3
, x =1


⎪ 81
f (x) = ⎨ 4 , x =2

⎩ 3,
⎪ x =3
8

Now,

3 1 3
E[X] = ∑ xf(X) = 1 × +2× +3 × = 2
x 8 4 8

So, the mean weight of the hens in the cage is 2kg.

For the continuous random variable, the mean is given by:


E[X ] = ∫ xf (x)dx
−∞

Basically, it is all about integrating the product of the value of the random variable and the probability

assumed by the corresponding random variable.

Example: Calculating the Expected Value of a Continuous Random


Variable

T he continuous random variable X has a pdf of f (x) = 12x 2 (1 − x) for 0 < x < 1.

We need to calculate E[X].

Solution

We know that:


E[X ] = ∫ xf (x)dx
−∞

27
© 2014-2023 AnalystPrep.
So,

1
12 5 1
E(X) = ∫ x12x 2(1 − x)d(x) = [3x 4 − x ] = 0.6
0 5 0

For random variables that are functions, we apply the same method as that of a “single” random

variable. T hat is, summing or integrating the product of the value of the random variable function and

the probability assumed by the corresponding random variable function.

Assume that the random variable function is g(x). T hen:

E[g(x)] = ∑ g(x)f (x)


x

for the discrete case and


E[g(x)] = ∫ g(x)f (x)dx
−∞

for the continuous case.

Example: Calculating the Expected Values Involving Functions as


Random Variable.

A random variable X has PDF of:

1 2
f X (x) = x , for 0 < x < 3
5

Calculate E(2X + 1)

Solution


E[g(x)] = ∫ g(x)f(x)dx
−∞
∞ 1 1 x4 x3 3
=∫ (2x + 1)x 2dx = [ + ] = 9.9
−∞ 5 5 2 3 0

28
© 2014-2023 AnalystPrep.
Properties of Expectation

T he expectation operator is a linear operator. Consequently, the expectation of a constant is a

constant. T hat is, E(c)=c. Moreover, the expected value of a random variable is a constant and not a

random variable.

For non-linear function g(x),E(g(x))≠ g(E(x)). For instance, E ( X1 ) ≠ 1


E(X )

The Variance of a Random Variable

T he variance of random variable measures the spread (dispersion or variability) of the distribution

about its mean. Mathematically,

V ar(X ) = E(X 2 ) − E(X )2 = E[X − E(X)]2

Intuitively, the standard deviation is the square root of the variance. Now, denoting E(X) = μ, then:

V ar(X) = E(X 2) − μ2

Example: Calculating the Variance of Random Variable

T he continuous random variable X has a pdf of f (x) = 12x 2 (1 − x) for 0 < x < 1.

We need to calculate Var[X].

Solution

We know that:

V ar(X) = E(X 2) − E(X)2

We had calculated E(X)=0.6

We have to calculate:

29
© 2014-2023 AnalystPrep.
E(X 2)

1
12 5 1
E(X ) = ∫ x.[12x 2 (1 − x)]dx = [3x 4 − x ] = 0.6
0 5 0
1 1
12 5
E(X 2 ) = ∫ 12x 4 − 12x 5dx = [ x − 2x 6 ] = 0.4
0 5 0

So,

V ar(X) = 0.4 − 0.62 = 0.04

Moments

Moments are defined as the expected values that briefly describe the features of a distribution. T he

first moment is defined to be the expected value of X:

μ1 = E(X)

T herefore, the first moment provides the information about the average value. T he second and

higher moments are broadly divided into Central and Non-central moments

Central Moments

T he general formula for the central moments is:

μk = E([X − E(X)]k ),k = 2,3 …

Where k denotes the order of the moment. Central moments are moments about the mean.

Non-Central Moments

Non-central moments describe those moments about 0. T he general formula is given by:

μk = E(X k)

30
© 2014-2023 AnalystPrep.
Note that the central moments are constructed from the non-central moments and the first central

and non-central moments are equal (μ1 = E(X )).

Population Moments

T he four common population moments are: mean, variance, skewness, and kurtosis.

The Mean

T he mean is the first moment and is given by:

μ = E(X)

It is the average (also called the location of the distribution) value of X.

The Variance

T his is the second moment. It is presented as:

σ 2 = E([X − E(X)]2 ) = E[(X − μ)2 ]

T he variance measures the spread of the random variable from its mean. T he standard deviation (σ)

is the square root of the variance. T he standard deviation is more commonly quoted in the world of

finance because it is easily comparable to the mean since they share the measurement units.

The Skewness

Skewness is a cubed standardized central moment given by:

3
E([X − E(X)])3 ⎡ X−μ ⎤
skew(X) = =E ( )
σ3 ⎣ σ ⎦

X−μ
Note that is a standardized X with a mean of 0 and a variance of 1.
σ

Skewness can be positive or negative.

31
© 2014-2023 AnalystPrep.
Posi ti ve sk ew

T he right tail is longer

T he mass of the distribution is concentrated on the left

T here are a few relatively high values.

In most cases (but not always), the mean is greater than the median, or equivalently,

the mean is greater than the mode; in which case the skewness is greater than zero.

Negati ve sk ew

T he left tail is longer

T he mass of the distribution is concentrated on the right

T he distribution has a few relatively low values.

In most cases (but not always), the mean is lower than the median, or equivalently,

the mean is lower than the mode, in which case the skewness is lower than zero.

Kurtosis

T he Kurtosis is defined as the fourth standardized moment given by:

32
© 2014-2023 AnalystPrep.
4
E([X − E(X)]4 ⎡ X −μ ⎤
Kurt(X) = =E ( )
σ4 ⎣ σ ⎦

T he description of kurtosis is analogous to that of the Skewness only that the fourth power of the

Kurtosis implies that it measures the absolute deviation of random variables. T he reference value of

a normally distributed random variable is 3. A random variable with Kurtosis exceeding 3 is termed to

be heavi l y or fat-tai l ed.

Effect of Linear Transformation on Moments

In very basic terms, a l i near transformati on is a change to a variable characterized by one or

more of the major math operations:

adding a constant to the variable,

subtracting a constant from the variable,

multiplying the variable by a constant,

33
© 2014-2023 AnalystPrep.
and/or dividing the variable by a constant.

T ransformation results in the formation of a new random variable.

If X is a random variable and α and β are constants, then α + βx is a linear transformation of X . α is

referred to as the shi ft constant, and β is the scal e constant. T he transformation shifts X by α

and scales it by β . T he process results in the formation of a new random variable, usually denoted by

Y.

Y = α + βx

Linear transformation of random variables is informed by the fact that many variables used in finance

and risk management do not have a natural scale.

Example: Linear Transformation of Random Variables

Suppose your salary is α dollars per year, and you are entitled to a bonus of β dollars for every dollar

of sales you successfully bring in. Let X be what you sell in a certain year. How much in total do you

make?

Solution

We can linearly transform the sales variable X into a new variable Y that represents the total amount

made.

Y = α + βx

Where α serves as the shift constant and β as the scale constant.

Effect on Mean and Variance

If Y = α + βx , where α and β are constants. T he mean of Y is given by:

34
© 2014-2023 AnalystPrep.
E(Y ) = E(α + βx) = α + βE(X)

T he variance is given by:

Var(Y ) = Var(α + βx) = β 2Var(X) = β 2σ 2

T he shift parameter α does not affect the variance. Why? Because variance is a measure of spread

from the mean; adding α does not change the spread but merely shifts the distribution to the left or

right.

T he standard deviation of Y is given by:

√β 2σ 2 = |β| σ

It also follows that α does not affect the standard deviation.

Effect on Skewness and Kurtosis

It can also be shown that if β is positive (so that Y = α + βx is an increasing transformation), then the

skewness and kurtosis of Y are identical to the skewness and kurtosis of X . T his is because both

moments are defined on standardized quantities, which removes the effect of the shift constant α and

the scaling factor β . T his can be seen as follows:

We know that:

3
⎡ X−μ ⎤
skew(X ) = E ( )
⎣ σ ⎦

Now,

35
© 2014-2023 AnalystPrep.
3
E([Y − E(Y )])3 ⎡ Y − E(Y ) ⎤
skew(Y ) = =E ( )
σ3 ⎣ σ ⎦
3
⎡ α + βX − (α + βμ) ⎤
=E ( )
⎣ βσ ⎦
3 3
⎡ β(X − μ) ⎤ ⎡ X−μ ⎤
=E ( ) =E ( ) = Skew(X )
⎣ βσ ⎦ ⎣ σ ⎦

However, if β < 0, the magnitude of skewness of Y is the same as that of X but with the opposite sign

because of the odd power (i.e., 3). On the other hand, the kurtosis is unaffected because it uses an

even power (i.e., 4).

Quantiles and Modes

Just like any data, quantities such as the quantiles and the modes are used to describe the distribution.

The Quantiles

For a continuous random variable X, the α -quartile of X is the smallest number m such that:

P r(X < m) = α

Where αϵ[0, 1]

For instance, if X is a continuous random variable, the median is defined to be the solution of:

m
P (X < m) = ∫ f X (x)dx = 0.5
−∞

Similarly, the lower and upper quartile is such that P (X < Q1) = 0.25 and P (X < Q3) = 0.75

T he interquartile range (IQR), is an alternative measure of spread. It is given by:

IQR = Q3 − Q1

36
© 2014-2023 AnalystPrep.
Exampl e: Cal cul ati ng the Quarti l es of a PDF

T he random variable X has a pdf given by:

f X (x) = 3e−2x , x > 0

. Calculate the median of the distribution.

Sol uti on

Denote the median by m. T hen m is such that:

m
P (X < m) = ∫ 3e−2xdx = 0.5
0

So,

3 m
= [− e−2x ] = 0.5
2 0
3 −2m 3
=− e + = 0.5
2 2
1 2
⇒ m = − × ln = 0.2027
2 3

Mode

T he mode measures the common tendency, that is, the location of the most observed value of a

random variable. In a continuous random variable, the mode is represented by the highest point in the

PDF.

Random variables can be unimodal if there’s just one mode, bimodal if there are two modes, or

multimodal if there are more than two modes.

T he graph below shows the difference between unimodal and bimodal distributions.

37
© 2014-2023 AnalystPrep.
38
© 2014-2023 AnalystPrep.
Question 1

If a random variable X has a mean of 4 and a standard deviation of 2, calculate Var(3 - 4x)

A. 29

B. 30

C. 64

D. 35

Sol uti on

The correct answer i s C.

Recall that:

Var(α + βx) = β 2Var(Y )

So,

Var(3 − 4X) = (−4)2 Var(X) = 16 Var(X)

But we are given that the standard deviation is 2, implying that the variance is 4.

T herefore,

Var(3 − 4X ) = 16 × 4 = 64

Question 2

A continuous random variable has a pdf given by f X (x) = ce−3x for all x > 0. Calculate

Pr(X<6.5)

A. 0.4532

39
© 2014-2023 AnalystPrep.
B. 0.4521

C. 0.3321

D. 0.9999

Solution

T he correct answer is D.

We need to find the constant c first. We know that:


∫ f (x)dx = 1
−∞

So,

∞ ∞
1 1
∫ ce−3x dx = 1 = c[− e−3x ] = c [0 − − ] = 1
0 3 0 3
⇒c=3

T herefore, the PDF is f X (x) = 3e−3x so that P r(X < 6.5) is given by:

6. 5 6. 5
1 1 1
∫ 3e−3xdx = 3[− e−3x ] = c [− e−3×6. 5 − − ]
0 3 0
3 3
= 0.9999

40
© 2014-2023 AnalystPrep.
Reading 14: Common Univariate Random Variables

After compl eti ng thi s readi ng, you shoul d be abl e to:

Distinguish the key properties among the following distributions: uniform distribution,

Bernoulli distribution, Binomial distribution, Poisson distribution, normal distribution,

lognormal distribution, Chi-squared distribution, student’s t, and F-distributions, and identify

common occurrences of each distribution.

Describe a mixture distribution and explain the creation and characteristics of mixture

distributions.

Parametric Distributions

T here are two types of distributions, namely parametric and non-parametric distributions. Functions

mathematically describe parametric distributions. On the other hand, one cannot use a mathematical

function to describe a non-parametric distribution. Examples of parametric distributions are uniform

and normal distributions.

Discrete Random Variables

Bernoulli Distribution

Bernoulli distribution is a discrete random variable that takes on values of 0 and 1. T his distribution is

suitable for scenarios with binary outcomes, such as corporate defaults. Most of the time, 1 is

always labeled “success” and 0 a “failure.”

41
© 2014-2023 AnalystPrep.
T he Bernoulli distribution has a parameter p which is the probability of success, i.e., the probability

that X=1, then:

P [X = 1] = p and P [X = 0] = 1 − p

T he probability mass function of the Bernoulli distribution stated as X ∼ Bernoulli(p) is given by:

f X (x) = px(1 − p)1−x

T herefore, the mean and variance of the distribution are computed as:

T he PMF confirms that:

P [X = 1] = p and P [X = 0] = 1 − p

T he CDF of a Bernoulli distribution is a step function given by:

⎧ 0, y < 0
FX (x) = ⎨ 1 − p, 0 ≤ y < 1

1, y ≥ 1

T herefore, the mean and variance of the distribution are computed as:

42
© 2014-2023 AnalystPrep.
E (X) = p × 1 + (1 − p) × 0 = p

V (X ) = E(X 2 ) − [E(X )]2 = [p × 12 + (1 − p) × 02] − p2 = p(1 − p)

Example: Bernoulli Distribution

What is the ratio of the mean to variance for X~Bernoulli(0.75)?

Sol uti on

We know that for Bernoulli Distribution,

E(X) = p

and

V (X ) = p(1 − p)

So,

E(X) p 1
= = =4
V (X) p(1 − p) 0.25

T hus, E(X): V(X)=4:1

Binomial Distribution

A binomial distribution is a collection of Bernoulli random variables. A binomial random variable

quantifies the total number of successes from an independent Bernoulli random variable, with the

probability of success being p and, of course, the failure being 1-p. Consider the following example:

Suppose we are given two independent bonds with a default likelihood of 10%. T hen we have the

following possibilities:

Both do not default,

Both of them default, or

43
© 2014-2023 AnalystPrep.
Only one of them defaults.

Let X represent the number of defaults:

P [X = 0] = (1 − 10%)2 = 81%

P [X = 1] = 2 × 10% × (1 − 10%) = 18%

P [X = 2] = 10% 2 = 1%

If we possess three independent bonds having a 10% default probability then:

P [X = 0] = (1 − 10%)3 = 72.9%

P [X = 1] = 3 × 10% × (1 − 10%)2 = 24.3%

P [X = 2] = 3 × 10% 2 × (1 − 10%) = 2.7%

P [X = 3] = 10% 3 = 0.1%

Suppose now that we have n bonds. T he following combination represents the number of ways in

which k of the n bonds can default:

n n!
( )= … … … … equation I
x x! (n − x)!

If p is the likelihood that one bond will default, then the chances that any particular k bonds will

default is given by:

px (1 − p)n −x … … … … … equation II

Combining equation I and II , we can determine the likelihood of k bonds defaulting as follows:

n n
P [X = x] = ( ) px(1 − p)n −x = ( ) px(1 − p)n −x f orx = 0, 1, 2,… n
x x

T his is the PDF for the binomial distribution.

44
© 2014-2023 AnalystPrep.
T herefore, binomial distribution has two parameters: n and p and usually stated as X B(n, p).

T he CDF of a binomial distribution is given by:

|x|
n
∑ ( ) pi(1 − p)n −i
i=1 i

Where |x| implies a random variable less than or equal to x.

T he mean and variance of the binomial distribution can be evaluated using moments. T he mean and

variance are given by:

E(X) = np

And

V (X) = np(1 − p)

T he binomial can be approximated using a normal distribution (as will be seen later) if np ≥ 10 and

n(1 − p) ≥ 10

Example: Binomial Distribution

Consider a Binomial distribution X~B(4,0.6). Calculate P(X≥ 3).

Sol uti on

We know that for binomial distribution:

n
P [X = x] = ( ) px(1 − p)n −x
x

In this case, n = 4 and p = 0.6

4 4
⇒ P (X ≥ 3) = P (X = 3) + P (X = 4) = ( ) p3 (1 − p)4−3 + ( ) p4(1 − p)4−4
3 4

4 4
= ( ) 0.63(1 − 0.6)4−3 + ( ) 0.64(1 − 0.6)4−4
3 4

45
© 2014-2023 AnalystPrep.
= 0.3456 + 0.1296 = 0.4752

Poisson Distribution

Events are said to follow a Poisson process if they happen at a constant rate over time, and the

likelihood that one event will take place is independent of all the other events,for instance,the

number of defaults that occur in each month.

Suppose that X is a Poisson random variable, stated as X~Poisson(λ) then the PMF is given by:

λx e−λ
P [X = x] =
x!

T he CDF of a Poisson distribution is given by:

|x|
λi

i=1 i!

T he Poisson parameter λ (lambda), termed as the hazard rate, represents the mean number of events

in an interval. T herefore, the mean and variance of the Poisson distribution are given by:

E(X) = λ

And

V (X) = λ

Example: Poisson Distribution

A fixed income portfolio is made of a huge number of independent bonds. T he average number of

bonds defaulting every month is 10. What is the probability that there are exactly 5 defaults in one

month?

Sol uti on

For Poisson distribution:

λx e−λ
46
© 2014-2023 AnalystPrep.
λx e−λ
P (X = x) =
x!

For this question, we have that: λ = 10 and we need:

105e−10
P (X = 5) = = 0.03783
5!

T he notable feature of a Poisson distribution is that it is infinitely divisible. T hat is, if

X 1 ∼ Poisson(λ1 ) and X 2 ∼ Poisson(λ2) and that Y = X 1 + X 2 then,

Y ∼ Poisson(λ1 + λ2 )

T herefore, Poisson distribution is suitable for time series data since summing the number of events

in the sampling interval does not distort the distribution.

Continuous Random Variables

Uniform Distribution

A uniform distribution is a continuous distribution, which takes any value within the range [a,b],

which is equally likely to occur.

T he PDF of a uniform distribution is given by:

1
f X (x) =
b−a

47
© 2014-2023 AnalystPrep.
Note that the PDF of a uniform random variable does not depend on x since all values are equally

likely.

T he CDF of the uniform distribution is:


⎪ 0, x < a
x− a
FX(x) = ⎨ b−a
,a ≤ x ≤ b

⎪ 1, x ≥ b

When a=0 and b=1, the distribution is called the standard uniform distribution. From this distribution,

we can construct any uniform distribution, U 2 and U 1 using the formula:

U 2 = a + (b − a) U 1

Where a and b are limits of U 2

T he uniform distribution is denoted by X ∼ U (a,b) , and the mean and variance are given by:
a+ b
48
© 2014-2023 AnalystPrep.
a+ b
E(X) =
2

(b − a)2
V (X) =
12

For instance, the variance of the standard uniform distribution U 1 ∼ N (0, 1) is given by:

(0 + 1) 1
E(X) = =
2 2

And

(1 − 0)2 1
V(X ) = =
12 12

Assume that we want to calculate the probability that X falls in the interval l < X < u where l is the

lower limit and u is the upper limit. T hat is, we need P (l < X < u) given that X ∼ U (a, b). To compute

this, we use the formula:

min(u, b) − max(l,a)
P (l < X < u) =
b−a

Intuitively, if l ≥ a and u ≤ b, the formula above simplifies into:

u−l
b− a

Given the uniform distribution X U (−5, 10), calculate the mean, variance, and P (−3 < X < 6) .

Sol uti on

For uniform distribution,

a+ b −5 + 10
E(X) = = = 2.5
2 2

And

(10 − −5)2
49
© 2014-2023 AnalystPrep.
(10 − −5)2 225
V (X ) = = = 18.75
12 12

For P (−3 < X < 6) , using the formula:

min(u, b) − max(l,a)
P (l < X < u) =
b−a

min(6, 10) − max(−3, −5) 6 − −3 9


P (−3 < X < 6) = = = = 0.60
10 − −5 10 − −5 15

Alternatively, you can think of the probability as the area under the curve. Note that the height of

the uniform distribution is \frac{1}{b-a}) and the length u − l.

T hat is:

1 1 9
× (u − l) = × (6 − −3) = = 0.60
b−a 10 − −5 15

Normal Distribution

Also called the Gaussian distribution, the normal distribution has a symmetrical PDF, and the mean

and median coincide with the highest point of the PDF. Furthermore, the normal distribution always

has a skewness of 0 and a kurtosis of 3.

50
© 2014-2023 AnalystPrep.
T he following is the formula of a PDF that is normally distributed, for a given random variable X :

1 ( x − μ )2

1
f (x) = e 2 σ , −∞ < x < ∞
σ√2π

When a variable is normally distributed, it is often written as follows, for convenience:

X ∼ N (μ, σ 2)

Where E(X) = μ and V (X) = σ2

We read this as X is normally distributed, with a mean, μ,and variance of σ 2. Any linear combination

of independent normal variables is also normal. To illustrate this, assume X and Y are two variables

that are normally distributed. We also have constants a and b. T hen Z will be normally distributed

such that:

Z = aX + bY , such that Z ∼ N (aμX + bμY , a2σX2 + b2σY2 )

51
© 2014-2023 AnalystPrep.
For instance for a = b = 1, then Z = X + Y and thus Z ∼ N(μX + μY , σ2X + σ2Y )

A standard normal distribution is a normal distribution whose mean is 0 and standard deviation is 1.It is

denoted by N(0,1) and its PDF is as shown below:

1 1 2
∅= e− 2 x
√2π

To determine a normal variable whose standard deviation is σ and mean is μ, we compute the product

of the standard normal variable with σ and then add the mean:

X = μ + σ∅ ⇒ X ∼ N (μ, σ 2)

T hree standard normal variables X 1, X 2, and X 3 are combined in the following way to construct two

normal variables that are correlated:

X A = √ρX 1 + √1 − ρX 2

X B = √ρX 1 + √1 − ρX 3

Where X A and X B have a correlation of ρ, and are standard normal variables.

T he z-value measures how many standard deviations the corresponding x value is above or below the

mean. It is given by:

X −μ
Φ (z) = ∼ N (0, 1)
σ

And

X ∼ N (μσ 2)

Converting X normal random variables is termed as standardization. T he values of z are usually

tabulated.

For example, consider the normal distribution X~N(1,2). We wish to calculate P(X>2).

52
© 2014-2023 AnalystPrep.
Solution

For

2−1
P (X > 2) = 1 − P (X ≤ 2) = 1 − = 0.2929 ≈ 0.29
√2

We look up this value from the z-table.

ϕ (0.29) ≈ 61.41%

x-value z-value
μ 0
μ + 1σ 1
μ + 2σ 2
μ + nσ n

Recall that for a binomial random variable, if np ≥ 10 and n(1 − p) ≥ 10, then the binomial distribution

is normally distributed as:

X ∼ N (np, np(1 − p))

53
© 2014-2023 AnalystPrep.
Also, Poisson distribution is normally approximated as λ≥1000 so that:

X ∼ N (λ, λ)

We then calculate the probabilities while maintaining the normal distribution principles. T he normal

distribution is very popular as compared to other distributions because:

Many discrete and continuous random variables distributions can be approximated using the

normal distribution.

T he normal distribution is widely used in Central Limit T heorem (CLT ), which is utilized in

hypothesis testing.

T he normal distribution is closely related to other important distributions, such as the chi-

squared and the F distributions.

T he notable property of the normal random variables is that they are infinitely divisible,

which makes the normal distribution suitable for modeling asset prices.

T he normal distributions are closed under linear operations. In other words, the weighted

sum of the normal random variables is also normally distributed.

Lognormal Distribution

A variable X is said to be lognormally distributed if the variable Y is normally distributed such that:

Y = lnX

T his also can be treated as:

X = eY

Where

Y ∼ N (μ, σ 2)

Since Y ∼ N (μ, σ 2) ,then the PDF of a log-normal random variable is:

54
© 2014-2023 AnalystPrep.
2
1 ⎛ ln(x) − μ ⎞

1 σ
e 2⎝ ⎠
f (x) = ,x ≥ 0
xσ√2π

A variable is said to have a lognormal distribution if its natural logarithm has a normal distribution.

T he lognormal distribution is undefined for negative values, unlike the normal distribution that has a

range of values between negative infinity and positive infinity.

If the above equation of the density function of the lognormal distribution is rearranged, we obtain an

equation that has a similar form to the normal distribution. T hat is:

2
2
1 ⎛ lnx−(μ−σ ) ⎞

1 2 −μ 1 2⎝ σ ⎠
f (x) = e 2 σ e
σ√2π

From the above, we notice that the lognormal distribution happens to be asymmetrical. It's not

symmetrical around the mean as is the case under the normal distribution. T he lognormal distribution

peaks at exp (μ − σ 2).

T he following is the formula for the mean:


1

55
© 2014-2023 AnalystPrep.
1 2
E [X] = eμ+ 2 σ

T his yields to an expression that closely resembles the Taylor expansion of the natural logarithm

around 1. Recall that:

1 2
r≈R− R
2

where R is a standard return and r is the corresponding log return.

T he following is the formula for the variance of the lognormal distribution:

2 2
V (X) = E [(X − E[X]2 )] = (eσ − 1) e2μ+σ

Example: Lognormal Distribution

Consider a lognormal distribution given by X ∼ LogN (0.08,0.2) . Calculate the expected value.

Sol uti on

For the lognormal distribution, the expected value is given by:

1 2 1
E[X] = eμ+ 2 σ = e0. 08+ 2 ×0. 2 = 1.19721

Chi-Squared Distribution, χ2

Assume we’ve got k independent standard normal variables ranging from Z1 to Zk. T he sum of their

squares will then have a Chi-Square distribution, written as follows:

k
S = ∑ Zi2
1=1

So, we can denote chi-distribution as:

S ∼ X k2

k is called the degree of freedom. It is important to note that two chi-squared variables that are

56
© 2014-2023 AnalystPrep.
independent, with degrees of freedom as k1 and k2, respectively, have a sum that is chi-square

distributed with (k1 + k2) degrees of freedom.

T he chi-squared variable is usually asymmetrical and takes on non-negative values only. T he

distribution has a mean of k and a standard deviation of 2k.

T he distribution has a mean and variance given by:

E (S) = k

and

V (S) = 2k

T he chi-squared distribution takes the following PDF, for positive values of x :

1 k x
f (x) = x 2 −1e− 2
k
2 Γ ( k2 )
2

57
© 2014-2023 AnalystPrep.
T he gamma function, Γ, is such that:


Γ (n) = ∫ x n −1e−x dx
0

Note also that the gamma function,Γis such that:

Γ (n) = (n − 1)!

For instance:

Γ (3) = (3 − 1)! = 2 × 1 = 2

T his distribution is widely applicable in statistics and risk management when testing hypotheses. T he

chi-distribution is approximated using normal distribution when n is large. T his implies that:

χ2k ∼ N (k, 2k)

T his is true because as the number of degrees of freedom increases, the skewness reduces. Degrees

of freedom measure the amount of data required to test model parameters. If we have a sample size

n, the degrees of freedom are given by n – p, where p is the number of parameters estimated..

Student’s t Distribution

T his distribution is often called the t distribution. Let Z be the standard normal variable, and U a chi-

square variable with k degrees of freedom. Also, assume that U is independent of Z. T hen, a random

variable X that follows a t distribution is such that:

Z
X=
U
√k

T he following formula represents its PDF:

Γ (k + 12 ) +1
−k
f (x) = (1 + x 2/k) 2

√kπΓ ( k2 )

58
© 2014-2023 AnalystPrep.
T he mean of the t distribution is usually zero, and the distribution is symmetrical around it.

T hat is:

E (X) = 0

T he variance is given by:

k
V (X) =
k− 2

T he kurtosis is also given by:

k−2
Kurt(X) = 3
k−4

It is easy to see that the mean is valid for k > 1 and the variances finite for v > 2. T he kurtosis is

only definite if k > 4 and should always be higher than 3.

T he distribution converges to a standard normal distribution as k tends towards infinity (k → ∞).


k
When k > 2, the variance of the distribution becomes: (k−2) , and it converges to one as k increases.

We can also separate the degrees of freedom from variance to get what we called the

standardi zed student’s t. Using the formula:

V (aX) = a2V (X))

Using this result, it is easy to see that :

v −2
V [√ Y]=1
v

Where

X ∼ tk

T he generalized student’s t is called standardized student’s t because it has a mean of 0 and a variance

59
© 2014-2023 AnalystPrep.
of 1. Note that we still rescale it to have any variance for k>2.

A generalized student’s t is stated by the mean, variance, and the number of degrees of freedom. It is

stated as Gen. tk(μ, σ2)

T his distribution is widely applicable in hypotheses testing, and modeling the returns of financial

assets due to the excess kurtosis it displays.

Example: Standardized Student’s t

T he kurtosis of some returns on a bond portfolio with three parameters to be estimated is 6. What

are the degrees of freedom if the parameters were generated using student’s tk?

Sol uti on

We know that for t-distribution:

k−2
Kurt(X) = 3
k−4

k− 2 5
∴6 = 3 ⇒ (k − 4)
k− 4 3

60
© 2014-2023 AnalystPrep.
So that

k=6

F–Distribution

T he F-distribution is often used in the analysis of variance (ANOVA). T he F distribution is an

asymmetric distribution that has a minimum value of 0, but no maximum value. Notably, the curve

approaches but never quite touches the horizontal axis.

X is said to follow an F -distribution with parameters k1 and k2 if:

U 1/k1
X= ∼ F (k1 ,k2 )
U 2/k2

Provided that U 1 and U 2 are chi-squared distributions that are independent having k1 and k2 as their

degrees of freedom.

61
© 2014-2023 AnalystPrep.
T he F -distribution has the following PDF:

(k1 X) k 1 kk
2
2


(k1 X+k2 ) k 1 +k 2
f (x) =
k1 k2
xB ( , )
2 2

B(x,y) is a beta function such that:

1
B (x , y) = ∫ z x−1(1 − z)y−1 dz
0

T he distribution has the following mean and variance respectively:

k2
E (X ) = for k2 > 2
k2 − 2

2k22 (k1 + k2 − 2)
σ2 = for k2 > 4
k1(k2 − 2)2 (k2 − 4)

Suppose that X is a random variable with a t-distribution, and it has k degrees of freedom, then X 2 is

said to have an F -distribution with 1 and k degrees of freedom, i.e.,

χ2 ∼ F (1, k)

The Beta Distribution

T he beta distribution applies to continuous random variables in the range of 0 and 1. T his distribution

is similar to the triangle distribution in the sense that they are both applicable in the modelling of

default rates and recovery rates. Assuming that a and b are two positive constants, then the PDF of

the beta distribution is written as:

1
f (x) = x a−1(1 − x)b−1, 0≤x≤1
B (a, b)

Γ( a)Γ(b)
Where B (a, b) = Γ( a+b)

T he following two equations represent the mean and variance of the beta distribution:
a
62
© 2014-2023 AnalystPrep.
a
μ=
a+ b

ab
σ2 = 2
(a + b) (a + b + 1)

Exponential Distribution

T he exponential distribution is a continuous distribution with a parameter , whose PDF is:

1 − xβ
f X (x) = e ,x ≥ 0
β

T he CDF is also given by:

x

FX (x) = 1 − e β

63
© 2014-2023 AnalystPrep.
T he parameter of the exponential distribution determines the mean and variance of the distribution.

T hat is:

E(X) = β

And

V (X) = β 2

64
© 2014-2023 AnalystPrep.
Notably, exponential distribution is a close ‘cousins’ of a Poisson. T he time intervals between one

and subsequent Poisson random variables are exponentially distributed. Another feature of the

exponential distribution is that it is memoryless. T hat is, its distributions are independent of their

histories.

Example: Exponential Distribution

Assume that the time to default for a specific segment of mortgage consumers is exponentially

distributed with a β of ten years. What is the probability that a borrower will not default before year

11?

Sol uti on

To find the probability that the borrower will not default before year eleven, we start by calculating

the cumulative distribution until year eleven and then subtract this from 100%::

P (X > 11) = 1 − P (X ≤ 11) = 1 − FX(x = 11)

11

=1 −e 10 = 1 − 0.3329 = 0.6671 = 66.7%

The Mixture Distribution

Mixture distributions are complex, and new distributions built using two or more distributions. In

this summary, we shall concentrate on the two distributions.

Generally, a mixture distribution comes from a weighted average distribution of density functions,

and can be written as follows:

n n
f (x) = ∑ w i f i (x) such that : ∑ w i = 1
j=1 i=1

f i (x) 's are the component distributions, with w ′i s as the weights or the mixing proportions. T he

component weights must all sum up to one, for the resulting mixture to be legitimately distributed. In

other words, a two-distribution combination must draw value from Bernoulli random variables and

65
© 2014-2023 AnalystPrep.
depending on the benefits (0 or 1), it then picks the component distributions. By doing this, it is

possible to compute the CDF of the mixture when the component distributions are normal random

variables. T hese distributions are very flexible as they fall between parametric and non-parametric

distributions.

For example, consider X 1 ∼ Fx1 and X 2 ∼ Fx2 and W i ∼ Bernoulli(p) . So that the mixture distribution

of X 1 and X 2 is given by:

Y = pX 1 + (1 − p)X 2

Both of the PDF and the CDF of the mixture distribution are weighted average of the constituent

CDFs and PDFs. T hat is:

FY (y) = pFX1 (x 1) + (1 − p)FX2 (x 2)

And

f Y (y) = pf X1 (x 1 ) + (1 − p)f X2(x 2 )

Intuitively, the computation of the central moment is done in a similar way. T hat is:

E(Y ) = pE(X 1) + (1 − p)E(X 2 )

And

V (Y ) = E(Y 2) − (E(Y ))2

Where

E(Y 2) = pE(X 12) + (1 − p)E(X 22 )

Using the same logic, we can calculate the other higher central moments such as the kurtosis and

skewness. However, note that the mixture distribution might have both the skewness and the

kurtosis, while the components do not have (for example, normal random variables).

Moreover, mixing components with different means and variances leads to distribution that is both

66
© 2014-2023 AnalystPrep.
skewed and heavy-tailed.

Example: Mixture Distributions

Consider two normal random variables X 1 ∼ N (0.15, 0.60) and X 1 ∼ N(−0.8, 3). . What is the mean of

the resulting mixture distribution (Y) if the weight of X 1 is 0.6?

Sol uti on

We know that:

E(Y ) = pE(X 1) + (1 − p)E(X 2 )


= 0.6 × 0.15 + (1 − 0.6)(−0.8)
= −0.23

67
© 2014-2023 AnalystPrep.
Question

T he number of new clients that a wealth management company receives in a month is

distributed as a Poisson random variable with mean 2. Calculate the probability that the

company receives exactly 28 clients in a year.

A. 5.48%

B. 0.10%

C. 3.54%

D. 10.2%

T he correct answer is A.

T he number of clients in a year (2 × 12) has a Poi(24) distribution.

λn −λ
P [X = n] = e
n!

2428 −24
P [X = 28] = e = 5.48
28!

68
© 2014-2023 AnalystPrep.
Reading 15: Multivariate Random Variables

After compl eti ng thi s readi ng, you shoul d be abl e to:

Explain how a probability matrix can be used to express a probability mass function (PMF).

Compute the marginal and conditional distributions of a discrete bivariate random variable.

Explain how the expectation of a function is computed for a bivariate discrete random

variable.

Define covariance and explain what it measures.

Explain the relationship between the covariance and correlation of two random variables

and how these are related to the independence of the two variables.

Explain the effects of applying linear transformations on the covariance and correlation

between two random variables.

Compute the variance of a weighted sum of two random variables.

Compute the conditional expectation of a component of a bivariate random variable.

Describe the features of an iid sequence of random variables.

Explain how the iid property is helpful in computing the mean and variance of a sum of iid

random variables.

Multivariate Random Variables

Multivariate random variables accommodate the dependence between two or more random variables.

T he concepts under multivariate random variables (such as expectations and moments) are analogous

to those under univariate random variables.

Multivariate Discrete Random Variables

69
© 2014-2023 AnalystPrep.
Multivariate random variables involve defining several random variables simultaneously on a sample

space. In other words, multivariate random variables are vectors of random variables. For instance, a

bivariate random variable X can be a vector with two components X 1 and X 2 with the corresponding

realizations being x 1 and x 2, respectively.

T he PMF or PDF for a bivariate random variable gives the probability that the two random variables

each take a certain value. If we wish to plot these functions, we would need three factors: X 1, X 2,

and the PMF/PDF. T his is also applicable to the CDF.

The Probability Mass Function (PMF)

T he PMF of a bivariate random variable is a function that gives the probability that the components

of X=x takes the values X 1 = x 1 and X 2 = x 2. T hat is:

f X1 ,X2 (x 1, x 2) = P(X 1 = x 1, X 2 = x 2)

T he PMF explains the probability of realization as a function of x 1 and x 2. T he PMF has the following

properties:

1. f X1,X2 (x 1 ,x 2 ) ≥ 0

2. ∑x1 ∑x2 f X1,X2 (x 1 ,x 2 ) = 1

Example: Trinomial Distribution

T he trinomial distribution is the distribution of n independent trials where each trial results in one of

the three outcomes (a generalization of the binomial distribution). T he first, second and the third

components are X 1 ,X 2 and n − X 1 − X 2 respectively. However, the third component is redundant

provided that we know X 1 and X 2.

T he trinomial distribution has three parameters:

1. n, representing the total number of the trials

2. p1, representing the probability of realizing X 1

3. p2, representing the probability of realizing X 2

70
© 2014-2023 AnalystPrep.
Intuitively, the probability of observing n − X 1 − X 2 is:

1 − p1 − p2

T he PMF of the trinomial distribution, therefore, is given by:

n!
f X1,X2 (x 1, x 2 ) = px1 px2 (1 − p1 − p2)n−x1−x2
x 1 !x 2!(n − x 1 − x 2)1 2

The Cumulative Distribution Function (CDF)

T he CDF of a bivariate discrete random variable returns the total probability that each component is

less than or equal to a given value. It is given by:

FX1,X2 (x 1 ,x 2 ) = P(X 1 < x 1 , X 2 < x 2) = ∑ ∑ f (X1,X2 )(t1 ,t2 )


t1 ϵR(X1 ) t2 ϵR(X2 )
t1 ≤x1 t2 ≤x2

In this equation, t1 contains the values that X 1 may take as long as t1 ≤ x 1. Similarly, t2 contains the

values that X 2 may take as long as t2 ≤ x 2

71
© 2014-2023 AnalystPrep.
Probability Matrices

T he probability matrix is a tabular representation of the PMF.

Example: Probability Matrix

In financial markets, market sentiments play a role in determining the return earned on a security.

Suppose the return earned on a bond is in part determined by the rating given to the bond by analysts.

For simplicity, we are going to assume the following:

T here are only three possible returns :10%, 0%, or -10%

Analyst ratings (sentiments) can be positive, neutral, or negative

We can represent this in a probability matrix as follows:

72
© 2014-2023 AnalystPrep.
Bond Return (X 1)
−10% 0% 10%
Analyst Positive +1 5% 5% 30%
(X 2 ) Neutral 0 10% 10% 15%
Negative −1 20% 5% 0%

Each cell represents the probability of a joint outcome. For example, there’s a 5% probability of a

negative return (-10%) if analysts have positive views about the bond and its issuer. In other words,

there’s a 5% probability that the bond will decline in price with a positive rating. Similarly, there’s a

10% chance that the bond’s price will not change (and hence a zero return) given a neutral rating.

The Marginal Distribution

T he marginal distribution gives the distribution of a single variable in a joint distribution. In the case

of bivariate distribution, the marginal PMF of X 1 is computed by summing up the probabilities for X 1

across all the values in the support of X 2. T he resulting PMF of X 1 is denoted by f X1 (x 1) , i.e., the

marginal distribution of X 1.

f X1 (x 1 ) = ∑ f X1,X2 (x 1 ,x 2 )
x2 ϵR(X2 )

Intuitively, the PMF of X 2 is given by:

f X2 (x 2 ) = ∑ f X1,X2 (x 1 ,x 2 )
x1 ϵR(X1 )

Example: Computing the Marginal Distribution

Using the probability matrix, we created above, we can come up with marginal distributions for both

X 1 (return) and X 2 (analyst ratings) as follows:

For X 1,

P(X 1 = −10%) = 5% + 20% + 10% = 35%


P(X 1 = 0%) = 5% + 10% + 5% = 20%
P(X 1 = +10%) = 30% + 15% + 0% = 45%

73
© 2014-2023 AnalystPrep.
For X 2,

P(X 2 = +1) = 5% + 5% + 30% = 40%


P(X 2 = 0) = 10% + 10% + 15% = 35%
P(X 2 = −1) = 20% + 5% + 0% = 25%

We wish to compute the marginal distribution of the returns. Now,

In summary, for example, the marginal distribution of X 1 is given below:

Return(X 1 ) −10% 0% 10%


P(X 1 = x 1 ) 35% 20% 35%

Bond Return (X 1)
−10% 0% 10% f X2 (x 2)
Analyst Positive +1 5% 5% 30% 40%
(X 2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
f X1 (x 1 ) 35% 20% 45%

As you may have noticed, the marginal distribution satisfies the property of the ideal probability

distribution. T hat is:

∑ f X1 (x 1) = 1
∀X1

And

f X1 (x 1) ≥ 0

T his is true because the marginal PMF is a univariate distribution.

We can, in addition, use the marginal PMF to compute the marginal CDF. T he marginal CDF is such

that, P(X 1 < x 1 ). T hat is:

FX1(x 1 ) = ∑ f X1 (t1)
t1 ϵR(X1 )
t1 ≤x1

74
© 2014-2023 AnalystPrep.
Independence of Random Variables

Recall that if the two events A and B are independent then:

P(A ∩ B) = P(A)P(B)

T his principle applies to bivariate random variables as well. If the distributions of the components of

the bivariate distribution are independent, then:

f X1 ,X2 (x 1, x 2) = f X1 (x 1)f X2 (x 2 )

Example: Independence of Random Variables

Now let’s use our earlier example on the return earned on a bond. If we assume that the two

variables – return and ratings – are independent, we can calculate the joint distribution by the

multiplying their marginal distributions. But are they really independent? Let’s find out! We have

already established the joint and the marginal distributions, as reproduced in the following table.

Bond Return (X 1)
−10% 0% 10% f X2 (x 2)
Analyst Positive +1 5% 5% 30% 40%
(X 2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
f X1 (x 1 ) 35% 20% 45%

So assuming that our two variables are independent, our joint distribution would be as follows:

Bond Return (X 1 )
−10% 0% 10%
Analyst Positive +1 14% 8% 18%
(X 2 ) Neutral 0 12.25% 7% 15.75%
Negative −1 8.75% 5% 11.25%

We obtain the table above by multiplying the marginal PMF of the bond return by the marginal PMF

of ratings. For example, the marginal probability that the bond return is 10% is 45% -- the sum of the

third column. T he marginal probability of a positive rating is 40% -- the sum of the first row. T hese

75
© 2014-2023 AnalystPrep.
two values when multiplied give us the joint probability on the upper left end of the table (18%).

45% ∗ 40% = 18%

It is clear that the two variables are not independent because multiplying their marginal PMFs does

not lead us back to the joint PMF.

The Conditional Distributions

T he conditional distributions describe the probability of an outcome of a random variable conditioned

on the other random variable taking a particular value.

Recall that, given any two events A and B, then:

P(A ∩ B)
P(A│B) =
P(B)

T his result can be applied in bivariate distributions. T hat is, the conditional distribution of X 1 given

X 2 is defined as:

f X1,X2 (x 1, x 2)
f X1│X2 (x 1│X 2 = x 2 ) =
f X2 (x 2)

From the result above, the conditional distribution is joint distribution divided by the marginal

distribution of the conditioning variable.

76
© 2014-2023 AnalystPrep.
Example: Calculating the Conditional Distribution

Bond Return (X 1)
−10% 0% 10% f X2 (x 2)
Analyst Positive +1 5% 5% 30% 40%
(X 2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
f X1 (x 1 ) 35% 20% 45%

Suppose we want to find the distribution of bond returns conditional on a positive analyst rating. T he

conditional distribution is:

f X1,X2 (x 1, X 2 = 1) f X1,X2 (x 1, X 2 = 1)
f (X1│X2 )(x 1 │X 2 = 1) = =
f X2(x 2 = 1) 40%

With this, we can proceed to determine specific conditional probabilities:

77
© 2014-2023 AnalystPrep.
Returns(X 1 ) −10% 0% 10%
5% 5% 30%
f (X1│X2) (x 1│X 2 = x 2 ) = 12.5% = 12.5% = 75%
40% 40% 40%
= P(X 1 = x 1|X 2 = 1)

What we have done is to take the joint probabilities where there’s a positive analyst rating and then

divided these values by the marginal probability of a positive rating (40%) to produce the conditional

distribution.

Note that the conditional PMF obeys the laws of probability, i.e.,

1. f (X1│X2) (x 1│X 2 = x 2 ) ≥ 0(nonnegativity)

2. ∑∀(X1│X2) f (X1 │X2 )(x 1 │X 2 = x 2) = 1

Conditional Distribution for a Set of Outcomes

Conditional distributions can be computed for one variable, while conditioning on more than one

variable.

For example, assume that we need to compute the conditional distribution of the bond returns given

that analyst ratings are non-negative. T herefore, our conditioning set is {+1,0}:

X 2 ∈ {+1, 0}

T he conditional PMF must sum across all outcomes in the set that is conditioned on S {+1,0}:

∑X2 ϵC f (X1,X2) (x 1, x 2 )
f (X1│X2) (x 1│x 2 ϵS) =
∑X2ϵC f (X2 )(x 2 )

T he marginal probability that X 2 ∈ {+1, 0} is the sum of the marginal probabilities of these two

outcomes:

f x2 (+1) + f x2 (0) = 75%

78
© 2014-2023 AnalystPrep.
Bond Return (X 1)
−10% 0% 10% f X2 (x 2)
Analyst Positive +1 5% 5% 30% 40%
(X 2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
f X1 (x 1 ) 35% 20% 45%

T hus, the conditional distribution is given by:

5%+10%

⎪ = 20%

⎪ 75%
⎪ 5%+10%
f (X1│X2) (x 1│x 2 ϵ {+1, 0}) = ⎨ 75% = 20%



⎩ 30%+15% = 60%

75%

Independence and Conditional Distribution of Random Variables

Recall that the conditional distribution is given by:

f X1,X2 (x 1, x 2)
f (X1│X2) (x 1│X 2 = x 2 ) =
f X2 (x 2)

T his can be rewritten into:

f (X1 ,X2)(x 1 , x 2) = f (X1│X2 )(x 1 │X 2 = x 2)f X2 (x 2 )

Or

f (X1 ,X2)(x 1 , x 2) = f X2│X1 (x 2 │X 1 = x 1)f X1 (x 1 )

Also, if the distributions of the components of the bivariate distributions are independent, then:

f (X1 ,X2) (x 1, x 2) = f X1 (x 1)f X2 (x 2 )

If we substitute this in the above results we get:

f X1(x 1 )f X2 (x 2) = f (X1│X2)(x 1 │X 2 = x 2)f X2 (x 2 )


⇒ f X1 (x 1) = f (X1│X2) (x 1│X 2 = x 2 )

79
© 2014-2023 AnalystPrep.
Applying again to

f X1 ,X2(x 1 , x 2) = f (X2│X1 )(x 2 │X 1 = x 1)f X1 (x 1 )

we get:

f X2 (x 2) = f (X2│X1 )(x 2 │X 1 = x 1)

Expectations

T he expectation of a function of a bivariate random variable is defined in the same way as that of the

univariate random variable. Consider the function g(X 1 ,X 2 ). T he expectation is defined as:

E(g(X 1, X 2 )) = ∑ ∑ g(x 1 , x 2)f X1,X2 (x 1 ,x 2 )


x1 ϵR(X1 ) x2 ϵR(X2 )

g(x 1 ,x 2 depends on both x 1 and x 2 ) and it may be a function of one component only. Just like the

univariate random variable,

E(g(X 1 ,X 2 )) ≠ g(E(X 1 ),E(X 2 ))

for a nonlinear function g(x 1 ,x 2 ).

Example: Calculating the Expectation

Consider the following probability mass function:

X1
1 2
X2 3 10% 15%
4 70% 5%

Given that g(x 1 ,x 2 ) = x x12 , calculate E(g(x 1 ,x 2 ))

Sol uti on

80
© 2014-2023 AnalystPrep.
Using the formula:

E(g(X 1, X 2 )) = ∑ ∑ g(x 1 , x 2)f X1,X2 (x 1 , x 2 )


x1 ϵR(X1 ) x2 ϵR(X2 )

In this case we need:

E(g(X 1, X 2)) = ∑ ∑ g(x 1 , x 2)f X1,X2 (x 1 , x 2 )


x1 ϵ{1,2} x2 ϵ{3,4}

= 13 (0.10) + 14(0.7) + 23(0.15) + 24(0.05)


= 2.80

Moments

Just like the univariate random variables, we shall use the expectations to define the moments.

T he first moment is defined as:

E(X) = [E(X 1), E(X 2)] = [μ1, μ2 ]

T he second moment involves the covariance between the components of the bivariate distribution

X 1 and X 2. T he second moment is given by:

Var(X 1 + X 2 ) = Var(X 1 ) + Var(X 2 ) + 2Cov(X 1 X 2)

T he Covariance between X 1 and X 2 is defined as:

Cov(X 1 , X 2 ) = E[(X 1 − E[X 1 ])]E[(X 2 − E[X 2])]


= E[X 1X 2 ] − E[X 1]E[X 2 ]

Note that Cov(X 1 ,X 1 ) = Var(X 1 ) and that if X 1 and X 2 are independent then E[X 1X 2 ] − E[X 1 ]E[X 2] = 0

and thus:

Cov(X 1, X 2) = E[X 1 X 2] − E[X 1 ]E[X 2]


= E[X 1 ]E[X 2] − E[X 1 ]E[X 2] = 0

Most of the correlation between X 1 and X 2 is reported. Now let Var(X 1 ) = σ12, Var(X 2) = σ22 and

81
© 2014-2023 AnalystPrep.
Cov(X 1 , X 2 ) = σ12 then the correlation is defined as:

Cov(X 1, X 2) σ12
Corr(X 1, X 2) = ρX1X2 = =
σ1 σ2
√σ12√σ22

T herefore, we can write this in terms of covariance. T hat is:

σ12 = ρX1 X2 σ1σ2

Correlation gives the measure of the strength of the linear relationship between the two random

variables, and it is always between -1 and 1. T hat is −1 < Corr(X 1, X 1) < 1

For instance, if X 2 = α + β X 1 then:

Cov(X 1, X 2) = Cov(X 1, α + βX 1) = βVar(X 1)

But we know that Var(α + β X 1) = β 2Var(X 1) . So,

β Var(X 1) β
Corr(X 1, X 2) = ρX1X2 = =
√Var(X 1)√β 2Var(X 1 ) |β|

it is now evident that if β > 0, then ρX1X2 = 1 and when β ≤ 0 then ρX1X2 = 0

Similarly, if we consider two scaled random variables a + bX 1 and c + dX 2

T hen,

Cov(a + bX 1 , c + dX 2) = bdCov(X 1, X 2)

T his implies that the scale factor in each random variable multiactivity affects the covariance. Using

the above results, the corresponding correlation coefficient of aX 1 and bX 2 is given by :

abCov(X 1, X 2 ) ab Cov(X 1, X 2 )
Corr(aX 1, bX 2) = =
|a||b| √Var(X 1)√Var(X 2 )
√a2 Var(X 1)√b2 Var(X 2)
ab
= ρ X1 X2
|a||b|

82
© 2014-2023 AnalystPrep.
Application of Correlation: Portfolio Variance and Hedging

T he variance of the underlying securities and their respective correlations are the necessary

ingredients if the variance of a portfolio of securities is to be determined. Assuming that we have

two securities whose random returns are X A and X B and their means are μA and μB with standard

deviations of σA and σB. T hen, the variance of X A plus X B can be computed as follows:

σA+B
2 = σA2 + σB2 + 2ρABσA σB

If X A and X B have a correlation of ρAB between them,

T he equation changes to:

σA+B
2 = 2σ 2(1 + ρAB),

Where:

σA2 = σB2 = σ 2

if both securities have an equal variance. If the correlation between the two securities is zero, then

the equation can be simplified further. We have the following relation for the standard deviation:

ρAB = 0 ⇒ σA+B = √2σ

For any number of variables, we have that:

n
Y = ∑ X i = 1nX i
i=1
n n
σY2 = ∑ ∑ ρijσi σj
i=1 j=1

In case all the X i’s are uncorrelated and all variances are equal to σ, then we have:

σY = √nσ if ρij = 0 ∀ i≠j

T his is what is called the square root rule for the addition of uncorrelated variables.

83
© 2014-2023 AnalystPrep.
Suppose that Y , X A, and X B are such that:

Y = aX A + bX B

T herefore, with our standard notation, we have that:

σY2 = a2σA2 + b2 σB2 + 2abρABσA σB … … … … … Eq 1

T he major challenge during hedging is a correlation. Suppose we are provided with $1 of a security A.

We are to hedge it with $h of another security B. A random variable p will be introduced to our

hedged portfolio. h is, therefore, the hedge ratio. T he variance of the hedged portfolio can easily be

computed by applying Eq1:

P = X A + hX B
σP2 = σA2 + h2σB2 + 2hρAB σAσB

T he minimum variance of a hedge ratio can be determined by determining the derivative with

respect to h of the portfolio variance equation and then equate it to zero:

dσ 2P
= 2hσB2 + 2ρAB σAσB = 0
dh
σA
⇒ h∗ = −ρAB
σB

To determine the minimum variance achievable, we substitute h* to our original equation:

min[σP2 ] = σA2 (1 − ρ2AB )

The Covariance Matrix

T he covariance matrix is a 2x2 matrix that displays the covariance between the components of X.

For instance, the covariance matrix of X is given by:

σ12 σ12
Cov(X) = [ ]
σ12 σ22

84
© 2014-2023 AnalystPrep.
The Variance of Sums of Random Variables

T he variance of the sum of two random variables is given by:

Var(X 1 + X 2 ) = Var(X 1 ) + Var(X 2 ) + 2Cov(X 1 X 2)

If the random variables are independent, then Cov(X 1 X 2) = 0 and thus:

Var(X 1 + X 2) = Var(X 1) + Var(X 2)

In case of weighted random variables, the variance is given by:

Var(aX 1 + bX 2 ) = a2 Var(X 1) + b2Var(X 2 ) + 2abCov(X 1 X 2)

Conditional Expectation

A conditional expectation is simply the mean calculated after a set of prior conditions has happened.

It is the value that a random variable takes “on average” over an arbitrarily large number of

occurrences – given the occurrence of a certain set of "conditions." A conditional expectation uses

the same expression as any other expectation and is a weighted average where the probabilities are

determined by a conditional PMF.

For a discrete random variable, the conditional expectation is given by:

E(X 1│X 2 = x 2 ) = ∑ x 1if(X 1 |X 2 = x 2)


i

Example: Calculating the Conditional Expectation

In the bond return/rating example, we may wish to calculate the expected return on the bond given a

positive analyst rating, i.e., E(X 1 │X 2 = 1)

If you recall, the conditional distribution is as follows:

Returns(X 1 ) −10% 0% 10%


5% 5% 30%
f (X1 │X2 )(x 1 │X 2 = 1) = 12.5% = 12.5% = 75%
40% 40% 40%
= P(X 1 = x 1|X 2 = 1)

85
© 2014-2023 AnalystPrep.
T he conditional expectation of the return is determined as follows:

E(X 1│X 2 = 1) = −0.10 × 0.125 + 0 × 0.125 + 0.10 × 0.75 = 0.0625 = 6.25%

Conditional Variance

We can calculate the conditional variance by substituting the expectation in the variance formula

with the conditional expectation.

We know that:

Var(X 1 ) = E[(X 1 − E(X 1 ))2] = E(X 1)2 − [E(X)]2

Now the conditional variance of X 1 conditional on X 2 is given by:

Var(X 1│X 2 = x 2 ) = E(X 21 |X 2 = x 2) − [E(X 1|X 2 = x 2 )]2

Returning to our example above, the conditional variance Var(X 1 |X 2 = 1) is given by:

Var(X 1 │X 2 = 1) = E(X 21 |X 2 = 1) − [E(X 1 |X 2 = 1)]2

Now,

E(X 1 |X 2 = 1) = 0.0625

We need to calculate:

E(X 21 │X 2 = 1) = (−0.10)2 × 0.125 + 02 × 0.125 + 0.102 × 0.75 = 0.00875

So that

2
Var(X 1│X 2 = 1) = σ(X = 0.00875 − [0.0625]2 = 0.004844 = 0.484%
1 │X2 = 1)

If we wish to find the standard deviation of the returns, we just find the square root of the variance:

σ(X1 │X2 =1) = √0.004844 = 0.06960 = 6.96%

86
© 2014-2023 AnalystPrep.
Continuous Random Variables

Before we continue, it is essential to note that continuous random variables make use of the same

concepts and methodologies as discrete random variables. T he main distinguishing factor is that

instead of PMFs, continuous random variables use PDFs.

The Joint PDF

T he joint (bivariate) distribution function gives the probability that the pair (X 1 ,X 2 ) takes values in a

stated region A. It is given by:

b d
P(a < X 1 < b,c < X 2 < d) = ∫ ∫ f X1,X2 (x 1 ,x 2 )dx 1dx 2
a c

T he joint pdf is always nonnegative, and the double integration yield a value of 1. T hat is:

87
© 2014-2023 AnalystPrep.
f X1,X2 (x 1, x 2 ) ≥ 0

And

b d
∫ ∫ f X1,X2 (x 1 , x 2 )dx 1dx 2 = 1
a c

Example: Calculating the Joint Probability

Assume that the random variables (X 1 ) and (X 2) are jointly distributed as:

f X1,X2 (x 1 , x 2 ) = k(x 1 + 3x 2 ) 0 < x 1 < 2, 0 < x 2 < 2

Calculate the probability P(X 1 < 1, X 2 > 1).

Sol uti on

88
© 2014-2023 AnalystPrep.
We need to first calculate the value of k.

Using the principle:

b d
∫ ∫ f X1,X2 (x 1 , x 2 )dx 1dx 2 = 1
a c

We have

2 2 2 2
1 2
∫ ∫ k(x 1 + 3x 2 )dx 1dx 2 = ∫ k[( x 1 + 3x 1x 2 )] dx 2 = 1
0 0 0 2 0
2
2
=∫ k(2 + 6x 2)dx 2 = k[2x 2 + 3x 22 ]0 = 1
0
1
16k = 1 ⇒ k =
16

So,

1
f X1,X2 (x 1 , x 2 ) = (x 1 + 3x 2)
16

T herefore,

1 2
1
P(X 1 < 1, X 2 > 1) = ∫ ∫ (x 1 + 3x 2)dx 1 dx 2 = 0.3125
0 1 16

Joint Cumulative Distribution Function (CDF)

T he joint cumulative distribution is given by:

x1 x2
F(X 1 < x 1, X 2 < x 2 ) = ∫ ∫ f X1 ,X2(t1 , t2)dt1 dt2
−∞ −∞

Note that the lower bound of the integral can be adjusted so that it is the lower value of the interval.

Using the example above, we can calculate F(X 1 < 1, X 2 < 1) in a similar way as above.

The Marginal Distributions

89
© 2014-2023 AnalystPrep.
For the continuous random the marginal distribution is given by:


f X1 (x 1 ) = ∫ f X1 ,X2 (x 1, x 2)dx 2
−∞

Similarly,


f X2 (x 2 ) = ∫ f X1 ,X2 (x 1, x 2)dx 1
−∞

Note that if we want to find the marginal distribution of X 1 we integrate X 2 out and vice versa.

Example: Computing the Marginal Distribution

Consider the example above. We have that

1
f X1,X2 (x 1 , x 2 ) = (x 1 + 3x 2) 0 < x 1 < 2, 0 < x 2 < 2
16

We wish to find the marginal distribution of X 1. T his implies that we need to integrate out X 2. So,

2 2
1 1 3
f X1 (x 1) = ∫ (x 1 + 3x 2)dx 2 = [x 1 x 2 + x 22 ]
0 16 16 2 0
1 1
= [2x 1 + 6] = (x 1 + 3)
16 8
1 1
⇒ f X1(x 1 ) = [2x 1 + 6] = (x 1 + 3)
16 8

Note that we can calculate f X2 (x 2) in a similar manner.

Conditional Distributions

T he conditional distribution is analogously defined as that of discrete random variables. T hat is:

f X1,X2 (x 1, x 2)
f (X1│X2) (x 1│X 2 = x 2 ) =
f X2 (x 2)

T he conditional distributions are applied in the field of finance, such as risk management. For

90
© 2014-2023 AnalystPrep.
instance, we may wish to compute the conditional distribution of interests rates, X 1 given that the

investors X 2 experience a huge loss.

Independent, Identically Distributed (IID) Random Variables

A collection of random vari abl es is independent and identically distributed (iid) if each random

vari abl e has the same probability distribution as the others and all are mutually independent.

Example:

Consider successive throws of a fai r coin:

T he coin has no memory, so all the throws are "i ndependent".

The probabi l i ty of head vs. tai l i n every throw i s 50:50; so the coin is equally

likely and stays fair; the distribution from which every throw is drawn is normal and stays

the same, and thus each outcome is "i denti cal l y di stri buted"

91
© 2014-2023 AnalystPrep.
iid variables are mostly applied in time series analysis.

Mean and Variance of iid Variables

Consider the iid variables generated by a normal distribution. T hey are typically defined as:

x iid
i ∼
N(μ, σ 2)

T he expected mean of these particular iid is given by:

n n n
E (∑ X i) = ∑ E(X i ) = ∑ μ = nμ
i i i

Where E(X i ) = μ

T he result above is valid since the variables are independent and have similar moments. Maintaining

this line of thought, the variance of iid random variables is given by:

n n n n
Var (∑ X i ) = ∑ Var (X i) + 2 ∑ ∑ Cov(X j, X k)
i i j=1 k=j+1
n n n n
= ∑ σ 2 + 2∑ ∑ 0 = ∑ σ 2 = nσ 2
i j=1 k=j+1 i

T he independence property is important because there’s a difference between the variance of the

sum of mul ti pl e random variables and the variance of a multiple of a single random variable. If X 1

and X 2 are iid with variance σ 2, then,

Var(X 1 + X 2) = Var(X 1 ) + Var(X 2) = σ 2 + σ 2 = 2σ 2


Var(X 1 + X 2) ≠ Var(2X 1 )

In the case of a multiple of a single variable, X 1, with variance σ 2,

Var(2X 1) = 4Var(X 1) = 4 × σ 2 = 4σ 2

92
© 2014-2023 AnalystPrep.
Practice Question

A company is reviewing fire damage claims under a comprehensive business insurance

policy. Let X be the portion of a claim representing damage to inventory and let Y be the

portion of the same application representing damage to the rest of the property. T he

joint density function of X and Y is:

6[1 − (x + y)], x > 0,y > 0 x + y < 1


f(x, y) = {
0, elsewhere

What is the probability that the portion of a claim representing damage to the rest of the

property is less than 0.3?

A. 0.657

B.0.450

C. 0.415

D. 0.752

The correct answer i s A.

First, we should find the marginal PMF of Y:

1−y
1−y x2
f Y (y) = ∫ 6[1 − (x + y)] ∂x = [6(x − − xy]
0 2 0

Substitute the limits as usual to get:

(1 − y)2
6 [(1 − y) − − y(1 − y)]
2

At this we can factor out (1 − y) and solve what remains in the square bracket:

93
© 2014-2023 AnalystPrep.
(1 − y) 1 −y
6(1 − y) [1 − − y] = 6(1 − y) [ ]
2 2

Of course you can cancel 2 with 6 at this point:

1−y
6(1 − y) [ ] = 3(1 − y)[1 − y] = 3(1 − 2y + y 2) = 3 − 6y + 3y 2
2

So,

f Y (y) = 3 − 6y + 3y 2, 0 < y < 1

We need P(Y < 0.3), So,

0. 3
P(Y < 0.3) = ∫ (3 − 6y + 3y 2 )dy = 0.9 − 0.27 + 0.027 = 0.657
0

94
© 2014-2023 AnalystPrep.
Reading 16: Sample Moments

After compl eti ng thi s readi ng, you shoul d be abl e to:

Estimate the mean, variance, and standard deviation using sample data.

Explain the difference between a population moment and a sample moment.

Distinguish between an estimator and an estimate.

Describe the bias of an estimator and explain what the bias measures.

Explain what is meant by the statement that the mean estimator is BLUE.

Describe the consistency of an estimator and explain the usefulness of this concept.

Explain how the Law of Large Numbers (LLN) and Central Limit T heorem (CLT ) apply to

the sample mean.

Estimate and interpret the skewness and kurtosis of a random variable.

Use sample data to estimate quantiles, including the median.

Estimate the mean of two random variables and apply the CLT.

Estimate the covariance and correlation between two random variables.

Explain how coskewness and cokurtosis are related to skewness and kurtosis.

Sample Moments

Recall that moments are defined as the expected values that briefly describe the features of a

distribution. Sample moments are those that are utilized to approximate the unknown population

moments. Sample moments are calculated from the sample data.

95
© 2014-2023 AnalystPrep.
Such moments include mean, variance, skewness, and kurtosis. We shall discuss each moment in

detail.

Estimation of the Mean

T he population mean, denoted by μ is estimated from the sample mean (X̄ ). T he estimated mean is

denoted by μ
^ and defined by:

1 n
^ = X̄ =
μ ∑ Xi
n i=1

Where X i is a random variable assumed to be independent and identically distributed so E(X i) = μ and

96
© 2014-2023 AnalystPrep.
n is the number of observations.

Note that the mean estimator is a function of random variables, and thus it is a random variable.

Consequently, we can examine its properties as a random variable (its mean and variance)

For instance, the expectation of the mean estimator μ


^ is the population mean μ. T his can be seen as

follows:

1 n 1 n 1 n 1
E(μ
^) = E(X̄ ) = E[ ∑ X i] = ∑ E(X i ) = ∑ μ = × nμ = μ
n i=1 n i=1 n i=1 n

T he above result is true since we have assumed that X i's are iid. T he mean estimator is an unbiased

estimator of the population mean.

T he bias of an estimate is defined as:

Bias(θ^) = E(θ^) − θ

Where θ^ is the true estimator of the population value θ. So, in the case of the population mean:

Bias(μ
^) = E(μ
^) − μ = μ − μ = 0

Since the value of the mean estimator is 0, it is an unbiased estimator of the population mean.

Using conventional features of a random variable, the variance of the mean estimator is calculated

as:

1 n 1 n
Var(μ
^) = Var ( ∑ X i ) = 2 [∑ Var(X i) + Covariances]
n i=1 n i=1

But we are assuming that X i's are iid, and thus they are uncorrelated, implying that their covariance

is equal to 0. Consequently, taking Var(X i ) = σ 2, the above formula changes to:

1 n 1 n 1 n 12 σ2
Var(μ
^) = Var ( ∑ X i ) = 2 [∑ Var(X i)] = 2 [∑ σ 2] = × nσ 2 =
n i=1 n i=1 n i=1 n n

97
© 2014-2023 AnalystPrep.
T hus

σ2
Var(μ
^) =
n

Looking at the last formula, the variance of the mean estimator depends on the data variance (σ 2) and

the sample mean n. Consequently, the variance of the mean estimator decreases as the number of

the observations (sample size) is increased. T his implies that the larger the sample size, the closer

the estimated mean to the population mean.

Example: Calculating the Sample Mean

An experiment was done to find out the number of hours that candidates spend preparing for the

FRM part 1 exam. It was discovered that for a sample of 10 students, the following times were

spent:

318, 304, 317, 305, 309, 307, 316, 309, 315, 327

What is the sample mean?

Sol uti on

We know that:

1 n
X̄ = μ
^= ∑ Xi
n i=1
318 + 304 + 317 + 305 + 309 + 307 + 316 + 309 + 315 + 327
⇒ X̄ =
10
= 312.7

Desirable Properties of the Sample Mean Estimator

T he mean estimator is averagely equal to the population mean.

As the sample size (the number of the observation) increases, the sample mean tends to

the population mean.

T he sample mean can be assumed to be distributed normally (normal distribution)

98
© 2014-2023 AnalystPrep.
Estimation of Variance and Standard Deviation

T he sample estimator of variance is defined as:

1 n
^2 =
σ ^)2
∑ (X − μ
n i=1 i

Note that we are still assuming that X i’s are iid. As compared to the mean estimator, the sample

estimator of variance is biased. It can be proved that:

n−1 2 σ2
^2 ) = E(σ
Bias(σ ^2) − σ 2 = σ − σ2 =
n n

T his implies that the bias decreases as the number of observations are increased. Intuitively, the
2
source of the bias is the variance of the mean estimator ( σn ) . Since the bias is known, we construct

an unbiased estimator of variance as:

1 n n
n n 1
s2 = ^2 =
σ ^)2 =
× ∑ (X i − μ ^)2
∑ (X i − μ
n−1 n−1 n i=1 n − 1 i=1

It can be shown that E(s2 ) = σ 2 and thus s2 is an unbiased variance estimator. Maintaining this line of

^2 but this is not necessarily true


thought, it might seem s2 is a better estimator of variance than σ

^2 is less than that of s2. However, financial analysis involves large data sets,
since the variance of σ

and thus either of these values can be used. However, when the number of observations is more

^2 is preferred conventionally.
than 30 (n ≥ 30), σ

T he sample standard deviation is the square root of the sample variance. T hat is:

^2
^ = √σ
σ

or

s = √s2

Note that the square root is a nonlinear function, and thus, the standard deviation estimators are

biased but diminish as the sample size increases.

99
© 2014-2023 AnalystPrep.
Example: Calculating the Sample Variance Estimator (Unbiased)

Using the example as in calculating the sample mean, what is the sample variance?

Solution

T he sample estimator of variance is given by:

n
1
s2 = ^)2
∑ (X i − μ
n − 1 i=1

To make it easier, we will calculate we make the following table:

Xi μ )2
(X i − ^
318 (318 − 312.7)2 = 28.09
304 75.69
317 18.49
305 59.29
309 13.69
307 32.49
316 10.89
309 13.69
315 5.29
327 204.49
Total 434.01

So, the variance is given to be:

n
1 434.01
s2 = ^)2 =
∑ (X i − μ = 48.22
n − 1 i=1 10 − 1

Reasons, why Mean and Standard Deviations are Used

I. T he mean and the variance are almost adequate to describe data.

II. T hey give a clue on the range of the values that can be observed.

III. T he units of the mean and the standard deviation are the same as those of the data, and thus

they can be easily compared.

100
© 2014-2023 AnalystPrep.
Skewness

As we saw in chapter two, the skewness is a cubed standardized central moment given by:

E([X − E(X)]3 X−μ 3


skew(X) = = E[( ) ]
σ3 σ

X−μ
Note that σ
is a standardized X with a mean of 0 and variance of 1.

T his can also be written as:

E([X − E(X)]3 μ3
skew(X) = 3 =
2 2 σ3
E[(X − E(X)) ]

Where μ3 is the third central moment, and σ is the standard deviation

T he skewness measures the asymmetry of the distribution (since the third power depends on the

sign of the difference). When the value of the asymmetry is negative, there is a high probability of

observing the large magnitude of negative value than positive values (tail is on the left side of the

distribution). Conversely, if the skewness is positive, there is a high probability of observing the

large magnitude of positive values than negative values (tail is on the right side of the distribution).

T he estimators of the skewness utilize the principle of expectation and is denoted by:

^3
101
© 2014-2023 AnalystPrep.
^3
μ
^3
σ

^3 as:
We can estimate μ

1 n
^3 =
μ ^)3
∑ (x i − μ
n i=1

Example: Calculating the Skewness

T he following are the data on the financial analysis of a sales company’s income over the last 100

months:

n = 100, ∑ni=1 (x i − μ
^)2 = 674, 759.90. and ∑ni=1 (x i − μ
^)3 = −12, 456.784

Calculate that Skewness.

Solution

T he skewness is given by:

1
^3
μ
1
∑ni=1(x i − μ
^)3 100
(−12, 456.784)
n
= = = −0.000225
^3 3 3
σ
[ n1 ∑ni=1 (x i 1
^)2] 2 2
−μ [ 100 × 674, 759.90]

Kurtosi s

T he Kurtosis is defined as the fourth standardized moment given by:

E([X − E(X)])4 X−μ 4


Kurt(X) = = E [( ) ]
σ4 σ

T he above can be written as:

E([X − E(X)]4 μ4
Kurt(X) = =
E[(X − E(X))2 ] 2 σ4

102
© 2014-2023 AnalystPrep.
T he description of kurtosis is analogous to that of the Skewness, only that the fourth power of the

Kurtosis implies that it measures the absolute deviation of random variables. T he reference value of

a normally distributed random variable is 3. A random variable with Kurtosis exceeding 3 is termed to

be heavi l y or fat-tai l ed.

T he estimators of the skewness utilize the principle of expectation and is denoted by:

^4
μ
^4
σ

^4 (fourth central moment) as:


We can estimate μ

1 n
^4 =
μ ^)4
∑ (x i − μ
n i=1

The BLUE Mean Estimator

We say that the mean estimator is the Best Linear Unbiased Estimator (BLUE) of the population

mean when the data used are iid. T hat is,

I. T he variance of the mean has the lowest variance of any Linear Unbiased Estimator (LUE).

II. It is the unbiased estimator of the population mean (as shown earlier)

III. It is a linear function of the data used.

T he linear estimators are a function of the mean and can be defined as:

n
μ
^ = ∑ ωiX i
i=1

1
Where ωi is independent of X i . In the case of the sample mean estimator, ωi = n
. Recall that we had

shown the unbiases of the sample mean estimator.

BLUE puts an estimator as the best by having the smallest variance among all linear and unbiased

estimators. However, there are other superior estimators, such as Maximum Likelihood Estimators

(MLE).

103
© 2014-2023 AnalystPrep.
The Behavior of Mean in Large Sample Sizes

Recall that the mean estimator is unbiased, and its variance takes a simple form. Moreover, if the

data used are iid and normally distributed, then the estimator is also normally distributed. However, it

poses a great difficulty in defining the exact distribution of the mean in a finite number of

observations.

To overcome this, we use the behavior of the mean in large sample sizes (that is as n → ∞) to

approximate the distribution of the mean infinite sample sizes. We shall explain the behavior of the

mean estimator using the Law of Large Numbers (LLN) and the Central Limit T heorem (CLT ).

The Law of Large Numbers (LLN)

T he law of large numbers (Kolmogorov Strong Law of Large Numbers) for iid states that if X i’s is a

sequence of random variables, with E(X i ) ≡ μ then:

1 n a. s
^n =
μ ∑ X−i
→ μ
n i=1

^n converges almost surely (a. s


Put in words, the sample mean estimator μ −→ ) to population mean (μ).

An estimator is said to be consistent if LLN applies to it. Consistency requires that an estimator is:

I. Unbiased and that the bias should decrease as n increases.

II. T he variance decreases as the number of observations n increases. T hat is:Var(μ


^n ) → 0.

a. s
^2−→ σ 2
Moreover, under LLN, the sample variance is consistent. T hat is, LLN implies that σ

However, consistency is not easy to study because it tends to 0 as n → ∞.

The Central Limit Theorem (CLT)

T he Central Limit T heorem (CLT ) states that if X 1 , X 2 ,… , X n is a sequence of iid random variables
^−μ
μ
with a finite mean μ and a finite non-zero variance σ 2, then the distribution of σ tends to a standard
√n

104
© 2014-2023 AnalystPrep.
normal distribution as n → ∞.

Put simply,

^−μ
μ
→ N(0, 1)
σ
√n

Note that μ
^ = X̄ = Sample Mean

Note that CLT extends LLN and provides a way of approximating the distribution of the sample mean

estimator. CLT seems to be appropriate since it does not require the distribution of random variables

used.

Since CLT is asymptotic, we can also use the unstandardized forms so that:

σ2
^ ∼ N (μ,
μ )
n

Note that we can go back to standard normal variable Z as:

^−μ
μ
Z= σ
√n

Which is actually the result we have initially.

T he main question is, how large is n?

T he value of n solely depends on the shape of the population (distribution of X i’s), i.e., the skewness.

However, CLT is appropriate when n ≥ 30

105
© 2014-2023 AnalystPrep.
Example: Applying CLT

A sales expert believes that the number of sales per day for a particular company has a mean of 40

and a standard deviation of 12. He surveyed for over 50 working days. What is the probability that the

sample mean of sales for this company is less than 35?

Solution

Using the information given in the question,

μ = 40,σ = 12 and n=50

By central limit theorem,

σ2
^ ∼ N (μ,
μ )
n

We need

106
© 2014-2023 AnalystPrep.
⎡ 35 − 40⎤
^ < 35) = P ⎢Z <
P(μ ⎥ = P(μ
^ < −2.946)
12
⎣ ⎦
√50
= P(μ
^ < −2.946) = 1 − P(μ
^ < 2.946) = 0.00161

Estimation of Median and Other Quantiles

Median

Median is a central tendency measure of distribution, also called the 50% quantile, which divides the

distribution in half ( 50% of observations lie on either side of the median value).

When the sample size is odd, the value in position (n + 1)/2 of the sorted list is used to estimate the

median:

Med(x) = x n +1
2

If the number of the observations is even, the median is estimated as the average of the two central

points of the sorted list. T hat is:

1
Med(x) = [x n + x n +1 ]
2 2 2

Example: Calculating the Median

T he ages of experienced financial analysts in a country are:

56,51, 43, 34,25, 50.

What is the median age of the analysts?

Solution

We need to arrange the data in ascending order:

25, 34, 43,50, 51, 56

107
© 2014-2023 AnalystPrep.
T he sample size is 6 (even), so the median is given by:

1 1 1
Med(Age) = [x 6 + x 6 +1] = (x 3 + x 4 ) = (43 + 50) = 46.5
2 2 2 2 2

Properties of the Median

It may not be an actual observation in the data set.

It is not affected by extreme values because the median is a positional measure.

It is used when the exact midpoint of the score distribution is desired, or when there are

many outliers (extreme observations).

Other Quartiles

For other quantiles such as 25% and 75% quantiles, we estimate analogously as the median. For

instance, a θ-quantile is determined using the nθ, which is a value in the sorted list. If nθ is not an

integer, we will have to take the average below or above the value nθ.

So, in our example above, the 25% quantile (θ=0.25) is 6×0.25=1.5. T his implies that we need to find

the average value of the 1st and 2nd values:

1
^
q 25 = (25 + 34) = 29.5
2

The Interquartile Range

T he interquartile range (IQR) is defined as the difference between the 75% and 25% quartiles. T hat

is:

ˆ
(IQR) =^
q 75 − ^
q 25

IQR is a measure of dispersion and thus can be used as an alternative to the standard deviation

If we use the example above, the 75% quantile is 6×0.75=4.5. So, we need to average the 4th and 5th

108
© 2014-2023 AnalystPrep.
values:

1
^
q 75 = (50 + 51) = 50.5
2

So that the IQR is:

ˆ
(IQR) = 50.5 − 29.5 = 21

Desirable Properties of Quantiles

I. T he units of the quantiles are the same as those of the data used hence they are easy to

interpret.

II. T hey are robust to outliers of the data. T he median and the IQR are unaffected by the

outliers.

The Multivariate Moments

We can extend the definition of moments from the univariate to multivariate random variables. T he

mean is unaffected by this because it is just the combination of the means of the two univariate

sample means.

However, if we extend the variance, we would need to estimate the covariance between each pair

plus the variance of each data set used. Moreover, we can also define Kurtosis and Skewness

analogously to univariate random variables.

Covariance

In covariance, we focus on the relationship between the deviations of some two variables rather

than the difference from the mean of one variable.

Recall that the covariance of two variables X and Y is given by:

Cov(X, Y) = E[(X − E[X])]E[(Y − E[Y])]


= E[XY] − E[X]E[Y]

109
© 2014-2023 AnalystPrep.
T he sample covariance estimator is analogous to this result. T he sample covariance estimator is

given by:

1 n
^XY =
σ ∑ (X i − μ
^X)(Y i − μ
^Y )
n i=1

Where

^X -the sample mean of X


μ

^Y - the sample mean of Y


μ

T he sample covariance estimator is biased towards zero, but we can remove the estimator by using

n-1 instead of just n.

Correlation

Correlation measures the strength of the linear relationship between the two random variables, and

it is always between -1 and 1. T hat is −1 < Corr(X 1, X 1) < 1.

Correlation is a standardized form of the covariance. It is approximated by dividing the sample

covariance by the product of the sample standard deviation estimator of each random variable. It is

defined as:

^XY
σ ^XY
σ
ρXY = =
^2X√σ
^2Y ^Xσ
σ ^Y
√σ

Sample Mean of Two Variables

We estimate the mean of two random variables the same way we estimate that of a single variable.

T hat is:

1 n
^x =
μ ∑ (x i )
n i=1

110
© 2014-2023 AnalystPrep.
And

1 n
^y =
μ ∑ (y )
n i=1 i

Assuming both of the random variables are iid, we can apply CLT in each estimator. However, if we

consider the joint behavior (as a bivariate statistic), CLT stacks the two mean estimators into a 2x1

matrix:

^x
μ
^ =[
μ ]
^y
μ

Which is normally distributed as long the random variable Z=[X, Y] is iid. T he CLT on this vector

depends on the covariance matrix:

σX2 σXY
[ ]
σXY σY2

Note that in a covariance matrix, one diagonal displays the variance of random variable series, and

the other is covariances between the pair of the random variables. So, the CLT for bivariate iid data

is given by:

^x − μx
μ 0 σ2 σXY
√n [ ] → N ([ ] , [ X ])
^y − μy
μ 0 σXY σY2

If we scale the difference between the vector of means, then the vector of means is normally

distributed. T hat is:

σ2 σ XY
^
μ ⎛ μ ⎡ X ⎤⎞
[ x ] → N ⎜[ x ] , ⎢ n n ⎥⎟
^y σ2
μ ⎝ μy ⎣ σ XY Y ⎦⎠
n n

Example: Applying Bivariate CLT

T he annualized estimates of the means, variances, covariance, and correlation for monthly return of

111
© 2014-2023 AnalystPrep.
stock trade (T ) and the government's bonds (G) for 350 months are as shown below:

Moment ^T
μ σT2 ^G
μ σG2 σTG ρTG
11.9 335.6 6.80 26.7 14.0 0.1434

We need to compare the volatility, interpret the correlation coefficient, and apply bivariate CLT.

Solution

Looking at the output, it is evident that the return from the stock trade is more volatile than the

government bond return since it has a higher variance. T he correlation between the two forms of

return is positive but very small.

If we apply bivariate CLT, then:

^x − μx
μ 0 335.6 14.0
√n [ ^ − μ ] → N ([ ] , [ ])
μy y 0 14.0 26.7

But the mean estimators have a limiting distribution (which is assumed to be normally distributed).

So,

^x
μ μ 0.9589 0.04
[ ] → N ([ x ] , [ ])
^y
μ μy 0.04 0.07629

Note the new covariance matrix is equivalent to the previous covariance divided by the sample size

n=350.

In bivariate CLT, the correlation in the data is the correlation between the sample means and should

be equal to the correlation between the data series.

Coskewness and Cokurtosis

112
© 2014-2023 AnalystPrep.
Coskewness and Cokurtosis are an extension of the univariate skewness and kurtosis.

Coskewness

T he two coskewness measures are defined as:

E[(X − E[X])2 (Y − E[Y])]


Skew(X , X , Y) =
σ2XσY

E[(X − E[X])(Y − E[Y])2 ]


Skew(X , Y , Y) =
σX σ2Y

T hese measures both capture the likelihood of the data taking a large directional value whenever the

other variable is large in magnitude. When there is no sensitivity to the direction of one variable to

the magnitude of the other, the two coskewnesses are 0. For example, the coskewness in a bivariate

normal is always 0, even when the correlation is different from 0. Note that the univariate skewness

estimators are s(X,X,X) and s{Y,Y,Y).

So how do we estimate coskewness?

T he coskewness is estimated by using the estimation analogy. T hat is, replacing the expectation

operator by summation. For instance, the two coskewness is given by:

∑ni=1 (x i − μ
^X )2(y i − μ
^Y )
Skew(X , X , Y) =
^2Xσ
σ ^Y

∑ni=1(x i − μ ^Y )2
^X)(y i − μ
Skew(X , Y , Y) =
^2Y
^X σ
σ

Cokurtosis

T here intuitively three configurations of the cokurtosis. T hey are:

E[(X − E[X])2(Y − E[Y])2]


Kurt(X , X , Y , Y) =
σX2σY2

E[(X − E[X])3 (Y − E[Y])]


113
© 2014-2023 AnalystPrep.
E[(X − E[X])3 (Y − E[Y])]
Kurt(X , X , X , Y) =
σX3σY

E[(X − E[X])(Y − E[Y])3]


Kurt(X , Y , Y , Y) =
σXσY3

T he reference value of a normally distributed random variable is 3. A random variable with Kurtosis

exceeding 3 is termed to be heavi l y or fat-tai l ed. However, comparing the cokurtosis to that of

the normal is not easy since the cokurtosis of the bivariate normal depends on the correlation.

When the value of the cokurtosis is 1, then the random variables are uncorrelated and increases as

the correlation devices from 0.

114
© 2014-2023 AnalystPrep.
Practice Question

A sample of 100 monthly profits gave out the following data:

∑100 x
i=1 i =
3, 353 and ∑100 x 2 844, 536
i=1 i =

What is the sample mean and standard deviation of the monthly profits?

A. Sample Mean=33.53, Standard deviation=85.99

B. Sample Mean=53.53, Standard deviation=85.55

C. Sample Mean=43.53, Standard deviation=89.99

D. Sample Mean=33.63, Standard deviation=65.99

Sol uti on

T he correct answer is A.

Recall that the sample mean is given by:

1 n
^ = X̄ =
μ ∑ Xi
n i=1
1
⇒ X̄ = × 3353 = 33.53
100

T he variance is given by:

1 n
s2 = ^)2
∑ (X i − μ
n−1 i=1

Note that,

^)2 = X 2i − 2X iμ
(X i − μ ^2
^+ μ

So that

115
© 2014-2023 AnalystPrep.
n n n n n
^)2 = ∑ X 2i − 2X iμ
∑ (X i − μ ^2 = ∑ X 2i − 2^
^+ μ ^2
μ ∑ Xi + ∑ μ
i=1 i=1 i=1 i=1 i=1

Note again that

1 n n
^=
μ ∑ X i ⇒ ∑ X i = n^
μ
n i=1 i=1

So,

n n n n
^2 = ∑ X 2i − 2^
∑ X 2i − 2μ̂ ∑ X i + ∑ μ μ . n^
μ + n^
μ
i=1 i=1 i=1 i=1
n
μ2
= ∑ X 2i − n^
i=1

T hus:

1 n 1 n
2
s2 = ^)2 =
∑ (X i − μ {∑ X 2i − n^
μ }
n−1 i=1 n −1 i=1

So, in our case:

1 n
2
1
s2 = {∑ X 2i − n^
μ }= (844, 536 − 100 × 33.532 ) = 7395.0496
n−1 i=1 99

So that the standard deviation is given to be:

s = √7395.0496 = 85.99

116
© 2014-2023 AnalystPrep.
Reading 17: Hypothesis Testing

After compl eti ng thi s readi ng, you shoul d be abl e to:

Construct an appropriate null hypothesis and alternative hypothesis and distinguish

between the two.

Construct and apply confidence intervals for one-sided and two-sided hypothesis tests, and

interpret the results of hypothesis tests with a specific level of confidence.

Differentiate between a one-sided and a two-sided test and identify when to use each test.

Explain the difference between T ype I and T ype II errors and how these relate to the size

and power of a test.

Understand how a hypothesis test and a confidence interval are related.

Explain what the p-value of a hypothesis test measures.

Interpret the results of hypothesis tests with a specific level of confidence.

Identify the steps to test a hypothesis about the difference between two population means.

Explain the problem of multiple testing and how it can bias results.

Hypothesi s testi ng is defined as a process of determining whether a hypothesis is in line with the

sample data. Hypothesis testing tries to test whether the observed data of the hypothesis is true.

Hypothesis testing starts by stating the null hypothesis and the alternative hypothesis. T he null

hypothesis is an assumption of the population parameter. On the other hand, the alternative

hypothesis states the parameter values (critical values) at which the null hypothesis is rejected. T he

critical values are determined by the distribution of the test statistic (when the null hypothesis is

true) and the size of the test (which gives the size at which we reject the null hypothesis).

Components of the Hypothesis Testing

T he elements of the test hypothesis include:

117
© 2014-2023 AnalystPrep.
I. T he null hypothesis.

II. T he alternative hypothesis.

III. T he test statistic.

IV. T he size of the hypothesis test and errors

V. T he critical value.

VI. T he decision rule.

The Null hypothesis

As stated earlier, the first stage of the hypothesis test is the statement of the null hypothesis. T he

null hypothesis is the statement concerning the population parameter values. It brings out the notion

that “there is nothing about the data.”

T he nul l hypothesi s, denoted as H 0, represents the current state of knowledge about the

population parameter that’s the subject of the test. In other words, it represents the “status quo.”

For example, the U.S Food and Drug Administration may walk into a cooking oil manufacturing plant

intending to confirm that each 1 kg oil package has, say, 0.15% cholesterol and not more. T he

inspectors will formulate a hypothesis like:

H 0: Each 1 kg package has 0.15% cholesterol.

A test would then be carried out to confirm or reject the null hypothesis.

Other typical statements of H 0 include:

H 0 : μ = μ0

H 0 : μ ≤ μ0

Where:

μ = true population mean and,

μ0= the hypothesized population mean.

The Alternative Hypothesis

118
© 2014-2023 AnalystPrep.
T he al ternati ve hypothesi s, denoted H 1, is a contradi cti on of the null hypothesis. T he null

hypothesis determines the values of the population parameter at which the null hypothesis is

rejected. T hus, rejecting the H 0 makes H 1 valid. We accept the alternative hypothesis when the

“status quo” is discredited and found to be untrue.

Using our FDA example above, the alternative hypothesis would be:

H 1: Each 1 kg package does not have 0.15% cholesterol.

T he typical statements of H 0 include:

H 0 : μ ≠ μ0

H 0 : μ > μ0

Where:

μ = true population mean and,

μ0= the hypothesized population mean.

Note that we have stated the alternative hypothesis, which contradicted the above statement of the

null hypothesis.

The Test Statistic

A test statistic is a standardized value computed from sample information when testing hypotheses. It

compares the given data with what we would expect under the null hypothesis. T hus, it is a major

determinant when deciding whether to reject H 0, the null hypothesis.

We use the test statistic to gauge the degree of agreement between sample data and the null

hypothesis. Analysts use the following formula when calculating the test statistic.

(Sample Statistic–Hypothesized Value)


Test Statistic =
(Standard Error of the Sample Statistic)

119
© 2014-2023 AnalystPrep.
T he test statistic is a random variable that changes from one sample to another. Test statistics

assume a variety of distributions. We shall focus on normally distributed test statistics because it is

used hypotheses concerning the means, regression coefficients, and other econometric models.

We shall consider the hypothesis test on the mean. Consider a null hypothesis H 0 : μ = μ0 . Assume

that the data used is iid, and asymptotic normally distributed as:

^ − μ) ∼ N (0, σ 2)
√n(μ

Where σ 2 is the variance of the sequence of the iid random variable used. T he asymptotic

distribution leads to the test statistic:

^ − μ0
μ
T = ∼ N (0, 1)
2
√^
σ
n

Note this is consistent with our initial definition of the test statistic.

T he following table gives a brief outline of the various test statistics used regularly, based on the

distribution that the data is assumed to follow:

Hypothesis Test Test Statistic


Z-test z-statistic
Chi-Square Test Chi-Square statistic
t-test t-statistic
ANOVA F-statistic

We can subdivide the set of values that can be taken by the test statistic into two regions: One is

called the non-rejection region, which is consistent with H 0 and the rejection region (critical region),

which is inconsistent with H 0. If the test statistic has a value found within the critical region, we

reject H 0.

Just like with any other statistic, the distribution of the test statistic must be specified entirely under

H 0 when H 0 is true.

120
© 2014-2023 AnalystPrep.
The Size of the Hypothesis Test and the Type I and Type II
Errors

While using sample statistics to draw conclusions about the parameters of the population as a whole,

there is always the possibility that the sample collected does not accurately represent the

population. Consequently, statistical tests carried out using such sample data may yield incorrect

results that may lead to erroneous rejection (or lack thereof) of the null hypothesis. We have two

types of error:

Type I Error

T ype I error occurs when we reject a true null hypothesis. For example, a type I error would

manifest in the form of rejecting H 0 = 0 when it is actually zero.

Type II Error

T ype II error occurs when we fail to reject a false null hypothesis. In such a scenario, the test

provides insufficient evidence to reject the null hypothesis when it’s false.

T he level of significance denoted by α represents the probability of making a type I error, i.e.,

rejecting the null hypothesis when, in fact, it’s true. α is the direct opposite of β, which is taken to

be the probability of making a type II error within the bounds of statistical testing. T he ideal but

practically impossible statistical test would be one that si mul taneousl y minimizes α and β. We

use α to determine critical values that subdivide the distribution into the rejection and the non-

rejection regions.

The Critical Value and the Decision Rule

T he decision to reject or not to reject the null hypothesis is based on the distribution assumed by the

test statistic. T his means if the variable involved follows a normal distribution, we use the level of

significance (α) of the test to come up with critical values that lie along with the standard normal

distribution.

121
© 2014-2023 AnalystPrep.
T he decision rule is a result of combining the critical value (denoted by Cα ), the alternative

hypothesis, and the test statistic (T ). T he decision rule is to whether to reject the null hypothesis in

favor of the alternative hypothesis or fail to reject the null hypothesis.

For the t-test, the decision rule is dependent on the alternative hypothesis. When testing the two-

side alternative, the decision is to reject the null hypothesis if |T | > Cα . T hat is, reject the null

hypothesis if the absolute value of the test statistic is greater than the critical value. When testing

on the one-sided, the decision rule, reject the null hypothesis if T < Cα when using a one-sided

lower alternative and if T > Cα when using a one-sided upper alternative. When a null hypothesis is

rejected at an α significance level, we say that the result is significant at α significance level.

Note that prior to decision making, one must decide whether the test should be one-tailed or two-

tailed. T he following is a brief summary of the decision rules under different scenarios:

Left One-tailed Test

H 1: parameter < X

Decision rule: Reject H 0 if the test statistic is less than the critical value. Otherwise, do not

rej ect H 0.

122
© 2014-2023 AnalystPrep.
Right One-tailed Test

H 1: parameter > X

Decision rule: Reject H 0 if the test statistic is greater than the critical value. Otherwise, do not

rej ect H 0.

123
© 2014-2023 AnalystPrep.
Two-tailed Test

H 1: parameter ≠ X (not equal to X)

Decision rule: Reject H 0 if the test statistic is greater than the upper critical value or less than the

lower critical value.

Consider, α=5%. Consider a one-sided test. T he rejection regions are shown below:

124
© 2014-2023 AnalystPrep.
T he first graph represents the rejection region when the alternative is one-sided lower. For

instance, the hypothesis is stated as:

H 0 : μ < μ0 vs. H 1 : μ > μ0.

T he second graph represents the rejection region when the alternative is a one-sided upper. T he null

hypothesis, in this case, is stated as:

H 0 : μ > μ0 vs. H 1 : μ < μ0.

Example: Hypothesis Test on the Mean

Consider the returns from a portfolio X = (x 1, x 2, … , x n ) from 1980 through 2020. T he approximated

mean of the returns is 7.50%, with a standard deviation of 17%. We wish to determine whether the

expected value of the return is different from 0 at a 5% significance level.

Sol uti on

We start by stating the two-sided hypothesis test:

H 0 : μ =0 vs. H 1 : μ ≠ 0

T he test statistic is:


μ
^−μ
125
© 2014-2023 AnalystPrep.
μ
^ − μ0
T = ∼ N (0, 1)
2
√^
σ
n

In this case, we have,

n=40

μ
^=0.075

μ0=0

^2=0.172
σ

So,

0.075 − 0
T = ≈ 2.79
0. 172
√ 40

At the significance level, α = 5%,the critical value is ±1.96. Since this is a two-sided test, the

rejection regions are ( −∞, −1.96 ) and ( 1.96,∞ ) as shown in the diagram below:

Since the test statistic (2.79) is higher than the critical value, then we reject the null hypothesis in

favor of the alternative hypothesis.

126
© 2014-2023 AnalystPrep.
T he example above is an example of a Z-test (which is mostly emphasized in this chapter and

immediately follows from the central limit theorem (CLT )). However, we can use the Student’s t-

distribution if the random variables are iid and normally distributed and that the sample size is small

(n<30).

In Student’s t-distribution, we used the unbiased estimator of variance. T hat is:

^ − μ0
μ
s2 =
2
√ sn

T herefore the test statistic for H 0 = μ0 is given by:

^ − μ0
μ
T = ∼ tn −1
2
√ sn

The Type II Error and the Test Power

T he power of a test is the direct opposite of the level of significance. While the level of relevance

gives us the probability of rejecting the null hypothesis when it’s, in fact, true, the power of a test

gives the probability of correctly discrediting and rejecting the null hypothesis when it is false. In

other words, it gives the likelihood of rejecting H 0 when, indeed, it’s false. Denoting the probability

of type II error by β , the power test is given by:

Power of a Test = 1– β

T he power test measures the likelihood that the false null hypothesis is rejected. It is influenced by

the sample size, the length between the hypothesized parameter and the true value, and the size of

the test.

Confidence Intervals

A confidence interval can be defined as the range of parameters at which the true parameter can be

found at a confidence level. For instance, a 95% confidence interval constitutes the set of parameter

127
© 2014-2023 AnalystPrep.
values where the null hypothesis cannot be rejected when using a 5% test size. T herefore, a 1-α

confidence interval contains the values that cannot be disregarded at a test size of α.

It is important to note that the confidence interval depends on the alternative hypothesis statement

in the test. Let us start with the two-sided test alternatives.

H0 : μ = 0

H1 : μ ≠ 0

T hen the 1 − α confidence interval is given by:

^
σ ^
σ

^ − Cα × ^ + Cα ×
,μ ]
√n √n

Cα is the critical value at α test size.

Example: Calculating Two-Sided Alternative Confidence Intervals

Consider the returns from a portfolio X = (x 1, x 2, … , x n ) from 1980 through 2020. T he approximated

mean of the returns is 7.50%, with a standard deviation of 17%. Calculate the 95% confidence

interval for the portfolio return.

T he 1 − α confidence interval is given by:

σ
^ σ
^

^ − Cα × ,μ
^ + Cα × ]
√n √n
0.17 0.17
= [0.0750 − 1.96 × ,0.0750 + 1.96 × ]
√40 √40
= [0.02232, 0.1277]

T hus, the confidence intervals imply any value of the null between 2.23% and 12.77% cannot be

rejected against the alternative.

One-Sided Alternative

128
© 2014-2023 AnalystPrep.
For the one-sided alternative, the confidence interval is given by either:

^
σ
(−∞, μ
^ + Cα × )
√n

for the lower alternative

or,

^
σ

^ + Cα × , ∞)
√n

for the upper alternative.

Example: Calculating the One-Sided Alternative Confidence Interval

Assume that we were conducting the following one-sided test:

H0 : μ ≤ 0

H1 : μ > 0

T he 95% confidence interval for the portfolio return is:

σ
^
= (−∞, μ
^ + Cα × )
√n
0.17
= (−∞, 0.0750 + 1.645 × )
√40
= (−∞, 0.1192)

On the other hand, if the hypothesis test was:

H0 : μ > 0

H1 : μ ≤ 0

T he 95% confidence interval would be:

129
© 2014-2023 AnalystPrep.
^
σ
= (−∞, μ
^ + Cα × )
√n

0.17
= (−∞, 0.0750 + 1.645 × ) = (−∞, 0.1192)
√40

Note that the critical value decrease from 1.96 to 1.645 due to a change in the direction of the

change.

The p-Value

When carrying out a statistical test with a fixed value of the significance level (α), we merely

compare the observed test statistic with some critical value. For example, we might “reject H 0 using

a 5% test” or “reject H 0 at 1% significance level”. T he problem with this ‘classical’ approach is that

it does not give us the details about the strength of the evi dence against the null hypothesis.

Determination of the p-value gives statisticians a more informative approach to hypothesis testing.

T he p-value is the lowest level at which we can reject H 0. T his means that the strength of the

evidence against H 0 increases as the p-value becomes smaller. T he test-statistic depends on the

alternative.

The p-Value for One-Tailed Test Alternative

For one-tailed tests, the p-value is given by the probability that lies below the calculated test statistic

for left-tailed tests. Similarly, the likelihood that lies above the test statistic in right-tailed tests gives

the p-value.

Denoting the test statistic by T, the p-value for H 1 : μ > 0 is given by:

P (Z > |T |) = 1 − P (Z ≤ |T |) = 1 − Φ(|T |)

Conversely, for H 1 : μ ≤ 0 the p-value is given by:

P (Z ≤ |T |) = Φ(|T |)

130
© 2014-2023 AnalystPrep.
Where z is a standard normal random variable, the absolute value of T (|T |) ensures that the right tail

is measured whether T is negative or positive.

The p-Value for Two-Tailed Test Alternative

If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We start

by determining the probability lying below the negative value of the test statistic. T hen, we add this

to the probability lying above the positive value of the test statistic. T hat is the p-value for the two-

tailed hypothesis test is given by:

2 [1 − Φ[|T |]

Example 1: p-Value for One-Sided Alternative

Let θ represent the probability of obtaining a head when a coin is tossed. Suppose we toss the coin

200 times, and heads come up in 85 of the trials. Test the following hypothesis at 5% level of

significance.

H 0: θ = 0.5

H 1: θ < 0.5

Sol uti on

First, not that repeatedly tossing a coin follows a binomial distribution.

Our p-value will be given by P(X < 85) where X `binomial(200,0.5) with mean 100(np=200*0.5),

assuming H 0 is true.

85.5 − 100
P [z < ] = P (Z < −2.05)
√50
= 1– 0.97982 = 0.02018

Recall that for a binomial distribution, the variance is given by:

131
© 2014-2023 AnalystPrep.
np(1 − p) = 200(0.5)(1 − 0.5) = 50

(We have applied the Central Limit T heorem by taking the binomial distribution as approx. normal)

Since the probability is less than 0.05, H 0 is extremely unlikely, and we actually have strong

evidence against H 0 that favors H 1. T hus, clearly expressing this result, we could say:

“T here is very strong evidence against the hypothesis that the coin is fair. We, therefore, conclude

that the coin is biased against heads.”

Remember, failure to reject H 0 does not mean it’s true. It means there’s insufficient evidence to

justify rejecting H 0, given a certain level of significance.

Example 2: p-Value for Two-Sided Alternative

A CFA candidate conducts a statistical test about the mean value of a random variable X.

H 0: μ = μ0 vs. H 1: μ ≠ μ0

She obtains a test statistic of 2.2. Given a 5% significance level, determine and interpret the p-value

Sol uti on

P-value = 2P (Z > 2.2) = 2[1– P (Z ≤ 2.2)] = 1.39% × 2 = 2.78%

(We have multiplied by two since this is a two-tailed test)

132
© 2014-2023 AnalystPrep.
Interpretati on

T he p-value (2.78%) is less than the level of significance (5%). T herefore, we have sufficient

evidence to reject H 0. In fact, the evidence is so strong that we would also reject H 0 at significance

levels of 4% and 3%. However, at significance levels of 2% or 1%, we would not reject H 0 since

the p-value surpasses these values.

Hypothesis about the Difference between Two Population


Means.

It’s common for analysts to be interested in establishing whether there exists a significant difference

between the means of two different populations. For instance, they might want to know whether the

average returns for two subsidiaries of a given company exhibit si gni fi cant differences.

Now, consider a bivariate random variable:

W i = [X i , Y i ]

133
© 2014-2023 AnalystPrep.
Assume that the components X i and Y iare both iid and are correlated. T hat is:

Corr(X i , Y i ) ≠ 0

Now, suppose that we want to test the hypothesis that:

H 0 : μX = μY

H 1 : μX ≠ μY

In other words, we want to test whether the constituent random variables have equal means. Note

that the hypothesis statement above can be written as:

H 0 : μX − μY = 0

H 1 : μX − μY ≠ 0

To execute this test, consider the variable:

Zi = X i − Y i

T herefore, considering the above random variable, if the null hypothesis is correct then,

E(Zi) = E(X i) − E(Y i ) = μX − μY = 0

Intuitively, this can be considered as a standard hypothesis test of

H 0: μZ =0 vs. H 1: μZ ≠ 0.

T he tests statistic is given by:

^z
μ
T = ∼ N(0, 1)
2
√^
σz
n

Note that the test statistic formula accounts for the correction between X i and Y i. It is easy to see

that:

V (Zi ) = V (X i) + V (Y i) − 2COV (X i , Y i )

134
© 2014-2023 AnalystPrep.
Which can be denoted as:

^2z = σ
σ ^2X + σ
^2Y − 2σXY

^z = μX − μY
μ

And thus the test statistic formula can be written as:

μX − μY
T =
^ 2+^ 2
σ Y −2σ XY
√σ X
n

T his formula indicates that correlation plays a crucial role in determining the magnitude of the test

statistic.

Another special case of the test-statistic is when X i, and Y i are iid and independent. T he test statistic

is given by:

μX − μY
T =
2 2
σ
^ σY
^
√n X +
X nY

Where nX and nY are the sample sizes of X i , and Y i respectively.

Example: Hypothesis Test on Two Means

An investment analyst wants to test whether there is a significant difference between the means of

the two portfolios at a 95% level. T he first portfolio X consists of 30 government-issued bonds and

has a mean of 10% and a standard deviation of 2%. T he second portfolio Y consists of 30 private

bonds with a mean of 14% and a standard deviation of 3%. T he correlation between the two

portfolios is 0.7. Calculate the null hypothesis and state whether the null hypothesis is rejected or

otherwise.

Solution

T he hypothesis statement is given by:

135
© 2014-2023 AnalystPrep.
H 0: μX - μY=0 vs. H 1: μX - μY ≠ 0.

Note that this is a two-tailed test. At 95% level, the test size is α=5% and thus the critical value

Cα = ±1.96.

Recall that:

Cov(X , Y ) = σXY = ρX Y σX σY

Where ρ_XY is the correlation coefficient between X and Y.

Now the test statistic is given by:

μX − μY μX − μY
T = =
^ 2+^ 2
σ Y −2σ XY ^ 2+^ 2
σ Y −2ρXY σ Xσ Y
√σ X √σ X
n n

0.10 − 0.14
= = −10.215
0. 022 +0. 032 −2×0. 7×0. 02×0. 03
√ 30

T he test statistic is far much less than -1.96. T herefore the null hypothesis is rejected at a 95%

level.

The Problem of Multiple Testing

Multiple testing occurs when multiple multiple hypothesis tests are conducted on the same data set.

T he reuse of data results in spurious results and unreliable conclusions that do not hold up to

scrutiny. T he fundamental problem with multiple testing is that the test size (i.e., the probability that

a true null is rejected) is only applicable for a single test. However, repeated testing creates test

sizes that are much larger than the assumed size of alpha and therefore increases the probability of a

T ype I error.

Some control methods have been developed to combat multiple testing. T hese include Bonferroni

correction, the False Discovery Rate (FDR), and Familywise Error Rate (FWER).

136
© 2014-2023 AnalystPrep.
Practice Question

An experiment was done to find out the number of hours that candidates spend preparing

for the FRM part 1 exam. It was discovered that for a sample of 10 students, the

following times were spent:

318, 304, 317, 305, 309, 307, 316, 309, 315, 327

If the sample mean and standard deviation are 312.7 and 7.2, respectively, calculate a

symmetrical 95% confidence interval for the mean time a candidate spends preparing for

the exam using the t-table.

q 0.95 0.975 0.99 0.995 0.999 0.9995


n=1 6.314 12.706 31.821 63.657 318.309 636.619
2 2.920 4.303 6.965 9.925 22.327 31.599
3 2.353 3.182 4.541 5.841 10.215 12.924
4 2.132 2.776 3.747 4.604 7.173 8.610
5 2.015 2.571 3.365 4.032 5.893 6.869
6 1.943 2.447 3.143 3.707 5.208 5.959
7 1.894 2.365 2.998 3.499 4.785 5.408
8 1.860 2.306 2.896 3.355 4.501 5.041
9 1.833 2.262 2.821 3.250 4.297 4.781
10 1.812 2.228 2.764 3.169 4.144 4.587
11 1.796 2.201 2.718 3.106 4.025 4.437
12 1.782 2.179 2.681 3.055 3.930 4.318

A. [307.5, 317.9]

B. [307.6, 317.8]

C. [307.9, 317.5]

D. [307.3, 318.2]

T he correct answer is A.

Population variance is unknown; we must use the t-score.

To find the value of t1− α , we use the t-table with (10 - 1 =) 9 degrees of freedom and the
2

137
© 2014-2023 AnalystPrep.
(1 - 0.025 =) 0.975 which gives us 2.262.

So the confidence interval is given by:

s 7.2
X̄ ± t1− α × = 312.7 ± 2.262 ×
2 √n √10
= [307.5, 317.9]

138
© 2014-2023 AnalystPrep.
Reading 18: Linear Regression

After compl eti ng thi s readi ng, you shoul d be abl e to:

Describe the models that can be estimated using linear regression and differentiate them

from those which cannot.

Interpret the results of an OLS regression with a single explanatory variable.

Describe the key assumptions of OLS parameter estimation.

Characterize the properties of OLS estimators and their sampling distributions.

Construct, apply, and interpret hypothesis tests and confidence intervals for a single

regression coefficient in a regression.

Explain the steps needed to perform a hypothesis test in linear regression.

Describe the relationship between a t-statistic, its p-value, and a confidence interval.

Linear regression is a statistical tool for modeling the relationship between two random variables.

T his chapter will concentrate on the linear regression model (regression model with one

explanatory variable).

The Linear Regression Model

As stated earlier, linear regression determines the relationship between the dependent variable Y and

the independent (explanatory) variable X. T he linear regression with a single explanatory variable is

given by:

Y = β0 + βX + ϵ

Where:

β0=constant intercept (the value of Y when X=0)

β=the Slope which measures the sensitivity of Y to variation in X.

139
© 2014-2023 AnalystPrep.
ϵ =error(sometimes referred to as shock). It represents the portion of Y that cannot be explained by

X.

T he assumption is that the expectation of the error is 0. T hat is, E(ϵ) = 0 and thus,

E[Y ] = E[β0 ] + βE[X] + E[ϵ]

⇒ E[Y ] = β0 + βE[X]

Note that β0 is the value of Y when X = 0. However, there are cases when the explanatory variable

is not equal to 0. In this case, β0 is interpreted as the value that ensures that the Y¯ in the regression

line Y¯ = β^ 0 + β^X̄ where Y¯ and X̄ are the mean of y i and x i random variables.

The Linearity of a Regression

140
© 2014-2023 AnalystPrep.
T he independent variable can be continuous, discrete or even functions. Above the diversity of the

explanatory variables, they must satisfy the following conditions:

1. T he relationship between the dependent variable Y and the explanatory variables

(X 1 , X 2 ,… , X n ) must be linear.

2. T he error term must be additive except where the variance of the error term depends on

the explanatory variables.

3. T he independent (explanatory variables) must be observables. T his ensures that a linear

regression with missing data is not developed.

A good example of a violation of the linearity principle is:

Y = β0 + βX k + ϵ

T his model cannot be estimated using linear regression due to the presence of the unknown

parameter k, which violates the first restriction (it is non-linear regression function). T his kind of

nonlinearity can be corrected through transformation.

Transformations

When a linear regression model does not satisfy the linearity conditions stated above, we can

reverse the violation of the restrictions by transforming the model. Consider the model:

Y = β0 X βϵ

Where ϵ is the positive error term (shock). Clearly, this model violates the condition of the

restriction since X is raised to an unknown parameter β, and the error term is not additive.

However, we can make this model linear by taking natural logarithm on both sides of the equation so

that:

ln(Y ) = (β0X β ϵ)

ln(Y ) = lnβ0 + βlnX + lnϵ

141
© 2014-2023 AnalystPrep.
T he last equation can be written as:

k
Y = β^0 + βX^ + ^
ϵ

Clearly, this equation satisfies the three linearity conditions. It is worth noting that when we are

interpreting the parameters of the transformed model, we measure the change of the transformed

independent variable X on the transformed variable Y.

For instance, ln(Y ) = lnβ0 + βlnX + lnϵ implies that β represents the change in lnY corresponding to

a unit change in lnX.

The Use of the Dummy Variables

T here are cases where the explanatory variables are binary numbers (0 and 1) representing the

occurrences of an event. T hese binary numbers are called dummies. For instance,

Assuming Di is a variable such that:

1 T he student-teacher ratio in ith school < 20


Di = {
0 T he student-teacher ratio in ith school ≥ 20

T he following is the population regression model whose regressor Di:

Y i = β0 + βDi + ϵ i ,∀i = 0,… , n

β is the coefficient on Di .

T he equation will change to the one written below under the condition that Di = 0:

Y i = β0 + ϵ i

When Di = 1:

Y i = β0 + β + ϵ i

142
© 2014-2023 AnalystPrep.
T his implies that when Di = 1,E (Y i|Di = 1) = β0 + β1. T he test scores will have a population mean

value of β0 + β1 when the ratio of students to teachers is low. T he conditional expectations of Y i

when Di = 1 and when Di = 0 will have a difference of β1 between them written as:

(β0 + β) − β0 = β

T his makes β to be the difference between population means.

The Ordinary Least Squares

T he Ordinary Least Squares (OLS) is a method of estimating the linear regression parameters by

minimizing the sum of squared deviations. T he regression coefficients chosen by the OLS estimators

are such that the observed data and the regression line are as close as possible.

143
© 2014-2023 AnalystPrep.
Consider a regression equation:

Y = β0 + βX + ϵ

Where each of X and Y consists of n observations each (X = x 1, x 2, … , n) and (Y = y 1, y 2, … , y n ).

Assume that each of x i and y​i are linearly related, then the parameters can be estimated using the

OLS. T he estimators minimize the residual sum of squares such that:

n 2 n
∑ (y i − β^0 − β^x i ) = ∑ ϵ
^2i
i=1 i=1

Where the β^0 and β^ are parameter estimators (intercept and the slope respectively) which

minimizes the squared deviations between the line β^0 + β^x i and y i so that:

144
© 2014-2023 AnalystPrep.
β^0 = Y¯ − β^ X̄

and

∑ni=1 (x i − X̄ ) (y i − Y¯ )
β^ = 2
∑ni=1 (x i − X̄ )

Where X̄ and Y¯ are the means of X and Y respectively.

After the estimation of the parameters, the estimated regression line is given by:

y^i = β^ 0 + β^x i

And the linear regression residual error term is given by:

ϵ i = y i − y^i = y i − β^ 0 − β^x i
^

T he variance of the error term is approximated as:

1 n
2
s2 = ∑ ^ϵ
n − 2 i=1 i

It can also be shown that:

n
s2 = ^2 (1 − ρ^2XY )
σ
n−2 Y

Note that n-2 implies that two parameters are estimated and that s2 is an unbiased estimator of σ2.

Moreover, it is assumed that the mean of the residuals is zero and uncorrelated with the explanatory

variables X i.

Now, consider the formula:

∑ni=1 (x i − X̄ ) (y i − Y¯ )
β^ = 2
∑ni=1 (x i − X̄ )

If we multiply both the numerator and the denominator by n1 , we have:


1
∑n (x − ¯) (y − ¯ )
145
© 2014-2023 AnalystPrep.
1
∑ni=1 (x i − X̄ ) (y i − Y¯ )
β^ =
n
1 2
∑ni=1 (x i − X̄ )
n

Note that the numerator is the covariance between X and Y, and the denominator is the variance of

X. So that we can write:

1
∑ni=1 (x i − X̄ ) (y i − Y¯ ) σ
^XY
β^ =
n
=
1 2 σX2
∑ni=1 (x i − X̄ )
n

Also recall that:

Cov(X , Y )
Corr(X , Y ) = ρXY =
σX σY

⇒ σXY = ρX Y σX σY

So,

ρXY σX σY
β^ =
^2X
σ

ρ^XY σ
^Y
∴ β^ =
^X
σ

Example: Estimating the Linear Regression Parameters

An investment analyst wants to explain the return from the portfolio (Y) using the prevailing interest

rates (X) over the past 30 years. T he mean interest rate is 7%, and the return from the portfolio is

14%. T he covariance matrix is given by:

^2Y
σ ^X Y
σ 1600 500
[ ]=[ ]
^X Y
σ σ 2
^X 500 338

Assume that the analyst wants to estimate the linear regression equation:

Y^i = β^ 0 + β^X i

146
© 2014-2023 AnalystPrep.
Estimate the parameters and, thus, the model equation.

Sol uti on

Now,

σ
^XY 500
β^ = = = 1.4793
^2X
σ 338

and

β^0 = Y¯ − β^ X̄ = 0.14 − 1.4793 × 0.07 = 0.0364

So, the estimated equation is given by:

Y^i = 0.0364 + 1.4793X i

Assumptions of OLS

T he OLS estimators assume the following:

1. T he conditional distribution of the error term given the independent variables X i is 0. More

precisely E(ϵ i |X i) = 0. T his also implies that the independent variables and the error term

are uncorrelated and that E(ϵ i ) = 0.

2. Both the dependent and independent variables are i.i.d. T his assumption concerns the

drawing of the sample. According to this assumption, (X i ,Y i ),i = 1, … , n are i.i.d in case a

simple random sampling is applied when drawing observations from a single large population.

Despite the i.i.d assumption being a reasonable assumption for many data collection schemes,

all sampling schemes do not produce i.i.d observations on (X i ,Y i ).

3. Large outliers are unlikely. In this assumption, observations whose values of X i and/or Y i fall

far outside the usual range of the data, are unlikely. T hese observations are known as

significant outliers. Results of OLS regression can be misleading due to large outliers.

4. T he variance of the independent variable is strictly nonnegative. T hat is, σ2X > 0. T his is

essential in estimating the regression parameters.

147
© 2014-2023 AnalystPrep.
5. T he variance of the error term is independent of the explanatory variables and that

V (ϵ i│X ) = σ2 < ∞ and that the variance of all the error terms (shocks) is equal. T his

assumption is termed as the homoskedasticity assumption.

T he OLS estimators imply that the parameter estimators are unbiased estimators. T hat is,E(α
^) = α

and E(β^) = β. T his is actually true for large sample sizes or rather as the sample sizes increases.

Lastly, the assumptions ensure that that the estimated parameters are normally distributed. T he

asymptotic distribution of the slope is given by:

σ2
√n (β^ − β) ∼ N (0, )
σX2

Where σ2 is the variance of the error term and σ2X is the variance of X. It is easy to see that the

variance of β^ increases as σ2 increases.

For the intercept, the asymptotic distribution is defined as:

σ 2 (μ2X − σX2 )
√n (β^ 0 − β0) ∼ N (0, )
σX2

According to the central limit theorem (CLT ), β^ can be treated as the standard random variable with
σ2
the mean as the true value β and the variance . T hat is:
nσ 2
X

σ2
β^ ∼ N (β, )
nσX2

However, we cannot use this value in hypothesis testing. We need to use the variance estimators

such that:

σ2 = s2

So, recall that for a large sample size:

1 n 2
^X =
σ ∑ (x i − X̄ )
n i=1

148
© 2014-2023 AnalystPrep.
n
2
⇒ n^
σ X = ∑ (x i − X̄ )
i=1

T herefore, the variance of the parameter β can be written as:

^2
σ s2
^2β =
σ =
2
∑ni=1 (x i − X̄ ) σ 2X
n^

T he standard error estimate of the β denoted as SEEβ is equivalent to the square root of its variance,

so:

s2 s
SEEβ = √ =
σ 2X
n^ √n^
σX

Analogously, the variance of the intercept:

^2X + σ
s2 (μ ^2X)
^2β0
σ =
σ 2X
n^

Hypothesis Testing on the Linear Regression Parameters

When the OLS assumptions are met, the parameters are assumed to be normally distributed when

large samples are used. T herefore, we can run a hypothesis tests on the parameters just like the

random variable.

A hypothesis is a statistical procedure where an analyst tests an assumption on the population

parameters. For instance, we may want to test the significance of a si ngl e regression coefficient in

a simple linear regression. Most of the hypothesis tests are t-tests.

Whenever a statistical test is being performed, the following procedure is generally considered ideal:

1. Statement of both the null and the alternative hypothesis;

2. Select the appropriate test statistic, i.e., what’s being tested, e.g., the population means, the

difference between sample means, or variance;

149
© 2014-2023 AnalystPrep.
3. Specify the level of significance;

4. Clearly, state the decision rule to guide you in choosing whether to reject or not to reject

the null hypothesis;

5. Calculate the sample statistic, and finally

6. Make a decision based on the sample results.

For instance, assume we are testing the null hypothesis that:

H 0 : β = βH0 vs. H 1 : β ≠ βH0

Where βH0 is the hypothesized slope parameter.

T hen the test statistic will be:

β^ − βH0
T =
SEEβ

T his statistic possesses asymptotic normal distribution, which is then compared to a critical value Ct

. T he null hypothesis is rejected if:

|T | > Ct

For instance, if we assume a 5% significance level in this case, then the critical value is 1.96.

We can also evaluate the p-values. For one-tailed tests, the p-value is given by the probability that lies

below the calculated test statistic for left-tailed tests. Similarly, the likelihood that lies above the test

statistic in right-tailed tests gives the p-value.

Denoting the test statistic by T, the p-value for H 1 : β^ > 0 is given by:

P (Z > |T |) = 1 − P (Z ≤ |T |) = 1 − Φ(|T |)

Conversely, for H 1 : β^ ≤ 0 the p-value is given by:

P (Z ≤ |T |) = Φ(|T |)

150
© 2014-2023 AnalystPrep.
Where z is a standard normal random variable, the absolute value of T (|T |) ensures that the right tail

is measured whether T is negative or positive.

If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We start by

determining the probability lying below the negative value of the test statistic. T hen, we add this to

the probability lying above the positive value of the test statistic. T hat is the p-value for the two-

tailed hypothesis test is given by:

2[1 − Φ(|T |)]

We can also construct confidence intervals (discussed in detail in the previous chapter). Recall that a

confidence interval can be defined as the range of parameters at which the true parameter can be

found at a confidence level. For instance, a 95% confidence interval constitutes that the set of

parameter values where the null hypothesis cannot be rejected when using a 5% test size.

For instance, if we are performing the two-tailed hypothesis tests, then the confidence interval is

given by:

[β^ − Ct × SEEβ , β^ + Ct × SEEβ ]

Example: Hypothesis Test on the Linear Regression Parameters

An investment analyst wants to explain the return from the portfolio (Y) using the prevailing interest

rates (X) over the past 30 years. T he mean interest rate is 7%, and the return from the portfolio is

14%. T he covariance matrix is given by:

^2Y
σ ^X Y
σ 1600 500
[ ]=[ ]
^X Y
σ σ 2
^X 500 338

Assume that the analyst wants to estimate the linear regression equation:

Y^i = β^ 0 + β^X i

Test whether the slope coefficient is equal to zero and construct a 95% confidence interval for the

151
© 2014-2023 AnalystPrep.
slope of the coefficient.

Solution

We start by stating the hypothesis:

β^ − βH0
T =
SEEβ

We had calculated the slope from the matrix as:

^XY
σ 500
β^ = = = 1.4793
σ
^X2 338

Now, recall that:

s
SEEβ^ =
√n^
σX

But

n
s2 = ^2 (1 − ρ^XY )
σ
n−2 Y

So, in this case:

30 500
s2 = × 1600 (1 − ) = 548.7251
30 − 2 √338√1600

^
σ XY
(Note that for ρ^XY we have used the relationship ρ^XY = .)
^
σ X^
σY

T herefore,

s = √s2 = √548.7251 = 23.4249

So,

s 23.4249
152
© 2014-2023 AnalystPrep.
s 23.4249
SEEβ^ = = = 0.23263
√nσ
^X √30√338

T herefore the t-statistic is given by:

β^ − βH0 1.4793
T = = = 6.3590
SEEβ 0.23263

For the two-tailed test, the critical value is 1.96, and since the t-statistic here is greater than the

significant value, then we reject the null hypothesis.

For the 95% CI, we know it is given by:

β^ − Ct × SEEβ , β^ + Ct × SEEβ

= [1.4793 − 1.96 × 0.23263, 1.4793 + 1.96 × 0.23263]

= [1.0233, 1.9353]

153
© 2014-2023 AnalystPrep.
Practice Question 1

Assume that you have carried out a regression analysis (to determine whether the slope

is different from 0) and found out that the slope β^ = 1.156. Moreover, you have

constructed a 95% confidence interval of [0.550, 1.762]. What is the likely value of your

test statistic?

A. 4.356

B. 3.7387

C. 0.7845

D. 0.6545

Solution

T he Correct answer is B

T his is a two-tailed test since we’re asked to determine if the slope is different from

zero. We know that:

[β^ − Ct × SEEβ , β^ + Ct × SEEβ ]

Which in this case is [0.550, 1.762].

We need to find the value of SEEβ . T hat is:

1.156 − 0.550
1.156 − 1.96 × SEEβ = 0.550 ⇒ SEEβ = = 0.3092
1.96

And we know that:

β^ − βH0 1.156 − 0
T = = = 3.7387
SEEβ 0.3092

154
© 2014-2023 AnalystPrep.
Practice Question 2

A trader develops a simple linear regression model to predict the price of a stock. T he

estimated slope coefficient for the regression is 0.60, the standard error is equal to 0.25,

and the sample has 30 observations. Determine if the estimated slope coefficient is

significantly different than zero at a 5% level of significance by correctly stating the

decision rule.

A. Accept H 1; T he slope coefficient is statistically significant.

B. Reject H 0; T he slope coefficient is statistically significant.

C. Reject H 0; T he slope coefficient is not statistically significant.

D. Accept H 1; T he slope coefficient is not statistically significant.

Sol uti on

T he correct answer is B.

Step 1: State the hypothesi s

H 0:β1=0

H 1:β1≠0

Step 2: Compute the test stati sti c

β1 − βH 0 0.60 − 0
= = 2.4
Sβ1 0.25

Step 3: Fi nd the cri ti cal val ue, t c

From the t table, we can find t0.025,28 is 2.048

155
© 2014-2023 AnalystPrep.
Step 4: State the deci si on rul e

Reject H 0; T he slope coefficient is statistically significant since 2.048 < 2.4.

156
© 2014-2023 AnalystPrep.
Reading 19: Regression with Multiple Explanatory Variables

After compl eti ng thi s readi ng, you shoul d be abl e to:

Distinguish between the relative assumptions of single and multiple regression.

Interpret regression coefficients in multiple regression.

Interpret goodness of fit measures for single and multiple regressions, including R 2 and

adjusted R 2.

Construct, apply, and interpret joint hypothesis tests and confidence intervals for multiple

coefficients in regression.

Unlike linear regression, mul ti pl e regressi on simultaneously considers the influence of multiple

explanatory variables on a response variable Y. In other words, it permits us to evaluate the effect of

more than one independent variable on a given dependent variable.

T he form of the multiple regression model (equation) is given by:

Y i = β0 + β1 X 1i + β2X 2 i + … + βkX k i + εi ∀i = 1, 2, … n

Intuitively, the multiple regression model has k slope coefficients and k+1 regression coefficients.

Normally, statistical software (such as Excel and R) are used to estimate the multiple regression

model.

Interpreting the Multiple Regression Coefficients

T he slope coefficients βk computes the level of variation of the dependent variable Y when the

independent variable X j changes by one unit while holding other independent variables constant. T he

interpretation of the multiple regression coefficients is quite different compared to linear regression

with one independent variable. T he effect of one variable is explored while keeping other

independent variables constant.

157
© 2014-2023 AnalystPrep.
For instance, a linear regression model with one independent variable could be estimated as

Y^ = 0.6 + 0.85X 1. In this case, the slope coefficient is 0.85, which implies that a 1 unit increase in

X 1 results in 0.85 units increases independent variable Y.

Now, assume that we had the second independent variable to the regression so that the regression

equation is Y^ = 0.6 + 0.85X 1 + 0.65X 2. A unit increase in X 1 will not result in a 0.85 unit increase in

Y unless X 1 and X 2 are uncorrelated. T herefore, we will interpret 0.85 as one unit of X 1 leads to

0.85 units increase in the dependent variable Y, while keeping X​2 constant.

OLS Estimators for the Multiple Regression Parameters

Although the multiples regression parameters can be estimated, it is challenging since it involves a

huge amount of algebra and the use of matrices. However, we build a foundation of understanding

using the multiple regression model with two explanatory variables.

Consider the following multiples regression equation.

Y i = β0 + β1X 1 i + β2 X 2i + εi

T he OLS estimator of β 1 is estimated as follows:

T he first step is to regress X 1 and X 2 and to get the residual of X 1i given by:

ϵ X1i = X 1i − α
^0 − α
^1 X 2i

Where α
^0 and α
^1 are the OLS estimators of X 2i.

T he next step is to regress Y on X 2 to get the residuals of Y i, which is intuitively given by:

ϵ Y i = Y i − y^0 − y^1 X 2i

Where ^
γ 0 and ^
γ 1 are the OLS estimators of X 2i . T he final step is to regress the residual of X 1 and Y

(ϵ X1i and ϵ Y i ) to get:

158
© 2014-2023 AnalystPrep.
ϵ Y i = β^ 1ϵ X1i + ϵ i

Note that we do not have a constant, the expected values of ϵ Y i and ϵ Xi are both 0. Moreover, the

main purpose of the first and the second regression is to exclude the effect of X 2 from both Y and X 1

by dividing the variable into the fittest value which is correlated with X 2, and the residual error

which is uncorrelated with X 2 and thus the two-residual obtained is uncorrelated with X 2 by

intuition. T he last step of the regression gives the regression between the components of Y and X 1,

which is uncorrelated with X 2.

T he OLS estimator for β2 can be approximated analogously as that of β1 by exchanging X 2 for X 1 in

the process above. By repeating this process, we can estimate a k-parameter model such as:

Y i = β0 + β1X 1i + β2 X 2i + … + βk X ki + εi∀i = 1, 2, … n

Most of the time, this is done using a statistical package such as Excel and R.

Assumptions of the Multiple Regression Model

Suppose that we have n observations of the dependent variable (Y) and the independent variables

(X 1, X 2, . . . , X k), we need to estimate the equation:

Y i = β0 + β1X 1i + β2 X 2i + … + βk X ki + εi ∀i = 1, 2, … n

For us to make a valid inference from the above equation, we need to make classical normal multiple

linear regression model assumptions as follows:

1. T he relationship between the dependent variable, Y, and the independent variables, X 1, X 2, . .

. , X k, is linear.

2. T he independent variables (X 1, X 2, . . . , X k) are iid. Moreover, there is no definite linear

relationship that exists between two or more of the independent variables, X 1, X 2, . . . , X k.

3. T he expectation of value of the error term, conditioned on the independent variables, is 0:

E(ϵ| X 1, X 2, . . . , X k) = 0

159
© 2014-2023 AnalystPrep.
4. T he variance of the error term is equal for all observations. T hat is, E(ϵ 2i ) = σ2ϵ , i = 1, 2,… , n

(homoskedasticity assumption). T he assumption enables us to estimate the distribution of

the regression coefficients.

5. T he error term ϵ is uncorrelated in all observations. Mathematically put,E(ϵ i ϵ j) = 0 ∀i ≠ j

6. T he error term ϵ is normally distributed. T his allows us to test the hypothesis about

regression analysis.

7. T here are no outliers so that E(X ji4 ) < ∞ for all j=1,2….k

T he assumptions are almost the same as those of linear regression with one independent variable,

only that the second assumption is tailored to ensure no linear relationships between the

independent variables (multicollinearity).

Measures of Goodness of Fit

T he goodness of fit of a regression is a measure using the Coefficient of determination (R 2) and the

adjusted coefficient of determination.

The Coefficient of Determination ( R2 )

Recall that the standard error estimate gives a percentage at which we are certain of a forecast

made by a regression model. However, it does not tell us how suitable is the independent variable in

determining the dependent variable. T he coefficient of variation corrects this shortcoming.

T he coefficient of variation measures a proportion of the total change in the dependent variable

explained by the independent variable. We can calculate the coefficient of variation in two ways:

1. Squaring the Correlation Coefficient between the Dependent and


Independent Variables.

T he coefficient of variation can be computed by squaring the correlation coefficient (r) between the

dependent and independent variables. T hat is:

R 2 = r2

160
© 2014-2023 AnalystPrep.
Recall that:

Cov(X , Y )
r=
σ Xσ Y

Where

Cov(X , Y )-covariance between two variables, X and Y

σX -standard deviation of X

σY -standard deviation of Y

However, this method only accommodates regression with one independent variable.

Example: Calculating the Coefficient of Determination using


Correlation Coefficient

T he correlation coefficient between the money supply growth rate (dependent, Y) and inflation rates

(independent, X) is 0.7565. T he standard deviation of the dependent (explained) variable is 0.050, and

that of the independent variable is 0.02. Regression analysis for the ten years was conducted on this

variable. We need to calculate the coefficient of determination.

Sol uti on

We know that:

Cov(X , Y ) 0.0007565
r= = = 0.7565
σ Xσ Y 0.05 × 0.02

So, the coefficient of determination is given by:

r2 = 0.75652 = 0.5723 = 57.23

So, in regression, the money supply growth rate explains roughly 57.23% of the variation in the

inflation rate over the ten years.

2. Method for Regression Model with One or More Independent

161
© 2014-2023 AnalystPrep.
Variables

If the regression analysis is known, then our best estimate for any observation for the dependent

variable would be the mean. Alternatively, instead of using the mean as an estimate of Y i , we can

predict an estimate using the regression equation. T he resulting solution will be denoted as:

Y i = β0 + β1 X 1i + β2X 2 i + … + βkX k i + εi = Y^i + ^


ϵi

So that:

Y i = Y^i + ^
ϵi

Now if we subtract the mean of the dependent variable in the above equation and square and sum on

both sides so that:

n n 2
2
∑ (Y i − Y¯ ) = ∑ (Y^i − Y¯ + ^
ϵ i)
i=1 i=1
n 2 n n
= ∑ (Y^i − Y¯ ) + 2 ∑ ^ ^2i
ϵ i (Y^i − Y¯ ) + ∑ ϵ
i=1 i=1 i=1

Note that:

n
ϵ i (Y^i − Y¯ ) = 0
2∑ ^
i=1

Since the sample correlation between Y^i and ^


ϵ i is 0. T he expression, therefore, reduces to,

n n 2 n
2
∑ (Y i − Y¯ ) = ∑ (Y^i − Y¯ ) + ∑ ^
ϵ 2i
i=1 i=1 i=1

But

2
ϵ 2i = (Y i − Y^)
^

So, that

162
© 2014-2023 AnalystPrep.
n n 2
^2i = ∑ (Y i − Y^)
∑ ϵ
i=2 i=1

T herefore,

n n 2 n 2
2
∑ (Y i − Y¯ ) = ∑ (Y^i − Y¯ ) + ∑ (Y i − Y^)
i=1 i=1 i=1

If the regression analysis is useful for predicting Y i using the regression equation, then the error

should be smaller than predicting Y i using the mean.

Now let:

2
Explained Sum of Squares (ESS)=∑ni=1 (Y^i − Y¯ )

2
Residual Sum of Squares (RSS) =∑ni=1 (Y i − Y^)

2
Total Sum of Squares (T SS)=∑ni=1 (Y i − Y¯ )

T hen:

T SS = ESS + RSS

If we divide both sides by T SS, we get:

ESS RSS
1= +
T SS T SS
ESS RSS
⇒ = 1−
T SS T SS

Now, recall than the coefficient of determination is the fraction of the overall change that is

reflected in the regression. Denoted by R 2, coefficient of determination is given by:

Explained Variation ESS RSS


R2 = = =1 −
Total Variation T SS T SS

163
© 2014-2023 AnalystPrep.
If a model does not explain any of the observed data, then it has an R 2 of 0 . On the other hand, if the

model perfectly describes the data, then it has an R 2 of 1. Other values are in the range of 0 and 1 and

are always positive. For instance, in the above example, the R 2 is approximately 1 and thus, the

money supply growth rate perfectly explains the level of inflation rates in the countries.

Limitations of R2

1. As the number of explanatory variables increases, the value of R 2 always increases even if

the new variable is almost completely irrelevant to the dependent variable. For instance, if a

regression model with one explanatory variable is modified to have two explanatory

variables, the new R 2 is greater or equal to that of a single explanatory model. In the case

where β = 0, adding a variable will not increase R 2. In that case, the RSS will remain the

same and so does R 2.

164
© 2014-2023 AnalystPrep.
2. T he Coefficient of Determination R 2 cannot be compared in different dependent variables.

For instance, we cannot compare the R 2 for Y i and lnY i .

3. T here is no standard value of R 2 that is considered good because its values depend on the

nature of the data involved.

Considering the first limitation, we now discuss the adjusted R 2.

The Adjusted R 2

2
Denoted by R̄ , the adjusted-R 2 measures the goodness of fit, which does not automatically increase

if an independent variable is added to the model; that is, it is adjusted for the degrees of freedom.
2 2
Note that R̄ is produced by statistical software. T he relationship between the R 2 and R̄ is given

by:

( nR−k−1
SS
)
2
R̄ = 1 −
( T −1
SS
)
n

n−1
=1 −( ) (1 − R 2)
n−k −1

Where

n=number of observations

k=number of the independent variables (Slope coefficients)

T he adjusted R-squared can increase, but that happens only if the new variable improves the model

more than would be expected by chance. If the added variable improves the model by less than

expected by chance, then the adjusted R-squared decreases.

2 2
When k≥ 1, then R 2 > R̄ since adding an extra new independent variable results in a decrease in R̄
2
if that addition causes a small increase in R 2. T his explains the fact that R̄ can be a negative though

R 2 is always nonnegative.

165
© 2014-2023 AnalystPrep.
2
A point to note is that when we decide to use R̄ to compare the regression models, the dependent

variable is defined the same way and that the sample size is the same as that of R 2.

2
T he following are the factors to watch out for when guarding against applying the R 2 or the R̄ :

2
An added variable doesn’t have to be statistically significant just because the R 2 or the R̄

has increased.

It is not always true that the regressors are a true cause of the dependent variable, just
2
because there is a high R 2 or R̄ .

It is not necessary that there is no omitted variable bias just because we have a high R 2 or
2
R̄ .

It is not necessarily true that we have the most appropriate set of regressors just because
2
we have a high R 2 or R̄

It is not necessarily true that we have an inappropriate set of regressors just because we
2
have a low R 2 or R̄ .

2
R̄ does not automatically indicate that regression is well specified due to its inclusion of a right set
2
of variables since a high R̄ could reflect other uncertainties in the data in the analysis. Moreover,
2
R̄ can be negative if the regression model produces an extremely poor fit.

Joint Hypothesis Test on Multiple Regression Parameters

Previously, we had conducted hypothesis tests on individual regression coefficients using the t-test.

We need to perform a joint hypothesis test on the multiple regression coefficients using the F-test

based on the F-statistic.

In multiple regression, we cannot test the null hypothesis that all the slope coefficients are equal to

0 using the t-test. T his is because an individual test on the coefficient does not accommodate the

effect of interactions among the independent variables (multicollinearity).

F-test (test of regression’s generalized significance) determines whether the slope coefficients in

166
© 2014-2023 AnalystPrep.
multiple linear regression are all equal to 0. T hat is, the null hypothesis is stated as

H 0 : β1 = β2 =. . .= βK = 0 against the alternative hypothesis that at least one slope coefficient is not

equal to 0.

To accurately compute the test statistic for the null hypothesis that the slope is equal to 0, we need

to identify the following:

I. T he Sum of Squared Residuals (SSR) given by:

n 2
∑ (Y i − Y^i )
i=1

T his is also called the residual sum of squares.

II.Explained Sum of Squares (ESS) given by:

n 2
∑ (Y^i − Y¯ i )
i=1

III. T he total number of observations (n).

III. T he number of parameters to be estimated. For example, in a regression analysis with one

independent variable, there are two parameters: the slope and the intercept coefficients.

Using the above four requirements, we can determine the F-statistic. T he F-statistic measures how

effective the regression equation explains the changes in the dependent variable. T he F-statistic is

denoted by F(Number of slope parameters, n-(number of parameters)). For instance, the F-statistic for

multiple regression with two slope coefficients (and one intercept coefficient) is denoted as F2, n-3.

T he value n-3 represents the degrees of freedom for the F-statistic.

T he F-statistic is the ratio of the average regression sum of squares to the average amount of

squared errors. T he average regression sum of squares is the regression sum of squares divided by

the number of slope parameters (k ) estimated. T he average sum of squared errors is the sum of

squared errors divided by the number of observations (n) less a total number of parameters

estimated ((n - (k + 1)). Mathematically:

Average regression sum of squares


167
© 2014-2023 AnalystPrep.
Average regression sum of squares
F =
T he average sum of squared errors

Explained sum of squares


ESS Slope parameters estimated
= Sum of squared residuals (SSR)
n −number of parameters estimated

In this case, we are dealing with a multiple linear regression model with k independent variable

whose F-statistic is given by:

( E SS )
k
F =
SS R
( )
n −(k+1)

In regression analysis output (ANOVA part), MSR and MSE are displayed as the first and the second

quantities under the MSS (mean sum of the squares) column, respectively. If the overall regression’s

significance is high, then the ratio will be large.

If the independent variables do not explain any of the variations in the dependent variable, each

predicted independent variable Y^i) possess the mean value of the dependent variable (Y ).

Consequently, the regression sum of squares is 0 implying the F-statistic is 0.

So, how do we decide F-test? We reject the null hypothesis at α significance level if the computed F-

statistic is greater than the upper α critical value of the F-distribution with the provided numerator

and denominator degrees of freedom (F-test is always a one-tailed test).

Example: Conducting F-test

An analyst runs a regression of monthly value-stock returns on four independent variables over 48

months.

T he total sum of squares for the regression is 360, and the sum of squared errors is 120.

Test the null hypothesis at a 5% significance level (95% confidence) that all the four independent

variables are equal to zero.

Solution

168
© 2014-2023 AnalystPrep.
H 0 : β1 = 0, β2 = 0,… , β4 = 0

Versus

H 1 : βj ≠ 0(at least one j is not equal to zero, j=1,2… k )

ESS = T SS – SSR = 360 – 120 = 240

T he calculated test statistic:

( ESS 240
)
k 4
F = = = 21.5
SS R 120
( ) 43
n −(k+1)

F43,43 is approximately 2.59 at a 5% significance level.

Decision: Reject H 0.

Conclusion: at l east one of the 4 independent variables is significantly different than zero.

Example: Calculating F-statistic and Conducting the F-test

An investment analyst wants to determine whether the natural log of the ratio of bid-offer spread to

the price of a stock can be explained by the natural log of the number of market participants and the

amount of market capitalization. He assumes a 5% significance level. T he following is the result of

the regression analysis.

Coefficient Standard Error t-Statistic


Intercept 1.6959 0.2375 7.0206
Number of market participants −1.6168 0.0708 −22.8361
Amount of Capitalization −0.4709 0.0205 −22.9707

ANOVA df SS MSS F Significance F


Regression 2 3, 730.1534 1, 865.0767 2, 217.95 0.00
Residual 2, 797 2, 351.9973 0.8409
Total 2, 799 5, 801.2051

169
© 2014-2023 AnalystPrep.
Residual standard error 0.9180
Multiple R-squared 0.6418
Observations 2, 800

We are concerned with the ANOVA (Analysis of variance) results. We need to conduct F-test to

determine the significance of regression analysis.

Sol uti on

So, the hypothesis is stated as:

H 0 : β^ 1 = β^2 = 0

vs

H 1 : At least 1β^ j ≠ 0, ∀j = 1, 2

T here are two slope coefficients, k=2 (coefficients on the natural log of the number of market

participants and the amount of market capitalization), which is degrees of freedom for the numerator

of the F-statistic formula. For the denominator, the degrees of freedom are n- (k + 1) =2800-3=

2,797.

T he sum of the squared errors is 2,351.9973, while the regression sum of squares is 3,730.1534.

T herefore, the F-statistic is:

( ESS 3730. 1534


)
k 2
F2,2797 = = = 2217.9530
SS R 2351. 9973
( ) 2797
n −(k+1)

Since we are working at a 5% (0.05) significance level, we look at the F-distribution table on the

second column which displays the F-distributions with degrees of freedom in the numerator of the F-

statistic formula as seen below:

F Distribution: Critical Values of F (5% significance level)

170
© 2014-2023 AnalystPrep.
1 2 3 4 5 6 7 8 9 10
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24
26 4.22 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16
35 4.12 3.27 2.87 2.64 2.49 2.37 2.29 2.22 2.16 2.11
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08
50 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 2.03
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99
70 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.02 1.97
80 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2.00 1.95
90 3.95 3.10 2.71 2.47 2.32 2.20 2.11 2.04 1.99 1.94
100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97 1.93
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91
1000 3.85 3.00 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84

As seen from the table, the critical value of the F-test for the null hypothesis to be rejected is

between 3.00 and 3.07. T he actual F-statistic is 2217.95, which is far higher than the F-test critical

value, and thus we reject the null hypothesis that all the slope coefficients are equal to 0.

Calculating the Confidence Interval for the Regression


Coefficient

171
© 2014-2023 AnalystPrep.
Confidence interval (CI) is a closed interval in which the actual parameter is believed to lie with

some degree of confidence. Confidence intervals are used to perform hypothesis tests. For instance,

we may want to ascertain stock valuation using the capital asset pricing model (CAPM). In this case,

we may wish to hypothesize that the beta possesses the market’s systematic risk or averaged beta.

T he same analogy used in the regression analysis with one explanatory variable is also used in a

multiple regression model using the t-test.

Example: Calculating the Confidence Interval (CI)

An economist tests the hypothesis that interest rates and inflation can explain GDP growth in a

country. Using some 73 observations, the analyst formulates the following regression equation:

GDP growth = ^
b0+^
b 1 (Interest) + ^
b 2 (Inflation)

T he regression estimates are as follows:

Coefficient Standard Error


Intercept 0.04 0.6%
Interest rates 0.25 6%
Inflation 0.20 4%

What is the 95% confidence interval for the coefficient on the inflation rate?

A. 0.12024 to 0.27976

B. 0.13024 to 0.37976

C. 0.12324 to 0.23976

D. 0.11324 to 0.13976

Sol uti on

T he correct answer is A

From the regression analysis, β^ =0.20 and the estimated standard error, sβ^ =0.04. T he number of

172
© 2014-2023 AnalystPrep.
degrees of freedom is 73-3=70. So, the t-critical value at the 0.05 significance level is =

t 0. 05,73−2−1 = t0. 025,70 = 1.994. T herefore, the 95% confidence level for the stock return is:
2

β^ ± tcsβ^ = 0.2 ± 1.994 × 0.04 = [0.12024, 0.27976]

173
© 2014-2023 AnalystPrep.
Practice Questions

Question 1

An analyst runs a regression of monthly value-stock returns on four independent

variables over 48 months. T he total sum of squares for the regression is 360 and the sum

of squared errors is 120. Calculate the R 2.

A. 42.1%

B. 50%

C. 33.3%

D. 66.7%

T he correct answer is D.

ESS 360 − 120


R2 = = = 66.7
T SS 360

Question 2

Refer to the previous problem and calculate the adjusted R 2.

A. 27.1%

B. 63.6%

C. 72.9%

D. 36.4%

T he correct answer is B.

n−1
174
© 2014-2023 AnalystPrep.
2 n−1
R̄ = 1 − × (1 − R 2)
n− k−1
48 − 1
= 1− × (1 − 0.667)
48 − 4 − 1
= 63.6%

Question 3

Refer to the previous problem. T he analyst now adds four more independent variables to

the regression and the new R 2 increases to 69%. What is the new adjusted R 2 and which

model would the analyst prefer?

A. T he analyst would prefer the model with four variables because its adjusted R 2 is

higher.

B. T he analyst would prefer the model with four variables because its adjusted R 2 is

lower.

C. T he analyst would prefer the model with eight variables because its adjusted R 2 is

higher.

D. T he analyst would prefer the model with eight variables because its adjusted R 2 is

lower.

T he correct answer is A.

2
New R = 69%

2 48 − 1
New adjusted R = 1− × (1 − 0.69) = 62.6%
48 − 8 − 1

T he analyst would prefer the first model because it has a higher adjusted R 2 and the

model has four independent variables as opposed to eight.

Question 4

175
© 2014-2023 AnalystPrep.
An economist tests the hypothesis that GDP growth in a certain country can be

explained by interest rates and inflation.

Using some 30 observations, the analyst formulates the following regression equation:

GDP growth = β^ 0 + β^1 Interest + β^ 2Inflation

Regression estimates are as follows:

Coefficient Standard Error


Intercept 0.10 0.5%
Interest Rates 0.20 0.05
Inflation 0.15 0.03

Is the coefficient for interest rates significant at 5%?

A. Since the test statistic < t-critical, we accept H 0; the interest rate coefficient

is not significant at the 5% level.

B. Since the test statistic > t-critical, we reject H 0; the interest rate coefficient

is not significant at the 5% level.

C. Since the test statistic > t-critical, we reject H 0; the interest rate coefficient is

significant at the 5% level.

D. Since the test statistic < t-critical, we accept H 1; the interest rate coefficient

is significant at the 5% level.

T he correct answer is C.

We have GDP growth = 0.10 + 0.20(Int) + 0.15(Inf)

Hypothesis:

H 0 : β^1 = 0 vs H 1 : β^ 1 ≠ 0

T he test statistic is:

176
© 2014-2023 AnalystPrep.
0.20 − 0
t =( )=4
0.05

T he critical value is t(α/2, n-k-1) = t0.025,27 = 2.052 (which can be found on the t-table).

df/p 0.40 0.25 0.10 0.05 0.025 0.01


25 0.256060 0.684430 1.316345 1.708141 2.05954 2.48511
26 0.255955 0.684043 1.314972 1.705618 2.05553 2.47863
27 0.255858 0.683685 1.313703 1.703288 2.05183 2.47266
28 0.255768 0.683353 1.312527 1.701131 2.04841 2.46714
29 0.255684 0.683044 1.311434 1.699127 2.04523 2.46202

Decision: Since test statistic > t-critical, we reject H 0.

Concl usi on: T he interest rate coefficient is significant at the 5% level.

177
© 2014-2023 AnalystPrep.
Reading 20: Regression Diagnostics

After compl eti ng thi s readi ng, you shoul d be abl e to:

Explain how to test whether regression is affected by heteroskedasticity.

Describe approaches to using heteroskedastic data.

Characterize multicollinearity and its consequences; distinguish between multicollinearity

and perfect collinearity.

Describe the consequences of excluding a relevant explaimgnatory variable from a model

and contrast those with the consequences of including an irrelevant regressor.

Explain two model selection procedures and how these relate to the bias-variance tradeoff.

Describe the various methods of visualizing residuals and their relative strengths.

Describe methods for identifying outliers and their impact.

Determine the conditions under which OLS is the best linear unbiased estimator.

Regression Model Specifications

Model specification is a process of determining which independent variables should be included in or

excluded from a regression model.

T hat is, an ideal regression model should consist of all the variables that explain the dependent

variables and remove those that do not.

Model specification includes the residual diagnostics and the statistical tests on the assumptions of

OLS estimators. Basically, the choice of variables to be included in a model depends on the bias-

variance tradeoff. For instance, large models that include the relevant number of variables are likely

to have unbiased coefficients. On the other side, smaller models lead to accurate estimates of the

impact of removing some variables.

T he conventional specification makes sure that the functional form of the model is adequate, the

178
© 2014-2023 AnalystPrep.
parameters are constant, and the homoscedasticity assumption is met.

The Omitted Variables

An omitted variable is one with a non-zero coefficient, but they are excluded in the regression model.

Effects of Omitting Variables

I. T he remaining variables sustain the impact of the excluded variables in terms of the common

variation. T hus, they do not consistently approximate the change in the independent variable

on the dependent variable while keeping all other things constant.

II. T he magnitude of the estimated residuals is larger than the true value. T his is true since the

estimated residuals have the true value and the effect of the omitted value that cannot be

reflected in the included variables.

Illustration of the Omitted Variables

Suppose that the regression model is stated as:

Y i = α + βi X 1i + β2X 2i + ϵ i

If we omit X 2 from the estimated model, then the model is given by:

Y i = α + βi X 1i + ϵ i

Now, in large samples sizes, the OLS estimator β^1 converges to:

β1 + β2 δ

Where:

Cov(X 1, X 2)
δ=
Var(X 1 )

179
© 2014-2023 AnalystPrep.
δ is the population slope coefficient in a regression of X 2 on X 1.

It is clear that the bias – due to the omitted variable – depends on the population coefficient of the

excluded variable β2 and the relational strength of the X 2 and X 1, represented by δ.

When the correlation between X 1 and X 2 is high, X 1 can explain a significant proportion of variation

in X 2 and hence the bias is high. On the other hand, if the independent variables are uncorrelated,

that is δ = 0 then β^1 is a consistent estimator of β1.

Conclusively, the omitted variable leads to biasness of the coefficient on the variables that are

correlated with the omitted variables.

Inclusion of Extraneous Variables

An extraneous variable is one that is unnecessarily included in the model, whose actual coefficient is

0 and is consistently estimated to be 0 in large samples. If we include these variables is costly.

Illustration of Effect of Inclusion of Extraneous Random Variables

Recall that the adjusted R 2 is given by:

2 RSS
R̄ = 1 − ξ
T SS

Where:

(n − 1)
ξ=
(n − k − 1)

Looking at the formula above, adding more variables increase the value of k which in turn increases
2
the value of ξ and hence reducing the value of R̄ . However, if the model is large, then RSS is smaller
2
which reduces the effect of ξ and produces larger R̄ .

Contrastingly, this is not always the case when the true coefficient is equal to 0 because, in this

180
© 2014-2023 AnalystPrep.
2
case, RSS remains constant as ξ increases leading to a smaller R̄ and a large standard error.

Lastly, if the correlation between X 1 and X 2 increases, the standard error value rises.

The Bias-Variance Tradeoff

T he bias-variance tradeoff amounts to choosing between the including irrelevant variables and

excluding relevant variables. Bigger models tend to have low bias level because it includes more

relevant variables. However, they are less accurate in approximating the regression parameters due

to the possibility of involving extraneous variables.

Moreover, regression models with fewer independent variables are characterized by low estimation

error but more prone to biased parameter estimates.

Methods of Choosing a Model from a Set of Independent


Variables

1. General -to-Speci fi c Model Sel ecti on

In the general-to-specific method, we start with a large general model that incorporates all

the relevant variables. T hen, the reduction of the general model starts. We use hypothesis

tests to establish if there are any statistically insignificant coefficients in the estimated

model. When such coefficients are found, the variable with the coefficient with the smallest

t-statistic is removed. T he model is then re-estimated using the remaining set of independent

variables. Once more, hypothesis tests are carried out to establish if statistically

insignificant coefficients are present. T hese two steps (remove and re-estimate) are

repeated until all coefficients that are statistically insignificant have been removed.

2. m-fol d Cross-Val i dati on

T he m-fold cross-validation model-selection method aims at choosing the model that’s best at

fitting observations not used to estimate parameters.

181
© 2014-2023 AnalystPrep.
How i s thi s method executed?

As a first step, the number of models has to be decided, and this is determined in part by the

number of explanatory variables. When this number is small, the researcher can consider all

the possible combinations. With 10 variables, for example, 1,024 (=) distinct models can be

constructed.

T he cross-validation process proceeds as follows:

1. Shuffle the dataset randomly.

2. Split the dataset into m groups.

3. Estimate parameters using m-1 of the groups; these groups make up what we call the

trai ni ng bl ock . T he excluded group is referred to as the val i dati on bl ock .

4. Use the estimated parameters and the data in the excluded block (validation block) to

compute residual values. T hese residuals are referred to as out-of-sample residuals

since they are arrived at using data not included in the sample used to come up with

the parameter estimates.

5. Repeat parameter estimation and residual computation a total of m times; each group

has to serve as the validation block and used to compute residuals.

6. Compute the sum of squared errors using the residuals estimated from the out-of-

sample data.

7. Select the model with the smallest out-of-sample sum of squared residuals.

Heteroskedasticity

Recall that homoskedasticity is one of the critical assumptions in the determination of the

distribution of the OLS estimator. T hat is, the variance of ϵ i is constant and that it does not vary with

any of the independent variables; formally stated as Var(ϵ i │X 1i, X 2i, … , X ki ) = δ 2.

Heterosk edasti ci ty is a systemati c pattern in the residuals where the variances of the residuals

are not constant.

182
© 2014-2023 AnalystPrep.
Test for Heteroskedasticity

Halbert White proposed a simple test, with the following two-step procedures:

I. Approximate the model and calculate the residuals, ϵ i

II. Regress the squared residuals on:

1. A constant

2. All explanatory variables

3. T he cross product of all the independent variables, including the product of each

variable with itself.

Consider an original model with two independent variables:

Y i = α + βi X 1i + β2X 2i + ϵ i

T he first step is to calculate the residuals by utilizing the OLS parameter estimators:

^ − β^1 X 1i − β^2 X 2i
ϵ i = Yi − α
^

Now, we need to regress the squared residuals on a constant X 1 ,X 2 ,X 21 ,X 22 and X 1 X 2

183
© 2014-2023 AnalystPrep.
ϵ 2i = Υ 0 + Υ 1X 1i + Υ 2 X 2i + Υ 3X 21i + Υ 4 X 22i + Υ 5X 21i X 22i
^

ϵ 2i must not be explained by any of the variables and the null


If the data is homoscedastic, then ^

hypothesis is: H 0 : Υ 1 = ⋯ = Υ 5 = 0

T he test statistic is calculated as nR 2 where R 2 is calculated in the second regression and that the

test statistic has a χ2k(k+3) (chi-distribution), where k is the number of explanatory variables in the
2

first-step model.

For instance, if the number of the explanatory variables is two, k=2, then the test statistic has a

distribution of χ5.

Modeling Heteroskedastic Data

T he three common methods of handling data with heteroskedastic shocks include:

1. Ignori ng the heterosk edasti ci ty when approxi mati ng the parameters and then

uti l i zi ng the Whi te covari ance esti mator i n hypothesi s tests.

However simple, this method leads to less accurate model parameter estimates compared to

other methods that address the heteroskedasticity.

2. Transformati on of data.

For instance, positive data can be log-transformed to try and remove heteroskedasticity and

give a better view of data. Another transformation can be in the form of dividing the

dependent variable by another positive variable.

3. Use of wei ghted l east squares (WLS).

T his is a complicated method that applies weights to the data before approximating the

parameters. T hat is if we know that Var(ϵ i ) = w 2i σ 2 where w i is known then we can

transform the data by dividing by w i to remove the heteroskedasticity from the errors. In
Yi Xi
other words, the WLS regresses wi
on wi
such as:

Y 1 X ϵ
184
© 2014-2023 AnalystPrep.
Yi 1 Xi ϵi
=α +β +
wi wi wi wi
Ȳ i = α C̄i + βX̄ i + ϵ̄ i

Note that the parameters of the model above are estimated using OLS on the transformed data. T hat
1 X
is, the weighted version of Y i which is Ȳ i on two weighted explanatory variables C̄i = and X̄ i = wi .
wi i

Note that the WLS model does not clearly include the intercept α , but the interpretation is still the

same, that is, the intercept.

Multicollinearity

Multicollinearity occurs when others can significantly explain one or more independent variables.

For instance, in the case of two independent variables, there is evidence of multicollinearity if the

R 2 is very high if one variable is regressed on the other.

185
© 2014-2023 AnalystPrep.
In contrast with multicollinearity, perfect correlation is where one of the variables is perfectly

correlated to others such that the R 2 of regression of X j on the remaining independent variable is

precisely 1.

Conventionally, when R 2 is above 90% leads to problems in medium sample sizes such as that of 100.

Multicollinearity does not pose an issue in parameter approximation, but rather, it brings some

difficulties in modeling the data.

When multicollinearity is present, some of the coefficients in a regression model are jointly

statistically significant (F-statistic is substantial), but the individual t-statistic is very small (less than

1.96) since the regression analysis assumes the collective effect of the variables rather than the

individual effect of the variables.

Addressing Multicollinearity

186
© 2014-2023 AnalystPrep.
T here are two ways of dealing with multicollinearity:

I. Ignoring multicollinearity altogether since it technically not a problem.

II. Identification of the multicollinear variables and excluding them from the model.

Identification of multicollinear variables using the variance inflation factor which compares

the variance of the regression coefficients on independent variable X j in two models: one

that incorporates only X j and one that omits k independent variables:

X ji = Υ 0 + Υ 1 X 1i + ⋯ + Υ j−1 X j−1i + Υ j+1X j+1i + ⋯ + Υ k X ki + ηi

T he variance inflation factor (VIF) for the variable X j is given by:

1
VIFj =
1 − R 2j

Where R 2j originates from regressing X j on the other variable in the model. When the value

of the VIF is above 10, then it is considered too much and the variable should be excluded

from the model.

Residual Plots

Residual plots are utilized to identify the deficiencies in a model specification. When the residual

plots are not systematically related to any of the included independent (explanatory variables) and

relatively small (within ±4s, where s, is the standard shock deviation of the model) in magnitude, then

the model is ideally good.

Residual plot is a graph of ^


ϵ i (vertical axis) against the independent variables x i. Alternatively, we
^
ϵi
could use the standardized residuals s which makes sure that the deviation is apparent.

Outliers

Outliers are values that, if removed from the sample, produce large changes in the estimated

187
© 2014-2023 AnalystPrep.
coefficients. T hey can also be viewed as data points that devi ate si gni fi cantl y from the normal

objects as if they were generated by a di fferent mechani sm.

Cook’s distance helps us measure the impact of dropping a single observation j on a regression (and

the line of best fit).

T he Cook’s distance is given by:

(−j) 2
∑ni=1 (Ȳ i ^ i)
−Y
Dj =
ks2

Where:

(−j)
Ȳ i =fitted value of Ȳ i when the observed value j is excluded, and the model is approximated using n-

1 observations.

188
© 2014-2023 AnalystPrep.
k=number of coefficients in the regression model

s2=estimated error variance from the model using all observations

When a variable is an inline (does not affect the coefficient estimates when excluded), the value of

its Cook’s distance (Dj ) is small. On the other hand, Dj is higher than 1 if it is an outlier.

Example: Calculating Cook’s Distance

Consider the following data sets:

Observation Y X
1 3.67 1.85
2 1.88 0.65
3 1.35 −0.63
4 0.34 1.24
5 −0.89 −2.45
6 1.95 0.76
7 2.98 0.85
8 1.65 0.28
9 1.47 0.75
10 1.58 −0.43
11 0.66 1.14
12 0.05 −1.79
13 1.67 1.49
14 −0.14 −0.64
15 9.05 1.87

If you look at the data sets above, it is easy to see that observation 15 is quite more significant than

the rest of the observations, and there is a possibility to be an outlier. However, we need to

ascertain this.

We begin by fitting the whole dataset (Ȳ i) and then the 14 observations which remain after excluding

the dataset that we believe is an outlier.

If we fit the whole dataset, we get the following regression equation:

Ȳ i = 1.4465 + 1.1281X i

And if we exclude the observation that we believe it is an outlier we get:

189
© 2014-2023 AnalystPrep.
(−j)
Ȳ i = 1.1516 + 0.6828X i

Now the fitted values are as shown below:

(−j) (−j) 2
Observation Y X Ȳ i Ȳ i (Ȳ i − Ȳ i )
1 3.67 1.85 3.533 2.4148 1.2504
2 1.88 0.65 2.179 1.5954 0.3406
3 1.35 0.63 0.7358 0.7214 0.0002
4 0.34 1.24 2.8453 1.9983 0.7174
5 0.89 2.45 −1.3174 −0.5213 0.6338
6 1.95 0.76 2.3039 1.6705 0.4012
7 2.98 0.85 2.4053 1.732 0.4533
8 1.65 0.28 1.7624 1.3428 0.1761
9 1.47 0.75 2.2926 1.6637 0.3955
10 1.58 0.43 0.9614 0.858 0.0107
11 0.66 1.14 2.7325 1.921 0.6585
12 0.05 1.79 −0.5728 −0.07061 0.2522
13 1.67 1.49 3.1274 2.169 0.9185
14 0.14 0.64 0.7245 0.7146 0.0001
15 9.05 1.87 3.556 2.4284 1.2715
Sum 7.4800

If the s2 = 3.554 the Cook’s distance is given by:

(−j) 2
∑ni=1 (Ȳ i ^i )
−Y 7.4800
Dj = = = 1.0523
ks2 2 × 3.554

Since Dj > 1, then observation 15 can be considered as an outlier.

Strengths of Ordinary Least Squares (OLS )

OLS is the Best Linear Unbiased Estimator (BLUE) when some key assumptions are met, which

implies that it can assume the smallest possible variance among any given estimator that is linear and

unbiased:

Li neari ty: the parameters being estimated using the OLS method must be themselves

linear.

190
© 2014-2023 AnalystPrep.
Random: the data must have been randomly sampled from the population.

Non-Col l i neari ty: the regressors being calculated should not be perfectly correlated

with each other.

Exogenei ty: the regressors aren’t correlated with the error term.

Homoscedasti ci ty: the variance of the error term is constant

However, being a BLUE estimator comes with the following limitations:

I. A big proportion of the estimators are not linear such as maximum likelihood estimators (but

biased).

II. BLUE property is heavily dependent on residuals being homoskedastic. In the case that the

variances of residuals vary the independent variables, then it is possible to construct linear

unbiased estimators (LUE) of the coefficients α and β using WLS but with extra assumptions.

When the residuals and iid and normally distributed with a mean of 0 and variance of σ 2, formally

stated as ϵ i ∼iid N(0, σ 2) makes the upgrades BLUE to BUE (Best Unbiased Estimator) by virtue

having the smallest variance among all linear and non-linear estimators. However, errors being

normally distributed is not a requirement for accurate estimates of the model coefficients or a

necessity for desirable properties of estimators.

191
© 2014-2023 AnalystPrep.
Practice Question 1

Which of the following statements is/are correct?

I. Homoskedasticity means that the variance of the error terms is constant for all

independent variables.

II. Heteroskedasticity means that the variance of error terms varies over the sample.

III. T he presence of conditional heteroskedasticity reduces the standard error.

A. Only I

B. II and III

C. All statements are correct

D. None of the statements are correct

Sol uti on

T he correct answer is C.

All statements are correct

If the variance of the residuals is constant across all observations in the sample, the

regression is said to be homoskedastic. When the opposite is true, the regression is said

to exhibit heteroskedasticity, i.e., the variance of the residuals is not the same across all

observations in the sample. T he presence of conditional heteroskedasticity poses a

significant problem: it introduces a bias into the estimators of the standard error of the

regression coefficients. As such, it understates the standard error.

Practice Question 2

A financial analyst fails to include a variable which inherently has a non-zero coefficient

in his regression analysis. Moreover, the ignored variable is highly correlated with the

192
© 2014-2023 AnalystPrep.
remaining variables.

What is the most likely deficiency of the analyst’s model?

A. Omitted variable bias.

B. Bias due to inclusion of extraneous variables.

C. Presence of heteroskedasticity.

D. None of the above.

Sol uti on

T he correct answer is A.

Ommitted variable bias occurs under two conditions:

I. A variable with a non-zero coefficient is omitted

II. A variable that is omitted is correlated with remaining (included) variables.

T hese conditions are met in the description of the analyst’s model.

Option B is incorrect since an extraneous variable is one that is unnecessarily included in

the model, whose true coefficient and consistently approximated value is 0 in large

sample sizes.

Option C is incorrect because heteroskedasticity is a condition where the variance of

the errors varies systematically with the independent variables of the model.

193
© 2014-2023 AnalystPrep.
Reading 21: Stationary Time Series

After compl eti ng thi s readi ng, you shoul d be abl e to:

Describe the requirements for a series to be covariance stationary.

Define the autocovariance function and the autocorrelation function.

Define white noise; describe independent white noise and normal (Gaussian) white noise.

Define and describe the properties of autoregressive (AR) processes.

Define and describe the properties of moving average (MA) processes.

Explain how a lag operator works.

Explain mean reversion and calculate a mean-reverting level.

Define and describe the properties of autoregressive moving average (ARMA) processes.

Describe the application of AR, MA, and ARMA processes.

Describe sample autocorrelation and partial autocorrelation.

Describe the Box-Pierce Q-statistic and the Ljung-Box Q statistic.

Explain how forecasts are generated from ARMA models.

Describe the role of mean reversion in long-horizon forecasts.

Explain how seasonality is modeled in a covariance-stationary ARMA.

T ime series is a collection of observations on a variable’s outcome in distinct periods — for example,

monthly sales of a company for the past ten years. T ime series are used to forecast the future of the

time series. T he time series are classified into the trend, seasonal, and cyclical components. A trend

time-series changes its level over time, while a seasonal time series has predictable changes over a

given time. Lastly, a cyclical time series, as its name suggests, reflects the cycles in a given data. We

will concentrate on the cyclical data (especially linear stochastic processes).

194
© 2014-2023 AnalystPrep.
A stochastic process is a set of variables. T he stochastic process is mostly denoted by Y t and by the

subscript, the random variable is ordered time so that Y s occurs first before Y t if s < t.

A linear process has a general form of:

Y t = α t + β0ϵ t + β1 ϵ t−1 + β2ϵ t−2 + …



= α t + ∑ βiϵ t−i
i=0

T he linear process is linear on the shock, ϵ t . α t is a deterministic while βi is a constant coefficient.

Covariance Stationary Time Series

T he ordered set: {… ,y −2 ,y −1 , y 0, y 1, y 2, …} is called the realization of a time series. T heoretically, it

starts from the infinite past and proceeds to the infinite future. However, only a finite subset of

realization can be used in practice, and is called a sample path.

A series is said to be covariance stationary if both its mean and covariance structure is stable over

time.

More specifically, a time series is said to be covariance stationary if:

I. T he mean does not change and thus constant over time. T hat is:

E(Y t) = μ∀t

II. T he variance does change over time, and it is constant. T hat is:

V (Y t) = γ0 < ∞ ∀t

III. T he autocovariance of the time series is finite and does not change over time, and it depends on

the distance between two observations. T hat is:

Cov (Y t, Y t−h ) = γh ∀t

T he covariance stationarity is crucial so that the time series has a constant relationship across time

195
© 2014-2023 AnalystPrep.
and that the parameters are easily interpreted since the parameters will be asymptotically normally

distributed.

Autocovariance and Autocorrelation Functions

The Autocovariance Function

It can be quite challenging to quantify the stability of a covariance structure. We will, therefore, use

the autocovariance function. T he autocovariance is the covariance between the stochastic process

at a different point in time (analogous to the covariance between two random variables). It is given

by:

γt,h = E [(Y t − E (Y t )) (Y t−h − E (Y t−h ))]

And if the length h = 0 then:

γt,h = E [(Y t − E(Y t ))2]

Which is the variance of Y t .

T he autocovariance is a function of h so that:

γh = γ|h|

T his is asserting the fact that the autocovariance depends on the length h and not the time t. So that:

Cov (Y t, Y t−h ) = Cov (Y t−h , Y t)

T he Autcorrelation is defined as:

Cov(Y t, Y th ) γh γh
ρ ( t) = = =
√γ0γ0 γ0
√V (Y t )√V(Y t−h )

Similarly, for h = 0.

γ
196
© 2014-2023 AnalystPrep.
γ0
ρ ( t) = =1
γ0

T he autocorrelation ranges from -1 and 1 inclusively. T he partial autocorrelation function is denoted

as, p(h), and in a linear population regression of Y t on Y t−1, … , Y t−h , it is the coefficient of y t−h . T his

regression is referred to as the autoregression. T his is because the regression is on the lagged values

of the variable.

White Noise

Assume that:

yt = ϵt

197
© 2014-2023 AnalystPrep.
ϵ t ∼ (0, σ 2) , ∀ σ2 < ∞

where ϵ t is the shock and is uncorrelated over time. T herefore, ϵ t and y t are said to be serially

uncorrelated.

T his auto-correlation that has a zero mean and unchanging variance is referred to as the zero-mean

white noise (or just white noise) and is written as:

ϵ t ∼ W N (0, σ 2)

198
© 2014-2023 AnalystPrep.
And:

y t ∼ W N (0, σ 2)

ϵ t and y t serially uncorrelated but not necessarily serially independent. If y possesses this property,

(serially uncorrelated but not necessarily serially independent) then it is said to be an independent

white noise.

T herefore, we write:

y t iid

(0, σ 2)

T his is read as “y is independently and identically distributed with a mean 0 and constant variance. y

is said to be serially independent if it is serially uncorrelated and it has a normal distribution. In this

case, y is called the normal white noise or the Gaussian white noise.

Written as:

y t iid

N (0, σ 2)

To characterize the dynamic stochastic structure of y t ∼ W N (0, σ 2) , it follows that the

unconditional mean and variance of y are:

E (y t) = 0

And:

var (y t ) = σ 2

T hese two are constant since only displacement affects the autocovariances rather than time. All

the autocovariances and autocorrelations are zero beyond displacement zero since white noise is

uncorrelated over time.

T he following is the autocovariance function for a white noise process:

199
© 2014-2023 AnalystPrep.
2
γ (h) = { σ , h=0
0, h≥0

T he following is the autocorrelation function for a white noise process:

ρ (h) = { 1, h=0
0, h≥1

Beyond displacement zero, all partial autocorrelations for a white noise process are zero. T hus, by

construction white noise is serially uncorrelated. T he following is the function of the partial

autocorrelation for a white noise process:

p(h) = { 1, h=0
0, h≥1

Simple transformations of white noise are considered in the construction of processes with much

richer dynamics. T hen the white noise should be the 1-step-ahead forecast errors from good models.

T he mean and variance of a process, conditional on its past, is another crucial characterization of

dynamics with crucial implications for forecasting.

To compare the conditional and unconditional means and variances, consider the independence white

noise: y t iid

(0, σ 2). y has an unconditional mean and variance of zero and σ 2 respectively. Now,

consider the transformational set:

Ωt−1 = {y t−1 , y t−2 , …}

Or:

Ωt−1 = {ϵ t−1 , ϵ t−2 , …}

T he conditional mean and variance do not necessarily have to be constant. T he conditional mean for

the independent white noise process is:

E (y t|Ωt−1 ) = 0

200
© 2014-2023 AnalystPrep.
T he conditional variance is:

var (y t|Ωt−1 ) = E ((y t − E (y t |Ωt−1))2 |Ωt−1) = σ 2

Independent white noise series have identical conditional and unconditional means and variances.

Wold’s Theorem

Assuming that {y t } is any zero-mean covariance-stationary process. T hen:


Y t = ϵ t + β1 ϵ t−1 + β2ϵ t−2 + ⋯ = ∑ βi ϵ t−i
i=0

Where:

ϵ t ∼ W N (0, σ 2)

Note that β0 = 1 and ∑∞ β 2 < ∞.


i=0 i

T he accurate model for any stationary covariance series is the Wold’s representation. Since ϵ t

corresponds to the 1-step-ahead forecast errors to be incurred should a particularly good forecast be

applied, the ϵ t ’s are the innovations.

Time-Series Models

The Autoregressive (AR) Models

AR models are time series models mostly used in finance and economics which links the stochastic

process Y t to the previous value Y t−1. T he first order AR model denoted by AR(1) is given by:

Y t = α + βY t−1 + ϵ t

Where:

201
© 2014-2023 AnalystPrep.
α = intercept

β = AR parameter

ϵ t = the shock which is white noise (ϵ t ∼ W N (0, σ2)

SinceY t is assumed to be covariance stationary, the mean,variance, and autocovariances are all

constant. By the principle of covariance stationarity,

E(Y t) = E(Y t−1 ) = μ

T herefore,

E(Y t ) = E(α + βY t−1 + ϵ t = α + βE(Y t−1 ) + E(ϵ t)

⇒ μ = α + βμ + 0

α
∴μ=
1−β

And for the variance,

V (Y t) = V (α + βY t−1 + ϵ t ) = β2V (Y t−1) + V (ϵ t) + Cov(Y t−1 , ϵ t )

γ0 = β 2 γ0 + σ 2 + 0

σ2

1 − β2

Note that Cov(Y t−1 , ϵ t)=0 since Y t−1 is uncorrelated with the shocks ϵ t−1, ϵ t−2, …

T he Autocovariances for AR(1) process is calculated recursively. T he first autocovariance for the

AR(1) model is given by:

Cov(Y t ,Y t−1 ) = Cov(α + βY t−1 + ϵ t , Y t−1)


= βCov(Y t, Y t−1 ) + Cov(Y t−1, ϵ t)
= βγ0

T he remaining autocovariance is recursively calculated as:

202
© 2014-2023 AnalystPrep.
Cov(Y t, Y t−h )) = Cov(α + βY t−1 + ϵ t, Y t−h )
= βCov(Y t−1 , Y t−h ) + Cov(Y t−h , ϵ t )
= βγh−1

It should be easy to see that Cov(Y t−h , ϵ t) = 0. Applying this recursion analogy:

γh = β h γ0

T herefore we can generalize the autocovariance as:

γh = β|h|γ0

Intuitively the autocorrelation function is given by:

β h γ0 |h|
ρ (ρ) = =β
γ0

T he ACF tends to 0 when h increases and that -1<β<0. T he Partial autocorrelation of an AR(1) model

is given by:

β |h|, h ∈ {0, ±}
∂ (h) = {
0, h ≥ 2

The Lag Operator

T he lag operator denoted by L is important for manipulating complex time-series models. As its name

suggests, the lag operator moves the index of a particular observation one step back. T hat is:

LY t = Y t−1

Properties of the Lag Operator

(I). T he lag operator moves the index of a time series one step back. T hat is:

LY t = Y t−1

203
© 2014-2023 AnalystPrep.
(II). Consider the following mth-order lag operator polynomial Lm then:

L m Y t = y t−m

For instance L 2 Y t = L(LY t) = L(Y t−1) = Y t−2

(III). T he lag operator of a constant is just a constant.

For example Lα = α

(IV). T he pth order lag operator is given by:

a(L) = 1 + a1 L + a2L 2 + … + apL p

so that:

a(L)Y t = Y t + a1Y t−1 + a2 Y t−2 + … + ap Y t−p

(V). T he lag operator has a multiplicative property. Consider two lag operators a(L) and b(L). T hen:

a(L)b(L)Y t = (1 + a1 (L)) (1 + b1 (L)) Y t


= (1 + a1 (L)) (Y t + b1Y t−1 )
= Y t + b1Y t−1 + a1 Y t−1 + a1b1 Y t−2

Moreover, the lag operator has a commutative property so that:

a(L)b(L) = b(L)a(L)

IV. Under some restrictive conditions, the lag operator polynomial can be inverted so that: a(L)a(L)−1

=1. When a(L) is a first-order lag operator polynomial given by 1 − a1 (L), is invertible if |a1| < 1 so

that its inverse is given by:


(1 − a1 (L))−1 = ∑ ai L i
i=1

For an AR(1) model,

204
© 2014-2023 AnalystPrep.
Y t = α + βY t−1 + ϵ t

T his can be expressed with the lag operator so that:

Y t = α + β(LY )t + ϵ t

⇒ (1 − βL)Y t = α + ϵ t

If |β|<1, then the lag polynomial above is invertible so that:

(1 − βL)−1 (1 − βL)Y t = (1 − βL)−1α + (1 − βL)−1 ϵ t

∞ ∞ ∞
α
⇒ Y t = α ∑ β i + ∑ β jL j ϵ t = + ∑ β iL i ϵ t−i
i=1 j=1 1−β i=1

The p th Order Autoregressive Model (AR(p))

T he AR(p) model is a generalization of the AR(1) model to include the p lags of Y t−1. T hus, the AR(p)

is given by:

Y t = α + β1Y t−1 + β2Y t−1 + … + βpY t−p + ϵ t

If Y t is covariance stationary, then the long-run mean is given by:

α
E (Y t ) =
1 − β1 − β2 − … βp

And the long-run variance is given by:

σ2
V (Y t) = γ0 =
1 − β1 ρ1 − β2ρ2 − … βpρp

From the formulas of the mean and variance of the AR(p) model, the covariance stationarity

property is satisfied if:

β1 + β2 + ⋯ + βp < 1

205
© 2014-2023 AnalystPrep.
Otherwise, the covariance stationarity will be violated.

T he autocorrelations function of the AR(p) model bears the same structural model as AR(1) model;

the ACF tends to 1 as the length between the two-time series increases and may oscillate. However,

higher-order ARs may bear complex structures in their ACFs.

The Moving Average Models (MA)

T he first-order moving average model denoted by MA(1) is given by:

Y t = μ + θϵ t−1 + ϵ t

Where ϵ t ∼ W N (0, σ2).

Evidently, the process Y t depends on the current shock ϵ t and the previous shock ϵ t−1 where the

coefficient θ measures the magnitude at which the previous shock affects the process. Note μ is the

mean of the process since:

E(Y t) = E(μ + θϵ t−1 + ϵ t ) = E(μ) + θE(ϵ t−1 ) + E(ϵ t )


= μ+0+0 = μ

For θ > 0, MA(1) is persistent because the consecutive values are positively correlated. On the

other hand, if θ < 0, the process mean reverts because the effect of the previous shock is reversed

in the current period.

T he MA(1) model is always a covariance stationary process. T he mean is as shown above, while the

variance of the MA(1) model is given by:

V (Y t) = V(μ + θϵ t−1 + ϵ t) = V (μ) + θ2 V(ϵ t−1 ) + V (ϵ t )


= 0 + θ2V (ϵ t−1 ) + V (ϵ t ) = θ2 σ2 + σ2
⇒ V(Y t ) = σ2(1 + θ2 )

T he variance uses the intuition that the shock is white noise processes that are uncorrelated.

T he MA(1) model has a non-zero autocorrelation function given by:

206
© 2014-2023 AnalystPrep.

⎪ 1, h = 0
θ
ρ (h) = ⎨ ,h = 1
1+ 2θ


0, h ≥ 2

T he partial autocorrelations (PACF) of the MA(1) model is a complex and non-zero at all lags.

From the MA(1), we can generalize the qth order MA process. Denoted by MA(q), it is given by:

Y t = μ + ϵ t + θ1ϵt−1 + … + θq ϵ t−q

T he mean of the MA(q) process is still μ since all the shocks are white noise process (their

expectations are 0). T he autocovariance function of the MA(q) process is given by:

σ 2 ∑q−h θi θi+h , 0 ≤ h ≤ q
γ (h) = { i=0
0, h > q

And θ0=1

T he value of θ can be determined by substituting the value taken by the autocorrelation function and

solving the resulting quadratic equation. T he partial autocorrelation of an MA(q) model is complex

and non-zero at all lags.

Example: Moving Average Process.

Given an MA(2), Y t = 3.0 + 5ϵ t−1 + 5.75ϵ t−2 + ϵ t where ϵ t ∼ W N (0, σ2). What is the mean of the

process?

Solution

T he MA(2) is given by:

Y t = μ + θ1ϵ t−1 + θ2 ϵ t−2 + ϵ t

Where μ is the mean. So, the mean of the above process is 3.0

The Autoregressive Moving Average (ARMA) Models

207
© 2014-2023 AnalystPrep.
T he ARMA model is a combination of AR and MA processes. Consider a first-order ARMA model

(ARMA(1,1)). It is given by:

Y t = α + βY t−1 + θϵ t−1 + ϵ t

T he mean of the ARMA(1,1) model is given by:

α
μ=
1 −β

And variance is given by

σ 2 (1 + 2βθ)
γ0 =
1 − β2

T he autocovariance function is given by:

1+2βθ+θ2

⎪ σ2 ,h = 0


⎪ 1−β2
γ (h) = ⎨ σ 2 β(1+βθ)+θ(1+βθ)
,h= 1

⎪ 1−β2


⎪ βγh−1, h ≥ 2

T he ACF form of the ARMA(1,1) decays as the length h increases and oscillate if β < 0, which is

consistent with the AR model.

T he PACF tends to 0 as the length h increase, which is consistent with the MA process. T he decay

of ARMA’s ACF and PACF is slow, which distinguishes it from the pure AR and MA models.

From the variance formula of ARMA(1,1), it is easy to see that the process is covariance stationery

if |β| < 1

ARMA(p,q) Model

As the name suggests, ARMA(p,q) is a combination of the AR(p) and MA(q) process. Its form is given

by:

Y t = α + β1Y t−1 + … + βpY t−p + θϵ t−1 + … + θq ϵ t−q + ϵ t

208
© 2014-2023 AnalystPrep.
When expressed using lag polynomial, this expression reduces to:

β(L)Y t = α + θ(L)ϵ t

Analogous to ARMA(1,1), ARMA(p,q) is covariance -stationary if the AR portion is covariance

stationary. T he autocovariance and ACFs of the ARMA process are complex that decay at a slow

pace to 0 as the lag h increases and possibly oscillate.

Sample Autocorrelation

T he sample autocorrelation is utilized in validating the ARMA models. T he autocovariance estimator

is given by:

T
1
γ^h = ∑ (Y i − Y¯ ) (Y i−h − Y¯ )
T − h i=h+i

Where Y¯ is the full sample mean.

T he autocorrelation estimator is given by:

∑Ti=h+i (Y i − Y¯ ) (Y i−h − Y¯ ) γ^h


ρ^h = =
∑Ti=1 (Y i − Y¯ )
2 γ^0

T he autocorrelation is such that −1 ≤ ρ^ ≤ 1

Test for Autocorrelation

Test for autocorrelation is done using the graphical examination by plotting ACF and PACF of the

residuals and check for any deficiencies such as inadequacy of the model to capture the dynamics of

the data. However, graphical methods are unreliable.

T he common tests used are Box-Pierce and Ljung-Box tests.

Box-Pierce and Ljung-Box Tests.

209
© 2014-2023 AnalystPrep.
Box-Pierce and Ljung-Box test both tests the null hypothesis that:

H 0 : ρ1 = ρ2 = … = ρh

Against the alternative that:

H 1 : ρj ≠ 0(At least one is non-zero)

Both the test are chi-distributed (χ2h ) random variables. If the test statistic is larger than the critical

value, the null hypothesis is rejected.

Box-Pierce Test

T he test statistic under the Box-Pierce is given by:

h
QBP = T ∑ ρ^2i
i=1

T hat is, the test statistic is the sum of squared autocorrelation scaled by the sample size T, which is (

χ2h ) random variable if the null hypothesis is true.

Ljung-Box Test

Ljung-Box test is a revised version of Box-Pierce that is appropriate with small sample sizes. T he

test statistic is given by:

h
1
QLP = T (T + 2) ∑ ( ) ρ^2
i=1 T −i i

T he Ljung-Box test statistic is also χ2h random variable.

Model Selection

T he first step in model selection is the inspection of the sample autocorrelations and the PACFs.

210
© 2014-2023 AnalystPrep.
T his provides the initial signs of the correlation of the data and thus can be used to select the type of

models to be used.

T he next step is to measure the fit of the selected model. T he most commonly used method of

measuring the model’s fit is Mean Squared Error (MSE) which is defined as:

1 T 2
^2 =
σ ∑ ^ϵ
T t=1 t

When the MSE is small, the model selected explains more of the time series. However, choosing a

model with a small MSE implies that we need to increase the coefficient of variation R 2, which can

lead to overfitting. To attend to this problem, other methods have been developed to measure the fit

of the model. T hese methods involve adding an adjustment factor to MSE each time a parameter is

added. T hese measures are termed as the Informati on Cri teri a (IC). T here are two such ICs:

Akaike Information Criteria (AIC) and the Bayesian Information Criteria (BIC).

Akaike Information Criteria (AIC)

Akaike Information Criteria (AIC) is defined as:

σ 2 + 2k
AI C = T ln^

Where T is the sample size, and k is the number of the parameter. T he AIC model adds the

adjustment of adding two more parameters.

Bayesian Information Criteria (BIC).

Bayesian Information Criteria (BIC) is defined as:

σ 2 + klnT
BIC = T ln^

Where the variables are defined as in AIC; however, note that the adjustment factor in BIC increases

with an increase in the sample size T. Hence, it is a consistent model selection criterion. Moreover,

the BIC criterion does not select the model that is larger than that selected by AIC.

211
© 2014-2023 AnalystPrep.
The Box-Jenkin Methodology

T he Box-Jenkin methodology provides a criterion of selecting between models that are equivalent

but with different parameter values. T he equivalency of the models implies that their mean, ACF and

PACF are equal.

T he Box-Jenkin methodology postulates two principles of selecting the models. One of the principles

is termed as Parsi mony. Under this principle, given two equivalent models, choose a model with

fewer parameters.

T he last principle is i nverti bi l i ty, which states that when selecting an MA or ARMA, select the

model such that the coefficient in MA is invertible.

Model Forecasting

Forecasting is the process of using current information to forecast the future. In time series

forecasting, we can make a one-step forecast or any time horizon h.

T he one-step forecast time series forecasts the conditions expectation E(Y T +1 |ΩT ). ΩT is termed as

the information set at time T which includes the entire history of Y (Y T, Y T-1... ) and the shock history

(ϵ T , ϵ T −1 …). In practice, this forecast is shortened to ET (Y T +1) so that:

ET (Y T +1 │ΩT ) = ET (Y T +1 )

Principles of Forecasting.

T here are three rules of forecasting:

I. T he expectation of a variable is the realization of that variable. T hat is: ET (Y T ) = Y T . T his applies

to the residuals: ET (ϵ T −1) = ϵ T −1

II. T he value of the expectation of future shocks is always 0. T hat is,

ET (ϵ T +h ) = 0

212
© 2014-2023 AnalystPrep.
III. T he forecasts are done recursively, beginning with ET (Y T +1 ) and that the forecast of a given time

horizon might depend on the forecast of the previous horizon.

Let us consider some examples.

For the AR(1) model, the one-step forecast is given by:

ET (Y T+1 ) = ET (α + βY T + ϵ T +1) = α + βET (Y T ) + 0


= α + βY T

Note that we are using the current values Y T to predict Y T +1 and shock used is that of the future

ϵ T +1.

T he two-step forecast is given by:

ET (Y T +2) = ET (α + βY T +1 + ϵ T +2 )
= α + βET (Y T +1 ) + ET (ϵ T +2 )

But ET (ϵ T +2) = 0 and ET (Y T +1) = α + βY T

So that:

ET (Y T +2) = α + βET (α + βY T ) = α + β (α + βY T )
⇒ ET (Y T +2) = α + αβ + β2Y T

Analogously, the forecast for time horizon h we have:

ET (Y T +h ) = α + αβ + αβ2 + … + αβh−1 + βh Y T
h
= ∑ αβ i + β h Y T
i=1

The Mean Reverting Level

When h is large, βh must be very small by the intuition of covariance stationary of Y t. T herefore, it

can be shown that:

213
© 2014-2023 AnalystPrep.
h
α
lim ∑ αβ iβ h Y T =
h→∞ 1− β
i=0

T he limit is actually the mean of the AR(1) model. T he mean-reverting level implies Y T does not

affect the future value of Y. T hat is,

lim ET (Y T+h ) = E(Y t )


h→∞

T he same procedure is applied to MA and ARMA models.

T he forecast error is the difference between the true future value and the forecasted value, that is,

ϵ T +1 = Y T +1 − ET (Y T+1 )

For longer time-horizon, the forecast is mostly functions of the model parameters.

Exampl e: Model Forecasti ng

T he ARMA(1,1) for modeling the default in premiums for an insurance company is given by

Dt = 0.055 + 0.934Dt−1 + ϵ t

Given that DT = 1.50, what is the first step forecast of the default?

Sol uti on

We need:

ET (Y T+1 ) = α + βY T
⇒ ET (DT+1 ) = 0.055 + 0.934 × 1.5 = 1.4560

Seasonality of Time Series

Some time-series data are seasonal. For instance, the sales at the time of summer that may differ

from that of winter. T he time series with deterministic seasonality is termed as non-stationary, while

those with stochastic seasonality are called stationary time series and hence modeled with AR or

ARMA process.

214
© 2014-2023 AnalystPrep.
A pure seasonal lag utilizes the lags at a seasonal frequency. For instance, assume that we are using

the semi-annual data, then the pure seasonal AR(1) model of quarterly time seasonal time series is:

(1 − βL 4)Y t = α + ϵ t

So that:

Y t = α + βY t−4 + ϵ t

A more efficient seasonality includes the short-term and seasonal lag components. T he short-term

components utilize the lags at the observation frequency.

Seasonality can also be introduced to AR, MA, or both models by multiplying the short run lag

polynomial and by the seasonal lag polynomial. For instance, the seasonal ARMA is specified as:

ARMA(p, q) × (ps , qs )f

Where p and q are the orders of the short run-lag polynomials, and ps and qs are the seasonal lag

polynomials. Practically, seasonal lag polynomials are restricted to one seasonal lag because the

accuracy of the parameter approximations depends on the number of full seasonal cycles in the

sample data.

215
© 2014-2023 AnalystPrep.
Question 1

T he following sample autocorrelation estimates are obtained using 300 data points:

Lag 1 2 3
Coefficient 0.25 −0.1 −0.05

Compute the value of the Box-Pierce Q-statistic.

A. 22.5

B. 22.74

C. 30

D. 30.1

T he correct answer is A.

m
QBP = T ∑ ρ^2 (h)
h =1
2 2
= 300(0.252 + (−0.1) + (−0.05) )
= 22.5

Question 2

T he following sample autocorrelation estimates are obtained using 300 data points:

Lag 1 2 3
Coefficient 0.25 −0.1 −0.05

Compute the value of the Ljung-Box Q-statistic.

A. 30.1

B. 30

C. 22.5

216
© 2014-2023 AnalystPrep.
D. 22.74

T he correct answer is D.

m ˆ1
QLB = T (T + 2) ∑ ( )ρ2 (h)
h =1 T − h
0.252 −0.12 −0.052
= 300(302)( + + )
299 298 297
= 22.74

Note: Provided the sample size is large, the Box-Pierce and the Ljung-Box tests typically

arrive at the same result.

Question 3

Assume the shock in a time series is approximated by Gaussian white noise. Yesterday's

realization, y(t) was 0.015, and the lagged shock was -0.160. Today's shock is 0.170.

If the weight parameter theta, θ, is equal to 0.70 and the mean of the process is 0.5,

determine today's realization under a first-order moving average, MA(1), process.

A. -4.205

B. 4.545

C. 0.558

D. 0.282

T he correct answer is C.

Today’s shock = ϵ t ; yesterday’s shock = ϵ t−1; today’s realization = y t ; yesterday’s

realization = y t−1.

T he MA(1) is given by:

y t = μ + θϵ t−1 + ϵ t
= 0.5 + 0.170 + 0.7(−0.160) = 0.558
= 0.558

217
© 2014-2023 AnalystPrep.
218
© 2014-2023 AnalystPrep.
Reading 22: Nonstationary Time Series

After compl eti ng thi s readi ng, you shoul d be abl e to:

Describe linear and nonlinear time trends.

Explain how to use regression analysis to model seasonality.

Describe a random walk and a unit root.

Explain the challenges of modeling time series containing unit-roots.

Describe how to test if a time series contains a unit root.

Explain how to construct an h-step-ahead point forecast for a time series with seasonality.

Calculate the estimated trend value and form an interval forecast for a time series.

Recall that the stationary time series have means, variance, and autocovariance that are independent

of time. T herefore any time series that violates this rule is termed as the non-stationary time series.

T he nonstationary time series include time trends, random walks( also called unit-roots) and

seasonalities. T ime trends reflect the feature of the time series to grow over time.

Seasonalities occur due to change in the time series over different seasons such as each quarter.

Seasonalities can be shifts of the mean (for example depending on the period of the year) and the

mean cycle of the time series (this occurs when the shock of the current value depends on the

shock of the same future period). Seasonalities can be modeled using the dummy variables or

modeling it period after period changes (such as year after year) in an attempt to remove the

seasonal change of the mean.

In a random walk, time series depends on each other and their respective shocks. We discuss each of

the non-stationarities.

Time Trends.

219
© 2014-2023 AnalystPrep.
T he time trend deterministically shifts the mean of the time series. T he time trend can be linear and

non-linear (which includes log and quadratic time series).

Linear Time Trends

Linear trend models are those that the dependent variable changes at a constant rate with time. If the

time series y t has a linear trend, we can model the series by the following equation:

Y t = β0 + β1 t + ϵ t , t = 1, 2,… , T

Where

Y t =the value of the time series at time t (trend value at time t)

β0=the y-intercept term

β1=the slope coefficient

t=time, the independent (explanatory) variable

ϵ t = a random error term (Shock) and is white noise (ϵ t ∼ WN(0, σ 2))

From the equation above, the β0 + β1t predicts y t at any time t. T he slope β1 is described as the trend

coefficient since it is the slope coefficient. We estimate both factors β0 and β1 using the ordinary

least squares and denoted as: β^0 and β^1 respectively.

T he mean of the linear time series is:

E(Y t) = β0 + β1t

On a graph, a linear trend appears as a straight line angled diagonally up or down.

220
© 2014-2023 AnalystPrep.
Estimation of the Trend Value Under Linear Trend Models

Using the estimated coefficients, we can predict the value of the dependent variable at any time
^2 = β^ 0 + β^1 (2). We can also forecast the
(t=1, 2…, T ). For instance, the trend value at time 2 is Y

value of the time series outside the sample’s period, that is, T +1. T herefore, the predicted value of
^T+1 = β^0 + β^ 1(T + 1).
Y t at time T +1 is Y

Example: Calculating the Trend Value

A linear trend is defined to be Y t = 17.5 + 0.65t . What is the trend projection for time 10?

Solution

We substitute t=10, which is:

T = 17.5 + 0.65 × 10 = 24

Disadvantages of Linear Time Series

221
© 2014-2023 AnalystPrep.
In linear time series, the growth is a constant which might pose problems in economic and financial

time series.

1. When the trend is positive, then the growth rate is expected to decrease over time.

2. If the slope coefficient is less than 0, the Y t will tend toward negative values, a situation that

would not be plausible in most financial time series, e.g., asset prices and quantiles.

Considering these limitations, we discuss the log-linear time series, with a constant growth rate

rather than just a constant rate.

Log-Linear Trend Models

Sometimes the linear trend models result in uncorrelated errors. For instance, the time series with

exponential growth rates. T he appropriate model for the time series with exponential growth is the

Log-linear trend model.

Log-linear trends are those in which the variable changes at an increasing or decreasing rate rather

than at a constant rate like in linear trends.

222
© 2014-2023 AnalystPrep.
Assume that the time series is defined as:

Y t = eβ0+β1 t, t = 1, 2, … , T

Which also can be written as (by taking the natural logarithms on both sides):

ln Y t = β0 + β1 t, t = 1, 2,… , T

By Exponential rate, we mean growth at a constant rate with continuous compounding. T his can be

seen as follows: Using the time series formula above, the value of the time series at time 1 and 2 are
y2
y 1 = eβ0+β1 (1) and y 2 = eβ0 +β1(2) . T he ratio y1
is given by:

Y2 eβ0+β1(2)
= = eβ(1)
Y1 eβ0+β1(1)

223
© 2014-2023 AnalystPrep.
Similarly, the value of the time-series at time t is Y t = eβ0+β1 t , and at t+1, we have Y t+1 = eβ0 +β1(t+1) .

T his implies that the ratio:

Y t+1 eβ0+β1(t+1)
= = eβ1
Yt eβ0 +β1(t)

If we take the natural logarithm on both sides of the above equation we have:

Y t+1
ln ( ) = lnY t+1 − lnY t = β1
Yt

T he log-linear model implies that:

E(ln Y t+1 − lnY t ) = β1

From the above results, proportional growth in time series over the two consecutive periods is

equal. T hat is:

y t+1 − y t y t+1
= − 1 = eβ1 − 1
yt yt

Example: Calculating the Trend Value of a Log-Linear Trend Time


Series

An investment analyst wants to fit the weekly sales (in millions) of his company by using the sales

data from Jan 2016 to Feb 2018. T he regression equation is defined as:

lnY t = 5.1062 + 0.0443t, t = 1, 2,… , 100

What is the trend estimated value of the sales in the 80th week?

Solution

From the regression equation, β^0 = 5.1062 and β^1 = 0.0443. We know that, under log-linear trend

models, the predicted trend value is given by:

224
© 2014-2023 AnalystPrep.
^ ^
Y t = eβ 0 + β 1 t

⇒ Y 80 = e5. 1062+0. 0443×80 = 5711.29 Million

Quadratic Time Trend

A polynomial-time trend can be defined as:

Y t = β0 + β1t + β2t2 + ⋯ + βm tm ϵ t, t = 1, 2, … , T

Practically speaking, the polynomial-time trends are only limited to the linear (discussed above) and

the quadratic (second degree) time trend. In a quadratic time trend, the parameter can be estimated

using the OLS. T he approximated parameter are asymptotically normally distributed and hence

statistical inference using the t-statistics and the standard error happen only if the residuals ϵ t are

white noise.

The Log-Quadratic Time Trend

As the name suggests, this time trend is a mixture of the log-linear and quadratic time series. It is

given by:

ln Y t = β0 + β1 t + β2 t2

It can be shown that the growth rate of the log-quadratic time trend is β1 + 2β2t . T his can be seen as

follows:

2
T he value of the time-series at time t is Y t = eβ0+β1 t+β2t , and at t+1, we have

2
Y t+1 = eβ0+β1 (t+1)+β2(t+1) . T his implies that the ratio:

2
Y t+1 eβ0 +β1(t+1)+β2(t+1)
= = eβ1 +2β2t
Yt eβ0+β1t+β2 t2

If we take a natural log on the results, we get the desired result.

225
© 2014-2023 AnalystPrep.
Example: Calculating the Growth Rate of Log-Quadratic Time Trend

T he monthly real GDP of a country over 20 years can be modeled by the time series equation given

by:

RGT = 6.75 + 0.015t + 0.0000564t2

What is the growth rate of the real GDP of this country at the end of 20 years?

Solution

T his is the log-quadratic time trend whose growth rate is given by

β1 + 2β2 t

From the regression time-series equation given, we have β^1 = 0.015 and β^2 = 0.0000564 so that the

growth rate is given by:

β1 + 2β2 t = 0.015 + 2 × 0.0000564 × 240 = 0.0421

Note that, since the data is modeled monthly, at the end of 20 years implies 240th month!

T he coefficient of variation (R 2 ) for the time trend series is always high and will tend to 100% as the

sample size increases. T herefore, the coefficient of variation is not an appropriate measure in trend

series. Other alternatives such as residual diagnostics, can be useful.

Seasonality

Seasonality is a feature of a time series in which the data undergoes regular and predictable changes

that recur every calendar year. For instance, gas consumption in the US rises during the winter and

falls during the summer.

Seasonal effects are observed within a calendar year, e.g., spikes in sales over Christmas, while

cyclical effects span time periods shorter or longer than one calendar year, e.g., spikes in sales due

226
© 2014-2023 AnalystPrep.
to low unemployment rates.

Modeling Seasonal Time Series

Regression on seasonal dummies is an essential method of modeling seasonality. Assuming that there

are s seasons in a year. T hen the pure annual dummy model is:

Y t = β0 + γ1D1t γ2D2t + ⋯ + γs−1Ds−1t + ϵ t


s−1
= β0 + ∑ γjDjt + ϵ t
j=1

Djt is defined as:

1 t mod s = j
Djt = { ,
1, t mod s ≠ j

γj measures the amount of difference of the mean at period j and s.

Note X mod Y is the remainder of the X/Y.For instance, 9 mod 4=1.

T he mean of the first period of the seasonality is:

E[Y 1 ] = β0 + γ1

And the mean of period 2 is:

E[Y 2 ] = β0 + γ2

Since period s, all dummy variables are zero, then the mean of the seasonality at time s is:

E[Y s ] = β0

T he parameters of seasonality are estimated using the OLS estimators by regressing Y t on constant

and s-1 dummy variables.

Combination of Stationary and Non-Stationary Time Series

227
© 2014-2023 AnalystPrep.
T ime trends and seasonalities can be insufficient in explaining economic time series and since their

residuals might not be white noise. In the case that the non-stationary time series appears to be

stationary, but the residuals are not white noise, we can add stationary time series components (such

as AR and MA) to reflect the components of the non-stationary time series.

Consider the following linear time trend.

Y t = β0 + β1 t + ϵ t

If the residuals are not white noise but the time series appears to be stationary, we can include an

AR term to make the model’s residuals white noise:

Y t = β0 + β1 t + δ1 Y t−1 + ϵ t

We can also add the seasonal component (if it exists):

s−1
Y t = β0 + β1 t + ∑ γj Djt + δ1 Y t−1 + ϵ t
j=1

Note that the AR component reflects the cyclicity of the time series, γj measures the shifts of the

mean from the trend growth, i.e β1 t. However, combinations of the time series do not always lead to

a model with the required dynamics. For instance, the Ljung-Box statistics may suggest rejection of

the null hypothesis.

Unit Roots and Random Walks

A random walk is a time series in which the value of the series in one period is equivalent to the

value of the series in the previous period plus the unforeseeable random error. A random walk can

be defined as follows:

Let

Y t = Y t−1 + ϵ t

228
© 2014-2023 AnalystPrep.
Intuitively,

Y t−1 = Y t−2 + ϵ t−1

If we substitute Y t−1 in the first equation, we get,

Y t = (Y t−2 + ϵ t−1) + ϵ t

Continuing this process, it implies that a random walk is given by:

t
Yt = Y0 + ∑ ϵi
i=1

T he random walk equation is a particular case of an AR(1) model with β0 = 0 and β1 = 1. T hus, we

cannot utilize the regression techniques to estimate such AR(1). T his is because a random walk does

not have a finite mean-reverting level or finite variance. Recall that if Y t has a mean-reverting level,
β0 0
then Y t = β0 + β1Y t and thus . However, in a random walk, β0 = 0 and β1 = 1 so, 1−1 = 0.
1−β1

T he variance of a random walk is given by:

V(Y t ) = tσ 2

T he implication of the infinite variance of a random walk is that we are unable to use standard

regression analysis on a time series that appears to be a random walk.

Unit Roots

We have been discussing the random walks without a drift; that the current value is the best

predictor of the time series in the next period.

A random walk with a drift is defined as a time-series where it increases or decreases by a constant

amount in each period. It is mathematically described as:

Y t = β0 + β1Y t−1 + ϵ t

229
© 2014-2023 AnalystPrep.
β0 ≠ 1, β1 = 1

Or

Y t = β0 + Y t−1 + ϵ t

Where ϵ t ∼ WN(0, σ 2)

Recall that β1 = 1 implies undefined mean-reversion level and hence non-stationarity. T herefore, we

are unable to use the AR model to analyze a time series unless we transform the time series by

taking the first difference we get:

ΔY t = Y t − Y (t−1) , y t = β0 + ϵ t , ∀β0 ≠ 0

Which is covariance stationary.

T he unit root test involves the application of the random walk concepts to determine whether a time

series is nonstationary by focusing on the slope coefficient in a random walk time series with a drift

case of AR(1) model. T his test is popularly known as the Dickey-Fuller Test

The Unit Root Problem

Consider an AR(1) model. If the time-series originates from an AR(1) model, then the time-series is

covariance stationary if the absolute value of the lag coefficient β1 is less than 1. T hat is, |β1| < 1.

T herefore, we could not depend on the statistical results if the lag coefficient is greater or equal to 1

(|β1 | ≥ 1).

When the lag coefficient is precisely equal to 1, then the time series is said to have a unit root. In

other words, the time-series is a random walk and hence not covariance stationary.

T he unit root problem can also be expressed using the lag polynomial. Let

ψ(L) be the full lag polynomial, which can be factorized into the unit root lag denoted by (1-L) and the

remainder lag polynomial ϕ(L ) which is the characteristic lag for stationary time series. Moreover,

let θ(L)ϵ t be an MA. T hus, the unit root process can be described as:

230
© 2014-2023 AnalystPrep.
ψ(L)Y t = θ(L)ϵ t

T his can be factorized into:

(1 − L)ϕ(L) = θ(L)ϵ t

Example: Checking for Unit Roots using the Lag Polynomials

An AR(2) model is given by Y t = 1.7Y t−1 − 0.7Y t−2 + ϵ t . Does the process contain a unit root?

Solution

If we rearrange the equation:

Y t − 1.7Y t−1 + 0.7Y t−2 = ϵ t

Using the definition of a lag polynomial, we can write the above equation as:

(1 − 1.7L + 0.7L 2 )Y t = ϵ t

T he right-hand side is a quadratic equation which can be factorized. So,

(1 − L)(1 − 0.7L)Y t = ϵ t

T herefore, the process has a unit root due to the presence of a unit root lag operator (1-L).

Challenges of Modeling Time Series Containing Unit Roots

1. A unit root process does not have a mean-reverting level. Recall that the stationary time

series does mean revert, that is, the long-run mean can be estimated.

2. In a time series with a unit root, spotting spurious relationships is a problem. A spurious

correlation is where there is no important link between the time series but regression

analysis produces significant parameter estimates.

3. T he parameter estimators in ARMA time series with a unit root possess Dickey-Fuller (DF)

distribution, which is asymmetric, dependent on the sample size, and that its critical value

231
© 2014-2023 AnalystPrep.
depends on whether time trends have been incorporated. T his characteristic makes it

difficult to come up with sound statistical inference and model selection when fitting the

models.

Transformation of Time Series with Unit Roots

If the time series seem to have unit roots, the best method is to model it using the first-differencing

series as an autoregressive time series, which can be effectively analyzed using regression analysis.

Recall that the time series with a drift is a form of AR(1) model given by:

y t = β0 + Y t−1 + ϵ t ,

Where ϵ t ∼ WN(0, σ 2)

Clearly β1 = 1 implies that the time series has an undefined mean-reversion level and hence non-

stationary. T herefore, we are unable to use the AR model to analyze time series unless we

transform the time series by taking the first difference to get:

Y t = Y t − Y (t−1) ⇒ y t = β0 + ϵ t , ∀β0 ≠ 0

Where the ϵ t ∼ WN(0, σ 2) and thus covariance stationary.

Using the lag polynomials, let ΔY t = Y t − Y (t−1) where Y t has a unit root (implying that Y t − Y (t−1) does

not have a unit root.), then:

(1 − L)ϕ(L)Y t = ϵt
ϕ(L)[(1 − L)Y t] = ϵt
ϕ(L)[(Y t − LY t)] = ϵt
ϕ(L)ΔY t = ϵt

Since the lag polynomial ϕ(L) is stationary series lag polynomial, the time series defined by ΔY t must

be stationary.

Unit Root Test

232
© 2014-2023 AnalystPrep.
T he unit root test is done using the Augmented Dickey-Fuller (ADF) test. T he test involves OLS

estimation of the parameters where the difference of the time series is regressed on the lagged

level, appropriate deterministic terms, and the lagged difference.

T he ADF regression is given by:

ΔY t = γ Y t−1 + (δ0 + δ1t) + (λΔY t−1 + λ2 ΔY t−2 + ⋯ + λp ΔY (t−p))

Where:

γY t−1 =Lagged level

δ0 + δ1t =deterministic terms

λΔY t−1 + λ2ΔY t−2 + ⋯ + λpΔY (t−p) =Lagged differences.

T he test statistic for the ADF test is that of γ^(estimate of γ).

To get the gist of this, assume that we are conducting an ADF test on a time series with lagged level

only:

ΔY t = γY t−1

Intuitively, if the time series is a random walk, then:

Y t = Y t−1 + ϵ t

If we subtract Y t−1 on both sides we get:

Y t − Y t−1 = Y t−1 − Y t−1 + ϵ t


⇒ ΔY t = 0 × Y t−1 + ϵ t

T herefore, it implies that the time series is a random walk if γ=0. T his leads us to the hypothesis

statement of the ADF test:

H 0 : γ = 0 (T he time series is a random walk)

H 1 : γ < 0 (the time series is a covariance stationary )

233
© 2014-2023 AnalystPrep.
You should note this is a one-sided test, and thus, the null hypothesis is not rejected if γ>0. T he

positivity of γ corresponds to an AR time series stationary. For example, recall that the AR(1) model

is given by:

Y t = β0 + β1Y t−1 + ϵ t

If we subtract Y t−1 from both sides of the AR(1) above we have:

Y t − Y t−1 = β0 + (β1 − 1)Y t−1 + ϵ t

Now let γ = (β1 − 1). T herefore,

ΔY t = β0 + γY t−1 + ϵ t

Clearly, if β1 = 1, then let γ = 0. T herefore, γ = 0 is the test for β1 = 1. In other words, if there is a

unit root in an AR(1) model (with the dependent variable being the difference between the time

series and independent variable of the first lag) then, γ = 0, implying that the series has a unit root

and is nonstationary.

Implementing an ADF test on a time series requires making two choices: which deterministic terms

to include and the number of lags of the differenced data to use. T he number of lags to include is

simple to determine—it should be large enough to absorb any short-run dynamics in the difference

ΔY t

T he appropriate method of selecting the lagged differences is the AIC (which selects a relatively

larger model as compared to BIC). T he length of the lag should be set depending on the length of the

time series and the frequency of the sampling.

T he Dickey-Fuller distributions are dependent on the choice of deterministic terms included. T he

deterministic terms can be excluded, and instead, use constant terms or trend deterministic terms.

While keeping all other things equal, the addition of more deterministic terms reduces the chance of

rejecting the null hypothesis when the time series does not have a unit root, and hence the power of

the ADF test is reduced. T herefore, relevant deterministic terms should be included.

T he recommended method of choosing appropriate deterministic terms is by including the

234
© 2014-2023 AnalystPrep.
deterministic terms that are significant at 10% level. In case the deterministic trend term is not

significant at 10%, it is then dropped and the constant deterministic term is used instead. If the trend

is also insignificant, then it can be dropped and the test is rerun without the deterministic term. It is

important to note that the majority of macroeconomic time series require the use of the constant.

In the case that the null of the ADF test cannot be rejected, the series should be differenced and the

test is rerun to make sure that the time series is stationary. If this is repeated (double differenced)

and the time series is still non-stationary, then other transformations to the data such as taking the

natural log(if the time series is always positive) might be required.

Example: Conducting the ADF Test

A financial analyst wishes to conduct an ADF test on the log of 20-year real GDP from 1999 to 2019.

T he result of the tests is shown below:

Deterministic γ δ0 δ1 Lags 5%CV 1%CV


None −0.004 8 −1.940 −2.570
(−1.665)
Constant −0.008 0.010 4 −2.860 −3.445
(−1.422) (1.025)
T rend −0.084 0.188 3 −3.420 −3.984
(−4.376) (−4.110)

T he output of the ADF reports the results at the different number deterministic terms (first

column), and the last three columns indicate the number of lags according to AIC and the 5% and 1%

critical values that are appropriate to the underlying sample size and the deterministic terms. T he

quantities in the parenthesis (below the parameters) are the test statistics.

Determine whether the time series contains a unit root.

Solution

T he hypothesis statement of the ADF test is:

H 0 : γ = 0 (T he time series is a random walk)

235
© 2014-2023 AnalystPrep.
H 1 : γ < 0 (the time series is a covariance stationary )

We begin with choosing the appropriate model. At 10%, the trend model has an absolute value of the

statistic greater than the CV at 1% and 5% significance level; thus, we choose a model with the trend

deterministic term.

T herefore, for this model, the null hypothesis is rejected at a 99% confidence level since

|-4.376|>|-3.984|. Note that the null hypothesis is also rejected at a 95% confidence level.

Moreover, if the model was constant or no-deterministic, the null hypothesis will fail to be rejected.

T his reiterates the importance of choosing an appropriate model.

The Seasonal Differencing

Seasonal differencing is an alternative method of modeling the seasonal time series with a unit root.

Seasonal differencing is done by subtracting the value in the same period in the previous year to

remove the deterministic seasonalities, the unit root, and the time trends.

Consider the following quarterly time series with deterministic seasonalities and non-zero growth

rate:

Y t = β0 + β1t + γ1 D1t + γ2D2t + γ3 D3t + ϵ t

Where ϵ t ∼ WN(0, σ 2).

Denote a seasonal Δ4 Y t = Y t − Y t−4

⇒ Δ4Y t = (β0 + β1 t + γ1 D1t + γ2D2t + γ3 D3t + ϵ t)


− (β0 + β1(t − 4) + γ1 D1t−4 + γ2D2t−4 + γ3 D3t−4 + ϵ t−4)
=β1(t − (t − 4)) − [γ1 (D1t − D1t−4) + γ1(D12 − D2t−4 ) + γ1 (D3t − D3t−4)] + ϵ t
− ϵ t−4

But

γj (D1j − D1j−4) = 0

236
© 2014-2023 AnalystPrep.
Because D1j = D1j−4 by the definition of the seasonal differencing. So that:

⇒ Δ4Y t = β1 (t − (t − 4)) + ϵ t − ϵ t−4

T herefore,

Δ4 Y t = 4β1 + ϵ t − ϵ t−4

Intuitively, this an MA(1) model, which is covariance stationary. T he seasonal differenced time series

is described as the year to year change in Y t or year to year growth in case of logged time series.

Spurious Regression

Spuri ous regressi on is a type of regressi on that gives misleading statistical evidence of a linear

relationship between independent non-stationary variables. T his is a problem in time series analysis,

but this can be avoided by making sure each of the time series in question is stationary by using

methods such as first differencing and log transformation (in case the time series is positive)

Condition for Differencing in Time Series

Practically, many financial and economic time series are plausibly persistent but stationary.

T herefore, differencing is only required when there is clear evidence of unit root in the time series.

Moreover, when it is difficult to distinguish whether time series is stationary or not, it is a good

statistical practice to generate models at both levels and the differences.

For example, we wish to model the interest rate on government bonds using an AR(3) model. T he

AR(3) is estimated on the levels and the differences (if we assume the existence of unit root) are

modeled by AR(2) since the AR is reduced by one due to differencing. By considering the models at

all levels allows us to choose the best model when the time series are highly persistent.

Forecasting

Forecasting in non-stationary time series is analogous to that of stationary time series. T hat is, the

237
© 2014-2023 AnalystPrep.
forecasted value at time T is the expected value of Y T+h .

Consider a linear time trend:

Y T = β0 + β1 T + ϵ t

Intuitively,

Y T+h = β0 + β1(T + h) + ϵ t+h

Taking the expectation, we get:

ET (Y T+h ) = ET (β0 ) + ET (β1 (T + h) + ET (ϵ t+h )


⇒ ET (Y T+h ) = β0 + β1 (T + h)

T his is true because of both β0 and β1(T + h) are constants while ϵ t+h ∼ WN(0, σ 2) .

Forecasting in Seasonal Time Series

Recall that the seasonal time series can be modeled using the dummy variables. Consequently, we

need to track the period of the forecast we desire. T he annual time series is given by:

s−1
Y T = β0 + ∑ γj Djt + ϵ t
j=1

T he first-step forecast is:

ET (Y T+1 ) = β0 + γj

Where:

j = (T + 1)mod s is the forecasted period and that the forecast and the coefficient on the omitted

periods is 0.

For instance, for quarterly seasonal time series that excludes the dummy variable for the fourth

quarter (Q4 ), then the forecast for period 116 is given by:

238
© 2014-2023 AnalystPrep.
ET (Y T+1 ) = β0 + γj
ET (Y T+1 ) = β0 + γ(116+1)(mod 4) = β0 + γ1

T herefore, the h-step ahead forecast are by tracking the period of T +h so that:

ET (Y T+h ) = β0 + γj

Where:

j = (T + h)mod s

Forecasting in Log Models

Under the log model, you should note that:

E(Y T+h ) ≠ E(ln Y T+h )

If the residuals are Gaussian white noise, that is:

ϵ iid
∼ N (0, σ 2)

T hen the properties of the log-normal can be used for forecasting. If

X ∼ N(0,σ 2) , then define W = eX ∼ Log(μ,σ 2) . Also recall that the mean of a log-normal distribution

is given by:

σ2
E(W) = eμ+ 2

Using this analogy, for a log-linear time trend model:

ln Y T+h = β0 + β1 (Y T+h ) + ϵ T+h

T he forecast at time T +h,

ET (ln Y T+h ) = β0 + β1 (Y T+h )

T he variance of the shock is σ 2 so that:

239
© 2014-2023 AnalystPrep.
ln Y T+h ∼ (β0 + β1(Y T+h ), σ 2)

T hus,

σ2
ET (Y T+h ) = eβ0+β1 (Y T+h)+ 2

Forecasting Confidence Intervals

Confidence intervals are constructed to reflect the uncertainty of the forecasted value. T he

confidence interval is dependent on the variance of the forecasted error, which is defined as:

ϵ T+h = Y T+h − ET (Y T+h )

i.e., it is the difference between the actual value and the forecasted value.

Consider the linear time trend model:

Y T+h = β0 + β1(T + h) + ϵ T+h

Clearly,

ET (Y T+h ) = β0 + β1(T + h)

And the forecast error is ϵ T+h

If we wish to construct a 95% confidence interval, given that the forecast error is Gaussian white

noise, then the confidence interval is given by:

ET (Y T+h ) ± 1.96σ

σ is not known and thus can be esti mated by the vari ance of the forecast error.

Intuitively, the confidence intervals for any model can be computed depending on the individual

forecast error ϵ T+h = Y T+h − ET (Y T+h ).

Example: Forecasting and Forecasting Confidence Intervals

240
© 2014-2023 AnalystPrep.
A linear time trend model is estimated on annual government bond interest rates from the year 2000

to 2020. T he model’s equation is given by:

R t = 0.25 + 0.000154t + ^
ϵt

T he standard deviation of the forecasting error is estimated to be σ ̂ =0.0245. What is the 95%

confidence interval for the second year if the forecasting residual errors (residuals) is a Gaussian

white noise?

(Note that for the first time period t=2000 and the last time period is t=2020)

Solution

T he second year starting from 2000 is 2002. So,

ET (R 2002 ) = 0.25 + 0.000154 × 2002 = 0.2808308

T he 95% confidence interval is given by:

ET (Y T+h ) ± 1.96σ
= 0.28083 ± 1.96 × 0.0245
= [0.2328108, 0.3288508]

So the 95% confidence interval for the interest rate is between 1.029% and 10.68%.

241
© 2014-2023 AnalystPrep.
Question 1

T he seasonal dummy model is generated on the quarterly growth rates of mortgages. T he

model is given by:

s−1
Y t = β0 + ∑ γj Djt + ϵ t
j=1

T he estimated parameters are γ^1 = 6.25, γ^2 = 50.52, γ^3 = 10.25 and β^ 0 = −10.42 using

the data up to the end of 2019. What is the forecasted value of the growth rate of the

mortgages in the second quarter of 2020?

A. 40.10

B. 34.56

C. 43.56

D. 36.90

The correct answer i s A.

We need to define the set of dummy variables:

1, for Q2
Djt = {
0, for Q1, Q3 and Q4

So,

3
^ Q 2) = β0 + ∑ γjDjt = −10.42 + 0 × 6.25 + 1 × 50.52 + 0 × 10.25 = 40.1
E(Y
j=1

Question 2

A mortgage analyst produced a model to predict housing starts (given in thousands) within

242
© 2014-2023 AnalystPrep.
California in the US. T he time series model contains both a trend and a seasonal

component and is given by the following:

Y t = 0.2t + 15.5 + 4.0 × D2t + 6.4 × D3t + 0.5 × D4t

T he trend component is reflected in variable time(t), where (t) month and seasons are

defined as follows:

Season Months Dummy


Winter December, January, and February
Spring March, April, and May D2t
Summer June, July, and August D3t
Fall September, October, and November D4t

T he model started in April 2019; for example, y (T+1) refers to May 2019.

What does the model predict for March 2020?

A. 21,700 housing starts

B. 22,500 housing starts

C. 24,300 housing starts

D. 20,225 housing starts

The correct answer i s A.

T he model is given as:

Y t = 0.2t + 15.5 + 4.0 × D2t + 6.4 × D3t + 0.5 × D4t

Important: Since we have three dummies and an intercept, quarterly seasonality is

reflected by the intercept (15.5) plus the three seasonal dummy variables (D2, D3, and D4

).

If Y T+1 = May 2019, then March 2020 = Y T + 11

Finally, note that March falls under D2t

243
© 2014-2023 AnalystPrep.
y T+11 = 0.20 × 11 + 15.5 + 4.0 × 1 = 21.7

T hus, the model predicts 21,700 housing starts in March 2020.

244
© 2014-2023 AnalystPrep.
Reading 23: Measuring Return, Volatility, and Correlation

After compl eti ng thi s readi ng, you shoul d be abl e to:

Calculate, distinguish, and convert between simple and continuously compounded returns.

Define and distinguish between volatility, variance rate, and implied volatility.

Describe how the first two moments may be insufficient to describe non-normal

distributions.

Explain how the Jarque-Bera test is used to determine whether returns are normally

distributed.

Describe the power law and its use for non-normal distributions.

Define correlation and covariance and differentiate between correlation and dependence.

Describe properties of correlations between normally distributed variables when using a

one-factor model.

Measurement of Returns

A return is a profit from an investment. T wo common methods used to measure returns include:

1. Simple Returns Method

2. Continuously Compounded Returns Method.

The Simple Returns Method

Denoted R t the simple return is given by:

P t − P t−1
Rt =
P t−1

Where

245
© 2014-2023 AnalystPrep.
P t−1=Price of an asset at time t (current time)

P t−1=Price of an asset at time t-1 (past time)

T he time scale is arbitrary or shorter period such monthly or quarterly. Under the simple returns

method, the returns over multiple periods is the product of the simple returns in each period.

Mathematically given by:

T
1 + R T = ∏ (1 + R t )
t=i

T
⇒ R T = (∏ (1 + R t )) − 1
t=i

Example: Calculating the Simple Returns

Consider the following data.

T ime Price
0 100
1 98.65
2 98.50
3 97.50
4 95.67
5 96.54

Calculate the simple return based on the data for all periods.

Solution

We need to calculate the simple return over multiple periods which is given by:

T
1 + R T = ∏ (1 + R t )
t=i

Consider the following table:

246
© 2014-2023 AnalystPrep.
T ime Price Rt 1 + Rt
0 100 − −
1 98.65 −0.0135 0.9865
2 98.50 −0.00152 0.998479
3 97.50 −0.01015 0.989848
4 95.67 −0.01877 0.981231
5 96.54 0.009094 1.009094
Product 0.9654

Note that

P t − P t−1
Rt =
P t−1

So that that

P1 − P0 98.65 − 100
R1 = = = −0.0135
P0 100

And

P2 − P1 98.50 − 98.65
R2 = = = −0.00152
P1 98.65

And so on.

Also note that:

5
∏ (1 + R t) = 0.9865 × 0.998479 × … × 1.009094 = 0.9654
t=1

So,

1 + R T = 0.9654 ⇒ R T = −0.0346 = −3.46%

Continuously Compounded Returns Method

Denoted by rt . Compounded returns is the difference between the natural logarithm of the price of

247
© 2014-2023 AnalystPrep.
assets at time t and t-1. It is given by:

rt = ln P t − lnP t−1

Computing the compounded returns over multiple periods is easy because it is just the sum of

returns of each period. T hat is:

T
rT = ∑ rt
t=1

Example: Calculating Continuously Compounded Returns

Consider the following data.

T ime Price
0 100
1 98.65
2 98.50
3 97.50
4 95.67
5 96.54

What is the continuously compounded return based on the data over all periods?

Solution

T he continuously compounded return over the multiple periods is given by

T
rT = ∑ rt
t=1

Where

rt = ln P t − ln P t−1

Consider the following table:

248
© 2014-2023 AnalystPrep.
T ime Price rt = ln P t − ln P t−1
0 100 −
1 98.65 −0.01359
2 98.50 −0.00152
3 97.50 −0.0102
4 95.67 −0.01895
5 96.54 0.009053
Sum −0.03521

Note that

r1 = ln P 1 − ln P 0 = ln 98.65 − ln 100 = −0.01359


r2 = ln P 2 − ln P 1 = ln 98.50 − ln 98.65 = −0.00152

And so on.

Also,

5
rT = ∑ rt = −0.01359 + −0.00152 + ⋯ + 0.009053 = −0.03521 = −3.521%
t=1

Relationship between the Compounded and Simple Returns

Intuitively, the compounded returns is an approximation of the simple return. T he approximation,

however, is prone to significant error over longer time horizons, and thus compounded returns are

suitable for short time horizons.

T he relationship between the compounded returns and the simple returns is given by the formula:

1 + R t = ert

Example: Conversion Between the Simple and Compound Returns

What is the equivalent simple return for a 30% continuously compounded return?

Solution.

Using the formula:

249
© 2014-2023 AnalystPrep.
1 + R t = ert
⇒ R t = ert − 1 = e0.3 − 1 = 0.3499 = 34.99%

It is worth noting that compound returns are always less than the simple return. Moreover, simple

returns are never less than -100%, unlike compound returns, which can be less than -100%. For

instance, the equivalent compound return for -65% simple return is:

rt = ln (1 − 0.65) = −104.98%

Measurement of Volatility and Risk

T he volatility of a variable denoted as σ is the standard deviation of returns. T he standard deviation of

returns measures the volatility of the return over the time period at which it is captured.

Consider the linear scaling of the mean and variance over the period at which the returns are

measured. T he model is given by:

rt = μ + σet

Where E(rt ) = μ is the mean of the return, V(rt ) = σ 2 is the variance of the return. et is the shocks,

which is assumed to be iid distributed with the mean 0 and variance of 1. Moreover, the return is

assumed to be also iid and normally distributed with the mean μ2 i.e. rt∼iid N(μ, σ 2) . Note the shock

can also be expressed as ϵ t = σet where: ϵ t ∼ N(0, σ 2) .

Assume that we wish to calculate the returns under this model for 10 working days (two weeks).

Since the model deals with the compound returns, we have:

10 10 10
∑ rt+i = ∑ (μ + σet+i ) = 10μ + σ ∑ et+i
i=1 i=1 i=1

So that the mean of the return over the 10 days is 10μ and the variance also is 10σ 2 since et is iid.

T he volatility of the return is, therefore:

√10σ

250
© 2014-2023 AnalystPrep.
T herefore, the variance and the mean of return are scaled to the holding period while the volatility is

scaled to the square root of the holding period. T his feature allows us to convert volatility between

different periods.

For instance, given daily volatility, we would to have yearly (annualized) volatility by scaling it by

√252. T hat is:

2
σannual = √252 × σdaily

Note that 252 is the conventional number of trading days in a year in most markets.

Example: Calculating the Annualized Volatility

T he monthly volatility of the price of gold is 4% in a given year. What is the annualized volatility of

the gold price?

Solution

Using the scaling analogy, the corresponding annualized volatility is given by:

σannual = √12 × 0.042 = 13.86%

Variance Rate

T he variance rate, also termed as variance, is the square of volatility. Similar to mean, variance rate

is linear to holding period and hence can be converted between periods. For instance, an annual

variance rate from a monthly variance rate is given by

2
σannual 2
= 12 × σmonthly

T he variance of returns can be approximated as:

1 T
^2 =
σ ^)2
∑ (rt − μ
T t−1

251
© 2014-2023 AnalystPrep.
Where μ
^ is the sample mean of return, and T is the sample size.

Example: Calculating the Variance of Return

T he investment returns of a certain entity for five consecutive days is 6%, 5%, 8%,10% and 11%.

What is the variance estimator of returns?

Solution

We start by calculating the sample mean:

1
^=
μ (0.06 + 0.05 + 0.08 + 0.10 + 0.11) = 0.08
5

So that the variance estimator is:

1 T
^2 =
σ ^)2
∑ (rt − μ
T t−1

1
= [(0.06 − 008)2 + (0.05 − 0.08)2 + (0.08 − 0.08)2 + (0.10 − 0.08)2 + (0.11 − 0.08)2 ] = 0.00052 = 0.052
5

The Implied Volatility

Implied volatility is an alternative measure of volatility that is constructed using options valuation.

T he options (both put and call) have payouts that are nonlinear functions of the price of the

underlying asset. For instance, the payout from the put option is given by:

max(K − P T )

where P T is the price of the underlying asset, K being the strike price, and T is the maturity period.

T herefore, the price payout from an option is sensitive to the variance of the return on the asset.

T he Black-Scholes-Merton model is commonly used for option pricing valuation. T he model relates

252
© 2014-2023 AnalystPrep.
the price of an option to the risk-free rate of interest, the current price of the underlying asset, the

strike price, time to maturity, and the variance of return.

For instance, the price of the call option can be denoted by:

Ct = f(rf , T , P t , σ 2)

Where:

rf = Risk-free rate of interest

T =T ime to maturity

P t =Current price of the underlying asset

σ 2=Variance of the return

T he implied volatility σ relates the price of an option with the other three parameters. T he implied

volatility is an annualized value and does not need to be converted further.

T he volatility index (VIX) measures the volatility in the S&P 500 over the coming 30 calendar days.

VIX is constructed from a variety of options with different strike prices. VIX applies to a large

variety of assets such as gold, but it is only applicable to highly liquid derivative markets and thus not

applicable to most financial assets.

The Financial Returns Distribution

T he financial returns are assumed to follow a normal distribution. T ypically, a normal distribution is

thinned-tailed, does not have skewness and excess kurtosis. T he assumption of the normal

distribution is sometimes not valid because a lot of return series are both skewed and mostly heavy-

tailed.

To determine whether it is appropriate to assume that the asset returns are normally distributed, we

use the Jarque-Bera test.

The Jarque-Bera Test

253
© 2014-2023 AnalystPrep.
Jarque-Bera test tests whether the skewness and kurtosis of returns are compatible with that of

normal distribution.

Denoting the skewness by S and kurtosis by k, the hypothesis statement of the Jarque-Bera test is

stated as:

H 0 : S = 0 and k=3 (the returns are normally distributed)

vs

H 1 : S ≠ 0 and k ≠ 3 (the returns are not normally distributed)

T he test statistic (JB) is given by:

2
⎛^
S (^
k − 3)2⎞
J B = (T − 1) +
⎝ 6 24 ⎠

Where T is the sample size.

T he basis of the test is that, under normal distribution, the skewness is asymptotically normally
2
^
S
distributed with the variance of 6 so that the variable is chi-squared distributed with one degree of
6

freedom (χ21 ) and kurtosis is also asymptotically normally distributed with the mean of 3 and variance
^ − 3)2
(k
of 24 so that is also (χ21) variable. Coagulating these arguments given that these variables are
24
independent, then:

JB ∼ χ22

The Decision Rule of the JB Test

When the test statistic is greater than the critical value, then the null hypothesis is rejected.

Otherwise, the alternative hypothesis is true. We use the χ22 table with the appropriate degrees of

freedom:

Chi-square Distribution Table

254
© 2014-2023 AnalystPrep.
d.f. .995 .99 .975 .95 .9 .1 .05 .025 .01
1 0.00 0.00 0.00 0.00 0.02 2.71 3.84 5.02 6.63
2 0.01 0.02 0.05 0.10 0.21 4.61 5.99 7.38 9.21
3 0.07 0.11 0.22 0.35 0.58 6.25 7.81 9.35 11.34
4 0.21 0.30 0.48 0.71 1.06 7.78 9.49 11.14 13.28
5 0.41 0.55 0.83 1.15 1.61 9.24 11.07 12.83 15.09
6 0.68 0.87 1.24 1.64 2.20 10.64 12.59 14.45 16.81
7 0.99 1.24 1.69 2.17 2.83 12.02 14.07 16.01 18.48
8 1.34 1.65 2.18 2.73 3.49 13.36 15.51 17.53 20.09
9 1.73 2.09 2.70 3.33 4.17 14.68 16.92 19.02 21.67
10 2.16 2.56 3.25 3.94 4.87 15.99 18.31 20.48 23.21
11 2.60 3.05 3.82 4.57 5.58 17.28 19.68 21.92 24.72
12 3.07 3.57 4.40 5.23 6.30 18.55 21.03 23.34 26.22

For example, the critical value of a χ22 at a 5% confidence level is 5.991, and thus, if the computed

test statistic is greater than 5.991, the null hypothesis is rejected.

Example: Conducting a JB Test

Investment return is such that it has a skewness of 0.75 and a kurtosis of 3.15. If the sample size is

125, what is the JB test statistic? Does the data qualify to be normally distributed at a 95%

confidence level?

Solution

T he test statistic is given by:

2
⎛^
S (^
k − 3)2⎞ 0.752 (3.15 − 3)2
JB = (T − 1) + = (125 − 1) ( + ) = 11.74
⎝ 6 24 ⎠ 6 24

Since the test statistic is greater than the 5% critical value (5.991), then the null hypothesis that the

data is normally distributed is rejected.

The Power Law

T he power law is an alternative method of determining whether the returns are normal or not by

255
© 2014-2023 AnalystPrep.
studying the tails. For a normal distribution, the tail is thinned, such that the probability of any return

greater than kσ decreases sharply as k increases. Other distributions are such that their tails

decrease relatively slowly, given a large deviation.

T he power law tails are such that, the probability of observing a value greater than a given value x is

defined as:

P(X > x) = kx −α

Where k and α are constants.

T he tail behavior of distributions is effectively compared by considering the natural log (ln(P(X>x)))

of the tail probability. From the above equation:

ln prob(X > x) = ln k − αln x

To test whether the above equation holds, a graph of ln prob(X > x) plotted against ln⁡x .

For a normal distribution, the plot is quadratic in x, and hence it decays quickly, meaning that they

have thinned tails. For other distributions such as Student’s t distribution, the plots are linear to x,

and thus, the tails decay at a slow rate, and hence they have fatter tails (produce values that are far

from the mean).

256
© 2014-2023 AnalystPrep.
Dependence and Correlation of Random Variables.

T he two random variables X and Y are said to be independent if their joint density function is equal to

the product of their marginal distributions. Formally stated:

f X,Y = f X (x). f Y (y)

Otherwise, the random variables are said to be dependent. T he dependence of random variables can

be linear or nonlinear.

257
© 2014-2023 AnalystPrep.
T he linear relationship of the random variables is measured using the correlation estimator called

Pearson’s correlation.

Recall that given the linear equation:

Y i = α + βi X i + ϵ i

T he slope β is related to the correlation coefficient ρ. T hat is, if β = 0, then the random variables X i

and Y i are uncorrelated. Otherwise, β ≠ 0. Infact, if the variances of the random variables are

engineered such that they are both equal to unity (σX2 = σY2 = 1), the slope of the regression equation

is equal to the correlation coefficient (β = ρ). T hus, the regression equation reflects how the

correlation measures the linear dependence.

Nonlinear dependence is complex and thus cannot be summarized using a single statistic.

Measures of Correlation

T he correlation is mostly measured using the rank correlation (Spearman’s rank correlation) and

Kendal’s τ correlation coefficient. T he values of the correlation coefficient are between -1 and 1.

When the value of the correlation coefficient is 0, then the random variables are independent;

otherwise, a positive (negative) correlation indicates an increasing (a decreasing) relationship

between the random variables.

Rank Correlation

T he rank correlation uses the ranks of observations of random variables X and Y. T hat is, rank

correlation depends on the linear relationship between the ranks rather than the random variables

themselves.

T he ranks are such that 1 is assigned to the smallest value, 2 to the next value, and so on until the

largest value is assigned n.

When a rank repeats itself, an average is computed depending on the number of repeated variables,

and each is assigned the averaged rank. Consider the ranks 1,2,3,3,3,4,5,6,7,7. Rank 3 is repeated

258
© 2014-2023 AnalystPrep.
(3+4+5)
three times, and rank 7 is repeated two times. For the repeated 3’s, the averaged rank is 3
= 4.
(9+10)
For the repeated 7’s the averaged rank is 2
= 8.5. Note that we are averaging the ranks, which

the repeated ranks could have to assume if they were not repeated. So the new ranks

are:1,2,4,4,4,4,5,6,8.5,8.5.

Now, denote the rank of X by R X and that of Y by R Y then the rank correlation estimator is given by:

Cov(RˆX, R Y )
ρ^s =
^ (R X)√V
√V ^ (R Y )

Alternatively, when all the ranks are distinct (no repeated ranks), the rank correlation estimator is

estimated as:

2
6 ∑ni=1 (R Xi − R Y i ) .
ρ^s = 1 −
n(n2 − 1)

T he intuition of the last formula is that when a highly ranked value of X is paired with corresponding

ranked values of Y, then the value of R Xi − R Y i is very small and thus, correlation tends to 1. On the

other, if the smaller rank values of X are marched with larger rank values of Y, then R Xi − R Y i is

relatively larger and thus, correlation tends to -1.

When the variables X and Y have a linear relationship, linear and rank, correlations have equal value.

However, rank correlation is inefficient compared to linear correlation and only used for

confirmational checks. On the other hand, rank correlation is insensitive to outliers because it only

deals with the ranks and not the values of X and Y.

Example: Calculating the Rank Correlation

Consider the following data.

259
© 2014-2023 AnalystPrep.
i X Y
1 0.35 2.50
2 1.73 6.65
3 −0.45 −2.43
4 −0.56 −5.04
5 4.03 3.20
6 3.21 2.31

What is the value of rank correlation?

Solution

Consider the following table where the ranks of each variable have been filled and the square of their

difference in ranks.

i X Y RX RY (R X − R Y )2
1 0.35 2.50 4 3 1
2 1.73 6.65 3 1 4
3 −0.45 −2.43 5 5 0
4 −0.56 −5.04 6 6 0
5 4.03 3.20 1 2 1
6 3.21 2.31 2 4 4
Sum 10

Since there are no repeated ranks, then the rank correlation is given by:

2
6 ∑ni=1 (R Xi − R Y i) .
ρ^s = 1 −
n(n2 − 1)
6 × 10
= 1− = 1 − 0.2857 = 0.7143
6(62 − 1)

The Kendal’s Tau ( τ )

Kendal’s Tau is a non-parametric measure of the relationship between two random variables, say, X

and Y. Kendal’s τ compares the frequency of concordant and discordant pairs.

Consider the set of random variables X i and Y i. T hese pairs are said to be concordant for all i≠j if the

ranks of the components agree. T hat is, X i > X j when Y i > Y j or X i < X j when Y i < Y j. T hat is, they

260
© 2014-2023 AnalystPrep.
are concordant if they agree on the same directional position (consistent). When the pairs disagree,

they are termed as discordant. Note that ties are neither concordant nor discordant.

Intuitively, random variables with a high number of concordant pairs have a strong positive

correlation, while those with a high number of discordant pairs are negatively correlated.

T he Kendal’s Tau is defined as:

nc − nd nc nd
τ^ = = −
n(n − 1) nc + nd + nt nc + nd + nt
2

Where

nc =number of concordant pairs

nd=number of discordant pairs

nt =number of ties

It is easy to se that Kendal’s Tau is equivalent to the difference between the probabilities of

concordance and discordance. Moreover, when all the pairs are concordant, τ^ = 1 and when all pairs

are discordant, τ^ = −1.

Example: Calculating the Kendall’s Tau

Consider the following data (same as the example above).

i X Y
1 0.35 2.50
2 1.73 6.65
3 −0.45 −2.43
4 −0.56 −5.04
5 4.03 3.20
6 3.21 2.31

What is Kendall’s τ correlation coefficient?

Sol uti on

261
© 2014-2023 AnalystPrep.
T he first step is to rank each data:

i X Y RX RY
1 0.35 2.50 3 4
2 1.73 6.65 4 6
3 −0.45 −2.43 2 2
4 −0.56 −5.04 1 1
5 4.03 3.20 6 5
6 3.21 2.31 5 3

Next is to arrange ranks in order of rank X, then the concordant (C) pairs are the number of ranks

greater than the given rank of Y, and discordant pairs are the number of ranks less than the given

rank of Y.

RX RY C D
1 1 5 0
2 2 4 0
3 4 2 1
4 6 1 1
5 3 1 0
6 5 − −
Total 13 2

Note that, C=4, are the number of ranks greater than 2 (4,3,5 and 6) below it. Also, D=0 is the

number of ranks less than 2 below it. T his is continued up to the second last row since there are no

more ranks to look up.

So, nc = 13 and nd = 2

nc − nd 13 − 2 11
⇒ τ^ = = = = 0.7333
n(n − 1) 6(6 − 1 15
2 2

262
© 2014-2023 AnalystPrep.
Practice Question

Suppose that we know from experience that α = 3 for a particular financial variable, and

we observe that the probability that X > 10 is 0.04.

Determine the probability that X is greater than 20.

A. 125%

B. 0.5%

C. 4%

D. 0.1%

T he correct answer is B.

From the given probability, we can get the value of constant k as follows:

prob(X > x) = kx (−α )


0.04 = k(10)(−3)
k = 40

T hus,

P(X > 20) = 40(20)(−3) = 0.005 or 0.5%

Note: T he power law provides an alternative to assuming normal distributions.

263
© 2014-2023 AnalystPrep.
Reading 24: Simulation and Bootstrapping

After compl eti ng thi s readi ng, you shoul d be abl e to:

Describe the basic steps to conduct a Monte Carlo simulation.

Describe ways to reduce the Monte Carlo sampling error.

Explain the use of antithetic and control variates in reducing Monte Carlo sampling error.

Describe the bootstrapping method and its advantage over the Monte Carlo simulation.

Describe pseudo-random number generation.

Describe situations where the bootstrapping method is ineffective.

Describe the disadvantages of the simulation approach to financial problem-solving.

Si mul ati on is a way of modeling random events to match real-world outcomes. By observing

si mul ated results, researchers gain insight into real problems. Examples of the application of the

simulation are the calculation of option payoff and determining the accuracy of an estimator. Some of

the simulation methods are the Monte Carlo Simulation (Monte Carlo) and the Bootstrapping.

Monte Carlo Simulation approximates the expected value of a random variable using the numerical

methods. T he Monte Carlo generates the random variables from an assumed data generating process

(DGP), and then it applies a function(s) to create realizations from the unknown distribution of the

transformed random variables. T his process is repeated (to improve the accuracy), and the statistic

of interest is then approximated using the simulated values.

Bootstrapping is a type of simulation where it uses the observed variables to simulate from the

unknown distribution that generates the observed variables. In other words, bootstrapping involves

the combination of the observed data and the simulated values to create a new sample that is related

but different from the observed data.

T he notable similarity between Monte Carlo and bootstrapping is that both aim at calculating the

expected value of the function by using simulated data (often by use of a computer).

264
© 2014-2023 AnalystPrep.
Also, the contrasting feature in these methods is that in Monte Carlo simulation, a data generating

process (DGP) is entirely used to simulate the data. However, in bootstrapping, observed data is used

to generate the simulated data without specifying an underlying DGP.

Simulation of Random Variables

T he simulation requires the generation of random variables from an assumed distribution, mostly

using a computer. However, computer-generated numbers are not necessarily random and thus

termed as pseudo-random numbers. Pseudo numbers are produced by the complex deterministic

functions (pseudo number generators, PNGs), which seem to be random. T he initial values of pseudo

numbers are termed as a seed val ue, which is usually unique but generates similar random variables

when PNRG runs.

T he ability of the simulated variables from PRNGs to replicate makes it possible to use pseudo

numbers across multiple experiments because the same sequence of random variables can be

generated using the same seed value. T herefore, we can use this feature to choose the best model or

reproduce the same results in the future in case of regulatory requirements. Moreover, the

corresponding random variables can be generated using different computers.

Simulating Random Variables from a Specific Distribution

Simulating random variables from a specific distribution is initiated by first generating a random

number from a uniform distribution (0,1). After that, the cumulative distribution of the distribution

we are trying to simulate is used to get the random values from that distribution. T hat is, we first

generate a random number U from U(0,1) distribution, then, we use the generated random number to

simulate a random variable X with the pdf f(x) by using the CDF, F(x).

Let U be the probability that X takes a value less than or equal to x, that is,

U = P(≤ x) = F(x)

T hen we can derive the random variable x as:

x = F−1 (u)

265
© 2014-2023 AnalystPrep.
To put this in a more straightforward perspective, the algorithm for simulating random variable from

a specific distribution involves:

1. Generating a random variable u from the uniform distribution U(0,1)

2. Compute x = F−1(u)

Note that the random variable X has a CDF F(x) as shown below:

P(X ≤ x) = P(F−1 (U) ≤ x) = P(U ≤ F(x)) = F(x)

Example: Generating Random Variables from Exponential Distribution

Assume that we want to simulate three random variables from an exponential distribution with a

parameter λ = 0.2 using the value 0.112, 0.508, and 0.005 from U(0,1).

266
© 2014-2023 AnalystPrep.
Solution

T his question assumes that the uniform random variable has been generated. T he inverse of the CDF

of exponential distribution is given by:

1
F−1 (x) = − ln (1 − x)
λ

So, in this case:

1
F−1(x) = − ln (1 − x)
0.2
1
x=− ln (1 − u)
0.2

So the random variables are:

1
x1 = − ln (1 − u1 ) = −5 ln (1 − 0.112) = 2.37567
0.2
1
x2 = − ln (1 − u2 ) = −5 ln (1 − 0.508) = 14.1855
0.2
1
x3 = − ln (1 − u3 ) = −5 ln (1 − 0.005) = 0.10025
0.2

T he random variables are 2.37567, 14.1855 and 0.10025

Monte Carlo Simulation

Monte Carlo simulation is used to estimate the population moments or functions. T he Monte Carlo is

as follows:

Assume that X is a random variable that can be simulated and let g(X) be a function that can be

evaluated at the realizations of X. T hen, the simulation generates multiple copies of g(X) by

simulating draws from X = x j and calculate gi = g(x i ).

T his process is then repeated b times so that a set of iid variables is generated from the unknown

267
© 2014-2023 AnalystPrep.
distribution g(X), which can then be used to estimate the desired statistic.

For instance, if we wish to estimate the mean of the generated random variables, then the mean is

given by:

1 b
^(g(X)) =
E ∑ g(X i )
b i=1

T his is true because the generated variables are iid, and then the process is repeated b times.

Consequently, by the law of large number (LLN),

^(g(X)) = E(g(x))
lim E
b→∞

Also, the Central Limit T heorem applies to the estimated mean so that:

σg2
^(g(X))] =
Var[E
b

Where σg2 = Var(g(X))

T he second moment, which is the variance (standard variance estimator) is estimated as:

1 b 2
^2g =
σ ∑ (g(X i ) − E[ĝ(X)])
b i=1

From CLT, the standard error of the simulated expectation is given by:

σg2
σ2
⎷ b = √b

T he standard error of the simulated expectation measures the level of accuracy of the estimation;

thus, the choice of b determines the accuracy of the simulation.,

Another quantity that can be calculated from the simulation is the α -quantile by arranging the b

draws in ascending order then selecting the value bα of the sorted set.

268
© 2014-2023 AnalystPrep.
Moreover, using the simulation, we determine the finite sample properties of the estimated

parameters. Assume that the sample size n is large enough so that approximation by CLT is adequate.

Now, consider a finite-sample distribution of a parameter θ^. Using the assumed DGP, n random

samples are generated so that:

X = [x 1, x 2, … , x n ]

We need to estimate a parameter θ^.

We would need to simulate new data set and estimate the parameter b times: (θ^1, θ^2, … , θ^b) from

the finite-sample distribution of the estimator of θ. From these values, we can rule out the

properties of the estimator θ^. For instance, the bias defined as:

Bias(θ) = E(θ^) − θ

T hat can be approximated as:

1 b
ˆ
(Bias) (θ) = ∑ (θ^i − θ)
b i=1

Having the basics of the Monte Carlo simulation, its basic logarithm is as follows:

i. Generate the data: x i = [x 1i , x 2i ,… , x ni] by using the assumed DGP.

ii. Compute the desired function or statistic gi = g(x i ).

iii. Iterate steps 1 and 2 b times.

iv. From the replications {g1 , g2 ,… , gb}, calculate the statistic of interest.

v. Determine the accuracy of the estimated quantity by calculating the standard error. If the

standard error is huge, increase the number of b-replications to obtain the smallest error

possible.

Exampl e: Usi ng the Monte Carl o Si mul ati on to Esti mate the Pri ce of a Cal l Opti on

Recall that the price of a call option is given by:

max(0, ST − K)

269
© 2014-2023 AnalystPrep.
ST is the price of the underlying stock at the time of maturity T, and K is the strike price. T he price

of the call option is a non-linear function of the underlying stock price at the expiration date, and

thus, we can model the price of the call option.

Assuming that the log of the stock price is normally distributed, then the price of the stock can be

modeled as the sum of the initial stock price, a mean and normally distributed error. Mathematically

stated as:

σ2
sT = s0 + T (rf − ) √T x i
2

Where

s0= the initial stock price

T = time to maturity in years

rf = annualized time to maturity

270
© 2014-2023 AnalystPrep.
σ 2= variance of the stock return

x i= simulated values from N(0, σ 2)

From the formula above, to simulate the price of the underlying stock requires the estimation of the

stock volatility.

Using the simulated price of the stock, the price of the option can be calculated as:

c = e(−rfT)max(ST − K, 0)

And thus the mean of the price of the call option can be estimated as:

1 b
^(c) = c̄ =
E ∑ ci
b i=1

Where c i is the simulated payoffs of the call option. Note that, using the equation,
σ2
sT = s0 + T (rf − ) √T x i , the simulated stock prices can be expressed as:
2

⎛ σ 2⎞
s 0 +T rf− +√T xi
STi = e
⎝ 2⎠

And thus

σ2
g(x i ) = c i = e(−rT) max (es 0 +T(rf− )+√T xi
2 − K, 0)

T he standard error of the call option price is given by:

^2g
σ ^g
σ
^(c)) =
s.e (E =
⎷ b √b

^2g
Where σ

1
^2g =
σ c )2
∑ (c i − ^
b ∀i

271
© 2014-2023 AnalystPrep.
Given that we calculate the standard error, we can calculate the confidence intervals for the

estimated mean of the call option price. For instance, the 95% confidence interval; is given by:

^(c) ± 1.96 s.e (E


E ^(c))

Reducing Monte Carlo Sampling Error

Sampling error in Monte Carlo simulation is reduced by two complementary methods:

1. Antithetic Variables, and

2. Control Variates.

T hese methods can be used simultaneously.

To set the mood, recall that the estimation of expected values in simulation depends on the Law of

Large Numbers (LLN) and that the standard error of the estimated expected value is proportional to

1/√b. T herefore, the accuracy of the simulation depends on the variance of the simulated quantities.

Antithetic Variables

Recall variance between two random variables X and Y is given by:

Var(X + Y) = Var(X) + Var(Y) + 2Cov(X , Y)

Otherwise, if the variables are independent, then:

Var(X + Y) = Var(X) + Var(Y)

Moreover, if the covariance between the variables is negative (or negatively correlated), then:

Var(X + Y) = Var(X) + Var(Y) − 2Cov(X , Y)

T he antithetic variables use the last result. T he antithetic variables reduce the sampling error by

incorporating the second set of variables that are generated in such a way that they are negatively

correlated with the initial iid simulated variables. T hat is, each simulated variable is paired with an

272
© 2014-2023 AnalystPrep.
antithetic variable so that they occur in pairs and are negatively correlated.

If U 1 is a uniform random variable, then:

F−1(U 1 ) ∼ Fx

Denote an antithetic variable U 2 which is generated using:

U2 = 1 − U1

Note that U 2 is also a uniform random variable so that:

F−1(U 2 ) ∼ Fx

T hen by definition of antithetic variables, the correlation between U 1 and U 2 is negative as well as

their mappings onto the CDF Fx .

Using the antithetic random variables is analogous to typical Monte Carlo simulation only that values

are constructed in pairs [{U 1 , 1 − U 1} , {U 2 , 1 − U 2 } , … , {U b , 1 − U b }] which are then transformed


2 2

to have the desired distribution using the inverse CDF.

Note that the number of simulations is b/2 since the simulation values are in pairs. T he antithetic

variables reduce the sampling error only if the function g(X) is monotonic in x so that

Corr(x i , −x i ) = Corr(g(x i ), g(−x i )).

Notably, the antithetic random variables reduce the sampling error through the correlation

coefficient. Note that usually sampling error using b iid simulated values, is

σg
√b

But by introducing the antithetic random variables, then the standard error is given by:

σg√1 + ρ
√b

273
© 2014-2023 AnalystPrep.
Clearly, the standard error decreases when the correlation coefficient, ρ < 0.

Control Variates

Control variates reduce the sampling error by incorporating values that have a mean of zero and

correlated to simulation. T he control variates have a mean of zero so that it does not bias the

approximation. Given that the control variate and the desire function are correlated, an effective

combination (optimal weights) of the control variate and the initial simulation value to reduce the

variance of the approximation.

Recall that expected value is approximated as:

1 b
^[g(X)] =
E ∑ g(x i )
b i=1

Since this estimate is consistent, we can break down to:

^[g(X)] = E[g(X)] + ηi
E

Where ηi is a mean zero error. T hat is: E(ηi) = 0

Denote the control variate by h(X i ) so that by definition, E[h(X i)] = 0 and that it is correlated with ηi.

An ideal control variate should be less costly to construct and that it should be highly correlated with

g(X) so that the optimal combination parameter β0 that minimizes the estimation errors can be

approximated by the regression equation:

g(x i ) = β0 + β1 h(X i)

Disadvantages of Simulation

Monte Carlo Simulation can result in unreliable approximates of moments if the DGPs used

do not adequately describe the observed data. T his mostly occurs due to misspecifications

of the DGP.

Simulation can be costly, especially when you are running multiple simulation experiments

274
© 2014-2023 AnalystPrep.
because it can be time-consuming.

Bootstrapping

As stated earlier, bootstrapping is a type of simulation where it uses the observed variables to

simulate from the unknown distribution that generates the observed variable. However, note that

bootstrapping does not directly model the observed data or suggest any assumption about the

distribution, but rather, the unknown distribution in which the sample is drawn is the origin of the

observed data.

275
© 2014-2023 AnalystPrep.
T here are two types of bootstraps:

i. iid Bootstraps

ii. Circular Blocks Bootstraps (CBB)

iid Bootstrap

iid bootstraps select the samples that are constructed with replacement from the observed data.

Assume that a simulation sample of size m is created from the observed data with n observations. iid

bootstraps construct observation indices by randomly sampling with replacing from the values 1,2,...,

n. T hese random indices are then used to draw the observed data to be included in the simulated data

(bootstrap sample).

For instance, assume we want to draw 10 observations from a sample of 50 data points:

{x 1 , x 2 , x 3 ,… , x 50}. T he first simulation could use {x 1 , x 12, x 23x 11 , x 32 , x 43 x 1, x 22 , x 2 , x 22 }observations

and second simulation could use {x 50 , x 21 , x 23 x 19, x 32, x 49 x 41, x 22, x 12, , x 39} and so on until the desired

number of simulations is reached.

In other words, iid bootstrap is analogous to Monte Carlo Simulation, where bootstrap samples are

used instead of simulated samples. Under iid bootstrap, the expected values are estimated as:

1 b
^[g(X)] =
E ∑ g(x BS x BS x BS )
1,j, 2,j,, … , m ,j,
b i=1

Where

x BS
i,j,
= observation i from observation j

b = total number of bootstraps samples

T he iid bootstrap is suitable when observations used are independent over time, and thus using it in

financial analysis is unsuitable because most of the financial data is dependent.

In short, the logarithm of generating a sample using the iid bootstrap include:

i. Create a random set of m integers (i1 , i2 ,… , im ) from (1,2,…,n) with replacement.

276
© 2014-2023 AnalystPrep.
ii. Construct the bootstrap sample as x i1 , x i2, … , x im

Circular Block Bootstrap (CBB)

T he circular block bootstrap differs from the iid bootstrap in that instead of sampling each data point

with replacement, it samples the blocks of size q with replacement. For instance, assume that we

have 50 observations which are sampled into five blocks (q=5), each with 10 observations.

T he blocks are sampled with replacement until the desired sample size is produced. In the case that

the number of observations in sampled blocks is larger than the required sample size, some of the

observations are omitted in the last block.

T he size of the number of blocks should be large enough to reflect the dependence of observations

but not too large to exclude some crucial blocks. Conventionally, the size of the blocks is the square

root of the sample size (√n).

T he general steps of generating sample using the CBB are:

i. Decide on the size of block q-more preferably, the block size should be equal to the square

root of the sample size, i.e √n.

ii. Select the first block index i from (1,2,…,n) and transfer {x i , x i+1 ,… , x i+q} to the bootstrap

sample where the indices larger (i>n) wrap around.

iii. Incase the bootstrap sample has less than m elements, repeat step (ii) above.

iv. In case the bootstrap sample has more than m elements, omit the values from the end of the

bootstrap sample until the sample size is m.

Application of Bootstrapping

One of the applications of bootstrapping is the estimation of the p-value at risk in financial markets.

Recall the p-value at risk (p-VaR) is defined as:

argmin Pr (L > VaR) = 1 − p


Var

Where:

277
© 2014-2023 AnalystPrep.
L = loss of the portfolio over a given period, and

1-p = the probability that the loss occurs.

If the loss is measured in percentages of a particular portfolio, then p-VaR can be seen as a quantile

of the return distribution. For instance, if we wish to calculate a one-year VaR of a portfolio, then we

will simulate a one-year data (252 days) and then find the quantile of the simulated annual returns.

T he VaR is then calculated by sorting the bootstrapped annual returns from lowest to highest and

then determining (1-p)b, which is basically the empirical 1-p quantile of the annual returns.

Situations Where Bootstrap Will be Ineffective

T he following are the two situations where bootstraps will not be sufficiently effective:

In cases where there are outliers in the data, hence there is a likelihood that the

bootstrap’s conclusion will be affected.

Non-independent data – When a bootstrap is applied, the assumption the data are

independent of one another.

Disadvantages of Bootstrapping

Bootstrapping uses the whole data to generate a simulated sample and thus may make the

simulated sample unreliable when the past and the present data are different. For example,

the present state of a financial market might be different from the past.

Bootstrapping of historical data can be unreliable due to changes in the market so that the

present is different from the past. For instance, if we are bootstrapping market interest

rates, there might be huge discrepancies due to past and present market forces, which

cause the interest rate to fluctuate significantly.

Comparison between Monte Carlo Simulation and


Bootstrapping

278
© 2014-2023 AnalystPrep.
Monte Carlo simulation uses an entire statistical model that incorporates the assumption on the

distribution of the shocks, and therefore, the results are inaccurate if the model used is poor even

when the replications are significantly large.

On the other hand, bootstrapping does not specify the model but instead assumes the past resembles

the present of the data. In other words, the bootstrapping incorporates the aspect of the dependence

of the observed data to reflect the sampling variation.

Both Monte Carlo Simulation and bootstrapping are affected by the “Black Swan” problem, where the

resulting simulations in both methods closely resemble historical data. In other words, simulations

tend to focus on historical data, and thus, the simulations are not so different from what it has been

observed.

279
© 2014-2023 AnalystPrep.
Practice Question

Which of the following statements correctly describes an antithetic variable?

A. T hey are variables that are generated to have a negative correlation with the

initial simulated sample.

B. T hey are mean zero values that are correlated to the desired statistic that is to

be computed from through simulation.

C. T hey are the mean zero variables that are negatively correlated with the initial

simulated sample.

D. None of the above

Solution

T he correct answer is A.

Control variates are used to reduce the sampling error in the Monte Carlo simulation.

T hey are constructed to have a negative correlation with the initial simulated sample so

that the overall standard error of approximation is reduced.

280
© 2014-2023 AnalystPrep.
Reading 25: Machine-Learning Methods

After compl eti ng thi s readi ng, you shoul d be abl e to:

Discuss the philosophical and practical differences between machine-learning techniques

and classical econometrics.

Differentiate among unsupervised, supervised, and reinforcement learning models.

Use principal components analysis to reduce the dimensionality of a set of features.

Describe how the K-means algorithm separates a sample into clusters.

Understand the differences between and consequences of underfitting and overfitting and

propose potential remedies for each.

Explain the differences among the training, validation, and test data sub-samples, and how

each is used.

Explain how reinforcement learning operates and how it is used in decision-making.

Be aware of natural language processing and how it is used.

Machine-Learning Techniques vs. Classical Econometrics

Machine learning (ML) is the art of programming computers to learn from data. Its basic idea is that

systems can learn from data and recognize patterns without active human intervention. ML is best

suited for certain applications, such as pattern recognition and complex problems that require large

amounts of data and are not well solved with traditional approaches.

On the other hand, classical econometrics has traditionally been used in finance to identify patterns

in data. It has a solid foundation in mathematical statistics, probability, and economic theory. In this

case, the analyst researches the best model to use along with the variables to be used. T he

computer’s algorithm tests the significance of variables, and based on the results, the analyst decides

whether the data supports the theory.

281
© 2014-2023 AnalystPrep.
Machine learning and traditional linear econometric approaches are both employed in prediction. T he

former has several advantages: machine learning does not rely on much financial theory when

selecting the most relevant features to include in a model. It can also be used by a researcher who is

unsure or has not specified whether the relationship between variables is linear or non-linear. T he

ML algorithm automatically selects the most relevant features and determines the most appropriate

relationships between the variables.

Secondly, ML algorithms are flexible and can handle complex relationships between variables.

Consider the following linear regression model:

y = β0 + β1 X 1 + β2X 2 + ε

Suppose that the effect of X 1 on y depends on the level of X 2. Analysts would miss this interaction

effect unless a multiplicative term was explicitly included in the model. In the case of many

explanatory variables, a linear model may be difficult to construct for all combinations of interaction

terms. T he use of machine learning algorithms can mitigate this problem by automatically capturing

interactions.

Additionally, the traditional statistical approaches for evaluating models, such as analyses of statistical

significance and goodness of fit tests, are not typically applied in the same way to supervised machine

learning models. T his is because the goal of supervised machine learning is often to make accurate

predictions rather than to understand the underlying relationships between variables or to test

hypotheses.

T here are different terminologies and notations used in ML. T his is because engineers, rather than

statisticians, developed most machine learning techniques. T here has been a lot of discussion of

features/inputs and targets/outputs. According to classical econometrics, features/inputs are simply

independent variables. Targets/outputs are dependent variables, and the values of the outputs are

referred to as labels.

282
© 2014-2023 AnalystPrep.
T he following gives a summary of some of the differences between ML techniques and classical

econometrics.

283
© 2014-2023 AnalystPrep.
Machine Learning Classical Econometrics
Techniques
Builds models that can learn
from data and continuously Identifies and estimates the
improve their performance relationships between variables.
Goals
with time, and do not need It also tests the hypothesis
to specify the relationships about these relationships.
between variables in advance.
Require well-structured
ML models can deal with large
and clearly defined
Data amounts of complex and
requirements dependent and independent
unstructured data.
variables.
They are not built on Based on various assumptions, e.g.,
assumptions and can handle errors are normally distributed,
Assumptions
non-linear relationships linear relationships
between variables. between variables.
Maybe complex to interpret,
as they may involve complex Statistical models can
be interpreted in terms
Interpretability patterns and relationships
of the relationships
that are difficult to understand
between variables.
or explain.

Types of Machine Learning

T here are many types of Machine learning systems. Some of the types include unsupervised

learning, supervised learning, and reinforcement learning.

Unsupervised Learning

As the name suggests, the system attempts to learn without a teacher. It recognizes data patterns

without an explicit target. More specifically, it uses inputs (X’s) for analysis with no corresponding

target (Y). Data is clustered to detect groups or factors that explain the data. It is, therefore, not

used for predictions.

For example, unsupervised learning can be used by an entrepreneur who sells books to detect

groups of similar customers. T he entrepreneur will at no point tell the algorithm which group a

customer belongs to. It instead finds the connections without the entrepreneur’s help. T he algorithm

may notice, for instance, that 30% of the store’s customers are males who love science fiction

books and frequent the store mostly during weekends, while 25% are females who enjoy drama

books. A hierarchical clustering algorithm can be used to further subdivide groups into smaller ones.

284
© 2014-2023 AnalystPrep.
Supervised Learning

By using well-labeled training data, this system is trained to work as a supervisor to teach the

machine to predict the correct output. You can think of it as how a student learns under the

supervision of a teacher. In supervised learning, a mapping function is determined that can map

inputs (X’s) with output (Y). T he output is also known as the target, while X’s are also known as the

features.

T ypically, there are two types of tasks in supervised learning. One is classification. For example, a

loan borrower may be classified as “likely to repay” or “likely to default.” T he second one is the

prediction of a target numerical value. For example, predicting a vehicle’s price based on a set of

features such as mileage, year of manufacture, etc. For the latter, labels will indicate the selling

prices. As for the former, the features would be the borrower’s credit score, income, etc., while the

labels would be whether they defaulted.

Reinforcement Learning

285
© 2014-2023 AnalystPrep.
Reinforcement learning differs from other forms of learning. A learning system called an agent

perceives and interprets its environment, performs actions, and is rewarded for desired behavior and

penalized for undesired behavior. T his is done through a trial-and-error approach. Over time, the

agent learns by itself what is the best strategy (policy) that will generate the best reward while

avoiding undesirable behaviors. Reinforcement learning can be used to optimize portfolio allocation

and create trading bots that can learn from stock market data through trial and error, among many

other uses.

Principal Components Analysis (PCA)

286
© 2014-2023 AnalystPrep.
T raining ML models can be slowed by the millions of features that might be present in each training

instance. T he many features can also make it difficult to find a good solution. T his problem is

referred to as the curse of dimensionality.

Dimensions and features are often used interchangeably. Dimension reduction involves reducing the

features of a dataset without losing important information. It is useful in ML as it simplifies complex

datasets, scales down the computational burden of dealing with large datasets, and improves the

interpretability of models.

PCA is the most popular dimension reduction approach. It involves projecting the training dataset

onto a lower-dimensional hyperplane. T his is done by finding the directions in the dataset that

capture the most variance and projecting the dataset onto those directions. PCA reduces the

dimensionality of a dataset while preserving as much information as possible.

In PCA, the variance measures the amount of information. Hence, principal components capture the

most variance and retain the most information. Accordingly, the first principal component will

account for the largest possible variance; the second component will intuitively account for the

second largest variance (provided that it is uncorrelated with the first principal component), and so

on. A scree plot shows how much variance is explained by the principal components of the data. T he

principal components that explain a significant proportion of the variance are retained (usually 85%

to 95%).

Example: Principal Components Analysis (PCA)

Researchers are concerned about which principal components will adequately explain returns in a

hypothetical Very Small Cap (VSC) 30 and Diversified Small Cap (DSC) 500 equity index over a 15-

year period. DSC 500 is a diversified index that contains stocks across all sectors, whereas VSC 30 is

a concentrated index that contains technology stocks. In addition to index prices, the dataset contains

more than 1000 technical and fundamental features. T he fact that the dataset has so many features

causes them to overlap due to multicollinearity. T his is where PCA comes in handy, as it works by

creating new variables that can explain most of the variance while preserving information in the

data.

287
© 2014-2023 AnalystPrep.
Below is a screen plot for each index. Based on the 20 principal components generated, the first

three components explain 88% and 91% of the variance in the VSC 30 and DSC 500 index values,

respectively. Screen plots for both indexes illustrate that the incremental contribution in explaining

variance structure is very small after PC5 or so. From PC5 onwards, it is possible to ignore the

principal components without losing important information.

The K-Means Clustering Algorithm

Clustering is a type of unsupervised machine-learning technique that organizes data points into

similar groups. T hese groups are called clusters.

288
© 2014-2023 AnalystPrep.
Clusters contain observations from data that are similar in nature. K-means is an iterative algorithm

that is used to solve clustering problems. K is the number of fixed clusters determined by the analyst

at the outset. It is based on the idea of minimizing the sum of squared distances between data points

and the centroid of the cluster to which they belong. T he following outlines the process for

implementing K-means clustering:

1. Randomly allocate initial K centroids within the data (centers of the clusters).

2. Assign each data point to the closest centroid, creating K clusters.

3. Calculate the new K centroids for each cluster by taking the average value of all data points

assigned to that cluster.

4. Reassign each data point to the closest centroid based on the newly calculated centroids.

5. Repeat the process of recalculating the new K centroids until the centroids converge or a

predetermined number of iterations has been reached.

Iterations continue until no data point is left to reassign to the closest centroid (there is no need to

recalculate new centroids). T he distance between each data point and the centroids can be measured

in two ways. T he first is the Euclidean distance, while the second is the Manhattan distance.

289
© 2014-2023 AnalystPrep.
Consider two features x and y , which both have two data points A and B, with coordinates (x A ,y A ) and

(x B ,y B ), respectively. T he Euclidean distance, also known as L 2 - norm, is calculated as the square

root of the sum of the squares of the differences between the coordinates of the two points. Imagine

the Pythagoras T heorem, where Euclidean distance is the unknown side of a right-angled triangle.

For a two-dimensional space, this is represented as:

Euclidean Distance (dE ) = √(x B − x A )2 + (y B − y A)2

In the case that there are more than two dimensions, for example, n features for two data points A

and B, Euclidean distance will be constructed in a similar fashion. Euclidean distance is also known as

the "straight-line distance " because it is the shortest distance between two points, indicated by the

solid line in the figure below. Manhattan distance, also known as L 1 - norm, is calculated as the sum of

the absolute differences between two coordinates. For a two-dimensional space, this is represented

as:

290
© 2014-2023 AnalystPrep.
Manhattan distance (dM ) = |x B − x A | + |x B − x A |

Manhattan distance is named after the layout of streets in Manhattan, where streets are laid out in a

grid pattern, and the only way to travel between two points is by going along the grid lines.

Example: Calculating Euclidean and Manhattan distances

Suppose you have the following financial data for three companies:

Company P:

Feature 1: Market Capitalization = $0.5 billion

Feature 2: P/E Ratio = 9

Feature 3: Debt-to-Equity Ratio = 0.6

Company Q:

Feature 1: Market Capitalization = $2.5 billion

Feature 2: P/E Ratio = 15

Feature 3: Debt-to-Equity Ratio = 8

Company R

Feature 1: Market Capitalization = $85 billion

Feature 2: P/E Ratio = 32

Feature 3: Debt-to-Equity Ratio = 45

Calculate the Euclidean and Manhattan distances between companies P and Q in feature space for the

raw data.

Eucl i dean Di stance

To calculate the Euclidean distance between companies P and Q in feature space for the raw data,

we first need to find the difference between each feature value for the two companies and then

291
© 2014-2023 AnalystPrep.
square the differences. T he Euclidean distance is then calculated by taking the square root of the

sum of these squared differences.

Euclidean Distance (dE ) = √(0.5 − 2.5)2 + (9 − 15)2 + (0.6 − 8)2 = √94.76 = 9.73

Manhattan Di stance

To calculate the Manhattan distance between companies P and Q in feature space for the raw data,

we simply find the absolute difference between each feature value for the two companies and sum

these differences. T he Manhattan distance is then calculated by taking the sum of these differences.

T he Manhattan distance between companies P and Q in feature space is:

Manhattan Distance (dM ) = |0.5 − 2.5| + |9 − 15| + |0.6 − 8| = | − 2| + | − 6| + | − 7.4|


= 2 + 6 + 7.4 = $15.4

Performance Measurement for K-means

Formulas described above indicate the distance between two points A and B. It should be noted that K

-means aims to minimize the distance between each data point and its centroid rather than to

minimize the distance between data points. T he data points will be closer to the centroids when the

model fits better.

Inertia, also known as the Within-Cluster Sum of Squared errors (WCSS), is a measure of the sum of

the squared distances between the data points within a cluster and the cluster's centroid. Denoting

the distance measure as di, WCSS is expressed as:

n
WCSS = ∑ di2
i=1

K-means algorithm aims to minimize the inertia by iteratively reassigning data points to different

clusters and updating the cluster centroids until convergence. T he final inertia value can be used to

measure the quality of the clusters produced by the K-means algorithm.

292
© 2014-2023 AnalystPrep.
Choosing an Appropriate Value for K

Choosing an appropriate value for K can affect the performance of the K-means model. For example,

if K is set too low, the clusters may be too general and may not be a true representative of the

underlying structure of the data. Similarly, if K is set too high, the clusters may be too specific and

may not represent the data’s overall structure. T hese clusters may not be useful for the intended

purpose of the analysis in either case. It is, therefore, important to choose K optimally in practice.

T he optimal value of K can be calculated using different methods, such as the elbow method and the

silhouette analysis. T he elbow method fits the K-means model for different values of K and plots the

inertia/ WCSS for each value of K. Similar to PCA, this is called a scree-plot. It is then examined for

the obvious point on the plot where the inertia decreases more slowly as K increases (elbow), which

is chosen as the optimal value of K. In other words, it is the value that corresponds to the “elbow”

point in the scree plot.

293
© 2014-2023 AnalystPrep.
T he second approach involves fitting the K-means model for a range of values of K and determining

the silhouette coefficient for each value of K. T he silhouette coefficient compares the distance of

each data point from other points in its own cluster with its distance from the data points in the

other closest cluster. In other words, it measures the similarity of a data point to its own cluster

compared to the other closest clusters. T he optimal value of K is the one that corresponds to the

highest silhouette coefficient across all data points.

Advantages and Disadvantages of the K-Means Algorithm

K-means clustering is simple and easy to implement, making it a popular choice for clustering tasks.

T here are some disadvantages to K-Means, such as the need to specify clusters, which can be

difficult if the dataset is not well separated. Additionally, it assumes that the clusters are spherical and

294
© 2014-2023 AnalystPrep.
equal in size, which is not always the case in practice.

K-means algorithm is very common in investment practice. It can be used for data exploration in high-

dimensional data to discover patterns and group similar observations together.

Overfitting and Underfitting

Understand the differences between and consequences of underfitting and overfitting, and propose

potential remedies for each.

Overfitting

Imagine that you have traveled to a new country, and the shop assistant rips you off. It is a natural

instinct to assume that all shop assistants in that country are thieves. If we are not careful, machines

can also fall into the same trap of overgeneralizing. T his is known as overfitting in ML.

Overfitting occurs when the model has been trained too well on the training data and performs

poorly on new, unseen data. An overfitted model can have too many model parameters, thus learning

the detail and noise in the training data rather than the underlying patterns. T his is a problem because

it means that the model cannot make reliable predictions about new data, which can lead to poor

performance in real-world applications. T he evaluation of the ML algorithm thus focuses on its

prediction error on new data rather than on its goodness of fit on the trained data. If an algorithm is

overfitted to the training data, it will have a low prediction error on the training data but a high

prediction error on new data.

T he dataset to which an ML model is applied is normally split into training and validation samples. T he

training data set is used to train the ML model by fitting the model parameters. On the other hand, the

validation data set is used to evaluate the trained model and estimate how well the model will

generalize into new data.

Overfitting is a severe problem in ML, which can easily have thousands of parameters, unlike

classical econometric models that can only have a few parameters. Potential remedies for overfitting

include decreasing the complexity of the model, reducing features, or using techniques such as

295
© 2014-2023 AnalystPrep.
regularization or early stopping.

Underfitting

Underfitting is the opposite of overfitting. It occurs when a model is too simple and thus not able to

capture the underlying patterns in the training data. T his results in poor performance on both the

training data and new data. For example, we would expect a linear model of life satisfaction to be

prone to underfit as the real world is more complicated than the model. In this scenario, the ML

predictions are likely to be inaccurate, even on the training data.

Underfitting is more likely in conventional models because they tend to be less flexible than ML

models. T he former follows a predetermined set of rules or assumptions, while ML approaches do

not follow assumptions about the structure of the model. It should be noted, however, that ML

models can still experience underfitting. T his can happen when there is insufficient data to train the

model, when the data is of poor quality, and if there is excessively stringent regularization.

Regularization is an approach commonly used to prevent overfitting. It adds a penalty to the model as

the complexity of the model increases. If the regularization is set too high, it can cause the model to

underfit the data. Potential remedies for addressing underfitting include increasing the complexity of

the model, adding more features, or increasing the amount of training data.

Bias-Variance Tradeoff

T he complexity of the ML model, which determines whether the data is over, under, or well-fitted,

involves a phenomenon called bias-variance tradeoff. Complexity refers to the number of features in

a model and whether a model is linear or non-linear (with non-linear being too complex). Bias occurs

when a complex model is approximated with a simpler model, i.e., by omitting relevant factors and

interactions. A model with highly biased predictions is likely to be oversimplified and thus results to

underfitting. Variance refers to how sensitive the model is to small fluctuations in the training data. A

model with high variance in predictions is likely to be complex and thus results to overfitting.

T he figure below illustrates how bias and variance are affected by model complexity.

296
© 2014-2023 AnalystPrep.
Sample Splitting and Preparation

Data Preparation

T here is a tendency for ML algorithms to perform poorly when the variables have very different

scales. For example, there is a vast difference in the range between income and age. A person’s

income ranges in the thousands while their age ranges in the tens. Since ML algorithms only see

numbers, they will assume that higher-ranging numbers (income in this case) are superior, which is

false. It is, therefore, crucial to have values in the same range. Standardization and normalization are

two methods for rescaling variables.

Standardization involves centering and scaling variables. Centering is where the variable’s mean value

is subtracted from all observations on that variable (so standardized values have a mean of 0). Scaling

297
© 2014-2023 AnalystPrep.
is where the centered values are divided by the standard deviation so that the distribution has a unit

variance. T his is expressed as follows:

xi − μ
x i (standardized) =
σ

Normalization, also known as min-max scaling, entails rescaling values from 0 to 1. T his is done by

subtracting the minimum value (x min ) from each observation and dividing by the difference between

the maximum (x max ) and minimum values (x min ) of X . T his is expressed as follows:

x i − x min
x i (normalized ) =
x max − x min

T he preferable rescaling method depends on the data characteristics:

Standardization is used when the data includes outliers. T his is because normalization would

compress data points into a narrow range of 0 − 1, which would be uncharacteristic of the

original data.

Data must be normally distributed for standardization to be used, whereas normalization

can be used when the data distribution is unknown.

Data Cleaning

T his is a crucial component of ML and may be the difference between an ML's success and failure.

Data cleaning is necessary for the following reasons:

Mi ssi ng data: Analysts encounter this issue very often. Missing data can be dealt with in

the following ways. First, observations with only a small number of missing values can be

removed. Secondly, they can replace them with the mean or median of the non-missing

observations. Lastly, it may be possible to estimate the missing values based on

observations of other features.

Inconsi stent recordi ng: It is important to record data consistently so that it can be read

correctly and easily used.

298
© 2014-2023 AnalystPrep.
Unwanted observati ons: Observations that are not relevant to the specific task should

be removed. T he result is a more efficient analysis and a reduction in distractions.

Dupl i cate observati ons: Duplicate data points should be removed to avoid biases.

Probl emati c features: A feature with many standard deviations from the mean should

be carefully monitored as they can be problematic.

Training, Validation, and Test Datasets

We briefly discussed the training and validation data sets, which are in-sample datasets. Additionally,

there is an out-of-sample dataset, which is the test data. T he training dataset teaches an ML model to

make predictions, i.e., it learns the relationships between the input data and the desired output. A

validation dataset is used to evaluate the performance of an ML model during the training process. It

compares the performance of different models so as to determine which one generalizes (fits) best

to new data. A test dataset is used to evaluate an ML model’s final performance and identify any

remaining issues or biases in the model. T he performance of a good ML model on the test dataset

should be relatively similar to the performance on the training dataset. However, the training and

test datasets may perform differently, and perfect generalization may not always be possible.

It is up to the researchers to decide how to subdivide the available data into the three samples. A

common rule of thumb is to use two-thirds of the sample for training and the remaining third to be

equally split between validation and testing. T he subdivision of the data will be less crucial when the

overall data points are large. Using a small training dataset can introduce biases into the parameter

estimation because the model will not have enough data to learn the underlying patterns in the data

accurately. Using a small validation dataset can lead to inaccurate model evaluation because the model

may not have enough data to assess its performance accurately; thus, it will be hard to identify the

best specification. When subdividing the data into training, validation, and test datasets, it is crucial to

consider the type of data you are working with.

For cross-sectional data, it is best to divide the dataset randomly, as the data has no natural ordering

(i.e., the variables are not related to each other in any specific order). For time series data, it is best

to divide the data into chronological order, starting with training data, then validation data, and testing

data.

299
© 2014-2023 AnalystPrep.
Cross-validation Searches

Cross-validation can be used when the overall dataset is insufficient to be divided into training,

validation, and testing datasets. In cross-validation, training and validation datasets are combined into

one sample, and the testing dataset is excluded. T he combined data is then equally split into sub-

samples, with a different sub-sample left out each time as the test dataset. T his technique is known

as k-fold cross-validation. It splits the training and validation data into k sub-samples, and the model is

trained and evaluated k times while leaving out the test data from the combined sample. T he values k

= 5 and k =10 are commonly chosen for k-fold cross-validation.

Reinforcement Learning (RL)

Reinforcement learning involves training an agent to make a series of decisions in an environment to

maximize a reward. T he agent is given feedback as either a reward or punishment depending on its

actions. It then uses the feedback to learn the actions that are likely to generate the highest reward.

T he algorithm learns through trial and error by playing many times against itself.

300
© 2014-2023 AnalystPrep.
How Reinforcement Learning Operates

Define the Environment

T he environment consists of the state space, action space, and the reward function. T he state space

is the set of all possible states in which the agent can be. On the other hand, the action space

consists of a set of actions that the agent can take. Lastly, the reward function defines the feedback

that the agent receives for taking a particular action in a given state space.

Initialize the Agent

301
© 2014-2023 AnalystPrep.
Involves specifying the learning algorithm and any relevant parameters. T he agent is then put in the

original state of the environment.

Take an Action

T he agent chooses an action depending on its current state and the learning algorithm. T his action is

then taken in the environment, which may lead to a change of state and a reward. At any given state,

the algorithm can choose between taking the best course of action (exploitation) and trying a new

action (exploration). Exploitation is assigned the probability p and exploration given the probability

1 − p. p increases as more trials are concluded, and the algorithm has learned more about the best

strategy.

Update the Agent

Based on the agent’s reward and the environment’s new state, it updates its internal state. T his

update is carried out using some form of the optimization algorithm.

Repeat the Process

T he agent continues to take actions and update its internal state until it reaches a predefined number

of iterations or a terminal state is reached.

Monte Carlo vs. Temporal Difference Methods for Reinforcement


Learning

T he Monte Carlo method estimates the value of a state or action based on the final reward received

at the end of an episode. On the other hand, the temporal difference method updates the value of a

state or action by looking at only one decision ahead when updating strategies.

An estimate of the expected value of taking action A in state S, after several trials, is denoted as

Q(S, A). T he estimated value of being in state S at any time is expressed as:

Qnew (S, A) = Qold (S, A) + α [R − Qold (S,A)]

302
© 2014-2023 AnalystPrep.
Where α is a parameter, say 0.05, which is the learning rate that determines how much the agent

updates its Q value based on the difference between the expected and actual reward.

Example: Reinforcement Learning

Suppose that we have three states (S1, S2, S3) and two actions (A1, A2), with the following Q(S, A)

values:

S1 S2 S3
A1 0.3 0.4 0.5
A2 0.7 0.6 0.5

Monte-Carlo Method

Suppose that on the next trial, Action 2 is taken in State 3, and the total subsequent reward is 1.0. If

= 0.075, the Monte Carlo method would lead to Q(3, 2) being updated from 0.5 to:

Q(S3, A2) = Q(S3, A2) + 0.075×; (1.0 − Q(S3, A2))


= 0.5 + 0.075(1.0 − 0.5) = 0.5375

Temporal Difference Method

If the next decision that has to be made on the trial under consideration happens when we are in

State 2. Additionally, a reward of 0.3 is earned between the two decisions.

T he value of being in State 2, Action 2 is 0.6. T he temporal difference method would lead to Q(3, 2)

being updated from 0.5 to:

0.5 + 0.075(0.3 + 0.6 − 0.5) = 0.53

Potential Applications of Reinforcement Learning in Finance

1. Tradi ng: Reinforcement learning algorithms can learn from past data and market dynamics

to make informed decisions on when to buy and sell, possibly optimizing the trading of

financial instruments, including stocks, bonds, and derivatives.

303
© 2014-2023 AnalystPrep.
2. Detecti ng fraud: RL can be used to detect fraudulent activity in financial transactions.

T his algorithm learns from past data and hence adapts to new fraud patterns. T his means that

the algorithm becomes better at detecting and preventing fraud with time.

3. Credi t scori ng: RL can be used to predict the probability of a borrower defaulting on a

loan. T he algorithm can be trained on historical data about borrowers and their credit

histories to achieve this.

4. Ri sk management: RL can be trained using past data to identify and mitigate financial

risks.

5. Portfol i o opti mi zati on: RL can be trained to take actions that modify the allocation of

assets in the portfolio with time, with the sim of maximizing portfolio returns and minimizing

risks.

Natural Language Processing

Natural language processing (NLP) focuses on helping machines process and understand human

language.

Steps Involved in NLP Process

T he main steps in the NLP process are outlined below:

1. Data col l ecti on: Involves acquiring data from various sources, including financial

statements, news articles, social media posts, etc.

2. Data preprocessi ng: T he raw textual data is cleaned, formatted, and transformed into a

form suitable for computer usage. Tasks such as tokenization, stemming, and stop word

removal can be carried out at this stage.

3. Feature extracti on: T his involves extracting relevant features from the preprocessed

data. It may involve extracting financial metrics, sentiments, and other relevant information.

4. Model trai ni ng: T his involves training the machine learning model using the extracted

features.

5. Model eval uati on: T his involves evaluating the performance of the trained model to

304
© 2014-2023 AnalystPrep.
ensure it generates accurate and reliable predictions. Techniques such as cross-validation

can be employed here. Model evaluation is carried out on the test dataset.

6. Model depl oyment: T he evaluated model is then deployed for use in real-world investment

scenarios.

Data Preprocessing

Textual data (unstructured data) is more suitable for human consumption rather than for computer

processing. Unstructured data thus needs to be converted to structured data through cleaning and

preprocessing, a process called text processing. Text cleansing involves involving removing HT ML

tags, punctuations, numbers, and white spaces (e.g., tabs and indents).

T he next step is text wrangling (preprocessing) which involves the following:

1. Tok eni zati on: Involves separating a piece of text into smaller units called tokens. It allows

the NLP model to analyze the textual data more easily by breaking it down into individual

units that can be more easily processed.

2. Lowercasi ng: To avoid discriminating between “stock” and “Stock.”

3. Removi ng stop words: T hese are words with no informational value, e.g., as, the, is, used

as sentence connectors. T hey are eliminated to reduce the number of tokens in the training

data.

4. Stemmi ng: Reduces all the variations of a word into a common value (base form/stem): For

example, “earned,” “earnings,” and “earning.” are all assigned a common value of earn. It only

removes the suffixes of words.

5. Lemmati zati on: Involves reducing words to their base form/lemma to identify related

words. Unlike stemming, lemmatization incorporates the full structure of the word and uses

a dictionary or morphological analysis to identify the lemma. It generates more accurate base

forms of words. However, it is more computationally expensive compared to stemming.

6. Consi der “n-grams:” T hese are words that need to be placed together to give a specific

meaning. For example, “strong earnings,” “negative outlook,” or “market uncertainty.”

Finance professionals can leverage on NLP to derive insights from large chunks of data to make

more informed decisions. T he following are some applications of NLP.

305
© 2014-2023 AnalystPrep.
Tradi ng: NLP can be employed to analyze real-time financial data, e.g., stock prices, to

derive trends and patterns that could be used to inform investment decisions.

Ri sk management: NLP can be used to identify possible risks in financial contracts and

regulatory filings. For example, identifying language that implies a high level of risk, or

wordings/clauses that could be interpreted differently by different parties.

News anal ysi s: NLP can be used to derive information from news articles and other

sources of financial information, e.g., earnings reports. T he resulting information can then

be used to monitor companies’ performance and identify potential investment

opportunities.

Senti ment anal ysi s: NLP can be used to measure the public opinion of a company,

industry, or market trend by analyzing sentiments on social media posts and news articles.

Investors can use this information to make more informed investment decisions. Investors

can classify the text as positive, negative, or neutral based on the sentiment expressed in

the text.

Customer servi ce: NLP can be employed in chatbots to aid companies in responding to

customer queries faster and more efficiently.

Detect accounti ng fraud: For example, to detect accounting fraud, the Securities and

Exchange Commission (SEC) analyzed large amounts of publicly available corporate

disclosure documents to identify patterns in language that indicated fraud.

Text cl assi fi cati on: T his is the process of assigning text data to prespecified categories.

For example, text classification could involve assigning newswire statements based on the

news they represent, e.g., education, financial, environmental, etc.

306
© 2014-2023 AnalystPrep.
Practice Question

Which of the following is least likely a task that can be performed using natural language

processing?

A. Sentiment analysis.

B. Text translation.

C. Image recognition.

D. Text classification.

Solution

T he correct answer is C.

Image recognition is not a task that can be performed using NLP. T his is because NLP is

focused on understanding and processing text, not images.

A i s i ncorrect: NLP can be used for sentiment analysis. For example, NLP can be used

to measure the public opinion of a company, industry, or market trend by analyzing

sentiments on social media posts.

B i s i ncorrect: Financial documents may need to be translated into different languages

to reach a global audience.

D i s i ncorrect: Text classification is the process of assigning text data to prespecified

categories. For example, text classification could involve assigning newswire statements

based on the news they represent, e.g., education, financial, environmental, etc.

307
© 2014-2023 AnalystPrep.
Reading 26: Machine Learning and Prediction

After compl eti ng thi s readi ng, you shoul d be abl e to:

Explain the role of linear regression and logistic regression in prediction.

Understand how to encode categorical variables.

Discuss why regularization is useful and distinguish between the ridge regression and

LASSO approaches.

Show how a decision tree is constructed and interpreted.

Describe how ensembles of learners are built.

Outline the intuition behind the K nearest neighbors and support vector machine methods

for classification.

Understand how neural networks are constructed and how their weights are determined.

Evaluate the predictive performance of logistic regression models and neural network

models using a confusion matrix.

Role of Linear and Logistic Regression in Prediction

Linear Regression (Ordinary Least Squares)

Linear regression models the relationship between a dependent variable and one or more

independent variables by fitting a linear equation to the observed data. It works by finding the line of

the best fit through the data points. T his line is called a regression line, and it is straight. T he

equation of the best fit can then be used to make predictions about the dependent variable based on

new values of the independent variables.

308
© 2014-2023 AnalystPrep.
T he regression line can be expressed as follows:

y = α + β1x 1 + β2 x 2+. .. +βn x n

Where:

y = Dependent variable.

α = Intercept.

309
© 2014-2023 AnalystPrep.
x 1 , x 2 ,… x n = Independent variables.

β1 , β2 ,… , βn = Multiple regression coefficients.

T he coefficients show the effect of each independent variable on the dependent variable and are

calculated based on the data.

The Cost Function for Linear Regression

T raining any machine learning model aims to minimize the cost (loss) function. A cost function

measures the inaccuracy of the model predictions. It is the sum of squared residuals (RSS) for a

linear regression model. T his is the sum of the squared difference between the actual and predicted

values of the response (dependent variable).

2
n n
RSS = ∑ (y i − α − ∑ βj x ij)
i=1 i=1

Where x ij is the ith observation and jth variable.

To measure how well the data fits the line, take the difference between each actual data point (y)

and the model's prediction (y^). T he differences are then squared to eliminate negative numbers and

penalize larger differences. T he squared differences are then added up, and an average is taken.

T he advantage of linear regression is that it is easy to understand and interpret. However, it has the

following limitations:

It assumes a linear relationship between the dependent and independent variables.

It assumes that residuals (the difference between observed and predicted values) are

normally distributed and have a constant variance.

It is prone to overfitting.

It assumes that there is no multicollinearity.

Example: Prediction using Linear Regression

310
© 2014-2023 AnalystPrep.
Aditya Khun, an investment analyst, wants to predict the return on a stock based on its P/E ratio and

the market capitalization of the company using linear regression in machine learning. Khun has

access to the P/E ratio and market capitalization dataset for several stocks, along with their

corresponding returns. Khun can employ linear regression to model the relationship between the

return on a stock and its P/E ratio and market capitalization. T he following equation represents the

model:

Return = β0 + β1 P /E ratio + β2Market capitalization

Where:

Return = Dependent variable.

P/E ratio and market capitalization = Independent variables.

β0 = Intercept.

β1 and β2 are the coefficients of the model.

T he first step of fitting a linear regression model is estimating the values of the coefficients β0, β1,

and β2 using the training data. Coefficients that minimize the sum of the squared residuals are

determined.

Suppose we have the following data for 6 stocks:

Stock P/E Ratio Market cap Return


($millions)
1 9 200 8%
2 11 300 15%
3 14 400 18%
4 16 500 19%
5 18 600 23%
6 20 700 27%

Given the following parameters and coefficients:

Intercept = 3.432.

P/E Ratio coefficient = −0.114.

311
© 2014-2023 AnalystPrep.
Market cap coefficient = 0.0368.

T he prediction equation is expressed as follows:

Return = 3.432 + −0.114 × P\E ratio + 0.0368 × Market capitalization

Given a P/E ratio of 14 and a market capitalization of $150M, the return of the stock can be

determined as follows:

Return = 3.432 + −0.114 × 14 + 0.0368 × 150 = 7.356%

Logistic Regression

When using a linear regression model for binary classification, where the dependent variable Y can

only be 0 or 1, the model can predict probabilities outside the range of 0 to 1. T his occurs because

the model attempts to fit a straight line to the data, and the predicted values may not be restricted to

the valid range of probabilities. As a result, the model may produce predictions that are less than zero

or greater than one. To avoid this issue, it may be necessary to use a different type of model, such as

logistic regression, which is specifically designed for binary classification tasks and ensures that the

predicted probabilities are within the valid range. T his is achieved by applying a sigmoid function.

T he sigmoid function graph is shown in the figure below.

312
© 2014-2023 AnalystPrep.
Logistic regression is used to forecast a binary outcome. In other words, it predicts the likelihood of

an event occurring based on independent variables, which can be categorical or continuous.

T he logistic regression model is expressed as:

e yj
F (y j ) =
1 + e yj

Where:

y j = α + β1x 1j + β2j x 2j+. . . +βn j x m j

α = Intercept term.

βij = Coefficients that must be learned from the training data.

T he probability that y j = 1 is expressed as:

e yj
313
© 2014-2023 AnalystPrep.
e yj
pj =
1 + e yj

Probability that y j = 0 is (1 − pi )

The Cost Function for Logistic Regression

T his measures how often we predicted zero when the true answer was one and vice versa. T he

logistic regression coefficients are trained using techniques such as maximum likelihood estimation

(MLE) to predict values close to 0 and 1. MLE works by selecting the values of the model

parameters (∝ and the β s) that maximize the likelihood of the training data occurring. T he likelihood

function is a mathematical function that describes the probability of the observed data given the

model parameters. By maximizing the likelihood function, we can find the values of the parameters

most likely to have produced the observed data. T his is expressed as:

n 1−yj
y
∏ F (y j ) j (1 − F (y j ))
j=1

It is often easier to maximize the log-likelihood function, log(L), than the likelihood function itself.

T he log-likelihood function is obtained by taking the natural logarithm of the likelihood function:

n
Log(L) = ∑ [y j log(F (y j )) + (1 − y j) log(1 − F (y j))]
j=1

Once the model parameters (∝ and the β s) that maximize the log-likelihood function have been

estimated using MLE, predictions can be made using the logistic regression model. To make

predictions, a threshold value Z is chosen. If the predicted probability pj is greater than or equal to

the threshold Z , the model predicts the positive outcome (y j = 1) ; if pj is less than the threshold Z ,

the model predicts a negative outcome (y i = 0). T his is expressed as:

1 if pj ≥ Z
yj = {
0 if pj < Z

Example: Using Logistic Regression to Predict Loan Default

A credit analyst wants to predict whether a customer will default on a loan based on their credit

314
© 2014-2023 AnalystPrep.
score and debt-to-income ratio. He gathers a dataset of 500 customers, with their corresponding

credit scores, debt-to-income ratio, and whether they defaulted on the loan. He then splits the data

into training and test sets and uses the training data to train a logistic regression model.

T he model learns the following relationship between the independent variables (input features) and

dependent variables (loan default):

e(−10+(0. 012×Credit score)+(0. 4×Debt-to-income))


Probability of default =
1 + e(−10+(0. 012×Credit score)+(0. 4×Debt-to-income))

T he above expression calculates the probability that the customer will default on the loan, given

their credit score and debt-to-income ratio.

So, if the credit score is 650 and the debt-to-income ratio is 0.6, the probability of default will be

calculated as:

e(−10+(0. 012×650)+(0. 4×0. 6))


Probability of default = ≈ 12%
1 + e(−10+(0. 012×650)+(0. 4×0. 6))

So there is a 12% probability that the customer will default on the loan. One can then use a threshold

(such as 50%) to convert this probability into a binary prediction (either “default” or “no default”).

Since 12% < 50%, we can classify this as “no default.”

Applications of Logistic Regression

Logistic regression is applied for prediction and classification tasks in machine learning. For example,

you could use logistic regression to classify stock returns as either “positive” or “negative” based on

a set of input features that you choose. It is simple to implement and interpret. However, it assumes

a linear relationship between the dependent and independent variables and requires a large sample

size to achieve stable estimates of the coefficients.

Encoding Categorical Variables

Categorical data refers to information presented in groups and can take on values that are names,

315
© 2014-2023 AnalystPrep.
attributes, or labels. It is not in a numerical format. For example, a given set of stocks can be

categorized as either growth or value stocks depending on the investment style. Many ML algorithms

struggle to deal with such data.

It isn't easy to transform categorical variables, especially non-ordinal categorical data, where the

classes are not in any order. Mapping or encoding involves transforming non-numerical information

into numbers. One-hot encoding is the most common solution for dealing with non-ordinal categorical

data. It involves creating a new dummy variable for each group of the categorical feature and

encoding the categories as binary. Each observation is marked as either belonging (Value=1) or not

belonging (Value=0) to that group.

Example: One-hot Encoding for Sector, Industry.

Utilities Technology T ransportation Internet Airlines Electric


Meta 0 1 0 1 0 0
Energy 1 0 0 0 0 1
Alibaba 0 1 0 1 0 0
Virgin 0 0 1 0 1 0
Atlantic

For ordered categorical variables, for example, where a candidate's grades are specified as either

poor, good, or excellent, a dummy variable that equals 0 for poor, 1 for good, and 2 for excellent can

be used.

If an intercept term and correlated dummy variables are included in a model, the dummy variable trap

may be encountered. T his means that the model will have multiple possible solutions, and we cannot

find a unique best-fit solution. To address this issue, techniques such as regularization can be used.

T hese approaches penalize the magnitude of the coefficients of the model, which can help to reduce

the impact of correlated variables and prevent the dummy variable trap from occurring.

Regularization

Regularization is a technique that events overfitting in machine learning models by penalizing large

coefficients. It adds a penalty term to the model's objective function, encouraging the coefficients to

316
© 2014-2023 AnalystPrep.
take on smaller values. T his reduces the impact of correlated variables, as it forces the model to rely

more on the overall pattern of the data and less on the influence of any single variable. It improves

the generalization of the model to new, unseen data.

Regularization requires the data to be normalized or standardized. Normalization is a method of

scaling the data to have a minimum value of 0 and a maximum value of 1. On the other hand,

standardization involves scaling the data so that it has a mean of zero and a standard deviation of one.

Ridge regression and the least absolute shrinkage and selection operator (LASSO) regression are the

two commonly used regularization techniques.

Ridge Regression

Ridge regression, sometimes known as L2 regularization, is a type of linear regression that is used to

analyze data and make predictions. It is similar to ordinary least squares regression but includes a

penalty term that constrains the size of the model's coefficients. Consider a dataset with n

observations on each of k features in addition to a single output variable y and, for simplicity, assume

that we are estimating a standard linear regression model with hats above parameters denoting their

estimated values. T he relevant objective function (referred to as a loss function) in ridge regression

is:

k
1 n 2 2
L= ^ − β̂1 x 1j − β̂2 x 2j − … − β̂kx kj ) + λ ∑ (β^ i )
∑ (ŷ j − ∝
n j=1 i=1

or

k
2
L = R̂SS + λ ∑ (β^ i )
i=1

T he first term in the expression is the residual sum of squares, which measures how well the model

fits the data. T he second term is the shrinkage term, which introduces a penalty for large slope

parameter values. T his is known as regularization, and it helps to prevent overfitting, which is when

a model fits the training data too well and performs poorly on new, unseen data.

T he parameter λ is a hyperparameter, which means that it is not part of the model itself but is used

317
© 2014-2023 AnalystPrep.
to determine the model. In this case, it controls the relative weight given to the shrinkage term

versus the model fit term. It is essential to tune the value of λ, or perform hyperparameter

optimization, to find the best value for the given situation. ∝


^ and β̂i are the model parameters, while

λ is a hyperparameter.

Least Absolute Shrinkage and Selection Operator (LASSO)

LASSO regression, sometimes known as L1 regularization, is similar to ridge regression in that it

introduces a penalty term to the objective function to prevent overfitting. However, the penalty

term in LASSO regression takes the form of the absolute value of the coefficients rather than the

square of the coefficients as in ridge regression.

k
1 n 2
L= ∑ (ŷ j − ∝
^ − β̂1 x 1j − β̂2 x 2j − … − β̂k x kj) + λ ∑ (| β̂i|)
n j=1 i=1

Also expressed as:

k
L = R̂SS + λ ∑ (| β̂i |)
i=1

In ridge regression, the values of ∝


^ and β̂i can be determined analytically using closed-form solutions.

T his means that the values of the coefficients can be calculated directly, without the need for

iterative optimization. On the other hand, LASSO does not have closed-form solutions for the

coefficients, so a numerical optimization procedure must be used to determine the values of the

parameters.

Ridge regression and LASSO have a crucial difference. Ridge regression adds a penalty term that

reduces the magnitude of the β parameters and makes them more stable. T he effect of this is to

“shrink” the β parameters towards zero, but not all the way to zero. T his can be especially useful

when there is multicollinearity among the variables, as it can help to prevent one variable from

dominating the others.

However, LASSO sets some of the less important β parameters to exactly zero. T he effect of this is

to perform feature selection, as the β parameters corresponding to the least important features will

318
© 2014-2023 AnalystPrep.
be set to zero. In contrast, the β parameters corresponding to the more important features will be

retained. T his can be useful in cases where the number of variables is very large, and some variables

are irrelevant or redundant. T he choice between LASSO and ridge regression depends on the

specific needs of the model and the data at hand.

Elastic Net

Elastic net regularization is a method that combines the L1 and L2 regularization techniques in a

single loss function:

k k
1 n 2 2
L= ^ − β̂1 x 1j − β̂2 x 2j − … − β̂k x kj) + λ1 ∑ (β^ i ) + λ2 ∑ (| β̂i|)
∑ (ŷ j − ∝
n j=1 i=1 i=1

k k
2
L = R̂SS + λ1 ∑ (β^i ) + λ2 ∑ (| β̂i|)
i=1 i=1

By adjusting λ1 and λ2, which are hyperparameters, it is possible to obtain the advantages of both L1

and L2 regularization. T hese advantages include decreasing the magnitude of some parameters and

eliminating some unimportant ones. T his can help to improve the model's performance and the

accuracy of its predictions.

Example: Regularization

Table 1: OLS, Ridge and LASSO Regression Estimates


Feature OLS Ridge Ridge LASSO LASSO
(λ = 0.1) (λ = 0.5) (λ = 0.01) (λ = 0.1)
Intercept 6.27 2.45 2.33 2.40 2.29
1 −20.02 −6.23 −1.90 −1.20 0
2 51.53 9.99 2.32 1.19 0.50
3 −32.45 −2.41 −0.43 0 0
4 10.01 0.89 0.51 0 0
5 −5.92 −1.64 −1.22 −1.01 0

OLS regression determines the coefficients of the model by minimizing the sum of the squared

residuals (RSS). Note that it does not incorporate any regularization and can therefore lead to

significant coefficients and overfitting. On the other hand, ridge regularization adds a penalty term to

319
© 2014-2023 AnalystPrep.
RSS. T he penalty term is determined as the sum of the squared coefficient values, multiplied by λ,

which is regarded as a hyperparameter. T he hyperparameter controls the strength of the penalty and

can be adjusted to find an optimal balance between the model's fitness and the model's simplicity.

Notice that as λ increases, the penalty term becomes more influential, and the coefficient values

become smaller.

As discussed earlier, LASSO uses the sum of the absolute values of the coefficients as the penalty

term. T his leads to some coefficients being reduced to zero, which eliminates unnecessary features

from the model. Notice the same from the table above. Similar to ridge regression, the strength of

the penalty can be modified by adjusting the value of λ.

Choosing the value of the hyperparameter in a regularized regression model is an important step in

the modeling process, as it can significantly impact the model's performance. One common approach

to selecting the value of the hyperparameter is to use cross-validation, which involves splitting the

data into a training set, a validation set, and a test set. T his was discussed in detail in Chapter 14. T he

training set is used to fit the model and determine the coefficients for different values of λ. T he

validation set determines how well the model generalizes to new data. T he test set is used to

evaluate the final performance of the model and provide an unbiased estimate of the model's

accuracy.

Decision Trees

A decision tree is a supervised machine-learning technique that can be used to predict either a

categorical target variable, produce a classification tree, or produce a regression tree. It creates a

tree-like decision model based on the input features. At each internal node of the tree, there is a

question, and the algorithm makes a decision based on the value of one of the features. It then

branches an observation to another node or a leaf. A leaf is a terminal node that leads to no further

nodes. In other words, the decision tree includes the initial root node, decision nodes, and terminal

nodes.

Classification and Regression T ree (CART ) is a decision tree algorithm commonly used for

supervised learning tasks, such as classification and regression. One of the main benefits of CART is

320
© 2014-2023 AnalystPrep.
that it is highly interpretable, meaning it is easy to understand how the model makes predictions. T his

is because CART models are built using a series of simple decision rules that are easy to understand

and follow. For this reason, CART models are often referred to as “white-box models,” in contrast to

other techniques like neural networks, which are often referred to as “black-box models.” Neural

networks are more challenging to interpret because they are based on complex mathematical

equations that are not as easy to understand and follow.

T he following is a visual representation of a simple model for predicting whether a company will

issue dividends to shareholders based on the company's profits:

When building a decision tree, the goal is to create a model that can accurately predict the value of a

target variable based on the importance of other features in the dataset. To do this, the decision tree

must decide which features to split on at each node of the tree. T he tree is constructed by starting

321
© 2014-2023 AnalystPrep.
at the root node and recursively partitioning the data into smaller and smaller groups based on the

values of the chosen features. We use a measure called information gain to determine which feature

to split at each node.

Information gain measures how much uncertainty or randomness is reduced by obtaining additional

information about the feature. In other words, it measures how much the feature helps us predict

the target variable.

T here are two commonly used measures of information gain: entropy and the Gini coefficient. Both

of these measures are used to evaluate the purity of a node in the decision tree. T he goal is to

choose the feature that results in the most significant reduction in entropy or the Gini coefficient, as

this will be the most helpful feature in predicting the target variable.

Entropy ranges from 0 to 1, with 0 representing a completely ordered or predictable system and 1

representing a completely random or unpredictable system. It is expressed as:

K
Entropy = − ∑ pi log2(pi )
i=1

Where K is the total number of possible outcomes and pi the probability of that outcome. T he

logarithm used in the formula is typically the base-2 logarithm, also known as the binary logarithm.

T he Gini measure is expressed as:

K
Gini = 1 − ∑ p2i
i=1

Example: Building a Decision-Tree Model to Classify Credit Card


Holders as High Risk or Low Risk

A credit card company is building a decision-tree model to classify credit card holders as high-risk or

low-risk for defaulting on their payments. T hey have the following data on whether a credit card

holder has defaulted (“Defaulted”) and two features (for the label and the features, in each case,

“yes” = 1 and “no” = 0): whether the credit card holder has a high income and whether they have a

history of late payments:

322
© 2014-2023 AnalystPrep.
Defaulted High_income Late_payments
1 1 1
0 0 0
0 0 0
1 1 1
1 0 1
0 0 1
0 1 0
0 1 0

1. Calculate the “base entropy” of the defaulted series.

T he base entropy measures the randomness (uncertainty) of the output series before any data is

split into separate groups or categories.

K
Entropy = − ∑ pi log2 (pi)
i=1

Where:

K = Total number of possible outcomes.

pi= Probability of that outcome.

T he logarithm used in the formula is typically the base-2 logarithm, also known as the binary

logarithm.

In this case, three credit card holders defaulted, and five didn't.

3 3 5 5
Entropy = − ( log2 ( ) + log2( )) = 0.954
8 8 8 8

2. Build a decision tree for this problem

Both features are binary, so there are no issues with determining a threshold as there would be for a

continuous series. T he first stage is to calculate the entropy if the split was made for each of the

two features. Examining the High_income feature first, among high-income credit card owners

(feature = 1), two defaulted while two did not, leading to entropy for this sub-set of:

323
© 2014-2023 AnalystPrep.
2 2 2 2
Entropy = −( log2( ) + log2 ( )) = 1
4 4 4 4

Among non-high income credit card owners (feature = 0), one defaulted while three did not, leading

to an entropy of:

1 1 3 3
Entropy = − ( log2 ( ) + log2( )) = 0.811
4 4 4 4

T he weighted entropy for splitting by income level is therefore given by:

4 4
Entropy = × 1 + × 0.811 = 0.906
8 8

Information gain = 0.954 − 0.906 = 0.048

We repeat this process by calculating the entropy that would occur if the split was made via the late

payment feature.

T hree of the four credit card owners who made late payments (feature = 1) defaulted, while one did

not.

3 3 1 1
Entropy = − ( log2 ( ) + log2( )) = 0.811
4 4 4 4

Among the four credit card owners who did not make late payments (feature = 0), none defaulted.

T he weighted entropy for late payments feature is, therefore:

4
Entropy = × 0.811 = 0.4055
8

Information gain = 0.954 − 0.4055 = 0.5485

Notice that the entropy is maximized when the sample is first split by the late payments feature.

T his becomes the root node of the decision tree. For credit card owners who do not make late

payments (i.e., the feature =0), there is already a pure split as none of them defaulted. T his is to say

that credit card holders who make timely payments do not default. T his means that no further splits

are required along this branch. T he (incomplete) tree structure is, therefore:

324
© 2014-2023 AnalystPrep.
Ensemble Techniques

Ensemble learning is a machine learning technique in which a group of models, or an ensemble, is

used to make predictions rather than relying on the output of a single model. T he idea behind

ensemble learning is that the individual models in the ensemble may have different error rates and

make noisy predictions. Still, by taking the average result of many predictions from various models,

the noise can be reduced, and the overall forecast can be more accurate.

T here are two objectives of using an ensemble approach in machine learning. First, ensembles can

often achieve better performance than individual models (think of the law of large numbers where,

as the number of models in the ensemble increases, the overall prediction accuracy tends to

325
© 2014-2023 AnalystPrep.
improve). Second, ensembles can be more robust and less prone to overfitting, as they are able to

average out the errors made by individual models. Some ensemble techniques are discussed below,

i.e., bootstrap aggregation, random forests, and boosting.

Bootstrap Aggregation

Bootstrap aggregation, or bagging, is a machine-learning technique that involves creating multiple

decision trees by sampling from the original training data. T he decision trees are then combined to

make a final prediction. A basic bagging algorithm for a decision tree would involve the following

steps:

1. Sample the training data with the replacement to obtain multiple subsets of the training data

2. Contruct a decision tree on each subset of the training data using the usual techniques.

3. Combine the predictions made by each of the decision tree models, e.g., average, to make a

forecast.

Sampling with replacement is a statistical method that involves randomly selecting a sample from a

dataset and returning the selected element back into the dataset before choosing the next element.

T his means that an element can be selected multiple times, or it can be left out entirely.

Sampling with replacement allows for the use of out-of-bag (OOB) data for model evaluation. OOB

data are observations that were not selected in a particular sample, and therefore were not used for

model training. T hese observations can be used to evaluate the model's performance, as they can

provide an estimate of how the model will perform on unseen data.

Random Forests

A random forest is an ensemble of decision trees. T he number of features chosen for each tree is

usually approximately equal to the square root of the total number of features. T he individual

decision trees in a random forest are trained on different subsets of the data and different subsets of

the features, which means that each tree may give a slightly different prediction. However, by

combining the predictions of all the trees, the random forest can produce a more accurate final

prediction. T he performance improvements of ensembles are often greatest when the individual

326
© 2014-2023 AnalystPrep.
model outputs have low correlations with one another because this helps to improve the

generalization of the model.

Boosting

Boosting is an ensemble learning technique that involves training a series of weak models, where

each successive model is trained on the errors or residuals of its predecessor. T he goal of boosting is

to improve the model's overall performance by combining the weaker models' predictions to reduce

bias and variance. Gradient boosting and AdaBoost (Adaptive Boosting) are the most popular methods.

AdaBoost

327
© 2014-2023 AnalystPrep.
AdaBoost is a boosting algorithm that trains a series of weak models, where each successive model

focuses more on the examples that were difficult for its predecessor to predict correctly. T his

results in new predictors that concentrate more and more on the hard cases. Specifically, AdaBoost

adjusts the weights of the training examples at each iteration based on the previous model's

performance, focusing the training on the examples that are most difficult to predict. Here is a more

detailed description of the process:

1. T he AdaBoost algorithm first trains a base classifier (such as a decision tree) on the training

data.

2. T he algorithm then uses the trained classifier to make predictions on the training set and

calculates the errors or residuals between the predicted labels and the true labels.

3. T he algorithm then adjusts the weights of the training examples based on the previous

classifier's performance, focusing the training on the examples that were most difficult to

predict correctly. Specifically, the weights of the misclassified examples are increased,

while the weights of the correctly classified examples are decreased.

4. A second classifier is then trained on the updated weights. T he whole process is repeated

until a predetermined number of classifiers have been trained, or until the model's

performance meets a desired threshold.

T he final prediction of the AdaBoost model is calculated by combining the predictions of all of the

individual classifiers using a weighted sum, where the accuracy of each classifieraccuracy of each

classifier determines the weights.

Gradient Boosting

In gradient boosting, a new model is trained on the residuals or errors of the previous model, which

are used as the target labels for the current model. T his process is repeated until a predetermined

number of models have been trained, or until the model's performance meets a desired threshold. In

contrast to AdaBoost, which adjusts the weights of the training examples at each iteration based on

the performance of the previous classifier, gradient boosting tries to fit the new predictor to the

residual errors made by the previous predictor.

328
© 2014-2023 AnalystPrep.
K-Nearest Neighbors and Support Vector Machine Methods

K-Nearest Neighbors

K-nearest neighbors (KNN) is a supervised machine learning technique commonly used for

classification and regression tasks. T he idea is to find similarities or “nearness” between a new

observation and its k-nearest neighbors in the existing dataset. To do this, the model uses one of the

distance metrics described in the previous chapter (Euclidean distance or Manhattan distance) to

calculate the distance between the new observation and each observation in the training set. T he k

observations with the smallest distances are considered the k-nearest neighbors of the new

observation. T he class label or value of the new observation is determined based on these neighbors'

class labels or values.

KNN is sometimes called a “lazy learner” as it does not learn the relationships between the features

and the target like other approaches do. Instead, it simply stores the training data and makes

predictions based on the similarity between the new observation and its K-nearest neighbors in the

training set.

Here are the basic steps involved in implementing the KNN model:

329
© 2014-2023 AnalystPrep.
Choosing an appropriate value for K is important, as it can impact the model's ability to generalize to

new data and avoid overfitting or underfitting. If K is too large so that many neighbors are selected, it

will give a high bias but low variance, and vice versa for small K. If the value of K is set too small, it

may result in a model that is more sensitive to individual observations and more complex. T his may

allow the model to fit the training data better. However, it may also make the model more prone to

overfitting and not generalize well to new data.

A typical heuristic for selecting K is to set it approximately equal to the square root of the size of

the training sample. For example, if the training sample contains 10,000 points, then K could be set to

100 (the square root of 10,000).

Support Vector Machines

Support vector machines (SVMs) are supervised machine learning models commonly used for

330
© 2014-2023 AnalystPrep.
classification tasks, particularly when there are many features. SVM works by finding the path's

hyperplane or center that maximizes the distance between the two classes, called the margin.

T his hyperplane (the solid line blue line in the figure below) is constructed by finding the two

parallel lines that are furthest apart and that best separate the observations into the two classes. T he

data points on the edge of this path, or the points closest to the hyperplane, are called support

vectors.

Example: Support Vector Machine

Emma White is a portfolio manager at Delta Investments, a firm that manages a diverse range of

investment portfolios for its clients. Delta has a portfolio of “investment-grade” stocks, which are

relatively low-risk and have a high likelihood of producing steady returns. T he portfolio also includes

a selection of “non-investment grade” stocks, which are higher-risk and have the potential for higher

331
© 2014-2023 AnalystPrep.
returns but also come with a greater risk of loss.

White is considering adding a new stock, ABC Inc., to the portfolio. ABC is a medium-sized company

in the retail sector but has not yet been rated by any of the major credit rating agencies. To

determine whether ABC is suitable for the portfolio, White decides to use machine learning methods

to predict the stock's risk level. How can Emma use the SVM algorithm to explore the implied credit

rating of ABC?

Solution

White would first gather data on the features and target of bonds from companies rated as either

investment grade or non-investment grade. She would then use this data to train the SVM algorithm to

identify the optimal hyperplane that separates the two classes. Once the SVM model is trained,

White can use it to predict the rating of ABC Inc's bonds by inputting the features of the bonds into

the model and noting on which side of the margin the data point lies. If the data point lies on the side

of the margin associated with the investment grade class, then the SVM model would predict that

ABC Inc's bonds are likely to be investment grade. If the data point lies on the side of the margin

associated with the non-investment grade class, then the SVM model would predict that ABC Inc's

bonds are likely to be non-investment grade.

Neural Networks

Neural networks (NNs), also known as artificial neural networks (ANNs), are machine learning

algorithms capable of learning and adapting to complex nonlinear relationships between input and

output data. T hey can be used for both classification and regression tasks in supervised learning, as

well as for reinforcement learning tasks that do not require human-labeled training data. A feed-

forward neural network with backpropagation is a type of artificial neural network that updates its

weights and biases through an iteration process called backpropagation.

332
© 2014-2023 AnalystPrep.
In this neural network, there are three input variables, a single hidden layer comprising three nodes

and a single output variable. T he output variable is determined based on the values of the hidden

nodes, which are calculated from the input variables. T he equations that are used to determine the

values at the hidden nodes are:

H 1 = ∅(W 111X 1 + W 112X 2 + W 113 X 3 + W 1)


H 2 = ∅(W 121X 1 + W 122X 2 + W 123 X 3 + W 2)
H 3 = ∅(W 131X 1 + W 132X 2 + W 133 X 3 + W 3)

∅ is known as an activation function, which is a nonlinear function that is applied to the linear

combination of the input feature values to introduce nonlinearity into the model.

T he value of y is determined by applying an activation function to a linear combination of the values

in the hidden layer.

333
© 2014-2023 AnalystPrep.
y = ∅(W 211 H 1 + W 221H 2 + W 231 H 3 + W 4)

Where W 1, W 2, W 3, W 4 are biases.

T he other W parameters (coefficients in the linear functions) are weights. As previously stated, if

the activation functions were not included, the model would only be able to output linear

combinations of the inputs and hidden layer values, limiting its ability to identify complex nonlinear

relationships. T his is not desirable, as the main purpose of using a neural network is to identify and

model these kinds of relationships.

T he parameters of a neural network are chosen based on the training data, similar to how the

parameters are chosen in linear or logistic regression. To predict the value of a continuous variable,

we can select the parameters that minimize the mean squared errors. We can use a maximum

likelihood criterion to choose the parameters for classification tasks.

T here are no exact formulas for finding the optimal values for the parameters in a neural network.

Instead, a gradient descent algorithm is used to find values that minimize the error for the training

set. T his involves starting with initial values for the parameters and iteratively adjusting them in the

direction that reduces the error of the objective function. T his process is similar to stepping down a

valley, with each step following the steepest descent.

T he learning rate is a hyperparameter that determines the size of the step taken during the gradient

descent algorithm. If the learning rate is too small, it will take longer to reach the optimal

parameters, but if it is too large, the algorithm may oscillate from one side of the valley to another

instead of accurately finding the optimal values. A hyperparameter is a value set before the model

training process begins and is used to control the model's behavior. It is not a parameter of the model

itself but rather a value used to determine how the model will be trained and function.

In the example given earlier, the neural network had 16 parameters (i.e., a total of the weights and

the biases). T he presence of many hidden layers and nodes in a neural network can lead to too many

parameters and the risk of overfitting. To prevent overfitting, calculations are performed on a

validation data set while training the model on the training data set. As the gradient descent algorithm

progresses through the multi-dimensional valley, the objective function will improve for both data

sets.

334
© 2014-2023 AnalystPrep.
However, at a certain point, further steps down the valley will begin to degrade the model's

performance on the validation data set while continuing to improve it on the training data set. T his

indicates that the model is starting to overfit, so the algorithm should be stopped to prevent this from

happening.

Predictive Performance of Logistic Regression Models vs.


Neural Network Models using a Confusion Matrix

A confusion matrix is a tool used to evaluate the performance of a binary classification model, where

the output variable is a binary categorical variable with two possible values (such as “default” or “not

default”). It is a 2×2 table that shows the possible outcomes, and whether the predicted outcome

was correct. A confusion matrix is organized as follows:

Predicted positive Predicted negative


Actual positive TP FN
Actual negative FP TN

T he four elements of the table are:

i. T rue positive (T P) refers to the number of times the model correctly predicted that a

borrower would default on their loan.

ii. False negative (FN) refers to the number of times the model incorrectly predicted that a

borrower would not default, when in fact, they did.

iii. False positive (FP) refers to the number of times the model incorrectly predicted that a

borrower would default, when in fact, they did not.

iv. T rue negative (T N) refers to the number of times the model correctly predicted that a

borrower would not default on their loan.

335
© 2014-2023 AnalystPrep.
T he most common performance metrics based on a confusion matrix are:

i. Accuracy: T his is the model's overall accuracy, calculated as the number of correct

predictions divided by the total number of predictions. In the case of a binary classification

problem, the accuracy is calculated as follows:

(T P + T N )
(T P + T N + F P + F N)

ii. Precision: T his is the proportion of correct positive predictions, calculated as:

TP
(T P + F P )

iii. Recall: T his is the proportion of actual positive cases that were correctly predicted,

336
© 2014-2023 AnalystPrep.
calculated as:

TP
(T P + F N )

iv. T he error rate is the proportion of incorrect predictions made by the model, calculated as

follows:

Error rate = (1 − Accuracy)

Example: Confusion Matrix

Suppose we have a dataset of 1600 borrowers, 400 of whom defaulted on their loans and 1200 of

whom did not. We can use logistic regression or a neural network to create a prediction model that

predicts the likelihood that a borrower will default on their loan. We can set a threshold value to

convert the predicted probabilities into binary values of 0 or 1.

Assume that a neural network with one hidden layer and backpropagation is used to model the data.

T he hidden layer has 5 units, and the activation function used is the logistic function. T he loss

function used in the optimization process is based on an entropy measure. Note that a loss function

is used to evaluate how well a model performs on a given task. T he optimization process aims to find

the set of model parameters that minimize the loss function. Suppose that the optimization process

takes 150 iterations to converge, which means it takes 150 steps to find the set of model parameters

that minimize the loss function.

In the context of machine learning, the effectiveness of a model specification is evaluated based on

its performance in classifying a validation sample. For simplicity, a threshold of 0.5 is used to

determine the predicted class label based on the model's output probability. If the probability of a

default predicted by the model is greater than or equal to 0.5, the predicted class label is “default.” If

the probability is less than 0.5, the predicted class label is “no default.”

Adjusting the threshold can affect the true positive and false positive rates in different ways. For

example, if the threshold is set too low, the model may have a high true positive rate and a high false

positive rate because the model is classifying more observations as positive. On the other hand, if

337
© 2014-2023 AnalystPrep.
the threshold is set too high, the model may have a low true positive rate and a low false positive rate

because the model is classifying fewer observations as positive. T his trade-off between true positive

and false positive rates is similar to the trade-off between type I and type II errors in hypothesis

testing. In hypothesis testing, a type I error occurs when the null hypothesis is rejected when it is

actually true. In contrast, a type II error occurs when the null hypothesis is not rejected when it is

actually false.

Hypothetical confusion matrices for the logistic and neural network models are presented for both

the training and validation samples.

Logistic Regression Training Sample

Predicted: Default Predicted: No Default


Actual: Default T P = 100 F N = 300
Actual: No default F P = 50 T N = 1150

Logistic Regression Validation Sample

Predicted: Default Predicted: No Default


Actual: Default T P = 100 F N = 175
Actual: No default F P = 56 T N = 337

Neural Network Training Sample

Predicted: Default Predicted: No Default


Actual: Default T P = 94 F N = 306
Actual: No default F P = 106 T N = 1094

Neural Network Validation Sample

Predicted: Default Predicted: No Default


Actual: Default T P = 93 F N = 182
Actual: No default F P = 51 T N = 342

T he values in the confusion matrix can be used to calculate various evaluation metrics:

338
© 2014-2023 AnalystPrep.
T raining sample Validation sample
Performance Logistic Neural Logistic Neural
metrics regression network regression network
Accuracy 0.781 0.743 0.654 0.651
Precision 0.667 0.470 0.641 0.646
Recall 0.250 0.235 0.364 0.338

T he model appears to perform slightly better on the training data than on the validation data,

indicating that the model is overfitting. To improve the model's performance, it may be beneficial to

remove some of the features with limited empirical relevance or apply regularization to the model.

T hese steps may help reduce overfitting and improve the model's ability to generalize to new data.

T here is not much difference in the performance of the logistic regression and neural network

approaches. T he logistic regression model has a higher true positive rate but a lower true negative

rate for the training data compared to the neural network model. On the other hand, the neural

network model appears to have a higher true positive rate but a lower true negative rate for the

validation data compared to the logistic regression model.

The Receiver Operating Characteristic Curve

T he receiver operating characteristic (ROC) curve is a graphical representation of the trade-off

between the true positive rate and the false positive rate, which is illustrated in the figure below. It

is calculated by varying the threshold value or decision boundary, classifying predictions as positive

or negative, and plotting the true positive rate and the false positive rate at each threshold.

339
© 2014-2023 AnalystPrep.
A higher area under the receiver operating curve (or area under curve/AUC) value indicates better

performance, with a perfect model having an AUC of 1. An AUC value of 0.5 corresponds to the

dashed line in the figure above and indicates that the model is no better than random guessing. In

contrast, an AUC value less than 0.5 indicates that the model has a negative predictive value.

340
© 2014-2023 AnalystPrep.
Practice Question

Consider the following confusion matrices.

Model A

Predicted: Predicted:
No Default Default
Actual: No Default T N = 100 F P = 50
Actual: default F N = 50 T P = 900

Model B

Predicted: Predicted:
No Default Default
Actual: No Default T N = 120 F P = 80
Actual: default F N = 30 T P = 870

T he model that is most likely to have a higher accuracy and higher precision,

respectively, is:

A. Higher accuracy: Model A, Higher precision: Model B.

B. Higher accuracy: Model B, Higher precision: Model A.

C. Higher accuracy: Model A, Higher precision: Model A.

D. Higher accuracy: Model B, Higher precision: Model B.

Solution

T he correct answer is C.

(T P + T N)
Model accuracy is calculated as
(T P + T N + F P + F N )

900 + 100
Model A accuracy = = 0.909
900 + 100 + 50 + 50
870 + 120
Model B accuracy = = 0.900
870 + 120 + 80 + 30

341
© 2014-2023 AnalystPrep.
Model A has a slightly higher accuracy than model B.

Model precision is calculated as follows:

TP
(T P + F P )

900
Model precision of A = = 0.9474
900 + 50
870
Model precision for B = = 0.9158
870 + 80

Model A has a higher precision relative to B.

342
© 2014-2023 AnalystPrep.

You might also like