0% found this document useful (0 votes)
24 views107 pages

Week 6 Notes

Uploaded by

Rama Bhushan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views107 pages

Week 6 Notes

Uploaded by

Rama Bhushan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Introduction

• Limited dependent variable modeling: background and motivation


• OLS approach: linear probability models (LPMs)
• Issues with LPM models
• Introduction to logit/probit models
• Understanding logit function
Introduction

• Thresholding
• Confusion/classification Matrix
• Receiver operator characteristic (ROC) curve
• Parameter interpretation
• Summary and concluding remarks
Background and Motivation
Limited Dependent Variable/Qualitative
Response Regression
Discrete choice variables, limited dependent variables, or qualitative response
variables are not suitable for modeling through linear regression models
Consider the following questions
• Why do firms choose to list their stocks on NSE vs. BSE?
• Why do some stocks pay dividends and others do not?
• What factors affect large corporate borrowers to default?
• What factors affect choices of internal vs. external financing?
Limited Dependent Variable/Qualitative
Response Regression
Credit default scoring (classification problem)
Linear Probability Model (LPM)
Linear Probability Model (LPM)

• In such models, the dependent variable is Yes/No or 1/0 kind of


variable
• First, we will examine a simple linear regression approach to deal with
such models: linear probability model (LPM)
• This is the most simple approach to deal with binary dependent
variables
• It is based on the assumption that the probability of an event (𝑃𝑖 ) is
linearly related to a set of explanatory variables, 𝑥1𝑖 , 𝑥2𝑖 , … , 𝑥𝑘𝑖
• 𝑃𝑖 = 𝑝 𝑦𝑖 = 1 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝑢𝑖 , 𝑖 = 1, … … . , 𝑁
Linear Probability Model (LPM)

In such models, the actual probabilities cannot be observed, so


your estimates (or dependent variables) would be 0s and 1s
• Consider the relationship between the size of a company “i" and
its ability to pay dividends
𝑌𝑖 = 𝛽1 + 𝛽2 𝑋𝑖 + 𝑢𝑖
where 𝑋𝑖 = market capitalization of the firm, and 𝑌𝑖 =1 if the dividend
is paid and 0 if the dividend is not paid.
Linear Probability Model (LPM)

In such models, the actual probabilities cannot be observed, so


your estimates (or dependent variables) would be 0s and 1s
• This is called linear probability model. The conditional expectation
of 𝑌𝑖 given 𝑋𝑖 , i.e., 𝐸(𝑌𝑖 |𝑋𝑖 ), can be interpreted that the event will
occur given 𝑋𝑖 : that is, 𝑃(𝑌𝑖 = 1|𝑋𝑖 )
• 𝐸(𝑌𝑖 |𝑋𝑖 )= 𝛽1 + 𝛽2 𝑋𝑖 (assuming 𝐸(𝑢𝑖 )=0)
Summary
Issues with LPM
Issues with LPM

Non-normality and heteroscedasticity of


error terms
• 𝑌𝑖 has the following distribution
𝐸(𝑌𝑖 |𝑋𝑖 ) = 0×(1−𝑃𝑖 ) + 1×(𝑃𝑖 ) = 𝑃𝑖
• This kind of model has a number of
econometric issues
• What is the nature of errors:
𝑢𝑖 = 𝑌𝑖 − 𝛽1 − 𝛽2 𝑋𝑖 ?
Issues with LPM

Non-normality and heteroscedasticity


of error terms
• 𝑢𝑖 is not normally distributed;
although in large samples, it is not
a problem
• 𝑢𝑖 s are heteroscedastic, i.e., they
vary with 𝑌𝑖
Issues with LPM

Nonfulfillment of 0 ≤ E(Yi | X) ≤ 1
• 𝑌𝑖 = −0.3 + 0.012𝑋𝑖 ; where 𝑋𝑖 is in million
dollars
• For every $1 million increase in size, the
probability that the firm will pay dividend
increases by 1.2%
• However, for X < $25 million and X > $88
million, the probabilities are less than 0 and
more than 1
Issues with LPM

Nonfulfillment of 0 ≤ E(Yi | X) ≤ 1
• What to do: set all negative as 0 and all
those greater than 1 as 1?
• Implausible to suggest that small firms
will never pay dividend and large firms
will always pay dividends
Issues with LPM

Diminishing utility of 𝑅2 as a goodness of fit


measure
• All the Y values will be on a line Y = 0 or Y = 1
• The conventional LPM is not expected to fit
well with such observations, except those
cases where all the observations are
scattered closely around points A and B
• Both logit and probit approaches are able to
overcome the limitation of LPM that it
produces values less than 0 and more than 1
Introduction to Logit Model
Introduction to Logit Model

The logit (and probit) approaches overcome the


limitations of the regression model by
transforming to a function so that fitted values
are bounded within (0,1) interval
• The fitted function looks like an S-shape
curve
• The logistic function for a random variable z
(𝑒 𝑧𝑖 ) 1
is: 𝐹 𝑧𝑖 = =
(1+𝑒 𝑧𝑖 ) (1+𝑒 −𝑧𝑖 )
Introduction to Logit Model

The logit (and probit) approaches overcome


the limitations of the regression model by
transforming to a function so that fitted
values are bounded within (0,1) interval
• Here F is the cumulative logistic
distribution
• The final logit model: 𝑃𝑖 (𝑦𝑖 = 1) =
1
(1+𝑒 − 𝛽1+𝛽2𝑥2𝑖 +𝛽3𝑥3𝑖 +⋯+𝛽𝑘 𝑥𝑘𝑖 +𝑢𝑖 )
Introduction to Logit Model
1
𝑃𝑖 (𝑦𝑖 = 1) =
(1+𝑒 − 𝛽1 +𝛽2 𝑥2𝑖 +𝛽3 𝑥3𝑖 +⋯+𝛽𝑘 𝑥𝑘𝑖 +𝑢𝑖 )

• Model asymptotically touches 0 (z → −∞) and


1 (z→∞)
• Is this model linear? Hence, not amenable to
OLS estimation
• The model would predict that the probability,
e.g., probability of bank loan default
(dependent variable = y)
Introduction to Logit Model
1
𝑃𝑖 (𝑦𝑖 = 1) =
(1+𝑒 − 𝛽1 +𝛽2 𝑥2𝑖 +𝛽3 𝑥3𝑖 +⋯+𝛽𝑘 𝑥𝑘𝑖 +𝑢𝑖 )

• P (y = 1), then P(y = 0) = 1 − P(y = 1)


• Here independent variables are 𝑥2𝑖 , 𝑥3𝑖 , 𝑥4𝑖 ,
𝑥5𝑖 , and so on
• This is essentially a non-linear transformation
of the model to produce consistent probability
results
Understanding the Logit Function
Understanding the Logit Function
1
𝑃𝑖 (𝑦𝑖 = 1) =
(1+𝑒 − 𝛽1+𝛽2𝑥2𝑖 +𝛽3𝑥3𝑖 +⋯+𝛽𝑘 𝑥𝑘𝑖 +𝑢𝑖 )

• Here extremely low and negative values of


the linear function 𝛽1 + 𝛽2 𝑥2𝑖 + 𝛽3 𝑥3𝑖 +
⋯ + 𝛽𝑘 𝑥𝑘𝑖 would predict No dividend (or
non-default cases) with a high probability
or 𝑃𝑖 (𝑦𝑖 = 0)
Understanding the Logit Function
1
𝑃𝑖 (𝑦𝑖 = 1) =
(1+𝑒 − 𝛽1+𝛽2𝑥2𝑖 +𝛽3𝑥3𝑖 +⋯+𝛽𝑘 𝑥𝑘𝑖 +𝑢𝑖 )

• Extremely high and positive values of the


linear function 𝛽1 + 𝛽2 𝑥2𝑖 + 𝛽3 𝑥3𝑖 + ⋯ +
𝛽𝑘 𝑥𝑘𝑖 would predict dividend payment (or
default cases) with high probability or
𝑃𝑖 (𝑦𝑖 = 1)
Understanding the Logit Function
1
𝑃𝑖 (𝑦𝑖 = 1) =
(1+𝑒 − 𝛽1 +𝛽2 𝑥2𝑖 +𝛽3 𝑥3𝑖 +⋯+𝛽𝑘 𝑥𝑘𝑖 +𝑢𝑖 )

• This can also be expressed in the form of


Odds
𝑃 𝑦=1
• Odds = ;
𝑃 𝑦=0

• Odds > 1 if 𝑦 = 1 is more likely


• Odds < 1 if 𝑦 = 0 is more likely
Understanding the Logit Function
1
𝑃𝑖 (𝑦𝑖 = 1) =
(1+𝑒 − 𝛽1+𝛽2𝑥2𝑖 +𝛽3𝑥3𝑖 +⋯+𝛽𝑘 𝑥𝑘𝑖 +𝑢𝑖 )

• If we substitute the logit function in Odds


equation, then

• Odds = exp(𝛽1 +𝛽2 𝑥2𝑖+𝛽3 𝑥3𝑖 +⋯+𝛽𝑘 𝑥𝑘𝑖+𝑢𝑖) or


• ln Odds = 𝛽1 + 𝛽2 𝑥2𝑖 + 𝛽3 𝑥3𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 +
𝑢𝑖
• The higher this logit (or ln Odds ) form, the
higher the probability for 𝑃𝑖 (𝑦𝑖 = 1)
Thresholding
Thresholding

The outcome of the regression model is a probability


• In real life, you would want to make a binary prediction, e.g.,
default or no default
• For this, we may consider a threshold value “t”
• If P(Default = 1) >= t, then predict a default case
• If P(Default = 0)< t, then predict a non-default case
Thresholding

What value should we select for “t”? What kind of error do you
prefer?
• Given a t value, one can make two types of errors: (1) predict
default, but the actual outcome is non-default: false positive; and
(2) predict non-default, but the actual outcome is default: false
negative
• A large threshold (e.g., t = 0.8) will have a very small probability of
predicting defaulters and, at the same time, a high probability of
predicting cases as non-defaulters
Thresholding

What value should we select for “t”? What kind of error do you
prefer?
• A small threshold (e.g., t = 0.1) will have a very large probability of
predicting defaulters and, at the same time, a small probability of
predicting cases as non-defaulters
• An aggressive bank would like to have high t values to increase
the possibility of converting a loan
Thresholding

What value should we select for “t”? What kind of error do you
prefer?
• A more conservative bank may choose a very low t value to select
those loan applications with a very low probability of default
• In the absence of any threshold, t = 0.5 is the correct value to pick
Classification Matrix
Selecting a Threshold:
Confusion/Classification Matrix
Predicted = 0 (Non-Default) Predicted = 1 (Default)
Actual = 0 True Negatives (TN) False Positives (FP)
Actual = 1 False Negatives (FN) True Positives (TP)
Let us compute two outcome measures to determine what kind of errors we are
making
TP
• Sensitivity = = TP rate
TP+FN
TN
• Specificity = = TN rate
TN+FP
Selecting a Threshold:
Confusion/Classification Matrix
Let us compute two outcome measures to determine what kind of errors we are
making
TP
• Sensitivity = = TP rate
TP+FN
TN
• Specificity = = TN rate
TN+FP

• A model with higher t will have lower sensitivity and higher specificity
• A model with lower t will have higher sensitivity and lower specificity
Selecting a Threshold:
Confusion/Classification Matrix
(TN+TP)
• Overall accuracy = , where 𝑁 = number of observations
𝑁
(FP+FN)
• Overall error rate =
𝑁
FN
• False negative error rate =
(TP+FN)
FP
• False positive error rate =
(TN+FP)
Receiver Operating Characteristic
(ROC) Curve
Receiver Operator Characteristic (ROC) Curve

• True positivity (TP) rate on the y-axis,


i.e., the proportion of default correctly
predicted
• False positive on the x-axis, i.e., the
proportion non-default incorrectly
predicted as default cases
• The curve shows how these two
measures vary with different threshold
values
Receiver Operator Characteristic (ROC) Curve

• For t = 1, TP = 0, and FP = 0 → will not be


able to predict any default cases but
correctly predict all the non-default cases
• For t = 0, TP = 1, and FP = 1 → will be
able to correctly predict all the default
cases but incorrectly predict all the non-
default cases
• As we move from t = 1 to t = 0, different
combinations of TP and FP are obtained
Receiver Operator Characteristic (ROC)
Curve
• ROC curve captures all the complete
threshold behavior
• High threshold: high specificity and low
sensitivity
• Low threshold: low specificity and high
sensitivity
• Thus, it is a tradeoff between cost in failing to
detect default cases vs. incorrectly
considering non-default cases as defaulters
Receiver Operator Characteristic (ROC)
Curve
• A 100% score area under the curve
will indicate complete accuracy, i.e., all
the observations are correctly
identified
TP = 1 and FP = 0
• A 50% score will indicate random
guessing, that is, half TP = 0.5 and
TN = 0.5 (FP = 0.5)
Parameter Interpretation
Parameter Interpretation
Parameter Interpretation
Unlike LPM, it is incorrect to state that 1 unit increase in 𝑥2𝑖 will
cause 100*𝛽2 % increase in the probability of 𝑦𝑖 = 1
𝑑𝑃𝑖
• For logit model, we calculate ; this works out to 𝛽2 𝐹(𝑥2𝑖 )(1 −
𝑑𝑥2𝑖
𝐹 𝑥2𝑖 ) for the logit model
• So, a 1-unit increase in 𝑥2𝑖 will increase the probability of 𝑦𝑖 = 1 by
𝛽2 𝐹(𝑥2𝑖 )(1 − 𝐹 𝑥2𝑖 )
• Usually, these marginal/incremental impacts are evaluated at
mean values
Parameter Interpretation
1
Example: 𝑃𝑖 (𝑦𝑖 = 1) =
(1+𝑒 − 𝛽1 +𝛽2 𝑥2𝑖 +𝛽3 𝑥3𝑖 +⋯+𝛽𝑘𝑥𝑘𝑖 +𝑢𝑖 )
1
• 𝐹(𝑧𝑖 ) = 𝑃෡𝑖 = ;
(1+𝑒 − 0.1+0.3𝑥2𝑖 −0.6𝑥3𝑖 +0.9𝑥4𝑖 )

• 𝛽1 = 0.1; 𝛽2 = 0.3; 𝛽3 = −0.6; 𝛽4 = 0.9


• What is 𝐹(𝑧𝑖 )? Given 𝑥ҧ2 = 1.6, 𝑥ҧ3 = 0.20, and 𝑥ҧ4 = 0.10?
• Marginal effects of 𝑥2𝑖 = 𝛽2 𝐹(𝑥2𝑖 )(1 − 𝐹 𝑥2𝑖 )
Parameter Interpretation
1 1
Example: 𝐹(𝑧𝑖 ) = 𝑃෡𝑖 = = = 0.63
(1+𝑒 − 0.1+0.3𝑥2𝑖 −0.6𝑥3𝑖 +0.9𝑥4𝑖 ) 1+𝑒 −0.55

• Thus, a 1-unit increase in 𝑥2𝑖 will increase the probability of 𝑦𝑖 by


0.3*0.63*(1 − 0.63) = 0.07
• Similarly, for 𝑥3𝑖 , −0.6*0.63*(1 − 0.63), and 𝑥4𝑖 , 0.9*0.63*(1 −
0.63)
• Sometimes, these are also called marginal effects
Probit Model
Maximum Likelihood Estimation (MLE)
Goodness-of-Fit Measures
Probit Model

• The probit model uses cumulative normal distribution: 𝐹 𝑧𝑖 =


1 𝑧𝑖 −(𝑧 2
‫׬‬ 𝑒 𝑖 )/2 𝑑𝑧
2𝜋 −∞

• Model asymptotically touches 0 (z→ −∞) and 1 (z→ ∞)


• Marginal impact of unit change on an explanatory variable 𝑥2𝑖 is
given as 𝛽2 𝐹(𝑧𝑖 ), where 𝛽2 is the parameter attached to 𝑥2𝑖 ;
𝑧𝑖 = 𝛽1 + 𝛽2 𝑥2𝑖 + 𝛽3 𝑥3𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝑢𝑖
• Both logit and probit models give similar results; differences may
occur when data is extremely imbalanced
Maximum Likelihood Estimation (MLE) of
Logit/Probit Models
These are non-linear models, hence cannot be estimated with a
simple OLS method
• They are estimated with MLE
• In MLE, parameters are chosen to maximize a log-likelihood
function
• The log-likelihood function obtains the population estimates that
maximize the joint probability of observed sample/sample
estimates
Goodness-of-Fit Measures

Conventional 𝑅2 and adj. −𝑅2 measures do not work well with these
models
MLE aims to maximize the log-likelihood function (LLF) and do not
minimize RSS
(1) % of 𝑦𝑖 values correctly predicted
(2) % of 𝑦𝑖 = 1 values correctly predicted + % of 𝑦𝑖 = 0 values
correctly predicted
Goodness-of-Fit Measures

Conventional 𝑅2 and adj. −𝑅2 measures do not work well


LLF
(3) Pseudo − 𝑅 = 1 −
2
, where LLF is the maximized value of
LLF0
the log-likelihood function for the logit and probit models, and
LLF0 is the value of the log-likelihood function for a restricted
model
Summary and Concluding Remarks
Summary and Concluding Remarks

• Among supervised learning algorithms, classification algorithm is a very


important tool employed in the finance domain for applications such as credit
scoring of loan applications
• Classification algorithms are very often implemented through Logit/Probit
class of models; these are very simple yet powerful models
• These models account for a number of shortcomings of linear probability
models: (a) non-normality and heteroscedasticity of error terms; (b) values of
the dependent variable (probability) exceeding the 0–1 range; and (c)
diminishing utility of conventional measures of goodness-of-fit (e.g., 𝑅2 )
Summary and Concluding Remarks

• Limited dependent variable models (e.g., Logit model) employ


cumulative probability functions (e.g., logistic function)
• These models, although non-linear, are very useful for modeling
limited dependent variables that are probabilistic in nature
• In the case of the logit model, the logit function is essential the
odds ratio
• Since the estimated variable is in the form of probabilities, the
thresholding process is needed to convert these probabilities into
limited outcomes (e.g., Yes/No)
Summary and Concluding Remarks

• The conventional measures of goodness-of-fit (e.g., 𝑅2 ) are not very


useful for such models
• These measures are evaluated on their ability to accurately classify
observations correctly
• For such purposes, a confusion/classification matrix is often employed
• The receiver operator characteristic (ROC) curve provides another
useful tool to examine the efficiency of these models, and also
facilitates the selection of thresholding values
Summary and Concluding Remarks

• Unlike simple linear models, the parameter estimates are interpreted in a


different manner
• Marginal effects are computed to interpret the coefficients and their
relationship with the dependent variable
• Other models (e.g., probit model) remain identical in all other aspects,
except that a different cumulative probability function is considered (normal
distribution in case of probit)
• Since the model is non-linear in nature, OLS cannot be employed for
estimation; maximum likelihood method is often employed to estimate these
models
Thanks!
Introduction

• Application of classification algorithm in the prediction of


security prices
• Revisiting the ABC case study
• Logit/Probit modeling
• Training the model and testing the model
• Model performance evaluation
• Summary and concluding remarks
Case Study: ABC Stock Price
Forecasting
Case Study: Stock Price Prediction

• Stock price prediction or stock return prediction is an attempt to


determine the future value of a company based on an analysis of
factors, which impact its price movement
• There are a number of factors that help in predicting stock prices
• These can be macroeconomic factors like the state of the
country’s economy, growth rate inflation, etc.
• There are also other factors that are more specific to a stock like
profit margin, debt to equity issues, sales of a company, etc.
Case Study: Stock Price Prediction

So, we are given the data for stock market price for ABC company, along with Nifty and Sensex
(market indices). We are also given the data of dividend announcement and a sentiment index.

Dividend
Date Price ABC Sensex Sentiment Nifty
Announced
03-01-2007 718.15 0.079925 0.073772 0 0.048936 0.095816
04-01-2007 712.9 –0.00731 0.021562 0 –0.05504 0.009706
05-01-2007 730 0.023987 –0.02441 0 0.019135 –0.03221
06-01-2007 788.35 0.079932 0.012046 0 0.080355 0.011205
07-01-2007 851.4 0.079977 –0.0013 0 0.094038 –0.0004
10-01-2007 919.5 0.079986 0.019191 1 0.015229 0.030168
11-01-2007 880 –0.04296 –0.04025 0 –0.07217 –0.04966
12-01-2007 893.75 0.015625 0.036799 0 0.01396 0.020999
13-01-2007 875 –0.02098 –0.00845 0 0.057518 –0.01164
14-01-2007 891 0.018286 0.004858 1 0.008828 0.020714
17-01-2007 819.75 –0.07997 –0.01228 0 –0.12395 –0.00962
…… …… …… …… …… …… ……
…… …… …… …… …… …… ……
Case Study: Stock Price Prediction

• Consider a portfolio manager who has built a model for a particular


stock
• The manager wants to predict whether in the next period the ABC
stock price returns for this stock will go up or down
• The data starts from 2007 and goes till 2019, so we have
approximately 13 years of data
• We have daily returns of ABC or a change in the price of ABC in
column B. Next, we have a daily return on Sensex in column C and
a daily return on Nifty in column D.
Case Study: Stock Price Prediction

• Sensex and Nifty are the two main stock indices used in India
• They are benchmark Indian stock market indices that represent
the weighted average of the largest Indian companies
• So, Sensex represents average of 30 largest and most actively
traded Indian companies
• Similarly, Nifty represents a weighted average of 50 largest Indian
companies
Summary

The following tasks need to be performed


• Create a dummy variable that is 1 when stock prices go up and
create a dummy variable that is “0” when stock prices go down
• Segregate the data into test and train datasets
• Train and build the model using simple logit/probit classification
algorithms using market index as the independent variable, and
up/down dummy as the dependent variable
Summary

The following tasks need to be performed


• Evaluate the in-sample performance and out-of-sample
performance of the model
• Compute the marginal effects of the independent variable
• Visualize the performance of these models using the ROCR curve
• Examine the classification accuracy of the model and compare it
with a similar linear probability model

Data Input and Exploration
Data Input and Exploration

• In this video, we will start with the implementation of the


classification algorithms using ABC Case study Data
• First, we will set the working directory, then we will read the data
• Lastly, we will create the binary response variable: ‘1’ for positive
returns and ‘0’ for negative returns
Summary

• We started our analysis with setting the working directory


• Next, we loaded the relevant package libraries
• Then we read the data from the working directory
• Lastly we created a new ‘updown’ binary response variable, which
is ‘1’ when returns are positive and ‘0’ when returns are negative
Creation of Test and Train Datasets
Creation of Test and Train Datasets

• In this video, we will create the test and train sample datasets
• Then we will examine the distribution of our binary response
variable in 1’s and 0’s
Summary

• First, we filtered the observations after 2006 and cleaned our


data
• Next, we randomly selected 80% observations as training dataset
and remaining 20% as test dataset
• Lastly, we tested the proportion of 1’s and 0’s in the parent
dataset, test dataset, and train dataset
• The distribution of 1’s and 0’s is fairly similar for all the three
datasets
Training the Linear Probability Model
(LPM) Algorithm
Training the LPM Algorithm

• In this video, we will train an LPM algorithm with the training


dataset
• Next, we will compute the classification/confusion matrix
• Final, using the classification/confusion matrix, we will compute
various performance measures, i.e., accuracy, specificity, and
sensitivity
Summary

• We trained an LPM algorithm using the training dataset


• Using the fitted results, we converted them into 1’s and 0’s using
thresholding values of 0.4, 0.6, and 0.8
• Lastly, using the classification/confusion matrix, we computed
three performance parameters, namely, accuracy, specificity,
and sensitivity
Training the Logit/Probit Algorithms
Training the Logit/Probit Algorithms

• In this video, we will train the Logit/Probit classification algorithms


using the training dataset
• Next, we will compute the in-sample performance evaluation
measures
• We will also compute the marginal effects of the independent
variable on the dependent variable
• Lastly, we will evaluate and compare the performance of these
algorithms on parameters, namely, accuracy, specificity, and
sensitivity
Summary

• We trained our classification algorithms using the training dataset


• Next, we computed the Pseudo R-square measure and also
computed the marginal effects
• Lastly, we evaluated the performance of these algorithms on three
parameters of sensitivity, specificity, and accuracy, using
classification matrix at threshold values of 0.4, 0.6, and 0.8
• The performance of all the algorithms appear to be close to each
other; this is ascribed to the fairly symmetric distribution of 1’s
and 0’s in the training dataset
Visualizing the Performance
Visualizing the Performance

• In this video, we will compare the performance of the three


trained classification algorithms (linear, logit, and probit objects)
using correlation measure and through visualization
Summary

• We computed the correlation across the fitted values for the three
classification algorithms (linear, logit, and probit)
• The correlations appear to be very high
• Next, we visualized the performance of the algorithms on
parameters of accuracy, sensitivity, and specificity for the three
threshold values of 0.4, 0.6, and 0.8
• While the performance of these algorithms appear to be close,
logit model appears to offer the best fit, followed by the probit,
and then the linear model
Receiver Operating Characteristic
(ROC) Curve
ROC Curve

• In this video, we will compare the performance of the three


trained classification algorithms (linear, logit, and probit objects)
with the help of ROC curve
Summary

• We plotted the ROC curve and examined the performance of the


three trained classification algorithms
• Area under the curve (AUC) appears to be identical for all the
three algorithms; this is ascribed to the extremely high correlation
in the fitted objects of these models
Defining the Objective Performance
Function
Defining the Objective Performance
Function
• In this video, we will develop a simple machine learning system
that will help the computer learn how to select the best
classification algorithm across a class of algorithms
• We will create a suitable user defined performance function to
analyze the performance of these algorithms
Summary

• We created an optimization function, which included the


arguments, namely, fitted values, actual values, and simulated
threshold values
• These values are employed to compute accuracy, sensitivity, and
specificity parameters through classification matrix
• The final performance object is a simple average of these three
parameters (i.e., accuracy, sensitivity, and specificity)


Creating Performance Objects
Creating Performance Objects

• In the previous video, we defined our performance objective


function; in this video we will simulate 1000 threshold values and
calculate the performance object values for all the three
classification algorithms using these threshold values
Summary

• We created three performance objects for the three classification


algorithms, namely logit, probit, and linear
• We simulated 1000 performance object values using our
performance objective function for all the three algorithms (linear,
logit, and probit)
In-sample Performance Evaluation
In-sample Performance Evaluation

• In the previous video, we computed 1000 performance object


values for the three classification algorithms
• In this video we will compare the performance of these three
classification algorithms through visualization
Summary

• We plotted 1000 performance object values for our three


classification algorithms, namely, linear, logit, and probit
• We found that for most of the threshold values, the logit model
algorithm works best, closely followed by probit model algorithm,
and lastly the linear model algorithm
• Lastly, we extracted the best fit model and the corresponding
threshold value
Out-of-Sample Prediction
Out-of-Sample Prediction

• In this video, we will start with out-of-sample prediction


• We will use the trained algorithms for our linear, logit, and probit
models and predict using test data set
• Lastly, we will compute the correlations across the predicted
values between the three algorithms
Summary

• We performed the prediction on the test data using our trained


algorithms for linear, logit, and probit models
• We found that the correlation across the predicted values are very
high; in fact the correlation between logit and probit predicted
values are 99%, and the correlation with linear model predicted
values are more than 90%
• This is ascribed to the fact that correlations across predicted
objects are very high, and the distribution of 1’s and 0’s is highly
symmetric in our test and training datasets
Out-of-Sample Prediction: ROC Curve
Out-of-Sample Prediction: ROC Curve

• In the previous video, we performed prediction with trained


algorithms, using the test datasets
• In this, video, we will visualize and compare the performance of
the three trained algorithms, using ROC curve and also compute
area under the ROC curve
Summary

• We plotted ROC curves for all the three classification algorithms


for linear, logit, and probit models
• The performances as per the ROC curve are quite similar with
identical area under the curve (ROC)
• This is ascribed to the high correlation across fitted objects and
symmetric nature of 1’s and 0’s in our test and training datasets
• In the next video, we will simulate 1000 threshold values and
compute the performance object values
Out-of-Sample Prediction:
Performance object
Out-of-Sample Prediction: Performance
object
• We have already set-up a performance object, which is the
average of three parameters: accuracy, sensitivity, and specificity
• Using our predicted values for all the three algorithms, we will
compute the performance object values for the 1000 simulated
threshold values
Summary

• In this video, we computed the values of our performance object


using 1000 simulated threshold values for all the three algorithms,
i.e., linear, logit, and probit
• In the next video, using these values of the performance object,
we will visualize and compare the out-of-sample performance of
the three algorithms
Out-of-Sample Prediction:
Performance Evaluation and
Visualization
Out-of-Sample Prediction: Performance
Evaluation and Visualization
• In the previous video, we have simulated 1000 performance
object values using our trained algorithms with the test data
• In this video, using these performance object values, we will
visualize and compare the performance of the three trained
algorithms
Summary

• To summarize, we plotted our simulated performance object


values
• For most of the threshold region, the logit model offers the best
prediction, closely followed by the probit and linear models
• We also extracted the details corresponding to the best
performance object value, including its threshold level
Summary and Concluding Remarks
Summary and Concluding Remarks

• ABC stock price up/down movements are modelled using


logit/probit classification algorithms
• The model is trained using the training dataset and is examined
on various measures of model performance evaluation
• Fitted modelled is examined visually as well
Summary and Concluding Remarks

• The model is tested using test dataset and various measures of


out of sample fit are examined
• Marginal effects of these independent variables are computed
• The performance of this model is compared with a similar linear
probability model
Thanks!

You might also like