0% found this document useful (0 votes)
17 views41 pages

Lecture 22. GLM

This document provides an overview of binary random variables and logistic regression, emphasizing its application in predicting binary outcomes. It explains the logistic function, model equations, and the relationship between probabilities and odds, along with methods for parameter estimation and model evaluation. Real-world applications include spam detection, medical diagnosis, and loan default prediction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views41 pages

Lecture 22. GLM

This document provides an overview of binary random variables and logistic regression, emphasizing its application in predicting binary outcomes. It explains the logistic function, model equations, and the relationship between probabilities and odds, along with methods for parameter estimation and model evaluation. Real-world applications include spam detection, medical diagnosis, and loan default prediction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Generalized Linear Model

LECTURE 22
BINARY VARIABLES AND LOGISTIC REGRESSION: FROM BASICS TO APPLICATIONS
What is a binary random variable?

 A binary random variable is a type of random variable that can


take on only two possible values, typically represented as 0 and 1.
 These values often correspond to two distinct outcomes in a
scenario, such as success/failure, yes/no, or true/false.
Introduction to Logistic Regression

 Definition: Logistic regression is a statistical model used to predict the


probability of a binary outcome (e.g., success/failure).
 Key Idea: Unlike linear regression, logistic regression is suitable for
categorical dependent variables.
 Objective: To understand how logistic regression works, its equations,
parameters, and applications.
How to model binary outcome
variables?
 In the basic form of logistic regression, dichotomous variables
(0 or 1) can be predicted and the probability of the occurrence
of value 1 (success/characteristic present) is estimated.
Real-world Examples:

 Spam detection: Classify emails as spam or not spam.


 Medical diagnosis: Predict the presence of a disease (e.g., diabetes:
yes/no).
 Loan default prediction: Will a customer default on a loan?
Simple Logistic Regression

 Simple Logistic Regression is a type of logistic regression model used


to predict the probability of a binary outcome based on a single
explanatory (independent) variable. It provides a framework for
understanding the relationship between one predictor variable and
the response variable.
 Model: Equation of the Model:
 P(Y=1∣ X)= exp(β0+β1X1)/ {1 + exp{β0+β1X)
 Logit Function:

 Parameters:
 : Intercept
 : Slope (effect of the explanatory variable)
 Interpretation: Odd increase by exp(β1) for a one-unit increase in X.
 Graphical Representation: S-shaped logistic curve showing the
probability of Y=1 as X changes.
Example
General Logistic Regression Model

 Model: Predicts a binary outcome based on multiple explanatory


variables.
 Equation:
 Logit Function:
 Parameters:
 : Intercept
 : Coefficients for predictors .
Example
Applications:

 Disease prediction using biomarkers and demographic variables.


 Fraud detection in financial transactions.
Logistic regression and probabilities

 In linear regression, the independent variables (e.g., age, height


and gender) are used to estimate the specific value of the
dependent variable (e.g., body mass index).
 In logistic regression, on the other hand, the dependent variable is
dichotomous (0 or 1) and the probability that expression 1 occurs is
estimated.
 Returning to an example: How likely is it that a disease is present if
the person under consideration has a certain age, sex and smoking
status.
Calculate logistic regression

To build a logistic regression model, the linear regression equation is used as


the starting point.
However, if a linear regression were
simply calculated for solving a
logistic regression, the following
result would appear graphically:

As can be seen in the graph,


however, values between plus and
minus infinity can now occur. The
goal of logistic regression, however,
is to estimate the probability of
occurrence and not the value of
the variable itself.

Therefore, the this equation must


be transformed.
 To do this, it is necessary to restrict the value range for the prediction
to the range between 0 and 1. To ensure that only values between
0 and 1 are possible, the logistic function f is used.
Logistic function

 The logistic model is based on the


logical function. The special thing
about the logistic function is that
for values between minus and
plus infinity, it always assumes only
values between 0 and 1.
Logistic Function

So the logistic function is perfect to describe


the probability P(Y=1).
If the logistic function is now applied to the previous
regression equation, the result is:
This now ensures that no matter
in which range the x values are
located, only values between 0
and 1 will come out.

The new graph now looks like


this:
The logistic function is a mathematical formula that maps any real number
(z) to a value between 0 and 1:
P=1/(1+e−z)
Where:
P: Probability of success (e.g., P(Y=1)).
z: Linear combination of predictors, expressed as:
z = β0+β1X1+β2X2+⋯+βkXk
Key Properties:
P is always in the range [0,1], making it suitable to represent probabilities.
For large positive z, P→1; for large negative z, P→0.

The logistic function creates an S-shaped (sigmoidal) curve, which models


the non-linear relationship between z and P
Logistic Regression

 Logistic regression applies the logistic function to model the


relationship between a binary outcome variable YYY (e.g.,
success/failure) and one or more predictors (X1,X2,…,Xk).
 Logistic Regression Model:
 P(Y=1∣X)=exp(β0+β1X1+β2X2+⋯+βkXk)/[1+exp(β0+β1X1+β2X2+⋯+βkXk)]
 Where:
 P(Y=1∣X) : Probability that the dependent variable Y=1 given
predictors X1,X2,…,Xk.

 z=β0+β1X1+β2X2+ ⋯ + βk Xk: The linear combination of the predictors.


The connection between the logistic function and logistic regression lies in
the transformation from probabilities to a linear relationship:
(a) Probabilities and Odds:
 Logistic regression models the probability of success P(Y=1∣X).
 The odds of success are defined as: Odds=P/(1−P)
(b) Logit Function (Linearization):
 The logistic regression model transforms the non-linear probability into a
linear form using the logit function:
 logit(P)=ln {P/(1−P)}=β0+β1X1+β2X2+⋯+βkXk
 This ensures that the predictors are linearly related to the log-odds of
the outcome.
 The logistic function is used to "invert" this transformation and predict
probabilities from the linear combination.
Key Components and Notations

 Dependent Variable:
 Binary (e.g., 0 or 1).
 Independent Variables:
 Can be continuous, categorical, or a mix.
 Odds and Odds Ratio:
Estimation of Parameters

 Maximum Likelihood Estimation (MLE):


 Likelihood Function:
 Log-Likelihood Function:
 Solved using iterative methods such as Newton-Raphson.
 Interpretation:
 Parameters represent changes in log-odds for a one-unit increase in
predictors.
Likelihood Function
Log Likelihood Function
Optimization
Fitting the Model
Interpretation of Parameters
Goodness of Fit and Model Evaluation

 Deviance:
 Measures lack of fit
 Hosmer-Lemeshow Test:
 Compares observed and predicted values in grouped data.
 Pseudo :
 McFadden’s, Cox-Snell, and Nagelkerke’s .
Deviance Measure

 The deviance for logistic regression


can also be expressed as:

𝑦𝑖 𝑛𝑖 −𝑦𝑖
 𝐷=2 𝑖 y𝑖 ln 𝑦𝑖
+ (𝑛𝑖 − y𝑖 ) ln 𝑛 𝑖 − 𝑦𝑖

 Where:
 Yi: Observed outcome for the i-th
observation (Yi∈{0,1}
Deviance measure and chi square test
Other Goodness of Fit statistic

 AIC
 BIC
Pseudo-R squared
In a linear regression, the coefficient of determination R2 indicates the
proportion of the explained variance.
In logistic regression, the dependent variable is scaled nominally or ordinally
and it is not possible to calculate a variance, so the coefficient of
determination cannot be calculated in logical regression.

However, in order to make a statement about the quality of the logistic


regression model, so-called pseudo coefficients of determination have
been established, also called pseudo-R squared.

Pseudo coefficients of determination are constructed in such a way that


they lie between 0 and 1 just like the original coefficient of determination.
The best known coefficients of determination are the Cox and Snell R-
square and the Nagelkerke R-square.
 Source: https://datatab.net/tutorial/logistic-regression

You might also like