Lesson 7 Logistic Regression
Lesson 7 Logistic Regression
Data Science
LOGISTIC REGRESSION
Module Objectives
At the end of this module, students must be able to:
1. differentiate linear regression to logistic regression;
2. enumerate situations where logistic regression are applicable;
3. discuss in simple terms the theory and model behind logistic
regression;
4. implement logistic regression in a churn model example using R;
Introduction
In linear regression modeling, the outcome (dependent) variable is a continuous variable. As
an example, linear regression can be used to model the relationship between age and
education to income as follows
Suppose a person’s actual income was not of interest, but rather whether someone was
wealthy or poor. In such a case, when the outcome variable is categorical in nature, logistic
regression can be used to predict the likelihood of an outcome based on the input variables.
Although logistic regression can be applied to an outcome variable that represents multiple
values, our discussion will only examine the case in which the outcome variable represents
two values such as true/false, pass/fail, or yes/no
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Introduction
The training set would also include the outcome variable on whether the
person purchased a new automobile over a 12-month period.
Finance: Using a loan applicant’s credit history and the details on the loan,
determine the probability that an applicant will default on the loan. Based on the
prediction, the loan can be approved or denied, or the terms can be modified.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Use cases
Marketing: Determine a wireless customer’s probability of switching carriers
(known as churning) based on age, number of family members on the plan,
months remaining on the existing contract, and social network contacts. With
such insight, target the high-probability customers with appropriate offers to
prevent churn.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Model Description
Logistic regression is based on
the logistic function f(y), as given
in Equation 6-7.
Note that as
and as . As the figure on the right
shows, the value of the logistic
function varies from 0 to 1.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Model Description
Because the range of f(y) is (0,1), the logistic function appears to be an
appropriate function to model the probability of a particular outcome occurring.
As the value of y increases, the probability of the outcome occurring
increases.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Model Description
Then, based on the input variables x1, x2 ,..., xp −1, the probability of an event is
as shown
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Model Description
It can be shown that this equation
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Model Description
Techniques such as Maximum Likelihood Estimation (MLE) are used to
estimate the model parameters. MLE determines the values of the model
parameters that maximize the chances of observing the given dataset.
However, the specifics of implementing MLE are beyond the scope of this
book.
The next example will help to clarify the logistic regression model. The
mechanics of using R to fit a logistic regression model are covered in the next
section on evaluating the fitted model. In this section, the discussion focuses
on interpreting the fitted model.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Example – Customer Churn
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Example – Customer Churn
Data on 8,000 current and prior customers was obtained. The variables
collected for each customer follow:
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Diagnostics
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Example – Customer Churn
After analyzing the data and fitting a logistic regression model, Age and
Churned_contacts were selected as the best predictor variables. The
following equation provides the estimated model parameters.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Example – Customer Churn
*Text taken from Data Science and Big Data Analytics by EMC Education Services