0% found this document useful (0 votes)
74 views3 pages

Case Study Summary

The document summarizes a case study on using logistic regression to score and rank leads for an online education company. The goal was to increase sales efficiency by identifying the most promising leads. Data on past leads was cleaned, explored, and split for model training and testing. A logistic regression model was built that achieved 80% accuracy on the test data. It generated conversion probability scores to classify leads as "hot" or not. This scoring system can help the sales team focus on prospects most likely to convert.

Uploaded by

Nitish Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views3 pages

Case Study Summary

The document summarizes a case study on using logistic regression to score and rank leads for an online education company. The goal was to increase sales efficiency by identifying the most promising leads. Data on past leads was cleaned, explored, and split for model training and testing. A logistic regression model was built that achieved 80% accuracy on the test data. It generated conversion probability scores to classify leads as "hot" or not. This scoring system can help the sales team focus on prospects most likely to convert.

Uploaded by

Nitish Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Case Study Summary

The aim of this report is to summarise the approach taken for Lead Scoring case
study.

Let us divide the summary in 3 parts namely What, How and Conclusions.

What: The X education company offers online courses. Everyday lots of people
looking for these online courses lands on company’s website or get to know about
the courses by different lead origins. All this data is stored in a dataset and used by
sales team to approach lead. The efficiency of this process is not good. The target is
to increase the efficiency by reducing the time spent on leads while keeping the
same conversion rate.

How: The way to achieve a better efficiency is by creating a logistic regression


model which can predict the probability of conversion based on the existing
dataset. The steps taken to build this logistic model are below:
1. Data is loaded and shape, info etc queried
2. “Select” is considered a null value and therefore replaced with NAN value
3. The data is cleaned in following steps:
a. Columns with more than 50 % missing values were dropped
b. On further examination, 4 columns had about 45 percent missing values
("Asymmetrique Activity Index", "Asymmetrique Profile Index", "Asymmetrique Activity
Score", "Asymmetrique Profile Score"), so with some examination and further
exploration, we concluded its better to drop these columns rather than imputing
there missing values.
c. Skewed columns were dropped with a cut-off value as 85 percent.
d. Rows with 5 or more missing values were dropped
e. The columns with 20-40 % missing values were imputed by substituting
with median values as they were all categorical columns
f. After cleaning the data 99.54 % data retained

4. EDA was performed


a. Univariate analysis in categorical variables showed the maximum and minimum
occurrence of categories. Some categorical columns had a lot of categories, so
based on each column, some categories with very less frequencies were merged into
a single category called “Other”
b. Univariate analysis on numerical columns showed that there were outliers
in few columns so the values in these columns were capped to 99%
c. Bivariate analysis was done using ‘Converted’ as target variable and using
that columns helping in conversion were interpreted
d. Multivariate analysis using correlation matrix showed the most correlated
variables

5. Data preparation steps:


a. Columns with too many categories were binned to reduce the number of
dummies
b. Data was split in 70-30 % ratio for train and test
c. Standard scaler was used for scaling to help algorithm converge faster
6. Modelling:
a. Using RFE initially 25 columns were selected
b. It took 14 model iteration to achieve stable and <5% p-values, less than 5
vif and 80% accuracy
c. ROC curve was plotted to check the sensitivity and specificity variation
d. The optimal cut off probability value was optimized by iterating over cut
off values and plotting the sensitivity, specificity and accuracy on a plot.
e. The sensitivity, specificity and accuracy plot intersected at about 0.37 but
we had a pre-requisite requirement of sensitivity of 80 percent, so cut-off value was
chosen as 0.2 which yielded sensitivity of 83 percent on train data and 81 percent
of Test data.

Conclusion
Based on the conversion probabilities calculated by the model, created a new
column called Score to rate the leads. It will help the Sales team in finding out hot
leads.
e. The precision recall curve was plotted but not used for cut-off value
optimization as our target was to chase hot leads and not cold leads so having a
good balance of sensitivity and specificity was more important
f. The model was run on test data and it gave a sensitivity value of 83%

Conclusion – A logistic regression model is created with desired accuracy of 80% and can be used to
find the hot leads.

You might also like