0% found this document useful (0 votes)
85 views4 pages

Lead Scoring Assignment Summary

An education company wanted to increase its low lead conversion rate. It provided a dataset to build a logistic regression model to assign lead scores between 0-100. The approach included data cleaning, EDA, feature selection, and model evaluation. The optimal model used 10 features to predict lead conversion with 87% AUC.

Uploaded by

Akshay Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views4 pages

Lead Scoring Assignment Summary

An education company wanted to increase its low lead conversion rate. It provided a dataset to build a logistic regression model to assign lead scores between 0-100. The approach included data cleaning, EDA, feature selection, and model evaluation. The optimal model used 10 features to predict lead conversion with 87% AUC.

Uploaded by

Akshay Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Lead Scoring Case Study Summary:

Problem Description:

An education company named X Education sells online courses to industry professionals. Although X Education gets
a lot of leads, its lead conversion rate is very poor and is around 30%.

X Education needs help with building a logistic regression model so as to assign a lead score between 0 and 100 to
each of the leads which can be used by the company to target potential leads. A higher score would mean that the
lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get
converted The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

Approach:

o Reading & understanding the data:

✓ In this step we took a first look at the dataset and inspected the following:
✓ First few and last few rows
✓ Checked the shape of the data
✓ Data types for each column
✓ Got the descriptive statistics for the numerical columns
✓ Did basic research to get better understanding of the domain

o Data Cleaning:

✓ Converted ‘Select’ values to null values.


✓ Missing value treatment:

✓ Further dropped columns with only one unique value:


✓ Dropped columns with unique values = 2, after confirming data imbalance of > 85%
✓ Checked for duplicates, none were found.
o Exploratory Data Analysis:

✓ Did basic EDA and identified very interesting patterns in the data.
✓ Performed bivariate analysis on categorical columns to see how they vary w.r.t Converted column.
✓ Dropped the column ‘Last Notable Activity’ as the feature is sales team generated
✓ Performed bivariate analysis on numerical columns by plotting box plots.
✓ Also used a heat plot to identify highly correlated numerical columns.

o Data Preparation:

✓ Created dummy variables the categorical columns with more than 2 categories using the
pd.get_dummies function
✓ Performed a 70-30 spilt the leads dataset into Train and Test respectively
✓ Performed feature scaling using the standard scaler.

o Model Building:

✓ We shortlisted the top 15 features using the Recursive Feature Elimination (RFE) technique to build
our first model.
✓ In the next few iterations, we further fine-tuned our model by eliminating features with p-values > 0.05
and (Variable Inflation Factor) vif values > 5. Using vif helps reduce the impact of multicollinearity in
the data.
✓ Once this model was less complex with ~10 features, we predicted probabilities on the train set and
created a new column predicted with 1 if probability is greater than .5 else 0.

o Model Evaluation:

✓ We also calculated the metrics sensitivity, specificity, precision, and accuracy.


✓ To make predictions on the train dataset, optimum cut-off of 0.34 was found from the intersection
of sensitivity, specificity and accuracy as shown in below figure.
✓ We also plotted roc curve to find the area under the curve (0.87 for the train data set).
✓ We also tired getting the optimal cut-off using Precision vs. Recall Trade-off curve. However, the
models sensitivity and precision went below the 75% mark and hence was not considered in as the
final cut-off.
o Predictions on the Test Set:
✓ After finalizing the optimum cut-off of 0.34 and calculating the metrics on train set, we predicted the
data on test data set. Below are the observations:
o Final Observations:

Below are the predictor variables that we used in our final model and their relative importance:

You might also like