Lead Scoring Assignment Summary
Lead Scoring Assignment Summary
Problem Description:
An education company named X Education sells online courses to industry professionals. Although X Education gets
a lot of leads, its lead conversion rate is very poor and is around 30%.
X Education needs help with building a logistic regression model so as to assign a lead score between 0 and 100 to
each of the leads which can be used by the company to target potential leads. A higher score would mean that the
lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get
converted The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.
Approach:
✓ In this step we took a first look at the dataset and inspected the following:
✓ First few and last few rows
✓ Checked the shape of the data
✓ Data types for each column
✓ Got the descriptive statistics for the numerical columns
✓ Did basic research to get better understanding of the domain
o Data Cleaning:
✓ Did basic EDA and identified very interesting patterns in the data.
✓ Performed bivariate analysis on categorical columns to see how they vary w.r.t Converted column.
✓ Dropped the column ‘Last Notable Activity’ as the feature is sales team generated
✓ Performed bivariate analysis on numerical columns by plotting box plots.
✓ Also used a heat plot to identify highly correlated numerical columns.
o Data Preparation:
✓ Created dummy variables the categorical columns with more than 2 categories using the
pd.get_dummies function
✓ Performed a 70-30 spilt the leads dataset into Train and Test respectively
✓ Performed feature scaling using the standard scaler.
o Model Building:
✓ We shortlisted the top 15 features using the Recursive Feature Elimination (RFE) technique to build
our first model.
✓ In the next few iterations, we further fine-tuned our model by eliminating features with p-values > 0.05
and (Variable Inflation Factor) vif values > 5. Using vif helps reduce the impact of multicollinearity in
the data.
✓ Once this model was less complex with ~10 features, we predicted probabilities on the train set and
created a new column predicted with 1 if probability is greater than .5 else 0.
o Model Evaluation:
Below are the predictor variables that we used in our final model and their relative importance: