ET - Project Presentation Solution
ET - Project Presentation Solution
Prediction
[email protected]
D1GS97LPEQ
● EDA Results
● Data Preprocessing
[email protected]
●
D1GS97LPEQ Model Performance Summary
● Appendix
● We have customers from tier 1 and tier 3 cities but very few from tier 2 cities. The company
should expand its marketing strategies to increase the number of customers from tier 2 cities.
● We saw
[email protected] in our analysis that people with higher income or at high positions like AVP or VP are
D1GS97LPEQ
less likely to buy the product. The company can offer short-term travel packages and customize
the package for higher- income customers with added luxuries to target such customers.
● When implementing a marketing strategy, external factors, such as the number of follow-ups,
time of call, should also be carefully considered as our analysis shows that the customers who
have been followed up more are the ones buying the package.
● We saw in our analysis that young and single people are more likely to buy the offered packages.
The company can offer discounts or customize the package to attract more couples, families, and
customers above 30 years of age.
[email protected]
D1GS97LPEQ
● One of the ways to expand the customer base is to introduce a new offering of packages.
Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super
[email protected]
D1GS97LPEQ Deluxe, and King. However, it was difficult to identify the potential customers because customers
were contacted at random without looking at the available information.
● The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness
Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy
lifestyle, and support or increase one's sense of well-being. This time company wants to harness
the available data of existing and potential customers to target the right customers.
● The task is to analyze the data and build a model to predict which customer is potentially going
to purchase the newly introduced travel package.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
EDA Results
[email protected]
D1GS97LPEQ
● The distribution for monthly income shows that most of the values lie between 20,000 to 40,000.
● Income is one of the important factors to consider while approaching a customer with a certain package.
We can explore this further in bivariate analysis.
● There are some observations on the left and some observations on the right of the boxplot which can be
considered as outliers.
● There are approx 70% of customers who reached out to the company
first i.e. self-inquiry.
[email protected]
D1GS97LPEQ
● This might be because the company makes more profit from Deluxe or Basic
packages or these packages are less expensive, so preferred by the majority
of the customers.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
EDA Results
● We have seen that married people are the most common customer
for the company but this graph shows that the conversion rate is
higher for single and unmarried customers as compared to the
married customers.
● The company can target single and unmarried customers more and
can modify packages as per these customers.
[email protected]
D1GS97LPEQ
● The conversion rate for large business owners is higher than salaried or
small business owners.
[email protected]
D1GS97LPEQ
● The conversion rate of customers is higher if the product pitched is Basic. This might be because
the basic package is less expensive.
● We saw earlier that company pitches the deluxe package more than the standard package, but the
standard package shows a higher conversion rate than the deluxe package. The company can pitch
standard packages more often.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
EDA Results
[email protected]
D1GS97LPEQ
● The Number of trips and age have a weak positive correlation, which makes sense as age increases
number of trips is expected to increase.
● ProdTaken has a weak negative correlation with age which agrees with our earlier observation that
as age increases the probability for purchasing a package decreases.
● There are only four observations where the monthly income is greater than 40,000 and less than
12000. Checked these observations and they seem to be the outliers.
● The percentage of categories for the number of trips 19 or above is very less. We can consider
these values as outliers. We can see that there are just four observations with a number of trips
[email protected]
D1GS97LPEQ
19 or greater, so we will drop these rows.
● There are missing values in a few of the numeric variables Age, Monthly income, and Number of
trips, so we will impute these values with a median.
● There are missing values in a few of the categorical variables Type of contact, Preferred property
star, and Number of children visiting, so we will impute these values with mode / most frequent.
● There are 6 categorical variables having string values, so we will be encoding these variables
with dummies.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Model Performance Summary
● We want to predict whether a liability customer will buy newly introduced travel package or not
using the information provided to us.
● We will use the Recall as the performance metric for our model because
● Predicting a customer will buy the product and the customer doesn't buy - Loss of
resources
[email protected]
D1GS97LPEQ ● Predicting a customer will not buy the product and the customer buys - Loss of opportunity
● We would want Recall to be maximized. The greater the Recall higher the chances of
minimizing false negatives
● Tuned XGBoost model indicates that the most significant predictors of buying a travel package:
○ Passport
○ Designation
○ Marital Status
○ City tier This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Model Performance Summary
[email protected]
D1GS97LPEQ
APPENDIX
● The attributes include Age, Occupation, Income,Gender,Prod taken, Occupation, Passport, and
more.
● Average age of customers is 37 years, age of customers has a wide range from 18 to 61 years.
● 0 errors on the training set, each sample ● The decision tree model is overfitting
has been classified correctly.
[email protected]
D1GS97LPEQ
the data as expected and not able to
generalize well on the test set.
● Model has performed very well on the
training set. ● We will have to use hyperparameter
tuning with the decision tree.
● As we know, a decision tree will continue
to grow and classify each data point
correctly if no restrictions are applied as the
trees will learn all the patterns in the
training set.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Model Improvement: Decision Tree
● The performance of the model after hyperparameter tuning has become generalized.
[email protected]
● We are getting
D1GS97LPEQ a Recall of 0.663 and 0.652 for training and test set, respectively.
● Let’s try building some ensemble models and see if the metrics improve.
● We are getting a Recall of 0.881 and 0.662 for training and test set, respectively.
[email protected]
● After tuning
D1GS97LPEQ the hyperparameters the random forest is still overfitting
● We are getting a Recall of 0.951 and 0.510 for training and test set, respectively, which is
a very big difference.
[email protected]
D1GS97LPEQ
● We'll try to reduce overfitting and improve the performance by hyperparameter tuning.
● The recall of both train and test set is improved but there is a big difference between both
the sets.
[email protected]
D1GS97LPEQ
● The recall of both train and test set is improved but there is a difference between both the
sets.
[email protected]
D1GS97LPEQ
● The XGBoost model on the training set has performed very well but it is not able to
generalize on the test set.
[email protected]
D1GS97LPEQ
● Let's try and tune the hyperparameters and see if the performance can be generalized.
● The overfitting has reduced after hyperparameter tuning but is still an overfit model.
[email protected]
D1GS97LPEQ
● For the Stacking Classifier, the tuned random forest, the tuned gradient boosting classifier
and the decision tree models were used as the initial estimators while the tuned xgboost
[email protected]
D1GS97LPEQ
classifier was used as the final estimator.
● We have received recall scores of 0.878 and 0.735 on the training and test set,
respectively.