0% found this document useful (0 votes)

6 views

Project Report

The document outlines a machine learning internship assessment focused on predicting customer churn, detailing the problem statement, client benefits, and dataset description. It covers the entire process from exploratory data analysis to model building and evaluation, ultimately selecting the XGBoost Classifier as the best-performing model. The conclusion emphasizes the importance of insights gained for customer retention strategies and suggests further analysis for improved model performance.

Uploaded by

nayan10072002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Project Report

Uploaded by

nayan10072002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Machine Learning Internship Assessment

Customer Churn Prediction

Sameer Ansari
Table of Contents
1) Introduction

2) Client

3) Dataset Description

4) Exploratory Data Analysis (EDA)

5) Outliers Treatment

6) Feature Encoding

7) Checking Distribution of Data

8) Check Collinearity Between Variables

9) Data Splitting

10 Feature Scaling

11) Check for Class Imbalance

12) Feature Selection Using Random Forest Feature Importance

13) Model Building: Machine Learning Algorithms

14) Model Building: Neural Network

Model Building: Ensembles of Random Forest

15) Model Building: PCA

16) Model Building: Final Model Selection - XGBoost Classifier

17) Hyperparameter Tuning

(I) Cross-Validation Scores (Accuracy)

(II) Cross-Validation Scores (Recall)

18) Cross-Validation

19) Finding Optimal Threshold

20) Model Evaluation

(I) Train & Test Data Metrics

(II) Confusion Matrix

(III) ROC-AUC Curve

21) Saving Model

22) Conclusion
Problem Statement
In today's competitive business world, it's important to keep customers happy so they don't
stop using our products or services. We want to develop a model that can predict which
customers are likely to stop using our service, so we can take steps to keep them.

Customer churn can lead to a loss of revenue and a decrease in customers. We want to use
machine learning to build a model that can accurately predict which customers are likely to
churn based on their past behaviour, demographics, and subscription details. This will help
us target high-risk customers with personalized retention strategies.

We want to create a solution that will help us keep customers happy and using our products
or services for the long term.

Client
 Proactive retention: The model can help the client identify customers who are likely
to churn before they actually do. This allows the client to take steps to retain those
customers, such as offering them discounts or special deals.
 Cost savings: By focusing on high-risk customers, the client can allocate their
resources more effectively and save money on marketing and customer acquisition
costs.
 Enhanced customer experience: Personalized retention efforts can improve the
overall customer experience, leading to increased satisfaction and loyalty. This can
make customers less likely to churn in the future.
 Optimized marketing: Targeted marketing efforts can be tailored to specific
customer segments, improving the effectiveness of marketing campaigns. This can
help the client attract new customers and retain existing ones.
 Business insights: The project can provide insights into factors that influence churn.
This information can be used to improve the client's products and services, making
them more appealing to customers.
 Competitive edge: Effective churn prediction can help the client differentiate
themselves from their competitors. This can give the client an advantage in
attracting and retaining customers.
 Revenue growth: Reduced churn rates mean a higher retention of paying customers.
This can lead to increased revenue growth and profitability.
 Data-driven decisions: The model's insights can help the client make informed
decisions based on historical customer data. This can help the client improve their
products, services, and marketing campaigns.
 Resource allocation: The model can help the client allocate customer service
resources more efficiently. This can help the client resolve customer issues more
quickly and effectively.
 Long-term value: Improved customer retention can help the client build a
foundation for sustainable business growth and long-term success.

Data Description
Dataset consists customer information for a customer churn prediction problem. It includes
the following columns:

CustomerID: Unique identifier for each customer.

Name: Name of the customer.

Age: Age of the customer.

Gender: Gender of the customer (Male or Female).

Location: Location where the customer is based, with options including Houston, Los
Angeles, Miami, Chicago, and New York.

Subscription_Length_Months: The number of months the customer has been subscribed.

Monthly_Bill: Monthly bill amount for the customer.

Total_Usage_GB: Total usage in gigabytes.

Churn: A binary indicator (1 or 0) representing whether the customer has churned (1) or not
(0).

Exploratory Data Analysis (EDA)

The initial step involved exploring the dataset to understand its structure and
characteristics.

* The dataset contains information about 100,000 customers with 9 variables.

* All variables have the correct data type, and there are no missing values or duplicate
records.
* Descriptive statistics were generated for each variable, revealing insights into customer
demographics, subscription details, billing, usage, and churn behavior.

* Gender and Location distributions were analysed, indicating the gender and location
distribution of the customers.

Outliers Treatment
Outliers can affect model performance, so identifying and treating them is crucial.

* Box plots were used to visualize the presence of outliers.

* No significant outliers were detected in the dataset.

Feature Encoding
Categorical variables were encoded to numerical values to enable machine learning
algorithms to process them effectively.

* One-Hot Encoding was applied to the 'Gender' and 'Location' variables.

Checking Distribution of Data

Analysing the distribution of data helps ensure that the data is suitable for modelling.

* Histograms and density plots were used to assess the distribution of numerical variables.

* All variables were found to be approximately normally distributed.

Check Collinearity Between Variables

Checking for collinearity between variables helps identify any redundant or highly correlated
features.

* Variance Inflation Factor (VIF) was calculated for each variable.

* No variables exhibited high multicollinearity.

Data Splitting
The dataset was divided into training and testing sets to enable model training and
evaluation.

* Dataset is divided into 70:30 ratio.

Feature Scaling
Feature scaling was applied to ensure all variables were on the same scale, aiding model
convergence.

* Min-Max Scaling was applied to variables such as 'Age', 'Subscription_Length_Months',

'Monthly_Bill', and 'Total_Usage_GB'.

Check for Class Imbalance

Checking for class imbalance is important to address issues related to the distribution of the
target variable.

* The churn variable was found to be evenly distributed.

Feature Selection Using Random Forest Feature

Importance
Identifying important features helps streamline the model and improve its interpretability.

* Random Forest Feature Importance was used to rank features based on their contribution
to the target variable.

* The top features were 'Monthly_Bill', 'Total_Usage_GB', 'Age', and

'Subscription_Length_Months'.
Feature Importance
Monthly_Bill 0.316383

Total_Usage_GB 0.290353

Age 0.194396

Subscription_Length_Months 0.142624

Gender_Male 0.016683

Location_Los Angeles 0.010595

Location_Houston 0.010007

Location_Miami 0.009792

Location_New York 0.009166

Model Building: Machine Learning Algorithms

Several machine learning algorithms were trained and evaluated using the dataset.

* Algorithms included Logistic Regression, Decision Tree, K-Nearest Neighbours, Gaussian

Naive Bayes, AdaBoost, Gradient Boosting, Random Forest, XGBoost, and Support Vector
Classifier (SVC).

* Training and test data performance metrics were calculated, revealing the strengths and
weaknesses of each algorithm.

Model Building: Neural Network

An attempt was made to build a neural network model, but it did not yield satisfactory
results.

Model Building: Ensembles of Random Forest

Ensemble models using Random Forest as base classifiers were evaluated, but no significant
improvement was observed.

Model Building: PCA

Principal Component Analysis (PCA) was applied to reduce dimensionality, but the results
did not show a significant improvement.

Model Building: Final Model Selection - XGBoost

Classifier
XGBoost Classifier was identified as the best-performing algorithm across various metrics
and feature variations.

Hyperparameter Tuning
Hyperparameter tuning was explored to improve the model's performance, but no
substantial gains were achieved.

Cross-Validation
Cross-validation was performed to validate the model's performance and ensure it
generalized well to new data.

(I) Cross-Validation Scores (Accuracy): [0.49692857, 0.50057143, 0.49892857, 0.50478571,

0.505].
Mean Accuracy Score: 0.5012428571428571
(II) Cross-Validation Scores (Recall): [0.48990983, 0.49398798, 0.48869167, 0.50171772, 0.4
9427426].
Mean Recall Score: 0.4937162923036775

Finding Optimal Threshold

The threshold for classification was fine-tuned to strike a balance between accuracy,
sensitivity, specificity, and F1-score.

Model Evaluation
(I) Train & Test Data Metrics

The final XGBoost model's performance was evaluated using various metrics on both the
training and test datasets.

Metric Train Test

Accuracy 0.664929 0.5005
Precision 0.668665 0.495329
Recall 0.651227 0.489224
F1-Score 0.659831 0.492258

(II) Confusion Matrix

Metric Training Set Test Set

True Positive (%) 33.995714 25.836667
True Negative (%) 16.102857 24.67
False Positive (%) 17.404286 25.28
False Negative (%) 32.497143 24.213333

(III) ROC-AUC Curve

* Train ROC-AUC (area=0.66)

* Test ROC-AUC (area=0.50)

Saving Model
The final XGBoost model was saved as a pickle file for future use.

Conclusion
The customer churn prediction project involved thorough exploratory data analysis, pre-
processing, and the evaluation of various machine learning algorithms. The XGBoost
Classifier was selected as the final model due to its superior performance across different
metrics. While achieving optimal accuracy and recall is challenging, the insights gained from
this project can guide the company's strategies for customer retention and business growth.
Further analysis may involve gathering more data and exploring advanced techniques to
improve model performance.

Capstone Presentation: Telecom Churn Study
100% (3)
Capstone Presentation: Telecom Churn Study
19 pages
Iranian Churn
No ratings yet
Iranian Churn
16 pages
Customer Churn Analysis and Prediction
No ratings yet
Customer Churn Analysis and Prediction
4 pages
naresh pbl
No ratings yet
naresh pbl
18 pages
Efficacy of Customer Churn Prediction System
No ratings yet
Efficacy of Customer Churn Prediction System
8 pages
ilovepdf_merged
No ratings yet
ilovepdf_merged
15 pages
Group 13 - Analyzing Customer Churn
No ratings yet
Group 13 - Analyzing Customer Churn
6 pages
Ex 5.1 Customer Behaviour Prediction
No ratings yet
Ex 5.1 Customer Behaviour Prediction
8 pages
Churn Prediction Product Idea
No ratings yet
Churn Prediction Product Idea
7 pages
INNOVATION - PDF Phrase 2
No ratings yet
INNOVATION - PDF Phrase 2
9 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
8 pages
Internship Evaluation Presentation(Pranshu)
No ratings yet
Internship Evaluation Presentation(Pranshu)
7 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
5 pages
Customer Churn Prediction System: A Machine Learning Approach
No ratings yet
Customer Churn Prediction System: A Machine Learning Approach
24 pages
Grade Project
No ratings yet
Grade Project
1 page
final project report
No ratings yet
final project report
25 pages
12622-Article Text-22383-1-10-20220510
No ratings yet
12622-Article Text-22383-1-10-20220510
5 pages
A Survey and Implementation of Machine Learning Algorithms For Customer Churn Prediction
No ratings yet
A Survey and Implementation of Machine Learning Algorithms For Customer Churn Prediction
7 pages
Major Project Report PDF
No ratings yet
Major Project Report PDF
35 pages
ML Project Life Cycle With Example
No ratings yet
ML Project Life Cycle With Example
2 pages
Report
No ratings yet
Report
17 pages
Synopsis
No ratings yet
Synopsis
3 pages
Full Text 01
No ratings yet
Full Text 01
26 pages
erum (1) (1)
No ratings yet
erum (1) (1)
18 pages
ML Customer Churn Case Study
No ratings yet
ML Customer Churn Case Study
4 pages
Research Churn
No ratings yet
Research Churn
4 pages
Customer Churn in Subscription Business Model-Pred - Copy
No ratings yet
Customer Churn in Subscription Business Model-Pred - Copy
7 pages
major project
No ratings yet
major project
27 pages
Customer_Churn_Prediction_employing_Ensemble_Learning
No ratings yet
Customer_Churn_Prediction_employing_Ensemble_Learning
5 pages
0 - Worsheet Template
No ratings yet
0 - Worsheet Template
10 pages
BATCH 3
No ratings yet
BATCH 3
22 pages
Ref 4
No ratings yet
Ref 4
16 pages
review1-1
No ratings yet
review1-1
16 pages
CHURNFORGE Research Paper Kajal
No ratings yet
CHURNFORGE Research Paper Kajal
6 pages
(IJCST-V11I1P5) :jitendra Maan, Harsh Maan
No ratings yet
(IJCST-V11I1P5) :jitendra Maan, Harsh Maan
6 pages
Executive Summary - Douaa
No ratings yet
Executive Summary - Douaa
3 pages
DWDM Cep
No ratings yet
DWDM Cep
13 pages
Algorithms 17 00231
No ratings yet
Algorithms 17 00231
21 pages
Comparative_Study_of_Customer_Churn_Prediction_Based_on_Data_Ensemble_Approach
No ratings yet
Comparative_Study_of_Customer_Churn_Prediction_Based_on_Data_Ensemble_Approach
10 pages
Anticipating Customer Churn in Telecommunication Using Machine Learning Algorithms For Customer Retention
No ratings yet
Anticipating Customer Churn in Telecommunication Using Machine Learning Algorithms For Customer Retention
7 pages
Data Mining
No ratings yet
Data Mining
7 pages
Abhishekj uvatkar
No ratings yet
Abhishekj uvatkar
4 pages
E Commerce Project
No ratings yet
E Commerce Project
12 pages
Customer_Churn_Prediction_Using_Machine_Learning_Algorithms
No ratings yet
Customer_Churn_Prediction_Using_Machine_Learning_Algorithms
6 pages
DL
No ratings yet
DL
9 pages
Hanoi - 2021: (Document Title)
No ratings yet
Hanoi - 2021: (Document Title)
19 pages
PHASE 3
No ratings yet
PHASE 3
16 pages
A Comparison of Machine Learning Algorithms for Customer Churn Prediction
No ratings yet
A Comparison of Machine Learning Algorithms for Customer Churn Prediction
6 pages
Assignment Csit
No ratings yet
Assignment Csit
5 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
3 pages
Bda Review
No ratings yet
Bda Review
13 pages
DM Assg 041
No ratings yet
DM Assg 041
9 pages
Ml Project Part b
No ratings yet
Ml Project Part b
8 pages
Hack Conquest
No ratings yet
Hack Conquest
7 pages
Abhay Ankit Customer Churn Capstone Project
No ratings yet
Abhay Ankit Customer Churn Capstone Project
19 pages
Capstone Project
No ratings yet
Capstone Project
21 pages
Answer-6 Shreyansh
No ratings yet
Answer-6 Shreyansh
2 pages
FULLTEXT01
No ratings yet
FULLTEXT01
88 pages
Effective Analytics for Marketing
From Everand
Effective Analytics for Marketing
Sucheta Kakkar
No ratings yet
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
From Everand
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
Satou Takahiro
No ratings yet
Manual Ceas Lidl
No ratings yet
Manual Ceas Lidl
90 pages
Criminal Law Mock Bar
No ratings yet
Criminal Law Mock Bar
8 pages
(2005) Advances in Active Radar Seeker Technology
100% (2)
(2005) Advances in Active Radar Seeker Technology
8 pages
Food Pantries List Dec 2023
No ratings yet
Food Pantries List Dec 2023
8 pages
A Summer Training Report in Leon Food Products Puttur
No ratings yet
A Summer Training Report in Leon Food Products Puttur
3 pages
Intercultural Management Theories Week 2
No ratings yet
Intercultural Management Theories Week 2
21 pages
Trimble Terramodel
No ratings yet
Trimble Terramodel
4 pages
Tax Calculator 2018-19 (Farrukh Iqbal Khan)
No ratings yet
Tax Calculator 2018-19 (Farrukh Iqbal Khan)
2 pages
Spare Parts List: Washing Machines and Tumble Dryers
No ratings yet
Spare Parts List: Washing Machines and Tumble Dryers
8 pages
Dashboards Overview
No ratings yet
Dashboards Overview
7 pages
Needle Stick Injury Prevention and Management: Nirupama Sahoo ICN
No ratings yet
Needle Stick Injury Prevention and Management: Nirupama Sahoo ICN
18 pages
Grade 8 Computer Education Edited
No ratings yet
Grade 8 Computer Education Edited
4 pages
Sample Paper - Ii: Instructions
No ratings yet
Sample Paper - Ii: Instructions
16 pages
Napoli Trumpet
No ratings yet
Napoli Trumpet
3 pages
Hot Stamping of Ultra High-Strength Steels: From a Technological and Business Perspective Eren Billur 2024 Scribd Download
100% (1)
Hot Stamping of Ultra High-Strength Steels: From a Technological and Business Perspective Eren Billur 2024 Scribd Download
55 pages
Bissell Little Green Proheat 14259 Owner S Manual
No ratings yet
Bissell Little Green Proheat 14259 Owner S Manual
12 pages
Quiz 5 With Answers PDF
No ratings yet
Quiz 5 With Answers PDF
4 pages
Bauxite Export Project, Guinea - CC - 006 Method Statement For Lateral Load Test SP-A8GA-006-FLR-CS-PL-5010 RC
No ratings yet
Bauxite Export Project, Guinea - CC - 006 Method Statement For Lateral Load Test SP-A8GA-006-FLR-CS-PL-5010 RC
10 pages
Survey Management Excellence
No ratings yet
Survey Management Excellence
13 pages
Matrix Automax Brochure
No ratings yet
Matrix Automax Brochure
2 pages
Seismic Study of Existing Building With Roof Top Telecommunication Towers
No ratings yet
Seismic Study of Existing Building With Roof Top Telecommunication Towers
4 pages
Islamic Capital Market
No ratings yet
Islamic Capital Market
34 pages
Security Economics Knowledge Guide
No ratings yet
Security Economics Knowledge Guide
25 pages
Sun Hasn't Set On Sunrise Packaging - Minneapolis - St. Paul Business Journal
No ratings yet
Sun Hasn't Set On Sunrise Packaging - Minneapolis - St. Paul Business Journal
4 pages
01.intro To BS5950
No ratings yet
01.intro To BS5950
12 pages
MSDS HDL Cholesterol
No ratings yet
MSDS HDL Cholesterol
5 pages
Vaisala AWS - Client - UG - M210932EN
No ratings yet
Vaisala AWS - Client - UG - M210932EN
80 pages
Healthy Foods Resource Guides Lake County
No ratings yet
Healthy Foods Resource Guides Lake County
11 pages
Study of Antidiabetic Drug Glyburide
No ratings yet
Study of Antidiabetic Drug Glyburide
19 pages
Medical Technology Clusters in Germany
No ratings yet
Medical Technology Clusters in Germany
6 pages

Project Report

Uploaded by

Project Report

Uploaded by

Machine Learning Internship Assessment

Customer Churn Prediction

4) Exploratory Data Analysis (EDA)

7) Checking Distribution of Data

8) Check Collinearity Between Variables

11) Check for Class Imbalance

12) Feature Selection Using Random Forest Feature Importance

13) Model Building: Machine Learning Algorithms

14) Model Building: Neural Network

Model Building: Ensembles of Random Forest

15) Model Building: PCA

16) Model Building: Final Model Selection - XGBoost Classifier

17) Hyperparameter Tuning

(I) Cross-Validation Scores (Accuracy)

(II) Cross-Validation Scores (Recall)

19) Finding Optimal Threshold

(I) Train & Test Data Metrics

(II) Confusion Matrix

(III) ROC-AUC Curve

21) Saving Model

CustomerID: Unique identifier for each customer.

Name: Name of the customer.

Age: Age of the customer.

Gender: Gender of the customer (Male or Female).

Subscription_Length_Months: The number of months the customer has been subscribed.

Monthly_Bill: Monthly bill amount for the customer.

Total_Usage_GB: Total usage in gigabytes.

Exploratory Data Analysis (EDA)

* The dataset contains information about 100,000 customers with 9 variables.

* Box plots were used to visualize the presence of outliers.

* No significant outliers were detected in the dataset.

* One-Hot Encoding was applied to the 'Gender' and 'Location' variables.

Checking Distribution of Data

* All variables were found to be approximately normally distributed.

Check Collinearity Between Variables

* Variance Inflation Factor (VIF) was calculated for each variable.

* No variables exhibited high multicollinearity.

* Dataset is divided into 70:30 ratio.

* Min-Max Scaling was applied to variables such as 'Age', 'Subscription_Length_Months',

Check for Class Imbalance

* The churn variable was found to be evenly distributed.

Feature Selection Using Random Forest Feature

* The top features were 'Monthly_Bill', 'Total_Usage_GB', 'Age', and

Location_Los Angeles 0.010595

Location_New York 0.009166

Model Building: Machine Learning Algorithms

* Algorithms included Logistic Regression, Decision Tree, K-Nearest Neighbours, Gaussian

Model Building: Neural Network

Model Building: Ensembles of Random Forest

Model Building: PCA

Model Building: Final Model Selection - XGBoost

(I) Cross-Validation Scores (Accuracy): [0.49692857, 0.50057143, 0.49892857, 0.50478571,

Finding Optimal Threshold

Metric Train Test

(II) Confusion Matrix

Metric Training Set Test Set

(III) ROC-AUC Curve

* Train ROC-AUC (area=0.66)

* Test ROC-AUC (area=0.50)

You might also like