0% found this document useful (0 votes)

8 views27 pages

Major Project

The document outlines a major project on churn prediction using machine learning, detailing the methodology and benefits of identifying at-risk customers to enhance retention and profitability. Key steps include data collection, exploratory data analysis, model selection, training, optimization, deployment, and maintenance, utilizing libraries like TensorFlow, Numpy, and Scikit-learn. Various machine learning models such as Decision Trees, Random Forests, and XGBoost are discussed, highlighting their features, advantages, and disadvantages in predicting customer churn.

Uploaded by

Srikanth Chenna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views27 pages

Major Project

Uploaded by

Srikanth Chenna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Major project on

CHURN PREDICTION USING Machine Learning

INTERNSHIP ON
ARTIFICIAL INTELLIGENCE WITH PYTHON
CONDUCTED BY: DEXTERITY GLOBAL

Submitted by:
Kumari Nikita
Introduction to Churn Prediction Using ML
Churn prediction using machine learning (ML) models is an essential technique for
businesses to identify and retain customers who are likely to stop using their products
or services. This process involves several key steps, starting with data collection and
preparation, where businesses gather comprehensive data on customer behaviour,
demographics, transaction history, and other relevant factors. This data forms the
foundation for building an effective churn prediction model. Next, exploratory data
analysis (EDA) is conducted to uncover patterns and trends within the data that may
indicate potential churn. Understanding these patterns helps in identifying the key
factors contributing to churn and informs the subsequent steps in the process.
Model selection and training involve choosing appropriate ML models such as logistic
regression, decision trees, random forests, or neural networks, and training them on the
prepared data. This training process enables the models to learn and recognize the
patterns associated with customer churn. Fine-tuning the models through optimization
and hyperparameter tuning is crucial to improving their accuracy and performance. By
adjusting various model parameters and testing different configurations, businesses can
achieve the best results in predicting churn.
Once the models are trained and optimized, they are deployed and integrated into the
business workflow. This implementation allows companies to start making predictions
about which customers are at risk of churning and take proactive measures to retain
them. Monitoring and maintenance are vital to ensure the model's continued
effectiveness. Regular updates and adjustments are necessary to adapt to changing
customer behaviour and trends, ensuring that the churn prediction model remains
accurate over time.
The benefits of churn prediction are significant. Enhanced customer retention is a
primary advantage, as identifying at-risk customers early allows businesses to take
proactive steps to retain them. This can involve offering personalized incentives,
improving customer support, or addressing specific pain points that may lead to churn.
Improved customer satisfaction is another benefit, as targeted interventions can address
customers' needs and concerns, enhancing their overall experience with the company.
Reduced revenue loss is a direct result of retaining customers who might otherwise have
churned, leading to increased profitability. Additionally, long-term profitability is
achieved by building and maintaining long-term relationships with customers, resulting
in sustained business growth.
Churn prediction using ML models is a strategic approach that leverages data-driven
insights to understand customer behaviour and take informed actions to enhance
retention and overall business performance. By investing in churn prediction, businesses
can create a more personalized and responsive customer experience, ultimately leading
to higher customer satisfaction and loyalty.
Methodology-
1. Data Collection and Preparation: Gather data on customer behaviour,
demographics, transaction history, and other relevant factors. This data serves as the
foundation for building the churn prediction model.
2. Exploratory Data Analysis (EDA): Conduct an initial analysis of the collected data
to identify patterns and trends that may indicate potential churn. This helps in
understanding the key factors contributing to churn.
3. Model Selection and Training: Choose appropriate ML models, such as logistic
regression, decision trees, random forests, or neural networks. Train these models on the
prepared data to learn the patterns associated with churn.
4. Model Optimization and Hyperparameter Tuning: Fine-tune the models to
improve their accuracy and performance. This involves adjusting the model's parameters
and testing different configurations to achieve the best results.
5. Model Deployment and Integration: Implement the trained churn prediction model
into the business workflow. This enables the company to start making predictions about
which customers are at risk of churning.
6. Monitoring and Maintenance: Continuously monitor the model's performance and
update it as needed to ensure it remains effective. Regular maintenance is crucial to
adapt to changing customer behaviour and trends.

WHAT IS CHURN?
Churn, in a business context, refers to the phenomenon of customers discontinuing
their relationship with a company or ceasing to use its products or services. It is a
critical metric for businesses as it directly impacts their revenue and growth. Customer
churn can be measured as the percentage of customers who leave over a specific
period.
Understanding and predicting churn is essential for businesses to retain customers and
maintain a competitive edge. By identifying factors that contribute to churn, companies
can implement strategies to improve customer satisfaction, enhance loyalty, and reduce
attrition rates.
LIBRARIES USED
 TensorFlow
TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used for
machine learning applications such as neural networks. It is used for both research and
production at Google. TensorFlow was developed by the Google Brain team for internal
Google use. It was released under the Apache 2.0 open-source license on November 9, 2015.

 Numpy
Numpy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the fundamental
package for scientific computing with Python. It contains various features including these
important ones:

• A powerful N-dimensional array object

• Sophisticated (broadcasting) functions

• Tools for integrating C/C++ and Fortran code

• Useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, Numpy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary data-types can be defined using Numpy which allows
Numpy to seamlessly and speedily integrate with a wide variety of databases.

 Pandas
Pandas is an open-source Python Library providing high-performance data manipulation and
analysis tool using its powerful data structures. Python was majorly used for data munging
and preparation. It had very little contribution towards data analysis. Pandas solved this
problem. Using Pandas, we can accomplish five typical steps in the processing and analysis
of data, regardless of the origin of data load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial
domains
including finance, economics, Statistics, analytics, etc.

 Matplotlib
Matplotlib is a Python 2D plotting library which produces publication quality figures in a
variety of hardcopy formats and interactive environments across platforms. Matplotlib can be
used in Python scripts, the Python and IPython shells, the Jupyter Notebook, web application
servers, and four graphical user interface toolkits. Matplotlib tries to make easy things easy
and hard things possible. You can generate plots, histograms, power spectra, bar charts, error
charts, scatter plots, etc., with just a few lines of code. For examples, see the sample plots and
thumbnail gallery.
For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.

 Scikit – learn
Scikit-learn provides a range of supervised and unsupervised learning algorithms via a
consistent interface in Python. It is licensed under a permissive simplified BSD license and is
distributed under many Linux distributions, encouraging academic and commercial use.
Machine Learning Models: Decision Tree, Random Forest,
and XGBoost
Decision Tree
A Decision Tree is a versatile and interpretable model used for both classification and
regression tasks. It splits the dataset into subsets based on the values of input features.
Each node represents a feature, each branch represents a decision rule, and each leaf
node represents an outcome.
Features:
 Nodes and Branches: The structure of the tree, with nodes representing features
and branches representing decision rules.
 Root Node: The topmost node, representing the most significant feature.
 Internal Nodes: Nodes that represent features used for further splitting the data.
 Leaf Nodes: Terminal nodes representing the final prediction (classification or
regression).
 Splitting: The process of dividing nodes into sub-nodes based on decision rules.
 Pruning: The technique used to remove parts of the tree that do not contribute
significantly, reducing complexity and preventing overfitting.
 Impurity Measures: Metrics like Gini Index, Information Gain, and Chi-square
to determine the best feature for splitting.
Advantages:
 Easy to understand and interpret.
 Can handle both numerical and categorical data.
 Requires little data preprocessing.
Disadvantages:
 Prone to overfitting, especially with complex datasets.
 Can be unstable with small changes in data leading to different trees.
Random Forest
Random Forest is an ensemble learning method that creates multiple decision trees and
merges their predictions. It uses the concept of bagging to improve accuracy and
robustness.
Features:
 Ensemble Learning: Combines multiple decision trees to enhance model
performance.
 Bagging (Bootstrap Aggregating): Generates multiple samples from the
training data by sampling with replacement.
 Feature Randomness: Each tree is built using a random subset of features,
increasing diversity and reducing correlation.
 Majority Voting: For classification, the final prediction is based on the majority
vote of the individual trees.
 Averaging: For regression, the final prediction is the average of the individual
trees' predictions.
 Feature Importance: Measures the contribution of each feature to the model's
predictive power.
Advantages:
 Reduces overfitting by averaging multiple trees.
 Handles large datasets with high dimensionality well.
 Provides feature importance.
Disadvantages:
 Can be computationally intensive and slower to predict than individual trees.
 Requires careful tuning of hyperparameters for optimal performance.

XGBoost (Extreme Gradient Boosting)

XGBoost is an optimized distributed gradient boosting library designed for efficiency,
flexibility, and high performance. It builds decision trees sequentially, with each new
tree aiming to correct errors made by previous ones.
Features:
 Gradient Boosting: Sequentially builds trees to correct errors from previous
ones.
 Regularization: Incorporates L1 (Lasso) and L2 (Ridge) regularization to
prevent overfitting.
 Parallel Processing: Utilizes parallel computing for faster training and
prediction.
 Tree Pruning: Uses advanced pruning based on maximum delta step for
efficient tree construction.
 Handling Missing Data: Automatically learns the best way to handle missing
data.
 Cross-validation: Supports built-in cross-validation for model evaluation and
selection.
 Scalability: Highly scalable and can handle large datasets efficiently.
Advantages:
 Highly efficient and fast due to parallel processing.
 Offers regularization to reduce overfitting.
 Provides feature importance and handles missing data well.
Disadvantages:
 Can be more complex to tune and requires careful selection of hyperparameters.
 May be less interpretable than simpler models like decision trees.

Summary
Churn prediction using machine learning models such as Decision Trees, Random
Forests, and XGBoost involves identifying customers who are likely to stop using a
company's products or services. By analyzing customer data, these models help
businesses take proactive measures to retain at-risk customers.
By leveraging these models, businesses can enhance customer retention, improve
satisfaction, reduce revenue loss, and achieve long-term profitability. Each model has
unique strengths, and the choice depends on the specific requirements of the task and
dataset.
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab

1. Importing the dependencies

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pickle

2. Data Loading and Understanding

# load teh csv data to a pandas dataframe

df = pd.read_csv("/content/WA_Fn-UseC_-Telco-Customer-Churn.csv")

df.shape

(7043, 21)

df.head()

customerID gender SeniorCitizen Partner Dependents tenure PhoneService Multipl

7590- No
0 Female 0 Yes No 1 No
VHVEG

5575-
1 Male 0 No No 34 Yes
GNVDE

3668-
2 Male 0 No No 2 Yes
QPYBK

7795- No
3 Male 0 No No 45 No
CFOCW

9237-
4 Female 0 No No 2 Yes
HQITU

5 rows × 21 columns

pd.set_option("display.max_columns", None)

https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 1/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab

df.head(2)

customerID gender SeniorCitizen Partner Dependents tenure PhoneService Multipl

7590- No
0 Female 0 Yes No 1 No
VHVEG

5575-
1 Male 0 No No 34 Yes
GNVDE

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

# dropping customerID column as this is not required for modelling

df = df.drop(columns=["customerID"])

df.head(2)

https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 2/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab

gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines Inte

No phone
0 Female 0 Yes No 1 No
service

1 Male 0 No No 34 Yes No

df.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',

'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
'MonthlyCharges', 'TotalCharges', 'Churn'],
dtype='object')

print(df["gender"].unique())

['Female' 'Male']

print(df["SeniorCitizen"].unique())

[0 1]

# printing the unique values in all the columns

numerical_features_list = ["tenure", "MonthlyCharges", "TotalCharges"]

for col in df.columns:

if col not in numerical_features_list:
print(col, df[col].unique())
print("-"*50)

gender ['Female' 'Male']

--------------------------------------------------
SeniorCitizen [0 1]
--------------------------------------------------
Partner ['Yes' 'No']
--------------------------------------------------
Dependents ['No' 'Yes']
--------------------------------------------------
PhoneService ['No' 'Yes']
--------------------------------------------------
MultipleLines ['No phone service' 'No' 'Yes']
--------------------------------------------------
InternetService ['DSL' 'Fiber optic' 'No']
--------------------------------------------------
OnlineSecurity ['No' 'Yes' 'No internet service']
--------------------------------------------------

https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 3/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab
OnlineBackup ['Yes' 'No' 'No internet service']
--------------------------------------------------
DeviceProtection ['No' 'Yes' 'No internet service']
--------------------------------------------------
TechSupport ['No' 'Yes' 'No internet service']
--------------------------------------------------
StreamingTV ['No' 'Yes' 'No internet service']
--------------------------------------------------
StreamingMovies ['No' 'Yes' 'No internet service']
--------------------------------------------------
Contract ['Month-to-month' 'One year' 'Two year']
--------------------------------------------------
PaperlessBilling ['Yes' 'No']
--------------------------------------------------
PaymentMethod ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
'Credit card (automatic)']
--------------------------------------------------
Churn ['No' 'Yes']
--------------------------------------------------

print(df.isnull().sum())

gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
dtype: int64

#df["TotalCharges"] = df["TotalCharges"].astype(float)

df[df["TotalCharges"]==" "]

https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 4/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab

gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines I

No phone
488 Female 0 Yes Yes 0 No
service

753 Male 0 No Yes 0 Yes No

936 Female 0 Yes Yes 0 Yes No

1082 Male 0 Yes Yes 0 Yes Yes

No phone
1340 Female 0 Yes Yes 0 No
service

3331 Male 0 Yes Yes 0 Yes No

3826 Male 0 Yes Yes 0 Yes Yes

4380 Female 0 Yes Yes 0 Yes No

5218 Male 0 Yes Yes 0 Yes No

6670 Female 0 Yes Yes 0 Yes Yes

6754 Male 0 No Yes 0 Yes Yes

len(df[df["TotalCharges"]==" "])

df["TotalCharges"] = df["TotalCharges"].replace({" ": "0.0"})

df["TotalCharges"] = df["TotalCharges"].astype(float)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 7043 non-null object
1 SeniorCitizen 7043 non-null int64
2 Partner 7043 non-null object
3 Dependents 7043 non-null object
https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 5/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab
4 tenure 7043 non-null int64
5 PhoneService 7043 non-null object
6 MultipleLines 7043 non-null object
7 InternetService 7043 non-null object
8 OnlineSecurity 7043 non-null object
9 OnlineBackup 7043 non-null object
10 DeviceProtection 7043 non-null object
11 TechSupport 7043 non-null object
12 StreamingTV 7043 non-null object
13 StreamingMovies 7043 non-null object
14 Contract 7043 non-null object
15 PaperlessBilling 7043 non-null object
16 PaymentMethod 7043 non-null object
17 MonthlyCharges 7043 non-null float64
18 TotalCharges 7043 non-null float64
19 Churn 7043 non-null object
dtypes: float64(2), int64(2), object(16)
memory usage: 1.1+ MB

# checking the class distribution of target column

print(df["Churn"].value_counts())

Churn
No 5174
Yes 1869
Name: count, dtype: int64

Insights:

1. Customer ID removed as it is not required for modelling

2. No mmissing values in the dataset
3. Missing values in the TotalCharges column were replaced with 0
4. Class imbalance identified in the target

3. Exploratory Data Analysis (EDA)

df.shape

(7043, 20)

df.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',

https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 6/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab

df.head(2)

gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines Inte

No phone
0 Female 0 Yes No 1 No
service

1 Male 0 No No 34 Yes No

df.describe()

SeniorCitizen tenure MonthlyCharges TotalCharges

count 7043.000000 7043.000000 7043.000000 7043.000000

mean 0.162147 32.371149 64.761692 2279.734304

std 0.368612 24.559481 30.090047 2266.794470

min 0.000000 0.000000 18.250000 0.000000

25% 0.000000 9.000000 35.500000 398.550000

50% 0.000000 29.000000 70.350000 1394.550000

75% 0.000000 55.000000 89.850000 3786.600000

max 1.000000 72.000000 118.750000 8684.800000

Numerical Features - Analysis

Understand the distribution of teh numerical features

def plot_histogram(df, column_name):

plt.figure(figsize=(5, 3))
sns.histplot(df[column_name], kde=True)
plt.title(f"Distribution of {column_name}")

# calculate the mean and median values for the columns

col_mean = df[column_name].mean()
col_median = df[column_name].median()

# add vertical lines for mean and median

plt.axvline(col_mean, color="red", linestyle="--", label="Mean")
plt.axvline(col_median, color="green", linestyle="-", label="Median")

plt.legend()

https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 7/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab

plt.show()

plot_histogram(df, "tenure")

plot_histogram(df, "MonthlyCharges")

plot_histogram(df, "TotalCharges")

https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 8/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab

Box plot for numerical features

def plot_boxplot(df, column_name):

plt.figure(figsize=(5, 3))
sns.boxplot(y=df[column_name])
plt.title(f"Box Plot of {column_name}")
plt.ylabel(column_name)
plt.show

plot_boxplot(df, "tenure")

plot_boxplot(df, "MonthlyCharges")

https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 9/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab

plot_boxplot(df, "TotalCharges")

Correlation Heatmap for numerical columns

# correlation matrix - heatmap

plt.figure(figsize=(8, 4))
sns.heatmap(df[["tenure", "MonthlyCharges", "TotalCharges"]].corr(), annot=True, cmap="coolw
plt.title("Correlation Heatmap")
plt.show()

https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 10/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab

Categorical features - Analysis

df.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 7043 non-null object
1 SeniorCitizen 7043 non-null int64
2 Partner 7043 non-null object
3 Dependents 7043 non-null object
4 tenure 7043 non-null int64
5 PhoneService 7043 non-null object
6 MultipleLines 7043 non-null object
7 InternetService 7043 non-null object
8 OnlineSecurity 7043 non-null object
9 OnlineBackup 7043 non-null object
10 DeviceProtection 7043 non-null object
11 TechSupport 7043 non-null object
https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 11/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab
12 StreamingTV 7043 non-null object
13 StreamingMovies 7043 non-null object
14 Contract 7043 non-null object
15 PaperlessBilling 7043 non-null object
16 PaymentMethod 7043 non-null object
17 MonthlyCharges 7043 non-null float64
18 TotalCharges 7043 non-null float64
19 Churn 7043 non-null object
dtypes: float64(2), int64(2), object(16)
memory usage: 1.1+ MB

Countplot for categorical columns

object_cols = df.select_dtypes(include="object").columns.to_list()

object_cols = ["SeniorCitizen"] + object_cols

for col in object_cols:

plt.figure(figsize=(5, 3))
sns.countplot(x=df[col])
plt.title(f"Count Plot of {col}")
plt.show()

Show hidden output

4. Data Preprocessing

df.head(3)

gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines Inte

No phone
0 Female 0 Yes No 1 No
service

1 Male 0 No No 34 Yes No

2 Male 0 No No 2 Yes No

Label encoding of target column

df["Churn"] = df["Churn"].replace({"Yes": 1, "No": 0})

<ipython-input-39-b6eb27bc3ee0>:1: FutureWarning: Downcasting behavior in `replace` is d

df["Churn"] = df["Churn"].replace({"Yes": 1, "No": 0})

https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 12/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab

df.head(3)

gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines Inte

No phone
0 Female 0 Yes No 1 No
service

1 Male 0 No No 34 Yes No

2 Male 0 No No 2 Yes No

print(df["Churn"].value_counts())

Churn
0 5174
1 1869
Name: count, dtype: int64

Label encoding of categorical fetaures

# identifying columns with object data type

object_columns = df.select_dtypes(include="object").columns

print(object_columns)

Index(['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',

'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
'PaperlessBilling', 'PaymentMethod'],
dtype='object')

# initialize a dictionary to save the encoders

encoders = {}

# apply label encoding and store the encoders

for column in object_columns:
label_encoder = LabelEncoder()
df[column] = label_encoder.fit_transform(df[column])
encoders[column] = label_encoder

# save the encoders to a pickle file

with open("encoders.pkl", "wb") as f:
pickle.dump(encoders, f)

https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 13/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab

encoders

{'gender': LabelEncoder(),
'Partner': LabelEncoder(),
'Dependents': LabelEncoder(),
'PhoneService': LabelEncoder(),
'MultipleLines': LabelEncoder(),
'InternetService': LabelEncoder(),
'OnlineSecurity': LabelEncoder(),
'OnlineBackup': LabelEncoder(),
'DeviceProtection': LabelEncoder(),
'TechSupport': LabelEncoder(),
'StreamingTV': LabelEncoder(),
'StreamingMovies': LabelEncoder(),
'Contract': LabelEncoder(),
'PaperlessBilling': LabelEncoder(),
'PaymentMethod': LabelEncoder()}

df.head()

gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines Inte

0 0 0 1 0 1 0 1

1 1 0 0 0 34 1 0

2 1 0 0 0 2 1 0

3 1 0 0 0 45 0 1

4 0 0 0 0 2 1 0

Traianing and test data split

# splitting the features and target

X = df.drop(columns=["Churn"])
y = df["Churn"]

# split training and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(y_train.shape)

(5634,)

print(y_train.value_counts())

Churn
0 4138
https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 14/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab
1 1496
Name: count, dtype: int64

Synthetic Minority Oversampling TEchnique (SMOTE)

smote = SMOTE(random_state=42)

X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(y_train_smote.shape)

(8276,)

print(y_train_smote.value_counts())

Churn
0 4138
1 4138
Name: count, dtype: int64

5. Model Training

Training with default hyperparameters

# dictionary of models
models = {
"Decision Tree": DecisionTreeClassifier(random_state=42),
"Random Forest": RandomForestClassifier(random_state=42),
"XGBoost": XGBClassifier(random_state=42)
}

# dictionary to store the cross validation results

cv_scores = {}

# perform 5-fold cross validation for each model

for model_name, model in models.items():
print(f"Training {model_name} with default parameters")
scores = cross_val_score(model, X_train_smote, y_train_smote, cv=5, scoring="accuracy")
cv_scores[model_name] = scores
print(f"{model_name} cross-validation accuracy: {np.mean(scores):.2f}")
print("-"*70)

Training Decision Tree with default parameters

Decision Tree cross-validation accuracy: 0.78
----------------------------------------------------------------------
Training Random Forest with default parameters
https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 15/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab
Random Forest cross-validation accuracy: 0.84
----------------------------------------------------------------------
Training XGBoost with default parameters
XGBoost cross-validation accuracy: 0.83
----------------------------------------------------------------------

cv_scores

{'Decision Tree': array([0.68297101, 0.71299094, 0.82175227, 0.83564955, 0.83564955]),

'Random Forest': array([0.72524155, 0.77824773, 0.90513595, 0.89425982, 0.90090634]),
'XGBoost': array([0.70048309, 0.75649547, 0.90271903, 0.89486405, 0.90030211])}

Random Forest gives the highest accuracy compared to other models with default parameters

rfc = RandomForestClassifier(random_state=42)

rfc.fit(X_train_smote, y_train_smote)

▾ RandomForestClassifier i ?

RandomForestClassifier(random_state=42)

print(y_test.value_counts())

Churn
0 1036
1 373
Name: count, dtype: int64

6. Model Evaluation

# evaluate on test data

y_test_pred = rfc.predict(X_test)

print("Accuracy Score:\n", accuracy_score(y_test, y_test_pred))

print("Confsuion Matrix:\n", confusion_matrix(y_test, y_test_pred))
print("Classification Report:\n", classification_report(y_test, y_test_pred))

Accuracy Score:
0.7785663591199432
Confsuion Matrix:
[[878 158]
[154 219]]
Classification Report:
precision recall f1-score support

0 0.85 0.85 0.85 1036

1 0.58 0.59 0.58 373
https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 16/17
2/18/25, 9:31 AM Customer_Churn_Prediction_using_ML.ipynb - Colab

accuracy 0.78 1409

macro avg 0.72 0.72 0.72 1409
weighted avg 0.78 0.78 0.78 1409

https://colab.research.google.com/drive/1bSgQoiWdHU8gWPpriBUZ1cKTzFFCts28#scrollTo=zssUNtcsgMLh&printMode=true 17/17
Conclusion and Future Enhancement
Conclusion
Churn prediction using machine learning models such as Decision Trees, Random
Forests, and XGBoost is a critical approach for businesses looking to understand and
mitigate customer attrition. This process involves the systematic analysis of customer
data to identify patterns and indicators that signal the likelihood of a customer leaving
a product or service. By leveraging these models, businesses can take proactive steps
to retain their customers, thereby enhancing customer satisfaction and reducing
revenue loss.
Decision Trees are one of the simplest and most interpretable machine learning
models. They work by splitting the dataset into subsets based on feature values,
creating a tree-like structure where each node represents a feature, each branch
represents a decision rule, and each leaf node represents an outcome. The ease of
interpretation makes decision trees particularly useful for gaining insights into the
factors contributing to customer churn. However, they are prone to overfitting,
especially with complex datasets, which can limit their predictive power.
Random Forests address the overfitting issue inherent in decision trees by creating an
ensemble of multiple decision trees. This ensemble approach, known as bagging
(Bootstrap Aggregating), involves generating multiple samples from the training data
and building separate trees for each sample. The final prediction is made by
aggregating the predictions of the individual trees, which enhances accuracy and
robustness. Random Forests also introduce randomness by selecting a subset of
features for each tree, which reduces correlation between trees and improves
generalization. This model is well-suited for large datasets and high-dimensional
spaces, making it a powerful tool for churn prediction. Additionally, Random Forests
provide feature importance measures, which help identify the most influential factors
driving customer churn.
XGBoost (Extreme Gradient Boosting) is an advanced and highly efficient
implementation of gradient boosting. Unlike Random Forests, which build trees
independently, XGBoost builds trees sequentially, with each new tree correcting the
errors made by the previous ones. This iterative process allows XGBoost to achieve
high levels of accuracy and performance. XGBoost incorporates regularization
techniques (L1 and L2) to prevent overfitting, ensuring that the model generalizes
well to new data. It also supports parallel processing, making it fast and scalable for
large datasets. XGBoost is known for its flexibility, as it can handle missing data and
offers built-in cross-validation for model evaluation and selection. The ability to
handle large and complex datasets efficiently makes XGBoost a popular choice for
churn prediction tasks.
In conclusion, churn prediction using machine learning models like Decision Trees,
Random Forests, and XGBoost is a strategic approach that enables businesses to
identify at-risk customers and take proactive measures to retain them. Each model
offers unique strengths: Decision Trees provide simplicity and interpretability,
Random Forests offer robustness and reduced overfitting, and XGBoost delivers high
performance and efficiency.

Future Enhancement
Future enhancements in churn prediction can focus on integrating advanced
algorithms like deep learning models, developing real-time prediction capabilities, and
improving explainability and transparency. By investing in churn prediction and
continuously evolving these models, businesses can enhance customer retention,
improve satisfaction, and achieve long-term profitability.

Dirty Dreams To Tell Your Boyfriend: Click Here To Download
67% (6)
Dirty Dreams To Tell Your Boyfriend: Click Here To Download
3 pages
FULLTEXT01
No ratings yet
FULLTEXT01
88 pages
Wa0004.
No ratings yet
Wa0004.
70 pages
Beyond The Algorithm: Practical Machine Learning Strategies
From Everand
Beyond The Algorithm: Practical Machine Learning Strategies
Jane Onwuchekwa
No ratings yet
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
From Everand
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
Margaux Masson-Forsythe
No ratings yet
PLAB 2 Notes Part 1
No ratings yet
PLAB 2 Notes Part 1
483 pages
Bank Customer Churn Prediction
No ratings yet
Bank Customer Churn Prediction
38 pages
Customer Churn Prediction Detailed Presentation
No ratings yet
Customer Churn Prediction Detailed Presentation
11 pages
Churn Prediction and ML
No ratings yet
Churn Prediction and ML
9 pages
Ensemble Based Customer Churn Prediction in Banking: A Voting Classifier Approach For Improved Client Retention Using Demographic and Behavioral Data
No ratings yet
Ensemble Based Customer Churn Prediction in Banking: A Voting Classifier Approach For Improved Client Retention Using Demographic and Behavioral Data
28 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
6 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Ref 4
No ratings yet
Ref 4
16 pages
Customer Churn Prediction Employing Ensemble Learning
No ratings yet
Customer Churn Prediction Employing Ensemble Learning
5 pages
Sample Report
No ratings yet
Sample Report
34 pages
KCU401-C Keeler Cryomatic Service Manual
100% (1)
KCU401-C Keeler Cryomatic Service Manual
25 pages
Churn Prediction
100% (3)
Churn Prediction
41 pages
Varshini Phase 2
No ratings yet
Varshini Phase 2
19 pages
Predicting Customer Churn A Systematic Literature Review
No ratings yet
Predicting Customer Churn A Systematic Literature Review
22 pages
Final Review Batch 07
No ratings yet
Final Review Batch 07
30 pages
SSRN 4976040
No ratings yet
SSRN 4976040
14 pages
Full Text 01
No ratings yet
Full Text 01
26 pages
Review1 1
No ratings yet
Review1 1
16 pages
Writeup On Bank Customer Churn Prediction
No ratings yet
Writeup On Bank Customer Churn Prediction
14 pages
Batch 3
No ratings yet
Batch 3
22 pages
Financial Churn Modeling
No ratings yet
Financial Churn Modeling
20 pages
Phase 3
No ratings yet
Phase 3
12 pages
Final Project Report
No ratings yet
Final Project Report
25 pages
Customer Churn Prediction in Telecom Sector Using Machine Learning Techniques
No ratings yet
Customer Churn Prediction in Telecom Sector Using Machine Learning Techniques
16 pages
Project Report
No ratings yet
Project Report
12 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Wa0001.
No ratings yet
Wa0001.
11 pages
Varshini Phase 3
No ratings yet
Varshini Phase 3
12 pages
GRP 10 Report
No ratings yet
GRP 10 Report
16 pages
Corba Book
100% (1)
Corba Book
286 pages
BIL Report 2
No ratings yet
BIL Report 2
11 pages
Project Report
No ratings yet
Project Report
11 pages
Customer Churn Prediction Model For Telecommunication Industry
No ratings yet
Customer Churn Prediction Model For Telecommunication Industry
7 pages
Churnprediction Project File
No ratings yet
Churnprediction Project File
12 pages
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
Churn Prediction in Telecom Using Machine Learning in R
No ratings yet
Churn Prediction in Telecom Using Machine Learning in R
9 pages
1.) Detailed Workflow For Predicting Customer Churn in An Online Retail Store
No ratings yet
1.) Detailed Workflow For Predicting Customer Churn in An Online Retail Store
9 pages
Customer Churn Prediction Using Machine Learning
No ratings yet
Customer Churn Prediction Using Machine Learning
7 pages
Research Paper - Tushar Agrawal
No ratings yet
Research Paper - Tushar Agrawal
3 pages
Comparative Study of Customer Churn Prediction Based On Data Ensemble Approach
No ratings yet
Comparative Study of Customer Churn Prediction Based On Data Ensemble Approach
10 pages
AI ML K6rn1i 54 Merged
No ratings yet
AI ML K6rn1i 54 Merged
6 pages
Predictive Modeling
No ratings yet
Predictive Modeling
7 pages
Churn Forecasting Using Deep Ljearning Model
No ratings yet
Churn Forecasting Using Deep Ljearning Model
5 pages
Hanoi - 2021: (Document Title)
No ratings yet
Hanoi - 2021: (Document Title)
19 pages
INNOVATION - PDF Phrase 2
No ratings yet
INNOVATION - PDF Phrase 2
9 pages
Customer Churn Prediction Using Machine Learning Algorithms
No ratings yet
Customer Churn Prediction Using Machine Learning Algorithms
6 pages
HFHDJSJWDJNDNDKWM
No ratings yet
HFHDJSJWDJNDNDKWM
81 pages
Churn Prediction Using Machine Learning Models
No ratings yet
Churn Prediction Using Machine Learning Models
6 pages
Efficacy of Customer Churn Prediction System
No ratings yet
Efficacy of Customer Churn Prediction System
8 pages
A Comparison of Machine Learning Algorithms For Customer Churn Prediction
No ratings yet
A Comparison of Machine Learning Algorithms For Customer Churn Prediction
6 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Answer-6 Shreyansh
No ratings yet
Answer-6 Shreyansh
2 pages
Anticipating Customer Churn in Telecommunication Using Machine Learning Algorithms For Customer Retention
No ratings yet
Anticipating Customer Churn in Telecommunication Using Machine Learning Algorithms For Customer Retention
7 pages
Bank Customer Churn Prediction
No ratings yet
Bank Customer Churn Prediction
5 pages
Automation Simulation:: Your Gateway Into Smart Manufacturing
No ratings yet
Automation Simulation:: Your Gateway Into Smart Manufacturing
31 pages
Paper Published
No ratings yet
Paper Published
5 pages
ML Project Life Cycle With Example
No ratings yet
ML Project Life Cycle With Example
2 pages
CHURNFORGE Research Paper Kajal
No ratings yet
CHURNFORGE Research Paper Kajal
6 pages
Churn PredictionITNACC
No ratings yet
Churn PredictionITNACC
7 pages
12622-Article Text-22383-1-10-20220510
No ratings yet
12622-Article Text-22383-1-10-20220510
5 pages
Basic Question Bank With Answers and Explanations
No ratings yet
Basic Question Bank With Answers and Explanations
275 pages
This Study Resource Was: Supply Chain Management
No ratings yet
This Study Resource Was: Supply Chain Management
4 pages
Performance Management (Final)
No ratings yet
Performance Management (Final)
16 pages
Customer Churn Analysis and Prediction
No ratings yet
Customer Churn Analysis and Prediction
4 pages
95SCS-4 Sr. No. 10 Examination of Marine Engineer Officer
No ratings yet
95SCS-4 Sr. No. 10 Examination of Marine Engineer Officer
6 pages
Agri Surfactants Handbook - V14 - 280225 - ENGLISH
No ratings yet
Agri Surfactants Handbook - V14 - 280225 - ENGLISH
35 pages
Choosing A Course Booklet 2022
No ratings yet
Choosing A Course Booklet 2022
9 pages
Upsc Cms Guru Answerkey2022p1
No ratings yet
Upsc Cms Guru Answerkey2022p1
45 pages
CT 230
No ratings yet
CT 230
21 pages
Shri Chinai College of Commerce and Economics Andheri (East), Mumbai-400 069 Bachlor of Management Studies Project Report On "Marketing Strategy of Samsung" Submitted by Pinak Varu Tybms B (Sem. V
No ratings yet
Shri Chinai College of Commerce and Economics Andheri (East), Mumbai-400 069 Bachlor of Management Studies Project Report On "Marketing Strategy of Samsung" Submitted by Pinak Varu Tybms B (Sem. V
33 pages
Open Silicon Pakistan Brochure
No ratings yet
Open Silicon Pakistan Brochure
1 page
Ikeja Electric PLC's Financial Statement For Statutory Report
No ratings yet
Ikeja Electric PLC's Financial Statement For Statutory Report
76 pages
Oscar Ccoa Codes v1
No ratings yet
Oscar Ccoa Codes v1
247 pages
CRC
No ratings yet
CRC
35 pages
Chapter 4 - Kanban Agile Method
No ratings yet
Chapter 4 - Kanban Agile Method
5 pages
Synonyms
No ratings yet
Synonyms
3 pages
EOY Subject Information 2024 9 Sec 1 G3
No ratings yet
EOY Subject Information 2024 9 Sec 1 G3
2 pages
BL - Awb
No ratings yet
BL - Awb
1 page
798 - Section 06
No ratings yet
798 - Section 06
6 pages
AN-1525 Single Supply Operation of The DAC0800 and DAC0802: Application Report
No ratings yet
AN-1525 Single Supply Operation of The DAC0800 and DAC0802: Application Report
7 pages
Norm Referenced Interpretation
No ratings yet
Norm Referenced Interpretation
1 page
Principles of Public Speaking Syllabus - Ms. Catherine Linobo
No ratings yet
Principles of Public Speaking Syllabus - Ms. Catherine Linobo
7 pages
Tugas Inggris Ridwan TaufikC1B230115 An23 Kls Pesantren
No ratings yet
Tugas Inggris Ridwan TaufikC1B230115 An23 Kls Pesantren
5 pages
Alluvial Soil Black Soil
No ratings yet
Alluvial Soil Black Soil
1 page
Contoh Soal - Imrona-Ngantang 1
No ratings yet
Contoh Soal - Imrona-Ngantang 1
3 pages