0% found this document useful (0 votes)
21 views20 pages

Data Science Case Report

This case study report focuses on predicting customer churn in the telecom industry to enhance retention strategies through data-driven approaches. It details the processes of data handling, cleaning, wrangling, and visualization, utilizing machine learning models to analyze customer behavior and identify key factors contributing to churn. The findings aim to provide actionable insights for telecom companies to improve customer retention and profitability.

Uploaded by

Sivdutt S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views20 pages

Data Science Case Report

This case study report focuses on predicting customer churn in the telecom industry to enhance retention strategies through data-driven approaches. It details the processes of data handling, cleaning, wrangling, and visualization, utilizing machine learning models to analyze customer behavior and identify key factors contributing to churn. The findings aim to provide actionable insights for telecom companies to improve customer retention and profitability.

Uploaded by

Sivdutt S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Predict customer churn for a telecom company to improve

retention strategies.
A CASE STUDY REPORT
Submitted by

VISHWAJITH V(RA2211003010081)
SREYA SUSAN ROY(RA2211003010089)
PARVATHY ULLAS(RA2211003010098)
AARTHI.N(RA2211003010126)

For the course


Data Science - 21CSS303T
In partial fulfillment of the requirements for the degree of
BACHELOR OF TECHNOLOGY

DEPARTMENT OF COMPUTING TECHOLOGIES

SCHOOL OF COMPUTING

FACULTY OF ENGINEERING AND TECHNOLOGY

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

KATTANKULATHUR - 603 203.


SRM INSTITUTE OFSCIENCE AND TECHNOLOGY
KATTANKULATHUR – 603 203

BONAFIDE CERTIFICATE

Certified that Data Science, A Case Study Report titled “Predict customer churn for a
telecom company to improve retention strategies” is the bonafide work of
VISHWAJITH V(RA2211003010081), SREYA SUSAN ROY(RA2211003010089),
PARVATHY ULLAS(RA2211003010098), AARTHI.N(RA2211003010126) who
carried out the case study under my supervision. Certified further, that to the best of my
knowledge the work reported herein does not form any other work

Faculty Signature

Dr.S.Priya
Assistant Professor
Department of Computing Technologies

Date:
TABLE OF CONTENTS

CHAPTER NO. TITLES PAGE NO.

1. INTRODUCTION 1

2. DATA HANDLING 2

3. DATA WRANGLING 5

4. DATA CLEANING AND 7


PROCESSING

5. DATA VISUALIZATION 9

6. IMPLEMENTATION 11

7. CONCLUSIONS AND RESULTS 13

APPENDICES 16

SAMPLE CODING 16
1. INTRODUCTION

Customer churn poses a significant challenge in the telecom industry, where acquiring new
customers often costs more than retaining existing ones. This project focuses on predicting
customer churn using data-driven approaches to support the development of effective retention
strategies. By leveraging historical data—including customer demographics, service usage
patterns, account information, and interaction history—a machine learning model is built to
classify customers as likely to churn or stay. Various classification algorithms are evaluated for
performance, with emphasis on accuracy, precision, recall, and F1-score. The results of the
predictive model are then analyzed to identify key factors contributing to churn, enabling the
company to design targeted interventions. This project not only demonstrates the practical
application of machine learning in business scenarios but also provides actionable insights that
can significantly improve customer retention and profitability for telecom companies.
This project focuses on predicting customer churn in the telecom sector through effective data
preprocessing, integration, and visualization. Additional datasets like service logs and complaints
are merged using Customer_ID to enrich insights. Data is reshaped to track monthly churn trends,
and cleaned to fix missing or inconsistent values. Derived features such as average monthly
charges and customer segmentation based on tenure and billing are created. Visualizations—
including bar plots, scatter plots, and heatmaps—reveal key churn patterns, supporting the
development of targeted retention strategies.

1
2. DATA HANDLING

Effective data analysis begins with well-structured and thoroughly preprocessed data.
The customer churn dataset, although rich in customer behavior and service usage patterns,
required substantial cleaning, validation, and feature transformation to support meaningful
insights and robust churn modeling.

2.1 Dataset Overview


The dataset consists of over 100,000 customer records, with each entry representing a customer's
service interaction and status.

• The key attributes considered for analysis include:

• Customer ID

• Gender, Senior Citizen Status, and Partner/Dependents

• Tenure (in months)

• Service Types (Phone, Internet, Online Security, etc.)

• Charges (MonthlyCharges, TotalCharges)

• Contract Type and Payment Method

• Churn Indicator (target variable)


Initial inspection revealed mixed data types, missing values, and inconsistent formatting—
necessitating thorough preprocessing steps to ensure model readiness.

2.2 Managing Data at Scale


To efficiently handle the dataset, the following strategies were applied:

• Memory Optimization:
Data types were reviewed and converted to reduce memory usage. For instance, object
types were converted to categories where applicable.

• Vectorized Operations:
Operations using NumPy and Pandas were leveraged to improve processing speed and
scalability.

• Batch Processing:
For large intermediate operations, data was processed in chunks to avoid memory

2
overflow.

• Initial Data Exploration:


Using .info(), .describe(), .head(), and .shape(), the dataset’s structure, types, and
summary statistics were examined.
Missing values were identified using .isnull().sum() and the number of unique values per column
was determined using .nunique().

2.3 Data Cleaning


Critical cleaning steps included:

• Handling 'TotalCharges' Column:


Initially stored as a string due to mixed types, TotalCharges was converted to numeric.
Rows with null values (often new customers without billed charges) were removed.

• Categorical and Numerical Segregation:


All categorical and numerical columns were listed and reviewed. Binary columns with
'Yes'/'No' were mapped to 1/0 for consistency.

• Encoding Categorical Variables:

• Binary Encoding was applied to binary features.

• One-Hot Encoding was used for multi-class categorical columns to prepare the
data for machine learning models.

2.4 Feature Visualization and Preprocessing


Before modeling, the dataset was visually and statistically explored:

• Churn Distribution:
A count plot was used to visualize the imbalance in churned vs. non-churned customers.

• Correlation Heatmap:
Feature correlations were visualized using a heatmap to assess relationships between
numerical fields and the target variable.

• Standardization:
Using StandardScaler, all feature columns were scaled to have zero mean and unit
variance to support algorithms sensitive to scale.

3
2.5 Train-Test Preparation
To enable robust modeling:

• Features (X) and target (y) were separated.

• Data was split using an 80-20 ratio for training and testing using train_test_split, with
stratification to preserve class distribution.

• The final shapes of training and test sets were printed to verify the split integrity.

At the end of this process:

• The dataset was fully cleaned and preprocessed.

• All missing values were addressed and all features were appropriately scaled or encoded.

• A consistent and memory-efficient dataset was ready for training classification models
and generating churn predictions.

4
3. DATA WRANGLING

Data wrangling, also known as data munging, is the process of transforming and organizing raw
data into a structured and usable format. This step bridges the gap between data collection and
meaningful churn analysis.

For the churn dataset, data wrangling involved integrating multiple data sources, reshaping data
to reflect customer behavior over time, and engineering features critical for detecting churn trends
and risk periods.

3.1 Structural Refinement


The original churn dataset included customer demographic details, usage patterns, and churn
indicators, spread across several columns.

Additionally, external datasets such as customer complaints and service usage logs were
merged using Customer_ID as the primary key. This integration enriched the dataset for more
nuanced analysis.

Key structural refinements included:

• All column names were displayed to gain an overview of available fields.

• Missing values in the tenure column were dropped to maintain integrity in tenure-
based analysis.

• The tenure column was converted to integer type to allow for numeric operations.

• A new column Churn_Flag was introduced with a default value of 0, later updated to
1 for customers who churned in their final recorded month.

3.2 Creating Analytical Dimensions


To facilitate more targeted analysis and trend detection, new analytical dimensions were
derived:

• Time-Based Grouping:
Customer data was grouped by Month_Number to assess monthly behavior patterns.

• Churn Flagging:
The Churn_Flag allowed for binary classification of churn status per customer-month

5
record.

• Monthly Churn Rate Calculation:


For each month, the total number of customers and those who churned were counted,
and the monthly churn rate was computed accordingly.

• Column Renaming:
Relevant columns were renamed to enhance clarity and consistency for downstream
analysis and visualization.
3.3 High-Risk Period Identification
To uncover critical risk windows in customer tenure, the dataset was further analyzed as
follows:

• Using pivot tables and time-based aggregation, churn patterns were monitored across
different months of tenure.

• Particular attention was given to months where churn rates peaked, notably around
the 3rd and 6th months.

• A filtered DataFrame was created to isolate these high-risk periods, supporting


focused retention strategy planning.
3.4 Visualization and Dataset Finalization

• Visualization libraries such as matplotlib.pyplot and seaborn were imported.

• A line plot was generated showing churn rate across months of customer tenure.

• Titles, axis labels, and a grid were added to improve interpretability of the visual
output.

• The resulting plot highlighted monthly churn fluctuations, aiding in pattern


recognition.
After all transformations, the cleaned dataset:

• Contained consistent, structured, and enriched values.

• Enabled month-wise churn trend breakdowns and early risk detection.

• Was saved as a new file to preserve reproducibility and avoid unintentional


modifications to the original dataset.

6
4. DATA CLEANING AND PROCESSING

In any data-driven project, cleaning and processing the data is a critical step that directly
influences the quality, reliability, and accuracy of the analysis.
For the Customer Churn dataset, data cleaning and processing involved identifying and
resolving missing or inconsistent values, standardizing formats, engineering useful features,
and ensuring the dataset was fully prepared for modeling and insightful analysis.

4.1 Handling Missing Data


Upon detailed examination of the dataset, a few key columns contained missing or invalid
values, particularly:

• TotalCharges

• PaymentMethod (contained inconsistencies such as extra spaces or inconsistent


casing)
To preserve the integrity of the dataset:

• Missing or non-numeric values in TotalCharges were converted to NaN and then


filled with 0, representing no charges incurred.

• The PaymentMethod column was cleaned by trimming white spaces and


standardizing capitalization to ensure uniform entries.
Additionally, general null value checks were performed using df.isnull().sum() to confirm
the completeness of the data post-cleaning.

4.2 Standardizing Data Formats


To ensure consistency and accurate downstream processing, several standardization steps
were taken:

• Categorical Variables:
Categorical features such as PaymentMethod, Contract, and InternetService were
encoded using label encoding for binary variables and one-hot encoding for multi-
class fields.

• Numerical Conversion:
The TotalCharges column was explicitly converted to a numeric type using
pd.to_numeric() to support mathematical operations.
7
This standardization enabled reliable filtering, grouping, and visualization across various
customer attributes.
4.3 Feature Engineering
To derive greater insight from the dataset, new features were constructed:

• Avg_Charges_per_Month:
A new column was created by dividing TotalCharges by Tenure. A condition was
added to handle tenure values of zero, assigning 0 in such cases.

• Tenure Grouping:
Customers were classified into categories based on their tenure:

• New (0–12 months)

• Regular (13–36 months)

• Loyal (37+ months)


These features supported segmentation and allowed for more targeted analysis and modeling.

4.4 Outlier Detection


Outliers in the MonthlyCharges field were assessed using:

• Z-score method:
Z-scores were calculated using scipy.stats.zscore() and extreme values (Z > 3 or Z <
-3) were flagged.
Based on their influence on model performance, outliers were either capped or
excluded.

• Boxplot Visualization:
Seaborn boxplots were generated to visually inspect the distribution of
MonthlyCharges and confirm the presence and handling of outliers.
This structured and comprehensive cleaning process ensured that the customer churn dataset
was ready for robust predictive modeling and insightful visual exploration.

8
5. DATA VISUALIZATION

Visualizing data is a crucial step in exploratory data analysis (EDA).


It allows us to uncover patterns, trends, and relationships that may not be immediately visible
in raw tabular data.
For the Customer Churn dataset, a variety of visualization techniques were used to explore
customer behavior, feature correlations, and churn trends. These visualizations aided in
generating hypotheses and informed the selection of features for modeling.

5.1 Churn Rate Across Contract Types


A bar plot was created to compare churn rates across different Contract_Type categories.
Key Observations:

• Customers on month-to-month contracts exhibited the highest churn rates.

• Those on one-year and two-year contracts had comparatively lower churn, suggesting
contract length plays a key role in customer retention.

• The visualization revealed the influence of contractual commitment on customer


behavior, making it a valuable feature in churn prediction.
5.2 Tenure vs Monthly Charges Scatter Plot
A scatter plot was generated to visualize the relationship between Tenure and
MonthlyCharges, with points colored based on Churn status.
Key Observations:

• Churned customers were mostly clustered in the region of short tenure and high
monthly charges.

• Long-term customers were less likely to churn, especially those paying moderate
charges.

• This visualization helped in identifying potential high-risk customer profiles and


supported segmentation strategies.
5.3 Pair Plot of Key Features
A pair plot was created to examine pairwise relationships among the following features:
Tenure, MonthlyCharges, TotalCharges, and Churn.
Key Observations:

9
• Positive correlation between Tenure and TotalCharges.

• Distinct clusters appeared when segmented by Churn, further validating patterns seen
in the scatter plot.

• The pair plot provided a multidimensional view of customer behavior and helped
visualize the separability of churned and retained users.

5.4 Heatmap of Correlation Among Numeric Features


A correlation heatmap was plotted to assess the strength of linear relationships among
numerical features.
Key Observations:

• Tenure and TotalCharges had a strong positive correlation

• MonthlyCharges showed weaker correlation with other variables but was still a
significant influencer on churn.

• This visualization assisted in identifying which features contributed most to churn


and were candidates for further modeling.

These visualizations provided a comprehensive understanding of churn patterns, feature


interdependencies, and customer profiles, enabling better-informed modeling and business
decisions.

10
6. IMPLEMENTATION

Data-driven analysis is a powerful tool for revealing underlying behavioral patterns and
enabling strategic business decisions in domains like customer retention, revenue
optimization, and personalized service delivery.
For the Customer Churn dataset, a structured approach was followed to implement churn
prediction, feature-driven customer segmentation, and outlier handling in place of building
a direct recommendation system.
This phase focused on building analytical models and visual tools to understand the factors
that drive customer attrition.

6.1 Objective
The primary goal of this implementation was to:

• Analyze customer churn trends across key demographics and service attributes.

• Detect unusual billing patterns (outliers) that could skew churn behavior analysis.

• Segment customers based on tenure and spending behavior for targeted marketing or
retention efforts.

This approach supports telecom operators and analysts in identifying vulnerable customer
segments and designing data-informed retention strategies.

6.2 Strategy and Methodology


The overall analysis strategy was divided into logical phases:
a. Data Preparation

• Handled missing values, particularly in the TotalCharges column, converting to


numeric and imputing with 0 where necessary.

• Cleaned and standardized categorical variables such as PaymentMethod.

• Engineered new features:

• Avg_Charges_per_Month = TotalCharges / Tenure

• Tenure_Group categorizing customers as 'New', 'Regular', or 'Loyal'

• Spending categories: 'Low', 'Medium', 'High'

11
b. Churn Analysis

• Bar plots were used to compare churn rates across different Contract Types and
Payment Methods.

• Temporal patterns were explored using tenure-based groupings to observe how long-
term customers behaved compared to new ones.

• Correlations were analyzed to determine how variables like MonthlyCharges,


TotalCharges, and Tenure influence churn likelihood.

c. Outlier Detection

• Z-score and boxplot techniques were applied to identify outliers in the


MonthlyCharges variable.

• Outliers were evaluated for their potential influence on churn predictions and were
either retained, capped, or removed accordingly.

d. Customer Segmentation

• Based on engineered features, customers were segmented into meaningful groups:

• Tenure Groups: to analyze loyalty patterns

• Spending Groups: to assess service value perception

• These segments were used for subgroup analysis, enabling the identification of high-
risk and high-value customer clusters.

This structured implementation enabled the development of a robust understanding of


customer churn patterns, providing a foundation for building predictive models and
actionable business strategies.

12
7. CONCLUSIONS AND RESULTS

This project focused on analyzing and predicting customer churn for a telecom company using a
comprehensive dataset of over 100,000 customer records. Through systematic data handling,
wrangling, cleaning, and visualization, we were able to uncover critical patterns that contribute
to customer attrition.Key insights revealed that churn rates were significantly influenced by
contract types, tenure duration, monthly charges, and payment methods. Customers with month-
to-month contracts and higher monthly charges were more likely to churn, especially around the
3rd and 6th month of service.By deriving new features such as average charges per month and
categorizing customers based on tenure and billing behavior, we could segment customers into
meaningful groups like “New,” “Regular,” and “Loyal.” Visual tools such as bar plots, scatter
plots, pair plots, and heatmaps provided a deeper understanding of correlations and churn-driving
factors.
Overall, this analysis equips telecom companies with actionable insights to improve customer
retention strategies. Interventions such as targeted loyalty programs, revisiting monthly plans, or
improving customer support during high-risk periods can significantly reduce churn and enhance
customer satisfaction.Future work may involve applying machine learning models for predictive
analytics and building a real-time churn alert system to proactively engage at-risk customers.

Fig 1. Churn Rate by Contact Type

13
Fig 2. Tenure vs Monthly Charges Colored by Churn

Fig 3. Pair Plot of Selected Features and Churn

14
Fig 4. Correlation Heatmap

15
APPENDICES
SAMPLE CODING
# 1. Importing Required Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 2. Loading the Dataset


df = pd.read_csv('casestud.csv')

#1. Bar Plot: Churn Rate Across Contract Types


plt.figure(figsize=(8, 6))
sns.barplot(data=df, x='Contract', y=df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0),
errorbar=None)
plt.title('Churn Rate by Contract Type')
plt.ylabel('Churn Rate')
plt.xlabel('Contract Type')
plt.show()

#2. Scatter Plot: Tenure vs Monthly Charges Colored by Churn


plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='tenure', y='MonthlyCharges', hue='Churn')
plt.title('Tenure vs Monthly Charges Colored by Churn')
plt.xlabel('Tenure (months)')
plt.ylabel('Monthly Charges')
plt.show()

#3. Pair Plot of Tenure, MonthlyCharges, TotalCharges, and Churn


sns.pairplot(df[['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']], hue='Churn',
diag_kind='kde')
plt.suptitle('Pair Plot of Selected Features and Churn', y=1.02)
16
plt.show()
#4. Heatmap of Correlation Among Numeric Features
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df_clean = df[['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']].dropna()
df_clean['Churn'] = df_clean['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)
plt.figure(figsize=(10, 6))
correlation = df_clean.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

17

You might also like