Data Science Case Report
Data Science Case Report
retention strategies.
A CASE STUDY REPORT
Submitted by
VISHWAJITH V(RA2211003010081)
SREYA SUSAN ROY(RA2211003010089)
PARVATHY ULLAS(RA2211003010098)
AARTHI.N(RA2211003010126)
SCHOOL OF COMPUTING
BONAFIDE CERTIFICATE
Certified that Data Science, A Case Study Report titled “Predict customer churn for a
telecom company to improve retention strategies” is the bonafide work of
VISHWAJITH V(RA2211003010081), SREYA SUSAN ROY(RA2211003010089),
PARVATHY ULLAS(RA2211003010098), AARTHI.N(RA2211003010126) who
carried out the case study under my supervision. Certified further, that to the best of my
knowledge the work reported herein does not form any other work
Faculty Signature
Dr.S.Priya
Assistant Professor
Department of Computing Technologies
Date:
TABLE OF CONTENTS
1. INTRODUCTION 1
2. DATA HANDLING 2
3. DATA WRANGLING 5
5. DATA VISUALIZATION 9
6. IMPLEMENTATION 11
APPENDICES 16
SAMPLE CODING 16
1. INTRODUCTION
Customer churn poses a significant challenge in the telecom industry, where acquiring new
customers often costs more than retaining existing ones. This project focuses on predicting
customer churn using data-driven approaches to support the development of effective retention
strategies. By leveraging historical data—including customer demographics, service usage
patterns, account information, and interaction history—a machine learning model is built to
classify customers as likely to churn or stay. Various classification algorithms are evaluated for
performance, with emphasis on accuracy, precision, recall, and F1-score. The results of the
predictive model are then analyzed to identify key factors contributing to churn, enabling the
company to design targeted interventions. This project not only demonstrates the practical
application of machine learning in business scenarios but also provides actionable insights that
can significantly improve customer retention and profitability for telecom companies.
This project focuses on predicting customer churn in the telecom sector through effective data
preprocessing, integration, and visualization. Additional datasets like service logs and complaints
are merged using Customer_ID to enrich insights. Data is reshaped to track monthly churn trends,
and cleaned to fix missing or inconsistent values. Derived features such as average monthly
charges and customer segmentation based on tenure and billing are created. Visualizations—
including bar plots, scatter plots, and heatmaps—reveal key churn patterns, supporting the
development of targeted retention strategies.
1
2. DATA HANDLING
Effective data analysis begins with well-structured and thoroughly preprocessed data.
The customer churn dataset, although rich in customer behavior and service usage patterns,
required substantial cleaning, validation, and feature transformation to support meaningful
insights and robust churn modeling.
• Customer ID
• Memory Optimization:
Data types were reviewed and converted to reduce memory usage. For instance, object
types were converted to categories where applicable.
• Vectorized Operations:
Operations using NumPy and Pandas were leveraged to improve processing speed and
scalability.
• Batch Processing:
For large intermediate operations, data was processed in chunks to avoid memory
2
overflow.
• One-Hot Encoding was used for multi-class categorical columns to prepare the
data for machine learning models.
• Churn Distribution:
A count plot was used to visualize the imbalance in churned vs. non-churned customers.
• Correlation Heatmap:
Feature correlations were visualized using a heatmap to assess relationships between
numerical fields and the target variable.
• Standardization:
Using StandardScaler, all feature columns were scaled to have zero mean and unit
variance to support algorithms sensitive to scale.
3
2.5 Train-Test Preparation
To enable robust modeling:
• Data was split using an 80-20 ratio for training and testing using train_test_split, with
stratification to preserve class distribution.
• The final shapes of training and test sets were printed to verify the split integrity.
• All missing values were addressed and all features were appropriately scaled or encoded.
• A consistent and memory-efficient dataset was ready for training classification models
and generating churn predictions.
4
3. DATA WRANGLING
Data wrangling, also known as data munging, is the process of transforming and organizing raw
data into a structured and usable format. This step bridges the gap between data collection and
meaningful churn analysis.
For the churn dataset, data wrangling involved integrating multiple data sources, reshaping data
to reflect customer behavior over time, and engineering features critical for detecting churn trends
and risk periods.
Additionally, external datasets such as customer complaints and service usage logs were
merged using Customer_ID as the primary key. This integration enriched the dataset for more
nuanced analysis.
• Missing values in the tenure column were dropped to maintain integrity in tenure-
based analysis.
• The tenure column was converted to integer type to allow for numeric operations.
• A new column Churn_Flag was introduced with a default value of 0, later updated to
1 for customers who churned in their final recorded month.
• Time-Based Grouping:
Customer data was grouped by Month_Number to assess monthly behavior patterns.
• Churn Flagging:
The Churn_Flag allowed for binary classification of churn status per customer-month
5
record.
• Column Renaming:
Relevant columns were renamed to enhance clarity and consistency for downstream
analysis and visualization.
3.3 High-Risk Period Identification
To uncover critical risk windows in customer tenure, the dataset was further analyzed as
follows:
• Using pivot tables and time-based aggregation, churn patterns were monitored across
different months of tenure.
• Particular attention was given to months where churn rates peaked, notably around
the 3rd and 6th months.
• A line plot was generated showing churn rate across months of customer tenure.
• Titles, axis labels, and a grid were added to improve interpretability of the visual
output.
6
4. DATA CLEANING AND PROCESSING
In any data-driven project, cleaning and processing the data is a critical step that directly
influences the quality, reliability, and accuracy of the analysis.
For the Customer Churn dataset, data cleaning and processing involved identifying and
resolving missing or inconsistent values, standardizing formats, engineering useful features,
and ensuring the dataset was fully prepared for modeling and insightful analysis.
• TotalCharges
• Categorical Variables:
Categorical features such as PaymentMethod, Contract, and InternetService were
encoded using label encoding for binary variables and one-hot encoding for multi-
class fields.
• Numerical Conversion:
The TotalCharges column was explicitly converted to a numeric type using
pd.to_numeric() to support mathematical operations.
7
This standardization enabled reliable filtering, grouping, and visualization across various
customer attributes.
4.3 Feature Engineering
To derive greater insight from the dataset, new features were constructed:
• Avg_Charges_per_Month:
A new column was created by dividing TotalCharges by Tenure. A condition was
added to handle tenure values of zero, assigning 0 in such cases.
• Tenure Grouping:
Customers were classified into categories based on their tenure:
• Z-score method:
Z-scores were calculated using scipy.stats.zscore() and extreme values (Z > 3 or Z <
-3) were flagged.
Based on their influence on model performance, outliers were either capped or
excluded.
• Boxplot Visualization:
Seaborn boxplots were generated to visually inspect the distribution of
MonthlyCharges and confirm the presence and handling of outliers.
This structured and comprehensive cleaning process ensured that the customer churn dataset
was ready for robust predictive modeling and insightful visual exploration.
8
5. DATA VISUALIZATION
• Those on one-year and two-year contracts had comparatively lower churn, suggesting
contract length plays a key role in customer retention.
• Churned customers were mostly clustered in the region of short tenure and high
monthly charges.
• Long-term customers were less likely to churn, especially those paying moderate
charges.
9
• Positive correlation between Tenure and TotalCharges.
• Distinct clusters appeared when segmented by Churn, further validating patterns seen
in the scatter plot.
• The pair plot provided a multidimensional view of customer behavior and helped
visualize the separability of churned and retained users.
• MonthlyCharges showed weaker correlation with other variables but was still a
significant influencer on churn.
10
6. IMPLEMENTATION
Data-driven analysis is a powerful tool for revealing underlying behavioral patterns and
enabling strategic business decisions in domains like customer retention, revenue
optimization, and personalized service delivery.
For the Customer Churn dataset, a structured approach was followed to implement churn
prediction, feature-driven customer segmentation, and outlier handling in place of building
a direct recommendation system.
This phase focused on building analytical models and visual tools to understand the factors
that drive customer attrition.
6.1 Objective
The primary goal of this implementation was to:
• Analyze customer churn trends across key demographics and service attributes.
• Detect unusual billing patterns (outliers) that could skew churn behavior analysis.
• Segment customers based on tenure and spending behavior for targeted marketing or
retention efforts.
This approach supports telecom operators and analysts in identifying vulnerable customer
segments and designing data-informed retention strategies.
11
b. Churn Analysis
• Bar plots were used to compare churn rates across different Contract Types and
Payment Methods.
• Temporal patterns were explored using tenure-based groupings to observe how long-
term customers behaved compared to new ones.
c. Outlier Detection
• Outliers were evaluated for their potential influence on churn predictions and were
either retained, capped, or removed accordingly.
d. Customer Segmentation
• These segments were used for subgroup analysis, enabling the identification of high-
risk and high-value customer clusters.
12
7. CONCLUSIONS AND RESULTS
This project focused on analyzing and predicting customer churn for a telecom company using a
comprehensive dataset of over 100,000 customer records. Through systematic data handling,
wrangling, cleaning, and visualization, we were able to uncover critical patterns that contribute
to customer attrition.Key insights revealed that churn rates were significantly influenced by
contract types, tenure duration, monthly charges, and payment methods. Customers with month-
to-month contracts and higher monthly charges were more likely to churn, especially around the
3rd and 6th month of service.By deriving new features such as average charges per month and
categorizing customers based on tenure and billing behavior, we could segment customers into
meaningful groups like “New,” “Regular,” and “Loyal.” Visual tools such as bar plots, scatter
plots, pair plots, and heatmaps provided a deeper understanding of correlations and churn-driving
factors.
Overall, this analysis equips telecom companies with actionable insights to improve customer
retention strategies. Interventions such as targeted loyalty programs, revisiting monthly plans, or
improving customer support during high-risk periods can significantly reduce churn and enhance
customer satisfaction.Future work may involve applying machine learning models for predictive
analytics and building a real-time churn alert system to proactively engage at-risk customers.
13
Fig 2. Tenure vs Monthly Charges Colored by Churn
14
Fig 4. Correlation Heatmap
15
APPENDICES
SAMPLE CODING
# 1. Importing Required Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
17