DataScience Project-New
DataScience Project-New
PROJECT REPORT
Submitted by:
B S Sanath RA2211042020020
Swathika M RA2211042020045
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND BUSINESS SYSTEMS ENGINEERING
BONAFIDE CERTIFICATE
Certified that this project report “POLICE CRIME ANALYSIS” is the Bonafide
work of “SANJAY A N (RA2211042020022),AKHIL AHMED
(RA2211042020008), SIDDHARTH S (RA2211042020054)” carried out the
21CSC355T– Data Mining And Analytics project work under my supervision.
SIGNATURE
Ms. S. Vaishnavii,M.E
APPENDIX
1. SAMPLE CODING 19
2. SAMPLE OUTPUT 20
Customer retention is a critical factor in the success of any business, particularly in industries with
high competition, such as telecommunications, banking, and subscription-based services. Customer
churn refers to the rate at which customers stop using a company’s product or service over a given
period. Understanding and predicting churn can help businesses take proactive measures to enhance
customer satisfaction, improve services, and optimize marketing strategies.
This report presents a Customer Churn Prediction Model developed using machine learning
techniques to identify key factors contributing to customer attrition. The model leverages Random
Forest and Decision Tree Classifiers to predict whether a customer is likely to churn based on
demographic, service usage, and contract-related features. To address data imbalance, Synthetic
Minority Over-sampling Technique (SMOTE) is applied, ensuring better prediction accuracy for
minority classes. The model's performance is optimized using grid search and cross-validation,
evaluated through key metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
The primary objective of this model is to provide businesses with predictive insights into customer
behavior, enabling them to take preemptive actions before a customer decides to leave. By analyzing
patterns in customer demographics, service usage, contract types, and support interactions, the model
helps organizations pinpoint the major causes of churn. This allows for the development of targeted
customer engagement strategies, such as personalized offers, loyalty programs, and improved
customer service interventions, ultimately leading to higher retention rates and reduced revenue
losses.
Beyond prediction, this model also serves as a strategic tool for business optimization. By identifying
key factors driving customer churn, businesses can refine their service offerings, pricing strategies,
and customer support systems to better meet customer needs. The insights derived from this analysis
can also inform marketing strategies, helping companies allocate resources efficiently to maximize
customer lifetime value (CLV). With this approach, businesses can shift from a reactive churn
management model to a proactive, data-driven decision-making framework, ultimately fostering
long-term customer loyalty and business growth.
2. Dataset Description
The Customer Churn Prediction Model is built on an extensive dataset containing customer
demographic details, service subscription attributes, and account-related information. The dataset
consists of thousands of customer records, each characterized by multiple numerical and categorical
features that influence customer retention and churn behavior. The primary goal of this dataset is to
provide a comprehensive view of customer engagement and service usage patterns, enabling
machine learning models to identify key factors contributing to churn.
The dataset contains a mix of numerical and categorical features, ensuring that the model can capture
the complex interactions between customer demographics, service preferences, and churn tendencies.
By analyzing this data, the model aims to predict whether a customer is likely to leave the service
provider, allowing businesses to take proactive retention measures.
1. Tenure: A numerical value representing the number of months a customer has been with the
company. Longer tenure generally indicates higher loyalty.
2. Contract Type: Categorical data indicating whether the customer has a month-to-month, one-
year, or two-year contract. Customers with longer contracts tend to have lower churn rates.
3. Monthly Charges: The amount a customer is billed each month. Higher charges may
influence churn, especially if customers perceive a lack of value in the services provided.
4. Total Charges: The cumulative amount paid by a customer over their entire tenure. It can
indicate long-term customer value and spending behavior.
5. Payment Method: The mode of payment used by customers, such as electronic check, mailed
check, bank transfer, or credit card. Certain payment methods may correlate with higher
churn rates.
6. Internet Service Type: Indicates whether a customer has DSL, Fiber Optic, or No Internet
Service. Fiber optic users may have different churn tendencies compared to DSL users.
7. Phone Service: A binary feature specifying whether a customer has a phone service.
8. Multiple Lines: Whether the customer has multiple phone lines, which may suggest higher
engagement with the provider.
9. Online Security: Indicates whether the customer has subscribed to an additional online
security service. Customers who use such services may have a higher perceived value of the
provider.
10. Online Backup: A feature indicating whether customers have cloud-based backup services.
11. Tech Support: Whether a customer has access to technical support services, which could
impact customer satisfaction and churn likelihood.
12. Streaming TV and Streaming Movies: Specifies whether the customer has access to
streaming services as part of their subscription package. Entertainment-based services can
affect customer retention.
13. Device Protection: Indicates whether a customer has opted for device protection plans, which
may add value to their subscription.
14. Dependents: A categorical feature specifying whether the customer has dependents.
Customers with families may have different usage behaviors and churn patterns.
15. Partner Status: Indicates whether a customer has a spouse or partner. This demographic
attribute can impact the likelihood of churn.
16. Senior Citizen: A binary feature identifying whether a customer is a senior citizen (1 = Yes, 0
= No). Age demographics can influence churn rates.
17. Churn: The target variable, indicating whether the customer has churned (1 = Yes, 0 = No).
3. Sample dataset
4. Trend Analysis
A significant portion of customers remain loyal, with the majority classified as non-churners.
Month-to-month contract holders exhibit a higher churn rate, whereas customers with long-
term contracts (one or two years) tend to stay longer.
These trends indicate that contract type and early engagement strategies play a crucial role in
customer retention.
Fiber optic users have a higher churn rate compared to DSL users, likely due to pricing
concerns or competition.
Customers without internet service have the lowest churn rate, indicating they are less likely
to switch providers.
Customers subscribed to online security, tech support, and backup services show lower churn
rates.
Customers who do not opt for these add-ons tend to churn more frequently, suggesting that
bundled services enhance retention.
Higher monthly charges correlate with increased churn, indicating that cost-sensitive
customers are more likely to leave.
Customers with lower monthly charges tend to stay longer, possibly due to perceived
affordability.
Subscription-based add-ons like streaming services and device protection plans can influence
churn, as customers may reconsider expenses over time.
This trend suggests that pricing strategies and discounts for high-risk customers could be
effective in reducing churn.
Senior citizens have a higher churn rate, possibly due to lower engagement with digital
services or financial constraints.
Customers with dependents and partners tend to have a lower churn rate, suggesting that
family plans contribute to retention.
Customers paying via electronic check show a significantly higher churn rate compared to
those using bank transfers or credit cards, indicating that payment method could be an early
predictor of churn.
These trends emphasize the role of personalized retention strategies, such as tailored offers
for senior customers or incentives for stable payment methods.
2. Fiber optic internet users have a higher churn rate than DSL users, indicating potential
dissatisfaction with pricing or service reliability.
3. Value-added services like tech support and online security reduce churn, suggesting
businesses should promote these features as retention tools.
4. High monthly charges are linked to higher churn, reinforcing the need for cost-effective
pricing models and discount offers for at-risk customers.
5. Demographics influence churn, with senior citizens and electronic check users showing
higher churn rates, highlighting the importance of targeted engagement strategies.
5. Model building
Customer churn prediction involves supervised machine learning techniques, where historical
customer data is used to train a model to classify customers as churners or non-churners. The
objective is to develop a robust model that accurately identifies at-risk customers, enabling
businesses to take proactive retention measures. Below is an explanation of different machine
learning approaches and their relevance to this dataset.
Each node represents a decision based on a customer attribute (e.g., contract type, monthly
charges).
The model continues to split the data until reaching a leaf node (churn or non-churn).
Useful for identifying thresholds where customers become high-risk, such as monthly charges
exceeding a certain amount.
3. Logistic Regression
A linear model for binary classification, predicting the probability of churn.
Useful for understanding the relationship between customer attributes and churn probability.
Can quantify how much a unit increase in monthly charges impacts churn likelihood.
Can be effective when combined with kernel tricks to capture complex, nonlinear
relationships.
Useful if customer churn data exhibits a clear decision boundary based on key features.
Works effectively with missing data and can capture complex interactions between variables.
Can capture nonlinear relationships that traditional machine learning models might miss.
Suitable for large-scale churn datasets with intricate dependencies between features.
Can be fine-tuned with techniques like dropout and batch normalization to improve
performance.
Optimal Case:
The optimal case for a customer churn prediction model is when it achieves high predictive accuracy,
strong generalization to unseen data, and actionable insights that enable businesses to retain at-risk
customers effectively.
6. Challenges Faced
1. Handling Class Imbalance: The dataset had a significantly lower number of churned
customers compared to retained customers. To address this, techniques like SMOTE
(Synthetic Minority Over-sampling Technique) and class weighting were implemented.
2. Feature Selection and Engineering: Identifying the most relevant features that contribute to
churn prediction required multiple rounds of feature importance analysis and correlation
studies.
3. Optimizing Model Performance: Balancing between precision and recall was challenging,
as a model with high precision but low recall would miss many actual churners, while high
recall but low precision would generate too many false positives.
4. Hyperparameter Tuning: Finding the right combination of hyperparameters (e.g., number
of estimators, max depth, learning rate) for Random Forest and XGBoost required
extensive fine-tuning with Grid Search and Cross-Validation.
5. Interpreting Model Decisions: Business stakeholders needed understandable insights, so
SHAP values and feature importance visualizations were used to explain the model's
decisions.
7. Future Enhancements
Handling Class Imbalance: The dataset had a significantly lower number of churned
customers compared to retained customers. To address this, techniques like SMOTE
(Synthetic Minority Over-sampling Technique) and class weighting were implemented.
Feature Selection and Engineering: Identifying the most relevant features that contribute to
churn prediction required multiple rounds of feature importance analysis and correlation
studies.
Optimizing Model Performance: Balancing between precision and recall was challenging, as
a model with high precision but low recall would miss many actual churners, while high
recall but low precision would generate too many false positives.
Hyperparameter Tuning: Finding the right combination of hyperparameters (e.g., number of
estimators, max depth, learning rate) for Random Forest and XGBoost required extensive
fine-tuning with Grid Search and Cross-Validation.
Interpreting Model Decisions: Business stakeholders needed understandable insights, so
SHAP values and feature importance visualizations were used to explain the model's
decisions.
8. Recommendations
Improve Customer Support and Engagement: Customers with frequent service issues or low
engagement levels should receive personalized assistance and proactive customer service.
Offer Tailored Retention Programs: High-risk churners identified by the model can be offered
loyalty rewards, discounts, or better subscription plans to encourage retention.
Monitor Billing and Payment Trends: Customers who frequently miss payments or use high-
cost payment methods (like one-time electronic checks) are at higher risk of churn. Offering
flexible billing options can help retain them.
9. Conclusion
The Customer Churn Prediction Model provides valuable insights into the key factors driving
customer attrition and offers actionable strategies to enhance customer retention. By leveraging
machine learning techniques such as Random Forest, Decision Trees, and XGBoost, businesses can
identify at-risk customers before they churn and implement targeted interventions to improve
retention rates.
One of the most significant findings from the model is that billing and contract-related factors play a
crucial role in churn. Customers on month-to-month contracts and those using electronic check
payments are more likely to leave. This indicates that businesses can reduce churn by offering long-
term contract incentives and introducing more flexible payment options.
Additionally, service quality and engagement levels strongly influence churn risk. Customers with
frequent technical issues or limited engagement with services are at higher risk. This highlights the
importance of proactive customer support, personalized recommendations, and service quality
improvements to retain customers effectively.
From a business strategy perspective, integrating this model into customer relationship management
(CRM) systems can provide real-time churn alerts and enable data-driven decision-making. AI-
driven churn prediction can help businesses prioritize high-risk customers, optimize marketing
efforts, and design better retention programs, leading to increased customer satisfaction and revenue
growth.
Looking ahead, future improvements such as real-time prediction systems, advanced feature
engineering, and reinforcement learning-based retention strategies can further refine the model’s
effectiveness. By continuously adapting to evolving customer behaviors, businesses can stay ahead
of churn risks and build stronger, long-term customer relationships.