Interim Report
Interim Report
SERVICES
Defining the Problem Statement: The main idea of this project is to analyze customer churn on
subscription-based services, being vital for the future revenue of the company and customer
loyalty. The issue is about finding the major variables, involved in churn, creating the machine
learning algorithm to forecast buyer’s behavior, and generating the business strategies against
churn.
Need of the Study/Project: Customer churn is thus one of the most important measures of
business profitability. Since churn is the likelihood of customers leaving, its prediction enables
businesses to deploy aggressive customer retention measures hence guaranteeing constant
income and excluding the costs of customer acquisition.
2. Data Report
Data Collection: That data used is the secondary customer subscription data that can encompass
the demographic information, use of the services, and the preferred mode of payment and
interaction with the customer service department. Specific monetary and non-monetary data
sources identified include the following: Data collection details involve customers and their
characteristics, use of services as well as their experience with the customer support of the
company under analysis; all primary data collected through surveys conducted using a sample
from the customer database of the company under analysis.
Visual Inspection of Data: This dataset consists of 11586 rows and 19 attributes which includes
customer tenure, monthly revenue, and interaction details etc. Respecting above steps, each
attribute was graphically explored using histograms, box plots, and count plots for the purpose of
getting insights about their distribution as well as checking for outliers.
Understanding of Attributes: The dataset attributes contain the following: Demographic data
which can be further divided into; Gender, Marital status, Service usage details such as; Tenure,
Account_user_count, Payments; Payment method, coupon_used_for_payment, Customer support
details; CC_contacted_LY, Day_since_CC_connect. Some features for example the multiple
identification numbers were eliminated before the actual processing of the information. As a
result, it was identified that there was missing data in measures like Tenure, Payment, and
Service_Score which was handled via imputation.
Shape of the Dataset: The dataset contains 11,260 rows and 19 columns.
Column Headings: ['AccountID', 'Churn', 'Tenure', 'City_Tier', 'CC_Contacted_LY',
'Payment', 'Gender', 'Service_Score', 'Account_user_count', 'account_segment',
'CC_Agent_Score', 'Marital_Status', 'rev_per_month', 'Complain_ly', 'rev_growth_yoy',
'coupon_used_for_payment', 'Day_Since_CC_connect', 'cashback', 'Login_device']
First Few Rows: Displayed sample rows include information such as AccountID, Churn
status, Tenure, Payment method, and more.
Last Few Rows: Displayed sample rows showing similar attributes to give an overview
of the data.
Missing Data: Several columns, including Tenure, Payment, and Service_Score, contain
missing values.
Data Types: Attributes include a mix of numerical and categorical variables, requiring
different preprocessing techniques.
Univariate Analysis:
Distribution of Customer Tenure: The histogram( Figure 1) illustrates the spread of the
customers by tenure, where most are below 20 months, and fewer customers are above
this point. This means that, a significant number of customers are likely to churn over
comparatively early. Since early churn is primarily due to dissatisfaction, the targeted
retention strategy might involve making customer experience better within a year.
Bivariate Analysis:
From the box plot (Figure 3) of the successful churn customers and churn customers, the
result illustrates that the churn customers have higher CS agent score. This means that
most of the churn originates from unsatisfactory customer support experiences. It also
calls for more commitment by companies into improving the quality of the customer
relations department.
Figure 3: Customer Service Agent Score by Churn Status
Monthly Revenue by Churn Status: Using the box plot (Figure 4) of the monthly
revenue of customers, it is evident that the two groups do not differ significantly and
therefore churn cannot be better explained by the base alone. Yet it can be valuable to
assess technology’s impact on revenue and measure it other value drivers, like enhanced
service quality and customer satisfaction, when the goal is to construct a more effective
retention agenda.
Figure 4: Monthly Revenue by Churn Status
Data Imbalance: Finally, the provided dataset is highly skewed as a majority of the customers
were not churn customers. This inequality means that while churn is not particularly common, it
is essential to fix since these clients may lose their value to the business. This means that along
elimination of imbalance, other methods such as SMOTE also prevents modelling bias because it
will predict both the churn and non-churn classes of data.
High Churn Rate in Specific Account Segments: Understanding shows that there are some
cases where the churn rate is higher than others, more so in segment 2 and segment 3. From this
understanding, it is possible to argue that customers in these segments may have unmet needs, or
other issues that need to be addressed. These segments could be targeted through business
interventions that entail provision of specialized services to customers or provision of loyalty
programs that can help to reduce churn ages.
Customer Interaction Frequency as a Churn Indicator: High levels of interaction with
customer care, for instance through continuously complaining are also indicators of churn. This
brings out the fact that not only must one attend to clients’ complaints, the resolutions should
make them happy. Enhancing customer retention can be achieved if businesses find ways of
actively pursuing ways of identifying and overcoming these pains.
Algorithms Selected: Based on the nature of the churn prediction problem, three algorithms of
supervised learning have been selected namely: Logistic Regression, Random Forest, and SVC.
Logistic Regression was chosen because of its simplicity and interactivity.
Random Forest was selected because it uses multiple decision trees in order to make final
decision and thus can handle feature interactions.
SVC is used for its capability to identify non-linear relationships that will help provide
clearly differentiating churn customer from the other customers.
Feature selection was performed to improve the performance of the machine learning and related
algorithms used, while at the same time improving the interpretability of results. The number of
input features is thereby decreased and the models become less overfitting; the computational
work is also minimized especially in cases where the computational memory is limited.
Recursive Feature Elimination (RFE): RFE was used as the dependent method for feature
selection. This operation is done recursively whereby each time a feature is removed the model is
rebuilt to find out which feature is unimportant. The idea is to find out which features matter
most as their inclusion enhances model generality and interpretability.
Selected Features for Each Model:
Hyperparameter Tuning:
To further improve the performance of each model, GridSearchCV was utilized for
hyperparameters optimization for every model. This process was extended by fine-tuning
significant parameters to yield improved accuracy and recall rates for churned customers
particularly.
Hyperparameter Tuning Results:
Logistic Regression: The two prominent parameters that yielded the best results were C=1 and
penalty =’l2’. After the tuning process the model achieved 86,53% accuracy. However, the recall
for churn stayed low to identify more need for improvement.
Random Forest Classifier: The best predictors were n_estimators with an optimal value of 50,,
max_depth and min_samples_split values of None and 2 respectively. On the final tuned
Random Forest the model showed a performance of 88.40% in the churn, and no churn
customer’s records.
Support Vector Classifier (SVC): The highest accuracy achieved from this decision forest was
with a parameter C= 0.01 and the kernel of the radial basis function was ‘linear’. For churn,
tuned SVC model indicates poor results similar to if used as it with recall rate at 0%.
Logistic Regression: Logistic Regression performed fairly well and missed some churned
customers especially due to low recall.
Support Vector Classifier (SVC) : SVC failed to capture any churned customers effectively, as
indicated by its recall and F1-score of 0% for "Churn: Yes." This indicates that this model may
not have been well suited to this problem without certain changes made to it.
Business Implications:
Improved Customer Retention Strategies: Through the analysis of a tuned Random Forest
Classifier it was identified that it demonstrated the most accurate performance, thus, businesses
can use the information gathered from this model and formulate punctual retention strategies.
Top drivers of churn are known to facilitate targeting specific customers who are likely to churn
and addressing their issue.
Resource Allocation: Churn could be managed well when the providers note particular types of
clients as potential to leave and direct their efforts where they’ll be needed, in marketing,
customer care, and so on. This assists with the customer loyalty programs, to make sure that the
organisation is paying attention to the customers who are even thinking of dumping the
organization.
Personalized Interventions: The knowledge coming from these tuned models is useful of
designing individual based intervention approaches. For example, customers with high risk
scores can be offered extra bonus for loyalty or even promotions for the goods they opted for or
better customer service for better experience.
Onboarding Improvements: Onboarding is an important area that can help decrease churn rates
because, as the name suggests, lots of customers churn in the first few months of interacting with
the company. The owners and managers of the companies should pay a lot of attention to the first
time that customers are interacting with their goods and services.
Customer Support Quality: Here it is possible to note that the tuned models pointed out the
customer service quality as an important factor. Other important areas are to distinguish the
sources of churn, especially in how to lower them, which includes enhancements in the customer
support where the staff serving the clients should be trained on proper communication and
mechanisms for receiving the customers’ feedback should be developed.
Data-Driven Decision Making: Through predictive modeling, firms and organizations can
better understand which customers they should consider keeping. This involves ascertaining
which segments are most likely to audited and then developing certain programs or certain
improvements to suit those customers.
Conclusion
The topics of this interim report were centered on customer churn prediction in subscription-
based service industries applying pre-built ML algorithms. In this paper, we have assessed the
performance of Logistic Regression, Random Forest Classifier and Support Vector Classifier
(SVC) with the purpose of differentiating churned and non-churned clients. Based on the
analysis:
Random Forest Classifier turned out to be the best model since it achieved comparable
accuracy and optimal recall and precision for churn and non-churn classes.
Logistic Regression which was also can perform good in this case used showed that it is
very difficult the identify the churned customers and this is due to the very low recall
value.
After optimizing the hyperparameters, using Support Vector Classifier (SVC) again
showed a low accuracy in identifying churned customers.
Business Implications:
The findings related to the current investigation suggest that there is need to appreciate and
anticipate the model of customer attrition in order to develop appealing strategies for this client
base. The information brought out by the models enable firms to identify high-risk customer
segment, effectively deploy resources and decide on measures that will improve customer
satisfaction and hence loyalty. Issues to do with customer acquisition, targeted sales offers, and
effective ways of handling customer complaints are some of the measures that need to be taken
to reduce churn and therefore improve the general business profitability.
Next Steps:
Model Optimization: Serious consideration can be given to the development of the
ensemble methods or to the use of more sophisticated techniques for further improvement
of models.
Data Enrichment: It may further benefit a model to incorporate extra customer
behavioral attributes like or frequency of posting about the product on social media, or
past purchase behaviour.
Implementation: Using the tuned model to predict churn in real time and feeding the
insights into the customer relationship management (CRM) system to take early and
adequate retention action.
Algorithms Selected: Churn analysis for customers was performed using three Supervised
Machine learning algorithms