0% found this document useful (0 votes)
36 views17 pages

Interim Report

Uploaded by

Anisha Gheever
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views17 pages

Interim Report

Uploaded by

Anisha Gheever
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

INTERIM REPORT: CUSTOMER CHURN PREDICTION IN SUBSCRIPTION-BASED

SERVICES

1. Introduction of the Business Problem

Defining the Problem Statement: The main idea of this project is to analyze customer churn on
subscription-based services, being vital for the future revenue of the company and customer
loyalty. The issue is about finding the major variables, involved in churn, creating the machine
learning algorithm to forecast buyer’s behavior, and generating the business strategies against
churn.

Need of the Study/Project: Customer churn is thus one of the most important measures of
business profitability. Since churn is the likelihood of customers leaving, its prediction enables
businesses to deploy aggressive customer retention measures hence guaranteeing constant
income and excluding the costs of customer acquisition.

Understanding Business/Social Opportunity: The predictive churn modeling in addition to


identifying the customers’ reasons for churning also assist one to have actionable knowledge that
can be implemented in order to intervene and improve the customers’ satisfaction hence making
the customers remain loyal to the firm and also increasing the value proposition of the firm.

2. Data Report

Data Collection: That data used is the secondary customer subscription data that can encompass
the demographic information, use of the services, and the preferred mode of payment and
interaction with the customer service department. Specific monetary and non-monetary data
sources identified include the following: Data collection details involve customers and their
characteristics, use of services as well as their experience with the customer support of the
company under analysis; all primary data collected through surveys conducted using a sample
from the customer database of the company under analysis.

Visual Inspection of Data: This dataset consists of 11586 rows and 19 attributes which includes
customer tenure, monthly revenue, and interaction details etc. Respecting above steps, each
attribute was graphically explored using histograms, box plots, and count plots for the purpose of
getting insights about their distribution as well as checking for outliers.

Understanding of Attributes: The dataset attributes contain the following: Demographic data
which can be further divided into; Gender, Marital status, Service usage details such as; Tenure,
Account_user_count, Payments; Payment method, coupon_used_for_payment, Customer support
details; CC_contacted_LY, Day_since_CC_connect. Some features for example the multiple
identification numbers were eliminated before the actual processing of the information. As a
result, it was identified that there was missing data in measures like Tenure, Payment, and
Service_Score which was handled via imputation.

3. Exploratory Data Analysis (EDA)


Output of EDA:

 Shape of the Dataset: The dataset contains 11,260 rows and 19 columns.
 Column Headings: ['AccountID', 'Churn', 'Tenure', 'City_Tier', 'CC_Contacted_LY',
'Payment', 'Gender', 'Service_Score', 'Account_user_count', 'account_segment',
'CC_Agent_Score', 'Marital_Status', 'rev_per_month', 'Complain_ly', 'rev_growth_yoy',
'coupon_used_for_payment', 'Day_Since_CC_connect', 'cashback', 'Login_device']
 First Few Rows: Displayed sample rows include information such as AccountID, Churn
status, Tenure, Payment method, and more.
 Last Few Rows: Displayed sample rows showing similar attributes to give an overview
of the data.
 Missing Data: Several columns, including Tenure, Payment, and Service_Score, contain
missing values.
 Data Types: Attributes include a mix of numerical and categorical variables, requiring
different preprocessing techniques.

Univariate Analysis:

 Distribution of Customer Tenure: The histogram( Figure 1) illustrates the spread of the
customers by tenure, where most are below 20 months, and fewer customers are above
this point. This means that, a significant number of customers are likely to churn over
comparatively early. Since early churn is primarily due to dissatisfaction, the targeted
retention strategy might involve making customer experience better within a year.

Figure 1: Distribution of Customer Tenure


 Churn Rate by Account Segment: The bar plot(Figure 2) illustrates that some particular
account segments such as segments 2 and 3 experienced higher churn rates. From this it
could be inferred that customers in these segments may have an unmet need or problem
and must be addressed through enhanced service delivery or reward programs
respectively.

Figure 2: Churn Rate by Account Segment

Bivariate Analysis:

 Customer Service Agent Score by Churn Status:

From the box plot (Figure 3) of the successful churn customers and churn customers, the
result illustrates that the churn customers have higher CS agent score. This means that
most of the churn originates from unsatisfactory customer support experiences. It also
calls for more commitment by companies into improving the quality of the customer
relations department.
Figure 3: Customer Service Agent Score by Churn Status

Monthly Revenue by Churn Status: Using the box plot (Figure 4) of the monthly
revenue of customers, it is evident that the two groups do not differ significantly and
therefore churn cannot be better explained by the base alone. Yet it can be valuable to
assess technology’s impact on revenue and measure it other value drivers, like enhanced
service quality and customer satisfaction, when the goal is to construct a more effective
retention agenda.
Figure 4: Monthly Revenue by Churn Status

Number of Complaints Last Year by Churn Status: An analysis (Figure 5) of the


count plot showed that the customers with more complaints tend to churn. This goes to
show the need for timely handling of complaints with a view of minimizing
dissatisfaction, and consequently churn. Any business should incorporate efficient
complaints handling processes and should continuously analyze the complaints.

Figure 5: Number of Complaints Last Year by Churn Status


Data Cleaning and Transformation:

 Handling Missing Values for 'Account_user_count': The 'Account_user_count'


column was converted to a numeric type, coercing any errors to NaN. Missing values
were then filled with the mean value of the column to maintain consistency.
 Removal of Unwanted Variables: Columns that were deemed irrelevant, repeating, or
derived without adding significant value were eliminated. These include:
o AccountID: A unique identifier with no predictive value.
o Login_device: The type of device used for login (e.g., mobile or computer),
which was unlikely to be a strong predictor of churn.
o Marital_Status: Unlikely to have direct influence on customer churn in this
context.
o coupon_used_for_payment: Limited influence on churn and not a strong
predictor.
o rev_growth_yoy: Derived feature that could be redundant with other revenue-
related attributes.
o Day_Since_CC_connect: Number of days since last customer care interaction,
not providing significant additional predictive power.
o cashback: Promotional feature that may add noise rather than predictive value.
 Encoding Categorical Columns: The categorical columns ('Payment', 'Gender',
'account_segment') were label encoded to prepare them for modeling.
 Class Imbalance Check: The distribution of the target variable ('Churn') was checked
for imbalance. The class imbalance was visualized using a bar plot (figure 6), showing
that there is significantly more non-churn (class 0) customers compared to churn (class
1). This will require techniques like SMOTE to balance the dataset for modeling.

Figure 6: Class Distribution of Target Variable (Churn)


4. Business Insights from EDA

Data Imbalance: Finally, the provided dataset is highly skewed as a majority of the customers
were not churn customers. This inequality means that while churn is not particularly common, it
is essential to fix since these clients may lose their value to the business. This means that along
elimination of imbalance, other methods such as SMOTE also prevents modelling bias because it
will predict both the churn and non-churn classes of data.
High Churn Rate in Specific Account Segments: Understanding shows that there are some
cases where the churn rate is higher than others, more so in segment 2 and segment 3. From this
understanding, it is possible to argue that customers in these segments may have unmet needs, or
other issues that need to be addressed. These segments could be targeted through business
interventions that entail provision of specialized services to customers or provision of loyalty
programs that can help to reduce churn ages.
Customer Interaction Frequency as a Churn Indicator: High levels of interaction with
customer care, for instance through continuously complaining are also indicators of churn. This
brings out the fact that not only must one attend to clients’ complaints, the resolutions should
make them happy. Enhancing customer retention can be achieved if businesses find ways of
actively pursuing ways of identifying and overcoming these pains.

Importance of Early Tenure Experience: Moreover, the distribution of the number of


customers suggests that the customers are rather inclined to churn in earlier stages of their
microbial network with the company. This only means that first moment or first impression is
crucial in terms of customer loyalty. The Concerns are that for customers who have monthly
subscription with the firms, there is an increased likelihood that they will churn within the first
weeks or months of subscription, and this can be a major advantage for firms to focus on the
onboarding process so as to ensure high levels of satisfaction within the first months of
interaction.
Customer Service Quality Impact on Churn: The simple regression analysis of the score
given to a customer service agent reveals that customers who have rated their service experience
low are the ones most likely to churn. This insight focuses on how firms should direct their
resources towards availing functional training for the customer services providers as well as
improving the quality of service delivery to customers.
Revenue Insights: The revenues in the month grouped by customers that churned and those that
did not churn have the following distribution. This implies that one has to look at the revenue
figures with a non-binary lens that fundamentally tells you churn is not solely related to revenue;
one has to dig deeper and understand high-revenue customers and their general experience and
satisfaction to retain them. To some extent, implementation of unique services or privileges
within a store may be useful in maintaining the high-spending clientele.
Complaint Management: The last year’s commutation count is the best approximate of churn.
Loyal customer risk consumers are those who have lodged more complaints than other
consumers. Automating complaints handling, adopting positive customer attitude and monitoring
customer disgruntlement are among the powerful techniques that help manage churn. There
should be a close loop to manage and implement responses to assess customer support and
feedback on various organizational dynamics.
Clustering Insights for Customer Segmentation: From the performed initial cluster analysis, it
is possible to identify separate customer segments, such as high-risk churners. Such info helps
businesses to address the high-risk segments within their target audiences, sufficiently using key
tactics in the form of loyalty programs, offers, or specially targeted messaging that will aid
increase customer retention.

5. Model Building and Interpretation

Algorithms Selected: Based on the nature of the churn prediction problem, three algorithms of
supervised learning have been selected namely: Logistic Regression, Random Forest, and SVC.
 Logistic Regression was chosen because of its simplicity and interactivity.
 Random Forest was selected because it uses multiple decision trees in order to make final
decision and thus can handle feature interactions.

 SVC is used for its capability to identify non-linear relationships that will help provide
clearly differentiating churn customer from the other customers.

Base Model Performance and Interpretation:


Figure 7: Base Model Performance
Logistic Regression: The model of overall accuracy was good but poor when it came to
classifying the customers that have the propensity to churn in a given organization as was
evidenced by the lower recall. This means that it was unlikely to capture true churn cases and
therefore not efficient for retention efforts.(Figure 8)
Figure 8:Logistic Regression - Confusion Matrix
Random Forest Classifier: In its performance the model registers very good accuracy and
preserve good rate between precision and recall. It was generally able to set apart churned and
non-churned customers, which put it in good stead for the process of churn prediction.( Figure 9)

Figure 9: Random Forest Classifier - Confusion Matrix


Support Vector Classifier (SVC): Although the model had good overall accuracy, it faced
significant challenges in predicting churned customers due to a low recall score. This indicates
that the model often failed to identify customers at risk of churning, limiting its usefulness for
proactive retention.
Figure 10: Support Vector Classifier - Confusion Matrix

Feature Selection and Re-Training:

Feature selection was performed to improve the performance of the machine learning and related
algorithms used, while at the same time improving the interpretability of results. The number of
input features is thereby decreased and the models become less overfitting; the computational
work is also minimized especially in cases where the computational memory is limited.
Recursive Feature Elimination (RFE): RFE was used as the dependent method for feature
selection. This operation is done recursively whereby each time a feature is removed the model is
rebuilt to find out which feature is unimportant. The idea is to find out which features matter
most as their inclusion enhances model generality and interpretability.
Selected Features for Each Model:

 Logistic Regression: ['Tenure', 'City_Tier', 'Account_user_count', 'CC_Agent_Score',


'Complain_ly']
 Random Forest Classifier: ['Tenure', 'CC_Contacted_LY', 'Payment',
'Account_user_count', 'rev_per_month']
 Support Vector Classifier (SVC): ['Tenure', 'CC_Contacted_LY', 'Account_user_count',
'account_segment', 'Complain_ly']

Model Performance with Selected Features


Figure 11: Model Performance with Selected features
In the table ( figure 11) , it provides summary of the various machine learning models employed
for churning customer and the results obtained after feature selection. The amounts displayed are
accuracy, precision, and recall for each model distinguished by churn (yes/no) as well as F1-
score.
Here’s an overview of what each column means:
Model: The following are the machine learning models which is featured by this column:
Logistic Regression, Random Forest, SVC.
Accuracy (%): This implies the level of accuracy, which is the ration of true prediction to the
total set by the model. For instance, Random Forest has accuracy of 88.51%, this means it was
able to predict 88.51% of the correct churn statuses.
Precision (Churn: No) (%): This defines the percentage of the times the model was right in
identifying customers who did not churn by giving a “no churn” designation. A high value means
that majority of the “no churn” predictions were accurate.
Precision (Churn: Yes) (%): This demonstrates the times that the model got it right by
predicting that a particular customer will churn (class 1). A low percentage of precision here like
in SVC, that had an archive ,which only got 0% correct would indicate that many of the things
that it labelled ‘churn’ were, in fact, wrong.
Recall (Churn: No) (%): The number of actual non-churn that the model successfully classified
it from the entire pool of actual non-churn cases. A high recall for "Churn: “No”, such as SVC at
100% indicates the models can include almost all of the non-churn cases properly.
Recall (Churn: Yes) (%): shows how well the model was able to identify customers who
ultimately did churn. A low recall as denoted by SVC’s 0% indicate that the model was able to
identify a few of the true churn cases.
F1-Score (Churn: No) (%): Precision + Recall for non-churned customers. It offers the cross-
validation of the model and indicates how effectively it predicts the absence of churn.
F1-Score (Churn: Yes) (%): Churned clients similar to the earlier column but only with
churned customers. A low F1-score here means that the model was not very good at identifying
the churned customers it was looking for as it did not bump up its recall values at the cost of
decreasing precision and recall.
Summary of Insights:
Logistic Regression: Achieved fairly high accuracy and high recall for “no churn” class, but
failed to identify the churn cases properly, they even had poor precision and recall for class 1.
Random Forest Classifier: Gained the best value for all the measures while achieving high
performance for churned and non-churned classes. It achieved the higher accuracy rate of
88.51% to prove the theory that it can identify two types of customers.
Support Vector Classifier (SVC): Had good accuracy but performed very poorly in identifying
churn cases as can be seen in the confusion matrix above. The recall of churn was 0% meaning
that it was unable to correctly identify a single actual churned customer This is a big
disadvantage when it comes to using this method to predict churn.
Hence, Random Forest algorithm turned out to be a reliable model for the customer churn
prediction as it was able to manage a good, balanced accuracy and recall with regard to the two
classes of data.
Model Interpretation
In the overall comparison of the results and disregarding quantity, the Random Forest Classifier
showed the most uniform high accuracy in fact along with the highest figures no matter whether
the features were subjected to filtering or not. Nonetheless, their performance in feature
interactions and balancing precisions and recalls for both churned and non-churned are precisely
why this model is preferred for customer churn predictions.

6. Model Tuning and Business Implications

Hyperparameter Tuning:
To further improve the performance of each model, GridSearchCV was utilized for
hyperparameters optimization for every model. This process was extended by fine-tuning
significant parameters to yield improved accuracy and recall rates for churned customers
particularly.
Hyperparameter Tuning Results:
Logistic Regression: The two prominent parameters that yielded the best results were C=1 and
penalty =’l2’. After the tuning process the model achieved 86,53% accuracy. However, the recall
for churn stayed low to identify more need for improvement.
Random Forest Classifier: The best predictors were n_estimators with an optimal value of 50,,
max_depth and min_samples_split values of None and 2 respectively. On the final tuned
Random Forest the model showed a performance of 88.40% in the churn, and no churn
customer’s records.
Support Vector Classifier (SVC): The highest accuracy achieved from this decision forest was
with a parameter C= 0.01 and the kernel of the radial basis function was ‘linear’. For churn,
tuned SVC model indicates poor results similar to if used as it with recall rate at 0%.

Figure 12: Hyperparameter Tuning – results


As per Figure 12,
Accuracy: The Random Forest model gave 88.40% accuracy which means the model was right
about the customer behavior in 88.40 % of the instances.
Precision (Churn: No/Yes): Random Forest model had a precision of 70% for "Churn: ‘Yes,’
meaning that 70 percent of customers that were anticipated to churn really did.
Recall : The Random Forest model, for instance, had a recall of 60% for "Churn: Yes,” as it has
been able to correctly detect 60 percent of the actual churn cases.
F1-Score (Churn: No/Yes): The Random Forest model had an F1-score of 64% for "Churn:
‘Yes,” indicating an appropriate degree of differentiation between claim-churn forecast and false
alarms.
Key Observations: ( Figure 12)
Random Forest Classifier: presented a well-fitted model with more precision, recall, and
balance between the F1-score; therefore, the best model for detecting the churned and non-
churned customers.

Logistic Regression: Logistic Regression performed fairly well and missed some churned
customers especially due to low recall.
Support Vector Classifier (SVC) : SVC failed to capture any churned customers effectively, as
indicated by its recall and F1-score of 0% for "Churn: Yes." This indicates that this model may
not have been well suited to this problem without certain changes made to it.
Business Implications:
Improved Customer Retention Strategies: Through the analysis of a tuned Random Forest
Classifier it was identified that it demonstrated the most accurate performance, thus, businesses
can use the information gathered from this model and formulate punctual retention strategies.
Top drivers of churn are known to facilitate targeting specific customers who are likely to churn
and addressing their issue.
Resource Allocation: Churn could be managed well when the providers note particular types of
clients as potential to leave and direct their efforts where they’ll be needed, in marketing,
customer care, and so on. This assists with the customer loyalty programs, to make sure that the
organisation is paying attention to the customers who are even thinking of dumping the
organization.
Personalized Interventions: The knowledge coming from these tuned models is useful of
designing individual based intervention approaches. For example, customers with high risk
scores can be offered extra bonus for loyalty or even promotions for the goods they opted for or
better customer service for better experience.
Onboarding Improvements: Onboarding is an important area that can help decrease churn rates
because, as the name suggests, lots of customers churn in the first few months of interacting with
the company. The owners and managers of the companies should pay a lot of attention to the first
time that customers are interacting with their goods and services.
Customer Support Quality: Here it is possible to note that the tuned models pointed out the
customer service quality as an important factor. Other important areas are to distinguish the
sources of churn, especially in how to lower them, which includes enhancements in the customer
support where the staff serving the clients should be trained on proper communication and
mechanisms for receiving the customers’ feedback should be developed.
Data-Driven Decision Making: Through predictive modeling, firms and organizations can
better understand which customers they should consider keeping. This involves ascertaining
which segments are most likely to audited and then developing certain programs or certain
improvements to suit those customers.
Conclusion
The topics of this interim report were centered on customer churn prediction in subscription-
based service industries applying pre-built ML algorithms. In this paper, we have assessed the
performance of Logistic Regression, Random Forest Classifier and Support Vector Classifier
(SVC) with the purpose of differentiating churned and non-churned clients. Based on the
analysis:

 Random Forest Classifier turned out to be the best model since it achieved comparable
accuracy and optimal recall and precision for churn and non-churn classes.
 Logistic Regression which was also can perform good in this case used showed that it is
very difficult the identify the churned customers and this is due to the very low recall
value.
 After optimizing the hyperparameters, using Support Vector Classifier (SVC) again
showed a low accuracy in identifying churned customers.
Business Implications:
The findings related to the current investigation suggest that there is need to appreciate and
anticipate the model of customer attrition in order to develop appealing strategies for this client
base. The information brought out by the models enable firms to identify high-risk customer
segment, effectively deploy resources and decide on measures that will improve customer
satisfaction and hence loyalty. Issues to do with customer acquisition, targeted sales offers, and
effective ways of handling customer complaints are some of the measures that need to be taken
to reduce churn and therefore improve the general business profitability.
Next Steps:
 Model Optimization: Serious consideration can be given to the development of the
ensemble methods or to the use of more sophisticated techniques for further improvement
of models.
 Data Enrichment: It may further benefit a model to incorporate extra customer
behavioral attributes like or frequency of posting about the product on social media, or
past purchase behaviour.
 Implementation: Using the tuned model to predict churn in real time and feeding the
insights into the customer relationship management (CRM) system to take early and
adequate retention action.
Algorithms Selected: Churn analysis for customers was performed using three Supervised
Machine learning algorithms

You might also like