Interim Repor - Final
Interim Repor - Final
REPORT
USN 23VMBR01961
METHODOLOGY
RESEARCH DESIGN
SAMPLING METHOD
CONCLUSION
OBJECTIVES OF THE STUDY
This project targets the crucial problem of high-value customer churn in subscription-based
businesses. Leveraging predictive analytics and machine learning methods, the research intends
to detect early indicators of customer disengagement and create data-informed interventions for
improved retention. The interim report summarizes the objectives, methodology, and preliminary
findings, and includes insights into patterns of customer behavior and predictive model
effectiveness in churn forecasting.
2. Leverage of Structured and Unstructured Data The study leverages both structured data
(e.g., transaction records, CRM data) and unstructured data (e.g., customer feedback,
support tickets) to gain a comprehensive understanding of customer behaviors and churn
indicators. This holistic approach ensures a nuanced analysis of factors influencing
customer retention.
METHODOLOGY
This research will utilize both data-driven methods and machine learning algorithms to forecast
and prevent high-value customer churn. The approach consists of the following elements:
Type of Research: The study is empirical and descriptive, with historical customer information
used to recognize patterns of churn and test predictive models.
Data Collection Methods:
1. Primary Data: Surveys, customer feedback, and interaction logs.
2. Secondary Data: Historical transaction records, CRM databases, and industry reports.
Evaluation:
1. Comparison of pre- and post-deployment retention rates.
2. Quantifying intervention effects using A/B testing.
3. Regular review to optimize retention strategies.
Expected Outcomes:
1. Discovery of critical churn indicators and risk factors for high-value customers.
2. Creation of a predictive model to segment at-risk customers.
3. Enhanced customer retention strategies resulting in revenue increase.
4. A model for companies to maximize customer engagement and loyalty.
5. A dynamic, real-time customer retention solution that adjusts to changing customer
behaviors and market trends.
RESEARCH DESIGN
The research design used in this study is a quantitative, exploratory, and predictive design with a
combination of statistical analysis and machine learning for the purpose of understanding customer
behavior and churn trends.
1. Research Type:
2. Data Source:
3. Sampling Design:
Census-based: The whole dataset was employed without sampling in order to maintain
representativeness and provide complete insights.
All customer segments were taken into account in the analysis including high-value,
regular, and churned customers.
6. Ethical Considerations:
All used data is anonymized and applied for academic as well as analytic purposes
only to meet data privacy principles.
The information applied in this study was collected through secondary data collection. The data
was supplied by LoyaltyVision Analytics, comprising a rich set of customer-level data applicable
in the understanding of behavior, engagement, and churn.
Data Source:
Internal Organizational Dataset provided by LoyaltyVision Analytics.
Data is representative of customer activity and behavior within a specified time period
and is presumed to be anonymized for research purposes.
Data Integrity:
The data were checked for missing values, invalid data types, and outliers.
Exploratory tests verified that the data was detailed and diverse enough to back up the
research goals.
Relevance to Study:
The data is directly relevant to the study's objective of comprehending customer churn,
thus ideal for exploratory and predictive analysis.
SAMPLING METHOD
The research adopts a census-oriented strategy over the traditional sampling framework since
the entire dataset supplied by LoyaltyVision Analytics was made available for analysis.
1. Sampling Design:
Population:
a) All customers that are represented within the LoyaltyVision Analytics dataset —
comprising a total of 11,260 entries — from different customer segments, revenue
categories, and service levels.
2. Target Group:
Segments like "Super," "Super Plus," and "Regular Plus" were examined in more detail to
see churn behavior and retention opportunitie
For extracting useful insights from data gathered and for solving the research goals effectively, a
mix of analytical libraries and statistical techniques have been utilized. These libraries assist in
data cleaning, data exploration, visualization, and predictive modelling.
1 Programming Language:
Python: The central programming language adopted for data cleansing, analysis,
visualization, and modeling.
2 Python Libraries:
Pandas – used for data preprocessing and manipulation.
NumPy – used for numeric computations and manipulating arrays.
Matplotlib & Seaborn – used for plotting data through histograms, boxplots, countplots,
and heatmaps.
Scikit-learn (sklearn) – for machine learning operations such as model training,
evaluation, and data splitting.
3 Environment:
Jupyter Notebook – an interactive coding environment to write, run, and document the
analysis process.
4 Analytical Techniques:
Descriptive statistics and summary tables.
Correlation analysis and visual heatmaps.
Outlier detection via boxplots.
Feature engineering through derived variables.
Logistic Regression model as a baseline predictive classifier.
Performance evaluation using classification report (precision, recall, F1-score).
Through the use of such data analysis tools, the research guarantees a detailed and meaningful
exploration of high-value customer behavior to allow for the creation of effective strategies for
improving retention.
This section presents a comprehensive exploratory data analysis (EDA) of the dataset provided
for the research on improving high-value customer retention using predictive analytics. The
purpose of EDA is to find out about the shape of the dataset, clean it and prepare it for further
steps, detect patterns, and derive insights that assist in modeling and decision-making.
We start by learning the shape and structure of the dataset. The dataset has 11,260 rows and 19
columns, which correspond to different customer attributes like demographics, engagement
metrics, and revenue information.
Screenshot : 1
df.describe(include='all')
Univariate analysis is concerned with the investigation of each single numerical variable alone to
see its distribution, central tendency, and dispersion. Histograms were graphed with df.hist() to
plot the frequency distribution of major numeric features like rev_per_month, Tenure, cashback,
and Service_Score.
This is the important step to detect skewness, identify potential outliers, and determine whether
or not data transformation like normalization or log scaling is required prior to proceeding to
predictive modeling.
Key Takeaways:
rev_per_month is right-skewed, which means a few customers account for most revenue.
cashback and Service_Score are similarly right-skewed.
Tenure has a fairly even distribution with a couple of peaks, which indicates different
stages of customer lifecycle.
Insights derived here will inform feature engineering and preprocessing decisions such as
normalization and binning.
Key Takeaways:
"Super" and "Regular Plus" are the most populous segments, meaning they play a
significant part in the customer base.
"Super +" is a niche segment with a tiny population, perhaps symbolizing premium or
elite-class customers.
Comparison of distributions like these assists in prioritizing segment-wise retention
activity and bespoke intervention strategies.
Key Takeaways:
There is a high positive correlation between Tenure and rev_per_month, indicating that long-
term customers spend more.
Service_Score is negatively correlated with Churn, which means that better serviced customers
are less likely to churn.
cashback and rev_growth_yoy are also moderately correlated, suggesting that reward policies
could impact revenue growth.
This step assists in determining which features can be given priority or observed more
intensively in predictive modeling.
[Step 5] : Outlier Detection
Detection of outliers is critical to determine extreme values that may skew
statistical summaries and affect model performance. For customer data, outliers
tend to be either data quality problems or truly high-value customers.
Boxplots were created for quantitative columns like rev_per_month and cashback
to see the spread and pinpoint values lying well beyond the interquartile range
(IQR). Such plots aid in determining whether to keep, truncate, or convert outlier
values.
Key Takeaways:
A large proportion of outliers in rev_per_month were seen that are high-spending
customers.
Likewise, cashback values had a broad range and some very extreme cashback cases.
These were kept, because they are giving useful information on the behavior of high-
value segments of customers who are important in this retention-based study.
Key Takeaways:
Most of the customers are classified into the Medium spender category according to
revenue ranges.
The feature CC_Contact_Recency indicates that newer engaged customers (contacting
the customer care unit within the recent 1–3 months) have lower churn, which validates
the significance of prompt support and communication.
All these engineered features provide greater granularity in the customer profile for more
effective targeting in retention practices.
Customers, who contacted customer care within the recent 3 months, tend to churn less.
Key Takeaways:
The total churn rate in the data is roughly 16.83%, implying that although there is a
general retention of most customers, there is a considerable percentage at risk.
This class imbalance implies a requirement for such methods as class weighting or
SMOTE (Synthetic Minority Over-sampling Technique) to preserve balanced model
training.
The rate of churn may inform business agendas—focusing on this 17% and engaging
them in personalized strategies may greatly improve retention.
CONCLUSION
The exploratory data analysis process yielded a thorough insight into the customer dataset and
yielded interesting insights into customer behavior, segment distribution, churn patterns, and
possible predictive indicators. The important findings like the high revenue variability, segment
dominance, presence of outliers, and inter-relationship between service scores and churn are
important inputs for future modeling.
The organized analysis not only revealed patterns but also informed the creation of new features
that will improve the accuracy of prediction in subsequent phases. This foundation allows
following steps—predictive modeling and optimization of strategy—to be constructed upon data-
driven knowledge.