Report
Report
Abstract
In this project I tried to explore the QWE dataset and build a model to predict the
potential churning customers. First I conducted the data pre-processing and EDA with
the QWE dataset. Visualization was also implemented to help achieve a better under-
standing. Then I found that the biggest difficulty in prediction is the sample imbalance.
As a result I tried different models and implemented several modifications to achieve
better performance. Eventually I reached the highest AUC score of 0.77. Data Insights
and business analysis were also conducted based on that.
1 Data Description
The QWE dataset contains 6347 rows (each representing a unique customer) with 13
columns: 12 features, 1 target feature (Churn). The data is composed of both int and float
feature. Here lists the brief description of each column in the QWE dataset.
Target:
Churn - Whether the customer churned or not (1,0)
Features:
Customer Age inmonths - represents customer longevity in months
CHI M0 - Customer Happiness Index at the current moment
CHI M01 - Difference of CHI between this month and last month
Support M0 - number of support cases at the current moment
Support M01 - Difference of support cases between this month and last month
SP M0 - average support priority at the current moment
SP M01 - Difference of average support priority between this month and last month
Logins M01 - Difference of Clients’ login time between this month and last month
Blog M01 - Difference of clients’ log amount between this month and last month
Views M01 - Difference of clients’ visit between this month and last month
1
DaysLastLogin M01 - Difference of visit interval between this month and last month
2 Data pre-processing
First I checked whether there contains abnormal values in the dataset. We do not have
any missing or duplicated data and our data-types are in order. At the top of the data, we see
two columns that are irrelevant, ‘ID’ and ‘Churn’, as the former is a unique identifier of the
customer and the latter is the target. We quickly remove these features from our DataFrame.
Also I did a data description to explore the distribution of both target and features. As
we can see in the figure, the churned targets only occupy 5% of the whole 6347 samples,
which indicates a significant sample imbalance. Meanwhile, Our data is full of numeric
data now, but they are all in different units. Comparing the number of support cases with a
continuous value of days since last login will not give any relevant information because they
have different units. The variables simply will not give an equal contribution to the model.
To fix this problem, we will standardize our data values before training the model.
2
Figure 2: Data description
3 EDA
In this part I conducted some exploratory data analysis(EDA) and visualization within
the QWE dataset. The sample imbalance is demonstrated by the pie chart below. You can
check the EDA.ipynb file for all the detailed code.
As we can see in the right pie chart, customer longevity from 6 months to 14 months ac-
counts for the largest percentage of churned customers, which is 45.5%. Customer longevity
above 14 months occupies 40.9% and 13.6% for Customer longevity below 6 months.Let us
3
also have a look at the detailed churn rate by customer longevity.
Also the density of customer longevity by different churn outcomes was drawn. It can
be clearly seen that the churned customers and unchurned customers have different customer
longevity density distribution. Churned distribution peaks around 13 months and has a rel-
atively small variance, while unchurned distribution has a wider distribution and a Poisson
like distribution form.
4
Figure 8: Density of CHI M0, CHI M01 by Figure 9: Density of CHI M0 by Churn
Churn
For the usage information, I compared the difference of churn and unchurned customers
between this month and last month. From the figures I found that it was hard to identify
whether a customer would churn merely on the Usage Information. A customer may use
the system more frequently if he encounters a problem that could not be fixed, which will
certainly increase the probability of churn. In contrast, he may use the system more only
because he finds the system helpful and convenient. It is hard to tell which situation is right.
As we can see in the figures, churned and unchurned customer features data have large over-
lap, which indicates that it is difficult for the model to precisely classify the two groups.
For each feature in the dataset, I conducted a Correlation Analysis using Pearson corre-
lation coefficient test with the target ”Churn”. The results are listed in the following table.
5
Taking the significance as 0.05, I found that ”Churn” has a relatively high correlation with
CHI, SP M0, Logins, DaysLastLogin, which are all statistically significant. A heatmap is
also drawn for more straightforward understanding.
6
In this section, I implemented several models including logistic regression, random forest,
SVM and neural network. From the data description we know that sample imbalance is the
main difficulty of prediction, as a result of which I tried several ways to help reach better
precision and recall.
7
the precision and recall of the positive class. Precision represents the proportion of correctly
predicted positive samples among all predicted positive samples, while recall represents the
proportion of correctly predicted positive samples among all actual positive samples. The
PR curve is a plot of precision against recall. In the case of imbalanced datasets, where the
positive class is rare, the PR curve is often more informative than the ROC curve as it captures
the model’s performance in correctly identifying positive samples.
AUC Formula:
The AUC can be calculated as the integral of the ROC curve:
TP
TPR =
TP + FN
FP
FPR =
FP + TN
Z
AUC = TPR( f ) dFPR( f )
Fβ score:
The Fβ score is a metric commonly used in binary classification tasks to measure the bal-
ance between precision and recall. It is defined as the weighted harmonic mean of precision
and recall, where the weight is determined by the parameter β .
When β is greater than 1, the Fβ score emphasizes recall more than precision. This
means that false negatives (missed positive samples) are considered more important than false
positives (misclassified negative samples), resulting in a higher weight on recall. Conversely,
when β is less than 1, the Fβ score places more emphasis on precision, indicating that false
positives are considered more important, leading to a higher weight on precision. When β is
equal to 1, the Fβ score reduces to the F1 score, which equally balances precision and recall.
PR Formula:
Precision, Recall, and Fβ -score can be calculated based on the confusion matrix:
TP
Precision =
T P + FP
TP
Recall =
T P + FN
(1 + β 2 ) · Precision · Recall
Fβ =
(β 2 · Precision) + Recall
8
important to us because we don’t need to predict exactly who will not churn. In terms of
confusion matrix, we only care about the recall and precision because they show the model’s
performance in predicting customer churn. Meanwhile, it is the company’s decision whether
to place more importance on recall or precision. Here in Fβ score, I set β = 2, which means
more importance is given to recall for the reason that I don’t want to miss any potential churn
customer. ROC and PR score are also presented in every model so as to demonstrate more
comprehensive model classification capabilities.
9
As a result, I believe that additional information must be introduced into the model to
help achieve better classification. Such additional information should help the model distin-
guish the features of two groups and expand the samples’ distance from the decision bound-
aries. EDA can be a handy assistant for us to get the additional information.
10
Table 5: Coefficients after age grouping
Feature Coefficient
CHI M0 2.8112522341779886
CHI M01 2.4315092152283455
Support M0 -0.8400921049910017
Support M01 -1.78198273155569
SP M0 -0.07098490525357841
SP M01 -0.5762626086878374
Logins M01 1.8082313760695083
Blog M01 0.679889325853613
Views M01 1.0535002955698523
DaysLastLogin M01 -3.980425753858899
age0-6 1.5922410363495894
age6-14 0.08141972438804576
age>14 0.26322785523925113
Here I displayed AUC and PR curve of Logistic Regression. After dividing the Cus-
tomer Age inmonths variable into three groups, AUC increased from 0.61 to 0.69.
Figure 14: AUC Curve after grouping of LR Figure 15: Precision-Recall Curve after
grouping of LR
11
4.5 Random Forest
Random forest is a convenient way to perform classification. I conducted the Random
Forest prediction and tried Ensemble learning with Logistic Regression, Random Forest and
Gradient Boosting Decision Tree, hoping to increase the model’s performance. However, en-
semble learning seemed not to live up to my expectation and made little difference. Detailed
code can be seen in the random forest train.py and Ensemble learning.py files.
Figure 16: AUC Curve of Random Forest Figure 17: Precision-Recall Curve of Ran-
dom Forest
Also I randomly chose a tree estimator in the random forest and visualized its classifi-
cation results. Since the feature dimension is high, it is hard for us to get a straightforward
understanding of the classification results. However, we can also find that those with few
supports and low average support priority are more likely to churn.
12
4.6 SVM
SVM is effective in high-dimensional spaces and robust against overfitting. Meanwhile,
SVM model can be computationally expensive as it takes much longer time to trian a SVM
model. Ultimately SVM model reached the AUC score of 0.72, proving the good classifica-
tion ability of SVM.
Figure 19: AUC Curve of SVM Figure 20: Precision-Recall Curve of SVM
13
Figure 21: Training loss of Neural Network
Figure 22: AUC Curve of Neural Network Figure 23: Precision-Recall Curve of Neural
Network
GAN can generate synthetic samples that closely resemble the real data. The generator
learns the complex data distribution, allowing it to produce samples that are more realistic
and representative of the minority class. In contrast, SMOTE generates synthetic samples by
interpolating existing samples, which may not capture the true data distribution as effectively.
14
In our model, first I trained GAN on the churned data and then generated 5000 new
churned samples based on the trained GAN. With the new generated churned data and the
original data, the total training data now has equal churn and unchurned samples. Second, I
continued to use the neural network model and trained it on the new total data. As we can see,
different from SMOTE which generates new data based on its neighbours, GAN is a trained
model with better generalization ability. You can check the detailed code in gan.py file.
Eventually Neural Network with GAN reached the recall of 0.6154 and precision of 0.1538.
Compared with other sample imbalance algorithms(precision = 0.09), model with GAN in-
creased precision significantly while maintaining high value of recall. With the help of GAN,
we obtained the highest value of AUC —— 0.77.
15
Figure 25: Churn Rate by Age Group (Total Data)
From the Pearson correlation coefficient test(Table 1) and the coefficients of logistic
regression(Table3 and 4), we can find that Customer Age inmonths, CHI, Logins M01 and
Support M01 have large impact on churn rate. As we can see in the density figures in the
EDA section, churned customers tend to have relatively low CHI, decreased login time and
unchanged support cases. It is easy to understand that customers with low happiness index
are more likely to churn. Our data also confirms that, as a result of which we should pay
attention to the Customer Happiness Index, especially those with low index.
Figure 26: Mean Difference of Features between Top 50 Churn and Total Data
In the above figure I selected the top 50 potential churn customers and compared their
standardized features with total data. As we can see from the bar chart, churn customers tend
to have relatively lower CHI, fewer logins and views than those of total data, which illus-
trates the significance of customer usage Indicators like views and logins. What’s more, the
16
difference of support and sp(M01) in churn customers is higher than average, while support
and sp(M0) of churn customers is lower than average. That indicates churn customers nor-
mally have few number of support cases and low priority, but they demand more support and
higher priority than usual a month before they churn. One possible explanation is that they
are experiencing difficulties during this time, which increase their probability of churn next
month.
3. Focus on customers who don’t usually seek support but suddenly have a large increase
in demand for support.
4. Customer usage data like logins and views should not be ignored. Customers who don’t
login and browse frequently are more likely to churn.
17