Analysis and Comparison of Forecasting Algorithms For Telecom Customer Churn
Analysis and Comparison of Forecasting Algorithms For Telecom Customer Churn
Series
Abstract. The integrated algorithm is a highly flexible data analysis and prediction
algorithm. In many big data competitions at home and abroad, the winning teams
basically adopt the idea of integrated algorithms such as random forest, GBDT,
XGBoost and other algorithms. This shows that accuracy of ensemble algorithms is
still very advantageous in terms of predictive classification. The main task of this
article is to predict the loss of telecom customers. Under the current situation of
saturation of the telecom market, how to retain the original customers is the main task
of each telecom operator. This article mainly compares the four prediction models on
the telecom data set. Predictive performance, the final performance evaluation index
also shows that the random forest model and XGBoost model of integrated thought
have better predictive models.
1. Introduction
For the predictive analysis of the classification status of customers, commonly used machine learning
methods include clustering, association rules [1] and machine learning models [2-7]. The machine
learning models used this time are mainly used. The data for predicting customer churn this time was
borrowed from publicly available data sets on the Internet, and private information such as user names
has been deleted.
In the data preprocessing stage, perform a simple statistical analysis of the data, including viewing
the characteristic attributes of the data set and its data types, showing the loss of each attribute in the
data set according to whether the customer is lost, and dividing it according to the different attribute
values In the two images on the left and right, use the .isna().sum() method to view the missing value
of each attribute in the data set, filter out the attributes with missing values, and then deal with the
missing values according to the specific data situation. The results show that some attributes in the
data have missing values. For the processing of missing values, generally delete and fill. The attribute
with missing values here is the total annual consumption. This attribute does not involve time series
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
The 2nd International Conference on Computing and Data Science (CONF-CDS 2021) IOP Publishing
Journal of Physics: Conference Series 1881 (2021) 032061 doi:10.1088/1742-6596/1881/3/032061
and continuity, so it is not necessary to use the previous value or the next value is filled. In order to
ensure the least impact on the prediction result, the average value is used here.
2.1 Experiments
(1) Logistic regression model is a commonly used machine learning model, usually used for data
mining, analysis and prediction [8]. In the prediction of the loss of telecom users, the more traditional
Sigmoid hypothesis function is adopted, and the regression prediction model is based on the linear
regression of the hypothesis function and independent variables. Pass the processed result data to the
logistic regression model to obtain the best weight coefficient. The loss function of this model draws
on the linear regression model and improves it. The loss function
m
1
is J ( β ) = ∑ {− y ( i ) log e h( x (i ) ) − (1 − y ( i ) ) log e [1 − h( x (i ) )]} .The gradient descent method is used to
m i =1
solve the optimal solution of the model. The model is trained to find the best classification regression
coefficients, and the user life cycle is predicted based on the characteristic attributes of user
information in the data set. Different feature attributes are very important for training the weight
coefficients of these data, which will affect the quality of the final prediction results.
2
The 2nd International Conference on Computing and Data Science (CONF-CDS 2021) IOP Publishing
Journal of Physics: Conference Series 1881 (2021) 032061 doi:10.1088/1742-6596/1881/3/032061
Random forest, XGBoost and AdaBoost models are all ensemble learning algorithms. The main
idea is to integrate weak learners to form the final strong learner. Integrated learning is divided into
two schools, and there is no correlation between each weak learner. Random forest and XGBoost
model are both typical algorithms for this ensemble learning. Another ensemble learning algorithm is
the mutual influence between weak learners. The typical algorithm is AdaBoost algorithm.
(2) Random forest model: The training process of the random forest model is to build multiple
decision trees and merge them to obtain a more accurate and stable model. The final output of the
model is determined by many decision trees. In the classification model, the classification results of
the final model are produced by voting on these decision trees. Every decision tree that makes up a
random forest is very important. The decision tree will select and divide the characteristic attributes of
the data. In random forest, the ranking of feature attributes is determined by information entropy.
Information entropy is an index to measure the stability of attributes. In this telecommunication
customer churn prediction, it is necessary to consider the mutual influence of multiple feature
attributes to determine the final.
When predicting the loss of telecom customers, Random Forest combines 1000 decision trees, the
minimum number of split samples is 3, the minimum leaf node is 3, and the largest feature attribute is
determined internally by the algorithm. One advantage of the random forest model is that it can output
the importance ranking of feature attributes. The ranking of feature attributes in the dataset is shown in
Figure 3.
i =1 2 j =1
composed of loss function and regular term. In the function, J is the number of leaf nodes, and ωtj is
the optimal value of the j-th leaf node. I is the input training sample set, T is the maximum number of
iterations, L is the loss function, used to indicate the degree to which the model fits the data, and the
regularization coefficient λ , γ is used to control the complexity model. The output result of the
objective function is the strong learner f(x). In the XGBoost model, the model uses Taylor's formula to
process the loss function. If the node loss is calculated under the current situation, the relevant
calculation formula is score=max(score, 1 GL2 1 GR2 1 (GL + GR ) 2
+ − − γ ) , where γ is the
2 HL + λ 2 HR + λ 2 HL + HR + λ
3
The 2nd International Conference on Computing and Data Science (CONF-CDS 2021) IOP Publishing
Journal of Physics: Conference Series 1881 (2021) 032061 doi:10.1088/1742-6596/1881/3/032061
threshold for calculating the split loss. Only when it is greater than γ will it choose to split, which
plays a pruning role.
The XGBoost algorithm is a very flexible algorithm and has achieved good results in many data
mining activities. Like the random forest of integrated learning, the importance of the attributes of the
data set will be sorted and output. The following figure shows the importance ranking of each attribute
in the XGBoost model.
4
The 2nd International Conference on Computing and Data Science (CONF-CDS 2021) IOP Publishing
Journal of Physics: Conference Series 1881 (2021) 032061 doi:10.1088/1742-6596/1881/3/032061
samples whose prediction is false but the actual is false . The second level indicator f1 of the
confusion matrix used above is the harmonic mean value.
In the analysis of the results in this experiment, the ROC curve and the AUC area are mainly used.
In the coordinate system, the abscissa and ordinate are related to the confusion matrix, the abscissa
is TPR = TP , the ordinate is FPR = FP , and AUC represents the area under the ROC curve. Is
TP + FN FP + TN
mainly used to measure the generalization ability of the model. AUC is a specific value, which is more
suitable for comparative analysis than the curve. The criterion of AUC value for model performance is:
when AUC=1, the performance of the classifier is very good, the perfect prediction effect can be
achieved when using this model. This model is ideal and generally does not exist; if AUC<0.5, then
the prediction effect of the model is very poor. If it is such a model, then it needs to be performed
Reverse prediction; the general performance model AUC value is between 0.5 and 1, the larger the
value, the better the performance [10].
The results of the four algorithms are compared as follows:
3. Conclusion
A comprehensive comparison of four machine learning prediction algorithms, logistic regression
algorithms and three ensemble learning algorithms found that the random forest algorithm, XGBoost
algorithm and AdaBoost algorithm are more effective, but for many modeling scholars, it is similar to
the black box principle. The internal operation of random forest is opaque. The model used this time is
a relatively basic model, and then simple parameter adjustments have been made. Future research can
improve the loss function and objective function. In addition, the current computer development has
moved towards artificial intelligence, and the development of data mining is gradually moving
towards the field of deep learning. Neural network is a very successful technology for computer deep
learning, so the next step is to consider using deep learning, neural network processing data set
[11-12].
References
[1] Wang, Xinyan and Jiao, Guie. ‘Research on Association Rules of Course Grades Based on
Parallel FP-Growth Algorithm’. 1 Jan. 2020 : 759 – 769.
[2] Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data MiningAugust 2016 Pages 785–794.
[3] S. A. Qureshi, A. S. Rehman, A. M. Qamar, A. Kamal and A. Rehman, "Telecommunication
subscribers' churn prediction model using machine learning," Eighth International
Conference on Digital Information Management (ICDIM 2013), Islamabad, 2013, pp.
131-136, doi: 10.1109/ICDIM.2013.6693977.
5
The 2nd International Conference on Computing and Data Science (CONF-CDS 2021) IOP Publishing
Journal of Physics: Conference Series 1881 (2021) 032061 doi:10.1088/1742-6596/1881/3/032061