Telecom Churn Report
Telecom Churn Report
PROJECT NOTE 1
MAY 2021
Submitted By,
Ajuna P John
Introduction
The telecommunications sector has become one of the main industries in developed
countries. The technical progress and the increasing number of operators raised the level of
competition. Companies are working hard to survive in this competitive market depending on
multiple strategies. Three main strategies have been proposed to generate more revenues:
(1) acquire new customers, (2) upsell the existing customers, and (3) increase the retention
period of customers. However, comparing these strategies taking the value of return on
investment (RoI) of each into account has shown that the third strategy is the most profitable
strategy, proves that retaining an existing customer costs much lower than acquiring a new
one, in addition to being considered much easier than the upselling strategy.
Customer churn is a major problem and one of the most important concerns for large
companies with highly competitive services. Due to the direct effect on the revenues of the
companies, especially in the telecom field, companies are seeking to develop means to predict
potential customer to churn. Therefore, finding factors that increase customer churn is
important to take necessary actions to reduce this churn. On the other hand, predicting the
customers who are likely to leave the company will represent potentially large additional
revenue source if it is done in the early phase.
The reasons that lead customers to the cancellation decision can be numerous,
coming from poor service quality, delay on customer support, prices, new competitors entering
the market, and so on. Usually, there is no single reason, but a combination of events that
somehow culminated in customer displeasure.
Understanding business/social opportunity
If your company were not capable to identify these signals and take actions prior to the
cancel button click, there is no turning back, your customer is already gone. But you still have
something valuable: the data. Your customer left particularly good clues about where you left
to be desired. It can be a valuable source for meaningful insights and to train customer
churn models. Learn from the past and have strategic information at hand to improve
future experiences.
When it comes to the telecommunications segment, there is great room for opportunities.
The wealth and the amount of customer data that carriers collect can contribute a lot to shift
from a reactive to a proactive position. The emergence of sophisticated artificial intelligence
and data analytics techniques further help leverage this rich data to address churn in a much
more effective manner.
Problem Statement:
The senior management in a telecom provider organization is worried about the rising
customer attrition levels. Additionally, a recent independent survey has suggested that the
industry will face increasing churn rates and decreasing ARPU (average revenue per unit).
Need of the study/project:
The effort to retain customers so far has been very reactive. Only when the customer calls to
close their account is when the company acts. That has not proved to be a great strategy so
far. The management team is keen to take more proactive measures on this front. This is to
derive insights, predict the potential behaviour of customers, and then recommend steps to
reduce churn.
Understanding how data was collected in terms of time, frequency and methodology
Generating hypothesis is key to unlock any analytics project. We first list down our
understanding to derive insights through various approach and then proceed from there.
Page 1
Importing libraries
The following packages were installed prior to running the analysis:
• install.packages("rpivotTable")
• install.packages("VIM")
• install.packages("corrplot")
• library(rpivotTable)
• library(tidyverse)
• library("VIM")
• library(readr)
• library(corrplot)
The working directory is set and the dataset is being read
#Set working directory
• setwd("D:/GreatLakes/Capstone")
• getwd()
#Read the dataset
• mydata= read.csv("Telecom_Sampled.csv")
• attach(mydata)
• View(mydata)
Page 2
Analysis of Dataset
Results of Exploratory Data Analysis
Page 3
Summary of the dataset
mou_Mean totmrc_Mean rev_Range mou_Range
Min. : 0.0 Min. :-26.91 Min. : 0.00 Min. : 0.0
1st Qu.: 160.5 1st Qu.: 30.00 1st Qu.: 1.98 1st Qu.: 114.0
Median : 368.5 Median : 44.99 Median : 15.99 Median : 245.0
Mean : 533.6 Mean : 47.19 Mean : 44.13 Mean : 378.7
3rd Qu.: 733.2 3rd Qu.: 59.99 3rd Qu.: 57.44 3rd Qu.: 485.0
Max. :7667.8 Max. :399.99 Max. :1524.39 Max. :6233.0
NA's :58 NA's :58 NA's :58 NA's :58
change_mou drop_blk_Mean drop_vce_Range owylis_vce_Range
Min. :-2785.00 Min. : 0.000 Min. : 0.00 Min. : 0.0
1st Qu.: -83.00 1st Qu.: 2.000 1st Qu.: 1.00 1st Qu.: 3.0
Median : -4.75 Median : 5.333 Median : 3.00 Median : 9.0
Mean : -10.21 Mean : 10.220 Mean : 5.52 Mean : 15.9
3rd Qu.: 65.25 3rd Qu.: 12.667 3rd Qu.: 7.00 3rd Qu.: 20.0
Max. : 3046.75 Max. :411.667 Max. :313.00 Max. :542.0
NA's :161
mou_opkv_Range months totcalls income eqpdays
Min. : 0.00 Min. : 6.00 Min. : 0 Min. :1.000 Min. : -5.0
1st Qu.: 16.20 1st Qu.:11.00 1st Qu.: 868 1st Qu.:4.000 1st Qu.: 202.0
Median : 55.73 Median :16.00 Median : 1808 Median :6.000 Median : 326.0
Mean : 117.42 Mean :18.69 Mean : 2936 Mean :5.772 Mean : 376.5
3rd Qu.: 142.03 3rd Qu.:24.00 3rd Qu.: 3534 3rd Qu.:7.000 3rd Qu.: 510.0
Max. :4783.67 Max. :60.00 Max. :92076 Max. :9.000 Max. :1812.0
NA's :6697
custcare_Mean callwait_Mean iwylis_vce_Mean callwait_Range
Min. : 0.000 Min. : 0.0000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.000 Median : 0.3333 Median : 2.000 Median : 0.000
Mean : 1.903 Mean : 1.8927 Mean : 8.307 Mean : 1.924
3rd Qu.: 1.667 3rd Qu.: 1.6667 3rd Qu.: 9.333 3rd Qu.: 2.000
Max. :365.667 Max. :212.6667 Max. :519.333 Max. :143.000
Page 4
Mean : 7.453 Mean : 2896 Mean : 13.318 Mean : 59.36
3rd Qu.: 7.000 3rd Qu.: 3486 3rd Qu.: 14.175 3rd Qu.: 71.95
Max. :600.000 Max. :92076 Max. :896.087 Max. :926.08
NA's :58 NA's :58
ovrmou_Mean comp_vce_Mean plcd_vce_Mean avg3mou
Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 30.67 1st Qu.: 41.0 1st Qu.: 159.0
Median : 2.75 Median : 78.00 Median : 103.8 Median : 369.0
Mean : 40.48 Mean : 112.59 Mean : 149.8 Mean : 538.5
3rd Qu.: 41.25 3rd Qu.: 155.00 3rd Qu.: 204.3 3rd Qu.: 737.0
Max. :3472.25 Max. :1812.67 Max. :2180.3 Max. :7270.0
NA's :58
avgmou avg3qty avgqty avg6mou avg6qty
Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0 Min. : 0.0
1st Qu.: 177.9 1st Qu.: 57.0 1st Qu.: 63.69 1st Qu.: 169 1st Qu.: 60.0
Median : 365.9 Median : 129.0 Median : 128.84 Median : 375 Median : 130.0
Mean : 493.9 Mean : 185.9 Mean : 177.23 Mean : 527 Mean : 183.8
3rd Qu.: 670.3 3rd Qu.: 247.0 3rd Qu.: 233.62 3rd Qu.: 721 3rd Qu.: 244.0
Max. :6329.4 Max. :3261.0 Max. :2475.75 Max. :5589 Max. :2759.0
NA's :814 NA's :814
crclscod asl_flag prizm_social_one area
AA :9602 N:22535 C :4594 NEW YORK CITY AREA : 3055
A :4338 Y: 3983 R :1279 DC/MARYLAND/VIRGINIA AREA: 1805
BA :3283 S :8475 MIDWEST AREA : 1782
CA :2298 T :3921 LOS ANGELES AREA : 1750
EA :1889 U :6368 SOUTHWEST AREA : 1613
B :1046 NA's:1881 (Other) :16508
(Other):4062 NA's : 5
refurb_new hnd_webcap marital ethnic age1 age2
N:22863 UNKW: 68 A :1361 N :8912 Min. : 0.00 Min. : 0.00
R: 3655 WC : 3408 B :1880 H :3512 1st Qu.: 0.00 1st Qu.: 0.00
WCMB:20660 M :8159 S :3417 Median :36.00 Median : 0.00
NA's: 2382 S :4829 U :2940 Mean :31.16 Mean :21.06
U :9821 G :1584 3rd Qu.:48.00 3rd Qu.:42.00
NA's: 468 (Other):5685 Max. :94.00 Max. :99.00
NA's : 468 NA's :468 NA's :468
models hnd_price actvsubs uniqsubs forgntvl
Page 5
Min. : 1.000 Min. : 9.99 Min. : 0.000 Min. : 1.00 Min. :0.000
1st Qu.: 1.000 1st Qu.: 59.99 1st Qu.: 1.000 1st Qu.: 1.00 1st Qu.:0.000
Median : 1.000 Median : 99.99 Median : 1.000 Median : 1.00 Median :0.000
Mean : 1.569 Mean :104.95 Mean : 1.349 Mean : 1.52 Mean :0.058
3rd Qu.: 2.000 3rd Qu.:149.99 3rd Qu.: 2.000 3rd Qu.: 2.00 3rd Qu.:0.000
Max. :15.000 Max. :499.99 Max. :11.000 Max. :12.00 Max. :1.000
NA's :254 NA's :468
dwlltype dwllsize mailordr occu1 opk_dat_Mean
M : 5169 A :12521 B : 9508 1 : 2723 Min. : 0.0000
S :12968 B : 1348 NA's:17010 2 : 1347 1st Qu.: 0.0000
NA's: 8381 C : 420 5 : 799 Median : 0.0000
J : 384 4 : 493 Mean : 0.4317
O : 315 3 : 428 3rd Qu.: 0.0000
(Other): 1440 (Other): 1240 Max. :247.3333
NA's :10090 NA's :19488
mtrcycle numbcars retdays truck wrkwoman
Min. :0.0000 Min. :1.000 Min. : 0.00 Min. :0.0000 Y : 3272
1st Qu.:0.0000 1st Qu.:1.000 1st Qu.: 37.25 1st Qu.:0.0000 NA's:23246
Median :0.0000 Median :1.000 Median :163.00 Median :0.0000
Mean :0.0137 Mean :1.562 Mean :233.84 Mean :0.1874
3rd Qu.:0.0000 3rd Qu.:2.000 3rd Qu.:391.00 3rd Qu.:0.0000
Max. :1.0000 Max. :3.000 Max. :966.00 Max. :1.0000
NA's :468 NA's :12959 NA's :25652 NA's :468
roam_Mean recv_sms_Mean blck_dat_Mean mou_pead_Mean
Min. : 0.0000 Min. : 0.00000 Min. : 0.00000 Min. : 0.0000
1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0000
Median : 0.0000 Median : 0.00000 Median : 0.00000 Median : 0.0000
Mean : 1.1811 Mean : 0.03733 Mean : 0.02104 Mean : 0.7392
3rd Qu.: 0.2575 3rd Qu.: 0.00000 3rd Qu.: 0.00000 3rd Qu.: 0.0000
Max. :488.7800 Max. :98.33333 Max. :122.33333 Max. :310.0933
NA's :58
churn solflag proptype mailresp cartype car_buy
Min. :0.00 N : 509 A : 6753 R : 9896 B : 2012 New :11051
1st Qu.:0.00 Y : 6 B : 406 NA's:16622 E : 1582 UNKNOWN:14999
Median :0.00 NA's:26003 D : 204 A : 1524 NA's : 468
Mean :0.24 E : 95 F : 1426
3rd Qu.:0.00 G : 18 C : 1021
Page 6
Max. :1.00 M : 55 (Other): 998
NA's:18987 NA's :17955
children csa da_Mean da_Range datovr_Mean
N : 2705 NYCBRO917: 918 Min. : 0.0000 Min. : 0.000 Min. : 0.0000
Y : 6251 HOUHOU281: 784 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000
NA's:17562 DALDAL214: 762 Median : 0.2475 Median : 0.990 Median : 0.0000
NYCMAN917: 649 Mean : 0.9034 Mean : 1.638 Mean : 0.2544
APCFCH703: 414 3rd Qu.: 0.9900 3rd Qu.: 1.980 3rd Qu.: 0.0000
(Other) :22986 Max. :72.7650 Max. :57.420 Max. :242.8725
NA's : 5 NA's :58 NA's :58 NA's :58
datovr_Range div_type drop_dat_Mean drop_vce_Mean
Min. : 0.0000 BTH : 512 Min. : 0.00000 Min. : 0.0000
1st Qu.: 0.0000 LDD : 4260 1st Qu.: 0.00000 1st Qu.: 0.6667
Median : 0.0000 LTD : 245 Median : 0.00000 Median : 3.0000
Mean : 0.7073 NA's:21501 Mean : 0.03939 Mean : 6.0837
3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.: 7.6667
Max. :475.0200 Max. :48.33333 Max. :195.3333
NA's :58
adjmou totrev adjrev avgrev
Min. : 0 Min. : 9.12 Min. : 8.77 Min. : 1.13
1st Qu.: 2454 1st Qu.: 509.31 1st Qu.: 441.51 1st Qu.: 35.45
Median : 5105 Median : 800.98 Median : 733.22 Median : 50.21
Mean : 7707 Mean : 1037.70 Mean : 965.30 Mean : 58.37
3rd Qu.: 9739 3rd Qu.: 1268.38 3rd Qu.: 1191.08 3rd Qu.: 70.27
Max. :174383 Max. :13358.37 Max. :12982.62 Max. :588.27
Page 7
Structure of the dataset
Page 8
$ avg6qty : int 99 166 118 95 16 217 145 374 44 572 ...
$ crclscod : Factor w/ 49 levels "A","A2","A3",..: 4 22 8 22 26 4 4 11 4 21 ...
$ asl_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 2 ...
$ prizm_social_one : Factor w/ 5 levels "C","R","S","T",..: 4 5 1 5 5 5 3 5 1 1 ...
$ area : Factor w/19 levels "ATLANTIC SOUTH AREA", 1 9 10 9 6 1 6 12 5 14
$ refurb_new : Factor w/ 2 levels "N","R": 1 1 1 1 1 2 1 1 1 1 ...
$ hnd_webcap : Factor w/ 3 levels "UNKW","WC","WCMB": 3 3 3 2 NA 3 3 3 3 3 ...
$ marital : Factor w/ 5 levels "A","B","M","S",..: 3 5 3 5 2 5 3 5 3 5 ...
$ ethnic : Factor w/ 17 levels "B","C","D","F",..: 10 14 10 15 10 10 14 6 10 10 ...
$ age1 : int 36 0 32 34 78 0 50 0 40 0 ...
$ age2 : int 0 0 30 38 0 0 48 0 42 0 ...
$ models : int 1 1 2 1 1 2 2 1 1 1 ...
$ hnd_price : num 200 150 100 30 30 ...
$ actvsubs : int 2 1 1 1 1 1 1 1 3 1 ...
$ uniqsubs : int 2 1 1 1 1 2 1 1 3 1 ...
$ forgntvl : int 0 0 0 0 0 0 0 0 0 0 ...
$ dwlltype : Factor w/ 2 levels "M","S": 2 NA 2 NA NA NA 2 NA 2 2 ...
$ dwllsize : Factor w/ 15 levels "A","B","C","D",..: 1 NA 1 NA NA NA 1 NA 1 1 ...
$ mailordr : Factor w/ 1 level "B": 1 NA 1 NA NA NA 1 NA 1 NA ...
$ occu1 : Factor w/ 21 levels "1","2","3","4",..: 10 NA NA NA NA NA 1 NA 4 NA
...
$ opk_dat_Mean : num 0 0 0 0 0 ...
$ mtrcycle : int 0 0 0 0 0 0 0 0 0 0 ...
$ numbcars : int NA 1 1 1 NA NA 2 NA NA 1 ...
$ retdays : int NA NA NA NA NA NA NA NA NA NA ...
$ truck : int 0 0 0 0 0 0 0 0 0 0 ...
$ wrkwoman : Factor w/ 1 level "Y": 1 NA NA NA NA NA NA NA NA NA ...
$ roam_Mean : num 0 0 0 0 0 0 0 0 0 0 ...
$ recv_sms_Mean : num 0 0 0 0 0 0 0 0 0 0 ...
$ blck_dat_Mean : num 0 0 0 0 0 0 0 0 0 0 ...
$ mou_pead_Mean : num 0 0.883 0 0 0 ...
$ churn : int 0 0 0 0 0 0 0 0 0 0 ...
$ solflag : Factor w/ 2 levels "N","Y": NA NA NA NA NA NA NA NA NA NA ...
Page 9
$ proptype : Factor w/ 6 levels "A","B","D","E",..: 1 NA NA NA NA NA NA NA 1 NA
$ mailresp : Factor w/ 1 level "R": 1 NA 1 NA NA NA 1 NA 1 NA ...
$ cartype : Factor w/ 7 levels "A","B","C","D",..: NA NA 5 NA NA NA 5 NA NA 5 ...
$ car_buy : Factor w/ 2 levels "New","UNKNOWN": 2 1 1 1 2 2 1 2 2 2 ...
$ children : Factor w/ 2 levels "N","Y": 2 NA 2 NA NA NA 2 NA 2 NA ...
$ csa : Factor w/ 694 levels "AIRAIK803","AIRAND864",..: 7 271 688 263 344
58 38 435 521 665 ...
$ da_Mean : num 0 0.247 0.495 0 0 ...
$ da_Range : num 0 0.99 1.98 0 0 0 0 0.99 0 0 ...
$ datovr_Mean : num 0 0.877 0 0 0 ...
$ datovr_Range : num 0 3.51 0 0 0 0 0 0.78 0 0 ...
$ div_type : Factor w/ 3 levels "BTH","LDD","LTD": NA NA NA 2 NA NA NA NA NA
$ drop_dat_Mean : num 0 0 0 0 0 0 0 0 0 0 ...
$ drop_vce_Mean : num 3.667 3 0.333 0.667 1.333 ...
$ adjmou : num 2376 5565 7866 5844 1573 ...
$ totrev : num 519 599 1617 1682 968 ...
$ adjrev : num 489 539 1586 1648 917 ...
$ avgrev : num 37.6 44.9 58.8 68.7 29.6 ...
$ Customer_ID : int 1064525 1048538 1010139 1014496 1012053 1028421 1061694
1038047 1084670 1096350 ...
$ comp_dat_Mean : num 0 0.333 0 0 0 ...
$ plcd_dat_Mean : num 0 0.333 0.333 0 0 ...
Page 10
Conversion to factors
The variables (forgntv1, mtrcycle, truck, churn) were converted to factors
The income levels are converted into ordered factors
eqpdays (“Number of days (age) of current equipment”) has negative values. We have
fixed them with corresponding absolute values
Page 11
Missing value treatment
As missing values are more than 40% of the data, would drop mailordr, occu1, numbcars,
retdays, wrkwoman, solflag, proptype, mailresp, cartype, children and div_type
Since 31 observation rows having “NA” are also having vital other information, hence we
may replace NA with “median value of the column” or “KNN value of the column” to factor
them instead of discarding them
Missing values in the variables `avg6mou`, `avg6qty`, `mou_Mean`, `totmrc_Mean`,
`rev_Range`, `mou_Range`, `change_mou`, `ovrrev_Mean`, `rev_Mean`, `ovrmou_Mean`
which are continuous with less than 10% of data values missing is replaced with the median
value of the column values.
Missing values in the variables "prizm_social_one", "income", "dwlltype", "dwllsize",
"hnd_webcap", "marital", "ethnic", "age1", "age2", "hnd_price", "forgntvl", "mtrcycle", "truck",
"roam_Mean", "car_buy", "csa", "da_Mean", "da_Range", ”datovr_Mean", "datovr_Range",
"area" which are categorical with less than 40% of data values missing are replaced with KNN
of the column values.
Page 12
Structure of working dataset:
'data.frame': 26518 obs. of 70 variables:
$ mou_Mean : num 190.2 443 400.5 53.5 37 ...
$ totmrc_Mean : num 63.9 40 45 34.7 21 ...
$ rev_Range : num 26 5.1 13.9 18.6 35.8 ...
$ mou_Range : num 43 199 172 78 74 ...
$ change_mou : num -11.2 -78 -67.5 12.5 -33 ...
$ drop_blk_Mean : num 4.67 4.33 2 1.33 5.67 ...
$ drop_vce_Range : num 8 5 1 2 1 2 3 16 1 5 ...
$ owylis_vce_Range: num 20 33 7 0 4 40 8 17 2 11 ...
$ mou_opkv_Range : num 28.4 72.5 93.6 4.7 36.6 ...
$ months : num 14 13 29 25 33 25 16 19 9 8 ...
$ totcalls : num 1104 2237 3276 1932 652 ...
$ income : Ord.factor w/ 9 levels "1"<"2"<"3"<"4"<..: 9 9 6 8 5 5 8 6 3 5 ...
$ eqpdays : num 403 404 213 757 970 273 391 283 258 215 ...
$ custcare_Mean : num 0 1.333 1 4 0.333 ...
$ callwait_Mean : num 0.667 0 0 0 0 ...
$ iwylis_vce_Mean : num 10.3 16.3 0 0 0 ...
$ callwait_Range : num 2 0 0 0 0 0 2 3 0 2 ...
$ ccrndmou_Range : num 0 14 17 17 4 0 0 17 16 0 ...
$ adjqty : num 1104 2230 3269 1924 636 ...
$ ovrrev_Mean : num 2.05 8.84 4.72 9.3 0 ...
$ rev_Mean : num 53.5 34.4 40.5 38.7 21 ...
$ ovrmou_Mean : num 5.5 25 13.5 23.2 0 ...
$ comp_vce_Mean : num 59.3 87 124.7 32.3 10.7 ...
$ plcd_vce_Mean : num 73.3 104.3 153 43 18 ...
$ avg3mou : num 194 469 423 49 48 ...
$ avgmou : num 182.8 463.8 291.3 243.5 50.7 ...
$ avg3qty : num 91 188 134 25 12 226 129 462 38 520 ...
$ avgqty : num 84.9 185.8 121.1 80.2 20.5 ...
$ avg6mou : num 211 467 333 244 43 ...
$ avg6qty : num 99 166 118 95 16 217 145 374 44 502 ...
Page 13
$ crclscod : Factor w/ 49 levels "A","A2","A3",..: 4 22 8 22 26 4 4 11 4 21 ...
$ asl_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 2 ...
$ prizm_social_one: Factor w/ 5 levels "C","R","S","T",..: 4 5 1 5 5 5 3 5 1 1 ...
$ area : Factor w/ 19 levels "ATLANTIC SOUTH AREA",..: 1 9 10 9 6 1 6 12 5 14 ...
$ refurb_new : Factor w/ 2 levels "N","R": 1 1 1 1 1 2 1 1 1 1 ...
$ hnd_webcap : Factor w/ 3 levels "UNKW","WC","WCMB": 3 3 3 2 2 3 3 3 3 3 ...
$ marital : Factor w/ 5 levels "A","B","M","S",..: 3 5 3 5 2 5 3 5 3 5 ...
$ ethnic : Factor w/ 17 levels "B","C","D","F",..: 10 14 10 15 10 10 14 6 10 10 ...
$ age1 : int 36 0 32 34 78 0 50 0 40 0 ...
$ age2 : int 0 0 30 38 0 0 48 0 42 0 ...
$ models : num 1 1 2 1 1 2 2 1 1 1 ...
$ hnd_price : num 200 150 100 30 30 ...
$ actvsubs : num 2 1 1 1 1 1 1 1 3 1 ...
$ uniqsubs : num 2 1 1 1 1 2 1 1 3 1 ...
$ forgntvl : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ dwlltype : Factor w/ 2 levels "M","S": 2 1 2 1 1 1 2 1 2 2 ...
$ dwllsize : Factor w/ 15 levels "A","B","C","D",..: 1 1 1 1 8 2 1 3 1 1 ...
$ opk_dat_Mean : num 0 0 0 0 0 ...
$ mtrcycle : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ truck : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ roam_Mean : num 0 0 0 0 0 0 0 0 0 0 ...
$ recv_sms_Mean : num 0 0 0 0 0 0 0 0 0 0 ...
$ blck_dat_Mean : num 0 0 0 0 0 0 0 0 0 0 ...
$ mou_pead_Mean : num 0 0.883 0 0 0 ...
$ churn : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ car_buy : Factor w/ 2 levels "New","UNKNOWN": 2 1 1 1 2 2 1 2 2 2 ...
$ csa : Factor w/ 694 levels "AIRAIK803","AIRAND864",..: 7 271 688 263 344 58 38
435 521 665 ...
$ da_Mean : num 0 0.247 0.495 0 0 ...
$ da_Range : num 0 0.99 1.98 0 0 0 0 0.99 0 0 ...
$ datovr_Mean : num 0 0.877 0 0 0 ...
$ datovr_Range : num 0 3.51 0 0 0 0 0 0.78 0 0 ...
$ drop_dat_Mean : num 0 0 0 0 0 0 0 0 0 0 ...
Page 14
$ drop_vce_Mean : num 3.667 3 0.333 0.667 1.333 ...
$ adjmou : num 2376 5565 7866 5844 1573 ...
$ totrev : num 519 599 1617 1682 968 ...
$ adjrev : num 489 539 1586 1648 917 ...
$ avgrev : num 37.6 44.9 58.8 68.7 29.6 ...
$ Customer_ID : int 1064525 1048538 1010139 1014496 1012053 1028421 1061694
1038047 1084670 1096350 ...
$ comp_dat_Mean : num 0 0.333 0 0 0 ...
$ plcd_dat_Mean : num 0 0.333 0.333 0 0 ...
Page 15
Summary of the working dataset:
Page 16
Max. :4.000 Max. :4.0000 Max. :23.000 Max. :5.000 Max. :17.000
Page 17
SOUTHWEST AREA : 1613 Z :1348
(Other) :14898 (Other):4386
age1 age2 models hnd_price actvsubs
Min. : 0.00 Min. : 0.00 Min. :1.000 Min. : 9.99 Min. :0.00
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:1.000 1st Qu.: 59.99 1st Qu.:1.00
Median :36.00 Median : 0.00 Median :1.000 Median : 99.99 Median :1.00
Mean :31.42 Mean :21.37 Mean :1.526 Mean :104.42 Mean :1.34
3rd Qu.:48.00 3rd Qu.:44.00 3rd Qu.:2.000 3rd Qu.:149.99 3rd Qu.:2.00
Max. :94.00 Max. :99.00 Max. :3.500 Max. :284.00 Max. :3.50
Page 18
SANSAN210: 380 Max. :2.4000 Max. :4.900
(Other) :22611
datovr_Mean datovr_Range drop_dat_Mean drop_vce_Mean
Min. : 0.0000 Min. : 0.0000 Min. : 0.00000 Min. : 0.0000
1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.6667
Median : 0.0000 Median : 0.0000 Median : 0.00000 Median : 3.0000
Mean : 0.2538 Mean : 0.7058 Mean : 0.03939 Mean : 5.1429
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.: 7.6667
Max. :242.8725 Max. :475.0200 Max. :48.33333 Max. :18.0000
Page 19
Univariate and Bivariate Analysis along with Outlier
treatment
Continuous Variables
Boxplot of continuous variables are created and found that outliers are present in most of the
variables. Outlier treatment is done by capping at the upper value. Boxplot, histogram and density
diagram of the treated values are taken for analysis.
On a typical chart of a particular variable, there are 5 plots consisting of boxplot of dataset with
outliers, boxplot with dataset after outlier treatment, histogram, density plots and plot on bivariate
analysis.
Page 20
AN1 - Anlaysis of Mean number AN2 - Anlaysis of Mean
of monthly minutes of use total monthly recurring charge
Page 21
AN3 - Range of revenue AN4 - Range of number of
(charge amount) minutes of use
Page 22
AN5 - Percentage change in monthly AN6 - Anlaysis of Mean
minutes of use vs previous 3 months number of dropped or blocked calls
Page 23
AN7 - Range of number of dropped AN8 - Anlaysis of Range of number of
(failed) voice calls outbound wireless to wireless calls
Page 24
AN9 - Range of unrounded minutes AN10 - Anlaysis of Total number
of use off-peak voice calls of months in service
Page 25
AN11 - Range of unrounded minutes AN12 - Anlaysis of Number of days
of use off-peak voice calls of current equipment
Page 26
AN13 - Range of Mean number of AN14 - Anlaysis of Mean number
customer care calls of call waiting calls
Page 27
AN15 - Range of inbound wireless to AN16 - Anlaysis of Range of
wireless voice calls number of call waiting calls
Page 28
AN17 - Range of unrounded minutes AN18 - Anlaysis of Total number
of use off-peak voice calls of calls over the life of the customer
Page 29
AN19 - Range of mean overage revenue AN20 - Anlaysis of Mean monthly
revenue (charge amount)
Page 30
AN21 - Range of mean overage AN22 - Anlaysis of mean number
minutes of use of completed voice calls
Page 31
AN23 - Mean numner of attempted AN24 - Average monthly minutes of use
voice calls placed over previous three months
Page 32
AN25 - Average monthly minutes AN26 - Anlaysis of monthly number of
of use over the life of the customer calls over the previous three months
Page 33
AN27 - Analysis of Average monthly AN28 - Anlaysis of Average monthly
number of calls over the life minutes of use over six months
Page 34
AN29 - Average monthly number AN30 - Age of first household
of calls over six months member
Page 35
AN31 - Age of second household AN32 - Anlaysis of Number of
member models issued
Page 36
AN33 - Analysis of current AN34 - Anlaysis of active subscribers
handset price in household
Page 37
AN35 - Analysis of unique AN36 - Anlaysis of mean number
users in the household of off-peak data calls
Page 38
AN37 - Analysis of mean AN38 - Anlaysis of mean number
number of roaming calls of received SMS calls
Page 39
AN39 - Analysis of mean number AN40 - Anlaysis of mean minutes
of blocked (failed) data calls of use of peak data calls
Page 40
AN41 - Analysis of mean number AN42 - Anlaysis of number of directory
of directory assisted calls assisted calls
Page 41
AN43 - Analysis of mean revenue AN44 - Anlaysis of range of revenue of
of data overage data overage
Page 42
AN45 - Analysis of dropped (failed) AN46 - Anlaysis of mean number of
data calls dropped (failed) voice calls
Page 43
AN47 - Analysis of billing adjusted AN48 - Anlaysis of Total Revenue
total minutes of use over life
Page 44
AN49 - Analysis of adjusted total AN50 - Anlaysis of Average monthly
revenue over life revenue over the life
Page 45
AN51 - Analysis of mean number AN52 - Anlaysis of mean number of
of completed data calls attempted data calls placed
Page 46
Correlation between the numeric variables
Page 47
Categorical Variables
1. Income
prop.table(table(mydata2$income,mydata2$churn),1)*100
0 1
1 77.56654 22.43346
2 78.40735 21.59265
3 77.72595 22.27405
4 78.85802 21.14198
5 76.72090 23.27910
6 76.76142 23.23858
7 72.39300 27.60700
8 72.12087 27.87913
9 77.28032 22.71968
> table(mydata2$churn, mydata2$income)
1 2 3 4 5 6 7 8 9
0 816 512 1333 2044 2452 5698 3721 1265 2313
1 236 141 382 548 744 1725 1419 489 680
2. Credit class code
Page 48
prop.table(table(mydata2$crclscod,mydata2$churn),1)*100
0 1
A 75.034578 24.965422
A2 65.934066 34.065934
A3 0.000000 100.000000
AA 75.234326 24.765674
B 71.032505 28.967495
B2 78.260870 21.739130
BA 73.865367 26.134633
C 76.334107 23.665893
C2 78.260870 21.739130
C5 85.185185 14.814815
CA 78.633594 21.366406
CC 70.000000 30.000000
CY 82.000000 18.000000
D 74.358974 25.641026
D2 100.000000 0.000000
D4 86.597938 13.402062
D5 93.023256 6.976744
DA 79.162512 20.837488
E 83.870968 16.129032
E2 90.000000 10.000000
E4 87.398374 12.601626
EA 80.836421 19.163579
EC 94.117647 5.882353
EF 100.000000 0.000000
EM 64.705882 35.294118
G 71.186441 28.813559
GA 77.777778 22.222222
GY 70.000000 30.000000
H 100.000000 0.000000
I 79.245283 20.754717
IF 66.666667 33.333333
Page 49
J 80.392157 19.607843
JF 74.137931 25.862069
K 72.727273 27.272727
M 77.500000 22.500000
O 71.428571 28.571429
TP 0.000000 100.000000
U 76.984127 23.015873
U1 82.352941 17.647059
V1 90.000000 10.000000
W 77.777778 22.222222
Y 100.000000 0.000000
Z 72.549020 27.450980
Z1 100.000000 0.000000
Z2 100.000000 0.000000
Z4 84.722222 15.277778
Z5 81.818182 18.181818
ZA 75.783784 24.216216
ZY 78.787879 21.212121
Page 50
3. Account spending limit
prop.table(table(mydata2$asl_flag,mydata2$churn),1)*100
0 1
N 74.97670 25.02330
Y 81.79764 18.20236
table(mydata2$churn, mydata2$asl_flag)
N Y
0 16896 3258
1 5639 725
Page 51
4. Social group letter
prop.table(table(mydata2$prizm_social_one,mydata2$churn),1)*100
0 1
C 76.45985 23.54015
R 74.64473 25.35527
S 76.32956 23.67044
T 74.00190 25.99810
U 76.72251 23.27749
table(mydata2$churn, mydata2$prizm_social_one)
C R S T U
0 3771 998 7004 3114 5267
1 1161 339 2172 1094 1598
Page 52
5. Area
prop.table(table(mydata2$area,mydata2$churn),1)*100
0 1
ATLANTIC SOUTH AREA 76.64188 23.35812
CALIFORNIA NORTH AREA 74.25743 25.74257
CENTRAL/SOUTH TEXAS AREA 78.49462 21.50538
CHICAGO AREA 75.32567 24.67433
DALLAS AREA 75.37367 24.62633
DC/MARYLAND/VIRGINIA AREA 77.17452 22.82548
GREAT LAKES AREA 77.53968 22.46032
HOUSTON AREA 77.34513 22.65487
LOS ANGELES AREA 77.08571 22.91429
MIDWEST AREA 78.63152 21.36848
NEW ENGLAND AREA 75.89928 24.10072
NEW YORK CITY AREA 74.79542 25.20458
NORTH FLORIDA AREA 72.64957 27.35043
NORTHWEST/ROCKY MOUNTAIN AREA 71.70732 28.29268
OHIO AREA 78.52713 21.47287
PHILADELPHIA AREA 72.80000 27.20000
SOUTH FLORIDA AREA 73.13609 26.86391
SOUTHWEST AREA 75.44947 24.55053
TENNESSEE AREA 79.28669 20.71331
table(mydata2$churn, mydata2$area)
Page 53
0 1059 1393 977 874
1 346 412 283 256
LOS ANGELES AREA MIDWEST AREA NEW ENGLAND AREA NEW YORK CITY AREA
0 1349 1402 1055 2285
1 401 381 335 770
Page 54
6. Handset: refurbished or new
prop.table(table(mydata2$refurb_new,mydata2$churn),1)*100
0 1
N 76.52976 23.47024
R 72.69494 27.30506
table(mydata2$churn, mydata2$refurb_new)
N R
0 17497 2657
1 5366 998
Page 55
7. Handset web capability
prop.table(table(mydata2$hnd_webcap,mydata2$churn),1)*100
0 1
UNKW 91.176471 8.823529
WC 68.529826 31.470174
WCMB 77.388248 22.611752
table(mydata2$churn, mydata2$hnd_webcap)
UNKW WC WCMB
0 62 2918 17174
1 6 1340 5018
Page 56
8. Marital status
prop.table(table(mydata2$marital,mydata2$churn),1)*100
0 1
A 75.20000 24.80000
B 76.63303 23.36697
M 77.09044 22.90956
S 77.17325 22.82675
U 74.47304 25.52696
table(mydata2$churn, mydata2$marital)
A B M S U
0 1034 1443 6555 3773 7349
1 341 440 1948 1116 2519
Page 57
9. Ethnicity roll-up code
prop.table(table(mydata2$ethnic,mydata2$churn),1)*100
0 1
B 70.317003 29.682997
C 91.780822 8.219178
D 70.000000 30.000000
F 75.130435 24.869565
G 77.451593 22.548407
H 75.322490 24.677510
I 73.278520 26.721480
J 74.827109 25.172891
M 80.555556 19.444444
N 76.555337 23.444663
O 70.216963 29.783037
P 81.818182 18.181818
R 72.908367 27.091633
S 75.452716 24.547284
U 75.403226 24.596774
X 83.870968 16.129032
Z 83.605341 16.394659
table(mydata2$churn, mydata2$ethnic)
B C D F G H I J M N O P R S U X
0 244 67 154 432 1240 2686 713 541 29 7014 712 117 183 2625 2244 26
1 103 6 66 143 361 880 260 182 7 2148 302 26 68 854 732 5
Z
0 1127
1 221
Page 58
10. Foreign travel dummy variable
prop.table(table(mydata2$forgntvl,mydata2$churn),1)*100
0 1
0 75.97200 24.02800
1 76.48221 23.51779
table(mydata2$churn, mydata2$forgntvl)
0 1
0 18993 1161
1 6007 357
Page 59
11. Dwelling unit type
prop.table(table(mydata2$dwlltype,mydata2$churn),1)*100
0 1
M 74.79606 25.20394
S 76.96271 23.03729
table(mydata2$churn, mydata2$dwlltype)
M S
0 8802 11352
1 2966 3398
Page 60
12. Dwelling size
prop.table(table(mydata2$dwllsize,mydata2$churn),1)*100
0 1
A 76.66584 23.33416
B 75.05963 24.94037
C 79.92599 20.07401
D 71.33479 28.66521
E 85.22427 14.77573
F 72.54902 27.45098
G 76.26582 23.73418
H 78.38828 21.61172
I 78.14208 21.85792
J 71.07642 28.92358
K 76.39257 23.60743
L 76.76768 23.23232
M 76.84729 23.15271
N 71.05263 28.94737
O 70.46070 29.53930
table(mydata2$churn, mydata2$dwllsize)
A B C D E F G H I J K L M
0 12380 2832 864 326 323 185 241 214 143 865 288 304 156
1 3768 941 217 131 56 70 75 59 40 352 89 92 47
N O
0 513 520
1 209 218
Page 61
13. Motorcycle indicator
prop.table(table(mydata2$mtrcycle,mydata2$churn),1)*100
0 1
0 76.0263 23.9737
1 74.1573 25.8427
table(mydata2$churn, mydata2$mtrcycle)
0 1
0 19890 264
1 6272 92
Page 62
14. Truck indicator
prop.table(table(mydata2$truck,mydata2$churn),1)*100
0 1
0 75.92395 24.07605
1 76.33313 23.66687
table(mydata2$churn, mydata2$truck)
0 1
0 16332 3822
1 5179 1185
Page 63
15. New or used car buyer
prop.table(table(mydata2$car_buy,mydata2$churn),1)*100
0 1
New 76.66960 23.33040
UNKNOWN 75.50105 24.49895
table(mydata2$churn, mydata2$car_buy)
New UNKNOWN
0 8702 11452
1 2648 3716
Page 64
Observation
Page 65