100% found this document useful (1 vote)
431 views

XYZ Data Analysis Report

The document discusses analyzing login data to characterize patterns of user demand over time. Key findings from visualizing and describing the login count time series include: - Login counts follow clear daily and weekly cycles, with highest demand on weekends and during midday and evenings on weekdays. - An experiment is proposed to reimburse toll costs for drivers operating between two cities with complementary circadian rhythms, to encourage more availability. The success of this experiment would be measured by the average per driver profit when making trips between cities. A two-week experiment is designed to statistically test if reimbursing tolls significantly increases this key metric. Interpretation of results and caveats are also discussed.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
431 views

XYZ Data Analysis Report

The document discusses analyzing login data to characterize patterns of user demand over time. Key findings from visualizing and describing the login count time series include: - Login counts follow clear daily and weekly cycles, with highest demand on weekends and during midday and evenings on weekdays. - An experiment is proposed to reimburse toll costs for drivers operating between two cities with complementary circadian rhythms, to encourage more availability. The success of this experiment would be measured by the average per driver profit when making trips between cities. A two-week experiment is designed to statistically test if reimbursing tolls significantly increases this key metric. Interpretation of results and caveats are also discussed.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Part 1 Exploratory data analysis

The attached logins.json file contains (simulated) timestamps of user logins in a particular
geographic location. Aggregate these login counts based on 15minute time intervals, an
visualize and describe the resulting time series of login counts in ways that best
characterize the underlying patterns of the demand. Please report/illustrate important
features of the demand such as daily cycles. If there are data quality issues, please report
them.

The file logins.json contains 93142 observations, with first login observed on "1970-01-01 20:12:16" and
last on "1970-04-13 18:57:38". Figure 1.1 is shows the time series plot of login counts, aggregated over
15-minutes time intervals. The maximum frequency of login counts occurs on March 1st, 1970 (Sunday).
The plot shows seasonal time series pattern of login count frequency, highest towards the weekend for
the 15-week period.

Figure 1.1: Time series plot of login counts

The Figure 1.2 shows the login counts aggregated over 15-minute time interval averaged over all weeks.

Weekday patterns:

During the weekday starting from Sunday midnight, the frequency goes down during midnight till early
hours (about 5:00am), then goes up till about noon, and then it decreases till about 4:00pm, increases
from about 4:00pm-midnight. There is daily seasonal pattern with two modes, maximum during noon and
other mode at about midnight. There is positive trend in the login counts over the weekdays. The days
later in the week (Wednesday, Thursday, and Friday) observe more login counts during noon as compared
to days earlier in the week (Monday, Tuesday). Starting from Thursday evening till Saturday night, the
login counts during evening till mid night increases as compared to rest of days of week. This shows that
during weekdays, there is high demand of XYZ drivers during the noon and in the late evening till midnight.

Weekend patterns:

On Friday, the login counts (demand) increases sharply from evening till late night (till about 5:00am)
Friday night, and goes down till about 6:00am in the Saturday morning. The same pattern continues on
Saturday, with highest frequency (over the entire week) at about 4:00am-5:00am morning. On Sunday
night we observed less login counts (demand) as compared to other days of the week.
The login count pattern resembles scenario of a busy city (or multiple cities), where people go to work
during weekdays, and go out during weekends.

Figure 1.2: Time series plot of login counts (aggregated over weeks)

The difference between Figure 1.2 and Figure 1.3 is that in Figure 1.2 the login counts are
aggregtated over 15 minutes time interval, whereas in Figure 1.3 the login counts are aggregated for each
hour and hence Figure 1.3 reflect the hourly patterns and gives easier comparision during the various
days of week.

Figure 1.3: Time series plot of hourly login counts (aggregated over weeks)

The matrix plot in Figure 1.4 shows the hourly login counts over 7 days of the week. The cyan color
reflects high login counts whereas the cornsilk color reflects the low login counts. From this plot, it is
easy to observe the demand of XYZ drivers during various hours throughout the week.
Figure 1.4: Matrix Plot of Hourly Login
Counts

Part 2 Experiment and metrics design


The neighboring cities of Gotham and Metropolis have complementary circadian rhythms:
on weekdays, XYZ Gotham is most active at night, and XYZ Metropolis is most active
during the day. On weekends, there is reasonable activity in both cities. However, a toll
bridge, with a two-way toll, between the two cities causes driver partners to tend to be
exclusive to each city. The XYZ managers of city operations for the two cities have
proposed an experiment to encourage driver partners to be available in both cities, by
reimbursing all toll costs.

Q1. What is key measure of success of this experiment in encouraging the driver partners to serve both
cities? And would you choose this measure.

As the XYZ managers pay for the tolls, this investment must be returned in terms of profit generated by
those driver partners who drive to other cities (and hence need to pay toll). Therefore, a key metric to
measure the success of this experiment would be average per driver return on the profit made when the
driver partner makes a trip to other cities.

I would choose this metric as this will enable me to know:

i) If there is any profit after paying the toll fee.


ii) How much a driver partner makes, if he makes good money, he will be happy and more likely
to drive to other city.

Q2. Describe a practical experiment you would design to compare the effectiveness of the proposed
change in relation to the key measure of success. Please provide details on:

a) How you will implement the experiment?


To compare the effectiveness of the proposed toll paying experiment, it would be of interest
to know if there is any significant change in the XYZs profit after implementing the toll paying
experiment. I would design the practical experiment by collecting some for a minimum of two
weeks as the following:

i) During the first week, do not pay the toll fee and calculate the earnings from each
driver partner for the entire week. Suppose there are 50 drivers in each city, then
have observations on 100 drivers, about how much they contributed to XYZ based on
rides.
ii) During the second week, pay the toll fee and calculate the earnings from each driver
partner for the entire week. Here the earnings from a driver partner would be
obtained after subtracting the toll fee. Again, from 100 driver partners, we have 100
observations.

b) What statistical test(s) you will conduct to verify the significance of the observation, and
Now if the experiment is effective, the per driver earning during second week, must be higher
than during first week. To further apply the statistical test of significance, we make following
assumption:

Assumptions:
i) The XYZs earning from the driver partners is independent for both Gotham and
Metropolis city
ii) During the second week, we expect there will be of more driver partner available
in Gatham during night and more driver partners in Metropolis during day. The
increased availability of driver partners (if any) does not affect the XYZs earnings
from the existing driver partners (who are already in their respective city).
(Failing this assumption the earnings will no longer be independent.)
iii) The earnings of any driver partner has normal distribution with finite mean
(unknown) and finite variance.

Under these assumption, we can test the significance of difference in average earning from per
driver partner during first and second week. Equivalently, let mu1 and mu2 be the true (unknown)
average earning per driver during week one and week two respectively. Define H0 and H1 by:

H0: mu1=mu2 i.e. implementing the toll paying experiment does not improve the earning per
driver partner during the week 2 from week 1.
H1: mu2>mu1 i.e. implementing the toll paying experiment has significantly increased the XYZs
average earning per driver partner during week 2 from week 1.

Let x1b and x2b be the average of earnings during the week one and week two respectively, let s1
and s2 be the standard deviation of the earnings during week one and week one.

Next compute the test statistic:

Z=(x1b-x2b)/sqrt(s1^2/n1+s2^2/n2)
Where n1,n2 be the number of driver partners during first and second week. Since I assume 50
drivers in each city, in this particular case, n1=n2=100.

The test statistics Z has normal distribution under the above assumptions (i)-(iii). We can conclude
the significance of the mean difference for a given level of significance.
c) how you would interpret the results and provide recommendations to the city operations team
along with any caveats.

If the mean difference turns out to be statistically significant at given level (say 5%), we would
conclude that the toll paying experiment is effective and yields good business for XYZ. However, if the
mean difference is not significant, we cant say if the toll paying experiment is effective for the metric
considered here.

Caveats:
a. Assumption (ii) may not be true in practice as if there are more drivers, it affects the business
of existing drivers in a given area.
b. To ensure that described statistical test is valid, typically one needs to have more than 30
driver partners, this may not be a realistic assumption if the cities Gotham and Metropolis are
very small.
c. Not all the driver partners would like to go to other city, and therefore the XYZs earning per
driver partner may not be independent. This will violate the assumption (i)
d. Again assumption (iii) of normality may not be true. However, if the number of driver partners
is large, this may not a concern.

Part 3 Predictive modeling


XYZ is interested in predicting rider retention. To help explore this question, we have
provided sample dataset of a cohort of users who signed up for an XYZ account in January
2014. The data was pulled several months later we consider a user retained if they were
active (i.e. took a trip) in the preceding 30 days. We would like you to use this data set to
help understand what factors are the best predictors for retention, and offer suggestions
to operationalize those insights to help XYZ.

Objective: The main objective is to predict the rider retention. The analysis should also provide valuable
insights about the important factors for retention, and suggestions to operationalize those insights to help
XYZ.

Dataset: The rider information on various features including city, signup date, last trip date. The dataset
contains 50K observations on 12 variables.
Data Source: Provided as part of data analysis challenge.

Here is the summary of entire data.

city trips_in_first_30_days signup_date


Astapor :16534 Min. : 0.000 Min. :2014-01-01
King's Landing:10130 1st Qu.: 0.000 1st Qu.:2014-01-09
Winterfell :23336 Median : 1.000 Median :2014-01-17
Mean : 2.278 Mean :2014-01-16
3rd Qu.: 3.000 3rd Qu.:2014-01-24
Max. :125.000 Max. :2014-01-31
avg_rating_of_driver avg_surge last_trip_date phone
Min. :1.000 Min. :1.000 Min. :2014-01-01 : 396
1st Qu.:4.300 1st Qu.:1.000 1st Qu.:2014-02-14 Android:15022
Median :4.900 Median :1.000 Median :2014-05-08 iPhone :34582
Mean :4.602 Mean :1.075 Mean :2014-04-19
3rd Qu.:5.000 3rd Qu.:1.050 3rd Qu.:2014-06-18
Max. :5.000 Max. :8.000 Max. :2014-07-01
NA's :8122
surge_pct XYZ_black_user weekday_pct avg_dist
Min. : 0.00 Mode :logical Min. : 0.00 Min. : 0.000
1st Qu.: 0.00 FALSE:31146 1st Qu.: 33.30 1st Qu.: 2.420
Median : 0.00 TRUE :18854 Median : 66.70 Median : 3.880
Mean : 8.85 NA's :0 Mean : 60.93 Mean : 5.797
3rd Qu.: 8.60 3rd Qu.:100.00 3rd Qu.: 6.940
Max. :100.00 Max. :100.00 Max. :160.960

avg_rating_by_driver
Min. :1.000
1st Qu.:4.700
Median :5.000
Mean :4.778
3rd Qu.:5.000
Max. :5.000
NA's :201

The data set contains missing information on number of variables including Average rating of driver,
Phone, Avg_rating_by_driver. Since the number of complete cases is still large (41445) and number
of variables is not large, we can use the complete cases data of 41,445 observations and 12 variables for
the to build the predictive model for retention. This accounts for about 83% of users retained for model
building.
To make response categorical, we convert the response to 1 if the XYZ user has taken a ride in past one
month, else we set its value zero. Next we do some exploratory data analysis to investigate the variables
in the data set.

Figure 3.1: Box Plot (before transforming the variables)

From the summary and box-plot we see that the variables are highly over-dispersed. In these situations,
to get a meaningful relationship, we transform the variables to reduce the over-dispersion.

Transformations:
udc.c.new$trips_in_first_30_days=log(1+udc.c$trips_in_first_30_days)
udc.c.new$avg_surge=log(udc.c$avg_surge)
udc.c.new$surge_pct=udc.c$surge_pct/100
udc.c.new$avg_dist=log(1+udc.c$avg_dist)
udc.c.new$weekday_pct=udc.c$weekday_pct/100

Also as most of surge_pct values are zero, and its distribution is skewed, it is good practice to covert it to
categorical variable. The transformation is:
udc.c.new$surge_pct_f=rep(0,length(udc.c.new$surge_pct))
udc.c.new$surge_pct_f[intersect(which(udc.c.new$surge_pct<=.10),which(udc.c.new$surge_pct>0))]=1
udc.c.new$surge_pct_f[intersect(which(udc.c.new$surge_pct<=.25),which(udc.c.new$surge_pct>0.1))]=2
udc.c.new$surge_pct_f[intersect(which(udc.c.new$surge_pct<=.50),which(udc.c.new$surge_pct>0.25))]=3
udc.c.new$surge_pct_f[intersect(which(udc.c.new$surge_pct<=.80),which(udc.c.new$surge_pct>0.5))]=4
udc.c.new$surge_pct_f[which(udc.c.new$surge_pct>0.8)]=5

Boxplot after the transformation: The over-dispersion has significantly decreased.

Figure 3.2: Box Plot (after transforming the variables)

Next check of relationship between response and predictor variables. As the response is binary and a plot
would not be very information, we group the mean value of response over range of predictor variable and
plot it. Below are some plots:

Figure 3.3: Last_trip_date (yes/no) by rating_by_driver


Figure 3.4:3.4:
Figure Last_trip_date (yes/no)
Last_trip_date by avg_rating_of_driver
(yes/no) by rating_of_driver

Figure 3.5: Last_trip_date (yes/no) by surge_pct

The pl he
ot

Both the plots of i) grouped response vs average rating by driver and ii) grouped response average rating
of driver are increasing. We expect the plots to like to be more like S-shaped for varying values if predictor
variables, however since there are few observations in plot, it is hard to discern. The plot of grouped
response vs. surge_pct scattered and does not shows a clear S-shaped pattern. Therefore, transforming
the surge_pct variable to a categorical variable is better idea.

Check for Multicollinearity:

Multicollinearity is an important check before model building as this helps to exclude the problematic
variables from model. The below table shows the correlation analysis between continuous variables from
the data set.
Table 3.1: Correlation Matrix

trips_in_first_3 avg_rating_of_ avg_su weekday avg_d avg_rating_by_


0_days driver rge _pct ist driver
trips_in_first_3 -
0_days 1.000 -0.029 0.033 0.035 0.129 -0.047
avg_rating_of_
driver -0.029 1.000 -0.020 0.013 0.033 0.122
-
avg_surge 0.033 -0.020 1.000 -0.134 0.098 0.011
weekday_pct 0.035 0.013 -0.134 1.000 0.093 0.018
avg_dist -0.129 0.033 -0.098 0.093 1.000 0.095
avg_rating_by_
driver -0.047 0.122 0.011 0.018 0.095 1.000

There is no multicollinearity among the variable. Another popular measure is to check the eigenvalues.
The eigenvalues of XX given are (here X is data matrix that contains the above variables):

1.2893826, 1.1207735, 1.0284520, 0.9091422, 0.8375238, 0.8147259

The eigenvalues are bounded below by a positive constant which suggest that data does not have any
multicollinearity issue.

Modeling:
Next we start with univariate modeling to assess their significance to the model. We model the response
by means of logistic regression.

Training and Validation Data:


We divide the data into training (2/3) and validation (1/3) subsets. We use the training data to train the
model and use the validation data to report the accuracy of fitted model.

###################################################
# 0. intercept model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.40411 0.01228 -32.91 <2e-16 ***

Null deviance: 37198 on 27629 degrees of freedom


Residual deviance: 37198 on 27629 degrees of freedom
AIC: 37200

# 1. City
Coefficients:
(Intercept) -0.78487 0.01654 -47.46 <2e-16 ***
XYZ_black 0.94321 0.02554 36.94 <2e-16 ***

Null deviance: 37198 on 27629 degrees of freedom


Residual deviance: 35808 on 27628 degrees of freedom
AIC: 35812
# 2. Avg_rating_of_driver
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.29198 0.09213 -3.169 0.00153 **
avg_rating_of_driver -0.02437 0.01985 -1.228 0.21953

Null deviance: 37198 on 27629 degrees of freedom


Residual deviance: 37196 on 27628 degrees of freedom
AIC: 37200

# 3. trips_in_first_30_days
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.91875 0.02034 -45.17 <2e-16 ***
trips_in_first_30_days 0.53572 0.01634 32.78 <2e-16 ***

Null deviance: 37198 on 27629 degrees of freedom


Residual deviance: 36073 on 27628 degrees of freedom
AIC: 36077

# 4. avg_surge

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.42132 0.01348 -31.256 < 2e-16 ***
avg_surge 0.28358 0.09097 3.117 0.00182 **

Null deviance: 37198 on 27629 degrees of freedom


Residual deviance: 37188 on 27628 degrees of freedom
AIC: 37192

# 5. Avg_dist
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.14566 0.03583 -4.065 4.81e-05 ***
avg_dist -0.15634 0.02044 -7.650 2.01e-14 ***

Null deviance: 37198 on 27629 degrees of freedom


Residual deviance: 37139 on 27628 degrees of freedom
AIC: 37143

# 6. avg_rating_by_driver

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.57446 0.14474 3.969 7.22e-05 ***
avg_rating_by_driver -0.20499 0.03022 -6.783 1.18e-11 ***

Null deviance: 37198 on 27629 degrees of freedom


Residual deviance: 37152 on 27628 degrees of freedom
AIC: 37156

# 7. Weekday_pct
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.48741 0.02505 -19.456 < 2e-16 ***
weekday_pct 0.13556 0.03544 3.826 0.00013 ***

Null deviance: 37198 on 27629 degrees of freedom


Residual deviance: 37183 on 27628 degrees of freedom
AIC: 37187

The univariate analysis shows that all the predictors except are highly significant at p-value =0.25. We
chose this high p-value to decide which of the variables to include in the full model. The full model is:

# full model (without interaction terms)


Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.29742 0.20016 1.486 0.1373
City_AK -1.80487 0.04155 -43.440 < 2e-16 ***
City_WK -1.17700 0.03793 -31.034 < 2e-16 ***
trips_in_first_30_days 0.16121 0.02033 7.930 2.19e-15 ***
avg_rating_of_driver -0.02868 0.02366 -1.212 0.2254
avg_rating_by_driver -0.26239 0.03511 -7.473 7.84e-14 ***
avg_surge -1.16862 0.22680 -5.153 2.57e-07 ***
Phone_i 1.17328 0.03483 33.687 < 2e-16 ***
surge_pct_10 2.16643 0.05912 36.644 < 2e-16 ***
surge_pct_20 1.49072 0.05173 28.817 < 2e-16 ***
surge_pct_30 0.70856 0.07837 9.041 < 2e-16 ***
surge_pct_40 0.86360 0.19431 4.445 8.81e-06 ***
surge_pct_50 -0.18802 0.18750 -1.003 0.3160
XYZ_black 0.91475 0.02984 30.659 < 2e-16 ***
weekday_pct 0.08357 0.04297 1.945 0.0518 .
avg_dist -0.05077 0.02452 -2.071 0.0384 *

Null deviance: 37198 on 27629 degrees of freedom


Residual deviance: 28835 on 27614 degrees of freedom
AIC: 28867

The avg_rating_of_driver and weekday_pct seems to be insignificant at 5% level, so first we drop the
avg_rating_of_driver from from the model. The reduced model is:
# overall sig of model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.18561 0.17763 1.045 0.2960
City_AK -1.80297 0.04151 -43.430 < 2e-16 ***
City_WK -1.17263 0.03775 -31.065 < 2e-16 ***
trips_in_first_30_days 0.16163 0.02033 7.952 1.84e-15 ***
avg_rating_by_driver -0.26711 0.03489 -7.656 1.91e-14 ***
avg_surge -1.16108 0.22670 -5.122 3.03e-07 ***
Phone_i 1.17436 0.03482 33.726 < 2e-16 ***
surge_pct_10 2.16588 0.05912 36.638 < 2e-16 ***
surge_pct_20 1.48979 0.05172 28.804 < 2e-16 ***
surge_pct_30 0.70664 0.07835 9.019 < 2e-16 ***
surge_pct_40 0.86224 0.19423 4.439 9.03e-06 ***
surge_pct_50 -0.19101 0.18743 -1.019 0.3082
XYZ_black 0.91520 0.02983 30.677 < 2e-16 ***
weekday_pct 0.08317 0.04296 1.936 0.0529 .
avg_dist -0.05158 0.02451 -2.105 0.0353 *

Null deviance: 37198 on 27629 degrees of freedom


Residual deviance: 28837 on 27615 degrees of freedom
AIC: 28867
The weekday_pct is insignificant, therefore we drop from the model. The reduced model has avg_dist
insignificant. Also after removing the avg_dist from the model, the accuracy on the validation data goes
slightly up (about 0.1%), this means that avg_dist does not have significant relationship with response and
must be dropped. Dropping the predictors reduces the model complexity and model becomes less
dependent on variables and therefore robust. The full model is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.17539 0.17403 1.008 0.314
City_AK -1.80116 0.04145 -43.451 < 2e-16 ***
City_WK -1.17543 0.03772 -31.164 < 2e-16 ***
trips_in_first_30_days 0.16680 0.02023 8.247 < 2e-16 ***
avg_rating_by_driver -0.27293 0.03471 -7.863 3.75e-15 ***
avg_surge -1.16860 0.22662 -5.157 2.51e-07 ***
Phone_i 1.17447 0.03481 33.735 < 2e-16 ***
surge_pct_10 2.17219 0.05905 36.786 < 2e-16 ***
surge_pct_20 1.49229 0.05154 28.954 < 2e-16 ***
surge_pct_30 0.70660 0.07805 9.054 < 2e-16 ***
surge_pct_40 0.86250 0.19392 4.448 8.68e-06 ***
surge_pct_50 -0.19472 0.18689 -1.042 0.297
XYZ_black 0.91261 0.02977 30.656 < 2e-16 ***

Null deviance: 37198 on 27629 degrees of freedom


Residual deviance: 28844 on 27617 degrees of freedom
AIC: 28870

Including interaction terms:

Next we include the interaction term one by one. Based on intuition, we include following pair of
interactions.
#1. Avg_surge and XYZ_black
#2. surge_pct and weekend_pct , here weekend_pct=1-weekday_pct
#3. avg_dist and city
#4. trips_in_first 30_days and weekend
#5. trips_in_first 30_days and avg_dist

Among all these interaction terms, only #1. Avg_surge and XYZ_black is statistically significant. The final
model is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.18432 0.17407 1.059 0.28964
City_AK -1.80107 0.04146 -43.442 < 2e-16 ***
City_WK -1.17518 0.03771 -31.166 < 2e-16 ***
trips_in_first_30_days 0.16649 0.02023 8.228 < 2e-16 ***
avg_rating_by_driver -0.27150 0.03472 -7.821 5.25e-15 ***
avg_surge -1.40830 0.24813 -5.676 1.38e-08 ***
Phone_i 1.17439 0.03482 33.725 < 2e-16 ***
surge_pct_10 2.16863 0.05907 36.714 < 2e-16 ***
surge_pct_20 1.48682 0.05170 28.756 < 2e-16 ***
surge_pct_30 0.70847 0.07839 9.038 < 2e-16 ***
surge_pct_40 0.88976 0.19579 4.545 5.51e-06 ***
surge_pct_50 -0.14188 0.18859 -0.752 0.45188
XYZ_black 0.87760 0.03270 26.837 < 2e-16 ***
avg_surge_int_XYZ_black 0.68655 0.26557 2.585 0.00973 **

Null deviance: 37198 on 27629 degrees of freedom


Residual deviance: 28838 on 27616 degrees of freedom
AIC: 28866

Summary:
i) The model has lowest AIC value among all the models. Model with lower AIC is preferred.
ii) All the retained variables are highly statistically significant as they have very low p-value.
iii) When including surge_pct as continuous variable, the prediction performance of the model
on validation set goes lower by 4%, this indicates that surge_pct should be modeled as
categorical.
iv) Model is robust: after trying different training samples, the model retains same predictors
and the change in the estimates of coefficient is very small (we expect little change as the
training sample changes). Also the predicting performance on the validation data is similar.

Results on Validation dataset (model performance):

We use the fitted model to predict the response for validation data given the predictors. The prediction
accuracy based on 20 runs is 26.02% with standard deviation of 0.4978%.

Next we answer questions asked in the data analysis challenge.

Q. We would like you to use this data set to help understand what factors are the best
predictors
for retention?
Among the predictors, City has the largest z-value, followed by surge_pct, phone, XYZ_black,
avg_rating_by_driver, trips_in_first_30_days, avg_surge. These all variables are statistically significant
with low p-value and therefore best predictors for predicting retention. Also the interaction term of
avg_surge and XYZ_black is statistically significant and is an important variable for predicting retention.

Q. Briefly discuss how XYZ might leverage the insights gained from the model to improve
its long term rider retention (again, a few sentences will suffice).

To give more insights, first we compute the odds ratio.

Coeff Estimate odds ratio


City_AK -1.801 0.165
City_WK -1.175 0.309
trips_in_first_30_days 0.166 1.181
avg_rating_by_driver -0.272 0.762
avg_surge -1.408 0.245
Phone_i 1.174 3.236
surge_pct_10 2.169 8.746
surge_pct_20 1.487 4.423
surge_pct_30 0.708 2.031
surge_pct_40 0.890 2.435
surge_pct_50 -0.142 0.868
XYZ_black 0.878 2.405
avg_surge_int_XYZ_black 0.687 1.987
Interpretation: Odds ratio for variable City_AK which is 0.165 means that if all the other variables are
fixed, it is about 16 (1/0.165) times more likely that a user from Kings Lansing travelled in last month than
a user from Astapor.

Summary and Recommendations:

i) From the above table, the highest odds ratio is for surge_pct_10, which is coded as 1 if the
surge_pct falls between (0,0.1]=(0%,10%], else coded 0. The odds ratio of surge_pct_10 is
with respect to the surge_pct=0%. This is equivalent to say that, if a user takes (0-10%] of
rides when there is surge in pricing, he is about 8.746 times more likely to be retained as
compared as he/she rides without a surge price. The decreasing odds ratio in surge_pct shows
that if the users rides more often when there is surge in pricing, he/she is less likely to have
traveled in last month.
ii) Also odds ratio of XYZ_black is 2.405 shows that if the user is XYZ_black, he/she is 2.405 times
more likely to travel in last month than if he is not a XYZ_black user, given all the other
variables fixed.

iii) From the analysis we can say that if the user is more likely to have used XYZ in last month if
he/she:
a. uses iPhone,
b. is XYZ_black user,
c. is from Kings Landing,
d. has taken about between (0-10%] when there is surge pricing,
e. more trips during first 30 days.

Astapor seems to be less profitable market as compared to Kings landing. Comparing the
avg_surge across three cities, we see that the avg_surge in Astapor is 6.8% which is highest as compared
to 6.1% in Kings Landing and 5.4% in Winterfell. As the coefficient of avg_dist is negative in the model, it
might be good idea to lower the surge in the Astapor or at the same level as Kings Landing to increase
the users retention. The analysis shows that if the user is iPhone user, he is 3.2 times more likely to use
XYZ in last month as compared to an Android user. It might be good idea to offer better rates to improve
the retention of the Android users.

You might also like