XYZ Data Analysis Report
XYZ Data Analysis Report
The attached logins.json file contains (simulated) timestamps of user logins in a particular
geographic location. Aggregate these login counts based on 15minute time intervals, an
visualize and describe the resulting time series of login counts in ways that best
characterize the underlying patterns of the demand. Please report/illustrate important
features of the demand such as daily cycles. If there are data quality issues, please report
them.
The file logins.json contains 93142 observations, with first login observed on "1970-01-01 20:12:16" and
last on "1970-04-13 18:57:38". Figure 1.1 is shows the time series plot of login counts, aggregated over
15-minutes time intervals. The maximum frequency of login counts occurs on March 1st, 1970 (Sunday).
The plot shows seasonal time series pattern of login count frequency, highest towards the weekend for
the 15-week period.
The Figure 1.2 shows the login counts aggregated over 15-minute time interval averaged over all weeks.
Weekday patterns:
During the weekday starting from Sunday midnight, the frequency goes down during midnight till early
hours (about 5:00am), then goes up till about noon, and then it decreases till about 4:00pm, increases
from about 4:00pm-midnight. There is daily seasonal pattern with two modes, maximum during noon and
other mode at about midnight. There is positive trend in the login counts over the weekdays. The days
later in the week (Wednesday, Thursday, and Friday) observe more login counts during noon as compared
to days earlier in the week (Monday, Tuesday). Starting from Thursday evening till Saturday night, the
login counts during evening till mid night increases as compared to rest of days of week. This shows that
during weekdays, there is high demand of XYZ drivers during the noon and in the late evening till midnight.
Weekend patterns:
On Friday, the login counts (demand) increases sharply from evening till late night (till about 5:00am)
Friday night, and goes down till about 6:00am in the Saturday morning. The same pattern continues on
Saturday, with highest frequency (over the entire week) at about 4:00am-5:00am morning. On Sunday
night we observed less login counts (demand) as compared to other days of the week.
The login count pattern resembles scenario of a busy city (or multiple cities), where people go to work
during weekdays, and go out during weekends.
Figure 1.2: Time series plot of login counts (aggregated over weeks)
The difference between Figure 1.2 and Figure 1.3 is that in Figure 1.2 the login counts are
aggregtated over 15 minutes time interval, whereas in Figure 1.3 the login counts are aggregated for each
hour and hence Figure 1.3 reflect the hourly patterns and gives easier comparision during the various
days of week.
Figure 1.3: Time series plot of hourly login counts (aggregated over weeks)
The matrix plot in Figure 1.4 shows the hourly login counts over 7 days of the week. The cyan color
reflects high login counts whereas the cornsilk color reflects the low login counts. From this plot, it is
easy to observe the demand of XYZ drivers during various hours throughout the week.
Figure 1.4: Matrix Plot of Hourly Login
Counts
Q1. What is key measure of success of this experiment in encouraging the driver partners to serve both
cities? And would you choose this measure.
As the XYZ managers pay for the tolls, this investment must be returned in terms of profit generated by
those driver partners who drive to other cities (and hence need to pay toll). Therefore, a key metric to
measure the success of this experiment would be average per driver return on the profit made when the
driver partner makes a trip to other cities.
Q2. Describe a practical experiment you would design to compare the effectiveness of the proposed
change in relation to the key measure of success. Please provide details on:
i) During the first week, do not pay the toll fee and calculate the earnings from each
driver partner for the entire week. Suppose there are 50 drivers in each city, then
have observations on 100 drivers, about how much they contributed to XYZ based on
rides.
ii) During the second week, pay the toll fee and calculate the earnings from each driver
partner for the entire week. Here the earnings from a driver partner would be
obtained after subtracting the toll fee. Again, from 100 driver partners, we have 100
observations.
b) What statistical test(s) you will conduct to verify the significance of the observation, and
Now if the experiment is effective, the per driver earning during second week, must be higher
than during first week. To further apply the statistical test of significance, we make following
assumption:
Assumptions:
i) The XYZs earning from the driver partners is independent for both Gotham and
Metropolis city
ii) During the second week, we expect there will be of more driver partner available
in Gatham during night and more driver partners in Metropolis during day. The
increased availability of driver partners (if any) does not affect the XYZs earnings
from the existing driver partners (who are already in their respective city).
(Failing this assumption the earnings will no longer be independent.)
iii) The earnings of any driver partner has normal distribution with finite mean
(unknown) and finite variance.
Under these assumption, we can test the significance of difference in average earning from per
driver partner during first and second week. Equivalently, let mu1 and mu2 be the true (unknown)
average earning per driver during week one and week two respectively. Define H0 and H1 by:
H0: mu1=mu2 i.e. implementing the toll paying experiment does not improve the earning per
driver partner during the week 2 from week 1.
H1: mu2>mu1 i.e. implementing the toll paying experiment has significantly increased the XYZs
average earning per driver partner during week 2 from week 1.
Let x1b and x2b be the average of earnings during the week one and week two respectively, let s1
and s2 be the standard deviation of the earnings during week one and week one.
Z=(x1b-x2b)/sqrt(s1^2/n1+s2^2/n2)
Where n1,n2 be the number of driver partners during first and second week. Since I assume 50
drivers in each city, in this particular case, n1=n2=100.
The test statistics Z has normal distribution under the above assumptions (i)-(iii). We can conclude
the significance of the mean difference for a given level of significance.
c) how you would interpret the results and provide recommendations to the city operations team
along with any caveats.
If the mean difference turns out to be statistically significant at given level (say 5%), we would
conclude that the toll paying experiment is effective and yields good business for XYZ. However, if the
mean difference is not significant, we cant say if the toll paying experiment is effective for the metric
considered here.
Caveats:
a. Assumption (ii) may not be true in practice as if there are more drivers, it affects the business
of existing drivers in a given area.
b. To ensure that described statistical test is valid, typically one needs to have more than 30
driver partners, this may not be a realistic assumption if the cities Gotham and Metropolis are
very small.
c. Not all the driver partners would like to go to other city, and therefore the XYZs earning per
driver partner may not be independent. This will violate the assumption (i)
d. Again assumption (iii) of normality may not be true. However, if the number of driver partners
is large, this may not a concern.
Objective: The main objective is to predict the rider retention. The analysis should also provide valuable
insights about the important factors for retention, and suggestions to operationalize those insights to help
XYZ.
Dataset: The rider information on various features including city, signup date, last trip date. The dataset
contains 50K observations on 12 variables.
Data Source: Provided as part of data analysis challenge.
avg_rating_by_driver
Min. :1.000
1st Qu.:4.700
Median :5.000
Mean :4.778
3rd Qu.:5.000
Max. :5.000
NA's :201
The data set contains missing information on number of variables including Average rating of driver,
Phone, Avg_rating_by_driver. Since the number of complete cases is still large (41445) and number
of variables is not large, we can use the complete cases data of 41,445 observations and 12 variables for
the to build the predictive model for retention. This accounts for about 83% of users retained for model
building.
To make response categorical, we convert the response to 1 if the XYZ user has taken a ride in past one
month, else we set its value zero. Next we do some exploratory data analysis to investigate the variables
in the data set.
From the summary and box-plot we see that the variables are highly over-dispersed. In these situations,
to get a meaningful relationship, we transform the variables to reduce the over-dispersion.
Transformations:
udc.c.new$trips_in_first_30_days=log(1+udc.c$trips_in_first_30_days)
udc.c.new$avg_surge=log(udc.c$avg_surge)
udc.c.new$surge_pct=udc.c$surge_pct/100
udc.c.new$avg_dist=log(1+udc.c$avg_dist)
udc.c.new$weekday_pct=udc.c$weekday_pct/100
Also as most of surge_pct values are zero, and its distribution is skewed, it is good practice to covert it to
categorical variable. The transformation is:
udc.c.new$surge_pct_f=rep(0,length(udc.c.new$surge_pct))
udc.c.new$surge_pct_f[intersect(which(udc.c.new$surge_pct<=.10),which(udc.c.new$surge_pct>0))]=1
udc.c.new$surge_pct_f[intersect(which(udc.c.new$surge_pct<=.25),which(udc.c.new$surge_pct>0.1))]=2
udc.c.new$surge_pct_f[intersect(which(udc.c.new$surge_pct<=.50),which(udc.c.new$surge_pct>0.25))]=3
udc.c.new$surge_pct_f[intersect(which(udc.c.new$surge_pct<=.80),which(udc.c.new$surge_pct>0.5))]=4
udc.c.new$surge_pct_f[which(udc.c.new$surge_pct>0.8)]=5
Next check of relationship between response and predictor variables. As the response is binary and a plot
would not be very information, we group the mean value of response over range of predictor variable and
plot it. Below are some plots:
The pl he
ot
Both the plots of i) grouped response vs average rating by driver and ii) grouped response average rating
of driver are increasing. We expect the plots to like to be more like S-shaped for varying values if predictor
variables, however since there are few observations in plot, it is hard to discern. The plot of grouped
response vs. surge_pct scattered and does not shows a clear S-shaped pattern. Therefore, transforming
the surge_pct variable to a categorical variable is better idea.
Multicollinearity is an important check before model building as this helps to exclude the problematic
variables from model. The below table shows the correlation analysis between continuous variables from
the data set.
Table 3.1: Correlation Matrix
There is no multicollinearity among the variable. Another popular measure is to check the eigenvalues.
The eigenvalues of XX given are (here X is data matrix that contains the above variables):
The eigenvalues are bounded below by a positive constant which suggest that data does not have any
multicollinearity issue.
Modeling:
Next we start with univariate modeling to assess their significance to the model. We model the response
by means of logistic regression.
###################################################
# 0. intercept model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.40411 0.01228 -32.91 <2e-16 ***
# 1. City
Coefficients:
(Intercept) -0.78487 0.01654 -47.46 <2e-16 ***
XYZ_black 0.94321 0.02554 36.94 <2e-16 ***
# 3. trips_in_first_30_days
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.91875 0.02034 -45.17 <2e-16 ***
trips_in_first_30_days 0.53572 0.01634 32.78 <2e-16 ***
# 4. avg_surge
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.42132 0.01348 -31.256 < 2e-16 ***
avg_surge 0.28358 0.09097 3.117 0.00182 **
# 5. Avg_dist
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.14566 0.03583 -4.065 4.81e-05 ***
avg_dist -0.15634 0.02044 -7.650 2.01e-14 ***
# 6. avg_rating_by_driver
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.57446 0.14474 3.969 7.22e-05 ***
avg_rating_by_driver -0.20499 0.03022 -6.783 1.18e-11 ***
# 7. Weekday_pct
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.48741 0.02505 -19.456 < 2e-16 ***
weekday_pct 0.13556 0.03544 3.826 0.00013 ***
The univariate analysis shows that all the predictors except are highly significant at p-value =0.25. We
chose this high p-value to decide which of the variables to include in the full model. The full model is:
The avg_rating_of_driver and weekday_pct seems to be insignificant at 5% level, so first we drop the
avg_rating_of_driver from from the model. The reduced model is:
# overall sig of model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.18561 0.17763 1.045 0.2960
City_AK -1.80297 0.04151 -43.430 < 2e-16 ***
City_WK -1.17263 0.03775 -31.065 < 2e-16 ***
trips_in_first_30_days 0.16163 0.02033 7.952 1.84e-15 ***
avg_rating_by_driver -0.26711 0.03489 -7.656 1.91e-14 ***
avg_surge -1.16108 0.22670 -5.122 3.03e-07 ***
Phone_i 1.17436 0.03482 33.726 < 2e-16 ***
surge_pct_10 2.16588 0.05912 36.638 < 2e-16 ***
surge_pct_20 1.48979 0.05172 28.804 < 2e-16 ***
surge_pct_30 0.70664 0.07835 9.019 < 2e-16 ***
surge_pct_40 0.86224 0.19423 4.439 9.03e-06 ***
surge_pct_50 -0.19101 0.18743 -1.019 0.3082
XYZ_black 0.91520 0.02983 30.677 < 2e-16 ***
weekday_pct 0.08317 0.04296 1.936 0.0529 .
avg_dist -0.05158 0.02451 -2.105 0.0353 *
Next we include the interaction term one by one. Based on intuition, we include following pair of
interactions.
#1. Avg_surge and XYZ_black
#2. surge_pct and weekend_pct , here weekend_pct=1-weekday_pct
#3. avg_dist and city
#4. trips_in_first 30_days and weekend
#5. trips_in_first 30_days and avg_dist
Among all these interaction terms, only #1. Avg_surge and XYZ_black is statistically significant. The final
model is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.18432 0.17407 1.059 0.28964
City_AK -1.80107 0.04146 -43.442 < 2e-16 ***
City_WK -1.17518 0.03771 -31.166 < 2e-16 ***
trips_in_first_30_days 0.16649 0.02023 8.228 < 2e-16 ***
avg_rating_by_driver -0.27150 0.03472 -7.821 5.25e-15 ***
avg_surge -1.40830 0.24813 -5.676 1.38e-08 ***
Phone_i 1.17439 0.03482 33.725 < 2e-16 ***
surge_pct_10 2.16863 0.05907 36.714 < 2e-16 ***
surge_pct_20 1.48682 0.05170 28.756 < 2e-16 ***
surge_pct_30 0.70847 0.07839 9.038 < 2e-16 ***
surge_pct_40 0.88976 0.19579 4.545 5.51e-06 ***
surge_pct_50 -0.14188 0.18859 -0.752 0.45188
XYZ_black 0.87760 0.03270 26.837 < 2e-16 ***
avg_surge_int_XYZ_black 0.68655 0.26557 2.585 0.00973 **
Summary:
i) The model has lowest AIC value among all the models. Model with lower AIC is preferred.
ii) All the retained variables are highly statistically significant as they have very low p-value.
iii) When including surge_pct as continuous variable, the prediction performance of the model
on validation set goes lower by 4%, this indicates that surge_pct should be modeled as
categorical.
iv) Model is robust: after trying different training samples, the model retains same predictors
and the change in the estimates of coefficient is very small (we expect little change as the
training sample changes). Also the predicting performance on the validation data is similar.
We use the fitted model to predict the response for validation data given the predictors. The prediction
accuracy based on 20 runs is 26.02% with standard deviation of 0.4978%.
Q. We would like you to use this data set to help understand what factors are the best
predictors
for retention?
Among the predictors, City has the largest z-value, followed by surge_pct, phone, XYZ_black,
avg_rating_by_driver, trips_in_first_30_days, avg_surge. These all variables are statistically significant
with low p-value and therefore best predictors for predicting retention. Also the interaction term of
avg_surge and XYZ_black is statistically significant and is an important variable for predicting retention.
Q. Briefly discuss how XYZ might leverage the insights gained from the model to improve
its long term rider retention (again, a few sentences will suffice).
i) From the above table, the highest odds ratio is for surge_pct_10, which is coded as 1 if the
surge_pct falls between (0,0.1]=(0%,10%], else coded 0. The odds ratio of surge_pct_10 is
with respect to the surge_pct=0%. This is equivalent to say that, if a user takes (0-10%] of
rides when there is surge in pricing, he is about 8.746 times more likely to be retained as
compared as he/she rides without a surge price. The decreasing odds ratio in surge_pct shows
that if the users rides more often when there is surge in pricing, he/she is less likely to have
traveled in last month.
ii) Also odds ratio of XYZ_black is 2.405 shows that if the user is XYZ_black, he/she is 2.405 times
more likely to travel in last month than if he is not a XYZ_black user, given all the other
variables fixed.
iii) From the analysis we can say that if the user is more likely to have used XYZ in last month if
he/she:
a. uses iPhone,
b. is XYZ_black user,
c. is from Kings Landing,
d. has taken about between (0-10%] when there is surge pricing,
e. more trips during first 30 days.
Astapor seems to be less profitable market as compared to Kings landing. Comparing the
avg_surge across three cities, we see that the avg_surge in Astapor is 6.8% which is highest as compared
to 6.1% in Kings Landing and 5.4% in Winterfell. As the coefficient of avg_dist is negative in the model, it
might be good idea to lower the surge in the Astapor or at the same level as Kings Landing to increase
the users retention. The analysis shows that if the user is iPhone user, he is 3.2 times more likely to use
XYZ in last month as compared to an Android user. It might be good idea to offer better rates to improve
the retention of the Android users.