100% found this document useful (1 vote)

431 views

XYZ Data Analysis Report

The document discusses analyzing login data to characterize patterns of user demand over time. Key findings from visualizing and describing the login count time series include: - Login counts follow clear daily and weekly cycles, with highest demand on weekends and during midday and evenings on weekdays. - An experiment is proposed to reimburse toll costs for drivers operating between two cities with complementary circadian rhythms, to encourage more availability. The success of this experiment would be measured by the average per driver profit when making trips between cities. A two-week experiment is designed to statistically test if reimbursing tolls significantly increases this key metric. Interpretation of results and caveats are also discussed.

Uploaded by

Ashwini Kumar Maurya

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

431 views

XYZ Data Analysis Report

Uploaded by

Ashwini Kumar Maurya

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Part 1 Exploratory data analysis

The attached logins.json file contains (simulated) timestamps of user logins in a particular
geographic location. Aggregate these login counts based on 15minute time intervals, an
visualize and describe the resulting time series of login counts in ways that best
characterize the underlying patterns of the demand. Please report/illustrate important
features of the demand such as daily cycles. If there are data quality issues, please report
them.

The file logins.json contains 93142 observations, with first login observed on "1970-01-01 20:12:16" and
last on "1970-04-13 18:57:38". Figure 1.1 is shows the time series plot of login counts, aggregated over
15-minutes time intervals. The maximum frequency of login counts occurs on March 1st, 1970 (Sunday).
The plot shows seasonal time series pattern of login count frequency, highest towards the weekend for
the 15-week period.

Figure 1.1: Time series plot of login counts

The Figure 1.2 shows the login counts aggregated over 15-minute time interval averaged over all weeks.

Weekday patterns:

During the weekday starting from Sunday midnight, the frequency goes down during midnight till early
hours (about 5:00am), then goes up till about noon, and then it decreases till about 4:00pm, increases
from about 4:00pm-midnight. There is daily seasonal pattern with two modes, maximum during noon and
other mode at about midnight. There is positive trend in the login counts over the weekdays. The days
later in the week (Wednesday, Thursday, and Friday) observe more login counts during noon as compared
to days earlier in the week (Monday, Tuesday). Starting from Thursday evening till Saturday night, the
login counts during evening till mid night increases as compared to rest of days of week. This shows that
during weekdays, there is high demand of XYZ drivers during the noon and in the late evening till midnight.

Weekend patterns:

On Friday, the login counts (demand) increases sharply from evening till late night (till about 5:00am)
Friday night, and goes down till about 6:00am in the Saturday morning. The same pattern continues on
Saturday, with highest frequency (over the entire week) at about 4:00am-5:00am morning. On Sunday
night we observed less login counts (demand) as compared to other days of the week.
The login count pattern resembles scenario of a busy city (or multiple cities), where people go to work
during weekdays, and go out during weekends.

Figure 1.2: Time series plot of login counts (aggregated over weeks)

The difference between Figure 1.2 and Figure 1.3 is that in Figure 1.2 the login counts are
aggregtated over 15 minutes time interval, whereas in Figure 1.3 the login counts are aggregated for each
hour and hence Figure 1.3 reflect the hourly patterns and gives easier comparision during the various
days of week.

Figure 1.3: Time series plot of hourly login counts (aggregated over weeks)

The matrix plot in Figure 1.4 shows the hourly login counts over 7 days of the week. The cyan color
reflects high login counts whereas the cornsilk color reflects the low login counts. From this plot, it is
easy to observe the demand of XYZ drivers during various hours throughout the week.
Figure 1.4: Matrix Plot of Hourly Login
Counts

Part 2 Experiment and metrics design

The neighboring cities of Gotham and Metropolis have complementary circadian rhythms:
on weekdays, XYZ Gotham is most active at night, and XYZ Metropolis is most active
during the day. On weekends, there is reasonable activity in both cities. However, a toll
bridge, with a two-way toll, between the two cities causes driver partners to tend to be
exclusive to each city. The XYZ managers of city operations for the two cities have
proposed an experiment to encourage driver partners to be available in both cities, by
reimbursing all toll costs.

Q1. What is key measure of success of this experiment in encouraging the driver partners to serve both
cities? And would you choose this measure.

As the XYZ managers pay for the tolls, this investment must be returned in terms of profit generated by
those driver partners who drive to other cities (and hence need to pay toll). Therefore, a key metric to
measure the success of this experiment would be average per driver return on the profit made when the
driver partner makes a trip to other cities.

I would choose this metric as this will enable me to know:

i) If there is any profit after paying the toll fee.

ii) How much a driver partner makes, if he makes good money, he will be happy and more likely
to drive to other city.

Q2. Describe a practical experiment you would design to compare the effectiveness of the proposed
change in relation to the key measure of success. Please provide details on:

a) How you will implement the experiment?

To compare the effectiveness of the proposed toll paying experiment, it would be of interest
to know if there is any significant change in the XYZs profit after implementing the toll paying
experiment. I would design the practical experiment by collecting some for a minimum of two
weeks as the following:

i) During the first week, do not pay the toll fee and calculate the earnings from each
driver partner for the entire week. Suppose there are 50 drivers in each city, then
have observations on 100 drivers, about how much they contributed to XYZ based on
rides.
ii) During the second week, pay the toll fee and calculate the earnings from each driver
partner for the entire week. Here the earnings from a driver partner would be
obtained after subtracting the toll fee. Again, from 100 driver partners, we have 100
observations.

b) What statistical test(s) you will conduct to verify the significance of the observation, and
Now if the experiment is effective, the per driver earning during second week, must be higher
than during first week. To further apply the statistical test of significance, we make following
assumption:

Assumptions:
i) The XYZs earning from the driver partners is independent for both Gotham and
Metropolis city
ii) During the second week, we expect there will be of more driver partner available
in Gatham during night and more driver partners in Metropolis during day. The
increased availability of driver partners (if any) does not affect the XYZs earnings
from the existing driver partners (who are already in their respective city).
(Failing this assumption the earnings will no longer be independent.)
iii) The earnings of any driver partner has normal distribution with finite mean
(unknown) and finite variance.

Under these assumption, we can test the significance of difference in average earning from per
driver partner during first and second week. Equivalently, let mu1 and mu2 be the true (unknown)
average earning per driver during week one and week two respectively. Define H0 and H1 by:

H0: mu1=mu2 i.e. implementing the toll paying experiment does not improve the earning per
driver partner during the week 2 from week 1.
H1: mu2>mu1 i.e. implementing the toll paying experiment has significantly increased the XYZs
average earning per driver partner during week 2 from week 1.

Let x1b and x2b be the average of earnings during the week one and week two respectively, let s1
and s2 be the standard deviation of the earnings during week one and week one.

Next compute the test statistic:

Z=(x1b-x2b)/sqrt(s1^2/n1+s2^2/n2)
Where n1,n2 be the number of driver partners during first and second week. Since I assume 50
drivers in each city, in this particular case, n1=n2=100.

The test statistics Z has normal distribution under the above assumptions (i)-(iii). We can conclude
the significance of the mean difference for a given level of significance.
c) how you would interpret the results and provide recommendations to the city operations team
along with any caveats.

If the mean difference turns out to be statistically significant at given level (say 5%), we would
conclude that the toll paying experiment is effective and yields good business for XYZ. However, if the
mean difference is not significant, we cant say if the toll paying experiment is effective for the metric
considered here.

Caveats:
a. Assumption (ii) may not be true in practice as if there are more drivers, it affects the business
of existing drivers in a given area.
b. To ensure that described statistical test is valid, typically one needs to have more than 30
driver partners, this may not be a realistic assumption if the cities Gotham and Metropolis are
very small.
c. Not all the driver partners would like to go to other city, and therefore the XYZs earning per
driver partner may not be independent. This will violate the assumption (i)
d. Again assumption (iii) of normality may not be true. However, if the number of driver partners
is large, this may not a concern.

Part 3 Predictive modeling

XYZ is interested in predicting rider retention. To help explore this question, we have
provided sample dataset of a cohort of users who signed up for an XYZ account in January
2014. The data was pulled several months later we consider a user retained if they were
active (i.e. took a trip) in the preceding 30 days. We would like you to use this data set to
help understand what factors are the best predictors for retention, and offer suggestions
to operationalize those insights to help XYZ.

Objective: The main objective is to predict the rider retention. The analysis should also provide valuable
insights about the important factors for retention, and suggestions to operationalize those insights to help
XYZ.

Dataset: The rider information on various features including city, signup date, last trip date. The dataset
contains 50K observations on 12 variables.
Data Source: Provided as part of data analysis challenge.

Here is the summary of entire data.

city trips_in_first_30_days signup_date

Astapor :16534 Min. : 0.000 Min. :2014-01-01
King's Landing:10130 1st Qu.: 0.000 1st Qu.:2014-01-09
Winterfell :23336 Median : 1.000 Median :2014-01-17
Mean : 2.278 Mean :2014-01-16
3rd Qu.: 3.000 3rd Qu.:2014-01-24
Max. :125.000 Max. :2014-01-31
avg_rating_of_driver avg_surge last_trip_date phone
Min. :1.000 Min. :1.000 Min. :2014-01-01 : 396
1st Qu.:4.300 1st Qu.:1.000 1st Qu.:2014-02-14 Android:15022
Median :4.900 Median :1.000 Median :2014-05-08 iPhone :34582
Mean :4.602 Mean :1.075 Mean :2014-04-19
3rd Qu.:5.000 3rd Qu.:1.050 3rd Qu.:2014-06-18
Max. :5.000 Max. :8.000 Max. :2014-07-01
NA's :8122
surge_pct XYZ_black_user weekday_pct avg_dist
Min. : 0.00 Mode :logical Min. : 0.00 Min. : 0.000
1st Qu.: 0.00 FALSE:31146 1st Qu.: 33.30 1st Qu.: 2.420
Median : 0.00 TRUE :18854 Median : 66.70 Median : 3.880
Mean : 8.85 NA's :0 Mean : 60.93 Mean : 5.797
3rd Qu.: 8.60 3rd Qu.:100.00 3rd Qu.: 6.940
Max. :100.00 Max. :100.00 Max. :160.960

avg_rating_by_driver
Min. :1.000
1st Qu.:4.700
Median :5.000
Mean :4.778
3rd Qu.:5.000
Max. :5.000
NA's :201

The data set contains missing information on number of variables including Average rating of driver,
Phone, Avg_rating_by_driver. Since the number of complete cases is still large (41445) and number
of variables is not large, we can use the complete cases data of 41,445 observations and 12 variables for
the to build the predictive model for retention. This accounts for about 83% of users retained for model
building.
To make response categorical, we convert the response to 1 if the XYZ user has taken a ride in past one
month, else we set its value zero. Next we do some exploratory data analysis to investigate the variables
in the data set.

Figure 3.1: Box Plot (before transforming the variables)

From the summary and box-plot we see that the variables are highly over-dispersed. In these situations,
to get a meaningful relationship, we transform the variables to reduce the over-dispersion.

Transformations:
udc.c.new$trips_in_first_30_days=log(1+udc.c$trips_in_first_30_days)
udc.c.new$avg_surge=log(udc.c$avg_surge)
udc.c.new$surge_pct=udc.c$surge_pct/100
udc.c.new$avg_dist=log(1+udc.c$avg_dist)
udc.c.new$weekday_pct=udc.c$weekday_pct/100

Also as most of surge_pct values are zero, and its distribution is skewed, it is good practice to covert it to
categorical variable. The transformation is:
udc.c.new$surge_pct_f=rep(0,length(udc.c.new$surge_pct))
udc.c.new$surge_pct_f[intersect(which(udc.c.new$surge_pct<=.10),which(udc.c.new$surge_pct>0))]=1
udc.c.new$surge_pct_f[intersect(which(udc.c.new$surge_pct<=.25),which(udc.c.new$surge_pct>0.1))]=2
udc.c.new$surge_pct_f[intersect(which(udc.c.new$surge_pct<=.50),which(udc.c.new$surge_pct>0.25))]=3
udc.c.new$surge_pct_f[intersect(which(udc.c.new$surge_pct<=.80),which(udc.c.new$surge_pct>0.5))]=4
udc.c.new$surge_pct_f[which(udc.c.new$surge_pct>0.8)]=5

Boxplot after the transformation: The over-dispersion has significantly decreased.

Figure 3.2: Box Plot (after transforming the variables)

Next check of relationship between response and predictor variables. As the response is binary and a plot
would not be very information, we group the mean value of response over range of predictor variable and
plot it. Below are some plots:

Figure 3.3: Last_trip_date (yes/no) by rating_by_driver

Figure 3.4:3.4:
Figure Last_trip_date (yes/no)
Last_trip_date by avg_rating_of_driver
(yes/no) by rating_of_driver

Figure 3.5: Last_trip_date (yes/no) by surge_pct

The pl he
ot

Both the plots of i) grouped response vs average rating by driver and ii) grouped response average rating
of driver are increasing. We expect the plots to like to be more like S-shaped for varying values if predictor
variables, however since there are few observations in plot, it is hard to discern. The plot of grouped
response vs. surge_pct scattered and does not shows a clear S-shaped pattern. Therefore, transforming
the surge_pct variable to a categorical variable is better idea.

Check for Multicollinearity:

Multicollinearity is an important check before model building as this helps to exclude the problematic
variables from model. The below table shows the correlation analysis between continuous variables from
the data set.
Table 3.1: Correlation Matrix

trips_in_first_3 avg_rating_of_ avg_su weekday avg_d avg_rating_by_

0_days driver rge _pct ist driver
trips_in_first_3 -
0_days 1.000 -0.029 0.033 0.035 0.129 -0.047
avg_rating_of_
driver -0.029 1.000 -0.020 0.013 0.033 0.122
-
avg_surge 0.033 -0.020 1.000 -0.134 0.098 0.011
weekday_pct 0.035 0.013 -0.134 1.000 0.093 0.018
avg_dist -0.129 0.033 -0.098 0.093 1.000 0.095
avg_rating_by_
driver -0.047 0.122 0.011 0.018 0.095 1.000

There is no multicollinearity among the variable. Another popular measure is to check the eigenvalues.
The eigenvalues of XX given are (here X is data matrix that contains the above variables):

1.2893826, 1.1207735, 1.0284520, 0.9091422, 0.8375238, 0.8147259

The eigenvalues are bounded below by a positive constant which suggest that data does not have any
multicollinearity issue.

Modeling:
Next we start with univariate modeling to assess their significance to the model. We model the response
by means of logistic regression.

Training and Validation Data:

We divide the data into training (2/3) and validation (1/3) subsets. We use the training data to train the
model and use the validation data to report the accuracy of fitted model.

###################################################
# 0. intercept model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.40411 0.01228 -32.91 <2e-16 ***

Null deviance: 37198 on 27629 degrees of freedom

Residual deviance: 37198 on 27629 degrees of freedom
AIC: 37200

# 1. City
Coefficients:
(Intercept) -0.78487 0.01654 -47.46 <2e-16 ***
XYZ_black 0.94321 0.02554 36.94 <2e-16 ***

Null deviance: 37198 on 27629 degrees of freedom

Residual deviance: 35808 on 27628 degrees of freedom
AIC: 35812
# 2. Avg_rating_of_driver
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.29198 0.09213 -3.169 0.00153 **
avg_rating_of_driver -0.02437 0.01985 -1.228 0.21953

Null deviance: 37198 on 27629 degrees of freedom

Residual deviance: 37196 on 27628 degrees of freedom
AIC: 37200

# 3. trips_in_first_30_days
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.91875 0.02034 -45.17 <2e-16 ***
trips_in_first_30_days 0.53572 0.01634 32.78 <2e-16 ***

Null deviance: 37198 on 27629 degrees of freedom

Residual deviance: 36073 on 27628 degrees of freedom
AIC: 36077

# 4. avg_surge

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.42132 0.01348 -31.256 < 2e-16 ***
avg_surge 0.28358 0.09097 3.117 0.00182 **

Null deviance: 37198 on 27629 degrees of freedom

Residual deviance: 37188 on 27628 degrees of freedom
AIC: 37192

# 5. Avg_dist
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.14566 0.03583 -4.065 4.81e-05 ***
avg_dist -0.15634 0.02044 -7.650 2.01e-14 ***

Null deviance: 37198 on 27629 degrees of freedom

Residual deviance: 37139 on 27628 degrees of freedom
AIC: 37143

# 6. avg_rating_by_driver

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.57446 0.14474 3.969 7.22e-05 ***
avg_rating_by_driver -0.20499 0.03022 -6.783 1.18e-11 ***

Null deviance: 37198 on 27629 degrees of freedom

Residual deviance: 37152 on 27628 degrees of freedom
AIC: 37156

# 7. Weekday_pct
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.48741 0.02505 -19.456 < 2e-16 ***
weekday_pct 0.13556 0.03544 3.826 0.00013 ***

Null deviance: 37198 on 27629 degrees of freedom

Residual deviance: 37183 on 27628 degrees of freedom
AIC: 37187

The univariate analysis shows that all the predictors except are highly significant at p-value =0.25. We
chose this high p-value to decide which of the variables to include in the full model. The full model is:

# full model (without interaction terms)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.29742 0.20016 1.486 0.1373
City_AK -1.80487 0.04155 -43.440 < 2e-16 ***
City_WK -1.17700 0.03793 -31.034 < 2e-16 ***
trips_in_first_30_days 0.16121 0.02033 7.930 2.19e-15 ***
avg_rating_of_driver -0.02868 0.02366 -1.212 0.2254
avg_rating_by_driver -0.26239 0.03511 -7.473 7.84e-14 ***
avg_surge -1.16862 0.22680 -5.153 2.57e-07 ***
Phone_i 1.17328 0.03483 33.687 < 2e-16 ***
surge_pct_10 2.16643 0.05912 36.644 < 2e-16 ***
surge_pct_20 1.49072 0.05173 28.817 < 2e-16 ***
surge_pct_30 0.70856 0.07837 9.041 < 2e-16 ***
surge_pct_40 0.86360 0.19431 4.445 8.81e-06 ***
surge_pct_50 -0.18802 0.18750 -1.003 0.3160
XYZ_black 0.91475 0.02984 30.659 < 2e-16 ***
weekday_pct 0.08357 0.04297 1.945 0.0518 .
avg_dist -0.05077 0.02452 -2.071 0.0384 *

Null deviance: 37198 on 27629 degrees of freedom

Residual deviance: 28835 on 27614 degrees of freedom
AIC: 28867

The avg_rating_of_driver and weekday_pct seems to be insignificant at 5% level, so first we drop the
avg_rating_of_driver from from the model. The reduced model is:
# overall sig of model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.18561 0.17763 1.045 0.2960
City_AK -1.80297 0.04151 -43.430 < 2e-16 ***
City_WK -1.17263 0.03775 -31.065 < 2e-16 ***
trips_in_first_30_days 0.16163 0.02033 7.952 1.84e-15 ***
avg_rating_by_driver -0.26711 0.03489 -7.656 1.91e-14 ***
avg_surge -1.16108 0.22670 -5.122 3.03e-07 ***
Phone_i 1.17436 0.03482 33.726 < 2e-16 ***
surge_pct_10 2.16588 0.05912 36.638 < 2e-16 ***
surge_pct_20 1.48979 0.05172 28.804 < 2e-16 ***
surge_pct_30 0.70664 0.07835 9.019 < 2e-16 ***
surge_pct_40 0.86224 0.19423 4.439 9.03e-06 ***
surge_pct_50 -0.19101 0.18743 -1.019 0.3082
XYZ_black 0.91520 0.02983 30.677 < 2e-16 ***
weekday_pct 0.08317 0.04296 1.936 0.0529 .
avg_dist -0.05158 0.02451 -2.105 0.0353 *

Null deviance: 37198 on 27629 degrees of freedom

Residual deviance: 28837 on 27615 degrees of freedom
AIC: 28867
The weekday_pct is insignificant, therefore we drop from the model. The reduced model has avg_dist
insignificant. Also after removing the avg_dist from the model, the accuracy on the validation data goes
slightly up (about 0.1%), this means that avg_dist does not have significant relationship with response and
must be dropped. Dropping the predictors reduces the model complexity and model becomes less
dependent on variables and therefore robust. The full model is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.17539 0.17403 1.008 0.314
City_AK -1.80116 0.04145 -43.451 < 2e-16 ***
City_WK -1.17543 0.03772 -31.164 < 2e-16 ***
trips_in_first_30_days 0.16680 0.02023 8.247 < 2e-16 ***
avg_rating_by_driver -0.27293 0.03471 -7.863 3.75e-15 ***
avg_surge -1.16860 0.22662 -5.157 2.51e-07 ***
Phone_i 1.17447 0.03481 33.735 < 2e-16 ***
surge_pct_10 2.17219 0.05905 36.786 < 2e-16 ***
surge_pct_20 1.49229 0.05154 28.954 < 2e-16 ***
surge_pct_30 0.70660 0.07805 9.054 < 2e-16 ***
surge_pct_40 0.86250 0.19392 4.448 8.68e-06 ***
surge_pct_50 -0.19472 0.18689 -1.042 0.297
XYZ_black 0.91261 0.02977 30.656 < 2e-16 ***

Null deviance: 37198 on 27629 degrees of freedom

Residual deviance: 28844 on 27617 degrees of freedom
AIC: 28870

Including interaction terms:

Next we include the interaction term one by one. Based on intuition, we include following pair of
interactions.
#1. Avg_surge and XYZ_black
#2. surge_pct and weekend_pct , here weekend_pct=1-weekday_pct
#3. avg_dist and city
#4. trips_in_first 30_days and weekend
#5. trips_in_first 30_days and avg_dist

Among all these interaction terms, only #1. Avg_surge and XYZ_black is statistically significant. The final
model is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.18432 0.17407 1.059 0.28964
City_AK -1.80107 0.04146 -43.442 < 2e-16 ***
City_WK -1.17518 0.03771 -31.166 < 2e-16 ***
trips_in_first_30_days 0.16649 0.02023 8.228 < 2e-16 ***
avg_rating_by_driver -0.27150 0.03472 -7.821 5.25e-15 ***
avg_surge -1.40830 0.24813 -5.676 1.38e-08 ***
Phone_i 1.17439 0.03482 33.725 < 2e-16 ***
surge_pct_10 2.16863 0.05907 36.714 < 2e-16 ***
surge_pct_20 1.48682 0.05170 28.756 < 2e-16 ***
surge_pct_30 0.70847 0.07839 9.038 < 2e-16 ***
surge_pct_40 0.88976 0.19579 4.545 5.51e-06 ***
surge_pct_50 -0.14188 0.18859 -0.752 0.45188
XYZ_black 0.87760 0.03270 26.837 < 2e-16 ***
avg_surge_int_XYZ_black 0.68655 0.26557 2.585 0.00973 **

Null deviance: 37198 on 27629 degrees of freedom

Residual deviance: 28838 on 27616 degrees of freedom
AIC: 28866

Summary:
i) The model has lowest AIC value among all the models. Model with lower AIC is preferred.
ii) All the retained variables are highly statistically significant as they have very low p-value.
iii) When including surge_pct as continuous variable, the prediction performance of the model
on validation set goes lower by 4%, this indicates that surge_pct should be modeled as
categorical.
iv) Model is robust: after trying different training samples, the model retains same predictors
and the change in the estimates of coefficient is very small (we expect little change as the
training sample changes). Also the predicting performance on the validation data is similar.

Results on Validation dataset (model performance):

We use the fitted model to predict the response for validation data given the predictors. The prediction
accuracy based on 20 runs is 26.02% with standard deviation of 0.4978%.

Next we answer questions asked in the data analysis challenge.

Q. We would like you to use this data set to help understand what factors are the best
predictors
for retention?
Among the predictors, City has the largest z-value, followed by surge_pct, phone, XYZ_black,
avg_rating_by_driver, trips_in_first_30_days, avg_surge. These all variables are statistically significant
with low p-value and therefore best predictors for predicting retention. Also the interaction term of
avg_surge and XYZ_black is statistically significant and is an important variable for predicting retention.

Q. Briefly discuss how XYZ might leverage the insights gained from the model to improve
its long term rider retention (again, a few sentences will suffice).

To give more insights, first we compute the odds ratio.

Coeff Estimate odds ratio

City_AK -1.801 0.165
City_WK -1.175 0.309
trips_in_first_30_days 0.166 1.181
avg_rating_by_driver -0.272 0.762
avg_surge -1.408 0.245
Phone_i 1.174 3.236
surge_pct_10 2.169 8.746
surge_pct_20 1.487 4.423
surge_pct_30 0.708 2.031
surge_pct_40 0.890 2.435
surge_pct_50 -0.142 0.868
XYZ_black 0.878 2.405
avg_surge_int_XYZ_black 0.687 1.987
Interpretation: Odds ratio for variable City_AK which is 0.165 means that if all the other variables are
fixed, it is about 16 (1/0.165) times more likely that a user from Kings Lansing travelled in last month than
a user from Astapor.

Summary and Recommendations:

i) From the above table, the highest odds ratio is for surge_pct_10, which is coded as 1 if the
surge_pct falls between (0,0.1]=(0%,10%], else coded 0. The odds ratio of surge_pct_10 is
with respect to the surge_pct=0%. This is equivalent to say that, if a user takes (0-10%] of
rides when there is surge in pricing, he is about 8.746 times more likely to be retained as
compared as he/she rides without a surge price. The decreasing odds ratio in surge_pct shows
that if the users rides more often when there is surge in pricing, he/she is less likely to have
traveled in last month.
ii) Also odds ratio of XYZ_black is 2.405 shows that if the user is XYZ_black, he/she is 2.405 times
more likely to travel in last month than if he is not a XYZ_black user, given all the other
variables fixed.

iii) From the analysis we can say that if the user is more likely to have used XYZ in last month if
he/she:
a. uses iPhone,
b. is XYZ_black user,
c. is from Kings Landing,
d. has taken about between (0-10%] when there is surge pricing,
e. more trips during first 30 days.

Astapor seems to be less profitable market as compared to Kings landing. Comparing the
avg_surge across three cities, we see that the avg_surge in Astapor is 6.8% which is highest as compared
to 6.1% in Kings Landing and 5.4% in Winterfell. As the coefficient of avg_dist is negative in the model, it
might be good idea to lower the surge in the Astapor or at the same level as Kings Landing to increase
the users retention. The analysis shows that if the user is iPhone user, he is 3.2 times more likely to use
XYZ in last month as compared to an Android user. It might be good idea to offer better rates to improve
the retention of the Android users.

Relais Chateaux Terms and Conditions 2013
No ratings yet
Relais Chateaux Terms and Conditions 2013
33 pages
Online Vehicle Booking
No ratings yet
Online Vehicle Booking
26 pages
Decision Tree Algorithm
92% (13)
Decision Tree Algorithm
10 pages
References - International Student Studies
No ratings yet
References - International Student Studies
5 pages
Caiib - Retail Banking (Numerical)
No ratings yet
Caiib - Retail Banking (Numerical)
24 pages
Assignment 4
No ratings yet
Assignment 4
10 pages
Fundamentals of Digital Quadrature Modulation
100% (2)
Fundamentals of Digital Quadrature Modulation
5 pages
Management and Ethics
100% (1)
Management and Ethics
12 pages
Lab Report PAM
No ratings yet
Lab Report PAM
6 pages
16 Data Mining Techniques - The Complete List - Talend
No ratings yet
16 Data Mining Techniques - The Complete List - Talend
9 pages
Data Analysis Using Spss
100% (2)
Data Analysis Using Spss
2 pages
5 Scope Management Plan
No ratings yet
5 Scope Management Plan
5 pages
Survey Questionnaire
No ratings yet
Survey Questionnaire
11 pages
PMER Pocket Guide Draft 5-2013
No ratings yet
PMER Pocket Guide Draft 5-2013
16 pages
7bBusinessAnalytics PDF
No ratings yet
7bBusinessAnalytics PDF
88 pages
Research Assigment
0% (1)
Research Assigment
18 pages
Impact of Covid-19 On Digital Marketing
100% (1)
Impact of Covid-19 On Digital Marketing
7 pages
Impact of Greenwashing Perception On Consumers GR
No ratings yet
Impact of Greenwashing Perception On Consumers GR
16 pages
CH 07 Tif
No ratings yet
CH 07 Tif
29 pages
Analyst Assignment Gameberry Labs
100% (1)
Analyst Assignment Gameberry Labs
2 pages
2008-08-06 231717 P. 367
No ratings yet
2008-08-06 231717 P. 367
3 pages
iCD109-CourseWork 06
No ratings yet
iCD109-CourseWork 06
14 pages
FT201033_MMLA_Assignment
No ratings yet
FT201033_MMLA_Assignment
10 pages
NSB MC 2021-22
No ratings yet
NSB MC 2021-22
41 pages
Uber-DSInterviewChallengeV_2_4__3__PaulWyatt.pdf.docx
No ratings yet
Uber-DSInterviewChallengeV_2_4__3__PaulWyatt.pdf.docx
5 pages
CVI Peachtree Corners Fiscal Analysis
No ratings yet
CVI Peachtree Corners Fiscal Analysis
25 pages
Price Variation Bid Project Cost in HAM
100% (3)
Price Variation Bid Project Cost in HAM
6 pages
High Low Meethod
No ratings yet
High Low Meethod
69 pages
Finance Assignment 1
No ratings yet
Finance Assignment 1
5 pages
Business Plan For Starting A Taxi Bussines
No ratings yet
Business Plan For Starting A Taxi Bussines
23 pages
Guesstimates
No ratings yet
Guesstimates
5 pages
Parking Memo
No ratings yet
Parking Memo
13 pages
1635446991862_PF-Worksheet1-Fall 2021
No ratings yet
1635446991862_PF-Worksheet1-Fall 2021
4 pages
Uber
100% (1)
Uber
4 pages
LPG Gas Survey
No ratings yet
LPG Gas Survey
6 pages
Assignment Questions
No ratings yet
Assignment Questions
4 pages
Predicting Click Through Rate For Advertising Data Using Logistic Regression
No ratings yet
Predicting Click Through Rate For Advertising Data Using Logistic Regression
14 pages
Check-in Exam 4 2023 - ANSWER KEY
No ratings yet
Check-in Exam 4 2023 - ANSWER KEY
9 pages
Econ 100A s12 Midterm 2 Practice Exam
No ratings yet
Econ 100A s12 Midterm 2 Practice Exam
14 pages
SCS 2019 2571 R1 Reviewer PDF
No ratings yet
SCS 2019 2571 R1 Reviewer PDF
62 pages
Decision Sciences 1: Some Problems From Sessions 7 & 8
No ratings yet
Decision Sciences 1: Some Problems From Sessions 7 & 8
5 pages
Uber_Sql_interview_questions
No ratings yet
Uber_Sql_interview_questions
8 pages
DE Assignment
No ratings yet
DE Assignment
4 pages
Moti Heera Analysis
No ratings yet
Moti Heera Analysis
6 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
CSC211 PS2
No ratings yet
CSC211 PS2
2 pages
Order 6726537
No ratings yet
Order 6726537
7 pages
FMG 22-Introduction
No ratings yet
FMG 22-Introduction
22 pages
Shapley Approach
No ratings yet
Shapley Approach
15 pages
Is 236 2018 2019 Test 2 Marking Scheme
No ratings yet
Is 236 2018 2019 Test 2 Marking Scheme
3 pages
Avi Watwani d17b 75 Bda Project Report
No ratings yet
Avi Watwani d17b 75 Bda Project Report
13 pages
Marketing Analytics - II-Nikhil6
No ratings yet
Marketing Analytics - II-Nikhil6
16 pages
CSBP112 Projects
No ratings yet
CSBP112 Projects
13 pages
chapter1
No ratings yet
chapter1
22 pages
Cive1219 1630 At2
0% (1)
Cive1219 1630 At2
6 pages
EsqRezoning
No ratings yet
EsqRezoning
2 pages
POWER Assumptions
No ratings yet
POWER Assumptions
7 pages
Activity Based Costing in SAP PDF
No ratings yet
Activity Based Costing in SAP PDF
9 pages
ECO1003 UAE New Car Sales Questions (1) - 1
100% (1)
ECO1003 UAE New Car Sales Questions (1) - 1
7 pages
assignment
No ratings yet
assignment
7 pages
Google Analytics Customer Revenue Prediction PDF
No ratings yet
Google Analytics Customer Revenue Prediction PDF
14 pages
Business Finance Assignment
No ratings yet
Business Finance Assignment
3 pages
Lab 2 Fundamental Data Types
No ratings yet
Lab 2 Fundamental Data Types
4 pages
Shadow Price
No ratings yet
Shadow Price
5 pages
allExamGuidestream11 - 2024 10 15 00 05
No ratings yet
allExamGuidestream11 - 2024 10 15 00 05
3 pages
1.4 MV Scope of Works Heat Pumps 3.0
No ratings yet
1.4 MV Scope of Works Heat Pumps 3.0
6 pages
Variable Stress Analysis
No ratings yet
Variable Stress Analysis
20 pages
Exclusionary Principle Under Interpreteation of Statutes
No ratings yet
Exclusionary Principle Under Interpreteation of Statutes
14 pages
Complete Download The Psychosocial Reality of Digital Travel: Being in Virtual Places Ingvar Tjostheim PDF All Chapters
100% (5)
Complete Download The Psychosocial Reality of Digital Travel: Being in Virtual Places Ingvar Tjostheim PDF All Chapters
51 pages
Divya Kotak Mahindra Bank
100% (1)
Divya Kotak Mahindra Bank
95 pages
Document (5)
No ratings yet
Document (5)
7 pages
Pension Regulations The Army
No ratings yet
Pension Regulations The Army
254 pages
C 100 For I 1 To N Do For J 1 To N Do (Temp A (I) (J) + C A (I) (J) A (J) (I) A (J) (I) Temp - C) For I 1 To N Do For J 1 To N Do Output (A (I) (J) )
No ratings yet
C 100 For I 1 To N Do For J 1 To N Do (Temp A (I) (J) + C A (I) (J) A (J) (I) A (J) (I) Temp - C) For I 1 To N Do For J 1 To N Do Output (A (I) (J) )
5 pages
Balance in Local Currency Only & Account - Local Currency Mar 2013
No ratings yet
Balance in Local Currency Only & Account - Local Currency Mar 2013
6 pages
EO1995 2000 Lawrence - 0008
No ratings yet
EO1995 2000 Lawrence - 0008
114 pages
A Practical Introduction To Python Programming Heinold
100% (1)
A Practical Introduction To Python Programming Heinold
263 pages
DNP-9052A Operation Manual
No ratings yet
DNP-9052A Operation Manual
3 pages
Ismp Did Ir Arman
No ratings yet
Ismp Did Ir Arman
8 pages
Non Judicial Stamp Paper: Rs.10 Rs.10
No ratings yet
Non Judicial Stamp Paper: Rs.10 Rs.10
3 pages
Tuff Torq K92 Service Manual
No ratings yet
Tuff Torq K92 Service Manual
55 pages
Expense for the Month of April
No ratings yet
Expense for the Month of April
8 pages
AAAV 30mm HE Lethality Testing: Test Procedures and Casualty Models
No ratings yet
AAAV 30mm HE Lethality Testing: Test Procedures and Casualty Models
32 pages
Measuring Monitoring Reporting Performance PDF
No ratings yet
Measuring Monitoring Reporting Performance PDF
16 pages
20UP20DN
No ratings yet
20UP20DN
4 pages
1 - Introduction To Orcad
No ratings yet
1 - Introduction To Orcad
4 pages
A GLUT-Based User Interface Library: by Paul Rademacher
No ratings yet
A GLUT-Based User Interface Library: by Paul Rademacher
38 pages
Social Marketing in Health: Presenter-Dr. Manju
No ratings yet
Social Marketing in Health: Presenter-Dr. Manju
76 pages
IKEA and DELL Case Study
No ratings yet
IKEA and DELL Case Study
2 pages
A Project Report On Vffs Machine and Increased Efficiency of Cooling Tower
No ratings yet
A Project Report On Vffs Machine and Increased Efficiency of Cooling Tower
52 pages
FinalPaper 133 0726092418
No ratings yet
FinalPaper 133 0726092418
10 pages

XYZ Data Analysis Report

Uploaded by

XYZ Data Analysis Report

Uploaded by

Part 1 Exploratory data analysis

Figure 1.1: Time series plot of login counts

Part 2 Experiment and metrics design

I would choose this metric as this will enable me to know:

i) If there is any profit after paying the toll fee.

a) How you will implement the experiment?

Next compute the test statistic:

Part 3 Predictive modeling

Here is the summary of entire data.

city trips_in_first_30_days signup_date

Figure 3.1: Box Plot (before transforming the variables)

Boxplot after the transformation: The over-dispersion has significantly decreased.

Figure 3.2: Box Plot (after transforming the variables)

Figure 3.3: Last_trip_date (yes/no) by rating_by_driver

Figure 3.5: Last_trip_date (yes/no) by surge_pct

Check for Multicollinearity:

trips_in_first_3 avg_rating_of_ avg_su weekday avg_d avg_rating_by_

1.2893826, 1.1207735, 1.0284520, 0.9091422, 0.8375238, 0.8147259

Training and Validation Data:

Null deviance: 37198 on 27629 degrees of freedom

Null deviance: 37198 on 27629 degrees of freedom

Null deviance: 37198 on 27629 degrees of freedom

Null deviance: 37198 on 27629 degrees of freedom

Null deviance: 37198 on 27629 degrees of freedom

Null deviance: 37198 on 27629 degrees of freedom

Null deviance: 37198 on 27629 degrees of freedom

Null deviance: 37198 on 27629 degrees of freedom

# full model (without interaction terms)

Null deviance: 37198 on 27629 degrees of freedom

Null deviance: 37198 on 27629 degrees of freedom

Null deviance: 37198 on 27629 degrees of freedom

Including interaction terms:

Null deviance: 37198 on 27629 degrees of freedom

Results on Validation dataset (model performance):

Next we answer questions asked in the data analysis challenge.

To give more insights, first we compute the odds ratio.

Coeff Estimate odds ratio

Summary and Recommendations:

You might also like