Data - Analysis 2
Data - Analysis 2
1 Exercise 1 3
2 Exercise 2 3
3 Exercise 3 4
4 Exercise 4 6
4.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Results and Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Exercise 5 11
6 Exercise 6 11
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.3 Interpretations and comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.3.1 Intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.3.2 Non-categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.3.3 Season categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.3.4 Weekday categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.3.5 Holiday categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.3.6 Weathersit categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.3.7 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7 Exercise 7 15
7.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.3 Interpretations and comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.3.1 Predicted parameter values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.3.2 Predicted numbers of daily bike rentals . . . . . . . . . . . . . . . . . . . . . . 18
8 Exercise 8 20
8.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8.3 Interpretations and comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8.3.1 OLS estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8.3.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2
1 Exercise 1
Univariate summaries such as histograms are useful to describe the distribution of the two variables,
but not their relationship, which is why we omit these tools. To illustrate the relationship between the
total number of bikes rented (cnt) and temperature (temp), we will use a scatterplot, which is given at
the end of this section, as a graphical tool and the sample covariance as well as the sample correlation
as descriptive statistics. The sample covariance of the two variables turns out to be approximately
9475 > 0, which indicates a positive relationship between the temperature and the total number of
bikes rented, but it is rather hard to interpret this quantity since it is not standardized. This is why
we turn to the sample correlation of the two variables, which equals approximately 0.77. This is again
a positive number, which implies a positive relationship between the two variables under examination,
but we know that the correlation is a unit free measure between -1 and 1, meaning that it also gives
an indication of the magnitude of the relationship between two variables. In this case, the obtained
sample correlation tells us that there is a strong positive relationship between temperature and total
daily bike rentals. This is visible when examining the scatterplot; the cloud of points has a clear
positive slope, which explains the positive correlation (and positive covariance), but the points are
quite disperse, which explains why the correlation is not closer to 1. Based on the aforementioned
observations and quantities, we conclude that higher temperature is associated with more bikes rented,
according to our data set.
3000
2000
1000
0 10 20 30
temp
2 Exercise 2
The OLS estimates of the regression coefficients turn out to be ˆ0 = 1630.99 and ˆ1 = 119.33. This
means that the total number of bikes rented, given that the temperature is 0 degrees, is predicted
to be approximately 1631. Furthermore, a one unit (degree) increase in temperature leads to an
3
increase in total number of (predicted) rentals of approximately 119 units (bikes). The fact that
the estimated coefficient of the regressor has a positive value also indicates that there is indeed a
positive relationship between temperature and the total number of bikes rented, like we concluded in
the previous section. In addition, the value of R-squared associated with our estimated model. which
equals 0.5948, indicates that temperature explains about 59.5% of the variability in the total number
of daily rentals, which is decent, yet suboptimal.
3 Exercise 3
(The discussed graphs are provided at the end of this section.)
When examining the graph, which showcases the data points and the estimated regression line, we see
that the intercept of the regression line ( ˆ0 ) is indeed around 1631 (bikes) and the slope of the regres-
sion line ( ˆ1 ) is indeed positive. The estimated regression line seems to fit the data reasonably, since
it is quite close to many data points (the areas with relatively many data points are located closely
to the regression line). However, there are a decent number of data points that have (much) higher
distance to the regression line. This is in accordance with the value of R-squared associated with
the estimated regression model, which was discussed in the previous section. Moreover, the graphical
representation seems to reflect the fact that we have used the OLS estimates to obtain the estimated
regression line, meaning that the squared distances between the data points and the estimated regres-
sion line are minimized. Furthermore, both the plot and the histogram of the residuals indicate that
most residuals have relatively low (absolute) values and the positive and negative residuals seem to
cancel each other out relatively well when comparing both their magnitudes and frequency. This all
makes sense to us since the OLS method aims at minimizing the squared residuals. In addition, the
OLS estimator is an unbiased estimator, which means that the mean of the residuals must equal 0
(on average, the estimator outputs the true value). The histogram already causes us to presume that
the mean of the residuals might indeed be 0 and this presumption is confirmed by a quick calculation
of the mean in R, which results in the value -3.775891 * 10 14 (approximately 0).
4
2000
100
1000
80
0
60
Frequency
res
−1000
40
−2000
20
−3000
0
res Index
Figure 3: Histogram of residuals of the re- Figure 4: Plot of residuals of the regression
gression model. model.
Figure 2: Scatterplot of variables ”cnt” and ”temp” and the estimated regression line
6000
5000
4000
cnt
3000
2000
1000
0 10 20 30
temp
5
4 Exercise 4
(The discussed graphs and tables can be found at the end of the section)
6
4.3 Comparison
When comparing the estimated regression curve corresponding to the fourth order polynomial with the
estimated regression line we discussed before, it becomes apparent that the straight line does not take
into account the decrease in the number of rentals as the temperature rises to extraordinary degrees;
the first order polynomial has a fixed slope, corresponding to the general trend in the data, which
means that the development of the line at extremely high values of ”temp” is the same as in the area
where moderate temperatures are observed. This results in the slope being too large at temperatures
around 30 degrees and greater, as well as in the area where the temperature is negative. The regression
curve, however, is created by using a fourth order polynomial which enables it to decrease strongly
when extremely high and low temperatures are measured. Furthermore, figure 10 showcases how
the curve is closer to the observed numbers of rentals for moderate temperatures (between 0 and 25
degrees) than the regression line, again due to the much greater flexibility of the curve thanks to a
greater number of parameters. It may be clear that we are dealing with two variables which have a
nonlinear relationship which is too complicated to be properly described by a first order polynomial
and by the corresponding straight line. Bike rentals do not simply decrease or increase when the
temperature rises or falls. Hence, a higher order polynomial, like the fourth order one we selected, is
necessary to take into account that the change in bike rentals as a result of a change in temperature
depends on the original temperature. This statement is reinforced by a significant increase in the
value of the adjusted R-squared of the model with the fourth order polynomial compared to the one
with a first order polynomial (from 0.5937 to 0.65).
7
6000
6000
5000
5000
4000
4000
cnt
cnt
3000
3000
2000
2000
1000
1000
0 10 20 30 0 10 20 30
temp temp
Figure 5: Regression curve when k=2 Figure 6: Regression curve when k=3
6000
5000
4000
cnt
3000
2000
1000
0 10 20 30
temp
8
6000
6000
5000
5000
4000
4000
cnt
cnt
3000
3000
2000
2000
1000
1000
0 10 20 30 0 10 20 30
temp temp
Figure 8: Regression curve when k=5 Figure 9: Regression curve when k=6
k-value Adj-R2
1 0.5937
2 0.6288
3 0.6497
4 0.65
5 0.6492
6 0.6483
9
Figure 10: Scatterplot of variables ”cnt” and ”temp”, estimated regression line (red)
and estimated regression curve corresponding to the polynomial of order 4 (green)
6000
5000
4000
cnt
3000
2000
1000
0 10 20 30
temp
10
5 Exercise 5
Given a temperature of 34 degrees Celsius, the model which uses a first order polynomial predicts
the number of bike rentals to be approximately 5688, whereas the model which uses a fourth order
polynomial predicts only 2702 rentals. This great di↵erence is due to the nature of both models, as
was outlined in the previous section; the first order polynomial has a fixed positive slope, due to the
fact that in general, there is an upward trend in the data, meaning that the predicted number of
rentals will keep increasing as the temperature increases, which we believe to be highly unrealistic
based on intuition. In this case, we are dealing with a relatively high temperature, which causes
the prediction of the model with two parameters to be much higher than the one from the model
which uses a fourth order polynomial that can, to a certain extent, describe the nonlinear relationship
between temperature and bike rentals as well. This nonlinear relationship is described such that
higher temperatures lead to lower numbers of rentals from a certain temperature onward, which
we think is highly reasonable. Since the given temperature is clearly above this boarder value, the
polynomial with order four predicts a much lower number of rentals than the regression with only a
first order polynomial. We certainly believe that this lower number is a much better prediction of
the actual number of rentals. This belief is reinforced by figure 10, which shows that the regression
line corresponding to the model with two parameters heavily deviates from the observed data points
for the given temperature, whereas the opposite holds for the regression curve corresponding to the
model with five parameters. The (di↵erence in the) predicted number of daily rentals of both models
given this high value of the variable temperature clearly shows how the use of a regression model
which outputs a straight line, can result in predictions that are incredibly inaccurate when dealing
with two variables that have a nonlinear relationship, especially when considering outliers as values of
the independent variable. Introducing powers of the independent variable in the linear regression, in
order to create a Taylor expansion that describes the said nonlinear relationship, is a possible solution,
which we see as a rather appropriate one in this case. However, alternative methods could certainly
be used in an attempt to obtain better results in future research.
6 Exercise 6
6.1 Motivation
In order to properly build a multiple linear regression model, we need make a distinction between
regular variables and categorical variables. A quick examination of (part of) the given data set reveals
that the variables ”hum” and ”windspeed” do not take on a fixed number of possible values and they
furthermore do not assign to each statistical unit a particular group or nominal category on the basis
of qualitative properties. The said variables will thus not be classified as categorical. However, the
variables ”weekday”, ”weathersit” and ”holiday” do in fact take on a fixed number of possible values,
namely 7, 4 and 2 respectively. Moreover, these variables clearly assign to each unit a particular
group on the basis of a qualitative property; the weekdays, holidays and non-holidays and the four
weathersits, given in the description of the variables, can be considered as groups. We hence classify
said variables as categorical variables and we must therefore treat them as such.
6.2 Results
We obtain the following estimated multiple regression model:
cnt = ˆ0 + ˆ1 * temp + ˆ2 * windspeed + ˆ3 * hum + ˆ4 * summer + ˆ5 * autumn + ˆ6 * winter +
ˆ7 * Monday + ˆ8 * Tuesday + ˆ9 * Wednesday + ˆ10 * Thursday + ˆ11 * Friday + ˆ12 * Saturday
+ ˆ13 * holiday + ˆ14 * weathersit2 + ˆ15 * weathersit3
11
The OLS estimates are given in the following table:
Table 2: OLS estimates of the multiple regression model (based on data from 2011)
12
of rentals. Henceforth, a lower prediction of the variable cnt as the variable windspeed takes on
greater values seems reasonable, intuitively speaking. However, we notice that the magnitude of the
coefficient corresponding to windspeed is lower than the magnitude of the coefficient corresponding
to temperature, which indicates that the e↵ect of a one unit change in windspeed on the predicted
number of rentals is less strong than the e↵ect of a one unit change in temperature. The same
argument holds for the variable humidity; As the humidity level increases, riding a bike becomes less
enjoyable and hence a decrease in bike rentals would be obvious, which explains why the coefficient
corresponding to the variable hum is negative, but the model considers the e↵ect of a one unit change
in humidity on the value of cnt to be even smaller than the e↵ect of a one unit change in windspeed,
since the absolute value of the coefficient corresponding to the variable humidity is lower than the
coefficient of windspeed. Whether consumers actually prioritize between temperature, windspeed and
humidity when deciding whether to rent a bike in the way the model predicts, is hard to determine
from a practical point of view.
13
many potential bike renters, which is why the seasons spring and autumn, which may have more days
with moderate temperatures, could have higher numbers of rentals based on the weather conditions
associated with these seasons only. This could intuitively explain the higher values of the coefficients
corresponding to the dummy variables of the seasons autumn and spring compared to the coefficient
which indicates which value is added to the predicted value of cnt when the day in question turns out
to fall in the summer.
14
clear practical interpretation; the description of the weathersit when this variable takes on the value
3 seems less favourable for biking compared to the description of the second category of weathersit,
which on its turn has weather properties that are less suitable for bicycling than the first category.
When we classify a day to have weathersit value 1, one would expect that, ceteris paribus, a greater
prediction is outputted by the model, since the weather conditions in this category are clearly the
most suitable for bicycling compared to the ones from the other categories. This intuitive expectation
has been met by our model; both dummy variables in our regression model which correspond to
categories of the categorical variable weathersit will take on the value 0 in case the weathersit value 1
were to be attained, meaning that none of the corresponding negative coefficients would be added to
the predicted value of cnt, which would of course have a positive e↵ect on the latter quantity (ceteris
paribus).
6.3.7 Performance
To measure the performance of the estimated multiple regression model, we use the adjusted R-
squared, since we have added several parameters to the polynomial of the model, meaning that we
need to take into account the possibility of overfitting (which the adjusted R-squared does to some
extent, opposite to the regular R-squared measure). The value of the adjusted R-squared is equal
to 0.7979, which is certainly higher than the adjusted R-squared we obtained from the models which
(only) take powers of the variable ”temp” as regressors. From this we deduce that using more variables
than only temperature in our model to predict the values that cnt takes on has a significant (positive)
impact on the accuracy of the predictions. This finding certainly has a practical interpretation; in
practice, consumers certainly do not base their decision of renting a bike solely on the temperature.
Many more factors play a role and we may not even have access to data with respect to some of the
variables that influence the number of daily bike rentals.
7 Exercise 7
7.1 Approach
Since we are estimating the rental rates of December 2012, we believe that using the data from the
other months in the same year to obtain the regression model would be appropriate. Usage of data
from 2011 might lead to misleading results. because there might be many other factors and underlying
motives that are not covered by our data set, which influence consumer behaviour and which hence
influence the bike rentals in each year. For instance, the prices of rental bikes may have increased
from 2011 to 2012, which means that using our data set from 2011 might result in predicted numbers
of rentals that are too high, due to the fact that we cannot take into account the (negative) e↵ect of
a price increase on bike rentals. We hence use the data from the first 11 months of 2011 to obtain
the estimated multiple regression model which we then use, by providing the data we possess from
December 2011, to obtain the predicted number of rentals on each day of this month.
7.2 Results
We obtain the following numerical and graphical results with respect to the OLS estimates and the
predicted values that the variable ”cnt” will take on:
15
Table 3: OLS estimates of the multiple regression model (based on data from 2012)
16
Table 4: Predicted number of rentals (cnt) per day in december 2012 (based on data from 2012)
17
Figure 11: Plot of the predicted numbers of rentals (cnt values) for each day in
December 2012.
6000
5000
4000
rentals_dec_2012
3000
2000
1000
0
0 5 10 15 20 25 30
day
18
daily rentals that figure 11 shows at December 21st, is almost certainly due to the fact that units
from this date onward are classified as winter days, rather than autumn days. For this conclusion,
it is crucial that the magnitude of the coefficient corresponding to the dummy variable of autumn is
relatively large, for otherwise the strong change in predicted cnt values at the first winter day could
have been caused by a change in the value of other variables, which is still possible, but rather unlikely.
The data points are thus classified as members of the aforementioned groups based on the variable
season. From a practical point of view, the outlined property of our model is not reasonable. It might
be true that, ceteris paribus, on average, the number of daily bike rentals is lower in the winter than
in the autumn, but this would probably be due to the fact that the weather conditions are on average
less suitable for bikers in the winter than in the autumn. However, weather conditions do not suddenly
change drastically usually when a new season starts; as said, they di↵er on average. It is hence not very
plausible that bike rental numbers would immediately drop drastically once the winter has started.
Another aspect of figure 11 which we believe to be explainable, is the outlier corresponding to the 26th
day of December. This day is of course a holiday, which, ceteris paribus, results in a lower predicted
number of total bike rentals, due to the fact that the estimated coefficient of the dummy variable
corresponding to the category ”holiday” is negative with relatively great magnitude as well. This
explains why the predicted total number of rentals is that much lower on December 26th compared
to the other winter days in the same month.
19
8 Exercise 8
8.1 Approach
In order to include in the previous model an increasing linear trend, we need to use the variable from
the data set that denotes the date of each unit (day), since this is an indication of time. We will
create a new variable, which denotes how much time has passed since January 1st, 2011 (which will
be our starting point), since we can then use R in order to find an OLS estimate for the corresponding
coefficient, by using our data from both 2011 and 2012 (except for the last month of the latter year).
This will probably give us a better estimate of the magnitude of the assumed increasing linear trend,
compared to an approach where we only use the data from one year, since our intuition tells us that
the presumed increasing linear trend will especially become apparent when considering relatively large
time intervals. We want to include a (preferably smooth) linear trend in the number of (predicted)
daily rentals, thus we cannot simply add a positive parameter, which covers the entire ”time e↵ect”
on bike rentals, corresponding to a variable that denotes the year of a unit to the estimated regression
model, since this would probably result in sudden (step wise) increases in the predicted cnt values
on January 1st in most years compared to the previous day, comparable to what we saw happen on
December 21st in the previous section. We hence introduce a new variable in the estimated regression
model that denotes how many days have passed since January 1st, 2011, since this will enable us to
implement a smooth increasing trend, rather than a step wise increase in the predicted number of
rentals. We use the following formula to calculate the value of the new variable “Days after start” for
each given unit:
Days after start = ([First number of the date] - 1) + 30 * ([second number of the date] - 1]) + 365 *
([third number of the date] – 2011)
We need to add the change in days with respect to our starting point, January 1st, 2011. This is why
we subtract 1 from the first number of any given date, since we start at the first day of the month
January and we also subtract 1 from the second number of any given date since we start at the first
month of the year. Furthermore, we subtract 2011 from the third number of any given date since we
start at a day from 2011 and as said, we only consider the change with respect to the starting point.
For simplicity, we add 30 days to the variable when an additional month (compared to the starting
month, January) has passed and 365 days when an additional year (compared to the starting year,
2011) has passed, even though this will cause slight deviations, due to the fact that not every month
has 30 days and not every year has 365 days. All that rests is to find an appropriate value for the
parameter ˆ16 which will indicate the increase in predicted number of daily rentals, ceteris paribus,
per day, due to the time e↵ect. For this, we can add our new variable ”Days after start” (after we
have calculated the correct value for each of the days from 2011 and each of the days in the first 11
months of 2012) as a regressor to the linear model we compute in R. This will give us the desired OLS
estimate of the corresponding parameter. However, in order to have numbers of observations for each
of the independent variables on each day in the said two years in the estimated regression model, we
will also use the data from 2011 with respect to the other regressors, since our goal is not necessarily
to forecast the numbers of daily rentals in December 2012 anymore, like it was in the previous section.
it is hence more appropriate to us all the data we have for all variables in our regression to come up
with our model. When more data is available, one could of course ”update” the model by calculating
the OLS estimates based on new observations for the variables in the estimated multiple regression
model as well.
20
8.2 Results
We obtain the following estimated multiple regression model: cnt = ˆ0 + ˆ1 * temp + ˆ2 * windspeed
+ ˆ3 * hum + ˆ4 * summer + ˆ5 * autumn + ˆ6 * winter + ˆ7 * Monday + ˆ8 * Tuesday + ˆ9 *
Wednesday + ˆ10 * Thursday + ˆ11 * Friday + ˆ12 * Saturday + ˆ13 * holiday + ˆ14 * weathersit2
+ ˆ15 * weathersit3 + ˆ16 * days after start
The corresponding OLS estimates are given in the following table:
Table 5: OLS estimates of the multiple regression model including the ”time e↵ect” (based on data
from 2011 and 2012)
21
is much lower in the model where we include the increasing linear trend. This makes sense, since
we used the OLS method, which minimizes the squared distance between the predicted numbers of
rentals and the observed number of rentals (from the data set). Since an increasing linear trend has
been added to the model and since we still obtain estimates that minimize the said squared distances
between predicted and observed cnt values, R has decreased the intercept in order to preserve the
”least squares property” of our model. The magnitude of the (positive) estimated parameter of the
variable ”days after start” is furthermore relatively great, given that the variable will take on large
values as time passes. This might explain why the parameter estimates corresponding to the other
regular variables hum and windspeed have decreased in the model which does include the increasing
linear trend compared to the one which doesn’t; the increase in prediction that is caused by the ”time
e↵ect” is compensated by a decrease in the coefficients corresponding to variables which only take
on non-negative values in order to prevent great residuals as a result of predicted cnt values that
would be way too high. The only variable that can take on negative values is temperature (since
we assume that the model will not be used for days before our starting point), which has a positive
parameter in both models under consideration, but in the model which includes the time e↵ect, the
magnitude of the estimated parameter has decreased, which could again be a way of making room
for the e↵ect the variable ”day after start” will have on the predicted values. The decrease in the
parameter estimate corresponding to the dummy variable of the category ”holiday” might have the
same reason; it prevents the model from predicting values that are too high when more and more time
starts passing. This reasoning does in general not hold for the coefficients corresponding to the other
dummy variables; we observe both increases and decreases when it comes to the parameter estimates
of the weekday and weathersit dummy variables. However, we do notice the strong decrease in the
coefficients corresponding to the dummy variables which indicate whether a unit falls in the summer
and whether a unit falls in the autumn, meaning that the values that are added to the predicted
number of rentals when a season falls in the summer or the autumn have decreased strongly. This
could also be a way in which the model alleviates the positive e↵ect of our new variable on the
predicted bike rentals as the year progresses (and as the variable ”days after start” takes on greater
values). We can, however, not confirm whether our intuitive interpretations of the (changes in) the
estimated parameters in our model, which includes the time e↵ect, are correct.
8.3.2 Performance
We are fairly certain that the increasing linear trend has actually been implemented in the model,
since the relatively great magnitude of our new parameter and the constantly increasing value of the
corresponding variable which gives an indication of time, ensure that the number of daily rentals will
go up over time, since the values of the other regular variables do not generally, strictly, significantly
increase or decrease over time2 . The number of units belonging to a certain season or weekday group
will of course also not change significantly. The number of holidays in a year is usually stable and we do
not believe that the number of days with a specific weathersit generally increases or decreases strongly
over time. We hence suspect that the predicted number of rentals will in general keep increasing as
more time passes, which is what we wanted to implement in our model. We continue by examining to
what extent the model is able to predict accurate values of the daily number of rentals. Since relatively
many parameters are included in the polynomial of the model under consideration, it is necessary to
take into account the possibility of overfitting, which is why we again use the adjusted R-squared to
get a notion of how well the regressors of our model can describe the variability of the regressand
(”cnt”). The said quantity equals 0.8209, which is much greater than all other values of the adjusted
R-squared we have obtained so far for our models. In order to make an appropriate comparison of
performance possible, we use the adjusted R-squared of a new model which uses the data from both
2011 and 2012, but which does not include the variable ”days after start” which measures the time.
Thus, we essentially consider the same model as in exercise 7, while using observations from both
2 The validity of this claim with respect to the variable ”temperature” may be subject to discussion.
22
2011 and 2012. The adjusted R-squared of the said model is approximately 0.5472, which is certainly
much lower than the value we obtained from the model which does include the increasing linear trend.
We hence believe that including an increasing linear trend by adding a new, time measuring variable
to the model has increased the degree to which the model can actually describe the variability in
the variable cnt. Despite the penalty that the R-squared value of the model including the ”time
e↵ect” has received due to the extra parameter, the adjusted R-squared value is greater than the ones
from all other models we have considered throughout our analysis. From this we deduce that the
suggestion from our colleague, which implied that the number of daily bike rentals is increasing over
time, is accurate, according to the data we have access to. Even though the model which includes
the ”time e↵ect” seems to perform relatively good based on the adjusted R-squared, the predictions
may become less trustworthy on the long term, since the actual number of total daily bike rentals
could potentially reach a bound, whereas the variable ”days after start” will keep increasing as more
days pass, resulting in greater and greater predicted numbers of daily rentals which could thus start
deviating (strongly) from the true values. In this case, the model could be updated by using more
relevant data, as was stated before.
Table 6: OLS estimates of the multiple regression model not including the ”time e↵ect” (based on
data from 2011 and 2012)
23