0% found this document useful (0 votes)
2 views22 pages

Data - Analysis 2

The document outlines a series of exercises focused on analyzing the relationship between temperature and bike rentals using statistical methods such as OLS regression and correlation analysis. It discusses the selection of a fourth-degree polynomial model for better prediction accuracy and interprets the results, indicating a strong positive correlation between temperature and bike rentals. The analysis includes graphical representations and detailed interpretations of the regression results, emphasizing the importance of model selection in statistical analysis.

Uploaded by

zymidd13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views22 pages

Data - Analysis 2

The document outlines a series of exercises focused on analyzing the relationship between temperature and bike rentals using statistical methods such as OLS regression and correlation analysis. It discusses the selection of a fourth-degree polynomial model for better prediction accuracy and interprets the results, indicating a strong positive correlation between temperature and bike rentals. The analysis includes graphical representations and detailed interpretations of the regression results, emphasizing the importance of model selection in statistical analysis.

Uploaded by

zymidd13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Contents

1 Exercise 1 3

2 Exercise 2 3

3 Exercise 3 4

4 Exercise 4 6
4.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Results and Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5 Exercise 5 11

6 Exercise 6 11
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.3 Interpretations and comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.3.1 Intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.3.2 Non-categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.3.3 Season categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.3.4 Weekday categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.3.5 Holiday categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.3.6 Weathersit categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.3.7 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

7 Exercise 7 15
7.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.3 Interpretations and comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.3.1 Predicted parameter values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.3.2 Predicted numbers of daily bike rentals . . . . . . . . . . . . . . . . . . . . . . 18

8 Exercise 8 20
8.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8.3 Interpretations and comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8.3.1 OLS estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8.3.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2
1 Exercise 1
Univariate summaries such as histograms are useful to describe the distribution of the two variables,
but not their relationship, which is why we omit these tools. To illustrate the relationship between the
total number of bikes rented (cnt) and temperature (temp), we will use a scatterplot, which is given at
the end of this section, as a graphical tool and the sample covariance as well as the sample correlation
as descriptive statistics. The sample covariance of the two variables turns out to be approximately
9475 > 0, which indicates a positive relationship between the temperature and the total number of
bikes rented, but it is rather hard to interpret this quantity since it is not standardized. This is why
we turn to the sample correlation of the two variables, which equals approximately 0.77. This is again
a positive number, which implies a positive relationship between the two variables under examination,
but we know that the correlation is a unit free measure between -1 and 1, meaning that it also gives
an indication of the magnitude of the relationship between two variables. In this case, the obtained
sample correlation tells us that there is a strong positive relationship between temperature and total
daily bike rentals. This is visible when examining the scatterplot; the cloud of points has a clear
positive slope, which explains the positive correlation (and positive covariance), but the points are
quite disperse, which explains why the correlation is not closer to 1. Based on the aforementioned
observations and quantities, we conclude that higher temperature is associated with more bikes rented,
according to our data set.

Figure 1: Scatterplot of the variables ”cnt” and ”temp”


6000
5000
4000
cnt

3000
2000
1000

0 10 20 30

temp

2 Exercise 2
The OLS estimates of the regression coefficients turn out to be ˆ0 = 1630.99 and ˆ1 = 119.33. This
means that the total number of bikes rented, given that the temperature is 0 degrees, is predicted
to be approximately 1631. Furthermore, a one unit (degree) increase in temperature leads to an

3
increase in total number of (predicted) rentals of approximately 119 units (bikes). The fact that
the estimated coefficient of the regressor has a positive value also indicates that there is indeed a
positive relationship between temperature and the total number of bikes rented, like we concluded in
the previous section. In addition, the value of R-squared associated with our estimated model. which
equals 0.5948, indicates that temperature explains about 59.5% of the variability in the total number
of daily rentals, which is decent, yet suboptimal.

3 Exercise 3
(The discussed graphs are provided at the end of this section.)
When examining the graph, which showcases the data points and the estimated regression line, we see
that the intercept of the regression line ( ˆ0 ) is indeed around 1631 (bikes) and the slope of the regres-
sion line ( ˆ1 ) is indeed positive. The estimated regression line seems to fit the data reasonably, since
it is quite close to many data points (the areas with relatively many data points are located closely
to the regression line). However, there are a decent number of data points that have (much) higher
distance to the regression line. This is in accordance with the value of R-squared associated with
the estimated regression model, which was discussed in the previous section. Moreover, the graphical
representation seems to reflect the fact that we have used the OLS estimates to obtain the estimated
regression line, meaning that the squared distances between the data points and the estimated regres-
sion line are minimized. Furthermore, both the plot and the histogram of the residuals indicate that
most residuals have relatively low (absolute) values and the positive and negative residuals seem to
cancel each other out relatively well when comparing both their magnitudes and frequency. This all
makes sense to us since the OLS method aims at minimizing the squared residuals. In addition, the
OLS estimator is an unbiased estimator, which means that the mean of the residuals must equal 0
(on average, the estimator outputs the true value). The histogram already causes us to presume that
the mean of the residuals might indeed be 0 and this presumption is confirmed by a quick calculation
of the mean in R, which results in the value -3.775891 * 10 14 (approximately 0).

4
2000
100

1000
80

0
60
Frequency

res

−1000
40

−2000
20

−3000
0

−3000 −2000 −1000 0 1000 2000 0 100 200 300

res Index

Figure 3: Histogram of residuals of the re- Figure 4: Plot of residuals of the regression
gression model. model.

Figure 2: Scatterplot of variables ”cnt” and ”temp” and the estimated regression line
6000
5000
4000
cnt

3000
2000
1000

0 10 20 30

temp

5
4 Exercise 4
(The discussed graphs and tables can be found at the end of the section)

4.1 Model selection


In order to find an appropriate model specification, we decide to choose k̄ = 6 as the maximum order
we want to allow for the polynomial which we use for our model. Table 1, which is provided at the
end of this section, shows that the adjusted-R2 takes on the highest value when we use the polynomial
of order 4. However, we also notice that there is a very small increase in the value of the adjusted-R2
when comparing the fourth degree polynomial to the third degree polynomial. We hence consider the
corresponding graphs of the estimated regression curves in question, namely figures 6 and 7. These
figures indicate that the said increase in adjusted-R2 value is (partially) due to the fact that the
regression curve corresponding to k = 4 interpolates to a greater extent the data points on the far
left and the far right sides of the graph. This lowers the residuals of the regression curve with the
fourth degree polynomial sufficiently, which is why the model with k = 4 has a higher adjusted-R2
value, despite the greater penalty due to the extra coefficient that was added compared to the model
which only has 3 regressors. In this case, we believe that this stronger change in the development
of the estimated regression curve with k = 4, at extreme (high and low) values is beneficial. This is
due to the fact that it would (intuitively) make sense if the number of bike rentals decreased strongly
as the temperature started rising above 30 degrees or decreasing underneath 0 degrees, since these
temperatures are in general considered to be highly unsuitable for bicycling. The regression curve
corresponding to the third degree polynomial, however, showcases a sudden increase in the number of
rentals as the temperature becomes lower than -5 degrees, which does not make any sense intuitively.
We hence find the regression model with the fourth degree polynomial more appropriate, since it not
only has a slightly higher adjusted-R2 value; it also showcases stronger decreases in bike rentals as
extreme temperatures are observed, which is in line with our intuition. The aforementioned advantages
of the regression curve with k=4 compared to the one with k=3 also hold when comparing it to the
regression curve corresponding to the second degree polynomial, since the latter barely decreases as
extraordinarily high temperatures above 30 degrees are reached, which does not reflect the common
behaviour of consumers. The said curve also attains the value 0 when the temperature is around -3
degrees, which we also do not believe to be a reasonable prediction, due to the necessity of certain
consumers to rent a bike, despite the weather conditions, which in essence still permit bicycling around
-3 degrees. Since the regression curves corresponding to fifth and sixth order polynomials show the
counterintuitive and unlikely increase or stability in the number of daily bike rentals at temperatures
lower than 0 degrees as well and since these curves tend to interpolate the data points more than the
regression curve with k=4 does, we select k=4 as the most appropriate model specification.

4.2 Results and Interpretations


The chosen model, which uses a fourth degree polynomial to output estimates of the variable ”cnt”,
gives us 0.65 as the adjusted R-squared and the following OLS estimates: ˆ0 = 1388.129, ˆ1 = 121.000,
ˆ2 = 2.342, ˆ3 = 0.094 and ˆ4 = -0.007. The first parameter represents the intercept, in essence the
predicted number of rentals when the temperature is 0 degrees. Moreover, in this new model with
a higher order polynomial, the relationship between cnt and temperature is harder to interpret; it is
no longer true that a unit increase in temperature leads to a ˆ1 increase in the predicted cnt values.
However, we do notice that the second, third and fourth parameters are positive numbers which cause
the strong increase in rentals as the temperature increases until around 25 degrees, when the fifth
parameter, which is slightly negative, ”outweighs” the others, resulting in a decreasing curve.

6
4.3 Comparison
When comparing the estimated regression curve corresponding to the fourth order polynomial with the
estimated regression line we discussed before, it becomes apparent that the straight line does not take
into account the decrease in the number of rentals as the temperature rises to extraordinary degrees;
the first order polynomial has a fixed slope, corresponding to the general trend in the data, which
means that the development of the line at extremely high values of ”temp” is the same as in the area
where moderate temperatures are observed. This results in the slope being too large at temperatures
around 30 degrees and greater, as well as in the area where the temperature is negative. The regression
curve, however, is created by using a fourth order polynomial which enables it to decrease strongly
when extremely high and low temperatures are measured. Furthermore, figure 10 showcases how
the curve is closer to the observed numbers of rentals for moderate temperatures (between 0 and 25
degrees) than the regression line, again due to the much greater flexibility of the curve thanks to a
greater number of parameters. It may be clear that we are dealing with two variables which have a
nonlinear relationship which is too complicated to be properly described by a first order polynomial
and by the corresponding straight line. Bike rentals do not simply decrease or increase when the
temperature rises or falls. Hence, a higher order polynomial, like the fourth order one we selected, is
necessary to take into account that the change in bike rentals as a result of a change in temperature
depends on the original temperature. This statement is reinforced by a significant increase in the
value of the adjusted R-squared of the model with the fourth order polynomial compared to the one
with a first order polynomial (from 0.5937 to 0.65).

7
6000

6000
5000

5000
4000

4000
cnt

cnt
3000

3000
2000

2000
1000

1000
0 10 20 30 0 10 20 30

temp temp

Figure 5: Regression curve when k=2 Figure 6: Regression curve when k=3
6000
5000
4000
cnt

3000
2000
1000

0 10 20 30

temp

Figure 7: Regression curve when k=4

8
6000

6000
5000

5000
4000

4000
cnt

cnt
3000

3000
2000

2000
1000

1000
0 10 20 30 0 10 20 30

temp temp

Figure 8: Regression curve when k=5 Figure 9: Regression curve when k=6

Table 1: k-values and corresponding adjusted-R2

k-value Adj-R2
1 0.5937
2 0.6288
3 0.6497
4 0.65
5 0.6492
6 0.6483

9
Figure 10: Scatterplot of variables ”cnt” and ”temp”, estimated regression line (red)
and estimated regression curve corresponding to the polynomial of order 4 (green)
6000
5000
4000
cnt

3000
2000
1000

0 10 20 30

temp

10
5 Exercise 5
Given a temperature of 34 degrees Celsius, the model which uses a first order polynomial predicts
the number of bike rentals to be approximately 5688, whereas the model which uses a fourth order
polynomial predicts only 2702 rentals. This great di↵erence is due to the nature of both models, as
was outlined in the previous section; the first order polynomial has a fixed positive slope, due to the
fact that in general, there is an upward trend in the data, meaning that the predicted number of
rentals will keep increasing as the temperature increases, which we believe to be highly unrealistic
based on intuition. In this case, we are dealing with a relatively high temperature, which causes
the prediction of the model with two parameters to be much higher than the one from the model
which uses a fourth order polynomial that can, to a certain extent, describe the nonlinear relationship
between temperature and bike rentals as well. This nonlinear relationship is described such that
higher temperatures lead to lower numbers of rentals from a certain temperature onward, which
we think is highly reasonable. Since the given temperature is clearly above this boarder value, the
polynomial with order four predicts a much lower number of rentals than the regression with only a
first order polynomial. We certainly believe that this lower number is a much better prediction of
the actual number of rentals. This belief is reinforced by figure 10, which shows that the regression
line corresponding to the model with two parameters heavily deviates from the observed data points
for the given temperature, whereas the opposite holds for the regression curve corresponding to the
model with five parameters. The (di↵erence in the) predicted number of daily rentals of both models
given this high value of the variable temperature clearly shows how the use of a regression model
which outputs a straight line, can result in predictions that are incredibly inaccurate when dealing
with two variables that have a nonlinear relationship, especially when considering outliers as values of
the independent variable. Introducing powers of the independent variable in the linear regression, in
order to create a Taylor expansion that describes the said nonlinear relationship, is a possible solution,
which we see as a rather appropriate one in this case. However, alternative methods could certainly
be used in an attempt to obtain better results in future research.

6 Exercise 6
6.1 Motivation
In order to properly build a multiple linear regression model, we need make a distinction between
regular variables and categorical variables. A quick examination of (part of) the given data set reveals
that the variables ”hum” and ”windspeed” do not take on a fixed number of possible values and they
furthermore do not assign to each statistical unit a particular group or nominal category on the basis
of qualitative properties. The said variables will thus not be classified as categorical. However, the
variables ”weekday”, ”weathersit” and ”holiday” do in fact take on a fixed number of possible values,
namely 7, 4 and 2 respectively. Moreover, these variables clearly assign to each unit a particular
group on the basis of a qualitative property; the weekdays, holidays and non-holidays and the four
weathersits, given in the description of the variables, can be considered as groups. We hence classify
said variables as categorical variables and we must therefore treat them as such.

6.2 Results
We obtain the following estimated multiple regression model:
cnt = ˆ0 + ˆ1 * temp + ˆ2 * windspeed + ˆ3 * hum + ˆ4 * summer + ˆ5 * autumn + ˆ6 * winter +
ˆ7 * Monday + ˆ8 * Tuesday + ˆ9 * Wednesday + ˆ10 * Thursday + ˆ11 * Friday + ˆ12 * Saturday
+ ˆ13 * holiday + ˆ14 * weathersit2 + ˆ15 * weathersit3

11
The OLS estimates are given in the following table:

Table 2: OLS estimates of the multiple regression model (based on data from 2011)

Variable parameter value


- ˆ0 2303.654
temp ˆ1 90.687
windspeed ˆ2 -34.476
hum ˆ3 -9.021
spring (season) ˆ4 1063.903
summer (season) ˆ5 933.751
autumn (season) ˆ6 1448.740
Monday (weekday) ˆ7 74.149
Tuesday (weekday) ˆ8 124.895
Wednesday (weekday) ˆ9 78.013
Thursday (weekday) ˆ10 98.755
Friday (weekday) ˆ11 166.091
Saturday (weekday) ˆ12 160.740
holiday (holiday) ˆ13 -361.335
weathersit2 (weathersit) ˆ14 -316.315
weathersit3 (weathersit) ˆ15 -1678.906

6.3 Interpretations and comments


6.3.1 Intercept
The first parameter1 , ˆ0 , is the intercept, meaning that the cnt value is predicted to be 2303.654 if
all variables attain the value 0. To interpret this in a more practical way, the model predicts that
around 2304 bikes will be rented on a day if all of the following conditions are met: the temperature
on this day is 0 degrees Celsius, the windspeed is 0, the humidity is 0, the day in question is a winter
day, the day is a Sunday, the day is not a holiday and the weathersit on the day falls in the category
with numerical value 1 (meaning that the weather can be described as clear, with few clouds or partly
clouded).

6.3.2 Non-categorical variables


We obtained 3 coefficients, ˆ1 , ˆ2 and ˆ3 , which correspond to our three regular variables, namely
”temp”, ”windspeed” and ”hum”. These coefficients indicate the change in the predicted value of
cnt as a result of a 1 unit increase in the value of the corresponding variable (in case the other
variables stay fixed). The coefficient corresponding to the variable temperature has a positive value,
which makes sense because we know from previous sections that a first order polynomial, like the
one we are considering in this section, will predict that the total number of rentals will increase as
the temperature increases (even though we found that there is a nonlinear relationship between the
variables). The coefficient corresponding to the variable windspeed has a negative value, which has
a clear practical interpretation; a higher windspeed requires more e↵ort in order to bicycle and it
hence makes it less attractive to use a bike as a means of transport, which leads to a lower number
1 When we say ”parameter” or ”coefficient” we mean the OLS estimate that R has calculated.

12
of rentals. Henceforth, a lower prediction of the variable cnt as the variable windspeed takes on
greater values seems reasonable, intuitively speaking. However, we notice that the magnitude of the
coefficient corresponding to windspeed is lower than the magnitude of the coefficient corresponding
to temperature, which indicates that the e↵ect of a one unit change in windspeed on the predicted
number of rentals is less strong than the e↵ect of a one unit change in temperature. The same
argument holds for the variable humidity; As the humidity level increases, riding a bike becomes less
enjoyable and hence a decrease in bike rentals would be obvious, which explains why the coefficient
corresponding to the variable hum is negative, but the model considers the e↵ect of a one unit change
in humidity on the value of cnt to be even smaller than the e↵ect of a one unit change in windspeed,
since the absolute value of the coefficient corresponding to the variable humidity is lower than the
coefficient of windspeed. Whether consumers actually prioritize between temperature, windspeed and
humidity when deciding whether to rent a bike in the way the model predicts, is hard to determine
from a practical point of view.

6.3.3 Season categories


All coefficients that have not been discussed up until this point correspond to so-called dummy vari-
ables which have automatically been made by R, because we indicated, in our code, that some of the
variables in the regression are categorical. Table 2 shows which variable corresponds to which estimate
and it also indicates from which original variable the dummy variables arose. We see that the categor-
ical variable season from the given data set, which can take on 4 values, is represented in our multiple
regression by 3 dummy variables, namely spring, summer and autumn. R automatically excludes one
of the possible values that a categorical value can take on, since we know for certain that this value
will be taken on if all of the other values (which do have corresponding new dummy variables) are
not taken on by the original categorical variable. In the case of the categorical variable season, R has
excluded the value 1, corresponding to the season winter. We notice that the three aforementioned
dummy variables all have positive corresponding coefficients with great magnitude compared to the
previously discussed coefficients of temperature, windspeed and humidity. The predicted cnt value
hence depends to a greater extent on the season in which a given day falls than on the temperature,
the windspeed and the humidity on that given day. The dummy variables simply indicate whether the
day in question falls in a specific season or not. If it does not, then the dummy variable will take on
the value 0, meaning that the entire term, including the coefficient, will vanish (become 0) and hence
not increase nor decrease the predicted value of cnt. However, if the day in question does fall within
one of the three seasons that correspond to a dummy variable, then this dummy variable will attain
the value 1, which means that the corresponding coefficient will be added to the predicted number
of rentals. Hence, given that a day falls in the summer, the spring or the autumn, the predicted
value of the number of rentals will rise compared to when a given day does not fall in any of these
seasons (ceteris paribus), thus compared to when the day falls in the winter. So being a member of the
category ”winter”, has a negative e↵ect (relative to being a member of the other season categories)
on the predicted number of rentals that our model outputs, since all dummy variables of the original
variable weekday will then take on the value 0, meaning that none of the positive coefficients will
be added to the predicted cnt value. This also has a practical interpretation; weather conditions are
usually less suitable for biking in the winter, compared to the other seasons, which means that the
number of rentals will probably be lower, ceteris paribus, in the winter compared to the other seasons.
When comparing the coefficients of each of the three dummy variables corresponding to the seasons
summer, spring and autumn, it becomes apparent that the latter has the highest value, followed by
spring, meaning that, ceteris paribus, days that fall in autumn are predicted to have higher numbers
of rentals compared to summer and spring days, while the spring days are predicted to have higher
cnt values than the summer days, when only considering the variable season. It is hard to determine
whether a practical interpretation is appropriate in this case. It depends on the weather conditions in
the seasons in question in the area we are considering, Washington D.C.. We suspect that the summer
might contain relatively many days with temperatures that are considered too high for bicycling by

13
many potential bike renters, which is why the seasons spring and autumn, which may have more days
with moderate temperatures, could have higher numbers of rentals based on the weather conditions
associated with these seasons only. This could intuitively explain the higher values of the coefficients
corresponding to the dummy variables of the seasons autumn and spring compared to the coefficient
which indicates which value is added to the predicted value of cnt when the day in question turns out
to fall in the summer.

6.3.4 Weekday categories


When considering the coefficients corresponding to the dummy variables of the categorical variable
weekday, it can be seen that these are all positive. Hence, our model predicts higher values of cnt on
all days except for Sundays rather than on Sundays (ceteris paribus), due to the fact that six dummy
variables of weekday would evaluate to 0 if a given day were to be a Sunday, meaning that none of
the positive coefficients would be added to the predicted value of cnt. We also see that the coefficients
of some days are much higher than others. We cannot justify the values of these parameters from a
practical point of view, since there is no obvious reason why there would for instance be more rentals
on Tuesdays rather than on Thursdays, even though the higher coefficient of the dummy variable
corresponding to Tuesday does seem to indicate that the model associates Tuesdays with higher bike
rentals. Higher rental rates on Fridays and Saturdays seem intuitive, since consumers often enjoy
more free time on these days and rental bikes might then be used to a greater extent for leisure time.
Potential bike renters may not travel much on Sundays, in general, which could explain why our model
predicts lower numbers of rentals on Sundays, ceteris paribus, compared to the other days. However,
consumers might need bikes to commute to and from work on weekdays, which is a reason why bike
rentals could, in fact, be higher on weekdays, rather than on Saturdays. We can thus not determine
whether the coefficients corresponding to the dummy variables of the categorical variable weekday
are reasonable. More information with respect to the general consumer behaviour of (potential) bike
renters in Washington D.C. would be needed.

6.3.5 Holiday categories


The negative coefficient of the dummy variable which indicates whether a day is a holiday, does make
a lot of sense to us. The dummy variable will take on the value of 1 if a holiday is attained and the
negative coefficient will then lower the predicted value of cnt. On holidays, people often don’t work
and hence do not need to commute to work and thus a means of transport like a rental bike will be
used to a lesser extent. On the other hand, bikes might be used to a greater extent for leisure, given
that many people have more free time on holidays compared to regular days, but holidays are often
spent at home or at a location relatively far from home, which makes bikes less useful as a means of
transport. The estimated parameter corresponding to the category ”holiday” thus seems reasonable
to us.

6.3.6 Weathersit categories


Finally, we observe that the model only includes two dummy variables to indicate in which of the
categories of the categorical variable weather a certain unit (day) belongs, while this variable can take
on 4 values. One of them, the value 1, is excluded automatically for the reason described before, but
the value 4 is also excluded, since this value of the variable weathersit is not observed in our data
set, which means that R does not have any information with respect to days with the property that
the weathersit value equals 4. This is why we have not obtained an OLS estimate corresponding to
this category. When examining the values that the coefficients of the weathersit dummy variables
take on, we notice that both are negative, but the coefficient of the dummy variable corresponding
to weathersit 3 has greater magnitude, meaning that the negative e↵ect on the predicted cnt value
is greater when a day is classified as a day with weathersit value 3 rather than value 2. This has a

14
clear practical interpretation; the description of the weathersit when this variable takes on the value
3 seems less favourable for biking compared to the description of the second category of weathersit,
which on its turn has weather properties that are less suitable for bicycling than the first category.
When we classify a day to have weathersit value 1, one would expect that, ceteris paribus, a greater
prediction is outputted by the model, since the weather conditions in this category are clearly the
most suitable for bicycling compared to the ones from the other categories. This intuitive expectation
has been met by our model; both dummy variables in our regression model which correspond to
categories of the categorical variable weathersit will take on the value 0 in case the weathersit value 1
were to be attained, meaning that none of the corresponding negative coefficients would be added to
the predicted value of cnt, which would of course have a positive e↵ect on the latter quantity (ceteris
paribus).

6.3.7 Performance
To measure the performance of the estimated multiple regression model, we use the adjusted R-
squared, since we have added several parameters to the polynomial of the model, meaning that we
need to take into account the possibility of overfitting (which the adjusted R-squared does to some
extent, opposite to the regular R-squared measure). The value of the adjusted R-squared is equal
to 0.7979, which is certainly higher than the adjusted R-squared we obtained from the models which
(only) take powers of the variable ”temp” as regressors. From this we deduce that using more variables
than only temperature in our model to predict the values that cnt takes on has a significant (positive)
impact on the accuracy of the predictions. This finding certainly has a practical interpretation; in
practice, consumers certainly do not base their decision of renting a bike solely on the temperature.
Many more factors play a role and we may not even have access to data with respect to some of the
variables that influence the number of daily bike rentals.

7 Exercise 7
7.1 Approach
Since we are estimating the rental rates of December 2012, we believe that using the data from the
other months in the same year to obtain the regression model would be appropriate. Usage of data
from 2011 might lead to misleading results. because there might be many other factors and underlying
motives that are not covered by our data set, which influence consumer behaviour and which hence
influence the bike rentals in each year. For instance, the prices of rental bikes may have increased
from 2011 to 2012, which means that using our data set from 2011 might result in predicted numbers
of rentals that are too high, due to the fact that we cannot take into account the (negative) e↵ect of
a price increase on bike rentals. We hence use the data from the first 11 months of 2011 to obtain
the estimated multiple regression model which we then use, by providing the data we possess from
December 2011, to obtain the predicted number of rentals on each day of this month.

7.2 Results
We obtain the following numerical and graphical results with respect to the OLS estimates and the
predicted values that the variable ”cnt” will take on:

15
Table 3: OLS estimates of the multiple regression model (based on data from 2012)

Variable parameter value


- ˆ0 4099.787
temp ˆ1 123.914
windspeed ˆ2 -48.120
hum ˆ3 -12.580
spring (season) ˆ4 1095.413
summer (season) ˆ5 665.499
autumn (season) ˆ6 1614.378
Monday (weekday) ˆ7 311.008
Tuesday (weekday) ˆ8 342.980
Wednesday (weekday) ˆ9 473.545
Thursday (weekday) ˆ10 535.992
Friday (weekday) ˆ11 531.392
Saturday (weekday) ˆ12 704.856
holiday (holiday) ˆ13 -913.631
weathersit2 (weathersit) ˆ14 -585.093
weathersit3 (weathersit) ˆ15 -2776.246

16
Table 4: Predicted number of rentals (cnt) per day in december 2012 (based on data from 2012)

Date Predicted cnt value


01-12-2012 5372.7857
02-12-2012 4724.7948
03-12-2012 6436.9567
04-12-2012 6352.5741
05-12-2012 6094.4221
06-12-2012 5545.3636
07-12-2012 5155.2559
08-12-2012 5592.1872
09-12-2012 4726.7615
10-12-2012 5209.7905
11-12-2012 4833.4696
12-12-2012 5141.3717
13-12-2012 5809.1665
14-12-2012 5662.7578
15-12-2012 6154.5152
16-12-2012 4868.9651
17-12-2012 5281.5847
18-12-2012 5906.5198
19-12-2012 5752.5428
20-12-2012 5328.3238
21-12-2012 3049.9334
22-12-2012 3493.1229
23-12-2012 3462.7155
24-12-2012 2937.0088
25-12-2012 2180.9065
26-12-2012 166.5929
27-12-2012 2589.3943
28-12-2012 3286.6858
29-12-2012 3355.4233
30-12-2012 2859.5396
31-12-2012 2865.6454

17
Figure 11: Plot of the predicted numbers of rentals (cnt values) for each day in
December 2012.
6000
5000
4000
rentals_dec_2012

3000
2000
1000
0

0 5 10 15 20 25 30

day

7.3 Interpretations and comments


7.3.1 Predicted parameter values
The OLS estimates that are provided in table 3 are of course not exactly the same as the ones
from table 2, since they are based on data of di↵erent years, but the ratios between the values of
the parameters do not di↵er drastically, meaning that all our conclusions with respect to the OLS
estimates of table 2, also hold for the ones from table 3.

7.3.2 Predicted numbers of daily bike rentals


When considering figure 11, which showcases the plot of the predicted numbers of daily rentals in
December, we can clearly see that the data points are split in two groups. The predicted number of
daily rentals does not undergo major changes when only considering the first ”group”, which represents
the predicted numbers of rentals of the first 20 days of the month December. However, the predicted
values become much lower from December 21st onward. This is no coincidence; the said date is
the first day of the winter. Table 3 tells us that the coefficient corresponding to the dummy variable
which indicates whether a day is an autumn day is greater than the value of all other coefficients which
correspond to dummy variables of one of the season categories, even though these parameters are also
strictly positive. Winter is the season category that our model has excluded, meaning that, ceteris
paribus, a unit (day) which has the property that the variable season takes on the value corresponding
to winter, will have a predicted number of rentals which will be lower than the prediction associated
with a unit that does not have said property. Considering that the (magnitude of the) coefficient
corresponding to the dummy variable of the season autumn is also greater than (the magnitude of)
the coefficients of the other variables, except for the one dummy variable which indicates whether the
weathersit is classified as ”3”, we conclude that the strong decrease in the predicted number of total

18
daily rentals that figure 11 shows at December 21st, is almost certainly due to the fact that units
from this date onward are classified as winter days, rather than autumn days. For this conclusion,
it is crucial that the magnitude of the coefficient corresponding to the dummy variable of autumn is
relatively large, for otherwise the strong change in predicted cnt values at the first winter day could
have been caused by a change in the value of other variables, which is still possible, but rather unlikely.
The data points are thus classified as members of the aforementioned groups based on the variable
season. From a practical point of view, the outlined property of our model is not reasonable. It might
be true that, ceteris paribus, on average, the number of daily bike rentals is lower in the winter than
in the autumn, but this would probably be due to the fact that the weather conditions are on average
less suitable for bikers in the winter than in the autumn. However, weather conditions do not suddenly
change drastically usually when a new season starts; as said, they di↵er on average. It is hence not very
plausible that bike rental numbers would immediately drop drastically once the winter has started.
Another aspect of figure 11 which we believe to be explainable, is the outlier corresponding to the 26th
day of December. This day is of course a holiday, which, ceteris paribus, results in a lower predicted
number of total bike rentals, due to the fact that the estimated coefficient of the dummy variable
corresponding to the category ”holiday” is negative with relatively great magnitude as well. This
explains why the predicted total number of rentals is that much lower on December 26th compared
to the other winter days in the same month.

19
8 Exercise 8
8.1 Approach
In order to include in the previous model an increasing linear trend, we need to use the variable from
the data set that denotes the date of each unit (day), since this is an indication of time. We will
create a new variable, which denotes how much time has passed since January 1st, 2011 (which will
be our starting point), since we can then use R in order to find an OLS estimate for the corresponding
coefficient, by using our data from both 2011 and 2012 (except for the last month of the latter year).
This will probably give us a better estimate of the magnitude of the assumed increasing linear trend,
compared to an approach where we only use the data from one year, since our intuition tells us that
the presumed increasing linear trend will especially become apparent when considering relatively large
time intervals. We want to include a (preferably smooth) linear trend in the number of (predicted)
daily rentals, thus we cannot simply add a positive parameter, which covers the entire ”time e↵ect”
on bike rentals, corresponding to a variable that denotes the year of a unit to the estimated regression
model, since this would probably result in sudden (step wise) increases in the predicted cnt values
on January 1st in most years compared to the previous day, comparable to what we saw happen on
December 21st in the previous section. We hence introduce a new variable in the estimated regression
model that denotes how many days have passed since January 1st, 2011, since this will enable us to
implement a smooth increasing trend, rather than a step wise increase in the predicted number of
rentals. We use the following formula to calculate the value of the new variable “Days after start” for
each given unit:
Days after start = ([First number of the date] - 1) + 30 * ([second number of the date] - 1]) + 365 *
([third number of the date] – 2011)
We need to add the change in days with respect to our starting point, January 1st, 2011. This is why
we subtract 1 from the first number of any given date, since we start at the first day of the month
January and we also subtract 1 from the second number of any given date since we start at the first
month of the year. Furthermore, we subtract 2011 from the third number of any given date since we
start at a day from 2011 and as said, we only consider the change with respect to the starting point.
For simplicity, we add 30 days to the variable when an additional month (compared to the starting
month, January) has passed and 365 days when an additional year (compared to the starting year,
2011) has passed, even though this will cause slight deviations, due to the fact that not every month
has 30 days and not every year has 365 days. All that rests is to find an appropriate value for the
parameter ˆ16 which will indicate the increase in predicted number of daily rentals, ceteris paribus,
per day, due to the time e↵ect. For this, we can add our new variable ”Days after start” (after we
have calculated the correct value for each of the days from 2011 and each of the days in the first 11
months of 2012) as a regressor to the linear model we compute in R. This will give us the desired OLS
estimate of the corresponding parameter. However, in order to have numbers of observations for each
of the independent variables on each day in the said two years in the estimated regression model, we
will also use the data from 2011 with respect to the other regressors, since our goal is not necessarily
to forecast the numbers of daily rentals in December 2012 anymore, like it was in the previous section.
it is hence more appropriate to us all the data we have for all variables in our regression to come up
with our model. When more data is available, one could of course ”update” the model by calculating
the OLS estimates based on new observations for the variables in the estimated multiple regression
model as well.

20
8.2 Results
We obtain the following estimated multiple regression model: cnt = ˆ0 + ˆ1 * temp + ˆ2 * windspeed
+ ˆ3 * hum + ˆ4 * summer + ˆ5 * autumn + ˆ6 * winter + ˆ7 * Monday + ˆ8 * Tuesday + ˆ9 *
Wednesday + ˆ10 * Thursday + ˆ11 * Friday + ˆ12 * Saturday + ˆ13 * holiday + ˆ14 * weathersit2
+ ˆ15 * weathersit3 + ˆ16 * days after start
The corresponding OLS estimates are given in the following table:

Table 5: OLS estimates of the multiple regression model including the ”time e↵ect” (based on data
from 2011 and 2012)

Variable parameter value


- ˆ0 2079.216
temp ˆ1 101.598
windspeed ˆ2 -41.549
hum ˆ3 -14.293
spring (season) ˆ4 786.512
summer (season) ˆ5 38.855
autumn (season) ˆ6 206.117
Monday (weekday) ˆ7 176.196
Tuesday (weekday) ˆ8 240.640
Wednesday (weekday) ˆ9 299.477
Thursday (weekday) ˆ10 327.816
Friday (weekday) ˆ11 341.969
Saturday (weekday) ˆ12 398.110
holiday (holiday) ˆ13 -653.928
weathersit2 (weathersit) ˆ14 -362.803
weathersit3 (weathersit) ˆ15 -1904.065
days after start ˆ16 5.627

8.3 Interpretations and comments


8.3.1 OLS estimates
We notice that the estimated parameter corresponding to the variable which denotes how many days
have passed since 01-01-2011 has a positive value. Hence, the squared distances between the predicted
cnt values and the observed cnt values are minimized (since we used the OLS method), when the
variable which give an indication of time has a positive coefficient, meaning that, according to the
OLS method, there is indeed an increasing trend over time in the number of rentals based on our
data set. This reinforces the presumption of our colleague. We continue with discussing the changes
in the other parameter estimates as a result of adding the variable ”days after start”. However, one
cannot compare the OLS estimates given in table 5 with the ones from table 3, since we decided to
calculate the latter estimates by using the observations from 2012 only, due to the fact that we made
the model with the purpose of predicting the number of daily rentals in December of 2012. This is
why we need to build a model where we do use the observations from both years, but where we do not
include the increasing linear trend, in order to make a proper comparison possible. The results are
given in table 6 at the end of this section. We immediately notice that the intercept (first parameter)

21
is much lower in the model where we include the increasing linear trend. This makes sense, since
we used the OLS method, which minimizes the squared distance between the predicted numbers of
rentals and the observed number of rentals (from the data set). Since an increasing linear trend has
been added to the model and since we still obtain estimates that minimize the said squared distances
between predicted and observed cnt values, R has decreased the intercept in order to preserve the
”least squares property” of our model. The magnitude of the (positive) estimated parameter of the
variable ”days after start” is furthermore relatively great, given that the variable will take on large
values as time passes. This might explain why the parameter estimates corresponding to the other
regular variables hum and windspeed have decreased in the model which does include the increasing
linear trend compared to the one which doesn’t; the increase in prediction that is caused by the ”time
e↵ect” is compensated by a decrease in the coefficients corresponding to variables which only take
on non-negative values in order to prevent great residuals as a result of predicted cnt values that
would be way too high. The only variable that can take on negative values is temperature (since
we assume that the model will not be used for days before our starting point), which has a positive
parameter in both models under consideration, but in the model which includes the time e↵ect, the
magnitude of the estimated parameter has decreased, which could again be a way of making room
for the e↵ect the variable ”day after start” will have on the predicted values. The decrease in the
parameter estimate corresponding to the dummy variable of the category ”holiday” might have the
same reason; it prevents the model from predicting values that are too high when more and more time
starts passing. This reasoning does in general not hold for the coefficients corresponding to the other
dummy variables; we observe both increases and decreases when it comes to the parameter estimates
of the weekday and weathersit dummy variables. However, we do notice the strong decrease in the
coefficients corresponding to the dummy variables which indicate whether a unit falls in the summer
and whether a unit falls in the autumn, meaning that the values that are added to the predicted
number of rentals when a season falls in the summer or the autumn have decreased strongly. This
could also be a way in which the model alleviates the positive e↵ect of our new variable on the
predicted bike rentals as the year progresses (and as the variable ”days after start” takes on greater
values). We can, however, not confirm whether our intuitive interpretations of the (changes in) the
estimated parameters in our model, which includes the time e↵ect, are correct.

8.3.2 Performance
We are fairly certain that the increasing linear trend has actually been implemented in the model,
since the relatively great magnitude of our new parameter and the constantly increasing value of the
corresponding variable which gives an indication of time, ensure that the number of daily rentals will
go up over time, since the values of the other regular variables do not generally, strictly, significantly
increase or decrease over time2 . The number of units belonging to a certain season or weekday group
will of course also not change significantly. The number of holidays in a year is usually stable and we do
not believe that the number of days with a specific weathersit generally increases or decreases strongly
over time. We hence suspect that the predicted number of rentals will in general keep increasing as
more time passes, which is what we wanted to implement in our model. We continue by examining to
what extent the model is able to predict accurate values of the daily number of rentals. Since relatively
many parameters are included in the polynomial of the model under consideration, it is necessary to
take into account the possibility of overfitting, which is why we again use the adjusted R-squared to
get a notion of how well the regressors of our model can describe the variability of the regressand
(”cnt”). The said quantity equals 0.8209, which is much greater than all other values of the adjusted
R-squared we have obtained so far for our models. In order to make an appropriate comparison of
performance possible, we use the adjusted R-squared of a new model which uses the data from both
2011 and 2012, but which does not include the variable ”days after start” which measures the time.
Thus, we essentially consider the same model as in exercise 7, while using observations from both
2 The validity of this claim with respect to the variable ”temperature” may be subject to discussion.

22
2011 and 2012. The adjusted R-squared of the said model is approximately 0.5472, which is certainly
much lower than the value we obtained from the model which does include the increasing linear trend.
We hence believe that including an increasing linear trend by adding a new, time measuring variable
to the model has increased the degree to which the model can actually describe the variability in
the variable cnt. Despite the penalty that the R-squared value of the model including the ”time
e↵ect” has received due to the extra parameter, the adjusted R-squared value is greater than the ones
from all other models we have considered throughout our analysis. From this we deduce that the
suggestion from our colleague, which implied that the number of daily bike rentals is increasing over
time, is accurate, according to the data we have access to. Even though the model which includes
the ”time e↵ect” seems to perform relatively good based on the adjusted R-squared, the predictions
may become less trustworthy on the long term, since the actual number of total daily bike rentals
could potentially reach a bound, whereas the variable ”days after start” will keep increasing as more
days pass, resulting in greater and greater predicted numbers of daily rentals which could thus start
deviating (strongly) from the true values. In this case, the model could be updated by using more
relevant data, as was stated before.

Table 6: OLS estimates of the multiple regression model not including the ”time e↵ect” (based on
data from 2011 and 2012)

Variable parameter value


- ˆ0 3945.899
temp ˆ1 134.383
windspeed ˆ2 -50.435
hum ˆ3 -25.822
spring (season) ˆ4 883.629
summer (season) ˆ5 401.452
autumn (season) ˆ6 1371.316
Monday (weekday) ˆ7 157.994
Tuesday (weekday) ˆ8 234.642
Wednesday (weekday) ˆ9 314.201
Thursday (weekday) ˆ10 302.376
Friday (weekday) ˆ11 319.894
Saturday (weekday) ˆ12 395.376
holiday (holiday) ˆ13 -562.405
weathersit2 (weathersit) ˆ14 -259.484
weathersit3 (weathersit) ˆ15 -1971.868

23

You might also like