Final Report
Final Report
Subhasree Goswami
Atanu Banerjee
Srikanth Shankar
Sarasij Ghosh
Harrisburg University
Table of Contents
INTRODUCTION .............................................................................................................. 3
Usage by city..................................................................................................................... 13
References ......................................................................................................................... 15
Appendix ........................................................................................................................... 16
Time-Based Exploration of Bicycle Trip Data 3
INTRODUCTION
This data analysis of bike trip data is going to help us study the bike sharing operators as
well as it will give us a better understanding of the valuable and important factors that might
affect the usage of bikes along with different patterns or trends noticed in bike usage. This study
will help the us and as well as bike rental companies to design and modify their business models.
In this project we are using Ford GoBike company data for this time based exploratory of bicycle
trip data.
“Ford GoBike is a regional public bicycle sharing system in the San Francisco Bay Area,
California. Beginning operation in August 2013 as Bay Area Bike Share, the Ford GoBike
system currently has 2,500 bicycles in 260 stations across San Francisco, East Bay and San Jose.
On June 28, 2017, the system officially launched as Ford GoBike in a partnership with Ford
Motor Company. The system is expected to expand to 7,000 bicycles around 540 stations in San
In this data analysis, we are going to answer few key questions regarding bicycle trips.
We want to get a sense for how the frequency of use varies with time, the usage of bike trips has
an increasing or decreasing pattern over time, how and what time of the day along with the time
Over the last ten years, the usage of public bike-sharing has grown tremendously as
transportation accessibility and mobility, reducing vehicle miles travelled while experiencing
positive impacts on public health (Shaheen et al., 2014). Researchers have been able to draw
Time-Based Exploration of Bicycle Trip Data 4
models with bike-sharing’s growth, identify its impacts, and analyze the travel behavior and
the topic, not many research works have been focused on bike-sharing’s largest subset of
adopters: casual users. Casual users—those who purchase a membership that exists for 30 days
or less—outnumbered annual members 20:1 and provided between 45% and 67% of operational
subscriber and customer totals vary based on specific features Hence, we will predict each one
separately and then add them together to compute the total number of rentals that day. Linear
regression process resulted in high values of MSE. Further to get more accurate RMSE, we used
the K-fold cross validation on the linear model with value of k=10 for subscriber and customer
totals in the modelling. Cross validation reduced the MSE subtotals to nearly half its values.
There are new predictor values added in recent years such gas_value and sf_events, so it is
necessary to verify if they significantly affect the bike rentals. When tried to fit those variables
into the model, it resulted in low p-values. Based on those low p-values we can reject the null
hypothesis. We attempted forward subset selection while using adjusted r2 as our metric. We
used linear regression with the variables with the highest adjusted r2 but this did not decrease
Linear regression, along with Cross Validation failed to prove accurate model enough to
predict the number of rentals. In the end, our random forests approach to predict the sum of
Subscriber and Customer totals separately gave us our most accurate RMSE on the testing set.
Time-Based Exploration of Bicycle Trip Data 5
This descriptive analysis intends to benefit the bike-sharing operators to design their
business models as per the usage requirements. This analytical study will help bike-sharing
operators to design and develop their business models keeping all the prudent factors in mind,
both in the Bay Area and elsewhere while gathering a better understanding of casual users and
non-users. Non-users are defined as those who approached a station and seemed interested in
using it, but decided not to use it. Casual users are defined as individuals who purchase a 24-hour
or 3-day pass to the system. For a comparative analysis, researchers used aggregated data from
previous surveys of Bay Area Bike Share (BABS) annual members. Researchers found that there
are numerous socio-monetary and demographic similarities amongst informal customers, non-
customers, and annual customers. the general public have a 4-12 months or post-graduate
diploma (annual: 87%; informal: 82%; non-customers: 79%); an annual family income of
$50,000 or extra (annual: 89%; informal: 71%; nonusers: 66%); and are Caucasian (annual: 75%;
casual: 70%; non-customers: 71%). expertise the geographic profile of casual users turned into
additionally a number one aim of this look at. Of informal users surveyed, 27% are from outside
of the united states; 57% are from the us but not the Bay location; and 16% are from the Bay area
(n=106). primary motives for being inside the Bay region include sightseeing (64%) and
Casual customers also have been probed regarding the pricing shape to gain insight into
whether informal users understood the structure. apparently, researchers observed that at least
53% of respondents did not recognize the pricing structure, and the substantial majority believed
they had been being charged much less than they have been. trendy pride with the gadget become
high among casual users. 85% had been “satisfied” or “very happy” with the system’s ease of
Time-Based Exploration of Bicycle Trip Data 6
use; 82% had been “happy” or “very happy” with Bay area motorbike percentage bicycle; 81%
have been 5 “glad” or “very satisfied” with the pricing; and forty-six% were “happy” or “very
Bay area motorcycle percentage launched in late-August 2013, with about 700 bicycles at
70 stations. it is the first system in North the united states to release as a local public bikesharing
machine, and it capabilities include docking stations in San Francisco, Palo Alto, Redwood
metropolis, Mountain View, and San Jose. Caltrain, a 6 Bay place commuter rail line that
connects San Francisco with San Jose, serves as the regional hyperlink among every set of
stations.
As of June 30, 2014, the entire system had accrued 253,309 trips, averaging to 1.13 trips
per bicycle per day. While this number is relatively low compared to other public bikesharing
systems, 90% of the total usage took place in San Francisco, which is home to half of the
system’s bikes and stations (Bay Area Bike Share, 2014). Bicycles in San Francisco are used at
nearly double the rate of the system as a whole, with an average of 2.16 trips per bike per day.
Furthermore, stations in San Francisco account for 85% of all casual user memberships sold
RESEARCH OBJECTIVE
In this analysis we are using R programming to analyze the open source data provided by
FordGo Bike in bay area. We are trying to categorize the analyses in few important sections like,
• total trips by hour of the day – peak hours, is it consistent across the year?
By investigating these broad areas, a complete pattern can be understood about the
customer usage and different parameters affecting it. We are also comparing casual user vs
subscriber usage pattern which will also be very valuable to the company business model. The
data is obtained primarily from the FordGo bike site which consists of several json files to pull
the data. We have simplified the approach by consuming the data directly from a Kaggle site,
mentioned in the reference section, where the data has been previously pulled from the company
website and stored in the form of relational database tables and .csv files. The time frame in this
R programming has been used to conduct the pulling of bike trip data from the available
csv files and well as leveraging them to provide some exploratory analysis. The study has been
In this particular study we are trying to focus on the total number of trips by calendar
date, which will depict how the number of total trips varies throughout the year. A general
expectation is that we might see the trips made in summer are higher as compared to that made in
winter.
Time-Based Exploration of Bicycle Trip Data 8
Fig.1: Plot showing total number of bicycle trips over calendar date.
The above plot shows total trips made in a day for a two year period (Aug 2013 – Aug
2015). Looking at the line fit it shows that number trips made by July is higher than in January
of 2014. Also the number keeps on decreasing until January of 2015 and then again rises in July.
This validates our null hypothesis that more trips are made in summer compared to winter.
Another interesting visualization in the above scatter plot is that there is a split in the data which
might be due to a confounding variable. There might be other factors that might drive the data
Time-Based Exploration of Bicycle Trip Data 9
like whether that particular year has more tourists, health consciousness has increased amongst
Below is the plot for total number of trips by day of the week.
Looks like the weekday usage of bikes are much higher than weekends. The possible
explanation for this might the fact that the daily commuters leverage the bikes a lot more than
weekend pleasures. Customers probably drive on the weekends to a farther location for vacation
or pleasure. That's something not to forget in future analysis (e.g., binary variable). We
additionally want to ensure that this pattern explains the department in the facts we saw earlier.
For you to see how the records are split up we will code the previous calendar plot with
Fig. 3: Plot showing trips by date with different color coding for weekday and weekend.
Time-Based Exploration of Bicycle Trip Data 11
This confirms that some of the pattern in the records throughout the year changed into
due to weekday vs. weekend utilization. It allows us to get a view of the identical information
With the information split we can see a bit better the traits between weekday and
weekend utilization through the years. Weekday trips nonetheless seem to have a whole lot of
variance. A part of this variance can be because of holidays. Now that we have visible how the
wide variety of trips varies for the duration of the 12 months, now we test how it varies in the
The peak time corresponds to 9:00 AM and 5:00 PM, which clearly gives us an idea that
the customers use the bike daily to work and they should probably be subscribers.
Each plot corresponds with a different quarter of the year. Quarter 1 (January - March),
Quarter 2 (April - June), Quarter 3 (July - September) or Quarter 4 (October - December). From
these plots it looks like that pattern from the previous analysis holds. the whole range of journeys
peaks round rush hour each time. It might be interesting to notice that during noon each day
which typically is the lunch time, the usage increases. Now let's take a look at how the city and
type of bicycle rider (subscriber vs. customer) may be influencing these trends. Each plot
corresponds with a different quarter of the year. Quarter 1 (January - March), Quarter 2 (April -
June), Quarter 3 (July - September) or Quarter 4 (October - December). From these plots it looks
like that pattern from before holds. That is, the total number of trips peaks around rush hour
each. Now let's take a look at how the town and form of bicycle rider (subscriber vs. purchaser)
Usage by city
As you may see, San Francisco kind of dominates the use of the program. One thing we
It looks as if subscribers dominate utilization at the weekday. Weekend the usage is extra
balanced. Does the strength of this trend hold for unique cities? possibly with all those travelers
It looks like the trend does indeed keep for San Francisco. But, Palo Alto seems to have a
more balanced utilization. you may also get an experience from those graphs how unbalanced the
utilization is throughout cities. Redwood city peaks at about 25 journeys a day as compared with
San Francisco that peaks toward 1,300 trips an afternoon. We've seen how the quantity of trips
fluctuates across the complete year, the way it fluctuates according to weekend vs. weekday, and
by means of hour of the day. We've also seen the stability of the variety of trips by means of
References
1. Martin, Elliot and Susan Shaheen. 2014. “Evaluating public transit modal shift dynamics in
response to bikesharing: a tale of two U.S. cities.” Journal of Transport Geography, Volume 41,
315-324
2. Etherington, Darrell (June 27, 2017). "Ford GoBike launches in the Bay Area starting
tomorrow". TechCrunch.
4. "World leaders in bike share Motivate and 8D Technologies merge". Cycling Industry News.
exploration/notebook
6. Shaheen, Susan, et al. 2014. Public Bikesharing in North America During a Period
Appendix
## Data preparation
##Loading required packages
library(lubridate)
library(ggplot2)
library(dplyr)
## List of variables
names(trip)
## Time formatting
t2 <- ymd_hms(trip$start_date)
t3 <- hour(t2) + minute(t2)/60
trip$daytime <- t3
rm(t2, t3) #Cleanup
ggplot(trip, aes(daytime)) +
geom_histogram(binwidth = 0.25) +
geom_vline(xintercept = 9, color = 'orange')+
geom_vline(xintercept = 17, color = 'red', alpha = 0.7) +
annotate("text", x = 9, y = 27000, label = "9:00 AM", color = "ora
nge",
size = 7) +
annotate("text", x = 17, y = 27000, label = "5:00 PM", color = "re
d",
size = 7) +
xlab("Time of day on 24 hour clock") +
ylab("Total number of bicycle trips")
ggplot(trip, aes(daytime)) +
geom_histogram(binwidth = 0.25) +
geom_vline(xintercept = 9, color = 'orange')+
geom_vline(xintercept = 17, color = 'red', alpha = 0.7) +
xlab("Time of day on 24 hour clock") +
ylab("Total number of bicycle trips") +
facet_wrap(~quarter)