0% found this document useful (0 votes)
49 views18 pages

Final Report

This document analyzes bicycle trip data from the Ford GoBike system in the San Francisco Bay Area. It explores trends in bike usage by date, day of week, hour of day, and year. It finds that bike usage varies significantly by these time factors. The document also compares usage between customers and subscribers, and between cities in the system. The analysis aims to help bike sharing operators understand usage patterns and improve their business models.

Uploaded by

srikanth3088
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views18 pages

Final Report

This document analyzes bicycle trip data from the Ford GoBike system in the San Francisco Bay Area. It explores trends in bike usage by date, day of week, hour of day, and year. It finds that bike usage varies significantly by these time factors. The document also compares usage between customers and subscribers, and between cities in the system. The analysis aims to help bike sharing operators understand usage patterns and improve their business models.

Uploaded by

srikanth3088
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1

TIME-BASED EXPLORATION OF BICYCLE TRIP DATA

Subhasree Goswami

Atanu Banerjee

Srikanth Shankar

Sarasij Ghosh

Harrisburg University

ANLY 500-90: Analytics: Prin & Appl


Time-Based Exploration of Bicycle Trip Data 2

Table of Contents

INTRODUCTION .............................................................................................................. 3

REVIEW OF THE LITERATURE ..................................................................................... 3

RESEARCH OBJECTIVE ................................................................................................. 6

RESULTS AND DISCUSSION.......................................................................................... 7

Trips by calendar date ......................................................................................................... 7

Total number of trips by day of the week ........................................................................... 9

Total number of trips by calendar date - weekend vs. weekday ....................................... 10

Separate plots for weekend and weekday ..........................................................................11

Total trips by hour of the day .............................................................................................11

Number of trips by hour, across the year .......................................................................... 12

Usage by city..................................................................................................................... 13

Customers vs. Subscribers ................................................................................................ 13

SUMMARY AND CONCLUSIONS ................................................................................ 13

References ......................................................................................................................... 15

Appendix ........................................................................................................................... 16
Time-Based Exploration of Bicycle Trip Data 3

INTRODUCTION

This data analysis of bike trip data is going to help us study the bike sharing operators as

well as it will give us a better understanding of the valuable and important factors that might

affect the usage of bikes along with different patterns or trends noticed in bike usage. This study

will help the us and as well as bike rental companies to design and modify their business models.

In this project we are using Ford GoBike company data for this time based exploratory of bicycle

trip data.

“Ford GoBike is a regional public bicycle sharing system in the San Francisco Bay Area,

California. Beginning operation in August 2013 as Bay Area Bike Share, the Ford GoBike

system currently has 2,500 bicycles in 260 stations across San Francisco, East Bay and San Jose.

On June 28, 2017, the system officially launched as Ford GoBike in a partnership with Ford

Motor Company. The system is expected to expand to 7,000 bicycles around 540 stations in San

Francisco, Oakland, Berkeley, Emeryville, and San Jose.”

In this data analysis, we are going to answer few key questions regarding bicycle trips.

We want to get a sense for how the frequency of use varies with time, the usage of bike trips has

an increasing or decreasing pattern over time, how and what time of the day along with the time

of year affect those specific usage patterns.

REVIEW OF THE LITERATURE

Over the last ten years, the usage of public bike-sharing has grown tremendously as

governmental and non-profit organizations have recognized it as a means of increasing

transportation accessibility and mobility, reducing vehicle miles travelled while experiencing

positive impacts on public health (Shaheen et al., 2014). Researchers have been able to draw
Time-Based Exploration of Bicycle Trip Data 4

models with bike-sharing’s growth, identify its impacts, and analyze the travel behavior and

demographics of its annual members. Although it is a growing breadth of academic literature on

the topic, not many research works have been focused on bike-sharing’s largest subset of

adopters: casual users. Casual users—those who purchase a membership that exists for 30 days

or less—outnumbered annual members 20:1 and provided between 45% and 67% of operational

revenue for a given program in 2012 (Shaheen et al., 2014).

We began Linear Regression in order to predict Subscriber/Customer rental totals. Both

subscriber and customer totals vary based on specific features Hence, we will predict each one

separately and then add them together to compute the total number of rentals that day. Linear

regression process resulted in high values of MSE. Further to get more accurate RMSE, we used

the K-fold cross validation on the linear model with value of k=10 for subscriber and customer

totals in the modelling. Cross validation reduced the MSE subtotals to nearly half its values.

There are new predictor values added in recent years such gas_value and sf_events, so it is

necessary to verify if they significantly affect the bike rentals. When tried to fit those variables

into the model, it resulted in low p-values. Based on those low p-values we can reject the null

hypothesis. We attempted forward subset selection while using adjusted r2 as our metric. We

used linear regression with the variables with the highest adjusted r2 but this did not decrease

MSE. Thus, we decided to use all predictors in our model.

Linear regression, along with Cross Validation failed to prove accurate model enough to

predict the number of rentals. In the end, our random forests approach to predict the sum of

Subscriber and Customer totals separately gave us our most accurate RMSE on the testing set.
Time-Based Exploration of Bicycle Trip Data 5

This descriptive analysis intends to benefit the bike-sharing operators to design their

business models as per the usage requirements. This analytical study will help bike-sharing

operators to design and develop their business models keeping all the prudent factors in mind,

both in the Bay Area and elsewhere while gathering a better understanding of casual users and

non-users. Non-users are defined as those who approached a station and seemed interested in

using it, but decided not to use it. Casual users are defined as individuals who purchase a 24-hour

or 3-day pass to the system. For a comparative analysis, researchers used aggregated data from

previous surveys of Bay Area Bike Share (BABS) annual members. Researchers found that there

are numerous socio-monetary and demographic similarities amongst informal customers, non-

customers, and annual customers. the general public have a 4-12 months or post-graduate

diploma (annual: 87%; informal: 82%; non-customers: 79%); an annual family income of

$50,000 or extra (annual: 89%; informal: 71%; nonusers: 66%); and are Caucasian (annual: 75%;

casual: 70%; non-customers: 71%). expertise the geographic profile of casual users turned into

additionally a number one aim of this look at. Of informal users surveyed, 27% are from outside

of the united states; 57% are from the us but not the Bay location; and 16% are from the Bay area

(n=106). primary motives for being inside the Bay region include sightseeing (64%) and

paintings/commercial enterprise (19%).

Casual customers also have been probed regarding the pricing shape to gain insight into

whether informal users understood the structure. apparently, researchers observed that at least

53% of respondents did not recognize the pricing structure, and the substantial majority believed

they had been being charged much less than they have been. trendy pride with the gadget become

high among casual users. 85% had been “satisfied” or “very happy” with the system’s ease of
Time-Based Exploration of Bicycle Trip Data 6

use; 82% had been “happy” or “very happy” with Bay area motorbike percentage bicycle; 81%

have been 5 “glad” or “very satisfied” with the pricing; and forty-six% were “happy” or “very

satisfied” with the station places.

Bay area motorcycle percentage launched in late-August 2013, with about 700 bicycles at

70 stations. it is the first system in North the united states to release as a local public bikesharing

machine, and it capabilities include docking stations in San Francisco, Palo Alto, Redwood

metropolis, Mountain View, and San Jose. Caltrain, a 6 Bay place commuter rail line that

connects San Francisco with San Jose, serves as the regional hyperlink among every set of

stations.

As of June 30, 2014, the entire system had accrued 253,309 trips, averaging to 1.13 trips

per bicycle per day. While this number is relatively low compared to other public bikesharing

systems, 90% of the total usage took place in San Francisco, which is home to half of the

system’s bikes and stations (Bay Area Bike Share, 2014). Bicycles in San Francisco are used at

nearly double the rate of the system as a whole, with an average of 2.16 trips per bike per day.

Furthermore, stations in San Francisco account for 85% of all casual user memberships sold

within the system.

RESEARCH OBJECTIVE

In this analysis we are using R programming to analyze the open source data provided by

FordGo Bike in bay area. We are trying to categorize the analyses in few important sections like,

• trips by calendar year – how it varies with time, increasing or decreasing?

• total number of trips by day of the week – weekday vs weekend?

• total trips by hour of the day – peak hours, is it consistent across the year?

• number of trips by hour across the year, usage by city.


Time-Based Exploration of Bicycle Trip Data 7

• customers vs. subscribers usage – who dominate the usage?

By investigating these broad areas, a complete pattern can be understood about the

customer usage and different parameters affecting it. We are also comparing casual user vs

subscriber usage pattern which will also be very valuable to the company business model. The

data is obtained primarily from the FordGo bike site which consists of several json files to pull

the data. We have simplified the approach by consuming the data directly from a Kaggle site,

mentioned in the reference section, where the data has been previously pulled from the company

website and stored in the form of relational database tables and .csv files. The time frame in this

analysis is over a two-year period, 2014 – 2015.

RESULTS AND DISCUSSION

R programming has been used to conduct the pulling of bike trip data from the available

csv files and well as leveraging them to provide some exploratory analysis. The study has been

done into a gamut of sections below.

Trips by calendar date

In this particular study we are trying to focus on the total number of trips by calendar

date, which will depict how the number of total trips varies throughout the year. A general

expectation is that we might see the trips made in summer are higher as compared to that made in

winter.
Time-Based Exploration of Bicycle Trip Data 8

Fig.1: Plot showing total number of bicycle trips over calendar date.

The above plot shows total trips made in a day for a two year period (Aug 2013 – Aug

2015). Looking at the line fit it shows that number trips made by July is higher than in January

of 2014. Also the number keeps on decreasing until January of 2015 and then again rises in July.

This validates our null hypothesis that more trips are made in summer compared to winter.

Another interesting visualization in the above scatter plot is that there is a split in the data which

might be due to a confounding variable. There might be other factors that might drive the data
Time-Based Exploration of Bicycle Trip Data 9

like whether that particular year has more tourists, health consciousness has increased amongst

the customers etc.

Total number of trips by day of the week

Below is the plot for total number of trips by day of the week.

Fig.2: Plot showing number of bicycle trips over days of week.


Time-Based Exploration of Bicycle Trip Data 10

Looks like the weekday usage of bikes are much higher than weekends. The possible

explanation for this might the fact that the daily commuters leverage the bikes a lot more than

weekend pleasures. Customers probably drive on the weekends to a farther location for vacation

or pleasure. That's something not to forget in future analysis (e.g., binary variable). We

additionally want to ensure that this pattern explains the department in the facts we saw earlier.

For you to see how the records are split up we will code the previous calendar plot with

colorations that correspond with weekdays or weekends.

Total number of trips by calendar date - weekend vs. weekday

Below is the plot for weekday vs weekend.

Fig. 3: Plot showing trips by date with different color coding for weekday and weekend.
Time-Based Exploration of Bicycle Trip Data 11

This confirms that some of the pattern in the records throughout the year changed into

due to weekday vs. weekend utilization. It allows us to get a view of the identical information

but plotted one by one by means of weekend and weekday.

Separate plots for weekend and weekday

With the information split we can see a bit better the traits between weekday and

weekend utilization through the years. Weekday trips nonetheless seem to have a whole lot of

variance. A part of this variance can be because of holidays. Now that we have visible how the

wide variety of trips varies for the duration of the 12 months, now we test how it varies in the

course of the day.

Total trips by hour of the day

Below is the plot for trips by hour of the day.


Time-Based Exploration of Bicycle Trip Data 12

Fig.4: Plot showing number of trips over 24 hour, in a day.

The peak time corresponds to 9:00 AM and 5:00 PM, which clearly gives us an idea that

the customers use the bike daily to work and they should probably be subscribers.

Number of trips by hour, across the year

Below is the plot for across the year.

Fig.5: Plot showing trips over 24 hours, across the year.


Time-Based Exploration of Bicycle Trip Data 13

Each plot corresponds with a different quarter of the year. Quarter 1 (January - March),

Quarter 2 (April - June), Quarter 3 (July - September) or Quarter 4 (October - December). From

these plots it looks like that pattern from the previous analysis holds. the whole range of journeys

peaks round rush hour each time. It might be interesting to notice that during noon each day

which typically is the lunch time, the usage increases. Now let's take a look at how the city and

type of bicycle rider (subscriber vs. customer) may be influencing these trends. Each plot

corresponds with a different quarter of the year. Quarter 1 (January - March), Quarter 2 (April -

June), Quarter 3 (July - September) or Quarter 4 (October - December). From these plots it looks

like that pattern from before holds. That is, the total number of trips peaks around rush hour

each. Now let's take a look at how the town and form of bicycle rider (subscriber vs. purchaser)

may be influencing these tendencies.

Usage by city

As you may see, San Francisco kind of dominates the use of the program. One thing we

must also recollect is if customers vs. subscriber is influencing or not.

Customers vs. Subscribers

The plot for customers and subscribers is given below.


Time-Based Exploration of Bicycle Trip Data 14

Fig.6: Plot showing usage comparison between customers and subscribers.

It looks as if subscribers dominate utilization at the weekday. Weekend the usage is extra

balanced. Does the strength of this trend hold for unique cities? possibly with all those travelers

there are extra clients in San Francisco relative to subscribers.

It looks like the trend does indeed keep for San Francisco. But, Palo Alto seems to have a

more balanced utilization. you may also get an experience from those graphs how unbalanced the

utilization is throughout cities. Redwood city peaks at about 25 journeys a day as compared with

San Francisco that peaks toward 1,300 trips an afternoon. We've seen how the quantity of trips

fluctuates across the complete year, the way it fluctuates according to weekend vs. weekday, and

by means of hour of the day. We've also seen the stability of the variety of trips by means of

cities and by subscription type.


Time-Based Exploration of Bicycle Trip Data 15

References

1. Martin, Elliot and Susan Shaheen. 2014. “Evaluating public transit modal shift dynamics in

response to bikesharing: a tale of two U.S. cities.” Journal of Transport Geography, Volume 41,

315-324

2. Etherington, Darrell (June 27, 2017). "Ford GoBike launches in the Bay Area starting

tomorrow". TechCrunch.

3. “FordGo Bike”. Wikipedia, https://en.wikipedia.org/wiki/Ford_GoBike.

4. "World leaders in bike share Motivate and 8D Technologies merge". Cycling Industry News.

2017-02-10. Retrieved 2018-01-26.

5. “Time-based exploration”. Kaggle, https://www.kaggle.com/parryfg/time-based-data-

exploration/notebook

6. Shaheen, Susan, et al. 2014. Public Bikesharing in North America During a Period

of Rapid Expansion: Understanding Business Models, Industry Trends and

User Impacts. Mineta Transportation Institute, MTI Report 12-29.


Time-Based Exploration of Bicycle Trip Data 16

Appendix

## Data preparation
##Loading required packages
library(lubridate)
library(ggplot2)
library(dplyr)

##Loading data files


trip <- read.csv("../input/trip.csv")
station <- read.csv("../input/station.csv")

## Prepare the data


## Date format
trip$start_date <- mdy_hm(trip$start_date)
trip$end_date <- mdy_hm(trip$end_date)

trip$date <- trip$start_date


trip$date <- as.Date(trip$date)

## Merge the city variable into trip


trip$date <- as.Date(trip$start_date)
trip$id2 <- trip$id
trip$id <- trip$start_station_id
trip <- left_join(trip, station, by = c ("id"))

## List of variables
names(trip)

## Trips by calendar date


datefreq <- count(trip, date)

ggplot(data = datefreq, aes(date, n)) +


geom_point() +
geom_smooth() +
ggtitle("Trips Each Day") +
ylab("Total Number of Bicycle Trips") +
xlab("Date")

dailyfreq <- as.data.frame(table(wday(trip$date, label = TRUE)))

ggplot(data = dailyfreq, aes(x = Var1, Freq)) +


geom_bar(stat="identity") +
ggtitle("Total Number of Trips Per Day") +
ylab("Total Number of Bicycle Trips") +
xlab("Day of the Week")

datefreq <- mutate(datefreq, weekend = (wday(datefreq$date) == 1 |


wday(datefreq$date) == 7))
#Makes variable with True if date == sunday(1) or saturday (7)
Time-Based Exploration of Bicycle Trip Data 17

datefreq$weekend <- factor(datefreq$weekend, labels = c("Weekday",


"Weekend"))
## Labeling variables

ggplot(data = datefreq, aes(date, n)) +


geom_point(aes(color = weekend), size = 3, alpha = 0.65) +
ggtitle("Total Number of Trips Per Day") +
ylab("Total Number of Bicycle Trips") +
xlab("Date")

ggplot(data = datefreq, aes(date, n)) +


geom_point(size = 3, alpha = 0.65) +
facet_grid(. ~ weekend) +
geom_smooth(se = FALSE) +
ylab("Total Number of Bicycle Trips") +
xlab("Date")

## Time formatting

t2 <- ymd_hms(trip$start_date)
t3 <- hour(t2) + minute(t2)/60
trip$daytime <- t3
rm(t2, t3) #Cleanup

ggplot(trip, aes(daytime)) +
geom_histogram(binwidth = 0.25) +
geom_vline(xintercept = 9, color = 'orange')+
geom_vline(xintercept = 17, color = 'red', alpha = 0.7) +
annotate("text", x = 9, y = 27000, label = "9:00 AM", color = "ora
nge",
size = 7) +
annotate("text", x = 17, y = 27000, label = "5:00 PM", color = "re
d",
size = 7) +
xlab("Time of day on 24 hour clock") +
ylab("Total number of bicycle trips")

trip$quarter <- quarter(trip$date)

ggplot(trip, aes(daytime)) +
geom_histogram(binwidth = 0.25) +
geom_vline(xintercept = 9, color = 'orange')+
geom_vline(xintercept = 17, color = 'red', alpha = 0.7) +
xlab("Time of day on 24 hour clock") +
ylab("Total number of bicycle trips") +
facet_wrap(~quarter)

## Weekend variable for data trip


Time-Based Exploration of Bicycle Trip Data 18

trip <- mutate(trip, weekend = (wday(trip$date) == 1 |


wday(trip$date) == 7))
trip$weekend <- factor(trip$weekend, labels = c("Weekday", "Weekend"))

## Plotting usage by city


ggplot(data = trip, aes(date)) +
geom_bar(aes(color = weekend), stat = "count",
position = "stack") +
ggtitle("Trips by City Across Time") +
ylab("Total Number of Bicycle Trips") +
xlab("Trend Across Time") +
facet_grid(~city) +
theme(axis.text.x = element_blank())

ggplot(data = trip, aes(date)) +


geom_bar(aes(color = subscription_type), stat = "count",
position = "stack") +
ggtitle("Customer Vs. Subscriber on Weekends and Weekdays") +
ylab("Total Number of Bicycle Trips") +
xlab("Trend Across Time") +
facet_grid(~weekend) +
theme(axis.text.x = element_blank())

ggplot(data = trip, aes(date)) +


geom_bar(aes(color = subscription_type), stat = "count", posit
ion = "stack") +
ggtitle("Subscribers Vs. Customers - Trips Per Day by City ")
+
ylab("Total Number of Bicycle Trips") +
xlab("Trend Across Time") +
facet_wrap(~city, scale = "free_y") +
theme(axis.text.x = element_blank())

You might also like