Portfolio Project Solution Sheet
Portfolio Project Solution Sheet
Portfolio Project Solution Sheet
Analysis
INTRODUCTION: Here’s what you need to know: Lyft purchased its bike share
program from Ford (who owned GoBike) and needs a data analyst – that’s you! – to
help the marketing team use data-driven approaches in their new marketing
efforts. You’ve been tasked by your manager to investigate the differences
between Lyft users and Ford users. Lyft wants to increase memberships in its
rideshare program and needs to determine how their users, both past and present,
use their product.
HOW IT WORKS: Follow the prompts in the questions below to investigate your
data. Post your answers in the provided boxes: the yellow boxes for the queries you
write, purple boxes for visualizations and blue boxes for text-based answers. When
you're done, export your document as a pdf file and submit it on the Milestone page
– see instructions for creating a PDF at the end of the Milestone.
RESOURCES: If you need hints on the Milestone or are feeling stuck, there are
multiple ways of getting help. Attend Drop-In Hours to work on these problems with
your peers, or reach out to the HelpHub if you have questions. Good luck!
PROMPT: Congratulations are in order! You’ve been hired as an intern by Lyft, one
of the largest ride-sharing transportation providers in the country. In your new role,
you’ll be working on the Lyft Bay Wheels product: their latest initiative that provides
rental bikes all across San Francisco through the Lyft app.
SQL App: Here’s that link to our specialized SQL app, where you’ll write your SQL
queries and interact with the data.
— Data Set Description
To begin, you’ll query a total of 3 datasets. You’ll start with the lyft.baywheels and
ford.gobike datasets available in your schema. Later, you will join the sf.weather
dataset.
The lyft.baywheels dataset reports information about rentals made on the Bay
Wheels bike share system. Each row represents a single rental; we will be making
use of the following fields in this project:
The ford.gobike dataset has information very similar to the lyft.baywheels table, but
reports rides prior to Lyft’s takeover of the bikeshare system. One major distinction
between the two tables is different field names. The field names in the ford.gobike
dataset will be explained through the course of the project tasks.
Before you can start analyzing customer activity, you first need to combine the data
needed from Ford and Lyft. While the datasets are currently captured in your SQL
database in separate data tables, your manager has assured you that they are the
same data, though with different variable names. Below is a table of equivalent
columns between the two datasets, detailing which columns in the lyft.baywheels
data set match which columns in the ford.gobike data table.
started_date start_date
started_at start_time
ended_at end_time
start_station_name start_station_name
end_station_name end_station_name
start_lat start_station_latitude
start_lng start_station_longitude
end_lat end_station_latitude
end_lng end_station_longitude
member_casual user_type
A. Write a query that filters the ford.gobike data to only include data from the
year 2020. HINT: Use the date_part function in SQL!
SELECT *
FROM ford.gobike
WHERE date_part('year', start_date) = 2020
B. Write a query that unions the ford.gobike dataset and the lyft.baywheels
dataset using the corresponding columns above. Make sure that you are still
filtering to the year 2020 on the Ford data.
Note: You will want the Lyft data to be the first table in your query so that the
column names from the Lyft dataset become the standard ones for the
remainder of your analysis.
SELECT
started_date,
started_at,
ended_at,
start_station_name,
end_station_name,
start_lat,
start_lng,
end_lat,
end_lng,
member_casual
FROM lyft.baywheels
UNION
SELECT
start_date,
start_time,
end_time,
start_station_name,
end_station_name,
start_station_latitude,
start_station_longitude,
end_station_latitude,
end_station_longitude,
user_type
FROM ford.gobike
WHERE date_part('year', start_date) = 2020
After showing the result of the query to your manager, she tells you that she wants
to know which data source is attributed to each row. She asks you to create a new
column called data_source that has the value ‘Lyft’ if the data came from the Lyft
dataset and the value ‘Ford’ if it came from the Ford dataset.
A colleague teaches you a simple method to do this. When writing your query, add
an additional column after your select statement. Here is an example of this for the
Lyft table:
SELECT
*,
'Lyft' AS data_source
FROM lyft_baywheels
Modify your query from part B to include the data_source column.
SELECT
started_date,
started_at,
ended_at,
start_station_name,
end_station_name,
start_lat,
start_lng,
end_lat,
end_lng,
member_casual,
'Lyft' AS data_source
FROM lyft.baywheels
UNION
SELECT
start_date,
start_time,
end_time,
start_station_name,
end_station_name,
start_station_latitude,
start_station_longitude,
end_station_latitude,
end_station_longitude,
user_type,
'Ford' AS data_source
FROM ford.gobike
WHERE date_part('year', start_date) = 2020
Great! Since you and other members on your team will be referencing the output of
your query for deeper analysis, your manager asked the Engineering team to store it
specially in your schema. For the remainder of this project, you’ll query
project.ford_lyft_analysis.
SELECT
*,
CASE
WHEN member_casual = 'Subscriber'
THEN 'member'
WHEN member_casual = 'Customer'
THEN 'casual'
END AS member_type
FROM project.ford_lyft_analysis
B. Almost there! After going over the table with your manager, she hypothesises
that patterns are driven by changes in weather and wants you to incorporate
weather data into your analysis.
You both decide San Francisco's average daily temperature and amount of
precipitation are the best metrics to base your weather analysis on. These are
located in the temperature_avg and precipitation columns, respectively, of
the sf.weather table.
Modify your query from part A to join the table with the sf_weather data on the
started_date field. From the sf_weather table, return the average daily
temperature, and the amount of precipitation.
SELECT
analysis.*,
CASE
WHEN member_casual = 'Subscriber'
THEN 'member'
WHEN member_casual = 'Customer'
THEN 'casual'
END AS member_type,
weather.temperature_avg,
weather.precipitation
FROM project.ford_lyft_analysis AS analysis
INNER JOIN sf.weather AS weather
ON analysis.started_date = weather.date
That’s it! Now this query will result in almost 2 million records for the year
2020! Since SQLPad will only let you download 150,000 records in a .csv, the
engineering team used some extra tools they have to download the result of
your query. It’s loaded for you in a Tableau Workbook, where you’ll complete
the rest of your project.
Once you’ve published your Tableau Workbook, paste the Share Link in the box
below.
https://prod-useast-b.online.tableau.com/#/site/globaltech/w
orkbooks/746907?:origin=card_share_link
Continue to post your answers in the provided boxes: purple boxes for your
visualizations, and blue boxes for text-based answers.
Using your visualization, when did operations transfer over from Ford to Lyft?
Are there any major differences in the volume of rentals before and after the
transfer?
In the visualization, it is clear to see that operations changed from
Ford to Lyft most likely around Week 14, or around March 14th,
2020. Before making the switch, Ford had a much higher daily
usage, almost double the daily usage that Lyft saw after
switching. However, Lyft’s daily usage was significantly more
consistent as opposed to Ford’s. The visualization also shows
that the daily usage may have plummeted prior to the switch,
where Week 12 had a major decline in daily usage while Ford still
owned the company.
B. Next, on Sheet 2, create a bar chart to depict the total number of rides during
each hour of the day. No need to include this visualization in this report just
yet! During which hours of the day are customers most likely to rent a bike?
C. Let’s break the hourly usage patterns down by data source. Using the Data
Source field, modify your visualization from part B to create two
side-by-side bar charts: one to illustrate the total number rides during each
hour of the data for Ford GoBike data, and the other for Lyft Baywheels.
Regarding popular hours of the day, what differences do you notice between
Lyft users and Ford users?
Looking at the hourly usage patterns based on the data source,
you can see that Ford has significantly greater rides in the
morning hours, where it then plateaus during the mid-day and
picks back up again in the evening. Whereas Lyft riders daily
usage gradually increases from 6am to where it peaks at 5pm,
and then starts to decline for the remainder of the day.
How does the temperature affect ridership? Which riders are more willing to
use a bike on cold days, and which riders are more likely to ride on warmer
days?
From the visualization, it appears that in the Ford data members
are more likely to ride regardless of temperature, but the most
likely to ride when the temperature is around 55 degrees, while
the casual members follow a very similar trend. From the Lyft
data, it seems that casual riders are more likely to ride regardless
of temperature, but both casual and members are most likely to
ride when it is around 60-65 degrees. This is interesting because
there is a 10 degree difference between Ford and Lyft riders
“preferred” temperatures. It also seems that there are more Ford
members than there are Lyft members.
That’s it! Submit your final project for evaluation, and go celebrate your
achievement! You just completed a rich, complex data analysis project
representing real-world level work. You’ve gained some impressive skills! Well
done, and never stop learning 😀
— LevelUp
The dataset in your Tableau workbook is rich – there’s much more that can be done
with the data! Below you’ll find three additional LevelUp tasks. Have fun exploring
them!
A. Your manager tells you that Lyft is interested in determining the distance
riders travel between start and end points. Take a look in your Tableau
notebook. You’ll find a variable called RIDE DISTANCE that is the distance
between the start and end points on a map.
Note: this is not the same as the total distance traveled on the bike. For
instance, if a ride began and ended at the same location, the distance would
show up as a zero in the data regardless of how long the bike was rented for.
Instead, it lets Lyft know the typical distance riders travel when they start and
end their rides at different points. The formula used is the Haversine distance.
It calculates the distance between two GPS coordinates, taking Earth’s
curvature into consideration.
On Sheet 5, use this new calculated field to plot a histogram of the distance
riders traveled. To make your visualization more useful, filter to values that are
less than 7 miles and use a bin size of 0.1.
Analyze the histogram: how far do the majority of the rides typically go?
Typically, it seems that riders normally begin their ride at the same
place they end the ride, therefore the data is showing that they
didn’t go anywhere. The second most common trend is that
riders will go 0.8 miles.
B. While you were assigned the analysis against temperature, one of your
colleagues looked at the other weather feature you joined into the data:
precipitation. She has interpreted the data to say that there’s no major
differences between Member Types in terms of ridership due to the weather.
She’s asked that you verify her work. Can you create a plot to illustrate how
precipitation affects ridership? Compare between Ford and Lyft users and
again between member and casual riders.
From my visualization, there is a slight difference in casual vs.
members and their willingness to ride, but only when there is no
precipitation. It appears the Ford members are most willing to
ride when there is no precipitation, but then the between casual
and members is pretty similar when it comes to any precipitation.
On the other hand, Lyft casual members are most willing to ride
when there is no precipitation, but then again both members and
casual users follow very similar trends when rain is involved. The
line graph for both show that ALL riders are very much not willing
to ride in the rain, even if it is the slightest amount of precipitation.
C. One of your colleagues has looked at the rentals by temperature plot you
created and the rentals by precipitation plot your colleague created. With
the approaching colder season in San Francisco, they’re afraid of a dropoff in
the amount of casual riders on the system and want to suggest additional
marketing efforts to increase casual rider engagement over the next few
months.
How much do you agree with, or disagree with your colleague’s assessment?
Are there aspects of the data that they haven’t considered in their analysis
that can be addressed with other plots you created? Is there information
outside of the available data that would be useful to make a better judgment
of where to put the marketing focus for the next winter season?
— Submission
Great work completing your Final Project!!!! To submit your completed project file,
you will need to download / export this document as a PDF and then upload it to the
Milestone submission page. You can find the option to download as a PDF from the
File menu in the upper-left corner of the Google Doc interface.