BigQuery Lab
BigQuery Lab
do Analysis
Overview
In this lab you analyze 2 different public datasets, run queries on them, separately
and then combined, to derive interesting insights.
Prerequisites
This is a fundamental level lab and assumes some experience with BigQuery and
SQL.
Introduction
This lab uses two public datasets in BigQuery: weather data from the US National
Oceanic and Atmospheric Administration (NOAA), and bicycle rental data from New
York City.
You will encounter, for the first time, several aspects of Google Cloud Platform that
are of great benefit to scientists:
2. Click Done.
SELECT
MIN(start_station_name) AS start_station_name,
MIN(end_station_name) AS end_station_name,
APPROX_QUANTILES(tripduration, 10)[OFFSET (5)] AS typical_duration,
COUNT(tripduration) AS num_trips
FROM
`bigquery-public-data.new_york_citibike.citibike_trips`
WHERE
start_station_id != end_station_id
GROUP BY
start_station_id,
end_station_id
ORDER BY
num_trips DESC
LIMIT
7. Click Run. Look at the result and try to determine what this query does ?
8. Next, run the query below to find another interesting fact: total distance
traveled by each bicycle in the dataset. Note that the query limits the results to
only top 5.
WITH
trip_distance AS (
SELECT
bikeid,
ST_Distance(ST_GeogPoint(s.longitude,
s.latitude),
ST_GeogPoint(e.longitude,
e.latitude)) AS distance
FROM
`bigquery-public-data.new_york_citibike.citibike_trips`,
`bigquery-public-data.new_york_citibike.citibike_stations` as s,
`bigquery-public-data.new_york_citibike.citibike_stations` as e
WHERE
start_station_name = s.name
AND end_station_name = e.name)
SELECT
bikeid,
SUM(distance)/1000 AS total_distance
FROM
trip_distance
GROUP BY
bikeid
ORDER BY
total_distance DESC
LIMIT
5
Copied!
Contcy9xyent_copy
Note: For this query, we also used the other table in the dataset
called citibike_stations to get bicycle station information.
2. Then click on the Preview tab. Your console should resemble the following:
3. Click the Blue + button to compose a new query and enter the following:
SELECT
wx.date,
wx.value/10.0 AS prcp
FROM
`bigquery-public-data.ghcn_d.ghcnd_2015` AS wx
WHERE
id = 'USW00094728'
AND qflag IS NULL
AND element = 'PRCP'
ORDER BY
wx.date
Copied!
content_copy
4. Click Run.
This query will return rainfall (in mm) for all days in 2015 from a weather station in
New York whose id is provided in the query (the station corresponds to NEW YORK
CNTRL PK TWR ).
1. Click the Blue + button to compose a new query and enter the following:
WITH bicycle_rentals AS (
SELECT
COUNT(starttime) as num_trips,
EXTRACT(DATE from starttime) as trip_date
FROM `bigquery-public-data.new_york_citibike.citibike_trips`
GROUP BY trip_date
),
rainy_days AS
(
SELECT
date,
(MAX(prcp) > 5) AS rainy
FROM (
SELECT
wx.date AS date,
IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp
FROM
`bigquery-public-data.ghcn_d.ghcnd_2015` AS wx
WHERE
wx.id = 'USW00094728'
)
GROUP BY
date
)
SELECT
ROUND(AVG(bk.num_trips)) AS num_trips,
wx.rainy
FROM bicycle_rentals AS bk
JOIN rainy_days AS wx
ON wx.date = bk.trip_date
GROUP BY wx.rainy
Copied!
content_copy
2. Click Run.
Now you can see the results of joining the bicycle rental dataset with a weather
dataset that comes from a completely different source:
Running the query yields that, yes, New Yorkers ride the bicycle 47% fewer times
when it rains.
Loading Taxi Data
into Google Cloud
SQL 2.5
Overview
BigQuery is Google's fully managed, NoOps, low cost analytics database. With
BigQuery you can query terabytes and terabytes of data without having any
infrastructure to manage or needing a database administrator. BigQuery uses SQL
and can take advantage of the pay-as-you-go model. BigQuery allows you to focus
on analyzing data to find meaningful insights.
In this lab you will ingest subsets of the NYC taxi trips data into tables inside of
BigQuery.
2. Click Done.
2. Next, name your Dataset ID nyctaxi and leave all other options at their
default values, and then click Create dataset.
You'll now see the nyctaxi dataset under your project name.
1. Download a subset of the NYC taxi 2018 trips data locally onto your computer
from this link.
2. In the BigQuery Console, Select the nyctaxi dataset then click Create Table
Source:
● Check Auto Detect (tip: Not seeing the checkbox? Ensure the file format is
CSV and not Avro)
Advanced Options
1. In the Query Editor, write a query to list the top 5 most expensive trips of the
year:
#standardSQL
SELECT
*
FROM
nyctaxi.2018trips
ORDER BY
fare_amount DESC
LIMIT 5
Copied!
content_copy
What was the highest fare amount in the year?
339
check300
250
Submit
Click Check my progress to verify the objective.
3. Back on your BigQuery console, select the 2018trips table and view details.
Confirm that the row count has now almost doubled.
4. You may want to run the same query like earlier to see if the top 5 most
expensive trips have changed.
#standardSQL
CREATE TABLE
nyctaxi.january_trips AS
SELECT
*
FROM
nyctaxi.2018trips
WHERE
EXTRACT(Month
FROM
pickup_datetime)=1;
Copied!
content_copy
2. Now run the below query in your Query Editor find the longest distance
traveled in the month of January:
#standardSQL
SELECT
*
FROM
nyctaxi.january_trips
ORDER BY
trip_distance DESC
LIMIT
1
Copied!
content_copy
Click Check my progress to verify the objective.