0% found this document useful (0 votes)
13 views13 pages

BigQuery Lab

This document outlines a lab focused on using BigQuery for data analysis, specifically analyzing public datasets related to bicycle rentals and weather data. Participants will learn to run interactive SQL queries, combine datasets, and explore insights such as the impact of rain on bike rentals. Additionally, the document covers loading NYC taxi trip data into BigQuery and performing various operations, including creating tables and ingesting data from different sources.

Uploaded by

Muneeba Kaleem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

BigQuery Lab

This document outlines a lab focused on using BigQuery for data analysis, specifically analyzing public datasets related to bicycle rentals and weather data. Participants will learn to run interactive SQL queries, combine datasets, and explore insights such as the impact of rain on bike rentals. Additionally, the document covers loading NYC taxi trip data into BigQuery and performing various operations, including creating tables and ingesting data from different sources.

Uploaded by

Muneeba Kaleem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Using BigQuery to

do Analysis

Overview
In this lab you analyze 2 different public datasets, run queries on them, separately
and then combined, to derive interesting insights.

What you'll learn

In this lab, you will:

● Carry out interactive queries on the BigQuery console.


● Combine and run analytics on multiple datasets.

Prerequisites
This is a fundamental level lab and assumes some experience with BigQuery and
SQL.
Introduction
This lab uses two public datasets in BigQuery: weather data from the US National
Oceanic and Atmospheric Administration (NOAA), and bicycle rental data from New
York City.

You will encounter, for the first time, several aspects of Google Cloud Platform that
are of great benefit to scientists:

1. Serverless. No need to download data to your machine in order to work with


it - the dataset will remain on the cloud.
2. Ease of use. Run ad-hoc SQL queries on your dataset without having to
prepare the data, like indexes, beforehand. This is invaluable for data
exploration.
3. Scale. Carry out data exploration on extremely large datasets interactively.
You don't need to sample the data in order to work with it in a timely manner.
4. Shareability. You will be able to run queries on data from different datasets
without any issues. BigQuery is a convenient way to share datasets. Of
course, you can also keep your data private, or share them only with specific
persons -- not all data need to be public.
The end-result is that you will find if there are lesser bike rentals on rainy days.

Open BigQuery Console

1. In the Google Cloud Console, select Navigation menu > BigQuery.


The Welcome to BigQuery in the Cloud Console message box opens. This
message box provides a link to the quickstart guide and lists UI updates.

2. Click Done.

Task 1. Explore bicycle rental data


1. In the left pane, click + Add , then click Star a project by name, next in the
pop-up window type bigquery-public-data, finally click Star.
2. In the BigQuery console, you see two projects in the left pane, one named
your Qwiklabs project ID, and one named bigquery-public-data.

3. In the left pane of the BigQuery console, select bigquery-public-


data > new_york_citibike > citibike_trips table.

4. In the Table (citibike_trips) window, click the Schema tab.

5. Examine the column names and the datatypes.

6. Click the Blue + button to compose a new query.

Enter the following query:

SELECT
MIN(start_station_name) AS start_station_name,
MIN(end_station_name) AS end_station_name,
APPROX_QUANTILES(tripduration, 10)[OFFSET (5)] AS typical_duration,
COUNT(tripduration) AS num_trips
FROM
`bigquery-public-data.new_york_citibike.citibike_trips`
WHERE
start_station_id != end_station_id
GROUP BY
start_station_id,
end_station_id
ORDER BY
num_trips DESC
LIMIT

7. Click Run. Look at the result and try to determine what this query does ?

Hint: typical duration for the 10 most common one-way rentals)

8. Next, run the query below to find another interesting fact: total distance
traveled by each bicycle in the dataset. Note that the query limits the results to
only top 5.

WITH
trip_distance AS (
SELECT
bikeid,
ST_Distance(ST_GeogPoint(s.longitude,
s.latitude),
ST_GeogPoint(e.longitude,
e.latitude)) AS distance
FROM
`bigquery-public-data.new_york_citibike.citibike_trips`,
`bigquery-public-data.new_york_citibike.citibike_stations` as s,
`bigquery-public-data.new_york_citibike.citibike_stations` as e
WHERE
start_station_name = s.name
AND end_station_name = e.name)
SELECT
bikeid,
SUM(distance)/1000 AS total_distance
FROM
trip_distance
GROUP BY
bikeid
ORDER BY
total_distance DESC
LIMIT
5
Copied!
Contcy9xyent_copy
Note: For this query, we also used the other table in the dataset
called citibike_stations to get bicycle station information.

Task 2. Explore the weather


dataset
1. In the left pane of the BigQuery Console, select the newly added bigquery-
public-data project and select ghcn_d > ghcnd_2015.

2. Then click on the Preview tab. Your console should resemble the following:

Examine the columns and some of the data values.

3. Click the Blue + button to compose a new query and enter the following:

SELECT
wx.date,
wx.value/10.0 AS prcp
FROM
`bigquery-public-data.ghcn_d.ghcnd_2015` AS wx
WHERE
id = 'USW00094728'
AND qflag IS NULL
AND element = 'PRCP'
ORDER BY
wx.date
Copied!
content_copy
4. Click Run.
This query will return rainfall (in mm) for all days in 2015 from a weather station in
New York whose id is provided in the query (the station corresponds to NEW YORK
CNTRL PK TWR ).

Task 3. Find correlation between


rain and bicycle rentals
How about joining the bicycle rentals data against weather data to learn whether
there are fewer bicycle rentals on rainy days?

1. Click the Blue + button to compose a new query and enter the following:

WITH bicycle_rentals AS (
SELECT
COUNT(starttime) as num_trips,
EXTRACT(DATE from starttime) as trip_date
FROM `bigquery-public-data.new_york_citibike.citibike_trips`
GROUP BY trip_date
),
rainy_days AS
(
SELECT
date,
(MAX(prcp) > 5) AS rainy
FROM (
SELECT
wx.date AS date,
IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp
FROM
`bigquery-public-data.ghcn_d.ghcnd_2015` AS wx
WHERE
wx.id = 'USW00094728'
)
GROUP BY
date
)
SELECT
ROUND(AVG(bk.num_trips)) AS num_trips,
wx.rainy
FROM bicycle_rentals AS bk
JOIN rainy_days AS wx
ON wx.date = bk.trip_date
GROUP BY wx.rainy
Copied!
content_copy
2. Click Run.
Now you can see the results of joining the bicycle rental dataset with a weather
dataset that comes from a completely different source:

Running the query yields that, yes, New Yorkers ride the bicycle 47% fewer times
when it rains.
Loading Taxi Data
into Google Cloud
SQL 2.5

Loading data into


BigQuery

Overview
BigQuery is Google's fully managed, NoOps, low cost analytics database. With
BigQuery you can query terabytes and terabytes of data without having any
infrastructure to manage or needing a database administrator. BigQuery uses SQL
and can take advantage of the pay-as-you-go model. BigQuery allows you to focus
on analyzing data to find meaningful insights.
In this lab you will ingest subsets of the NYC taxi trips data into tables inside of
BigQuery.

What you'll learn

● Loading data into BigQuery from various sources

● Loading data into BigQuery using the CLI and Console

● Using DDL to create tables

Open BigQuery Console

1. In the Google Cloud Console, select Navigation menu > BigQuery.


The Welcome to BigQuery in the Cloud Console message box opens. This
message box provides a link to the quickstart guide and lists UI updates.

2. Click Done.

Task 1. Create a new dataset to


store tables
1. To create a dataset, click on the View actions icon (the three vertical dots)
next to your project ID and select Create dataset.

2. Next, name your Dataset ID nyctaxi and leave all other options at their
default values, and then click Create dataset.
You'll now see the nyctaxi dataset under your project name.

Click Check my progress to verify the objective.

Creating a dataset to store new tables


Check my progress

Task 2. Ingest a new dataset from


a CSV
In this section, you will load a local CSV into a BigQuery table.

1. Download a subset of the NYC taxi 2018 trips data locally onto your computer
from this link.
2. In the BigQuery Console, Select the nyctaxi dataset then click Create Table

Specify the below table options:

Source:

● Create table from: Upload


● Choose File: select the file you downloaded locally earlier
● File format: CSV
Destination:

● Table name: 2018trips Leave all other settings at default.


Schema:

● Check Auto Detect (tip: Not seeing the checkbox? Ensure the file format is
CSV and not Avro)
Advanced Options

● Leave at default values


Click Create Table.
3. You should now see the 2018trips table below the nyctaxi dataset.
Select the 2018trips table and view details:

How many rows are in the table?


900
check10,018
1,090
1,200
Submit
4. Select Preview and confirm all columns have been loaded (sampled below):
You have successfully loaded a CSV file into a new BigQuery table.

Running SQL Queries

Next, practice with a basic query on the 2018trips table.

1. In the Query Editor, write a query to list the top 5 most expensive trips of the
year:

#standardSQL
SELECT
*
FROM
nyctaxi.2018trips
ORDER BY
fare_amount DESC
LIMIT 5
Copied!
content_copy
What was the highest fare amount in the year?
339
check300
250
Submit
Click Check my progress to verify the objective.

Ingest a new Dataset from a CSV


Check my progress
Task 3. Ingest a new dataset from
Google Cloud Storage
Now, let's try to load another subset of the same 2018 trip data that is available on
Cloud Storage. And this time, let's use the CLI tool to do it.

1. In your Cloud Shell, run the following command :


bq load \
--source_format=CSV \
--autodetect \
--noreplace \
nyctaxi.2018trips \
gs://cloud-training/OCBL013/nyc_tlc_yellow_trips_2018_subset_2.csv
Copied!
content_copy
Note: With the above load job, you are specifying that this subset is to be appended
to the existing 2018trips table that you created above.
2. When the load job is complete, you will get a confirmation on the screen.

3. Back on your BigQuery console, select the 2018trips table and view details.
Confirm that the row count has now almost doubled.

4. You may want to run the same query like earlier to see if the top 5 most
expensive trips have changed.

Click Check my progress to verify the objective.

Ingest a dataset from google cloud storage


Check my progress

Task 4. Create tables from other


tables with DDL
The 2018trips table now has trips from throughout the year. What if you were only
interested in January trips? For the purpose of this lab, we will keep it simple and
focus only on pickup date and time. Let's use DDL to extract this data and store it in
another table

1. In the Query Editor, run the following CREATE TABLE command :

#standardSQL
CREATE TABLE
nyctaxi.january_trips AS
SELECT
*
FROM
nyctaxi.2018trips
WHERE
EXTRACT(Month
FROM
pickup_datetime)=1;
Copied!
content_copy
2. Now run the below query in your Query Editor find the longest distance
traveled in the month of January:
#standardSQL
SELECT
*
FROM
nyctaxi.january_trips
ORDER BY
trip_distance DESC
LIMIT
1
Copied!
content_copy
Click Check my progress to verify the objective.

Create tables from other tables with DDL


Check my progress

You might also like