Lab 4 Creating A Streaming Data Pipeline For A Real
Lab 4 Creating A Streaming Data Pipeline For A Real
Dataflow
help_outline
language
Start Lab
01:00:00
Overview
Set up your environments
Task 1. Source a public Pub/Sub topic and create a BigQuery dataset
Task 2. Create a Cloud Storage bucket
Task 3. Set up a Dataflow Pipeline
Task 4. Analyze the taxi data using BigQuery
Task 5. Perform aggregations on the stream for reporting
Task 6. Stop the Dataflow job
Task 7. Create a real-time dashboard
Task 8. Create a time series dashboard
Congratulations!
End your lab
Creating a Streaming
Data Pipeline for a
Real-Time Dashboard
with Dataflow
1 hour1 Credit
Overview
In this lab, you own a fleet of New York City taxi cabs and are looking to monitor how well your
business is doing in real-time. You will build a streaming data pipeline to capture taxi revenue,
passenger count, ride status, and much more and visualize the results in a management
dashboard.
Qwiklabs setup
For each lab, you get a new GCP project and set of resources for a fixed time at no cost.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
3. When ready,
click .
Google maintains a few public Pub/Sub streaming data topics for labs like this one. We'll be
using the NYC Taxi & Limousine Commission’s open dataset.
BigQuery is a serverless data warehouse. Tables in BigQuery are organized into datasets. In this
lab, messages published into Pub/Sub will be aggregated and stored in BigQuery.
To create a new BigQuery dataset:
bq --location=us-west1 mk \
--time_partitioning_field timestamp \
--schema ride_id:string,point_idx:integer,latitude:float,longitude:float,\
timestamp:timestamp,meter_reading:float,meter_increment:float,ride_status:
string,\
passenger_count:integer -t taxirides.realtime
Copied!
content_copy
Note: Skip these steps if you created the tables using the command line.
1. In the Google Cloud Console,
select Navigation
menu > Analytics > BigQuery:
ride_id:string,
point_idx:integer,
latitude:float,
longitude:float,
timestamp:timestamp,
meter_reading:float,
meter_increment:float,
ride_status:string,
passenger_count:integer
Copied!
content_copy
10. Under Partition and cluster settings,
select the timestamp option for the
Partitioning field.
3. Click Manage.
6. Click Enable.
7. Click Save.
Note: There is a colon : between the project and dataset name and a dot . between the dataset
and table name.
8. Under Temporary location,
enter gs://<mybucket>/tmp/.
Max workers: 2
Number of workers: 2
Note: If the dataflow job fails for the first time then re-create a new job template with new job
name and run the job.
Task 4. Analyze the taxi data using
BigQuery
To analyze the data as it is streaming:
WITH streaming_data AS (
SELECT
timestamp,
TIMESTAMP_TRUNC(timestamp, HOUR, 'UTC') AS hour,
TIMESTAMP_TRUNC(timestamp, MINUTE, 'UTC') AS minute,
TIMESTAMP_TRUNC(timestamp, SECOND, 'UTC') AS second,
ride_id,
latitude,
longitude,
meter_reading,
ride_status,
passenger_count
FROM
taxirides.realtime
WHERE ride_status = 'dropoff'
ORDER BY timestamp DESC
LIMIT 1000
)
# calculate aggregations on stream for reporting:
SELECT
ROW_NUMBER() OVER() AS dashboard_sort,
minute,
COUNT(DISTINCT ride_id) AS total_rides,
SUM(meter_reading) AS total_revenue,
SUM(passenger_count) AS total_passengers
FROM streaming_data
GROUP BY minute, timestamp
Copied!
content_copy
Note: Ensure dataflow is registering data in BigQuery before proceeding to the next task.
The result shows key metrics by the minute for every taxi drop-off.
SELECT
*
FROM
taxirides.realtime
WHERE
ride_status='dropoff'
Copied!
content_copy
6. Click Add > ADD TO REPORT.
Dimension: timestamp
Metric: meter_reading(SUM)
Your time series chart should look similar to this:
Note: if Dimension is timestamp(Date), then click on calendar icon next
to timestamp(Date), and select type to Date & Time > Date Hour Minute.
Congratulations!
In this lab, you used Pub/Sub to collect streaming data messages from taxis and feed it through
your Dataflow pipeline into BigQuery.
You will be given an opportunity to rate the lab experience. Select the applicable number of
stars, type a comment, and then click Submit.
Cancel
Submit
error_outline
All done? If you end this lab, you will lose all
your work. You may not be able to restart the
lab if there is a quota limit. Are you sure you
want to end this lab?
Cancel
Submit