0% found this document useful (0 votes)
18 views18 pages

Lab 4 Creating A Streaming Data Pipeline For A Real

Uploaded by

Julius Sutrisno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views18 pages

Lab 4 Creating A Streaming Data Pipeline For A Real

Uploaded by

Julius Sutrisno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Creating a Streaming Data Pipeline for a Real-Time Dashboard with

Dataflow
help_outline
language

Start Lab

01:00:00

 Overview
 Set up your environments
 Task 1. Source a public Pub/Sub topic and create a BigQuery dataset
 Task 2. Create a Cloud Storage bucket
 Task 3. Set up a Dataflow Pipeline
 Task 4. Analyze the taxi data using BigQuery
 Task 5. Perform aggregations on the stream for reporting
 Task 6. Stop the Dataflow job
 Task 7. Create a real-time dashboard
 Task 8. Create a time series dashboard
 Congratulations!
 End your lab

Creating a Streaming
Data Pipeline for a
Real-Time Dashboard
with Dataflow
1 hour1 Credit

Overview
In this lab, you own a fleet of New York City taxi cabs and are looking to monitor how well your
business is doing in real-time. You will build a streaming data pipeline to capture taxi revenue,
passenger count, ride status, and much more and visualize the results in a management
dashboard.

Set up your environments

Qwiklabs setup
For each lab, you get a new GCP project and set of resources for a fixed time at no cost.

1. Make sure you signed into Qwiklabs


using an incognito window.

2. Note the lab's access time (for

example, and make


sure you can finish in that time block.

There is no pause feature. You can restart if needed, but you have to start at the beginning.
3. When ready,

click .

4. Note your lab credentials. You will use


them to sign in to Cloud Platform
Console.

5. Click Open Google Console.


6. Click Use another account and
copy/paste credentials for this lab into
the prompts.

If you use other credentials, you'll get errors or incur charges.


7. Accept the terms and skip the recovery
resource page.
Do not click End Lab unless you are finished with the lab or want to restart it. This clears your
work and removes the project.

Task 1. Source a public Pub/Sub topic and


create a BigQuery dataset
Pub/Sub is an asynchronous global messaging service. By decoupling senders and receivers, it
allows for secure and highly available communication between independently written
applications. Pub/Sub delivers low-latency, durable messaging.
In Pub/Sub, publisher applications and subscriber applications connect with one another through
the use of a shared string called a topic. A publisher application creates and sends messages to a
topic. Subscriber applications create a subscription to a topic to receive messages from it.

Google maintains a few public Pub/Sub streaming data topics for labs like this one. We'll be
using the NYC Taxi & Limousine Commission’s open dataset.
BigQuery is a serverless data warehouse. Tables in BigQuery are organized into datasets. In this
lab, messages published into Pub/Sub will be aggregated and stored in BigQuery.
To create a new BigQuery dataset:

Option 1: The command-line tool

1. Open Cloud Shell ( ) and run the


below command to create
the taxirides dataset.
bq --location=us-west1 mk taxirides
Copied!
content_copy
2. Run this command to create
the taxirides.realtime table
(empty schema that you will stream
into later).

bq --location=us-west1 mk \
--time_partitioning_field timestamp \
--schema ride_id:string,point_idx:integer,latitude:float,longitude:float,\
timestamp:timestamp,meter_reading:float,meter_increment:float,ride_status:
string,\
passenger_count:integer -t taxirides.realtime
Copied!
content_copy

Option 2: The BigQuery Console UI

Note: Skip these steps if you created the tables using the command line.
1. In the Google Cloud Console,
select Navigation
menu > Analytics > BigQuery:

2. The Welcome to BigQuery in the


Cloud Console message box opens.
This message box provides a link to the
quickstart guide and lists UI updates.

3. Click on the View actions icon next to


your Project ID and click Create
dataset.

4. Set the Dataset ID as taxirides,


for Data location, select us-west1
(Oregon) leave all the other fields the
way they are, and click CREATE
DATASET.

5. If you look at the left-hand resources


menu, you should see your newly
created dataset.
6. Click on the View actions icon next to
the taxirides dataset and click Open in
current tab.

7. Click CREATE TABLE.

8. Name the table realtime

9. For the schema, click Edit as text and


paste in the below:

ride_id:string,
point_idx:integer,
latitude:float,
longitude:float,
timestamp:timestamp,
meter_reading:float,
meter_increment:float,
ride_status:string,
passenger_count:integer
Copied!
content_copy
10. Under Partition and cluster settings,
select the timestamp option for the
Partitioning field.

11. Click the CREATE TABLE button.

Task 2. Create a Cloud Storage bucket


Cloud Storage allows world-wide storage and retrieval of any amount of data at any time. You
can use Cloud Storage for a range of scenarios including serving website content, storing data for
archival and disaster recovery, or distributing large data objects to users via direct download. In
this lab, you use Cloud Storage to provide working space for your Dataflow pipeline.

1. In the Cloud Console, go to Navigation


menu > Cloud Storage.
2. Click CREATE BUCKET.
3. For Name, paste in your GCP Project
ID and then click Continue.
4. For Location type, click Multi-
region if it is not already selected.
5. Click CREATE.

Task 3. Set up a Dataflow Pipeline


Dataflow is a serverless way to carry out data analysis. In this lab, you set up a streaming data
pipeline to read sensor data from Pub/Sub, compute the maximum temperature within a time
window, and write this out to BigQuery.
Restart the connection to the Dataflow API.

1. In the Cloud Console, enter Dataflow


API in the top search bar.

2. Click on the result for Dataflow API.

3. Click Manage.

4. Click Disable API.

5. If asked to confirm, click Disable.

6. Click Enable.

To create a new streaming pipeline:

1. In the Cloud Console, go to Navigation


menu > Dataflow.

2. In the top menu bar, click CREATE


JOB FROM TEMPLATE.
3. Enter streaming-taxi-pipeline as the
Job name for your Dataflow job.

4. Under Regional endpoint, select us-


west1 (Oregon).

5. Under Dataflow template, select


the Pub/Sub Topic to
BigQuery template.

6. Under Input Pub/Sub topic,


click Enter topic Manually,
enter projects/pubsub-public-
data/topics/taxirides-
realtime

7. Click Save.

8. Under BigQuery output table,


enter <myprojectid>:taxirides
.realtime

Note: There is a colon : between the project and dataset name and a dot . between the dataset
and table name.
8. Under Temporary location,
enter gs://<mybucket>/tmp/.

9. Click Show Optional Parameters and


input the following values as listed
below:

 Max workers: 2

 Number of workers: 2

 Worker region: us-west1

10. Click the RUN JOB button.


A new streaming job has started! You can now see a visual representation of the data pipeline.

Note: If the dataflow job fails for the first time then re-create a new job template with new job
name and run the job.
Task 4. Analyze the taxi data using
BigQuery
To analyze the data as it is streaming:

1. In the Cloud Console,


select Navigation menu > BigQuery.

2. Enter the following query in the


query EDITOR and click RUN:

SELECT * FROM taxirides.realtime LIMIT 10


Copied!
content_copy
3. If no records are returned, wait another
minute and re-run the above query
(Dataflow takes 3-5 minutes to setup
the stream). You will receive a similar
output:
Task 5. Perform aggregations on the
stream for reporting
1. Copy and paste the below query and
click RUN.

WITH streaming_data AS (
SELECT
timestamp,
TIMESTAMP_TRUNC(timestamp, HOUR, 'UTC') AS hour,
TIMESTAMP_TRUNC(timestamp, MINUTE, 'UTC') AS minute,
TIMESTAMP_TRUNC(timestamp, SECOND, 'UTC') AS second,
ride_id,
latitude,
longitude,
meter_reading,
ride_status,
passenger_count
FROM
taxirides.realtime
WHERE ride_status = 'dropoff'
ORDER BY timestamp DESC
LIMIT 1000
)
# calculate aggregations on stream for reporting:
SELECT
ROW_NUMBER() OVER() AS dashboard_sort,
minute,
COUNT(DISTINCT ride_id) AS total_rides,
SUM(meter_reading) AS total_revenue,
SUM(passenger_count) AS total_passengers
FROM streaming_data
GROUP BY minute, timestamp
Copied!
content_copy
Note: Ensure dataflow is registering data in BigQuery before proceeding to the next task.
The result shows key metrics by the minute for every taxi drop-off.

Task 6. Stop the Dataflow job


1. Navigate back to Dataflow.

2. Click the streaming-taxi-pipeline or


the new job name.

3. Click STOP and


select Cancel > STOP JOB.

This will free up resources for your project.

Task 7. Create a real-time dashboard


1. Open this Google Data Studio link in a
new incognito browser tab.
2. On the Reports page, in the Start with
a Template section, click the [+]
Blank Report template.

3. To get started complete account setup,


select your Country from the drop
down, enter Company if applicable.

4. Check the checkbox to acknowledge


the Google Data Studio Additional
Terms, and click Continue.

5. Select No to all the questions, then


click Continue.

6. Switch back to the BigQuery Console.

7. Click EXPLORE DATA > Explore


with Data Studio in BigQuery page.
8. Specify the below settings:

 Chart type: Combo chart


 Date range Dimension: dashboard_sort
 Dimension: dashboard_sort
 Drill Down: dashboard_sort (Make sure
that Drill down option is turned ON)
 Metric: SUM() total_rides, SUM()
total_passengers, SUM()
total_revenue
 Sort: dashboard_sort, Ascending (latest
rides first)
Your chart should look similar to this:
Note: Visualizing data at a minute-level granularity is currently not supported in Data Studio as
a timestamp. This is why we created our own dashboard_sort dimension.
9. When you're happy with your
dashboard, click Save and share to
save this data source.

10. If prompted with the Review data


access before saving window,
click Acknowledge and save.

11. Click Add to report.

12. Whenever anyone visits your


dashboard, it will be up-to-date with
the latest transactions. You can try it
yourself by clicking on the More
option and Refresh data.

Task 8. Create a time series dashboard


1. Click this Google Data Studio link to
open Data Studio in a new browser tab.
2. On the Reports page, in the Start with
a Template section, click the [+]
Blank Report template.

3. A new, empty report opens with Add


data to report.

4. From the list of Google Connectors,


select the BigQuery tile.

5. Under CUSTOM QUERY,


click qwiklabs-gcp-xxxxxxx > Enter
Custom Query, add the following
query.

SELECT
*
FROM
taxirides.realtime
WHERE
ride_status='dropoff'
Copied!
content_copy
6. Click Add > ADD TO REPORT.

Create a time series chart

1. In the Data panel, click ADD A


FIELD. Click All Fields on the left
corner.

2. Change the field timestamp type


to Date & Time > Date Hour Minute
(YYYYMMDDhhmm).

3. Click Continue and then click Done.

4. Click Add a chart.

5. Choose Time series chart.

6. Position the chart in the bottom left


corner - in the blank space.

7. In the Data panel on the right, change


the following:

 Dimension: timestamp
 Metric: meter_reading(SUM)
Your time series chart should look similar to this:
Note: if Dimension is timestamp(Date), then click on calendar icon next
to timestamp(Date), and select type to Date & Time > Date Hour Minute.

Congratulations!
In this lab, you used Pub/Sub to collect streaming data messages from taxis and feed it through
your Dataflow pipeline into BigQuery.

End your lab


When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the
resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of
stars, type a comment, and then click Submit.

The number of stars indicates the following:

 1 star = Very dissatisfied


 2 stars = Dissatisfied
 3 stars = Neutral
 4 stars = Satisfied
 5 stars = Very satisfied
You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.


Copyright 2021 Google LLC All rights reserved. Google and the Google logo are trademarks of
Google LLC. All other company and product names may be trademarks of the respective
companies with which they are associated.

How satisfied are you with this lab?*


Additional Comments

Cancel
Submit
error_outline

All done? If you end this lab, you will lose all
your work. You may not be able to restart the
lab if there is a quota limit. Are you sure you
want to end this lab?
Cancel
Submit

You might also like