OD 03 PDE Building and Operationalizing Data Processing Systems
OD 03 PDE Building and Operationalizing Data Processing Systems
Operationalizing Data
Processing Systems
Tom Stern
The next section of the exam guide covers building data processing systems.
So that includes assembling data processing from parts, as well as using full services.
Building and
Maintaining Data
Structures and
Databases
Tom Stern
The first area of data processing we will look at is building and maintaining structures
and databases.
So, not just selecting a particular kind of database or service, but also thinking about
the qualities that are provided and starting to consider how to organize the data.
Selecting storage NO YES
options
Is your data
Start
structured?
Does your
workload involve
analytics?
Cloud Storage Do you need
extensive
updates and/or
Is your data low latency?
relational?
Do you
need global Cloud Bigtable BigQuery
scalability? High throughput Data warehouse
Do you need Tabular data
application
caching?
Cloud Cloud
Spanner SQL
Memorystore Firestore
High availability Transactions
State Country
Here is some concrete advice on flexible data representation. You want the data
divided up in a way that makes the most sense for your given use case.
If the data is divided up too much, it creates additional work.
In the example on the left, each data item is stored separately, making it easy to filter
on a specific fields and to perform updates.
In the example on the right, all of the data is stored in a single record, like a single
string. Editing/updating is difficult. Filtering on a particular field would be hard.
In the example on the bottom, a relation is defined by two tables. This might make it
easier to manage and report on the list of locations.
What transaction qualities are required?
CAP Theorem
ACID BASE
● Atomicity ● Basically Available
● Consistency ● Soft state
● Isolation ● Eventual consistency
● Durability
ACID vs BASE is essential data knowledge that you will want to be very familiar with
so that you can easily determine whether a particular data solution is compatible with
the requirements identified in the case. Example: for a financial transaction, a service
that provides only eventual consistency might be incompatible.
Cloud Storage
Cluster node options Features Best Practices
Cloud Storage is persistent. It has storage classes -- nearline, coldline, regional and
multiregional. There is granular access control. You should be familiar with all the
methods of access control, including IAM roles and Signed URLs.
Cloud Storage has a ton of features that people often miss. They end up trying to
duplicate the function in code, when in fact, all they need to do is use the capacity
already available.
For example, you can change storage classes. You can stream data to Cloud
Storage. Cloud Storage supports a kind of versioning. And there are multiple
encryption options to meet different needs. Also, you can automate some of these
features using Lifecycle management. For example, you could change the class of
storage for an object or delete that object after a period of time.
Cloud SQL
Familiar: Cloud SQL supports most MySQL Connect from anywhere (can assign a static
statements and functions, even Stored IP address, and use typical SQL connector
procedures, Triggers, and Views. libraries).
Fully-managed: Cloud SQL is a Fast: You can place your Cloud SQL instance
fully-managed relational database service in the same region as your App Engine or
for MySQL, PostgreSQL, and SQL Server. Compute Engine applications and get great
bandwidth.
Not supported: User-defined functions,
statements, and functions related to files
Google security: Cloud SQL resides in secure
and plugins.
Google data centers. There are several ways
Flexible pricing: You can pay per use to securely access a Cloud SQL instance (see
or per hour. Backups, replication, and so Cloud SQL Access in the Security part of this
forth are managed for you. course).
I want to highlight to you that there are several ways to securely connect to a Cloud
SQL instance. And it would be important for you to be familiar with the different
approaches and the benefits of each.
Cloud Bigtable
Properties: Important features:
● Cloud Bigtable is meant for high ● Schema design and time-series support
throughput data
● Access control
● Millisecond latency, NoSQL
● Performance design
● Access is designed to optimize for a
● Choosing between SSD and HDD
range of Row Key prefixes
Cloud Bigtable is meant for high throughput data, it has millisecond latency, so it is
much faster than BigQuery, for example. It is NoSQL. So this is good for a columnar
store.
When would you want to select an SSD for the machines in the cluster rather than
HDD? I'd guess if you needed faster performance.
Cloud Bigtable: High throughput data where access
is primarily for a range of Row Key prefixes
Each trade is its own row. This will result in 100s of millions of rows per day. Which is
fine with Cloud Bigtable.
Cloud Spanner
Properties: Important features:
● Global, fully managed, relational ● Schema design, Data Model, and updates
database with transactional
● Secondary indexes
consistency.
● Timestamp bounds and
● Data in Cloud Spanner is strongly typed:
Commit timestamps
you must define a schema for each
database, and that schema must ● Data types
specify the data types of each column
● Transactions
of each table.
Cloud Spanner is strongly typed and globally consistent. The two characteristics that
distinguish it from Cloud SQL are globally consistent transactions and size. Cloud
Spanner can work with much larger databases than Cloud SQL.
Use Cloud Spanner if you need globally consistent
data or more than one Cloud SQL instance
Mean Latency as Throughput Increases A Latency (ms)
A B Throughput (queries per second)
24 ━ MySQL (mean)
━ spanner 9 nodes (mean)
━ spanner 15 nodes (mean)
18
━ spanner 30 nodes (mean)
12
0 B
4000 6000 8000 10.000 20.000
Cloud SQL is fine if you can get by with a single database. But if your needs are such
that you need multiple databases, Cloud Spanner is a great choice.
MySQL hits a performance wall. If you look at the 99th percentile of latency, it is clear
that performance degrades. Distributing MySQL is hard. However, Spanner distributes
easily (even globally) and provides consistent performance. To support more
throughput, just add more nodes.
Firestore is the new version of Datastore
Properties: Important features:
● NoSQL document database ● Effortlessly scales up or down to meet
● ACID transactions demand
● Massive scalability with high performance ● Built-in live synchronization and offline
mode for multi-user, collaborative mobile
● Flexible storage and querying of data
and web applications
● Fully managed, serverless
● Introduces several improvements over
Datastore
Datastore is a NoSQL solution that used to be private to App Engine. It offers many
features that are mainly useful to applications, such as persisting state information. It
is now available to clients besides App Engine.
Comparing storage options: Use cases
Cloud Cloud Cloud
Firestore Cloud SQL BigQuery
Bigtable Storage Spanner
Type NoSQL NoSQL Blobstore Relational SQL Relational SQL Relational SQL
document wide column for OLTP for OLTP for OLAP
Use Mobile and AdTech, Images, large User Whenever high Data
cases web Financial and media files, credentials, I/O, global warehousing
applications IoT data backups customer consistency is
with orders needed
transactions
Commit this table to memory, and be able to use it backwards. Example: If the exam
question contains "Data Warehouse" you should be thinking "BigQuery is a
candidate".
If the case says something about large media files, you should immediately be
thinking Cloud Storage.
Building and
Operationalizing
Pipelines
Tom Stern
The next section is on Building and Maintaining pipelines. We've already covered a lot
of this information in the design section.
Dataflow does batch and streaming
Stream 9:00am
Simple Windows
Apache Beam is an open programming platform for unifying batch and streaming.
Before Apache Beam, you needed two pipelines to balance latency, throughput, and
fault tolerance.
Continuous data can arrive out of order. Simple windowing can separate related
events into independent windows, losing relationship information.
Dataflow resources are deployed on demand, per job, and work is constantly
rebalanced across resources
Dataflow solves many stream processing issues, including changes in size (spikes)
and growth over time. It can scale while remaining fault tolerant. And it has a flexible
programming model and methods to work with data arriving late or out of order.
Dataflow windowing for streams
Triggering controls how results are delivered to the next transforms in the pipeline.
Watermark is a heuristic that tracks how far behind the system is in processing data
from the event time. Where in event time does processing occur?
All data processing is behind or lags events simply due to latency in the delivery of
the event message.
Windowing is too complicated to explain here. I just want to highlight that you might
need to know it, so make sure you understand it.
There really is no replacement for the Dataflow windowing capability for streaming
data.
Windows are the answer to "Where in event time?"
Windowing divides data into event time–based finite chunks.
Key 1
Key 2
Key 3
5 2
4
Time
Often required when doing aggregations over unbounded data.
Windowing divides a PCollection up into finite chunks based on the event time of
each message. It can be useful in many contexts but is required when aggregating
over infinite data.
Do you know the basic windowing methods, including Fixed time (such as a daily
window), Sliding and overlapping windows (such as the last 24 hours), and
session-based windows that are triggered to capture bursts of activity?
Side inputs in Dataflow
Get Messages Side Input
Pub/Sub
Extract Data
Detect Accidents
BigQuery
Tom Stern
We have covered a lot of information about the processing infrastructure already. Just
a few points about building and maintaining.
Building a streaming pipeline
Stream from Pub/Sub into BigQuery; BigQuery can provide streaming ingest to
unbounded datasets.
Your project can stream up to 300 MB per second in all locations except the us and
eu multi-regions, where your project can stream up to 1 GB per second.
Pub/Sub guarantees delivery, but not the order of messages. "At least once" means
that repeated delivery of the same message is possible.
All data processing is behind or lags events simply due to latency in the delivery of
the event message.
Pub/Sub guarantees delivery but might deliver the messages out of order.
If you have a timestamp, then Dataflow can remove duplicates and work out the order
of messages.
Data processing solutions
Ingest/Capture Process Store Analyze Visualize
Google Data
Analytics 360 Scientists
This diagram is useful because it shows the progression and options for input and
visualization on the edges of the common solution design.
Scaling streaming beyond BigQuery
Why Bigtable and not Cloud Spanner? Cost! Note that we can support 100,000 qps
with 10 nodes in Bigtable, but would need ~150 nodes in Cloud Spanner.
Scenario #1
Question
An application that relies on Cloud SQL to read infrequently changing data is predicted
to grow dramatically. How can you increase capacity for more read-only clients?
Time for some more practice exam questions. Here's the first one.
An application that relies on Cloud SQL to read infrequently changing data is predicted
to grow dramatically. How can you increase capacity for more read-only clients?
D is correct.
https://cloud.google.com/sql/docs/mysql/replicat
ion/tips#read-replica
The clue is that the clients are read-only and need the challenge is scale.
Note that a high availability configuration wouldn't help in this scenario because it
would not necessarily increase throughput.
A. Use Cloud Spanner. Monitor storage usage and increase node count if more
than 70% utilized.
B. Use Cloud Spanner. Monitor CPU utilization and increase node count if more
than 70% utilized for your time span.
C. Use Cloud Bigtable. Monitor data stored and increase node count if more
than 70% utilized.
D. Use Cloud Bigtable. Monitor CPU utilization and increase node count if more
than 70% utilized for your time span.
Your client wants a transactionally consistent global relational repository where they
can monitor and adjust node count for unpredictable traffic spikes.
● Use Cloud Spanner. Monitor storage usage and increase node count if more
than 70% utilized.
● Use Cloud Spanner. Monitor CPU utilization and increase node count if more
than 70% utilized for your time span.
● Use Cloud Bigtable. Monitor data stored and increase node count if more than
70% utilized.
● Use Cloud Bigtable. Monitor CPU utilization and increase node count if more
than 70% utilized for your time span.
A. Use Cloud Spanner. Monitor storage usage and increase node count if more
than 70% utilized.
B. Use Cloud Spanner. Monitor CPU utilization and increase node count if more
than 70% utilized for your time span.
C. Use Cloud Bigtable. Monitor data stored and increase node count if more
than 70% utilized.
D. Use Cloud Bigtable. Monitor CPU utilization and increase node count if more
than 70% utilized for your time span.
The answer is B. Use Cloud Spanner, monitor CPU utilization and increase the
number of nodes as needed.
Scenario #2
Rationale
https://cloud.google.com/spanner/docs/monitoring
https://cloud.google.com/bigtable/docs/monitorin
g-instance
Review
ETL ETL ETL and challenge
Case Study 01
I worked with a client that had a very complex reporting pipeline. They had a very
large amount of data that had to be processed into a report that was going to
regulators on a daily basis. They had to demonstrate that the risk in the financial data
from the previous day indicated that they were following the regulatory rules. They put
the data through multiple systems or stages. Each stage had a separate
Extract-Transform-Load sequence and then performed unique processing on the
data.
The complexity in the simplified diagram makes it look like it was a linear progression
from start-to-finish. But this is meant to be symbolic. The actual processes were much
more complicated and it took 30 hours from start to finish to generate one report. The
processes were actually more of a spider-web, with many dependencies. So one part
would run and then halt, waiting until other dependent parts were complete before
proceeding. And there were some processes that could run in parallel.
Data Engineer Case Study 01:
We mapped that to technical requirements like this...
BigQuery: Reduce overall time to run with BigQuery as data warehouse and analytics engine.
Apache Airflow: Control to automate pipeline, handle dependencies as code, start query
when preceding queries were done.
First we diagrammed out all the processes. And then we started to look at how to
implement this on Google Cloud using the available services. We initially considered
using Dataproc or Dataflow. However, the customer already had analysts that were
familiar with BigQuery and SQL. So if we developed in BigQuery it was going to make
the solution more maintainable and usable to the group. If we developed in Dataproc,
for example, they would have had to rely on another team that had Spark
programmers. So this is an example where the technical solution was influenced by
the business context.
To make this solution work, we needed some automation. And for that we chose
Apache Airflow. In the original design we ran Airflow on a Compute Engine instance.
You might be familiar with the Google service called Cloud Composer, which provides
a managed Apache Airflow service. Cloud Composer was not yet available when we
began the design.
Data ETL
Warehouse
In this particular case we used open source Apache Airflow. But if we were
implementing it today we would use Cloud Composer.
We were able to implement all their processing as SQL queries in BigQuery. And we
were able to implement all the dependencies through Airflow.
One of the time-sinks in their original process had to do with that 30 hour
start-to-finish window. What they would do is start processing jobs and sometimes
they would fail because the data from a previous dependency wasn't yet available.
And they had a manual process for restarting those jobs. We were able to automate
away that toil and the re-work by implementing the logic in Apache Airflow.
Challenge Lab 01
PDE Prep—BigQuery Essentials:
Challenge Lab
A Challenge Lab has minimal instructions. It explains the circumstance and the
expected results; you have to figure out how to implement them.