0% found this document useful (0 votes)

20 views

OD 03 PDE Building and Operationalizing Data Processing Systems

Uploaded by

sanaraheel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

OD 03 PDE Building and Operationalizing Data Processing Systems

Uploaded by

sanaraheel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Building and

Operationalizing Data
Processing Systems

Tom Stern

The next section of the exam guide covers building data processing systems.

So that includes assembling data processing from parts, as well as using full services.
Building and
Maintaining Data
Structures and
Databases

Tom Stern

The first area of data processing we will look at is building and maintaining structures
and databases.

So, not just selecting a particular kind of database or service, but also thinking about
the qualities that are provided and starting to consider how to organize the data.
Selecting storage NO YES

options
Is your data
Start
structured?

Does your
workload involve
analytics?
Cloud Storage Do you need
extensive
updates and/or
Is your data low latency?
relational?
Do you
need global Cloud Bigtable BigQuery
scalability? High throughput Data warehouse
Do you need Tabular data
application
caching?

Cloud Cloud
Spanner SQL
Memorystore Firestore
High availability Transactions

You can familiarize yourself with this diagram as well.

● BigQuery is recommended as a data warehouse.

● BigQuery is the default storage for tabular data.
● Use Cloud Bigtable if you want low-latency/high-throughput.
Building and maintaining flexible
data representations
Databases are systems for the storage and retrieval of information.
● One approach is to previously deﬁne a complex structure that is tailored to the intended operations.
This approach may be faster and more eﬃcient for queries and reports.
● Another approach is to retrieve information from documents geared toward the business logic.
This approach may be easier for data entry and maintenance.

Last name First name

Street address Last name, First name, Street address,
City City, State, Country

State Country

Last First Street LOC LOC City State Country

Here is some concrete advice on flexible data representation. You want the data
divided up in a way that makes the most sense for your given use case.
If the data is divided up too much, it creates additional work.

In the example on the left, each data item is stored separately, making it easy to filter
on a specific fields and to perform updates.

In the example on the right, all of the data is stored in a single record, like a single
string. Editing/updating is difficult. Filtering on a particular field would be hard.

In the example on the bottom, a relation is defined by two tables. This might make it
easier to manage and report on the list of locations.
What transaction qualities are required?

CAP Theorem

Pick two: Consistency Availability Partition Tolerance

ACID BASE
● Atomicity ● Basically Available
● Consistency ● Soft state
● Isolation ● Eventual consistency
● Durability

ACID vs BASE is essential data knowledge that you will want to be very familiar with
so that you can easily determine whether a particular data solution is compatible with
the requirements identified in the case. Example: for a financial transaction, a service
that provides only eventual consistency might be incompatible.
Cloud Storage
Cluster node options Features Best Practices

● Persistent storage ● Versioning ● Traﬃc estimation

● Staging area for other services ● Encryption options: default
● Storage classes (Google), CMEK, CSEK
● Lifecycles
Access ● Change storage class
● Granular access control: ● Streaming
control access at Project, ● Data transfer/
Bucket, or Object synchronization
● IAM roles, ACLs, and ● Storage Transfer Service
Signed URLs
● JSON and XML APIs

Cloud Storage is persistent. It has storage classes -- nearline, coldline, regional and
multiregional. There is granular access control. You should be familiar with all the
methods of access control, including IAM roles and Signed URLs.

Cloud Storage has a ton of features that people often miss. They end up trying to
duplicate the function in code, when in fact, all they need to do is use the capacity
already available.

For example, you can change storage classes. You can stream data to Cloud
Storage. Cloud Storage supports a kind of versioning. And there are multiple
encryption options to meet different needs. Also, you can automate some of these
features using Lifecycle management. For example, you could change the class of
storage for an object or delete that object after a period of time.
Cloud SQL
Familiar: Cloud SQL supports most MySQL Connect from anywhere (can assign a static
statements and functions, even Stored IP address, and use typical SQL connector
procedures, Triggers, and Views. libraries).

Fully-managed: Cloud SQL is a Fast: You can place your Cloud SQL instance
fully-managed relational database service in the same region as your App Engine or
for MySQL, PostgreSQL, and SQL Server. Compute Engine applications and get great
bandwidth.
Not supported: User-deﬁned functions,
statements, and functions related to ﬁles
Google security: Cloud SQL resides in secure
and plugins.
Google data centers. There are several ways
Flexible pricing: You can pay per use to securely access a Cloud SQL instance (see
or per hour. Backups, replication, and so Cloud SQL Access in the Security part of this
forth are managed for you. course).

Cloud SQL is the managed service that provides a MySQL instance.

I want to highlight to you that there are several ways to securely connect to a Cloud
SQL instance. And it would be important for you to be familiar with the different
approaches and the benefits of each.
Cloud Bigtable
Properties: Important features:

● Cloud Bigtable is meant for high ● Schema design and time-series support
throughput data
● Access control
● Millisecond latency, NoSQL
● Performance design
● Access is designed to optimize for a
● Choosing between SSD and HDD
range of Row Key preﬁxes

Cloud Bigtable is meant for high throughput data, it has millisecond latency, so it is
much faster than BigQuery, for example. It is NoSQL. So this is good for a columnar
store.

When would you want to select an SSD for the machines in the cluster rather than
HDD? I'd guess if you needed faster performance.
Cloud Bigtable: High throughput data where access
is primarily for a range of Row Key prefixes

Row key Column data

NASDAQ#ZXZZT#14265 MD:SYMBO MD:LASTSA MD:LASTSI MD:TRADET MD:EXCHA

35612045 L:ZXZZT LE:600.58 ZE:300 IME:142653 NGE:NASD
5612045 AQ

... ... ... ... ... ...

Cloud Bigtable will automatically

Store changes as new rows
compact the table

This example shows stock trades.

Each trade is its own row. This will result in 100s of millions of rows per day. Which is
fine with Cloud Bigtable.
Cloud Spanner
Properties: Important features:

● Global, fully managed, relational ● Schema design, Data Model, and updates
database with transactional
● Secondary indexes
consistency.
● Timestamp bounds and
● Data in Cloud Spanner is strongly typed:
Commit timestamps
you must deﬁne a schema for each
database, and that schema must ● Data types
specify the data types of each column
● Transactions
of each table.

Cloud Spanner is strongly typed and globally consistent. The two characteristics that
distinguish it from Cloud SQL are globally consistent transactions and size. Cloud
Spanner can work with much larger databases than Cloud SQL.
Use Cloud Spanner if you need globally consistent
data or more than one Cloud SQL instance
Mean Latency as Throughput Increases A Latency (ms)
A B Throughput (queries per second)
24 ━ MySQL (mean)
━ spanner 9 nodes (mean)
━ spanner 15 nodes (mean)
18
━ spanner 30 nodes (mean)

0 B
4000 6000 8000 10.000 20.000

Cloud SQL is fine if you can get by with a single database. But if your needs are such
that you need multiple databases, Cloud Spanner is a great choice.

MySQL hits a performance wall. If you look at the 99th percentile of latency, it is clear
that performance degrades. Distributing MySQL is hard. However, Spanner distributes
easily (even globally) and provides consistent performance. To support more
throughput, just add more nodes.
Firestore is the new version of Datastore
Properties: Important features:
● NoSQL document database ● Effortlessly scales up or down to meet
● ACID transactions demand

● High availability of reads and writes ● No maintenance downtime

● Massive scalability with high performance ● Built-in live synchronization and oﬄine
mode for multi-user, collaborative mobile
● Flexible storage and querying of data
and web applications
● Fully managed, serverless
● Introduces several improvements over
Datastore

Datastore is a NoSQL solution that used to be private to App Engine. It offers many
features that are mainly useful to applications, such as persisting state information. It
is now available to clients besides App Engine.
Comparing storage options: Use cases
Cloud Cloud Cloud
Firestore Cloud SQL BigQuery
Bigtable Storage Spanner

Type NoSQL NoSQL Blobstore Relational SQL Relational SQL Relational SQL
document wide column for OLTP for OLTP for OLAP

Best Using “Flat” data, Structured and Web Large-scale Interactive

for mobile/client Heavy unstructured frameworks, database querying,
libraries to read/write, binary or existing applications (> oﬄine
read and add events, object data applications ~2 TB) analytics
data analytical data

Use Mobile and AdTech, Images, large User Whenever high Data
cases web Financial and media ﬁles, credentials, I/O, global warehousing
applications IoT data backups customer consistency is
with orders needed
transactions

Commit this table to memory, and be able to use it backwards. Example: If the exam
question contains "Data Warehouse" you should be thinking "BigQuery is a
candidate".

If the case says something about large media files, you should immediately be
thinking Cloud Storage.
Building and
Operationalizing
Pipelines

Tom Stern

The next section is on Building and Maintaining pipelines. We've already covered a lot
of this information in the design section.
Dataflow does batch and streaming
Stream 9:00am

Simple Windows

Time-based Windows (shuffle)

Apache Beam

8:00am 9:00am 10:00am 11:00am

Apache Beam is an open programming platform for unifying batch and streaming.

Before Apache Beam, you needed two pipelines to balance latency, throughput, and
fault tolerance.

Dataflow is Apache Beam as a service; a fully-managed autoscaling service that runs

Beam pipelines.

Continuous data can arrive out of order. Simple windowing can separate related
events into independent windows, losing relationship information.

Time-based windowing (shuffling) overcomes this limitation.

Dataflow solves many stream processing issues

Scalability and Programming

Size Unbounded data
Fault-tolerance Model

Autoscaling and On-demand and Eﬃcient pipelines Windowing,

rebalancing handles distribution of (Apache Beam) + triggering,
variable volumes of processing scales eﬃcient execution incremental
data and growth. with fault tolerance. (Dataﬂow). processing, and
out-of-order/late data
are addressed in the
streaming model.

Dataflow resources are deployed on demand, per job, and work is constantly
rebalanced across resources

Dataflow solves many stream processing issues, including changes in size (spikes)
and growth over time. It can scale while remaining fault tolerant. And it has a flexible
programming model and methods to work with data arriving late or out of order.
Dataflow windowing for streams

To compute averages on streaming data, we need to bound the computation within

time windows.

Triggering controls how results are delivered to the next transforms in the pipeline.

Watermark is a heuristic that tracks how far behind the system is in processing data
from the event time. Where in event time does processing occur?

Fixed, sliding, and session-based windows.

Updated results (late), or speculative results (early).

All data processing is behind or lags events simply due to latency in the delivery of
the event message.

Windowing is too complicated to explain here. I just want to highlight that you might
need to know it, so make sure you understand it.

There really is no replacement for the Dataflow windowing capability for streaming
data.
Windows are the answer to "Where in event time?"
Windowing divides data into event time–based ﬁnite chunks.

Fixed Sliding Sessions

3
2 4
1 2 3 4 1 1 3

Key 1

Key 2

Key 3

5 2
4
Time
Often required when doing aggregations over unbounded data.

Windowing creates individual results for different slices of event time.

Windowing divides a PCollection up into finite chunks based on the event time of
each message. It can be useful in many contexts but is required when aggregating
over infinite data.

Do you know the basic windowing methods, including Fixed time (such as a daily
window), Sliding and overlapping windows (such as the last 24 hours), and
session-based windows that are triggered to capture bursts of activity?
Side inputs in Dataflow
Get Messages Side Input

Pub/Sub
Extract Data

Pipeline to detect accidents Window

DetectAccidents uses the average
speed at each location as a side input. Average

Detect Accidents

BigQuery

Remember to study side-inputs. If you understand side-inputs you will necessarily

understand many dependent concepts that are part of Dataflow.
Building and
Operationalizing
Processing
Infrastructure

Tom Stern

We have covered a lot of information about the processing infrastructure already. Just
a few points about building and maintaining.
Building a streaming pipeline

Stream from Pub/Sub into BigQuery; BigQuery can provide streaming ingest to
unbounded datasets.

Your project can stream up to 300 MB per second in all locations except the us and
eu multi-regions, where your project can stream up to 1 GB per second.

Pub/Sub guarantees delivery, but not the order of messages. "At least once" means
that repeated delivery of the same message is possible.

Dataﬂow stream processing can remove duplicates based on internal Pub/Sub ID

and can work with out-of-order messages when computing aggregates.

All data processing is behind or lags events simply due to latency in the delivery of
the event message.

You can stream unbounded data into BigQuery.

Pub/Sub guarantees delivery but might deliver the messages out of order.

If you have a timestamp, then Dataflow can remove duplicates and work out the order
of messages.
Data processing solutions
Ingest/Capture Process Store Analyze Visualize

App Engine Batch third-party tools:

Tableau Qlik iCharts Data
Stream Analysts
Cloud Dataflow BigQuery Storage BigQuery
Logging (tables) Analytics (SQL)

Google Data
Analytics 360 Scientists

Pub/Sub Cloud Storage

Dataproc Notebooks
(objects)
Smart
Cloud Apps
Monitoring

BigQuery is an inexpensive data store for tabular data. It is cost-comparable with

Cloud Storage, so it makes sense to ingest into BigQuery and leave the data there.

This diagram is useful because it shows the progression and options for input and
visualization on the edges of the common solution design.
Scaling streaming beyond BigQuery

BigQuery Cloud Bigtable

Easy, inexpensive Low latency/high-throughput
● latency in order of seconds ● 100,000 QPS at 6ms latency
● 100k rows/second streaming for a 10-node cluster

Why Bigtable and not Cloud Spanner? Cost! Note that we can support 100,000 qps
with 10 nodes in Bigtable, but would need ~150 nodes in Cloud Spanner.
Scenario #1
Question

An application that relies on Cloud SQL to read infrequently changing data is predicted
to grow dramatically. How can you increase capacity for more read-only clients?

A. Conﬁgure high availability on the primary node.

B. Establish an external replica in the customer's data center.

C. Use backups, so you can restore if there is an outage.

D. Conﬁgure read replicas.

Time for some more practice exam questions. Here's the first one.

An application that relies on Cloud SQL to read infrequently changing data is

predicted to grow dramatically. How can you increase capacity for more read-only
clients?

● Configure high availability on the primary node.

● Establish an external replica in the customer's data center.
● Use backups, so you can restore if there is an outage.
● Configure read replicas.

Do you have your answer?

Scenario #1
Answer

An application that relies on Cloud SQL to read infrequently changing data is predicted
to grow dramatically. How can you increase capacity for more read-only clients?

A. Conﬁgure high availability on the primary node.

B. Establish an external replica in the customer's data center.

C. Use backups, so you can restore if there is an outage.

D. Conﬁgure read replicas.

The answer is D, configure Read Replicas.

Do you know why?

Scenario #1
Rationale

D is correct.

A: High Availability does nothing to improve

throughput; it makes the service more accessible.

B: An external replica is more of a backup/D.R.

activity; it doesn't add to throughput on the cloud.

C: Backups would not make sense in this scenario.

https://cloud.google.com/sql/docs/mysql/replicat
ion/tips#read-replica

The clue is that the clients are read-only and need the challenge is scale.

Read Replicas increase capacity for simultaneous reads.

Note that a high availability configuration wouldn't help in this scenario because it
would not necessarily increase throughput.

Ready for another question?

Scenario #2
Question

Your client wants a transactionally consistent global relational repository.

You need to be able to monitor and adjust node count for unpredictable traﬃc spikes.

A. Use Cloud Spanner. Monitor storage usage and increase node count if more
than 70% utilized.
B. Use Cloud Spanner. Monitor CPU utilization and increase node count if more
than 70% utilized for your time span.
C. Use Cloud Bigtable. Monitor data stored and increase node count if more
than 70% utilized.
D. Use Cloud Bigtable. Monitor CPU utilization and increase node count if more
than 70% utilized for your time span.

Your client wants a transactionally consistent global relational repository where they
can monitor and adjust node count for unpredictable traffic spikes.

● Use Cloud Spanner. Monitor storage usage and increase node count if more
than 70% utilized.
● Use Cloud Spanner. Monitor CPU utilization and increase node count if more
than 70% utilized for your time span.
● Use Cloud Bigtable. Monitor data stored and increase node count if more than
70% utilized.
● Use Cloud Bigtable. Monitor CPU utilization and increase node count if more
than 70% utilized for your time span.

Got your answer?

Scenario #2
Answer

Your client wants a transactionally consistent global relational repository.

You need to be able to monitor and adjust node count for unpredictable traﬃc spikes.

The answer is B. Use Cloud Spanner, monitor CPU utilization and increase the
number of nodes as needed.
Scenario #2
Rationale

B is correct because of the requirement to globally

scale transactions—use Cloud Spanner. CPU
utilization is the recommended metric for scaling,
per Google best practices, linked below.
A is not correct because you should not use storage
utilization as a scaling metric.
C, D are not correct because you should not use
Cloud Bigtable for this scenario.

https://cloud.google.com/spanner/docs/monitoring
https://cloud.google.com/bigtable/docs/monitorin
g-instance

B is correct because of the requirement to globally scalable transactions, therefore

use Cloud Spanner. CPU utilization is the recommended metric for scaling, per
Google best practices.
Data Engineer Case Study 01:
A customer had this interesting business requirement...

A daily reporting pipeline with multiple sources and complex dependencies

Human intervention to data check quality, inputs, and proceed to next stage
Need daily updates on yesterday’s data, but takes >24 hours to run.

Integrated Model Calculations Reporting Daily report

Review
ETL ETL ETL and challenge

Case Study 01

I worked with a client that had a very complex reporting pipeline. They had a very
large amount of data that had to be processed into a report that was going to
regulators on a daily basis. They had to demonstrate that the risk in the financial data
from the previous day indicated that they were following the regulatory rules. They put
the data through multiple systems or stages. Each stage had a separate
Extract-Transform-Load sequence and then performed unique processing on the
data.

A customer had this interesting business requirement...

A daily reporting pipeline with multiple sources and complex dependencies

● Human intervention to data check quality, inputs, and proceed to next stage
● Need daily updates on yesterday’s data, but takes >24 hours to run

The complexity in the simplified diagram makes it look like it was a linear progression
from start-to-finish. But this is meant to be symbolic. The actual processes were much
more complicated and it took 30 hours from start to finish to generate one report. The
processes were actually more of a spider-web, with many dependencies. So one part
would run and then halt, waiting until other dependent parts were complete before
proceeding. And there were some processes that could run in parallel.
Data Engineer Case Study 01:
We mapped that to technical requirements like this...

BigQuery and Cloud Composer (aka Apache Airﬂow)

BigQuery: Reduce overall time to run with BigQuery as data warehouse and analytics engine.

Apache Airﬂow: Control to automate pipeline, handle dependencies as code, start query
when preceding queries were done.

First we diagrammed out all the processes. And then we started to look at how to
implement this on Google Cloud using the available services. We initially considered
using Dataproc or Dataflow. However, the customer already had analysts that were
familiar with BigQuery and SQL. So if we developed in BigQuery it was going to make
the solution more maintainable and usable to the group. If we developed in Dataproc,
for example, they would have had to rely on another team that had Spark
programmers. So this is an example where the technical solution was influenced by
the business context.

To make this solution work, we needed some automation. And for that we chose
Apache Airflow. In the original design we ran Airflow on a Compute Engine instance.
You might be familiar with the Google service called Cloud Composer, which provides
a managed Apache Airflow service. Cloud Composer was not yet available when we
began the design.

BigQuery and Cloud Composer (aka Apache Airflow)

BigQuery: Reduce overall time to run with BQ as data warehouse and analytics
engine
Apache Airflow: Control to automate pipeline, handle dependencies as code, start
query when preceding queries were done.
Data Engineer Case Study 01:
And this is how we implemented that technical requirement
Common data warehouse in BigQuery. Apache Airﬂow to automate query dependencies.
On-premises

Data Sources Orchestration

(Airﬂow)
Compute
Engine
Daily report

Dedicated BigQuery BigQuery

Interconnect

Data ETL
Warehouse

And this is how we implemented that technical requirement. Common data

warehouse in BigQuery. Apache Airflow to automate query dependencies

In this particular case we used open source Apache Airflow. But if we were
implementing it today we would use Cloud Composer.

Cloud Composer / Apache Airflow allowed us to establish the dependencies between

different queries that existed in the original reporting process. BigQuery served as
both the data storage solution and the data processing / query solution.

We were able to implement all their processing as SQL queries in BigQuery. And we
were able to implement all the dependencies through Airflow.

One of the time-sinks in their original process had to do with that 30 hour
start-to-finish window. What they would do is start processing jobs and sometimes
they would fail because the data from a previous dependency wasn't yet available.
And they had a manual process for restarting those jobs. We were able to automate
away that toil and the re-work by implementing the logic in Apache Airflow.
Challenge Lab 01
PDE Prep—BigQuery Essentials:
Challenge Lab

Estimated completion: 30 min

Lab expires after: 2 hours

A Challenge Lab has minimal instructions. It explains the circumstance and the
expected results; you have to figure out how to implement them.

This is a timed lab.

The lab can be completed in 30 minutes.
The Qwiklabs system will track and verify your progress in completing the lab.

PL-300 Study Guide PDF
100% (3)
PL-300 Study Guide PDF
2 pages
Google - Cloud Digital Leader.v2023 06 22.q106
No ratings yet
Google - Cloud Digital Leader.v2023 06 22.q106
50 pages
Exam Overview: GCP Data Engineer
No ratings yet
Exam Overview: GCP Data Engineer
12 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Assignment 1 17bcs2733
No ratings yet
Assignment 1 17bcs2733
22 pages
Storage_and_Database_Services GCP
No ratings yet
Storage_and_Database_Services GCP
69 pages
Week 5 GCP Notes
No ratings yet
Week 5 GCP Notes
5 pages
GCP Technologies
No ratings yet
GCP Technologies
12 pages
05 Data Storage Services
No ratings yet
05 Data Storage Services
75 pages
GCP Storage
No ratings yet
GCP Storage
12 pages
GC Week 3
No ratings yet
GC Week 3
1 page
dc201 Choosing The Right Database On Google Cloud
No ratings yet
dc201 Choosing The Right Database On Google Cloud
30 pages
2.2 Storage and Database Services
No ratings yet
2.2 Storage and Database Services
64 pages
2.2 Storage and Database Services
No ratings yet
2.2 Storage and Database Services
64 pages
Exam Overview: GCP Data Engineer
100% (5)
Exam Overview: GCP Data Engineer
12 pages
Seminar Nosql
No ratings yet
Seminar Nosql
56 pages
GCP Notes For Certification
No ratings yet
GCP Notes For Certification
24 pages
CloudComputing DATABASE
No ratings yet
CloudComputing DATABASE
27 pages
Google Cloud Database Engineer Exam Prep Sheet
No ratings yet
Google Cloud Database Engineer Exam Prep Sheet
8 pages
3.2 - Data Storage Services
No ratings yet
3.2 - Data Storage Services
98 pages
M4 _ T-GCPFCI-B _ Core Infrastructure v5.1.0 _ ILT
No ratings yet
M4 _ T-GCPFCI-B _ Core Infrastructure v5.1.0 _ ILT
52 pages
Seminar Nosql
No ratings yet
Seminar Nosql
59 pages
6th Choosing Your Database
No ratings yet
6th Choosing Your Database
11 pages
1.1 GCP - Storage - Options PDF
No ratings yet
1.1 GCP - Storage - Options PDF
20 pages
Cloud Databases 1
No ratings yet
Cloud Databases 1
23 pages
Ace3 HTML
No ratings yet
Ace3 HTML
41 pages
41 NoSQL Introduction.pptx
No ratings yet
41 NoSQL Introduction.pptx
18 pages
Bcse302l Dbms Module-7 Nosql
No ratings yet
Bcse302l Dbms Module-7 Nosql
30 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
Week 5 GCP Lec Notes
No ratings yet
Week 5 GCP Lec Notes
13 pages
More About Google Compute Engine: Disk Packages On First Boot, You Can Do It With Start-Up Scripts. There Are Also
No ratings yet
More About Google Compute Engine: Disk Packages On First Boot, You Can Do It With Start-Up Scripts. There Are Also
3 pages
Week 6 GCP Notes
No ratings yet
Week 6 GCP Notes
4 pages
04 Choosing Storage Solutions
No ratings yet
04 Choosing Storage Solutions
29 pages
NoSql 2024 Assign2
No ratings yet
NoSql 2024 Assign2
189 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
29 pages
Cheat Sheet v2
No ratings yet
Cheat Sheet v2
3 pages
Lecture 1 - NoSQL
No ratings yet
Lecture 1 - NoSQL
31 pages
20 - 04 - 2024 Cheatsheet
No ratings yet
20 - 04 - 2024 Cheatsheet
3 pages
04 Surveys Cattell PDF
No ratings yet
04 Surveys Cattell PDF
16 pages
09 - Cloud-Enabling Technologies - v2
No ratings yet
09 - Cloud-Enabling Technologies - v2
45 pages
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
From Everand
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
Robert Johnson
No ratings yet
Storage Options for Transformed Data
No ratings yet
Storage Options for Transformed Data
3 pages
Mastering Delta Lake: Optimizing Data Lakes for Performance and Reliability
From Everand
Mastering Delta Lake: Optimizing Data Lakes for Performance and Reliability
Robert Johnson
No ratings yet
NOSQL Lecture 1 Notes
No ratings yet
NOSQL Lecture 1 Notes
31 pages
Unit 2 - BD - Big Data Technology Foundations
No ratings yet
Unit 2 - BD - Big Data Technology Foundations
76 pages
Unit3 - Cloud Data Storage
No ratings yet
Unit3 - Cloud Data Storage
7 pages
dc200 Google Clouds Database Strategy Roadmap
No ratings yet
dc200 Google Clouds Database Strategy Roadmap
30 pages
No SQL
No ratings yet
No SQL
109 pages
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
No ratings yet
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
31 pages
Week 7 GCP Notes
No ratings yet
Week 7 GCP Notes
4 pages
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
No ratings yet
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
102 pages
4 - Key-Value Storage
No ratings yet
4 - Key-Value Storage
109 pages
MODULE 3
No ratings yet
MODULE 3
37 pages
Big Data Storage and Processing
No ratings yet
Big Data Storage and Processing
49 pages
Unit 2 - BD - Big Data Technology Foundations
No ratings yet
Unit 2 - BD - Big Data Technology Foundations
44 pages
Google Cloud Services
No ratings yet
Google Cloud Services
27 pages
Bda Notes (Unit-2)
No ratings yet
Bda Notes (Unit-2)
26 pages
05_Storage_and_Database_Services
No ratings yet
05_Storage_and_Database_Services
74 pages
Cheat Sheet v4
No ratings yet
Cheat Sheet v4
3 pages
Associate Cloud Engineer - Study Notes
No ratings yet
Associate Cloud Engineer - Study Notes
14 pages
Introduction to NoSQL
No ratings yet
Introduction to NoSQL
13 pages
CC - Lecture 6-Data
No ratings yet
CC - Lecture 6-Data
44 pages
Overview of Transaction Processing and Enterprise Resource Planning Systems
No ratings yet
Overview of Transaction Processing and Enterprise Resource Planning Systems
17 pages
Test2 chap5678 HonsDBMS-11feb25
No ratings yet
Test2 chap5678 HonsDBMS-11feb25
1 page
N Chandrasekhar - Mainframe - Resume
No ratings yet
N Chandrasekhar - Mainframe - Resume
2 pages
Spencer Kane Kennelly, Exp. Associate: Data Standardization and Master/Reference Data Management
No ratings yet
Spencer Kane Kennelly, Exp. Associate: Data Standardization and Master/Reference Data Management
1 page
Program 1
No ratings yet
Program 1
3 pages
References Between Tables - AppSheet Help Center
No ratings yet
References Between Tables - AppSheet Help Center
4 pages
Passing Multiple Parameters in SSRS
No ratings yet
Passing Multiple Parameters in SSRS
2 pages
Introduction To AWS
No ratings yet
Introduction To AWS
8 pages
BDA Presentations
No ratings yet
BDA Presentations
26 pages
Jan Feb 2023
No ratings yet
Jan Feb 2023
2 pages
Android Sqlite Example Application
No ratings yet
Android Sqlite Example Application
10 pages
Rajat Awasthi-
No ratings yet
Rajat Awasthi-
2 pages
Introduction To Transact-SQL
No ratings yet
Introduction To Transact-SQL
22 pages
Data Analyst Job Roles
No ratings yet
Data Analyst Job Roles
3 pages
BI PRACTICAL
No ratings yet
BI PRACTICAL
60 pages
Dbms Lab (R22a0584)
No ratings yet
Dbms Lab (R22a0584)
108 pages
Recent Advances in Text-To-SQL- A Survey of What We Have and What We Expect
No ratings yet
Recent Advances in Text-To-SQL- A Survey of What We Have and What We Expect
22 pages
Oracle Database 21c - Data Warehousing
No ratings yet
Oracle Database 21c - Data Warehousing
37 pages
Interface Python With MySQL-1
No ratings yet
Interface Python With MySQL-1
6 pages
DBMS Important Questions
No ratings yet
DBMS Important Questions
7 pages
Data Structures and Algorithms: (CS210/ESO207/ESO211)
No ratings yet
Data Structures and Algorithms: (CS210/ESO207/ESO211)
35 pages
DB Ass2
No ratings yet
DB Ass2
3 pages
RDBMS Full Programs
No ratings yet
RDBMS Full Programs
2 pages
Lecture Notes Course Outcome 1 & Session 4 Topic: SFS File System Implementation
No ratings yet
Lecture Notes Course Outcome 1 & Session 4 Topic: SFS File System Implementation
8 pages
CERTYIQ-DP900 Part6
No ratings yet
CERTYIQ-DP900 Part6
28 pages
ADO NET Entity Framework
No ratings yet
ADO NET Entity Framework
66 pages
Addmrpt 1 36557 36558
No ratings yet
Addmrpt 1 36557 36558
5 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages

OD 03 PDE Building and Operationalizing Data Processing Systems

Uploaded by

OD 03 PDE Building and Operationalizing Data Processing Systems

Uploaded by

Building and

You can familiarize yourself with this diagram as well.

● BigQuery is recommended as a data warehouse.

Last name First name

Last First Street LOC LOC City State Country

Pick two: Consistency Availability Partition Tolerance

● Persistent storage ● Versioning ● Traﬃc estimation

Cloud SQL is the managed service that provides a MySQL instance.

Row key Column data

NASDAQ#ZXZZT#14265 MD:SYMBO MD:LASTSA MD:LASTSI MD:TRADET MD:EXCHA

... ... ... ... ... ...

Cloud Bigtable will automatically

This example shows stock trades.

● High availability of reads and writes ● No maintenance downtime

Best Using “Flat” data, Structured and Web Large-scale Interactive

Time-based Windows (shuffle)

8:00am 9:00am 10:00am 11:00am

Dataflow is Apache Beam as a service; a fully-managed autoscaling service that runs

Time-based windowing (shuffling) overcomes this limitation.

Scalability and Programming

Autoscaling and On-demand and Eﬃcient pipelines Windowing,

To compute averages on streaming data, we need to bound the computation within

Fixed, sliding, and session-based windows.

Updated results (late), or speculative results (early).

Fixed Sliding Sessions

Windowing creates individual results for different slices of event time.

Pipeline to detect accidents Window

Remember to study side-inputs. If you understand side-inputs you will necessarily

Dataﬂow stream processing can remove duplicates based on internal Pub/Sub ID

You can stream unbounded data into BigQuery.

App Engine Batch third-party tools:

Pub/Sub Cloud Storage

BigQuery is an inexpensive data store for tabular data. It is cost-comparable with

BigQuery Cloud Bigtable

A. Conﬁgure high availability on the primary node.

B. Establish an external replica in the customer's data center.

C. Use backups, so you can restore if there is an outage.

D. Conﬁgure read replicas.

An application that relies on Cloud SQL to read infrequently changing data is

● Configure high availability on the primary node.

Do you have your answer?

A. Conﬁgure high availability on the primary node.

B. Establish an external replica in the customer's data center.

C. Use backups, so you can restore if there is an outage.

D. Conﬁgure read replicas.

The answer is D, configure Read Replicas.

Do you know why?

A: High Availability does nothing to improve

B: An external replica is more of a backup/D.R.

C: Backups would not make sense in this scenario.

Read Replicas increase capacity for simultaneous reads.

Ready for another question?

Your client wants a transactionally consistent global relational repository.

Got your answer?

Your client wants a transactionally consistent global relational repository.

B is correct because of the requirement to globally

B is correct because of the requirement to globally scalable transactions, therefore

A daily reporting pipeline with multiple sources and complex dependencies

Integrated Model Calculations Reporting Daily report

A customer had this interesting business requirement...

A daily reporting pipeline with multiple sources and complex dependencies

BigQuery and Cloud Composer (aka Apache Airﬂow)

BigQuery and Cloud Composer (aka Apache Airflow)

Data Sources Orchestration

Dedicated BigQuery BigQuery

And this is how we implemented that technical requirement. Common data

Cloud Composer / Apache Airflow allowed us to establish the dependencies between

Estimated completion: 30 min

This is a timed lab.

You might also like