0% found this document useful (0 votes)
503 views130 pages

GCP ACE Notes 3

This presentation provides an overview of the Google Cloud Associate Cloud Engineer certification exam. It discusses the topics covered in the exam, including setting up cloud environments, planning and configuring cloud solutions, deploying and implementing cloud solutions, and ensuring security and operations of cloud solutions. It also provides resources for the exam guide, sample questions, and information on how to access training and vouchers through the Partner Certification Academy website.

Uploaded by

Mohan Muddaliar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
503 views130 pages

GCP ACE Notes 3

This presentation provides an overview of the Google Cloud Associate Cloud Engineer certification exam. It discusses the topics covered in the exam, including setting up cloud environments, planning and configuring cloud solutions, deploying and implementing cloud solutions, and ensuring security and operations of cloud solutions. It also provides resources for the exam guide, sample questions, and information on how to access training and vouchers through the Partner Certification Academy website.

Uploaded by

Mohan Muddaliar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

Proprietary + Confidential

Partner Certification Academy

Associate Cloud Engineer

pls-academy-ace-student-slides-4-2303
Proprietary + Confidential

The information in this presentation is classified:

Google confidential & proprietary


⚠ This presentation is shared with you under NDA.

● Do not record or take screenshots of this presentation.

● Do not share or otherwise distribute the information in this


presentation with anyone inside or outside of your organization.

Thank you!
Proprietary + Confidential

Session logistics
● When you have a question, please:
○ Click the Raise hand button in Google Meet.
○ Or add your question to the Q&A section of Google Meet.
○ Please note that answers may be deferred until the end of the session.

● These slides are available in the Student Lecture section of your Qwiklabs classroom.

● The session is not recorded.

● Google Meet does not have persistent chat.


○ If you get disconnected, you will lose the chat history.
○ Please copy any important URLs to a local text file as they appear in the chat.
Proprietary + Confidential

Program issues or concerns?

● Problems with accessing Cloud Skills Boost for Partners


[email protected]

● Problems with a lab (locked out, etc.)


[email protected]

● Problems with accessing Partner Advantage


○ https://support.google.com/googlecloud/topic/9198654
Proprietary + Confidential

The Google Cloud Certified


Associate Cloud Engineer exam assesses your ability to:

Setup a cloud solution environment

Plan and configure a cloud solution

Deploy and implement a cloud solution

Associate Cloud Ensure successful operation of a cloud solution

Engineer Configure access and security

For more information:


https://cloud.google.com/certification/cloud-engineer

Associate Cloud Engineer


https://cloud.google.com/certification/cloud-engineer

Exam Guide
https://cloud.google.com/certification/guides/cloud-engineer

Sample Questions
https://docs.google.com/forms/d/e/1FAIpQLSfexWKtXT2OSFJ-obA4iT3GmzgiOCGvjr
T9OfxilWC1yPtmfQ/viewform
Proprietary + Confidential

Learning Path - Partner Certification Academy Website


Go to: https://rsvp.withgoogle.com/events/partner-learning/google-cloud-certifications

Click
Click here Professional
Cloud
Architect
Proprietary + Confidential

Needed for
Exam
Voucher
Proprietary + Confidential
Proprietary + Confidential

Associate Cloud Engineer (ACE) Exam Guide


Each module of this course covers Google Cloud
services based on the topics in the ACE Exam Guide

The primary topics are:


● Compute Engine
● VPC Networks
● Google Kubernetes Engine
● Cloud Run, Cloud Functions and App Engine
● Storage and database options
Next
● Resource Hierarchy/Identity and Access discussion
Management (IAM)
● Logging and Monitoring

https://cloud.google.com/certification/guides/
cloud-engineer/
Proprietary + Confidential

Storage and database

02
options
Proprietary + Confidential

Exam Guide Overview - Storage and Data Transfer Options


2.3 Planning and configuring data storage options, including:
2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent
Storage Transfer Transfer
Cloud Storage
Service Appliance disk, Standard, Nearline, Coldline, Archive)

3.4 Deploying and implementing data solutions. Tasks include:


3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery,
Cloud Spanner, Pub/Sub, Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
Filestore Cloud SQL Cloud Spanner
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data
from Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
Cloud Bigtable Firestore Memorystore
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore, Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore)
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery

BigQuery
Proprietary + Confidential

Exam Guide - Storage options


2.3 Planning and configuring data storage options. Considerations include:
2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent disk, Standard,
Nearline, Coldline, Archive)

3.4 Deploying and implementing data solutions. Tasks include:


3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery, Cloud Spanner, Pub/Sub,
Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data from
Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore, Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore)
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery
Storage Overview
Proprietary + Confidential

Compute
Engine topics

Object
storage is
covered
next

A map of storage options in Google Cloud


Proprietary + Confidential

Storage and database services

Object File Relational Non-relational Warehouse In memory

Cloud Cloud Cloud


Filestore Cloud SQL Firestore BigQuery
Storage Spanner Bigtable

Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps

Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions

Next
discussion

This table shows the storage and database services and highlights the storage
service type (object, file, relational, non-relational, and data warehouse), what each
service is good for, and intended use.
Proprietary + Confidential

All you need to know about Cloud Storage

https://cloud.google.com/blog/topics/developers-practitioners/all-you-need-k
now-about-cloud-storage
Proprietary + Confidential

Cloud Storage is a fully managed storage service

Binary large-object (BLOB) storage used for

Online content

Backup and archiving

Storage of intermediate results

And much more….

Object storage for companies of all sizes


https://cloud.google.com/storage

Cloud Storage’s primary use is whenever binary large-object storage (also known as a
“BLOB”) is needed for online content such as videos and photos, for backup and
archived data, and for storage of intermediate results in processing workflows.
Proprietary + Confidential

Files are organized into buckets

Geographic location

Unique name

Cloud Storage files are organized into buckets. A bucket needs a globally-unique
name and a specific geographic location for where it should be stored, and an ideal
location for a bucket is where latency is minimized. For example, if most of your users
are in Europe, you probably want to pick a European location so either a specific
Google Cloud region in Europe, or else the EU multi-region.
Proprietary + Confidential

Cloud Storage Classes - Options for any use case

Standard Nearline Coldline Archive

In multi-region In regional or For data access For data accessed For long term retention
locations dual-regional locations less than once roughly less than once a
for serving content for data accessed a month quarter
globally. frequently or high
throughput needs

Streaming Video Serving rarely Serve rarely Regulatory


videos transcoding accessed docs used data archives

Tape
Images Genomics Backup Movie archive
replacement
General data
Disaster
Websites analytics &
recovery
compute

Documents

There are four primary storage classes in Cloud Storage and stored data is managed
and billed according to which “class” it belongs.

The first is Standard Storage. Standard Storage is considered best for frequently
accessed, or “hot,” data. It’s also great for data that is stored for only brief periods of
time.

The second storage class is Nearline Storage. This is best for storing infrequently
accessed data, like reading or modifying data once per month or less, on average.
Examples might include data backups, long-tail multimedia content, or data archiving.

The third storage class is Coldline Storage. This is also a low-cost option for storing
infrequently accessed data. However, as compared to Nearline Storage, Coldline
Storage is meant for reading or modifying data, at most, once every 90 days.

The fourth storage class is Archive Storage. This is the lowest-cost option, used
ideally for data archiving, online backup, and disaster recovery. It’s the best choice for
data that you plan to access less than once a year, because it has higher costs for
data access and operations and a 365-day minimum storage duration.
Proprietary + Confidential

Characteristics applicable to all storage classes

Although each of these four classes has differences, it’s worth noting that several
characteristics apply across all these storage classes.
These include:
●Unlimited storage with no minimum object size requirement,
●Worldwide accessibility and locations,
●Low latency and high durability,
●A uniform experience, which extends to security, tools, and APIs, and
●Geo-redundancy if data is stored in a multi-region or dual-region. So this means
placing physical servers in geographically diverse data centers to protect against
catastrophic events and natural disasters, and load-balancing traffic for optimal
performance.
Proprietary + Confidential

Additional Cloud Storage features

Cloud Storage has no minimum fee because you pay only for what you use, and prior
provisioning of capacity isn’t necessary.

And from a security perspective, Cloud Storage always encrypts data on the server
side, before it’s written to disk, at no additional charge. Data traveling between a
customer’s device and Google is encrypted by default using HTTPS/TLS (Transport
Layer Security).
Proprietary + Confidential

Choosing a location type

Regional Dual-regional

Your data is stored in a specific Your data is replicated


region with replication across across a specific pair
availability zones in that region. of regions.

Multi-regional

Your data is distributed


redundantly across US,
EU or Asia.

How to choose between regional,


dual-region and multi-region Cloud Storage

Bucket locations
https://cloud.google.com/storage/docs/locations

How to migrate Cloud Storage data from multi-region to regional


https://cloud.google.com/blog/products/storage-data-transfer/multi-region-google-clou
d-storage-to-regional-data-migration

How to choose between regional, dual-region and multi-region Cloud Storage


https://cloud.google.com/blog/products/storage-data-transfer/choose-between-regiona
l-dual-region-and-multi-region-cloud-storage/
Proprietary + Confidential

Choosing a storage classes

Standard Nearline Coldline Archive

Use case “Hot” data and/or Infrequently accessed Infrequently accessed Data archiving, online
stored for only brief data like data backup, data that you read or backup, and disaster
periods of time like long-tail multimedia modify at most once a recovery
data-intensive content, and data quarter
computations archiving

Minimum storage None 30 days 90 days 365 days


duration*

Retrieval cost None $0.01 per GB $0.02 per GB $0.05 per GB

Availability SLA 99.95% (multi/dual) 99.90% (multi/dual) None


99.90% (region) 99.00% (region)

Durability 99.999999999%

*Minimum storage duration = if delete file before x days, will still pay for x days

Cloud Storage has four storage classes: Standard, Nearline, Coldline and Archive and
each of those storage classes provide 3 location types:
● A multi-region is a large geographic area, such as the United States, that
contains two or more geographic places.
● A dual-region is a specific pair of regions, such as Finland and the
Netherlands.
● A region is a specific geographic place, such as London.

Objects stored in a multi-region or dual-region are geo-redundant.


● Standard Storage is best for data that is frequently accessed ("hot" data)
and/or stored for only brief periods of time. This is the most expensive storage
class but it has no minimum storage duration and no retrieval cost. When used
in a:
○ region, Standard Storage is appropriate for storing data in the same
location as Google Kubernetes Engine clusters or Compute Engine
instances that use the data. Co-locating your resources maximizes the
performance for data-intensive computations and can reduce network
charges.
○ dual-region, you still get optimized performance when accessing
Google Cloud products that are located in one of the associated
regions, but you also get the improved availability that comes from
○ storing data in geographically separate locations.
○ multi-region, Standard Storage is appropriate for storing data that is
accessed around the world, such as serving website content,
streaming videos, executing interactive workloads, or serving data
supporting mobile and gaming applications.
● Nearline Storage is a low-cost, highly durable storage service for storing
infrequently accessed data like data backup, long-tail multimedia content, and
data archiving. Nearline Storage is a better choice than Standard Storage in
scenarios where slightly lower availability, a 30-day minimum storage duration,
and costs for data access are acceptable trade-offs for lowered at-rest storage
costs.
● Coldline Storage is a very-low-cost, highly durable storage service for storing
infrequently accessed data. Coldline Storage is a better choice than Standard
Storage or Nearline Storage in scenarios where slightly lower availability, a
90-day minimum storage duration, and higher costs for data access are
acceptable trade-offs for lowered at-rest storage costs.
● Archive Storage is the lowest-cost, highly durable storage service for data
archiving, online backup, and disaster recovery. Unlike the "coldest" storage
services offered by other Cloud providers, your data is available within
milliseconds, not hours or days. Unlike other Cloud Storage storage classes,
Archive Storage has no availability SLA, though the typical availability is
comparable to Nearline Storage and Coldline Storage. Archive Storage also
has higher costs for data access and operations, as well as a 365-day
minimum storage duration. Archive Storage is the best choice for data that you
plan to access less than once a year.

Let’s focus on durability and availability. All of these storage classes have 11 nines of
durability, but what does that mean? Does that mean you have access to your files at
all times? No, what that means is you won't lose data. You may not be able to access
the data which is like going to your bank and saying well my money is in there, it's 11
nines durable. But when the bank is closed we don't have access to it which is the
availability that differs between the storage classes and the location type.
Proprietary + Confidential

Exam Guide - Storage options


2.3 Planning and configuring data storage options. Considerations include:
2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent disk, Standard,
Nearline, Coldline, Archive)

3.4 Deploying and implementing data solutions. Tasks include:


3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery, Cloud Spanner, Pub/Sub,
Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data from
Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore, Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore)
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery
Proprietary + Confidential

BigQuery
discussed
later

How to transfer data to Google Cloud?

How to transfer data to Google Cloud


https://www.youtube.com/watch?v=lt9bOxlsKs4
https://cloud.google.com/blog/topics/developers-practitioners/how-transfer-your-data-
google-cloud
Proprietary + Confidential

Bringing data into Cloud Storage


● Online transfer
○ Drag and drop in the Goggle Cloud Console
○ Upload / download via the command line
■ gsutil/gcloud
Online transfer
● Storage Transfer Service
○ Create jobs to run once or on a scheduled bases
○ Transfer data from
■ Other clouds (AWS, Azure), URL, Posix filesystem
■ On-premise
Storage Transfer
■ Cloud Storage Bucket to Cloud Storage Bucket Service
● Transfer Appliance
○ Rackable, high-capacity storage server leased from Google
○ Used when the amount of data to be transferred would take
too much time given network bandwidth between the
customer location and Google Transfer Appliance

How long will it take to transfer data?

Cloud Storage - data transfer options:


https://cloud.google.com/architecture/migration-to-google-cloud-transferring-your-larg
e-datasets#transfer-options

How long will it take to transfer data?


https://cloud.google.com/architecture/migration-to-google-cloud-transferring-your-larg
e-datasets#time

Online transfer
gsutil: https://cloud.google.com/storage/docs/gsutil
gcloud: https://cloud.google.com/sdk/gcloud/reference/storage

Storage Transfer Service


https://cloud.google.com/storage-transfer/docs/overview

Transfer Appliance
https://cloud.google.com/transfer-appliance

Transfer appliance Youtube video:


https://www.youtube.com/watch?v=4g2ntSRU2pI

There are several ways to bring your data into Cloud Storage.

Many customers simply carry out their own online transfer using gsutil, which is the
Cloud Storage command from the Cloud SDK. Data can also be moved in by using a
drag and drop option in the Google Cloud console, if accessed through the Google
Chrome web browser.

But what if you have to upload terabytes or even petabytes of data? Storage Transfer
Service enables you to import large amounts of online data into Cloud Storage quickly
and cost-effectively. The Storage Transfer Service lets you schedule and manage
batch transfers to Cloud Storage from another cloud provider, from a different Cloud
Storage region, or from an HTTP(S) endpoint.

And then there is the Transfer Appliance, which is a rackable, high-capacity storage
server that you lease from Google Cloud. You connect it to your network, load it with
data, and then ship it to an upload facility where the data is uploaded to Cloud
Storage. You can transfer up to a petabyte of data on a single appliance.
Proprietary + Confidential

Transferring data into the cloud can be challenging


1 Mbps 10 Mbps 100 Mbps 1 Gbps 10 Gbps 100 Gbps

1 GB 3 hrs 18 mins 2 mins 11 secs 1 sec .1 secs

10 GB 30 hrs 3 hrs 18 mins 2 mins 11 secs 1 sec

100 GB 12 days 30 hrs 3 hrs 18 mins 2 mins 11 secs

1 TB 124 days 12 days 30 hrs 3 hrs 18 mins 2 mins

10 TB 3 years 124 days 12 days 30 hrs 3 hrs 18 mins

100 TB 34 years 3 years 124 days 12 days 30 hrs 3 hrs


Typical
1 PB 340 yrs 34 years 3 years 124 days 12 days 30 hrs
enterprise

10 PB 3.404 yrs 340 yrs 34 years 3 years 124 days 12 days

100 PB 34,048 yr 3,404 yrs 340 yrs 34 years 3 years 124 days

Data transfer speeds


https://cloud.google.com/transfer-appliance/docs/4.0/overview#transfer-speeds

One of the data transfer barriers that all companies face is their available network.
The rate at which we are creating data is accelerating far beyond our capacity to send
it all over typical networks. So all companies battle with this

In working with our customers we’ve found the typical enterprise organization has
about 10 petabytes of information and available network bandwidth of somewhere
between 100 Mbps and 1 Gbps. This works out to somewhere between 3 and 34
years of ingress. Too long. Now all of that data has different storage requirements,
use cases and will likely have many different eventual homes. Because of this
organizations need data transfer options across this x and y axis.
Proprietary + Confidential

Saving files to Cloud Storage - gsutil

● gsutil - Command line to upload/download files


○ For smaller amounts of data (<1TB) if have adequate bandwidth

● Upload
gsutil cp OBJECT_LOCATION gs://DESTINATION_BUCKET_NAME/
gsutil cp desktop/myfile.png gs://my-bucket/

● Download

gsutil cp gs://BUCKET_NAME/OBJECT_NAME SAVE_TO_LOCATION


gsutil cp gs://my-bucket/* desktop/file-folder/

New: gcloud storage commands were introduced in 2022

Cloud Storage - command line upload:


https://cloud.google.com/storage/docs/uploading-objects#prereq-cli

Cloud Storage - command line download:


https://cloud.google.com/storage/docs/downloading-objects

Gsutil copy syntax:


https://cloud.google.com/storage/docs/gsutil/commands/cp#description

Creating and managing data transfers programmatically:


https://cloud.google.com/storage-transfer/docs/create-manage-transfer-program

Create and manage data transfers with gcloud (Cloud Storage):


https://cloud.google.com/storage-transfer/docs/create-manage-transfer-gcloud
Proprietary + Confidential

Summary: Choosing a transfer option


Where you're moving data from Scenario Suggested products

Another cloud provider (for example, Amazon — Storage Transfer Service


Web Services or Microsoft Azure) to Google
Cloud

Cloud Storage to Cloud Storage (two different — Storage Transfer Service


buckets)

Your private data center to Google Cloud Enough bandwidth to meet gsutil
your project deadline
for less than 1 TB of data

Your private data center to Google Cloud Enough bandwidth to meet Storage Transfer Service for
your project deadline on-premises data
for more than 1 TB of data

Your private data center to Google Cloud Not enough bandwidth to Transfer Appliance
meet your project deadline
Proprietary + Confidential

Exam Guide - Storage options


2.3 Planning and configuring data storage options. Considerations include:
2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent disk, Standard,
Nearline, Coldline, Archive)

3.4 Deploying and implementing data solutions. Tasks include:


3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery, Cloud Spanner, Pub/Sub,
Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data from
Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore, Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore)
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery
Proprietary + Confidential

Cloud Storage has many object management


features
30 days Nearline

object LIVE object


object

90 days Coldline
object #1 object

object #2

Retention Policy Versioning Lifecycle Management

Best practices for Cloud Storage cost optimization


https://cloud.google.com/blog/products/storage-data-transfer/best-practices-for-cloud-
storage-cost-optimization

Cloud Storage has many object management features. For example, you can set a
retention policy on all objects in the bucket. For example, the objects should be
expired after 30 days.

You can also use versioning, so that multiple versions of an object are tracked and
available if necessary. You might even set up lifecycle management, to automatically
move objects that haven’t been accessed in 30 days to Nearline and after 90 days to
Coldline.
Proprietary + Confidential

Options for controlling data lifecycles


● A retention policy specifies a retention period to be placed on a bucket.
○ An object cannot be deleted or replaced until it reaches the specified age.
● Object Versioning can be enabled on a bucket in order to retain older versions of
objects.
○ When the live version of an object is deleted or replaced, it becomes noncurrent
○ If a live object version is accidentally deleted, can restore the noncurrent version
back to the live version.
○ Object Versioning increases storage costs, but can be mitigated by Lifecycle
Management to delete older objects
● Object Lifecycle Management can be configured for a bucket, which provides
automated control over deleting objects and changing storage classes

Options for controlling data lifecycles


https://cloud.google.com/storage/docs/control-data-lifecycles

Object Lifecycle Management:


https://cloud.google.com/storage/docs/lifecycle

Lifecycle conditions:
https://cloud.google.com/storage/docs/lifecycle#conditions

gsutil command:
https://cloud.google.com/storage/docs/gsutil/commands/lifecycle

Object versioning:
https://cloud.google.com/storage/docs/object-versioning
Proprietary + Confidential

Set Lifecycle policy - Console

Object Lifecycle Management


https://cloud.google.com/storage/docs/lifecycle
Proprietary + Confidential

Set Lifecycle Policy - CLI

{ Rule 1: Delete
noncurrent objects if
"lifecycle": { there are 2 newer ones
Create a JSON config "rule": [
file containing the rules {
"action": {"type": "Delete"},
"condition": { Rule 2: Delete
Then run: "numNewerVersions": 2, noncurrent versions of
"isLive": false objects after they've
gsutil lifecycle set been noncurrent for 7
}
days.
[config-file] },
gs://acme-data-bucket {
"action": {"type": "Delete"},
"condition": {
"daysSinceNoncurrentTime": 7
}}]}}

Lifecycle - Get or set lifecycle configuration for a bucket


https://cloud.google.com/storage/docs/gsutil/commands/lifecycle
Exam Guide - Storage options
Proprietary + Confidential

2.3 Planning and configuring data storage options. Considerations include:


2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent disk, Standard,
Nearline, Coldline, Archive)

3.4 Deploying and implementing data solutions. Tasks include:


3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery, Cloud Spanner, Pub/Sub,
Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data from
Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore, Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore)
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery
Proprietary + Confidential

Cloud Storage encryption options


Google Cloud

● Default - Google Cloud manages encryption


● Customer-managed encryption keys (CMEK)
○ Encryption occurs after Cloud Storage receives
data but before the data is written to disk
■ Keys are managed through Cloud Key
Management Service (KMS)
● Customer-supplied encryption keys (CSEK)
○ Keys created and managed externally to Google
Cloud
■ Keys act as an additional encryption layer on
top of the Google default encryption
—----------------------------------------------------------
● Client-side encryption (external to Google Cloud)
Bucket access
○ Data is sent to Cloud Storage already encrypted permissions (IAM) is
■ Data also undergoes server-side encryption discussed in a
different module
by Google
Customer

Lab - Getting Started with Cloud KMS


https://partner.cloudskillsboost.google/catalog_lab/368

Provide support for external keys with EKM


You can encrypt data in BigQuery and Compute Engine with encryption keys that are
stored and managed in a third-party key management system that’s deployed outside
Google’s infrastructure. External Key Manager allows you to maintain separation
between your data at rest and your encryption keys while still leveraging the power of
cloud for compute and analytics.
https://cloud.google.com/security-key-management
Exam Guide - Storage options
Proprietary + Confidential

2.3 Planning and configuring data storage options. Considerations include:


2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent disk, Standard,
Nearline, Coldline, Archive)
Discussed in Compute Engine
3.4 Deploying and implementing data solutions. Tasks include: module.
3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery, Cloud Spanner, Pub/Sub,
Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data from
Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore, Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore)
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery
Exam Guide - Storage options
Proprietary + Confidential

2.3 Planning and configuring data storage options. Considerations include:


2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent disk, Standard,
Nearline, Coldline, Archive)

3.4 Deploying and implementing data solutions. Tasks include:


3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery, Cloud Spanner, Pub/Sub,
Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data from
Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore , Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore)
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery
Proprietary + Confidential

Custom and Managed Solutions

You can manage databases yourself by creating a VM and installing the database
software. But you are responsible for everything - backups, patches, OS updates,
etc. As an alternative, Google offers fully managed storage options where Google
manages the underlying infrastructure
Proprietary + Confidential

Database Options

Your Google Cloud database options, explained


Proprietary + Confidential

Storage and database services

Object File Relational Non-relational Warehouse In memory

Cloud Cloud Cloud


Filestore Cloud SQL Firestore BigQuery
Storage Spanner Bigtable

Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps

Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions

Discussed in Compute
Next discussion
Engine module
Proprietary + Confidential

What does data in a relational database look like?

● Tables contain fields, indexes, and constraints


● Primary key ensures each row in a table is unique
● Relationships are constraints that ensure a parent row cannot be deleted if there
are child rows in another table

Customers Orders OrderDetails


ID: int 1 ∞ ID: int 1 ∞ ID: int
FirstName: string CustomerID: int OrderID: int
LastName: string OrderDate: date Qty: int
... ... Description: string
...

This is a brief (and very, very simplistic) overview of a relational database.


Proprietary + Confidential

Cloud SQL
A fully managed, cloud-based alternative to on-premise MySQL, PostgreSQL, and SQL
Server databases

The business value of Cloud SQL: how companies speed up deployments, lower costs and boost agility

The business value of Cloud SQL: how companies speed up deployments, lower costs
and boost agility
https://cloud.google.com/blog/products/databases/the-business-value-of-cloud-sql/

Cloud SQL:
https://cloud.google.com/sql/docs/mysql/introduction

What is Cloud SQL


https://cloud.google.com/blog/topics/developers-practitioners/what-cloud-sql

Cloud SQL offers fully managed relational databases, including MySQL, PostgreSQL,
and SQL Server as a service. It’s designed to hand off mundane, but necessary and
often time-consuming, tasks to Google—like applying patches and updates, managing
backups, and configuring replications—so your focus can be on building great
applications.
Proprietary + Confidential

Cloud SQL - shared responsibilities

Application Development Managed by


Schema Design Customer
Query Optimization

Storage Scaling
Availability
Database Maintenance Managed by
Monitoring Google
Security
OS
Hardware & Network

Google Cloud manages the infrastructure work while the customer focuses on their
goals.
Proprietary + Confidential

Cloud SQL supports a variety of use cases

● Storing and managing relational data such as customer information, product


inventory, and financial transactions
● Backend database for web and mobile applications
● Migrating on-premises databases to the cloud
● Building and deploying data-driven applications with minimal setup and
management overhead
● Replicating data for disaster recovery and high availability
○ Provides automatic failover for high availability, ensuring that databases are
always available in case of an unexpected outage or hardware failure
Proprietary + Confidential

Cloud SQL - customer use case


● Qlue: Boosting intelligence about activity in
Indonesian cities
● Applications include
○ Dashboard to track and resolve
incidents of flooding, crime, fire, illegal
rubbish dumping, and more
○ Monitor energy usage and traffic
congestion in real time
○ Mobile app that enables citizens to
report incidents

Qlue: Boosting intelligence about activity in Indonesian cities


https://cloud.google.com/customers/qlue/

Note that Qlue uses multiple Google Cloud services, not just Cloud SQL. It’s highly
recommended that you understand the use case for all these services, and how they
are used together to meet customer needs.
Proprietary + Confidential

Scaling Cloud SQL Databases


● Cloud SQL can scale in the following ways
○ Vertical scaling:
■ Increasing the amount of computing resources (such as CPU and memory)
allocated to a single database instance.
■ Can be done with a few clicks in the Cloud Console, and requires no
downtime.
○ Horizontal scaling:
■ Adding additional read replicas to distribute read workloads and improve
performance.
■ Can be added on demand, and there's no limit to the number of replicas
that can be added.
Proprietary + Confidential

Cloud SQL Read Replicas

Fully managed read replica in a different


region(s) than that of the primary instance
● Enhance DR Primary

● Bring data closer to your applications


(performance and cost implications)
● Migrate data across regions
● Data and other changes on the primary
instance are updated in almost real time
on the read replicas

Cloud SQL Read Replicas:


https://cloud.google.com/sql/docs/mysql/replication

Cloud SQL read replicas provide


● Enhanced Disaster recovery. Can convert a read replica to the Primary (if the
primary is no longer available)
● In terms of data closer to your app, you can have different connection strings
for an application for different endpoints
○ For example,:
■ Read-Write connection strings will go to the primary
■ Read only connections hit a local read replica
● If reads comprise most of the use case for your app, then accessing the local
replica saves on data transfer and egress charges since you are not accessing
the primary sitting in the eastern United States in this example

Introducing cross-region replica for Cloud SQL (June 2, 2020)


https://cloud.google.com/blog/products/databases/introducing-cross-region-replica-for
-cloud-sql
Proprietary + Confidential

Cloud SQL - High Availability

● Provides automatic failover if a zone or


instance become unavailable
● The primary instance is in one zone in a region
○ The failover instance is in another zone
● Synchronous replication is used to copy all
writes from the primary disk to the replica

About high availability


https://cloud.google.com/sql/docs/mysql/high-availability
Proprietary + Confidential

Connecting and running a query in Cloud SQL

● Connect to Cloud SQL

gcloud sql connect myinstancename --user=root

● Execute a query using standard SQL

SELECT firstname, lastname, empid FROM employee;

Connect to Cloud SQL for MySQL from Cloud Shell


https://cloud.google.com/sql/docs/mysql/connect-instance-cloud-shell

All Cloud SQL for MySQL code samples:


https://cloud.google.com/sql/docs/mysql/samples
Proprietary + Confidential

Storage and database services

Object File Relational Non-relational Warehouse In memory

Cloud Cloud Cloud


Filestore Cloud SQL Firestore BigQuery
Storage Spanner Bigtable

Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps

Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions

Next discussion
Proprietary + Confidential

Cloud Spanner
A enterprise-grade, globally distributed, externally consistent relational database having unlimited
scalability and industry-leading 99.999% availability
● Powers Google’s most popular, globally available products, like
YouTube, Drive, and Gmail
● Capable of processing more than 1 billion queries per second at
peak
● For any workload, large or small, that cannot tolerate downtime, and
requires high availability
● Regional and multi-regional deployments
○ SLA: Multi-regional: 99.999%
○ SLA: Regional: 99.99%
Cloud Spanner myths
● Supports ANSI standard SQL
busted

Cloud Spanner:
https://cloud.google.com/blog/topics/developers-practitioners/what-cloud-spanner
https://cloud.google.com/spanner

Cloud Spanner myths busted:


https://cloud.google.com/blog/products/databases/cloud-spanner-myths-busted

Cloud Spanner is an enterprise-grade, globally distributed, externally consistent


database that offers unlimited scalability and industry-leading 99.999% availability. It
requires no maintenance windows and combines the benefits of relational databases
with the unmatched scalability and availability of non-relational databases.

Spanner powers Google’s most popular, globally available products, like YouTube,
Drive, and Gmail, and it can processes more than 1 Billion queries per second at
peak

But don’t be mislead into thinking that Spanner is only for large, enterprise level
applications. Many customers use Spanner for their smaller workloads (both in terms
of transactions per second and storage size) for availability and scalability reasons.
Spanner is appropriate for any customer that cannot tolerate downtime, and needs
high availability for their applications

Limitless scalability and high availability is critical in many industry verticals such as
gaming and retail, especially when a newly launched game goes viral and becomes
an overnight success or when a retailer has to handle a sudden surge in traffic due to
a Black Friday/Cyber Monday sale.

Spanner offers the flexibility to interact with the database via a SQL dialect based on
ANSI 2011 standard as well as via a REST or gRPC API interface, which are
optimized for performance and ease-of-use. In addition to Spanner’s interface, there
is a PostgreSQL interface for Spanner, that leverages the ubiquity of PostgreSQL and
provides development teams with an interface that they are familiar with. The
PostgreSQL interface provides a rich subset of the open-source PostgreSQL SQL
dialect, including common query syntax, functions, and operators. It also supports a
core collection of open-source PostgreSQL data types, DDL syntax, and information
schema views. You get the PostgreSQL familiarity, and relational semantics at
Spanner scale.
Proprietary + Confidential

Cloud Spanner supports a variety of use cases


● Large-scale, multi-regional data storage
○ Can store and manage terabytes or petabytes of data across multiple regions
● Online transaction processing (OLTP) applications:
○ Can support OLTP workloads with low latency and high throughput, making it suitable
for applications that require real-time data processing, such as e-commerce or
financial services
● Banking and financial services
○ Spanner's high-availability and consistency guarantees make it well-suited for use
cases in the financial services industry, such as stock trading or payment processing
● Geographically distributed data management
○ Spanner supports multiple geographic locations and provides a globally consistent
view of data, making it ideal for use cases that require a database that can handle
data distributed across multiple regions or continents
Proprietary + Confidential

Cloud Spanner - customer use case


● Dragon Ball Legends - mobile game from
Bandai Namco Entertainment
● Requirements were:
○ Global backend that could scale with
millions of players and still perform well.
○ Global reliable, low latency network to
support multi-region
player-versus-player battles
○ Real-time data analytics to measure and
evaluate how people are playing the
game and adjust it on-the-fly. Behind the scenes with the Dragon Ball
Legends GC backend

This is another example where multiple Google Cloud services are used together to
provide a solution.
Proprietary + Confidential

Scaling Cloud Spanner


Scales out (horizontal scaling)
● Manually add nodes/processing
units to support more data and
users as needed
● Turn on autoscaling to
automatically adjust the number of
nodes in an instance to handle
changing traffic patterns and load

Autoscaling Cloud Spanner


https://cloud.google.com/architecture/autoscaling-cloud-spanner
Proprietary + Confidential

Querying Spanner
def query_data(instance_id, database_id):
"""Queries sample data from the database using SQL."""
spanner_client = spanner.Client()
Using a client instance = spanner_client.instance(instance_id)
library (Python database = instance.database(database_id)
example)
with database.snapshot() as snapshot:
results = snapshot.execute_sql(
"SELECT SingerId, AlbumId, AlbumTitle FROM Albums"
)
for row in results:
print("SingerId: {}, AlbumId: {}, AlbumTitle:
{}".format(*row))

Using the CLI


gcloud spanner databases execute-sql example-db \
--sql='SELECT SingerId, AlbumId, AlbumTitle FROM Albums'

Query syntax in Google Standard SQL


https://cloud.google.com/spanner/docs/reference/standard-sql/query-syntax

Python quickstart
https://cloud.google.com/spanner/docs/getting-started/python

Spanner code examples:


https://cloud.google.com/spanner/docs/samples

If you already know database programming, you will be comfortable using Spanner.
Like all relational databases, you create tables, tables have fields, fields have data
types and constraints. You can set up relationships between tables. When you want to
store data, you add rows to tables.

Once you have data, you can retrieve it with a standard SQL SELECT statement.
Proprietary + Confidential

Cloud SQL Cloud Spanner

● Max 64TB data ● 4TB / node data


● High Availability via master / standby ● Distributed service - always available
● 99.95% SLA ● 99.99% SLA (regional)
● Vertically Scalable via reprovisioning ● 99.999% SLA (multi-region)
● Horizontally Scalable by read replicas ● Horizontally Scalable
● Planned maintenance windows ● No maintenance managed service
Proprietary + Confidential

Storage and database services

Object File Relational Non-relational Warehouse In memory

Cloud Cloud Cloud


Filestore Cloud SQL Firestore BigQuery
Storage Spanner Bigtable

Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps

Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions

Next discussion
Proprietary + Confidential

Types of NoSQL Database


● Key-value stores (Cloud Memorystore)
○ Data is stored in key-value pairs
○ Examples include Redis and SimpleDB
● Document stores (Cloud Firestore)
○ Data is stored in some standard format like XML or JSON
○ Nested and hierarchical data can be stored together
○ MongoDB, CouchDB, and DynamoDB are examples
● Wide-column stores (Cloud Bigtable)
○ Key identifies a row in a table
○ Columns can be different within each row
○ Cassandra and HBase are examples

There are different types of NoSQL databases. (The Google Cloud services are in
bold text in the above slide.)

Key-value stores are the simplest. These act like a map or dictionary. To save data,
just specify a key and assign a value to that key.

Document stores allow you to store hierarchical data. So, instead of having an orders
table and a details table, you can store an order in a single document. That document
can have properties that represent the order along with an array of subdocuments
representing the details. The data can be stored in the database as JSON or XML, or
a binary format called BSON.

Wide-column stores have data in tables. Each row in the table has a unique value or
key. Associated with that key can be any number of columns. Each column can have
any type of data. There’s no schema, so different rows in the same table can have
different columns.
Proprietary + Confidential

Cloud Firestore

All you need to know about Firestore

All you need to know about Firestore:


https://cloud.google.com/blog/topics/developers-practitioners/all-you-need-know-abou
t-firestore-cheatsheet
Proprietary + Confidential

Cloud Firestore
● Completely managed, document store, NoSQL Database
○ No administration, no maintenance, nothing to provision
● 1 GB per month free tier
● Indexes created for every property by default
○ Secondary indexes and composite indexes are supported
● Supports ACID* transactions
● For mobile, web, and IoT apps at global scale
○ Live synchronization and offline support
● Multi-region replication

*ACID explained: https://en.wikipedia.org/wiki/ACID

Firestore:
https://cloud.google.com/firestore/docs/

Cloud Firestore is a fast, fully managed, serverless, cloud-native NoSQL document


database

It simplifies storing, syncing, and querying data for your mobile, web, and IoT apps at
global scale

Its client libraries provide live synchronization and offline support, and its security
features and integrations with Firebase and Google Cloud accelerate building truly
serverless apps.

Cloud Firestore also supports ACID transactions, so if any of the operations in the
transaction fail and cannot be retried, the whole transaction will fail.

Also, with automatic multi-region replication and strong consistency, your data is safe
and available, even when disasters strike.

Cloud Firestore allows you to run sophisticated queries against your NoSQL data
without any degradation in performance. This gives you more flexibility in the way you
structure your data.

Some features recently introduced are


● VPC Service Controls for Firestore. This allows you to define a perimeter to
mitigate data exfiltration risks.
● Firestore triggers for Cloud Functions. When certain events happen in
Firestore, Cloud Functions can run in response.
● The same Key Visualizer service that’s available for Bigtable and Spanner is
also available for Firestore. It allows developers to quickly and visually identify
performance issues
Proprietary + Confidential

Firestore example data

An index is
created for
every property
so that queries
are extremely
fast

From: https://firebase.google.com/docs/firestore/data-model#hierarchical-data
Proprietary + Confidential

Firestore - customer use case


● Forbes created Bertie - an AI
assistant for journalists
● Journalists upload their content and
Bertie provides feedback
○ Strength of the article’s headline
○ Keywords needed for search
engine optimization
○ Words to add to the headline to
increase search performance

YouTube video: https://www.youtube.com/watch?v=KVRxsRPhmoo

Forbes website (for anyone that wants to know more about the company)
https://www.forbes.com/
Proprietary + Confidential

Forbes - Bertie Architecture


● Content is stored in Firestore
● Data updates trigger Cloud
Functions
○ Each Cloud Function
performs a different task
○ Results are written back to
Firestore
○ Website is refreshed with
the recommendations

YouTube video: https://www.youtube.com/watch?v=KVRxsRPhmoo


Proprietary + Confidential

Querying Firestore - Python example


Get a reference to
a collection

cities_ref = db.collection(u'cities')

query = cities_ref.where(u'capital', u'==',


Return cities that True).stream()
are capitals

Choosing between Native mode and Datastore mode


https://cloud.google.com/datastore/docs/firestore-or-datastore

Firestore in Datastore Mode Queries


https://cloud.google.com/datastore/docs/concepts/queries

Firestore in Native Mode Queries


https://cloud.google.com/firestore/docs/samples

Firestore code examples:


https://cloud.google.com/firestore/docs/samples

Firestore - Querying and filtering data:


https://cloud.google.com/firestore/docs/query-data/queries
Proprietary + Confidential

Storage and database services

Object File Relational Non-relational Warehouse In memory

Cloud Cloud Cloud


Filestore Cloud SQL Firestore BigQuery
Storage Spanner Bigtable

Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps

Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions

Next discussion
Proprietary + Confidential

Cloud Bigtable
Highly scalable NoSQL database that can handle billions of rows and petabytes of
data, making it ideal for use cases that require large-scale data storage and
processing with up to 99.999% availability

● High throughput at low latency


○ Ideal for use cases that require real-time data processing and
analysis
● Integrates easily with big data tools
○ Same API as HBase
○ Allows on-premises HBase applications to be easily migrated
● Consistent sub-10ms latency

Bigtable:
https://cloud.google.com/bigtable

● Bigtable is a fully managed, scalable NoSQL database service for large


analytical and operational workloads with up to 99.999% availability
● Bigtable is built with proven infrastructure that powers Google products used
by billions such as Search and Maps.
● Bigtable is ideal for storing very large amounts of data in a key-value store and
supports high read and write throughput at low latency for fast access to large
amounts of data.
● It is a Fully managed service that integrates easily with big data tools like
Hadoop, Dataflow, and Dataproc. Plus, support for the open source HBase
API standard makes it easy for development teams to get started
● Some of the features are
○ Autoscaling: Autoscaling helps prevent over-provisioning or
under-provisioning by letting Cloud Bigtable automatically add or
remove nodes to a cluster when usage changes.
■ In addition, metrics are available to help you understand how
autoscaling is working.
○ You can use customer managed encryption keys (CMEK) in Cloud
Bigtable instances, including ones that are replicated across multiple
regions.
○ App profile cluster groups let you route an app profile's traffic to a
subset of an instance's clusters.
Proprietary + Confidential

Cloud Bigtable

How BIG is Cloud Bigtable?

How BIG is Cloud Bigtable:


https://cloud.google.com/blog/topics/developers-practitioners/how-big-cloud-bigtable
Proprietary + Confidential

Cloud Bigtable example use cases

Healthcare Patient data and other health-related information


Financial Services Storage of large amounts of transaction, risk management, and
compliance data
Retail Storage of data related to customer behavior which can be used to
generate product recommendations
Gaming Player data for game analytics
IoT Data generated by connected devices and sensors
Logs and Metrics Storage of log data and metrics for analysis
Time series data Storage of resource consumption like CPU and memory usage over time
for multiple servers
Proprietary + Confidential

Cloud Bigtable example data


Data that’s likely to be
accessed via the same
Max size of column is request are grouped into
256 MB/ea column families

One key Column Families


column
Row Key Flight_Information Aircraft_Information
Destinatio
Origin Departure Arrival Passengers Capacity Make Model Age
n

ATL#arrival#20190321-1121 ATL LON 20190321-0311 20190321-1121 158 162 B 737 18

ATL#arrival#20190321-1201 ATL MEX 20190321-0821 20190321-1201 187 189 B 737 8

ATL#arrival#20190321-1716 ATL YVR 20190321-1014 20190321-1716 201 259 B 757 23

Rows can have


millions of columns

Cloud Bigtable provides Column Families. By accessing the Column Family, you can
pull some of the data you need without pulling all of the data from the row or having to
search for it and assemble it. This makes access more efficient.
Proprietary + Confidential

Bigtable - customer use case


● Macy’s sells a wide range of merchandise,
including apparel and accessories (men’s,
women’s and children’s), cosmetics, home
furnishings and other consumer goods
○ 700+ stores across the US
● Bigtable provides data for the pricing system
○ Access pattern entails finding an item’s
ticket price based on a given division,
location, and the universal price code
○ Can search millions of product codes
within single digit milliseconds

How Macy’s enhances the customer experience with


Google Cloud services

How Macy’s enhances the customer experience with Google Cloud services
https://cloud.google.com/blog/products/databases/how-macys-enhances-customer-ex
perience-google-cloud-services
Proprietary + Confidential

Querying Bigtable using Python SDK


1 from google.cloud import bigtable
Import the
1
instance_id = 'big-pets' Bigtable SDK
table_id = 'pets_table'
column_family_id = 'pets_family'

2 client = bigtable.Client(admin=True)

instance = client.instance(instance_id) Connect to the


2
table = instance.table(table_id) Bigtable service
3 table.create()
column_family = table.column_family(column_family_id)
column_family.create()

Create a table and


pet = ['Noir', 'Dog', 'Schnoodle'] 3
row_key = 'pet:{}'.format(pet[0].encode('utf-8')) a column family
4 row = table.row(row_key)
row.set_cell(column_family_id, 'type',
pet[1].encode('utf-8'))
row.set_cell(column_family_id, 'breed',
pet[2].encode('utf-8'))
Add a row to
row.commit() 4
the table

Python Client for Google Cloud Bigtable


https://googleapis.dev/python/bigtable/latest/index.html

Here’s some Python code that uses the Bigtable SDK. The takeaway should be that
the code is pretty simple. Connect to the database, create a table, tables have column
families. You can then add rows. Rows require a unique ID or row key. Rows have
columns that are in column families.
Proprietary + Confidential

Querying Bigtable (continued)


key = 'pet:Noir'.encode('utf-8')
1 row = table.read_row(key)
breed =
row.cells[column_family_id]['breed'][0].value.decode('utf-8')
type =
row.cells[column_family_id]['type'][0].value.decode('utf-8') 1 Retrieve a row

results = table.read_rows()
2 results.consume_all()

for row_key, row in results.rows.items(): Retrieve many


key = row_key.decode('utf-8') 2
rows
type =
row.cells[column_family_id]['type'][0].value.decode('utf-8')
breed =
row.cells[column_family_id]['breed'][0].value.decode('utf-8')

Once you have some data, you can read individual or multiple rows. Remember, to
get high performance, you want to retrieve rows using the row key.
Proprietary + Confidential

Storage and database services

Object File Relational Non-relational Warehouse In memory

Cloud Cloud Cloud


Filestore Cloud SQL Firestore BigQuery
Storage Spanner Bigtable

Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps

Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions

Next discussion
Proprietary + Confidential

What is a Data Warehouse?


● Allows data from multiple sources to be combined and
analyzed
○ Historical archive of data
● Data sources could be:
○ Relational databases
○ Logs
○ Web data
○ Etc.
● Optimized for analytical processing
○ Can handle large amounts of data and complex
A data warehouse is a
queries, and is well-suited for reporting and data central hub for business data.
analysis Different types of data can be
transformed and consolidated into
the warehouse for analysis
Proprietary + Confidential

BigQuery
● Fully managed, serverless, highly scalable data warehouse
● Multi-cloud capabilities using standard SQL
● Processes multi-terabytes of data in minutes
● Automatic high availability
● Supports federated queries
○ Cloud SQL & Cloud Spanner
○ Cloud Bigtable BigQuery
○ Files in Cloud Storage
● Use cases:
○ Near real-time analytics of streaming data to predict business outcomes with
built-in machine learning, geospatial analysis and more
○ Analysis of historical data

Cloud data warehouse to power your data-driven innovation


https://cloud.google.com/bigquery
Proprietary + Confidential

BigQuery - customer use case


● The Home Depot is the world’s largest home
improvement retailer
○ 2,300 stores in North America + online retail
○ Annual sales > $100 billion
● BigQuery provides timely data to help keep 50,000+
items stocked at over 2,000 locations, to ensure
website availability, and provide relevant information
through the call center
● No two Home Depots are alike, and the stock in each
The Home Depot's data-driven focus on customer success
has to be managed at maximum efficiency.
○ Migrating to Google Cloud, THD’s engineers
built one of the industry’s most efficient stock
replenishment systems

About The Home Depot


https://corporate.homedepot.com/page/about-us

The Home Depot's data-driven focus on customer success


https://cloud.google.com/customers/featured/the-home-depot
Proprietary + Confidential

BigQuery: Executing queries in Can run queries


interactively in the
the console console or schedule
them to run later

Amount of data
processed by the
query.

Can be plugged into the


Pricing Calculator for
cost estimation

Run interactive and batch query jobs


https://cloud.google.com/bigquery/docs/running-queries#bigquery-query-cl
Proprietary + Confidential

BigQuery: Executing queries in the CLI

bq query --use_legacy_sql=false \
'SELECT
word,
SUM(word_count) AS count
FROM
`bigquery-public-data`.samples.shakespeare
WHERE
word LIKE "%raisin%"
GROUP BY
word'

bq command line tool:


https://cloud.google.com/bigquery/docs/bq-command-line-tool
Proprietary + Confidential

BigQuery: Executing queries using client library


from google.cloud import bigquery

Python # Construct a BigQuery client object.


example client = bigquery.Client()

query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state = 'TX'
GROUP BY name, state
ORDER BY total_people DESC
LIMIT 20
"""
query_job = client.query(query) # Make an API request.

print("The query data:")


for row in query_job:
# Row values can be accessed by field name or index.
print("name={}, count={}".format(row[0], row["total_people"]))

All BigQuery code samples


https://cloud.google.com/bigquery/docs/samples
Proprietary + Confidential

Storage and database decision chart


Is your data
Start
structured?

Do you need a
shared file
system? Is your workload
analytics?

Do you need
Filestore Cloud Storage
updates or low
Is your data latency?
relational?

Do you need
horizontal Cloud BigQuery
scalability? Bigtable

Cloud Cloud Firestore NO YES


Spanner SQL

Design an optimal storage strategy for your cloud workload

Design an optimal storage strategy for your cloud workload


https://cloud.google.com/architecture/storage-advisor

Let’s summarize the services in this module with this decision chart:
● First, ask yourself: Is your data structured, and will it need to be accessed
using its structured data format? If the answer is no, then ask yourself if you
need a shared file system. If you do, then choose Filestore.
● If you don't, then choose Cloud Storage.
● If your data is structured and needs to be accessed in this way, than ask
yourself, does your workload focus on analytics? If it does, you will want to
choose Cloud Bigtable or BigQuery, depending on your latency and update
needs.
● Otherwise, check whether your data is relational. If it’s not relational, choose
Firestore.
● If it is relational, you will want to choose Cloud SQL or Cloud Spanner,
depending on your need for horizontal scalability.

Depending on your application, you might use one or several of these services to get
the job done. For more information on how to choose between these different
services, please refer to the following two links:
https://cloud.google.com/storage-options/
https://cloud.google.com/products/databases/
Storage and database services
Proprietary + Confidential

Object File Relational Non-relational Warehouse In memory

Cloud Cloud Cloud


Filestore Cloud SQL Firestore BigQuery
Storage Spanner Bigtable

Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps

Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions

Next discussion
Proprietary + Confidential

Memorystore
● Fully managed implementation of the open source
in-memory databases Redis and Memcached
● High availability, failover, patching and monitoring
● Sub-millisecond latency
● Instances up to 300 GB
● Network throughput of 12 Gbps
● Use cases:
○ Lift and shift of Redis, Memcached
○ Anytime need a managed service for cached
data

Memorystore
https://cloud.google.com/memorystore
Proprietary + Confidential

In-memory caching

gcloud redis instances create my-redis-instance


gcloud memcache instances create my-memcache-instance
Proprietary + Confidential

When to use Cloud Memorystore


Proprietary + Confidential

Memorystore - customer use case


● Opinary creates polls that appear alongside
news articles on various sites around the
world
○ Machine learning is used to decide
which poll to display by which article
● The polls let users share their opinion with
one click and see how they compare to
other readers.
● Publishers benefit by increased reader
retention, and increased subscriptions Opinary generates recommendations
● Advertisers benefit from high-performing faster on Cloud Run
interaction with their audiences

Opinary generates recommendations faster on Cloud Run


https://cloud.google.com/blog/topics/developers-practitioners/opinary-generates-reco
mmendations-faster-cloud-run/

More about Opinary (if interested)


https://opinary.com/
Proprietary + Confidential

Google white paper - database migration

Covers the
services
discussed in this
module plus a
few more

Migrating Databases
Proprietary + Confidential

Which database should I use?

Make your database your secret advantage


Exam Guide - Storage options
Proprietary + Confidential

2.3 Planning and configuring data storage options. Considerations include:


2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent disk, Standard,
Nearline, Coldline, Archive)

3.4 Deploying and implementing data solutions. Tasks include:


3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery, Cloud Spanner, Pub/Sub,
Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data from
Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore , Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore)
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery
Proprietary + Confidential

Typical Data Processing Pipeline

How to Build a data pipeline with Google Cloud

How to Build a data pipeline with Google Cloud


https://www.youtube.com/watch?v=yVUXvabnMRU
Proprietary + Confidential

Google Cloud big data services are fully managed and


scalable

Pub/Sub Dataflow BigQuery Dataproc

Scalable & flexible Stream & batch Analytics database; Managed Hadoop,
enterprise processing; unified stream data at MapReduce, Spark,
messaging and simplified 100,000 Pig, and Hive service
pipelines rows per second

Google Cloud Big Data solutions are designed to help you transform your business
and user experiences with meaningful data insights. It is an integrated, serverless
platform. “Serverless” means you don’t have to provision compute instances to run
your jobs. The services are fully managed, and you pay only for the resources you
consume. The platform is “integrated” so Google Cloud data services work together to
help you create custom solutions.
Proprietary + Confidential

Pub/Sub is scalable, reliable messaging


● Fully managed, massively scalable messaging service
○ It allows messages to be sent between independent
applications
○ Can scale to millions of messages per second
● Messages are sent and received via HTTP(S)
● Supports multiple senders and receivers simultaneously
● Global service
○ Messages are copied to multiple zones for greater fault
tolerance
○ Dedicated resources in every region for fast delivery
worldwide
● Pub/Sub messages are encrypted at rest and in transit

What is Cloud Pub/Sub?

What is Cloud Pub/Sub?


https://www.youtube.com/watch?v=JrKEErlWvzA&list=PLTWE_lmu2InBzuPmOcgAY
P7U80a87cpJd

Cloud Pub/Sub is a fully managed, massively scalable messaging service that can be
configured to send messages between independent applications, and can scale to
millions of messages per second.

Pub/Sub messages can be sent and received via HTTP and HTTPS.

It also supports multiple senders and receivers simultaneously.

Pub/Sub is a global service. Fault tolerance is achieved by copying the message to


multiple zones and using dedicated resources in every region for fast worldwide
delivery.

All Pub/Sub messages are encrypted at rest and in transit.


Proprietary + Confidential

Why use Pub/Sub?

● Building block for data ingestion in Dataflow, Internet of


Things (IoT), Marketing Analytics, etc.
● Provides push notifications for cloud-based
applications.
● Connects applications across Google Cloud (push/pull
between components (e.g. GCE and App Engine)

Pub/Sub is an important building block for applications where data arrives at high and
unpredictable rates, like Internet of Things systems. If you’re analyzing streaming
data, Dataflow is a natural pairing with Pub/Sub.

Pub/Sub also works well with applications built on Google Cloud’s compute platforms.
You can configure your subscribers to receive messages on a “push” or a “pull” basis.
In other words, subscribers can get notified when new messages arrive for them, or
they can check for new messages at intervals.
Proprietary + Confidential

Pub/Sub - customer use case


● Sky is one of Europe’s leading media and
communications companies, providing Sky TV,
streaming, mobile TV, broadband, talk, and line rental
services to millions of customers in seven countries
● Pub/Sub is used to stream diagnostic data from
millions of Sky Q TV boxes
● Data is then parsed through Cloud Dataflow to
Cloud Storage and BigQuery, monitored on its way
by Stackdriver (Operations), which triggers email
and Slack alerts should issues occur.

Sky: Scaling for success with Sky Q diagnostics


Proprietary + Confidential

Dataproc is managed Hadoop


● Fast, easy, managed way to run Hadoop and
Spark/Hive/Pig on Google Cloud
● Create clusters in 90 seconds or less on average
● Scale clusters up and down even when jobs are running
● Dataproc storage
○ Automatically installs the HDFS-compatible Cloud
Storage connector
■ Run Apache Hadoop or Apache Spark jobs
directly on data in Cloud Storage
○ Alternatively can use boot disks to store data
■ Deleted when the Dataproc cluster is deleted

Overview
https://cloud.google.com/dataproc

Dataproc Cloud Storage connector


https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage

Customer use case: Best practices for migrating Hadoop to Dataproc by LiveRamp
https://cloud.google.com/blog/products/data-analytics/best-practices-for-migrating-had
oop-to-gcp-dataproc

Apache Hadoop is an open-source framework for big data. It is based on the


MapReduce programming model, which Google invented and published. The
MapReduce model, at its simplest, means that one function -- traditionally called the
“map” function -- runs in parallel across a massive dataset to produce intermediate
results; and another function -- traditionally called the “reduce” function -- builds a final
result set based on all those intermediate results. The term “Hadoop” is often used
informally to encompass Apache Hadoop itself and related projects, such as Apache
Spark, Apache Pig, and Apache Hive.

Dataproc is a fast, easy, managed way to run Hadoop, Spark, Hive, and Pig on
Google Cloud. All you have to do is to request a Hadoop cluster. It will be built for you
in 90 seconds or less, on top of Compute Engine virtual machines whose number and
type you can control. If you need more or less processing power while your cluster’s
running, you can scale it up or down. You can use the default configuration for the
Hadoop software in your cluster, or you can customize it. And you can monitor your
cluster using Operations.
Proprietary + Confidential

Hadoop history (simplified)


● Google and Yahoo were looking for ways to analyze mountains of user internet
search results
○ Google published a white paper on MapReduce in 2004
○ Yahoo implemented the concepts and open sourced it in 2008
● Two main components when running on-premise
○ Multiple nodes (VMs) to process data
■ May consist of 1,000s of nodes
■ Shares computational workloads and works on data in parallel
○ Each node has persistent disk storage
■ Hadoop Distributed File System (HDFS) to store the data

Google’s whitepaper:
https://static.googleusercontent.com/media/research.google.com/en//archive/mapredu
ce-osdi04.pdf

Yahoo released Hadoop as an open source project to Apache Software Foundation in


2008
Proprietary + Confidential

Example use case - Clickstream data


● Websites often track every click made by every user on every page visited
○ Results in millions of rows of data
● Analysts would like to know why someone added items to a shopping cart, proceeded to
checkout, and then abandoned the cart
○ Don’t need all the data - just data for users who abandoned their cart
■ Phase 1 (aka “Map”)
● Process the data and get the users, the contents of their carts and the
last page visited
■ Phase 2 (aka “Reduce”)
● Aggregate the total the number and value of carts abandoned per
month
● Plus total the most common final pages that someone viewed before
ending the user session

This is a very simplistic example.


Proprietary + Confidential

Hadoop initial lift and shift into GC, followed by


optimization Initial lift and shift for
Over time, move
Hadoop on-prem. Most storage to Cloud
Dataproc cluster costly storage option. Storage, or
Can’t delete the cluster Bigtable for
Primary Nodes (1-3) because the data disks greater cost
will be deleted as well
Standard VMs HDFS savings
Compute Engine
HDDs and SSDs
Persistent Disk
Persistent Storage Delete Dataproc
cluster when
Cluster bucket
Primary Workers HDFS connector jobs complete.
Bucket
Standard VMs
Cloud Storage
Compute Engine Next time jobs
run, pull data
Tables
HBase connector from Cloud
NoSQL Database
Cloud Bigtable
Storage or
Secondary Workers
Bigtable
Preemptible/Spot
VMs Data Warehouse
BigQuery connector Bigquery is
Analytics Service
BigQuery
often the output
Optional & subject to
destination for
availability. Google will
attempt to keep the # data analysis
specified available

The lines in black are usually the initial implementation. Customers gain greater cost
savings when they transition to the red flow. This requires making some modifications
to the jobs to use the connectors.

HDFS: Hadoop data file system. Cloud storage was originally named DFS
(distributed file system)
HBase: open-source, NoSQL, distributed big data store. Runs on top of HDFS. The
GC equivalent is Bigtable

DFS - distributed file system = Cloud Storage

HA Dataproc has 3 masters (one is a witness); With no HA , have 1 master, no


failover. All are in the same zone

Primary workers can use autoscaling. Secondary workers are MIGs but do not scale.
However if you tell Google you want 4 workers, it will try to keep 4 workers at al times.
These can be spot VMs

HBASE - database that lives in HDFS. Equivalent to Cloud Bigtable


Proprietary + Confidential

Dataflow offers managed data pipelines

● Processes data using Compute Engine


instances.
○ Clusters are sized for you
○ Automated scaling, no instance provisioning
required
● Write code once and get batch and streaming
○ Transform-based programming model
Dataflow, the backbone of data analytics

Dataflow, the backbone of data analytics


https://cloud.google.com/blog/topics/developers-practitioners/dataflow-backbone-data
-analytics

Dataflow Under the Hood: Comparing Dataflow with other tools (Aug 24, 2020)
https://cloud.google.com/blog/products/data-analytics/dataflow-vs-other-stream-batch-
processing-engines

Dataproc is great when you have a dataset of known size, or when you want to
manage your cluster size yourself. But what if your data shows up in realtime?
Or it’s of unpredictable size or rate? That’s where Dataflow is a particularly
good choice. It’s both a unified programming model and a managed service,
and it lets you develop and execute a big range of data processing patterns:
extract-transform-and-load, batch computation, and continuous computation.
You use Dataflow to build data pipelines, and the same pipelines work for both
batch and streaming data.

Dataflow is a unified programming model and a managed service for


developing and executing a wide range of data processing patterns including
ETL, batch computation, and continuous computation. Dataflow frees you from
operational tasks like resource management and performance optimization.

Dataflow features:
Resource Management: Dataflow fully automates management of required
processing resources. No more spinning up instances by hand.

On Demand: All resources are provided on demand, enabling you to scale to


meet your business needs. No need to buy reserved compute instances.

Intelligent Work Scheduling: Automated and optimized work partitioning which


can dynamically rebalance lagging work. No more chasing down “hot keys” or
pre-processing your input data.

Auto Scaling: Horizontal auto scaling of worker resources to meet optimum


throughput requirements results in better overall price-to-performance.

Unified Programming Model: The Dataflow API enables you to express


MapReduce like operations, powerful data windowing, and fine grained
correctness control regardless of data source.

Open Source: Developers wishing to extend the Dataflow programming model


can fork and or submit pull requests on the Java-based Dataflow SDK.
Dataflow pipelines can also run on alternate runtimes like Spark and Flink.

Monitoring: Integrated into the Cloud Console, Dataflow provides statistics


such as pipeline throughput and lag, as well as consolidated worker log
inspection—all in near-real time.

Integrated: Integrates with Cloud Storage, Pub/Sub, Datastore, Cloud Bigtable,


and BigQuery for seamless data processing. And can be extended to interact
with others sources and sinks like Apache Kafka and HDFS.

Reliable & Consistent Processing: Dataflow provides built-in support for


fault-tolerant execution that is consistent and correct regardless of data size,
cluster size, processing pattern or pipeline complexity.
Proprietary + Confidential

Dataflow pipelines flow Source

data from a source


through transforms

Transforms
BigQuery

Cloud Storage Sink (Destinations)

This example Dataflow pipeline reads data from a BigQuery table (the “source”),
processes it in various ways (the “transforms”), and writes its output to Cloud Storage
(the “sink”). Some of those transforms you see here are map operations, and some
are reduce operations. You can build really expressive pipelines.

Each step in the pipeline is elastically scaled. There is no need to launch and manage
a cluster. Instead, the service provides all resources on demand. It has automated
and optimized work partitioning built in, which can dynamically rebalance lagging
work. That reduces the need to worry about “hot keys” -- that is, situations where
disproportionately large chunks of your input get mapped to the same custer.
Proprietary + Confidential

Why use Dataflow?

● ETL (extract/transform/load) pipelines to move, filter,


enrich, shape data
● Data analysis: batch computation or continuous
computation using streaming
● Orchestration: create pipelines that coordinate services,
including external services
● Integrates with Google Cloud services like Cloud Storage,
Pub/Sub, BigQuery, and Cloud Bigtable
○ Open source Java, Python SDKs

People use Dataflow in a variety of use cases. For one, it serves well as a
general-purpose ETL tool.

And its use case as a data analysis engine comes in handy in things like these: fraud
detection in financial services; IoT analytics in manufacturing, healthcare, and
logistics; and clickstream, Point-of-Sale, and segmentation analysis in retail.

And, because those pipelines we saw can orchestrate multiple services, even
external services, it can be used in real time applications such as personalizing
gaming user experiences.
Proprietary + Confidential

BigQuery is a fully managed data warehouse Revisiting this


topic

● Provides near real-time interactive analysis of massive


datasets (hundreds of TBs).
● Query using SQL syntax (ANSI SQL 2011)
● No cluster maintenance is required
● Compute and storage are separated with a terabit
network in between.
● You only pay for storage and processing used.
● Automatic discount for long-term data storage.

https://cloud.google.com/bigquery

Query BIG with BigQuery: A cheat sheet


https://cloud.google.com/blog/topics/developers-practitioners/query-big-bigquery-chea
t-sheet

If, instead of a dynamic pipeline, you want to do ad-hoc SQL queries on a massive
dataset, that is what BigQuery is for. BigQuery is Google's fully managed, petabyte
scale, low cost analytics data warehouse.

BigQuery is NoOps: there is no infrastructure to manage and you don't need a


database administrator, so you can focus on analyzing data to find meaningful
insights, use familiar SQL, and take advantage of our pay-as-you-go model. BigQuery
is a powerful big data analytics platform used by all types of organizations, from
startups to Fortune 500 companies.

BigQuery’s features:

Flexible Data Ingestion: Load your data from Cloud Storage or Datastore, or stream it
into BigQuery at 100,000 rows per second to enable real-time analysis of your data.

Global Availability: You have the option to store your BigQuery data in European
locations while continuing to benefit from a fully managed service, now with the option
of geographic data control, without low-level cluster maintenance.

Security and Permissions: You have full control over who has access to the data
stored in BigQuery. If you share datasets, doing so will not impact your cost or
performance; those you share with pay for their own queries.

Cost Controls: BigQuery provides cost control mechanisms that enable you to cap
your daily costs at an amount that you choose. For more information, see Cost
Controls.

Highly Available: Transparent data replication in multiple geographies means that your
data is available and durable even in the case of extreme failure modes.

Super Fast Performance: Run super-fast SQL queries against multiple terabytes of
data in seconds, using the processing power of Google's infrastructure.

Fully Integrated In addition to SQL queries, you can easily read and write data in
BigQuery via Dataflow, Spark, and Hadoop.

Connect with Google Products: You can automatically export your data from Google
Analytics Premium into BigQuery and analyze datasets stored in Google Cloud
Storage, Google Drive, and Google Sheets.

BigQuery can make Create, Replace, Update, and Delete changes to databases,
subject to some limitations and with certain known issues.
Exam Guide - Storage options
Proprietary + Confidential

2.3 Planning and configuring data storage options. Considerations include:


2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent disk, Standard,
Nearline, Coldline, Archive)

3.4 Deploying and implementing data solutions. Tasks include:


3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery, Cloud Spanner, Pub/Sub,
Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data from
Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore , Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore)
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery
Proprietary + Confidential

BigQuery Data Transfer Service


● Automates data movement into BigQuery on a scheduled, managed basis
● Transfers data into to BigQuery (does not transfer data out of BigQuery)
● Supports loading data from the following data sources*
○ External cloud storage providers
■ Amazon S3
○ Data warehouses
■ Teradata
■ Amazon Redshift
○ Google Software as a Service (SaaS) apps
○ Campaign Manager
○ Cloud Storage
○ Google Ad Manager

*Check the documentation for a current list of data sources

BigQuery Data Transfer Service


https://cloud.google.com/bigquery-transfer/docs/introduction
Proprietary + Confidential

BigQuery - other data ingestion methods


● In addition to the Data Transfer Service, are several others ways to ingest data into
BigQuery:
○ Batch load a set of data records
■ Sources could be Cloud Storage, or file stored on a local machine
■ Data could be formatted in Avro, CSV, JSON, ORC, or Parquet
○ Stream individual records or batches of records
■ A Storage Write API for BigQuery introduced in 2021
● Data must be in the ProtoBuf format
● This is a detailed coding solution that provides 100% control
● Alternatively, use Pub/Sub and DataFlow
○ Much of the complexity is handled by Google, e.g., autoscaling
○ Use SQL queries to generate new data and append or overwrite the results to a
table.
○ Use a third-party application or service

Introduction to loading data


https://cloud.google.com/bigquery/docs/loading-data
Proprietary + Confidential

Cloud Storage - choosing a transfer option Saw this


slide earlier

Where you're moving data from Scenario Suggested products

Another cloud provider (for example, Amazon — Storage Transfer Service


Web Services or Microsoft Azure) to Google
Cloud

Cloud Storage to Cloud Storage (two different — Storage Transfer Service


buckets)

Your private data center to Google Cloud Enough bandwidth to meet gsutil
your project deadline
for less than 1 TB of data

Your private data center to Google Cloud Enough bandwidth to meet Storage Transfer Service for
your project deadline on-premises data
for more than 1 TB of data

Your private data center to Google Cloud Not enough bandwidth to Transfer Appliance
meet your project deadline
Proprietary + Confidential

Cloud SQL
● Export data to a storage bucket
gcloud sql export csv my-cloudsqlserver
gs://my-bucket/sql-export.csv \
--database=hr-database \
--query=’select * from employees’

● Import data from a bucket


gcloud sql import sql [INSTANCE_NAME]
gs://[BUCKET_NAME]/[IMPORT_FILE_NAME] \
--database=[DATABASE_NAME]

● For REST API examples, see


https://cloud.google.com/sql/docs/mysql/import-export/import-export-csv#ex
port_data_to_a_csv_file

Cloud SQL - Export and import using CSV files:


https://cloud.google.com/sql/docs/mysql/import-export/import-export-csv

Cloud SQL - Best practices for importing and exporting data:


https://cloud.google.com/sql/docs/mysql/import-export/
Proprietary + Confidential

Streaming data to Pub/Sub

● Suggest looking at various quickstarts


○ Stream messages from Pub/Sub by using Dataflow
■ https://cloud.google.com/pubsub/docs/stream-messages-dataflow
○ Using client libraries
■ https://cloud.google.com/pubsub/docs/publish-receive-messages-c
lient-library
○ Using gcloud CLI
■ https://cloud.google.com/pubsub/docs/publish-receive-messages-g
cloud
Exam Guide - Storage options
Proprietary + Confidential

2.3 Planning and configuring data storage options. Considerations include:


2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent disk, Standard,
Nearline, Coldline, Archive)

3.4 Deploying and implementing data solutions. Tasks include:


3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery, Cloud Spanner, Pub/Sub,
Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data from
Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore , Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore)
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery
Proprietary + Confidential

Use the Google Cloud Pricing Calculator to estimate costs


● Create cost estimates based on
forecasting and capacity planning.
● The parameters entered will vary according
to the service, e.g.,
○ Compute Engine - machine type,
operating system, usage/day, disk size,
etc
○ Cloud Storage - Location, storage class,
storage amount, ingress and egress
estimates
● Can save and email estimates for later use,
e.g., presentations
https://cloud.google.com/products/calculator

The pricing calculator is the go-to resource for gaining cost estimates. Remember that
the costs are just an estimate, and actual cost may be higher or lower. The estimates
by default use the timeframe of one month. If any inputs vary from this, they will state
this. For example, Firestore document operations read, write, and delete are asked
for on a per day basis.
Proprietary + Confidential

BigQuery: Executing queries in the CLI with dry_run

-bq query --use_legacy_sql=false --dry_run \


Add --dry_run
'SELECT flag to receive
word, estimate of how
SUM(word_count) AS count much data will be
FROM processed
`bigquery-public-data`.samples.shakespeare Plug that number
WHERE into the Pricing
word LIKE "%raisin%" Calculator
GROUP BY
word'

BigQuery - Estimate storage and query costs:


https://cloud.google.com/bigquery/docs/estimate-costs

BigQuery - estimate query costs:


https://cloud.google.com/bigquery/docs/estimate-costs#estimate_query_costs
Exam Guide - Storage options
Proprietary + Confidential

2.3 Planning and configuring data storage options. Considerations include:


2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent disk, Standard,
Nearline, Coldline, Archive)

3.4 Deploying and implementing data solutions. Tasks include:


3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery, Cloud Spanner, Pub/Sub,
Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data from
Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore , Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore)
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery

Cloud Spanner backup and restore:


https://cloud.google.com/spanner/docs/backup

Bigtable - Manage backups:


https://cloud.google.com/bigtable/docs/managing-backups
Proprietary + Confidential

Cloud SQL Backup - Console

About Cloud SQL backups:


https://cloud.google.com/sql/docs/mysql/backup-recovery/backups

Cloud SQL - Create and manage on-demand and automatic backups:


https://cloud.google.com/sql/docs/mysql/backup-recovery/backing-up

Cloud SQL - Schedule Cloud SQL database backups:


https://cloud.google.com/sql/docs/mysql/backup-recovery/scheduling-backups
Proprietary + Confidential

Cloud SQL Restore - Console


Proprietary + Confidential

Cloud SQL Backup/Restore - CLI --async means don’t


wait for the command to
● Create backup complete

gcloud sql backups create --async --instance mytest

● Backups - get the instance IDs


gcloud sql backups list --instance mytest

● Restore backup - get the backup ID


gcloud sql backups restore [Backup ID] \
--restore-instance mytest \
--backup-instance mytest \
--async
Proprietary + Confidential

Cloud Firestore Backup/Restore - Console

Stored in Cloud Storage


bucket

Firestore (Datastore) - Exporting and Importing Entities:


https://cloud.google.com/datastore/docs/export-import-entities

Firestore (Datastore) - Move data between projects:


https://firebase.google.com/docs/firestore/manage-data/move-data
Proprietary + Confidential

Cloud Firestore Export/Import - CLI

● Export data

gcloud firestore export gs://my-firestore-data

● Import data

gcloud firestore import


gs://my-firestore-data/file-created-by-export/
Exam Guide - Storage options
Proprietary + Confidential

2.3 Planning and configuring data storage options. Considerations include:


2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent disk, Standard,
Nearline, Coldline, Archive)

3.4 Deploying and implementing data solutions. Tasks include:


3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery, Cloud Spanner, Pub/Sub,
Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data from
Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore , Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore )
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery
Proprietary + Confidential

Dataproc: Reviewing job status - CLI


All jobs
● Getting job status

gcloud dataproc jobs list

Specific job

gcloud dataproc jobs describe job-id


--region=region

What is Dataproc?:
https://cloud.google.com/dataproc/docs/concepts/overview

Life of a Dataproc Job:


https://cloud.google.com/dataproc/docs/concepts/jobs/life-of-a-job

gcloud command:
https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/list
Proprietary + Confidential

Dataflow: Reviewing job status - CLI


All jobs
● Getting job status

gcloud dataflow jobs list

Specific job

gcloud dataflow jobs describe job-id


--region=region

Using the Dataflow monitoring interface:


https://cloud.google.com/dataflow/docs/guides/using-monitoring-intf

gcloud command:
https://cloud.google.com/sdk/gcloud/reference/dataflow/jobs/describe
BigQuery: Reviewing job status - CLI
Proprietary + Confidential

● Listing jobs

bq ls --jobs=true --all=true

● Getting job status

bq --location=LOCATION show --job=true JOB_ID

● Example

bq show --job=true
myproject:US.bquijob_123x456_123y123z123c
● Sample output
Job Type State Start Time Duration User Email Bytes Processed Bytes Billed
---------- --------- ----------------- ---------- ------------------- ----------------- --------------
extract SUCCESS 06 Jul 11:32:10 0:01:41 [email protected]

BigQuery - Introduction to BigQuery jobs:


https://cloud.google.com/bigquery/docs/jobs-overview

BigQuery - Managing jobs:


https://cloud.google.com/bigquery/docs/managing-jobs

View job details with bg cli:


https://cloud.google.com/bigquery/docs/managing-jobs#bq
Exam Guide - Storage options
Proprietary + Confidential

2.3 Planning and configuring data storage options. Considerations include:


2.3.1 Product choice (e.g., Cloud SQL, BigQuery, Firestore, Cloud Spanner, Cloud Bigtable)
2.3.2 Choosing storage options (e.g., Zonal persistent disk, Regional balanced persistent disk, Standard,
Nearline, Coldline, Archive)

3.4 Deploying and implementing data solutions. Tasks include:


3.4.1 Initializing data systems with products (e.g., Cloud SQL, Firestore, BigQuery, Cloud Spanner, Pub/Sub,
Cloud Bigtable, Dataproc, Dataflow, Cloud Storage)
3.4.2 Loading data (e.g., command line upload, API transfer, import/export, load data from
Cloud Storage, streaming data to Pub/Sub)

4.4 Managing storage and database solutions. Tasks include:


4.4.1 Managing and securing objects in and between Cloud Storage buckets
4.4.2 Setting object life cycle management policies for Cloud Storage buckets
4.4.3 Executing queries to retrieve data from data instances (e.g., Cloud SQL, BigQuery,
Cloud Spanner, Datastore , Cloud Bigtable)
4.4.4 Estimating costs of data storage resources
4.4.5 Backing up and restoring database instances (e.g., Cloud SQL, Datastore )
4.4.6 Reviewing job status in Dataproc, Dataflow, or BigQuery
Proprietary + Confidential

You might also like