GCP ACE Notes 3
GCP ACE Notes 3
pls-academy-ace-student-slides-4-2303
Proprietary + Confidential
Thank you!
Proprietary + Confidential
Session logistics
● When you have a question, please:
○ Click the Raise hand button in Google Meet.
○ Or add your question to the Q&A section of Google Meet.
○ Please note that answers may be deferred until the end of the session.
● These slides are available in the Student Lecture section of your Qwiklabs classroom.
Exam Guide
https://cloud.google.com/certification/guides/cloud-engineer
Sample Questions
https://docs.google.com/forms/d/e/1FAIpQLSfexWKtXT2OSFJ-obA4iT3GmzgiOCGvjr
T9OfxilWC1yPtmfQ/viewform
Proprietary + Confidential
Click
Click here Professional
Cloud
Architect
Proprietary + Confidential
Needed for
Exam
Voucher
Proprietary + Confidential
Proprietary + Confidential
https://cloud.google.com/certification/guides/
cloud-engineer/
Proprietary + Confidential
02
options
Proprietary + Confidential
BigQuery
Proprietary + Confidential
Compute
Engine topics
Object
storage is
covered
next
Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps
Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions
Next
discussion
This table shows the storage and database services and highlights the storage
service type (object, file, relational, non-relational, and data warehouse), what each
service is good for, and intended use.
Proprietary + Confidential
https://cloud.google.com/blog/topics/developers-practitioners/all-you-need-k
now-about-cloud-storage
Proprietary + Confidential
Online content
Cloud Storage’s primary use is whenever binary large-object storage (also known as a
“BLOB”) is needed for online content such as videos and photos, for backup and
archived data, and for storage of intermediate results in processing workflows.
Proprietary + Confidential
Geographic location
Unique name
Cloud Storage files are organized into buckets. A bucket needs a globally-unique
name and a specific geographic location for where it should be stored, and an ideal
location for a bucket is where latency is minimized. For example, if most of your users
are in Europe, you probably want to pick a European location so either a specific
Google Cloud region in Europe, or else the EU multi-region.
Proprietary + Confidential
In multi-region In regional or For data access For data accessed For long term retention
locations dual-regional locations less than once roughly less than once a
for serving content for data accessed a month quarter
globally. frequently or high
throughput needs
Tape
Images Genomics Backup Movie archive
replacement
General data
Disaster
Websites analytics &
recovery
compute
Documents
There are four primary storage classes in Cloud Storage and stored data is managed
and billed according to which “class” it belongs.
The first is Standard Storage. Standard Storage is considered best for frequently
accessed, or “hot,” data. It’s also great for data that is stored for only brief periods of
time.
The second storage class is Nearline Storage. This is best for storing infrequently
accessed data, like reading or modifying data once per month or less, on average.
Examples might include data backups, long-tail multimedia content, or data archiving.
The third storage class is Coldline Storage. This is also a low-cost option for storing
infrequently accessed data. However, as compared to Nearline Storage, Coldline
Storage is meant for reading or modifying data, at most, once every 90 days.
The fourth storage class is Archive Storage. This is the lowest-cost option, used
ideally for data archiving, online backup, and disaster recovery. It’s the best choice for
data that you plan to access less than once a year, because it has higher costs for
data access and operations and a 365-day minimum storage duration.
Proprietary + Confidential
Although each of these four classes has differences, it’s worth noting that several
characteristics apply across all these storage classes.
These include:
●Unlimited storage with no minimum object size requirement,
●Worldwide accessibility and locations,
●Low latency and high durability,
●A uniform experience, which extends to security, tools, and APIs, and
●Geo-redundancy if data is stored in a multi-region or dual-region. So this means
placing physical servers in geographically diverse data centers to protect against
catastrophic events and natural disasters, and load-balancing traffic for optimal
performance.
Proprietary + Confidential
Cloud Storage has no minimum fee because you pay only for what you use, and prior
provisioning of capacity isn’t necessary.
And from a security perspective, Cloud Storage always encrypts data on the server
side, before it’s written to disk, at no additional charge. Data traveling between a
customer’s device and Google is encrypted by default using HTTPS/TLS (Transport
Layer Security).
Proprietary + Confidential
Regional Dual-regional
Multi-regional
Bucket locations
https://cloud.google.com/storage/docs/locations
Use case “Hot” data and/or Infrequently accessed Infrequently accessed Data archiving, online
stored for only brief data like data backup, data that you read or backup, and disaster
periods of time like long-tail multimedia modify at most once a recovery
data-intensive content, and data quarter
computations archiving
Durability 99.999999999%
*Minimum storage duration = if delete file before x days, will still pay for x days
Cloud Storage has four storage classes: Standard, Nearline, Coldline and Archive and
each of those storage classes provide 3 location types:
● A multi-region is a large geographic area, such as the United States, that
contains two or more geographic places.
● A dual-region is a specific pair of regions, such as Finland and the
Netherlands.
● A region is a specific geographic place, such as London.
Let’s focus on durability and availability. All of these storage classes have 11 nines of
durability, but what does that mean? Does that mean you have access to your files at
all times? No, what that means is you won't lose data. You may not be able to access
the data which is like going to your bank and saying well my money is in there, it's 11
nines durable. But when the bank is closed we don't have access to it which is the
availability that differs between the storage classes and the location type.
Proprietary + Confidential
BigQuery
discussed
later
Online transfer
gsutil: https://cloud.google.com/storage/docs/gsutil
gcloud: https://cloud.google.com/sdk/gcloud/reference/storage
Transfer Appliance
https://cloud.google.com/transfer-appliance
There are several ways to bring your data into Cloud Storage.
Many customers simply carry out their own online transfer using gsutil, which is the
Cloud Storage command from the Cloud SDK. Data can also be moved in by using a
drag and drop option in the Google Cloud console, if accessed through the Google
Chrome web browser.
But what if you have to upload terabytes or even petabytes of data? Storage Transfer
Service enables you to import large amounts of online data into Cloud Storage quickly
and cost-effectively. The Storage Transfer Service lets you schedule and manage
batch transfers to Cloud Storage from another cloud provider, from a different Cloud
Storage region, or from an HTTP(S) endpoint.
And then there is the Transfer Appliance, which is a rackable, high-capacity storage
server that you lease from Google Cloud. You connect it to your network, load it with
data, and then ship it to an upload facility where the data is uploaded to Cloud
Storage. You can transfer up to a petabyte of data on a single appliance.
Proprietary + Confidential
100 PB 34,048 yr 3,404 yrs 340 yrs 34 years 3 years 124 days
One of the data transfer barriers that all companies face is their available network.
The rate at which we are creating data is accelerating far beyond our capacity to send
it all over typical networks. So all companies battle with this
In working with our customers we’ve found the typical enterprise organization has
about 10 petabytes of information and available network bandwidth of somewhere
between 100 Mbps and 1 Gbps. This works out to somewhere between 3 and 34
years of ingress. Too long. Now all of that data has different storage requirements,
use cases and will likely have many different eventual homes. Because of this
organizations need data transfer options across this x and y axis.
Proprietary + Confidential
● Upload
gsutil cp OBJECT_LOCATION gs://DESTINATION_BUCKET_NAME/
gsutil cp desktop/myfile.png gs://my-bucket/
‘
● Download
Your private data center to Google Cloud Enough bandwidth to meet gsutil
your project deadline
for less than 1 TB of data
Your private data center to Google Cloud Enough bandwidth to meet Storage Transfer Service for
your project deadline on-premises data
for more than 1 TB of data
Your private data center to Google Cloud Not enough bandwidth to Transfer Appliance
meet your project deadline
Proprietary + Confidential
90 days Coldline
object #1 object
object #2
Cloud Storage has many object management features. For example, you can set a
retention policy on all objects in the bucket. For example, the objects should be
expired after 30 days.
You can also use versioning, so that multiple versions of an object are tracked and
available if necessary. You might even set up lifecycle management, to automatically
move objects that haven’t been accessed in 30 days to Nearline and after 90 days to
Coldline.
Proprietary + Confidential
Lifecycle conditions:
https://cloud.google.com/storage/docs/lifecycle#conditions
gsutil command:
https://cloud.google.com/storage/docs/gsutil/commands/lifecycle
Object versioning:
https://cloud.google.com/storage/docs/object-versioning
Proprietary + Confidential
{ Rule 1: Delete
noncurrent objects if
"lifecycle": { there are 2 newer ones
Create a JSON config "rule": [
file containing the rules {
"action": {"type": "Delete"},
"condition": { Rule 2: Delete
Then run: "numNewerVersions": 2, noncurrent versions of
"isLive": false objects after they've
gsutil lifecycle set been noncurrent for 7
}
days.
[config-file] },
gs://acme-data-bucket {
"action": {"type": "Delete"},
"condition": {
"daysSinceNoncurrentTime": 7
}}]}}
You can manage databases yourself by creating a VM and installing the database
software. But you are responsible for everything - backups, patches, OS updates,
etc. As an alternative, Google offers fully managed storage options where Google
manages the underlying infrastructure
Proprietary + Confidential
Database Options
Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps
Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions
Discussed in Compute
Next discussion
Engine module
Proprietary + Confidential
Cloud SQL
A fully managed, cloud-based alternative to on-premise MySQL, PostgreSQL, and SQL
Server databases
The business value of Cloud SQL: how companies speed up deployments, lower costs and boost agility
The business value of Cloud SQL: how companies speed up deployments, lower costs
and boost agility
https://cloud.google.com/blog/products/databases/the-business-value-of-cloud-sql/
Cloud SQL:
https://cloud.google.com/sql/docs/mysql/introduction
Cloud SQL offers fully managed relational databases, including MySQL, PostgreSQL,
and SQL Server as a service. It’s designed to hand off mundane, but necessary and
often time-consuming, tasks to Google—like applying patches and updates, managing
backups, and configuring replications—so your focus can be on building great
applications.
Proprietary + Confidential
Storage Scaling
Availability
Database Maintenance Managed by
Monitoring Google
Security
OS
Hardware & Network
Google Cloud manages the infrastructure work while the customer focuses on their
goals.
Proprietary + Confidential
Note that Qlue uses multiple Google Cloud services, not just Cloud SQL. It’s highly
recommended that you understand the use case for all these services, and how they
are used together to meet customer needs.
Proprietary + Confidential
Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps
Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions
Next discussion
Proprietary + Confidential
Cloud Spanner
A enterprise-grade, globally distributed, externally consistent relational database having unlimited
scalability and industry-leading 99.999% availability
● Powers Google’s most popular, globally available products, like
YouTube, Drive, and Gmail
● Capable of processing more than 1 billion queries per second at
peak
● For any workload, large or small, that cannot tolerate downtime, and
requires high availability
● Regional and multi-regional deployments
○ SLA: Multi-regional: 99.999%
○ SLA: Regional: 99.99%
Cloud Spanner myths
● Supports ANSI standard SQL
busted
Cloud Spanner:
https://cloud.google.com/blog/topics/developers-practitioners/what-cloud-spanner
https://cloud.google.com/spanner
Spanner powers Google’s most popular, globally available products, like YouTube,
Drive, and Gmail, and it can processes more than 1 Billion queries per second at
peak
But don’t be mislead into thinking that Spanner is only for large, enterprise level
applications. Many customers use Spanner for their smaller workloads (both in terms
of transactions per second and storage size) for availability and scalability reasons.
Spanner is appropriate for any customer that cannot tolerate downtime, and needs
high availability for their applications
Limitless scalability and high availability is critical in many industry verticals such as
gaming and retail, especially when a newly launched game goes viral and becomes
an overnight success or when a retailer has to handle a sudden surge in traffic due to
a Black Friday/Cyber Monday sale.
Spanner offers the flexibility to interact with the database via a SQL dialect based on
ANSI 2011 standard as well as via a REST or gRPC API interface, which are
optimized for performance and ease-of-use. In addition to Spanner’s interface, there
is a PostgreSQL interface for Spanner, that leverages the ubiquity of PostgreSQL and
provides development teams with an interface that they are familiar with. The
PostgreSQL interface provides a rich subset of the open-source PostgreSQL SQL
dialect, including common query syntax, functions, and operators. It also supports a
core collection of open-source PostgreSQL data types, DDL syntax, and information
schema views. You get the PostgreSQL familiarity, and relational semantics at
Spanner scale.
Proprietary + Confidential
This is another example where multiple Google Cloud services are used together to
provide a solution.
Proprietary + Confidential
Querying Spanner
def query_data(instance_id, database_id):
"""Queries sample data from the database using SQL."""
spanner_client = spanner.Client()
Using a client instance = spanner_client.instance(instance_id)
library (Python database = instance.database(database_id)
example)
with database.snapshot() as snapshot:
results = snapshot.execute_sql(
"SELECT SingerId, AlbumId, AlbumTitle FROM Albums"
)
for row in results:
print("SingerId: {}, AlbumId: {}, AlbumTitle:
{}".format(*row))
Python quickstart
https://cloud.google.com/spanner/docs/getting-started/python
If you already know database programming, you will be comfortable using Spanner.
Like all relational databases, you create tables, tables have fields, fields have data
types and constraints. You can set up relationships between tables. When you want to
store data, you add rows to tables.
Once you have data, you can retrieve it with a standard SQL SELECT statement.
Proprietary + Confidential
Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps
Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions
Next discussion
Proprietary + Confidential
There are different types of NoSQL databases. (The Google Cloud services are in
bold text in the above slide.)
Key-value stores are the simplest. These act like a map or dictionary. To save data,
just specify a key and assign a value to that key.
Document stores allow you to store hierarchical data. So, instead of having an orders
table and a details table, you can store an order in a single document. That document
can have properties that represent the order along with an array of subdocuments
representing the details. The data can be stored in the database as JSON or XML, or
a binary format called BSON.
Wide-column stores have data in tables. Each row in the table has a unique value or
key. Associated with that key can be any number of columns. Each column can have
any type of data. There’s no schema, so different rows in the same table can have
different columns.
Proprietary + Confidential
Cloud Firestore
Cloud Firestore
● Completely managed, document store, NoSQL Database
○ No administration, no maintenance, nothing to provision
● 1 GB per month free tier
● Indexes created for every property by default
○ Secondary indexes and composite indexes are supported
● Supports ACID* transactions
● For mobile, web, and IoT apps at global scale
○ Live synchronization and offline support
● Multi-region replication
Firestore:
https://cloud.google.com/firestore/docs/
It simplifies storing, syncing, and querying data for your mobile, web, and IoT apps at
global scale
Its client libraries provide live synchronization and offline support, and its security
features and integrations with Firebase and Google Cloud accelerate building truly
serverless apps.
Cloud Firestore also supports ACID transactions, so if any of the operations in the
transaction fail and cannot be retried, the whole transaction will fail.
Also, with automatic multi-region replication and strong consistency, your data is safe
and available, even when disasters strike.
Cloud Firestore allows you to run sophisticated queries against your NoSQL data
without any degradation in performance. This gives you more flexibility in the way you
structure your data.
An index is
created for
every property
so that queries
are extremely
fast
From: https://firebase.google.com/docs/firestore/data-model#hierarchical-data
Proprietary + Confidential
Forbes website (for anyone that wants to know more about the company)
https://www.forbes.com/
Proprietary + Confidential
cities_ref = db.collection(u'cities')
Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps
Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions
Next discussion
Proprietary + Confidential
Cloud Bigtable
Highly scalable NoSQL database that can handle billions of rows and petabytes of
data, making it ideal for use cases that require large-scale data storage and
processing with up to 99.999% availability
Bigtable:
https://cloud.google.com/bigtable
Cloud Bigtable
Cloud Bigtable provides Column Families. By accessing the Column Family, you can
pull some of the data you need without pulling all of the data from the row or having to
search for it and assemble it. This makes access more efficient.
Proprietary + Confidential
How Macy’s enhances the customer experience with Google Cloud services
https://cloud.google.com/blog/products/databases/how-macys-enhances-customer-ex
perience-google-cloud-services
Proprietary + Confidential
2 client = bigtable.Client(admin=True)
Here’s some Python code that uses the Bigtable SDK. The takeaway should be that
the code is pretty simple. Connect to the database, create a table, tables have column
families. You can then add rows. Rows require a unique ID or row key. Rows have
columns that are in column families.
Proprietary + Confidential
results = table.read_rows()
2 results.consume_all()
Once you have some data, you can read individual or multiple rows. Remember, to
get high performance, you want to retrieve rows using the row key.
Proprietary + Confidential
Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps
Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions
Next discussion
Proprietary + Confidential
BigQuery
● Fully managed, serverless, highly scalable data warehouse
● Multi-cloud capabilities using standard SQL
● Processes multi-terabytes of data in minutes
● Automatic high availability
● Supports federated queries
○ Cloud SQL & Cloud Spanner
○ Cloud Bigtable BigQuery
○ Files in Cloud Storage
● Use cases:
○ Near real-time analytics of streaming data to predict business outcomes with
built-in machine learning, geospatial analysis and more
○ Analysis of historical data
Amount of data
processed by the
query.
bq query --use_legacy_sql=false \
'SELECT
word,
SUM(word_count) AS count
FROM
`bigquery-public-data`.samples.shakespeare
WHERE
word LIKE "%raisin%"
GROUP BY
word'
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state = 'TX'
GROUP BY name, state
ORDER BY total_people DESC
LIMIT 20
"""
query_job = client.query(query) # Make an API request.
Do you need a
shared file
system? Is your workload
analytics?
Do you need
Filestore Cloud Storage
updates or low
Is your data latency?
relational?
Do you need
horizontal Cloud BigQuery
scalability? Bigtable
Let’s summarize the services in this module with this decision chart:
● First, ask yourself: Is your data structured, and will it need to be accessed
using its structured data format? If the answer is no, then ask yourself if you
need a shared file system. If you do, then choose Filestore.
● If you don't, then choose Cloud Storage.
● If your data is structured and needs to be accessed in this way, than ask
yourself, does your workload focus on analytics? If it does, you will want to
choose Cloud Bigtable or BigQuery, depending on your latency and update
needs.
● Otherwise, check whether your data is relational. If it’s not relational, choose
Firestore.
● If it is relational, you will want to choose Cloud SQL or Cloud Spanner,
depending on your need for horizontal scalability.
Depending on your application, you might use one or several of these services to get
the job done. For more information on how to choose between these different
services, please refer to the following two links:
https://cloud.google.com/storage-options/
https://cloud.google.com/products/databases/
Storage and database services
Proprietary + Confidential
Good for: Good for: Good for: Good for: Good for: Good for: Good for: Good for:
Binary or Network Web RDBMS + Hierarchical, Heavy read + Enterprise Caching for
object data Attached frameworks scale, HA, mobile, web write, events, data Web/Mobile
Storage (NAS) HTAP warehouse apps
Such as: Such as: Such as: Such as: Such as: Such as: Such as: Such as:
Images, Latency CMS, User metadata, User profiles, AdTech, Analytics, Game state,
media serving, sensitive eCommerce Ad/Fin/ game state financial, IoT dashboards user
backups workloads MarTech sessions
Next discussion
Proprietary + Confidential
Memorystore
● Fully managed implementation of the open source
in-memory databases Redis and Memcached
● High availability, failover, patching and monitoring
● Sub-millisecond latency
● Instances up to 300 GB
● Network throughput of 12 Gbps
● Use cases:
○ Lift and shift of Redis, Memcached
○ Anytime need a managed service for cached
data
Memorystore
https://cloud.google.com/memorystore
Proprietary + Confidential
In-memory caching
Covers the
services
discussed in this
module plus a
few more
Migrating Databases
Proprietary + Confidential
Scalable & flexible Stream & batch Analytics database; Managed Hadoop,
enterprise processing; unified stream data at MapReduce, Spark,
messaging and simplified 100,000 Pig, and Hive service
pipelines rows per second
Google Cloud Big Data solutions are designed to help you transform your business
and user experiences with meaningful data insights. It is an integrated, serverless
platform. “Serverless” means you don’t have to provision compute instances to run
your jobs. The services are fully managed, and you pay only for the resources you
consume. The platform is “integrated” so Google Cloud data services work together to
help you create custom solutions.
Proprietary + Confidential
Cloud Pub/Sub is a fully managed, massively scalable messaging service that can be
configured to send messages between independent applications, and can scale to
millions of messages per second.
Pub/Sub messages can be sent and received via HTTP and HTTPS.
Pub/Sub is an important building block for applications where data arrives at high and
unpredictable rates, like Internet of Things systems. If you’re analyzing streaming
data, Dataflow is a natural pairing with Pub/Sub.
Pub/Sub also works well with applications built on Google Cloud’s compute platforms.
You can configure your subscribers to receive messages on a “push” or a “pull” basis.
In other words, subscribers can get notified when new messages arrive for them, or
they can check for new messages at intervals.
Proprietary + Confidential
Overview
https://cloud.google.com/dataproc
Customer use case: Best practices for migrating Hadoop to Dataproc by LiveRamp
https://cloud.google.com/blog/products/data-analytics/best-practices-for-migrating-had
oop-to-gcp-dataproc
Dataproc is a fast, easy, managed way to run Hadoop, Spark, Hive, and Pig on
Google Cloud. All you have to do is to request a Hadoop cluster. It will be built for you
in 90 seconds or less, on top of Compute Engine virtual machines whose number and
type you can control. If you need more or less processing power while your cluster’s
running, you can scale it up or down. You can use the default configuration for the
Hadoop software in your cluster, or you can customize it. And you can monitor your
cluster using Operations.
Proprietary + Confidential
Google’s whitepaper:
https://static.googleusercontent.com/media/research.google.com/en//archive/mapredu
ce-osdi04.pdf
The lines in black are usually the initial implementation. Customers gain greater cost
savings when they transition to the red flow. This requires making some modifications
to the jobs to use the connectors.
HDFS: Hadoop data file system. Cloud storage was originally named DFS
(distributed file system)
HBase: open-source, NoSQL, distributed big data store. Runs on top of HDFS. The
GC equivalent is Bigtable
Primary workers can use autoscaling. Secondary workers are MIGs but do not scale.
However if you tell Google you want 4 workers, it will try to keep 4 workers at al times.
These can be spot VMs
Dataflow Under the Hood: Comparing Dataflow with other tools (Aug 24, 2020)
https://cloud.google.com/blog/products/data-analytics/dataflow-vs-other-stream-batch-
processing-engines
Dataproc is great when you have a dataset of known size, or when you want to
manage your cluster size yourself. But what if your data shows up in realtime?
Or it’s of unpredictable size or rate? That’s where Dataflow is a particularly
good choice. It’s both a unified programming model and a managed service,
and it lets you develop and execute a big range of data processing patterns:
extract-transform-and-load, batch computation, and continuous computation.
You use Dataflow to build data pipelines, and the same pipelines work for both
batch and streaming data.
Dataflow features:
Resource Management: Dataflow fully automates management of required
processing resources. No more spinning up instances by hand.
Transforms
BigQuery
This example Dataflow pipeline reads data from a BigQuery table (the “source”),
processes it in various ways (the “transforms”), and writes its output to Cloud Storage
(the “sink”). Some of those transforms you see here are map operations, and some
are reduce operations. You can build really expressive pipelines.
Each step in the pipeline is elastically scaled. There is no need to launch and manage
a cluster. Instead, the service provides all resources on demand. It has automated
and optimized work partitioning built in, which can dynamically rebalance lagging
work. That reduces the need to worry about “hot keys” -- that is, situations where
disproportionately large chunks of your input get mapped to the same custer.
Proprietary + Confidential
People use Dataflow in a variety of use cases. For one, it serves well as a
general-purpose ETL tool.
And its use case as a data analysis engine comes in handy in things like these: fraud
detection in financial services; IoT analytics in manufacturing, healthcare, and
logistics; and clickstream, Point-of-Sale, and segmentation analysis in retail.
And, because those pipelines we saw can orchestrate multiple services, even
external services, it can be used in real time applications such as personalizing
gaming user experiences.
Proprietary + Confidential
https://cloud.google.com/bigquery
If, instead of a dynamic pipeline, you want to do ad-hoc SQL queries on a massive
dataset, that is what BigQuery is for. BigQuery is Google's fully managed, petabyte
scale, low cost analytics data warehouse.
BigQuery’s features:
Flexible Data Ingestion: Load your data from Cloud Storage or Datastore, or stream it
into BigQuery at 100,000 rows per second to enable real-time analysis of your data.
Global Availability: You have the option to store your BigQuery data in European
locations while continuing to benefit from a fully managed service, now with the option
of geographic data control, without low-level cluster maintenance.
Security and Permissions: You have full control over who has access to the data
stored in BigQuery. If you share datasets, doing so will not impact your cost or
performance; those you share with pay for their own queries.
Cost Controls: BigQuery provides cost control mechanisms that enable you to cap
your daily costs at an amount that you choose. For more information, see Cost
Controls.
Highly Available: Transparent data replication in multiple geographies means that your
data is available and durable even in the case of extreme failure modes.
Super Fast Performance: Run super-fast SQL queries against multiple terabytes of
data in seconds, using the processing power of Google's infrastructure.
Fully Integrated In addition to SQL queries, you can easily read and write data in
BigQuery via Dataflow, Spark, and Hadoop.
Connect with Google Products: You can automatically export your data from Google
Analytics Premium into BigQuery and analyze datasets stored in Google Cloud
Storage, Google Drive, and Google Sheets.
BigQuery can make Create, Replace, Update, and Delete changes to databases,
subject to some limitations and with certain known issues.
Exam Guide - Storage options
Proprietary + Confidential
Your private data center to Google Cloud Enough bandwidth to meet gsutil
your project deadline
for less than 1 TB of data
Your private data center to Google Cloud Enough bandwidth to meet Storage Transfer Service for
your project deadline on-premises data
for more than 1 TB of data
Your private data center to Google Cloud Not enough bandwidth to Transfer Appliance
meet your project deadline
Proprietary + Confidential
Cloud SQL
● Export data to a storage bucket
gcloud sql export csv my-cloudsqlserver
gs://my-bucket/sql-export.csv \
--database=hr-database \
--query=’select * from employees’
The pricing calculator is the go-to resource for gaining cost estimates. Remember that
the costs are just an estimate, and actual cost may be higher or lower. The estimates
by default use the timeframe of one month. If any inputs vary from this, they will state
this. For example, Firestore document operations read, write, and delete are asked
for on a per day basis.
Proprietary + Confidential
● Export data
● Import data
Specific job
What is Dataproc?:
https://cloud.google.com/dataproc/docs/concepts/overview
gcloud command:
https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/list
Proprietary + Confidential
Specific job
gcloud command:
https://cloud.google.com/sdk/gcloud/reference/dataflow/jobs/describe
BigQuery: Reviewing job status - CLI
Proprietary + Confidential
● Listing jobs
bq ls --jobs=true --all=true
● Example
bq show --job=true
myproject:US.bquijob_123x456_123y123z123c
● Sample output
Job Type State Start Time Duration User Email Bytes Processed Bytes Billed
---------- --------- ----------------- ---------- ------------------- ----------------- --------------
extract SUCCESS 06 Jul 11:32:10 0:01:41 [email protected]