GCP Data
GCP Data
examsrocket.com
Question 1:
You currently have a single on-premises Kafka cluster in a data center in the
us-east region that is responsible for ingesting messages from IoT devices
globally. Because large parts of the globe have poor internet connectivity,
messages sometimes batch at the edge, come in all at once, and cause a
spike in load on your Kafka cluster. This is becoming difficult to manage and
prohibitively expensive. What is the Google-recommended cloud-native
architecture for this scenario?
Edge TPUs as sensor devices for storing and transmitting the messages.
Explanation
A is incorrect as Edge TPU enables the broad deployment of high-quality AI at
the edge. It’s not meant for IoT protocol.
Links:
https://cloud.google.com/edge-tpu
Question 2:
You decided to use Cloud Datastore to ingest vehicle telemetry data in real-
time. You want to build a storage system that will account for the long-term
data growth while keeping the costs low. You also want to create snapshots of
the data periodically, so that you can make a point-in-time (PIT) recovery, or
clone a copy of the data for Cloud Datastore in a different environment. You
want to archive these snapshots for a long time. Which two methods can
accomplish this? (Choose two.)
Use managed export, and then import the data into a BigQuery
table created just for that export, and delete temporary export (Correct)
files.
Write an application that uses Cloud Datastore client libraries to read all
the entities. Treat each entity as a BigQuery table row via BigQuery
streaming insert. Assign an export timestamp for each export, and attach
it as an extra column for each row. Make sure that the BigQuery table is
partitioned using the export timestamp column.
Write an application that uses Cloud Datastore client libraries to read all
the entities. Format the exported data into a JSON file. Apply compression
before storing the data in Cloud Source Repositories.
Explanation
A is correct as Cloud Bucket is a perfect option for archiving the data.
E is incorrect as Cloud Source Repos are used to store code and not for data.
Links:
https://cloud.google.com/datastore/docs/export-import-entities
https://cloud.google.com/datastore/docs/export-import-entities#import-into-
bigquery
Question 3:
You need to create a data pipeline that copies time-series transaction data so
that it can be queried from within BigQuery by your data science team for
analysis. Every hour, thousands of transactions are updated with a new status.
The size of the initial dataset is 1.5 PB, and it will grow by 3 TB per day. The
data is heavily structured, and your data science team will build machine
learning models based on this data. You want to maximize performance and
usability for your data science team. Which two strategies should you adopt?
(Choose two.)
Explanation
A is correct as Denormalizing the data will help to improve the readability of
the data.
B is incorrect as Preserving the structure of the data will reduce the usability
of the data.
Links:
https://cloud.google.com/architecture/bigquery-data-
warehouse#handling_change
https://cloud.google.com/bigquery/docs/best-practices-performance-
input#denormalizing_data
Question 4:
You are designing a cloud-native historical data processing system to meet
the following conditions:
- The data being analyzed is in CSV, Avro, and PDF formats and will be
accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and
Compute Engine.
- A streaming data pipeline stores new data daily.
- Performance is not a factor in the solution.
- The solution design should maximize availability.
Create a Cloud Dataproc cluster with high availability. Store the data in
HDFS, and peform analysis as needed.
Store the data in BigQuery. Access the data using the BigQuery Connector
on Cloud Dataproc and Compute Engine.
Store the data in a regional Cloud Storage bucket. Access the bucket
directly using Cloud Dataproc, BigQuery, and Compute Engine.
Explanation
A is incorrect as Creating a Cloud Dataproc cluster with high availability will
not solve the problem as Dataproc is not a storage system instead it’s a lift
and shift solution for Hadoop jobs.
B is incorrect as BQ cannot store PDF files and it’s a Data warehouse solution.
Links:
https://jayendrapatil.com/tag/multi-regional-storage/
https://medium.com/google-cloud/google-cloud-storage-what-bucket-class-
for-the-best-performance-5c847ac8f9f2
Question 5:
You have a petabyte of analytics data and need to design a storage and
processing platform for it. You must be able to perform data warehouse-
style analytics on the data in Google Cloud and expose the dataset as
files for batch analysis tools in other cloud providers. What should you
do?
Store the warm data as files in Cloud Storage, and store the active data in
BigQuery. Keep this ratio as 80% warm and 20% active.
Explanation
A is incorrect as both storing and exporting files in BQ is not possible. In this
option, we have to connect external cloud vendors directly to BQ which could
incur BQ egress costs.
Links:
https://cloud.google.com/storage/docs/collaboration
Question 6:
You work for a manufacturing company that sources up to 750 different
components, each from a different supplier. You've collected a labeled
dataset that has on average 1000 examples for each unique component.
Your team wants to implement an app to help warehouse workers
recognize incoming components based on a photo of the component.
You want to implement the first working version of this app (as Proof-Of-
Concept) within a few working days. What should you do?
Explanation
A is correct as AutoML without reducing the data will be fit for POC purposes.
AutoML is a serverless service of GCP so it can scale for whole data.
D is incorrect as training our model will take time and we need to come up
with the solution as soon as possible.
Links:
https://cloud.google.com/vision/automl/object-detection/docs/prepare
https://cloud.google.com/vision/automl/docs/beginners-guide
https://cloud.google.com/vision
https://cloud.google.com/vision/automl/docs/beginners-
guide#data_preparation
Question 7:
You are working on a niche product in the image recognition domain.
Your team has developed a model that is dominated by custom C++
TensorFlow ops your team has implemented. These ops are used inside
your main training loop and are performing bulky matrix multiplications.
It currently takes up to several days to train a model. You want to
decrease this time significantly and keep the cost low by using an
accelerator on Google Cloud. What should you do?
Use Cloud TPUs after implementing GPU kernel support for your customs
ops.
Use Cloud GPUs after implementing GPU kernel support for your
(Correct)
customs ops.
Stay on CPUs, and increase the size of the cluster you're training your
model on.
Explanation
A is incorrect as TPU can be a costly solution as compared to using GPU.
Moreover, TPU doesn’t support Models with custom TensorFlow operations
inside the main training loop.
Links:
https://cloud.google.com/tpu/docs/tpus
https://cloud.google.com/tpu/docs/tpus#when_to_use_tpu
Question 8:
You work on a regression problem in a natural language processing domain,
and you have 100M labeled examples in your dataset. You have randomly
shuffled your data and split your dataset into train and test samples (in a
90/10 ratio). After you trained the neural network and evaluated your model on
a test set, you discover that the root-mean-squared error (RMSE) of your
model is twice as high on the train set as on the test set. How should you
improve the performance of your model?
Try to collect more data and increase the size of your dataset.
Explanation
A is incorrect as Increasing the share of the test sample in the train-test split
will not help and will leave the model poor.
Links:
https://developers.google.com/machine-learning/crash-
course/generalization/peril-of-overfitting
https://towardsdatascience.com/deep-learning-3-more-on-cnns-handling-
overfitting-2bd5d99abe5d?gi=12a885894aa6
Question 9:
You use BigQuery as your centralized analytics platform. New data is
loaded every day, and an ETL pipeline modifies the original data and
prepares it for the final users. This ETL pipeline is regularly modified and
can generate errors, but sometimes the errors are detected only after 2
weeks. You need to provide a method to recover from these errors, and
your backups should be optimized for storage costs. How should you
organize your data in BigQuery and store your backups?
Organize your data in a single table, export, and compress and store the
BigQuery data in Cloud Storage.
Organize your data in separate tables for each month, and duplicate your
data on a separate dataset in BigQuery.
Organize your data in separate tables for each month, and use snapshot
decorators to restore the table to a time prior to the corruption.
Explanation
A is incorrect as organizing your data in a single table is not optimized for
bigquery workloads and as our ETL errors are detected only after 2 weeks so
having one table doesn’t help.
C is incorrect as duplicating our data into the BQ dataset will incur the cost
and is not a cost-effective solution.
Links:
https://cloud.google.com/solutions/dr-scenarios-for-data#managed-
database-services-on-gcp
Question 10:
The marketing team at your organization provides regular updates of a
segment of your customer dataset. The marketing team has given you a
CSV with 1 million records that must be updated in BigQuery. When you
use the UPDATE statement in BigQuery, you receive a quotaExceeded
error. What should you do?
Reduce the number of records updated each day to stay within the
BigQuery UPDATE DML statement limit.
Split the source CSV file into smaller CSV files in Cloud Storage to reduce
the number of BigQuery UPDATE DML statements per BigQuery job.
Import the new records from the CSV file into a new BigQuery
table. Create a BigQuery job that merges the new records with
(Correct)
the existing records and writes the results to a new BigQuery
table.
Explanation
A is incorrect as reducing the UPDATE statements will not solve the business
problem as the requirement is to keep up with the records up-to-date.
C is incorrect as splitting the CSV into smaller CSVs will cause manual
workload.
Links:
https://cloud.google.com/blog/products/bigquery/performing-large-scale-
mutations-in-bigquery
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-
syntax#merge_statement
Question 11:
As your organization expands its usage of GCP, many teams have started
to create their own projects. Projects are further multiplied to
accommodate different stages of deployments and target audiences.
Each project requires unique access control configurations. The central
IT team needs to have access to all projects. Furthermore, data from
Cloud Storage buckets and BigQuery datasets must be shared for use in
other projects in an ad hoc way. You want to simplify access control
management by minimizing the number of policies. Which two steps
should you take? (Choose two.)
Create distinct groups for various teams, and specify groups in Cloud IAM
policies.
Only use service accounts when sharing data for Cloud Storage
(Correct)
buckets and BigQuery datasets.
Explanation
A is incorrect as Deployment manager is an Infrastructure as a Service
solution.
Links:
https://cloud.google.com/docs/enterprise/best-practices-for-enterprise-
organizations
Question 12:
Your United States-based company has created an application for assessing
and responding to user actions. The primary table's data volume grows by
250,000 records per second. Many third parties use your application's APIs to
build the functionality into their own frontend applications. Your application's
APIs should comply with the following requirements:
Implement Cloud SQL for PostgreSQL with the master in Norht America
and read replicas in Asia and Europe.
Implement Cloud Bigtable with the primary cluster in North America and
secondary clusters in Asia and Europe.
Explanation
A is incorrect as BQ is appropriate for analytical workloads instead of app
workload.
B is correct as Spanner meets all the requirements as ANSI SQL can only be
accommodated in the Spanner and it can be highly available globally.
Links:
https://cloud.google.com/spanner/#section-2
Question 13:
A data scientist has created a BigQuery ML model and asks you to create an
ML pipeline to serve predictions. You have a REST API application with the
requirement to serve predictions for an individual user ID with latency under
100 milliseconds. You use the following query to generate predictions:
Add a WHERE clause to the query, and grant the BigQuery Data Viewer
role to the application service account.
Create an Authorized View with the provided query. Share the dataset that
contains the view with the application service account.
Explanation
A is incorrect as adding a WHERE clause to the query, and grant the BigQuery
Data Viewer role to the application service account will incur query cost and
cannot be guaranteed to be achieved in 100 ms.
B is incorrect as creating an Authorized View with the provided query will also
have the same reason as BQ read operation doesn’t give 100 ms read speed.
C is incorrect as Invoking Dataflow every time will incur the cost and also
doesn’t guarantee a 100ms response to the request.
Links:
https://cloud.google.com/blog/products/databases/getting-started-with-time-
series-trend-predictions-using-gcp
https://cloud.google.com/bigquery-ml/docs/exporting-models
Question 14:
You are building an application to share financial market data with consumers,
who will receive data feeds. Data is collected from the markets in real-time.
Explanation
A is incorrect as services don’t match with the stated requirements.
Links:
https://cloud.google.com/architecture/processing-logs-at-scale-using-
dataflow
Question 15:
You are building a new application that you need to collect data from in a
scalable way. Data arrives continuously from the application throughout the
day, and you expect to generate approximately 150 GB of JSON data per day
by the end of the year. Your requirements are:
Create an application that provides an API. Write a tool to poll the API and
write data to Cloud Storage as gzipped JSON files.
Explanation
A is incorrect as mapped services don’t fulfill all the requirements, directly
writing to the GCS cannot decouple the producer and consumer.
Question 16:
You are running a pipeline in Cloud Dataflow that receives messages from a
Cloud Pub/Sub topic and writes the results to a BigQuery dataset in the EU.
Currently, your pipeline is located in europe-west4 and has a maximum of 3
workers, for instance, type n1-standard-1. You notice that during peak
periods, your pipeline is struggling to process records in a timely fashion
when all 3 workers are at maximum CPU utilization. Which two actions can
you take to increase the performance of your pipeline? (Choose two.)
Use a larger instance type for your Cloud Dataflow workers (Correct)
Create a temporary table in Cloud Bigtable that will act as a buffer for new
data. Create a new step in your pipeline to write to this table first, and
then create a new pipeline to write from Cloud Bigtable to BigQuery
Create a temporary table in Cloud Spanner that will act as a buffer for new
data. Create a new step in your pipeline to write to this table first, and
then create a new pipeline to write from Cloud Spanner to BigQuery
Explanation
A is correct as increasing the max worker number will release the load of
those three workers and thus allows autoscaling.
B is correct as increasing the size of the node is a good option and will allow
more read/write throughput.
E is incorrect as creating two pipelines will not solve the problem instead it’ll
increase the workload.
Links:
https://cloud.google.com/bigtable/docs/performance
https://cloud.google.com/dataflow/docs/guides/setting-pipeline-options
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#workers
https://cloud.google.com/dataflow/docs/guides/deploying-a-
pipeline#locations
Question 17:
You have a data pipeline with a Cloud Dataflow job that aggregates and writes
time-series metrics to Cloud Bigtable. This data feeds a dashboard used by
thousands of users across the organization. You need to support additional
concurrent users and reduce the amount of time required to write the data.
Which two actions should you take? (Choose two.)
Modify your Cloud Dataflow pipeline to use the Flatten transform before
writing to Cloud Bigtable
Explanation
A is incorrect as the configuration Dataflow pipeline is not feasible as it’s a
global resource.
Links:
https://cloud.google.com/bigtable/docs/performance#performance-write-
throughput
https://cloud.google.com/dataflow/docs/guides/specifying-exec-
params#setting-other-cloud-pipeline-options
Question 18:
You have several Spark jobs that run on a Cloud Dataproc cluster on a
schedule. Some of the jobs run in sequence, and some of the jobs run
concurrently. You need to automate this process. What should you do?
Create a Bash script that uses the Cloud SDK to create a cluster, execute
jobs, and then tear down the cluster
Explanation
A is correct as since all the jobs are in Dataproc, templates in Dataproc are
sufficient to accommodate the job.
Links:
https://cloud.google.com/dataproc/docs/concepts/workflows/using-workflows
Question 19:
You are building a new data pipeline to share data between two different
types of applications: jobs generators and job runners. Your solution
must scale to accommodate increases in usage and must accommodate
the addition of new applications without negatively affecting the
performance of existing ones. What should you do?
Create an API using App Engine to receive and send messages to the
applications
Create a table on Cloud SQL, and insert and delete rows with the job
information
Create a table on Cloud Spanner, and insert and delete rows with the job
information
Explanation
A is incorrect as creating an API on AppEngine is not an effective solution as it
involves manual effort and it doesn’t ensure the decoupling of the pipelines.
B is correct as PubSub can be used for this purpose, publishers can generate
the jobs and subscribers can consume the jobs without affecting the existing
jobs.
Links:
https://cloud.google.com/pubsub/docs/publisher
Question 20:
You need to create a new transaction table in Cloud Spanner that stores
product sales data. You are deciding what to use as a primary key. From a
performance perspective, which strategy should you choose?
The original order identification number from the sales system, which is a
monotonically increasing integer
Explanation
A is incorrect as the current epoch time cannot avoid hotspot possibility as it
will store increasing numbers.
B is incorrect as the product name will not be a good option to optimize the
performance it should be ID.
C is correct as a unique UUID will ensure the avoidance of hotspots and will
distribute the query traffic equally. It’s the best option from a performance
standpoint.
Links:
https://cloud.google.com/spanner/docs/schema-and-data-model
Notes:
The primary key uniquely identifies each row in a table. If you want to update
or delete existing rows in a table, then the table must have a primary key
composed of one or more columns. (A table with no primary key columns can
have only one row.) Often your application already has a field that's a natural
fit for use as the primary key.
Question 21:
Data Analysts in your company have the Cloud IAM Owner role assigned
to them in their projects to allow them to work with multiple GCP
products in their projects. Your organization requires that all BigQuery
data access logs be retained for 6 months. You need to ensure that only
audit personnel in your company can access the data access logs for all
projects. What should you do?
Enable data access logs in each Data Analyst's project. Restrict access to
Stackdriver Logging via Cloud IAM roles.
Export the data access logs via a project-level export sink to a Cloud
Storage bucket in the Data Analysts' projects. Restrict access to the
Cloud Storage bucket.
Export the data access logs via a project-level export sink to a Cloud
Storage bucket in a newly created projects for audit logs. Restrict access
to the project with the exported logs.
Explanation
A is incorrect as Data Analyst already has an owner role.
D is correct as an aggregated log sink will create a single sink for all projects,
the destination can be a google cloud storage, Pub/Sub topic, bigquery table,
or a cloud logging bucket. without aggregated sink, this will be required to be
done for each project individually which will be cumbersome.
Links:
https://cloud.google.com/logging/docs/export/aggregated_sinks
https://cloud.google.com/iam/docs/job-
functions/auditing#scenario_operational_monitoring
Question 22:
Each analytics team in your organization is running BigQuery jobs in their
own projects. You want to enable each team to monitor slot usage within
their projects. What should you do?
Create a log export for each project, capture the BigQuery job execution
logs, create a custom metric based on the totalSlotMs, and create a
Stackdriver Monitoring dashboard based on the custom metric
Explanation
A is incorrect as metric query/scanned_bytes will give the total usage.
C is incorrect as totalSlotMs doesn’t give metrics as per the project but on the
usage of slots. E.g you have a 20-second query that is continuously
consuming 4 slots. In that case, your query is using 80,000 totalSlotMs (4 *
20,000).
D is incorrect as totalSlotMs doesn’t give metric as per the project but on the
usage of slots.
Links:
https://cloud.google.com/bigquery/docs/monitoring
https://cloud.google.com/monitoring/api/metrics_gcp
https://cloud.google.com/bigquery/docs/monitoring-dashboard#metrics
https://cloud.google.com/monitoring/api/metrics_gcp#gcp-bigquery
Question 23:
You are operating a streaming Cloud Dataflow pipeline. Your engineers
have a new version of the pipeline with a different windowing algorithm
and triggering strategy. You want to update the running pipeline with the
new version. You want to ensure that no data is lost during the update.
What should you do?
Stop the Cloud Dataflow pipeline with the Cancel option. Create a new
Cloud Dataflow job with the updated code
Stop the Cloud Dataflow pipeline with the Drain option. Create a new
Cloud Dataflow job with the updated code
Explanation
A is correct as this is the provision when one wants to update the existing
dataflow pipeline.
B is incorrect as when you set a job name to a new name it will create a new
dataflow pipeline and the older one will remain as it is.
C is incorrect as stopping the data pipeline is not an option as it will lose the
data.
D is incorrect as the drain option is used when we don’t want to lose the data
when we are stopping the pipeline.
Links:
https://cloud.google.com/dataflow/docs/guides/updating-a-
pipeline#Launching
https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline
Notes:
To update your job, you'll need to launch a new job to replace the ongoing job.
When you launch your replacement job, you'll need to set the following
pipeline options to perform the update process in addition to the job's regular
options:
- Pass the --update option.
- Set the --job_name option in PipelineOptions to the same name as the job
you want to update.
- If any transform names in your pipeline have changed, you must supply a
transform mapping and pass it using the --transform_name_mapping option.
Question 24:
You need to move 2 PB of historical data from an on-premises storage
appliance to Cloud Storage within six months, and your outbound
network capacity is constrained to 20 Mb/sec. How should you migrate
this data to Cloud Storage?
Create a private URL for the historical data, and then use Storage Transfer
Service to copy the data to Cloud Storage
Use trickle or ionice along with gsutil cp to limit the amount of bandwidth
gsutil utilizes to less than 20 Mb/sec so it does not interfere with the
production traffic
Explanation
A is correct as a transfer appliance is made for huge transfers such as in PB.
Links:
https://cloud.google.com/transfer-appliance/docs/4.0
https://cloud.google.com/storage-transfer-service
Question 25:
You receive data files in CSV format monthly from a third party. You need to
cleanse this data, but every third month the schema of the files changes. Your
requirements for implementing these transformations include:
Load each month's CSV data into BigQuery, and write a SQL query to
transform the data to a standard schema. Merge the transformed tables
together with a SQL query
Help the analysts write a Cloud Dataflow pipeline in Python to perform the
transformation. The Python code should be stored in a revision control
system and modified as the incoming data's schema changes
Use Apache Spark on Cloud Dataproc to infer the schema of the CSV file
before creating a Dataframe. Then implement the transformations in
Spark SQL before writing the data out to Cloud Storage and loading into
BigQuery
Explanation
A is correct as Cloud Dataprep is used by analysts and not by developers.
Dataprep is mainly used for transformations by drag and drop.
Links:
https://cloud.google.com/dataprep/docs/quickstarts/quickstart-dataprep
https://docs.trifacta.com/display/DP/Getting+Started+with+Cloud+Dataprep
https://docs.trifacta.com/display/DP/Overview+of+RapidTarget
Question 26:
You want to migrate an on-premises Hadoop system to Cloud Dataproc.
Hive is the primary tool in use, and the data format is Optimized Row
Columnar (ORC). All ORC files have been successfully copied to a Cloud
Storage bucket. You need to replicate some data to the cluster's local
Hadoop Distributed File System (HDFS) to maximize performance. What
are two ways to start using Hive in Cloud Dataproc? (Choose two.)
Run the gsutil utility to transfer all ORC files from the Cloud Storage
bucket to HDFS. Mount the Hive tables locally.
Run the gsutil utility to transfer all ORC files from the Cloud Storage
bucket to any node of the Dataproc cluster. Mount the Hive tables locally.
Run the gsutil utility to transfer all ORC files from the Cloud Storage
bucket to the master node of the Dataproc cluster. Then run the Hadoop
utility to copy them do HDFS. Mount the Hive tables from HDFS.
Explanation
A is incorrect as there is no utility available for data to move directly from GCS
to HDFS.
Links:
https://mrjob.readthedocs.io/en/latest/guides/dataproc-quickstart.html
https://cloud.google.com/dataproc/docs/tutorials/gcs-connector-spark-
tutorial
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-orc
https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery
https://cloud.google.com/bigquery/docs/samples/bigquery-create-table-
external-hivepartitioned
Question 27:
You are implementing several batch jobs that must be executed on a
schedule. These jobs have many interdependent steps that must be
executed in a specific order. Portions of the jobs involve executing shell
scripts, running Hadoop jobs, and running queries in BigQuery. The jobs
are expected to run for many minutes up to several hours. If the steps
fail, they must be retried a fixed number of times. Which service should
you use to manage the execution of these jobs?
Cloud Scheduler
Cloud Dataflow
Cloud Functions
Explanation
A is incorrect as a cloud scheduler is nothing but a managed crontab service.
We cannot set dependency in the cloud scheduler.
B is incorrect as cloud data flow is used to run data pipelines and not to
orchestrate individual components.
Question 28:
You work for a shipping company that has distribution centers where
packages move on delivery lines to route them properly. The company
wants to add cameras to the delivery lines to detect and track any visual
damage to the packages in transit. You need to create a way to automate
the detection of damaged packages and flag them for human review in
real time while the packages are in transit. Which solution should you
choose?
Use the Cloud Vision API to detect for damage, and raise an alert through
Cloud Functions. Integrate the package tracking applications with this
function.
Explanation
A is incorrect as BigQuery ML cannot be trained on image data.
B is correct as AutoML can be used to train the model at scale and use it at
an API endpoint.
Links:
https://cloud.google.com/vision/automl/object-detection/docs
Question 29:
You are migrating your data warehouse to BigQuery. You have migrated all of
your data into tables in a dataset. Multiple users from your organization will be
using the data. They should only see certain tables based on their team
membership. How should you set user permissions?
Assign the users/groups data viewer access at the table level for each
table
Create SQL views for each team in the same dataset in which the data
resides, and assign the users/groups data viewer access to the SQL views
Create authorized views for each team in datasets created for each team.
Assign the authorized views data viewer access to the dataset in which
the data resides. Assign the users/groups data viewer access to the
datasets in which the authorized views reside
Explanation
A is incorrect as BQ doesn’t provide table-level access.
Links:
https://cloud.google.com/bigquery/docs/authorized-views
Question 30:
You want to build a managed Hadoop system as your data lake. The data
transformation process is composed of a series of Hadoop jobs executed in
sequence. To accomplish the design of separating storage from compute, you
decided to use the Cloud Storage connector to store all input data, output
data, and intermediary data. However, you noticed that one Hadoop job runs
very slowly with Cloud Dataproc when compared with the on-premises bare-
metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows
that this particular Hadoop job is disk I/O intensive. You want to resolve the
issue. What should you do?
Allocate more CPU cores of the virtual machine instances of the Hadoop
cluster so that the networking bandwidth for each instance can scale up
Explanation
A is incorrect as increasing data proc VM’s memory doesn’t help to reduce
the time as the job is IO sensitive.
Links:
https://cloud.google.com/architecture/hadoop/hadoop-gcp-migration-jobs
Question 31:
You work for an advertising company, and you've developed a Spark ML
model to predict click-through rates at advertisement blocks. You've been
developing everything at your on-premises data center, and now your
company is migrating to Google Cloud. Your data center will be closing soon,
so a rapid lift-and-shift migration is necessary. However, the data you've been
using will be migrated to BigQuery. You periodically retrain your Spark ML
models, so you need to migrate existing training pipelines to Google Cloud.
What should you do?
Explanation
A is incorrect as using Cloud ML is not a lift and shift option for SparkML
models. When we have to train the model from scratch we can use Cloud ML.
B is incorrect as due to the time limit we cannot train Tensorflow models and
use Cloud ML.
C is correct as Dataproc is truly a lift and shift solution for the Hadoop
ecosystem so using Dataproc will be easy as we are moving the whole
retraining pipeline from on-premise to cloud.
Links:
https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml
Question 32:
You work for a global shipping company. You want to train a model on 40
TB of data to predict which ships in each geographic region are likely to
cause delivery delays on any given day. The model will be based on
multiple attributes collected from multiple sources. Telemetry data,
including location in GeoJSON format, will be pulled from each ship and
loaded every hour. You want to have a dashboard that shows how many
and which ships are likely to cause delays within a region. You want to
use a storage solution that has native functionality for prediction and
geospatial processing. Which storage solution should you use?
BigQuery (Correct)
Cloud Bigtable
Cloud Datastore
Explanation
A is correct as BQ fits and fulfills all the stated requirements. We can use
BQML to predict based on geospatial data natively.
D is incorrect as Cloud SQL is not meant for model training and data
warehousing. It’s a lift and shift solution for a relational database.
Links:
https://cloud.google.com/bigquery/docs/gis-intro
Question 33:
You operate an IoT pipeline built around Apache Kafka that normally
receives around 5000 messages per second. You want to use Google
Cloud Platform to create an alert as soon as the moving average over 1
hour drops below 4000 messages per second. What should you do?
Consume the stream of data in Cloud Dataflow using Kafka IO. Set a fixed
time window of 1 hour. Compute the average when the window closes, and
send an alert if the average is less than 4000 messages.
Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub.
Use a Cloud Dataflow template to write your messages from Cloud
Pub/Sub to Cloud Bigtable. Use Cloud Scheduler to run a script every hour
that counts the number of rows created in Cloud Bigtable in the last hour.
If that number falls below 4000, send an alert.
Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub.
Use a Cloud Dataflow template to write your messages from Cloud
Pub/Sub to BigQuery. Use Cloud Scheduler to run a script every five
minutes that counts the number of rows created in BigQuery in the last
hour. If that number falls below 4000, send an alert.
Explanation
A is correct as cloud Dataflow can be used to calculate the moving average
and Kafka IO can be used to connect with on-premise Kafka clusters.
B is incorrect as the fixed time window cannot calculate the moving average
of no. of messages.
C is incorrect as it overkills the task as a simple data flow sliding window can
calculate the average messages.
Links:
https://cloud.google.com/architecture/processing-messages-from-kafka-
hosted-outside-gcp
Question 34:
You plan to deploy Cloud SQL using MySQL. You need to ensure high
availability in the event of a zone failure. What should you do?
Create a Cloud SQL instance in one zone, and create a failover replica in
another zone within the same region.
Create a Cloud SQL instance in one zone, and configure an external read
replica in a zone in a different region.
Explanation
A is incorrect as there is no specific option for failover replica.
B is correct as the read replica can be used when we have set Cloud SQL
High Availability.
Links:
https://cloud.google.com/sql/docs/mysql/high-availability
https://cloud.google.com/sql/docs/mysql/replication#read-replicas
Notes:
As a best practice, put read replicas in a different zone than the primary
instance when you use HA on your primary instance. This practice ensures
that read replicas continue to operate when the zone that contains the
primary instance has an outage.
Question 35:
Your company is selecting a system to centralize data ingestion and delivery.
You are considering messaging and data integration systems to address the
requirements. The key requirements are:
- The ability to seek a particular offset in a topic, possibly back to the start of
all data ever captured
- Support for publish/subscribe semantics on hundreds of topics
- Retain per-key ordering
Cloud Storage
Cloud Pub/Sub
Explanation
A is correct as Apache Kafka can fulfill all the stated requirements. In Kafka,
we can track any topic to any given point.
Links:
https://kafka.apache.org/081/documentation.html
https://cloud.google.com/pubsub/quotas#resource_limits
https://firebase.google.com/products/cloud-messaging
Question 36:
You are planning to migrate your current on-premises Apache Hadoop
deployment to the cloud. You need to ensure that the deployment is as
fault-tolerant and cost-effective as possible for long-running batch jobs.
You want to use a managed service. What should you do?
Deploy a Cloud Dataproc cluster. Use an SSD persistent disk and 50%
preemptible workers. Store data in Cloud Storage, and change references
in scripts from hdfs:// to gs://
Explanation
A is correct as It’s best practice to store Hadoop data in the GCS bucket and
50% preemptible workers VMs.
Links:
https://cloud.google.com/bigtable/docs/choosing-ssd-hdd
https://cloud.google.com/blog/products/data-analytics/optimize-dataproc-
costs-using-vm-machine-type
Question 37:
Your team is working on a binary classification problem. You have trained a
support vector machine (SVM) classifier with default parameters and received
an area under the Curve (AUC) of 0.87 on the validation set. You want to
increase the AUC of the model. What should you do?
Deploy the model and measure the real-world AUC; it's always higher
because of generalization
Scale predictions you get out of the model (tune a scaling factor
(Correct)
as a hyperparameter) in order to get the highest AUC
Explanation
A is incorrect as performing hyperparameter tuning helps to build a model and
not necessarily increase the AUC score.
B is incorrect as DNN is used when we have too much data and it’s not a
solution for increasing AUC.
D is correct as scaling predictions out of the model and using a scaling factor
will improve the AUC score. As data change/processing can improve the AUC
score.
Links:
https://developers.google.com/machine-learning/crash-
course/classification/roc-and-auc?hl=en
Notes:
One way of interpreting AUC is as the probability that the model ranks a
random positive example more highly than a random negative example.
Question 38:
You need to deploy additional dependencies to all of a Cloud Dataproc
cluster at startup using an existing initialization action. Company
security policies require that Cloud Dataproc nodes do not have access
to the Internet so public initialization actions cannot fetch resources.
What should you do?
Use an SSH tunnel to give the Cloud Dataproc cluster access to the
Internet
Use Resource Manager to add the service account used by the Cloud
Dataproc cluster to the Network User role
Explanation
A is incorrect as using Cloud SQL Proxy will not help as Cloud SQL cannot
store dependencies.
B is incorrect as the SSH tunnel will still expose the Dataproc cluster to the
internet.
C is correct as the GCS bucket can store the dependencies and be used in
the init script.
D is incorrect as the Network user role can access a shared VPC network but
we still need to store the dependencies.
Links:
https://cloud.google.com/dataproc/docs/concepts/configuring-
clusters/network#create_a_cloud_dataproc_cluster_with_internal_ip_address
_only
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-
actions
Question 39:
You need to choose a database for a new project that has the following
requirements:
- Fully managed
- Able to automatically scale-up
- Transactionally consistent
- Able to scale up to 6 TB
- Able to be queried using SQL
Cloud Bigtable
Cloud Spanner
Cloud Datastore
Explanation
A is correct as Cloud SQL meets all the requirements.
Links:
https://cloud.google.com/sql
Question 40:
You work for a mid-sized enterprise that needs to move its operational
system transaction data from an on-premises database to GCP. The
database is about 20 TB in size. Which database should you choose?
Cloud Bigtable
Cloud Spanner
Cloud Datastore
Explanation
A is correct as Cloud SQL can store up to 30TB of data.
Links:
https://cloud.google.com/sql/docs/quotas#storage_limits
Question 41:
You need to choose a database to store time-series CPU and memory usage
for millions of computers. You need to store this data in one-second interval
samples. Analysts will be performing real-time, ad hoc analytics against the
database. You want to avoid being charged for every query executed and
ensure that the schema design will allow for future growth of the dataset.
Which database and data model should you choose?
Create a table in BigQuery, and append the new samples for CPU and
memory to the table
Create a wide table in BigQuery, create a column for the sample value at
each second, and update the row with the interval for each second
Create a wide table in Cloud Bigtable with a row key that combines the
computer identifier with the sample time at each minute, and combine the
values for each second as column data.
Explanation
A is incorrect as BigQuery doesn’t support dynamic schema whereas BigTable
is a NoSql database so it essentially comes with a schema-less design.
Links:
https://cloud.google.com/bigtable/docs/schema-design-time-
series#patterns_for_row_key_design
Notes:
A tall and narrow table has a small number of events per row, which could be
just one event, whereas a short and wide table has a large number of events
per row. As explained in a moment, tall and narrow tables are best suited for
time-series data.
For time series, you should generally use tall and narrow tables. This is for two
reasons: Storing one event per row makes it easier to run queries against your
data. Storing many events per row makes it more likely that the total row size
will exceed the recommended maximum (see Rows can be big but are not
infinite).
Question 42:
You want to archive data in Cloud Storage. Because some data is very
sensitive, you want to use the "Trust No One" (TNO) approach to encrypt
your data to prevent the cloud provider staff from decrypting your data.
What should you do?
Use gcloud kms keys create to create a symmetric key. Then use gcloud
kms encrypt to encrypt each archival file with the key and unique
additional authenticated data (AAD). Use gsutil cp to upload each
encrypted file to the Cloud Storage bucket, and keep the AAD outside of
Google Cloud.
Use gcloud kms keys create to create a symmetric key. Then use gcloud
kms encrypt to encrypt each archival file with the key. Use gsutil cp to
upload each encrypted file to the Cloud Storage bucket. Manually destroy
the key previously used for encryption, and rotate the key once.
Explanation
A is incorrect as AAD’s purpose is different than to intensify the encryption.
B is incorrect as we don’t want to use GCP services and this approach uses
cloud KMS keys.
Links:
https://cloud.google.com/kms/docs/additional-authenticated-
data#when_to_use_aad
https://cloud.google.com/storage/docs/encryption/customer-supplied-keys
Notes:
AAD does not increase the cryptographic strength of the ciphertext. Instead,
it is an additional check by Cloud KMS to authenticate a decryption request.
Question 43:
You have data pipelines running on BigQuery, Cloud Dataflow, and Cloud
Dataproc. You need to perform health checks and monitor their behavior, and
then notify the team managing the pipelines if they fail. You also need to be
able to work across multiple projects. Your preference is to use managed
products/features of the platform. What should you do?
Run a Virtual Machine in Compute Engine with Airflow, and export the
information to Stackdriver
Export the logs to BigQuery, and set up App Engine to read that
information and send emails if you find a failure in the logs
Develop an App Engine application to consume logs using GCP API calls,
and send emails if you find a failure in the logs
Explanation
A is correct as Stackdriver Monitoring can monitor multiple projects at once.
There is a concept of workspace in which we can include multiple projects.
Links:
https://cloud.google.com/monitoring/workspaces/#account-project
https://cloud.google.com/monitoring/settings/multiple-projects#create-multi
Question 44:
You are building storage for files for a data pipeline on Google Cloud. You
want to support JSON files. The schema of these files will occasionally
change. Your analyst teams will use running aggregate ANSI SQL queries on
this data. What should you do?
A. Use BigQuery for storage. Provide format files for data load. Update the
format files as needed.
Explanation
B is correct because of the requirement to support occasionally (schema)
changing JSON files and aggregate ANSI SQL queries: you need to use
BigQuery, and it is quickest to use 'Automatically detect' for schema changes.
A is not correct because you should not provide format files: you can simply
turn on the 'Automatically detect' schema changes flag.
C, D are not correct because you should not use Cloud Storage for this
scenario: it is cumbersome and doesn't add value.
Question 45:
Your company is loading CSV files into BigQuery. The data is fully imported
successfully; however, the imported data is not matching byte-to-byte to the
source file. What is the most likely cause of this problem?
B. The CSV data had invalid rows that were skipped on import.
D. The CSV data has not gone through an ETL phase before loading into
BigQuery.
Explanation
A is not correct because if another data format other than CSV was selected
then the data would not import successfully.
B is not correct because the data was fully imported meaning no rows were
skipped.
C is correct because this is the only situation that would cause successful
import.
D is not correct because whether the data has been previously transformed
will not affect whether the source file will match the BigQuery table.
Links:
https://cloud.google.com/bigquery/docs/loading-data#loading_encoded_data
For more visit
examsrocket.com