0% found this document useful (0 votes)
179 views

GCP Data

There is no text to summarize. The document is empty.

Uploaded by

Kiran Chilledout
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
179 views

GCP Data

There is no text to summarize. The document is empty.

Uploaded by

Kiran Chilledout
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

For more visit

examsrocket.com
Question 1:
You currently have a single on-premises Kafka cluster in a data center in the
us-east region that is responsible for ingesting messages from IoT devices
globally. Because large parts of the globe have poor internet connectivity,
messages sometimes batch at the edge, come in all at once, and cause a
spike in load on your Kafka cluster. This is becoming difficult to manage and
prohibitively expensive. What is the Google-recommended cloud-native
architecture for this scenario?

Edge TPUs as sensor devices for storing and transmitting the messages.

Cloud Dataflow connected to the Kafka cluster to scale the processing of


incoming messages.

An IoT gateway connected to Cloud Pub/Sub, with Cloud


(Correct)
Dataflow to read and process the messages from Cloud Pub/Sub.

A Kafka cluster virtualized on Compute Engine in us-east with Cloud Load


Balancing to connect to the devices around the world.

Explanation
A is incorrect as Edge TPU enables the broad deployment of high-quality AI at
the edge. It’s not meant for IoT protocol.

B is incorrect as Kafka cluster on-premise or even in the cloud is not


recommended practice. As Google has already an equivalent service called
pub/sub.

C is correct as IoT can be deployed as it’s a managed service and connected


to Pub/sub is a way to handle messages and Dataflow to process the data in
the window is a google recommend them in practice.

D is incorrect as A Kafka cluster virtualized on Compute Engine is not a


scalable solution.

Links:
https://cloud.google.com/edge-tpu

Question 2:
You decided to use Cloud Datastore to ingest vehicle telemetry data in real-
time. You want to build a storage system that will account for the long-term
data growth while keeping the costs low. You also want to create snapshots of
the data periodically, so that you can make a point-in-time (PIT) recovery, or
clone a copy of the data for Cloud Datastore in a different environment. You
want to archive these snapshots for a long time. Which two methods can
accomplish this? (Choose two.)

Use managed export, and store the data in a Cloud Storage


(Correct)
bucket using Nearline or Coldline class.

Use managed export, and then import to Cloud Datastore in a separate


project under a unique namespace reserved for that export.

Use managed export, and then import the data into a BigQuery
table created just for that export, and delete temporary export (Correct)
files.

Write an application that uses Cloud Datastore client libraries to read all
the entities. Treat each entity as a BigQuery table row via BigQuery
streaming insert. Assign an export timestamp for each export, and attach
it as an extra column for each row. Make sure that the BigQuery table is
partitioned using the export timestamp column.

Write an application that uses Cloud Datastore client libraries to read all
the entities. Format the exported data into a JSON file. Apply compression
before storing the data in Cloud Source Repositories.

Explanation
A is correct as Cloud Bucket is a perfect option for archiving the data.

B is incorrect as importing into another Datastore project doesn't make sense


since our objective is to have recovery of the data.

C is correct as BigQuery export will be useful for data recovery, moreover, BQ


internally uses GCS to store the data so the cost will be less too

D is incorrect as creating an application that can replicate data will have a


dynamic schema problem and we don’t have to run analysis on the data.

E is incorrect as Cloud Source Repos are used to store code and not for data.

Links:
https://cloud.google.com/datastore/docs/export-import-entities

https://cloud.google.com/datastore/docs/export-import-entities#import-into-
bigquery

Question 3:
You need to create a data pipeline that copies time-series transaction data so
that it can be queried from within BigQuery by your data science team for
analysis. Every hour, thousands of transactions are updated with a new status.
The size of the initial dataset is 1.5 PB, and it will grow by 3 TB per day. The
data is heavily structured, and your data science team will build machine
learning models based on this data. You want to maximize performance and
usability for your data science team. Which two strategies should you adopt?
(Choose two.)

Denormalize the data as must as possible. (Correct)

Preserve the structure of the data as much as possible.

Use BigQuery UPDATE to further reduce the size of the dataset.

Develop a data pipeline where status updates are appended to BigQuery


instead of updated.

Copy a daily snapshot of transaction data to Cloud Storage and


store it as an Avro file. Use BigQuery's support for external data (Correct)
sources to query.

Explanation
A is correct as Denormalizing the data will help to improve the readability of
the data.

B is incorrect as Preserving the structure of the data will reduce the usability
of the data.

C is incorrect as using an UPDATE statement will incur costs significantly.

D is incorrect as updating the BQ table is costly instead appending the table is


a more viable option but in our case, it will increase the storage cost.

E is correct as BQ supports federated queries. It’s a good option to store the


data into Avro format and query on them.

Links:
https://cloud.google.com/architecture/bigquery-data-
warehouse#handling_change

https://cloud.google.com/bigquery/docs/best-practices-performance-
input#denormalizing_data

Question 4:
You are designing a cloud-native historical data processing system to meet
the following conditions:

- The data being analyzed is in CSV, Avro, and PDF formats and will be
accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and
Compute Engine.
- A streaming data pipeline stores new data daily.
- Performance is not a factor in the solution.
- The solution design should maximize availability.

How should you design data storage for this solution?

Create a Cloud Dataproc cluster with high availability. Store the data in
HDFS, and peform analysis as needed.

Store the data in BigQuery. Access the data using the BigQuery Connector
on Cloud Dataproc and Compute Engine.

Store the data in a regional Cloud Storage bucket. Access the bucket
directly using Cloud Dataproc, BigQuery, and Compute Engine.

Store the data in a multi-regional Cloud Storage bucket. Access


the data directly using Cloud Dataproc, BigQuery, and Compute (Correct)
Engine.

Explanation
A is incorrect as Creating a Cloud Dataproc cluster with high availability will
not solve the problem as Dataproc is not a storage system instead it’s a lift
and shift solution for Hadoop jobs.

B is incorrect as BQ cannot store PDF files and it’s a Data warehouse solution.

C is incorrect as a regional bucket can have availability issues as compared to


a multi-regional GCS bucket.

D is correct as the multi-regional GCS bucket is meant for this kind of


storage. GCS is blob storage that can store any file and can also be used for
historical data storage purposes.

Links:
https://jayendrapatil.com/tag/multi-regional-storage/

https://medium.com/google-cloud/google-cloud-storage-what-bucket-class-
for-the-best-performance-5c847ac8f9f2

Question 5:
You have a petabyte of analytics data and need to design a storage and
processing platform for it. You must be able to perform data warehouse-
style analytics on the data in Google Cloud and expose the dataset as
files for batch analysis tools in other cloud providers. What should you
do?

Store and process the entire dataset in BigQuery.

Store and process the entire dataset in Cloud Bigtable.

Store the full dataset in BigQuery, and store a compressed copy


(Correct)
of the data in a Cloud Storage bucket.

Store the warm data as files in Cloud Storage, and store the active data in
BigQuery. Keep this ratio as 80% warm and 20% active.

Explanation
A is incorrect as both storing and exporting files in BQ is not possible. In this
option, we have to connect external cloud vendors directly to BQ which could
incur BQ egress costs.

B is incorrect as BigTable is not meant for Data-Warehouse analysis.

C is correct as performing analysis in BQ and exporting data into the GCS


bucket for external cloud providers is an appropriate option as BQ is a Data
Warehouse solution and the GCS bucket is heavily connected with other cloud
providers.

D is incorrect as BQ should not be used for data archiving. It will be costly.

Links:
https://cloud.google.com/storage/docs/collaboration

Question 6:
You work for a manufacturing company that sources up to 750 different
components, each from a different supplier. You've collected a labeled
dataset that has on average 1000 examples for each unique component.
Your team wants to implement an app to help warehouse workers
recognize incoming components based on a photo of the component.
You want to implement the first working version of this app (as Proof-Of-
Concept) within a few working days. What should you do?

Use Cloud Vision AutoML with the existing dataset. (Correct)

Use Cloud Vision AutoML, but reduce your dataset twice.

Use Cloud Vision API by providing custom labels as recognition hints.

Train your own image recognition model leveraging transfer learning


techniques.

Explanation
A is correct as AutoML without reducing the data will be fit for POC purposes.
AutoML is a serverless service of GCP so it can scale for whole data.

B is incorrect as there is no need to reduce the data. If we reduce it can affect


the output of the AutoML model.

C is incorrect as Cloud Vision API is specifically for object detection and


additionally it’s a costly service.

D is incorrect as training our model will take time and we need to come up
with the solution as soon as possible.

Links:
https://cloud.google.com/vision/automl/object-detection/docs/prepare

https://cloud.google.com/vision/automl/docs/beginners-guide

https://cloud.google.com/vision

https://cloud.google.com/vision/automl/docs/beginners-
guide#data_preparation

Question 7:
You are working on a niche product in the image recognition domain.
Your team has developed a model that is dominated by custom C++
TensorFlow ops your team has implemented. These ops are used inside
your main training loop and are performing bulky matrix multiplications.
It currently takes up to several days to train a model. You want to
decrease this time significantly and keep the cost low by using an
accelerator on Google Cloud. What should you do?

Use Cloud TPUs without any additional adjustment to your code.

Use Cloud TPUs after implementing GPU kernel support for your customs
ops.

Use Cloud GPUs after implementing GPU kernel support for your
(Correct)
customs ops.

Stay on CPUs, and increase the size of the cluster you're training your
model on.

Explanation
A is incorrect as TPU can be a costly solution as compared to using GPU.
Moreover, TPU doesn’t support Models with custom TensorFlow operations
inside the main training loop.

B is incorrect as TPU can be a costly solution as compared to using GPU.


Moreover, TPU doesn’t support Models with custom TensorFlow operations
inside the main training loop.

C is correct as if we want to keep the cost low by using an accelerator on


Google Cloud, we need to use GPU and GPU also support custom ops code.

D is incorrect as CPUs cannot provide compute power enough to train a large


model.

Links:
https://cloud.google.com/tpu/docs/tpus

https://cloud.google.com/tpu/docs/tpus#when_to_use_tpu

Question 8:
You work on a regression problem in a natural language processing domain,
and you have 100M labeled examples in your dataset. You have randomly
shuffled your data and split your dataset into train and test samples (in a
90/10 ratio). After you trained the neural network and evaluated your model on
a test set, you discover that the root-mean-squared error (RMSE) of your
model is twice as high on the train set as on the test set. How should you
improve the performance of your model?

Increase the share of the test sample in the train-test split.

Try to collect more data and increase the size of your dataset.

Try out regularization techniques (e.g., dropout of batch normalization) to


avoid overfitting.

Increase the complexity of your model by, e.g., introducing an


additional layer or increase sizing the size of vocabularies or n- (Correct)
grams used.

Explanation
A is incorrect as Increasing the share of the test sample in the train-test split
will not help and will leave the model poor.

B is incorrect as by increasing the training data, we cannot guarantee that the


model will learn properly. If we increase training data but our Neural network
has fewer layers to learn, the model cannot learn from an increased dataset.

C is incorrect as regularization techniques are used if our model is overfitting


test data. We can say that the model has learned from training data too much
that it cannot generalize on test data or production data.

D is correct as by increasing the complexity of the model, the model can


learn in-depth and will not underfit training data.

Links:
https://developers.google.com/machine-learning/crash-
course/generalization/peril-of-overfitting

https://towardsdatascience.com/deep-learning-3-more-on-cnns-handling-
overfitting-2bd5d99abe5d?gi=12a885894aa6

Question 9:
You use BigQuery as your centralized analytics platform. New data is
loaded every day, and an ETL pipeline modifies the original data and
prepares it for the final users. This ETL pipeline is regularly modified and
can generate errors, but sometimes the errors are detected only after 2
weeks. You need to provide a method to recover from these errors, and
your backups should be optimized for storage costs. How should you
organize your data in BigQuery and store your backups?

Organize your data in a single table, export, and compress and store the
BigQuery data in Cloud Storage.

Organize your data in separate tables for each month, and


(Correct)
export, compress, and store the data in Cloud Storage.

Organize your data in separate tables for each month, and duplicate your
data on a separate dataset in BigQuery.

Organize your data in separate tables for each month, and use snapshot
decorators to restore the table to a time prior to the corruption.

Explanation
A is incorrect as organizing your data in a single table is not optimized for
bigquery workloads and as our ETL errors are detected only after 2 weeks so
having one table doesn’t help.

B is correct as an organization is optimized and will help to run analytical


workloads, GCS bucket is storage optimized and cost-optimized which fulfills
our requirements.

C is incorrect as duplicating our data into the BQ dataset will incur the cost
and is not a cost-effective solution.

D is incorrect as Using snapshot decorators a recovery of the table is valid


only for 7 days. Here our ETL error detection takes 2 weeks.

Links:
https://cloud.google.com/solutions/dr-scenarios-for-data#managed-
database-services-on-gcp

Question 10:
The marketing team at your organization provides regular updates of a
segment of your customer dataset. The marketing team has given you a
CSV with 1 million records that must be updated in BigQuery. When you
use the UPDATE statement in BigQuery, you receive a quotaExceeded
error. What should you do?

Reduce the number of records updated each day to stay within the
BigQuery UPDATE DML statement limit.

Increase the BigQuery UPDATE DML statement limit in the Quota


management section of the Google Cloud Platform Console.

Split the source CSV file into smaller CSV files in Cloud Storage to reduce
the number of BigQuery UPDATE DML statements per BigQuery job.

Import the new records from the CSV file into a new BigQuery
table. Create a BigQuery job that merges the new records with
(Correct)
the existing records and writes the results to a new BigQuery
table.

Explanation
A is incorrect as reducing the UPDATE statements will not solve the business
problem as the requirement is to keep up with the records up-to-date.

B is incorrect as there is no quota limit currently on DML statements.

C is incorrect as splitting the CSV into smaller CSVs will cause manual
workload.

D is correct as google suggests that use a MERGE statement when a


requirement is to keep two tables.

Links:
https://cloud.google.com/blog/products/bigquery/performing-large-scale-
mutations-in-bigquery

https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-
syntax#merge_statement

Question 11:
As your organization expands its usage of GCP, many teams have started
to create their own projects. Projects are further multiplied to
accommodate different stages of deployments and target audiences.
Each project requires unique access control configurations. The central
IT team needs to have access to all projects. Furthermore, data from
Cloud Storage buckets and BigQuery datasets must be shared for use in
other projects in an ad hoc way. You want to simplify access control
management by minimizing the number of policies. Which two steps
should you take? (Choose two.)

Use Cloud Deployment Manager to automate access provision.

Introduce resource hierarchy to leverage access control policy


(Correct)
inheritance.

Create distinct groups for various teams, and specify groups in Cloud IAM
policies.

Only use service accounts when sharing data for Cloud Storage
(Correct)
buckets and BigQuery datasets.

For each Cloud Storage bucket or BigQuery dataset, decide which


projects need access. Find all the active members who have access to
these projects, and create a Cloud IAM policy to grant access to all these
users.

Explanation
A is incorrect as Deployment manager is an Infrastructure as a Service
solution.

B is correct as google suggested the way is to maintain resource hierarchy.

C is incorrect as this is an insecure way to give access to users.

D is correct as a service account is the best way to access resources in an


ad-hoc way.

E is incorrect as this approach involves a manual workload.

Links:
https://cloud.google.com/docs/enterprise/best-practices-for-enterprise-
organizations

Question 12:
Your United States-based company has created an application for assessing
and responding to user actions. The primary table's data volume grows by
250,000 records per second. Many third parties use your application's APIs to
build the functionality into their own frontend applications. Your application's
APIs should comply with the following requirements:

-Single global endpoint


-ANSI SQL support
-Consistent access to the most up-to-date data

What should you do?

Implement BigQuery with no region selected for storage or processing.

Implement Cloud Spanner with the leader in North America and


(Correct)
read-only replicas in Asia and Europe.

Implement Cloud SQL for PostgreSQL with the master in Norht America
and read replicas in Asia and Europe.

Implement Cloud Bigtable with the primary cluster in North America and
secondary clusters in Asia and Europe.

Explanation
A is incorrect as BQ is appropriate for analytical workloads instead of app
workload.

B is correct as Spanner meets all the requirements as ANSI SQL can only be
accommodated in the Spanner and it can be highly available globally.

C is incorrect as Cloud SQL can not be highly available and cannot


accommodate 250K ingestion/second.

D is incorrect as Cloud BigTable doesn’t support SQL support.

Links:
https://cloud.google.com/spanner/#section-2

Question 13:
A data scientist has created a BigQuery ML model and asks you to create an
ML pipeline to serve predictions. You have a REST API application with the
requirement to serve predictions for an individual user ID with latency under
100 milliseconds. You use the following query to generate predictions:

1 SELECT predicted_label, user_id


2
3 FROM ML.PREDICT (MODEL 'dataset.model', table user_features)

How should you create the ML pipeline?

Add a WHERE clause to the query, and grant the BigQuery Data Viewer
role to the application service account.

Create an Authorized View with the provided query. Share the dataset that
contains the view with the application service account.

Create a Cloud Dataflow pipeline using BigQueryIO to read results from


the query. Grant the Dataflow Worker role to the application service
account.

Create a Cloud Dataflow pipeline using BigQueryIO to read


predictions for all users from the query. Write the results to Cloud
Bigtable using BigtableIO. Grant the Bigtable Reader role to the (Correct)
application service account so that the application can read
predictions for individual users from Cloud Bigtable.

Explanation
A is incorrect as adding a WHERE clause to the query, and grant the BigQuery
Data Viewer role to the application service account will incur query cost and
cannot be guaranteed to be achieved in 100 ms.

B is incorrect as creating an Authorized View with the provided query will also
have the same reason as BQ read operation doesn’t give 100 ms read speed.

C is incorrect as Invoking Dataflow every time will incur the cost and also
doesn’t guarantee a 100ms response to the request.

D is correct as BigTable is optimized for reading operations and also it’s a


No-SQL solution that doesn’t have too much Query cost.

Links:
https://cloud.google.com/blog/products/databases/getting-started-with-time-
series-trend-predictions-using-gcp

https://cloud.google.com/bigquery-ml/docs/exporting-models

Question 14:
You are building an application to share financial market data with consumers,
who will receive data feeds. Data is collected from the markets in real-time.

Consumers will receive the data in the following ways:


- Real-time event stream
- ANSI SQL access to real-time stream and historical data
- Batch historical exports

Which solution should you use?

Cloud Dataflow, Cloud SQL, Cloud Spanner

Cloud Pub/Sub, Cloud Storage, BigQuery (Correct)

Cloud Dataproc, Cloud Dataflow, BigQuery

Cloud Pub/Sub, Cloud Dataproc, Cloud SQL

Explanation
A is incorrect as services don’t match with the stated requirements.

B is correct as the direct map to the services is as follows: Streaming Data --


> Pub/Sub
ANSI SQL access --> Big Query and Batch Historical Exports --> Cloud
Storage.

C is incorrect as services don’t match with the stated requirements.

D is incorrect as services don’t match with the stated requirements.

Links:
https://cloud.google.com/architecture/processing-logs-at-scale-using-
dataflow

Question 15:
You are building a new application that you need to collect data from in a
scalable way. Data arrives continuously from the application throughout the
day, and you expect to generate approximately 150 GB of JSON data per day
by the end of the year. Your requirements are:

- Decoupling producer from consumer


- Space and cost-efficient storage of the raw ingested data, which is to be
stored indefinitely
- Near real-time SQL query
- Maintain at least 2 years of historical data, which will be queried with SQL

Which pipeline should you use to meet these requirements?

Create an application that provides an API. Write a tool to poll the API and
write data to Cloud Storage as gzipped JSON files.

Create an application that writes to a Cloud SQL database to store the


data. Set up periodic exports of the database to write to Cloud Storage
and load into BigQuery.

Create an application that publishes events to Cloud Pub/Sub, and create


Spark jobs on Cloud Dataproc to convert the JSON data to Avro format,
stored on HDFS on Persistent Disk.

Create an application that publishes events to Cloud Pub/Sub,


and create a Cloud Dataflow pipeline that transforms the JSON
(Correct)
event payloads to Avro, writing the data to Cloud Storage and
BigQuery.

Explanation
A is incorrect as mapped services don’t fulfill all the requirements, directly
writing to the GCS cannot decouple the producer and consumer.

B is incorrect as mapped services don’t fulfill all the requirements to take


periodic exports we need to develop the script and Cloud SQL is not meant
for historical data archiving, moreover, not the same query can be run with BQ
and Cloud SQL.

C is incorrect as mapped services don’t fulfill all the requirements Dataproc


doesn’t support SQL-like queries.

D is correct as we can map the requirements as follows:

1. Decoupling producer from the consumer -- using pub/sub fulfills the


requirement.
2. Space and cost-efficient storage of the raw ingested data, which is to be
stored indefinitely -- GCS bucket fulfills the requirement.
3. Near real-time SQL query -- Bigquery fulfills the requirement.
4. Maintain at least 2 years of historical data, which will be queried with SQL -
- using federated query we can query data stored in GCS bucket.

Question 16:
You are running a pipeline in Cloud Dataflow that receives messages from a
Cloud Pub/Sub topic and writes the results to a BigQuery dataset in the EU.
Currently, your pipeline is located in europe-west4 and has a maximum of 3
workers, for instance, type n1-standard-1. You notice that during peak
periods, your pipeline is struggling to process records in a timely fashion
when all 3 workers are at maximum CPU utilization. Which two actions can
you take to increase the performance of your pipeline? (Choose two.)

Increase the number of max workers (Correct)

Use a larger instance type for your Cloud Dataflow workers (Correct)

Change the zone of your Cloud Dataflow pipeline to run in us-central1

Create a temporary table in Cloud Bigtable that will act as a buffer for new
data. Create a new step in your pipeline to write to this table first, and
then create a new pipeline to write from Cloud Bigtable to BigQuery

Create a temporary table in Cloud Spanner that will act as a buffer for new
data. Create a new step in your pipeline to write to this table first, and
then create a new pipeline to write from Cloud Spanner to BigQuery

Explanation
A is correct as increasing the max worker number will release the load of
those three workers and thus allows autoscaling.

B is correct as increasing the size of the node is a good option and will allow
more read/write throughput.

C is incorrect as changing the zone to us-central1 will not help in achieving


autoscaling.

D is incorrect as creating a temporary buffer in BigTable will add overhead


workload instead of solving a problem.

E is incorrect as creating two pipelines will not solve the problem instead it’ll
increase the workload.

Links:
https://cloud.google.com/bigtable/docs/performance

https://cloud.google.com/dataflow/docs/guides/setting-pipeline-options

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#workers

https://cloud.google.com/dataflow/docs/guides/deploying-a-
pipeline#locations

Question 17:
You have a data pipeline with a Cloud Dataflow job that aggregates and writes
time-series metrics to Cloud Bigtable. This data feeds a dashboard used by
thousands of users across the organization. You need to support additional
concurrent users and reduce the amount of time required to write the data.
Which two actions should you take? (Choose two.)

Configure your Cloud Dataflow pipeline to use local execution

Increase the maximum number of Cloud Dataflow workers by


(Correct)
setting maxNumWorkers in PipelineOptions

Increase the number of nodes in the Cloud Bigtable cluster (Correct)

Modify your Cloud Dataflow pipeline to use the Flatten transform before
writing to Cloud Bigtable

Modify your Cloud Dataflow pipeline to use the CoGroupByKey transform


before writing to Cloud Bigtable

Explanation
A is incorrect as the configuration Dataflow pipeline is not feasible as it’s a
global resource.

B is correct as The maximum number of Compute Engine instances to be


made available to your pipeline during execution. specified by num_workers to
allow your job to scale up will have a fast write effect on storage.

C is correct as bigtable clusters will distribute the write load among


themselves and adding additional nodes will increase write speed.

D is incorrect as modification of transform will not help as it will require only if


there is an error in the pipeline.

E is incorrect as using CoGroupByKey transform is not meant to increase the


write throughput.

Links:
https://cloud.google.com/bigtable/docs/performance#performance-write-
throughput

https://cloud.google.com/dataflow/docs/guides/specifying-exec-
params#setting-other-cloud-pipeline-options

Question 18:
You have several Spark jobs that run on a Cloud Dataproc cluster on a
schedule. Some of the jobs run in sequence, and some of the jobs run
concurrently. You need to automate this process. What should you do?

Create a Cloud Dataproc Workflow Template (Correct)

Create an initialization action to execute the jobs

Create a Directed Acyclic Graph in Cloud Composer

Create a Bash script that uses the Cloud SDK to create a cluster, execute
jobs, and then tear down the cluster

Explanation
A is correct as since all the jobs are in Dataproc, templates in Dataproc are
sufficient to accommodate the job.

B is incorrect as creating an initialization job is the concept used in VM and it’s


for creating an environment.

C is incorrect as cloud composer is used for orchestrating the pipeline with


different services. Our all jobs are in Dataproc so using the Dataproc template
is fit for the purpose.

D is incorrect as creating a bash script is a manual effort and usually cannot


be used to orchestrate the data proc jobs.

Links:
https://cloud.google.com/dataproc/docs/concepts/workflows/using-workflows

Question 19:
You are building a new data pipeline to share data between two different
types of applications: jobs generators and job runners. Your solution
must scale to accommodate increases in usage and must accommodate
the addition of new applications without negatively affecting the
performance of existing ones. What should you do?

Create an API using App Engine to receive and send messages to the
applications

Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions


(Correct)
to execute them

Create a table on Cloud SQL, and insert and delete rows with the job
information

Create a table on Cloud Spanner, and insert and delete rows with the job
information

Explanation
A is incorrect as creating an API on AppEngine is not an effective solution as it
involves manual effort and it doesn’t ensure the decoupling of the pipelines.

B is correct as PubSub can be used for this purpose, publishers can generate
the jobs and subscribers can consume the jobs without affecting the existing
jobs.

C is incorrect as Cloud SQL cannot ensure decoupling of the jobs.

D is incorrect as Cloud Spanner cannot ensure decoupling of the jobs.

Links:
https://cloud.google.com/pubsub/docs/publisher

Question 20:
You need to create a new transaction table in Cloud Spanner that stores
product sales data. You are deciding what to use as a primary key. From a
performance perspective, which strategy should you choose?

The current epoch time

A concatenation of the product name and the current epoch time

A random universally unique identifier number (version 4 UUID) (Correct)

The original order identification number from the sales system, which is a
monotonically increasing integer

Explanation
A is incorrect as the current epoch time cannot avoid hotspot possibility as it
will store increasing numbers.

B is incorrect as the product name will not be a good option to optimize the
performance it should be ID.

C is correct as a unique UUID will ensure the avoidance of hotspots and will
distribute the query traffic equally. It’s the best option from a performance
standpoint.

D is incorrect as an increasing integer will lead to the hotspot and eventually


lower performance.

Links:
https://cloud.google.com/spanner/docs/schema-and-data-model

Notes:
The primary key uniquely identifies each row in a table. If you want to update
or delete existing rows in a table, then the table must have a primary key
composed of one or more columns. (A table with no primary key columns can
have only one row.) Often your application already has a field that's a natural
fit for use as the primary key.

For example, in the Customers table example above, there might be an


application-supplied CustomerId that serves well as the primary key. In other
cases, you may need to generate a primary key when inserting the row, like a
unique INT64 value that you generate.

Question 21:
Data Analysts in your company have the Cloud IAM Owner role assigned
to them in their projects to allow them to work with multiple GCP
products in their projects. Your organization requires that all BigQuery
data access logs be retained for 6 months. You need to ensure that only
audit personnel in your company can access the data access logs for all
projects. What should you do?

Enable data access logs in each Data Analyst's project. Restrict access to
Stackdriver Logging via Cloud IAM roles.

Export the data access logs via a project-level export sink to a Cloud
Storage bucket in the Data Analysts' projects. Restrict access to the
Cloud Storage bucket.

Export the data access logs via a project-level export sink to a Cloud
Storage bucket in a newly created projects for audit logs. Restrict access
to the project with the exported logs.

Export the data access logs via an aggregated export sink to a


Cloud Storage bucket in a newly created project for audit logs. (Correct)
Restrict access to the project that contains the exported logs.

Explanation
A is incorrect as Data Analyst already has an owner role.

B is incorrect as Data Analyst already has an owner role.

C is incorrect as without aggregated log we need to do it at a project level and


it’s time-consuming.

D is correct as an aggregated log sink will create a single sink for all projects,
the destination can be a google cloud storage, Pub/Sub topic, bigquery table,
or a cloud logging bucket. without aggregated sink, this will be required to be
done for each project individually which will be cumbersome.

Links:
https://cloud.google.com/logging/docs/export/aggregated_sinks

https://cloud.google.com/iam/docs/job-
functions/auditing#scenario_operational_monitoring

Question 22:
Each analytics team in your organization is running BigQuery jobs in their
own projects. You want to enable each team to monitor slot usage within
their projects. What should you do?

Create a Stackdriver Monitoring dashboard based on the BigQuery metric


query/scanned_bytes

Create a Stackdriver Monitoring dashboard based on the


(Correct)
BigQuery metric slots/allocated_for_project

Create a log export for each project, capture the BigQuery job execution
logs, create a custom metric based on the totalSlotMs, and create a
Stackdriver Monitoring dashboard based on the custom metric

Create an aggregated log export at the organization level, capture the


BigQuery job execution logs, create a custom metric based on the
totalSlotMs, and create a Stackdriver Monitoring dashboard based on the
custom metric

Explanation
A is incorrect as metric query/scanned_bytes will give the total usage.

B is correct as metric slots/allocated_for_project will give accurate results in


our case as all the analysts run their queries in their project.

C is incorrect as totalSlotMs doesn’t give metrics as per the project but on the
usage of slots. E.g you have a 20-second query that is continuously
consuming 4 slots. In that case, your query is using 80,000 totalSlotMs (4 *
20,000).

D is incorrect as totalSlotMs doesn’t give metric as per the project but on the
usage of slots.

Links:
https://cloud.google.com/bigquery/docs/monitoring

https://cloud.google.com/monitoring/api/metrics_gcp

https://cloud.google.com/bigquery/docs/monitoring-dashboard#metrics

https://cloud.google.com/monitoring/api/metrics_gcp#gcp-bigquery

Question 23:
You are operating a streaming Cloud Dataflow pipeline. Your engineers
have a new version of the pipeline with a different windowing algorithm
and triggering strategy. You want to update the running pipeline with the
new version. You want to ensure that no data is lost during the update.
What should you do?

Update the Cloud Dataflow pipeline inflight by passing the --


(Correct)
update option with the --jobName set to the existing job name

Update the Cloud Dataflow pipeline inflight by passing the --update


option with the --jobName set to a new unique job name

Stop the Cloud Dataflow pipeline with the Cancel option. Create a new
Cloud Dataflow job with the updated code

Stop the Cloud Dataflow pipeline with the Drain option. Create a new
Cloud Dataflow job with the updated code

Explanation
A is correct as this is the provision when one wants to update the existing
dataflow pipeline.

B is incorrect as when you set a job name to a new name it will create a new
dataflow pipeline and the older one will remain as it is.

C is incorrect as stopping the data pipeline is not an option as it will lose the
data.

D is incorrect as the drain option is used when we don’t want to lose the data
when we are stopping the pipeline.

Links:
https://cloud.google.com/dataflow/docs/guides/updating-a-
pipeline#Launching

https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline

Notes:
To update your job, you'll need to launch a new job to replace the ongoing job.
When you launch your replacement job, you'll need to set the following
pipeline options to perform the update process in addition to the job's regular
options:
- Pass the --update option.
- Set the --job_name option in PipelineOptions to the same name as the job
you want to update.
- If any transform names in your pipeline have changed, you must supply a
transform mapping and pass it using the --transform_name_mapping option.

Question 24:
You need to move 2 PB of historical data from an on-premises storage
appliance to Cloud Storage within six months, and your outbound
network capacity is constrained to 20 Mb/sec. How should you migrate
this data to Cloud Storage?

Use Transfer Appliance to copy the data to Cloud Storage (Correct)

Use gsutil cp ""J to compress the content being uploaded to Cloud


Storage

Create a private URL for the historical data, and then use Storage Transfer
Service to copy the data to Cloud Storage

Use trickle or ionice along with gsutil cp to limit the amount of bandwidth
gsutil utilizes to less than 20 Mb/sec so it does not interfere with the
production traffic

Explanation
A is correct as a transfer appliance is made for huge transfers such as in PB.

B is incorrect as gsutil cp cannot provide such speed.

C is incorrect as Storage Transfer services are mainly used to transfer data


from other cloud providers’ buckets. In our case network speed is slow so this
option is not useful as it will take too much time.

D is incorrect as gsutil cp cannot provide such speed.

Links:
https://cloud.google.com/transfer-appliance/docs/4.0
https://cloud.google.com/storage-transfer-service
Question 25:
You receive data files in CSV format monthly from a third party. You need to
cleanse this data, but every third month the schema of the files changes. Your
requirements for implementing these transformations include:

- Executing the transformations on a schedule


- Enabling non-developer analysts to modify transformations
- Providing a graphical tool for designing transformations

What should you do?

Use Cloud Dataprep to build and maintain the transformation


(Correct)
recipes, and execute them on a scheduled basis

Load each month's CSV data into BigQuery, and write a SQL query to
transform the data to a standard schema. Merge the transformed tables
together with a SQL query

Help the analysts write a Cloud Dataflow pipeline in Python to perform the
transformation. The Python code should be stored in a revision control
system and modified as the incoming data's schema changes

Use Apache Spark on Cloud Dataproc to infer the schema of the CSV file
before creating a Dataframe. Then implement the transformations in
Spark SQL before writing the data out to Cloud Storage and loading into
BigQuery

Explanation
A is correct as Cloud Dataprep is used by analysts and not by developers.
Dataprep is mainly used for transformations by drag and drop.

B is incorrect as writing an SQL transformation is not feasible for analysts and


we are aiming for non-coding background employees.

C is incorrect as understanding the Python code cannot be feasible for


analysts.

D is incorrect as SparkSQL is more difficult than SQL hence not a suitable


option for this use case.

Links:
https://cloud.google.com/dataprep/docs/quickstarts/quickstart-dataprep

https://docs.trifacta.com/display/DP/Getting+Started+with+Cloud+Dataprep

https://docs.trifacta.com/display/DP/Overview+of+RapidTarget

Question 26:
You want to migrate an on-premises Hadoop system to Cloud Dataproc.
Hive is the primary tool in use, and the data format is Optimized Row
Columnar (ORC). All ORC files have been successfully copied to a Cloud
Storage bucket. You need to replicate some data to the cluster's local
Hadoop Distributed File System (HDFS) to maximize performance. What
are two ways to start using Hive in Cloud Dataproc? (Choose two.)

Run the gsutil utility to transfer all ORC files from the Cloud Storage
bucket to HDFS. Mount the Hive tables locally.

Run the gsutil utility to transfer all ORC files from the Cloud Storage
bucket to any node of the Dataproc cluster. Mount the Hive tables locally.

Run the gsutil utility to transfer all ORC files from the Cloud Storage
bucket to the master node of the Dataproc cluster. Then run the Hadoop
utility to copy them do HDFS. Mount the Hive tables from HDFS.

Leverage Cloud Storage connector for Hadoop to mount the ORC


files as external Hive tables. Replicate external Hive tables to the (Correct)
native ones.

Load the ORC files into BigQuery. Leverage BigQuery connector


for Hadoop to mount the BigQuery tables as external Hive tables. (Correct)
Replicate external Hive tables to the native ones.

Explanation
A is incorrect as there is no utility available for data to move directly from GCS
to HDFS.

B is incorrect as we cannot mount directly without transferring to HDFS.

C is incorrect as there is no native support of Hadoop utilities to copy data


from GCS to the Local node’s persistence disk and then to HDFS. Dataproc
node will only support HDFS storage

D is correct as the Cloud Storage connector for Hadoop can be leveraged to


copy tables from GCS to local.

E is correct as BigQuery connector for Hive external tables can be used to


work with hive tables

Links:
https://mrjob.readthedocs.io/en/latest/guides/dataproc-quickstart.html

https://cloud.google.com/dataproc/docs/tutorials/gcs-connector-spark-
tutorial

https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-orc

https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery

https://cloud.google.com/bigquery/docs/samples/bigquery-create-table-
external-hivepartitioned

Question 27:
You are implementing several batch jobs that must be executed on a
schedule. These jobs have many interdependent steps that must be
executed in a specific order. Portions of the jobs involve executing shell
scripts, running Hadoop jobs, and running queries in BigQuery. The jobs
are expected to run for many minutes up to several hours. If the steps
fail, they must be retried a fixed number of times. Which service should
you use to manage the execution of these jobs?

Cloud Scheduler

Cloud Dataflow

Cloud Functions

Cloud Composer (Correct)

Explanation
A is incorrect as a cloud scheduler is nothing but a managed crontab service.
We cannot set dependency in the cloud scheduler.

B is incorrect as cloud data flow is used to run data pipelines and not to
orchestrate individual components.

C is incorrect as cloud functions are serverless services and are used to


execute small scripts.

D is correct as cloud composer is used to orchestrating the pipelines and


manage the workflow.

Question 28:
You work for a shipping company that has distribution centers where
packages move on delivery lines to route them properly. The company
wants to add cameras to the delivery lines to detect and track any visual
damage to the packages in transit. You need to create a way to automate
the detection of damaged packages and flag them for human review in
real time while the packages are in transit. Which solution should you
choose?

Use BigQuery machine learning to be able to train the model at scale, so


you can analyze the packages in batches.

Train an AutoML model on your corpus of images, and build an


API around that model to integrate with the package tracking (Correct)
applications.

Use the Cloud Vision API to detect for damage, and raise an alert through
Cloud Functions. Integrate the package tracking applications with this
function.

Use TensorFlow to create a model that is trained on your corpus of


images. Create a Python notebook in Cloud Datalab that uses this model
so you can analyze for damaged packages.

Explanation
A is incorrect as BigQuery ML cannot be trained on image data.

B is correct as AutoML can be used to train the model at scale and use it at
an API endpoint.

C is incorrect as Cloud Vision API is mostly used for object detection.

D is incorrect as there is no point to train the TensorFlow model specifically


when we can leverage AutoML.

Links:
https://cloud.google.com/vision/automl/object-detection/docs

Question 29:
You are migrating your data warehouse to BigQuery. You have migrated all of
your data into tables in a dataset. Multiple users from your organization will be
using the data. They should only see certain tables based on their team
membership. How should you set user permissions?

Assign the users/groups data viewer access at the table level for each
table

Create SQL views for each team in the same dataset in which the data
resides, and assign the users/groups data viewer access to the SQL views

Create authorized views for each team in the same dataset in


which the data resides, and assign the users/groups data viewer (Correct)
access to the authorized views

Create authorized views for each team in datasets created for each team.
Assign the authorized views data viewer access to the dataset in which
the data resides. Assign the users/groups data viewer access to the
datasets in which the authorized views reside

Explanation
A is incorrect as BQ doesn’t provide table-level access.

B is incorrect as there is no such concept of SQL views.

C is correct as authorized views are a suitable option when we have separate


teams accessing the same dataset.

D is incorrect as there is no need to create separate datasets for different


teams as it will increase management workload.

Links:
https://cloud.google.com/bigquery/docs/authorized-views

Question 30:
You want to build a managed Hadoop system as your data lake. The data
transformation process is composed of a series of Hadoop jobs executed in
sequence. To accomplish the design of separating storage from compute, you
decided to use the Cloud Storage connector to store all input data, output
data, and intermediary data. However, you noticed that one Hadoop job runs
very slowly with Cloud Dataproc when compared with the on-premises bare-
metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows
that this particular Hadoop job is disk I/O intensive. You want to resolve the
issue. What should you do?

Allocate sufficient memory to the Hadoop cluster, so that the intermediary


data of that particular Hadoop job can be held in memory

Allocate sufficient persistent disk space to the Hadoop cluster,


and store the intermediate data of that particular Hadoop job on (Correct)
native HDFS

Allocate more CPU cores of the virtual machine instances of the Hadoop
cluster so that the networking bandwidth for each instance can scale up

Allocate additional network interface card (NIC), and configure link


aggregation in the operating system to use the combined throughput
when working with Cloud Storage

Explanation
A is incorrect as increasing data proc VM’s memory doesn’t help to reduce
the time as the job is IO sensitive.

B is correct as due to metadata operation it will be good practice to use


HDFS local storage and increase Disk space.

C is incorrect as CPU refers to compute power and not disk power.

D is incorrect as here we are dealing with an IO-intensive job so allocating NIC


will reduce the latency to some extent but not fully solve the issue.

Links:
https://cloud.google.com/architecture/hadoop/hadoop-gcp-migration-jobs

Question 31:
You work for an advertising company, and you've developed a Spark ML
model to predict click-through rates at advertisement blocks. You've been
developing everything at your on-premises data center, and now your
company is migrating to Google Cloud. Your data center will be closing soon,
so a rapid lift-and-shift migration is necessary. However, the data you've been
using will be migrated to BigQuery. You periodically retrain your Spark ML
models, so you need to migrate existing training pipelines to Google Cloud.
What should you do?

Use Cloud ML Engine for training existing Spark ML models

Rewrite your models on TensorFlow, and start using Cloud ML Engine

Use Cloud Dataproc for training existing Spark ML models, but


(Correct)
start reading data directly from BigQuery

Spin up a Spark cluster on Compute Engine, and train Spark ML models on


the data exported from BigQuery

Explanation
A is incorrect as using Cloud ML is not a lift and shift option for SparkML
models. When we have to train the model from scratch we can use Cloud ML.

B is incorrect as due to the time limit we cannot train Tensorflow models and
use Cloud ML.

C is correct as Dataproc is truly a lift and shift solution for the Hadoop
ecosystem so using Dataproc will be easy as we are moving the whole
retraining pipeline from on-premise to cloud.

D is incorrect as setting up Spark on VM is not a scalable solution.

Links:
https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml

Question 32:
You work for a global shipping company. You want to train a model on 40
TB of data to predict which ships in each geographic region are likely to
cause delivery delays on any given day. The model will be based on
multiple attributes collected from multiple sources. Telemetry data,
including location in GeoJSON format, will be pulled from each ship and
loaded every hour. You want to have a dashboard that shows how many
and which ships are likely to cause delays within a region. You want to
use a storage solution that has native functionality for prediction and
geospatial processing. Which storage solution should you use?

BigQuery (Correct)

Cloud Bigtable

Cloud Datastore

Cloud SQL for PostgreSQL

Explanation
A is correct as BQ fits and fulfills all the stated requirements. We can use
BQML to predict based on geospatial data natively.

B is incorrect as we cannot train the model natively on BigTable.

C is incorrect as we cannot train the model natively on Datastore.

D is incorrect as Cloud SQL is not meant for model training and data
warehousing. It’s a lift and shift solution for a relational database.

Links:
https://cloud.google.com/bigquery/docs/gis-intro

Question 33:
You operate an IoT pipeline built around Apache Kafka that normally
receives around 5000 messages per second. You want to use Google
Cloud Platform to create an alert as soon as the moving average over 1
hour drops below 4000 messages per second. What should you do?

Consume the stream of data in Cloud Dataflow using Kafka IO.


Set a sliding time window of 1 hour every 5 minutes. Compute the
(Correct)
average when the window closes, and send an alert if the
average is less than 4000 messages.

Consume the stream of data in Cloud Dataflow using Kafka IO. Set a fixed
time window of 1 hour. Compute the average when the window closes, and
send an alert if the average is less than 4000 messages.

Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub.
Use a Cloud Dataflow template to write your messages from Cloud
Pub/Sub to Cloud Bigtable. Use Cloud Scheduler to run a script every hour
that counts the number of rows created in Cloud Bigtable in the last hour.
If that number falls below 4000, send an alert.

Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub.
Use a Cloud Dataflow template to write your messages from Cloud
Pub/Sub to BigQuery. Use Cloud Scheduler to run a script every five
minutes that counts the number of rows created in BigQuery in the last
hour. If that number falls below 4000, send an alert.

Explanation
A is correct as cloud Dataflow can be used to calculate the moving average
and Kafka IO can be used to connect with on-premise Kafka clusters.

B is incorrect as the fixed time window cannot calculate the moving average
of no. of messages.

C is incorrect as it overkills the task as a simple data flow sliding window can
calculate the average messages.

D is incorrect as a running script at 5 minutes cannot give us the accurate


result of the average number of messages.

Links:
https://cloud.google.com/architecture/processing-messages-from-kafka-
hosted-outside-gcp

Question 34:
You plan to deploy Cloud SQL using MySQL. You need to ensure high
availability in the event of a zone failure. What should you do?

Create a Cloud SQL instance in one zone, and create a failover replica in
another zone within the same region.

Create a Cloud SQL instance in one zone, and create a read


(Correct)
replica in another zone within the same region.

Create a Cloud SQL instance in one zone, and configure an external read
replica in a zone in a different region.

Create a Cloud SQL instance in a region, and configure automatic backup


to a Cloud Storage bucket in the same region.

Explanation
A is incorrect as there is no specific option for failover replica.

B is correct as the read replica can be used when we have set Cloud SQL
High Availability.

C is incorrect as we cannot natively configure external read replicas in


different zones in different regions. We need to manually copy all the data
periodically from the main server to the replica server.

D is incorrect as the back option can disrupt services in case of failure.

Links:
https://cloud.google.com/sql/docs/mysql/high-availability
https://cloud.google.com/sql/docs/mysql/replication#read-replicas

Notes:
As a best practice, put read replicas in a different zone than the primary
instance when you use HA on your primary instance. This practice ensures
that read replicas continue to operate when the zone that contains the
primary instance has an outage.

Question 35:
Your company is selecting a system to centralize data ingestion and delivery.
You are considering messaging and data integration systems to address the
requirements. The key requirements are:

- The ability to seek a particular offset in a topic, possibly back to the start of
all data ever captured
- Support for publish/subscribe semantics on hundreds of topics
- Retain per-key ordering

Which system should you choose?

Apache Kafka (Correct)

Cloud Storage

Cloud Pub/Sub

Firebase Cloud Messaging

Explanation
A is correct as Apache Kafka can fulfill all the stated requirements. In Kafka,
we can track any topic to any given point.

B is incorrect as Cloud Storage is blob storage, not a messaging service.

C is incorrect as pub/sub can store messages only for 7 days.

D is incorrect as Firebase Cloud Messaging is not a messaging service instead


it’s for mobile application solutions. Moreover, it doesn’t fulfill all the stated
requirements.

Links:
https://kafka.apache.org/081/documentation.html
https://cloud.google.com/pubsub/quotas#resource_limits
https://firebase.google.com/products/cloud-messaging

Question 36:
You are planning to migrate your current on-premises Apache Hadoop
deployment to the cloud. You need to ensure that the deployment is as
fault-tolerant and cost-effective as possible for long-running batch jobs.
You want to use a managed service. What should you do?

Deploy a Cloud Dataproc cluster. Use a standard persistent disk


and 50% preemptible workers. Store data in Cloud Storage, and (Correct)
change references in scripts from hdfs:// to gs://

Deploy a Cloud Dataproc cluster. Use an SSD persistent disk and 50%
preemptible workers. Store data in Cloud Storage, and change references
in scripts from hdfs:// to gs://

Install Hadoop and Spark on a 10-node Compute Engine instance group


with standard instances. Install the Cloud Storage connector, and store
the data in Cloud Storage. Change references in scripts from hdfs:// to
gs://

Install Hadoop and Spark on a 10-node Compute Engine instance group


with preemptible instances. Store data in HDFS. Change references in
scripts from hdfs:// to gs://

Explanation
A is correct as It’s best practice to store Hadoop data in the GCS bucket and
50% preemptible workers VMs.

B is incorrect as SSD is costly as compared to HDD hence cannot be used for


a cost-effective solution.

C is incorrect as installing Hadoop onset of VM will increase workload and


cannot guarantee fault-tolerance.

D is incorrect as installing Hadoop onset of VM will increase workload and


cannot guarantee fault-tolerance.

Links:
https://cloud.google.com/bigtable/docs/choosing-ssd-hdd
https://cloud.google.com/blog/products/data-analytics/optimize-dataproc-
costs-using-vm-machine-type

Question 37:
Your team is working on a binary classification problem. You have trained a
support vector machine (SVM) classifier with default parameters and received
an area under the Curve (AUC) of 0.87 on the validation set. You want to
increase the AUC of the model. What should you do?

Perform hyperparameter tuning

Train a classifier with deep neural networks, because neural networks


would always beat SVMs

Deploy the model and measure the real-world AUC; it's always higher
because of generalization

Scale predictions you get out of the model (tune a scaling factor
(Correct)
as a hyperparameter) in order to get the highest AUC
Explanation
A is incorrect as performing hyperparameter tuning helps to build a model and
not necessarily increase the AUC score.

B is incorrect as DNN is used when we have too much data and it’s not a
solution for increasing AUC.

C is incorrect as model performance on validation data sets generally gives an


idea of real-world AUC.

D is correct as scaling predictions out of the model and using a scaling factor
will improve the AUC score. As data change/processing can improve the AUC
score.

Links:
https://developers.google.com/machine-learning/crash-
course/classification/roc-and-auc?hl=en

Notes:
One way of interpreting AUC is as the probability that the model ranks a
random positive example more highly than a random negative example.

Question 38:
You need to deploy additional dependencies to all of a Cloud Dataproc
cluster at startup using an existing initialization action. Company
security policies require that Cloud Dataproc nodes do not have access
to the Internet so public initialization actions cannot fetch resources.
What should you do?

Deploy the Cloud SQL Proxy on the Cloud Dataproc master

Use an SSH tunnel to give the Cloud Dataproc cluster access to the
Internet

Copy all dependencies to a Cloud Storage bucket within your


(Correct)
VPC security perimeter

Use Resource Manager to add the service account used by the Cloud
Dataproc cluster to the Network User role

Explanation
A is incorrect as using Cloud SQL Proxy will not help as Cloud SQL cannot
store dependencies.

B is incorrect as the SSH tunnel will still expose the Dataproc cluster to the
internet.

C is correct as the GCS bucket can store the dependencies and be used in
the init script.

D is incorrect as the Network user role can access a shared VPC network but
we still need to store the dependencies.

Links:
https://cloud.google.com/dataproc/docs/concepts/configuring-
clusters/network#create_a_cloud_dataproc_cluster_with_internal_ip_address
_only

https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-
actions

Question 39:
You need to choose a database for a new project that has the following
requirements:

- Fully managed
- Able to automatically scale-up
- Transactionally consistent
- Able to scale up to 6 TB
- Able to be queried using SQL

Which database do you choose?

Cloud SQL (Correct)

Cloud Bigtable

Cloud Spanner

Cloud Datastore

Explanation
A is correct as Cloud SQL meets all the requirements.

B is incorrect as Cloud BigTable is not Transactionally consistent.

C is incorrect as Cloud Spanner is not recommended for less than 10 TB of


data.

D is incorrect as Cloud Datastore is not Transactionally consistent.

Links:
https://cloud.google.com/sql

Question 40:
You work for a mid-sized enterprise that needs to move its operational
system transaction data from an on-premises database to GCP. The
database is about 20 TB in size. Which database should you choose?

Cloud SQL (Correct)

Cloud Bigtable

Cloud Spanner

Cloud Datastore

Explanation
A is correct as Cloud SQL can store up to 30TB of data.

B is incorrect as Cloud BigTable is not Transactionally consistent.

C is incorrect as Cloud Spanner is a petabyte-scale solution. Cloud SQL can


handle 30TB of data.

D is incorrect as Cloud Datastore is not Transactionally consistent.

Links:
https://cloud.google.com/sql/docs/quotas#storage_limits

Question 41:
You need to choose a database to store time-series CPU and memory usage
for millions of computers. You need to store this data in one-second interval
samples. Analysts will be performing real-time, ad hoc analytics against the
database. You want to avoid being charged for every query executed and
ensure that the schema design will allow for future growth of the dataset.
Which database and data model should you choose?

Create a table in BigQuery, and append the new samples for CPU and
memory to the table

Create a wide table in BigQuery, create a column for the sample value at
each second, and update the row with the interval for each second

Create a narrow table in Cloud Bigtable with a row key that


combines the Computer Engine computer identifier with the (Correct)
sample time at each second

Create a wide table in Cloud Bigtable with a row key that combines the
computer identifier with the sample time at each minute, and combine the
values for each second as column data.

Explanation
A is incorrect as BigQuery doesn’t support dynamic schema whereas BigTable
is a NoSql database so it essentially comes with a schema-less design.

B is incorrect as BigQuery doesn’t support dynamic schema and the


proposed solution will insert many records which will be costly.

C is correct as BigTable is a NoSql database solution that will accommodate


schema changes and with BigTable, with effective rowkey design with the
narrow pattern we can easily run analytics on the database with minimum
cost.

D is incorrect as a wide pattern is used when we have many parameters to


store. In our case, the nature of data is a time series that doesn’t have many
parameters.

Links:
https://cloud.google.com/bigtable/docs/schema-design-time-
series#patterns_for_row_key_design

Notes:
A tall and narrow table has a small number of events per row, which could be
just one event, whereas a short and wide table has a large number of events
per row. As explained in a moment, tall and narrow tables are best suited for
time-series data.

For time series, you should generally use tall and narrow tables. This is for two
reasons: Storing one event per row makes it easier to run queries against your
data. Storing many events per row makes it more likely that the total row size
will exceed the recommended maximum (see Rows can be big but are not
infinite).

Bigquery charges as per the query whereas BigTable charges on infra


provided.

Question 42:
You want to archive data in Cloud Storage. Because some data is very
sensitive, you want to use the "Trust No One" (TNO) approach to encrypt
your data to prevent the cloud provider staff from decrypting your data.
What should you do?

Use gcloud kms keys create to create a symmetric key. Then use gcloud
kms encrypt to encrypt each archival file with the key and unique
additional authenticated data (AAD). Use gsutil cp to upload each
encrypted file to the Cloud Storage bucket, and keep the AAD outside of
Google Cloud.

Use gcloud kms keys create to create a symmetric key. Then use gcloud
kms encrypt to encrypt each archival file with the key. Use gsutil cp to
upload each encrypted file to the Cloud Storage bucket. Manually destroy
the key previously used for encryption, and rotate the key once.

Specify customer-supplied encryption key (CSEK) in the .boto


configuration file. Use gsutil cp to upload each archival file to the Cloud
Storage bucket. Save the CSEK in Cloud Memorystore as permanent
storage of the secret.

Specify customer-supplied encryption key (CSEK) in the .boto


configuration file. Use gsutil cp to upload each archival file to the
(Correct)
Cloud Storage bucket. Save the CSEK in a different project that
only the security team can access.

Explanation
A is incorrect as AAD’s purpose is different than to intensify the encryption.

B is incorrect as we don’t want to use GCP services and this approach uses
cloud KMS keys.

C is incorrect as Cloud Memorystore is used for caching purposes and not to


store the keys.

D is correct as we can store whole secret keys to different cloud GCP


projects and give access to only the security team. This way we can decouple
the key with files and also let GCP key management out of reach to the secret
key.

Links:
https://cloud.google.com/kms/docs/additional-authenticated-
data#when_to_use_aad
https://cloud.google.com/storage/docs/encryption/customer-supplied-keys

Notes:
AAD does not increase the cryptographic strength of the ciphertext. Instead,
it is an additional check by Cloud KMS to authenticate a decryption request.

Question 43:
You have data pipelines running on BigQuery, Cloud Dataflow, and Cloud
Dataproc. You need to perform health checks and monitor their behavior, and
then notify the team managing the pipelines if they fail. You also need to be
able to work across multiple projects. Your preference is to use managed
products/features of the platform. What should you do?

Export the information to Cloud Stackdriver, and set up an


(Correct)
Alerting policy

Run a Virtual Machine in Compute Engine with Airflow, and export the
information to Stackdriver

Export the logs to BigQuery, and set up App Engine to read that
information and send emails if you find a failure in the logs

Develop an App Engine application to consume logs using GCP API calls,
and send emails if you find a failure in the logs

Explanation
A is correct as Stackdriver Monitoring can monitor multiple projects at once.
There is a concept of workspace in which we can include multiple projects.

B is incorrect as Airflow is a data pipeline orchestration tool and not for


monitoring purposes.

C is incorrect as AppEngine services are costly as compared to managed


services. Stackdriver Monitoring is a fit service for this purpose.

D is incorrect as using AppEngine and develops applications will need to go


through SDLC and which will be a custom application that would require a
service account with appropriate permissions and much more management
whereas Stackdriver can monitor services from different projects.

Links:
https://cloud.google.com/monitoring/workspaces/#account-project
https://cloud.google.com/monitoring/settings/multiple-projects#create-multi

Question 44:
You are building storage for files for a data pipeline on Google Cloud. You
want to support JSON files. The schema of these files will occasionally
change. Your analyst teams will use running aggregate ANSI SQL queries on
this data. What should you do?

A. Use BigQuery for storage. Provide format files for data load. Update the
format files as needed.

B. Use BigQuery for storage. Select "Automatically detect" in the


(Correct)
Schema section.

C. Use Cloud Storage for storage. Link data as temporary tables in


BigQuery and turn on the "Automatically detect" option in the Schema
section of BigQuery.

D. Use Cloud Storage for storage. Link data as permanent tables in


BigQuery and turn on the "Automatically detect" option in the Schema
section of BigQuery.

Explanation
B is correct because of the requirement to support occasionally (schema)
changing JSON files and aggregate ANSI SQL queries: you need to use
BigQuery, and it is quickest to use 'Automatically detect' for schema changes.

A is not correct because you should not provide format files: you can simply
turn on the 'Automatically detect' schema changes flag.

C, D are not correct because you should not use Cloud Storage for this
scenario: it is cumbersome and doesn't add value.

Question 45:
Your company is loading CSV files into BigQuery. The data is fully imported
successfully; however, the imported data is not matching byte-to-byte to the
source file. What is the most likely cause of this problem?

A. The CSV data loaded in BigQuery is not flagged as CSV.

B. The CSV data had invalid rows that were skipped on import.

C. The CSV data loaded in BigQuery is not using BigQuery’s


(Correct)
default encoding.

D. The CSV data has not gone through an ETL phase before loading into
BigQuery.

Explanation
A is not correct because if another data format other than CSV was selected
then the data would not import successfully.

B is not correct because the data was fully imported meaning no rows were
skipped.

C is correct because this is the only situation that would cause successful
import.

D is not correct because whether the data has been previously transformed
will not affect whether the source file will match the BigQuery table.

Links:
https://cloud.google.com/bigquery/docs/loading-data#loading_encoded_data
For more visit

examsrocket.com

You might also like