100% found this document useful (3 votes)
1K views57 pages

Google Cloud Certified - Professional Data Engineer Practice Exam 4 - Results

- A company stores data in a single BigQuery project that multiple teams need access to. Each team has their own project. - The question asks how to grant access to the data while only billing each team for their own queries. - The best approach is to create authorized views in each team's project that query the tables in the main project. Grant the BigQuery user role to each team's project and the viewer role to the main project's dataset. This allows control and separate billing for each team's queries.

Uploaded by

vamshi nagabhyru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
1K views57 pages

Google Cloud Certified - Professional Data Engineer Practice Exam 4 - Results

- A company stores data in a single BigQuery project that multiple teams need access to. Each team has their own project. - The question asks how to grant access to the data while only billing each team for their own queries. - The best approach is to create authorized views in each team's project that query the tables in the main project. Grant the BigQuery user role to each team's project and the viewer role to the main project's dataset. This allows control and separate billing for each team's queries.

Uploaded by

vamshi nagabhyru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Google Cloud Certified - Professional Data Engineer Practice

Exam 4 - Results

 Return to review

Attempt 1
All knowledge areas
All questions
Question 1: Skipped
A company has its data stored within a single project acme-company-project. Users
across teams need to be able to access various tables within that dataset. Each team has
a separate project acme-company-team-00x created. How can the access be control
while billing only the team querying the dataset?

A. Create Authorized views for tables required by the team in their respective project. Grant
BigQuery User role for acme-company-team-00x and data viewer role to acme-company-project
dataset

B. Create Authorized views for tables required by the team in their respective project. Grant
BigQuery User role for acme-company-team-00x and data viewer role to acme-company-team-
00x dataset

(Correct)

C. Create Authorized views for tables required by the team in their respective project. Grant
BigQuery JobUser role for acme-company-team-00x and data viewer role to acme-company-
team-00x dataset

D. Create Authorized views for tables required by the team in the acme-company-project project.
Grant BigQuery User role for acme-company-team-00x and data viewer role to acme-company-
team-00x dataset
Explanation
Correct answer is B as the controlled access can be provided using Authorized views created
in a separate project. The Users should be provided with the BigQuery User role on the
project to query and Data Viewer role to the dataset to be able to view the dataset within the
project.
Refer GCP documentation - BigQuery Authorized View
Giving a view access to a dataset is also known as creating an authorized view in BigQuery.
An authorized view allows you to share query results with particular users and groups
without giving them access to the underlying tables. You can also use the view's SQL query to
restrict the columns (fields) the users are able to query.
When you create the view, it must be created in a dataset separate from the source data
queried by the view. Because you can assign access controls only at the dataset level, if the
view is created in the same dataset as the source data, your data analysts would have access
to both the view and the data.
In order to query the view, your data analysts need permission to run query jobs. The
bigquery.userrole includes permissions to run jobs, including query jobs, within the project.
If you grant a user or group the bigquery.user role at the project level, the user can create
datasets and can run query jobs against tables in those datasets. The bigquery.user role does
not give users permission to query data, view table data, or view table schema details for
datasets the user did not create.
Assigning your data analysts the project-level bigquery.user role does not give them the
ability to view or query table data in the dataset containing the tables queried by the view.
Most individuals (data scientists, business intelligence analysts, data analysts) in an
enterprise should be assigned the project-level bigquery.user role.
In order for your data analysts to query the view, they need READER access to the dataset
containing the view. The bigquery.user role gives your data analysts the permissions
required to create query jobs, but they cannot successfully query the view unless they also
have at least READER access to the dataset containing the view.
Option A is wrong as viewer role should be provided to the dataset within the respective team
project.
Option C is wrong as the user should be provided with the User role.
Option D is wrong as Authorized views should be created in a separate project. If they are
created in the same project, the users would have access to the underlying tables as well.
Question 2: Skipped
You are tasked with building an online analytical processing (OLAP) marketing
analytics and reporting tool. This requires a relational database that can operate on
hundreds of terabytes of data. What is the Google recommended tool for such
applications?

A. Cloud Spanner, because it is globally distributed

B. Cloud SQL, because it is a fully managed relational database


C. Cloud Firestore, because it offers real-time synchronization across devices

D. BigQuery, because it is designed for large-scale processing of tabular data

(Correct)

Explanation
Correct answer is D as BigQuery is a fully managed data warehouse solution with analytics
and reporting capability and able to handle large amounts of data.
Refer GCP documentation - Storage Options

 BigQuery A scalable, fully managed enterprise data warehouse (EDW) with SQL and
fast ad-hoc queries.OLAP workloads up to petabyte scaleBig data exploration and
processingReporting via business intelligence (BI) toolsAnalytical reporting on large
dataData science and advanced analysesBig data processing using SQL
Options A & B are wrong as they are relational databases and suitable for OLTP workloads.
Option C is wrong as Cloud Firestore is a shared file system to be attached to the virtual
machines. It does not provide analytics capabilities.
Question 3: Skipped
You work for a manufacturing plant that batches application log files together into a
single log file once a day at 2:00 AM. You have written a Google Cloud Dataflow job to
process that log file. You need to make sure the log file in processed once per day as
inexpensively as possible. What should you do?

A. Change the processing job to use Google Cloud Dataproc instead.

B. Manually start the Cloud Dataflow job each morning when you get into the office.

C. Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.

(Correct)

D. Configure the Cloud Dataflow job as a streaming job so that it processes the log data
immediately.
Explanation
Correct answer is C as the Cloud Dataflow job can be triggering using a cron job hosted on
the GCP infrastructure.
Refer GCP documentation - Scheduling Dataflow pipelines using App Engine Cron Service
App Engine Cron Service allows you to configure and run cron jobs at regular intervals.
These cron jobs are a little different from regular Linux cron jobs in that they cannot run any
script or command. They can only invoke a URL defined as part of your App Engine app via
HTTP GET. In return, you don’t have to worry about how or where the cron job is running.
App Engine infrastructure takes care of making sure that your cron job runs at the interval
that you want it to run.
Option A is wrong as Dataproc is more suitable for existing hadoop or spark jobs and it not
an inexpensive approach.
Option B is wrong as manually triggering the pipeline is not an efficient approach.
Option D is wrong as Cloud Dataflow Streaming job only supports Cloud Pub/Sub
What data sources and sinks are supported in streaming mode?
You can read streaming data from Cloud Pub/Sub, and you can write streaming data to
Cloud Pub/Sub or BigQuery..
Question 4: Skipped
Your globally distributed auction application allows users to bid on items. Occasionally,
users place identical bids at nearly identical times, and different application servers
process those bids. Each bid event contains the item, amount, user, and timestamp. You
want to collate those bid events into a single location in real time to determine which
user bid first. What should you do?

A. Create a file on a shared file and have the application servers write all bid events to that file.
Process the file with Apache Hadoop to identify which user bid first.

B. Have each application server write the bid events to Cloud Pub/Sub as they occur. Push the
events from Cloud Pub/Sub to a custom endpoint that writes the bid event information into Cloud
SQL.


C. Set up a MySQL database for each application server to write bid events into. Periodically
query each of those distributed MySQL databases and update a master MySQL database with bid
event information.

D. Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use
a pull subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each
item to the user in the bid event that is processed first.

(Correct)

Explanation
Correct answer is D as Cloud Pub/Sub with Cloud Dataflow can be used to buffer the bids
and process them as per the order.
Refer GCP documentation - Cloud Pub/Sub Subscriber
Cloud Pub/Sub provides a highly-available, scalable message delivery service. The tradeoff
for having these properties is that the order in which messages are received by subscribers is
not guaranteed. While the lack of ordering may sound burdensome, there are very few use
cases that actually require strict ordering.
Typically, Cloud Pub/Sub delivers each message once and in the order in which it was
published. However, messages may sometimes be delivered out of order or more than once.
In general, accommodating more-than-once delivery requires your subscriber to
be idempotent when processing messages. You can achieve exactly once processing of Cloud
Pub/Sub message streams using Cloud Dataflow  PubsubIO .  PubsubIO  de-duplicates
messages on custom message identifiers or those assigned by Cloud Pub/Sub. You can also
achieve ordered processing with Cloud Dataflow by using the standard sorting APIs of the
service. Alternatively, to achieve ordering, the publisher of the topic to which you subscribe
can include a sequence token in the message.
Options A, B & C are wrong as they do not provide a scalable approach at the real time to
collate and determine which user bid first.
Question 5: Skipped
You want to use a database of information about tissue samples to classify future tissue
samples as either normal or mutated. You are evaluating an unsupervised anomaly
detection method for classifying the tissue samples. Which two characteristic support
this method? (Choose two.)

A. There are very few occurrences of mutations relative to normal samples.

(Correct)


B. There are roughly equal occurrences of both normal and mutated samples in the database.

C. You expect future mutations to have different features from the mutated samples in the
database.

D. You expect future mutations to have similar features to the mutated samples in the database.

(Correct)

E. You already have labels for which samples are mutated and which are normal in the database.
Explanation
Correct answers are A & D as Unsupervised Anomaly Detection would need the data to have
fewer occurrences of mutation as compared to normal data and expect future mutations to
have similar features.
Unsupervised Anomaly Detection - These techniques do not need training data. As
alternative, they based on two basic assumptions. First, they presume that most of the
network connections are normal traffic and only a very small traffic percentage is abnormal.
Second, they anticipate that malicious traffic is statistically various from normal traffic.
According to these two assumptions, data groups of similar instances which appear
frequently are assumed to be normal traffic, while infrequently instances which considerably
various from the majority of the instances are regarded to be malicious
Option B is wrong as an equal number of mutations to normal data would not allow anomaly
detection.
Option C is wrong as with different features for future mutations, the anomaly direction
would not work.
Option E is wrong as it would be best to use supervised learning, as we already have labels
for samples.
Supervised Anomaly Detection  - Supervised methods (also known as classification methods)
required a labeled training set containing both normal and anomalous samples to construct
the predictive model. Theoretically, supervised methods provide better detection rate than
semi-supervised and unsupervised methods, since they have access to more information.
However, there exist some technical issues, which make these methods seem not accurate as
they are supposed to be .
Question 6: Skipped
Your organization has been collecting and analyzing data in Google BigQuery for 6
months. The majority of the data analyzed is placed in a time-partitioned table named
events_partitioned. To reduce the cost of queries, your organization created a view
called events, which queries only the last 14 days of data. The view is described in legacy
SQL. Next month, existing applications will be connecting to BigQuery to read the
events data via an ODBC connection. You need to ensure the applications can connect.
Which two actions should you take? (Choose two.)

A. Create a new view over events using standard SQL

B. Create a new partitioned table using a standard SQL query

C. Create a new view over events_partitioned using standard SQL

D. Create a service account for the ODBC connection to use for authentication

(Correct)

E. Create a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC
connection and shared "events"

(Correct)

Explanation
Correct answers are D & E as BigQuery supports authentication using Service Accounts and
User accounts.
Refer GCP documentation - BigQuery with ODBC driver
You'll need to provide credentials, either with a service account key or user authentication.
Service accounts - A service account is a Google account that is associated with your GCP
project. Use a service account to access the BigQuery API if your application can run jobs
associated with service credentials rather than an end-user's credentials, such as a batch
processing pipeline.
User accounts - Use user credentials to ensure that your application has access only to
BigQuery tables that are available to the end user. A user credential can run queries against
only the end user's Cloud Platform project rather than the application's project, meaning the
user is billed for queries instead of the application.
Options A, B & C are wrong as the applications can connect to Legacy SQL using ODBC
using service account key or user authentication.
Question 7: Skipped
You are implementing security best practices on your data pipeline. Currently, you are
manually executing jobs as the Project Owner. You want to automate these jobs by
taking nightly batch files containing non-public information from Google Cloud
Storage, processing them with a Spark Scala job on a Google Cloud Dataproc cluster,
and depositing the results into Google BigQuery. How should you securely run this
workload?

A. Restrict the Google Cloud Storage bucket so only you can see the files

B. Grant the Project Owner role to a service account, and run the job with it

C. Use a service account with the ability to read the batch files and to write to BigQuery

(Correct)

D. Use a user account with the Project Viewer role on the Cloud Dataproc cluster to read the
batch files and write to BigQuery
Explanation
Correct answer is C as the best practice is to use a service account with least privilege.
Refer GCP documentation - IAM Best Practices - Service Accounts
A service account is a special type of Google account intended to represent a non-human
user that needs to authenticate and be authorized to access data in Google APIs.
Typically, service accounts are used in scenarios such as:
Running workloads on virtual machines (VMs).
Running workloads on on-premises workstations or data centers that call Google APIs.
Running workloads which are not tied to the lifecycle of a human user.
Option A is wrong as the best practice is to use a service account i.e. non human user for
jobs.
Option B is wrong as Project Owner role does not align with the IAM best practices of least
privilege.
All editor permissions and permissions for the following actions:
Manage roles and permissions for a project and all resources within the project.
Set up billing for a project.
Option D is wrong as the Project Viewer role does not grant access to write to BigQuery.
Permissions for read-only actions that do not affect state, such as viewing (but not
modifying) existing resources or data.
Question 8: Skipped
Your company’s customer and order databases are often under heavy load. This makes
performing analytics against them difficult without harming operations. The databases
are in a MySQL cluster, with nightly backups taken using mysqldump. You want to
perform analytics with minimal impact on operations. What should you do?

A. Add a node to the MySQL cluster and build an OLAP cube there.

B. Use an ETL tool to load the data from MySQL into Google BigQuery.

(Correct)

C. Connect an on-premises Apache Hadoop cluster to MySQL and perform ETL.

D. Mount the backups to Google Cloud SQL, and then process the data using Google Cloud
Dataproc.
Explanation
Correct answer is B as as moving data to BigQuery would reduce the load on the MySQL
instances and allow data to be queried using the same SQLs.
Options A & C is wrong as this does not reduce the load on the existing MySQL instance.
Option D is wrong as backups cannot be mounted to Google Cloud SQL, but have to be
restored or imported. Also, it needs operational effort.
Question 9: Skipped
You are training a spam classifier. You notice that you are overfitting the training data.
Which three actions can you take to resolve this problem? (Choose three.)

A. Get more training examples

(Correct)

B. Reduce the number of training examples

C. Use a smaller set of features

(Correct)

D. Use a larger set of features

E. Increase the regularization parameters

(Correct)

F. Decrease the regularization parameters


Explanation
Correct answers are A, C & E
Refer documentation - Tensorflow Overfit vs Underfit
Overfitting is a phenomenon where a machine learning model models the training data too
well but fails to perform well on the testing data.
If you train for too long though, the model will start to overfit and learn patterns from the
training data that don't generalize to the test data. We need to strike a balance.
Understanding how to train for an appropriate number of epochs as we'll explore below is a
useful skill.
To prevent overfitting, the best solution is to use more training data. A model trained on
more data will naturally generalize better. When that is no longer possible, the next best
solution is to use techniques like regularization. These place constraints on the quantity and
type of information your model can store. If a network can only afford to memorize a small
number of patterns, the optimization process will force it to focus on the most prominent
patterns, which have a better chance of generalizing well.
Train with more data - It won’t work every time, but training with more data can help
algorithms detect the signal better.
Remove features - Some algorithms have built-in feature selection. For those that don’t, you
can manually improve their generalizability by removing irrelevant input features.
Regularization - Regularization refers to a broad range of techniques for artificially forcing
your model to be simpler. The method will depend on the type of learner you’re using. For
example, you could prune a decision tree, use dropout on a neural network, or add a penalty
parameter to the cost function in regression. Oftentimes, the regularization method is a
hyperparameter as well, which means it can be tuned through cross-validation.
Question 10: Skipped
Your infrastructure includes a set of YouTube channels. You have been tasked with
creating a process for sending the YouTube channel data to Google Cloud for analysis.
You want to design a solution that allows your world-wide marketing teams to perform
ANSI SQL and other types of analysis on up-to-date YouTube channels log data. How
should you set up the log data transfer into Google Cloud?

A. Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-
Regional storage bucket as a final destination.

B. Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Regional
bucket as a final destination.

C. Use BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage
Multi-Regional storage bucket as a final destination.

(Correct)

D. Use BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage
Regional storage bucket as a final destination.
Explanation
Correct answer is C as BigQuery Data Transfer Service provides integration with youtube to
transfer data to Cloud Storage. Using Multi-Regional storage bucket would allow storage and
querying data from across global.
Refer GCP documentation - BigQuery Transfer Service & Dataset Locations
BigQuery Data Transfer Service automates data movement from Software as a Service
(SaaS) applications such as Google Ads and Google Ad Manager on a scheduled, managed
basis. Your analytics team can lay the foundation for a data warehouse without writing a
single line of code.
Like BigQuery, the BigQuery Data Transfer Service is a multi-regional resource.
Data locality is specified when you create a dataset to store your BigQuery Data Transfer
Service core customer data. When you set up a transfer, the transfer configuration is set to
the same locality as the dataset. The BigQuery Data Transfer Service processes and stages
data in the same location as the target BigQuery dataset.
If your BigQuery dataset is in a multi-regional location, the Cloud Storage bucket containing
the data you're loading must be in a regional or multi-regional bucket in the same location.
When you export data, the regional or multi-regional Cloud Storage bucket must be in the
same location as the BigQuery dataset.
Options A & B are wrong as Storage Transfer Service transfers data from an online data
source to a data sink. Your data source can be an Amazon Simple Storage Service (Amazon
S3) bucket, an HTTP/HTTPS location, or a Cloud Storage bucket. Your data sink (the
destination) is always a Cloud Storage bucket.
Option D is wrong as Multi-regional storage should be preferred over Regional storage.
Question 11: Skipped
Your company is performing data preprocessing for a learning algorithm in Google
Cloud Dataflow. Numerous data logs are being generated during this step, and the team
wants to analyze them. Due to the dynamic nature of the campaign, the data is growing
exponentially every hour. The data scientists have written the following code to read the
data for a new key features in the logs.

1. BigQueryIO.Read
2. .named("ReadLogData")
3. .from("clouddataflow-readonly:samples.log_data")

You want to improve the performance of this data read. What should you do?

A. Specify the TableReference object in the code.

B. Use  .fromQuery  operation to read specific fields from the table.

(Correct)

C. Use of both the Google BigQuery TableSchema and TableFieldSchema classes.

D. Call a transform that returns TableRow objects, where each element in the PCollection
represents a single row in the table.
Explanation
Correct answer is B as best practice is to limit the data queried.
BigQueryIO.read.from()   directly reads the whole table from BigQuery. This function
exports the whole table to temporary files in Google Cloud Storage, where it will later be
read from. This requires almost no computation, as it only performs an export job, and later
Dataflow reads from GCS (not from BigQuery).
BigQueryIO.read.fromQuery()   executes a query and then reads the results received after
the query execution. Therefore, this function is more time-consuming, given that it requires
that a query is first executed (which will incur in the corresponding economic and
computational costs).
Refer GCP documentation - BigQuery Best Practices
Best practice: Control projection — Query only the columns that you need.
Projection refers to the number of columns that are read by your query. Projecting excess
columns incurs additional (wasted) I/O and materialization (writing results).
Using  SELECT *  is the most expensive way to query data. When you use  SELECT * ,
BigQuery does a full scan of every column in the table.
If you are experimenting with data or exploring data, use one of the data preview options
instead of  SELECT * .

Applying a  LIMIT   clause to a  SELECT *   query does not affect the amount of data read. You
are billed for reading all bytes in the entire table, and the query counts against your free tier
quota.
Instead, query only the columns you need. For example, use  SELECT * EXCEPT  to exclude
one or more columns from the results.
If you do require queries against every column in a table, but only against a subset of data,
consider:
Materializing results in a destination table and querying that table instead
Partitioning your tables by date and querying the relevant partition; for example,  WHERE
_PARTITIONDATE="2017-01-01"   only scans the January 1, 2017 partition
Querying a subset of data or using  SELECT * EXCEPT   can greatly reduce the amount of data
that is read by a query. In addition to the cost savings, performance is improved by reducing
the amount of data I/O and the amount of materialization that is required for the query
results.
Options A & C are wrong as they do not improve query performance
Option D is wrong as performing inline transformation is not recommended and would
reduce the performance.
Question 12: Skipped
You are designing storage for two relational tables that are part of a 10-TB database on
Google Cloud. You want to support transactions that scale horizontally. You also want
to optimize data for range queries on non-key columns. What should you do?

A. Use Cloud SQL for storage. Add secondary indexes to support query patterns.

B. Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.

C. Use Cloud Spanner for storage. Add secondary indexes to support query patterns.

(Correct)

D. Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query
patterns.
Explanation
Correct answer is C as Cloud Spanner provides the ability to scale horizontally and
Secondary Indexes help to query non-key fields effectively.
Refer GCP documentation - Spanner & Secondary Indexes
Cloud Spanner is the first scalable, enterprise-grade, globally-distributed, and strongly
consistent database service built for the cloud specifically to combine the benefits of
relational database structure with non-relational horizontal scale. This combination delivers
high-performance transactions and strong consistency across rows, regions, and continents
with an industry-leading 99.999% availability SLA, no planned downtime, and enterprise-
grade security. Cloud Spanner revolutionizes database administration and management and
makes application development more efficient.
In a Cloud Spanner database, Cloud Spanner automatically creates an index for each table's
primary key column.
You can also create secondary indexes for other columns. Adding a secondary index on a
column makes it more efficient to look up data in that column.
Options A & B are wrong as Cloud SQL does not provide the ability to scale horizontally.
Option D is wrong as using Dataflow is not an effective approach.
Question 13: Skipped
Your company is streaming real-time sensor data from their factory floor into Bigtable
and they have noticed extremely poor performance. How should the row key be
redesigned to improve Bigtable performance on queries that populate real-time
dashboards?

A. Use a row key of the form  <timestamp> .

B. Use a row key of the form  <sensorid> .

C. Use a row key of the form  <timestamp>#<sensorid> .

D. Use a row key of the form  <sensorid>#<timestamp> .

(Correct)

Explanation
Correct answer is D as the data is time-series data, it is recommended to use tall and narrow
tables with a combination of both sensorid and timestamp. Also, it is recommended to not use
timestamp at the start of the row key as most writes would be pushed to a single node.
Refer GCP documentation - Bigtable Schema Design & Time-Series Schema Design
A tall and narrow table has a small number of events per row, which could be just one event,
whereas a short and wide table has a large number of events per row.
For time series, you should generally use tall and narrow tables.  This is for two reasons:
Storing one event per row makes it easier to run queries against your data. Storing many
events per row makes it more likely that the total row size will exceed the recommended
maximum
if you often need to retrieve data based on the time when it was recorded, it's a good idea to
include a timestamp as part of your row key.  Using the timestamp by itself as the row key is
not recommended, as most writes would be pushed onto a single node. For the same
reason, avoid placing a timestamp at the start of the row key.
For example, your application might need to record performance-related data, such as CPU
and memory usage, once per second for a large number of machines. Your row key for this
data could combine an identifier for the machine with a timestamp for the data (for
example,  machine_4223421#1425330757685 ).

Options A & B are wrong as they would not querying based on sensor and time together to
build the dashboard.
Option C is wrong as it is recommended to NOT have timestamp at the start of the row key.
Question 14: Skipped
Your company receives both batch- and stream-based event data. You want to process
the data using Google Cloud Dataflow over a predictable time period. However, you
realize that in some instances data can arrive late or out of order. How should you
design your Cloud Dataflow pipeline to handle data that is late or out of order?

A. Set a single global window to capture all the data.

B. Set sliding windows to capture all the lagged data.

C. Use watermarks and timestamps to capture the lagged data.

(Correct)

D. Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to
define the logic for lagged data.
Explanation
Correct answer is C as you would need both watermarks to identify the time period.
Refer GCP documentation - Dataflow Streaming Basics & Beam Windowing
In any data processing system, there is a certain amount of lag between the time a data event
occurs (the “event time”, determined by the timestamp on the data element itself) and the
time the actual data element gets processed at any stage in your pipeline (the “processing
time”, determined by the clock on the system processing the element). In addition, there are
no guarantees that data events will appear in your pipeline in the same order that they were
generated.
Watermarks are the notion of when the system expects that all data in a certain window has
arrived in the pipeline. Cloud Dataflow tracks watermarks because data is not guaranteed to
arrive in time order or at predictable intervals. In addition, there are no guarantees that data
events appear in the pipeline in the same order that they were generated. After the
watermark progresses past the end of a window, any further elements that arrive with a
timestamp in that window are considered late data.
However, data isn’t always guaranteed to arrive in a pipeline in time order, or to always
arrive at predictable intervals. Beam tracks a watermark, which is the system’s notion of
when all data in a certain window can be expected to have arrived in the pipeline. Once the
watermark progresses past the end of a window, any further element that arrives with a
timestamp in that window is considered  late data.
Option A is wrong as for unbounded data you need to choose non-global window.
Option B is wrong as Sliding windows do not catch late data.
Hopping windowing also represents time intervals in the data stream; however, hopping
windows can overlap. For example, each window might capture five minutes worth of data,
but a new window starts every ten seconds. The frequency with which hopping windows
begin is called the period. Therefore, our example would have a window duration of five
minutes and a period of ten seconds.
Because multiple windows overlap, most elements in a dataset belong to more than one
window. Hopping windowing is useful for taking running averages of data; in our example,
you can compute a running average of the past minutes' worth of data, updated every thirty
seconds.
Option D is wrong as you would need watermarks to identify late data.
Question 15: Skipped
Your company is currently setting up data pipelines for their campaign. For all the
Google Cloud Pub/Sub streaming data, one of the important business requirements is to
be able to periodically identify the inputs and their timings during their campaign.
Engineers have decided to use windowing and transformation in Google Cloud
Dataflow for this purpose. However, when testing this feature, they find that the Cloud
Dataflow job fails for the all streaming insert. What is the most likely cause of this
problem?

A. They have not assigned the timestamp, which causes the job to fail

B. They have not set the triggers to accommodate the data coming in late, which causes the job to
fail

C. They have not applied a global windowing function, which causes the job to fail when the
pipeline is created

D. They have not applied a non-global windowing function, which causes the job to fail when the
pipeline is created

(Correct)

Explanation
Correct answer is D as with unbounded Pub/Sub collection you need to apply the non-global
windowing function.
Refer GCP documentation - Dataflow Streaming Pipeline Basics & Beam Windowing
Windowing enables grouping over unbounded collections by dividing the collection into
windows according to the timestamps of the individual elements. Each window contains a
finite number of elements. Grouping operations work implicitly on a per-window basis;
grouping operations process each collection as a succession of multiple, finite windows,
though the entire collection might be of unbounded size.
If you are using unbounded  PCollection s, you must use either  non-global windowing  or
an  aggregation trigger in order to perform a  GroupByKey  or CoGroupByKey. This is
because a bounded  GroupByKey   or  CoGroupByKey  must wait for all the data with a certain
key to be collected, but with unbounded collections, the data is unlimited. Windowing and/or
triggers allow grouping to operate on logical, finite bundles of data within the unbounded
data streams.
If you do apply  GroupByKey   or  CoGroupByKey  to a group of
unbounded  PCollection s without setting either a non-global windowing strategy, a
trigger strategy, or both for each collection, Beam generates an IllegalStateException
error at pipeline construction time.
Option A is wrong as PubsubIO will read the message from Pub/Sub and assign the message
publish time to the element as the record timestamp.
Option B is wrong as trigger and watermarks are not mandatory. A related concept,
called triggers, determines when to emit the results of aggregation as unbounded data
arrives. You can use triggers to refine the windowing strategy for your  PCollection .
Triggers allow you to deal with late-arriving data or to provide early results.
Option C is wrong as with unbounded collection you need to apply non-global windowing
function.
Question 16: Skipped
You need to store and analyze social media postings in Google BigQuery at a rate of
10,000 messages per minute in near real-time. Initially, the application was designed to
use streaming inserts for individual postings. Your application also performs data
aggregations right after the streaming inserts. You discover that the queries after
streaming inserts do not exhibit strong consistency, and reports from the queries might
miss in-flight data. How can you adjust your application design?

A. Re-write the application to load accumulated data every 2 minutes.

B. Convert the streaming insert code to batch load for individual messages.

C. Load the original message to Google Cloud SQL, and export the table every hour to BigQuery
via streaming inserts.

D. Estimate the average latency for data availability after streaming inserts, and always run
queries after waiting twice as long.

(Correct)

Explanation
Correct answer is D as the application can be adjusted to check the average latency and wait
for a variable time.
Refer GCP documentation - BigQuery Streaming Inserts
Streamed data is available for real-time analysis within a few seconds of the first streaming
insertion into a table. In rare circumstances (such as an outage), data in the streaming buffer
may be temporarily unavailable. When data is unavailable, queries continue to run
successfully, but they skip some of the data that is still in the streaming buffer. These queries
will contain a warning in the  errors  field of  bigquery.jobs.getQueryResults , in the
response to bigquery.jobs.query   or in the  status.errors  field of  bigquery.jobs.get .

Data can take up to 90 minutes to become available for copy and export operations. Also,
when streaming to a partitioned table, data in the streaming buffer has a NULL value for
the  _PARTITIONTIME  pseudo column. To see whether data is available for copy and export,
check the  tables.get   response for a section named  streamingBuffer . If that section is
absent, your data should be available for copy or export, and should have a non-null value
for the  _PARTITIONTIME  pseudo column. Additionally,
the  streamingBuffer.oldestEntryTime   field can be leveraged to identify the age of records
in the streaming buffer.
Option A is wrong as the data availability is variable, fixed time would not address the
problem.
Option B is wrong as Batch load is not ideal for individual messages.
Option C is wrong as Cloud SQL is not ideal choice to support streaming data inserts.
Question 17: Skipped
You are building a model to make clothing recommendations. You know a user’s
fashion preference is likely to change over time, so you build a data pipeline to stream
new data back to the model as it becomes available. How should you use this data to
train the model?

A. Continuously retrain the model on just the new data.

B. Continuously retrain the model on a combination of existing data and the new data.

(Correct)

C. Train on the existing data while using the new data as your test set.

D. Train on the new data while using the existing data as your test set.
Explanation
Correct answer is B as the preference is going to change over period of time, it is more
logical to retrain the models on the new data and existing data.
Another way to keep your models up-to-date is to have an automated system to continuously
evaluate and retrain your models. This type of system is often referred to as continuous
learning, and may look something like this:
Save new training data as you receive it.
When you have enough new data, test its accuracy against your machine learning model.
If you see the accuracy of your model degrading over time, use the new data, or a
combination of the new data and old training data to build and deploy a new model.
The benefit to a continuous learning system is that it can be completely automated.
Option A is wrong as the model can be improved taking into account the new and old data
which would change over a period of time.
Options C & D are wrong as the training needs to happen on both new and old data. Training
of one set of data and using on other set would result in an inaccurate model and results.
Question 18: Skipped
You are designing storage for very large text files for a data pipeline on Google Cloud.
You want to support ANSI SQL queries. You also want to support compression and
parallel load from the input locations using Google recommended practices. What
should you do?

A. Transform text files to compressed Avro using Cloud Dataflow. Use BigQuery for storage and
query.

(Correct)

B. Transform text files to compressed Avro using Cloud Dataflow. Use Cloud Storage and
BigQuery permanent linked tables for query.

C. Compress text files to gzip using the Grid Computing Tools. Use BigQuery for storage and
query.

D. Compress text files to gzip using the Grid Computing Tools. Use Cloud Storage, and then
import into Cloud Bigtable for query.
Explanation
Correct answer is A as BigQuery can be used to store and query the text data. BigQuery
natively supports Avro and can work with compressed blocks.
Refer GCP documentation - BigQuery Loading Data
The Avro binary format is the preferred format for loading compressed data. Avro data is
faster to load because the data can be read in parallel, even when the data blocks are
compressed. Compressed Avro files are not supported, but compressed data blocks are.
BigQuery supports the DEFLATE and Snappy codecs for compressed data blocks in Avro
files.
Option B is wrong as although it works, Google recommends using BigQuery for storage, if
possible, as it results is better performance.
Query performance for external data sources may not be as high as querying data in a native
BigQuery table. If query speed is a priority, load the data into BigQuery instead of setting up
an external data source. The performance of a query that includes an external data source
depends on the external storage type. For example, querying data stored in Cloud Storage is
faster than querying data stored in Google Drive. In general, query performance for external
data sources should be equivalent to reading the data directly from the external storage.
Options C & D are wrong Grid Computing Tools are not needed and Dataflow can work fine.
Also, for text files (CSV and JSON) BigQuery can load uncompressed files faster.
For other data formats such as CSV and JSON, BigQuery can load uncompressed files
significantly faster than compressed files because uncompressed files can be read in parallel.
Because uncompressed files are larger, using them can lead to bandwidth limitations and
higher Cloud Storage costs for data staged in Cloud Storage prior to being loaded into
BigQuery. You should also note that line ordering is not guaranteed for compressed or
uncompressed files. It's important to weigh these tradeoffs depending on your use case.
In general, if bandwidth is limited, compress your CSV and JSON files using gzip before
uploading them to Cloud Storage. Currently, when loading data into BigQuery, gzip is the
only supported file compression type for CSV and JSON files. If loading speed is important to
your app and you have a lot of bandwidth to load your data, leave your files uncompressed.
Question 19: Skipped
You are designing storage for 20 TB of text files as part of deploying a data pipeline on
Google Cloud. Your input data is in CSV format. You want to minimize the cost of
querying aggregate values for multiple users who will query the data in Cloud Storage
with multiple engines. Which storage service and schema design should you use?

A. Use Cloud Bigtable for storage. Install the HBase shell on a Compute Engine instance to query
the Cloud Bigtable data.

B. Use Cloud Bigtable for storage. Link as permanent tables in BigQuery for query.

C. Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.

(Correct)

D. Use Cloud Storage for storage. Link as temporary tables in BigQuery for query.
Explanation
Correct answer is C as Cloud Storage provides a cost-effective solution to store data and
BigQuery Permanent tables can use Cloud Storage as an external data store and be shared.
Refer GCP documentation - BigQuery Temporary vs Permanent Tables
Permanent versus temporary external tables
You can query an external data source in BigQuery by using a permanent table or a
temporary table. When you use a permanent table, you create a table in a BigQuery dataset
that is linked to your external data source. Because the table is permanent, you can use
dataset-level access controls to share the table with others who also have access to the
underlying external data source, and you can query the table at any time.
When you query an external data source using a temporary table, you submit a command
that includes a query and creates a non-permanent table linked to the external data source.
When you use a temporary table, you do not create a table in one of your BigQuery datasets.
Because the table is not permanently stored in a dataset, it cannot be shared with others.
Querying an external data source using a temporary table is useful for one-time, ad-hoc
queries over external data, or for extract, transform, and load (ETL) processes.
Options A & B are wrong as Bigtable is not a cost-effective storage solution.
Option D is wrong as BigQuery temporary tables for useful for one-time jobs and cannot be
shared with others.
Question 20: Skipped
You have enabled the free integration between Firebase Analytics and Google
BigQuery. Firebase now automatically creates a new table daily in BigQuery in the
format  app_events_YYYYMMDD . You want to query all of the tables for the past 30 days in
legacy SQL. What should you do?

A. Use the  TABLE_DATE_RANGE  function

(Correct)

B. Use the  WHERE _PARTITIONTIME  pseudo column

C. Use  WHERE date BETWEEN YYYY-MM-DD AND YYYY-MM-DD

D. Use  SELECT IF(date >= YYYY-MM-DD AND date <= YYYY-MM-DD)

Explanation
Correct answer is A as the data is already created by data, it would be best to
use  TABLE_DATE_RANGE  to filter based on range of dates.
Refer GCP documentation - BigQuery with Firebase Analytics & Legacy SQL Reference
TABLE_DATE_RANGE() Queries multiple daily tables that span a date range.

What if we want to run a query across both platforms of our app over a specific date range?
Since Firebase Analytics data is split into tables for each day, we can do this using
BigQuery’s TABLE_DATE_RANGE  function. This query returns a count of the cities users
are coming from over a one week period:
SELECT
user_dim.geo_info.city,
COUNT(user_dim.geo_info.city) as city_count
FROM
TABLE_DATE_RANGE([firebase-analytics-sample-data:xx.app_events_],
DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP()),
GROUP BY
user_dim.geo_info.city
ORDER BY
city_count DESC

Option B is wrong as _PARTITIONTIME is valid only for ingestion streaming data.


Options C & D are wrong as they are not valid wildcard date functions for Legacy SQL.
Question 21: Skipped
Your analytics team wants to build a simple statistical model to determine which
customers are most likely to work with your company again, based on a few different
metrics. They want to run the model on Apache Spark, using data housed in Google
Cloud Storage, and you have recommended using Google Cloud Dataproc to execute
this job. Testing has shown that this workload can run in approximately 30 minutes on
a 15-node cluster, outputting the results into Google BigQuery. The plan is to run this
workload weekly. How should you optimize the cluster for cost?

A. Migrate the workload to Google Cloud Dataflow

B. Use pre-emptible virtual machines (VMs) for the cluster

(Correct)

C. Use a higher-memory node so that the job runs faster


D. Use SSDs on the worker nodes so that the job can run faster
Explanation
Correct answer is B as the key requirement is to optimize cost, pre-emptible VMs can be used
with Dataproc.
Refer GCP documentation - Dataproc Preemptible-VMs
In addition to using standard Compute Engine virtual machines (VMs), Cloud Dataproc
clusters can use preemptible VM instances, also known as preemptible VMs. You may decide
to use preemptible instances to lower per-hour compute costs for non-critical data
processing or to create very large clusters at a lower total cost.
All preemptible instances added to a cluster use the machine type of the cluster's non-
preemptible worker nodes. For example, if you create a cluster with workers that use  n1-
standard-4   machine types, all preemptible instances added to the cluster will also use  n1-
standard-4   machines. The addition or removal of preemptible workers from a cluster does
not affect the number of non-preemptible workers in the cluster.
Because preemptible instances are reclaimed if they are required for other tasks, Cloud
Dataproc adds preemptible instances as secondary workers in a managed instance group,
which contains only preemptible workers. The managed group automatically re-adds
workers lost due to reclamation as capacity permits. For example, if two preemptible
machines are reclaimed and removed from a cluster, these instances will be re-added to the
cluster if and when capacity is available to re-add them.
Option A is wrong as Dataflow would need the redesign of the application, as it cannot reuse
the Spark scripts.
Options C & D are wrong as they would not reduce the cost.
Question 22: Skipped
You are building a data pipeline on Google Cloud. You need to prepare data using a
casual method for a machine-learning process. You want to support a logistic regression
model. You also need to monitor and adjust for null values, which must remain real-
valued and cannot be removed. What should you do?

A. Use Cloud Dataprep to find null values in sample source data. Convert all nulls to ‘none’
using a Cloud Dataproc job.

B. Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 0 using a
Cloud Dataprep job.

(Correct)

C. Use Cloud Dataflow to find null values in sample source data. Convert all nulls to ‘none’
using a Cloud Dataprep job.

D. Use Cloud Dataflow to find null values in sample source data. Convert all nulls to using a
custom script.
Explanation
Correct answer is B as Cloud Dataprep would help find null values as well as help convert
the null values as required.
Refer GCP documentation - DataPrep Manage Null values
Option A is wrong as Dataproc is not efficient to convert nulls values.
Options C & D are wrong as Dataflow is not efficient in finding nulls in the data.
Question 23: Skipped
You are developing an application that uses a recommendation engine on Google Cloud.
Your solution should display new videos to customers based on past views. Your
solution needs to generate labels for the entities in videos that the customer has viewed.
Your design must be able to provide very fast filtering suggestions based on data from
other customer preferences on several TB of data. What should you do?

A. Build and train a complex classification model with Spark MLlib to generate labels and filter
the results. Deploy the models using Cloud Dataproc. Call the model from your application.

B. Build and train a classification model with Spark MLlib to generate labels. Build and train a
second classification model with Spark MLlib to filter results to match customer preferences.
Deploy the models using Cloud Dataproc. Call the models from your application.

C. Build an application that calls the Cloud Video Intelligence API to generate labels. Store data
in Cloud Bigtable, and filter the predicted labels to match the user’s viewing history to generate
preferences.

(Correct)


D. Build an application that calls the Cloud Video Intelligence API to generate labels. Store data
in Cloud SQL, and join and filter the predicted labels to match the user’s viewing history to
generate preferences.
Explanation
Correct answer is C as Cloud Video Intelligence API provides an out of the box solution to
generate labels from videos. Storing data in Bigtable would provide low latency and very fast
filtering capability of TBs of data.
Options A & B are wrong as building a model for label extraction is cumbersome as
compared to using already available Cloud Video Intelligence service.
Option D is wrong as Cloud SQL is not ideal for low latency access on TBs of data.
Question 24: Skipped
You are integrating one of your internal IT applications and Google BigQuery, so users
can query BigQuery from the application’s interface. You do not want individual users
to authenticate to BigQuery and you do not want to give them access to the dataset. You
need to securely access BigQuery from your IT application. What should you do?

A. Create groups for your users and give those groups access to the dataset

B. Integrate with a single sign-on (SSO) platform, and pass each user’s credentials along with the
query request

C. Create a service account and grant dataset access to that account. Use the service account’s
private key to access the dataset

(Correct)

D. Create a dummy user and grant dataset access to that user. Store the username and password
for that user in a file on the files system, and use those credentials to access the BigQuery dataset
Explanation
Correct answer is C as the Application needs to access BigQuery, it can be configured to use
Service Account.
Refer GCP documentation - BigQuery Service Account File
A service account is a Google account that is associated with your GCP project. Use a service
account to access the BigQuery API if your application can run jobs associated with service
credentials rather than an end-user's credentials, such as a batch processing pipeline.
Manually create and obtain service account credentials to use BigQuery when an application
is deployed on-premises or to other public clouds. You can set the environment variable to
load the credentials using Application Default Credentials, or you can specify the path to
load the credentials manually in your application code.
Options A, B & D are wrong as either they are not best practices or would provide users
access to the dataset.
Question 25: Skipped
You set up a streaming data insert into a Redis cluster via a Kafka cluster. Both clusters
are running on Compute Engine instances. You need to encrypt data at rest with
encryption keys that you can create, rotate, and destroy as needed. What should you
do?

A. Create a dedicated service account, and use encryption at rest to reference your data stored in
your Compute Engine cluster instances as part of your API service calls.

B. Create encryption keys in Cloud Key Management Service. Use those keys to encrypt your
data in all of the Compute Engine cluster instances.

(Correct)

C. Create encryption keys locally. Upload your encryption keys to Cloud Key Management
Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.

D. Create encryption keys in Cloud Key Management Service. Reference those keys in your API
service calls when accessing the data in your Compute Engine cluster instances.
Explanation
Correct answer is B as encryptions keys in Cloud KMS can be used by Compute Engine to
encrypt data and provides an ability to create, rotate, and destroy as needed
Refer GCP documentation - Compute Engine Encryption & Encryption at Rest
By default, Compute Engine encrypts customer content at rest. Compute Engine handles and
manages this encryption for you without any additional actions on your part. However, if you
want to control and manage this encryption yourself, you can use key encryption keys. Key
encryption keys do not directly encrypt your data but are used to encrypt the data encryption
keys that encrypt your data.
You have two options for key encryption keys in Compute Engine:
Use Cloud Key Management Service to create and manage key encryption keys. For more
information, see  Key management. This topic provides details about this option, known as
customer-managed encryption keys (CMEK).
Create and manage your own key encryption keys. For information about this option, known
as customer-supplied encryption keys (CSEK), see Encrypting Disks with Customer-Supplied
Encryption Keys.
After you create a Compute Engine resource that is protected by Cloud KMS, you do not
need to specify the key because Compute Engine knows which KMS key was used. This is
different from how Compute Engine accesses resources protected by customer-supplied keys.
For that access, you need to specify the customer-supplied key.
Option A is wrong as the default encryption provided by Compute Engine does not allow
creation, management and rotation.
Option C is wrong as CSEK does not need to be uploaded to Cloud KMS.
Option D is wrong as the approach does not encrypt the data.
Question 26: Skipped
You are selecting services to write and transform JSON messages from Cloud Pub/Sub
to BigQuery for a data pipeline on Google Cloud. You want to minimize service costs.
You also want to monitor and accommodate input data volume that will vary in size
with minimal manual intervention. What should you do?

A. Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster.
Resize the number of worker nodes in your cluster via the command line.

B. Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an
operational output archive. Locate the bottleneck and adjust cluster resources.

C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver.
Use the default autoscaling setting for worker instances.

(Correct)


D. Use Cloud Dataflow to run your transformations. Monitor the total execution time for a
sampling of jobs. Configure the job to use non-default Compute Engine machine types when
needed.
Explanation
Correct answer is C as Dataflow, provides a cost-effective solution to perform
transformations on the streaming data, with auto-scaling provides scaling without any
intervention. System lag with Stackdriver provides monitoring for the streaming data.
Refer GCP documentation - Dataflow Monitoring
With autoscaling enabled, the Cloud Dataflow service automatically chooses the appropriate
number of worker instances required to run your job. The Cloud Dataflow service may also
dynamically re-allocate more workers or fewer workers during runtime to account for the
characteristics of your job. Certain parts of your pipeline may be computationally heavier
than others, and the Cloud Dataflow service may automatically spin up additional workers
during these phases of your job (and shut them down when they're no longer needed).
Stackdriver provides powerful monitoring, logging, and diagnostics. Cloud Dataflow
integration with Stackdriver Monitoring allows you to access Cloud Dataflow job metrics
such as Job Status, Element Counts, System Lag (for streaming jobs), and User Counters
from the Stackdriver dashboards. You can also employ Stackdriver alerting capabilities to be
notified of a variety of conditions, such as long streaming system lag or failed jobs.
Options A & B are wrong as Dataproc does not provide a cost-effective solution as the
machine needs to be configured.

Option D is wrong as using non-default Compute Engine machine types as needed would
need manual intervention.
Question 27: Skipped
Your startup has never implemented a formal security policy. Currently, everyone in
the company has access to the datasets stored in Google BigQuery. Teams have freedom
to use the service as they see fit, and they have not documented their use cases. You
have been asked to secure the data warehouse. You need to discover what everyone is
doing. What should you do first?

A. Use Google Stackdriver Audit Logs to review data access.

(Correct)

B. Get the identity and access management (IAM) policy of each table


C. Use Stackdriver Monitoring to see the usage of BigQuery query slots.

D. Use the Google Cloud Billing API to see what account the warehouse is being billed to.
Explanation
Correct answer is A as Stackdriver BigQuery Data Access audit logs can provide the
information what users are accessing what BigQuery datasets.
Refer GCP documentation - BigQuery Audit Logs
Cloud Audit Logs are a collection of logs provided by Google Cloud Platform that provide
insight into operational concerns related to your use of Google Cloud services. This page
provides details about BigQuery specific log information, and it demonstrates how to use
BigQuery to analyze logged activity.
Option B is wrong as IAM policy is not attached to the tables.
Option C is wrong as Stackdriver only provides info for available and allocated Query Slots
Option D is wrong as billing does not provide information of what users are accessing which
tables.
Question 28: Skipped
Your company uses a proprietary system to send inventory data every 6 hours to a data
ingestion service in the cloud. Transmitted data includes a payload of several fields and
the timestamp of the transmission. If there are any concerns about a transmission, the
system re-transmits the data. How should you deduplicate the data most efficiency?

A. Assign global unique identifiers (GUID) to each data entry.

(Correct)

B. Compute the hash value of each data entry, and compare it with all historical data.

C. Store each data entry as the primary key in a separate database and apply an index.

D. Maintain a database table to store the hash value and other metadata for each data entry.
Explanation
Correct answer is A as a global unique identifier would allow one to detect duplicates when
the message is retransmitted.
Refer GCP documentation - Pub/Sub Duplicates
Cloud Pub/Sub assigns a unique `message_id` to each message, which can be used to detect
duplicate messages received by the subscriber. This will not, however, allow you to detect
duplicates resulting from multiple publish requests on the same data.
Option B is wrong as using the hash with timestamp of the transmission, it would never
match.
Options C & D are wrong as using database would not be cost effective solution.
Question 29: Skipped
You are responsible for writing your company’s ETL pipelines to run on an Apache
Hadoop cluster. The pipeline will require some checkpointing and splitting pipelines.
Which method should you use to write the Pipelines?

A. PigLatin using Pig

(Correct)

B. HiveQL using Hive

C. Java using MapReduce

D. Python using MapReduce


Explanation
Correct answer is A as Pig Latin follows a procedure programming model and I feel it is
more natural to use Pig to build a data pipeline, such as ETL job. It gives me more control
over how the data is flowed through the pipeline, when to checkpoint the data in pipeline,
support DAGs in pipeline such as split and more control over optimization. Lots of concepts
in Pig are from SQL such as filtering, group by, but the syntax is a little different.
Option B is wrong as Hive is more SQL-like and data analyst should feel very familiar with
the syntax and using Hive to do ad-hoc analytic query.
Option C & D are wrong java and python are just implementation libraries and MapReduce
does not support checkpointing and splitting.
Question 30: Skipped
Your financial services company is moving to cloud technology and wants to store 50
TB of financial timeseries data in the cloud. This data is updated frequently and new
data will be streaming in all the time. Your company also wants to move their existing
Apache Hadoop jobs to the cloud to get insights into this data. Which product should
they use to store the data?

A. Cloud Bigtable

(Correct)

B. Google BigQuery

C. Google Cloud Storage

D. Google Cloud Datastore


Explanation
Correct answer is A as Bigtable is ideal for storing time-series data, data with frequent
updates.
Refer GCP documentation - Big data products
Cloud Bigtable  provides a massively scalable NoSQL database suitable for low-latency and
high-throughput workloads. It integrates easily with popular big-data tools like Hadoop and
Spark, and it supports the open-source, industry-standard HBase API. Cloud Bigtable is a
great choice for both operational and analytical applications, including IoT, user analytics,
and financial data analysis.
Option B is wrong as BigQuery is not suitable for data with frequent updates.
Option C is wrong as Cloud Storage is not ideal for time-series data with frequent updates.
Option D is wrong as Datastore is not ideal for analytics time-series workload.
Question 31: Skipped
Government regulations in your industry mandate that you have to maintain an
auditable record of access to certain types of data. Assuming that all expiring logs will
be archived correctly, where should you store data that is subject to that mandate?

A. Encrypted on Cloud Storage with user-supplied encryption keys. A separate decryption key
will be given to each authorized user.

B. In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log
used to provide the auditability.

C. In Cloud SQL, with separate database user names to each user. The Cloud SQL Admin
activity logs will be used to provide the auditability.


D. In a bucket on Cloud Storage that is accessible only by an App Engine service that collects
user information and logs the access before providing a link to the bucket.

(Correct)

Explanation
Correct answer is D as Cloud Storage is an ideal storage option for logs. The access can be
controlled using an App Engine with access to the bucket and logging all access events.
Option A is wrong as encryption can help protect data, however it does not help capture data
access.
Options B & C are wrong as BigQuery and Cloud SQL are not an ideal storage option for
logs.
Question 32: Skipped
Your company maintains a hybrid deployment with GCP, where analytics are
performed on your anonymized customer data. The data are imported to Cloud Storage
from your data center through parallel uploads to a data transfer server running on
GCP. Management informs you that the daily transfers take too long and have asked
you to fix the problem. You want to maximize transfer speeds. Which action should you
take?

A. Increase the CPU size on your server.

B. Increase the size of the Google Persistent Disk on your server.

C. Increase your network bandwidth from your datacenter to GCP.

(Correct)

D. Increase your network bandwidth from Compute Engine to Cloud Storage


Explanation
Correct answer is C as to improve data transfer speed the network bandwidth between the
data center and GCP needs to be increased. Take into account parallel uploads are already
being performed.
Refer GCP documentation - Transferring Big Data sets to GCP
Increase network bandwidth
Methods to increase your network bandwidth depends on how you choose to connect to GCP.
You can connect to GCP in three main ways:
Public internet connection
Direct peering
Cloud Interconnect
Options A & B are wrong as they do not help increase transfer speeds.
Option D is wrong as you cannot increase network bandwidth from Compute Engine to
Cloud Storage. Also, private access can be used to enable data transfer from Compute Engine
to Cloud Storage using internal network.
Question 33: Skipped
You are creating a model to predict housing prices. Due to budget constraints, you must
run it on a single resource-constrained virtual machine. Which learning algorithm
should you use?

A. Linear regression

(Correct)

B. Logistic classification

C. Recurrent neural network

D. Feedforward neural network


Explanation
Correct answer is A as linear regression can help predict housing prices and also run on a
single resource-constrained virtual machine.
Refer documentation - Machine learning
Option B is wrong as the housing price needs to be predicted, classification cannot be used.
Options C & D are wrong as neural network are resource intensive and would not be able to
execute on single resource-constrained virtual machine.
Question 34: Skipped
You are designing a basket abandonment system for an ecommerce company. The
system will send a message to a user based on these rules:
A. No interaction by the user on the site for 1 hour
B. Has added more than $30 worth of products to the basket
C. Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be
sent. How should you design the pipeline?

A. Use a fixed-time window with a duration of 60 minutes.

B. Use a sliding time window with a duration of 60 minutes.

C. Use a session window with a gap time duration of 60 minutes.

(Correct)

D. Use a global window with a time based trigger with a delay of 60 minutes.
Explanation
Correct answer is C as the key here is to track user inactivity for an hour. Session windows
can be easily used to track the activity and trigger events based on the conditions.
Refer Beam documentation - Windowing
A session window function defines windows that contain elements that are within a certain
gap duration of another element. Session windowing applies on a per-key basis and is useful
for data that is irregularly distributed with respect to time. For example, a data stream
representing user mouse activity may have long periods of idle time interspersed with high
concentrations of clicks. If data arrives after the minimum specified gap duration time, this
initiates the start of a new window.

Options A, B & D are wrong as they would not be able to track and reset the window based
on user activity.
Question 35: Skipped
By default, which of the following windowing behavior does Dataflow apply to
unbounded data sets?

A. Windows at every 100 MB of data.


B. Single, Global Window.

(Correct)

C. Windows at every 1 minute.

D. Windows at every 10 minutes.


Explanation
Correct answer is B as Dataflow, based on Apache Beam, by default applies a single, global
window to unbounded datasets.
Refer Beam documentation - Windowing
Beam’s default windowing behavior is to assign all elements of a  PCollection  to a single,
global window and discard late data, even for unbounded  PCollection s. Before you use a
grouping transform such as  GroupByKey  on an unbounded  PCollection , you must do at
least one of the following:
Set a non-global windowing function.
Set a non-default  trigger. This allows the global window to emit results under other
conditions, since the default windowing behavior (waiting for all data to arrive) will never
occur.
If you don’t set a non-global windowing function or a non-default trigger for your
unbounded  PCollection   and subsequently use a grouping transform such
as  GroupByKey  or  Combine , your pipeline will generate an error upon construction and
your job will fail.
Question 36: Skipped
You are a retailer that wants to integrate your online sales capabilities with different in-
home assistants, such as Google Home. You need to interpret customer voice commands
and issue an order to the backend systems. Which solutions should you choose?

A. Cloud Speech-to-Text API

B. Cloud Natural Language API


C. Dialogflow Enterprise Edition

(Correct)

D. Cloud AutoML Natural Language


Explanation
Correct answer is C as Dialogflow Enterprise Edition would provide an ideal solutionas the
key requirement is to interpret voice commands and fire events.
Refer GCP documentation - AI Products
Dialogflow is an end-to-end, build-once deploy-everywhere development suite for creating
conversational interfaces for websites, mobile applications, popular messaging platforms,
and IoT devices. You can use it to build interfaces (such as chatbots and conversational IVR)
that enable natural and rich interactions between your users and your business. Dialogflow
Enterprise Edition users have access to Google Cloud Support and a service level agreement
(SLA) for production deployments.
You can expand your conversational interface to recognize voice interactions and generate a
voice response, all with a single API call. Powered by Google Cloud Speech-to-
Text and Cloud Text-to-Speech, it supports real-time streaming and synchronous modes.
Option A is wrong as Cloud Speech-to-Text API just provides speech-to-text conversion
powered by ML.
Option B as Cloud Natural Language API help derive insights from unstructured text.
Option D is wrong as AutoML helps reveal the structure and meaning of text through
machine learning.
GCP PDE Question feedback
Question 37: Skipped
You are choosing a NoSQL database to handle telemetry data submitted from millions
of Internet-of- Things (IoT) devices. The volume of data is growing at 100 TB per year,
and each data entry has about 100 attributes. The data processing pipeline does not
require atomicity, consistency, isolation, and durability (ACID). However, high
availability and low latency are required. You need to analyze the data by querying
against individual fields. Which three databases meet your requirements? (Choose
three.)

A. Redis

B. HBase

(Correct)

C. MySQL

D. MongoDB

(Correct)

E. Cassandra

(Correct)

F. HDFS with Hive


Explanation
Correct answers are B, D & E as HBase, MongoDb and Cassandra are NoSQL options for
storing data and provide low latency access to the data with an ability to scale horizontally
and being highly available.
Option A is wrong as Redis is more of a caching engine.
Option C is wrong as MySQL is a relational database and would not scale.
Option E is wrong as HDFS with Hive is more ideal for batch jobs and do not provide low
latency access to the data.
Question 38: Skipped
You need to migrate a 2TB relational database to Google Cloud Platform. You do not
have the resources to significantly refactor the application that uses this database and
cost to operate is of primary concern. Which service do you select for storing and
serving your data?


A. Cloud Spanner

B. Cloud Bigtable

C. Cloud Firestore

D. Cloud SQL

(Correct)

Explanation
Correct answer is D as Cloud SQL provides relational database.
Refer GCP documentation - Databases & Migrating from MySQL to Cloud Spanner
Option A is wrong as although Cloud Spanner provides relation database capability.
However, the migration is not seamless and would need modification to the application.
Cloud Spanner uses certain concepts differently from other enterprise database management
tools, so you might need to adjust your application's architecture to take full advantage of its
capabilities. You might also need to supplement Cloud Spanner with other services from
Google Cloud Platform (GCP) to meet your needs.
Options B & C are wrong as Bigtable and Firestore are NoSQL/ Non-relational database
types and would require modification of the application.
Question 39: Skipped
Your company is loading comma-separated values (CSV) files into Google BigQuery.
The data is fully imported successfully; however, the imported data is not matching
byte-to-byte to the source file. What is the most likely cause of this problem?

A. The CSV data loaded in BigQuery is not flagged as CSV.

B. The CSV data has invalid rows that were skipped on import.


C. The CSV data loaded in BigQuery is not using BigQuery’s default encoding.

(Correct)

D. The CSV data has not gone through an ETL phase before loading into BigQuery.
Explanation
Correct answer is C as the data imported fine, the mismatch would be due to the CSV file
having a different encoding than BigQuery's default encoding of UTF-8.
Refer GCP documentation - BigQuery Load CSV
CSV encoding
BigQuery expects CSV data to be UTF-8 encoded. If you have CSV files with data encoded in
ISO-8859-1 (also known as Latin-1) format, you should explicitly specify the encoding when
you load your data so it can be converted to UTF-8.
Delimiters in CSV files can be any ISO-8859-1 single-byte character. To use a character in
the range 128-255, you must encode the character as UTF-8. BigQuery converts the string to
ISO-8859-1 encoding and uses the first byte of the encoded string to split the data in its raw,
binary state.
Question 40: Skipped
You are managing a Cloud Dataproc cluster. You need to make a job run faster while
minimizing costs, without losing work in progress on your clusters. What should you
do?

A. Increase the cluster size with more non-preemptible workers.

B. Increase the cluster size with preemptible worker nodes, and configure them to forcefully
decommission.

C. Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a
script to preserve work.

D. Increase the cluster size with preemptible worker nodes, and configure them to use graceful
decommissioning.
(Correct)

Explanation
Correct answer is D as Dataproc cluster can be scaled using preemptible worker nodes,
configured with graceful decommissioning to prevent losing in-progress work.
Refer GCP documentation - Dataproc Scaling Clusters
After creating a Cloud Dataproc cluster, you can adjust ("scale") the cluster by increasing or
decreasing the number of primary or secondary worker nodes in the cluster. You can scale a
Cloud Dataproc cluster at any time, even when jobs are running on the cluster.
Why scale a Cloud Dataproc cluster?
to increase the number of workers to make a job run faster
to decrease the number of workers to save money (see Graceful Decommissioning as an
option to use when downsizing a cluster to avoid losing work in progress).
to increase the number of nodes to expand available Hadoop Distributed Filesystem (HDFS)
storage
Because clusters can be scaled more than once, you might want to increase/decrease the
cluster size at one time, and then decrease/increase the size later.
When you downscale a cluster, work in progress may terminate before completion. If you are
using Cloud Dataproc v 1.2 or later, you can use Graceful Decommissioning, which
incorporates Graceful Decommission of YARN Nodes to finish work in progress on a worker
before it is removed from the Cloud Dataproc cluster.
Option A is wrong as non-preemptible workers would increase cost.
Option B & C are wrong as the approaches would lead to losing in-progress work.
Question 41: Skipped
You have Cloud Functions written in Node.js that pull messages from Cloud Pub/Sub
and send the data to BigQuery. You observe that the message processing rate on the
Pub/Sub topic is orders of magnitude higher than anticipated, but there is no error
logged in Stackdriver Log Viewer. What are the two most likely causes of this problem?
Choose 2 answers.

A. Publisher throughput quota is too small.

B. Total outstanding messages exceed the 10-MB maximum.

C. Error handling in the subscriber code is not handling run-time errors properly.
(Correct)

D. The subscriber code cannot keep up with the messages.

E. The subscriber code does not acknowledge the messages that it pulls.

(Correct)

Explanation
Correct answers are C & E as the handling is more than anticipated, the possible reasons are
the messages are being redelivered either due to subscriber not acknowledging the message
within the ack time or it not handling runtime errors.
Refer GCP documentation - Pub/Sub Troubleshooting
Dealing with duplicates and forcing retries - When you do not acknowledge a message
before its acknowledgement deadline has expired, Cloud Pub/Sub resends the message. As a
result, Cloud Pub/Sub can send duplicate messages. Use Stackdriver to monitor acknowledge
operations with the  expired  response code to detect this condition. To get this data, select
the Acknowledge message operations metric, then group or filter it by
the  response_code  label. Note that  response_code  is a system label on a metric - it is not
a metric.
Options A & D are wrong as the Cloud Function is processing more than anticipated without
any errors.
Option B is wrong as this would lead into errors.
Question 42: Skipped
You need to copy millions of sensitive patient records from a relational database to
BigQuery. The total size of the database is 10 TB. You need to design a solution that is
secure and time-efficient. What should you do?

A. Export the records from the database as an Avro file. Upload the file to GCS using gsutil, and
then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.

B. Export the records from the database as an Avro file. Copy the file onto a Transfer Appliance
and send it to Google, and then load the Avro file into BigQuery using the BigQuery web UI in
the GCP Console.
(Correct)

C. Export the records from the database into a CSV file. Create a public URL for the CSV file,
and then use Storage Transfer Service to move the file to Cloud Storage. Load the CSV file into
BigQuery using the BigQuery web UI in the GCP Console.

D. Export the records from the database as an Avro file. Create a public URL for the Avro file,
and then use Storage Transfer Service to move the file to Cloud Storage. Load the Avro file into
BigQuery using the BigQuery web UI in the GCP Console.
Explanation
Correct answer is B as exporting the files in Avro file provides compression of data. Using
Transfer Appliance to transfer data from on-premises to Cloud Storage is both secure and
time-efficient. The data can be loaded using BigQuery web UI.
Refer GCP documentation - Transfer Appliance & BigQuery Avro
Transfer Appliance is a high-capacity storage device that enables you to transfer and
securely ship your data to a Google upload facility, where we upload your data to Google
Cloud Storage.
Avro is the preferred format for loading data into BigQuery. Loading Avro files has the
following advantages over CSV and JSON (newline delimited):
The Avro binary format:Is faster to load. The data can be read in parallel, even if the data
blocks are compressed.Doesn't require typing or serialization.Is easier to parse because
there are no encoding issues found in other formats such as ASCII.
When you load Avro files into BigQuery, the table schema is automatically retrieved from the
self-describing source data.
Options A, C & D are wrong as all of the options would still use public internet to transfer the
data to Cloud Storage which is neither time-efficient and secure.
Question 43: Skipped
Your team is responsible for developing and maintaining ETLs in your company. One
of your Dataflow jobs is failing because of some errors in the input data, and you need
to improve reliability of the pipeline (incl. being able to reprocess all failing data). What
should you do?

A. Add a filtering step to skip these types of errors in the future, extract erroneous rows from
logs.


B. Add a  try… catch  block to your  DoFn  that transforms the data, extract erroneous rows from
logs.

C. Add a  try… catch  block to your  DoFn  that transforms the data, write erroneous rows to
PubSub directly from the  DoFn .

D. Add a  try… catch  block to your  DoFn  that transforms the data, use a sideOutput to create a
PCollection that can be stored to PubSub later.

(Correct)

Explanation
Correct answer is D as the reliability of the Dataflow can be increased by handling the errors
using the try... catch block and using sideOutput to store the failed records to a PubSub topic,
acting as a Dead Letter Queue.
Refer GCP documentation - Dataflow Handling Input Errors
If the failure is within the processing code of a  DoFn , one way to handle this is to catch the
exception, log an error, and then drop the input. The rest of the elements in the pipeline will
be processed successfully, so progress can be made as normal. But just logging the elements
isn’t ideal because it doesn’t provide an easy way to see these malformed inputs and
reprocess them later.
A better way to solve this would be to have a dead letter file where all of the failing inputs
are written for later analysis and reprocessing. We can use a side output in Dataflow to
accomplish this goal. For example:
Question 44: Skipped
You have historical data covering the last three years in BigQuery and a data pipeline
that delivers new data to BigQuery daily. You have noticed that when the Data Science
team runs a query filtered on a date column and limited to 30–90 days of data, the
query scans the entire table. You also noticed that your bill is increasing more quickly
than you expected. You want to resolve the issue as cost-effectively as possible while
maintaining the ability to conduct SQL queries. What should you do?

A. Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP
or DATE Type.

(Correct)

B. Recommend that the Data Science team export the table to a CSV file on Cloud Storage and
use Cloud Datalab to explore the data by reading the files directly.

C. Modify your pipeline to maintain the last 30–90 days of data in one table and the longer
history in a different table to minimize full table scans over the entire history.


D. Write an Apache Beam pipeline that creates a BigQuery table per day. Recommend that the
Data Science team use wildcards on the table name suffixes to select the data they need.
Explanation
Correct answer is A as the table can be partitioned by TIMESTAMP or DATE. This would
limit the number of records queried based on the predicate filters.
Refer GCP documentation - BigQuery Partitioned Tables
BigQuery also allows partitioned tables. Partitioned tables allow you to bind the partitioning
scheme to a specific TIMESTAMP  or  DATE  column. Data written to a partitioned table is
automatically delivered to the appropriate partition based on the date value (expressed in
UTC) in the partitioning column.
Partitioning versus sharding
As an alternative to partitioned tables, you can shard tables using a time-based naming
approach such as  [PREFIX]_YYYYMMDD . This is referred to as creating date-sharded tables.
Using either standard SQL or legacy SQL, you can specify a query with a  UNION  operator to
limit the tables scanned by the query.
Partitioned tables perform better than tables sharded by date. When you create date-named
tables, BigQuery must maintain a copy of the schema and metadata for each date-named
table. Also, when date-named tables are used, BigQuery might be required to verify
permissions for each queried table. This practice also adds to query overhead and impacts
query performance. The recommended best practice is to use partitioned tables instead of
date-sharded tables.
Option B is wrong as exporting the data to CSV is not a cumbersome approach and does not
provide the SQL querying capability
Option C is wrong as limiting the table to 30-90 would work, however it is still not cost-
effective as the whole table will be always scanned. Also, there is a overhead
Option D is wrong as although sharding is a valid option, partitioning is preferred over
sharding.
Question 45: Skipped
You launched a new gaming app almost three years ago. You have been uploading log
files from the previous day to a separate Google BigQuery table with the table name
format  LOGS_yyyymmdd . You have been using table wildcard functions to generate daily
and monthly reports for all time ranges. Recently, you discovered that some queries
that cover long date ranges are exceeding the limit of 1,000 tables and failing. How can
you resolve this issue?

A. Convert all daily log tables into date-partitioned tables


B. Convert the sharded tables into a single partitioned table

(Correct)

C. Enable query caching so you can cache data from previous months

D. Create separate views to cover each month, and query from these views
Explanation
Correct answer is B as Google Cloud recommends using partitioned tables instead of sharded
tables, which would help query a single table and improve performance.
Refer GCP documentation - BigQuery Partitioned Tables
BigQuery also allows partitioned tables. Partitioned tables allow you to bind the partitioning
scheme to a specific TIMESTAMP  or  DATE  column. Data written to a partitioned table is
automatically delivered to the appropriate partition based on the date value (expressed in
UTC) in the partitioning column.
Partitioning versus sharding
As an alternative to partitioned tables, you can shard tables using a time-based naming
approach such as  [PREFIX]_YYYYMMDD . This is referred to as creating date-sharded tables.
Using either standard SQL or legacy SQL, you can specify a query with a  UNION  operator to
limit the tables scanned by the query.
Partitioned tables perform better than tables sharded by date. When you create date-named
tables, BigQuery must maintain a copy of the schema and metadata for each date-named
table. Also, when date-named tables are used, BigQuery might be required to verify
permissions for each queried table. This practice also adds to query overhead and impacts
query performance. The recommended best practice is to use partitioned tables instead of
date-sharded tables.
Option A is wrong as the tables are already sharded, creating the date partition would not
help.
Option C is wrong as query caching does not work for wildcard queries
Currently, cached results are not supported for queries against multiple tables using a
wildcard even if the Use Cached Results option is checked. If you run the same wildcard
query multiple times, you are billed for each query
Option D is wrong as the daily reports would still fail.
Question 46: Skipped
A shipping company has live package-tracking data that is sent to an Apache Kafka
stream in real time. This is then loaded into BigQuery. Analysts in your company want
to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a
package. The table was originally created with ingest-date partitioning. Over time, the
query processing time has increased. You need to implement a change that would
improve query performance in BigQuery. What should you do?

A. Implement clustering in BigQuery on the ingest date column.

B. Implement clustering in BigQuery on the package-tracking ID column.

(Correct)

C. Tier older data onto Cloud Storage files, and leverage extended tables.

D. Re-create the table using data partitioning on the package delivery date.
Explanation
Correct answer is B as the tables are already partitioned and the analysts want to query for a
package, Clustering on the package-tracking ID would help improve the query performance.
Refer GCP documentation - BigQuery Cluster Tables
When you create a clustered table in BigQuery, the table data is automatically organized
based on the contents of one or more columns in the table’s schema. The columns you specify
are used to colocate related data. When you cluster a table using multiple columns, the order
of columns you specify is important. The order of the specified columns determines the sort
order of the data.
Clustering can improve the performance of certain types of queries such as queries that use
filter clauses and queries that aggregate data. When data is written to a clustered table by a
query job or a load job, BigQuery sorts the data using the values in the clustering columns.
These values are used to organize the data into multiple blocks in BigQuery storage. When
you submit a query containing a clause that filters data based on the clustering columns,
BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the
clustering columns, performance is improved because the sorted blocks colocate rows with
similar values.
When to use clustering
Currently, BigQuery supports clustering over a partitioned table. Use clustering over a
partitioned table when:
Your data is already partitioned on a date or timestamp column.
You commonly use filters or aggregation against particular columns in your queries.
Table clustering is supported for both ingestion time partitioned tables and for
tables partitioned on a  DATE  or  TIMESTAMP  column. Currently, clustering is not supported
for non-partitioned tables.
Option A is wrong as clustering needs to be on the column queried, which is the package
identifier.
Option C is wrong as extended tables reduce performance and it is recommended to host the
data within BigQuery.
Option D is wrong as partitioning on package delivery date would not improve the
performance for queries for a package.
Question 47: Skipped
You are deploying MariaDB SQL databases on GCE VM Instances and need to
configure monitoring and alerting. You want to collect metrics including network
connections, disk IO and replication status from MariaDB with minimal development
effort and use StackDriver for dashboards and alerts. What should you do?

A. Install the OpenCensus Agent and create a custom metric collection application with a
StackDriver exporter.

B. Place the MariaDB instances in an Instance Group with a Health Check.

C. Install the StackDriver Logging Agent and configure fluentd in_tail plugin to read MariaDB
logs.

D. Install the StackDriver Agent and configure the MySQL plugin.

(Correct)

Explanation
Correct answer is D as MariaDB provides a drop in replacement for MySQL, the MySQL
plugin can be used with Stackdriver agent seamlessly to capture network connections, disk IO
and replication status for monitoring and alerting
Option A is wrong as the approach does not have minimal development effort.
Option B is wrong as placing in an Instance group with health check does not provide
metrics.
Option C is wrong as Stackdriver Logging agent would only capture MariaDB logs.
Question 48: Skipped
You need to set access to BigQuery for different departments within your company.
Your solution should comply with the following requirements:
Each department should have access only to their data.
Each department will have one or more leads who need to be able to create and update
tables and provide them to their team.
Each department has data analysts who need to be able to query but not modify data.
How should you set access to the data in BigQuery?

A. Create a dataset for each department. Assign the department leads the role of OWNER, and
assign the data analysts the role of WRITER on their dataset.

B. Create a dataset for each department. Assign the department leads the role of WRITER, and
assign the data analysts the role of READER on their dataset.

(Correct)

C. Create a table for each department. Assign the department leads the role of Owner, and assign
the data analysts the role of Editor on the project the table is in.

D. Create a table for each department. Assign the department leads the role of Editor, and assign
the data analysts the role of Viewer on the project the table is in.
Explanation
Correct answer is B. Each department needs to have a separate dataset and BigQuery access
control works on dataset and not on tables. Data Analysts should be given the VIEWER role
to query, but not modify data. Leads should be provided with EDITOR access to create and
update tables and provide them to their team.
Refer GCP documentation - BigQuery Access Control
READER  Can read, query, copy or export tables in the dataset. Can read routines in the
datasetCan call get on the datasetCan call get and list on tables in the datasetCan call get and
list on routines in the datasetCan call list on table data for tables in the datasetMaps to
the  bigquery.dataViewer  predefined role WRITER  Same as  READER , plus:Can edit or append
data in the datasetCan call insert, insertAll, update or delete on tablesCan use tables in the
dataset as destinations for load, copy or query jobsCan call insert, update, or delete on
routinesMaps to the  bigquery.dataEditor  predefined role

Option A is wrong as WRITER access to data analysts would enable them to modify the data.
Options C & D are wrong as BigQuery access control works at the dataset level only.
Question 49: Skipped
You have developed three data processing jobs. One executes a Cloud Dataflow pipeline
that transforms data uploaded to Cloud Storage and writes results to BigQuery. The
second ingests data from on-premises servers and uploads it to Cloud Storage. The
third is a Cloud Dataflow pipeline that gets information from third-party data
providers and uploads the information to Cloud Storage. You need to be able to
schedule and monitor the execution of these three workflows and manually execute
them when needed. What should you do?

A. Create a Direct Acyclic Graph in Cloud Composer to schedule and monitor the jobs.

(Correct)

B. Use Stackdriver Monitoring and set up an alert with a Webhook notification to trigger the jobs.

C. Develop an App Engine application to schedule and request the status of the jobs using GCP
API calls.

D. Set up cron jobs in a Compute Engine instance to schedule and monitor the pipelines using
GCP API calls.
Explanation
Correct answer is A as Cloud Composer allows you schedule and monitor jobs as well as the
ability to manually execute them when needed.
Refer GCP documentation - Cloud Composer
Cloud Composer is a fully managed workflow orchestration service that empowers you to
author, schedule, and monitor pipelines that span across clouds and on-premises data
centers. Built on the popular Apache Airflow open source project and operated using the
Python programming language, Cloud Composer is free from lock-in and easy to use.
Cloud Composer pipelines are configured as directed acyclic graphs (DAGs) using Python,
making it easy for users of any experience level to author and schedule a workflow
Cloud Composer is deeply integrated within the Google Cloud Platform, giving users the
ability to orchestrate their full pipeline. Cloud Composer has robust, built-in integration with
many products, including Google BigQuery, Cloud Dataflow, Cloud Dataproc, Cloud
Datastore, Cloud Storage, Cloud Pub/Sub, and Cloud ML Engine.
Cloud Composer gives you the ability to connect your pipeline through a single orchestration
tool whether your workflow lives on-premises, in multiple clouds, or fully within GCP. The
ability to author, schedule, and monitor your workflows in a unified manner means you can
break down the silos in your environment and focus less on infrastructure.
Options B, C & D are wrong as they do not satisfy all the requirements.
Question 50: Skipped
You are a head of BI at a large enterprise company with multiple business units that
each have different priorities and budgets. You use on-demand pricing for BigQuery
with a quota of 2K concurrent on-demand slots per project. Users at your organization
sometimes don’t get slots to execute their query and you need to correct this. You’d like
to avoid introducing new projects to your account. What should you do?

A. Convert your batch BQ queries into interactive BQ queries.

B. Create an additional project to overcome the 2K on-demand per-project quota.

C. Switch to flat-rate pricing and establish a hierarchical priority model for your projects.

(Correct)

D. Increase the amount of concurrent slots per project at the Quotas page at the Cloud Console.
Explanation
Correct answer is C as if more slots are needed, flat-rate pricing can be checked. Flat-rate
pricing offers predictable and consistent month-to-month costs.
Refer GCP documentation - BigQuery Slots
Maximum concurrent slots per project for on-demand pricing — 2,000
The default number of slots for on-demand queries is shared among all queries in a single
project. As a rule, if you're processing less than 100 GB of queries at once, you're unlikely to
be using all 2,000 slots.
To check how many slots you're using, see Monitoring BigQuery using Stackdriver. If you
need more than 2,000 slots, contact your sales representative to discuss whether flat-rate
pricing meets your needs.
BigQuery offers flat-rate pricing for customers who prefer a stable monthly cost for queries
rather than paying the on-demand price per TB of data processed.
When you enroll in flat-rate pricing, you purchase dedicated query processing capacity
which is measured in BigQuery slots. The cost of all bytes processed is included in the
monthly flat-rate price. If your queries exceed your flat-rate capacity, your queries will run
proportionally more slowly until more of your flat-rate resources become available.
Option A is wrong as concurrent slots limit apply for both batch and interactive queries
Option B is wrong as it does not meet the requirement of avoiding introducing new projects
to the account.
Option D is wrong as you cannot increase the amount of concurrent slots per project beyond
2000.
Question 51: Skipped
You are using Google BigQuery as your data warehouse. Your users report that the
following simple query is running very slowly, no matter when they run the query:

1. SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country

You check the query plan for the query and see the following output in the Read section
of Stage:1:
Larger image

What is the most likely cause of the delay for this query?

A. Users are running too many concurrent queries in the system

B. The  [myproject:mydataset.mytable]  table has too many partitions


C. Either the state or the city columns in the  [myproject:mydataset.mytable]  table have too
many NULL values

D. Most rows in the  [myproject:mydataset.mytable]  table have the same value in the country
column, causing data skew

(Correct)

Explanation
Correct answer is D as the query plan indicates the average time spent in reading data and the
time taken by the slowest worker. The difference is huge and the reason is mostly skewed
data.
Refer GCP documentation - BigQuery Query Plan Execution
The query stages also provide stage timing classifications, in both relative and absolute
form. As each stage of execution represents work undertaken by one or more independent
workers, information is provided in both average and worst-case times, representing the
average performance for all workers in a stage as well as the long-tail slowest worker
performance for a given classification. The average and max times are furthermore broken
down into absolute and relative representations. For the ratio-based statistics, the data is
provided as a fraction of the longest time spent by any worker in any segment.
readRatioAvg   readMsAvg    Time the average worker spent
reading input data. readRatioMax   readMsMax    Time the
slowest worker spent reading input data.

You might also like