0% found this document useful (0 votes)
35 views177 pages

Sertif GCP

The document outlines various scenarios involving Google Cloud services, focusing on solutions for data processing, access control, and real-time analytics. Key methods discussed include using Dropout for overfitting in TensorFlow models, employing Cloud Composer for orchestrating data pipelines, and leveraging BigQuery's features for data sharing and partitioning. Additionally, it emphasizes the importance of security and efficiency in handling sensitive data and optimizing costs in cloud environments.

Uploaded by

Eric Sandria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views177 pages

Sertif GCP

The document outlines various scenarios involving Google Cloud services, focusing on solutions for data processing, access control, and real-time analytics. Key methods discussed include using Dropout for overfitting in TensorFlow models, employing Cloud Composer for orchestrating data pipelines, and leveraging BigQuery's features for data sharing and partitioning. Additionally, it emphasizes the importance of security and efficiency in handling sensitive data and optimizing costs in cloud environments.

Uploaded by

Eric Sandria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 177

1.

Your company built a TensorFlow neutral-network model with a large


number of neurons and layers. The model fits well for the training data.
However, when tested against new data, it performs poorly. What method
can you employ to address this?
Dropout Methods adalah teknik
regularisasi yang ampuh dan banyak
digunakan untuk mengatasi overfitting
dalam jaringan saraf, termasuk model
TensorFlow. Dengan menonaktifkan neuron
secara acak saat pelatihan, dropout
membantu model mempelajari
representasi yang lebih kokoh dan
mencegahnya terlalu bergantung pada
fitur tertentu. Ini adalah jawaban yang tepat untuk situasi di mana model
bekerja dengan baik pada data pelatihan tetapi buruk pada data pengujian
baru.

2. Your company is in a highly regulated industry. One of your requirements


is to ensure individual users have access only to the minimum amount of
information required to do their jobs. You want to enforce this requirement
with Google BigQuery. Which three approaches can you take? (Choose
three.)
B. Restrict access to tables
by role: You can define roles
in BigQuery and grant
specific permissions to these
roles to control who can
access particular tables.
D. Restrict BigQuery API
access to approved users:
You can control access to the BigQuery API and, consequently, to the
underlying data by ensuring that only approved users or services can
make API requests.
E. Segregate data across multiple tables or databases: You can separate
data into different tables or databases based on user access requirements,
which allows you to limit users' access to specific data sets.
These approaches, when used together, can help you enforce data access
controls in a regulated environment. Options A, C, and F are also important
considerations but are not direct methods for enforcing fine-grained
access control to specific data.

3. You have a requirement to insert minute-resolution data from 50,000


sensors into a BigQuery table. You expect significant growth in data
volume and need the data to be available within 1 minute of ingestion for
real-time analysis of aggregated trends. What should you do?
For real-time analysis and quick data
availability, the appropriate option is
the combination of the pipeline with
BigQuery.

4. You need to copy millions of sensitive patient records from a relational


database to BigQuery. The total size of the database is 10 TB. You need to
design a solution that is secure and time-efficient. What should you do?
For your scenario
with 10 TB of data
in Cloud SQL, if you
export to Avro
without specifying
compression, you
can expect the
resulting Avro file to be around the same size, potentially slightly smaller
depending on the data characteristics. Here in this question, there is no
mentioning about compression.
So let's not assume that the data being used in Avro format will get
compressed.
If Google cloud storage itself, can't handle an object of size greater than 5
TB, there is no point of using gsutil
Given the sensitivity of the patient records and the large size of the data,
using Google's Transfer Appliance is a secure and efficient method. The
Transfer Appliance is a hardware solution provided by Google for
transferring large amounts of data. It enables you to securely transfer data
without exposing it over the internet.

5. You need to create a near real-time inventory dashboard that reads the
main inventory tables in your BigQuery data warehouse. Historical
inventory data is stored as inventory balances by item and location. You
have several thousand updates to inventory every hour. You want to
maximize performance of the dashboard and ensure that the data is
accurate. What should you do?

Streams
inventory
changes near
real-time:
BigQuery
streaming ingests data immediately, keeping the inventory movement
table constantly updated.
Daily balance calculation: Joining the movement table with the historical
balance table provides an accurate view of current inventory levels
without affecting the actual balance table.
Nightly update for historical data: Updating the main inventory balance
table nightly ensures long-term data consistency while maintaining near
real-time insights through the view.
This approach balances near real-time updates with efficiency and data
accuracy, making it the optimal solution for the given scenario.
6. You have a data stored in BigQuery. The data in the BigQuery dataset must
be highly available. You need to define a storage, backup, and recovery
strategy of this data that minimizes cost. How should you configure the
BigQuery table that have a recovery point objective (RPO) of 30 days?

7. You used Dataprep to create a recipe on a sample of data in a BigQuery


table. You want to reuse this recipe on a daily upload of data with the
same schema, after the load job with variable execution time completes.
What should you do?

We have
external
dependency
"after the load
job with variable
execution time completes"
which requires DAG -> Airflow (Cloud Composer)
The reasons:
A scheduler like Cloud Scheduler won't handle the dependency on the
BigQuery load completion time
Using Composer allows creating a DAG workflow that can:
Trigger the BigQuery load
Wait for BigQuery load to complete
Trigger the Dataprep Dataflow job
Dataflow template allows easy reuse of the Dataprep transformation logic
Composer coordinates everything based on the dependencies in an
automated workflow

8. You want to automate execution of a multi-step data pipeline running on


Google Cloud. The pipeline includes Dataproc and Dataflow jobs that have
multiple dependencies on each other. You want to use managed services
where possible, and the pipeline will run every day. Which tool should you
use?
Cloud Composer is a managed service
that allows you to create and run Apache
Airflow workflows. Airflow is a workflow
management platform that can be used
to automate complex data pipelines. It is
a good choice for this use case because
it is a managed service, which means that Google will take care of the
underlying infrastructure. It also supports multiple dependencies, so you

9. You are managing a Cloud Dataproc cluster. You need to make a job run
faster while minimizing costs, without losing work in progress on your
clusters. What should you do?

All your workers


need to be the
same kind. Use
Graceful
Decommissioning for don't lose any data and add more(increase the
cluster) preemptible workers because there are more cost-effective .

10.You work for a shipping company that uses handheld scanners to read
shipping labels. Your company has strict data privacy standards that
require scanners to only transmit tracking numbers when events are sent
to Kafka topics. A recent software update caused the scanners to
accidentally transmit recipients' personally identifiable information (PII) to
analytics systems, which violates user privacy rules. You want to quickly
build a scalable solution using cloud-native managed services to prevent
exposure of PII to the analytics systems. What should you do?
Quick to implement: Using
managed services reduces
development time and
effort compared to
building solutions from
scratch. Scalability:
can easily Cloud
Functions and Cloud DLP
API are designed to handle large volumes of data. Accuracy: Cloud DLP API
has advanced PII detection capabilities. Flexibility: You can customize the
processing logic in Cloud Function to meet your specific needs. Security:
Sensitive data is handled securely within a controlled cloud environment.

11.You have developed three data processing jobs. One executes a Cloud
Dataflow pipeline that transforms data uploaded to Cloud Storage and
writes results to BigQuery. The second ingests data from on-premises
servers and uploads it to Cloud Storage. The third is a Cloud Dataflow
pipeline that gets information from third-party data providers and uploads
the information to Cloud Storage. You need to be able to schedule and
monitor the execution of these three workflows and manually execute
them when needed. What should you do?
12.You have Cloud Functions written in Node.js that pull messages from Cloud
Pub/Sub and send the data to BigQuery. You observe that the message
processing rate on the Pub/Sub topic is orders of magnitude higher than
anticipated, but there is no error logged in Cloud Logging. What are the
two most likely causes of this problem? (Choose two.)
By not acknowleding the
pulled message, this result
in it be putted back in Cloud
Pub/Sub, meaning the
messages accumulate
instead of being consumed
and removed from Pub/Sub.
The same thing can happen
ig the subscriber maintains the lease on the message it receives in case of
an error. This reduces the overall rate of processing because messages get
stuck on the first subscriber. Also, errors in Cloud Function do not show up

13.You are designing a basket abandonment system for an ecommerce


company. The system will send a message to a user based on these rules:
✑ No interaction by the user on the site for 1 hour
Has added more than $30 worth of products to the basket
✑ Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a
message should be sent. How should you design the pipeline?
Session windows group data
based on periods of activity
and inactivity. If there's no
interaction for the duration of
the gap time (in this case, 60
minutes), a new window is
in Stackdriver
started. Log Viewer
This would help if
identify users who haven't interacted with the site for the specified
duration, fulfilling the requirement for the basket abandonment system.

14.You are creating a new pipeline in Google Cloud to stream IoT data from
Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the
data, you notice that roughly 2% of the data appears to be corrupt. You
need to modify the Cloud Dataflow pipeline to filter out this corrupt data.
What should you do?

Filtering a data set. You


can use ParDo to consider
each element in a
PCollection and either
output that element to a new collection or discard it.
Formatting or type-converting each element in a data set. If your input
PCollection contains elements that are of a different type or format than
you want, you can use ParDo to perform a conversion on each element
and output the result to a new PCollection.
Extracting parts of each element in a data set. If you have a PCollection of
records with multiple fields, for example, you can use a ParDo to parse out
just the fields you want to consider into a new PCollection.
Performing computations on each element in a data set. You can use
ParDo to perform simple or complex computations on every element, or
certain elements, of a PCollection and output the results as a new
PCollection.

15.You have historical data covering the last three years in BigQuery and a
data pipeline that delivers new data to BigQuery daily. You have noticed
that when the Data Science team runs a query filtered on a date column
and limited to 30`"90 days of data, the query scans the entire table. You
also noticed that your bill is increasing more quickly than you expected.
You want to resolve the issue as cost-effectively as possible while
maintaining the ability to conduct SQL queries. What should you do?
A partitioned table
is a special table
that is divided into
segments, called
partitions, that
make it easier to
manage and query
your data. By dividing a large table into smaller partitions, you can
improve query performance, and you can control costs by reducing the
number of bytes read by a query.

16.You operate a logistics company, and you want to improve event delivery
reliability for vehicle-based sensors. You operate small data centers around
the world to capture these events, but leased lines that provide
connectivity from your event collection infrastructure to your event
processing infrastructure are unreliable, with unpredictable latency. You
want to address this issue in the most cost-effective way. What should you
do?
have the data acquisition devices publish data to Cloud Pub/Sub. This
would provide a reliable messaging service for your event data, allowing
you to ingest and process your data in a timely manner, regardless of the
reliability of the leased lines. Cloud Pub/Sub also offers automatic retries
and fault-tolerance, which would further improve the reliability of your
event delivery.
Additionally, using Cloud
Pub/Sub would allow you
to easily scale up or
down your event
processing infrastructure
as needed, which would
help to minimize costs.

17.You are a retailer that wants to integrate your online sales capabilities with
different in-home assistants, such as Google Home. You need to interpret
customer voice commands and issue an order to the backend systems.
Which solutions should you choose?

Dialogflow is a powerful natural


language understanding platform
developed by Google. It allows you to
build conversational interfaces,
interpret user voice commands, and integrate with various platforms and
devices like Google Home. The "Enterprise Edition" provides additional
features and support for more complex use cases, making it a good choice
for a retailer looking to integrate with in-home assistants and handle
customer voice commands effectively.

18.Your company has a hybrid cloud initiative. You have a complex data
pipeline that moves data between cloud provider services and leverages
services from each of the cloud providers. Which cloud-native service
should you use to orchestrate the entire pipeline?
Cloud Composer is considered suitable across
multiple cloud providers, as it is built on Apache
Airflow, which allows for workflow orchestration
across different cloud environments and even on-
premises data centers, making it a good choice
for multi-cloud strategies; however, its tightest
integration is with Google Cloud Platform services.

19.You use a dataset in BigQuery for analysis. You want to provide third-party
companies with access to the same dataset. You need to keep the costs of
data sharing low and ensure that the data is current. Which solution
should you choose?
Shared datasets are collections of tables and views in BigQuery defined by
a data publisher and make up the unit of cross-project / cross-
organizational sharing. Data subscribers get an opaque, read-only, linked
dataset inside their project and VPC perimeter that they can combine with
their own datasets and connect to solutions from Google Cloud or our
partners. For example, a retailer might create a single exchange to share
demand forecasts to the 1,000’s of vendors in their supply chain–having
joined historical sales data with weather, web clickstream, and Google
Trends data in their own BigQuery project, then sharing real-time outputs
via Analytics Hub. The publisher can add metadata, track subscribers, and
see aggregated usage
metrics.

20.Your company is in the process of migrating its on-premises data


warehousing solutions to BigQuery. The existing data warehouse uses
trigger-based change data capture (CDC) to apply updates from multiple
transactional database sources on a daily basis. With BigQuery, your
company hopes to improve its handling of CDC so that changes to the
source systems are available to query in BigQuery in near-real time using
log-based CDC streams, while also optimizing for the performance of
applying changes to the data warehouse. Which two steps should they
take to ensure that changes are available in the BigQuery reporting table
with minimal latency while reducing compute overhead? (Choose two.)

Delta tables contain all change events for a particular table since the
initial load. Having all change events available can be valuable for
identifying trends, the state of the entities that a table represents at a
particular moment, or change frequency.
The best way to merge data frequently and consistently is to use a MERGE
statement, which lets you combine multiple INSERT, UPDATE, and DELETE
statements into a single atomic operation.

21.You are designing a data processing pipeline. The pipeline must be able to
scale automatically as load increases. Messages must be processed at
least once and must be ordered within windows of 1 hour. How should you
design the solution?
the combination of
Cloud Pub/Sub for
scalable ingestion
and Cloud Dataflow
for scalable stream
processing with
windowing capabilities makes option D the most appropriate
solution for the given requirements. It minimizes management
overhead, ensures scalability, and provides the necessary features for at-
least-once processing and ordered processing within time windows.

22.You need to set access to BigQuery for different departments within your
company. Your solution should comply with the following requirements:
✑ Each department should have access only to their data.
✑ Each department will have one or more leads who need to be able to
create and update tables and provide them to their team.
✑ Each department has data analysts who need to be able to query but
not modify data.
How should you set access to the data in BigQuery?
Option B provides the
most secure and
appropriate solution by
leveraging dataset-level
access control. It adheres
to the principle of least
privilege, granting leads
the specific permissions they need to manage their department's data (via
WRITER) while allowing analysts to perform their tasks without the risk of
accidental or malicious modifications (via READER). The dataset acts as a
natural container for data isolation, fulfilling all the requirements outlined
in the scenario.

23.You operate a database that stores stock trades and an application that
retrieves average stock price for a given company over an adjustable
window of time. The data is stored in Cloud Bigtable where the datetime of
the stock trade is the beginning of the row key. Your application has
thousands of concurrent users, and you notice that performance is starting
to degrade as more stocks are added. What should you do to improve the
performance of your application?

The core issue is the


inefficient data
retrieval due to the
current row key
structure. Changing the row key to start with the stock symbol (option A)
addresses this directly by improving data locality and reducing the amount
of data scanned during reads. This is the most straightforward and
effective way to improve the application's performance in this scenario.

24.Your company handles data processing for a number of different clients.


Each client prefers to use their own suite of analytics tools, with some
allowing direct query access via Google BigQuery. You need to secure the
data so that clients cannot see each other's data. You want to ensure
appropriate access to the data. Which three steps should you take?
(Choose three.)
B. Load data into a different
dataset for each client: Organize
the data into separate datasets for
each client. This ensures data
isolation and simplifies access
control.
D. Restrict a client's dataset to
approved users: Implement access
controls by specifying which users or groups are allowed to access each
client's dataset. This restricts data access to approved users only.
F. Use the appropriate identity and access management (IAM) roles for
each client's users: Assign IAM roles based on client-specific requirements
to manage permissions effectively. IAM roles help control access at a more
granular level, allowing you to tailor access to specific users or groups
within each client's dataset.
These steps ensure that each client's data is separated, and access is
controlled based on client-specific requirements. Options A, C, and E, while
important in other contexts, are not sufficient on their own to ensure client
data isolation and access control in a multi-client environment.

25.You are operating a Cloud Dataflow streaming pipeline. The pipeline


aggregates events from a Cloud Pub/Sub subscription source, within a
window, and sinks the resulting aggregation to a Cloud Storage bucket.
The source has consistent throughput. You want to monitor an alert on
behavior of the pipeline with Cloud Stackdriver to ensure that it is
processing data. Which Stackdriver alerts should you create?
Option B provides
the most reliable
monitoring
strategy. An
increase in

subscription/num_undelivered_messages coupled with a decrease in the


rate of change of instance/storage/used_bytes strongly suggests that the
pipeline is having issues processing and writing data, which is precisely

26.You currently have a single on-premises Kafka cluster in a data center in


the us-east region that is responsible for ingesting messages from IoT
devices globally. Because large parts of globe have poor internet
connectivity, messages sometimes batch at the edge, come in all at once,
and cause a spike in load on your Kafka cluster. This is becoming difficult
to manage and prohibitively expensive. What is the Google-recommended
cloud native architecture for this scenario?
Pub/Sub can act like a
shock absorber and
rate leveller for both
incoming data streams
and application
what you want to be
architecture changes.
Many devices have limited ability to store and retry sending telemetry
data. Pub/Sub scales to handle data spikes that can occur when swarms of
devices respond to events in the physical world, and buffers these spikes
to help isolate them from applications monitoring the data.

27.You decided to use Cloud Datastore to ingest vehicle telemetry data in real
time. You want to build a storage system that will account for the long-
term data growth, while keeping the costs low. You also want to create
snapshots of the data periodically, so that you can make a point-in-time
(PIT) recovery, or clone a copy of the data for Cloud Datastore in a
different environment. You want to archive these snapshots for a long
time. Which two methods can accomplish this? (Choose two.)

Options A and B leverage Datastore's managed export functionality and


appropriate storage solutions (Cloud Storage and separate Datastore
projects) to create cost-effective, consistent snapshots for long-term
archival, PIT recovery, and cloning. They are the most efficient and
practical approaches for the given requirements.

28.You need to create a data pipeline that copies time-series transaction data
so that it can be queried from within BigQuery by your data science team
for analysis. Every hour, thousands of transactions are updated with a new
status. The size of the initial dataset is 1.5 PB, and it will grow by 3 TB per
day. The data is heavily structured, and your data science team will build
machine learning models based on this data. You want to maximize
performance and usability for your data science team. Which two
strategies should you adopt? (Choose two.)

Use nested and repeated fields to denormalize data storage and increase
query performance.
Denormalization is a common strategy for increasing read performance for
relational datasets that were previously normalized. The recommended
way to denormalize data in BigQuery is to use nested and repeated fields.
It's best to use this strategy when the relationships are hierarchical and
frequently queried together, such as in parent-child relationships.

29.You are designing a cloud-native historical data processing system to meet


the following conditions:
✑ The data being analyzed is in CSV, Avro, and PDF formats and will be
accessed by multiple analysis tools including Dataproc, BigQuery, and
Compute Engine.
✑ A batch pipeline moves daily data.
✑ Performance is not a factor in the solution.
✑ The solution design should maximize availability.
How should you design data storage for this solution?

Option D provides the most straightforward, cost-effective, and highly


available solution. It leverages the strengths of Cloud Storage for storing
diverse data formats and its multi-regional nature to ensure maximum
availability, fulfilling all the specified requirements. The ability for
Dataproc, BigQuery, and Compute Engine to directly access the data
simplifies the architecture and makes data access seamless.

30.You have a petabyte of analytics data and need to design a storage and
processing platform for it. You must be able to perform data warehouse-
style analytics on the data in Google Cloud and expose the dataset as files
for batch analysis tools in other cloud providers. What should you do?

Option C provides the best balance of performance, cost-effectiveness,


and meeting all the requirements. BigQuery handles the analytical
workload within GCP, while Cloud Storage provides a readily accessible
copy of the data for external use. Compressing the data in Cloud Storage
further optimizes costs and data transfer. This approach leverages the
strengths of each service and avoids unnecessary complexity.

31.You work for a manufacturing company that sources up to 750 different


components, each from a different supplier. You've collected a labeled
dataset that has on average 1000 examples for each unique component.
Your team wants to implement an app to help warehouse workers
recognize incoming components based on a photo of the component. You
want to implement the first working version of this app (as Proof-Of-
Concept) within a few working days. What should you do?

The bare minimum required by AutoML Vision training is 100 image


examples per category/label. The likelihood of successfully recognizing a
label goes up with the number of high quality examples for each; in
general, the more labeled data you can bring to the training process, the
better your model will be. Target at least 1000 examples per label.
32.You are working on a niche product in the image recognition domain. Your
team has developed a model that is dominated by custom C++
TensorFlow ops your team has implemented. These ops are used inside
your main training loop and are performing bulky matrix multiplications. It
currently takes up to several days to train a model. You want to decrease
this time significantly and keep the cost low by using an accelerator on
Google Cloud. What should you do?

The question emphasizes the need for a quick solution with low cost. While
GPUs and TPUs offer greater potential performance, they require
significant development effort (writing kernels) before they can be utilized.
Sticking with CPUs and scaling the cluster is the fastest and most cost-
effective way to improve training time immediately, given the reliance on
custom C++ ops without existing GPU/TPU kernel support.

33.You work on a regression problem in a natural language processing


domain, and you have 100M labeled examples in your dataset. You have
randomly shuffled your data and split your dataset into train and test
samples (in a 90/10 ratio). After you trained the neural network and
evaluated your model on a test set, you discover that the root-mean-
squared error (RMSE) of your model is twice as high on the train set as on
the test set. How should you improve the performance of your model?

The large discrepancy in RMSE, with the training error being higher, points
directly to underfitting. Increasing the model's complexity by adding layers
or expanding the input representation is the most appropriate strategy to
address this issue and improve the model's performance.

34.You use BigQuery as your centralized analytics platform. New data is


loaded every day, and an ETL pipeline modifies the original data and
prepares it for the final users. This ETL pipeline is regularly modified and
can generate errors, but sometimes the errors are detected only after 2
weeks. You need to provide a method to recover from these errors, and
your backups should be optimized for storage costs. How should you
organize your data in BigQuery and store your backups?
Option D provides the best balance of functionality and cost-effectiveness.
Partitioned tables improve data management and query performance,
while snapshot decorators offer a very efficient way to recover data to a
previous state without incurring significant storage costs. This approach
aligns perfectly with the requirements of the scenario.

35.You want to process payment transactions in a point-of-sale application


that will run on Google Cloud Platform. Your user base could grow
exponentially, but you do not want to manage infrastructure scaling.
Which Google database service should you use?

Cloud Datastore (now part of Google Cloud Firestore in Datastore mode) is


designed for high scalability and ease of management for applications. It
is a NoSQL document database built for automatic scaling, high
performance, and ease of application development. It's serverless,
meaning it handles the scaling, performance, and management
automatically, fitting your requirement of not wanting to manage
infrastructure scaling.
Cloud SQL, while a fully-managed relational database service that makes it
easy to set up, manage, and administer your SQL databases, is not as
automatically scalable as Datastore. It's better suited for applications that
require a traditional relational database.

36.The marketing team at your organization provides regular updates of a


segment of your customer dataset. The marketing team has given you a
CSV with 1 million records that must be updated in BigQuery. When you
use the UPDATE statement in BigQuery, you receive a quotaExceeded
error. What should you do?
BigQuery is primarly designed and suit to append-only technology with
some limited DML statements.
It's not a relational database where you constantly update your user
records if they edit their profile. Instead you need to arhitect your code so
each edit is a new row in Bigquery, and you always query the latest row.
The DML statement limitation is low, because it targets different scenarios
and not yours, aka live update on rows. You could ingest your data into a
separate table, and issue 1 update statement per day.

37.As your organization expands its usage of GCP, many teams have started
to create their own projects. Projects are further multiplied to
accommodate different stages of deployments and target audiences. Each
project requires unique access control configurations. The central IT team
needs to have access to all projects. Furthermore, data from Cloud Storage
buckets and BigQuery datasets must be shared for use in other projects in
an ad hoc way. You want to simplify access control management by
minimizing the number of policies. Which two steps should you take?
(Choose two.)

1. Define your resource hierarchy: Google Cloud resources are organized


hierarchically. This hierarchy allows you to map your enterprise's
operational structure to Google Cloud, and to manage access control and
permissions for groups of related resources.
2. Delegate responsibility with groups and service accounts: we
recommend collecting users with the same responsibilities into groups and
assigning IAM roles to the groups rather than to individual users.

38.Your United States-based company has created an application for


assessing and responding to user actions. The primary table's data volume
grows by 250,000 records per second. Many third parties use your
application's APIs to build the functionality into their own frontend
applications. Your application's APIs should comply with the following
requirements:
✑ Single global endpoint
✑ ANSI SQL support
✑ Consistent access to the most up-to-date data
What should you do?
Cloud Spanner is the only option that meets all the requirements of this
application. It provides the scalability, global distribution, SQL support, and
strong consistency needed to handle the high data volume and global user
base. The other options are either not scalable enough, don't support SQL,
or don't offer the required level of consistency.

39.A data scientist has created a BigQuery ML model and asks you to create
an ML pipeline to serve predictions. You have a REST API application with
the requirement to serve predictions for an individual user ID with latency
under 100 milliseconds. You use the following query to generate
predictions: SELECT predicted_label, user_id FROM ML.PREDICT (MODEL
'dataset.model', table user_features). How should you create the ML
pipeline?

The key requirements are serving predictions for individual user IDs with
low (sub-100ms) latency.
Option D meets this by batch predicting for all users in BigQuery ML,
writing predictions to Bigtable for fast reads, and allowing the application
access to query Bigtable directly for low latency reads.
Since the application needs to serve low-latency predictions for individual
user IDs, using Dataflow to batch predict for all users and write to Bigtable
allows low-latency reads. Granting the Bigtable Reader role allows the
application to retrieve predictions for a specific user ID from Bigtable.

40.You are building an application to share financial market data with


consumers, who will receive data feeds. Data is collected from the markets
in real time. Consumers will receive the data in the following ways:
✑ Real-time event stream
✑ ANSI SQL access to real-time stream and historical data
✑ Batch historical exports
Which solution should you use?
Real-time Event Stream: Cloud Pub/Sub is a managed messaging service
that can handle real-time event streams efficiently. You can use Pub/Sub to
ingest and publish real-time market data to consumers.
ANSI SQL Access: BigQuery supports ANSI SQL queries, making it suitable
for both real-time and historical data analysis. You can stream data into
BigQuery tables from Pub/Sub and provide ANSI SQL access to consumers.
Batch Historical Exports: Cloud Storage can be used for batch historical
exports. You can export data from BigQuery to Cloud Storage in batch,
making it available for consumers to download.

41.You are building a new application that you need to collect data from in a
scalable way. Data arrives continuously from the application throughout
the day, and you expect to generate approximately 150 GB of JSON data
per day by the end of the year. Your requirements are:
✑ Decoupling producer from consumer
✑ Space and cost-efficient storage of the raw ingested data, which is to be
stored indefinitely
✑ Near real-time SQL query
✑ Maintain at least 2 years of historical data, which will be queried with
SQL
Which pipeline should you use to meet these requirements?

Decoupling Producer from Consumer: Cloud Pub/Sub provides a decoupled


messaging system where the producer publishes events, and consumers
(like Dataflow) can subscribe to these events. This decoupling ensures
flexibility and scalability.
Space and Cost-Efficient Storage: Storing data in Avro format is more
space-efficient than JSON, and Cloud Storage is a cost-effective storage
solution. Additionally, Cloud Pub/Sub and Dataflow allow you to process
and transform data efficiently, reducing storage costs.
Near Real-time SQL Query: By using Dataflow to transform and load data
into BigQuery, you can achieve near real-time data availability for SQL
queries. BigQuery is well-suited for ad-hoc SQL queries and provides
excellent query performance.

42.You are running a pipeline in Dataflow that receives messages from a


Pub/Sub topic and writes the results to a BigQuery dataset in the EU.
Currently, your pipeline is located in europe-west4 and has a maximum of
3 workers, instance type n1-standard-1. You notice that during peak
periods, your pipeline is struggling to process records in a timely fashion,
when all 3 workers are at maximum CPU utilization. Which two actions can
you take to increase performance of your pipeline? (Choose two.)

The most effective way to address the performance issue in this Dataflow
pipeline is to increase the processing capacity by either adding more
workers (horizontal scaling) or using more powerful workers (vertical
scaling). Both options A and B directly address the identified CPU
bottleneck and are the most appropriate solutions.

43.You have a data pipeline with a Dataflow job that aggregates and writes
time series metrics to Bigtable. You notice that data is slow to update in
Bigtable. This data feeds a dashboard used by thousands of users across
the organization. You need to support additional concurrent users and
reduce the amount of time required to write the data. Which two actions
should you take? (Choose two.)

To improve the data pipeline's performance and address the slow updates
in Bigtable, the most effective solutions are to increase the processing
power of the Dataflow job (by adding workers) and increase the capacity
of the Bigtable cluster (by adding nodes). Both options B and C directly
target the potential bottlenecks and are the most appropriate actions to
take.
44.You have several Spark jobs that run on a Cloud Dataproc cluster on a
schedule. Some of the jobs run in sequence, and some of the jobs run
concurrently. You need to automate this process. What should you do?
For orchestrating Spark jobs on Dataproc with specific sequencing and
concurrency requirements, Cloud Composer with Airflow DAGs provides
the most flexible, scalable, and manageable solution. It allows you to
define dependencies, schedule execution, and monitor the entire workflow
in a centralized and reliable manner.

45.You are building a new data pipeline to share data between two different
types of applications: jobs generators and job runners. Your solution must
scale to accommodate increases in usage and must accommodate the
addition of new applications without negatively affecting the performance
of existing ones. What should you do?

Cloud Pub/Sub is the best solution for this data pipeline because it
provides the necessary decoupling, scalability, and extensibility to meet
the requirements. It enables independent scaling of job generators and
runners, simplifies the addition of new applications, and ensures reliable
message delivery.

46.You want to use a database of information about tissue samples to classify


future tissue samples as either normal or mutated. You are evaluating an
unsupervised anomaly detection method for classifying the tissue
samples. Which two characteristic support this method? (Choose two.)

A. There are very few occurrences of mutations relative to normal


samples. This characteristic is supportive of using an unsupervised
anomaly detection method, as it is well suited for identifying rare events
or anomalies in large amounts of data. By training the algorithm on the
normal tissue samples in the database, it can then identify new tissue
samples that have different features from the normal samples and classify
them as mutated.
D. You expect future mutations to have similar features to the mutated
samples in the database. This characteristic is supportive of using an
unsupervised anomaly detection method, as it is well suited for identifying
patterns or anomalies in the data. By training the algorithm on the
mutated tissue samples in the database, it can then identify new tissue
samples that have similar features and classify them as mutated.

47.You need to create a new transaction table in Cloud Spanner that stores
product sales data. You are deciding what to use as a primary key. From a
performance perspective, which strategy should you choose?

According to the documentation:


Use a Universally Unique Identifier (UUID)
You can use a Universally Unique Identifier (UUID) as defined by RFC 4122
as the primary key. Version 4 UUID is recommended, because it uses
random values in the bit sequence. Version 1 UUID stores the timestamp
in the high order bits and is not recommended.

48.Data Analysts in your company have the Cloud IAM Owner role assigned to
them in their projects to allow them to work with multiple GCP products in
their projects. Your organization requires that all BigQuery data access
logs be retained for 6 months. You need to ensure that only audit
personnel in your company can access the data access logs for all
projects. What should you do?

Aggregated Export Sink: By using an aggregated export sink, you can


consolidate data access logs from multiple projects into a single location.
This simplifies log management and retention policies.
Newly Created Project for Audit Logs: Creating a dedicated project for
audit logs allows you to centralize access control and manage logs
separately from individual Data Analyst projects.
Access Restriction: By restricting access to the project containing the
exported logs, you ensure that only authorized audit personnel have
access to the logs while preventing Data Analysts from accessing them.

49.Each analytics team in your organization is running BigQuery jobs in their


own projects. You want to enable each team to monitor slot usage within
their projects. What should you do?

You should create a Cloud Monitoring dashboard based on the BigQuery


metric slots/allocated_for_project.
This metric represents the number of BigQuery slots allocated for a
project. By creating a Cloud Monitoring dashboard based on this metric,
you can monitor the slot usage within each project in your organization.
This will allow each team to monitor their own slot usage and ensure that
they are not exceeding their allocated quota.

50.You are operating a streaming Cloud Dataflow pipeline. Your engineers


have a new version of the pipeline with a different windowing algorithm
and triggering strategy. You want to update the running pipeline with the
new version. You want to ensure that no data is lost during the update.
What should you do?

Your engineers have a new version of the pipeline with a different


windowing algorithm and triggering strategy.
New version is mayor changes. Stop and drain and then launch the new
code is a lot is the safer way.
We recommend that you attempt only smaller changes to your pipeline's
windowing, such as changing the duration of fixed- or sliding-time
windows. Making major changes to windowing or triggers, like changing
the windowing algorithm, might have unpredictable results on your
pipeline output.

51.You need to move 2 PB of historical data from an on-premises storage


appliance to Cloud Storage within six months, and your outbound network
capacity is constrained to 20 Mb/sec. How should you migrate this data to
Cloud Storage?

Physical Transfer: Transfer Appliance is a physical device provided by


Google Cloud that you can use to physically transfer large volumes of data
to the cloud. It allows you to avoid the limitations of network bandwidth
and transfer data much faster.
Capacity: Transfer Appliance can handle large volumes of data, including
the 2 PB you need to migrate, without the constraints of slow network
speeds.
Efficiency: It is highly efficient for large-scale data transfers and is a
practical choice for transferring multi-terabyte or petabyte-scale datasets.

52.You receive data files in CSV format monthly from a third party. You need
to cleanse this data, but every third month the schema of the files
changes. Your requirements for implementing these transformations
include:
✑ Executing the transformations on a schedule
✑ Enabling non-developer analysts to modify transformations
✑ Providing a graphical tool for designing transformations
What should you do?
Dataprep by Trifacta is an intelligent data service for visually exploring,
cleaning, and preparing structured and unstructured data for analysis,
reporting, and machine learning. Because Dataprep is serverless and
works at any scale, there is no infrastructure to deploy or manage. Your
next ideal data transformation is suggested and predicted with each UI
input, so you don’t have to write code.

53.You want to migrate an on-premises Hadoop system to Cloud Dataproc.


Hive is the primary tool in use, and the data format is Optimized Row
Columnar (ORC). All ORC files have been successfully copied to a Cloud
Storage bucket. You need to replicate some data to the cluster's local
Hadoop Distributed File System (HDFS) to maximize performance. What
are two ways to start using Hive in Cloud Dataproc? (Choose two.)

The most efficient ways to start using Hive in Cloud Dataproc with ORC
files already in Cloud Storage are:
1. Copy to HDFS using gsutil and Hadoop tools for maximum
performance.
2. Use the Cloud Storage connector for initial access, then replicate
key data to HDFS for optimized performance.
Both options allow you to leverage the benefits of having data in the local
HDFS for improved Hive query performance. Option D provides more
flexibility by allowing you to choose what data to replicate based on your
needs.

54.You are implementing several batch jobs that must be executed on a


schedule. These jobs have many interdependent steps that must be
executed in a specific order. Portions of the jobs involve executing shell
scripts, running Hadoop jobs, and running queries in BigQuery. The jobs
are expected to run for many minutes up to several hours. If the steps fail,
they must be retried a fixed number of times. Which service should you
use to manage the execution of these jobs?
Workflow Orchestration: Cloud Composer is a fully managed workflow
orchestration service based on Apache Airflow. It allows you to define,
schedule, and manage complex workflows with multiple steps, including
shell scripts, Hadoop jobs, and BigQuery queries.
Dependency Management: You can define dependencies between different
steps in your workflow to ensure they are executed in a specific order.
Retry Mechanism: Cloud Composer provides built-in retry mechanisms, so
if any step fails, it can be retried a fixed number of times according to your
configuration.
Scheduled Execution: Cloud Composer allows you to schedule the
execution of your workflows on a regular basis, meeting the requirement
for executing the jobs on a schedule.

55.You work for a shipping company that has distribution centers where
packages move on delivery lines to route them properly. The company
wants to add cameras to the delivery lines to detect and track any visual
damage to the packages in transit. You need to create a way to automate
the detection of damaged packages and flag them for human review in
real time while the packages are in transit. Which solution should you
choose?

For this scenario, where you need to automate the detection of damaged
packages in real time while they are in transit, the most suitable solution
among the provided options would be B.
Here's why this option is the most appropriate:
Real-Time Analysis: AutoML provides the capability to train a custom
model specifically tailored to recognize patterns of damage in packages.
This model can process images in real-time, which is essential in your
scenario.
Integration with Existing Systems: By building an API around the AutoML
model, you can seamlessly integrate this solution with your existing
package tracking applications. This ensures that the system can flag
damaged packages for human review efficiently.
Customization and Accuracy: Since the model is trained on your specific
corpus of images, it can be more accurate in detecting damages relevant
to your use case compared to pre-trained models.

56.You are migrating your data warehouse to BigQuery. You have migrated all
of your data into tables in a dataset. Multiple users from your organization
will be using the data. They should only see certain tables based on their
team membership. How should you set user permissions?

The simplest and most effective way to control user access to specific
tables in BigQuery is to assign the bigquery.dataViewer role (or a custom
role) at the table level. This provides the necessary granular control, is
easy to manage, and scales well.

57.You need to store and analyze social media postings in Google BigQuery at
a rate of 10,000 messages per minute in near real-time. Initially, design
the application to use streaming inserts for individual postings. Your
application also performs data aggregations right after the streaming
inserts. You discover that the queries after streaming inserts do not exhibit
strong consistency, and reports from the queries might miss in-flight data.
How can you adjust your application design?

Option D provides the most practical and efficient way to address the
consistency issues with BigQuery streaming inserts while maintaining near
real-time data availability. It leverages the benefits of streaming inserts for
high-volume data ingestion and balances data freshness with accuracy by
waiting for a period based on estimated latency.
58.You want to build a managed Hadoop system as your data lake. The data
transformation process is composed of a series of Hadoop jobs executed in
sequence. To accomplish the design of separating storage from compute,
you decided to use the Cloud Storage connector to store all input data,
output data, and intermediary data. However, you noticed that one
Hadoop job runs very slowly with Cloud Dataproc, when compared with
the on-premises bare-metal Hadoop environment (8-core nodes with 100-
GB RAM). Analysis shows that this particular Hadoop job is disk I/O
intensive. You want to resolve the issue. What should you do?

The most effective way to resolve the performance issue for a disk I/O
intensive Hadoop job in Cloud Dataproc is to allocate sufficient persistent
disks and store the intermediate data on the local HDFS. This reduces
network overhead and allows the job to access data much faster,
improving overall performance.

59.You work for an advertising company, and you've developed a Spark ML


model to predict click-through rates at advertisement blocks. You've been
developing everything at your on-premises data center, and now your
company is migrating to Google Cloud. Your data center will be closing
soon, so a rapid lift-and-shift migration is necessary. However, the data
you've been using will be migrated to migrated to BigQuery. You
periodically retrain your Spark ML models, so you need to migrate existing
training pipelines to Google Cloud. What should you do?

For a rapid lift-and-shift migration of Spark ML model training pipelines to


Google Cloud with data in BigQuery, Dataproc provides the most suitable
solution. It allows you to run your existing Spark ML code with minimal
changes, leverages the BigQuery connector for direct data access, and
simplifies cluster management through its managed service offering. This
approach balances speed, cost-effectiveness, and ease of use.
60.You work for a global shipping company. You want to train a model on 40
TB of data to predict which ships in each geographic region are likely to
cause delivery delays on any given day. The model will be based on
multiple attributes collected from multiple sources. Telemetry data,
including location in GeoJSON format, will be pulled from each ship and
loaded every hour. You want to have a dashboard that shows how many
and which ships are likely to cause delays within a region. You want to use
a storage solution that has native functionality for prediction and
geospatial processing. Which storage solution should you use?

BigQuery is the most appropriate storage solution for this use case due to
its scalability, geospatial processing capabilities, high-speed ingestion,
machine learning integration, and suitability for dashboard creation. It
directly addresses all the key requirements for storing, processing, and
analyzing the ship telemetry data to predict delivery delays.

61.You operate an IoT pipeline built around Apache Kafka that normally
receives around 5000 messages per second. You want to use Google Cloud
Platform to create an alert as soon as the moving average over 1 hour
drops below 4000 messages per second. What should you do?

Dataflow with Sliding Time Windows: Dataflow allows you to work with
event-time windows, making it suitable for time-series data like incoming
IoT messages. Using sliding windows every 5 minutes allows you to
compute moving averages efficiently.
Sliding Time Window: The sliding time window of 1 hour every 5 minutes
enables you to calculate the moving average over the specified time
frame.
Computing Averages: You can efficiently compute the average when each
sliding window closes. This approach ensures that you have real-time
visibility into the message rate and can detect deviations from the
expected rate.
Alerting: When the calculated average drops below 4000 messages per
second, you can trigger an alert from within the Dataflow pipeline, sending
it to your desired alerting mechanism, such as Cloud Monitoring, Pub/Sub,
or another notification service.
Scalability: Dataflow can scale automatically based on the incoming data
volume, ensuring that you can handle the expected rate of 5000
messages per second.

62.You plan to deploy Cloud SQL using MySQL. You need to ensure high
availability in the event of a zone failure. What should you do?

The HA configuration provides data redundancy. A Cloud SQL instance


configured for HA is also called a regional instance and has a primary and
secondary zone within the configured region. Within a regional instance,
the configuration is made up of a primary instance and a standby
instance. Through synchronous replication to each zone's persistent disk,
all writes made to the primary instance are replicated to disks in both
zones before a transaction is reported as committed. In the event of an
instance or zone failure, the standby instance becomes the new primary
instance. Users are then rerouted to the new primary instance. This
process is called a failover.

63.Your company is selecting a system to centralize data ingestion and


delivery. You are considering messaging and data integration systems to
address the requirements. The key requirements are:
✑ The ability to seek to a particular offset in a topic, possibly back to the
start of all data ever captured
✑ Support for publish/subscribe semantics on hundreds of topics
Retain per-key ordering - Which system should you choose?
Ability to Seek to a Particular Offset: Kafka allows consumers to seek to a
specific offset in a topic, enabling you to read data from a specific point,
including back to the start of all data ever captured. This is a fundamental
capability of Kafka.
Support for Publish/Subscribe Semantics: Kafka supports publish/subscribe
semantics through topics. You can have hundreds of topics in Kafka, and
consumers can subscribe to these topics to receive messages in a
publish/subscribe fashion.
Retain Per-Key Ordering: Kafka retains the order of messages within a
partition. If you have a key associated with your messages, you can
ensure per-key ordering by sending messages with the same key to the
same partition.
Scalability: Kafka is designed to handle high-throughput data streaming
and is capable of scaling to meet your needs.
Apache Kafka aligns well with the requirements you've outlined for
centralized data ingestion and delivery. It's a robust choice for scenarios
that involve data streaming, publish/subscribe, and retaining message
ordering.
64.You are planning to migrate your current on-premises Apache Hadoop
deployment to the cloud. You need to ensure that the deployment is as
fault-tolerant and cost-effective as possible for long-running batch jobs.
You want to use a managed service. What should you do?

Dataproc Managed Service: Dataproc is a fully managed service for


running Apache Hadoop and Spark. It provides ease of management and
automation.
Standard Persistent Disk: Using standard persistent disks for Dataproc
workers ensures durability and is cost-effective compared to SSDs.
Preemptible Workers: By using 50% preemptible workers, you can
significantly reduce costs while maintaining fault tolerance. Preemptible
VMs are cheaper but can be preempted by Google, so having a mix of
preemptible and non-preemptible workers provides cost savings with
redundancy.
Storing Data in Cloud Storage: Storing data in Cloud Storage is highly
durable, scalable, and cost-effective. It also makes data accessible to
Dataproc clusters, and you can leverage native connectors for reading
data from Cloud Storage.
Changing References to gs://: Updating your scripts to reference data in
Cloud Storage using gs:// ensures that your jobs work seamlessly with the
cloud storage infrastructure.

65.Your team is working on a binary classification problem. You have trained a


support vector machine (SVM) classifier with default parameters, and
received an area under the Curve (AUC) of 0.87 on the validation set. You
want to increase the AUC of the model. What should you do?

Hyperparameter tuning is the most effective and efficient way to improve


the AUC of the SVM classifier. It allows you to systematically optimize the
model's parameters to find the best settings for the given data and
problem. The other options are either not guaranteed to improve
performance or are premature and risky.

66.You need to deploy additional dependencies to all nodes of a Cloud


Dataproc cluster at startup using an existing initialization action. Company
security policies require that Cloud Dataproc nodes do not have access to
the Internet so public initialization actions cannot fetch resources. What
should you do?

Security Compliance: This option aligns with your company's security


policies, which prohibit public Internet access from Cloud Dataproc nodes.
Placing the dependencies in a Cloud Storage bucket within your VPC
security perimeter ensures that the data remains within your private
network.
VPC Security: By placing the dependencies within your VPC security
perimeter, you maintain control over network access and can restrict
access to the necessary nodes only.
Dataproc Initialization Action: You can use a custom initialization action or
script to fetch and install the dependencies from the secure Cloud Storage
bucket to the Dataproc cluster nodes during startup.
By copying the dependencies to a secure Cloud Storage bucket and using
an initialization action to install them on the Dataproc nodes, you can
meet your security requirements while providing the necessary
dependencies to your cluster.
67.You need to choose a database for a new project that has the following
requirements:
✑ Fully managed
✑ Able to automatically scale up
✑ Transactionally consistent
✑ Able to scale up to 6 TB
✑ Able to be queried using SQL
Which database do you choose?

Cloud SQL is a fully managed service that scales up automatically and


supports SQL queries, it does not inherently guarantee transactional
consistency or the ability to scale up to 6 TB for all its database engines.

68.Your startup has never implemented a formal security policy. Currently,


everyone in the company has access to the datasets stored in Google
BigQuery. Teams have freedom to use the service as they see fit, and they
have not documented their use cases. You have been asked to secure the
data warehouse. You need to discover what everyone is doing. What
should you do first?

To effectively discover what everyone is doing with BigQuery datasets,


Google Stackdriver Audit Logs are the most appropriate tool. They provide
a comprehensive record of data access, including user identity, accessed
data, timestamps, and query details. This information is crucial for
understanding data usage patterns and securing the data warehouse.

69.You work for a mid-sized enterprise that needs to move its operational
system transaction data from an on-premises database to GCP. The
database is about 20 TB in size. Which database should you choose?
Cloud SQL is a fully managed service that scales up automatically and
supports SQL queries, it does not inherently guarantee transactional
consistency or the ability to scale up to 6 TB for all its database engines.

70.You need to choose a database to store time series CPU and memory
usage for millions of computers. You need to store this data in one-second
interval samples. Analysts will be performing real-time, ad hoc analytics
against the database. You want to avoid being charged for every query
executed and ensure that the schema design will allow for future growth of
the dataset. Which database and data model should you choose?

Bigtable with a narrow table design is the most suitable solution for this
scenario. It provides the scalability, low-latency reads, cost-effectiveness,
and schema flexibility needed to store and analyze time series data from
millions of computers. The narrow table model ensures efficient storage
and retrieval of data, while the Bigtable's pricing model avoids per-query
charges.
71.You want to archive data in Cloud Storage. Because some data is very
sensitive, you want to use the `Trust No One` (TNO) approach to encrypt
your data to prevent the cloud provider staff from decrypting your data.
What should you do?

Additional authenticated data (AAD) is any string that you pass to Cloud
Key Management Service as part of an encrypt or decrypt request. AAD is
used as an integrity check and can help protect your data from a confused
deputy attack. The AAD string must be no larger than 64 KiB.
Cloud KMS will not decrypt ciphertext unless the same AAD value is used
for both encryption and decryption.
AAD is bound to the encrypted data, because you cannot decrypt the
ciphertext unless you know the AAD, but it is not stored as part of the
ciphertext. AAD also does not increase the cryptographic strength of the
ciphertext. Instead it is an additional check by Cloud KMS to authenticate
a decryption request.

72.You have data pipelines running on BigQuery, Dataflow, and Dataproc. You
need to perform health checks and monitor their behavior, and then notify
the team managing the pipelines if they fail. You also need to be able to
work across multiple projects. Your preference is to use managed products
or features of the platform. What should you do?

Cloud Monitoring (formerly known as Stackdriver) is a fully managed


monitoring service provided by GCP, which can collect metrics, logs, and
other telemetry data from various GCP services, including BigQuery,
Dataflow, and Dataproc.
Alerting Policies: Cloud Monitoring allows you to define alerting policies
based on specific conditions or thresholds, such as pipeline failures,
latency spikes, or other custom metrics. When these conditions are met,
Cloud Monitoring can trigger notifications (e.g., emails) to alert the team
managing the pipelines.
Cross-Project Monitoring: Cloud Monitoring supports monitoring resources
across multiple GCP projects, making it suitable for your requirement to
monitor pipelines in multiple projects.
Managed Solution: Cloud Monitoring is a managed service, reducing the
operational overhead compared to running your own virtual machine
instances or building custom solutions.

73.You are working on a linear regression model on BigQuery ML to predict a


customer's likelihood of purchasing your company's products. Your model
uses a city name variable as a key predictive component. In order to train
and serve the model, your data must be organized in columns. You want to
prepare your data using the least amount of coding while maintaining the
predictable variables. What should you do?
One-hot encoding is a common technique used to handle categorical data
in machine learning. This approach will transform the city name variable
into a series of binary columns, one for each city. Each row will have a "1"
in the column corresponding to the city it represents and "0" in all other
city columns. This method is effective for linear regression models as it
enables the model to use city data as a series of numeric, binary
variables. BigQuery supports SQL operations that can easily implement
one-hot encoding, thus minimizing the amount of coding required and
efficiently preparing the data for the model.

74.You work for a large bank that operates in locations throughout North
America. You are setting up a data storage system that will handle bank
account transactions. You require ACID compliance and the ability to
access data with SQL. Which solution is appropriate?

Since the banking transaction system requires ACID compliance and SQL
access to the data, Cloud Spanner is the most appropriate solution. Unlike
Cloud SQL, Cloud Spanner natively provides ACID transactions and
horizontal scalability.
Enabling stale reads in Spanner (option A) would reduce data consistency,
violating the ACID compliance requirement of banking transactions.
BigQuery (option C) does not natively support ACID transactions or SQL
writes which are necessary for a banking transactions system.
Cloud SQL (option D) provides ACID compliance but does not scale
horizontally like Cloud Spanner can to handle large transaction volumes.
By using Cloud Spanner and specifically locking read-write transactions,
ACID compliance is ensured while providing fast, horizontally scalable SQL
processing of banking transactions.

75.A shipping company has live package-tracking data that is sent to an


Apache Kafka stream in real time. This is then loaded into BigQuery.
Analysts in your company want to query the tracking data in BigQuery to
analyze geospatial trends in the lifecycle of a package. The table was
originally created with ingest-date partitioning. Over time, the query
processing time has increased. You need to implement a change that
would improve query performance in BigQuery. What should you do?

Clustered tables in BigQuery are tables that have a user-defined column


sort order using clustered columns. Clustered tables can improve query
performance and reduce query costs.
In BigQuery, a clustered column is a user-defined table property that sorts
storage blocks based on the values in the clustered columns. The storage
blocks are adaptively sized based on the size of the table. A clustered
table maintains the sort properties in the context of each operation that
modifies it. Queries that filter or aggregate by the clustered columns only
scan the relevant blocks based on the clustered columns instead of the
entire table or table partition.

76.Your company currently runs a large on-premises cluster using Spark,


Hive, and HDFS in a colocation facility. The cluster is designed to
accommodate peak usage on the system; however, many jobs are batch
in nature, and usage of the cluster fluctuates quite dramatically. Your
company is eager to move to the cloud to reduce the overhead associated
with on-premises infrastructure and maintenance and to benefit from the
cost savings. They are also hoping to modernize their existing
infrastructure to use more serverless offerings in order to take advantage
of the cloud. Because of the timing of their contract renewal with the
colocation facility, they have only 2 months for their initial migration. How
would you recommend they approach their upcoming migration strategy
so they can maximize their cost savings in the cloud while still executing
the migration in time?

When you want to move your Apache Spark workloads from an on-
premises environment to Google Cloud, we recommend using Dataproc to
run Apache Spark/Apache Hadoop clusters. Dataproc is a fully managed,
fully supported service offered by Google Cloud. It allows you to separate
storage and compute, which helps you to manage your costs and be more
flexible in scaling your workloads.
https://cloud.google.com/bigquery/docs/migration/hive#data_migration
Migrating Hive data from your on-premises or other cloud-based source
cluster to BigQuery has two steps:
1. Copying data from a source cluster to Cloud Storage
2. Loading data from Cloud Storage into BigQuery

77.You work for a financial institution that lets customers register online. As
new customers register, their user data is sent to Pub/Sub before being
ingested into BigQuery. For security reasons, you decide to redact your
customers' Government issued Identification Number while allowing
customer service representatives to view the original values when
necessary. What should you do?

Before loading the data into BigQuery, use Cloud Data Loss Prevention
(DLP) to replace input values with a cryptographic format-preserving
encryption token.
The key reasons are:
DLP allows redacting sensitive PII like SSNs before loading into BigQuery.
This provides security by default for the raw SSN values.
Using format-preserving encryption keeps the column format intact while
still encrypting, allowing business logic relying on SSN format to continue
functioning.
The encrypted tokens can be reversed to view original SSNs when
required, meeting the access requirement for customer service reps.

78.You are migrating a table to BigQuery and are deciding on the data model.
Your table stores information related to purchases made across several
store locations and includes information like the time of the transaction,
items purchased, the store ID, and the city and state in which the store is
located. You frequently query this table to see how many of each item
were sold over the past 30 days and to look at purchasing trends by state,
city, and individual store. How would you model this table for the best
query performance?

This page provides an overview of partitioned tables in BigQuery. A


partitioned table is a special table that is divided into segments, called
partitions, that make it easier to manage and query your data. By dividing
a large table into smaller partitions, you can improve query performance,
and you can control costs by reducing the number of bytes read by a
query.
You can partition BigQuery tables by:
- Time-unit column: Tables are partitioned based on a TIMESTAMP, DATE, or
DATETIME column in the table.
https://cloud.google.com/bigquery/docs/clustered-tables
Clustered tables in BigQuery are tables that have a user-defined column
sort order using clustered columns. Clustered tables can improve query
performance and reduce query costs.

79.Your company is migrating their 30-node Apache Hadoop cluster to the


cloud. They want to re-use Hadoop jobs they have already created and
minimize the management of the cluster as much as possible. They also
want to be able to persist data beyond the life of the cluster. What should
you do?

Cloud Dataproc allows you to run Apache Hadoop jobs with minimal
management. It is a managed Hadoop service.
Using the Google Cloud Storage (GCS) connector, Dataproc can access
data stored in GCS, which allows data persistence beyond the life of the
cluster. This means that even if the cluster is deleted, the data in GCS
remains intact. Moreover, using GCS is often cheaper and more durable
than using HDFS on persistent disks.

80.You are updating the code for a subscriber to a Pub/Sub feed. You are
concerned that upon deployment the subscriber may erroneously
acknowledge messages, leading to message loss. Your subscriber is not
set up to retain acknowledged messages. What should you do to ensure
that you can recover from errors after deployment?

By creating a snapshot of the subscription before deploying new code, you


can preserve the state of unacknowledged messages. If after deployment
you find that the new subscriber code is erroneously acknowledging
messages, you can use the Seek operation with the snapshot to reset the
subscription's acknowledgment state to the time the snapshot was
created. This would effectively re-deliver messages available since the
snapshot, ensuring you can recover from errors. This approach does not
require setting up a local emulator and directly addresses the concern of
message loss due to erroneous acknowledgments.

81.You work for a large real estate firm and are preparing 6 TB of home sales
data to be used for machine learning. You will use SQL to transform the
data and use BigQuery ML to create a machine learning model. You plan to
use the model for predictions against a raw dataset that has not been
transformed. How should you set up your workflow in order to prevent
skew at prediction time?

Option A is best is the correct answer because it leverages BigQuery ML's


built-in mechanisms (TRANSFORM and ML.PREDICT) to guarantee that the
exact same preprocessing steps are applied during both training and
prediction, preventing training-serving skew in the most reliable and
maintainable way. And, the use of ML.PREDICT in option A is needed to
generate predictions.

82.You are analyzing the price of a company's stock. Every 5 seconds, you
need to compute a moving average of the past 30 seconds' worth of data.
You are reading data from Pub/Sub and using DataFlow to conduct the
analysis. How should you set up your windowed pipeline?
Since you need to compute a moving average of the past 30 seconds'
worth of data every 5 seconds, a sliding window is appropriate. A sliding
window allows overlapping intervals and is well-suited for computing
rolling aggregates.
Window Duration: The window duration should be set to 30 seconds to
cover the required 30 seconds' worth of data for the moving average
calculation.
Window Period: The window period or sliding interval should be set to 5
seconds to move the window every 5 seconds and recalculate the moving
average with the latest data.
Trigger: The trigger should be set to AfterWatermark.pastEndOfWindow()
to emit the computed moving average results when the watermark
advances past the end of the window. This ensures that all data within the
window is considered before emitting the result.

83.You are designing a pipeline that publishes application events to a Pub/Sub


topic. Although message ordering is not important, you need to be able to
aggregate events across disjoint hourly intervals before loading the results
to BigQuery for analysis. What technology should you use to process and
load this data to BigQuery while ensuring that it will scale with large
volumes of events?

Option D is best choice. Dataflow's streaming mode, combined with


tumbling windows, provides the perfect combination of continuous
processing, accurate hourly aggregations, scalability, and seamless
integration with BigQuery. It's the most robust and efficient solution for
this real-time data processing requirement. The other options either
introduce latency, have scalability limitations, or are not well-suited for the
specific aggregation requirement.

84.You work for a large financial institution that is planning to use Dialogflow
to create a chatbot for the company's mobile app. You have reviewed old
chat logs and tagged each conversation for intent based on each
customer's stated intention for contacting customer service. About 70% of
customer requests are simple requests that are solved within 10 intents.
The remaining 30% of inquiries require much longer, more complicated
requests. Which intents should you automate first?

This is the best approach because it follows the Pareto principle (80/20
rule). By automating the most common 10 intents that address 70% of
customer requests, you free up the live agents to focus their time and
effort on the more complex 30% of requests that likely require human
insight/judgement. Automating the simpler high-volume requests first
allows the chatbot to handle those easily, efficiently routing only the
trickier cases to agents. This makes the best use of automation for high-
volume simple cases and human expertise for lower-volume complex
issues.

85.Your company is implementing a data warehouse using BigQuery, and you


have been tasked with designing the data model. You move your on-
premises sales data warehouse with a star data schema to BigQuery but
notice performance issues when querying the data of the past 30 days.
Based on Google's recommended practices, what should you do to speed
up the query without increasing storage costs?

BigQuery supports partitioned tables, where the data is divided into


smaller, manageable portions based on a chosen column (e.g., transaction
date). By partitioning the data based on the transaction date, BigQuery
can efficiently query only the relevant partitions that contain data for the
past 30 days, reducing the amount of data that needs to be
scanned.Partitioning does not increase storage costs. It organizes existing
data in a more structured manner, allowing for better query performance
without any additional storage expenses.
86.You have uploaded 5 years of log data to Cloud Storage. A user reported
that some data points in the log data are outside of their expected ranges,
which indicates errors. You need to address this issue and be able to run
the process again in the future while keeping the original data for
compliance reasons. What should you do?

Option C, using a Dataflow workflow to read from Cloud Storage, correct


the data by setting default values, and writing the corrected data to a new
location in Cloud Storage, perfectly addresses all the requirements of the
problem: correcting errors, preserving original data, and creating a
repeatable process.

87.You want to rebuild your batch pipeline for structured data on Google
Cloud. You are using PySpark to conduct data transformations at scale, but
your pipelines are taking over twelve hours to run. To expedite
development and pipeline run time, you want to use a serverless tool and
SOL syntax. You have already moved your raw data into Cloud Storage.
How should you build the pipeline on Google Cloud while meeting speed
and processing requirements?

BigQuery SQL provides a fast, scalable, and serverless method for


transforming structured data, easier to develop than PySpark.
Directly ingesting the raw Cloud Storage data into BigQuery avoids
needing an intermediate processing cluster like Dataproc.
Transforming the data via BigQuery SQL queries will be faster than
PySpark, especially since the data is already loaded into BigQuery.
Writing the transformed results to a new BigQuery table keeps the original
raw data intact and provides a clean output.
So migrating to BigQuery SQL for transformations provides a fully
managed serverless architecture that can significantly expedite
development and reduce pipeline runtime versus PySpark. The ability to
avoid clusters and conduct transformations completely within BigQuery is
the most efficient approach here.
88.You are testing a Dataflow pipeline to ingest and transform text files. The
files are compressed gzip, errors are written to a dead-letter queue, and
you are using SideInputs to join data. You noticed that the pipeline is
taking longer to complete than expected; what should you do to expedite
the Dataflow job?

The core issue is the use of SideInputs for joining data, leading to
materialization and replication overhead. CoGroupByKey provides a more
efficient, parallel approach to join operations in Dataflow by avoiding
materialization and reducing replication. Therefore, switching to
CoGroupByKey is the most effective way to expedite the Dataflow job in
this scenario.

89.You are building a real-time prediction engine that streams files, which
may contain PII (personal identifiable information) data, into Cloud Storage
and eventually into BigQuery. You want to ensure that the sensitive data is
masked but still maintains referential integrity, because names and emails
are often used as join keys. How should you use the Cloud Data Loss
Prevention API (DLP API) to ensure that the PII data is not accessible by
unauthorized individuals?

The best approach is to use format-preserving encryption or tokenization


from the start. This ensures that sensitive data is protected while
maintaining the format and structure needed for referential integrity and
downstream processing. Option D specifically highlights the importance of
format preservation, making it the most suitable answer.

90.Business owners at your company have given you a database of bank


transactions. Each row contains the user ID, transaction type, transaction
location, and transaction amount. They ask you to investigate what type of
machine learning can be applied to the data. Which three machine
learning applications can you use? (Choose three.)

i) Fraudulent transaction, is nothing but anomaly detection which falls


under Unsupervised.
ii) All transactions can be categorized using type etc - clustering algorithm.
iii) Using location as a label, supervised classification can be developed to
predict location.

91.You are migrating an application that tracks library books and information
about each book, such as author or year published, from an on-premises
data warehouse to BigQuery. In your current relational database, the
author information is kept in a separate table and joined to the book
information on a common key. Based on Google's recommended practice
for schema design, how would you structure the data to ensure optimal
speed of queries about the author of each book that has been borrowed?

Best practice: Use nested and repeated fields to denormalize data


storage and increase query performance.
Denormalization is a common strategy for increasing read performance for
relational datasets that were previously normalized. The recommended
way to denormalize data in BigQuery is to use nested and repeated fields.
It's best to use this strategy when the relationships are hierarchical and
frequently queried together, such as in parent-child relationships.

92.You need to give new website users a globally unique identifier (GUID)
using a service that takes in data points and returns a GUID. This data is
sourced from both internal and external systems via HTTP calls that you
will make via microservices within your pipeline. There will be tens of
thousands of messages per second and that can be multi-threaded. and
you worry about the backpressure on the system. How should you design
your pipeline to minimize that backpressure?
Option D is the best approach to minimize backpressure in this scenario.
By batching the jobs into 10-second increments, you can throttle the rate
at which requests are made to the external GUID service. This prevents
too many simultaneous requests from overloading the service.

93.You are migrating your data warehouse to Google Cloud and


decommissioning your on-premises data center. Because this is a priority
for your company, you know that bandwidth will be made available for the
initial data load to the cloud. The files being transferred are not large in
number, but each file is 90 GB. Additionally, you want your transactional
systems to continually update the warehouse on Google Cloud in real
time. What tools should you use to migrate the data and ensure that it
continues to write to your warehouse?

Considering the requirement for handling large files and the need for real-
time data integration, Option C (gsutil for the migration; Pub/Sub and
Dataflow for the real-time updates) seems to be the most appropriate.
gsutil will effectively handle the large file transfers, while Pub/Sub and
Dataflow provide a robust solution for real-time data capture and
processing, ensuring continuous updates to your warehouse on Google
Cloud.

94.You are using Bigtable to persist and serve stock market data for each of
the major indices. To serve the trading application, you need to access
only the most recent stock prices that are streaming in. How should you
design your row key and tables to ensure that you can access the data
with the simplest query?
A single table for all indices keeps the structure simple.
Using a reverse timestamp as part of the row key ensures that the most
recent data comes first in the sorted order. This design is beneficial for
quickly accessing the latest data.
For example: you can convert the timestamp to a string and format it in
reverse order, like "yyyyMMddHHmmss", ensuring newer dates and times
are sorted lexicographically before older ones.

95.You are building a report-only data warehouse where the data is streamed
into BigQuery via the streaming API. Following Google's best practices, you
have both a staging and a production table for the data. How should you
design your data loading to ensure that there is only one master dataset
without affecting performance on either the ingestion or reporting pieces?

Following common extract, transform, load (ETL) best practices, we used a


staging table and a separate production table so that we could load data
into the staging table without impacting users of the data. The design we
created based on ETL best practices called for first deleting all the records
from the staging table, loading the staging table, and then replacing the
production table with the contents.
When using the streaming API, the BigQuery streaming buffer remains
active for about 30 to 60 minutes or more after use, which means that you
can’t delete or change data during that time. Since we used the streaming
API, we scheduled the load every three hours to balance getting data into
BigQuery quickly and being able to subsequently delete the data from the
staging table during the load process.

96.You issue a new batch job to Dataflow. The job starts successfully,
processes a few elements, and then suddenly fails and shuts down. You
navigate to the Dataflow monitoring interface where you find errors
related to a particular DoFn in your pipeline. What is the most likely cause
of the errors?

While your job is running, you might encounter errors or exceptions in your
worker code. These errors generally mean that the DoFns in your pipeline
code have generated unhandled exceptions, which result in failed tasks in
your Dataflow job.
Exceptions in user code (for example, your DoFn instances) are reported in
the Dataflow monitoring interface.

97.Your new customer has requested daily reports that show their net
consumption of Google Cloud compute resources and who used the
resources. You need to quickly and efficiently generate these daily reports.
What should you do?

Integration with BigQuery: BigQuery is a powerful tool for analyzing large


datasets. By exporting Cloud Logging data directly to BigQuery, you can
leverage its fast querying capabilities and advanced analysis features.
Automated Daily Exports: Setting up automated daily exports to BigQuery
streamlines the reporting process, ensuring that data is consistently and
efficiently transferred.
Creating Views for Specific Filters: By creating views in BigQuery that filter
data by project, log type, resource, and user, you can tailor the reports to
the specific needs of your customer. Views also simplify repeated analysis
by encapsulating complex SQL queries.
Efficiency and Scalability: This method is highly efficient and scalable,
handling large volumes of data without the manual intervention required
for CSV exports and data cleansing.

98.The Development and External teams have the project viewer Identity and
Access Management (IAM) role in a folder named Visualization. You want
the Development Team to be able to read data from both Cloud Storage
and BigQuery, but the External Team should only be able to read data from
BigQuery. What should you do?
Development team: needs to access both Cloud Storage and BQ ->
therefore we put the Development team inside a perimeter so it can
access both the Cloud Storage and the BQ
External team: allowed to access only BQ -> therefore we put Cloud
Storage behind the restricted API and leave the external team outside of
the perimeter, so it can access BQ, but is prohibited from accessing the
Cloud Storage

99.Your startup has a web application that currently serves customers out of a
single region in Asia. You are targeting funding that will allow your startup
to serve customers globally. Your current goal is to optimize for cost, and
your post-funding goal is to optimize for global presence and performance.
You must use a nativeJDBC driver. What should you do?
This option allows for optimization for cost initially with a single region
Cloud Spanner instance, and then optimization for global presence and
performance after funding with multi-region instances.
Cloud Spanner supports native JDBC drivers and is horizontally scalable,
providing very high performance. A single region instance minimizes costs
initially. After funding, multi-region instances can provide lower latency
and high availability globally.
Cloud SQL does not scale as well and has higher costs for multiple high
availability regions. Bigtable does not support JDBC drivers natively.
Therefore, Spanner is the best choice here for optimizing both for cost
initially and then performance and availability globally post-funding.

100. You need to migrate 1 PB of data from an on-premises data center


to Google Cloud. Data transfer time during the migration should take only
a few hours. You want to follow Google-recommended practices to
facilitate the large data transfer over a secure connection. What should
you do?

Cloud Interconnect provides a dedicated private connection between on-


prem and Google Cloud for high bandwidth (up to 100 Gbps) and low
latency. This facilitates large, fast data transfers.
Storage Transfer Service supports parallel data transfers over Cloud
Interconnect. It can transfer petabyte-scale datasets faster by transferring
objects in parallel.
Storage Transfer Service uses HTTPS encryption in transit and at rest by
default for secure data transfers.
It follows Google-recommended practices for large data migrations vs ad
hoc methods like gsutil or scp.
The other options would take too long for a 1 PB transfer (VPN capped at 3
Gbps, manual transfers) or introduce extra steps like batching and
checksums. Cloud Interconnect + Storage Transfer is the recommended
Google solution.

101. Your company's on-premises Apache Hadoop servers are


approaching end-of-life, and IT has decided to migrate the cluster to
Google Cloud Dataproc. A like-for- like migration of the cluster would
require 50 TB of Google Persistent Disk per node. The CIO is concerned
about the cost of using that much block storage. You want to minimize the
storage cost of the migration. What should you do?
Moving the data to Cloud Storage directly addresses the CIO's concern
about storage costs. It's the most cost-effective, scalable, and easily
integrated solution for this scenario. While other options might offer some
benefits, they don't directly tackle the core issue of high Persistent Disk
costs like Option A does.

102. You are loading CSV files from Cloud Storage to BigQuery. The files
have known data quality issues, including mismatched data types, such as
STRINGs and INT64s in the same column, and inconsistent formatting of
values such as phone numbers or addresses. You need to create the data
pipeline to maintain data quality and perform the required cleansing and
transformation. What should you do?

Data Fusion is the best choice for this scenario because it provides a
comprehensive platform for building and managing data pipelines,
including data quality features and pre-built transformations for handling
the specific data issues in your CSV files. It simplifies the process and
reduces the amount of manual coding required compared to using SQL-
based approaches.

103. You are developing a new deep learning model that predicts a
customer's likelihood to buy on your ecommerce site. After running an
evaluation of the model against both the original training data and new
test data, you find that your model is overfitting the data. You want to
improve the accuracy of the model when predicting new data. What
should you do?
To improve the accuracy of a model that's overfitting, the most effective
strategies are to:
 Increase the amount of training data: This helps the model learn
more generalizable patterns.
 Decrease the number of input features: This helps the model focus on
the most relevant information and avoid learning noise.
Therefore, option B is the most suitable approach to address overfitting
and improve the model's accuracy on new data.

104. You are implementing a chatbot to help an online retailer streamline


their customer service. The chatbot must be able to respond to both text
and voice inquiries. You are looking for a low-code or no-cade option, and
you want to be able to easily train the chatbot to provide answers to
keywords. What should you do?

Dialogflow is the most appropriate choice because it's a purpose-built


platform for creating conversational interfaces. It handles both text and
voice input, provides easy training for keyword recognition, and minimizes
the need for coding, aligning perfectly with the requirements of the
problem.

105. An aerospace company uses a proprietary data format to store its


flight data. You need to connect this new data source to BigQuery and
stream the data into BigQuery. You want to efficiently import the data into
BigQuery while consuming as few resources as possible. What should you
do?

Option D provides the most efficient and streamlined approach for this
scenario. By using an Apache Beam custom connector with Dataflow and
Avro format, you can directly read, transform, and stream the proprietary
data into BigQuery while minimizing resource consumption and
maximizing performance.
106. An online brokerage company requires a high volume trade
processing architecture. You need to create a secure queuing system that
triggers jobs. The jobs will run in Google Cloud and call the company's
Python API to execute trades. You need to efficiently implement a solution.
What should you do?

assume, Company wants to buy immediately in same second if stock goes


down or up.
Somehow, it is connected to PubSub as SINK connector, then immediately
there is PUSH to subcriber (cloud function) that is connected to their
python API (internal application) that makes the purchase.

107. Your company wants to be able to retrieve large result sets of


medical information from your current system, which has over 10 TBs in
the database, and store the data in new tables for further query. The
database must have a low-maintenance architecture and be accessible via
SQL. You need to implement a cost-effective solution that can support data
analytics for large result sets. What should you do?

The key reasons why BigQuery fits the requirements:


It is a fully managed data warehouse built to scale to handle massive
datasets and perform fast SQL analytics
It has a low maintenance architecture with no infrastructure to manage
SQL capabilities allow easy querying of the medical data
Output destinations allow configurable caching for fast retrieval of large
result sets
It provides a very cost-effective solution for these large scale analytics use
cases
In contrast, Cloud Spanner and Cloud SQL would not scale as cost
effectively for 10TB+ data volumes. Self-managed MySQL on Compute
Engine also requires more maintenance. Hence, leveraging BigQuery as a
fully managed data warehouse is the optimal solution here.

108. You have 15 TB of data in your on-premises data center that you
want to transfer to Google Cloud. Your data changes weekly and is stored
in a POSIX-compliant source. The network operations team has granted
you 500 Mbps bandwidth to the public internet. You want to follow Google-
recommended practices to reliably transfer your data to Google Cloud on a
weekly basis. What should you do?

Like gsutil, Storage Transfer Service for on-premises data enables transfers
from network file system (NFS) storage to Cloud Storage. Although gsutil
can support small transfer sizes (up to 1 TB), Storage Transfer Service for
on-premises data is designed for large-scale transfers (up to petabytes of
data, billions of files).

109. You are designing a system that requires an ACID-compliant


database. You must ensure that the system requires minimal human
intervention in case of a failure. What should you do?

Cloud SQL for PostgreSQL provides full ACID compliance, unlike Bigtable
which provides only atomicity and consistency guarantees.
Enabling high availability removes the need for manual failover as Cloud
SQL will automatically failover to a standby replica if the leader instance
goes down.
Point-in-time recovery in MySQL requires manual intervention to restore
data if needed.
BigQuery does not provide transactional guarantees required for an ACID
database.
Therefore, a Cloud SQL for PostgreSQL instance with high availability
meets the ACID and minimal intervention requirements best. The
automatic failover will ensure availability and uptime without
administrative effort.

110. You are implementing workflow pipeline scheduling using open


source-based tools and Google Kubernetes Engine (GKE). You want to use
a Google managed service to simplify and automate the task. You also
want to accommodate Shared VPC networking considerations. What
should you do?
Shared VPC requires that you designate a host project to which networks
and subnetworks belong and a service project, which is attached to the
host project. When Cloud Composer participates in a Shared VPC, the
Cloud Composer environment is in the service project.

111. You are using BigQuery and Data Studio to design a customer-facing
dashboard that displays large quantities of aggregated data. You expect a
high volume of concurrent users. You need to optimize the dashboard to
provide quick visualizations with minimal latency. What should you do?

In BigQuery, materialized views are precomputed views that periodically


cache the results of a query for increased performance and efficiency.
BigQuery leverages precomputed results from materialized views and
whenever possible reads only delta changes from the base tables to
compute up-to-date results. Materialized views can be queried directly or
can be used by the BigQuery optimizer to process queries to the base
tables.
Queries that use materialized views are generally faster and consume
fewer resources than queries that retrieve the same data only from the
base tables. Materialized views can significantly improve the performance
of workloads that have the characteristic of common and repeated
queries.
112. You are building a model to make clothing recommendations. You
know a user's fashion preference is likely to change over time, so you build
a data pipeline to stream new data back to the model as it becomes
available. How should you use this data to train the model?

This approach allows the model to benefit from both the historical data
(existing data) and the new data, ensuring that it adapts to changing
preferences while retaining knowledge from the past. By combining both
types of data, the model can learn to make recommendations that are up-
to-date and relevant to users' evolving preferences.

113. You work for a car manufacturer and have set up a data pipeline
using Google Cloud Pub/Sub to capture anomalous sensor events. You are
using a push subscription in Cloud Pub/Sub that calls a custom HTTPS
endpoint that you have created to take action of these anomalous events
as they occur. Your custom HTTPS endpoint keeps getting an inordinate
amount of duplicate messages. What is the most likely cause of these
duplicate messages?

Pub/Sub guarantees at-least-once message delivery, which means that


occasional duplicates are to be expected. However, a high rate of
duplicates may indicate that the client is not acknowledging messages
within the configured ack_deadline_seconds, and Pub/Sub is retrying the
message delivery. This can be observed in the monitoring metrics
pubsub.googleapis.com/subscription/pull_ack_message_operation_count
for pull subscriptions, and
pubsub.googleapis.com/subscription/push_request_count for push
subscriptions. Look for elevated expired or webhook_timeout values in
the /response_code. This is particularly likely if there are many small
messages, since Pub/Sub may batch messages internally and a partially
acknowledged batch will be fully redelivered.

114. Government regulations in the banking industry mandate the


protection of clients' personally identifiable information (PII). Your
company requires PII to be access controlled, encrypted, and compliant
with major data protection standards. In addition to using Cloud Data Loss
Prevention (Cloud DLP), you want to follow Google-recommended practices
and use service accounts to control access to PII. What should you do?

Google Cloud Storage is designed to comply with major data protection


standards. Creating multiple service accounts and attaching them to IAM
groups provides granular control over who has access to the data. This
approach is aligned with the principle of least privilege, a security best
practice where a user is given the minimum levels of access necessary to
complete their tasks.

115. You need to migrate a Redis database from an on-premises data


center to a Memorystore for Redis instance. You want to follow Google-
recommended practices and perform the migration for minimal cost, time
and effort. What should you do?

The import and export feature uses the native RDB snapshot feature of
Redis to import data into or export data out of a Memorystore for Redis
instance. The use of the native RDB format prevents lock-in and makes it
very easy to move data within Google Cloud or outside of Google Cloud.
Import and export uses Cloud Storage buckets to store RDB files.

116. Your platform on your on-premises environment generates 100 GB


of data daily, composed of millions of structured JSON text files. Your on-
premises environment cannot be accessed from the public internet. You
want to use Google Cloud products to query and explore the platform
data. What should you do?

You need a service to transfer data from on-premises to cloud storage. so


"Transfer service" is the best option & additionally you can easily configure
the network so that data flows through private network.
cloud scheduler on other hand is used mostly for automation. You can
schedule a service but in my view cannot be used solo to transfer data.

117. A TensorFlow machine learning model on Compute Engine virtual


machines (n2-standard-32) takes two days to complete training. The
model has custom TensorFlow operations that must run partially on a CPU.
You want to reduce the training time in a cost-effective manner. What
should you do?
The best way to reduce the TensorFlow training time in a cost-effective
manner is to use a VM with a GPU hardware accelerator. TensorFlow can
take advantage of GPUs to significantly speed up training time for many
models.
Specifically, option C is the best choice.
Changing the VM to another standard type like n2-highmem-32 or e2-
standard-32 (options A and B) may provide some improvement, but likely
not a significant speedup.

118. You want to create a machine learning model using BigQuery ML


and create an endpoint for hosting the model using Vertex AI. This will
enable the processing of continuous streaming data in near-real time from
multiple vendors. The data may contain invalid values. What should you
do?

Option D provides the most comprehensive and scalable solution. It


leverages Pub/Sub for reliable ingestion, Dataflow for processing and
cleaning, and BigQuery for storing the prepared data for your ML model.
This approach ensures data quality, efficient streaming, and seamless
integration with BigQuery ML and Vertex AI.

119. You have a data processing application that runs on Google


Kubernetes Engine (GKE). Containers need to be launched with their latest
available configurations from a container registry. Your GKE nodes need to
have GPUs, local SSDs, and 8 Gbps bandwidth. You want to efficiently
provision the data processing infrastructure and manage the deployment
process. What should you do?
- Dataflow is a fully managed service for stream and batch data processing
and is well-suited for real-time data processing tasks like identifying
longtail and outlier data points.
- Using BigQuery as a sink allows to efficiently store the cleansed and
processed data for further analysis and serving it to AI models.

120. You need ads data to serve AI models and historical data for
analytics. Longtail and outlier data points need to be identified. You want
to cleanse the data in near-real time before running it through AI models.
What should you do?

Dataflow for Real-Time Processing: Dataflow allows you to process data in


near-real time, making it well-suited for identifying longtail and outlier
data points as they occur. You can use Dataflow to implement custom data
cleansing and outlier detection algorithms that operate on streaming data.
BigQuery as a Sink: Using BigQuery as a sink allows you to store the
cleaned and processed data efficiently for further analysis or use in AI
models. Dataflow can write the cleaned data to BigQuery tables, enabling
seamless integration with downstream processes.

121. You are collecting IoT sensor data from millions of devices across
the world and storing the data in BigQuery. Your access pattern is based
on recent data, filtered by location_id and device_version with the
following query:
You want to optimize your queries for cost and performance. How should
you structure your data?

Partitioning by create_date:
Aligns with query pattern: Filters for recent data based on create_date, so
partitioning by this column allows BigQuery to quickly narrow down the
data to scan, reducing query costs and improving performance.
Manages data growth: Partitioning effectively segments data by date,
making it easier to manage large datasets and optimize storage costs.
Clustering by location_id and device_version:
Enhances filtering: Frequently filtering by location_id and device_version,
clustering physically co-locates related data within partitions, further
reducing scan time and improving performance.

122. A live TV show asks viewers to cast votes using their mobile phones.
The event generates a large volume of data during a 3-minute period. You
are in charge of the "Voting infrastructure" and must ensure that the
platform can handle the load and that all votes are processed. You must
display partial results while voting is open. After voting closes, you need to
count the votes exactly once while optimizing cost. What should you do?

- Google Cloud Pub/Sub can manage the high-volume data ingestion.


- Google Cloud Dataflow can efficiently process and route data to both
Bigtable and BigQuery.
- Bigtable is excellent for handling high-throughput writes and reads,
making it suitable for real-time vote tallying.
- BigQuery is ideal for exact vote counting and deeper analysis once voting
concludes.

123. A shipping company has live package-tracking data that is sent to


an Apache Kafka stream in real time. This is then loaded into BigQuery.
Analysts in your company want to query the tracking data in BigQuery to
analyze geospatial trends in the lifecycle of a package. The table was
originally created with ingest-date partitioning. Over time, the query
processing time has increased. You need to copy all the data to a new
clustered table. What should you do?

Query Focus: Analysts are interested in geospatial trends within individual


package lifecycles. Clustering by package-tracking ID physically co-locates
related data, significantly improving query performance for these
analyses.
Addressing Slow Queries: Clustering addresses the query slowdown issue
by optimizing data organization for the specific query patterns.
Partitioning vs. Clustering:
Partitioning: Divides data into segments based on a column's values,
primarily for managing large datasets and optimizing query costs.
Clustering: Organizes data within partitions for faster querying based on
specific columns.

124. Your company uses a proprietary system to send inventory data


every 6 hours to a data ingestion service in the cloud. Transmitted data
includes a payload of several fields and the timestamp of the
transmission. If there are any concerns about a transmission, the system
re-transmits the data. How should you deduplicate the data most
efficiency?

Assigning GUIDs to each data entry is the most efficient way to


deduplicate data in this scenario. GUIDs ensure uniqueness, simplify the
deduplication process, and require minimal computational overhead. This
makes it the ideal solution for handling re-transmissions and ensuring data
quality.
125. You are designing a data mesh on Google Cloud with multiple
distinct data engineering teams building data products. The typical data
curation design pattern consists of landing files in Cloud Storage,
transforming raw data in Cloud Storage and BigQuery datasets, and
storing the final curated data product in BigQuery datasets. You need to
configure Dataplex to ensure that each team can access only the assets
needed to build their data products. You also need to ensure that teams
can easily share the curated data product. What should you do?

Option D best embodies the data mesh principles by providing strong


isolation, clear ownership, and simplified sharing. Separate virtual lakes
and multiple zones within each lake allow for granular access control and
organization, ensuring that each team can independently manage their
data product while following the typical data curation workflow.

126. You are using BigQuery with a multi-region dataset that includes a
table with the daily sales volumes. This table is updated multiple times per
day. You need to protect your sales table in case of regional failures with a
recovery point objective (RPO) of less than 24 hours, while keeping costs
to a minimum. What should you do?

Option A, scheduling a daily export of the sales table to a redundant Cloud


Storage bucket, provides the best balance of meeting the RPO, minimizing
costs, and leveraging existing Google Cloud services for disaster recovery.
It's a simple, efficient, and cost-effective approach to protecting your
critical sales data.

127. You are troubleshooting your Dataflow pipeline that processes data
from Cloud Storage to BigQuery. You have discovered that the Dataflow
worker nodes cannot communicate with one another. Your networking
team relies on Google Cloud network tags to define firewall rules. You need
to identify the issue while following Google-recommended networking
security practices. What should you do?

The best approach would be to check if there is a firewall rule allowing


traffic on TCP ports 12345 and 12346 for the Dataflow network tag.
Dataflow uses TCP ports 12345 and 12346 for communication between
worker nodes. Using network tags and associated firewall rules is a
Google-recommended security practice for controlling access between
Compute Engine instances like Dataflow workers.
So the key things to check would be:
1. Ensure your Dataflow pipeline is using the Dataflow network tag on the
worker nodes. This tag is applied by default unless overridden.
2. Check if there is a firewall rule allowing TCP 12345 and 12346 ingress
and egress traffic for instances with the Dataflow network tag. If not, add
the rule.

128. Your company's customer_order table in BigQuery stores the order


history for 10 million customers, with a table size of 10 PB. You need to
create a dashboard for the support team to view the order history. The
dashboard has two filters, country_name and username. Both are string
data types in the BigQuery table. When a filter is applied, the dashboard
fetches the order history from the table and displays the query results.
However, the dashboard is slow to show the results when applying the
filters to the following query:

How should you redesign the BigQuery table to support faster access?

If country is represented by an integer code, then partition by country and


cluster by username would be a better solution. As country code is a
string, available best solution is "A. Cluster the table by country and
username fields."
129. You have a Standard Tier Memorystore for Redis instance deployed
in a production environment. You need to simulate a Redis instance
failover in the most accurate disaster recovery situation, and ensure that
the failover has no impact on production data. What should you do?

• The failover should be tested in a separate development environment,


not production, to avoid impacting real data.
• The force-data-loss mode will simulate a full failover and restart, which is
the most accurate test of disaster recovery.
• Limited-data-loss mode only fails over reads which does not fully test
write capabilities.
• Increasing replicas in production and failing over (C) risks losing real
production data.
• Failing over production (D) also risks impacting real data and traffic.
So option B isolates the test from production and uses the most rigorous
failover mode to fully validate disaster recovery capabilities.

130. You are administering a BigQuery dataset that uses a customer-


managed encryption key (CMEK). You need to share the dataset with a
partner organization that does not have access to your CMEK. What should
you do?

- Create a copy of the necessary tables into a new dataset that doesn't use
CMEK, ensuring the data is accessible without requiring the partner to
have access to the encryption key.
- Analytics Hub can then be used to share this data securely and efficiently
with the partner organization, maintaining control and governance over
the shared data.

131. You are developing an Apache Beam pipeline to extract data from a
Cloud SQL instance by using JdbcIO. You have two projects running in
Google Cloud. The pipeline will be deployed and executed on Dataflow in
Project A. The Cloud SQL. instance is running in Project B and does not
have a public IP address. After deploying the pipeline, you noticed that the
pipeline failed to extract data from the Cloud SQL instance due to
connection failure. You verified that VPC Service Controls and shared VPC
are not in use in these projects. You want to resolve this error while
ensuring that the data does not go through the public internet. What
should you do?

To allow the Dataflow workers in Project A to connect to the private Cloud


SQL instance in Project B, you need to set up VPC Network Peering
between the two projects.
Then create a Compute Engine instance without external IP in Project B on
the peered subnet. This instance can serve as a proxy server to connect to
the private Cloud SQL instance.
The Dataflow workers can connect through the peered network to the
proxy instance, which then connects to Cloud SQL. This allows accessing
the private Cloud SQL instance without going over the public internet.
Option A would allow access but still goes over the public internet.
Option B and C would not work since the Cloud SQL instance does not
have a public IP address.
So D is the right approach to resolve the connection issue while keeping
the data private.

132. You have a BigQuery table that contains customer data, including
sensitive information such as names and addresses. You need to share the
customer data with your data analytics and consumer support teams
securely. The data analytics team needs to access the data of all the
customers, but must not be able to access the sensitive data. The
consumer support team needs access to all data columns, but must not be
able to access customers that no longer have active contracts. You
enforced these requirements by using an authorized dataset and policy
tags. After implementing these steps, the data analytics team reports that
they still have access to the sensitive columns. You need to ensure that
the data analytics team does not have access to restricted data. What
should you do? (Choose two.)
The two best answers are D and E. You need to both enforce the policy
tags (E) and remove the broad data viewing permission (D) to effectively
restrict the data analytics team's access to sensitive information. This
combination ensures that the policy tags are actually enforced and that
the team lacks the underlying permissions to bypass those restrictions.

133. You have a Cloud SQL for PostgreSQL instance in Region’ with one
read replica in Region2 and another read replica in Region3. An
unexpected event in Region’ requires that you perform disaster recovery
by promoting a read replica in Region2. You need to ensure that your
application has the same database capacity available before you switch
over the connections. What should you do?

Capacity Restoration: Promoting the Region2 replica makes it the new


primary. You need to replicate from this new primary to maintain
redundancy and capacity. Creating two replicas (Region3, new region)
accomplishes this.
Geographic Distribution: Distributing replicas across regions ensures
availability if another regional event occurs.
Speed: Creating new replicas from the promoted primary is likely faster
than promoting another existing replica.

134. You orchestrate ETL pipelines by using Cloud Composer. One of the
tasks in the Apache Airflow directed acyclic graph (DAG) relies on a third-
party service. You want to be notified when the task does not succeed.
What should you do?
Direct Trigger:
The on_failure_callback parameter is specifically designed to invoke a
function when a task fails, ensuring immediate notification.
Customizable Logic:
You can tailor the notification function to send emails, create alerts, or
integrate with other notification systems, providing flexibility.

135. Your company has hired a new data scientist who wants to perform
complicated analyses across very large datasets stored in Google Cloud
Storage and in a Cassandra cluster on Google Compute Engine. The
scientist primarily wants to create labelled data sets for machine learning
projects, along with some visualization tasks. She reports that her laptop is
not powerful enough to perform her tasks and it is slowing her down. You
want to help her perform her tasks. What should you do?

Google Cloud Datalab is a powerful interactive tool for data exploration,


analysis, and machine learning. By deploying it to a VM on Google
Compute Engine, you can provide her with a robust and scalable
environment where she can work with large datasets, create labeled
datasets, and perform data analyses efficiently.

136. You are migrating your on-premises data warehouse to BigQuery.


One of the upstream data sources resides on a MySQL. database that runs
in your on-premises data center with no public IP addresses. You want to
ensure that the data ingestion into BigQuery is done securely and does not
go through the public internet. What should you do?
Option B provides the most secure and reliable solution by leveraging
Cloud Interconnect for private connectivity and Datastream's ability to
utilize this connection. It avoids the complexities and security risks
associated with the other options. It ensures that data ingestion happens
over a private connection, fulfilling the requirements of the problem
statement.

137. You store and analyze your relational data in BigQuery on Google
Cloud with all data that resides in US regions. You also have a variety of
object stores across Microsoft Azure and Amazon Web Services (AWS), also
in US regions. You want to query all your data in BigQuery daily with as
little movement of data as possible. What should you do?

BigQuery Omni with BigLake tables is specifically designed for querying


data across clouds without needing to migrate or copy it. It's the most
efficient and cost-effective approach when you want to minimize data
movement and query data in place.

138. You have a variety of files in Cloud Storage that your data science
team wants to use in their models. Currently, users do not have a method
to explore, cleanse, and validate the data in Cloud Storage. You are
looking for a low code solution that can be used by your data science team
to quickly cleanse and explore data within Cloud Storage. What should you
do?
Dataprep is the most suitable option because it's a low-code tool
specifically designed for data exploration, cleansing, and validation
directly within Cloud Storage. It aligns perfectly with the requirements
outlined in the problem statement.

139. You are building an ELT solution in BigQuery by using Dataform. You
need to perform uniqueness and null value checks on your final tables.
What should you do to efficiently integrate these checks into your
pipeline?

- Dataform provides a feature called "assertions," which are essentially


SQL-based tests that you can define to verify the quality of your data.
- Assertions in Dataform are a built-in way to perform data quality checks,
including checking for uniqueness and null values in your tables.

140. A web server sends click events to a Pub/Sub topic as messages.


The web server includes an eventTimestamp attribute in the messages,
which is the time when the click occurred. You have a Dataflow streaming
job that reads from this Pub/Sub topic through a subscription, applies
some transformations, and writes the result to another Pub/Sub topic for
use by the advertising department. The advertising department needs to
receive each message within 30 seconds of the corresponding click
occurrence, but they report receiving the messages late. Your Dataflow
job's system lag is about 5 seconds, and the data freshness is about 40
seconds. Inspecting a few messages show no more than 1 second lag
between their eventTimestamp and publishTime. What is the problem and
what should you do?
In summary, the issue is not with the web server, the Dataflow processing
time itself, or the advertising department's consumption. The problem lies
with the Dataflow job's ability to keep up with the incoming message rate
from the Pub/Sub subscription due to the high data freshness/backlog.

141. Your organization stores customer data in an on-premises Apache


Hadoop cluster in Apache Parquet format. Data is processed on a daily
basis by Apache Spark jobs that run on the cluster. You are migrating the
Spark jobs and Parquet data to Google Cloud. BigQuery will be used on
future transformation pipelines so you need to ensure that your data is
available in BigQuery. You want to use managed services, while minimizing
ETL data processing changes and overhead costs. What should you do?

Option C provides the most direct, efficient, and cost-effective way to


meet the requirements. It minimizes ETL changes, leverages managed
services, and ensures the data is readily available in BigQuery for future
transformations. Moving the data directly to BigQuery simplifies the
architecture and reduces the number of moving parts.

142. Your organization has two Google Cloud projects, project A and
project B. In project A, you have a Pub/Sub topic that receives data from
confidential sources. Only the resources in project A should be able to
access the data in that topic. You want to ensure that project B and any
future project cannot access data in the project A topic. What should you
do?
-It allows us to create a secure boundary around all resources in Project A,
including the Pub/Sub topic.
- It prevents data exfiltration to other projects and ensures that only
resources within the perimeter (Project A) can access the sensitive data.
- VPC Service Controls are specifically designed for scenarios where you
need to secure sensitive data within a specific context or boundary in
Google Cloud.

143. You stream order data by using a Dataflow pipeline, and write the
aggregated result to Memorystore. You provisioned a Memorystore for
Redis instance with Basic Tier, 4 GB capacity, which is used by 40 clients
for read-only access. You are expecting the number of read-only clients to
increase significantly to a few hundred and you need to be able to support
the demand. You want to ensure that read and write access availability is
not impacted, and any changes you make can be deployed quickly. What
should you do?

Scalability for Read-Only Clients: Read replicas distribute read traffic


across multiple instances, significantly enhancing read capacity to support
a large number of clients without impacting write performance.
High Availability: Standard Tier ensures high availability with automatic
failover, minimizing downtime in case of instance failure.
Minimal Code Changes: Redis clients can seamlessly connect to read
replicas without requiring extensive code modifications, enabling a quick
deployment.

144. You have a streaming pipeline that ingests data from Pub/Sub in
production. You need to update this streaming pipeline with improved
business logic. You need to ensure that the updated pipeline reprocesses
the previous two days of delivered Pub/Sub messages. What should you
do? (Choose two.)
D&E
Both retain-acked-messages and Seek are required to achieve the
desired reprocessing. retain-acked-messages keeps the messages
available, and Seek allows the updated pipeline to rewind and read those
messages again. They are complementary functionalities that solve
different parts of the problem.

145. You currently use a SQL-based tool to visualize your data stored in
BigQuery. The data visualizations require the use of outer joins and
analytic functions. Visualizations must be based on data that is no less
than 4 hours old. Business users are complaining that the visualizations
are too slow to generate. You want to improve the performance of the
visualization queries while minimizing the maintenance overhead of the
data preparation pipeline. What should you do?

In scenarios where data staleness is acceptable, for example for batch


data processing or reporting, non-incremental materialized views can
improve query performance and reduce cost.
allow_non_incremental_definition option. This option must be accompanied
by the max_staleness option. To ensure a periodic refresh of the
materialized view, you should also configure a refresh policy.

146. You are deploying 10,000 new Internet of Things devices to collect
temperature data in your warehouses globally. You need to process, store
and analyze these very large datasets in real time. What should you do?
Google Cloud Pub/Sub allows for efficient ingestion and real-time data
streaming.
Google Cloud Dataflow can process and transform the streaming data in
real-time.
Google BigQuery is a fully managed, highly scalable data warehouse that
is well-suited for real-time analysis and querying of large datasets.

147. You need to modernize your existing on-premises data strategy. Your
organization currently uses:
• Apache Hadoop clusters for processing multiple large data sets,
including on-premises Hadoop Distributed File System (HDFS) for data
replication.
• Apache Airflow to orchestrate hundreds of ETL pipelines with thousands
of job steps.
You need to set up a new architecture in Google Cloud that can handle
your Hadoop workloads and requires minimal changes to your existing
orchestration processes. What should you do?

Option B provides the most straightforward approach to migrating the


existing data strategy to Google Cloud with minimal changes. It leverages
managed services that directly replace the existing on-premises
infrastructure and tools, ensuring a smooth transition and reducing
operational overhead. Using Dataproc, Cloud Storage, and Cloud
Composer allows the organization to focus on modernizing their data
strategy without having to re-engineer their core data processing and
orchestration workflows.

148. You recently deployed several data processing jobs into your Cloud
Composer 2 environment. You notice that some tasks are failing in Apache
Airflow. On the monitoring dashboard, you see an increase in the total
workers memory usage, and there were worker pod evictions. You need to
resolve these errors. What should you do? (Choose two.)

Both increasing worker memory (D) and increasing the Cloud Composer
environment size (B) are crucial for solving the problem. The environment
size provides the necessary resources, while increasing worker memory
allows the workers to utilize those resources effectively. They work
together to address the root cause of worker memory issues and pod
evictions.

149. You are on the data governance team and are implementing
security requirements to deploy resources. You need to ensure that
resources are limited to only the europe-west3 region. You want to follow
Google-recommended practices.
What should you do?

Using Organization Policy with the constraints/gcp.resourceLocations


constraint is the most straightforward, centralized, and recommended way
to enforce location restrictions on Google Cloud resources. It ensures that
the policy is applied consistently across all projects and deployments
within your organization.

150. You are a BigQuery admin supporting a team of data consumers


who run ad hoc queries and downstream reporting in tools such as Looker.
All data and users are combined under a single organizational project. You
recently noticed some slowness in query results and want to troubleshoot
where the slowdowns are occurring. You think that there might be some
job queuing or slot contention occurring as users run jobs, which slows
down access to results. You need to investigate the query job information
and determine where performance is being affected. What should you do?
Option C provides the most direct and effective way to investigate query
job information and pinpoint the source of performance slowdowns in
BigQuery. By combining the overview provided by administrative resource
charts with the detailed information available in the
INFORMATION_SCHEMA, you can effectively diagnose and address
performance bottlenecks.

151. You migrated a data backend for an application that serves 10 PB of


historical product data for analytics. Only the last known state for a
product, which is about 10 GB of data, needs to be served through an API
to the other applications. You need to choose a cost-effective persistent
storage solution that can accommodate the analytics requirements and
the API performance of up to 1000 queries per second (QPS) with less than
1 second latency. What should you do?

Cost-Effective Analytics: BigQuery excels at handling large datasets (10


PB) and complex analytical queries. Its columnar storage and massively
parallel processing make it ideal for analyzing historical product data.
High-Performance API: Cloud SQL provides a managed relational database
service optimized for transactional workloads. It can easily handle the
1000 QPS requirement with low latency, ensuring fast API responses.
Separation of Concerns: Storing historical data in BigQuery and the last
known state in Cloud SQL separates analytical and transactional
workloads, optimizing performance and cost for each use case.
152. You want to schedule a number of sequential load and
transformation jobs. Data files will be added to a Cloud Storage bucket by
an upstream process. There is no fixed schedule for when the new data
arrives. Next, a Dataproc job is triggered to perform some transformations
and write the data to BigQuery. You then need to run additional
transformation jobs in BigQuery. The transformation jobs are different for
every table. These jobs might take hours to complete. You need to
determine the most efficient and maintainable workflow to process
hundreds of tables and provide the freshest data to your end users. What
should you do?

Option D provides the most efficient and maintainable workflow by


combining the modularity of separate DAGs with the event-driven
approach of a Cloud Storage object trigger. This ensures that data is
processed as soon as it's available, providing the freshest data to end-
users, while also keeping the transformation logic for each table organized
and manageable.

153. You are deploying a MySQL database workload onto Cloud SQL. The
database must be able to scale up to support several readers from various
geographic regions. The database must be highly available and meet low
RTO and RPO requirements, even in the event of a regional outage. You
need to ensure that interruptions to the readers are minimal during a
database failover. What should you do?
Option C provides the most robust and highly available solution by
combining a highly available primary instance with a highly available read
replica in another region. This approach ensures that the database can
withstand both zonal and regional failures, while cascading read replicas
provide scalability and low latency for read workloads.

154. You are planning to load some of your existing on-premises data
into BigQuery on Google Cloud. You want to either stream or batch-load
data, depending on your use case. Additionally, you want to mask some
sensitive data before loading into BigQuery. You need to do this in a
programmatic way while keeping costs to a minimum. What should you
do?

- Programmatic Flexibility: Apache Beam provides extensive control over


pipeline design, allowing for customization of data transformations,
including integration with Cloud DLP for sensitive data masking.
- Streaming and Batch Support: Beam seamlessly supports both streaming
and batch data processing modes, enabling flexibility in data loading
patterns.
- Cost-Effective Processing: Dataflow offers a serverless model, scaling
resources as needed, and only charging for resources used, helping
optimize costs.
- Integration with Cloud DLP: Beam integrates well with Cloud DLP for
sensitive data masking, ensuring data privacy before loading into
BigQuery.
155. You want to encrypt the customer data stored in BigQuery. You need
to implement per-user crypto-deletion on data stored in your tables. You
want to adopt native features in Google Cloud to avoid custom solutions.
What should you do?

- AEAD cryptographic functions in BigQuery allow for encryption and


decryption of data at the column level.
- You can encrypt specific data fields using a unique key per user and
manage these keys outside of BigQuery (for example, in your application
or using a key management system).
- By "deleting" or revoking access to the key for a specific user, you
effectively make their data unreadable, achieving crypto-deletion.
- This method provides fine-grained encryption control but requires careful
key management and integration with your applications.

156. The data analyst team at your company uses BigQuery for ad-hoc
queries and scheduled SQL pipelines in a Google Cloud project with a slot
reservation of 2000 slots. However, with the recent introduction of
hundreds of new non time-sensitive SQL pipelines, the team is
encountering frequent quota errors. You examine the logs and notice that
approximately 1500 queries are being triggered concurrently during peak
time. You need to resolve the concurrency issue. What should you do?

Option B provides the most effective and efficient solution by prioritizing


interactive queries (ad-hoc) over batch queries (scheduled pipelines). This
ensures that ad-hoc queries have the resources they need, reducing quota
errors and improving the overall user experience. It also optimizes slot
utilization and avoids unnecessary increases in slot capacity.
157. You have spent a few days loading data from comma-separated
values (CSV) files into the Google BigQuery table CLICK_STREAM. The
column DT stores the epoch time of click events. For convenience, you
chose a simple schema where every field is treated as the STRING type.
Now, you want to compute web session durations of users who visit your
site, and you want to change its data type to the TIMESTAMP. You want to
minimize the migration effort without making future queries
computationally expensive. What should you do?

Option E provides the best balance between minimizing migration effort


and ensuring computationally inexpensive future queries. It efficiently
transforms the data into the correct format and creates a new table for
future use, avoiding unnecessary overhead and complexity.

158. You are designing a data mesh on Google Cloud by using Dataplex
to manage data in BigQuery and Cloud Storage. You want to simplify data
asset permissions. You are creating a customer virtual lake with two user
groups:
• Data engineers, which require full data lake access
• Analytic users, which require access to curated data
You need to assign access rights to these two groups. What should you do?
Option A provides the most straightforward and efficient way to manage
permissions in Dataplex by using its built-in roles (dataplex.dataOwner and
dataplex.dataReader). This simplifies permission management and
ensures that each user group has the appropriate level of access to the
data lake.

159. You are designing the architecture of your application to store data
in Cloud Storage. Your application consists of pipelines that read data from
a Cloud Storage bucket that contains raw data, and write the data to a
second bucket after processing. You want to design an architecture with
Cloud Storage resources that are capable of being resilient if a Google
Cloud regional failure occurs. You want to minimize the recovery point
objective (RPO) if a failure occurs, with no impact on applications that use
the stored data. What should you do?

Option C provides the best balance of high availability, low RPO, and
minimal impact on applications. Dual-region buckets with turbo replication
offer a robust and efficient solution for storing data in Cloud Storage with
regional failure resilience.

160. You have designed an Apache Beam processing pipeline that reads
from a Pub/Sub topic. The topic has a message retention duration of one
day, and writes to a Cloud Storage bucket. You need to select a bucket
location and processing strategy to prevent data loss in case of a regional
outage with an RPO of 15 minutes. What should you do?
Option D provides the most robust and efficient solution for preventing
data loss and ensuring business continuity during a regional outage. It
combines the high availability of dual-region buckets with turbo
replication, proactive monitoring, and a well-defined failover process.

161. You are preparing data that your machine learning team will use to
train a model using BigQueryML. They want to predict the price per square
foot of real estate. The training data has a column for the price and a
column for the number of square feet. Another feature column called
‘feature1’ contains null values due to missing data. You want to replace
the nulls with zeros to keep more data points. Which query should you
use?
a. Option A is the correct choice because it retains all the original columns
and specifically addresses the issue of null values in ‘feature1’ by
replacing them with zeros, without altering any other columns or
performing unnecessary calculations. This makes the data ready for use in
BigQueryML without losing any important information.
Option C is not the best choice because it includes the EXCEPT clause for
the price and square_feet columns, which would exclude these columns
from the results. This is not desirable since you need these columns for
the machine learning model to predict the price per square foot

162. Different teams in your organization store customer and


performance data in BigQuery. Each team needs to keep full control of
their collected data, be able to query data within their projects, and be
able to exchange their data with other teams. You need to implement an
organization-wide solution, while minimizing operational tasks and costs.
What should you do?
Centralized Data Exchange: Analytics Hub provides a unified platform for
data sharing across teams and organizations. It simplifies the process of
publishing, discovering, and subscribing to datasets, reducing operational
overhead.
Data Ownership and Control: Each team retains full control over their data,
deciding which datasets to publish and who can access them. This ensures
data governance and security.
Cross-Project Querying: Once a team subscribes to a dataset in Analytics
Hub, they can query it directly from their own BigQuery project, enabling
seamless data access without data replication.
Cost Efficiency: Analytics Hub eliminates the need for data duplication or
complex ETL processes, reducing storage and processing costs.

163. You are developing a model to identify the factors that lead to sales
conversions for your customers. You have completed processing your data.
You want to continue through the model development lifecycle. What
should you do next?

you've just concluded processing data, ending up with clean and prepared
data for the model. Now you need to decide how to split the data for
testing and for training. Only afterwards, you can train the model,
evaluate it, fine tune it and, eventually, predict with it

164. You have one BigQuery dataset which includes customers’ street
addresses. You want to retrieve all occurrences of street addresses from
the dataset. What should you do?

- Cloud Data Loss Prevention (Cloud DLP) provides powerful inspection


capabilities for sensitive data, including predefined detectors for infoTypes
such as STREET_ADDRESS.
- By creating a deep inspection job for each table with the
STREET_ADDRESS infoType, you can accurately identify and retrieve rows
that contain street addresses.

165. Your company operates in three domains: airlines, hotels, and ride-
hailing services. Each domain has two teams: analytics and data science,
which create data assets in BigQuery with the help of a central data
platform team. However, as each domain is evolving rapidly, the central
data platform team is becoming a bottleneck. This is causing delays in
deriving insights from data, and resulting in stale data when pipelines are
not kept up to date. You need to design a data mesh architecture by using
Dataplex to eliminate the bottleneck. What should you do?

Option C aligns perfectly with the principles of a data mesh architecture by


promoting domain-centricity, decentralized ownership, and independent
data management. It empowers each domain to manage its data assets
effectively, reducing the bottleneck on the central data platform team and
enabling faster insights.

166. dataset.inventory_vm sample records:

You have an inventory of VM data stored in the BigQuery table. You want
to prepare the data for regular reporting in the most cost-effective way.
You need to exclude VM rows with fewer than 8 vCPU in your report. What
should you do?

The regular reporting doesn't justify a materialized view, since the


frequency of access is not so high; a simple view would do the trick.
Moreover, the vcpu data is in a nested field and requires Unnest.

167. Your team is building a data lake platform on Google Cloud. As a


part of the data foundation design, you are planning to store all the raw
data in Cloud Storage. You are expecting to ingest approximately 25 GB of
data a day and your billing department is worried about the increasing
cost of storing old data. The current business requirements are:
• The old data can be deleted anytime.
• There is no predefined access pattern of the old data.
• The old data should be available instantly when accessed.
• There should not be any charges for data retrieval.
What should you do to optimize for cost?

Autoclass is the most suitable option for this scenario because it


automatically manages storage classes based on access patterns,
ensuring cost optimization, instant availability, and no retrieval charges. It
eliminates the need for defining complex lifecycle management policies
and simplifies storage management.

168. You want to use Google Stackdriver Logging to monitor Google


BigQuery usage. You need an instant notification to be sent to your
monitoring tool when new data is appended to a certain table using an
insert job, but you do not want to receive notifications for other tables.
What should you do?

This approach allows you to set up a custom log sink with an advanced
filter that targets the specific table and then export the log entries to
Google Cloud Pub/Sub. Your monitoring tool can subscribe to the Pub/Sub
topic, providing you with instant notifications when relevant events occur
without being inundated with notifications from other tables.
Options A and B do not offer the same level of customization and
specificity in targeting notifications for a particular table.
Option C is almost correct but doesn't mention the use of an advanced log
filter in the sink configuration, which is typically needed to filter the logs to
a specific table effectively. Using the Stackdriver API for more advanced
configuration is often necessary for fine-grained control over log filtering.

169. Your company's data platform ingests CSV file dumps of booking
and user profile data from upstream sources into Cloud Storage. The data
analyst team wants to join these datasets on the email field available in
both the datasets to perform analysis. However, personally identifiable
information (PII) should not be accessible to the analysts. You need to de-
identify the email field in both the datasets before loading them into
BigQuery for analysts. What should you do?
Format-preserving encryption (FPE) with FFX in Cloud DLP is a strong
choice for de-identifying PII like email addresses. FPE maintains the format
of the data and ensures that the same input results in the same encrypted
output consistently. This means the email fields in both datasets can be
encrypted to the same value, allowing for accurate joins in BigQuery while
keeping the actual email addresses hidden.

170. You have important legal hold documents in a Cloud Storage bucket.
You need to ensure that these documents are not deleted or modified.
What should you do?

- Setting a retention policy on a Cloud Storage bucket prevents objects


from being deleted for the duration of the retention period.
- Locking the policy makes it immutable, meaning that the retention period
cannot be reduced or removed, thus ensuring that the documents cannot
be deleted or overwritten until the retention period expires.

171. You are designing a data warehouse in BigQuery to analyze sales


data for a telecommunication service provider. You need to create a data
model for customers, products, and subscriptions. All customers, products,
and subscriptions can be updated monthly, but you must maintain a
historical record of all data. You plan to use the visualization layer for
current and historical reporting. You need to ensure that the data model is
simple, easy-to-use, and cost-effective. What should you do?

Option D provides the best balance of simplicity, ease of use,


performance, and cost-effectiveness. It leverages the features of BigQuery
to create a data model that is optimized for analytical workloads while
preserving the historical record of all changes

172. You are deploying a batch pipeline in Dataflow. This pipeline reads
data from Cloud Storage, transforms the data, and then writes the data
into BigQuery. The security team has enabled an organizational constraint
in Google Cloud, requiring all Compute Engine instances to use only
internal IP addresses and no external IP addresses. What should you do?

- Private Google Access for services allows VM instances with only internal
IP addresses in a VPC network or on-premises networks (via Cloud VPN or
Cloud Interconnect) to reach Google APIs and services.
- When you launch a Dataflow job, you can specify that it should use
worker instances without external IP addresses if Private Google Access is
enabled on the subnetwork where these instances are launched.
- This way, your Dataflow workers will be able to access Cloud Storage and
BigQuery without violating the organizational constraint of no external IPs.

173. You are running a Dataflow streaming pipeline, with Streaming


Engine and Horizontal Autoscaling enabled. You have set the maximum
number of workers to 1000. The input of your pipeline is Pub/Sub
messages with notifications from Cloud Storage. One of the pipeline
transforms reads CSV files and emits an element for every CSV line. The
job performance is low, the pipeline is using only 10 workers, and you
notice that the autoscaler is not spinning up additional workers. What
should you do to improve performance?

- Fusion optimization in Dataflow can lead to steps being "fused" together,


which can sometimes hinder parallelization.
- Introducing a Reshuffle step can prevent fusion and force the distribution
of work across more workers.
- This can be an effective way to improve parallelism and potentially
trigger the autoscaler to increase the number of workers.

174. You have an Oracle database deployed in a VM as part of a Virtual


Private Cloud (VPC) network. You want to replicate and continuously
synchronize 50 tables to BigQuery. You want to minimize the need to
manage infrastructure. What should you do?
- Datastream is a serverless and easy-to-use change data capture (CDC)
and replication service.
- You would create a Datastream service that sources from your Oracle
database and targets BigQuery, with private connectivity configuration to
the same VPC.
- This option is designed to minimize the need to manage infrastructure
and is a fully managed service.

175. You are deploying an Apache Airflow directed acyclic graph (DAG) in
a Cloud Composer 2 instance. You have incoming files in a Cloud Storage
bucket that the DAG processes, one file at a time. The Cloud Composer
instance is deployed in a subnetwork with no Internet access. Instead of
running the DAG based on a schedule, you want to run the DAG in a
reactive way every time a new file is received. What should you do?

- Enable Airflow REST API: In Cloud Composer, enable the "Airflow web
server" option.
- Set Up Cloud Storage Notifications: Create a notification for new files,
routing to a Cloud Function.
- Create PSC Endpoint: Establish a PSC endpoint for Cloud Composer.
- Write Cloud Function: Code the function to use the Airflow REST API (via
PSC endpoint) to trigger the DAG.

176. You are planning to use Cloud Storage as part of your data lake
solution. The Cloud Storage bucket will contain objects ingested from
external systems. Each object will be ingested once, and the access
patterns of individual objects will be random. You want to minimize the
cost of storing and retrieving these objects. You want to ensure that any
cost optimization efforts are transparent to the users and applications.
What should you do?
- Autoclass automatically analyzes access patterns of objects and
automatically transitions them to the most cost-effective storage class
within Standard, Nearline, Coldline, or Archive.
- This eliminates the need for manual intervention or setting specific age
thresholds.
- No user or application interaction is required, ensuring transparency.

177. You have several different file type data sources, such as Apache
Parquet and CSV. You want to store the data in Cloud Storage. You need to
set up an object sink for your data that allows you to use your own
encryption keys. You want to use a GUI-based solution. What should you
do?

- Cloud Data Fusion is a fully managed, code-free, GUI-based data


integration service that allows you to visually connect, transform, and
move data between various sources and sinks. - It supports various file
formats and can write to Cloud Storage.
- You can configure it to use Customer-Managed Encryption Keys (CMEK)
for the buckets where it writes data.

178. Your business users need a way to clean and prepare data before
using the data for analysis. Your business users are less technically savvy
and prefer to work with graphical user interfaces to define their
transformations. After the data has been transformed, the business users
want to perform their analysis directly in a spreadsheet. You need to
recommend a solution that they can use. What should you do?

 It uses Dataprep to address the need for a graphical interface for data
cleaning.
 It leverages BigQuery for scalable data storage.
 It employs Connected Sheets to enable analysis directly within a
spreadsheet, fulfilling all the stated requirements.

179. You are working on a sensitive project involving private user data.
You have set up a project on Google Cloud Platform to house your work
internally. An external consultant is going to assist with coding a complex
transformation in a Google Cloud Dataflow pipeline for your project. How
should you maintain users' privacy?

By creating an anonymized sample of the data, you can provide the


consultant with a realistic dataset that doesn't contain sensitive or private
information. This way, the consultant can work on the project without
direct access to sensitive data, reducing privacy risks.

180. You have two projects where you run BigQuery jobs:
• One project runs production jobs that have strict completion time SLAs.
These are high priority jobs that must have the required compute
resources available when needed. These jobs generally never go below a
300 slot utilization, but occasionally spike up an additional 500 slots.
• The other project is for users to run ad-hoc analytical queries. This
project generally never uses more than 200 slots at a time. You want these
ad-hoc queries to be billed based on how much data users scan rather
than by slot capacity.
You need to ensure that both projects have the appropriate compute
resources available. What should you do?

Separate Reservations: This approach provides tailored resource allocation


and billing models to match the distinct needs of each project.
SLA Project Reservation:
Enterprise Edition: Guarantees consistent slot availability for your
production jobs.
Baseline of 300 slots: Ensures resources are always available to meet your
core usage at a predictable cost.
Autoscaling up to 500 slots: Accommodates bursts in workload while
controlling costs.
Ad-hoc Project On-demand:
On-demand billing: Charges based on data scanned, ideal for
unpredictable and variable query patterns by your ad-hoc users.

181. You want to migrate your existing Teradata data warehouse to


BigQuery. You want to move the historical data to BigQuery by using the
most efficient method that requires the least amount of programming, but
local storage space on your existing data warehouse is limited. What
should you do?

Extraction using a JDBC driver with FastExport connection. If there are


constraints on the local storage space available for extracted files, or if
there is some reason you can't use TPT, then use this extraction method.
In this mode, the migration agent extracts tables into a collection of AVRO
files on the local file system. It then uploads these files to a Cloud Storage
bucket, where they are used by the transfer job. Once the files are
uploaded to Cloud Storage, the migration agent deletes them from the
local file system.
In this mode, you can limit the amount of space used by the AVRO files on
the local file system. If this limit is exceeded, extraction is paused until
space is freed up by the migration agent uploading and deleting existing
AVRO files.

182. You are on the data governance team and are implementing
security requirements. You need to encrypt all your data in BigQuery by
using an encryption key managed by your team. You must implement a
mechanism to generate and store encryption material only on your on-
premises hardware security module (HSM). You want to rely on Google
managed solutions. What should you do?
- Cloud EKM allows you to use encryption keys managed in external key
management systems, including on-premises HSMs, while using Google
Cloud services.
- This means that the key material remains in your control and
environment, and Google Cloud services use it via the Cloud EKM
integration.
- This approach aligns with the need to generate and store encryption
material only on your on-premises HSM and is the correct way to integrate
such keys with BigQuery.

183. You maintain ETL pipelines. You notice that a streaming pipeline
running on Dataflow is taking a long time to process incoming data, which
causes output delays. You also noticed that the pipeline graph was
automatically optimized by Dataflow and merged into one step. You want
to identify where the potential bottleneck is occurring. What should you
do?

From the Dataflow documentation: "There are a few cases in your pipeline
where you may want to prevent the Dataflow service from performing
fusion optimizations. These are cases in which the Dataflow service might
incorrectly guess the optimal way to fuse operations in the pipeline, which
could limit the Dataflow service's ability to make use of all available
workers.
You can insert a Reshuffle step. Reshuffle prevents fusion, checkpoints the
data, and performs deduplication of records. Reshuffle is supported by
Dataflow even though it is marked deprecated in the Apache Beam
documentation."

184. You are running your BigQuery project in the on-demand billing
model and are executing a change data capture (CDC) process that
ingests data. The CDC process loads 1 GB of data every 10 minutes into a
temporary table, and then performs a merge into a 10 TB target table.
This process is very scan intensive and you want to explore options to
enable a predictable cost model. You need to create a BigQuery
reservation based on utilization information gathered from BigQuery
Monitoring and apply the reservation to the CDC process. What should you
do?
The most effective and recommended way to ensure a BigQuery
reservation applies to your CDC process, which involves multiple jobs and
potential different datasets/service accounts, is to create the reservation
at the project level. This guarantees that all BigQuery workloads within
the project, including your CDC process, will utilize the reserved capacity.

185. You are designing a fault-tolerant architecture to store data in a


regional BigQuery dataset. You need to ensure that your application is able
to recover from a corruption event in your tables that occurred within the
past seven days. You want to adopt managed services with the lowest RPO
and most cost-effective solution. What should you do?

- Lowest RPO: Time travel offers point-in-time recovery for the past seven
days by default, providing the shortest possible recovery point objective
(RPO) among the given options. You can recover data to any state within
that window.
- No Additional Costs: Time travel is a built-in feature of BigQuery,
incurring no extra storage or operational costs.
- Managed Service: BigQuery handles time travel automatically,
eliminating manual backup and restore processes.

186. You are building a streaming Dataflow pipeline that ingests noise
level data from hundreds of sensors placed near construction sites across
a city. The sensors measure noise level every ten seconds, and send that
data to the pipeline when levels reach above 70 dBA. You need to detect
the average noise level from a sensor when data is received for a duration
of more than 30 minutes, but the window ends when no data has been
received for 15 minutes. What should you do?

to detect average noise levels from sensors, the best approach is to use
session windows with a 15-minute gap duration (Option A). Session
windows are ideal for cases like this where the events (sensor data) are
sporadic. They group events that occur within a certain time interval (15
minutes in your case) and a new window is started if no data is received
for the duration of the gap. This matches your requirement to end the
window when no data is received for 15 minutes, ensuring that the
average noise level is calculated over periods of continuous data

187. You are creating a data model in BigQuery that will hold retail
transaction data. Your two largest tables, sales_transaction_header and
sales_transaction_line, have a tightly coupled immutable relationship.
These tables are rarely modified after load and are frequently joined when
queried. You need to model the sales_transaction_header and
sales_transaction_line tables to improve the performance of data analytics
queries. What should you do?

- In BigQuery, nested and repeated fields can significantly improve


performance for certain types of queries, especially joins, because the
data is co-located and can be read efficiently. - - This approach is often
used in data warehousing scenarios where query performance is a priority,
and the data relationships are immutable and rarely modified.

188. You created a new version of a Dataflow streaming data ingestion


pipeline that reads from Pub/Sub and writes to BigQuery. The previous
version of the pipeline that runs in production uses a 5-minute window for
processing. You need to deploy the new version of the pipeline without
losing any data, creating inconsistencies, or increasing the processing
latency by more than 10 minutes. What should you do?

- Draining the old pipeline ensures that it finishes processing all in-flight
data before stopping, which prevents data loss and inconsistencies.
- After draining, you can start the new pipeline, which will begin processing
new data from where the old pipeline left off.
- This approach maintains a smooth transition between the old and new
versions, minimizing latency increases and avoiding data gaps or overlaps.

189. Your organization's data assets are stored in BigQuery, Pub/Sub, and
a PostgreSQL instance running on Compute Engine. Because there are
multiple domains and diverse teams using the data, teams in your
organization are unable to discover existing data assets. You need to
design a solution to improve data discoverability while keeping
development and configuration efforts to a minimum. What should you
do?

 It leverages Data Catalog's automatic crawling for BigQuery and Pub/Sub,


minimizing effort.
 It utilizes custom connectors for the non-native PostgreSQL integration,
providing a sustainable and discoverable solution with a reasonable initial
investment.
By combining automatic cataloging with custom connectors, Option C
strikes the right balance between comprehensive data discoverability and
minimal development/configuration effort.

190. You are building a model to predict whether or not it will rain on a
given day. You have thousands of input features and want to see if you can
improve training speed by removing some features while having a
minimum effect on model accuracy. What can you do?

Combining highly correlated features into a single representative feature


can reduce the dimensionality of your dataset, making the training
process faster while preserving relevant information. This approach often
helps eliminate redundancy in the input data.

191. You need to create a SQL pipeline. The pipeline runs an aggregate
SQL transformation on a BigQuery table every two hours and appends the
result to another existing BigQuery table. You need to configure the
pipeline to retry if errors occur. You want the pipeline to send an email
notification after three consecutive failures. What should you do?
Option B leverages the power of Cloud Composer's workflow orchestration
and the BigQueryInsertJobOperator's capabilities to create a
straightforward, reliable, and maintainable SQL pipeline that meets all the
specified requirements, including retries and email notifications after three
consecutive failures.

192. You are monitoring your organization’s data lake hosted on


BigQuery. The ingestion pipelines read data from Pub/Sub and write the
data into tables on BigQuery. After a new version of the ingestion pipelines
is deployed, the daily stored data increased by 50%. The volumes of data
in Pub/Sub remained the same and only some tables had their daily
partition data size doubled. You need to investigate and fix the cause of
the data increase. What should you do?

 It begins with a direct hypothesis (duplicates) and a method to verify it.


 It utilizes audit logs and monitoring to precisely identify the problematic
jobs and pipeline versions.
 It focuses on stopping old pipelines as a targeted remediation step.
By following this systematic approach, you can effectively diagnose and
resolve the data increase issue while minimizing disruption and ensuring a
long-term fix.
193. You have a BigQuery dataset named “customers”. All tables will be
tagged by using a Data Catalog tag template named “gdpr”. The template
contains one mandatory field, “has_sensitive_data”, with a boolean value.
All employees must be able to do a simple search and find tables in the
dataset that have either true or false in the “has_sensitive_data’ field.
However, only the Human Resources (HR) group should be able to see the
data inside the tables for which “has_sensitive data” is true. You give the
all employees group the bigquery.metadataViewer and
bigquery.connectionUser roles on the dataset. You want to minimize
configuration overhead. What should you do next?

 It makes the tag template public, enabling all employees to search for
tables based on the tags without needing extra permissions.
 It directly grants BigQuery data access to the HR group only on the
necessary tables, minimizing configuration overhead and ensuring
compliance with the restricted data access requirement.
By combining public tag visibility with targeted BigQuery permissions,
Option C provides the most straightforward and least complex way to
achieve the desired access control and searchability for your BigQuery
data and Data Catalog tags.

194. You are creating the CI/CD cycle for the code of the directed acyclic
graphs (DAGs) running in Cloud Composer. Your team has two Cloud
Composer instances: one instance for development and another instance
for production. Your team is using a Git repository to maintain and develop
the code of the DAGs. You want to deploy the DAGs automatically to Cloud
Composer when a certain tag is pushed to the Git repository. What should
you do?
 It uses Cloud Build to automate the deployment process based on Git tags.
 It directly deploys DAG code to the Cloud Storage buckets used by Cloud
Composer, eliminating the need for additional infrastructure.
 It aligns with the recommended approach for managing DAGs in Cloud
Composer.
By leveraging Cloud Build and Cloud Storage, Option A minimizes the
configuration overhead and complexity while providing a robust and
automated CI/CD pipeline for your Cloud Composer DAGs.

195. You have a BigQuery table that ingests data directly from a Pub/Sub
subscription. The ingested data is encrypted with a Google-managed
encryption key. You need to meet a new organization policy that requires
you to use keys from a centralized Cloud Key Management Service (Cloud
KMS) project to encrypt data at rest. What should you do?

 It creates a new BigQuery table with the required CMEK configuration.


 It migrates the existing data to the new table, ensuring all data is
encrypted according to the new policy.
This approach ensures compliance with the organization's encryption
policy while minimizing unnecessary changes to other parts of the data
pipeline.
196. You created an analytics environment on Google Cloud so that your
data scientist team can explore data without impacting the on-premises
Apache Hadoop solution. The data in the on-premises Hadoop Distributed
File System (HDFS) cluster is in Optimized Row Columnar (ORC) formatted
files with multiple columns of Hive partitioning. The data scientist team
needs to be able to explore the data in a similar way as they used the on-
premises HDFS cluster with SQL on the Hive query engine. You need to
choose the most cost-effective storage and processing solution. What
should you do?

- It leverages the strengths of BigQuery for SQL-based exploration while


avoiding additional costs and complexity associated with data
transformation or migration.
- The data remains in ORC format in Cloud Storage, and BigQuery's
external tables feature allows direct querying of this data.

197. You are designing a Dataflow pipeline for a batch processing job.
You want to mitigate multiple zonal failures at job submission time. What
should you do?

 It directly addresses the problem of zonal failures during job submission by


targeting a region instead of specific zones.
 It leverages Dataflow's built-in capabilities for distributing resources across
zones within a region.
By specifying a worker region, you ensure that your Dataflow job is
submitted and executed in a way that is resilient to zonal failures,
providing a more robust and reliable pipeline.

198. You are designing a real-time system for a ride hailing app that
identifies areas with high demand for rides to effectively reroute available
drivers to meet the demand. The system ingests data from multiple
sources to Pub/Sub, processes the data, and stores the results for
visualization and analysis in real-time dashboards. The data sources
include driver location updates every 5 seconds and app-based booking
events from riders. The data processing involves real-time aggregation of
supply and demand data for the last 30 seconds, every 2 seconds, and
storing the results in a low-latency system for visualization. What should
you do?

Tumbling windows are the best choice for this ride-hailing app because
they provide accurate 2-second aggregations without the complexities of
overlapping data. This is crucial for real-time decision-making and
ensuring accurate visualization of supply and demand.
Hopping windows introduce potential inaccuracies and complexity, making
them less suitable for this scenario. While they can be useful in other
situations, they are not the optimal choice for real-time aggregation with
strict accuracy requirements.

199. Your car factory is pushing machine measurements as messages


into a Pub/Sub topic in your Google Cloud project. A Dataflow streaming
job, that you wrote with the Apache Beam SDK, reads these messages,
sends acknowledgment to Pub/Sub, applies some custom business logic in
a DoFn instance, and writes the result to BigQuery. You want to ensure that
if your business logic fails on a message, the message will be sent to a
Pub/Sub topic that you want to monitor for alerting purposes. What should
you do?

Side Output for Failed Messages: Dataflow allows you to use side outputs
to handle messages that fail processing. In your DoFn , you can catch
exceptions and write the failed messages to a separate PCollection . This
PCollection can then be written to a new Pub/Sub topic.
New Pub/Sub Topic for Monitoring: Creating a dedicated Pub/Sub topic for
failed messages allows you to monitor it specifically for alerting purposes.
This provides a clear view of any issues with your business logic.
topic/num_unacked_messages_by_region Metric: This Cloud Monitoring
metric tracks the number of unacknowledged messages in a Pub/Sub
topic. By monitoring this metric on your new topic, you can identify when
messages are failing to be processed correctly.

200. You want to store your team’s shared tables in a single dataset to
make data easily accessible to various analysts. You want to make this
data readable but unmodifiable by analysts. At the same time, you want to
provide the analysts with individual workspaces in the same project, where
they can create and store tables for their own use, without the tables
being accessible by other analysts. What should you do?

- Data Viewer on Shared Dataset: Grants read-only access to the shared


dataset.
- Data Editor on Individual Datasets: Giving each analyst Data Editor role
on their respective dataset creates private workspaces where they can
create and store personal tables without exposing them to other analysts.

201. Your company is performing data preprocessing for a learning


algorithm in Google Cloud Dataflow. Numerous data logs are being are
being generated during this step, and the team wants to analyze them.
Due to the dynamic nature of the campaign, the data is growing
exponentially every hour. The data scientists have written the following
code to read the data for a new key features in the logs.

You want to improve the performance of this data read. What should you
do?
This function exports the whole table to temporary files in Google Cloud
Storage, where it will later be read from.
This requires almost no computation, as it only performs an export job,
and later Dataflow reads from GCS (not from BigQuery).
BigQueryIO.read.fromQuery() executes a query and then reads the results
received after the query execution. Therefore, this function is more time-
consuming, given that it requires that a query is first executed (which will
incur in the corresponding economic and computational costs).
202. You are running a streaming pipeline with Dataflow and are using
hopping windows to group the data as the data arrives. You noticed that
some data is arriving late but is not being marked as late data, which is
resulting in inaccurate aggregations downstream. You need to find a
solution that allows you to capture the late data in the appropriate
window. What should you do?

A watermark is a threshold that indicates when Dataflow expects all of the


data in a window to have arrived. If the watermark has progressed past
the end of the window and new data arrives with a timestamp within the
window, the data is considered late data. For more information, see
Watermarks and late data in the Apache Beam documentation.
Dataflow tracks watermarks because of the following reasons:
Data is not guaranteed to arrive in time order or at predictable intervals.
Data events are not guaranteed to appear in pipelines in the same order
that they were generated.

203. You work for a large ecommerce company. You store your
customer's order data in Bigtable. You have a garbage collection policy set
to delete the data after 30 days and the number of versions is set to 1.
When the data analysts run a query to report total customer spending, the
analysts sometimes see customer data that is older than 30 days. You
need to ensure that the analysts do not see customer data older than 30
days while minimizing cost and overhead. What should you do?

Because it can take up to a week for expired data to be deleted, you


should never rely solely on garbage collection policies to ensure that read
requests return the desired data. Always apply a filter to your read
requests that excludes the same values as your garbage collection rules.
You can filter by limiting the number of cells per column or by specifying a
timestamp range.

204. You are using a Dataflow streaming job to read messages from a
message bus that does not support exactly-once delivery. Your job then
applies some transformations, and loads the result into BigQuery. You want
to ensure that your data is being streamed into BigQuery with exactly-
once delivery semantics. You expect your ingestion throughput into
BigQuery to be about 1.5 GB per second. What should you do?

This approach directly addresses the issue by filtering out data older than
30 days at query time, ensuring that only the relevant data is retrieved. It
avoids the overhead and potential delays associated with garbage
collection and manual deletion processes

205. You have created an external table for Apache Hive partitioned data
that resides in a Cloud Storage bucket, which contains a large number of
files. You notice that queries against this table are slow. You want to
improve the performance of these queries. What should you do?

- BigLake Table: BigLake allows for more efficient querying of data lakes
stored in Cloud Storage. It can handle large datasets more effectively than
standard external tables.
- Metadata Caching: Enabling metadata caching can significantly improve
query performance by reducing the time taken to read and process
metadata from a large number of files.

206. You have a network of 1000 sensors. The sensors generate time
series data: one metric per sensor per second, along with a timestamp.
You already have 1 TB of data, and expect the data to grow by 1 GB every
day. You need to access this data in two ways. The first access pattern
requires retrieving the metric from one specific sensor stored at a specific
timestamp, with a median single-digit millisecond latency. The second
access pattern requires running complex analytic queries on the data,
including joins, once a day. How should you store this data?
- Bigtable excels at incredibly fast lookups by row key, often reaching
single-digit millisecond latencies.
- Constructing the row key with sensor ID and timestamp enables efficient
retrieval of specific sensor readings at exact timestamps.
- Bigtable's wide-column design effectively stores time series data,
allowing for flexible addition of new metrics without schema changes.
- Bigtable scales horizontally to accommodate massive datasets
(petabytes or more), easily handling the expected data growth.

207. You have 100 GB of data stored in a BigQuery table. This data is
outdated and will only be accessed one or two times a year for analytics
with SQL. For backup purposes, you want to store this data to be
immutable for 3 years. You want to minimize storage costs. What should
you do?

 It utilizes the cheapest storage option (Archive storage) for long-term


preservation.
 It ensures immutability through a locked retention policy, fulfilling the
backup requirement.
 It maintains queryability using BigQuery external tables, allowing for
occasional analytics without incurring the cost of keeping the data active
in BigQuery.
By combining these techniques, Option D provides the best balance of
cost-effectiveness, immutability, and accessibility for your outdated
BigQuery data.

208. You have thousands of Apache Spark jobs running in your on-
premises Apache Hadoop cluster. You want to migrate the jobs to Google
Cloud. You want to use managed services to run your jobs instead of
maintaining a long-lived Hadoop cluster yourself. You have a tight timeline
and want to keep code changes to a minimum. What should you do?
Dataproc is the most suitable choice for migrating your existing Apache
Spark jobs to Google Cloud because it is a fully managed service that
supports Apache Spark and Hadoop workloads with minimal changes to
your existing code. Moving your data to Cloud Storage and running jobs on
Dataproc offers a fast, efficient, and scalable solution for your needs.

209. You are administering shared BigQuery datasets that contain views
used by multiple teams in your organization. The marketing team is
concerned about the variability of their monthly BigQuery analytics spend
using the on-demand billing model. You need to help the marketing team
establish a consistent BigQuery analytics spend each month. What should
you do?

This option provides the marketing team with a predictable monthly cost
by reserving a fixed number of slots, ensuring that they have dedicated
resources without the variability introduced by autoscaling or on-demand
pricing. This setup also simplifies budgeting and financial planning for the
marketing team, as they will have a consistent expense each month.

210. You are part of a healthcare organization where data is organized


and managed by respective data owners in various storage services. As a
result of this decentralized ecosystem, discovering and managing data has
become difficult. You need to quickly identify and implement a cost-
optimized solution to assist your organization with the following:

• Data management and discovery


• Data lineage tracking
• Data quality validation

How should you build the solution?


 It provides a centralized platform to manage and govern data across
various storage systems.
 It automates metadata discovery, enabling easy data discovery.
 It tracks data lineage, helping understand the data flow and identify
potential issues.
 It allows you to define and monitor data quality rules.
 It's designed to be cost-effective.
By using Dataplex, the healthcare organization can quickly implement a
solution to address its data management challenges in a cost-optimized
and efficient manner.

211. You have data located in BigQuery that is used to generate reports
for your company. You have noticed some weekly executive report fields
do not correspond to format according to company standards. For
example, report errors include different telephone formats and different
country code identifiers. This is a frequent issue, so you need to create a
recurring job to normalize the data. You want a quick solution that requires
no coding. What should you do?

 It leverages Cloud Data Fusion's visual environment to build data pipelines


without coding.
 It utilizes Wrangler's point-and-click interface to easily perform data
normalization tasks.
 It allows scheduling recurring jobs to automate the normalization process.
By using Cloud Data Fusion and Wrangler, you can quickly implement a
recurring data normalization solution without writing any code, addressing
the specific requirements of the problem.

212. Your company is streaming real-time sensor data from their factory
floor into Bigtable and they have noticed extremely poor performance.
How should the row key be redesigned to improve Bigtable performance
on queries that populate real-time dashboards?
 It enables efficient range scans for retrieving data for specific sensors
across time.
 It distributes writes to prevent hotspots and maintain write performance.
 It ensures data locality for recent queries, improving read performance for
real-time dashboards.
By using <sensorid>#<timestamp> as the row key structure, you
optimize Bigtable for the specific access patterns of your real-time
dashboards, resulting in improved query performance and a better user
experience.

213. You are designing a messaging system by using Pub/Sub to process


clickstream data with an event-driven consumer app that relies on a push
subscription. You need to configure the messaging system that is reliable
enough to handle temporary downtime of the consumer app. You also
need the messaging system to store the input messages that cannot be
consumed by the subscriber. The system needs to retry failed messages
gradually, avoiding overloading the consumer app, and store the failed
messages after a maximum of 10 retries in a topic. How should you
configure the Pub/Sub subscription?

 It uses exponential backoff to ensure gradual retries, preventing consumer


overload.
 It utilizes dead-lettering to store failed messages in a separate topic for
later analysis or processing.
 It sets a maximum delivery attempt limit to prevent messages from being
stuck in a retry loop.
This configuration ensures that your messaging system is reliable, resilient
to temporary downtime, and able to manage failed messages effectively.

214. You designed a data warehouse in BigQuery to analyze sales data.


You want a self-serving, low-maintenance, and cost- effective solution to
share the sales dataset to other business units in your organization. What
should you do?
 It enables self-service data discovery for other business units.
 It offers centralized data governance and control.
 It minimizes maintenance overhead.
 It's a cost-effective way to share data within your organization.
By using Analytics Hub, you can empower other business units to access
the sales data they need while maintaining control and minimizing costs.

215. You have terabytes of customer behavioral data streaming from


Google Analytics into BigQuery daily. Your customers’ information, such as
their preferences, is hosted on a Cloud SQL for MySQL database. Your CRM
database is hosted on a Cloud SQL for PostgreSQL instance. The marketing
team wants to use your customers’ information from the two databases
and the customer behavioral data to create marketing campaigns for
yearly active customers. You need to ensure that the marketing team can
run the campaigns over 100 times a day on typical days and up to 300
during sales. At the same time, you want to keep the load on the Cloud
SQL databases to a minimum. What should you do?

 It minimizes the load on Cloud SQL by replicating the data to BigQuery.


 It enables efficient and frequent queries using BigQuery.
 It provides near real-time updates to ensure data freshness.
By replicating the required data to BigQuery, you create a dedicated
analytical environment that can handle the high query frequency without
impacting the performance of the transactional Cloud SQL databases.

216.
Your organization is modernizing their IT services and migrating to Google
Cloud. You need to organize the data that will be stored in Cloud Storage
and BigQuery. You need to enable a data mesh approach to share the data
between sales, product design, and marketing departments. What should
you do?

- Decentralized ownership: Each department controls its data lake,


aligning with the core principle of data ownership in a data mesh.
- Self-service data access: Departments can create and manage their own
Cloud Storage buckets and BigQuery datasets within their data lakes,
enabling self-service data access.
- Interdepartmental sharing: Dataplex facilitates data sharing by enabling
departments to publish their data products from their data lakes, making it
easily discoverable and usable by other departments.

217. You work for a large ecommerce company. You are using Pub/Sub to
ingest the clickstream data to Google Cloud for analytics. You observe that
when a new subscriber connects to an existing topic to analyze data, they
are unable to subscribe to older data. For an upcoming yearly sale event in
two months, you need a solution that, once implemented, will enable any
new subscriber to read the last 30 days of data. What should you do?

- Topic Retention Policy: This policy determines how long messages are
retained by Pub/Sub after they are published, even if they have not been
acknowledged (consumed) by any subscriber.
- 30 Days Retention: By setting the retention policy of the topic to 30 days,
all messages published to this topic will be available for consumption for
30 days. This means any new subscriber connecting to the topic can
access and analyze data from the past 30 days.

218. You are designing the architecture to process your data from Cloud
Storage to BigQuery by using Dataflow. The network team provided you
with the Shared VPC network and subnetwork to be used by your
pipelines. You need to enable the deployment of the pipeline on the
Shared VPC network. What should you do?

Shared VPC and Network Access: When using a Shared VPC, you need to
grant specific permissions to service accounts in the service project
(where your Dataflow pipeline runs) to access resources in the host
project's network.
compute.networkUser Role: This role grants the necessary permissions for
a service account to use the network resources in the Shared VPC. This
includes accessing subnets, creating instances, and communicating with
other services within the network.
Service Account for Pipeline Execution: The service account that executes
your Dataflow pipeline is the one that needs these network permissions.
This is because the Dataflow service uses this account to create and
manage worker instances within the Shared VPC network.

219. Your infrastructure team has set up an interconnect link between


Google Cloud and the on-premises network. You are designing a high-
throughput streaming pipeline to ingest data in streaming from an Apache
Kafka cluster hosted on- premises. You want to store the data in BigQuery,
with as minimal latency as possible. What should you do?

This approach allows you to directly connect Dataflow to your Kafka


cluster, ensuring minimal latency by avoiding additional intermediaries
like Pub/Sub. Dataflow is designed to handle high-throughput data
processing and can efficiently ingest and process streaming data from
Kafka, then write it to BigQuery. This setup leverages the interconnect link
for a direct and efficient data flow
220. You migrated your on-premises Apache Hadoop Distributed File
System (HDFS) data lake to Cloud Storage. The data scientist team needs
to process the data by using Apache Spark and SQL. Security policies need
to be enforced at the column level. You need a cost-effective solution that
can scale into a data mesh. What should you do?

- BigLake Integration: BigLake allows you to define tables on top of data in


Cloud Storage, providing a bridge between data lake storage and
BigQuery's powerful analytics capabilities. This approach is cost-effective
and scalable.
- Data Catalog for Governance: Creating a taxonomy of policy tags in
Google Cloud's Data Catalog and applying these tags to specific columns
in your BigLake tables enables fine-grained, column-level access control.
- Processing with Spark and SQL: The Spark-BigQuery connector allows
data scientists to process data using Apache Spark directly against
BigQuery (and BigLake tables). This supports both Spark and SQL
processing needs.
- Scalability into a Data Mesh: BigLake and Data Catalog are designed to
scale and support the data mesh architecture, which involves
decentralized data ownership and governance.

221. One of your encryption keys stored in Cloud Key Management


Service (Cloud KMS) was exposed. You need to re- encrypt all of your
CMEK-protected Cloud Storage data that used that key, and then delete
the compromised key. You also want to reduce the risk of objects getting
written without customer-managed encryption key (CMEK) protection in
the future. What should you do?
- New Key Creation: A new Cloud KMS key ensures a secure replacement
for the compromised one.
- New Bucket: A separate bucket prevents potential conflicts with existing
objects and configurations.
- Default CMEK: Setting the new key as default enforces encryption for all
objects in the bucket, reducing the risk of unencrypted data.
- Copy Without Key Specification: Copying objects without specifying a key
leverages the default key, simplifying the process and ensuring consistent
encryption.
- Old Key Deletion: After copying, the compromised key can be safely
deleted.

222. You have an upstream process that writes data to Cloud Storage.
This data is then read by an Apache Spark job that runs on Dataproc.
These jobs are run in the us-central1 region, but the data could be stored
anywhere in the United States. You need to have a recovery process in
place in case of a catastrophic single region failure. You need an approach
with a maximum of 15 minutes of data loss (RPO=15 mins). You want to
ensure that there is minimal latency when reading the data. What should
you do?

- Rapid Replication: Turbo replication ensures near-real-time data


synchronization between regions, achieving an RPO of 15 minutes or less.
- Minimal Latency: Dataproc clusters can read from the bucket in the same
region, minimizing data transfer latency and optimizing performance.
- Disaster Recovery: In case of regional failure, Dataproc clusters can
seamlessly redeploy to the other region and continue reading from the
same bucket, ensuring business continuity.
223. You designed a database for patient records as a pilot project to
cover a few hundred patients in three clinics. Your design used a single
database table to represent all patients and their visits, and you used self-
joins to generate reports. The server resource utilization was at 50%. Since
then, the scope of the project has expanded. The database must now store
100 times more patient records. You can no longer run the reports,
because they either take too long or they encounter errors with insufficient
compute resources. How should you adjust the database design?

Normalizing the database into separate Patients and Visits tables, along
with creating other necessary tables, is the best solution for handling the
increased data size while ensuring efficient query performance and
maintainability. This approach addresses the root problem instead of
applying temporary fixes.
224. Your company's customer and order databases are often under
heavy load. This makes performing analytics against them difficult without
harming operations. The databases are in a MySQL cluster, with nightly
backups taken using mysqldump. You want to perform analytics with
minimal impact on operations. What should you do?

Based on these considerations, option B is likely the best approach. By


using an ETL tool to load data from MySQL into Google BigQuery, you're
leveraging BigQuery's strengths in handling large-scale analytics
workloads without impacting the performance of the operational
databases. This option provides a clear separation of operational and
analytical workloads and takes advantage of BigQuery's fast analytics
capabilities.

225. You currently have transactional data stored on-premises in a


PostgreSQL database. To modernize your data environment, you want to
run transactional workloads and support analytics needs with a single
database. You need to move to Google Cloud without changing database
management systems, and minimize cost and complexity. What should
you do?
They currently have transactional data stored on-premises in a PostgreSQL
database and they want to modernize their database that supports
transactional workloads and analytics .If they select cloud Sql
(postgreSQL) it will minimize the cost and complexity.

226. You are architecting a data transformation solution for BigQuery.


Your developers are proficient with SQL and want to use the ELT
development technique. In addition, your developers need an intuitive
coding environment and the ability to manage SQL as code. You need to
identify a solution for your developers to build these pipelines. What
should you do?

- Aligns with ELT Approach: Dataform is designed for ELT (Extract, Load,
Transform) pipelines, directly executing SQL transformations within
BigQuery, matching the developers' preference.
-SQL as Code: It enables developers to write and manage SQL
transformations as code, promoting version control, collaboration, and
testing.
- Intuitive Coding Environment: Dataform provides a user-friendly interface
and familiar SQL syntax, making it easy for SQL-proficient developers to
adopt.
- Scheduling and Orchestration: It includes built-in scheduling capabilities
to automate pipeline execution, simplifying pipeline management.

227. You work for a farming company. You have one BigQuery table
named sensors, which is about 500 MB and contains the list of your 5000
sensors, with columns for id, name, and location. This table is updated
every hour. Each sensor generates one metric every 30 seconds along
with a timestamp, which you want to store in BigQuery. You want to run an
analytical query on the data once a week for monitoring purposes. You
also want to minimize costs. What data model should you use?
This approach offers several advantages:
Cost Efficiency: Partitioning the metrics table by timestamp helps reduce
query costs by allowing BigQuery to scan only the relevant partitions.
Data Organization: Keeping metrics in a separate table maintains a clear
separation between sensor metadata and sensor metrics, making it easier
to manage and query the data2.
Performance: Using INSERT statements to append new metrics ensures
efficient data ingestion without the overhead of frequent updates

228. You are managing a Dataplex environment with raw and curated
zones. A data engineering team is uploading JSON and CSV files to a
bucket asset in the curated zone but the files are not being automatically
discovered by Dataplex. What should you do to ensure that the files are
discovered by Dataplex?

Raw zones store structured data, semi-structured data such as CSV files
and JSON files, and unstructured data in any format from external sources.
Curated zones store structured data. Data can be stored in Cloud Storage
buckets or BigQuery datasets. Supported formats for Cloud Storage
buckets include Parquet, Avro, and ORC.

229. You have a table that contains millions of rows of sales data,
partitioned by date. Various applications and users query this data many
times a minute. The query requires aggregating values by using AVG,
MAX, and SUM, and does not require joining to other tables. The required
aggregations are only computed over the past year of data, though you
need to retain full historical data in the base tables. You want to ensure
that the query results always include the latest data from the tables, while
also reducing computation cost, maintenance overhead, and duration.
What should you do?

- Materialized View: Materialized views in BigQuery are precomputed views


that periodically cache the result of a query for increased performance and
efficiency. They are especially beneficial for heavy and repetitive
aggregation queries.
- Filter for Recent Data: Including a clause to focus on the last year of
partitions ensures that the materialized view is only storing and updating
the relevant data, optimizing storage and refresh time.
- Always Up-to-date: Materialized views are maintained by BigQuery and
automatically updated at regular intervals, ensuring they include the latest
data up to a certain freshness point.

230. Your organization uses a multi-cloud data storage strategy, storing


data in Cloud Storage, and data in Amazon Web Services’ (AWS) S3
storage buckets. All data resides in US regions. You want to query up-to-
date data by using BigQuery, regardless of which cloud the data is stored
in. You need to allow users to query the tables from BigQuery without
giving direct access to the data in the storage buckets. What should you
do?

- BigQuery Omni: This is an extension of BigQuery that allows you to


analyze data across Google Cloud, AWS, and Azure without having to
manage the infrastructure or move data across clouds. It's suitable for
querying data stored in AWS S3 buckets directly.
- BigLake: Allows you to create a logical abstraction (table) over data
stored in Cloud Storage and S3, so you can query data using BigQuery
without moving it.
- Unified Querying: By setting up BigQuery Omni to connect to AWS S3 and
creating BigLake tables over both Cloud Storage and S3 data, you can
query all data using BigQuery directly.

231. You are preparing an organization-wide dataset. You need to


preprocess customer data stored in a restricted bucket in Cloud Storage.
The data will be used to create consumer analyses. You need to comply
with data privacy requirements. What should you do?

- Prioritizes Data Privacy: It protects sensitive information by masking it,


reducing the risk of exposure in case of unauthorized access or accidental
leaks.
- Reduces Data Sensitivity: Masking renders sensitive data unusable for
attackers, even if they gain access to it.
- Preserves Data Utility: Masked data can still be used for consumer
analyses, as patterns and relationships are often preserved, allowing
meaningful insights to be derived.

232. You need to connect multiple applications with dynamic public IP


addresses to a Cloud SQL instance. You configured users with strong
passwords and enforced the SSL connection to your Cloud SQL instance.
You want to use Cloud SQL public IP and ensure that you have secured
connections. What should you do?

- Using the Cloud SQL Auth proxy is a recommended method for secure
connections, especially when dealing with dynamic IP addresses.
- The Auth proxy provides secure access to your Cloud SQL instance
without the need for Authorized Networks or managing IP addresses.
- It works by encapsulating database traffic and forwarding it through a
secure tunnel, using Google's IAM for authentication.
- Leaving the Authorized Networks empty means you're not allowing any
direct connections based on IP addresses, relying entirely on the Auth
proxy for secure connectivity. This is a secure and flexible solution,
especially for applications with dynamic IPs.

233. You are migrating a large number of files from a public HTTPS
endpoint to Cloud Storage. The files are protected from unauthorized
access using signed URLs. You created a TSV file that contains the list of
object URLs and started a transfer job by using Storage Transfer Service.
You notice that the job has run for a long time and eventually failed.
Checking the logs of the transfer job reveals that the job was running fine
until one point, and then it failed due to HTTP 403 errors on the remaining
files. You verified that there were no changes to the source system. You
need to fix the problem to resume the migration process. What should you
do?

HTTP 403 errors: These errors indicate unauthorized access, but since you
verified the source system and signed URLs, the issue likely lies with
expired signed URLs. Renewing the URLs with a longer validity period
prevents this issue for the remaining files.
Separate jobs: Splitting the file into smaller chunks and submitting them
as separate jobs improves parallelism and potentially speeds up the
transfer process.
Avoid manual intervention: Options A and D require manual intervention
and complex setups, which are less efficient and might introduce risks.
Longer validity: While option B addresses expired URLs, splitting the file
offers additional benefits for faster migration.

234. You work for an airline and you need to store weather data in a
BigQuery table. Weather data will be used as input to a machine learning
model. The model only uses the last 30 days of weather data. You want to
avoid storing unnecessary data and minimize costs. What should you do?
 It uses partitioning to improve query performance when selecting data
within a date range.
 It automates data deletion through partition expiration, ensuring that only
the necessary data is stored.
By using a partitioned table with partition expiration, you can effectively
manage your weather data in BigQuery, optimize query performance, and
minimize storage costs.
Sumber dan konten terkait

235. You have Google Cloud Dataflow streaming pipeline running with a
Google Cloud Pub/Sub subscription as the source. You need to make an
update to the code that will make the new Cloud Dataflow pipeline
incompatible with the current version. You do not want to lose any data
when making this update. What should you do?

 It leverages the drain flag to ensure that all data is processed before the
pipeline is shut down for the update.
 It allows for a seamless transition to the updated pipeline without any data
loss.
By using the drain flag, you can safely update your Dataflow pipeline with
incompatible changes while preserving data integrity.

236. You need to look at BigQuery data from a specific table multiple
times a day. The underlying table you are querying is several petabytes in
size, but you want to filter your data and provide simple aggregations to
downstream users. You want to run queries faster and get up-to-date
insights quicker. What should you do?
 It provides the best query performance by storing pre-computed results.
 It offers up-to-date insights through automatic refresh capabilities.
 It can be more cost-effective than repeatedly querying the large table.
By creating a materialized view, you can significantly improve query
performance and get up-to-date insights faster, while reducing the load on
your BigQuery table.

237. Your chemical company needs to manually check documentation for


customer order. You use a pull subscription in Pub/Sub so that sales agents
get details from the order. You must ensure that you do not process orders
twice with different sales agents and that you do not add more complexity
to this workflow. What should you do?

 It directly addresses the requirement of preventing duplicate order


processing.
 It's simple to implement and doesn't add unnecessary complexity to the
workflow.
By using Pub/Sub's exactly-once delivery feature, you can ensure that
each order is processed only once, without adding significant overhead or
complexity to your system.

238. You are migrating your on-premises data warehouse to BigQuery. As


part of the migration, you want to facilitate cross-team collaboration to get
the most value out of the organization’s data. You need to design an
architecture that would allow teams within the organization to securely
publish, discover, and subscribe to read-only data in a self-service manner.
You need to minimize costs while also maximizing data freshness. What
should you do?

Analytics Hub is a fully managed data sharing platform provided by Google


Cloud. It allows organizations to publish, discover, and subscribe to
datasets securely and efficiently. It facilitates collaboration across teams
or even across organizations by enabling self-service access to shared
data without duplicating or moving it.
239. You want to migrate an Apache Spark 3 batch job from on-premises
to Google Cloud. You need to minimally change the job so that the job
reads from Cloud Storage and writes the result to BigQuery. Your job is
optimized for Spark, where each executor has 8 vCPU and 16 GB memory,
and you want to be able to choose similar settings. You want to minimize
installation and management effort to run your job. What should you do?

Dataproc Serverless allows you to run Spark jobs without needing to


manage the underlying infrastructure. It automatically handles resource
provisioning and scaling, which simplifies the process and reduces
management overhead

240. You are configuring networking for a Dataflow job. The data pipeline
uses custom container images with the libraries that are required for the
transformation logic preinstalled. The data pipeline reads the data from
Cloud Storage and writes the data to BigQuery. You need to ensure cost-
effective and secure communication between the pipeline and Google APIs
and services. What should you do?

This approach ensures that your worker VMs can access Google APIs and
services securely without using external IP addresses, which reduces costs
and enhances security by keeping the traffic within Google's network

241. You are using Workflows to call an API that returns a 1KB JSON
response, apply some complex business logic on this response, wait for
the logic to complete, and then perform a load from a Cloud Storage file to
BigQuery. The Workflows standard library does not have sufficient
capabilities to perform your complex logic, and you want to use Python's
standard library instead. You want to optimize your workflow for simplicity
and speed of execution. What should you do?

Using a Cloud Function allows you to run your Python code in a serverless
environment, which simplifies deployment and management. It also
ensures quick execution and scalability, as Cloud Functions can handle the
processing of your JSON response efficiently
242. You are administering a BigQuery on-demand environment. Your
business intelligence tool is submitting hundreds of queries each day that
aggregate a large (50 TB) sales history fact table at the day and month
levels. These queries have a slow response time and are exceeding cost
expectations. You need to decrease response time, lower query costs, and
minimize maintenance. What should you do?

Materialized views are precomputed views that cache the results of a


query, which can significantly improve query performance and reduce
costs by avoiding repeated computation. They automatically update with
changes to the base table, ensuring data freshness without additional
maintenance.

243. You have several different unstructured data sources, within your
on-premises data center as well as in the cloud. The data is in various
formats, such as Apache Parquet and CSV. You want to centralize this data
in Cloud Storage. You need to set up an object sink for your data that
allows you to use your own encryption keys. You want to use a GUI-based
solution. What should you do?

Cloud Data Fusion is a fully managed, cloud-native data integration service


that provides a graphical interface for building and managing data
pipelines. It supports various data formats and allows you to use your own
encryption keys for secure data transfer

244. You are using BigQuery with a regional dataset that includes a table
with the daily sales volumes. This table is updated multiple times per day.
You need to protect your sales table in case of regional failures with a
recovery point objective (RPO) of less than 24 hours, while keeping costs
to a minimum. What should you do?

 It provides a cost-effective way to create a backup of your sales table.


 It meets the required RPO of less than 24 hours.
 It ensures regional redundancy by storing the backup in a dual or multi-
region Cloud Storage bucket.
By scheduling a daily export to Cloud Storage, you can effectively protect
your sales table against regional failures while minimizing costs and
complexity.

245. You are preparing an organization-wide dataset. You need to


preprocess customer data stored in a restricted bucket in Cloud Storage.
The data will be used to create consumer analyses. You need to follow
data privacy requirements, including protecting certain sensitive data
elements, while also retaining all of the data for potential future use cases.
What should you do?

This approach ensures that sensitive data elements are protected through
masking, which meets data privacy requirements. At the same time, it
retains the data in a usable form for future analyses

246. Your company is running their first dynamic campaign, serving


different offers by analyzing real-time data during the holiday season. The
data scientists are collecting terabytes of data that rapidly grows every
hour during their 30-day campaign. They are using Google Cloud Dataflow
to preprocess the data and collect the feature (signals) data that is needed
for the machine learning model in Google Cloud Bigtable. The team is
observing suboptimal performance with reads and writes of their initial
load of 10 TB of data. They want to improve this performance while
minimizing cost. What should they do?
Distributing reads and writes evenly across the row space helps prevent
hotspots and ensures that the load is spread evenly, avoiding performance
bottlenecks.
Google Cloud Bigtable's performance is influenced by how well the data is
distributed across the tablet servers, and evenly distributing the load can
lead to better performance.
This approach aligns with best practices for designing scalable and
performant Bigtable schemas.

247. Your software uses a simple JSON format for all messages. These
messages are published to Google Cloud Pub/Sub, then processed with
Google Cloud Dataflow to create a real-time dashboard for the CFO. During
testing, you notice that some messages are missing in the dashboard. You
check the logs, and all messages are being published to Cloud Pub/Sub
successfully. What should you do next?

This will allow you to determine if the issue is with the pipeline or with the
dashboard application. By analyzing the output, you can see if the
messages are being processed correctly and determine if there are any
discrepancies or missing messages. If the issue is with the pipeline, you
can then debug and make any necessary updates to ensure that all
messages are processed correctly. If the issue is with the dashboard
application, you can then focus on resolving that issue. This approach
allows you to isolate and identify the root cause of the missing messages
in a controlled and efficient manner.

248. Flowlogistic Case Study –


Company Overview –
Flowlogistic is a leading logistics and supply chain provider. They help
businesses throughout the world manage their resources and transport
them to their final destination. The company has grown rapidly, expanding
their offerings to include rail, truck, aircraft, and oceanic shipping.
Company Background –
The company started as a regional trucking company, and then expanded
into other logistics market. Because they have not updated their
infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary
technology for tracking shipments in real time at the parcel level.
However, they are unable to deploy it because their technology stack,
based on Apache Kafka, cannot support the processing volume. In
addition, Flowlogistic wants to further analyze their orders and shipments
to determine how best to deploy their resources.

Solution Concept –

Flowlogistic wants to implement two concepts using the cloud:


✑ Use their proprietary technology in a real-time inventory-tracking
system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain
both structured and unstructured data, to determine how best to deploy
resources, which markets to expand info. They also want to use predictive
analytics to learn earlier when a shipment will be delayed.

Existing Technical Environment –


Flowlogistic architecture resides in a single data center:
✑ Use their proprietary technology in a real-time inventory-tracking
system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain
both structured and unstructured data, to determine how best to deploy
resources, which markets to expand info. They also want to use predictive
analytics to learn earlier when a shipment will be delayed.

Existing Technical Environment –


Flowlogistic architecture resides in a single data center:
✑ Databases
8 physical servers in 2 clusters

- SQL Server `" user data, inventory, static data


3 physical servers

- Cassandra `" metadata, tracking messages


10 Kafka servers `" tracking message aggregation and batch insert

✑ Application servers `" customer front end, middleware for order/customs


60 virtual machines across 20 physical servers
- Tomcat `" Java services
- Nginx `" static content
- Batch servers
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups

✑ 10 Apache Hadoop /Spark servers


- Core Data Lake
- Data analysis workloads

✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,

Business Requirements –
Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met

Technical Requirements –
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud
environment

SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.

CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.

CFO Statement –
Part of our competitive advantage is that we penalize ourselves for late
shipments and deliveries. Knowing where out shipments are at all times
has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.
Flowlogistic wants to use Google BigQuery as their primary analysis
system, but they still have Apache Hadoop and Spark workloads that they
cannot move to BigQuery. Flowlogistic does not know how to store the
data that is common to both workloads. What should they do?

This approach allows for interoperability between BigQuery and


Hadoop/Spark as Avro is a commonly used data serialization format that
can be read by both systems. Data stored in Google Cloud Storage can be
accessed by both BigQuery and Dataproc, providing a bridge between the
two environments. Additionally, you can set up data transformation
pipelines in Dataproc to work with this data.

249. Flowlogistic Case Study –


Company Overview –
Flowlogistic is a leading logistics and supply chain provider. They help
businesses throughout the world manage their resources and transport
them to their final destination. The company has grown rapidly, expanding
their offerings to include rail, truck, aircraft, and oceanic shipping.

Company Background –
The company started as a regional trucking company, and then expanded
into other logistics market. Because they have not updated their
infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary
technology for tracking shipments in real time at the parcel level.
However, they are unable to deploy it because their technology stack,
based on Apache Kafka, cannot support the processing volume. In
addition, Flowlogistic wants to further analyze their orders and shipments
to determine how best to deploy their resources.

Solution Concept –
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking
system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain
both structured and unstructured data, to determine how best to deploy
resources, which markets to expand info. They also want to use predictive
analytics to learn earlier when a shipment will be delayed.

Existing Technical Environment –


Flowlogistic architecture resides in a single data center:
✑ Databases
8 physical servers in 2 clusters

- SQL Server `" user data, inventory, static data


3 physical servers

- Cassandra `" metadata, tracking messages


10 Kafka servers `" tracking message aggregation and batch insert

✑ Application servers `" customer front end, middleware for order/customs


60 virtual machines across 20 physical servers
- Tomcat `" Java services
- Nginx `" static content
- Batch servers

✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups

✑ 10 Apache Hadoop /Spark servers


- Core Data Lake
- Data analysis workloads

✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,

Business Requirements –
✑ Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met

Technical Requirements –
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud
environment

SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.

CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.
CFO Statement –
Part of our competitive advantage is that we penalize ourselves for late
shipments and deliveries. Knowing where out shipments are at all times
has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.

Flowlogistic's management has determined that the current Apache Kafka


servers cannot handle the data volume for their real-time inventory
tracking system. You need to build a new system on Google Cloud Platform
(GCP) that will feed the proprietary tracking software. The system must be
able to ingest data from a variety of global sources, process and query in
real-time, and store the data reliably. Which combination of GCP products
should you choose?

Cloud Pub/Sub: It is a messaging service that allows you to asynchronously


send and receive messages between independent applications.
Cloud Dataflow: It can handle both streaming and batch data, making it
suitable for real-time processing of data from various sources.
Cloud Storage: Cloud Storage can be used to store the processed and
analyzed data reliably. It provides scalable, durable, and globally
accessible object storage, making it suitable for storing large volumes of
data.

250. Flowlogistic Case Study –


Company Overview –
Flowlogistic is a leading logistics and supply chain provider. They help
businesses throughout the world manage their resources and transport
them to their final destination. The company has grown rapidly, expanding
their offerings to include rail, truck, aircraft, and oceanic shipping.

Company Background –
The company started as a regional trucking company, and then expanded
into other logistics market. Because they have not updated their
infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary
technology for tracking shipments in real time at the parcel level.
However, they are unable to deploy it because their technology stack,
based on Apache Kafka, cannot support the processing volume. In
addition, Flowlogistic wants to further analyze their orders and shipments
to determine how best to deploy their resources.

Solution Concept –
Flowlogistic wants to implement two concepts using the cloud:
Use their proprietary technology in a real-time inventory-tracking system
that indicates the location of their loads

✑ Perform analytics on all their orders and shipment logs, which contain
both structured and unstructured data, to determine how best to deploy
resources, which markets to expand info. They also want to use predictive
analytics to learn earlier when a shipment will be delayed.

Existing Technical Environment –


Flowlogistic architecture resides in a single data center:
✑ Databases
8 physical servers in 2 clusters

- SQL Server `" user data, inventory, static data


3 physical servers

- Cassandra `" metadata, tracking messages


10 Kafka servers `" tracking message aggregation and batch insert

✑ Application servers `" customer front end, middleware for order/customs


60 virtual machines across 20 physical servers
- Tomcat `" Java services
- Nginx `" static content
- Batch servers

✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups

✑ 10 Apache Hadoop /Spark servers


- Core Data Lake
- Data analysis workloads

✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,

Business Requirements –
✑ Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met

Technical Requirements –
Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud
environment

SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.

CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.

CFO Statement –
Part of our competitive advantage is that we penalize ourselves for late
shipments and deliveries. Knowing where out shipments are at all times
has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.

Flowlogistic's CEO wants to gain rapid insight into their customer base so
his sales team can be better informed in the field. This team is not very
technical, so they've purchased a visualization tool to simplify the creation
of BigQuery reports. However, they've been overwhelmed by all the data
in the table, and are spending a lot of money on queries trying to find the
data they need. You want to solve their problem in the most cost-effective
way. What should you do?
Creating a view in BigQuery allows you to define a virtual table that is a
subset of the original data, containing only the necessary columns or
filtered data that the sales team requires for their reports. This approach is
cost-effective because it doesn't involve exporting data to external tools or
creating additional tables, and it ensures that the sales team is working
with the specific data they need without running expensive queries on the
full dataset. It simplifies the data for non-technical users while keeping the
data in BigQuery, which is a powerful and cost-efficient data warehousing
solution.

251. Flowlogistic Case Study –


Company Overview –
Flowlogistic is a leading logistics and supply chain provider. They help
businesses throughout the world manage their resources and transport
them to their final destination. The company has grown rapidly, expanding
their offerings to include rail, truck, aircraft, and oceanic shipping.

Company Background –
The company started as a regional trucking company, and then expanded
into other logistics market. Because they have not updated their
infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary
technology for tracking shipments in real time at the parcel level.
However, they are unable to deploy it because their technology stack,
based on Apache Kafka, cannot support the processing volume. In
addition, Flowlogistic wants to further analyze their orders and shipments
to determine how best to deploy their resources.

Solution Concept –
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking
system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain
both structured and unstructured data, to determine how best to deploy
resources, which markets to expand info. They also want to use predictive
analytics to learn earlier when a shipment will be delayed.

Existing Technical Environment –


Flowlogistic architecture resides in a single data center:
✑ Databases
8 physical servers in 2 clusters
- SQL Server `" user data, inventory, static data
3 physical servers
- Cassandra `" metadata, tracking messages
10 Kafka servers `" tracking message aggregation and batch insert

✑ Application servers `" customer front end, middleware for order/customs


60 virtual machines across 20 physical servers
- Tomcat `" Java services
- Nginx `" static content
- Batch servers

✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups

✑ 10 Apache Hadoop /Spark servers


- Core Data Lake
- Data analysis workloads

✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,

Business Requirements –
✑ Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met

Technical Requirements –
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud
environment

SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.

CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.

CFO Statement –
Part of our competitive advantage is that we penalize ourselves for late
shipments and deliveries. Knowing where out shipments are at all times
has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.

Flowlogistic is rolling out their real-time inventory tracking system. The


tracking devices will all send package-tracking messages, which will now
go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka
cluster. A subscriber application will then process the messages for real-
time reporting and store them in Google BigQuery for historical analysis.
You want to ensure the package data can be analyzed over time. Which
approach should you take?

Timestamping and adding the Package ID at the publisher device (option


B) is the best practice for ensuring data integrity, traceability, and
accurate historical analysis of package tracking data. It captures the
essential information at the earliest possible point in the data flow.

252. MJTelco Case Study –


Company Overview –
MJTelco is a startup that plans to build networks in rapidly growing,
underserved markets around the world. The company has patents for
innovative optical communications hardware. Based on these patents,
they can create many reliable, high-speed backbone links with inexpensive
hardware.

Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost. Their management and operations teams are situated all around
the globe creating many-to-many relationship between data consumers
and provides in their system. After careful consideration, they decided
public cloud is the perfect environment to support their needs.

Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.

Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.

Technical Requirements –
✑ Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.

CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.

CTO Statement -
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.
CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.

MJTelco's Google Cloud Dataflow pipeline is now ready to start receiving


data from the 50,000 installations. You want to allow Cloud Dataflow to
scale its compute power up as required. Which Cloud Dataflow pipeline
configuration setting should you update?

To enable Dataflow to scale its compute power in response to the


increased data load from 50,000 installations, you should configure the
maximum number of workers. This allows Dataflow to automatically
adjust the worker count within the specified limit, providing the necessary
processing power while also controlling costs.

253. MJTelco Case Study –


Company Overview –
MJTelco is a startup that plans to build networks in rapidly growing,
underserved markets around the world. The company has patents for
innovative optical communications hardware. Based on these patents,
they can create many reliable, high-speed backbone links with inexpensive
hardware.

Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost. Their management and operations teams are situated all around
the globe creating many-to-many relationship between data consumers
and provides in their system. After careful consideration, they decided
public cloud is the perfect environment to support their needs.

Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.

Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.

Technical Requirements –
✑ Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.

CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.

CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.

CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.
You need to compose visualizations for operations teams with the
following requirements:
✑ The report must include telemetry data from all 50,000 installations for
the most resent 6 weeks (sampling once every minute).
✑ The report must not be more than 3 hours delayed from live data.
✑ The actionable report should only show suboptimal links.
✑ Most suboptimal links should be sorted to the top.
✑ Suboptimal links can be grouped and filtered by regional geography.
✑ User response time to load the report must be <5 seconds.
Which approach meets the requirements?

Loading the data into BigQuery and using Data Studio 360 provides the
best balance of scalability, performance, ease of use, and functionality to
meet MJTelco's visualization requirements.

254. You create an important report for your large team in Google Data
Studio 360. The report uses Google BigQuery as its data source. You notice
that visualizations are not showing data that is less than 1 hour old. What
should you do?

The most direct and effective way to ensure your Data Studio report shows
the latest data (less than 1 hour old) is to disable caching in the report
settings. This will force Data Studio to query BigQuery for fresh data each
time the report is accessed.

255. MJTelco Case Study –


Company Overview –
MJTelco is a startup that plans to build networks in rapidly growing,
underserved markets around the world. The company has patents for
innovative optical communications hardware. Based on these patents,
they can create many reliable, high-speed backbone links with inexpensive
hardware.
Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost.

Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.

Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.

Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.

Technical Requirements –
✑ Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.

CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.

CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.

CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.

You create a new report for your large team in Google Data Studio 360.
The report uses Google BigQuery as its data source. It is company policy
to ensure employees can view only the data associated with their region,
so you create and populate a table for each region. You need to enforce
the regional access policy to the data. Which two actions should you take?
(Choose two.)

Organize your tables into regional datasets and then grant view access on
those datasets to the appropriate regional security groups. This ensures
that users only have access to the data relevant to their region.

256. MJTelco Case Study –


Company Overview –
MJTelco is a startup that plans to build networks in rapidly growing,
underserved markets around the world. The company has patents for
innovative optical communications hardware. Based on these patents,
they can create many reliable, high-speed backbone links with inexpensive
hardware.

Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost.

Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.

Solution Concept –

MJTelco is running a successful proof-of-concept (PoC) project in its labs.


They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.

Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.

Technical Requirements –
Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.

CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.

CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.

CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.

MJTelco needs you to create a schema in Google Bigtable that will allow for
the historical analysis of the last 2 years of records. Each record that
comes in is sent every 15 minutes, and contains a unique identifier of the
device and a data record. The most common query is for all the data for a
given device for a given day.
Which schema should you use?

Optimized for Most Common Query: The most common query is for all data
for a given device on a given day. This schema directly matches the query
pattern by including both date and device_id in the row key. This enables
efficient retrieval of the required data using a single row key prefix scan.
Scalability: As the number of devices and data points increases, this
schema distributes the data evenly across nodes in the Bigtable cluster,
avoiding hotspots and ensuring scalability.
Data Organization: By storing data points as column values within each
row, you can easily add new data points or timestamps without modifying
the table structure.

257. Your company has recently grown rapidly and now ingesting data at
a significantly higher rate than it was previously. You manage the daily
batch MapReduce analytics jobs in Apache Hadoop. However, the recent
increase in data has meant the batch jobs are falling behind. You were
asked to recommend ways the development team could increase the
responsiveness of the analytics without increasing costs. What should you
recommend they do?

Both Pig & Spark requires rewriting the code so its an additional overhead,
but as an architect I would think about a long lasting solution. Resizing
Hadoop cluster can resolve the problem statement for the workloads at
that point in time but not on longer run. So Spark is the right choice,
although its a cost to start with, it will certainly be a long lasting solution

258. You work for a large fast food restaurant chain with over 400,000
employees. You store employee information in Google BigQuery in a Users
table consisting of a FirstName field and a LastName field. A member of IT
is building an application and asks you to modify the schema and data in
BigQuery so the application can query a FullName field consisting of the
value of the FirstName field concatenated with a space, followed by the
value of the LastName field for each employee. How can you make that
data available while minimizing cost?

This option is the most cost-effective and efficient, as creating a view in


BigQuery doesn't result in any data duplication, and the view is
automatically updated whenever data in the underlying table changes.
Additionally, querying a view is as efficient as querying a table, so
performance will not be impacted.

259. You are deploying a new storage system for your mobile application,
which is a media streaming service. You decide the best fit is Google Cloud
Datastore. You have entities with multiple properties, some of which can
take on multiple values. For example, in the entity 'Movie' the property
'actors' and the property 'tags' have multiple values but the property 'date
released' does not. A typical query would ask for all movies with
actor=<actorname> ordered by date_released or all movies with
tag=Comedy ordered by date_released. How should you avoid a
combinatorial explosion in the number of indexes?

Option A provides the most efficient and scalable solution by creating


targeted indexes that match your specific query patterns. It avoids the
creation of unnecessary indexes, preventing the combinatorial explosion
and ensuring efficient query performance.

260. You work for a manufacturing plant that batches application log files
together into a single log file once a day at 2:00 AM. You have written a
Google Cloud Dataflow job to process that log file. You need to make sure
the log file in processed once per day as inexpensively as possible. What
should you do?

Using the Google App Engine Cron Service to run the Cloud Dataflow job
allows you to automate the execution of the job. By creating a cron job,
you can ensure that the Dataflow job is triggered exactly once per day at a
specified time. This approach is automated, reliable, and fits the
requirement of processing the log file once per day.

261. You work for an economic consulting firm that helps companies
identify economic trends as they happen. As part of your analysis, you use
Google BigQuery to correlate customer data with the average prices of the
100 most common goods sold, including bread, gasoline, milk, and others.
The average prices of these goods are updated every 30 minutes. You
want to make sure this data stays up to date so you can combine it with
other data in BigQuery as cheaply as possible. What should you do?
In summary, option B provides the most efficient and cost-
effective way to keep your economic data up-to-date in BigQuery
while minimizing overhead. You store the frequently changing data in a
cheaper storage service (Cloud Storage) and then use BigQuery's ability to
query data directly from that storage (federated tables) to combine it with
your other data. This avoids the need for constant, expensive data loading
into BigQuery.

262. You are designing the database schema for a machine learning-
based food ordering service that will predict what users want to eat. Here
is some of the information you need to store:
✑ The user profile: What the user likes and doesn't like to eat
✑ The user account information: Name, address, preferred meal times
✑ The order information: When orders are made, from where, to whom
The database will be used to store all the transactional data of the
product. You want to optimize the data schema. Which Google Cloud
Platform product should you use?

For a machine learning-based food ordering service that needs to store a


variety of data, scale for transactions, and integrate with machine learning
tools, BigQuery is the most appropriate choice. It provides the necessary
flexibility, scalability, and analytical capabilities to support your needs.

263. Your company is loading comma-separated values (CSV) files into


Google BigQuery. The data is fully imported successfully; however, the
imported data is not matching byte-to-byte to the source file. What is the
most likely cause of this problem?

When importing CSV data into BigQuery, make sure to:


1. Identify the source file's encoding: Check the file's metadata or use a
text editor to determine its encoding.
2. Specify the encoding during import: BigQuery's import tools allow you
to specify the encoding. Choose the encoding that matches your source
file.
By correctly matching the encoding, you'll ensure that the data is loaded
accurately, and a byte-to-byte comparison will succeed.

264. Your company produces 20,000 files every hour. Each data file is
formatted as a comma separated values (CSV) file that is less than 4 KB.
All files must be ingested on Google Cloud Platform before they can be
processed. Your company site has a 200 ms latency to Google Cloud, and
your Internet connection bandwidth is limited as 50 Mbps. You currently
deploy a secure FTP (SFTP) server on a virtual machine in Google Compute
Engine as the data ingestion point. A local SFTP client runs on a dedicated
machine to transmit the CSV files as is. The goal is to make reports with
data from the previous day available to the executives by 10:00 a.m. each
day. This design is barely able to keep up with the current volume, even
though the bandwidth utilization is rather low. You are told that due to
seasonality, your company expects the number of files to double for the
next three months. Which two actions should you take? (Choose two.)

Combining parallel uploads using gsutil (option C) with reducing the


number of files by creating TAR archives (option D) provides the most
effective solution to scale your data ingestion pipeline and meet the
increased demand. These actions minimize overhead, maximize
bandwidth utilization, and address the limitations posed by latency.

265. An external customer provides you with a daily dump of data from
their database. The data flows into Google Cloud Storage GCS as comma-
separated values (CSV) files. You want to analyze this data in Google
BigQuery, but the data could have rows that are formatted incorrectly or
corrupted. How should you build this pipeline?

Using a Dataflow batch pipeline provides the most comprehensive and


flexible solution for handling data corruption in your CSV imports. It allows
you to:
1. Validate data: Implement custom validation logic to detect formatting
errors or corruption.
2. Isolate errors: Capture bad records and send them to a separate table
for analysis.
3. Ensure data quality: Load only clean, valid data into your main
BigQuery table.
4. Automate the process: Schedule the pipeline to run daily for seamless
data ingestion.
This approach ensures data quality, provides insights into data errors, and
automates the data loading process.

266. You are choosing a NoSQL database to handle telemetry data


submitted from millions of Internet-of-Things (IoT) devices. The volume of
data is growing at 100 TB per year, and each data entry has about 100
attributes. The data processing pipeline does not require atomicity,
consistency, isolation, and durability (ACID). However, high availability and
low latency are required. You need to analyze the data by querying against
individual fields. Which three databases meet your requirements? (Choose
three.)

A. Redis is a key-value store (and in many cases used as in-memory and


non persistent cache). It is not designed for "100TB per year" of highly
available storage.
B. HBase is similar to Google Bigtable, fits the requirements perfectly:
highly available, scalable and with very low latency.
C. MySQL is a relational DB, designed precisely for ACID transactions and
not for the stated requirements. Also, growth may be an issue.
D. MongoDB is a document-db used for high volume data and maintains
currently used data in RAM, so performance is usually really good. Should
also fit the requirements well.
E. Cassandra is designed precisely for highly available massive datasets,
and a fine tuned cluster may offer low latency in reads. Fits the
requirements.
F. HDFS with Hive is great for OLAP and data-warehouse scenarios,
allowing to solve map-reduce problems using an SQL subset, but the
latency is usually really high (we may talk about seconds, not
milliseconds, when obtaining results), so this does not complies with the
requirements.

267. You are training a spam classifier. You notice that you are overfitting
the training data. Which three actions can you take to resolve this
problem? (Choose three.)
To address the problem of overfitting in training a spam classifier, you
should consider the following three actions:
A. Get more training examples:
Why: More training examples can help the model generalize better to
unseen data. A larger dataset typically reduces the chance of overfitting,
as the model has more varied examples to learn from.
C. Use a smaller set of features:
Why: Reducing the number of features can help prevent the model from
learning noise in the data. Overfitting often occurs when the model is too
complex for the amount of data available, and having too many features
can contribute to this complexity.
E. Increase the regularization parameters:
Why: Regularization techniques (like L1 or L2 regularization) add a penalty
to the model for complexity. Increasing the regularization parameter will
strengthen this penalty, encouraging the model to be simpler and thus
reducing overfitting.

268. You are implementing security best practices on your data pipeline.
Currently, you are manually executing jobs as the Project Owner. You want
to automate these jobs by taking nightly batch files containing non-public
information from Google Cloud Storage, processing them with a Spark
Scala job on a Google Cloud Dataproc cluster, and depositing the results
into Google BigQuery. How should you securely run this workload?

It is best practice to use service accounts with the least privilege


necessary to perform a specific task when automating jobs. In this case,
the job needs to read the batch files from Cloud Storage and write the
results to BigQuery. Therefore, you should create a service account with
the ability to read from the Cloud Storage bucket and write to BigQuery,
and use that service account to run the job.

269. You are using Google BigQuery as your data warehouse. Your users
report that the following simple query is running very slowly, no matter
when they run the query:
SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP
BY country
You check the query plan for the query and see the following output in the
Read section of Stage:1:

What is the most likely cause of the delay for this query?

The most likely cause of the delay for this query is option D. Most rows in
the [myproject:mydataset.mytable] table have the same value in the
country column, causing data skew.
Group by queries in BigQuery can run slowly when there is significant data
skew on the grouped columns. Since the query is grouping by country, if
most rows have the same country value, all that data will need to be
shuffled to a single reducer to perform the aggregation. This can cause a
data skew slowdown.
Options A and B might cause general slowness but are unlikely to affect
this specific grouping query. Option C could also cause some slowness but
not to the degree that heavy data skew on the grouped column could. So
D is the most likely root cause. Optimizing the data distribution to reduce
skew on the grouped column would likely speed up this query.

270. Your globally distributed auction application allows users to bid on


items. Occasionally, users place identical bids at nearly identical times,
and different application servers process those bids. Each bid event
contains the item, amount, user, and timestamp. You want to collate those
bid events into a single location in real time to determine which user bid
first. What should you do?
This is the most suitable solution for the requirements. Google Cloud
Pub/Sub can handle high throughput and low-latency data ingestion.
Coupled with Google Cloud Dataflow, which can process data streams in
real time, this setup allows for immediate processing of bid events.
Dataflow can also handle ordering and timestamp extraction, crucial for
determining which bid came first. This architecture supports scalability
and real-time analytics, which are essential for a global auction system.

271. Your organization has been collecting and analyzing data in Google
BigQuery for 6 months. The majority of the data analyzed is placed in a
time-partitioned table named events_partitioned. To reduce the cost of
queries, your organization created a view called events, which queries
only the last 14 days of data. The view is described in legacy SQL. Next
month, existing applications will be connecting to BigQuery to read the
events data via an ODBC connection. You need to ensure the applications
can connect. Which two actions should you take? (Choose two.)

C = A standard SQL query cannot reference a view defined using legacy


SQL syntax.
D = For the ODBC drivers is needed a service account which will get a
standard Bigquery role.
272. You have enabled the free integration between Firebase Analytics
and Google BigQuery. Firebase now automatically creates a new table
daily in BigQuery in the format app_events_YYYYMMDD. You want to query
all of the tables for the past 30 days in legacy SQL. What should you do?

The TABLE_DATE_RANGE function in BigQuery is a table wildcard function


that can be used to query a range of daily tables. It takes two arguments:
a table prefix and a date range. The table prefix is the beginning of the
table names, and the date range is the start and end dates of the tables to
be queried.
The TABLE_DATE_RANGE function will expand to cover all tables in the
dataset that match the table prefix and are within the date range. For
example, if you have a dataset that contains daily tables named
my_table_20230804, my_table_20230805, and my_table_20230806, you
could use the TABLE_DATE_RANGE function to query all of the tables in the
dataset between August 4, 2023 and August 6, 2023 as follows:
SELECT *
FROM TABLE_DATE_RANGE('my_table_', '2023-08-04', '2023-08-06');

273. Your company is currently setting up data pipelines for their


campaign. For all the Google Cloud Pub/Sub streaming data, one of the
important business requirements is to be able to periodically identify the
inputs and their timings during their campaign. Engineers have decided to
use windowing and transformation in Google Cloud Dataflow for this
purpose. However, when testing this feature, they find that the Cloud
Dataflow job fails for the all streaming insert. What is the most likely cause
of this problem?

Beam’s default windowing behavior is to assign all elements of a


PCollection to a single, global window and discard late data, even for
unbounded PCollections. Before you use a grouping transform such as
GroupByKey on an unbounded PCollection, you must do at least one of the
following:
Set a non-global windowing function. See Setting your PCollection’s
windowing function.
Set a non-default trigger. This allows the global window to emit results
under other conditions, since the default windowing behavior (waiting for
all data to arrive) will never occur.

274. You architect a system to analyze seismic data. Your extract,


transform, and load (ETL) process runs as a series of MapReduce jobs on
an Apache Hadoop cluster. The ETL process takes days to process a data
set because some steps are computationally expensive. Then you discover
that a sensor calibration step has been omitted. How should you change
your ETL process to carry out sensor calibration systematically in the
future?
This approach would ensure that sensor calibration is systematically
carried out every time the ETL process runs, as the new MapReduce job
would be responsible for calibrating the sensors before the data is
processed by the other steps. This would ensure that all data is calibrated
before being analyzed, thus avoiding the omission of the sensor
calibration step in the future.
It also allows you to chain all other MapReduce jobs after this one, so that
the calibrated data is used in all the downstream jobs.

275. An online retailer has built their current application on Google App
Engine. A new initiative at the company mandates that they extend their
application to allow their customers to transact directly via the application.
They need to manage their shopping transactions and analyze combined
data from multiple datasets using a business intelligence (BI) tool. They
want to use only a single database for this purpose. Which Google Cloud
database should they choose?

Cloud SQL would be the most appropriate choice for the online retailer in
this scenario. Cloud SQL is a fully-managed relational database service
that allows for easy management and analysis of data using SQL. It is well-
suited for applications built on Google App Engine and can handle the
transactional workload of an e-commerce application, as well as the
analytical workload of a BI tool.

276. Your weather app queries a database every 15 minutes to get the
current temperature. The frontend is powered by Google App Engine and
server millions of users. How should you design the frontend to respond to
a database failure?

Exponential backoff is a commonly used technique to handle temporary


failures, such as a database server becoming temporarily unavailable. This
approach retries the query, initially with a short delay and then with
increasingly longer intervals between retries. Setting a cap of 15 minutes
ensures that you don't excessively burden your system with constant
retries.

277. You launched a new gaming app almost three years ago. You have
been uploading log files from the previous day to a separate Google
BigQuery table with the table name format LOGS_yyyymmdd. You have
been using table wildcard functions to generate daily and monthly reports
for all time ranges. Recently, you discovered that some queries that cover
long date ranges are exceeding the limit of 1,000 tables and failing. How
can you resolve this issue?

Sharded tables, like LOGS_yyyymmdd, are useful for managing data, but
querying across a long date range with table wildcards can lead to
inefficiencies and exceed the 1,000 table limit in BigQuery. Instead of
using multiple sharded tables, you should consider converting these into a
partitioned table.
A partitioned table allows you to store all the log data in a single table, but
logically divides the data into partitions (e.g., by date). This way, you can
efficiently query data across long date ranges without hitting the 1,000
table limit.

278. Your analytics team wants to build a simple statistical model to


determine which customers are most likely to work with your company
again, based on a few different metrics. They want to run the model on
Apache Spark, using data housed in Google Cloud Storage, and you have
recommended using Google Cloud Dataproc to execute this job. Testing
has shown that this workload can run in approximately 30 minutes on a
15-node cluster, outputting the results into Google BigQuery. The plan is to
run this workload weekly. How should you optimize the cluster for cost?

Preemptible workers are the default secondary worker type. They are
reclaimed and removed from the cluster if they are required by Google
Cloud for other tasks. Although the potential removal of preemptible
workers can affect job stability, you may decide to use preemptible
instances to lower per-hour compute costs for non-critical data processing
or to create very large clusters at a lower total cost

279. Your company receives both batch- and stream-based event data.
You want to process the data using Google Cloud Dataflow over a
predictable time period. However, you realize that in some instances data
can arrive late or out of order. How should you design your Cloud Dataflow
pipeline to handle data that is late or out of order?
Watermarks are a way to indicate that some data may still be in transit
and not yet processed. By setting a watermark, you can define a time
period during which Dataflow will continue to accept late or out-of-order
data and incorporate it into your processing. This allows you to maintain a
predictable time period for processing while still allowing for some
flexibility in the arrival of data.
Timestamps, on the other hand, are used to order events correctly, even if
they arrive out of order. By assigning timestamps to each event, you can
ensure that they are processed in the correct order, even if they don't
arrive in that order.

280. You have some data, which is shown in the graphic below. The two
dimensions are X and Y, and the shade of each dot represents what class
it is. You want to classify this data accurately using a linear algorithm. To
do this you need to add a synthetic feature. What should the value of that
feature be?

The synthetic feature that should be added in this case is the squared
value of the distance from the origin (0,0). This is equivalent to X2+Y2. By
adding this feature, the classifier will be able to make more accurate
predictions by taking into account the distance of each data point from the
origin.
X2 and Y2 alone will not give enough information to classify the data
because they do not take into account the relationship between X and Y.

281. You are integrating one of your internal IT applications and Google
BigQuery, so users can query BigQuery from the application's interface.
You do not want individual users to authenticate to BigQuery and you do
not want to give them access to the dataset. You need to securely access
BigQuery from your IT application. What should you do?

Creating a service account and granting dataset access to that account is


the most secure way to access BigQuery from an IT application. Service
accounts are designed for use in automated systems and do not require
user interaction, eliminating the need for individual users to authenticate
to BigQuery. Additionally, by using the private key of the service account
to access the dataset, you can ensure that the authentication process is
secure and that only authorized users have access to the data.

282. You are building a data pipeline on Google Cloud. You need to
prepare data using a casual method for a machine-learning process. You
want to support a logistic regression model. You also need to monitor and
adjust for null values, which must remain real-valued and cannot be
removed. What should you do?

Option B leverages the user-friendly features of Cloud Dataprep to


efficiently identify and handle null values by replacing them with 0, a
suitable numerical value for logistic regression, fulfilling all the
requirements outlined in the prompt.

283. You set up a streaming data insert into a Redis cluster via a Kafka
cluster. Both clusters are running on Compute Engine instances. You need
to encrypt data at rest with encryption keys that you can create, rotate,
and destroy as needed. What should you do?
Cloud Key Management Service (KMS) is a fully managed service that
allows you to create, rotate, and destroy encryption keys as needed. By
creating encryption keys in Cloud KMS, you can use them to encrypt your
data at rest in the Compute Engine cluster instances, which is running
your Redis and Kafka clusters. This ensures that your data is protected
even when it is stored on disk.

284. You are developing an application that uses a recommendation


engine on Google Cloud. Your solution should display new videos to
customers based on past views. Your solution needs to generate labels for
the entities in videos that the customer has viewed. Your design must be
able to provide very fast filtering suggestions based on data from other
customer preferences on several TB of data. What should you do?

Option C is the correct choice because it utilizes the Cloud Video


Intelligence API to generate labels for the entities in the videos, which
would save time and resources compared to building and training a model
from scratch. Additionally, by storing the data in Cloud Bigtable, it allows
for fast and efficient filtering of the predicted labels based on the user's
viewing history and preferences. This is a more efficient and cost-effective
approach than storing the data in Cloud SQL and performing joins and
filters.

285. You are selecting services to write and transform JSON messages
from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You
want to minimize service costs. You also want to monitor and
accommodate input data volume that will vary in size with minimal
manual intervention. What should you do?
using Cloud Dataflow for transformations with monitoring via Stackdriver
and leveraging its default autoscaling settings, is the best choice. Cloud
Dataflow is purpose-built for this type of workload, providing seamless
scalability and efficient processing capabilities for streaming data. Its
autoscaling feature minimizes manual intervention and helps manage
costs by dynamically adjusting resources based on the actual processing
needs, which is crucial for handling fluctuating data volumes efficiently
and cost-effectively.

286. Your infrastructure includes a set of YouTube channels. You have


been tasked with creating a process for sending the YouTube channel data
to Google Cloud for analysis. You want to design a solution that allows your
world-wide marketing teams to perform ANSI SQL and other types of
analysis on up-to-date YouTube channels log data. How should you set up
the log data transfer into Google Cloud?

The BigQuery Data Transfer Service automates data movement into


BigQuery on a scheduled, managed basis. Your analytics team can lay the
foundation for a BigQuery data warehouse without writing a single line of
code.
After you configure a data transfer, the BigQuery Data Transfer Service
automatically loads data into BigQuery on a regular basis. You can also
initiate data backfills to recover from any outages or gaps. Currently, you
cannot use the BigQuery Data Transfer Service to transfer data out of
BigQuery.
287. You are creating a model to predict housing prices. Due to budget
constraints, you must run it on a single resource-constrained virtual
machine. Which learning algorithm should you use?

A tip here to decide when a liner regression should be used or logistics


regression needs to be used. If you are forecasting that is the values in the
column that you are predicting is numeric, it is always liner regression. If
you are classifying, that is buy or no buy, yes or no, you will be using
logistics regression.

288. You are designing storage for very large text files for a data pipeline
on Google Cloud. You want to support ANSI SQL queries. You also want to
support compression and parallel load from the input locations using
Google recommended practices. What should you do?

The advantages of creating external tables are that they are fast to create
so you skip the part of importing data and no additional monthly billing
storage costs are accrued to your account since you only get charged for
the data that is stored in the data lake, which is comparatively cheaper
than storing it in BigQuery

289. You are developing an application on Google Cloud that will


automatically generate subject labels for users' blog posts. You are under
competitive pressure to add this feature quickly, and you have no
additional developer resources. No one on your team has experience with
machine learning. What should you do?
The Cloud Natural Language API is a pre-trained machine learning model
that can be used for natural language processing tasks such as entity
recognition, sentiment analysis, and syntax analysis. The API can be called
from your application using a simple API call, and it can generate entities
analysis that can be used as labels for the user's blog posts. This would be
the quickest and easiest option for your team since it would not require
any machine learning expertise or additional developer resources to build
and train a model. Additionally, it will give you accurate and up-to-date
results as the API is constantly updated by Google.

290. You are designing storage for 20 TB of text files as part of deploying
a data pipeline on Google Cloud. Your input data is in CSV format. You
want to minimize the cost of querying aggregate values for multiple users
who will query the data in Cloud Storage with multiple engines. Which
storage service and schema design should you use?

BigQuery can access data in external sources, known as federated


sources. Instead of first loading data into BigQuery, you can create a
reference to an external source. External sources can be Cloud Bigtable,
Cloud Storage, and Google Drive.
When accessing external data, you can create either permanent or
temporary external tables. Permanent tables are those that are created in
a dataset and linked to an external source. Dataset-level access controls
can be applied to these tables. When you are using a temporary table, a
table is created in a special dataset and will be available for approxi-
mately 24 hours. Temporary tables are useful for one-time operations,
such as loading data into a data warehouse

291. You are designing storage for two relational tables that are part of a
10-TB database on Google Cloud. You want to support transactions that
scale horizontally. You also want to optimize data for range queries on non-
key columns. What should you do?

Cloud Spanner is a fully-managed, horizontally scalable relational


database service that supports transactions and allows you to optimize
data for range queries on non-key columns. By using Cloud Spanner for
storage, you can ensure that your database can scale horizontally to meet
the needs of your application.
To optimize data for range queries on non-key columns, you can add
secondary indexes, this will allow you to perform range scans on non-key
columns, which can improve the performance of queries that filter on non-
key columns.
292. Your financial services company is moving to cloud technology and
wants to store 50 TB of financial time-series data in the cloud. This data is
updated frequently and new data will be streaming in all the time. Your
company also wants to move their existing Apache Hadoop jobs to the
cloud to get insights into this data. Which product should they use to store
the data?

Bigtable is Google's NoSQL Big Data database service. It's the same
database that powers many core Google services, including Search,
Analytics, Maps, and Gmail. Bigtable is designed to handle massive
workloads at consistent low latency and high throughput, so it's a great
choice for both operational and analytical applications, including IoT, user
analytics, and financial data analysis.
Bigtable is an excellent option for any Apache Spark or Hadoop uses that
require Apache HBase. Bigtable supports the Apache HBase 1.0+ APIs and
offers a Bigtable HBase client in Maven, so it is easy to use Bigtable with
Dataproc.

293. An organization maintains a Google BigQuery dataset that contains


tables with user-level data. They want to expose aggregates of this data to
other Google Cloud projects, while still controlling access to the user-level
data. Additionally, they need to minimize their overall storage cost and
ensure the analysis cost for other projects is assigned to those projects.
What should they do?
An authorized view is a BigQuery feature that allows you to share only a
specific subset of data from a table, while still keeping the original data
private. This way, the organization can expose only the aggregate data to
other projects, while still controlling access to the user-level data. By using
an authorized view, the organization can minimize their overall storage
cost as the aggregate data takes up less storage space than the original
data. Additionally, by using authorized view, the analysis cost for other
projects is assigned to those projects.

294. Government regulations in your industry mandate that you have to


maintain an auditable record of access to certain types of data. Assuming
that all expiring logs will be archived correctly, where should you store
data that is subject to that mandate?

BigQuery provides built-in logging of all data access, including the user's
identity, the specific query run and the time of the query. This log can be
used to provide an auditable record of access to the data. Additionally,
BigQuery allows you to control access to the dataset using Identity and
Access Management (IAM) roles, so you can ensure that only authorized
personnel can view the dataset.

295. Your neural network model is taking days to train. You want to
increase the training speed. What can you do?

Subsampling your training dataset can help increase the training speed of
your neural network model. By reducing the size of your training dataset,
you can speed up the process of updating the weights in your neural
network. This can help you quickly test and iterate your model to improve
its accuracy.
Subsampling your test dataset, on the other hand, can lead to inaccurate
evaluation of your model's performance and may result in overfitting. It is
important to evaluate your model's performance on a representative test
dataset to ensure that it can generalize to new data.
Increasing the number of input features or layers in your neural network
can also improve its performance, but this may not necessarily increase
the training speed. In fact, adding more layers or features can increase the
complexity of your model and make it take longer to train. It is important
to balance the model's complexity with its performance and training time.

296. You are responsible for writing your company's ETL pipelines to run
on an Apache Hadoop cluster. The pipeline will require some checkpointing
and splitting pipelines. Which method should you use to write the
pipelines?

Pig Latin supports both splitting pipelines and checkpointing, allowing


users to create complex data processing workflows with the ability to
restart from specific points in the pipeline if necessary.

297. Your company maintains a hybrid deployment with GCP, where


analytics are performed on your anonymized customer data. The data are
imported to Cloud Storage from your data center through parallel uploads
to a data transfer server running on GCP. Management informs you that
the daily transfers take too long and have asked you to fix the problem.
You want to maximize transfer speeds. Which action should you take?

This will likely have the most impact on transfer speeds as it addresses the
bottleneck in the transfer between your data center and GCP. Increasing
the CPU size or the size of the Google Persistent Disk on the server may
help with processing the data once it has been transferred, but will not
address the bottleneck in the transfer itself. Increasing the network
bandwidth from Compute Engine to Cloud Storage would also help with
processing the data once it has been transferred but will not address the
bottleneck in the transfer itself as well.

298. You are building new real-time data warehouse for your company
and will use Google BigQuery streaming inserts. There is no guarantee
that data will only be sent in once but you do have a unique ID for each
row of data and an event timestamp. You want to ensure that duplicates
are not included while interactively querying data. Which query type
should you use?

This approach will assign a row number to each row within a unique ID
partition, and by selecting only rows with a row number of 1, you will
ensure that duplicates are excluded in your query results. It allows you to
filter out redundant rows while retaining the latest or earliest records
based on your timestamp column.
Options A, B, and C do not address the issue of duplicates effectively or
interactively as they do not explicitly remove duplicates based on the
unique ID and event timestamp.

299. MJTelco Case Study –


Company Overview –
MJTelco is a startup that plans to build networks in rapidly growing,
underserved markets around the world. The company has patents for
innovative optical communications hardware. Based on these patents,
they can create many reliable, high-speed backbone links with inexpensive
hardware.

Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost.

Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.

Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.

Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.

Technical Requirements –
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each. Allow analysis and presentation against
data tables tracking up to 2 years of data storing approximately 100m
records/day Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.

CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.

CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.

CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.
MJTelco is building a custom interface to share data. They have these
requirements:
1. They need to do aggregations over their petabyte-scale datasets.
2. They need to scan specific time range rows with a very fast
response time (milliseconds).
Which combination of Google Cloud Platform products should you
recommend?

BigQuery is the best choice for performing aggregations over petabyte-


scale datasets, while Cloud Bigtable is ideal for scanning specific time
range rows with very fast response times. Together, they provide a
powerful combination for MJTelco's data sharing interface.

300. MJTelco Case Study –


Company Overview –
MJTelco is a startup that plans to build networks in rapidly growing,
underserved markets around the world. The company has patents for
innovative optical communications hardware. Based on these patents,
they can create many reliable, high-speed backbone links with inexpensive
hardware.

Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost.

Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.

Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.

Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.

Technical Requirements –
✑Ensure secure and efficient transport and storage of telemetry data
✑Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.

CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.

CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.

CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.

You need to compose visualization for operations teams with the following
requirements:
✑ Telemetry must include data from all 50,000 installations for the most
recent 6 weeks (sampling once every minute)
✑ The report must not be more than 3 hours delayed from live data.
✑ The actionable report should only show suboptimal links.
✑ Most suboptimal links should be sorted to the top.

Suboptimal links can be grouped and filtered by regional geography.


✑ User response time to load the report must be <5 seconds.

You create a data source to store the last 6 weeks of data, and create
visualizations that allow viewers to see multiple date ranges, distinct
geographic regions, and unique installation types. You always show the
latest data without any changes to your visualizations. You want to avoid
creating and updating new visualizations each month. What should you
do?

D is the most correct in this specific context because it enables complete


automation of the data processing, filtering, sorting, and visualization
updates, fulfilling the prompt's "no changes" and automation requirements
more directly. It prioritizes automation over interactive exploration.

301. MJTelco Case Study –


Company Overview –
MJTelco is a startup that plans to build networks in rapidly growing,
underserved markets around the world. The company has patents for
innovative optical communications hardware. Based on these patents,
they can create many reliable, high-speed backbone links with inexpensive
hardware.

Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost.
Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.

Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.

MJTelco will also use three separate operating environments `"


development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.

Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.

Technical Requirements –
✑Ensure secure and efficient transport and storage of telemetry data
✑Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.

CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.

CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.

Given the record streams MJTelco is interested in ingesting per day, they
are concerned about the cost of Google BigQuery increasing. MJTelco asks
you to provide a design solution. They require a single large data table
called tracking_table. Additionally, they want to minimize the cost of daily
queries while performing fine-grained analysis of each day's events. They
also want to use streaming ingestion. What should you do?

Partition tables in BQ have different cost. If a partition is not modified


(DML) for 90 days then cost will be less by 50%, while querying will be
efficient since its single large table.

302. Flowlogistic Case Study –


Company Overview –
Flowlogistic is a leading logistics and supply chain provider. They help
businesses throughout the world manage their resources and transport
them to their final destination. The company has grown rapidly, expanding
their offerings to include rail, truck, aircraft, and oceanic shipping.

Company Background –
The company started as a regional trucking company, and then expanded
into other logistics market. Because they have not updated their
infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary
technology for tracking shipments in real time at the parcel level.
However, they are unable to deploy it because their technology stack,
based on Apache Kafka, cannot support the processing volume. In
addition, Flowlogistic wants to further analyze their orders and shipments
to determine how best to deploy their resources.

Solution Concept –
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking
system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain
both structured and unstructured data, to determine how best to deploy
resources, which markets to expand info. They also want to use predictive
analytics to learn earlier when a shipment will be delayed.

Existing Technical Environment –


Flowlogistic architecture resides in a single data center:
✑ Databases
- 8 physical servers in 2 clusters
- SQL Server `" user data, inventory, static data
- 3 physical servers
- Cassandra `" metadata, tracking messages
-10 Kafka servers `" tracking message aggregation and batch insert

✑ Application servers `" customer front end, middleware for order/customs


- 60 virtual machines across 20 physical servers
- Tomcat `" Java services
- Nginx `" static content
- Batch servers

✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups

✑ 10 Apache Hadoop /Spark servers


- Core Data Lake
- Data analysis workloads

✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,

Business Requirements –
✑ Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met
Technical Requirements –
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑Connect a VPN between the production data center and cloud
environment

SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.

CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.

CFO Statement –
Part of our competitive advantage is that we penalize ourselves for late
shipments and deliveries. Knowing where out shipments are at all times
has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.
Flowlogistic's management has determined that the current Apache Kafka
servers cannot handle the data volume for their real-time inventory
tracking system.

You need to build a new system on Google Cloud Platform (GCP) that will
feed the proprietary tracking software. The system must be able to ingest
data from a variety of global sources, process and query in real-time, and
store the data reliably. Which combination of GCP products should you
choose?

The problem statement "Flowlogistic's management has determined that


the current Apache Kafka servers cannot handle the
data volume for their real-time inventory tracking system.
As it says, "we cannot determine the data volume", but it doesn't say that
we can't calculate it either.
Requirement definition: The system must be able to
ingest data from a variety of global sources
process and query in real-time
Store the data reliably.
It says above, if you look at the Google page.
Logging to multiple systems. for example, a Google Compute Engine
instance can write logs to a monitoring system, to a database for later
querying, and so on.

303. After migrating ETL jobs to run on BigQuery, you need to verify that
the output of the migrated jobs is the same as the output of the original.
You've loaded a table containing the output of the original job and want to
compare the contents with output from the migrated job to show that they
are identical. The tables do not contain a primary key column that would
enable you to join them together for comparison. What should you do?

Option C uses the power of distributed computing (Dataproc) and a clever


technique (sorting before hashing) to definitively compare the contents of
two large tables without relying on a primary key. It's the most robust and
accurate way to verify your ETL migration in this scenario.

304. You are a head of BI at a large enterprise company with multiple


business units that each have different priorities and budgets. You use on-
demand pricing for BigQuery with a quota of 2K concurrent on-demand
slots per project. Users at your organization sometimes don't get slots to
execute their query and you need to correct this. You'd like to avoid
introducing new projects to your account. What should you do?

Switching to flat-rate pricing would allow you to ensure a consistent level


of service and avoid running into the on-demand slot quota per project.
Additionally, by establishing a hierarchical priority model for your projects,
you could allocate resources based on the specific needs and priorities of
each business unit, ensuring that the most critical queries are executed
first. This approach would allow you to balance the needs of each business
unit while maximizing the use of your BigQuery resources.

305. You have an Apache Kafka cluster on-prem with topics containing
web application logs. You need to replicate the data to Google Cloud for
analysis in BigQuery and Cloud Storage. The preferred replication method
is mirroring to avoid deployment of Kafka Connect plugins. What should
you do?

This option involves setting up a separate Kafka cluster in Google Cloud,


and then configuring the on-prem cluster to mirror the topics to this
cluster. The data from the Google Cloud Kafka cluster can then be read
using either a Dataproc cluster or a Dataflow job and written to Cloud
Storage for analysis in BigQuery.

306. You've migrated a Hadoop job from an on-prem cluster to dataproc


and GCS. Your Spark job is a complicated analytical workload that consists
of many shuffling operations and initial data are parquet files (on average
200-400 MB size each). You see some degradation in performance after
the migration to Dataproc, so you'd like to optimize for it. You need to keep
in mind that your organization is very cost-sensitive, so you'd like to
continue using Dataproc on preemptibles (with 2 non-preemptible workers
only) for this workload. What should you do?

By default, preemptible node disk sizes are limited to 100GB or the size of
the non-preemptible node disk sizes, whichever is smaller. However you
can override the default preemptible disk size to any requested size. Since
the majority of our cluster is using preemptible nodes, the size of the disk
used for caching operations will see a noticeable performance
improvement using a larger disk. Also, SSD's will perform better than HDD.
This will increase costs slightly, but is the best option available while
maintaining costs.
307. Your team is responsible for developing and maintaining ETLs in
your company. One of your Dataflow jobs is failing because of some errors
in the input data, and you need to improve reliability of the pipeline (incl.
being able to reprocess all failing data). What should you do?

Based on the given scenario, option D would be the best approach to


improve the reliability of the pipeline.
Adding a try-catch block to the DoFn that transforms the data would allow
you to catch and handle errors within the pipeline. However, storing
erroneous rows in Pub/Sub directly from the DoFn (Option C) could
potentially create a bottleneck in the pipeline, as it adds additional I/O
operations to the data processing. Option D of using a sideOutput to
create a PCollection of erroneous data would allow for reprocessing of the
failed data and would not create a bottleneck in the pipeline. Storing the
erroneous data in a separate PCollection would also make it easier to
debug and analyze the failed data.
Therefore, adding a try-catch block to the DoFn that transforms the data
and using a sideOutput to create a PCollection of erroneous data that can
be stored to Pub/Sub later would be the best approach to improve the
reliability of the pipeline.

308. You're training a model to predict housing prices based on an


available dataset with real estate properties. Your plan is to train a fully
connected neural net, and you've discovered that the dataset contains
latitude and longitude of the property. Real estate professionals have told
you that the location of the property is highly influential on price, so you'd
like to engineer a feature that incorporates this physical dependency.
What should you do?
Option C provides the best approach by:
 Creating a feature cross to represent specific locations.
 Bucketizing at an appropriate granularity (minute level) to balance
precision and practicality.
 Using L1 regularization to perform feature selection and focus on the most
relevant location features.

309. Your company is using WILDCARD tables to query data across


multiple tables with similar names. The SQL statement is currently failing
with the following error:

Which table name will make the SQL statement work correctly?

Option B is the only one that correctly uses the wildcard syntax without
quotes to specify a pattern for multiple tables and allows the
TABLE_SUFFIX to filter based on the matched portion of the table name.
This makes it the correct answer for querying across multiple tables using
WILDCARD tables in BigQuery.

310. You are deploying MariaDB SQL databases on GCE VM Instances and
need to configure monitoring and alerting. You want to collect metrics
including network connections, disk IO and replication status from MariaDB
with minimal development effort and use StackDriver for dashboards and
alerts. What should you do?
StackDriver Agent: The StackDriver Agent is designed to collect system
and application metrics from virtual machine instances and send them to
StackDriver Monitoring. It simplifies the process of collecting and
forwarding metrics.
MySQL Plugin: The StackDriver Agent has a MySQL plugin that allows you
to collect MySQL-specific metrics without the need for additional custom
development. This includes metrics related to network connections, disk
IO, and replication status – which are the specific metrics you mentioned.
Option D is the most straightforward and least development-intensive
approach to achieve the monitoring and alerting requirements for MariaDB
on GCE VM Instances using StackDriver.

311. You work for a bank. You have a labelled dataset that contains
information on already granted loan application and whether these
applications have been defaulted. You have been asked to train a model to
predict default rates for credit applicants. What should you do?

Appropriate Model for Prediction: Linear regression is a common statistical


method used for predictive modeling, particularly when the outcome
variable (in this case, the likelihood of default) is continuous. In the
context of credit scoring, linear regression can be used to predict a risk
score that represents the probability of default.
Utilization of Labeled Data: Since you already have a labeled dataset
containing information on loans that have been granted and whether they
have defaulted, you can use this data to train the regression model. This
historical data provides the model with examples of borrower
characteristics and their corresponding default outcomes.

312. You need to migrate a 2TB relational database to Google Cloud


Platform. You do not have the resources to significantly refactor the
application that uses this database and cost to operate is of primary
concern. Which service do you select for storing and serving your data?
Cloud SQL: max storage for shared core = 3TB and for dedicated core =
up to 64TB
Only use Spanner if we need autoscale (Note that Cloud SQL could scale
too but not automatic yet) or the size is too big (as above) or 4/5 9s HA
(Cloud SQL is only 99.95)

313. You're using Bigtable for a real-time application, and you have a
heavy load that is a mix of read and writes. You've recently identified an
additional use case and need to perform hourly an analytical job to
calculate certain statistics across the whole database. You need to ensure
both the reliability of your production application as well as the analytical
workload. What should you do?

When you use a single cluster to run a batch analytics job that performs
numerous large reads alongside an application that performs a mix of
reads and writes, the large batch job can slow things down for the
application's users. With replication, you can use app profiles with single-
cluster routing to route batch analytics jobs and application traffic to
different clusters, so that batch jobs don't affect your applications' users.

314. You are designing an Apache Beam pipeline to enrich data from
Cloud Pub/Sub with static reference data from BigQuery. The reference
data is small enough to fit in memory on a single worker. The pipeline
should write enriched results to BigQuery for analysis. Which job type and
transforms should this pipeline use?

In streaming analytics applications, data is often enriched with additional


information that might be useful for further analysis. For example, if you
have the store ID for a transaction, you might want to add information
about the store location. This additional information is often added by
taking an element and bringing in information from a lookup table.

315. You have a data pipeline that writes data to Cloud Bigtable using
well-designed row keys. You want to monitor your pipeline to determine
when to increase the size of your Cloud Bigtable cluster. Which two actions
can you take to accomplish this? (Choose two.)

D: In general, do not use more than 70% of the hard limit on total storage,
so you have room to add more data. If you do not plan to add significant
amounts of data to your instance, you can use up to 100% of the hard
limit
C: If this value is frequently at 100%, you might experience increased
latency. Add nodes to the cluster to reduce the disk load percentage.
The key visualizer metrics options, suggest other things other than
increase the cluster size.

316. You want to analyze hundreds of thousands of social media posts


daily at the lowest cost and with the fewest steps.
You have the following requirements:
✑ You will batch-load the posts once per day and run them through the
Cloud Natural Language API.
✑ You will extract topics and sentiment from the posts.
✑ You must store the raw posts for archiving and reprocessing.
✑ You will create dashboards to be shared with people both inside and
outside your organization.
You need to store both the data extracted from the API to perform analysis
as well as the raw social media posts for historical archiving. What should
you do?
First store the data on GCS, then extract only the relavant info for analysis
and load into BQ. This way, huge data ie., audio, videos can stay on GCS
(not lost). BQ cannot store audio/video. And note that Cloud Natural
Language API is used for analysis which uses Text as it's source
317. You store historic data in Cloud Storage. You need to perform
analytics on the historic data. You want to use a solution to detect invalid
data entries and perform data transformations that will not require
programming or knowledge of SQL. What should you do?

Cloud Dataprep is the best solution in this scenario because it provides a


code-free and SQL-free environment for detecting errors and performing
data transformations on historic data in Cloud Storage. Its visual interface
and recipe-based approach make it accessible to users without
programming or SQL skills.

318. Your company needs to upload their historic data to Cloud Storage.
The security rules don't allow access from external IPs to their on-premises
resources. After an initial upload, they will add new data from existing on-
premises applications every day. What should they do?

gsutil rsync is the most straightforward, secure, and efficient solution for
transferring data from on-premises servers to Cloud Storage, especially
when security rules restrict inbound connections to the on-premises
environment. It's well-suited for both the initial bulk upload and the
ongoing daily updates.

319. You have a query that filters a BigQuery table using a WHERE clause
on timestamp and ID columns. By using bq query `"-dry_run you learn that
the query triggers a full scan of the table, even though the filter on
timestamp and ID select a tiny fraction of the overall data. You want to
reduce the amount of data scanned by BigQuery with minimal changes to
existing SQL queries. What should you do?
Partitioning and clustering are the most effective way to optimize
BigQuery queries that filter on specific columns like timestamp and ID. By
reorganizing the table structure, BigQuery can significantly reduce the
amount of data scanned, leading to faster and cheaper queries.

You might also like