Sertif GCP
Sertif GCP
5. You need to create a near real-time inventory dashboard that reads the
main inventory tables in your BigQuery data warehouse. Historical
inventory data is stored as inventory balances by item and location. You
have several thousand updates to inventory every hour. You want to
maximize performance of the dashboard and ensure that the data is
accurate. What should you do?
Streams
inventory
changes near
real-time:
BigQuery
streaming ingests data immediately, keeping the inventory movement
table constantly updated.
Daily balance calculation: Joining the movement table with the historical
balance table provides an accurate view of current inventory levels
without affecting the actual balance table.
Nightly update for historical data: Updating the main inventory balance
table nightly ensures long-term data consistency while maintaining near
real-time insights through the view.
This approach balances near real-time updates with efficiency and data
accuracy, making it the optimal solution for the given scenario.
6. You have a data stored in BigQuery. The data in the BigQuery dataset must
be highly available. You need to define a storage, backup, and recovery
strategy of this data that minimizes cost. How should you configure the
BigQuery table that have a recovery point objective (RPO) of 30 days?
We have
external
dependency
"after the load
job with variable
execution time completes"
which requires DAG -> Airflow (Cloud Composer)
The reasons:
A scheduler like Cloud Scheduler won't handle the dependency on the
BigQuery load completion time
Using Composer allows creating a DAG workflow that can:
Trigger the BigQuery load
Wait for BigQuery load to complete
Trigger the Dataprep Dataflow job
Dataflow template allows easy reuse of the Dataprep transformation logic
Composer coordinates everything based on the dependencies in an
automated workflow
9. You are managing a Cloud Dataproc cluster. You need to make a job run
faster while minimizing costs, without losing work in progress on your
clusters. What should you do?
10.You work for a shipping company that uses handheld scanners to read
shipping labels. Your company has strict data privacy standards that
require scanners to only transmit tracking numbers when events are sent
to Kafka topics. A recent software update caused the scanners to
accidentally transmit recipients' personally identifiable information (PII) to
analytics systems, which violates user privacy rules. You want to quickly
build a scalable solution using cloud-native managed services to prevent
exposure of PII to the analytics systems. What should you do?
Quick to implement: Using
managed services reduces
development time and
effort compared to
building solutions from
scratch. Scalability:
can easily Cloud
Functions and Cloud DLP
API are designed to handle large volumes of data. Accuracy: Cloud DLP API
has advanced PII detection capabilities. Flexibility: You can customize the
processing logic in Cloud Function to meet your specific needs. Security:
Sensitive data is handled securely within a controlled cloud environment.
11.You have developed three data processing jobs. One executes a Cloud
Dataflow pipeline that transforms data uploaded to Cloud Storage and
writes results to BigQuery. The second ingests data from on-premises
servers and uploads it to Cloud Storage. The third is a Cloud Dataflow
pipeline that gets information from third-party data providers and uploads
the information to Cloud Storage. You need to be able to schedule and
monitor the execution of these three workflows and manually execute
them when needed. What should you do?
12.You have Cloud Functions written in Node.js that pull messages from Cloud
Pub/Sub and send the data to BigQuery. You observe that the message
processing rate on the Pub/Sub topic is orders of magnitude higher than
anticipated, but there is no error logged in Cloud Logging. What are the
two most likely causes of this problem? (Choose two.)
By not acknowleding the
pulled message, this result
in it be putted back in Cloud
Pub/Sub, meaning the
messages accumulate
instead of being consumed
and removed from Pub/Sub.
The same thing can happen
ig the subscriber maintains the lease on the message it receives in case of
an error. This reduces the overall rate of processing because messages get
stuck on the first subscriber. Also, errors in Cloud Function do not show up
14.You are creating a new pipeline in Google Cloud to stream IoT data from
Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the
data, you notice that roughly 2% of the data appears to be corrupt. You
need to modify the Cloud Dataflow pipeline to filter out this corrupt data.
What should you do?
15.You have historical data covering the last three years in BigQuery and a
data pipeline that delivers new data to BigQuery daily. You have noticed
that when the Data Science team runs a query filtered on a date column
and limited to 30`"90 days of data, the query scans the entire table. You
also noticed that your bill is increasing more quickly than you expected.
You want to resolve the issue as cost-effectively as possible while
maintaining the ability to conduct SQL queries. What should you do?
A partitioned table
is a special table
that is divided into
segments, called
partitions, that
make it easier to
manage and query
your data. By dividing a large table into smaller partitions, you can
improve query performance, and you can control costs by reducing the
number of bytes read by a query.
16.You operate a logistics company, and you want to improve event delivery
reliability for vehicle-based sensors. You operate small data centers around
the world to capture these events, but leased lines that provide
connectivity from your event collection infrastructure to your event
processing infrastructure are unreliable, with unpredictable latency. You
want to address this issue in the most cost-effective way. What should you
do?
have the data acquisition devices publish data to Cloud Pub/Sub. This
would provide a reliable messaging service for your event data, allowing
you to ingest and process your data in a timely manner, regardless of the
reliability of the leased lines. Cloud Pub/Sub also offers automatic retries
and fault-tolerance, which would further improve the reliability of your
event delivery.
Additionally, using Cloud
Pub/Sub would allow you
to easily scale up or
down your event
processing infrastructure
as needed, which would
help to minimize costs.
17.You are a retailer that wants to integrate your online sales capabilities with
different in-home assistants, such as Google Home. You need to interpret
customer voice commands and issue an order to the backend systems.
Which solutions should you choose?
18.Your company has a hybrid cloud initiative. You have a complex data
pipeline that moves data between cloud provider services and leverages
services from each of the cloud providers. Which cloud-native service
should you use to orchestrate the entire pipeline?
Cloud Composer is considered suitable across
multiple cloud providers, as it is built on Apache
Airflow, which allows for workflow orchestration
across different cloud environments and even on-
premises data centers, making it a good choice
for multi-cloud strategies; however, its tightest
integration is with Google Cloud Platform services.
19.You use a dataset in BigQuery for analysis. You want to provide third-party
companies with access to the same dataset. You need to keep the costs of
data sharing low and ensure that the data is current. Which solution
should you choose?
Shared datasets are collections of tables and views in BigQuery defined by
a data publisher and make up the unit of cross-project / cross-
organizational sharing. Data subscribers get an opaque, read-only, linked
dataset inside their project and VPC perimeter that they can combine with
their own datasets and connect to solutions from Google Cloud or our
partners. For example, a retailer might create a single exchange to share
demand forecasts to the 1,000’s of vendors in their supply chain–having
joined historical sales data with weather, web clickstream, and Google
Trends data in their own BigQuery project, then sharing real-time outputs
via Analytics Hub. The publisher can add metadata, track subscribers, and
see aggregated usage
metrics.
Delta tables contain all change events for a particular table since the
initial load. Having all change events available can be valuable for
identifying trends, the state of the entities that a table represents at a
particular moment, or change frequency.
The best way to merge data frequently and consistently is to use a MERGE
statement, which lets you combine multiple INSERT, UPDATE, and DELETE
statements into a single atomic operation.
21.You are designing a data processing pipeline. The pipeline must be able to
scale automatically as load increases. Messages must be processed at
least once and must be ordered within windows of 1 hour. How should you
design the solution?
the combination of
Cloud Pub/Sub for
scalable ingestion
and Cloud Dataflow
for scalable stream
processing with
windowing capabilities makes option D the most appropriate
solution for the given requirements. It minimizes management
overhead, ensures scalability, and provides the necessary features for at-
least-once processing and ordered processing within time windows.
22.You need to set access to BigQuery for different departments within your
company. Your solution should comply with the following requirements:
✑ Each department should have access only to their data.
✑ Each department will have one or more leads who need to be able to
create and update tables and provide them to their team.
✑ Each department has data analysts who need to be able to query but
not modify data.
How should you set access to the data in BigQuery?
Option B provides the
most secure and
appropriate solution by
leveraging dataset-level
access control. It adheres
to the principle of least
privilege, granting leads
the specific permissions they need to manage their department's data (via
WRITER) while allowing analysts to perform their tasks without the risk of
accidental or malicious modifications (via READER). The dataset acts as a
natural container for data isolation, fulfilling all the requirements outlined
in the scenario.
23.You operate a database that stores stock trades and an application that
retrieves average stock price for a given company over an adjustable
window of time. The data is stored in Cloud Bigtable where the datetime of
the stock trade is the beginning of the row key. Your application has
thousands of concurrent users, and you notice that performance is starting
to degrade as more stocks are added. What should you do to improve the
performance of your application?
27.You decided to use Cloud Datastore to ingest vehicle telemetry data in real
time. You want to build a storage system that will account for the long-
term data growth, while keeping the costs low. You also want to create
snapshots of the data periodically, so that you can make a point-in-time
(PIT) recovery, or clone a copy of the data for Cloud Datastore in a
different environment. You want to archive these snapshots for a long
time. Which two methods can accomplish this? (Choose two.)
28.You need to create a data pipeline that copies time-series transaction data
so that it can be queried from within BigQuery by your data science team
for analysis. Every hour, thousands of transactions are updated with a new
status. The size of the initial dataset is 1.5 PB, and it will grow by 3 TB per
day. The data is heavily structured, and your data science team will build
machine learning models based on this data. You want to maximize
performance and usability for your data science team. Which two
strategies should you adopt? (Choose two.)
Use nested and repeated fields to denormalize data storage and increase
query performance.
Denormalization is a common strategy for increasing read performance for
relational datasets that were previously normalized. The recommended
way to denormalize data in BigQuery is to use nested and repeated fields.
It's best to use this strategy when the relationships are hierarchical and
frequently queried together, such as in parent-child relationships.
30.You have a petabyte of analytics data and need to design a storage and
processing platform for it. You must be able to perform data warehouse-
style analytics on the data in Google Cloud and expose the dataset as files
for batch analysis tools in other cloud providers. What should you do?
The question emphasizes the need for a quick solution with low cost. While
GPUs and TPUs offer greater potential performance, they require
significant development effort (writing kernels) before they can be utilized.
Sticking with CPUs and scaling the cluster is the fastest and most cost-
effective way to improve training time immediately, given the reliance on
custom C++ ops without existing GPU/TPU kernel support.
The large discrepancy in RMSE, with the training error being higher, points
directly to underfitting. Increasing the model's complexity by adding layers
or expanding the input representation is the most appropriate strategy to
address this issue and improve the model's performance.
37.As your organization expands its usage of GCP, many teams have started
to create their own projects. Projects are further multiplied to
accommodate different stages of deployments and target audiences. Each
project requires unique access control configurations. The central IT team
needs to have access to all projects. Furthermore, data from Cloud Storage
buckets and BigQuery datasets must be shared for use in other projects in
an ad hoc way. You want to simplify access control management by
minimizing the number of policies. Which two steps should you take?
(Choose two.)
39.A data scientist has created a BigQuery ML model and asks you to create
an ML pipeline to serve predictions. You have a REST API application with
the requirement to serve predictions for an individual user ID with latency
under 100 milliseconds. You use the following query to generate
predictions: SELECT predicted_label, user_id FROM ML.PREDICT (MODEL
'dataset.model', table user_features). How should you create the ML
pipeline?
The key requirements are serving predictions for individual user IDs with
low (sub-100ms) latency.
Option D meets this by batch predicting for all users in BigQuery ML,
writing predictions to Bigtable for fast reads, and allowing the application
access to query Bigtable directly for low latency reads.
Since the application needs to serve low-latency predictions for individual
user IDs, using Dataflow to batch predict for all users and write to Bigtable
allows low-latency reads. Granting the Bigtable Reader role allows the
application to retrieve predictions for a specific user ID from Bigtable.
41.You are building a new application that you need to collect data from in a
scalable way. Data arrives continuously from the application throughout
the day, and you expect to generate approximately 150 GB of JSON data
per day by the end of the year. Your requirements are:
✑ Decoupling producer from consumer
✑ Space and cost-efficient storage of the raw ingested data, which is to be
stored indefinitely
✑ Near real-time SQL query
✑ Maintain at least 2 years of historical data, which will be queried with
SQL
Which pipeline should you use to meet these requirements?
The most effective way to address the performance issue in this Dataflow
pipeline is to increase the processing capacity by either adding more
workers (horizontal scaling) or using more powerful workers (vertical
scaling). Both options A and B directly address the identified CPU
bottleneck and are the most appropriate solutions.
43.You have a data pipeline with a Dataflow job that aggregates and writes
time series metrics to Bigtable. You notice that data is slow to update in
Bigtable. This data feeds a dashboard used by thousands of users across
the organization. You need to support additional concurrent users and
reduce the amount of time required to write the data. Which two actions
should you take? (Choose two.)
To improve the data pipeline's performance and address the slow updates
in Bigtable, the most effective solutions are to increase the processing
power of the Dataflow job (by adding workers) and increase the capacity
of the Bigtable cluster (by adding nodes). Both options B and C directly
target the potential bottlenecks and are the most appropriate actions to
take.
44.You have several Spark jobs that run on a Cloud Dataproc cluster on a
schedule. Some of the jobs run in sequence, and some of the jobs run
concurrently. You need to automate this process. What should you do?
For orchestrating Spark jobs on Dataproc with specific sequencing and
concurrency requirements, Cloud Composer with Airflow DAGs provides
the most flexible, scalable, and manageable solution. It allows you to
define dependencies, schedule execution, and monitor the entire workflow
in a centralized and reliable manner.
45.You are building a new data pipeline to share data between two different
types of applications: jobs generators and job runners. Your solution must
scale to accommodate increases in usage and must accommodate the
addition of new applications without negatively affecting the performance
of existing ones. What should you do?
Cloud Pub/Sub is the best solution for this data pipeline because it
provides the necessary decoupling, scalability, and extensibility to meet
the requirements. It enables independent scaling of job generators and
runners, simplifies the addition of new applications, and ensures reliable
message delivery.
47.You need to create a new transaction table in Cloud Spanner that stores
product sales data. You are deciding what to use as a primary key. From a
performance perspective, which strategy should you choose?
48.Data Analysts in your company have the Cloud IAM Owner role assigned to
them in their projects to allow them to work with multiple GCP products in
their projects. Your organization requires that all BigQuery data access
logs be retained for 6 months. You need to ensure that only audit
personnel in your company can access the data access logs for all
projects. What should you do?
52.You receive data files in CSV format monthly from a third party. You need
to cleanse this data, but every third month the schema of the files
changes. Your requirements for implementing these transformations
include:
✑ Executing the transformations on a schedule
✑ Enabling non-developer analysts to modify transformations
✑ Providing a graphical tool for designing transformations
What should you do?
Dataprep by Trifacta is an intelligent data service for visually exploring,
cleaning, and preparing structured and unstructured data for analysis,
reporting, and machine learning. Because Dataprep is serverless and
works at any scale, there is no infrastructure to deploy or manage. Your
next ideal data transformation is suggested and predicted with each UI
input, so you don’t have to write code.
The most efficient ways to start using Hive in Cloud Dataproc with ORC
files already in Cloud Storage are:
1. Copy to HDFS using gsutil and Hadoop tools for maximum
performance.
2. Use the Cloud Storage connector for initial access, then replicate
key data to HDFS for optimized performance.
Both options allow you to leverage the benefits of having data in the local
HDFS for improved Hive query performance. Option D provides more
flexibility by allowing you to choose what data to replicate based on your
needs.
55.You work for a shipping company that has distribution centers where
packages move on delivery lines to route them properly. The company
wants to add cameras to the delivery lines to detect and track any visual
damage to the packages in transit. You need to create a way to automate
the detection of damaged packages and flag them for human review in
real time while the packages are in transit. Which solution should you
choose?
For this scenario, where you need to automate the detection of damaged
packages in real time while they are in transit, the most suitable solution
among the provided options would be B.
Here's why this option is the most appropriate:
Real-Time Analysis: AutoML provides the capability to train a custom
model specifically tailored to recognize patterns of damage in packages.
This model can process images in real-time, which is essential in your
scenario.
Integration with Existing Systems: By building an API around the AutoML
model, you can seamlessly integrate this solution with your existing
package tracking applications. This ensures that the system can flag
damaged packages for human review efficiently.
Customization and Accuracy: Since the model is trained on your specific
corpus of images, it can be more accurate in detecting damages relevant
to your use case compared to pre-trained models.
56.You are migrating your data warehouse to BigQuery. You have migrated all
of your data into tables in a dataset. Multiple users from your organization
will be using the data. They should only see certain tables based on their
team membership. How should you set user permissions?
The simplest and most effective way to control user access to specific
tables in BigQuery is to assign the bigquery.dataViewer role (or a custom
role) at the table level. This provides the necessary granular control, is
easy to manage, and scales well.
57.You need to store and analyze social media postings in Google BigQuery at
a rate of 10,000 messages per minute in near real-time. Initially, design
the application to use streaming inserts for individual postings. Your
application also performs data aggregations right after the streaming
inserts. You discover that the queries after streaming inserts do not exhibit
strong consistency, and reports from the queries might miss in-flight data.
How can you adjust your application design?
Option D provides the most practical and efficient way to address the
consistency issues with BigQuery streaming inserts while maintaining near
real-time data availability. It leverages the benefits of streaming inserts for
high-volume data ingestion and balances data freshness with accuracy by
waiting for a period based on estimated latency.
58.You want to build a managed Hadoop system as your data lake. The data
transformation process is composed of a series of Hadoop jobs executed in
sequence. To accomplish the design of separating storage from compute,
you decided to use the Cloud Storage connector to store all input data,
output data, and intermediary data. However, you noticed that one
Hadoop job runs very slowly with Cloud Dataproc, when compared with
the on-premises bare-metal Hadoop environment (8-core nodes with 100-
GB RAM). Analysis shows that this particular Hadoop job is disk I/O
intensive. You want to resolve the issue. What should you do?
The most effective way to resolve the performance issue for a disk I/O
intensive Hadoop job in Cloud Dataproc is to allocate sufficient persistent
disks and store the intermediate data on the local HDFS. This reduces
network overhead and allows the job to access data much faster,
improving overall performance.
BigQuery is the most appropriate storage solution for this use case due to
its scalability, geospatial processing capabilities, high-speed ingestion,
machine learning integration, and suitability for dashboard creation. It
directly addresses all the key requirements for storing, processing, and
analyzing the ship telemetry data to predict delivery delays.
61.You operate an IoT pipeline built around Apache Kafka that normally
receives around 5000 messages per second. You want to use Google Cloud
Platform to create an alert as soon as the moving average over 1 hour
drops below 4000 messages per second. What should you do?
Dataflow with Sliding Time Windows: Dataflow allows you to work with
event-time windows, making it suitable for time-series data like incoming
IoT messages. Using sliding windows every 5 minutes allows you to
compute moving averages efficiently.
Sliding Time Window: The sliding time window of 1 hour every 5 minutes
enables you to calculate the moving average over the specified time
frame.
Computing Averages: You can efficiently compute the average when each
sliding window closes. This approach ensures that you have real-time
visibility into the message rate and can detect deviations from the
expected rate.
Alerting: When the calculated average drops below 4000 messages per
second, you can trigger an alert from within the Dataflow pipeline, sending
it to your desired alerting mechanism, such as Cloud Monitoring, Pub/Sub,
or another notification service.
Scalability: Dataflow can scale automatically based on the incoming data
volume, ensuring that you can handle the expected rate of 5000
messages per second.
62.You plan to deploy Cloud SQL using MySQL. You need to ensure high
availability in the event of a zone failure. What should you do?
69.You work for a mid-sized enterprise that needs to move its operational
system transaction data from an on-premises database to GCP. The
database is about 20 TB in size. Which database should you choose?
Cloud SQL is a fully managed service that scales up automatically and
supports SQL queries, it does not inherently guarantee transactional
consistency or the ability to scale up to 6 TB for all its database engines.
70.You need to choose a database to store time series CPU and memory
usage for millions of computers. You need to store this data in one-second
interval samples. Analysts will be performing real-time, ad hoc analytics
against the database. You want to avoid being charged for every query
executed and ensure that the schema design will allow for future growth of
the dataset. Which database and data model should you choose?
Bigtable with a narrow table design is the most suitable solution for this
scenario. It provides the scalability, low-latency reads, cost-effectiveness,
and schema flexibility needed to store and analyze time series data from
millions of computers. The narrow table model ensures efficient storage
and retrieval of data, while the Bigtable's pricing model avoids per-query
charges.
71.You want to archive data in Cloud Storage. Because some data is very
sensitive, you want to use the `Trust No One` (TNO) approach to encrypt
your data to prevent the cloud provider staff from decrypting your data.
What should you do?
Additional authenticated data (AAD) is any string that you pass to Cloud
Key Management Service as part of an encrypt or decrypt request. AAD is
used as an integrity check and can help protect your data from a confused
deputy attack. The AAD string must be no larger than 64 KiB.
Cloud KMS will not decrypt ciphertext unless the same AAD value is used
for both encryption and decryption.
AAD is bound to the encrypted data, because you cannot decrypt the
ciphertext unless you know the AAD, but it is not stored as part of the
ciphertext. AAD also does not increase the cryptographic strength of the
ciphertext. Instead it is an additional check by Cloud KMS to authenticate
a decryption request.
72.You have data pipelines running on BigQuery, Dataflow, and Dataproc. You
need to perform health checks and monitor their behavior, and then notify
the team managing the pipelines if they fail. You also need to be able to
work across multiple projects. Your preference is to use managed products
or features of the platform. What should you do?
74.You work for a large bank that operates in locations throughout North
America. You are setting up a data storage system that will handle bank
account transactions. You require ACID compliance and the ability to
access data with SQL. Which solution is appropriate?
Since the banking transaction system requires ACID compliance and SQL
access to the data, Cloud Spanner is the most appropriate solution. Unlike
Cloud SQL, Cloud Spanner natively provides ACID transactions and
horizontal scalability.
Enabling stale reads in Spanner (option A) would reduce data consistency,
violating the ACID compliance requirement of banking transactions.
BigQuery (option C) does not natively support ACID transactions or SQL
writes which are necessary for a banking transactions system.
Cloud SQL (option D) provides ACID compliance but does not scale
horizontally like Cloud Spanner can to handle large transaction volumes.
By using Cloud Spanner and specifically locking read-write transactions,
ACID compliance is ensured while providing fast, horizontally scalable SQL
processing of banking transactions.
When you want to move your Apache Spark workloads from an on-
premises environment to Google Cloud, we recommend using Dataproc to
run Apache Spark/Apache Hadoop clusters. Dataproc is a fully managed,
fully supported service offered by Google Cloud. It allows you to separate
storage and compute, which helps you to manage your costs and be more
flexible in scaling your workloads.
https://cloud.google.com/bigquery/docs/migration/hive#data_migration
Migrating Hive data from your on-premises or other cloud-based source
cluster to BigQuery has two steps:
1. Copying data from a source cluster to Cloud Storage
2. Loading data from Cloud Storage into BigQuery
77.You work for a financial institution that lets customers register online. As
new customers register, their user data is sent to Pub/Sub before being
ingested into BigQuery. For security reasons, you decide to redact your
customers' Government issued Identification Number while allowing
customer service representatives to view the original values when
necessary. What should you do?
Before loading the data into BigQuery, use Cloud Data Loss Prevention
(DLP) to replace input values with a cryptographic format-preserving
encryption token.
The key reasons are:
DLP allows redacting sensitive PII like SSNs before loading into BigQuery.
This provides security by default for the raw SSN values.
Using format-preserving encryption keeps the column format intact while
still encrypting, allowing business logic relying on SSN format to continue
functioning.
The encrypted tokens can be reversed to view original SSNs when
required, meeting the access requirement for customer service reps.
78.You are migrating a table to BigQuery and are deciding on the data model.
Your table stores information related to purchases made across several
store locations and includes information like the time of the transaction,
items purchased, the store ID, and the city and state in which the store is
located. You frequently query this table to see how many of each item
were sold over the past 30 days and to look at purchasing trends by state,
city, and individual store. How would you model this table for the best
query performance?
Cloud Dataproc allows you to run Apache Hadoop jobs with minimal
management. It is a managed Hadoop service.
Using the Google Cloud Storage (GCS) connector, Dataproc can access
data stored in GCS, which allows data persistence beyond the life of the
cluster. This means that even if the cluster is deleted, the data in GCS
remains intact. Moreover, using GCS is often cheaper and more durable
than using HDFS on persistent disks.
80.You are updating the code for a subscriber to a Pub/Sub feed. You are
concerned that upon deployment the subscriber may erroneously
acknowledge messages, leading to message loss. Your subscriber is not
set up to retain acknowledged messages. What should you do to ensure
that you can recover from errors after deployment?
81.You work for a large real estate firm and are preparing 6 TB of home sales
data to be used for machine learning. You will use SQL to transform the
data and use BigQuery ML to create a machine learning model. You plan to
use the model for predictions against a raw dataset that has not been
transformed. How should you set up your workflow in order to prevent
skew at prediction time?
82.You are analyzing the price of a company's stock. Every 5 seconds, you
need to compute a moving average of the past 30 seconds' worth of data.
You are reading data from Pub/Sub and using DataFlow to conduct the
analysis. How should you set up your windowed pipeline?
Since you need to compute a moving average of the past 30 seconds'
worth of data every 5 seconds, a sliding window is appropriate. A sliding
window allows overlapping intervals and is well-suited for computing
rolling aggregates.
Window Duration: The window duration should be set to 30 seconds to
cover the required 30 seconds' worth of data for the moving average
calculation.
Window Period: The window period or sliding interval should be set to 5
seconds to move the window every 5 seconds and recalculate the moving
average with the latest data.
Trigger: The trigger should be set to AfterWatermark.pastEndOfWindow()
to emit the computed moving average results when the watermark
advances past the end of the window. This ensures that all data within the
window is considered before emitting the result.
84.You work for a large financial institution that is planning to use Dialogflow
to create a chatbot for the company's mobile app. You have reviewed old
chat logs and tagged each conversation for intent based on each
customer's stated intention for contacting customer service. About 70% of
customer requests are simple requests that are solved within 10 intents.
The remaining 30% of inquiries require much longer, more complicated
requests. Which intents should you automate first?
This is the best approach because it follows the Pareto principle (80/20
rule). By automating the most common 10 intents that address 70% of
customer requests, you free up the live agents to focus their time and
effort on the more complex 30% of requests that likely require human
insight/judgement. Automating the simpler high-volume requests first
allows the chatbot to handle those easily, efficiently routing only the
trickier cases to agents. This makes the best use of automation for high-
volume simple cases and human expertise for lower-volume complex
issues.
87.You want to rebuild your batch pipeline for structured data on Google
Cloud. You are using PySpark to conduct data transformations at scale, but
your pipelines are taking over twelve hours to run. To expedite
development and pipeline run time, you want to use a serverless tool and
SOL syntax. You have already moved your raw data into Cloud Storage.
How should you build the pipeline on Google Cloud while meeting speed
and processing requirements?
The core issue is the use of SideInputs for joining data, leading to
materialization and replication overhead. CoGroupByKey provides a more
efficient, parallel approach to join operations in Dataflow by avoiding
materialization and reducing replication. Therefore, switching to
CoGroupByKey is the most effective way to expedite the Dataflow job in
this scenario.
89.You are building a real-time prediction engine that streams files, which
may contain PII (personal identifiable information) data, into Cloud Storage
and eventually into BigQuery. You want to ensure that the sensitive data is
masked but still maintains referential integrity, because names and emails
are often used as join keys. How should you use the Cloud Data Loss
Prevention API (DLP API) to ensure that the PII data is not accessible by
unauthorized individuals?
91.You are migrating an application that tracks library books and information
about each book, such as author or year published, from an on-premises
data warehouse to BigQuery. In your current relational database, the
author information is kept in a separate table and joined to the book
information on a common key. Based on Google's recommended practice
for schema design, how would you structure the data to ensure optimal
speed of queries about the author of each book that has been borrowed?
92.You need to give new website users a globally unique identifier (GUID)
using a service that takes in data points and returns a GUID. This data is
sourced from both internal and external systems via HTTP calls that you
will make via microservices within your pipeline. There will be tens of
thousands of messages per second and that can be multi-threaded. and
you worry about the backpressure on the system. How should you design
your pipeline to minimize that backpressure?
Option D is the best approach to minimize backpressure in this scenario.
By batching the jobs into 10-second increments, you can throttle the rate
at which requests are made to the external GUID service. This prevents
too many simultaneous requests from overloading the service.
Considering the requirement for handling large files and the need for real-
time data integration, Option C (gsutil for the migration; Pub/Sub and
Dataflow for the real-time updates) seems to be the most appropriate.
gsutil will effectively handle the large file transfers, while Pub/Sub and
Dataflow provide a robust solution for real-time data capture and
processing, ensuring continuous updates to your warehouse on Google
Cloud.
94.You are using Bigtable to persist and serve stock market data for each of
the major indices. To serve the trading application, you need to access
only the most recent stock prices that are streaming in. How should you
design your row key and tables to ensure that you can access the data
with the simplest query?
A single table for all indices keeps the structure simple.
Using a reverse timestamp as part of the row key ensures that the most
recent data comes first in the sorted order. This design is beneficial for
quickly accessing the latest data.
For example: you can convert the timestamp to a string and format it in
reverse order, like "yyyyMMddHHmmss", ensuring newer dates and times
are sorted lexicographically before older ones.
95.You are building a report-only data warehouse where the data is streamed
into BigQuery via the streaming API. Following Google's best practices, you
have both a staging and a production table for the data. How should you
design your data loading to ensure that there is only one master dataset
without affecting performance on either the ingestion or reporting pieces?
96.You issue a new batch job to Dataflow. The job starts successfully,
processes a few elements, and then suddenly fails and shuts down. You
navigate to the Dataflow monitoring interface where you find errors
related to a particular DoFn in your pipeline. What is the most likely cause
of the errors?
While your job is running, you might encounter errors or exceptions in your
worker code. These errors generally mean that the DoFns in your pipeline
code have generated unhandled exceptions, which result in failed tasks in
your Dataflow job.
Exceptions in user code (for example, your DoFn instances) are reported in
the Dataflow monitoring interface.
97.Your new customer has requested daily reports that show their net
consumption of Google Cloud compute resources and who used the
resources. You need to quickly and efficiently generate these daily reports.
What should you do?
98.The Development and External teams have the project viewer Identity and
Access Management (IAM) role in a folder named Visualization. You want
the Development Team to be able to read data from both Cloud Storage
and BigQuery, but the External Team should only be able to read data from
BigQuery. What should you do?
Development team: needs to access both Cloud Storage and BQ ->
therefore we put the Development team inside a perimeter so it can
access both the Cloud Storage and the BQ
External team: allowed to access only BQ -> therefore we put Cloud
Storage behind the restricted API and leave the external team outside of
the perimeter, so it can access BQ, but is prohibited from accessing the
Cloud Storage
99.Your startup has a web application that currently serves customers out of a
single region in Asia. You are targeting funding that will allow your startup
to serve customers globally. Your current goal is to optimize for cost, and
your post-funding goal is to optimize for global presence and performance.
You must use a nativeJDBC driver. What should you do?
This option allows for optimization for cost initially with a single region
Cloud Spanner instance, and then optimization for global presence and
performance after funding with multi-region instances.
Cloud Spanner supports native JDBC drivers and is horizontally scalable,
providing very high performance. A single region instance minimizes costs
initially. After funding, multi-region instances can provide lower latency
and high availability globally.
Cloud SQL does not scale as well and has higher costs for multiple high
availability regions. Bigtable does not support JDBC drivers natively.
Therefore, Spanner is the best choice here for optimizing both for cost
initially and then performance and availability globally post-funding.
102. You are loading CSV files from Cloud Storage to BigQuery. The files
have known data quality issues, including mismatched data types, such as
STRINGs and INT64s in the same column, and inconsistent formatting of
values such as phone numbers or addresses. You need to create the data
pipeline to maintain data quality and perform the required cleansing and
transformation. What should you do?
Data Fusion is the best choice for this scenario because it provides a
comprehensive platform for building and managing data pipelines,
including data quality features and pre-built transformations for handling
the specific data issues in your CSV files. It simplifies the process and
reduces the amount of manual coding required compared to using SQL-
based approaches.
103. You are developing a new deep learning model that predicts a
customer's likelihood to buy on your ecommerce site. After running an
evaluation of the model against both the original training data and new
test data, you find that your model is overfitting the data. You want to
improve the accuracy of the model when predicting new data. What
should you do?
To improve the accuracy of a model that's overfitting, the most effective
strategies are to:
Increase the amount of training data: This helps the model learn
more generalizable patterns.
Decrease the number of input features: This helps the model focus on
the most relevant information and avoid learning noise.
Therefore, option B is the most suitable approach to address overfitting
and improve the model's accuracy on new data.
Option D provides the most efficient and streamlined approach for this
scenario. By using an Apache Beam custom connector with Dataflow and
Avro format, you can directly read, transform, and stream the proprietary
data into BigQuery while minimizing resource consumption and
maximizing performance.
106. An online brokerage company requires a high volume trade
processing architecture. You need to create a secure queuing system that
triggers jobs. The jobs will run in Google Cloud and call the company's
Python API to execute trades. You need to efficiently implement a solution.
What should you do?
108. You have 15 TB of data in your on-premises data center that you
want to transfer to Google Cloud. Your data changes weekly and is stored
in a POSIX-compliant source. The network operations team has granted
you 500 Mbps bandwidth to the public internet. You want to follow Google-
recommended practices to reliably transfer your data to Google Cloud on a
weekly basis. What should you do?
Like gsutil, Storage Transfer Service for on-premises data enables transfers
from network file system (NFS) storage to Cloud Storage. Although gsutil
can support small transfer sizes (up to 1 TB), Storage Transfer Service for
on-premises data is designed for large-scale transfers (up to petabytes of
data, billions of files).
Cloud SQL for PostgreSQL provides full ACID compliance, unlike Bigtable
which provides only atomicity and consistency guarantees.
Enabling high availability removes the need for manual failover as Cloud
SQL will automatically failover to a standby replica if the leader instance
goes down.
Point-in-time recovery in MySQL requires manual intervention to restore
data if needed.
BigQuery does not provide transactional guarantees required for an ACID
database.
Therefore, a Cloud SQL for PostgreSQL instance with high availability
meets the ACID and minimal intervention requirements best. The
automatic failover will ensure availability and uptime without
administrative effort.
111. You are using BigQuery and Data Studio to design a customer-facing
dashboard that displays large quantities of aggregated data. You expect a
high volume of concurrent users. You need to optimize the dashboard to
provide quick visualizations with minimal latency. What should you do?
This approach allows the model to benefit from both the historical data
(existing data) and the new data, ensuring that it adapts to changing
preferences while retaining knowledge from the past. By combining both
types of data, the model can learn to make recommendations that are up-
to-date and relevant to users' evolving preferences.
113. You work for a car manufacturer and have set up a data pipeline
using Google Cloud Pub/Sub to capture anomalous sensor events. You are
using a push subscription in Cloud Pub/Sub that calls a custom HTTPS
endpoint that you have created to take action of these anomalous events
as they occur. Your custom HTTPS endpoint keeps getting an inordinate
amount of duplicate messages. What is the most likely cause of these
duplicate messages?
The import and export feature uses the native RDB snapshot feature of
Redis to import data into or export data out of a Memorystore for Redis
instance. The use of the native RDB format prevents lock-in and makes it
very easy to move data within Google Cloud or outside of Google Cloud.
Import and export uses Cloud Storage buckets to store RDB files.
120. You need ads data to serve AI models and historical data for
analytics. Longtail and outlier data points need to be identified. You want
to cleanse the data in near-real time before running it through AI models.
What should you do?
121. You are collecting IoT sensor data from millions of devices across
the world and storing the data in BigQuery. Your access pattern is based
on recent data, filtered by location_id and device_version with the
following query:
You want to optimize your queries for cost and performance. How should
you structure your data?
Partitioning by create_date:
Aligns with query pattern: Filters for recent data based on create_date, so
partitioning by this column allows BigQuery to quickly narrow down the
data to scan, reducing query costs and improving performance.
Manages data growth: Partitioning effectively segments data by date,
making it easier to manage large datasets and optimize storage costs.
Clustering by location_id and device_version:
Enhances filtering: Frequently filtering by location_id and device_version,
clustering physically co-locates related data within partitions, further
reducing scan time and improving performance.
122. A live TV show asks viewers to cast votes using their mobile phones.
The event generates a large volume of data during a 3-minute period. You
are in charge of the "Voting infrastructure" and must ensure that the
platform can handle the load and that all votes are processed. You must
display partial results while voting is open. After voting closes, you need to
count the votes exactly once while optimizing cost. What should you do?
126. You are using BigQuery with a multi-region dataset that includes a
table with the daily sales volumes. This table is updated multiple times per
day. You need to protect your sales table in case of regional failures with a
recovery point objective (RPO) of less than 24 hours, while keeping costs
to a minimum. What should you do?
127. You are troubleshooting your Dataflow pipeline that processes data
from Cloud Storage to BigQuery. You have discovered that the Dataflow
worker nodes cannot communicate with one another. Your networking
team relies on Google Cloud network tags to define firewall rules. You need
to identify the issue while following Google-recommended networking
security practices. What should you do?
How should you redesign the BigQuery table to support faster access?
- Create a copy of the necessary tables into a new dataset that doesn't use
CMEK, ensuring the data is accessible without requiring the partner to
have access to the encryption key.
- Analytics Hub can then be used to share this data securely and efficiently
with the partner organization, maintaining control and governance over
the shared data.
131. You are developing an Apache Beam pipeline to extract data from a
Cloud SQL instance by using JdbcIO. You have two projects running in
Google Cloud. The pipeline will be deployed and executed on Dataflow in
Project A. The Cloud SQL. instance is running in Project B and does not
have a public IP address. After deploying the pipeline, you noticed that the
pipeline failed to extract data from the Cloud SQL instance due to
connection failure. You verified that VPC Service Controls and shared VPC
are not in use in these projects. You want to resolve this error while
ensuring that the data does not go through the public internet. What
should you do?
132. You have a BigQuery table that contains customer data, including
sensitive information such as names and addresses. You need to share the
customer data with your data analytics and consumer support teams
securely. The data analytics team needs to access the data of all the
customers, but must not be able to access the sensitive data. The
consumer support team needs access to all data columns, but must not be
able to access customers that no longer have active contracts. You
enforced these requirements by using an authorized dataset and policy
tags. After implementing these steps, the data analytics team reports that
they still have access to the sensitive columns. You need to ensure that
the data analytics team does not have access to restricted data. What
should you do? (Choose two.)
The two best answers are D and E. You need to both enforce the policy
tags (E) and remove the broad data viewing permission (D) to effectively
restrict the data analytics team's access to sensitive information. This
combination ensures that the policy tags are actually enforced and that
the team lacks the underlying permissions to bypass those restrictions.
133. You have a Cloud SQL for PostgreSQL instance in Region’ with one
read replica in Region2 and another read replica in Region3. An
unexpected event in Region’ requires that you perform disaster recovery
by promoting a read replica in Region2. You need to ensure that your
application has the same database capacity available before you switch
over the connections. What should you do?
134. You orchestrate ETL pipelines by using Cloud Composer. One of the
tasks in the Apache Airflow directed acyclic graph (DAG) relies on a third-
party service. You want to be notified when the task does not succeed.
What should you do?
Direct Trigger:
The on_failure_callback parameter is specifically designed to invoke a
function when a task fails, ensuring immediate notification.
Customizable Logic:
You can tailor the notification function to send emails, create alerts, or
integrate with other notification systems, providing flexibility.
135. Your company has hired a new data scientist who wants to perform
complicated analyses across very large datasets stored in Google Cloud
Storage and in a Cassandra cluster on Google Compute Engine. The
scientist primarily wants to create labelled data sets for machine learning
projects, along with some visualization tasks. She reports that her laptop is
not powerful enough to perform her tasks and it is slowing her down. You
want to help her perform her tasks. What should you do?
137. You store and analyze your relational data in BigQuery on Google
Cloud with all data that resides in US regions. You also have a variety of
object stores across Microsoft Azure and Amazon Web Services (AWS), also
in US regions. You want to query all your data in BigQuery daily with as
little movement of data as possible. What should you do?
138. You have a variety of files in Cloud Storage that your data science
team wants to use in their models. Currently, users do not have a method
to explore, cleanse, and validate the data in Cloud Storage. You are
looking for a low code solution that can be used by your data science team
to quickly cleanse and explore data within Cloud Storage. What should you
do?
Dataprep is the most suitable option because it's a low-code tool
specifically designed for data exploration, cleansing, and validation
directly within Cloud Storage. It aligns perfectly with the requirements
outlined in the problem statement.
139. You are building an ELT solution in BigQuery by using Dataform. You
need to perform uniqueness and null value checks on your final tables.
What should you do to efficiently integrate these checks into your
pipeline?
142. Your organization has two Google Cloud projects, project A and
project B. In project A, you have a Pub/Sub topic that receives data from
confidential sources. Only the resources in project A should be able to
access the data in that topic. You want to ensure that project B and any
future project cannot access data in the project A topic. What should you
do?
-It allows us to create a secure boundary around all resources in Project A,
including the Pub/Sub topic.
- It prevents data exfiltration to other projects and ensures that only
resources within the perimeter (Project A) can access the sensitive data.
- VPC Service Controls are specifically designed for scenarios where you
need to secure sensitive data within a specific context or boundary in
Google Cloud.
143. You stream order data by using a Dataflow pipeline, and write the
aggregated result to Memorystore. You provisioned a Memorystore for
Redis instance with Basic Tier, 4 GB capacity, which is used by 40 clients
for read-only access. You are expecting the number of read-only clients to
increase significantly to a few hundred and you need to be able to support
the demand. You want to ensure that read and write access availability is
not impacted, and any changes you make can be deployed quickly. What
should you do?
144. You have a streaming pipeline that ingests data from Pub/Sub in
production. You need to update this streaming pipeline with improved
business logic. You need to ensure that the updated pipeline reprocesses
the previous two days of delivered Pub/Sub messages. What should you
do? (Choose two.)
D&E
Both retain-acked-messages and Seek are required to achieve the
desired reprocessing. retain-acked-messages keeps the messages
available, and Seek allows the updated pipeline to rewind and read those
messages again. They are complementary functionalities that solve
different parts of the problem.
145. You currently use a SQL-based tool to visualize your data stored in
BigQuery. The data visualizations require the use of outer joins and
analytic functions. Visualizations must be based on data that is no less
than 4 hours old. Business users are complaining that the visualizations
are too slow to generate. You want to improve the performance of the
visualization queries while minimizing the maintenance overhead of the
data preparation pipeline. What should you do?
146. You are deploying 10,000 new Internet of Things devices to collect
temperature data in your warehouses globally. You need to process, store
and analyze these very large datasets in real time. What should you do?
Google Cloud Pub/Sub allows for efficient ingestion and real-time data
streaming.
Google Cloud Dataflow can process and transform the streaming data in
real-time.
Google BigQuery is a fully managed, highly scalable data warehouse that
is well-suited for real-time analysis and querying of large datasets.
147. You need to modernize your existing on-premises data strategy. Your
organization currently uses:
• Apache Hadoop clusters for processing multiple large data sets,
including on-premises Hadoop Distributed File System (HDFS) for data
replication.
• Apache Airflow to orchestrate hundreds of ETL pipelines with thousands
of job steps.
You need to set up a new architecture in Google Cloud that can handle
your Hadoop workloads and requires minimal changes to your existing
orchestration processes. What should you do?
148. You recently deployed several data processing jobs into your Cloud
Composer 2 environment. You notice that some tasks are failing in Apache
Airflow. On the monitoring dashboard, you see an increase in the total
workers memory usage, and there were worker pod evictions. You need to
resolve these errors. What should you do? (Choose two.)
Both increasing worker memory (D) and increasing the Cloud Composer
environment size (B) are crucial for solving the problem. The environment
size provides the necessary resources, while increasing worker memory
allows the workers to utilize those resources effectively. They work
together to address the root cause of worker memory issues and pod
evictions.
149. You are on the data governance team and are implementing
security requirements to deploy resources. You need to ensure that
resources are limited to only the europe-west3 region. You want to follow
Google-recommended practices.
What should you do?
153. You are deploying a MySQL database workload onto Cloud SQL. The
database must be able to scale up to support several readers from various
geographic regions. The database must be highly available and meet low
RTO and RPO requirements, even in the event of a regional outage. You
need to ensure that interruptions to the readers are minimal during a
database failover. What should you do?
Option C provides the most robust and highly available solution by
combining a highly available primary instance with a highly available read
replica in another region. This approach ensures that the database can
withstand both zonal and regional failures, while cascading read replicas
provide scalability and low latency for read workloads.
154. You are planning to load some of your existing on-premises data
into BigQuery on Google Cloud. You want to either stream or batch-load
data, depending on your use case. Additionally, you want to mask some
sensitive data before loading into BigQuery. You need to do this in a
programmatic way while keeping costs to a minimum. What should you
do?
156. The data analyst team at your company uses BigQuery for ad-hoc
queries and scheduled SQL pipelines in a Google Cloud project with a slot
reservation of 2000 slots. However, with the recent introduction of
hundreds of new non time-sensitive SQL pipelines, the team is
encountering frequent quota errors. You examine the logs and notice that
approximately 1500 queries are being triggered concurrently during peak
time. You need to resolve the concurrency issue. What should you do?
158. You are designing a data mesh on Google Cloud by using Dataplex
to manage data in BigQuery and Cloud Storage. You want to simplify data
asset permissions. You are creating a customer virtual lake with two user
groups:
• Data engineers, which require full data lake access
• Analytic users, which require access to curated data
You need to assign access rights to these two groups. What should you do?
Option A provides the most straightforward and efficient way to manage
permissions in Dataplex by using its built-in roles (dataplex.dataOwner and
dataplex.dataReader). This simplifies permission management and
ensures that each user group has the appropriate level of access to the
data lake.
159. You are designing the architecture of your application to store data
in Cloud Storage. Your application consists of pipelines that read data from
a Cloud Storage bucket that contains raw data, and write the data to a
second bucket after processing. You want to design an architecture with
Cloud Storage resources that are capable of being resilient if a Google
Cloud regional failure occurs. You want to minimize the recovery point
objective (RPO) if a failure occurs, with no impact on applications that use
the stored data. What should you do?
Option C provides the best balance of high availability, low RPO, and
minimal impact on applications. Dual-region buckets with turbo replication
offer a robust and efficient solution for storing data in Cloud Storage with
regional failure resilience.
160. You have designed an Apache Beam processing pipeline that reads
from a Pub/Sub topic. The topic has a message retention duration of one
day, and writes to a Cloud Storage bucket. You need to select a bucket
location and processing strategy to prevent data loss in case of a regional
outage with an RPO of 15 minutes. What should you do?
Option D provides the most robust and efficient solution for preventing
data loss and ensuring business continuity during a regional outage. It
combines the high availability of dual-region buckets with turbo
replication, proactive monitoring, and a well-defined failover process.
161. You are preparing data that your machine learning team will use to
train a model using BigQueryML. They want to predict the price per square
foot of real estate. The training data has a column for the price and a
column for the number of square feet. Another feature column called
‘feature1’ contains null values due to missing data. You want to replace
the nulls with zeros to keep more data points. Which query should you
use?
a. Option A is the correct choice because it retains all the original columns
and specifically addresses the issue of null values in ‘feature1’ by
replacing them with zeros, without altering any other columns or
performing unnecessary calculations. This makes the data ready for use in
BigQueryML without losing any important information.
Option C is not the best choice because it includes the EXCEPT clause for
the price and square_feet columns, which would exclude these columns
from the results. This is not desirable since you need these columns for
the machine learning model to predict the price per square foot
163. You are developing a model to identify the factors that lead to sales
conversions for your customers. You have completed processing your data.
You want to continue through the model development lifecycle. What
should you do next?
you've just concluded processing data, ending up with clean and prepared
data for the model. Now you need to decide how to split the data for
testing and for training. Only afterwards, you can train the model,
evaluate it, fine tune it and, eventually, predict with it
164. You have one BigQuery dataset which includes customers’ street
addresses. You want to retrieve all occurrences of street addresses from
the dataset. What should you do?
165. Your company operates in three domains: airlines, hotels, and ride-
hailing services. Each domain has two teams: analytics and data science,
which create data assets in BigQuery with the help of a central data
platform team. However, as each domain is evolving rapidly, the central
data platform team is becoming a bottleneck. This is causing delays in
deriving insights from data, and resulting in stale data when pipelines are
not kept up to date. You need to design a data mesh architecture by using
Dataplex to eliminate the bottleneck. What should you do?
You have an inventory of VM data stored in the BigQuery table. You want
to prepare the data for regular reporting in the most cost-effective way.
You need to exclude VM rows with fewer than 8 vCPU in your report. What
should you do?
This approach allows you to set up a custom log sink with an advanced
filter that targets the specific table and then export the log entries to
Google Cloud Pub/Sub. Your monitoring tool can subscribe to the Pub/Sub
topic, providing you with instant notifications when relevant events occur
without being inundated with notifications from other tables.
Options A and B do not offer the same level of customization and
specificity in targeting notifications for a particular table.
Option C is almost correct but doesn't mention the use of an advanced log
filter in the sink configuration, which is typically needed to filter the logs to
a specific table effectively. Using the Stackdriver API for more advanced
configuration is often necessary for fine-grained control over log filtering.
169. Your company's data platform ingests CSV file dumps of booking
and user profile data from upstream sources into Cloud Storage. The data
analyst team wants to join these datasets on the email field available in
both the datasets to perform analysis. However, personally identifiable
information (PII) should not be accessible to the analysts. You need to de-
identify the email field in both the datasets before loading them into
BigQuery for analysts. What should you do?
Format-preserving encryption (FPE) with FFX in Cloud DLP is a strong
choice for de-identifying PII like email addresses. FPE maintains the format
of the data and ensures that the same input results in the same encrypted
output consistently. This means the email fields in both datasets can be
encrypted to the same value, allowing for accurate joins in BigQuery while
keeping the actual email addresses hidden.
170. You have important legal hold documents in a Cloud Storage bucket.
You need to ensure that these documents are not deleted or modified.
What should you do?
172. You are deploying a batch pipeline in Dataflow. This pipeline reads
data from Cloud Storage, transforms the data, and then writes the data
into BigQuery. The security team has enabled an organizational constraint
in Google Cloud, requiring all Compute Engine instances to use only
internal IP addresses and no external IP addresses. What should you do?
- Private Google Access for services allows VM instances with only internal
IP addresses in a VPC network or on-premises networks (via Cloud VPN or
Cloud Interconnect) to reach Google APIs and services.
- When you launch a Dataflow job, you can specify that it should use
worker instances without external IP addresses if Private Google Access is
enabled on the subnetwork where these instances are launched.
- This way, your Dataflow workers will be able to access Cloud Storage and
BigQuery without violating the organizational constraint of no external IPs.
175. You are deploying an Apache Airflow directed acyclic graph (DAG) in
a Cloud Composer 2 instance. You have incoming files in a Cloud Storage
bucket that the DAG processes, one file at a time. The Cloud Composer
instance is deployed in a subnetwork with no Internet access. Instead of
running the DAG based on a schedule, you want to run the DAG in a
reactive way every time a new file is received. What should you do?
- Enable Airflow REST API: In Cloud Composer, enable the "Airflow web
server" option.
- Set Up Cloud Storage Notifications: Create a notification for new files,
routing to a Cloud Function.
- Create PSC Endpoint: Establish a PSC endpoint for Cloud Composer.
- Write Cloud Function: Code the function to use the Airflow REST API (via
PSC endpoint) to trigger the DAG.
176. You are planning to use Cloud Storage as part of your data lake
solution. The Cloud Storage bucket will contain objects ingested from
external systems. Each object will be ingested once, and the access
patterns of individual objects will be random. You want to minimize the
cost of storing and retrieving these objects. You want to ensure that any
cost optimization efforts are transparent to the users and applications.
What should you do?
- Autoclass automatically analyzes access patterns of objects and
automatically transitions them to the most cost-effective storage class
within Standard, Nearline, Coldline, or Archive.
- This eliminates the need for manual intervention or setting specific age
thresholds.
- No user or application interaction is required, ensuring transparency.
177. You have several different file type data sources, such as Apache
Parquet and CSV. You want to store the data in Cloud Storage. You need to
set up an object sink for your data that allows you to use your own
encryption keys. You want to use a GUI-based solution. What should you
do?
178. Your business users need a way to clean and prepare data before
using the data for analysis. Your business users are less technically savvy
and prefer to work with graphical user interfaces to define their
transformations. After the data has been transformed, the business users
want to perform their analysis directly in a spreadsheet. You need to
recommend a solution that they can use. What should you do?
It uses Dataprep to address the need for a graphical interface for data
cleaning.
It leverages BigQuery for scalable data storage.
It employs Connected Sheets to enable analysis directly within a
spreadsheet, fulfilling all the stated requirements.
179. You are working on a sensitive project involving private user data.
You have set up a project on Google Cloud Platform to house your work
internally. An external consultant is going to assist with coding a complex
transformation in a Google Cloud Dataflow pipeline for your project. How
should you maintain users' privacy?
180. You have two projects where you run BigQuery jobs:
• One project runs production jobs that have strict completion time SLAs.
These are high priority jobs that must have the required compute
resources available when needed. These jobs generally never go below a
300 slot utilization, but occasionally spike up an additional 500 slots.
• The other project is for users to run ad-hoc analytical queries. This
project generally never uses more than 200 slots at a time. You want these
ad-hoc queries to be billed based on how much data users scan rather
than by slot capacity.
You need to ensure that both projects have the appropriate compute
resources available. What should you do?
182. You are on the data governance team and are implementing
security requirements. You need to encrypt all your data in BigQuery by
using an encryption key managed by your team. You must implement a
mechanism to generate and store encryption material only on your on-
premises hardware security module (HSM). You want to rely on Google
managed solutions. What should you do?
- Cloud EKM allows you to use encryption keys managed in external key
management systems, including on-premises HSMs, while using Google
Cloud services.
- This means that the key material remains in your control and
environment, and Google Cloud services use it via the Cloud EKM
integration.
- This approach aligns with the need to generate and store encryption
material only on your on-premises HSM and is the correct way to integrate
such keys with BigQuery.
183. You maintain ETL pipelines. You notice that a streaming pipeline
running on Dataflow is taking a long time to process incoming data, which
causes output delays. You also noticed that the pipeline graph was
automatically optimized by Dataflow and merged into one step. You want
to identify where the potential bottleneck is occurring. What should you
do?
From the Dataflow documentation: "There are a few cases in your pipeline
where you may want to prevent the Dataflow service from performing
fusion optimizations. These are cases in which the Dataflow service might
incorrectly guess the optimal way to fuse operations in the pipeline, which
could limit the Dataflow service's ability to make use of all available
workers.
You can insert a Reshuffle step. Reshuffle prevents fusion, checkpoints the
data, and performs deduplication of records. Reshuffle is supported by
Dataflow even though it is marked deprecated in the Apache Beam
documentation."
184. You are running your BigQuery project in the on-demand billing
model and are executing a change data capture (CDC) process that
ingests data. The CDC process loads 1 GB of data every 10 minutes into a
temporary table, and then performs a merge into a 10 TB target table.
This process is very scan intensive and you want to explore options to
enable a predictable cost model. You need to create a BigQuery
reservation based on utilization information gathered from BigQuery
Monitoring and apply the reservation to the CDC process. What should you
do?
The most effective and recommended way to ensure a BigQuery
reservation applies to your CDC process, which involves multiple jobs and
potential different datasets/service accounts, is to create the reservation
at the project level. This guarantees that all BigQuery workloads within
the project, including your CDC process, will utilize the reserved capacity.
- Lowest RPO: Time travel offers point-in-time recovery for the past seven
days by default, providing the shortest possible recovery point objective
(RPO) among the given options. You can recover data to any state within
that window.
- No Additional Costs: Time travel is a built-in feature of BigQuery,
incurring no extra storage or operational costs.
- Managed Service: BigQuery handles time travel automatically,
eliminating manual backup and restore processes.
186. You are building a streaming Dataflow pipeline that ingests noise
level data from hundreds of sensors placed near construction sites across
a city. The sensors measure noise level every ten seconds, and send that
data to the pipeline when levels reach above 70 dBA. You need to detect
the average noise level from a sensor when data is received for a duration
of more than 30 minutes, but the window ends when no data has been
received for 15 minutes. What should you do?
to detect average noise levels from sensors, the best approach is to use
session windows with a 15-minute gap duration (Option A). Session
windows are ideal for cases like this where the events (sensor data) are
sporadic. They group events that occur within a certain time interval (15
minutes in your case) and a new window is started if no data is received
for the duration of the gap. This matches your requirement to end the
window when no data is received for 15 minutes, ensuring that the
average noise level is calculated over periods of continuous data
187. You are creating a data model in BigQuery that will hold retail
transaction data. Your two largest tables, sales_transaction_header and
sales_transaction_line, have a tightly coupled immutable relationship.
These tables are rarely modified after load and are frequently joined when
queried. You need to model the sales_transaction_header and
sales_transaction_line tables to improve the performance of data analytics
queries. What should you do?
- Draining the old pipeline ensures that it finishes processing all in-flight
data before stopping, which prevents data loss and inconsistencies.
- After draining, you can start the new pipeline, which will begin processing
new data from where the old pipeline left off.
- This approach maintains a smooth transition between the old and new
versions, minimizing latency increases and avoiding data gaps or overlaps.
189. Your organization's data assets are stored in BigQuery, Pub/Sub, and
a PostgreSQL instance running on Compute Engine. Because there are
multiple domains and diverse teams using the data, teams in your
organization are unable to discover existing data assets. You need to
design a solution to improve data discoverability while keeping
development and configuration efforts to a minimum. What should you
do?
190. You are building a model to predict whether or not it will rain on a
given day. You have thousands of input features and want to see if you can
improve training speed by removing some features while having a
minimum effect on model accuracy. What can you do?
191. You need to create a SQL pipeline. The pipeline runs an aggregate
SQL transformation on a BigQuery table every two hours and appends the
result to another existing BigQuery table. You need to configure the
pipeline to retry if errors occur. You want the pipeline to send an email
notification after three consecutive failures. What should you do?
Option B leverages the power of Cloud Composer's workflow orchestration
and the BigQueryInsertJobOperator's capabilities to create a
straightforward, reliable, and maintainable SQL pipeline that meets all the
specified requirements, including retries and email notifications after three
consecutive failures.
It makes the tag template public, enabling all employees to search for
tables based on the tags without needing extra permissions.
It directly grants BigQuery data access to the HR group only on the
necessary tables, minimizing configuration overhead and ensuring
compliance with the restricted data access requirement.
By combining public tag visibility with targeted BigQuery permissions,
Option C provides the most straightforward and least complex way to
achieve the desired access control and searchability for your BigQuery
data and Data Catalog tags.
194. You are creating the CI/CD cycle for the code of the directed acyclic
graphs (DAGs) running in Cloud Composer. Your team has two Cloud
Composer instances: one instance for development and another instance
for production. Your team is using a Git repository to maintain and develop
the code of the DAGs. You want to deploy the DAGs automatically to Cloud
Composer when a certain tag is pushed to the Git repository. What should
you do?
It uses Cloud Build to automate the deployment process based on Git tags.
It directly deploys DAG code to the Cloud Storage buckets used by Cloud
Composer, eliminating the need for additional infrastructure.
It aligns with the recommended approach for managing DAGs in Cloud
Composer.
By leveraging Cloud Build and Cloud Storage, Option A minimizes the
configuration overhead and complexity while providing a robust and
automated CI/CD pipeline for your Cloud Composer DAGs.
195. You have a BigQuery table that ingests data directly from a Pub/Sub
subscription. The ingested data is encrypted with a Google-managed
encryption key. You need to meet a new organization policy that requires
you to use keys from a centralized Cloud Key Management Service (Cloud
KMS) project to encrypt data at rest. What should you do?
197. You are designing a Dataflow pipeline for a batch processing job.
You want to mitigate multiple zonal failures at job submission time. What
should you do?
198. You are designing a real-time system for a ride hailing app that
identifies areas with high demand for rides to effectively reroute available
drivers to meet the demand. The system ingests data from multiple
sources to Pub/Sub, processes the data, and stores the results for
visualization and analysis in real-time dashboards. The data sources
include driver location updates every 5 seconds and app-based booking
events from riders. The data processing involves real-time aggregation of
supply and demand data for the last 30 seconds, every 2 seconds, and
storing the results in a low-latency system for visualization. What should
you do?
Tumbling windows are the best choice for this ride-hailing app because
they provide accurate 2-second aggregations without the complexities of
overlapping data. This is crucial for real-time decision-making and
ensuring accurate visualization of supply and demand.
Hopping windows introduce potential inaccuracies and complexity, making
them less suitable for this scenario. While they can be useful in other
situations, they are not the optimal choice for real-time aggregation with
strict accuracy requirements.
Side Output for Failed Messages: Dataflow allows you to use side outputs
to handle messages that fail processing. In your DoFn , you can catch
exceptions and write the failed messages to a separate PCollection . This
PCollection can then be written to a new Pub/Sub topic.
New Pub/Sub Topic for Monitoring: Creating a dedicated Pub/Sub topic for
failed messages allows you to monitor it specifically for alerting purposes.
This provides a clear view of any issues with your business logic.
topic/num_unacked_messages_by_region Metric: This Cloud Monitoring
metric tracks the number of unacknowledged messages in a Pub/Sub
topic. By monitoring this metric on your new topic, you can identify when
messages are failing to be processed correctly.
200. You want to store your team’s shared tables in a single dataset to
make data easily accessible to various analysts. You want to make this
data readable but unmodifiable by analysts. At the same time, you want to
provide the analysts with individual workspaces in the same project, where
they can create and store tables for their own use, without the tables
being accessible by other analysts. What should you do?
You want to improve the performance of this data read. What should you
do?
This function exports the whole table to temporary files in Google Cloud
Storage, where it will later be read from.
This requires almost no computation, as it only performs an export job,
and later Dataflow reads from GCS (not from BigQuery).
BigQueryIO.read.fromQuery() executes a query and then reads the results
received after the query execution. Therefore, this function is more time-
consuming, given that it requires that a query is first executed (which will
incur in the corresponding economic and computational costs).
202. You are running a streaming pipeline with Dataflow and are using
hopping windows to group the data as the data arrives. You noticed that
some data is arriving late but is not being marked as late data, which is
resulting in inaccurate aggregations downstream. You need to find a
solution that allows you to capture the late data in the appropriate
window. What should you do?
203. You work for a large ecommerce company. You store your
customer's order data in Bigtable. You have a garbage collection policy set
to delete the data after 30 days and the number of versions is set to 1.
When the data analysts run a query to report total customer spending, the
analysts sometimes see customer data that is older than 30 days. You
need to ensure that the analysts do not see customer data older than 30
days while minimizing cost and overhead. What should you do?
204. You are using a Dataflow streaming job to read messages from a
message bus that does not support exactly-once delivery. Your job then
applies some transformations, and loads the result into BigQuery. You want
to ensure that your data is being streamed into BigQuery with exactly-
once delivery semantics. You expect your ingestion throughput into
BigQuery to be about 1.5 GB per second. What should you do?
This approach directly addresses the issue by filtering out data older than
30 days at query time, ensuring that only the relevant data is retrieved. It
avoids the overhead and potential delays associated with garbage
collection and manual deletion processes
205. You have created an external table for Apache Hive partitioned data
that resides in a Cloud Storage bucket, which contains a large number of
files. You notice that queries against this table are slow. You want to
improve the performance of these queries. What should you do?
- BigLake Table: BigLake allows for more efficient querying of data lakes
stored in Cloud Storage. It can handle large datasets more effectively than
standard external tables.
- Metadata Caching: Enabling metadata caching can significantly improve
query performance by reducing the time taken to read and process
metadata from a large number of files.
206. You have a network of 1000 sensors. The sensors generate time
series data: one metric per sensor per second, along with a timestamp.
You already have 1 TB of data, and expect the data to grow by 1 GB every
day. You need to access this data in two ways. The first access pattern
requires retrieving the metric from one specific sensor stored at a specific
timestamp, with a median single-digit millisecond latency. The second
access pattern requires running complex analytic queries on the data,
including joins, once a day. How should you store this data?
- Bigtable excels at incredibly fast lookups by row key, often reaching
single-digit millisecond latencies.
- Constructing the row key with sensor ID and timestamp enables efficient
retrieval of specific sensor readings at exact timestamps.
- Bigtable's wide-column design effectively stores time series data,
allowing for flexible addition of new metrics without schema changes.
- Bigtable scales horizontally to accommodate massive datasets
(petabytes or more), easily handling the expected data growth.
207. You have 100 GB of data stored in a BigQuery table. This data is
outdated and will only be accessed one or two times a year for analytics
with SQL. For backup purposes, you want to store this data to be
immutable for 3 years. You want to minimize storage costs. What should
you do?
208. You have thousands of Apache Spark jobs running in your on-
premises Apache Hadoop cluster. You want to migrate the jobs to Google
Cloud. You want to use managed services to run your jobs instead of
maintaining a long-lived Hadoop cluster yourself. You have a tight timeline
and want to keep code changes to a minimum. What should you do?
Dataproc is the most suitable choice for migrating your existing Apache
Spark jobs to Google Cloud because it is a fully managed service that
supports Apache Spark and Hadoop workloads with minimal changes to
your existing code. Moving your data to Cloud Storage and running jobs on
Dataproc offers a fast, efficient, and scalable solution for your needs.
209. You are administering shared BigQuery datasets that contain views
used by multiple teams in your organization. The marketing team is
concerned about the variability of their monthly BigQuery analytics spend
using the on-demand billing model. You need to help the marketing team
establish a consistent BigQuery analytics spend each month. What should
you do?
This option provides the marketing team with a predictable monthly cost
by reserving a fixed number of slots, ensuring that they have dedicated
resources without the variability introduced by autoscaling or on-demand
pricing. This setup also simplifies budgeting and financial planning for the
marketing team, as they will have a consistent expense each month.
211. You have data located in BigQuery that is used to generate reports
for your company. You have noticed some weekly executive report fields
do not correspond to format according to company standards. For
example, report errors include different telephone formats and different
country code identifiers. This is a frequent issue, so you need to create a
recurring job to normalize the data. You want a quick solution that requires
no coding. What should you do?
212. Your company is streaming real-time sensor data from their factory
floor into Bigtable and they have noticed extremely poor performance.
How should the row key be redesigned to improve Bigtable performance
on queries that populate real-time dashboards?
It enables efficient range scans for retrieving data for specific sensors
across time.
It distributes writes to prevent hotspots and maintain write performance.
It ensures data locality for recent queries, improving read performance for
real-time dashboards.
By using <sensorid>#<timestamp> as the row key structure, you
optimize Bigtable for the specific access patterns of your real-time
dashboards, resulting in improved query performance and a better user
experience.
216.
Your organization is modernizing their IT services and migrating to Google
Cloud. You need to organize the data that will be stored in Cloud Storage
and BigQuery. You need to enable a data mesh approach to share the data
between sales, product design, and marketing departments. What should
you do?
217. You work for a large ecommerce company. You are using Pub/Sub to
ingest the clickstream data to Google Cloud for analytics. You observe that
when a new subscriber connects to an existing topic to analyze data, they
are unable to subscribe to older data. For an upcoming yearly sale event in
two months, you need a solution that, once implemented, will enable any
new subscriber to read the last 30 days of data. What should you do?
- Topic Retention Policy: This policy determines how long messages are
retained by Pub/Sub after they are published, even if they have not been
acknowledged (consumed) by any subscriber.
- 30 Days Retention: By setting the retention policy of the topic to 30 days,
all messages published to this topic will be available for consumption for
30 days. This means any new subscriber connecting to the topic can
access and analyze data from the past 30 days.
218. You are designing the architecture to process your data from Cloud
Storage to BigQuery by using Dataflow. The network team provided you
with the Shared VPC network and subnetwork to be used by your
pipelines. You need to enable the deployment of the pipeline on the
Shared VPC network. What should you do?
Shared VPC and Network Access: When using a Shared VPC, you need to
grant specific permissions to service accounts in the service project
(where your Dataflow pipeline runs) to access resources in the host
project's network.
compute.networkUser Role: This role grants the necessary permissions for
a service account to use the network resources in the Shared VPC. This
includes accessing subnets, creating instances, and communicating with
other services within the network.
Service Account for Pipeline Execution: The service account that executes
your Dataflow pipeline is the one that needs these network permissions.
This is because the Dataflow service uses this account to create and
manage worker instances within the Shared VPC network.
222. You have an upstream process that writes data to Cloud Storage.
This data is then read by an Apache Spark job that runs on Dataproc.
These jobs are run in the us-central1 region, but the data could be stored
anywhere in the United States. You need to have a recovery process in
place in case of a catastrophic single region failure. You need an approach
with a maximum of 15 minutes of data loss (RPO=15 mins). You want to
ensure that there is minimal latency when reading the data. What should
you do?
Normalizing the database into separate Patients and Visits tables, along
with creating other necessary tables, is the best solution for handling the
increased data size while ensuring efficient query performance and
maintainability. This approach addresses the root problem instead of
applying temporary fixes.
224. Your company's customer and order databases are often under
heavy load. This makes performing analytics against them difficult without
harming operations. The databases are in a MySQL cluster, with nightly
backups taken using mysqldump. You want to perform analytics with
minimal impact on operations. What should you do?
- Aligns with ELT Approach: Dataform is designed for ELT (Extract, Load,
Transform) pipelines, directly executing SQL transformations within
BigQuery, matching the developers' preference.
-SQL as Code: It enables developers to write and manage SQL
transformations as code, promoting version control, collaboration, and
testing.
- Intuitive Coding Environment: Dataform provides a user-friendly interface
and familiar SQL syntax, making it easy for SQL-proficient developers to
adopt.
- Scheduling and Orchestration: It includes built-in scheduling capabilities
to automate pipeline execution, simplifying pipeline management.
227. You work for a farming company. You have one BigQuery table
named sensors, which is about 500 MB and contains the list of your 5000
sensors, with columns for id, name, and location. This table is updated
every hour. Each sensor generates one metric every 30 seconds along
with a timestamp, which you want to store in BigQuery. You want to run an
analytical query on the data once a week for monitoring purposes. You
also want to minimize costs. What data model should you use?
This approach offers several advantages:
Cost Efficiency: Partitioning the metrics table by timestamp helps reduce
query costs by allowing BigQuery to scan only the relevant partitions.
Data Organization: Keeping metrics in a separate table maintains a clear
separation between sensor metadata and sensor metrics, making it easier
to manage and query the data2.
Performance: Using INSERT statements to append new metrics ensures
efficient data ingestion without the overhead of frequent updates
228. You are managing a Dataplex environment with raw and curated
zones. A data engineering team is uploading JSON and CSV files to a
bucket asset in the curated zone but the files are not being automatically
discovered by Dataplex. What should you do to ensure that the files are
discovered by Dataplex?
Raw zones store structured data, semi-structured data such as CSV files
and JSON files, and unstructured data in any format from external sources.
Curated zones store structured data. Data can be stored in Cloud Storage
buckets or BigQuery datasets. Supported formats for Cloud Storage
buckets include Parquet, Avro, and ORC.
229. You have a table that contains millions of rows of sales data,
partitioned by date. Various applications and users query this data many
times a minute. The query requires aggregating values by using AVG,
MAX, and SUM, and does not require joining to other tables. The required
aggregations are only computed over the past year of data, though you
need to retain full historical data in the base tables. You want to ensure
that the query results always include the latest data from the tables, while
also reducing computation cost, maintenance overhead, and duration.
What should you do?
- Using the Cloud SQL Auth proxy is a recommended method for secure
connections, especially when dealing with dynamic IP addresses.
- The Auth proxy provides secure access to your Cloud SQL instance
without the need for Authorized Networks or managing IP addresses.
- It works by encapsulating database traffic and forwarding it through a
secure tunnel, using Google's IAM for authentication.
- Leaving the Authorized Networks empty means you're not allowing any
direct connections based on IP addresses, relying entirely on the Auth
proxy for secure connectivity. This is a secure and flexible solution,
especially for applications with dynamic IPs.
233. You are migrating a large number of files from a public HTTPS
endpoint to Cloud Storage. The files are protected from unauthorized
access using signed URLs. You created a TSV file that contains the list of
object URLs and started a transfer job by using Storage Transfer Service.
You notice that the job has run for a long time and eventually failed.
Checking the logs of the transfer job reveals that the job was running fine
until one point, and then it failed due to HTTP 403 errors on the remaining
files. You verified that there were no changes to the source system. You
need to fix the problem to resume the migration process. What should you
do?
HTTP 403 errors: These errors indicate unauthorized access, but since you
verified the source system and signed URLs, the issue likely lies with
expired signed URLs. Renewing the URLs with a longer validity period
prevents this issue for the remaining files.
Separate jobs: Splitting the file into smaller chunks and submitting them
as separate jobs improves parallelism and potentially speeds up the
transfer process.
Avoid manual intervention: Options A and D require manual intervention
and complex setups, which are less efficient and might introduce risks.
Longer validity: While option B addresses expired URLs, splitting the file
offers additional benefits for faster migration.
234. You work for an airline and you need to store weather data in a
BigQuery table. Weather data will be used as input to a machine learning
model. The model only uses the last 30 days of weather data. You want to
avoid storing unnecessary data and minimize costs. What should you do?
It uses partitioning to improve query performance when selecting data
within a date range.
It automates data deletion through partition expiration, ensuring that only
the necessary data is stored.
By using a partitioned table with partition expiration, you can effectively
manage your weather data in BigQuery, optimize query performance, and
minimize storage costs.
Sumber dan konten terkait
235. You have Google Cloud Dataflow streaming pipeline running with a
Google Cloud Pub/Sub subscription as the source. You need to make an
update to the code that will make the new Cloud Dataflow pipeline
incompatible with the current version. You do not want to lose any data
when making this update. What should you do?
It leverages the drain flag to ensure that all data is processed before the
pipeline is shut down for the update.
It allows for a seamless transition to the updated pipeline without any data
loss.
By using the drain flag, you can safely update your Dataflow pipeline with
incompatible changes while preserving data integrity.
236. You need to look at BigQuery data from a specific table multiple
times a day. The underlying table you are querying is several petabytes in
size, but you want to filter your data and provide simple aggregations to
downstream users. You want to run queries faster and get up-to-date
insights quicker. What should you do?
It provides the best query performance by storing pre-computed results.
It offers up-to-date insights through automatic refresh capabilities.
It can be more cost-effective than repeatedly querying the large table.
By creating a materialized view, you can significantly improve query
performance and get up-to-date insights faster, while reducing the load on
your BigQuery table.
240. You are configuring networking for a Dataflow job. The data pipeline
uses custom container images with the libraries that are required for the
transformation logic preinstalled. The data pipeline reads the data from
Cloud Storage and writes the data to BigQuery. You need to ensure cost-
effective and secure communication between the pipeline and Google APIs
and services. What should you do?
This approach ensures that your worker VMs can access Google APIs and
services securely without using external IP addresses, which reduces costs
and enhances security by keeping the traffic within Google's network
241. You are using Workflows to call an API that returns a 1KB JSON
response, apply some complex business logic on this response, wait for
the logic to complete, and then perform a load from a Cloud Storage file to
BigQuery. The Workflows standard library does not have sufficient
capabilities to perform your complex logic, and you want to use Python's
standard library instead. You want to optimize your workflow for simplicity
and speed of execution. What should you do?
Using a Cloud Function allows you to run your Python code in a serverless
environment, which simplifies deployment and management. It also
ensures quick execution and scalability, as Cloud Functions can handle the
processing of your JSON response efficiently
242. You are administering a BigQuery on-demand environment. Your
business intelligence tool is submitting hundreds of queries each day that
aggregate a large (50 TB) sales history fact table at the day and month
levels. These queries have a slow response time and are exceeding cost
expectations. You need to decrease response time, lower query costs, and
minimize maintenance. What should you do?
243. You have several different unstructured data sources, within your
on-premises data center as well as in the cloud. The data is in various
formats, such as Apache Parquet and CSV. You want to centralize this data
in Cloud Storage. You need to set up an object sink for your data that
allows you to use your own encryption keys. You want to use a GUI-based
solution. What should you do?
244. You are using BigQuery with a regional dataset that includes a table
with the daily sales volumes. This table is updated multiple times per day.
You need to protect your sales table in case of regional failures with a
recovery point objective (RPO) of less than 24 hours, while keeping costs
to a minimum. What should you do?
This approach ensures that sensitive data elements are protected through
masking, which meets data privacy requirements. At the same time, it
retains the data in a usable form for future analyses
247. Your software uses a simple JSON format for all messages. These
messages are published to Google Cloud Pub/Sub, then processed with
Google Cloud Dataflow to create a real-time dashboard for the CFO. During
testing, you notice that some messages are missing in the dashboard. You
check the logs, and all messages are being published to Cloud Pub/Sub
successfully. What should you do next?
This will allow you to determine if the issue is with the pipeline or with the
dashboard application. By analyzing the output, you can see if the
messages are being processed correctly and determine if there are any
discrepancies or missing messages. If the issue is with the pipeline, you
can then debug and make any necessary updates to ensure that all
messages are processed correctly. If the issue is with the dashboard
application, you can then focus on resolving that issue. This approach
allows you to isolate and identify the root cause of the missing messages
in a controlled and efficient manner.
Solution Concept –
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements –
Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met
Technical Requirements –
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud
environment
SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.
CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.
CFO Statement –
Part of our competitive advantage is that we penalize ourselves for late
shipments and deliveries. Knowing where out shipments are at all times
has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.
Flowlogistic wants to use Google BigQuery as their primary analysis
system, but they still have Apache Hadoop and Spark workloads that they
cannot move to BigQuery. Flowlogistic does not know how to store the
data that is common to both workloads. What should they do?
Company Background –
The company started as a regional trucking company, and then expanded
into other logistics market. Because they have not updated their
infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary
technology for tracking shipments in real time at the parcel level.
However, they are unable to deploy it because their technology stack,
based on Apache Kafka, cannot support the processing volume. In
addition, Flowlogistic wants to further analyze their orders and shipments
to determine how best to deploy their resources.
Solution Concept –
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking
system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain
both structured and unstructured data, to determine how best to deploy
resources, which markets to expand info. They also want to use predictive
analytics to learn earlier when a shipment will be delayed.
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements –
✑ Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met
Technical Requirements –
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud
environment
SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.
CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.
CFO Statement –
Part of our competitive advantage is that we penalize ourselves for late
shipments and deliveries. Knowing where out shipments are at all times
has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.
Company Background –
The company started as a regional trucking company, and then expanded
into other logistics market. Because they have not updated their
infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary
technology for tracking shipments in real time at the parcel level.
However, they are unable to deploy it because their technology stack,
based on Apache Kafka, cannot support the processing volume. In
addition, Flowlogistic wants to further analyze their orders and shipments
to determine how best to deploy their resources.
Solution Concept –
Flowlogistic wants to implement two concepts using the cloud:
Use their proprietary technology in a real-time inventory-tracking system
that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain
both structured and unstructured data, to determine how best to deploy
resources, which markets to expand info. They also want to use predictive
analytics to learn earlier when a shipment will be delayed.
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements –
✑ Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met
Technical Requirements –
Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud
environment
SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.
CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.
CFO Statement –
Part of our competitive advantage is that we penalize ourselves for late
shipments and deliveries. Knowing where out shipments are at all times
has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.
Flowlogistic's CEO wants to gain rapid insight into their customer base so
his sales team can be better informed in the field. This team is not very
technical, so they've purchased a visualization tool to simplify the creation
of BigQuery reports. However, they've been overwhelmed by all the data
in the table, and are spending a lot of money on queries trying to find the
data they need. You want to solve their problem in the most cost-effective
way. What should you do?
Creating a view in BigQuery allows you to define a virtual table that is a
subset of the original data, containing only the necessary columns or
filtered data that the sales team requires for their reports. This approach is
cost-effective because it doesn't involve exporting data to external tools or
creating additional tables, and it ensures that the sales team is working
with the specific data they need without running expensive queries on the
full dataset. It simplifies the data for non-technical users while keeping the
data in BigQuery, which is a powerful and cost-efficient data warehousing
solution.
Company Background –
The company started as a regional trucking company, and then expanded
into other logistics market. Because they have not updated their
infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary
technology for tracking shipments in real time at the parcel level.
However, they are unable to deploy it because their technology stack,
based on Apache Kafka, cannot support the processing volume. In
addition, Flowlogistic wants to further analyze their orders and shipments
to determine how best to deploy their resources.
Solution Concept –
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking
system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain
both structured and unstructured data, to determine how best to deploy
resources, which markets to expand info. They also want to use predictive
analytics to learn earlier when a shipment will be delayed.
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements –
✑ Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met
Technical Requirements –
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud
environment
SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.
CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.
CFO Statement –
Part of our competitive advantage is that we penalize ourselves for late
shipments and deliveries. Knowing where out shipments are at all times
has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.
Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost. Their management and operations teams are situated all around
the globe creating many-to-many relationship between data consumers
and provides in their system. After careful consideration, they decided
public cloud is the perfect environment to support their needs.
Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
✑ Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
CTO Statement -
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.
CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.
Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost. Their management and operations teams are situated all around
the globe creating many-to-many relationship between data consumers
and provides in their system. After careful consideration, they decided
public cloud is the perfect environment to support their needs.
Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
✑ Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.
CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.
You need to compose visualizations for operations teams with the
following requirements:
✑ The report must include telemetry data from all 50,000 installations for
the most resent 6 weeks (sampling once every minute).
✑ The report must not be more than 3 hours delayed from live data.
✑ The actionable report should only show suboptimal links.
✑ Most suboptimal links should be sorted to the top.
✑ Suboptimal links can be grouped and filtered by regional geography.
✑ User response time to load the report must be <5 seconds.
Which approach meets the requirements?
Loading the data into BigQuery and using Data Studio 360 provides the
best balance of scalability, performance, ease of use, and functionality to
meet MJTelco's visualization requirements.
254. You create an important report for your large team in Google Data
Studio 360. The report uses Google BigQuery as its data source. You notice
that visualizations are not showing data that is less than 1 hour old. What
should you do?
The most direct and effective way to ensure your Data Studio report shows
the latest data (less than 1 hour old) is to disable caching in the report
settings. This will force Data Studio to query BigQuery for fresh data each
time the report is accessed.
Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.
Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
✑ Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.
CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.
You create a new report for your large team in Google Data Studio 360.
The report uses Google BigQuery as its data source. It is company policy
to ensure employees can view only the data associated with their region,
so you create and populate a table for each region. You need to enforce
the regional access policy to the data. Which two actions should you take?
(Choose two.)
Organize your tables into regional datasets and then grant view access on
those datasets to the appropriate regional security groups. This ensures
that users only have access to the data relevant to their region.
Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost.
Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.
Solution Concept –
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.
CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.
MJTelco needs you to create a schema in Google Bigtable that will allow for
the historical analysis of the last 2 years of records. Each record that
comes in is sent every 15 minutes, and contains a unique identifier of the
device and a data record. The most common query is for all the data for a
given device for a given day.
Which schema should you use?
Optimized for Most Common Query: The most common query is for all data
for a given device on a given day. This schema directly matches the query
pattern by including both date and device_id in the row key. This enables
efficient retrieval of the required data using a single row key prefix scan.
Scalability: As the number of devices and data points increases, this
schema distributes the data evenly across nodes in the Bigtable cluster,
avoiding hotspots and ensuring scalability.
Data Organization: By storing data points as column values within each
row, you can easily add new data points or timestamps without modifying
the table structure.
257. Your company has recently grown rapidly and now ingesting data at
a significantly higher rate than it was previously. You manage the daily
batch MapReduce analytics jobs in Apache Hadoop. However, the recent
increase in data has meant the batch jobs are falling behind. You were
asked to recommend ways the development team could increase the
responsiveness of the analytics without increasing costs. What should you
recommend they do?
Both Pig & Spark requires rewriting the code so its an additional overhead,
but as an architect I would think about a long lasting solution. Resizing
Hadoop cluster can resolve the problem statement for the workloads at
that point in time but not on longer run. So Spark is the right choice,
although its a cost to start with, it will certainly be a long lasting solution
258. You work for a large fast food restaurant chain with over 400,000
employees. You store employee information in Google BigQuery in a Users
table consisting of a FirstName field and a LastName field. A member of IT
is building an application and asks you to modify the schema and data in
BigQuery so the application can query a FullName field consisting of the
value of the FirstName field concatenated with a space, followed by the
value of the LastName field for each employee. How can you make that
data available while minimizing cost?
259. You are deploying a new storage system for your mobile application,
which is a media streaming service. You decide the best fit is Google Cloud
Datastore. You have entities with multiple properties, some of which can
take on multiple values. For example, in the entity 'Movie' the property
'actors' and the property 'tags' have multiple values but the property 'date
released' does not. A typical query would ask for all movies with
actor=<actorname> ordered by date_released or all movies with
tag=Comedy ordered by date_released. How should you avoid a
combinatorial explosion in the number of indexes?
260. You work for a manufacturing plant that batches application log files
together into a single log file once a day at 2:00 AM. You have written a
Google Cloud Dataflow job to process that log file. You need to make sure
the log file in processed once per day as inexpensively as possible. What
should you do?
Using the Google App Engine Cron Service to run the Cloud Dataflow job
allows you to automate the execution of the job. By creating a cron job,
you can ensure that the Dataflow job is triggered exactly once per day at a
specified time. This approach is automated, reliable, and fits the
requirement of processing the log file once per day.
261. You work for an economic consulting firm that helps companies
identify economic trends as they happen. As part of your analysis, you use
Google BigQuery to correlate customer data with the average prices of the
100 most common goods sold, including bread, gasoline, milk, and others.
The average prices of these goods are updated every 30 minutes. You
want to make sure this data stays up to date so you can combine it with
other data in BigQuery as cheaply as possible. What should you do?
In summary, option B provides the most efficient and cost-
effective way to keep your economic data up-to-date in BigQuery
while minimizing overhead. You store the frequently changing data in a
cheaper storage service (Cloud Storage) and then use BigQuery's ability to
query data directly from that storage (federated tables) to combine it with
your other data. This avoids the need for constant, expensive data loading
into BigQuery.
262. You are designing the database schema for a machine learning-
based food ordering service that will predict what users want to eat. Here
is some of the information you need to store:
✑ The user profile: What the user likes and doesn't like to eat
✑ The user account information: Name, address, preferred meal times
✑ The order information: When orders are made, from where, to whom
The database will be used to store all the transactional data of the
product. You want to optimize the data schema. Which Google Cloud
Platform product should you use?
264. Your company produces 20,000 files every hour. Each data file is
formatted as a comma separated values (CSV) file that is less than 4 KB.
All files must be ingested on Google Cloud Platform before they can be
processed. Your company site has a 200 ms latency to Google Cloud, and
your Internet connection bandwidth is limited as 50 Mbps. You currently
deploy a secure FTP (SFTP) server on a virtual machine in Google Compute
Engine as the data ingestion point. A local SFTP client runs on a dedicated
machine to transmit the CSV files as is. The goal is to make reports with
data from the previous day available to the executives by 10:00 a.m. each
day. This design is barely able to keep up with the current volume, even
though the bandwidth utilization is rather low. You are told that due to
seasonality, your company expects the number of files to double for the
next three months. Which two actions should you take? (Choose two.)
265. An external customer provides you with a daily dump of data from
their database. The data flows into Google Cloud Storage GCS as comma-
separated values (CSV) files. You want to analyze this data in Google
BigQuery, but the data could have rows that are formatted incorrectly or
corrupted. How should you build this pipeline?
267. You are training a spam classifier. You notice that you are overfitting
the training data. Which three actions can you take to resolve this
problem? (Choose three.)
To address the problem of overfitting in training a spam classifier, you
should consider the following three actions:
A. Get more training examples:
Why: More training examples can help the model generalize better to
unseen data. A larger dataset typically reduces the chance of overfitting,
as the model has more varied examples to learn from.
C. Use a smaller set of features:
Why: Reducing the number of features can help prevent the model from
learning noise in the data. Overfitting often occurs when the model is too
complex for the amount of data available, and having too many features
can contribute to this complexity.
E. Increase the regularization parameters:
Why: Regularization techniques (like L1 or L2 regularization) add a penalty
to the model for complexity. Increasing the regularization parameter will
strengthen this penalty, encouraging the model to be simpler and thus
reducing overfitting.
268. You are implementing security best practices on your data pipeline.
Currently, you are manually executing jobs as the Project Owner. You want
to automate these jobs by taking nightly batch files containing non-public
information from Google Cloud Storage, processing them with a Spark
Scala job on a Google Cloud Dataproc cluster, and depositing the results
into Google BigQuery. How should you securely run this workload?
269. You are using Google BigQuery as your data warehouse. Your users
report that the following simple query is running very slowly, no matter
when they run the query:
SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP
BY country
You check the query plan for the query and see the following output in the
Read section of Stage:1:
What is the most likely cause of the delay for this query?
The most likely cause of the delay for this query is option D. Most rows in
the [myproject:mydataset.mytable] table have the same value in the
country column, causing data skew.
Group by queries in BigQuery can run slowly when there is significant data
skew on the grouped columns. Since the query is grouping by country, if
most rows have the same country value, all that data will need to be
shuffled to a single reducer to perform the aggregation. This can cause a
data skew slowdown.
Options A and B might cause general slowness but are unlikely to affect
this specific grouping query. Option C could also cause some slowness but
not to the degree that heavy data skew on the grouped column could. So
D is the most likely root cause. Optimizing the data distribution to reduce
skew on the grouped column would likely speed up this query.
271. Your organization has been collecting and analyzing data in Google
BigQuery for 6 months. The majority of the data analyzed is placed in a
time-partitioned table named events_partitioned. To reduce the cost of
queries, your organization created a view called events, which queries
only the last 14 days of data. The view is described in legacy SQL. Next
month, existing applications will be connecting to BigQuery to read the
events data via an ODBC connection. You need to ensure the applications
can connect. Which two actions should you take? (Choose two.)
275. An online retailer has built their current application on Google App
Engine. A new initiative at the company mandates that they extend their
application to allow their customers to transact directly via the application.
They need to manage their shopping transactions and analyze combined
data from multiple datasets using a business intelligence (BI) tool. They
want to use only a single database for this purpose. Which Google Cloud
database should they choose?
Cloud SQL would be the most appropriate choice for the online retailer in
this scenario. Cloud SQL is a fully-managed relational database service
that allows for easy management and analysis of data using SQL. It is well-
suited for applications built on Google App Engine and can handle the
transactional workload of an e-commerce application, as well as the
analytical workload of a BI tool.
276. Your weather app queries a database every 15 minutes to get the
current temperature. The frontend is powered by Google App Engine and
server millions of users. How should you design the frontend to respond to
a database failure?
277. You launched a new gaming app almost three years ago. You have
been uploading log files from the previous day to a separate Google
BigQuery table with the table name format LOGS_yyyymmdd. You have
been using table wildcard functions to generate daily and monthly reports
for all time ranges. Recently, you discovered that some queries that cover
long date ranges are exceeding the limit of 1,000 tables and failing. How
can you resolve this issue?
Sharded tables, like LOGS_yyyymmdd, are useful for managing data, but
querying across a long date range with table wildcards can lead to
inefficiencies and exceed the 1,000 table limit in BigQuery. Instead of
using multiple sharded tables, you should consider converting these into a
partitioned table.
A partitioned table allows you to store all the log data in a single table, but
logically divides the data into partitions (e.g., by date). This way, you can
efficiently query data across long date ranges without hitting the 1,000
table limit.
Preemptible workers are the default secondary worker type. They are
reclaimed and removed from the cluster if they are required by Google
Cloud for other tasks. Although the potential removal of preemptible
workers can affect job stability, you may decide to use preemptible
instances to lower per-hour compute costs for non-critical data processing
or to create very large clusters at a lower total cost
279. Your company receives both batch- and stream-based event data.
You want to process the data using Google Cloud Dataflow over a
predictable time period. However, you realize that in some instances data
can arrive late or out of order. How should you design your Cloud Dataflow
pipeline to handle data that is late or out of order?
Watermarks are a way to indicate that some data may still be in transit
and not yet processed. By setting a watermark, you can define a time
period during which Dataflow will continue to accept late or out-of-order
data and incorporate it into your processing. This allows you to maintain a
predictable time period for processing while still allowing for some
flexibility in the arrival of data.
Timestamps, on the other hand, are used to order events correctly, even if
they arrive out of order. By assigning timestamps to each event, you can
ensure that they are processed in the correct order, even if they don't
arrive in that order.
280. You have some data, which is shown in the graphic below. The two
dimensions are X and Y, and the shade of each dot represents what class
it is. You want to classify this data accurately using a linear algorithm. To
do this you need to add a synthetic feature. What should the value of that
feature be?
The synthetic feature that should be added in this case is the squared
value of the distance from the origin (0,0). This is equivalent to X2+Y2. By
adding this feature, the classifier will be able to make more accurate
predictions by taking into account the distance of each data point from the
origin.
X2 and Y2 alone will not give enough information to classify the data
because they do not take into account the relationship between X and Y.
281. You are integrating one of your internal IT applications and Google
BigQuery, so users can query BigQuery from the application's interface.
You do not want individual users to authenticate to BigQuery and you do
not want to give them access to the dataset. You need to securely access
BigQuery from your IT application. What should you do?
282. You are building a data pipeline on Google Cloud. You need to
prepare data using a casual method for a machine-learning process. You
want to support a logistic regression model. You also need to monitor and
adjust for null values, which must remain real-valued and cannot be
removed. What should you do?
283. You set up a streaming data insert into a Redis cluster via a Kafka
cluster. Both clusters are running on Compute Engine instances. You need
to encrypt data at rest with encryption keys that you can create, rotate,
and destroy as needed. What should you do?
Cloud Key Management Service (KMS) is a fully managed service that
allows you to create, rotate, and destroy encryption keys as needed. By
creating encryption keys in Cloud KMS, you can use them to encrypt your
data at rest in the Compute Engine cluster instances, which is running
your Redis and Kafka clusters. This ensures that your data is protected
even when it is stored on disk.
285. You are selecting services to write and transform JSON messages
from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You
want to minimize service costs. You also want to monitor and
accommodate input data volume that will vary in size with minimal
manual intervention. What should you do?
using Cloud Dataflow for transformations with monitoring via Stackdriver
and leveraging its default autoscaling settings, is the best choice. Cloud
Dataflow is purpose-built for this type of workload, providing seamless
scalability and efficient processing capabilities for streaming data. Its
autoscaling feature minimizes manual intervention and helps manage
costs by dynamically adjusting resources based on the actual processing
needs, which is crucial for handling fluctuating data volumes efficiently
and cost-effectively.
288. You are designing storage for very large text files for a data pipeline
on Google Cloud. You want to support ANSI SQL queries. You also want to
support compression and parallel load from the input locations using
Google recommended practices. What should you do?
The advantages of creating external tables are that they are fast to create
so you skip the part of importing data and no additional monthly billing
storage costs are accrued to your account since you only get charged for
the data that is stored in the data lake, which is comparatively cheaper
than storing it in BigQuery
290. You are designing storage for 20 TB of text files as part of deploying
a data pipeline on Google Cloud. Your input data is in CSV format. You
want to minimize the cost of querying aggregate values for multiple users
who will query the data in Cloud Storage with multiple engines. Which
storage service and schema design should you use?
291. You are designing storage for two relational tables that are part of a
10-TB database on Google Cloud. You want to support transactions that
scale horizontally. You also want to optimize data for range queries on non-
key columns. What should you do?
Bigtable is Google's NoSQL Big Data database service. It's the same
database that powers many core Google services, including Search,
Analytics, Maps, and Gmail. Bigtable is designed to handle massive
workloads at consistent low latency and high throughput, so it's a great
choice for both operational and analytical applications, including IoT, user
analytics, and financial data analysis.
Bigtable is an excellent option for any Apache Spark or Hadoop uses that
require Apache HBase. Bigtable supports the Apache HBase 1.0+ APIs and
offers a Bigtable HBase client in Maven, so it is easy to use Bigtable with
Dataproc.
BigQuery provides built-in logging of all data access, including the user's
identity, the specific query run and the time of the query. This log can be
used to provide an auditable record of access to the data. Additionally,
BigQuery allows you to control access to the dataset using Identity and
Access Management (IAM) roles, so you can ensure that only authorized
personnel can view the dataset.
295. Your neural network model is taking days to train. You want to
increase the training speed. What can you do?
Subsampling your training dataset can help increase the training speed of
your neural network model. By reducing the size of your training dataset,
you can speed up the process of updating the weights in your neural
network. This can help you quickly test and iterate your model to improve
its accuracy.
Subsampling your test dataset, on the other hand, can lead to inaccurate
evaluation of your model's performance and may result in overfitting. It is
important to evaluate your model's performance on a representative test
dataset to ensure that it can generalize to new data.
Increasing the number of input features or layers in your neural network
can also improve its performance, but this may not necessarily increase
the training speed. In fact, adding more layers or features can increase the
complexity of your model and make it take longer to train. It is important
to balance the model's complexity with its performance and training time.
296. You are responsible for writing your company's ETL pipelines to run
on an Apache Hadoop cluster. The pipeline will require some checkpointing
and splitting pipelines. Which method should you use to write the
pipelines?
This will likely have the most impact on transfer speeds as it addresses the
bottleneck in the transfer between your data center and GCP. Increasing
the CPU size or the size of the Google Persistent Disk on the server may
help with processing the data once it has been transferred, but will not
address the bottleneck in the transfer itself. Increasing the network
bandwidth from Compute Engine to Cloud Storage would also help with
processing the data once it has been transferred but will not address the
bottleneck in the transfer itself as well.
298. You are building new real-time data warehouse for your company
and will use Google BigQuery streaming inserts. There is no guarantee
that data will only be sent in once but you do have a unique ID for each
row of data and an event timestamp. You want to ensure that duplicates
are not included while interactively querying data. Which query type
should you use?
This approach will assign a row number to each row within a unique ID
partition, and by selecting only rows with a row number of 1, you will
ensure that duplicates are excluded in your query results. It allows you to
filter out redundant rows while retaining the latest or earliest records
based on your timestamp column.
Options A, B, and C do not address the issue of duplicates effectively or
interactively as they do not explicitly remove duplicates based on the
unique ID and event timestamp.
Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost.
Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.
Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each. Allow analysis and presentation against
data tables tracking up to 2 years of data storing approximately 100m
records/day Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.
CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.
MJTelco is building a custom interface to share data. They have these
requirements:
1. They need to do aggregations over their petabyte-scale datasets.
2. They need to scan specific time range rows with a very fast
response time (milliseconds).
Which combination of Google Cloud Platform products should you
recommend?
Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost.
Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.
Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
✑Ensure secure and efficient transport and storage of telemetry data
✑Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.
CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.
You need to compose visualization for operations teams with the following
requirements:
✑ Telemetry must include data from all 50,000 installations for the most
recent 6 weeks (sampling once every minute)
✑ The report must not be more than 3 hours delayed from live data.
✑ The actionable report should only show suboptimal links.
✑ Most suboptimal links should be sorted to the top.
You create a data source to store the last 6 weeks of data, and create
visualizations that allow viewers to see multiple date ranges, distinct
geographic regions, and unique installation types. You always show the
latest data without any changes to your visualizations. You want to avoid
creating and updating new visualizations each month. What should you
do?
Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost.
Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.
Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
✑Ensure secure and efficient transport and storage of telemetry data
✑Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.
CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.
Given the record streams MJTelco is interested in ingesting per day, they
are concerned about the cost of Google BigQuery increasing. MJTelco asks
you to provide a design solution. They require a single large data table
called tracking_table. Additionally, they want to minimize the cost of daily
queries while performing fine-grained analysis of each day's events. They
also want to use streaming ingestion. What should you do?
Company Background –
The company started as a regional trucking company, and then expanded
into other logistics market. Because they have not updated their
infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary
technology for tracking shipments in real time at the parcel level.
However, they are unable to deploy it because their technology stack,
based on Apache Kafka, cannot support the processing volume. In
addition, Flowlogistic wants to further analyze their orders and shipments
to determine how best to deploy their resources.
Solution Concept –
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking
system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain
both structured and unstructured data, to determine how best to deploy
resources, which markets to expand info. They also want to use predictive
analytics to learn earlier when a shipment will be delayed.
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements –
✑ Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met
Technical Requirements –
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑Connect a VPN between the production data center and cloud
environment
SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.
CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.
CFO Statement –
Part of our competitive advantage is that we penalize ourselves for late
shipments and deliveries. Knowing where out shipments are at all times
has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.
Flowlogistic's management has determined that the current Apache Kafka
servers cannot handle the data volume for their real-time inventory
tracking system.
You need to build a new system on Google Cloud Platform (GCP) that will
feed the proprietary tracking software. The system must be able to ingest
data from a variety of global sources, process and query in real-time, and
store the data reliably. Which combination of GCP products should you
choose?
303. After migrating ETL jobs to run on BigQuery, you need to verify that
the output of the migrated jobs is the same as the output of the original.
You've loaded a table containing the output of the original job and want to
compare the contents with output from the migrated job to show that they
are identical. The tables do not contain a primary key column that would
enable you to join them together for comparison. What should you do?
305. You have an Apache Kafka cluster on-prem with topics containing
web application logs. You need to replicate the data to Google Cloud for
analysis in BigQuery and Cloud Storage. The preferred replication method
is mirroring to avoid deployment of Kafka Connect plugins. What should
you do?
By default, preemptible node disk sizes are limited to 100GB or the size of
the non-preemptible node disk sizes, whichever is smaller. However you
can override the default preemptible disk size to any requested size. Since
the majority of our cluster is using preemptible nodes, the size of the disk
used for caching operations will see a noticeable performance
improvement using a larger disk. Also, SSD's will perform better than HDD.
This will increase costs slightly, but is the best option available while
maintaining costs.
307. Your team is responsible for developing and maintaining ETLs in
your company. One of your Dataflow jobs is failing because of some errors
in the input data, and you need to improve reliability of the pipeline (incl.
being able to reprocess all failing data). What should you do?
Which table name will make the SQL statement work correctly?
Option B is the only one that correctly uses the wildcard syntax without
quotes to specify a pattern for multiple tables and allows the
TABLE_SUFFIX to filter based on the matched portion of the table name.
This makes it the correct answer for querying across multiple tables using
WILDCARD tables in BigQuery.
310. You are deploying MariaDB SQL databases on GCE VM Instances and
need to configure monitoring and alerting. You want to collect metrics
including network connections, disk IO and replication status from MariaDB
with minimal development effort and use StackDriver for dashboards and
alerts. What should you do?
StackDriver Agent: The StackDriver Agent is designed to collect system
and application metrics from virtual machine instances and send them to
StackDriver Monitoring. It simplifies the process of collecting and
forwarding metrics.
MySQL Plugin: The StackDriver Agent has a MySQL plugin that allows you
to collect MySQL-specific metrics without the need for additional custom
development. This includes metrics related to network connections, disk
IO, and replication status – which are the specific metrics you mentioned.
Option D is the most straightforward and least development-intensive
approach to achieve the monitoring and alerting requirements for MariaDB
on GCE VM Instances using StackDriver.
311. You work for a bank. You have a labelled dataset that contains
information on already granted loan application and whether these
applications have been defaulted. You have been asked to train a model to
predict default rates for credit applicants. What should you do?
313. You're using Bigtable for a real-time application, and you have a
heavy load that is a mix of read and writes. You've recently identified an
additional use case and need to perform hourly an analytical job to
calculate certain statistics across the whole database. You need to ensure
both the reliability of your production application as well as the analytical
workload. What should you do?
When you use a single cluster to run a batch analytics job that performs
numerous large reads alongside an application that performs a mix of
reads and writes, the large batch job can slow things down for the
application's users. With replication, you can use app profiles with single-
cluster routing to route batch analytics jobs and application traffic to
different clusters, so that batch jobs don't affect your applications' users.
314. You are designing an Apache Beam pipeline to enrich data from
Cloud Pub/Sub with static reference data from BigQuery. The reference
data is small enough to fit in memory on a single worker. The pipeline
should write enriched results to BigQuery for analysis. Which job type and
transforms should this pipeline use?
315. You have a data pipeline that writes data to Cloud Bigtable using
well-designed row keys. You want to monitor your pipeline to determine
when to increase the size of your Cloud Bigtable cluster. Which two actions
can you take to accomplish this? (Choose two.)
D: In general, do not use more than 70% of the hard limit on total storage,
so you have room to add more data. If you do not plan to add significant
amounts of data to your instance, you can use up to 100% of the hard
limit
C: If this value is frequently at 100%, you might experience increased
latency. Add nodes to the cluster to reduce the disk load percentage.
The key visualizer metrics options, suggest other things other than
increase the cluster size.
318. Your company needs to upload their historic data to Cloud Storage.
The security rules don't allow access from external IPs to their on-premises
resources. After an initial upload, they will add new data from existing on-
premises applications every day. What should they do?
gsutil rsync is the most straightforward, secure, and efficient solution for
transferring data from on-premises servers to Cloud Storage, especially
when security rules restrict inbound connections to the on-premises
environment. It's well-suited for both the initial bulk upload and the
ongoing daily updates.
319. You have a query that filters a BigQuery table using a WHERE clause
on timestamp and ID columns. By using bq query `"-dry_run you learn that
the query triggers a full scan of the table, even though the filter on
timestamp and ID select a tiny fraction of the overall data. You want to
reduce the amount of data scanned by BigQuery with minimal changes to
existing SQL queries. What should you do?
Partitioning and clustering are the most effective way to optimize
BigQuery queries that filter on specific columns like timestamp and ID. By
reorganizing the table structure, BigQuery can significantly reduce the
amount of data scanned, leading to faster and cheaper queries.