PMLE Book
PMLE Book
Cover
Table of Contents
Title Page
Copyright
Dedication
Acknowledgments
About the Author
About the Technical Editors
About the Technical Proofreader
Google Technical Reviewer
Introduction
Google Cloud Professional Machine Learning Engineer
Certification
Who Should Buy This Book
How This Book Is Organized
Bonus Digital Contents
Conventions Used in This Book
Google Cloud Professional ML Engineer Objective Map
How to Contact the Publisher
Assessment Test
Answers to Assessment Test
Chapter 1: Framing ML Problems
Translating Business Use Cases
Machine Learning Approaches
ML Success Metrics
Responsible AI Practices
Summary
Exam Essentials
Review Questions
Chapter 2: Exploring Data and Building Data Pipelines
Visualization
Statistics Fundamentals
Data Quality and Reliability
Establishing Data Constraints
Running TFDV on Google Cloud Platform
Organizing and Optimizing Training Datasets
Handling Missing Data
Data Leakage
Summary
Exam Essentials
Review Questions
Chapter 3: Feature Engineering
Consistent Data Preprocessing
Encoding Structured Data Types
Class Imbalance
Feature Crosses
TensorFlow Transform
GCP Data and ETL Tools
Summary
Exam Essentials
Review Questions
Chapter 4: Choosing the Right ML Infrastructure
Pretrained vs. AutoML vs. Custom Models
Pretrained Models
AutoML
Custom Training
Provisioning for Predictions
Summary
Exam Essentials
Review Questions
Chapter 5: Architecting ML Solutions
Designing Reliable, Scalable, and Highly Available ML
Solutions
Choosing an Appropriate ML Service
Data Collection and Data Management
Automation and Orchestration
Serving
Summary
Exam Essentials
Review Questions
Chapter 6: Building Secure ML Pipelines
Building Secure ML Systems
Identity and Access Management
Privacy Implications of Data Usage and Collection
Summary
Exam Essentials
Review Questions
Chapter 7: Model Building
Choice of Framework and Model Parallelism
Modeling Techniques
Transfer Learning
Semi supervised Learning
Data Augmentation
Model Generalization and Strategies to Handle Overfitting
and Underfitting
Summary
Exam Essentials
Review Questions
Chapter 8: Model Training and Hyperparameter Tuning
Ingestion of Various File Types into Training
Developing Models in Vertex AI Workbench by Using
Common Frameworks
Training a Model as a Job in Different Environments
Hyperparameter Tuning
Tracking Metrics During Training
Retraining/Redeployment Evaluation
Unit Testing for Model Training and Serving
Summary
Exam Essentials
Review Questions
Chapter 9: Model Explainability on Vertex AI
Model Explainability on Vertex AI
Summary
Exam Essentials
Review Questions
Chapter 10: Scaling Models in Production
Scaling Prediction Service
Serving (Online, Batch, and Caching)
Google Cloud Serving Options
Hosting Third Party Pipelines (MLflow) on Google Cloud
Testing for Target Performance
Configuring Triggers and Pipeline Schedules
Summary
Exam Essentials
Review Questions
Chapter 11: Designing ML Training Pipelines
Orchestration Frameworks
Identification of Components, Parameters, Triggers, and
Compute Needs
System Design with Kubeflow/TFX
Hybrid or Multicloud Strategies
Summary
Exam Essentials
Review Questions
Chapter 12: Model Monitoring, Tracking, and Auditing Metadata
Model Monitoring
Model Monitoring on Vertex AI
Logging Strategy
Model and Dataset Lineage
Vertex AI Experiments
Vertex AI Debugging
Summary
Exam Essentials
Review Questions
Chapter 13: Maintaining ML Solutions
MLOps Maturity
Retraining and Versioning Models
Feature Store
Vertex AI Permissions Model
Common Training and Serving Errors
Summary
Exam Essentials
Review Questions
Chapter 14: BigQuery ML
BigQuery – Data Access
BigQuery ML Algorithms
Explainability in BigQuery ML
BigQuery ML vs. Vertex AI Tables
Interoperability with Vertex AI
BigQuery Design Patterns
Summary
Exam Essentials
Review Questions
Appendix: Answers to Review Questions
Chapter 1: Framing ML Problems
Chapter 2: Exploring Data and Building Data Pipelines
Chapter 3: Feature Engineering
Chapter 4: Choosing the Right ML Infrastructure
Chapter 5: Architecting ML Solutions
Chapter 6: Building Secure ML Pipelines
Chapter 7: Model Building
Chapter 8: Model Training and Hyperparameter Tuning
Chapter 9: Model Explainability on Vertex AI
Chapter 10: Scaling Models in Production
Chapter 11: Designing ML Training Pipelines
Chapter 12: Model Monitoring, Tracking, and Auditing
Metadata
Chapter 13: Maintaining ML Solutions
Chapter 14: BigQuery ML
Index
End User License Agreement
List of Tables
Chapter 1
TABLE 1.1 ML problem types
TABLE 1.2 Structured data
TABLE 1.3 Time Series Data
TABLE 1.4 Confusion matrix for a binary classification
example
TABLE 1.5 Summary of metrics
Chapter 2
TABLE 2.1 Mean, median, and mode for outlier detection
Chapter 3
TABLE 3.1 One hot encoding example
TABLE 3.2 Run a TFX pipeline on GCP
Chapter 4
TABLE 4.1 Vertex AI AutoML Tables algorithms
TABLE 4.2 AutoML algorithms
TABLE 4.3 Problems solved using AutoML
TABLE 4.4 Summary of the recommendation types available
in Retail AI
Chapter 5
TABLE 5.1 ML workflow to GCP services mapping
TABLE 5.2 When to use BigQuery ML vs. AutoML vs. a
custom model
TABLE 5.3 Google Cloud tools to read BigQuery data
TABLE 5.4 NoSQL data store options
Chapter 6
TABLE 6.1 Difference between server side and client side
encryption
TABLE 6.2 Strategies for handling sensitive data
TABLE 6.3 Techniques to handle sensitive fields in data
Chapter 7
TABLE 7.1 Distributed training strategies using TensorFlow
TABLE 7.2 Summary of loss functions based on ML problems
TABLE 7.3 Differences between L1 and L2 regularization
Chapter 8
TABLE 8.1 Dataproc connectors
TABLE 8.2 Data storage guidance on GCP for machine
learning
TABLE 8.3 Differences between managed and user managed
notebooks
TABLE 8.4 Worker pool tasks in distributed training
TABLE 8.5 Search algorithm options for hyperparameter
tuning on GCP
TABLE 8.6 Tools to track metric or profile training metrics
TABLE 8.7 Retraining strategies
Chapter 9
TABLE 9.1 Explainable techniques used by Vertex AI
Chapter 10
TABLE 10.1 Static vs. dynamic features
TABLE 10.2 Input data options for batch training in Vertex
AI
TABLE 10.3 ML orchestration options
Chapter 11
TABLE 11.1 Kubeflow Pipelines vs. Vertex AI Pipelines vs.
Cloud Composer
Chapter 13
TABLE 13.1 Table of baseball batters
Chapter 14
TABLE 14.1 Models available on BigQuery ML
TABLE 14.2 Model types
List of Illustrations
Chapter 1
FIGURE 1.1 Business case to ML problem
FIGURE 1.2 AUC
FIGURE 1.3 AUC PR
Chapter 2
FIGURE 2.1 Box plot showing quartiles
FIGURE 2.2 Line plot
FIGURE 2.3 Bar plot
FIGURE 2.4 Data skew
FIGURE 2.5 TensorFlow Data Validation
FIGURE 2.6 Dataset representation
FIGURE 2.7 Credit card data representation
FIGURE 2.8 Downsampling credit card data
Chapter 3
FIGURE 3.1 Difficult to separate by line or a linear method
FIGURE 3.2 Difficult to separate classes by line
FIGURE 3.3 Summary of feature columnsGoogle Cloud via
Coursera, www.coursera...
FIGURE 3.4 TensorFlow Transform
Chapter 4
FIGURE 4.1 Pretrained, AutoML, and custom models
FIGURE 4.2 Analyzing a photo using Vision AI
FIGURE 4.3 Vertex AI AutoML, providing a “budget”
FIGURE 4.4 Choosing the size of model in Vertex AI
FIGURE 4.5 TPU system architecture
Chapter 5
FIGURE 5.1 Google AI/ML stack
FIGURE 5.2 Kubeflow Pipelines and Google Cloud managed
services
FIGURE 5.3 Google Cloud architecture for performing offline
batch prediction...
FIGURE 5.4 Google Cloud architecture for online prediction
FIGURE 5.5 Push notification architecture for online
prediction
Chapter 6
FIGURE 6.1 Creating a user managed Vertex AI Workbench
notebook
FIGURE 6.2 Managed Vertex AI Workbench notebook
FIGURE 6.3 Permissions for a managed Vertex AI
Workbench notebook
FIGURE 6.4 Creating a private endpoint in the Vertex AI
console
FIGURE 6.5 Architecture for de identification of PII on large
datasets using...
Chapter 7
FIGURE 7.1 Asynchronous data parallelism
FIGURE 7.2 Model parallelism
FIGURE 7.3 Training strategy with TensorFlow
FIGURE 7.4 Artificial or feedforward neural network
FIGURE 7.5 Deep neural network
Chapter 8
FIGURE 8.1 Google Cloud data and analytics overview
FIGURE 8.2 Cloud Dataflow source and sink
FIGURE 8.3 Summary of processing tools on GCP
FIGURE 8.4 Creating a managed notebook
FIGURE 8.5 Opening the managed notebook
FIGURE 8.6 Exploring frameworks available in a managed
notebook
FIGURE 8.7 Data integration with Google Cloud Storage
within a managed noteb...
FIGURE 8.8 Data Integration with BigQuery within a
managed notebook
FIGURE 8.9 Scaling up the hardware from a managed
notebook
FIGURE 8.10 Git integration within a managed notebook
FIGURE 8.11 Scheduling or executing code in the notebook
FIGURE 8.12 Submitting the notebook for execution
FIGURE 8.13 Scheduling the notebook for execution
FIGURE 8.14 Choosing TensorFlow framework to create a
user managed notebook...
FIGURE 8.15 Create a user managed TensorFlow notebook
FIGURE 8.16 Exploring the network
FIGURE 8.17 Training in the Vertex AI console
FIGURE 8.18 Vertex AI training architecture for a prebuilt
container
FIGURE 8.19 Vertex AI training console for pre built
containersSource: Googl...
FIGURE 8.20 Vertex AI training architecture for custom
containers
FIGURE 8.21 ML model parameter and hyperparameter
FIGURE 8.22 Configure hyperparameter tuning by training
the pipeline UISourc...
FIGURE 8.23 Enabling an interactive shell in the Vertex AI
consoleSource: Go...
FIGURE 8.24 Web terminal to access an interactive
shellSource: Google LLC.
Chapter 9
FIGURE 9.1 SHAP model explainability
FIGURE 9.2 Feature attribution using integrated gradients
for cat image
Chapter 10
FIGURE 10.1 TF model serving options
FIGURE 10.2 Static reference architecture
FIGURE 10.3 Dynamic reference architecture
FIGURE 10.4 Caching architecture
FIGURE 10.5 Deploying to an endpoint
FIGURE 10.6 Sample prediction request
FIGURE 10.7 Batch prediction job in Console
Chapter 11
FIGURE 11.1 Relation between model data and ML code for
MLOps
FIGURE 11.2 End to end ML development workflow
FIGURE 11.3 Kubeflow architecture
FIGURE 11.4 Kubeflow components and pods
FIGURE 11.5 Vertex AI Pipelines
FIGURE 11.6 Vertex AI Pipelines condition for deployment
FIGURE 11.7 Lineage tracking with Vertex AI Pipelines
FIGURE 11.8 Lineage tracking in Vertex AI Metadata store
FIGURE 11.9 Continuous training and CI/CD
FIGURE 11.10 CI/CD with Kubeflow Pipelines
FIGURE 11.11 Kubeflow Pipelines on GCP
FIGURE 11.12 TFX pipelines, libraries, and components
Chapter 12
FIGURE 12.1 Categorical features
FIGURE 12.2 Numerical values
FIGURE 12.3 Vertex Metadata data model
FIGURE 12.4 Vertex AI Pipelines showing lineage
Chapter 13
FIGURE 13.1 Steps in MLOps level 0
FIGURE 13.2 MLOps Level 1 or strategic phase
FIGURE 13.3 MLOps level 2, the transformational phase
Chapter 14
FIGURE 14.1 Running a SQL query in the web console
FIGURE 14.2 Running the same SQL query through a
Jupyter Notebook on Vertex ...
FIGURE 14.3 SQL options for DNN_CLASSIFIER and
DNN_REGRESSOR
Mona Mona
Pratap Ramamurthy
Copyright © 2024 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada and the United Kingdom.
ISBNs: 9781119944461 (paperback), 9781119981848 (ePDF), 9781119981565 (ePub)
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means, electronic, mechanical, photocopying, recording, scanning, or
otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright
Act, without either the prior written permission of the Publisher, or authorization through
payment of the appropriate per copy fee to the Copyright Clearance Center, Inc., 222
Rosewood Drive, Danvers, MA 01923, (978) 750 8400, fax (978) 750 4470, or on the web at
www.copyright.com. Requests to the Publisher for permission should be addressed to the
Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030,
(201) 748 6011, fax (201) 748 6008, or online at www.wiley.com/go/permission.
Trademarks: WILEY and the Wiley logo are trademarks or registered trademarks of John
Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not
be used without written permission. Google Cloud is a trademark of Google, Inc. All other
trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not
associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and authors have used
their best efforts in preparing this book, they make no representations or warranties with
respect to the accuracy or completeness of the contents of this book and specifically disclaim
any implied warranties of merchantability or fitness for a particular purpose. No warranty
may be created or extended by sales representatives or written sales materials. The advice and
strategies contained herein may not be suitable for your situation. You should consult with a
professional where appropriate. Further, readers should be aware that websites listed in this
work may have changed or disappeared between when this work was written and when it is
read. Neither the publisher nor authors shall be liable for any loss of profit or any other
commercial damages, including but not limited to special, incidental, consequential, or other
damages.
For general information on our other products and services or for technical support, please
contact our Customer Care Department within the United States at (800) 762 2974, outside
the United States at (317) 572 3993 or fax (317) 572 4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in
print may not be available in electronic formats. For more information about Wiley products,
visit our website at www.wiley.com.
Library of Congress Control Number: 2023931675
Cover image: © Getty Images Inc./Jeremy Woodhouse
Cover design: Wiley
To my late father, grandparents, mom, and husband (Pratyush
Ranjan), mentor (Mark Smith), and friends. Also to anyone trying to
study for this exam. Hope this book helps you pass the exam with
flying colors!
—Mona Mona
Chapter Features
Each chapter begins with a list of the objectives that are covered in the
chapter. The book doesn't cover the objectives in order. Thus, you
shouldn't be alarmed at some of the odd ordering of the objectives
within the book.
At the end of each chapter, you'll find several elements you can use to
prepare for the exam.
Exam Essentials This section summarizes important
information that was covered in the chapter. You should be able
to perform each of the tasks or convey the information requested.
Review Questions Each chapter concludes with 8+ review
questions. You should answer these questions and check your
answers against the ones provided after the questions. If you can't
answer at least 80 percent of these questions correctly, go back
and review the chapter, or at least those sections that seem to be
giving you difficulty.
The review questions, assessment test, and other
testing elements included in this book are not derived from the
PMLE exam questions, so don't memorize the answers to these
questions and assume that doing so will enable you to pass the
exam. You should learn the underlying topic, as described in the
text of the book. This will let you answer the questions provided
with this book and pass the exam. Learning the underlying topic is
also the approach that will serve you best in the workplace.
To get the most out of this book, you should read each chapter from
start to finish and then check your memory and understanding with
the chapter end elements. Even if you're already familiar with a topic,
you should skim the chapter; machine learning is complex enough that
there are often multiple ways to accomplish a task, so you may learn
something even if you're already competent in an area.
Hyperparameter tuning 8
Assessment Test
1. How would you split the data to predict a user lifetime value
(LTV) over the next 30 days in an online recommendation system
to avoid data and label leakage? (Choose three.)
A. Perform data collection for 30 days.
B. Create a training set for data from day 1 to day 29.
C. Create a validation set for data for day 30.
D. Create random data split into training, validation, and test
sets.
2. You have a highly imbalanced dataset and you want to focus on
the positive class in the classification problem. Which metrics
would you choose?
A. Area under the precision recall curve (AUC PR)
B. Area under the curve ROC (AUC ROC)
C. Recall
D. Precision
3. A feature cross is created by ________________ two or more
features.
A. Swapping
B. Multiplying
C. Adding
D. Dividing
4. You can use Cloud Pub/Sub to stream data in GCP and use Cloud
Dataflow to transform the data.
A. True
B. False
5. You have training data, and you are writing the model training
code. You have a team of data engineers who prefer to code in
SQL. Which service would you recommend?
A. BigQuery ML
B. Vertex AI custom training
C. Vertex AI AutoML
D. Vertex AI pretrained APIs
6. What are the benefits of using a Vertex AI managed dataset?
(Choose three.)
A. Integrated data labeling for unlabeled, unstructured data
such as video, text, and images using Vertex data labeling.
B. Track lineage to models for governance and iterative
development.
C. Automatically splitting data into training, test, and validation
sets.
D. Manual splitting of data into training, test, and validation
sets.
7. Masking, encrypting, and bucketing are de identification
techniques to obscure PII data using the Cloud Data Loss
Prevention API.
A. True
B. False
8. Which strategy would you choose to handle the sensitive data that
exists within images, videos, audio, and unstructured free form
data?
A. Use NLP API, Cloud Speech API, Vision AI, and Video
Intelligence AI to identify sensitive data such as email and
location out of box, and then redact or remove it.
B. Use Cloud DLP to address this type of data.
C. Use Healthcare API to hide sensitive data.
D. Create a view that doesn't provide access to the columns in
question. The data engineers cannot view the data, but at the
same time the data is live and doesn't require human
intervention to de identify it for continuous training.
9. You would use __________________ when you are trying to
reduce features while trying to solve an overfitting problem with
large models.
A. L1 regularization
B. L2 regularization
C. Both A and B
D. Vanishing gradient
10. If the weights in a network are very large, then the gradients for
the lower layers involve products of many large terms leading to
exploding gradients that get too large to converge. What are some
of the ways this can be avoided? (Choose two.)
A. Batch normalization
B. Lower learning rate
C. The ReLU activation function
D. Sigmoid activation function
11. You have a Spark and Hadoop environment on premises, and you
are planning to move your data to Google Cloud. Your ingestion
pipeline is both real time and batch. Your ML customer engineer
recommended a scalable way to move your data using Cloud
Dataproc to BigQuery. Which of the following Dataproc
connectors would you not recommend?
A. Pub/Sub Lite Spark connector
B. BigQuery Spark connector
C. BigQuery connector
D. Cloud Storage connector
12. You have moved your Spark and Hadoop environment and your
data is in Google Cloud Storage. Your ingestion pipeline is both
real time and batch. Your ML customer engineer recommended a
scalable way to run Apache Hadoop or Apache Spark jobs directly
on data in Google Cloud Storage. Which of the following Dataproc
connector would you recommend?
A. Pub/Sub Lite Spark connector
B. BigQuery Spark connector
C. BigQuery connector
D. Cloud Storage connector
13. Which of the following is not a technique to speed up
hyperparameter optimization?
A. Parallelize the problem across multiple machines by using
distributed training with hyperparameter optimization.
B. Avoid redundant computations by pre computing or cache
the results of computations that can be reused for subsequent
model fits.
C. Use grid search rather than random search.
D. If you have a large dataset, use a simple validation set instead
of cross validation.
14. Vertex AI Vizier is an independent service for optimizing complex
models with many parameters. It can be used only for non ML use
cases.
A. True
B. False
15. Which of the following is not a tool to track metrics when training
a neural network?
A. Vertex AI interactive shell
B. What If Tool
C. Vertex AI TensorBoard Profiler
D. Vertex AI hyperparameter tuning
16. You are a data scientist working to select features with structured
datasets. Which of the following techniques will help?
A. Sampled Shapley
B. Integrated gradient
C. XRAI (eXplanation with Ranked Area Integrals)
D. Gradient descent
17. Variable selection and avoiding target leakage are the benefits of
feature importance.
A. True
B. False
18. A TensorFlow SavedModel is what you get when you call
__________________. Saved models are stored as a directory
on disk. The file within that directory, saved_model.pb, is a
protocol buffer describing the functional tf.Graph.
A. tf.saved_model.save()
B. tf.Variables
C. tf.predict()
D. Tf.keras.models.load_model
19. What steps would you recommend a data engineer trying to
deploy a TensorFlow model trained locally to set up real time
prediction using Vertex AI? (Choose three.)
A. Import the model to Model Registry.
B. Deploy the model.
C. Create an endpoint for deployed model.
D. Create a model in Model Registry.
20. You are an MLOps engineer and you deployed a Kubeflow
pipeline on Vertex AI pipelines. Which Google Cloud feature will
help you track lineage with your Vertex AI pipelines?
A. Vertex AI Model Registry
B. Vertex AI Artifact Registry
C. Vertex AI ML metadata
D. Vertex AI Model Monitoring
21. What is not a recommended way to invoke a Kubeflow pipeline?
A. Using Cloud Scheduler
B. Responding to an event, using Pub/Sub and Cloud Functions
C. Cloud Composer and Cloud Build
D. Directly using BigQuery
22. You are a software engineer working at a start up that works on
organizing personal photos and pet photos. You have been asked
to use machine learning to identify and tag which photos have
pets and also identify public landmarks in the photos. These
features are not available today and you have a week to create a
solution for this. What is the best approach?
A. Find the best cat/dog dataset and train a custom model on
Vertex AI using the latest algorithm available. Do the same
for identifying landmarks.
B. Find a pretrained cat/dog dataset (available) and train a
custom model on Vertex AI using the latest deep neural
network TensorFlow algorithm.
C. Use the cat/dog dataset to train a Vertex AI AutoML image
classification model on Vertex AI. Do the same for identifying
landmarks.
D. Vision AI already identifies pets and landmarks. Use that to
see if it meets the requirements. If not, use the Vertex AI
AutoML model.
23. You are building a product that will accurately throw a ball into
the basketball net. This should work no matter where it is placed
on the court. You have created a very large TensorFlow model
(size more than 90 GB) based on thousands of hours of video. The
model uses custom operations, and it has optimized the training
loop to not have any I/O operations. What are your hardware
options to train this model?
A. Use a TPU slice because the model is very large and has been
optimized to not have any I/O operations.
B. Use a TPU pod because the model size is larger than 50 GB.
C. Use a GPU only instance.
D. Use a CPU only instance to build your model.
24. You work in the fishing industry and have been asked to use
machine learning to predict the age of lobster based on size and
color. You have thousands of images of lobster from Arctic fishing
boats, from which you have extracted the size of the lobster that is
passed to the model, and you have built a regression model for
predicting age. Your model has performed very well in your test
and validation data. Users want to use this model from their
boats. What are your next steps? (Choose three.)
A. Deploy the model on Vertex AI, expose a REST endpoint.
B. Enable monitoring on the endpoint and see if there is any
training serving skew and drift detection. The original dataset
was only from Arctic boats.
C. Also port this model to BigQuery for batch prediction.
D. Enable Vertex AI logging and analyze the data in BigQuery.
25. You have built a custom model and deployed it in Vertex AI. You
are not sure if the predictions are being served fast enough (low
latency). You want to measure this by enabling Vertex AI logging.
Which type of logging will give you information like time stamp
and latency for each request?
A. Container logging
B. Time stamp logging
C. Access logging
D. Request response logging
26. You are part of a growing ML team in your company that has
started to use machine learning to improve your business. You
were initially building models using Vertex AI AutoML and
providing the trained models to the deployment teams. How
should you scale this?
A. Create a Python script to train multiple models using Vertex
AI.
B. You are now in level 0, and your organization needs level 1
MLOps maturity. Automate the training using Vertex AI
Pipelines.
C. You are in the growth phase of the organization, so it is
important to grow the team to leverage more ML engineers.
D. Move to Vertex AI custom models to match the MLOps
maturity level.
27. What is not a reason to use Vertex AI Feature Store?
A. It is a managed service.
B. It extracts features from images and videos and stores them.
C. All data is a time series, so you can track when the features
values change over time.
D. The features created by the feature engineering teams are
available during training time but not during serving time. So
this helps in bridging that.
28. You are a data analyst in an organization that has thousands of
insurance agents, and you have been asked to predict the revenue
by each agent for the next quarter. You have the historical data for
the last 10 years. You are familiar with all AI services on Google
Cloud. What is the most efficient way to do this?
A. Build a Vertex AI AutoML forecast, deploy the model, and
make predictions using REST API.
B. Build a Vertex AI AutoML forecast model, import the model
into BigQuery, and make predictions using BigQuery ML.
C. Build a BigQuery ML ARIMA+ model using data in BigQuery,
and make predictions in BigQuery.
D. Build a BigQuery ML forecast model, export the model to
Vertex AI, and run a batch prediction in Vertex AI.
29. You are an expert in Vertex AI Pipelines, Vertex AI training, and
Vertex AI deployment and monitoring. A data analyst team has
built a highly accurate model, and this has been brought to you.
Your manager wants you to make predictions using the model and
use those predictions. What do you do?
A. Retrain the model on Vertex AI with the same data and
deploy the model on Vertex AI as part of your CD.
B. Run predictions on BigQuery ML and export the predictions
into GCS and then load into your pipeline.
C. Export the model from BigQuery into the Vertex AI model
repository and run predictions in Vertex AI.
D. Download the BigQuery model, and package into a Vertex AI
custom container and deploy it in Vertex AI.
30. Which of the following statements about Vertex AI and BigQuery
ML is incorrect?
A. BigQueryML supports both unsupervised and supervised
models.
B. BigQuery ML is very portable. Vertex AI supports all models
trained on BigQuery ML.
C. Vertex AI model monitoring and logs data is stored in
BigQuery tables.
D. BigQuery ML also has algorithms to predict
recommendations for users.
On the exam, you will be given the details of a use case and
will be expected to understand the nature of the problem and find
the appropriate machine learning approach to solve it. To
accomplish that, you need to have wide knowledge of the landscape
of these machine learning approaches.
Forecasting is another type where the input is time series data and the
model predicts the future values. In a time series dataset (Table 1.3),
you get a series of input values that are indexed in time order. For
example, you have a series of temperature measurements taken every
hour for 10 hours from a sensor. In this case, one temperature reading
is related to the previous and next reading because they are from the
same sensor, in subsequent hours, and usually only vary to a small
extent by the hour, so they are not considered to be “independent” (an
important distinction from other types of structured data).
TABLE 1.3 Time Series Data
Temperature
Series 1 29, 30, 40, 39, 23, 20
Series 2 10, 11, 13, 23, 43, 34
Series 2 19, 18, 19, 20, 38, 20
Series 4 14, 17, 34, 34, 12, 43
Some forecasting problems can be converted to regression problems
by modifying the time series data into independent and identically
distributed (IID) values. This is done either for convenience or
availability of data or for preference for a certain type of ML model. In
other cases, regression problems can be converted into classification
problems by bucketizing the values. We will look into details in the
following chapters. There is an art to fitting an ML model to a use case.
Clustering is another type of problem, where the algorithm creates
groups in the data based on inherent similarities and differences
among the different data points. For example, if we are given the
latitude and longitude of every house on Earth, the algorithm might
group each of these data points into clusters of cities based on the
distances between groups of houses. K means is a popular algorithm in
this type.
ML Success Metrics
A business problem can be solved using many different machine
learning algorithms, so which one to choose? An ML metric (or a suite
of metrics) is used to determine if the trained model is accurate
enough. After you train the model (supervised learning), you will
predict the values (y) for, say, N data points for which you know the
actual value (y). We will use a formula to calculate the metric from
these N predictions.
There are several metrics with different properties. If so, what is our
metric? What is the formula for calculating the metric? Does the
metric align with the business success criteria? To answer these
questions, let us look at each class of problems, starting with
classification.
Say you are trying to detect a rare fatal disease from an X ray. This is a
binary classification problem with two possible outcomes:
positive/negative. You are given a set of a million labeled X ray images
with only 1 percent of the cases with the disease, a positive data point.
In this case, a wrong negative (false negative), where we predict that
the patient does not have the disease when they actually do have it,
might cause the patient to not take timely action and cause harm due
to inaction. But a wrong positive prediction (false positive), where we
predict that the patient has the disease when in fact they do not, might
cause undue concern for the patient. This will result in further medical
tests to confirm the prediction. In this case, accuracy (the percentage
of correct prediction) is not the correct metric.
Let us now consider an example with prediction numbers for a binary
classification for an unbalanced dataset, shown in Table 1.4.
TABLE 1.4 Confusion matrix for a binary classification example
Predicted
Actual Positive Negative
Prediction Prediction
Positive Class 5 2
Negative 3 990
Class
There are two possible prediction classes, positive and negative.
Usually the smaller class (almost always the more important class) is
represented as the positive class. In Table 1.4, we have a total of 1,000
data points and have predictions for each. We have tabulated the
predictions against the actual values. Out of 1,000 data points, there
are 7 belonging to the positive class and 993 belonging to the negative
class. The model has predicted 8 to be in the positive class and 992 in
the negative class. The bottom right represents true negatives (990
correctly predicted negatives) and the top left represents true positives
(5 correctly predicted positives). The bottom left represents false
positives (3 incorrectly predicted as positive) and the top right
represents false negatives (2 incorrectly predicted as negative). Now,
using the numbers in this confusion matrix, we can calculate various
metrics based on our needs.
If this model is to detect cancer, we do not want to miss detecting the
disease; in other words, we want a low false negative rate. In this case,
recall is a good metric.
In our case, recall = 5/(5 + 2) = 0.714. If false positives are higher, the
recall metric will be lower because false negative is in the
denominator. Recall can range from 0 to 1, and a higher score is
better. Intuitively, recall is the measure of what percentage of the
positive data points the model was able to predict correctly.
On the other hand, if this is a different use case and you are trying to
reduce false positives, then you can use the precision metric.
Regression
Regression predicts a numerical value. The metric should try to show
the quantitative difference between the actual value and the predicted
value.
MAE The mean absolute error (MAE) is the average absolute
difference between the actual values and the predicted values.
RMSE The root mean squared error (RMSE) is the square root of
the average squared difference between the target and predicted
values. If you are worried that your model might incorrectly
predict a very large value and want to penalize the model, you can
use this. Ranges from 0 to infinity.
RMSLE The root mean squared logarithmic error (RMSLE)
metric is similar to RMSE, except that it uses the natural
logarithm of the predicted and actual values +1. This is an
asymmetric metric, which penalizes under prediction (value
predicted is lower than actual) rather than over prediction.
MAPE Mean absolute percentage error (MAPE) is the average
absolute percentage difference between the labels and the
predicted values. You would choose MAPE when you care about
proportional difference between actual and predicted value.
R2 R squared (R2) is the square of the Pearson correlation
coefficient (r) between the labels and predicted values. This
metric ranges from zero to one; and generally a higher value
indicates a better fit for the model.
Responsible AI Practices
AI and machine learning are powerful new tools, and with power
comes responsibility. You should consider fairness, interpretability,
privacy, and security in your ML solution. You can borrow from best
practices in software engineering in tandem with considerations
unique to machine learning.
General Best Practices Always have the end user in mind as
well as their user experience. How does your solution change
someone's life? Solicit feedback early in the design process.
Engage and test with a diverse set of users you would expect to
use your solution. This will build a rich variety of perspectives and
will allow you to adjust early in the design phase.
Fairness Fairness is very important because machine learning
models can reflect and reinforce unfair biases. Fairness is also
difficult in practice because there are several definitions of
fairness from different perspectives (academic, legal, cultural,
etc.). Also, it is not possible to apply the same “fairness” to all
situations as it is very contextual. To start with, you can use
statistical methods to measure bias in datasets and to test ML
models for bias in the evaluation phase.
Interpretability Some popular state of the art machine learning
models like neural networks are too complex for human beings to
comprehend, so they are treated as black boxes. The lack of
visibility creates doubt and could have hidden biases.
Interpretability is the science of gaining insights into models and
predictions. Some models are inherently more interpretable (like
linear regression, decision trees) and others are less interpretable
(deep learning models). One way to improve interpretability is to
use model explanations. Model explanations quantify the
contributions of each input feature toward making a prediction.
However, not all algorithms support model explanations. In some
domains, model explanations are mandated, so your choice of
algorithms is restricted.
Privacy The only connection between the training data and
prediction is the ML model. While the model only provides
predictions from input values, there are some cases where it can
reveal some details about the training data. This becomes a
serious issue if you trained with sensitive data like medical
history, for example. Although the science of detecting and
preventing data leakage is still an area of active research,
fortunately there are now techniques to minimize leakage in a
precise and principled fashion.
Security The threat of cybersecurity is very much applicable to
machine learning. In addition to the usual threats to any digital
application, there are some unique security challenges to machine
learning applications. These threats are ever present, from the
data collection phase (poison data), training phase (leakage of
training data), and deployment phase (stealing of models). It is
important to identify potential threats to the system, keep
learning to stay ahead of the curve, and develop approaches to
combat these threats.
You can read more at https://ai.google/responsibilities.
Summary
In this chapter, you learned how to take a business use case and
understand the different dimensions to an ask and to frame a machine
learning problem statement as a first step.
Exam Essentials
Translate business challenges to machine learning.
Understand the business use case that wants to solve a problem
using machine learning. Understand the type of problem, the data
availability, expected outcomes, stakeholders, budget, and
timelines.
Understand the problem types. Understand regression,
classification, and forecasting. Be able to tell the difference in data
types and popular algorithms for each problem type.
Know how to use ML metrics. Understand what a metric is,
and match the metric with the use case. Know the different
metrics for each problem type, like precision, recall, F1, AUC
ROC, RMSE, and MAPE.
Understand Google's Responsible AI principles.
Understand the recommended practices for AI in the context of
fairness, interpretability, privacy, and security.
Review Questions
1. When analyzing a potential use case, what are the first things you
should look for? (Choose three.)
A. Impact
B. Success criteria
C. Algorithm
D. Budget and time frames
2. When you try to find the best ML problem for a business use case,
which of these aspects is not considered?
A. Model algorithm
B. Hyperparameters
C. Metric
D. Data availability
3. Your company wants to predict the amount of rainfall for the next
7 days using machine learning. What kind of ML problem is this?
A. Classification
B. Forecasting
C. Clustering
D. Reinforcement learning
4. You work for a large company that gets thousands of support
tickets daily. Your manager wants you to create a machine
learning model to detect if a support ticket is valid or not. What
type of model would you choose?
A. Linear regression
B. Binary classification
C. Topic modeling
D. Multiclass classification
5. You are building an advanced camera product for sports, and you
want to track the ball. What kind of problem is this?
A. Not possible with current state of the art algorithms
B. Image detection
C. Video object tracking
D. Scene detection
6. Your company has millions of academic papers from several
research teams. You want to organize them in some way, but there
is no company policy on how to classify the documents. You are
looking for any way to cluster the documents and gain any insight
into popular trends. What can you do?
A. Not much. The problem is not well defined.
B. Use a simple regression problem.
C. Use binary classification.
D. Use topic modeling.
7. What metric would you never chose for linear regression?
A. RMSE
B. MAPE
C. Precision
D. MAE
8. You are building a machine learning model to predict house
prices. You want to make sure the prediction does not have
extreme errors. What metric would you choose?
A. RMSE
B. RMSLE
C. MAE
D. MAPE
9. You are building a plant classification model to predict variety1
and variety2, which are found in equal numbers in the field. What
metric would you choose?
A. Accuracy
B. RMSE
C. MAPE
D. R2
10. You work for a large car manufacturer and are asked to detect
hidden cracks in engines using X ray images. However, missing a
crack could mean the engine could fail at some random time while
someone is driving the car. Cracks are relatively rare and happen
in about 1 in 100 engines. A special camera takes an X ray image
of the engine as it comes through the assembly line. You are going
to build a machine learning model to classify if an engine has a
crack or not. If a crack is detected, the engine would go through
further testing to verify. What metric would you choose for your
classification model?
A. Accuracy
B. Precision
C. Recall
D. RMSE
11. You are asked to build a classification model and are given a
training dataset but the data is not labeled. You are asked to
identify ways of using machine learning with this data. What type
of learning will you use?
A. Supervised learning
B. Unsupervised learning
C. Semi supervised learning
D. Reinforcement learning
12. You work at a company that hosts millions of videos and you have
thousands of users. The website has a Like button for users to
click, and some videos get thousands of “likes.” You are asked to
create a machine learning model to recommend videos to users
based on all the data collected to increase the amount of time
users spend on your website. What would be your ML approach?
A. Supervised learning to predict based on the popularity of
videos
B. Deep learning model based on the amount of time users
watch the videos
C. Collaborative filtering method based on explicit feedback
D. Semi supervised learning because you have some data about
some videos
13. You work for the web department of a large hardware store chain.
You have built a visual search engine for the website. You want to
build a model to classify whether an image contains a product.
There are new products being introduced on a weekly basis to
your product catalog and these new products must be
incorporated into the visual search engine. Which of the following
options is a bad idea?
A. Create a pipeline to automate the step: take the dataset, train
a model.
B. Create a golden dataset and do not change the dataset for at
least a year because creating a dataset is time consuming.
C. Extend the dataset to include new products frequently and
retrain the model.
D. Add evaluation of the model as part of the pipeline.
14. Which of the following options is not a type of machine learning
approach?
A. Supervised learning
B. Unsupervised learning
C. Semi supervised learning
D. Hyper supervised learning
15. Your manager is discussing a machine learning approach and is
asking you about feeding the output of one model to another
model. Select two statements that are true about this kind of
approach.
A. There are many ML pipelines where the output of one model
is fed into another.
B. This is a poor design and never done in practice.
C. Never feed the output of one model into another model. It
may amplify errors.
D. There are several design patterns where the output of one
model (like encoder or transformer) is passed into a second
model and so on.
16. You are building a model that is going to predict credit worthiness
and will be used to approve loans. You have created a model and
it is performing extremely well and has high impact. What next?
A. Deploy the model.
B. Deploy the model and integrate it with the system.
C. Hand it over to the software integration team.
D. Test your model and data for biases (gender, race, etc.).
17. You built a model to predict credit worthiness, and your training
data was checked for biases. Your manager still wants to know the
reason for each prediction and what the model does. What do you
do?
A. Get more testing data.
B. The ML model is a black box. You cannot satisfy this
requirement.
C. Use model interpretability/explanations.
D. Remove all fields that may cause bias (race, gender, etc.).
18. Your company is building an Android app to add funny
moustaches on photos. You built a deep learning model to detect
the location of a face in a photo, and your model had very high
accuracy based on a public photo dataset that you found online.
When integrated into an Android phone app, it got negative
feedback on accuracy. What could be the reason?
A. The model was not deployed properly.
B. Android phones could not handle a deep learning model.
C. Your dataset was not representative of all users.
D. The metric was wrong.
19. You built a deep learning model to predict cancer based on
thousands of personal records and scans. The data was used in
training and testing. The model is secured behind a firewall, and
all cybersecurity precautions have been taken. Are there any
privacy concerns? (Choose two.)
A. No. There are no privacy concerns. This does not contain
photographs, only scans.
B. Yes. This is sensitive data being used.
C. No. Although sensitive data is used, it is only for training and
testing.
D. The model could reveal some detail about the training data.
There is a risk.
20. You work for an online shoe store and the company wants to
increase revenue. You have a large dataset that includes the
browsing history of thousands of customers, and also their
shopping cart history. You have been asked to create a
recommendation model. Which of the following is not a valid next
step?
A. Use your ML model to recommend products at checkout.
B. Creatively use all the data to get maximum value because
there is no privacy concern.
C. Periodically retrain the model to adjust for performance and
also to include new products.
D. In addition to the user history, you can use the data about
product (description, images) in training your model.
Chapter 2
Exploring Data and Building Data Pipelines
Visualization
Data visualization is a data exploratory technique to find trends and
outliers in the data. Data visualization helps in the data cleaning
process because you can find out whether your data is imbalanced by
visualizing the data on a chart. It also helps in the feature engineering
process because you can select features and discard features and see
how a feature will influence your model by visualizing it.
There are two ways to visualize data:
Univariate Analysis In this analysis, each of the features is
analyzed independently, such as the range of the feature and
whether outliers exist in the data. The most common visuals used
for this are box plots and distribution plots.
Bivariate Analysis In this analysis, we compare the data
between two features. This analysis can be helpful in finding
correlation between features. Some of the ways you can perform
this analysis are by using line plots, bar plots, and scatterplots.
Box Plot
A box plot helps visualize the division of observations into defined
intervals known as quartiles and how that compares to the entire
observation. It represents the data as 25th, 50th, and 75th quartiles. It
consists of the body, or interquartile range, where maximum
observations are present. Whiskers or straight lines represent the
maximum and minimum. Points that lie outside the whiskers will be
considered outliers.
Figure 2.1 shows a box plot.
Line Plot
A line plot plots the relationship between two variables and is used to
analyze the trends for data changes over time.
Figure 2.2 shows a line plot.
Bar Plot
A bar plot is used for analyzing trends in data and comparing
categorical data such as sales figures every week, the number of
visitors to a website, or revenue from a product every month.
Figure 2.3 shows a bar plot.
Statistics Fundamentals
In statistics, we have three measures of central tendency: mean,
median, and mode. They help us describe the data and can be used to
clean data statistically.
Mean
Mean is the accurate measure to describe the data when we do not
have any outliers present.
Median
Median is used if there is an outlier in the dataset. You can find the
median by arranging data values from the lowest to the highest value.
If there are even numbers, the median is the average of two numbers
in the middle, and if there are odd numbers, the median is the middle
value. For example, in the dataset 1, 1, 2, 4, 6, 6, 9, the median is 4. For
the dataset 1, 1, 4, 6, 6, 9, the median is 5. Take the mean of 4 and 6, or
(4+6) / 2 = 5.
Mode
Mode is used if there is an outlier and the majority of the data is the
same. Mode is the value or values in the dataset that occur most.
For example, for the dataset 1, 1, 2, 5, 5, 5, 9, the mode is 5.
Outlier Detection
Mean is the measure of central tendency that is affected by the
outliers, which in turn impacts standard deviation.
For example, consider the following small dataset:
[15, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9, 210]
By looking at it, one can quickly say 210 is an outlier that is much
larger than the other values.
As you can see from Table 2.1, there has been a significant change in
mean by adding an outlier compared to median and mode. Variance is
the average of the squared differences from the mean.
TABLE 2.1 Mean, median, and mode for outlier detection
With Outlier Without Outlier
Mean: 12.72 Mean: 29.16
Median: 13 Median: 14
Mode: 15 Mode: 15
Standard Deviation
Standard deviation is the square root of the variance. Standard
deviation is an excellent way to identify outliers. Data points that lie
more than one standard deviation from the mean can be considered
unusual.
Correlation
Correlation is simply a normalized form of covariance. The value of the
correlation coefficient ranges from –1 to +1. The correlation coefficient
is also known as Pearson's correlation coefficient.
Positive Correlation When we increase the value of one
variable, the value of another variable increases respectively; this
is called positive correlation.
Negative Correlation When we increase the value of one
variable, the value of another variable decreases respectively; this
is called negative correlation.
Zero Correlation When the change in the value of one variable
does not impact the other substantially, then it is called zero
correlation.
Data Skew
Data skew means when the normal distribution curve is not
symmetric, the data is skewed. It means that there are outliers in the
data or the data distribution is not even.
The skewness for a normal distribution is 0.
The data can be right skewed or left skewed (see Figure 2.4). You can
analyze skew by knowing the statistical measure such as mean and
median and standard deviation from the dataset.
Scaling
Scaling means converting floating point feature values from their
natural range into a standard range—for example, from 1,000–5,000
to 0 to 1 or –1 to +1. Scaling is useful when a feature set consists of
multiple features. It has the following benefits:
In deep neural network training, scaled features help gradient
descent converge better than non scaled features.
Scaling removes the possibility of “NaN traps” as every number
value is scaled to a range of numbers.
Without scaling, the model will give too much importance to
features having a wider range.
Log Scaling
Log scaling is used when some of the data samples are in the power of
law, or very large. For example, you would use log scaling when some
of the sample is 10,000 while some is in the range 0–100.
So, taking a log will bring them to same range. For example, log of
(100,000) = 100 and log of (100) = 10. Therefore, your data will be
scaled to the 0 to 100 range with log scaling.
Z‐score
This is another scaling method where the value is calculated as
standard deviations away from the mean. You would calculate the z
score as follows when you have a few outliers:
Scaled value = (value − mean) / stddev
For example, given
Mean = 100
Standard deviation = 20
Original value = 130
the scaled value is 1.5. The z score lies between –3 to +3, so anything
outside of that range will be an outlier.
Clipping
In the case of extreme outliers, you can cap all feature values above or
below to a certain fixed value. You can perform feature clipping before
or after other normalization techniques.
Handling Outliers
An outlier is a value that is the odd one out or an observation that lies
far from the rest of the data points because it is too large or too small.
They may exist in data due to human error or skew.
You need to use the following visualization techniques and statistical
techniques (some of which were discussed in previous sections) to
detect outliers:
Box plots
Z score
Clipping
Interquartile range (IQR)
Once an outlier is detected, you can either remove it from the dataset
so that it does not affect model training or impute or replace outlier
data to either mean, median, mode, or boundary values.
Imbalanced Data
When two classes in a dataset are not equal, the result is imbalanced
data. In the example shown in Figure 2.7, there is less chance of credit
card fraud in the dataset compared to No fraud.
In the case of credit card transactions, suppose, out of all transactions,
1,000 are not fraud examples and only five are fraud transactions. This
is a classic representation of imbalanced data. In this scenario, we do
not have enough transactions to train the model to classify whether a
credit card transaction is fraud. The training model will spend more
time on no fraud scenarios.
In a random sampling, you can perform either oversampling, which
means duplicating samples from the minority class, or
undersampling, which means deleting samples from the majority
class. Both of these approaches include bias because they introduce
either more samples or fewer samples to remove imbalance.
Data Splitting
Mostly with general data cleaning, you would start with random
splitting of the data. For example, consider datasets having naturally
clustered examples.
Say you want your model to classify the topics in the text of a book.
The topics can be horror, love story, and drama. A random split would
be a problem in that case.
Why would having a random split cause a skew? It can cause a skew
because stories with the same type of topic are written on the same
timeline. If the data is split randomly, the test set and training set
might contain the same stories.
To fix this, try splitting the data based on the time the story was
published. For example, you can put stories written in June in the
training set and stories published in July in the test set.
Another simple approach to fixing this problem would be to split the
data based on when the story was published. For example, you could
train on stories for the month of April and then use the second week of
May as the test set to prevent overlap.
Data Leakage
Data leakage happens when you expose your machine learning model
to the test data during training. As a result, your model performs great
during training and testing, but when you expose the model to unseen
data, it underperforms. Data leakage leads to overfitting in the model
as the model has already learned from the test and training data.
The following are some of the reasons for data leakage:
The target variable is the output that your model is trying to
predict, and features are the data that is fed into the model to
make a prediction or predict the target variable. The cause of data
leakage is that by mistake you have added your target variable as
your feature.
While splitting the test data and the training data for model
training, you have included the test data with the training data.
The presence of features that expose the information about the
target variable will not be available after the model is deployed.
This is also called label leakage and can be detected by checking
the correlation between the target variable and the feature.
Applying preprocessing techniques (normalizing features,
removing outliers, etc.) to the entire dataset will cause the model
to learn not only the training set but also the test set, which leads
to data leakage.
A classic example of data leakage is time series data. For example,
when dealing with time series data, if we use data from the future
when doing computations for current features or predictions, we
would highly likely end up with a leaked model. It generally happens
when the data is randomly split into train and test subsets.
These are the situations where you might have data leakage:
If the model's predicted output is as good as actual output, it
might be because of a data leakage. This means the model might
be somehow memorizing the data or might have been exposed to
the actual data.
While doing the exploratory data analysis, having features that
are very highly correlated with the target variable might be a data
leakage.
Data leakage happens primarily because of the way we split our data
and when we split our data. Now, let's understand how to prevent data
leakage:
Select features that are not correlated with a given target variable
or that don't contain information about the target variable.
Split the data into test, train, and validation sets. The purpose of
the validation set is to mimic the real life scenario and it will help
identify any possible case of overfitting.
Preprocess the training and test data separately. You would
perform normalization on training data rather than on the
complete dataset to avoid any leakage.
In case of time series data, have a cutoff value on time as it
prevents you from getting any information after the time of
prediction.
Cross validation is another approach to avoid data leakage when
you have limited data. However, if data leakage still happens, then
scale or normalize the data and compute the parameters on each
fold of cross validation separately.
Furthermore, the difference in production data versus training data
must be reflected in the difference between the validation data split
and the training data split and between the testing data split and the
validation data split.
For example, if you are planning on making predictions about user
lifetime value (LTV) over the next 30 days, then make sure that the
data in your validation data split is from 30 days after the data in your
training data split and that the data in your testing data split is from
30 days before your validation data split.
Summary
In this chapter, we discussed why we need to visualize data and the
various ways to visualize data, such as using box plots, line plots, and
scatterplots. Then we covered statistical fundamentals such as mean,
median, mode, and standard deviation and why they are relevant
when finding outliers in data. Also, you learned how to check data
correlation using a line plot.
You learned about various data cleaning and normalizing techniques
such as log scaling, scaling, clipping, and using a z score to improve
the quality of data.
We also discussed establishing data constraints and why it's important
to define a data schema in an ML pipeline and the need to validate
data. We covered using TFDV for validating data at scale and why you
need TFDV to validate data schema for large scale deep learning
systems. Then we discussed the strategy used for splitting the data and
spoke about the data splitting strategy for an imbalanced dataset. We
covered splitting based on time for online systems and clustered data.
Last, we covered strategies for how to deal with missing data and data
leakage.
Exam Essentials
Be able to visualize data. Understand why we need to visualize
data and various ways to do so, such as using box plots, line plots,
and scatterplots.
Understand the fundamentals of statistical terms. Be able
to describe mean, median, mode, and standard deviation and how
they are relevant in finding outliers in data. Also know how to
check data correlation using a line plot.
Determine data quality and reliability or feasibility.
Understand why you want data without outliers and what data
skew is, and learn about various data cleaning and normalizing
techniques such as log scaling, scaling, clipping, and z score.
Establish data constraints. Understand why it's important to
define a data schema in an ML pipeline and the need to validate
data. Also, you need to understand TFDV for validating data at
scale.
Organize and optimize training data. You need to
understand how to split your dataset into training data, test data,
and validation data and how to apply the data splitting technique
when you have clustered and online data. Also understand the
sampling strategy when you have imbalanced data.
Handle missing data. Know the various ways to handle
missing data, such as removing missing values; replacing missing
values with mean, median, or mode; or using ML to create
missing values.
Avoid data leaks. Know the various ways data leakage and label
leakage can happen in the data and how to avoid it.
Review Questions
1. You are the data scientist for your company. You have a dataset
that includes credit card transactions, and 1 percent of those
credit card transactions are fraudulent. Which data
transformation strategy would likely improve the performance of
your classification model?
A. Write your data in TFRecords.
B. Z normalize all the numeric features.
C. Use one hot encoding on all categorical features.
D. Oversample the fraudulent transactions.
2. You are a research scientist building a cancer prediction model
from medical records. Features of the model are patient name,
hospital name, age, vitals, and test results. This model performed
really well on held out test data but performed poorly on new
patient data. What is the reason for this?
A. Strong correlation between feature hospital name and
predicted result.
B. Random splitting of data between all the features available.
C. Missing values in the feature hospital name and age.
D. Negative correlation between the feature hospital name and
age.
3. Your team trained and tested a deep neural network model with
99 percent accuracy. Six months after model deployment, the
model is performing poorly due to change in input data
distribution. How should you address input data distribution?
A. Create alerts to monitor for skew and retrain your model.
B. Perform feature selection and retrain the model.
C. Retrain the model after hyperparameter tuning.
D. Retrain your model monthly to detect data skew.
4. You are an ML engineer who builds and manages a production
system to predict sales. Model accuracy is important as the
production model has to keep up with market changes. After a
month in production, the model did not change but the model
accuracy was reduced. What is the most likely cause of the
reduction in model accuracy?
A. Accuracy dropped due to poor quality data.
B. Lack of model retraining.
C. Incorrect data split ratio in validation, test, and training data.
D. Missing data for training.
5. You are a data scientist in a manufacturing firm. You have been
asked to investigate failure of a production line based on sensor
readings. You realize that 1 percent of the data samples are
positive examples of a faulty sensor reading. How will you resolve
the class imbalance problem?
A. Generate 10 percent positive examples using class
distribution.
B. Downsample the majority data with upweighting to create 10
percent samples.
C. Delete negative examples until positive and negative
examples are equal.
D. Use a convolutional neural network with the softmax
activation function.
6. You are the data scientist of a meteorological department asked to
build a model to predict daily temperatures. You split the data
randomly and then transform the training and test datasets.
Temperature data for model training is uploaded hourly. During
testing, your model performed with 99 percent accuracy;
however, in production, accuracy dropped to 70 percent. How can
you improve the accuracy of your model in production?
A. Split the training and test data based on time rather than a
random split to avoid leakage.
B. Normalize the data for the training and test datasets as two
separate steps.
C. Add more data to your dataset so that you have fair
distribution.
D. Transform data before splitting, and cross validate to make
sure the transformations are applied to both the training and
test sets.
7. You are working on a neural network based project. The dataset
provided to you has columns with different ranges and a lot of
missing values. While preparing the data for model training, you
discover that gradient optimization is having difficulty moving
weights. What should you do?
A. Use feature construction to combine the strongest features.
B. Use the normalization technique to transform data.
C. Improve the data cleaning step by removing features with
missing values.
D. Change the hyperparameter tuning steps to reduce the
dimension of the test set and have a larger training set.
8. You are an ML engineer working to set a model in production.
Your model performs well with training data. However, the model
performance degrades in production environment and your
model is overfitting. What can be the reason for this? (Choose
three.)
A. Applying normalizing features such as removing outliers to
the entire dataset
B. High correlation between the target variable and the feature
C. Removing features with missing values
D. Adding your target variable as your feature
Chapter 3
Feature Engineering
Some algorithms can work with categorical data directly, such as,
for example, decision trees. However, most ML algorithms cannot
operate on label data directly. They require all input variables and
output variables to be numeric, which is why categorical data must
be converted to numeric data. If the categorical variable is an
output variable, you would have to convert the numeric output
back to categorical data during predictions.
Normalizing
We covered normalization techniques such as scaling, log scaling, z
score, and clipping in the Data Cleaning section of Chapter 2,
“Exploring Data and Building Data Pipelines.” You would perform
normalization in two cases with numeric data:
Numeric features that have distinctly different ranges
(for example, age and income): In this case, the model
gradient descent can slow down convergence due to various data
ranges. That is why AdaGrad and Adam optimization techniques
can help, because they create a separate learning rate for each
feature.
Numeric features that cover a wide range such as a city:
This type of dataset model will generate a NaN data error if it is
not normalized. In this situation, even optimizers such as Adam
and AdaGrad can't prevent NaN errors when there is a wide range
of values in a single feature.
Bucketing
Bucketing is transforming numeric data to categorical data. For
example, latitude can be a floating point value, and it's difficult to
represent it while predicting house price with respect to location. Two
ways to do bucketing are as follows:
Creating buckets with equal spaced boundaries: You
create a range of buckets and some buckets might have more data
points compared to others. For example, to represent rainfall, you
can bucket by range (0–50 cm, 50–100 cm); you might have more
data points in the rainfall range 0–50 cm in an area with less
intense rainfall.
Buckets with quantile boundaries: Each bucket has the
same number of points. The boundaries are not fixed and could
encompass a narrow or wide span of values.
One‐Hot Encoding
One hot encoding is the process of creating dummy variables. This
technique is used for categorical variables where order does not
matter. The one hot encoding technique is used when the features are
nominal (do not have any order). Let's discuss what is an ordinal
relationship in categorical data first; for example, ordinal variables can
be socioeconomic status (low income, middle income, high income),
education level (high school, BS, MS, PhD), income level (less than
50K, 50K–100K, over 100K), satisfaction rating (extremely dislike,
dislike, neutral, like, extremely like). For categorical variables where
no such ordinal relationship exists, such integer encoding is not
enough.
In one hot encoding, for every categorical feature, a new variable is
created. It is a process by which categorical variables are converted
into binary representation for the integer encoding or integer values.
Let's look at the example of two colors, as shown in Table 3.1. Each can
be represented by binary values.
TABLE 3.1 One hot encoding example
Categorical Integer Encoding or Creating a One Hot
Value Vocabulary Mapping Encoding
Red 0 00
Blue 1 01
Sometimes you have large binary spaces to represent, which might
lead to sparse representation with lots of 0s.
Feature Hashing
Hashing works by applying a hash function to the categorical features
and using their hash values as indices directly rather than looking into
vocabulary. Hashing often causes collisions, and that is why for
important terms, hashing can be worse than selecting a vocabulary.
The advantage of hashing is that it doesn't require you to assemble a
vocabulary in case the feature distribution changes heavily over time.
Embedding
Embedding is a categorical feature represented as a continuous valued
feature. Deep learning models frequently convert the indices from an
index to an embedding. Mostly, embeddings are used for text
classification or document classification where you have a bag of
words or a document of words.
Feature Selection
Feature selection means selecting a subset of features or reducing the
number of input variables that are most useful to a model in order to
predict the target variable.
Dimensionality reduction is one of the most popular techniques for
reducing the number of features. The advantage of this is that it
reduces the noise from data and overfitting problems. A lower number
of dimensions in data means less training time and computational
resources and increases the overall performance of machine learning
algorithms. There are two ways this can be accomplished:
Keep the most important features only: some of the techniques
used are backward selection and Random Forest.
Find a combination of new features: some of the techniques used
are principal component analysis (PCA) and t distributed
stochastic neighbor embedding (t SNE).
Class Imbalance
We covered data class imbalance in Chapter 2 and in this section, we
will talk about class imbalance specific to classification models.
Classification models rely on some key outcomes we covered in
Chapter 1, “Framing ML”:
A true positive is an outcome when the model correctly predicts
the positive class; for example, the patients who took tests were
actually sick with the virus.
True negative is an outcome when the model correctly predicts
the negative class, which means the patient who took the test has
negative results and they actually are not sick.
A false positive is an outcome where the model incorrectly
predicts the positive class, meaning the patients were actually not
sick but the test predicted them as sick.
False negative is an outcome where the model incorrectly
predicts the negative class, meaning the patients were sick but the
test determined them as not sick.
In this scenario, false negatives are a problem because you do not want
sick patients identified as not sick. So you would work on minimizing
the false negatives outcome in your classification model.
AUC ROC
An ROC curve (receiver operating characteristic curve) is a graph
showing the performance of a classification model at all classification
thresholds. This curve plots two parameters: true positive rate and
false positive rate. You can refer to this link to see what the graph
looks like:
https://developers.google.com/machine-learning/crash-
course/classification/roc-and-auc
AUC ROC (area under the ROC curve) measures the two dimensional
area underneath the entire ROC curve. It refers to a number between
0.0 and 1.0 representing a binary classification model's ability to
separate positive classes from negative classes.
The closer the AUC is to 1.0, the better the model's ability to separate
classes from each other. AUC ROC curves are used when the class is
balanced or when you want to give equal weight to both classes
(negative and positive class) prediction ability.
AUC PR
A PR curve is a graph with Precision values on the y axis and Recall
values on the x axis. The focus of the PR curve on the minority class
makes it an effective diagnostic for imbalanced binary classification
models.
The area under the precision recall (AUC PR) curve measures the two
dimensional area underneath the precision recall (PR) curve. In case
of an imbalanced class, precision recall curves (PR curves) are
recommended for highly skewed domains. AUC PR gives more
attention to the minority class. It can be used in conjunction with
downsampling or upsampling, which we covered in Chapter 2.
Feature Crosses
A feature cross, or synthetic feature, is created by multiplying
(crossing) two or more features. It can be multiplying the same feature
by itself [A * A] or it can be multiplying values of multiple features,
such as [A * B * C]. In machine learning, feature crosses are usually
performed on one hot encoded features—for example, binned_latitude
× binned_longitude. (For more information, see
https://developers.google.com/machine-learning/crash-
course/feature-crosses/video-lecture.)
TensorFlow Transform
Increasing the performance of a TensorFlow model requires an
efficient input pipeline. First we will discuss the TF Data API and then
we'll talk about TensorFlow Transform.
TensorFlow Transform
The TensorFlow Transform library is part of TensorFlow Extended
(TFX) and allows you to perform transformations prior to training the
model and to emit a TensorFlow graph that reproduces these
transformations during training. Using tf.Transform avoids the
training serving skew. In Google Cloud, you can create transform
pipelines using Cloud Dataflow. Some of the steps that TF Transform
takes for transformations during training and serving are analyzing
training data, transforming training data, transforming evaluation
data, producing metadata, feeding the model, and serving data, as
shown in Figure 3.4.
FIGURE 3.4 TensorFlow Transform
Source: Adapted from Google Cloud
You can run the previous pipeline using Cloud Dataflow and BigQuery:
1. Read training data from BigQuery.
2. Analyze and transform training data using tf.Transform Cloud
Dataflow.
3. Write transformed training data to Cloud Storage as TFRecords.
4. Read evaluation data from BigQuery.
5. Transform evaluation data using the transform_fn produced by
step 2 in Cloud Dataflow.
6. Write transformed training data to Cloud Storage as TFRecords.
7. Write transformation artifacts to Cloud Storage for creating and
exporting the model.
Table 3.2 shows how you can run a TFX pipeline using Google Cloud
Platform (GCP) services.
TABLE 3.2 Run a TFX pipeline on GCP
Step TFX library GCP service
Data extraction & TensorFlow Data Validation Cloud
validation Dataflow
Data transformation TensorFlow Transform Cloud
Dataflow
Model training & TensorFlow (tf.Estimators Vertex AI
tuning and tf.Keras) Training
Model evaluation & TensorFlow Model Analysis Cloud
validation Dataflow
Model serving for TensorFlow Serving Vertex AI
prediction Prediction
Summary
In this chapter, we discussed feature engineering and why it's
important to transform numerical and categorical features for model
training and serving.
Then we discussed various techniques to transform numerical
features, such as bucketing and normalization. We also discussed the
technique to transform categorical features such as linear encoding,
one hot encoding, out of vocabulary, hashing, and embedding.
You learned why it's important to select features and some of the
techniques for dimensionality reduction such as PCA. Then we covered
class imbalance and how precision and recall impacts the
classification. For imbalanced classes, AUC PR is more effective than
AUC ROC.
We also discussed why feature crosses are important and the benefits
of feature crossing.
We covered how to represent data for TensorFlow using tf.data and
then we covered tf.Transform and how to process pipelines using
tf.Transform on Google Cloud. You learned about some of the Google
Cloud data processing and ETL tools such as Cloud Data Fusion and
Cloud Dataprep.
Exam Essentials
Use consistent data processing. Understand when to
transform data, either before training or during model training.
Also know the benefits and limitations of transforming data
before training.
Know how to encode structured data types. Understand
techniques to transform both numeric and categorical data such
as bucketing, normalization, hashing, and one hot encoding.
Understand feature selection. Understand why feature
selection is needed and some of the techniques of feature
selection, such as dimensionality reduction.
Understand class imbalance. Understand true positive, false
positive, accuracy, AUC, precision, and recall in classification
problems and how to effectively measure accuracy with class
imbalance.
Know where and how to use feature cross. You need to
understand why feature cross is important and the scenarios in
which you would need it.
Understand TensorFlow Transform. You need to
understand TensorFlow Data and TensorFlow Transform and
how to architect tf.Transform pipelines on Google Cloud using
BigQuery and Cloud Data Fusion.
Use GCP data and ETL tools. Know how and when to use
tools such as Cloud Data Fusion and Cloud Dataprep. For
example, in case you are looking for a no code solution to clean
data, you would use Dataprep for data processing and, in case you
are looking for a no code and UI–based solution for ETL (extract,
transform, load), you would use Cloud Data Fusion.
Review Questions
1. You are the data scientist for your company. You have a dataset
that which has all categorical features. You trained a model using
some algorithms. With some algorithms this data is giving good
result but when you change the algorithm the performance is
getting reduced. Which data transformation strategy ould likely
improve the performance of your model?
A. Write your data in TFRecords.
B. Create a feature cross with categorical feature.
C. Use one hot encoding on all categorical features.
D. Oversample the features.
2. You are working on a neural network–based project. The dataset
provided to you has columns with different ranges. While
preparing the data for model training, you discover that gradient
optimization is having difficulty moving weights to an optimized
solution. What should you do?
A. Use feature construction to combine the strongest features.
B. Use the normalization technique.
C. Improve the data cleaning step by removing features with
missing values.
D. Change the partitioning step to reduce the dimension of the
test set and have a larger training set.
3. You work for a credit card company and have been asked to create
a custom fraud detection model based on historical data using
AutoML Tables. You need to prioritize detection of fraudulent
transactions while minimizing false positives. Which optimization
objective should you use when training the model?
A. An optimization objective that minimizes log loss.
B. An optimization objective that maximizes the precision at a
recall value of 0.50.
C. An optimization objective that maximizes the area under the
precision recall curve (AUC PR) value.
D. An optimization objective that maximizes the area under the
curve receiver operating characteristic (AUC ROC) curve
value.
4. You are a data scientist working on a classification problem with
time series data and achieved an area under the receiver
operating characteristic curve (AUC ROC) value of 99 percent for
training data with just a few experiments. You haven't explored
using any sophisticated algorithms or spent any time on
hyperparameter tuning. What should your next step be to identify
and fix the problem?
A. Address the model overfitting by using a less complex
algorithm.
B. Address data leakage by applying nested cross validation
during model training.
C. Address data leakage by removing features highly correlated
with the target value.
D. Address the model overfitting by tuning the hyperparameters
to reduce the AUC ROC value.
5. You are training a ResNet model on Vertex AI using TPUs to
visually categorize types of defects in automobile engines. You
capture the training profile using the Cloud TPU profiler plug in
and observe that it is highly input bound. You want to reduce the
bottleneck and speed up your model training process. Which
modifications should you make to the tf.data dataset? (Choose
two.)
A. Use the interleave option to read data.
B. Set the prefetch option equal to the training batch size.
C. Reduce the repeat parameters.
D. Decrease the batch size argument in your transformation.
E. Increase the buffer size for shuffle.
6. You have been asked to develop an input pipeline for an ML
training model that processes images from disparate sources at a
low latency. You discover that your input data does not fit in
memory. How should you create a dataset following Google
recommended best practices?
A. Create a tf.data.Dataset.prefetch transformation.
B. Convert the images into TFRecords, store the images in
Cloud Storage, and then use the tf.data API to read the
images for training.
C. Convert the images to tf.Tensor objects, and then run
Dataset.from_tensor_slices{).
D. Convert data into TFRecords.
7. Different cities in California have markedly different housing
prices. Suppose you must create a model to predict housing
prices. Which of the following sets of features or feature crosses
could learn city specific relationships between roomsPerPerson
and housing price?
A. Two feature crosses: [binned latitude x binned
roomsPerPerson] and [binned longitude x binned
roomsPerPerson]
B. Three separate binned features: [binned latitude], [binned
longitude], [binned roomsPerPerson]
C. One feature cross: [binned latitude x binned longitude x
binned roomsPerPerson]
D. One feature cross: [latitude x longitude x roomsPerPerson]
8. You are a data engineer for a finance company. You are
responsible for building a unified analytics environment across a
variety of on premises data marts. Your company is experiencing
data quality and security challenges when integrating data across
the servers, caused by the use of a wide range of disconnected
tools and temporary solutions. You need a fully managed, cloud
native data integration service that will lower the total cost of
work and reduce repetitive work. Some members on your team
prefer a codeless interface for building an extract, transform, load
(ETL) process. Which service should you use?
A. Cloud Data Fusion
B. Dataprep
C. Cloud Dataflow
D. Apache Flink
9. You work for a global footwear retailer and need to predict when
an item will be out of stock based on historical inventory data.
Customer behavior is highly dynamic since footwear demand is
influenced by many different factors. You want to serve models
that are trained on all available data but track your performance
on specific subsets of data before pushing to production. What is
the most streamlined, scalable, and reliable way to perform this
validation?
A. Use the tf.Transform to specify performance metrics for
production readiness of the data.
B. Use the entire dataset and treat the area under the receiver
operating characteristic curve (AUC ROC) as the main metric.
C. Use the last relevant week of data as a validation set to ensure
that your model is performing accurately on current data.
D. Use k fold cross validation as a validation strategy to ensure
that your model is ready for production.
10. You are transforming a complete dataset before model training.
Your model accuracy is 99 percent in training, but in production
its accuracy is 66 percent. What is a possible way to improve the
model in production?
A. Apply transformation during model training.
B. Perform data normalization.
C. Remove missing values.
D. Use tf.Transform for creating production pipelines for both
training and serving.
Chapter 4
Choosing the Right ML Infrastructure
Say you find that your use case is unique and there is no AutoML
available for it. In that case, you can use custom models in Vertex AI.
This is the top tier in the pyramid in Figure 4.1, and it offers you a lot
of flexibility for the choice of algorithm, hardware provisioning, and
data types. The reason this is in the top of the pyramid is because of
the flexibility, but at the same time, it also has the smallest base
because the number of customers who have the expertise to use
custom models is small. We will look at the hardware provisioning
options for this training method later in this chapter.
Pretrained Models
When you are trying to solve a problem with machine learning, the
first step is to see if there is a pretrained model. Pretrained models are
machine learning models that have been trained on extremely large
datasets and perform very well in benchmark tests. These models are
supported by very large engineering and research teams and are
retrained frequently.
As a customer, you can start using these models in just a few minutes
through the web console (or the CLI, Python, Java, or Node.js SDK).
Any developer who is trying to solve a problem using machine learning
should first check if there is a pretrained model available on the
Google Cloud platform, and if so, use it.
Google Cloud has several pretrained models available:
Vision AI
Video AI
Natural Language AI
Translation AI
Speech to Text and Text to Speech
In addition to pretrained models, Google Cloud has platforms that
offer solutions to certain kinds of problems and include pretrained
models as well as the ability to uptrain the existing models:
Document AI
Contact Center AI
Vision AI
Vision AI provides you with convenient access to ML algorithms for
processing images and photos without having to create a complete
machine learning infrastructure. Using this service, you can perform
image classification, detect objects and faces, and read handwriting
(through optical character recognition).
You can also try the service quickly from the convenience of your
browser at https://cloud.google.com/vision. In production, typically
you upload an image to the service or point to an image URL to
analyze.
When you try service using your browser, you get four types of
predictions, as shown in Figure 4.2. First you see objects detected in
the photo.
Video AI
This API has pretrained machine learning models that recognize
objects, places, and actions in videos. It can be applied to stored video
or to streaming video where the results are returned in real time.
You can use this service to recognize more than 20,000 different
objects, places, and actions in videos. You can use the results as
metadata in your video that can be used to search videos from your
video catalog. For example, you can use the service to tag sports
videos, and more specifically the type of sport. You can also process
livestreams; for example, if you have a street camera looking at traffic,
you can count the number of cars that cross an intersection.
Here are several examples of use cases for Video AI:
Use Case 1: This API can be used to build a video
recommendation system, using the labels generated by the API
and a user's viewing history. This provides you with an ability to
recommend based on details from within the video and not just
external metadata and can greatly increase user experience.
Use Case 2: Another use case is to create an index of your video
archives using the metadata from the API. This is perfect for mass
media companies that have petabytes of data that are not indexed.
Use Case 3: Advertisements inserted into videos sometimes
could be completely irrelevant to the videos. This is another use
case where you can improve the user experience by comparing the
time frame specific labels of the video content and the content of
the advertisements.
Natural Language AI
The Natural Language AI provides insights from unstructured text
using pretrained machine learning models. The main services it
provides are entity extraction, sentiment analysis, syntax analysis, and
general categorization.
The entity extraction service identifies entities such as the names of
people, organizations, products, events, locations, and so on. This
service also enriches the entities with additional information like links
to Wikipedia articles if it finds any. Although entity extraction may
sound like a simple problem, it is a nontrivial task. For example, in the
sentence “Mr. Wood is a good actor,” it takes a good amount of
understanding that “Mr. Wood” is a person and not a type of wood.
Sentiment analysis provides you a positive, negative, or neutral score
with magnitude for each sentence, for each entity, and the whole text.
The syntax analysis can be used to identify the part of speech,
dependency between words, lemma, and the morphology of text.
Finally, it also classifies documents into one of more than 700
predefined categories. For more details, see
https://cloud.google.com/natural-language.
Translation AI
Use Translation AI to detect more than 100 languages, from Afrikaans
to Zulu, and translate between any pairs of languages in that list. It
uses Google Neural Machine Translation (GNMT) technology that was
pioneered by Google and is now considered industry standard. For
more information, refer to https://ai.googleblog.com/2016/09/a-
neural-network-for-machine.html.
This service has two levels, Basic and Advanced. There are many
differences, but the main difference is the Advanced version can use a
glossary (a dictionary of terms mapped from source language to target
language) and also can translate entire documents (PDFs, DOCs, etc.).
There is also price difference.
You can translate text in ASCII or in UTF 8 format. In addition, you
can also translate audio in real time using the Media Translation API,
typically used for streaming services.
The Media Translation API (separate from the Translation API)
directly translates audio in source language into audio in target
languages. This helps with low latency streaming applications and
scales quickly.
Speech‐to‐Text
You can use the Speech to Text service to convert recorded audio or
streaming audio into text. This is a popular service for creating
subtitles for video recordings and streaming video as well. This is also
commonly combined with a translate service to generate subtitles for
multiple languages. For more details, see
https://cloud.google.com/speech-to-text#section-10.
Text‐to‐Speech
Customers use the Text to Speech service to provide realistic speech
with humanlike intonation. This is based on the state of the art speech
synthesis expertise from DeepMind (an AI subsidiary of Google). It
currently supports 220+ voices across 40+ languages and variants.
You can create a unique voice to represent your brand at all your
touchpoints. See here for the list of languages supported:
https://cloud.google.com/text-to-speech/docs/voices.
AutoML
AutoML, or automated ML, is the process of automating the time
consuming tasks of model training. AutoML is available for popular,
well understood, and practically feasible ML problems like image
classification, text classification, translation, and so on. You as a user
only bring in the data and configure a few settings and the rest of the
training is automated. You either leverage the easy to use web console
or use a Python, Java, or Node.js SDK to initiate the AutoML training
job.
There is AutoML training available for many data types and use cases.
We can broadly categorize them into four categories:
Structured data
Images/video
Natural language
Recommendations AI/Retail AI
Recommendations AI/Retail AI
GCP has an AutoML solution for the retail domain. Retail Search
offers retailers a Google quality search that can be customized and
built upon Google's understanding of user intent and context.
The Vision API Product Search (a service under Vision AI) can be
trained on reference images of products in your catalog, which can
then be searched using an image.
The third part of the solution is Recommendations AI, which can
understand nuances behind customer behavior, context, and SKUs in
order to drive engagement across channels through relevant
recommendations.
In this solution, customers upload the product catalog with details
about each product, photos, and other metadata. The customer then
feeds in the “user events” such as what the customer clicks, views, and
buys. Recommendations AI uses this data to create models. Customers
are charged for training models, which are continuously fine tuned to
include updates to the “user events” data.
In addition, when the recommendations are served and get millions of
hits, the customer is charged for each 1,000 requests. This is a
serverless approach to provisioning resources and the customer does
not have to bother about the exact hardware behind the scenes.
Recommendations AI has an easy to use automated machine learning
training method. It provides several different models that serve a
variety of purposes for an online retail presence. From the exam
perspective, it is important to understand Table 4.4, which describes
the different recommendation types.
TABLE 4.4 Summary of the recommendation types available in
Retail AI
Source: Adapted from Google cloud/
https://cloud.google.com/retail/docs/models#model-types last accessed December 16,
2022.
Document AI
When you want to extract details from documents, like digitized
scanned images of old printed documents, books, or forms, you can
use Document AI. These are pages that can contain text in paragraph
format and also tables and pictures. Some of this text could be printed
and sometimes it could be handwritten. This is also a common type of
document seen in government offices like the DMV where people fill
out forms.
Forms contain a mix of printed text, with blank spaces where people
fill out their details. This could be written with different types of ink
(blue or black pen, etc.), and sometimes people make mistakes while
writing. So, the ML model needs to understand the structure of the
document (where to expect “name” and “address”) and have a
tolerance for handwritten text.
Another example is government documents like passports, driver's
licenses, and tax filings. These are documents that have better
structure but still have some variability.
If we can extract important details (like “firstname,” “lastname,”
“address,” etc.) from forms, we have structured data, which can now
be stored in a database and can be analyzed. This is the extraction
phase.
Document AI is a platform that understands documents and helps you
to do the following:
Detect document quality.
Deskew.
Extract text and layout information.
Identify and extract key/value pairs.
Extract and normalize entities.
Split and classify documents.
Review documents (human in the loop).
Store, search, and organize documents (Document AI Warehouse)
Document AI has two important concepts: processors and a Document
AI Warehouse.
A Document AI processor is an interface between the document and a
machine learning model that performs the actions. There are general
processors, specialized processors (procurement, identity, lending,
and contract documents), and custom processors. You can train the
custom processor by providing a training dataset (labeled set of
documents) for your custom needs. For the full list of processor types,
visit https://cloud.google.com/document-ai/docs/processors-list.
A Document AI Warehouse is a platform to store, search, organize,
govern, and analyze documents along with their structured metadata.
Agent Assist
When the human agent is handling a call, the Agent Assist can provide
support by identifying intent and providing ready to send responses
and answers from a centralized knowledge base as well as transcript
calls in real time.
Insights
This service uses natural language processing to call drivers and
measure sentiment to help leadership understand the call center
operations so they can improve outcomes.
CCAI
The Contact Center AI platform is a complete cloud native platform to
support multichannel communications between customers and agents.
Although Dialogflow and CCAI use advanced machine learning
techniques, especially in natural language, they are mostly hidden
from the machine learning engineer. An in depth understanding of
CCAI is beyond the scope of this exam.
Custom Training
When you have chosen custom training, you have full flexibility to
choose a wide range of hardware options to train your model on.
Graphics processing units (GPUs) can accelerate the training process
of deep learning models. Models used for natural language, images,
and videos need compute intensive operations like matrix
multiplications that can benefit by running on massively parallel
architectures like GPUs.
If you train a deep learning model on a single CPU, it could take days,
weeks, or sometimes months to complete. However, if you can offload
the heavy computation to a GPU, it can reduce the time by an order of
magnitude. What used to take days might take hours to complete.
To understand the advantages of specialized hardware, let us first see
how a CPU works.
GPU
GPUs bring in additional firepower. A graphics processing unit (GPU)
is a specialized chip designed to rapidly process data in memory to
create images; it was originally intended to process movies and render
images in video games. A GPU does not work alone and is a
subprocessor that helps the CPU in some tasks.
GPUs contain thousands of arithmetic logic units (ALUs) in a single
processor. So instead of accessing the memory for each operation, a
GPU loads a block of memory and applies some operation using the
thousands of ALUs in parallel, thereby making it faster. Using the
GPUs for large matrix multiplications and differential operations
could improve the speed by an order of magnitude in time.
To use GPUs, you must use an A2 or N1 machine series. GPU currently
has the following GPUs available:
NVIDIA_TESLA_T4
NVIDIA_TESLA_K80
NVIDIA_TESLA_P4
NVIDIA_TESLA_P100
NVIDIA_TESLA_V100
NVIDIA_TESLA_A100
In your WorkerPoolSpec, specify the type of GPU that you want to use
in the machineSpec.acceleratorType field and the number of GPUs that
you want each VM in the worker pool to use in the
machineSpec.acceleratorCount field.
When you are trying to configure GPUs with instance types, there are
several restrictions based on instance types, instance memory, and so
on. Some restrictions are as follows:
The type of GPU that you choose must be available in the location
where you are performing custom training. Not all types of GPUs
are available in all regions.
Here is a page that lists the available locations:
cloud.google.com/vertex-ai/docs/general/locations.
TPU
As ML engineers used GPUs to train very large neural networks, they
started to notice the next bottleneck. The GPU is still a semi general
purpose processor that has to support many different applications,
including video processing software. Therefore, in this way GPUs have
the same problem as CPUs. For every calculation in the thousands of
ALUs, a GPU must access registers or shared memory to read
operands and store the intermediate calculation results. To take
performance to the next level, Google designed TPUs.
Tensor Processing Units (TPUs) are specialized hardware accelerators
designed by Google specifically for machine learning workloads. See
Figure 4.5 for the system architecture of a TPU v4.
Advantages of TPUs
TPUs accelerate the computational speed beyond GPUs. Models that
take months to train on CPUs might take a few days to train on GPUs
but might run in a matter of hours on TPUs. Simply put, TPUs could
provide an order of magnitude improvement over GPUs.
Scaling Behavior
If you use an autoscaling configuration, Vertex AI automatically scales
to use more prediction nodes when the CPU usage of your existing
nodes gets high. If you are using GPU nodes, make sure to configure
the appropriate trigger because there are three resources (CPU,
memory, and GPU) that have to be monitored for usage.
Edge TPU
An important part of the Internet of Things (IoT) are the edge devices.
These devices collect real time data, make decisions, take action, and
communicate with other devices or with the cloud. Since such devices
have limited bandwidth and sometimes may operate completely
offline, there is increasing demand for running inference on the device
itself, called edge inference.
The Google designed Edge TPU coprocessor accelerates ML inference
on these edge devices. A single Edge TPU can perform 4 trillion
operations per second (4 TOPS), on just 2 watts of power.
The Edge TPU is available for your own prototyping and production
devices in several form factors, including a single board computer, a
system on module, the Edge TPU, and all available products. This is
sold under the brand name of Coral.ai.
TPUs are popularly used for training but usually are not
used for serving in the cloud. However, Edge TPUs are used for
deploying models at the edge.
Summary
In this chapter, you learned about different pretrained models that are
available on Google Cloud. You also learned about AutoML models
and the applicability to different scenarios. In the main part of the
chapter, you learned about the different hardware options available for
training your models, the difference in the training workload and
prediction workload. Google Cloud provides you with a wide variety of
hardware accelerators in the form of GPUs and TPUs. Finally, going
beyond the cloud, you were introduced to the ideas of deploying to the
edge devices.
Exam Essentials
Choose the right ML approach. Understand the requirements
to choose between pretrained models, AutoML, or custom
models. Understand the readiness of the solution, the flexibility,
and approach.
Provision the right hardware for training. Understand the
various hardware options available for machine learning. Also
understand the requirements of GPU and TPU hardware and the
instance types that support the specialized hardware. Also learn
about hardware differences in training and deployment.
Provision the right hardware for predictions. Learn the
difference between provisioning during training time and during
predictions. The requirements for predictions are usually
scalability and the CPU and memory constraints, so CPUs and
GPUs are used in the cloud. However, TPUs are used in the edge
devices.
Understand the available ML solutions. Instead of
provisioning hardware, take a serverless approach by using
pretrained models and solutions that are built to solve a problem
in a domain.
Review Questions
1. Your company deals with real estate, and as part of a software
development team, you have been asked to add a machine
learning model to identify objects in photos uploaded to your
website. How do you go about this?
A. Use a custom model to get best results.
B. Use AutoML to create object detection.
C. Start with Vision AI, and if that does not work, use AutoML.
D. Combine AutoML and a custom model to get better results.
2. Your company is working with legal documentation (thousands of
pages) that needs to be translated to Spanish and French. You
notice that the pretrained model in Google’s Translation AI is
good, but there are a few hundred domain specific terms that are
not translated in the way you want. You don't have any labeled
data and you have only a few professional translators in your
company. What do you do?
A. Use Google's translate service and then have a human in the
loop (HITL) to fix each translation.
B. Use Google AutoML Translation to create a new translation
model for your case.
C. Use Google's Translation AI with a “glossary” of the terms
you need.
D. Not possible to translate because you don't seem to have
data.
3. You are working with a thousand hours of video recordings in
Spanish and need to create subtitles in English and French. You
already have a small dataset with hundreds of hours of video for
which subtitles have been created manually. What is your first
approach?
A. There is no “translated subtitle” service so use AutoML to
create a “subtitle” job using the existing dataset and then use
that model to create translated subtitles.
B. There is no “translated subtitle” service, and there is no
AutoML for this so you have to create a custom model using
the data and run it on GPUs.
C. There is no “translated subtitle” service, and there is no
AutoML for this so you have to create a custom model using
the data and run it on TPUs.
D. Use the pretrained Speech to Text (STT) service and then use
the pretrained Google Translate service to translate the text
and insert the subtitles.
4. You want to build a mobile app to classify the different kinds of
insects. You have enough labeled data to train but you want to go
to market quickly. How would you design this?
A. Use AutoML to train a classification model, with AutoML
Edge as the method. Create an Android app using ML Kit and
deploy the model to the edge device.
B. Use AutoML to train a classification model, with AutoML
Edge as the method. Use a Coral.ai device that has edge TPU
and deploy the model on that device.
C. Use AutoML to train an object detection model with AutoML
Edge as the method. Use a Coral.ai device that has edge TPU
and deploy the model on that device.
D. Use AutoML to train an image segmentation model, with
AutoML Edge as the method. Create an Android app using
ML Kit and deploy the model to the edge device.
5. You are training a deep learning model for object detection. It is
taking too long to converge, so you are trying to speed up the
training. While you are trying to launch an instance (with GPU)
with Deep Learning VM Image, you get an error that the
“NVIDIA_TESLA_V100 was not found.” What could be the
problem?
A. GPU was not available in the selected region.
B. GPU quota was not sufficient.
C. Preemptible GPU quota was not sufficient.
D. GPU did not have enough memory.
6. Your team is building a convolutional neural network for an
image segmentation problem on prem on a CPU only machine. It
takes a long time to train, so you want to speed up the process by
moving to the cloud. You experiment with VMs on Google Cloud
to use better hardware. You do not have any code for manual
placements and have not used any custom transforms. What
hardware should you use?
A. A deep learning VM with n1 standard 2 machine with 1 GPU
B. A deep learning VM with more powerful e2 highCPU 16
machines
C. A VM with 8 GPUs
D. A VM with 1 TPU
7. You work for a hardware retail store and have a website where
you get thousands of users on a daily basis. You want to display
recommendations on the home page for your users, using
Recommendations AI. What model would you choose?
A. “Others you may like”
B. “Frequently bought together”
C. “Similar items”
D. “Recommended for you”
8. You work for a hardware retail store and have a website where
you get thousands of users on a daily basis. You want to increase
your revenue by showing recommendations while customers
check out. What type of model in Recommendations AI would you
choose?
A. “Others you may like”
B. “Frequently bought together”
C. “Similar items”
D. “Recommended for you”
9. You work for a hardware retail store and have a website where
you get thousands of users on a daily basis. You have a customer's
browsing history and want to engage the customer more.
What model in Recommendations AI would you choose?
A. “Others you may like”
B. “Frequently bought together”
C. “Similar items”
D. “Recommended for you”
10. You work for a hardware retail store and have a website where
you get thousands of users on a daily basis. You do not have
browsing events data. What type of model in Recommendations
AI would you choose?
A. “Others you may like”
B. “Frequently bought together”
C. “Similar items”
D. “Recommended for you”
11. You work for a hardware retail store and have a website where
you get thousands of users on a daily basis. You want to show
details to increase cart size. You are going to use
Recommendations AI for this. What model and optimization do
you choose?
A. “Others you may like” with “click through rate” as the
objective
B. “Frequently bought together” with “revenue per order” as the
objective
C. “Similar items” with “revenue per order” as the objective
D. “Recommended for you” with “revenue per order” as the
objective
12. You are building a custom deep learning neural network model in
Keras that will summarize a large document into a 50 word
summary. You want to try different architectures and compare the
metrics and performance. What should you do?
A. Create multiple AutoML jobs and compare performance.
B. Use Cloud Composer to automate multiple jobs.
C. Use the pretrained Natural Language API first.
D. Run multiple jobs on the AI platform and compare results.
13. You are building a sentiment analysis tool that collates the
sentiment of all customer calls to the call center. The management
is looking for something to measure the sentiment; it does not
have to be super accurate, but it needs to be quick. What do you
think is the best approach for this?
A. Use the pretrained Natural Language API to predict
sentiment.
B. Use Speech to Text (STT) and then pass through the
pretrained Natural Language API to predict sentiment.
C. Build a custom model to predict the sentiment directly from
voice calls, which captures the intonation.
D. Convert Speech to Text and extract sentiment using BERT
algorithm.
14. You have built a very large deep learning model using some
custom TensorFlow operations written in C++ for object tracking
in videos. Your model has been tested on CPU and now you want
to speed up training. What would you do?
A. Use TPU v4 in default setting because it involves using very
large matrix operations.
B. Customize the TPU v4 size to match with the video and
recompile the custom TensorFlow operations for TPU.
C. Use GPU instances because TPUs do not support custom
operations.
D. You cannot use GPU or TPU because neither supports
custom operations.
15. You want to use GPUs for training your models that need about
50 GB of memory. What hardware options do you have?
A. n1 standard 64 with 8 NVIDIA_TESLA_P100
B. e2 standard 32 with 4 NVIDIA_TESLA_P100
C. n1 standard 32 with 3 NVIDIA_TESLA_P100
D. n2d standard 32 with 4 NVIDIA_TESLA_P100
16. You have built a deep neural network model to translate voice in
real time cloud TPUs and now you want to push it to your end
device. What is the best option?
A. Push the model to the end device running Edge TPU.
B. Models built on TPUs cannot be pushed to the edge. The
model has to be recompiled before deployment to the edge.
C. Push the model to any Android device.
D. Use ML Kit to reduce the size of the model to push the model
to any Android device.
17. You want to use cloud TPUs and are looking at all options. Which
of the below are valid options? (Choose two.)
A. A single TPU VM
B. An HPC cluster of instances with TPU
C. A TPU Pod or slice
D. An instance with both TPU and GPU to give additional boost
18. You want to train a very large deep learning TensorFlow model
(more than 100 GB) on a dataset that has a matrix in which most
values are zero. You do not have any custom TensorFlow
operations and have optimized the training loop to not have an
I/O operation. What are your options?
A. Use a TPU because you do not have any custom TensorFlow
operations.
B. Use a TPU Pod because the size of the model is very large.
C. Use a GPU.
D. Use an appropriately sized TPUv4 slice.
19. You have been tasked to use machine learning to precisely predict
the amount of liquid (down to the milliliter) in a large tank based
on pictures of the tank. You have decided to use a large deep
learning TensorFlow model. The model is more than 100 GB and
trained on a dataset that is very large. You do not have any custom
TensorFlow operations and have optimized the training loop to
not have I/O operations. What are your options?
A. Use a TPU because you do not have any custom TensorFlow
operations.
B. Use a TPU Pod because the size of the model is very large.
C. Use a GPU.
D. Use TPU v4 of appropriate size and shape for the use case.
20. You are a data scientist trying to build a model to estimate the
energy usage of houses based on photos, year built, and so on.
You have built a custom model and deployed this custom
container in Vertex AI. Your application is a big hit with home
buyers who are using it to predict energy costs for houses before
buying. You are now getting complaints that the latency is too
high. To fix the latency problem, you deploy the model on a bigger
instance (32 core) but the latency is still high. What is your next
step? (Choose two.)
A. Increase the size of the instance.
B. Use a GPU instance for prediction.
C. Deploy the model on a computer engine instance and test the
memory and CPU usage.
D. Check the code to see if this is single threaded and other
software configurations for any bugs.
Chapter 5
Architecting ML Solutions
BigQuery
The best practice is to store tabular data in BigQuery. For training data
it's better to store the data as tables instead of views for better speed.
BigQuery functionality is available by using the following:
The Google Cloud console, search for BigQuery
The bq command line tool
The BigQuery REST API
Vertex AI Jupyter Notebooks using BigQuery Magic or BigQuery
Python client.
We are going to cover BigQuery ML in Chapter 14.
Table 5.3 lists Google Cloud tools that make it easier to use the API.
TABLE 5.3 Google Cloud tools to read BigQuery data
Framework Google Cloud tool to read data from BigQuery
TensorFlow tf.data.dataset reader for BigQuery and
or Keras tfio.BigQuery.BigQueryClient()
(www.tensorflow.org/io/api_docs/python/tfio/BigQuery/BigQueryC
TFX BigQuery client
Dataflow BigQuery I/O connector
Any other BigQuery Python Client library
framework
When you have unlabeled and unstructured data, you can use the
Vertex AI data labeling service to label the data in Google Cloud
Storage or Vertex AI–managed datasets. This is a service just to label
the data and it does not store data. You can use third party crowd
sourced human labelers or your own labelers to label the data.
Serving
After you train, evaluate, and tune a machine learning (ML) model, the
model is deployed to production for predictions. An ML model can
provide predictions in two ways: offline prediction and online
prediction.
Online Prediction
This prediction happens in near real time when you send a request to
your deployed model endpoint and you get the predicted response
back. This can be a model deployed to an HTTPS endpoint, and you
can use microservice architecture to call this endpoint from your web
applications or mobile applications. Use cases where you need
response in near real time while making ML predictions are real time
bidding and real time sentiment analysis of Twitter feeds.
There are two ways you can have online predictions, described here:
Synchronous In this the caller waits until it receives the
prediction from the ML service before performing the subsequent
steps. You can use Vertex AI online predictions to deploy your
model as a real time HTTPS endpoint. You can also use App
Engine or GKE (Google Kubernetes Engine) as an ML gateway to
perform some feature preprocessing before sending your request
from client applications, as shown in Figure 5.4.
Summary
In this chapter, we discussed best practices for designing a reliable,
scalable, and highly available ML solution on Google Cloud Platform
(GCP). Then we discussed when to use which service from the three
layers of the GCP AI/ML stack. We covered data collection and data
management strategy for managing data in the integrated Vertex AI
platform with BigQuery. We also covered data storage options for
submillisecond and millisecond latency such as NoSQL data store.
Then we covered automation and orchestration techniques for ML
pipelines such as Vertex AI Pipelines, Kubeflow Pipelines, and TFX
pipelines.
We discussed how you can serve the model using both batch mode and
real time to serve the predictions. We covered a few architecture
patterns to create batch predictions as well as online predictions using
Vertex AI Prediction. Last, we discussed some ways to improve model
latency while using real time serving.
Exam Essentials
Design reliable, scalable, and highly available ML
solutions. Understand why you need to design a scalable
solution and how Google Cloud AI/ML services can help architect
a scalable and highly available solution.
Choose an appropriate ML service. Understand the AI/ML
stack of GCP and when to use each layer of the stack based on
your use case and expertise with ML.
Understand data collection and management. Understand
various types of data stores for storing your data for various ML
use cases.
Know how to implement automation and orchestration.
Know when to use Vertex AI Pipelines vs. Kubeflow vs. TFX
pipelines. We will cover the details in Chapter 11, “Designing ML
Training Pipelines.”
Understand how to best serve data. You need to understand
the best practices when deploying models. Know when to use
batch prediction versus real time prediction and how to manage
latency with online real time prediction.
Review Questions
1. You work for an online travel agency that also sells advertising
placements on its website to other companies. You have been
asked to predict the most relevant web banner that a user should
see next. Security is important to your company. The model
latency requirements are 300ms@p99, the inventory is thousands
of web banners, and your exploratory analysis has shown that
navigation context is a good predictor.
You want to implement the simplest solution. How should you
configure the prediction pipeline?
A. Embed the client on the website, and then deploy the model
on the Vertex AI platform prediction.
B. Embed the client on the website, deploy the gateway on App
Engine, and then deploy the model on the Vertex AI platform
prediction.
C. Embed the client on the website, deploy the gateway on App
Engine, deploy the database on Cloud Bigtable for writing
and for reading the user's navigation context, and then
deploy the model on the Vertex AI Prediction.
D. Embed the client on the website, deploy the gateway on App
Engine, deploy the database on Memorystore for writing and
for reading the user's navigation context, and then deploy the
model on Google Kubernetes Engine (GKE).
2. You are training a TensorFlow model on a structured dataset with
100 billion records stored in several CSV files. You need to
improve the input/output execution performance. What should
you do?
A. Load the data into BigQuery and read the data from
BigQuery.
B. Load the data into Cloud Bigtable, and read the data from
Bigtable.
C. Convert the CSV files into shards of TFRecords, and store the
data in Google Cloud Storage.
D. Convert the CSV files into shards of TFRecords, and store the
data in the Hadoop Distributed File System (HDFS).
3. You are a data engineer who is building an ML model for a
product recommendation system in an e commerce site that's
based on information about logged in users. You will use Pub/Sub
to handle incoming requests. You want to store the results for
analytics and visualizing. How should you configure the pipeline?
Pub/Sub > Preprocess(1) > ML training/serving(2) >
Storage(3) > Data studio/Looker studio for visualization
A. 1 = Dataflow, 2 = Vertex Al platform, 3 = Cloud BigQuery
B. 1 = Dataproc, 2 = AutoML, 3 = Cloud Memorystore
C. 1 = BigQuery, 2 = AutoML, 3 = Cloud Functions
D. 1 = BigQuery, 2 = Vertex Al platform, 3 = Google Cloud
Storage
4. You are developing models to classify customer support emails.
You created models with TensorFlow Estimator using small
datasets on your on premises system, but you now need to train
the models using large datasets to ensure high performance. You
will port your models to Google Cloud and want to minimize code
refactoring and infrastructure overhead for easier migration from
on prem to cloud. What should you do?
A. Use the Vertex AI platform for distributed training.
B. Create a cluster on Dataproc for training.
C. Create a managed instance group with autoscaling.
D. Use Kubeflow Pipelines to train on a Google Kubernetes
Engine cluster.
5. You are a CTO wanting to implement a scalable solution on
Google Cloud to digitize documents such as PDF files and Word
DOC files in various silos. You are also looking for storage
recommendations for storing the documents in a data lake. Which
options have the least infrastructure efforts? (Choose two.)
A. Use the Document AI solution.
B. Use Vision AI OCR to digitize the documents.
C. Use Google Cloud Storage to store documents.
D. Use Cloud Bigtable to store documents.
E. Use a custom Vertex AI model to build a document
processing pipeline.
6. You work for a public transportation company and need to build a
model to estimate delay for multiple transportation routes.
Predictions are served directly to users in an app in real time.
Because different seasons and population increases impact the
data relevance, you will retrain the model every month. You want
to follow Google recommended best practices. How should you
configure the end to end architecture of the predictive model?
A. Configure Kubeflow Pipelines to schedule your multistep
workflow from training to deploying your model.
B. Use a model trained and deployed on BigQuery ML and
trigger retraining with the scheduled query feature in
BigQuery.
C. Write a Cloud Functions script that launches a training and
deploying job on the Vertex AI platform that is triggered by
Cloud Scheduler.
D. Use Cloud Composer to programmatically schedule a
Dataflow job that executes the workflow from training to
deploying your model.
7. You need to design a customized deep neural network in Keras
that will predict customer purchases based on their purchase
history. You want to explore model performance using multiple
model architectures, store training data, and be able to compare
the evaluation metrics in the same dashboard. What should you
do?
A. Create multiple models using AutoML Tables.
B. Automate multiple training runs using Cloud Composer.
C. Run multiple training jobs on the Vertex AI platform with
similar job names.
D. Create an experiment in Kubeflow Pipelines to organize
multiple runs.
8. You work with a data engineering team that has developed a
pipeline to clean your dataset and save it in a Google Cloud
Storage bucket. You have created an ML model and want to use
the data to refresh your model as soon as new data is available. As
part of your CI/CD workflow, you want to automatically run a
Kubeflow Pipelines training job on Google Kubernetes Engine
(GKE). How should you architect this workflow?
A. Configure your pipeline with Dataflow, which saves the files
in Google Cloud Storage. After the file is saved, start the
training job on a GKE cluster.
B. Use App Engine to create a lightweight Python client that
continuously polls Google Cloud Storage for new files. As
soon as a file arrives, initiate the training job.
C. Configure a Google Cloud Storage trigger to send a message
to a Pub/Sub topic when a new file is available in a storage
bucket. Use a Pub/Sub–triggered Cloud Function to start the
training job on a GKE cluster.
D. Use Cloud Scheduler to schedule jobs at regular intervals. For
the first step of the job, check the time stamp of objects in
your Google Cloud Storage bucket. If there are no new files
since the last run, abort the job.
9. Your data science team needs to rapidly experiment with various
features, model architectures, and hyperparameters. They need to
track the accuracy metrics for various experiments and use an API
to query the metrics over time. What should they use to track and
report their experiments while minimizing manual effort?
A. Use Kubeflow Pipelines to execute the experiments. Export
the metrics file, and query the results using the Kubeflow
Pipelines API.
B. Use Vertex AI Platform Training to execute the experiments.
Write the accuracy metrics to BigQuery, and query the results
using the BigQuery API.
C. Use Vertex AI Platform Training to execute the experiments.
Write the accuracy metrics to Cloud Monitoring, and query
the results using the Monitoring API.
D. Use Vertex AI Workbench Notebooks to execute the
experiments. Collect the results in a shared Google Sheets
file, and query the results using the Google Sheets API.
10. As the lead ML Engineer for your company, you are responsible
for building ML models to digitize scanned customer forms. You
have developed a TensorFlow model that converts the scanned
images into text and stores them in Google Cloud Storage. You
need to use your ML model on the aggregated data collected at the
end of each day with minimal manual intervention. What should
you do?
A. Use the batch prediction functionality of the Vertex AI
platform.
B. Create a serving pipeline in Compute Engine for prediction.
C. Use Cloud Functions for prediction each time a new data
point is ingested.
D. Deploy the model on the Vertex AI platform and create a
version of it for online inference.
11. As the lead ML architect, you are using TensorFlow and Keras as
the machine learning framework and your data is stored in disk
files as block storage. You are migrating to Google Cloud and you
need to store the data in BigQuery as tabular storage. Which of
the following techniques will you use to store TensorFlow storage
data from block storage to BigQuery?
A. tf.data.dataset reader for BigQuery
B. BigQuery Python Client library
C. BigQuery I/O Connector
D. tf.data.iterator
12. As the CTO of the financial company focusing on building AI
models for structured datasets, you decide to store most of the
data used for ML models in BigQuery. Your team is currently
working on TensorFlow and other frameworks. How would they
modify code to access BigQuery data to build their models?
(Choose three.)
A. tf.data.dataset reader for BigQuery
B. BigQuery Python Client library
C. BigQuery I/O Connector
D. BigQuery Omni
13. As the chief data scientist of a retail website, you develop many
ML models in PyTorch and TensorFlow for Vertex AI Training.
You also use Bigtable and Google Cloud Storage. In most cases,
the same data is used for multiple models and projects and also
updated. What is the best way to organize the data in Vertex AI?
A. Vertex AI–managed datasets
B. BigQuery
C. Vertex AI Feature Store
D. CSV
14. You are the data scientist team lead and your team is working for
a large consulting firm. You are working on an NLP model to
classify customer support requests. You are working on data
storage strategy to store the data for NLP models. What type of
storage should you avoid in a managed GCP environment in
Vertex AI? (Choose two.)
A. Block storage
B. File storage
C. BigQuery
D. Google Cloud Storage
Chapter 6
Building Secure ML Pipelines
Encryption at Rest
For machine learning models, your data will be in either Cloud Storage
or BigQuery tables. Google encrypts data stored at rest by default for
both Cloud Storage and BigQuery. By default, Google manages the
encryption keys used to protect your data. You can also use customer
managed encryption keys. You can encrypt individual table values in
BigQuery using Authenticated Encryption with Associated Data
(AEAD) encryption functions. Please refer to
https://cloud.google.com/bigquery/docs/reference/standard-
sql/aead-encryption-concepts to understand AEAD BigQuery
encryption functions.
Table 6.1 shows the difference between server side encryption and
client side encryption in terms of cloud storage and BigQuery.
TABLE 6.1 Difference between server side and client side encryption
Server Side Client Side Encryption
Encryption
Encryption that occurs Encryption that occurs before data is sent
after the cloud storage to Cloud Storage and BigQuery. Such data
receives your data, but arrives at Cloud Storage and BigQuery
before the data is written already encrypted but also undergoes
to disk and stored. server side encryption.
You can create and You are responsible for the client side
manage your encryption keys and cryptographic operations.
keys using a Google
Cloud Key Management
Service.
Encryption in Transit
To protect your data as it travels over the Internet during read and
write operations, Google Cloud uses Transport Layer Security (TLS).
Encryption in Use
Encryption in use protects your data in memory from compromise or
data exfiltration by encrypting data while it's being processed.
Confidential Computing is an example. Confidential Computing
protects your data in memory from compromise by encrypting it while
it is being processed. You can encrypt your data in use with
Confidential VMs and Confidential GKE Nodes. Read this blog for
more details on data security concepts:
https://cloud.google.com/blog/topics/developers-
practitioners/data-security-google-cloud.
The following are the IAM roles that can be used in Vertex AI:
Predefined roles allow you to grant a set of related permissions
to your Vertex AI resources at the project level. Two of the
common predefined roles for Vertex AI are Vertex AI
Administrator and Vertex AI User.
Basic roles such as Owner, Editor, and Viewer provide access
control to your Vertex AI resources at the project level. These
roles are common to all Google Cloud services.
Custom roles allow you to choose a specific set of permissions,
create your own role with those permissions, and grant the role to
users in your organization.
Not all Vertex AI predefined roles and resources support resource level
policies.
For the exam, you will not be tested on these concepts. However, basic
understanding of these concepts helps you to understand how data
and access are controlled in Google Cloud.
Now we will cover how you can secure the following:
Vertex AI Workbench notebook environment
Vertex AI endpoints (public vs. private endpoints)
Vertex AI training jobs
Federated Learning
According to the Google AI blog,
https://ai.googleblog.com/2017/04/federated-learning-
collaborative.html, federated learning is a technique that is used to
enable mobile phones to collaboratively learn a shared prediction
model while keeping all the training data on the device. The device
downloads the model and learns from the device data. This updated
model is then sent to the cloud with encrypted communication. Since
all the training data remains on your device, federated learning allows
for smarter models, lower latency, and less power consumption, all
while ensuring privacy.
An example would be a group of hospitals around the world that are
participating in the same clinical trial. The data that an individual
hospital collects about patients is not shared outside the hospital. As a
result, hospitals can't transfer or share patient data with third parties.
Federated learning lets affiliated hospitals train shared ML models
while still retaining security, privacy, and control of patient data
within each hospital by using a centralized model shared by all the
hospitals. This model trains local data in the hospital, and only the
model update is sent back to the centralized cloud server. The model
updates are decrypted, averaged, and integrated into the centralized
model. Iteration after iteration, the collaborative training continues
until the model is fully trained. This way federated learning decouples
the ability to do machine learning from the need to store the data in
the cloud. Refer to this link for more information:
https://cloud.google.com/architecture/federated-learning-google-
cloud.
Differential Privacy
According to https://en.wikipedia.org/wiki/Differential_privacy,
differential privacy (DP) is a system for publicly sharing information
about a dataset by describing the patterns within groups of individuals
within the dataset while withholding information about each
individual in the dataset. For example, training a machine learning
model for medical diagnosis, we would like to have machine learning
algorithms that do not memorize sensitive information about the
training set, such as the specific medical histories of individual
patients. Differential privacy is a notion that allows quantifying the
degree of privacy protection provided by an algorithm for the
underlying (sensitive) dataset it operates on. Through differential
privacy, we can design machine learning algorithms that responsibly
train models on private data.
You can use both the techniques together, federated
learning with differential privacy, to securely train a model with
PII data sitting in distributed silos.
The Google Cloud Healthcare API has the de identify operation, which
removes PHI or otherwise sensitive information from healthcare data.
The healthcare API's de identification is highly configurable and
redacts PHI from text, images, Fast Healthcare Interoperability
Resources (FHIR), and Digital Imaging and Communications in
Medicine (DICOM) data. Source
https://cloud.google.com/healthcare-api/docs/concepts/de-
identification The Cloud Healthcare API also detects sensitive data
in DICOM instances and FHIR data, such as protected PHI, and then
uses a de identification transformation to mask, delete, or otherwise
obscure the data.
The PHI targeted by the de identify command includes the 18
identifiers described in the HIPAA Privacy Rule de identification
standard. The HIPAA Privacy Rule does not restrict the use or
disclosure of de identified health information, as it is no longer
considered protected health information.
For CSVs, BigQuery tables, and text strings, the open source DLP API
Dataflow pipeline (see the GitHub repo
https://github.com/GoogleCloudPlatform/healthcare-deid) eases the
process of configuring and running the DLP API on healthcare data.
Summary
In this chapter, we discussed some of the security best practices used
to manage data for machine learning in Google Cloud, such as
encryption at rest and encryption in transit.
We also covered IAM briefly and how to use IAM to provide and
manage access to Vertex AI Workbench for your data science team. We
covered some secure ML development techniques such as federated
learning and differential privacy.
Last, we covered how you can manage PII and PHI data using the
Cloud DLP and Cloud Healthcare APIs. We also covered an
architecture pattern on how you can scale the PII identification and
de identification on a large dataset.
Exam Essentials
Build secure ML systems. Understand encryption at rest and
encryption in transit for Google Cloud. Know how encryption at
rest and in transit works for storing data for machine learning in
Cloud Storage and BigQuery. Know how you can set up IAM roles
to manage your Vertex AI Workbench and how to set up network
security for your Vertex AI Workbench. Last, understand some
concepts such as differential privacy, federated learning, and
tokenization.
Understand the privacy implications of data usage and
collection. Understand the Google Cloud Data Loss Prevention
(DLP) API and how it helps identify and mask PII type data. Also,
understand the Google Cloud Healthcare API to identify and mask
PHI type data. Finally, understand some of the best practices for
removing sensitive data.
Review Questions
1. You are an ML security expert at a bank that has a mobile
application. You have been asked to build an ML based
fingerprint authentication system for the app that verifies a
customer's identity based on their fingerprint. Fingerprints
cannot be downloaded into and stored in the bank databases.
Which learning strategy should you recommend to train and
deploy this ML model and make sure the fingerprints are secure
and protected?
A. Differential privacy
B. Federated learning
C. Tokenization
D. Data Loss Prevention API
2. You work on a growing team of more than 50 data scientists who
all use Vertex AI Workbench. You are designing a strategy to
organize your jobs, models, and versions in a clean and scalable
way. Which strategy is the most managed and requires the least
effort?
A. Set up restrictive IAM permissions on the Vertex AI platform
notebooks so that only a single user or group can access a
given instance.
B. Separate each data scientist's work into a different project to
ensure that the jobs, models, and versions created by each
data scientist are accessible only to that user.
C. Use labels to organize resources into descriptive categories.
Apply a label to each created resource so that users can filter
the results by label when viewing or monitoring the
resources.
D. Set up a BigQuery sink for Cloud Logging logs that is
appropriately filtered to capture information about AI
Platform resource usage. In BigQuery, create a SQL view that
maps users to the resources they are using.
3. You are an ML engineer of a Fintech company working on a
project to create a model for document classification. You have a
big dataset with a lot of PII that cannot be distributed or
disclosed. You are asked to replace the sensitive data with specific
surrogate characters. Which of the following techniques is best to
use?
A. Format preserving encryption or tokenization
B. K anonymity
C. Replacement
D. Masking
4. You are a data scientist of an EdTech company, and your team
needs to build a model on the Vertex AI platform. You need to set
up access to a Vertex AI Python library on Google Colab Jupyter
Notebook. What choices do you have? (Choose three.)
A. Create a service account key.
B. Set the environment variable named
GOOGLE_APPLICATION_CREDENTIALS.
C. Give your service account the Vertex AI user role.
D. Use console keys.
E. Create a private account key.
5. You are a data scientist training a deep neural network. The data
you are training contains PII. You have two challenges: first you
need to transform the data to hide PII, and you also need to
manage who has access to this data in various groups in the GCP
environment. What are the choices provided by Google that you
can use? (Choose two.)
A. Network firewall
B. Cloud DLP
C. VPC security control
D. Service keys
E. Differential privacy
6. You are a data science manager and recently your company
moved to GCP. You have to set up a JupyterLab environment for
20 data scientists on your team. You are looking for a least
managed and cost effective way to manage the Vertex AI
Workbench so that your instances are only running when the data
scientists are using the notebook. How would you architect this on
GCP?
A. Use Vertex AI–managed notebooks.
B. Use Vertex AI user managed notebooks.
C. Use Vertex AI user managed notebooks with a script to stop
the instances when not in use.
D. Use a Vertex AI pipeline.
7. You have Fast Healthcare Interoperability Resources (FHIR) data
and you are building a text classification model to detect patient
notes. You need to remove the PHI from the data. Which service
you would use?
A. Cloud DLP
B. Cloud Healthcare API
C. Cloud NLP API
D. Cloud Vision AI
8. You are an ML engineer of a Fintech company building a real time
prediction engine that streams files that may contain personally
identifiable information (PII) to GCP. You want to use the Cloud
Data Loss Prevention (DLP) API to scan the files. How should you
ensure that the PII is not accessible by unauthorized individuals?
A. Stream all files to Google Cloud, and then write the data to
BigQuery. Periodically conduct a bulk scan of the table using
the DLP API.
B. Stream all files to Google Cloud, and write batches of the data
to BigQuery. While the data is being written to BigQuery,
conduct a bulk scan of the data using the DLP API.
C. Create two buckets of data: sensitive and nonsensitive. Write
all data to the Nonsensitive bucket. Periodically conduct a
bulk scan of that bucket using the DLP API, and move the
sensitive data to the Sensitive bucket.
D. Periodically conduct a bulk scan of the Google Cloud Storage
bucket using the DLP API, and move the data to either the
Sensitive or Nonsensitive bucket.
Chapter 7
Model Building
In this chapter, we will talk about data parallel and model parallel
strategies to use while training a large neural network. Then we will
cover some of the modeling techniques by defining some important
concepts such as gradient descent, learning, rate, batch size, and epoch
in a neural network. Then you will learn what happens when we
change these hyperparameters (batch size, learning rate) while
training a neural network.
We are going to discuss transfer learning and how pretrained models
are used to kickstart training when you have limited datasets.
Then we are going to cover semi supervised learning and when to use
this technique. We will also cover data augmentation techniques and
how they can be used in an ML pipeline. Last, we will cover key
concepts such as bias and variance and then discuss how they can lead
to underfit and overfit models. We will also cover strategies for
underfit models and overfit models and detail the regularization
strategy used for overfit models.
Data Parallelism
Data parallelism is when the dataset is split into parts and then
assigned to parallel computational machines or graphics processing
units (GPUs). For every GPU or node, the same parameters are used
for the forward propagation. A small batch of data is sent to every
node, and the gradient is computed normally and sent back to the
main node. There are two strategies when distributed training is
practiced, synchronous and asynchronous. For data parallelism, we
have to reduce the learning rate to keep a smooth training process if
there are too many computational nodes. Refer to
https://analyticsindiamag.com/data-parallelism-vs-model-
parallelism-how-do-they-differ-in-distributed-training for more
details.
Synchronous Training
In synchronous training, the model sends different parts of the data
into each accelerator or GPU. Every GPU has a complete copy of the
model and is trained solely on a part of the data. Every single part
starts a forward pass simultaneously and computes a different output
and gradient. Synchronous training uses an all reduce algorithm,
which collects all the trainable parameters from various workers and
accelerators.
Asynchronous Training
Synchronous training can be harder to scale and can result in workers
staying idle at times. In asynchronous training, workers don't have to
wait for each other during downtime in maintenance, and all workers
are independently training over the input data and updating variables
asynchronously. An example is the parameter server strategy for
TensorFlow distributed training. See Figure 7.1 to understand data
parallelism with a parameter server.
Model Parallelism
In model parallelism, every model is partitioned into parts, just as
with data parallelism. Each model is then placed on an individual
GPU.
Model parallelism has some obvious benefits. It can be used to train a
model such that it does not fit into just a single GPU. For example, say
we have 10 GPUs and we want to train a simple ResNet50 model. We
could assign the first five layers to GPU 1, the second five layers to
GPU 2, and so on, and the last five layers to GPU 10. During the
training, in each iteration, the forward propagation has to be done in
GPU 1 first and GPU 2 is waiting for the output from GPU 1. Once the
forward propagation is done, we calculate the gradients for the last
layers that reside in GPU 10 and update the model parameters for
those layers in GPU 10. Then the gradients back propagate to the
previous layers in GPU 9. Each GPU/node is like a compartment in the
factory production line; it waits for the products from its previous
compartment and sends its own products to the next compartment.
See Figure 7.2 where the model is split into various GPUs.
Modeling Techniques
Let's go over some basic terminology in neural networks that you
might see in exam questions.
Gradient Descent
The gradient descent algorithm calculates the gradient of the loss
curve at the starting point. The gradient of the loss is equal to the
derivative (slope) of the curve. The gradient has both magnitude and
direction (vector) and always points in the direction of the steepest
increase in the loss function. The gradient descent algorithm takes a
step in the direction of the negative gradient in order to reduce loss as
quickly as possible.
Learning Rate
As we know, the gradient vector has both a direction and a magnitude.
Gradient descent algorithms multiply the gradient by a scalar known
as the learning rate (also sometimes called step size) to determine the
next point. For example, if the gradient magnitude is 2.5 and the
learning rate is 0.01, then the gradient descent algorithm will pick the
next point 0.025 away from the previous point.
Batch
In gradient descent, a batch is the total number of examples you use to
calculate the gradient in a single iteration. So far, we've assumed that
the batch has been the entire dataset. A very large batch may cause
even a single iteration to take a very long time to compute.
Batch Size
Batch size is the number of examples in a batch. For example, the
batch size of SGD is 1, while the batch size of a mini batch is usually
between 10 and 1,000. Batch size is usually fixed during training and
inference; however, TensorFlow does permit dynamic batch sizes.
Epoch
An epoch means an iteration for training the neural network with all
the training data. In an epoch, we use all of the data exactly once. A
forward pass and a backward pass together are counted as one pass.
An epoch is made up of one or more batches.
Hyperparameters
We covered loss, learning rate, batch size, and epoch. These are the
hyperparameters that you can change while training your ML model.
Most machine learning programmers spend a fair amount of time
tuning the learning rate.
If you pick a learning rate that is too small, learning will take too
long.
Conversely, if you specify a large batch size, the model might take
more time to compute.
Transfer Learning
According to a Wiki definition, transfer learning
(https://en.wikipedia.org/wiki/Transfer_learning) is a research
problem in machine learning (ML) that focuses on storing knowledge
gained while solving one problem and applying it to a different but
related problem. For example, knowledge gained while learning to
recognize cars could apply when trying to recognize trucks. In deep
learning, transfer learning is a technique whereby a neural network
model is first trained on a problem similar to the problem that is being
solved. One or more layers from the trained model are then used in a
new model trained on the problem of interest.
Transfer learning is an optimization to save time or get better
performance.
You can use an available pretrained model, which can be used as a
starting point for training your own model.
Transfer learning can enable you to develop models even for
problems where you may not have very much data.
Semi‐supervised Learning
Semi supervised learning (SSL) is a third type of learning. It is a
machine learning problem that involves a small number of labeled
examples and a large number of unlabeled examples. Semi supervised
learning is an approach to machine learning that combines a small
amount of labeled data with a large amount of unlabeled data during
training. It falls between unsupervised learning (with no labeled
training data) and supervised learning (with only labeled training
data).
Limitations of SSL
With a minimal amount of labeled data and plenty of unlabeled data,
semi supervised learning shows promising results in classification
tasks. But it doesn't mean that semi supervised learning is applicable
to all tasks. If the portion of labeled data isn't representative of the
entire distribution, the approach may fall short.
Data Augmentation
Neural networks typically have a lot of parameters. You would need to
show your machine learning model a proportional number of
examples to get good performance. Also, the number of parameters
you need is proportional to the complexity of the task your model has
to perform.
You need a large amount of data examples to train neural networks. In
most of the use cases, it's difficult to find a relevant dataset with a
large number of examples. So, to get more data or examples to train
the neural networks, you need to make minor alterations to your
existing dataset—minor changes such as flips or translations or
rotations. Our neural network would think these are distinct images.
This is data augmentation, where we train our neural network with
synthetically modified data (orientation, flips, or rotation) in the case
of limited data.
Even if you have a large amount of data, it can help to increase the
amount of relevant data in your dataset. There are two ways you can
apply augmentation in your ML pipeline: offline augmentation and
online augmentation.
Offline Augmentation
In offline augmentation, you perform all the necessary
transformations beforehand, essentially increasing the size of your
dataset. This method is preferred for relatively smaller datasets
because you would end up increasing the size of the dataset by a factor
equal to the number of transformations you perform. For example, by
rotating all images, you can increase the size of the dataset by a factor
of 2.
Online Augmentation
In online augmentation, you perform data augmentation
transformations on a mini batch, just before feeding it to your
machine learning model. This is also known as augmentation on the
fly. This method is preferred for large datasets as mini batches you
would feed into the model.
The following list includes some of the data augmentation techniques
for images:
Flip
Rotate
Crop
Scale
Gaussian noise (adding just the right amount of noise to enhance
the learning capability)
Translate
Conditional generative adversarial networks (GANs) to transform
an image from one domain to an image in another domain
Transfer learning to give the models a better chance with the
scarce amount of data
Underfitting
An underfit model fails to sufficiently learn the problem and performs
poorly on a training dataset and does not perform well on a test or
validation dataset. In the context of the bias variance trade off, an
underfit model has high bias and low variance. Regardless of the
specific samples in the training data, it cannot learn the problem.
There are a couple of reasons for model underfitting:
Data used for training is not cleaned.
The model has a high bias.
The overfit model performance varies widely with unseen examples in
the training dataset. Here are some of the ways to reduce underfitting:
Increase model complexity.
Increase the number of features by performing feature
engineering.
Remove noise from the data.
Increase the number of epochs or increase the duration of
training to get better results.
Overfitting
The model learns the training data too well and performance varies
widely with new unseen examples or even statistical noise added to
examples in the training dataset. An overfit model has low bias and
high variance. There are two ways to approach an overfit model:
Reduce overfitting by training the network on more examples.
Reduce overfitting by changing the complexity of network
structure and parameters.
Here are some of the ways to avoid overfitting:
Regularization technique: explained in next section.
Dropout: Probabilistically remove inputs during training.
Noise: Add statistical noise to inputs during training.
Early stopping: Monitor model performance on a validation set
and stop training when performance degrades.
Data augmentation.
Cross validation.
Regularization
Regularization comes into play and shrinks the learned estimates
toward 0. In other words, it tunes the loss function by adding a penalty
term that prevents excessive fluctuation of the coefficients, thereby
reducing the chances of overfitting.
L1 and L2 are two common regularization methods. You will use L1
when you are trying to reduce features and L2 when you are looking
for a stable model. Table 7.3 summarizes the difference between the
two techniques.
TABLE 7.3 Differences between L1 and L2 regularization
L1 Regularization L2 Regularization
L1 regularization, also known as L1 norm L2 regularization, or the
or lasso (in regression problems), combats L2 norm or ridge (in
overfitting by shrinking the parameters regression problems),
toward 0. This makes some features combats overfitting by
obsolete. So, this works well for feature forcing weights to be
selection in case you have a huge number small but not making
of features. them exactly 0.
L1 regularization penalizes the sum of L2 regularization
absolute values of the weights. penalizes the sum of
squares of the weights.
L1 regularization has built in feature L2 regularization doesn't
selection. perform feature
selection.
L1 regularization is robust to outliers. L2 regularization is not
robust to outliers.
L1 regularization helps with feature L2 regularization always
selection and reducing model size or improves generalization
leading to smaller models. in linear models.
Losses are good now, but in case you want to reduce your training loss
further, you can try the following techniques:
Increase the depth and width of your neural network.
If the features don't add information relative to existing features,
try a different feature.
Decrease the learning rate.
Increase the depth and width of your layers (to increase predictive
power).
If you have lots of data, use held out test data.
If you have little data, use cross validation or bootstrapping.
The model you have trained is not converging and it is bouncing
around. This can be due to the following:
Features might not have predictive power.
Raw data might not comply with the defined schema.
Learning rate seems high, and you need to decrease it.
Reduce your training set to few examples to obtain a very low loss.
Start with one or two features (and a simple model) that you know
have predictive power and see if the model overperforms your
baseline.
Summary
In this chapter, we discussed model parallelism and data parallelism
and some strategies to use while training a TensorFlow model with a
model and data parallel approach.
You learned about modeling techniques such as what loss function to
choose while training a neural network. We covered important
concepts related to training neural networks such as gradient descent,
learning rate, batch size, epoch, and hyperparameters.
We also covered the importance of these hyperparameters when
training a neural network—for example, what happens when we
decrease learning rate or increase the epoch while training the
network.
We discussed transfer learning and the advantages of using it. We also
covered semi supervised learning: when you need semi supervised
learning along with its limitations.
We discussed data augmentation techniques. You use online
augmentation when you have a large dataset and offline augmentation
when you have a small dataset. We also covered techniques such as
rotation and flipping to augment your existing dataset.
Finally, we discussed model underfitting, model overfitting, and
regularization concepts.
Exam Essentials
Choose either framework or model parallelism.
Understand multinode training strategies to train a large neural
network model. The strategy can be data parallel or model
parallel. Also, know what strategies can be used for distributed
training of TensorFlow models.
Understand modeling techniques. Understand when to use
which loss function (sparse cross entropy versus categorical cross
entropy). Understand important concepts such as gradient
descent, learning rate, batch size, and epoch. Also understand that
these are hyperparameters and know some strategies to tune
these hyperparameters to minimize loss or error rate while
training your model.
Understand transfer learning. Understand what transfer
learning is and how it can help with training neural networks with
limited data as these are pretrained models trained on large
datasets.
Use semi supervised learning (SSL). Understand semi
supervised learning and when you need to use this method. Also
know the limitations of SSL.
Use data augmentation. You need to understand data
augmentation and how you can apply it in your ML pipeline
(online versus offline). You also need to learn some key data
augmentation techniques such as flipping, rotation, GANs, and
transfer learning.
Understand model generalization and strategies to
handle overfitting and underfitting. You need to understand
bias variance trade off while training a neural network. Know the
strategies to handle underfitting as well as strategies to handle
overfitting, such as regularization. You need to understand the
difference between L1 and L2 regularization and when to apply
which approach.
Review Questions
1. Your data science team trained and tested a deep neural net
regression model with good results in development. In
production, six months after deployment, the model is performing
poorly due to a change in the distribution of the input data. How
should you address the input differences in production?
A. Perform feature selection on the model using L1
regularization and retrain the model with fewer features.
B. Retrain the model, and select an L2 regularization parameter
with a hyperparameter tuning service.
C. Create alerts to monitor for skew, and retrain the model.
D. Retrain the model on a monthly basis with fewer features.
2. You are an ML engineer of a start up and have trained a deep
neural network model on Google Cloud. The model has low loss
on the training data but is performing worse on the validation
data. You want the model to be resilient to overfitting. Which
strategy should you use when retraining the model?
A. Optimize for the L1 regularization and dropout parameters.
B. Apply an L2 regularization parameter of 0.4, and decrease
the learning rate by a factor of 10.
C. Apply a dropout parameter of 0.2.
D. Optimize for the learning rate, and increase the number of
neurons by a factor of 2.
3. You are a data scientist of a Fintech company training a computer
vision model that predicts the type of government ID present in a
given image using a GPU powered virtual machine on Compute
Engine. You use the following parameters: Optimizer: SGD,
Image shape = 224x224, Batch size = 64, Epochs = 10, and
Verbose = 2.
During training you encounter the following error:
“ResourceExhaustedError: out of Memory (oom) when allocating
tensor.” What should you do?
A. Change the optimizer.
B. Reduce the batch size.
C. Change the learning rate.
D. Reduce the image shape.
4. You are a data science manager of an EdTech company and your
team needs to build a model that predicts whether images contain
a driver's license, passport, or credit card. The data engineering
team already built the pipeline and generated a dataset composed
of 20,000 images with driver's licenses, 2,000 images with
passports, and 2,000 images with credit cards. You now have to
train a model with the following label map: ['drivers_license',
'passport', 'credit_card']. Which loss function should you use?
A. Categorical hinge
B. Binary cross entropy
C. Categorical cross entropy
D. Sparse categorical cross entropy
5. You are a data scientist training a deep neural network. During
batch training of the neural network, you notice that there is an
oscillation in the loss. How should you adjust your model to
ensure that it converges?
A. Increase the size of the training batch.
B. Decrease the size of the training batch.
C. Increase the learning rate hyperparameter.
D. Decrease the learning rate hyperparameter.
6. You have deployed multiple versions of an image classification
model on the Vertex AI platform. You want to monitor the
performance of the model versions over time. How should you
perform this comparison?
A. Compare the loss performance for each model on a held out
dataset.
B. Compare the loss performance for each model on the
validation data.
C. Compare the mean average precision across the models using
the Continuous Evaluation feature.
D. Compare the ROC curve for each model.
7. You are training an LSTM based model to summarize text using
the following hyperparameters: epoch = 20, batch size =32, and
learning rate = 0.001. You want to ensure that training time is
minimized without significantly compromising the accuracy of
your model. What should you do?
A. Modify the epochs parameter.
B. Modify the batch size parameter.
C. Modify the learning rate parameter.
D. Increase the number of epochs.
8. Your team needs to build a model that predicts whether images
contain a driver's license or passport. The data engineering team
already built the pipeline and generated a dataset composed of
20,000 images with driver's licenses and 5,000 images with
passports. You have transformed the features into one hot
encoded value for training. You now have to train a model to
classify these two classes; which loss function should you use?
A. Sparse categorical cross entropy
B. Categorical cross entropy
C. Categorical hinge
D. Binary cross entropy
9. You have developed your own DNN model with TensorFlow to
identify products for an industry. During training, your custom
model converges but the tests are giving unsatisfactory results.
What do you think is the problem and how can you fix it? (Choose
two.)
A. You have to change the algorithm to XGBoost.
B. You have an overfitting problem.
C. You need to increase your learning rate hyperparameter.
D. The model is complex and you need to regularize the model
using L2.
E. Reduce the batch size.
10. As the lead ML engineer for your company, you are building a
deep neural network TensorFlow model to optimize customer
satisfaction. Your focus is to minimize bias and increase accuracy
for the model. Which other parameter do you need to consider so
that your model converges while training and doesn't lead to
underfit or overfit problems?
A. Learning rate
B. Batch size
C. Variance
D. Bagging
11. As a data scientist, you are working on building a DNN model for
text classification using Keras TensorFlow. Which of the following
techniques should not be used? (Choose two.)
A. Softmax function
B. Categorical cross entropy
C. Dropout layer
D. L1 regularization
E. K means
12. As the ML developer for a gaming company, you are asked to
create a game in which the characters look like human players.
You have been asked to generate the avatars for the game.
However, you have very limited data. Which technique would you
use?
A. Feedforward neural network
B. Data augmentation
C. Recurrent neural network
D. Transformers
13. You are working on building a TensorFlow model for binary
classification with a lot of categorical features. You have to encode
them with a limited set of numbers. Which activation function will
you use for the task?
A. One hot encoding
B. Sigmoid
C. Embeddings
D. Feature cross
14. You are the data scientist working on building a TensorFlow
model to optimize the level of customer satisfaction for after sales
service. You are struggling with learning rate, batch size, and
epoch to optimize and converge your model. What is your
problem in ML?
A. Regularization
B. Hyperparameter tuning
C. Transformer
D. Semi supervised learning
15. You are a data scientist working for a start up on several projects
with TensorFlow. You need to increase the performance of the
training and you are already using caching and prefetching. You
want to use GPU for training but you have to use only one
machine to be cost effective. Which of the following tf distribution
strategies should you use?
A. MirroredStrategy
B. MultiWorkerMirroredStrategy
C. TPUStrategy
D. ParameterServerStrategy
Chapter 8
Model Training and Hyperparameter Tuning
In this chapter, we will talk about various file types and how they can
be stored and ingested for AI/ML workloads in GCP. Then we will talk
about how you can train your model using Vertex AI training. Vertex
AI training supports frameworks such as scikit learn, TensorFlow,
PyTorch, and XGBoost. We will talk about how you can train a model
using prebuilt containers and custom containers. We will also cover
why and how you can unit test the data and model for machine
learning. Then, we will cover hyperparameter tuning and various
search algorithms for hyperparameter tuning available in Google
Cloud. We will also cover Vertex AI Vizier and how it's different than
hyperparameter tuning. We will talk about how you can track and
debug your training model in Vertex AI metrics using the Vertex AI
interactive shell, TensorFlow Profiler, and What If Tool. Last, we are
going to talk about data drift, concept drift, and when you should
retrain your model to avoid drift.
Collect
If you need to collect batch or streaming data from various sources
such as IoT devices, e commerce websites, or any third party
applications, this can be done by Google Cloud services.
Pub/Sub and Pub/Sub Lite for real time streaming:
Pub/Sub is a serverless scalable service (1 KB to 100 GB with
consistent performance) for messaging and real time analytics.
Pub/Sub can both publish and subscribe across the globe,
regardless of where your ingestion or processing applications live.
It has deep integration with processing services (Dataflow) and
analytics services (BigQuery). You can directly stream data from a
third party to BigQuery using Pub/Sub. Pub/Sub Lite is also a
serverless offering that optimizes for cost over reliability. Pub/Sub
Lite is good for workloads with more predictable and consistent
load.
Datastream for moving on premise Oracle and MySQL
databases to Google Cloud data storage: Datastream is a
serverless and easy to use change data capture (CDC) and
replication service. It allows you to synchronize data across
heterogeneous databases and applications reliably and with
minimal latency and downtime. Datastream supports streaming
from Oracle and MySQL databases into Cloud Storage.
Datastream is integrated with Dataflow, and it leverages Dataflow
templates to load data into BigQuery, Cloud Spanner, and Cloud
SQL.
BigQuery Data Transfer Service: You can load data from the
following sources to BigQuery using the BigQuery Data Transfer
Service:
Data warehouses such as Teradata and Amazon Redshift
External cloud storage provider Amazon S3
Google software as a service (SaaS) apps such as Cloud
Storage, Google Ads, etc.
After you configure a data transfer, the BigQuery Data Transfer
Service automatically loads data into BigQuery on a regular basis.
Process
Once you have collected the data from various sources, you need tools
to process or transform the data before it is ready for ML training. The
following sections cover some of the tools that can help.
Cloud Dataflow
Cloud Dataflow is a serverless, fully managed data processing or ETL
service to process streaming and batch data. Dataflow used Apache
Beam before open sourcing its SDK. Apache Beam offers exactly once
streaming semantics, which means it has mechanisms in place to
process each message not only at least once, but exactly one time. This
simplifies your business logic because you don't have to worry about
handling duplicates or errors.
Data flows are processing pipelines that perform a set of actions, and
this allows you to build pipelines, monitor their execution, and
transform and analyze data. It aims to address the performance issues
of MapReduce when building pipelines. Many Hadoop workloads can
be done easily and be more maintainable with Dataflow. Cloud
Dataflow allows you to process and read data from source Google
Cloud data services to sinks as shown in Figure 8.2.
Cloud Dataproc
Dataproc is a fully managed and highly scalable service for running
Apache Spark, Apache Flink, Presto, and 30+ open source tools and
frameworks. Dataproc lets you take advantage of open source data
tools for batch processing, querying, streaming, and machine learning.
Dataproc automation helps you create clusters quickly, manage them
easily, and save money by turning them off when you do not need
them.
Dataproc has built in integration with other Google Cloud Platform
services such as BigQuery, Cloud Storage, Cloud Bigtable, Cloud
Logging, and Cloud Monitoring, which provides a complete data
platform. For example, you can use Dataproc to effortlessly ETL
(Extract Transform Load) terabytes of raw log data directly into
BigQuery for business reporting. Dataproc uses the Hadoop
Distributed File System (HDFS) for storage. Additionally, Dataproc
automatically installs the HDFS compatible Cloud Storage connector,
which enables the use of Cloud Storage in parallel with HDFS. Data
can be moved in and out of a cluster through upload/download to
HDFS or Cloud Storage. Table 8.1 summarizes connectors with
Dataproc.
TABLE 8.1 Dataproc connectors
Connector Description
Cloud This is by default available on Dataproc and this
Storage connector helps run Apache Hadoop or Apache Spark
connector jobs directly on data in Cloud Storage. Store your data in
Cloud Storage and access it directly with Cloud Storage
connector. You do not need to transfer it into HDFS
first.
BigQuery You can use BigQuery connector to enable
connector programmatic read/write access to BigQuery. This is an
ideal way to process data that is stored in BigQuery as
command line access is not exposed. The BigQuery
connector is a library that enables Spark and Hadoop
applications to process data from BigQuery and write
data to BigQuery. BigQuery Spark connector is used for
Spark and BigQuery Hadoop connector is used for
Hadoop.
BigQuery Apache Spark SQL connector for Google BigQuery. The
Spark connector supports reading Google BigQuery tables into
connector Spark's DataFrames, and writing DataFrames back into
BigQuery. This is done by using the Spark SQL Data
Source API to communicate with BigQuery.
Cloud Bigtable is an excellent option for any Apache Spark or
Bigtable Hadoop uses that require Apache HBase. Bigtable
with supports the Apache HBase APIs so it is easy to use
Dataproc Bigtable with Dataproc.
Pub/Sub The Pub/Sub Lite Spark connector supports Pub/Sub
Lite Spark Lite as an input source to Apache Spark Structured
connector Streaming in the default micro batch processing and
experimental continuous processing modes.
All Cloud Dataproc clusters come with the BigQuery connector for
Hadoop built in so that you can easily and quickly read and write
BigQuery data to and from Cloud Dataproc.
Cloud Composer
There are multiple ways of creating, running, and managing workflows
such as running Cron tasks, using Cron jobs, and scripting and
creating custom applications. Each approach has pros and cons. More
importantly, there is management overhead in all the approaches here.
That is why we have Cloud Composer, which is a fully managed data
workflow orchestration service that allows you to author, schedule,
and monitor pipelines. Cloud Composer is built on Apache Airflow,
and pipelines are configured as directed acyclic graphs (DAGs) using
Python.
It supports hybrid and multicloud architecture to manage your
workflow pipelines whether it's on premises, in multiple clouds, or
fully within Google Cloud.
Cloud Composer provides end to end integration with Google Cloud
products including BigQuery, Dataflow, Dataproc, Datastore, Cloud
Storage, Pub/Sub, and Vertex AI Platform, which gives users the
freedom to fully orchestrate their pipeline.
Cloud Dataprep
We covered Cloud Dataprep in Chapter 3. Cloud Dataprep is a UI
based ETL tool to visually explore, clean, and prepare structured and
unstructured data for analysis, reporting, and machine learning at any
scale.
Data Integration
Click the Browse GCS icon on the left navigation bar (Figure 8.7) to
browse and load data from cloud storage folders.
FIGURE 8.7 Data integration with Google Cloud Storage within a
managed notebook
BigQuery Integration
Click the BigQuery icon on the left as shown in Figure 8.8 to get data
from your BigQuery tables. The interface also has an Open SQL editor
option to query these tables without leaving the JupyterLab interface.
Pre‐built Containers
Vertex AI supports scikit learn, TensorFlow, PyTorch, and XGBoost
containers hosted on the container registry for prebuilt training.
Google manages all the container images and their versions. In order
to set up a training with prebuilt container, please follow the steps
below:
1. You need to organize your code according to the application
structure as shown in Figure 8.18. You should have a root folder
with setup.py and a trainer folder with task.py (training code),
which is the entry point for a Vertex AI training job. You can use
standard dependencies or libraries not in the prebuilt container
by specifying it in setup.py.
2. You need to upload your training code as Python source
distribution to a Cloud Storage bucket before you start training
with a prebuilt container. You use the sdist command to create a
source distribution—for example, python setup.py sdist ‐‐
formats=gztar,zip. Figure 8.18 shows the folder structure and
architecture for a prebuilt container.
Custom Containers
A custom container is a Docker image that you create to run your
training application. The following are some of the benefits of using
custom container versus prebuilt:
Faster start up time. If you use a custom container with your
dependencies preinstalled, you can save the time that your
training application would otherwise take to install dependencies
when starting up.
Use the ML framework of your choice. If you can't find a
Vertex AI prebuilt container with the ML framework you want to
use, you can build a custom container with your chosen
framework and use it to run jobs on Vertex AI. For example, you
can use a custom container to train with PyTorch.
Extended support for distributed training. With custom
containers, you can do distributed training using any ML
framework.
Use the newest version. You can also use the latest build or
minor version of an ML framework. For example, you can build a
custom container to train with tf nightly.
Figure 8.20 shows the architecture of how custom container training
works on Google Cloud. You build a container using a Dockerfile and
training file with a recommended folder structure. You build your
Dockerfile and push it to an Artifact Registry. For Vertex AI training
with a custom container, you specify the dataset (managed), custom
container image URI you pushed to the repository, and compute (VM)
instances to train on.
FIGURE 8.20 Vertex AI training architecture for custom containers
To create a custom container for Vertex AI training, you need to create
a Dockerfile, build the Dockerfile, and push it to an Artifact Registry.
These are the steps:
1. Create a custom container and training file.
a. Set up your files as per required folder structure: you need to
create a root folder. Then create a Dockerfile and a folder
named trainer/. In that trainer folder you need to create
task.py (your training code). This task.py file is the entry
point file.
b. Create a Dockerfile. Your Dockerfile needs to include
commands as shown in the following code that includes tasks
such as choose a base image, install additional dependencies,
copy your training code to the image, and configure the entry
point to invoke your training code.
# Specifies base image and tag
FROM image:tag
WORKDIR /root
4. After pushing the image to the repository, you can start training
by creating a custom job using the following command:
gcloud ai custom-jobs create \
--region=LOCATION \
--display-name=JOB_NAME \
--worker-pool-spec=machine-type=MACHINE_TYPE,replica-
count=REPLICA_COUNT,container-image-
uri=CUSTOM_CONTAINER_IMAGE_URI
Distributed Training
You need to specify multiple machines (nodes) in a training cluster in
order to run a distributed training job with Vertex. The training
service allocates the resources for the machine types you specify. The
running job on a given node is called a replica. A group of replicas
with the same configuration is called a worker pool. You can configure
any custom training job as a distributed training job by defining
multiple worker pools. You can also run distributed training within a
training pipeline or a hyperparameter tuning job. Use an ML
framework that supports distributed training. In your training code,
you can use the CLUSTER_SPEC or TF_CONFIG environment
variables to reference specific parts of your training cluster. Please
refer to Table 8.4 to understand worker pool tasks in distributed
training.
TABLE 8.4 Worker pool tasks in distributed training
Position in Task Performed in Cluster
workerPoolSpecs[]
First Primary, chief, scheduler, or “master.” Exactly
(workerPoolSpecs[0]) one replica is designated the primary replica.
This task manages the others and reports
status for the job as a whole.
Second Secondary, replicas, workers.
(workerPoolSpecs[1]) One or more replicas may be designated as
workers. These replicas do their portion of the
work as you designate in your job
configuration.
Third Parameter servers and Reduction Server
(workerPoolSpecs[2]) Parameter servers: If supported by your
ML framework, one or more replicas may
be designated as parameter servers.
These replicas store model parameters to
coordinate shared model state between
the workers.
Reduction Server is an all reduce
algorithm that can increase throughput
and reduce latency for distributed
training. You can use this option if you
are doing distributed training with GPU
workers. Your training code uses
TensorFlow or PyTorch and is configured
for multi host data parallel training with
GPUs using NCCL all reduce.
Hyperparameter Tuning
Hyperparameters are parameters of the training algorithm itself that
are not learned directly from the training process. Let us look at an
example of a simple feed forward neural network trained DNN model
using gradient descent. One of the hyperparameters in the gradient
descent is the learning rate, which we covered in Chapter 7. The
learning rate must be set up front before any learning can begin.
Therefore, finding the right learning rate involves choosing a value,
training a model, evaluating it, and trying again.
Figure 8.21 summarizes the difference between a parameter and
hyperparameter.
# DEFINE METRIC
hp_metric = history.history[‘val_accuracy’][-1]
hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag=’accuracy’,
metric_value=hp_metric,
global_step=NUM_EPOCHS)
You can track the progress of this job on a Vertex AI console in Vertex
AI Training.
Vertex AI Vizier
Vertex AI Vizier is a black box optimization service that helps you tune
hyperparameters in complex ML models. Below are the criteria to use
Vertex AI Vizier to train ML models:
Vertex AI Vizier doesn't have a known objective function to
evaluate.
Vertex AI Vizier is too costly to evaluate by using the objective
function, usually due to the complexity of the system.
Vertex AI Vizier optimizes hyperparameters of ML models, but it
can also perform other optimization tasks such as tuning model
parameters and works with any system that you can evaluate.
Some of the examples or use cases where you can use Vertex AI Vizier
for hyperparameter tuning are as follows:
Optimize the learning rate, batch size, and other hyperparameters
of a neural network recommendation engine.
Optimize usability of an application by testing different
arrangements of user interface elements.
Minimize computing resources for a job by identifying an ideal
buffer size and thread count.
Optimize the amounts of ingredients in a recipe to produce the
most delicious version.
TensorFlow Profiler
Vertex AI TensorBoard is an enterprise ready managed version of
TensorBoard. Vertex AI TensorBoard Profiler lets you monitor and
optimize your model training performance by helping you understand
the resource consumption of training operations. This can pinpoint
and fix performance bottlenecks to train models faster and cheaper.
There are two ways to access the Vertex AI TensorBoard Profiler
dashboard from the Google Cloud console:
From the custom jobs page
From the experiments page
To capture a profiling session, your training job must be in the
RUNNING state.
TF Profiler allows you to profile your remote Vertex AI training jobs
on demand and visualize the results in Vertex TensorBoard. For
details, see Profile model training performance using Profiler at
https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-
profiler.
What‐If Tool
You can use the What If Tool (WIT) within notebook environments to
inspect AI Platform Prediction models through an interactive
dashboard. The What If Tool integrates with TensorBoard, Jupyter
Notebooks, Colab notebooks, and JupyterHub. It is also preinstalled
on Vertex AI Workbench user managed notebooks and TensorFlow
instances.
To use WIT, you need to install the witwidget library, which is already
installed in Vertex AI Workbench. Then configure WitConfigBuilder to
either inspect a model or compare two models. The following is some
example code to inspect a model:
PROJECT_ID = 'YOUR_PROJECT_ID'
MODEL_NAME = 'YOUR_MODEL_NAME'
VERSION_NAME = 'YOUR_VERSION_NAME'
TARGET_FEATURE = 'mortgage_status'
LABEL_VOCAB = ['denied', 'approved']
config_builder = (WitConfigBuilder(test_examples.tolist(),
features.columns.tolist() + ['mortgage_status'])
.set_ai_platform_model(PROJECT_ID, MODEL_NAME, VERSION_NAME,
adjust_prediction=adjust_prediction)
.set_target_feature(TARGET_FEATURE)
.set_label_vocab(LABEL_VOCAB)
Then pass the config builder to WitWidget, and set a display height.
WitWidget(config_builder, height=800)
Retraining/Redeployment Evaluation
After the model is trained and deployed in the real world, over time
model performance changes; your model is sensitive to change as user
behavior and training data keeps changing with time. Although all
machine learning models decay, the speed of decay varies with time.
Data drift, concept drift, or both mostly cause this. Let us understand
these terms.
Data Drift
Data drift is a change in the statistical distribution of production data
from the baseline data used to train or build the model. You can detect
data drift if the feature attribution of your model changes or the data
itself changes. For example, suppose you built a model with
temperature data collected from an IoT sensor in Fahrenheit degrees
but the unit changed to Celsius. This means there has been a change in
your input data, so the data has drifted. You can detect data drift by
examining the feature distribution or correlation between features or
checking the data schema over baseline using a monitoring system.
Concept Drift
Concept drift is a phenomenon where the statistical properties of the
target variable you're trying to predict change over time. For example,
say you build a model to classify positive and negative sentiment of
Reddit feed around certain topics. Over time, people's sentiments
about these topics change. Tweets belonging to a positive sentiment
may evolve over time to be negative.
In order to detect drift, you need to monitor your deployed model,
which can be done by Vertex AI Model Monitoring. We will cover this
topic in detail in Chapter 12, “Model Monitoring, Tracking and
Auditing Metadata.”
Summary
In this chapter, we discussed various file types such as structured,
unstructured, and semi structured and how they can be stored and
ingested for AI/ML workloads in GCP. We divided the file ingestion
into Google Cloud Platform into stages such as collect, process, store,
and analyze and discussed services that can help at each stage. We
covered Pub/Sub and Pub/Sub Lite to collect real time data and
BigQuery Data Transfer Service and Datastream to migrate data from
third party sources and databases to Google Cloud. In the process
phase, we covered how we can transform the data using services such
as Cloud Dataflow, Cloud Data Fusion, Cloud Dataproc, Cloud
Composer, and Cloud Dataprep.
Then we talked about how you can train your model using Vertex AI
training. Vertex AI training supports frameworks such as scikit learn,
TensorFlow, PyTorch, and XGBoost. We talked about how you can
train a model using prebuilt containers and custom containers.
We also covered why and how you can unit test the data and model for
machine learning.
Then we covered hyperparameter tuning and various search
algorithms for hyperparameter tuning available in Google Cloud, as
well as Vertex AI Vizier and how it's different than hyperparameter
tuning.
You learned how you can track and debug your training model in
Vertex AI metrics using Vertex AI interactive shell, TensorFlow
Profiler, and the What If Tool.
You also learned about data drift and concept drift and when you
should retrain your model to avoid drift.
Exam Essentials
Know how to ingest various file types into training.
Understand the various file types, such as structured (for
example, CSV), unstructured (for example, text files), and semi
structured (for example, JSON files). Know how these file types
can be stored and ingested for AI/ML workloads in GCP.
Understand how the file ingestion into Google Cloud works by
using a Google Cloud data analytics platform into stages such as
collect, process, store, and analyze. For collecting data into Google
Cloud Storage, you can use Pub/Sub and Pub/Sub Lite to collect
real time data as well as BigQuery Data Transfer Service and
Datastream to migrate data from third party sources and
databases to Google Cloud. In the process phase, understand how
we can transform the data or run Spark/Hadoop jobs for ETL
using services such as Cloud Dataflow, Cloud Data Fusion, Cloud
Dataproc, Cloud Composer, and Cloud Dataprep. know how to
use Vertex AI Workbench environment by using common
frameworks: understand the feature differences and framework
supported by both managed and user managed notebooks.
Understand when you should use user managed notebook vs
managed notebook. Understand how to create these notebooks
and what features they support out of the box.
Know how to use the Vertex AI Workbench environment
by using common frameworks. Understand the feature
differences and framework supported by both managed and user
managed notebooks. Understand when you should use user
managed notebooks versus managed notebooks. Understand how
to create these notebooks and what features they support out of
the box.
Know how to train a model as a job in different
environments. Understand options for Vertex AI training such
as AutoML and custom training. Then understand how you can
perform custom training by using either a prebuilt container or a
custom container using Vertex AI training along with
architecture. Understand using a training pipeline versus custom
jobs to set up training in Vertex AI. Vertex AI training supports
frameworks such as scikit learn, TensorFlow, PyTorch, and
XGBoost. Also, understand how to set up distributed training
using Vertex AI custom jobs.
Be able to unit test for model training and serving.
Understand why and how you can unit test the data and model for
machine learning. Understand how to test for updates in APIs
after model endpoints are updated and how to test for algorithm
correctness.
Understand hyperparameter tuning. Understand
hyperparameter tuning and various search algorithms for
hyperparameter tuning such as grid search, random search, and
Bayesian search. Understand when to use which search algorithm
to speed up performance. Know how to set up hyperparameter
tuning using custom jobs. Last, also understand Vertex AI Vizier
and how it's different from setting up hyperparameter tuning.
Track metrics during training. You can use Interactive shell,
Tensorflow Profiler and What If tool to track metrics during
model training.
Conduct a retraining/redeployment evaluation.
Understand bias variance trade off while training a neural
network. Then you need to understand strategies to handle
underfitting and strategies to handle overfitting, such as
regularization. Know the difference between L1 and L2
regularization and when to apply which approach.
Review Questions
1. You are a data scientist for a financial firm who is developing a
model to classify customer support emails. You created models
with TensorFlow Estimators using small datasets on your on
premises system, but you now need to train the models using
large datasets to ensure high performance. You will port your
models to Google Cloud and want to minimize code refactoring
and infrastructure overhead for easier migration from on prem to
cloud. What should you do?
A. Use Vertex AI custom jobs for training.
B. Create a cluster on Dataproc for training.
C. Create an AutoML model using Vertex AI training.
D. Create an instance group with autoscaling.
2. You are a data engineer building a demand forecasting pipeline in
production that uses Dataflow to preprocess raw data prior to
model training and prediction. During preprocessing, you
perform z score normalization on data stored in BigQuery and
write it back to BigQuery. Because new training data is added
every week, what should you do to make the process more
efficient by minimizing computation time and manual
intervention?
A. Translate the normalization algorithm into SQL for use with
BigQuery.
B. Normalize the data with Apache Spark using the Dataproc
connector for BigQuery.
C. Normalize the data with TensorFlow data transform.
D. Normalize the data by running jobs in Google Kubernetes
Engine clusters.
3. You are an ML engineer for a fashion apparel company designing
a customized deep neural network in Keras that predicts customer
purchases based on their purchase history. You want to explore
model performance using multiple model architectures, to store
training data, and to compare the evaluation metric while the job
is running. What should you do?
A. Create multiple models using AutoML Tables.
B. Create an experiment in Kubeflow Pipelines to organize
multiple runs.
C. Run multiple training jobs on the Vertex AI platform with an
interactive shell enabled.
D. Run multiple training jobs on the Vertex AI platform with
hyperparameter tuning.
4. You are a data scientist who has created an ML pipeline with
hyperparameter tuning jobs using Vertex AI custom jobs. One of
your tuning jobs is taking longer than expected and delaying the
downstream processes. You want to speed up the tuning job
without significantly compromising its effectiveness. Which
actions should you take? (Choose three.)
A. Decrease the number of parallel trials.
B. Change the search algorithm from grid search to random
search.
C. Decrease the range of floating point values.
D. Change the algorithm to grid search.
E. Set the early stopping parameter to TRUE.
5. You are a data engineer using PySpark data pipelines to conduct
data transformations at scale on Google Cloud. However, your
pipelines are taking over 12 hours to run. In order to expedite
pipeline runtime, you do not want to manage servers and need a
tool that can run SQL. You have already moved your raw data into
Cloud Storage. How should you build the pipeline on Google
Cloud while meeting speed and processing requirements?
A. Use Data Fusion's GUI to build the transformation pipelines,
and then write the data into BigQuery.
B. Convert your PySpark commands into Spark SQL queries to
transform the data and then run your pipeline on Dataproc to
write the data into BigQuery using BigQuery Spark
connector.
C. Ingest your data into BigQuery from Cloud Storage, convert
your PySpark commands into BigQuery SQL queries to
transform the data, and then write the transformations to a
new table.
D. Ingest your data into Cloud SQL, convert your PySpark
commands into Spark SQL queries to transform the data, and
then use SQL queries from BigQuery for machine learning.
6. You are a lead data scientist manager who is managing a team of
data scientists using a cloud based system to submit training jobs.
This system has become very difficult to administer. The data
scientists you work with use many different frameworks such as
Keras, PyTorch, Scikit, and custom libraries. What is the most
managed way to run the jobs in Google Cloud?
A. Use the Vertex AI training custom containers to run training
jobs using any framework.
B. Use the Vertex AI training prebuilt containers to run training
jobs using any framework.
C. Configure Kubeflow to run on Google Kubernetes Engine and
receive training jobs through TFJob.
D. Create containerized images on Compute Engine using GKE
and push these images on a centralized repository.
7. You are training a TensorFlow model on a structured dataset with
500 billion records stored in several CSV files. You need to
improve the input/output execution performance. What should
you do?
A. Load the data into HDFS.
B. Load the data into Cloud Bigtable, and read the data from
Bigtable using a TF Bigtable connector.
C. Convert the CSV files into shards of TFRecords, and store the
data in Cloud Storage.
D. Load the data into BigQuery using Dataflow jobs.
8. You are the senior solution architect of a gaming company. You
have to design a streaming pipeline for ingesting player
interaction data for a mobile game. You want to perform ML on
the streaming data. What should you do to build a pipeline with
the least overhead?
A. Use Pub/Sub with Cloud Dataflow streaming pipeline to
ingest data.
B. Use Apache Kafka with Cloud Dataflow streaming pipeline to
ingest data.
C. Use Apache Kafka with Cloud Dataproc to ingest data.
D. Use Pub/Sub Lite streaming connector with Cloud Data
Fusion.
9. You are a data scientist working on a smart city project to build an
ML model to detect anomalies in real time sensor data. You will
use Pub/Sub to handle incoming requests. You want to store the
results for analytics and visualization. How should you configure
the below pipeline:
Ingest data using Pub/Sub > 1. Preprocess > 2. ML training
> 3. Storage > Visualization in Data Studio
A. 1. Dataflow, 2. Vertex AI Training, 3. BigQuery
B. 1. Dataflow, 2. Vertex AI AutoML, 3. Bigtable
C. 1. BigQuery, 2. Vertex AI Platform, 3. Cloud Storage
D. 1. Dataflow, 2. Vertex AI AutoML, 3. Cloud Storage
10. You are a data scientist who works for a Fintech company. You
want to understand how effective your company's latest
advertising campaign for a financial product is. You have
streamed 900 MB of campaign data into BigQuery. You want to
query the table and then manipulate the results of that query with
a pandas DataFrame in a Vertex AI platform notebook. What will
be the least number of steps needed to do this?
A. Download your table from BigQuery as a local CSV file, and
upload it to your AI platform notebook instance. Use pandas
read_csv to ingest the file as a pandas DataFrame.
Explainable AI
Explainability is the extent to which you can explain the internal
mechanics of an ML or deep learning system in human terms. It is in
contrast to the concept of the black box, in which even designers
cannot explain why an AI arrives at a specific decision.
There are two types of explainability, global and local:
Global explainability aims at making the overall ML model
transparent and comprehensive.
Local explainability focuses on explaining the model's individual
predictions.
The ability to explain an ML model and its predictions builds trust and
improves ML adoption. The model is no longer a black box. This
increases the comfort level of the consumers of model predictions. For
model owners, the ability to understand the uncertainty inherent in
ML models helps with debugging the model when things go wrong and
improving the model for better business outcomes.
Debugging machine learning models is complex because of deep
neural nets. As the number of variables increases, it becomes really
hard to see what feature contributed to which outcome. Linear models
are easily explained and interpreted since the input parameters have a
linear relationship with the output: (X = ax + y), where X is the
predicted output depending on x and y (input parameters). With
models based on decision trees such as XGBoost and deep neural nets,
this mathematical relationship to determine the output from a set of
inputs gets complex, leading to difficulty in debugging these models.
That is why explainable techniques are needed to explain the model.
Interpretability and Explainability
Interpretability and explainability are often used interchangeably.
However, there is a slight difference in what they mean.
Interpretability has to do with how accurately a machine learning
model can associate a cause to an effect. Explainability has to do with
explaining the ability of the parameters hidden in deep neural nets
(which we covered in Chapter 7, “Model Building”) to justify the
results.
Feature Importance
Feature importance is a technique that explains the features that make
up the training data using a score (importance). It indicates how
useful or valuable the feature is relative to other features. In the use
case of individual income prediction using XGBoost, the importance
score indicates the value of each feature in the construction of the
boosted decision trees within the model. The more a model uses an
attribute to make key decisions with decision trees, the higher the
attribute's relative importance.
The following are the most important benefits of using feature
importance:
Variable selection: Suppose you are training with 1,000
variables. You can easily figure out which variables are not
important or contributing less to your model prediction and easily
remove those variables before deploying the model in production.
This can save a lot of compute and infrastructure costs and
training time.
Target/label or data leakage in your model: Data leakage
occurs when by mistake you have added your target variable (the
feature you are trying to predict) in your training dataset as a
feature. We covered this in Chapter 2, “Exploring Data and
Building Data Pipelines.”
Vertex Explainable AI
Vertex Explainable AI integrates feature attributions into Vertex AI
and helps you understand your model's outputs for classification and
regression tasks. Vertex AI tells you how much each feature in the data
contributed to the predicted result. You can then use this information
to verify that the model is behaving as expected, recognize bias in your
model, and get ideas for ways to improve your model and your training
data. These are supported services for Vertex Explainable AI:
AutoML image models (classification models only)
AutoML tabular models (classification and regression models
only)
Custom trained TensorFlow models based on tabular data
Custom trained TensorFlow models based on image data
Feature Attribution
Google Cloud's current offering in Vertex AI is centered around
instance level feature attributions, which provide a signed per feature
attribution score proportional to the feature's contribution to the
model's prediction. Feature attributions indicate how much each
feature in your model contributed to the predictions for each given
instance. When you request predictions, you get the predicted values
as appropriate for your model. When you request explanations, you
get the predictions along with feature attribution information.
For more information, refer to https://cloud.google.com/vertex-
ai/docs/explainable-ai/overview#feature-based_explanations.
You do not need to understand the details of these methods for the
exam. However, you would need to know when to use which technique
for the use cases and data types in Vertex AI; see Table 9.1.
TABLE 9.1 Explainable techniques used by Vertex AI
Method Supported Model Types Use Case Vertex AI–
Data Equivalent
Types Model
Sampled Tabular Nondifferentiable Classification Custom
Shapley models and trained
(explained after regression on models (any
the table), such tabular data prediction
as ensembles of container)
trees and neural AutoML
networks. tabular
models
Integrated Image and Differentiable Classification Custom
gradients tabular data models and trained
(explained after regression on TensorFlow
the table), such tabular data models that
as neural Classification use a
networks. on image TensorFlow
Recommended data prebuilt
especially for container to
models with large serve
feature spaces. predictions
Recommended AutoML
for low contrast image
images, such as models
X rays.
XRAI Image data Models that Classification Custom
(eXplanation accept image on image trained
with Ranked inputs. data TensorFlow
Area Recommended models that
Integrals) especially for use a
natural images, TensorFlow
which are any pre built
real world scenes container to
that contain serve
multiple objects. predictions
AutoML
image
models
We mention differentiable models and nondifferentiable models in
Table 9.1. Let's talk about the basic difference between them:
Differentiable models: You can calculate the derivative of all
the operations in your TensorFlow graph. This property helps to
make backpropagation possible in such models. For example,
neural networks are differentiable. To get feature attributions for
differentiable models, use the integrated gradients method.
Nondifferentiable models: These include nondifferentiable
operations in the TensorFlow graph, such as operations that
perform decoding and rounding tasks. For example, a model built
as an ensemble of trees and neural networks is nondifferentiable.
To get feature attributions for nondifferentiable models, use the
sampled Shapley method.
This feature is in public preview, and you might not get questions from
this topic on the exam. However, it's a good topic to understand for the
Explainable AI options provided by Google Cloud.
By using these tools and techniques, you can detect bias and fairness
in your dataset before training models on them during the data
exploration or data preprocessing phase, which we covered in Chapter
2, “Exploring Data and Building Data Pipelines,” and Chapter 3,
“Feature Engineering.”
ML Solution Readiness
We will talk about ML solution readiness in terms of these two
concepts:
Responsible AI
Model governance
Google believes in Responsible AI principles, which we covered in
Chapter 1, “Framing ML Problems.” Google shares best practices with
customers through Google Responsible AI practices, fairness best
practices, technical references, and technical ethics materials.
Summary
In this chapter, we discussed what explainable AI is and the difference
between explainability and interpretability. Then we covered the term
feature importance and why it's important to explain the models. We
covered data bias and fairness as well as ML solution readiness.
Last, we covered the explainable AI technique on the Vertex AI
platform and feature attribution. We covered three primary techniques
for model feature attribution used on the Vertex AI platform: sampled
Shapley, XRAI, and integrated gradients.
Exam Essentials
Understand model explainability on Vertex AI. Know what
explainability is and the difference between global and local
explanations. Why is it important to explain models? What is
feature importance? Understand the options of feature attribution
on the Vertex AI platform such as Sampled Shapley algorithm,
integrated gradients, and XRAI. We covered data bias and
fairness and how feature attributions can help with determining
bias and fairness from the data. ML Solution readiness talks about
Responsible AI and ML model governance best practices.
Understand that explainable AI in Vertex AI is supported for the
TensorFlow prediction container using the Explainable AI SDK
and for the Vertex AI AutoML tabular and AutoML image models.
Review Questions
1. You are a data scientist building a linear model with more than
100 input features, all with values between –1 and 1. You suspect
that many features are non informative. You want to remove the
non informative features from your model while keeping the
informative ones in their original form. Which technique should
you use?
A. Use principal component analysis to eliminate the least
informative features.
B. When building your model, use Shapley values to determine
which features are the most informative.
C. Use L1 regularization to reduce the coefficients of
noninformative features to 0.
D. Use an iterative dropout technique to identify which features
do not degrade the model when removed.
2. You are a data scientist at a startup and your team is working on a
number of ML projects. Your team trained a TensorFlow deep
neural network model for image recognition that works well and
is about to be rolled out in production. You have been asked by
leadership to demonstrate the inner workings of the model. What
explainability technique would you use on Google Cloud?
A. Sampled Shapley
B. Integrated gradient
C. PCA
D. What If Tool analysis
3. You are a data scientist working with Vertex AI and want to
leverage Explainable AI to understand which are the most
essential features and how they impact model predictions. Select
the model types and services supported by Vertex Explainable AI.
(Choose three.)
A. AutoML Tables
B. Image classification
C. Custom DNN models
D. Decision trees
E. Linear learner
4. You are an ML engineer working with Vertex Explainable AI. You
want to understand the most important features for training
models that use image and tabular datasets. Which of the feature
attribution techniques can you use? (Choose three.)
A. XRAI
B. Sampled Shapley
C. Minimum likelihood
D. Interpretability
E. Integrated gradients
5. You are a data scientist training a TensorFlow model with graph
operations as operations that perform decoding and rounding
tasks. Which technique would you use to debug or explain this
model in Vertex AI?
A. Sampled Shapley
B. Integrated gradients
C. XRAI
D. PCA
6. You are a data scientist working on creating an image
classification model on Vertex AI. You want these images to have
feature attribution. Which of the attribution techniques is
supported by Vertex AI AutoML images? (Choose two.)
A. Sampled Shapely
B. Integrated gradients
C. XRAI
D. DNN
7. You are a data scientist working on creating an image
classification model on Vertex AI. You want to set up an
explanation for testing your TensorFlow code in user managed
notebooks. What is the suggested approach with the least effort?
A. Set up local explanations using Explainable AI SDK in the
notebooks.
B. Configure explanations for the custom TensorFlow model.
C. Set up an AutoML classification model to get explanations.
D. Set the generateExplanation field to true when you create a
batch prediction job.
8. You are a data scientist who works in the aviation industry. You
have been given a task to create a model to identify planes. The
images in the dataset are of poor quality. Your model is
identifying birds as planes. Which approach would you use to help
explain the predictions with this dataset?
A. Use Vertex AI example–based explanations.
B. Use the integrated gradients technique for explanations.
C. Use the Sampled Shapley technique for explanations.
D. Use the XRAI technique for explanations.
Chapter 10
Scaling Models in Production
In this chapter, we will cover how you can deploy a model and scale it
in production using TensorFlow Serving. Then we will cover Google
Cloud architecture patterns for serving models online and in batches.
We will also cover caching strategies with serving models in Google
Cloud. We will talk about how you can set up real time endpoints and
batch jobs using Vertex AI Prediction. We will cover some of the
challenges while testing a model for target performance in production.
Last, we will cover ways to orchestrate triggers and pipelines for
automating model training and prediction pipelines.
TensorFlow Serving
TensorFlow (TF) Serving allows you to host a trained TensorFlow
model as an API endpoint through a model server. TensorFlow
Serving handles the model serving and version management and lets
you serve models. It allows you to load your models from different
sources.
TensorFlow Serving allows two types of API endpoints : REST and
gRPC.
These are the steps to set up TF Serving:
1. Install TensorFlow Serving with Docker.
2. Train and save a model with TensorFlow.
3. Serve the saved model using TensorFlow Serving.
See www.tensorflow.org/tfx/serving/api_rest for more information.
URI: /v1/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]
VERB: classify|regress|predict
To call the predict() REST endpoint, you need to define a JSON data
payload because TF Serving expects data as JSON, as shown here:
data = json.dumps({"signature_name": "serving_default",
"instances": instances.tolist()})
headers = {"content-type": "application/json"}
json_response = requests.post(url, data=data,
headers=headers)
predictions = json.loads(json_response.text)['predictions']
To know what your predict() response format will be, you need to
look at the SavedModel's SignatureDef (for more information, see
www.tensorflow.org/tfx/serving/signature_defs). You can use the
SignatureDef CLI to inspect a saved model. The given SavedModel
SignatureDef contains the following output(s):
outputs['class_ids'] tensor_info:
dtype: DT_INT64
shape: (-1, 1)
name: dnn/head/predictions/ExpandDims:0
outputs['classes'] tensor_info:
dtype: DT_STRING
shape: (-1, 1)
name: dnn/head/predictions/str_classes:0
outputs['logits'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 3)
name: dnn/logits/BiasAdd:0
outputs['probabilities'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 3)
name: dnn/head/predictions/probabilities:0
Method name is: tensorflow/serving/predict
The previous example has a class ID integer with shape (−1,1) classes
as strings with shape (−1,1), logits as float with shape (−1,3), and
probability as float with shape (−1,3).
Since both the shape of class ID and class is (−1,1), there will be only
one value in the prediction for them. Similarly, for both probability
and logit, the shape (−1,3) will give three tensor values in response. −1
means we can pass on any size to the saved model.
Online Predictions
To set up a real time prediction endpoint, you might have these two
use cases:
Models trained in Vertex AI using Vertex AI training:
This can be AutoML or custom models, as explained in Chapter 8,
“Model Training and Hyperparameter Tuning.”
Model trained elsewhere (on premise, on another cloud,
or in local device): If you already have a model trained
elsewhere, you need to import the model to Google Cloud before
deploying and creating an endpoint.
If you are importing as a prebuilt container, ensure that your model
artifacts have filenames that exactly match the following examples:
TensorFlow SavedModel: saved_model.pb
scikit learn: model.joblib or model.pkl
XGBoost: model.bst, model.joblib, or model.pkl
If you are importing a custom container, you need to create a
container image and push the image possibly using Cloud Build to
Artifact Registry as per the requirements for custom container hosting.
For both options, you need to follow these steps to set up Vertex AI
predictions:
1. Deploy the model resource to an endpoint.
2. Make a prediction.
3. Undeploy the model resource if the endpoint is not in use.
Make Predictions
Before you can run the data through the endpoint, you need to
preprocess it to match the format that your custom model defined in
task.py expects. Use the Endpoint object's predict function, which
takes the following parameters:
You will see in the output for each prediction the confidence level
for the prediction.
predictions = endpoint.predict(instances=x_test)
In the above code, traffic split parameters take a list of deployed model
id and their value set as traffic split. Check out this link to learn more
about the parameters to deploy models:
https://cloud.google.com/sdk/gcloud/reference/ai/endpoints/update
Since the resources are associated with the model rather than the
endpoint, you could deploy models of different types to the same
endpoint. However, the best practice is to deploy models of a specific
type (for example, AutoML text, AutoML tabular, custom trained) to
an endpoint. This configuration is easier to manage.
Out of the box, other than using traffic split, Vertex AI does not have
all the capabilities that you typically have for A/B testing like
controlling the model traffic, experimentation, tracking results,
comparison, and so on.
However, you can use the Vertex AI model evaluations feature with
Vertex AI Model Monitoring to create a setup for A/B testing. This
feature is in experimental release right now.
The Vertex AI model evaluation feature allows you to run model
evaluation jobs (measure model performance on a test dataset)
regardless of which Vertex service is used to train the model (AutoML,
managed pipelines, custom training, etc.), and store and visualize the
evaluation results across multiple models in the Vertex AI Model
Registry. With these capabilities, Vertex AI model evaluation enables
users to decide which model(s) can progress to online testing or be put
into production and, once they're in production, when models need to
be retrained.
Undeploy Endpoints
If you do not have a critical business and you need to get endpoints up
and running for a few hours in a day or during weekdays, you would
need to undeploy the endpoints. You incur charges for a running
endpoint. So, it's better to undeploy them when they're not in use. You
can use the undeploy function to undeploy, as shown in the following
Python code snippet:
deployed_model_id = endpoint.list_models()[0].id
endpoint.undeploy(deployed_model_id=deployed_model_id)
endpoint = aiplatform.Endpoint(endpoint_id)
response = endpoint.explain(instances=[instance:dict],
parameters={})
Batch Predictions
In batch prediction, you point to the model and the input data
(production data) in Google Cloud Storage and run a batch prediction
job. The job runs the prediction using the model on the input data and
saves the predictions output in Cloud Storage.
You need to make sure your input data is formatted as per the
requirements for either an AutoML model (vision, video, image, text,
tabular) or a custom model (prebuilt or custom container). To get
batch predictions from a custom trained model, prepare your input
data in one of the ways described in Table 10.2.
TABLE 10.2 Input data options for batch training in Vertex AI
Input Description
JSON Use a JSON Lines file to specify a list of input instances to
Lines make predictions about. Store the JSON Lines file in a
Cloud Storage bucket.
TFRecord You can optionally compress the TFRecord files with gzip.
Store the TFRecord files in a Cloud Storage bucket. Vertex
AI reads each instance in your TFRecord files as binary,
then base64 encodes the instance as a JSON object with a
single key named b64.
CSV files Specify one input instance per row in a CSV file. The first
row must be a header row. You must enclose all strings in
double quotation marks (").
File list Create a text file where each row is the Cloud Storage URI
to a file. Example: gs://path/to/image/image1.jpg.
BigQuery Specify a BigQuery table as projectId.datasetId.tableId.
Vertex AI transforms each row from the table to a JSON
instance.
Similar to online prediction, you can use either Vertex AI APIs or the
Google Cloud console to create a batch prediction job (see Figure
10.7).
FIGURE 10.7 Batch prediction job in Console
You can select for output either a BigQuery table or a Google Cloud
Storage bucket. You can also enable model monitoring (in Preview) to
detect skew.
Summary
In this chapter, we covered the details of TF Serving in scaling
prediction service.
We discussed the predict function in TF Serving and how to know the
output based on the SignatureDef of the saved TF model.
Then we discussed architecture for online serving. We dove deep into
static and dynamic reference architectures. Then we discussed the
architecture to use pre computing and caching while serving
predictions.
We also covered how to deploy models using online and batch mode
with Vertex AI Prediction and Google Cloud serving options. We
covered the reasons for performance degradation in production when
testing for target performance such as training serving skew, change in
data quality, and so on. You learned about tools such as Vertex AI
Model Monitoring that can help in testing for models in production.
Last, you learned about ways to configure triggers and schedules to
automate a model pipeline, such as Cloud Run, Cloud Build, Cloud
Scheduler, Vertex AI managed notebooks, and Cloud Composer.
Exam Essentials
Understand TensorFlow Serving. Understand what
TensorFlow Serving is and how to deploy a trained TensorFlow
model using TF Serving. Know the different ways to set up TF
Serving with Docker. Understand the TF Serving prediction
response based on a saved model's SignatureDef tensors.
Understand the scaling prediction services (online,
batch, and caching). Understand the difference between online
batch and caching. For online serving, understand the differences
in architecture and use cases with respect to input features that
are fetched in real time to invoke the model for prediction (static
reference features and dynamic reference features). Also,
understand the caching strategies to improve serving latency.
Understand the Google Cloud serving options. Understand
how to set up real time endpoints using Google Cloud Vertex AI
Prediction for custom models or models trained outside Vertex
AI; understand how to set up predictions using both APIs and the
GCP console setup. Also, understand how to set up a batch job for
any model using Vertex AI batch prediction.
Test for target performance. Understand why model
performance in production degrades. Also understand at a high
level how Vertex AI services such as Vertex AI Model Monitoring
can help with performance degradation issues.
Configure triggers and pipeline schedules. Understand
ways to set up a trigger to invoke a trained model or deploy a
model for prediction on Google Cloud. Know how to schedule the
triggers, such as using Cloud Scheduler and the Vertex AI
managed notebooks scheduler. Also, learn how to automate the
pipeline with Workflows, Vertex AI Pipelines, and Cloud
Composer.
Review Questions
1. You are a data scientist working for an online travel agency. You
have been asked to predict the most relevant web banner that a
user should see next in near real time. The model latency
requirements are 300ms@p99, and the inventory is thousands of
web banners. You want to implement the simplest solution on
Google Cloud. How should you configure the prediction pipeline?
A. Embed the client on the website, and cache the predictions in
a data store by creating a batch prediction job pointing to the
data warehouse. Deploy the gateway on App Engine, and
then deploy the model using Vertex AI Prediction.
B. Deploy the model using TF Serving.
C. Deploy the model using the Google Kubernetes engine.
D. Embed the client on the website, deploy the gateway on App
Engine, deploy the database on Cloud Bigtable for writing
and for reading the user's navigation context, and then
deploy the model on Vertex AI.
2. You are a data scientist training a text classification model in
TensorFlow using the Vertex AI platform. You want to use the
trained model for batch predictions on text data stored in
BigQuery while minimizing computational overhead. What
should you do?
A. Submit a batch prediction job on Vertex AI that points to
input data as a BigQuery table where text data is stored.
B. Deploy and version the model on the Vertex AI platform.
C. Use Dataflow with the SavedModel to read the data from
BigQuery.
D. Export the model to BigQuery ML.
3. You are a CTO of a global bank and you appointed an ML
engineer to build an application for the bank that will be used by
millions of customers. Your team has built a forecasting model
that predicts customers' account balances three days in the future.
Your team will use the results in a new feature that will notify
users when their account balance is likely to drop below a certain
amount. How should you serve your predictions?
A. Create a Pub/Sub topic for each user. Deploy a Cloud
Function that sends a notification when your model predicts
that a user's account balance will drop below the threshold.
B. Create a Pub/Sub topic for each user. Deploy an application
on the App Engine environment that sends a notification
when your model predicts that a user's account balance will
drop below the threshold.
C. Build a notification system on Firebase. Register each user
with a user ID on the Firebase Cloud Messaging server, which
sends a notification when the average of all account balance
predictions drops below the threshold.
D. Build a notification system on a Docker container. Set up
cloud functions and Pub/Sub, which sends a notification
when the average of all account balance predictions drops
below the threshold.
4. You are a data scientist and you trained a text classification
model using TensorFlow. You have downloaded the saved model
for TF Serving. The model has the following SignatureDefs:
input ['text'] tensor_info:
dtype: String
shape: (-1, 2)
name: dnn/head/predictions/textclassifier
SignatureDefs for output.
output ['text'] tensor_info:
dtype: String
shape: (-1, 2)
name: tfserving/predict
C. json.dumps({'signature_name': 'serving_default,
'instances': [['a', 'b\ 'c'1, [d\ 'e\ T]]})
Orchestration Frameworks
You need an orchestrator to manage the various steps, such as
cleaning data, transforming data, and training a model, in the ML
pipeline. The orchestrator runs the pipeline in a sequence and
automatically moves from one step to the next based on the defined
conditions. For example, a defined condition might be executing the
model serving step after the model evaluation step if the evaluation
metrics meet the predefined thresholds. Orchestrating the ML pipeline
is useful in both the development and production phases:
During the development phase, orchestration helps the data
scientists to run the ML experiment instead of having to manually
execute each step.
During the production phase, orchestration helps automate the
execution of the ML pipeline based on a schedule or certain
triggering conditions.
We will cover Kubeflow Pipelines, Vertex AI Pipelines, Apache Airflow,
and Cloud Composer in the following sections as different ML pipeline
orchestrators you can use.
Kubeflow Pipelines
Before understanding how Kubeflow Pipelines works, you should
understand what Kubeflow is. Kubeflow is the ML toolkit for
Kubernetes. (You can learn more about it at
www.kubeflow.org/docs/started/architecture.) Kubeflow builds on
Kubernetes as a system for deploying, scaling, and managing complex
systems. Using Kubeflow, you can specify any ML framework required
for your workflow, such as TensorFlow, PyTorch, or MXNet. Then you
can deploy the workflow to various clouds or local and on premises
platforms for experimentation and for production use. Figure 11.3
shows how you can use Kubeflow as a platform for arranging the
components of your ML system on top of Kubernetes.
FIGURE 11.3 Kubeflow architecture
When you develop and deploy an ML system, the ML workflow
typically consists of several stages. Developing an ML system is an
iterative process. You need to evaluate the output at various stages of
the ML workflow and apply changes to the model and parameters
when necessary to ensure the model keeps producing the results you
need.
Kubeflow Pipelines is a platform for building, deploying, and
managing multistep ML workflows based on Docker containers.
A pipeline is a description of an ML workflow in the form of a graph,
including all of the components in the workflow and how the
components relate to each other. The pipeline configuration includes
the definition of the inputs (parameters) required to run the pipeline
and the inputs and outputs of each component.
When you run a pipeline, the system launches one or more Kubernetes
pods corresponding to the steps (components) in your workflow
(pipeline). The pods start Docker containers, and the containers in
turn start your programs, as shown in Figure 11.4.
FIGURE 11.4 Kubeflow components and pods
The Kubeflow Pipelines platform consists of the following:
A user interface (UI) for managing and tracking experiments,
jobs, and runs
An engine for scheduling multistep ML workflows
An SDK for defining and manipulating pipelines and components
Notebooks for interacting with the system using the SDK
Orchestration, experimentation, and reuse
You can install Kubeflow Pipelines on Google Cloud on GKE or use
managed Vertex AI Pipelines to run Kubeflow Pipelines on Google
Cloud.
Refer to https://cloud.google.com/vertex-
ai/docs/pipelines/migrate-kfp to learn more about how you can
migrate your existing pipelines in Kubeflow Pipelines to Vertex AI
Pipelines.
Apache Airflow
Apache Airflow is an open source workflow management platform for
data engineering pipelines. It started at Airbnb in October 2014 as a
solution to manage the company's increasingly complex workflows.
Airflow (https://airflow.apache.org/docs/apache-
airflow/stable/concepts/overview.html) is a platform that lets you
build and run workflows. A workflow is represented as a directed
acyclic graph (DAG) and contains individual pieces of work called
tasks, arranged with dependencies and data flows taken into account.
It comes with a UI, a scheduler, and an executor.
Cloud Composer
Cloud Composer is a fully managed workflow orchestration service
built on Apache Airflow.
By using Cloud Composer instead of a local instance of Apache
Airflow, you can benefit from the best of Airflow with no installation or
management overhead. Cloud Composer helps you create Airflow
environments quickly and use Airflow native tools, such as the
powerful Airflow web interface and command line tools, so you can
focus on your workflows and not your infrastructure. Cloud Composer
is designed to orchestrate data driven workflows (particularly
ETL/ELT).
Cloud Composer is best for batch workloads that can handle a few
seconds of latency between task executions. You can use Cloud
Composer to orchestrate services in your data pipelines, such as
triggering a job in BigQuery or starting a Dataflow pipeline.
Comparison of Tools
Table 11.1 compares three orchestrators.
TABLE 11.1 Kubeflow Pipelines vs. Vertex AI Pipelines vs. Cloud
Composer
Kubeflow Vertex AI Cloud
Pipelines Pipelines Composer
Management Kubeflow Managed Managed way to
and support Pipelines is used serverless orchestrate
for to orchestrate ML pipeline to ETL/ELT
frameworks workflows in any orchestrate pipelines using
supported either Kubeflow Apache Airflow.
framework such Pipelines or TFXIt's a Python
as TensorFlow, Pipeline. No based
PyTorch, or need to manage implementation.
MXNet using the You would use a
Kubernetes. infrastructure. workflow to
It can be set up solve complex
on premises or in data processing
any cloud. workflows in
MLOps.
Failure You would need to Since Vertex AI Failure
handling handle failures on pipelines runs management for
metrics as this is the Kubeflow built in GCP
not supported out Pipelines, you metrics to take
of the box. can use the action on failure
Kubeflow failure or success.
management on
metrics.
Summary
In this chapter, we covered orchestration for ML pipelines using tools
such as Kubeflow, Vertex AI Pipelines, Apache Airflow, and Cloud
Composer. We also covered the difference between all these tools and
when to use each one for ML workflow automation. You saw that
Vertex AI Pipelines is a managed serverless way to run Kubeflow and
TFX workflows, while you can run Kubeflow on GCP on Google
Kubernetes Engine.
Then we covered ways to schedule ML workflows using Kubeflow and
Vertex AI Pipelines. For Kubeflow, you would use Cloud Build to
trigger a deployment, while for Vertex AI Pipelines, you can use Cloud
Function event triggers to schedule the pipeline. You can also run
these pipelines manually.
We covered system design with Kubeflow and TensorFlow. In
Kubeflow Pipelines, you create every task into a component and
orchestrate the components. Kubeflow comes with a UI and
TensorBoard to visualize these components. You can run TFX
pipelines on Kubeflow. For TFX, we covered three TFX components
and TFX libraries to define ML pipelines. To orchestrate an ML
pipeline, TFX supports bringing your own orchestrator or runtime.
You can use Cloud Composer, Kubeflow, or Vertex AI Pipelines to run
TFX ML workflows.
Finally, we covered the high level definition of hybrid and multicloud.
You saw how to use BigQuery Omni and Anthos to set up hybrid and
multicloud environments on GCP. You can use BigQuery Omni
connectors to get data from AWS S3 and Azure storage. You can use
Anthos to set up Kubeflow Pipelines on GKE on premises.
Exam Essentials
Understand the different orchestration frameworks.
Know what an orchestration framework is and why it's needed.
You should know what Kubeflow Pipelines is and how you can run
Kubeflow Pipelines on GCP. You should also know Vertex AI
Pipelines and how you can run Kubeflow and TFX on Vertex AI
Pipelines. Also learn about Apache Airflow and Cloud Composer.
Finally, compare and contrast all four orchestration methods for
automating ML workflows.
Identify the components, parameters, triggers, and
compute needs on these frameworks. Know ways to
schedule ML workflows using Kubeflow and Vertex AI Pipelines.
For Kubeflow, understand how you would use Cloud Build to
trigger a deployment. For Vertex AI Pipelines, understand how
you can use Cloud Function event triggers to schedule the
pipeline.
Understand the system design of TFX/Kubeflow. Know
system design with Kubeflow and TensorFlow. Understand that in
Kubeflow Pipelines, you create every task into a component and
orchestrate the components. Understand how you can run TFX
pipelines on Kubeflow and how to use TFX components and TFX
libraries to define ML pipelines. Understand that to orchestrate
ML pipelines using TFX, you can use any runtime or orchestrator
such as Kubeflow or Apache Airflow. You can also run TFX on
GCP using Vertex AI Pipelines.
Review Questions
1. You are a data scientist building a TensorFlow model with more
than 100 input features, all with values between –1 and 1. You
want to serve models that are trained on all available data but
track your performance on specific subsets of data before pushing
to production. What is the most streamlined and reliable way to
perform this validation?
A. Use the TFX ModelValidator component to specify
performance metrics for production readiness.
B. Use the entire dataset and treat the area under the curve
receiver operating characteristic (AUC ROC) as the main
metric.
C. Use L1 regularization to reduce the coefficients of
uninformative features to 0.
D. Use k fold cross validation as a validation strategy to ensure
that your model is ready for production.
2. Your team has developed an ML pipeline using Kubeflow to clean
your dataset and save it in a Google Cloud Storage bucket. You
created an ML model and want to use the data to refresh your
model as soon as new data is available. As part of your CI/CD
workflow, you want to automatically run a Kubeflow Pipelines job
on GCP. How should you design this workflow with the least effort
and in the most managed way?
A. Configure a Cloud Storage trigger to send a message to a
Pub/Sub topic when a new file is available in a storage
bucket. Use a Pub/Sub–triggered Cloud Function to start the
Vertex AI Pipelines.
B. Use Cloud Scheduler to schedule jobs at a regular interval.
For the first step of the job, check the time stamp of objects in
your Cloud Storage bucket. If there are no new files since the
last run, abort the job.
C. Use App Engine to create a lightweight Python client that
continuously polls Cloud Storage for new files. As soon as a
file arrives, initiate the Kubeflow Pipelines job on GKE.
D. Configure your pipeline with Dataflow, which saves the files
in Cloud Storage. After the file is saved, you start the job on
GKE.
3. You created an ML model and want to use the data to refresh your
model as soon as new data is available in a Google Cloud Storage
bucket. As part of your CI/CD workflow, you want to
automatically run a Kubeflow Pipelines training job on GKE. How
should you design this workflow with the least effort and in the
most managed way?
A. Configure a Cloud Storage trigger to send a message to a
Pub/Sub topic when a new file is available in a storage
bucket. Use a Pub/Sub–triggered Cloud Function to start the
training job on GKE.
B. Use Cloud Scheduler to schedule jobs at a regular interval.
For the first step of the job, check the time stamp of objects in
your Cloud Storage bucket to see if there are no new files
since the last run.
C. Use App Engine to create a lightweight Python client that
continuously polls Cloud Storage for new files. As soon as a
file arrives, initiate the Kubeflow Pipelines job on GKE.
D. Configure your pipeline with Dataflow, which saves the files
in Cloud Storage. After the file is saved, you can start the job
on GKE.
4. You are an ML engineer for a global retail company. You are
developing a Kubeflow pipeline on Google Kubernetes Engine for
a recommendation system. The first step in the pipeline is to issue
a query against BigQuery. You plan to use the results of that query
as the input to the next step in your pipeline. Choose two ways
you can create this pipeline.
A. Use the Google Cloud BigQuery component for Kubeflow
Pipelines. Copy that component's URL, and use it to load the
component into your pipeline. Use the component to execute
queries against a BigQuery table.
B. Use the Kubeflow Pipelines domain specific language to
create a custom component that uses the Python BigQuery
client library to execute queries.
C. Use the BigQuery console to execute your query and then
save the query results into a new BigQuery table.
D. Write a Python script that uses the BigQuery API to execute
queries against BigQuery. Execute this script as the first step
in your pipeline in Kubeflow Pipelines.
5. You are a data scientist training a TensorFlow model with graph
operations as operations that perform decoding and rounding
tasks. You are using TensorFlow data transform to create data
transformations and TFServing to serve your data. Your ML
architect has asked you to set up MLOps and orchestrate the
model serving only if data transformation is complete. Which of
the following orchestrators can you choose to orchestrate your ML
workflow? (Choose two.)
A. Apache Airflow
B. Kubeflow
C. TFX
D. Dataflow
6. You are a data scientist working on creating an image
classification model on Vertex AI. You are using Kubeflow to
automate the current ML workflow. Which of the following
options will help you set up the pipeline on Google Cloud with the
least amount of effort?
A. Set up Kubeflow Pipelines on GKE.
B. Use Vertex AI Pipelines to set up Kubeflow ML pipelines.
C. Set up Kubeflow Pipelines on an EC2 instance with
autoscaling.
D. Set up Kubeflow Pipelines using Cloud Run.
7. As an ML engineer, you have written unit tests for a Kubeflow
pipeline that require custom libraries. You want to automate the
execution of unit tests with each new push to your development
branch in Cloud Source Repositories. What is the recommended
way?
A. Write a script that sequentially performs the push to your
development branch and executes the unit tests on Cloud
Run.
B. Create an event based Cloud Function when new code is
pushed to Cloud Source Repositories to trigger a build.
C. Using Cloud Build, set an automated trigger to execute the
unit tests when changes are pushed to your development
branch.
D. Set up a Cloud Logging sink to a Pub/Sub topic that captures
interactions with Cloud Source Repositories. Execute the unit
tests using a Cloud Function that is triggered when messages
are sent to the Pub/Sub topic.
8. Your team is building a training pipeline on premises. Due to
security limitations, they cannot move the data and model to the
cloud. What is the recommended way to scale the pipeline?
A. Use Anthos to set up Kubeflow Pipelines on GKE on
premises.
B. Use Anthos to set up Cloud Run to trigger training jobs on
GKE on premises. Orchestrate all of the runs manually.
C. Use Anthos to set up Cloud Run on premises to create a
Vertex AI Pipelines job.
D. Use Anthos to set up Cloud Run on premises to create a
Vertex AI Training job.
Chapter 12
Model Monitoring, Tracking, and Auditing
Metadata
Model Monitoring
You went through a great journey, from experimentation in a Jupyter
Notebook to deploying a model in production, and you are now
serving predictions using serverless architecture. Deployment is not
the end of this workflow; rather, it is the first iteration of the machine
learning model's life cycle.
While your model might have scored high in your evaluation metrics,
how do you know if it will perform well on real time data? What if
there are massive changes in the world after you deploy, like a
worldwide pandemic that changes human behavior? What if there are
subtle changes to the input that are not obvious? In short, how do you
know that your model works after deployment? Post deployment, the
model may not be fit for the original purpose after some time.
The world is a dynamic place and things keep changing. However,
machine learning models are trained on historical data and, ideally,
recently collected data. Imagine that a model is deployed and the
environment slowly but surely starts to change; your model will
become more and more irrelevant as time passes. This concept is
called drift. There are two kinds of drift: concept drift and data drift.
There are two types of drift: concept drift and data drift.
Know how to detect the different types of drift and methods to
recover.
Concept Drift
In general, there is a relationship between the input variables and
predicted variables that we try to approximate using a machine
learning model. When this relationship is not static and changes, it is
called concept drift. This often happens because the underlying
assumptions of your model have changed.
A good example of this is email spam, which makes up the majority of
all emails sent. As spam gets detected and filtered, spammers modify
the emails to bypass the detection filter. In these cases, adversarial
agents try to outdo one another and change their behavior.
Data Drift
Data drift refers to the change in the input data that is fed to the
machine learning model compared to the data that was used to train
the model. One example would be the changes in the statistical
distribution, such as a model trained to predict the food preference of
a customer failing if the age of the customer demography changes.
Another reason for data drift could be the change in the input schema
at the source of your data. An example of this is the presence of new
product labels (SKUs) not present in training data. A more subtle case
is when the meaning of the columns change. For example, the
meaning of the term diabetic might change over time based on the
medical diagnostic levels of blood sugar.
The only way to act on model deterioration such as through drift is to
keep an eye on the data, called model monitoring. The most direct way
is to monitor the input data and continuously evaluate the model with
the same metrics that were used during the training phase.
Input Schemas
The input values are part of the payload of the prediction requests.
Vertex AI should be able to parse the input values to monitor. You can
specify a schema when you configure model monitoring to help parse
this input.
The input schema is automatically parsed for AutoML models, so you
don't have to provide one. You must provide one for custom trained
models that don't use a key/value input format.
Input values are automatically parsed for AutoML models.
Custom Schema
To make sure that Vertex AI correctly parses the input values, you can
specify the schema in what is called an analysis instance. The analysis
instance is expected to be in Open API schema format.
Here are the details of the schema expected:
The “type” could be one of the following:
object: key/value pairs
array: array like format
string: csv string
Properties: the type of each feature.
For array or csv string format, specify the order of the features.
Logging Strategy
When you deploy a model for prediction, in addition to monitoring the
inputs (which is to keep track of the trends in the input features), it
may be useful to log the requests. In some domains (such as regulated
financial verticals), logging is mandatory for all models for future
audits. In other cases, it may be useful to collect monitoring data to
update your training data.
In Vertex AI, you can enable prediction logs for AutoML tabular
models, AutoML image models, and custom trained models. This can
be done during either model deployment or endpoint creation.
Container Logging
This logs the stdout and stderr from your prediction nodes to Cloud
Logging. This is highly useful and relevant for debugging your
container or the model. It may be helpful to understand the larger
logging platform on GCP.
Access Logging
This logs information such as a time stamp and latency for each
request to Cloud Logging. This is enabled by default on v1 service
endpoints; in addition, you can enable access logging when you deploy
your model to the endpoint.
Request‐Response Logging
This logs a sample of the online prediction requests and responses to a
BigQuery table. This is the primary mechanism. With this you can
create more data to augment your training or test data. Either you can
enable this during creation of the prediction endpoint or you can
update it later.
Log Settings
Log settings can be updated when you create the endpoint or when
you deploy a model to the endpoint. It is important to be aware of the
default settings for logging. If you have already deployed your model
with default log settings but want to change the log settings, you must
undeploy your model and redeploy it with new settings.
If you expect to see a huge number of requests for your model, in
addition to scaling your deployment, you should consider the costs of
logging. A high rate of “Queries per second,” or QPS, will produce a
significant number of logs.
Here is an example of the gcloud command to configure logging for an
image classification model. Notice the last two lines where the logging
configuration is specified.
gcloud ai endpoints deploy-model 1234567890\
--region=us‐central1 \
--model=model_id_12345 \
--display-name=image_classification \
--machine-type=a2‐highgpu‐2g \
--accelerator=count=2,type=nvidia-tesla-t4 \
‐‐disable‐container‐logging \
‐‐enable‐access‐logging
Vertex ML Metadata
Using Vertex ML Metadata, you can record the metadata of the
artifacts and query the metadata for analyzing, debugging, or auditing
purposes. Vertex ML Metadata is based on the open source ML
Metadata (MLMD) library that was developed by Google's TensorFlow
Extended team.
The Vertex ML Metadata uses the following data model and the
associated terminology:
Metadata store: This is the top level container for all the
metadata resources. It is regional and within the scope of a Google
Cloud project. Usually, one metadata store is shared by the entire
organization.
Metadata resources: Vertex ML Metadata uses a graph like
data model for representing the relationship between the
resources. These resources are as follows:
Artifacts: An artifact is an entity or a piece of data that was
created by or can be consumed by an ML workflow. Datasets,
models, input files, training logs, and metrics are examples.
Context: A context is a group of artifacts and executions that
can be queried. Say you are optimizing hyperparameters; each
experiment would be a different execution with its own set of
parameters and metrics. You can group these experiments into
a context and then compare the metrics in this context to
identify the best model.
Execution: This represents a step in a machine learning
workflow and can be annotated with runtime parameters. An
example could be a “training” operation, with annotation
about time and number of GPUs used.
Events: An event connects artifacts and executions. Details
like an artifact being the output of an execution and the input
of the next execution can be captured using events. It helps
you determine origin of artifact when trying to trace lineage.
Figure 12.3 shows a simple graph containing events, artifacts,
execution, and context.
Manage ML Metadataschemas
Every metadata resource that is stored in Vertex ML Metadata follows
a schema called MetaDataSchema. There are predefined schemas called
system schemas for the most common types of resources stored. The
system schemas come under the namespace system. Here is an
example of a predefined model system type in YAML format:
title: system.Model
type: object
properties:
framework:
type: string
Vertex AI Pipelines
When you use Vertex AI Pipelines, the model metadata and artifacts
are automatically stored in the metadata store for lineage tracking.
Whenever you run a pipeline, it generates a series of artifacts. These
could include dataset summaries, model evaluation metrics, metadata
on the specific pipeline execution, and so on. Vertex AI Pipelines also
provides a visual representation of the lineage, as shown in Figure
12.4. You can use this to understand which data the model was built
on and which model version was deployed and also sort it by data
schema and date.
Vertex AI Experiments
When you are developing a ML model, the goal is to find the best
model for the use case. You could experiment with various different
libraries, algorithms, model architectures, hyperparameters, etc.
Vertex AI Experiments helps you to keep track of these trials, and
analyze the different variations.
In particular, Vertex AI Experiments helps you to:
Track the steps of an experiment run (like preprocessing,
embedding, training, etc.)
Track input like algorithms, hyperparameters, datasets, etc.
Track output of these steps like models, metrics, checkpoints, etc.
Based on the above you can understand what works and choose your
direction of exploration. The Google Cloud console provides a single
pane of glass to view your experiments, where you can slice and dice
the results of the experiment runs, and zoom into the results of a
single run. Using the Vertex AI SDK for Python you can access
experiments, experiment runs, experiment run parameters, metrics,
and artifacts. When used with Vertex ML Metadata you can track
artifacts and view the lineage.
Vertex AI Debugging
Sometimes when training a model, you run into issues and suspect
that the GPU is not being used efficiently or a permission issue
restricts access to data. To debug these kinds of issues, Vertex AI
allows you to directly connect to the container where your training is
running. To do this, follow these steps:
1. Install the interactive Bash shell in the training container. The
Bash shell comes installed in pre built containers.
2. Run the custom training where interactive shells are supported.
3. Make sure that the user has the right permissions. If you are using
a service account, make sure the service account has the right
permissions.
4. Set the enableWebAccess API field to true to enable interactive
shells.
5. Navigate to the URI provided by Vertex AI when you initiate the
custom training job.
6. Use the interactive shell to do the following:
a. Check permissions for the service account that Vertex AI uses
for training code.
b. Visualize Python execution with profiling tools like py spy,
which you can install using pip3.
c. Analyze performance of your training node using perf.
d. Check CPU and GPU usage.
Summary
In this chapter we looked at the steps beyond building a model and
deploying it. This includes monitoring a deployed model to detect if
there is a performance degradation. Later we also looked at the
various different logging strategies available in Vertex AI. Finally,
when building several models, we looked at how to track lineage of
models using Vertex ML Metadata and track using Vertex AI
Experiments.
Exam Essentials
Understand model monitoring. Understand the need to
monitor the performance of the model after deployment. There
are two main types of degradation: data drift and concept drift.
Learn how to monitor continuously for these kinds of changes to
input.
Learn logging strategies. Logging after deployment is crucial
to be able to keep track of the deployment, including the
performance, as well as create new training data. Learn how to
use logging in addition to monitoring the models in Vertex AI.
Understand Vertex ML Metadata. ML metadata helps you to
track lineage of the models and other artifacts. Vertex ML
Metadata is a managed solution for storing and accessing
metadata on GCP. Learn the data model as well as the basic
operations of creating and querying metadata.
Review Questions
1. You spend several months fine tuning your model and the model
is performing very well in your evaluations based on test data.
You have deployed your model, and over time you notice that the
model accuracy is low. What happened and what should you do?
(Choose two.)
A. Nothing happened. There is only a temporary glitch.
B. You need to enable monitoring to establish if the input data
has drifted from the train/test data.
C. Throw away the model and retrain with a higher threshold of
accuracy.
D. Collect more data from your input stream and use that to
create training data, then retrain the model.
2. You spend several months fine tuning your model and the model
is performing very well in your evaluations based on test data.
You have deployed your model and it is performing well on real
time data as well based on an initial assessment. Do you still need
to monitor the deployment?
A. It is not necessary because it performed very well with test
data.
B. It is not necessary because it performed well with test data
and also on real time data on initial assessment.
C. Yes. Monitoring the model is necessary no matter how well it
might have performed on test data.
D. It is not necessary because of cost constraints.
3. Which of the following are two types of drift?
A. Data drift
B. Technical drift
C. Slow drift
D. Concept drift
4. You trained a regression model to predict the longevity of a tree,
and one of the input features was the height of the tree. When the
model is deployed, you find that the average height of trees you
are seeing is two standard deviations away from your input. What
type of drift is this?
A. Data drift
B. Technical drift
C. Slow drift
D. Concept drift
5. You trained a classification model to predict fraudulent
transactions and got a high F1 score. When the model was
deployed initially, you had good results, but after a year, your
model is not catching fraud. What type of drift is this?
A. Data drift
B. Technical drift
C. Slow drift
D. Concept drift
6. When there is a difference in the input feature distribution
between the training data and the data in production, what is this
called?
A. Distribution drift
B. Feature drift
C. Training serving skew
D. Concept drift
7. When statistical distribution of the input feature in production
data changes over time, what is this called in Vertex AI?
A. Distribution drift
B. Prediction drift
C. Training serving skew
D. Concept drift
8. You trained a classification model to predict the number of
plankton in an image of ocean water taken using a microscope to
measure the amount of plankton in the ocean. When the model is
deployed, you find that the average number of plankton is an
order of magnitude away from your training data. Later, you
investigate this and find out it is because the magnification of the
microscope was different in the training data. What type of drift is
this?
A. Data drift
B. Technical drift
C. Slow drift
D. Concept drift
9. What is needed to detect training serving skew? (Choose two.)
A. Baseline statistical distribution of input features in training
data
B. Baseline statistical distribution of input features in
production data
C. Continuous statistical distribution of features in training data
D. Continuous statistical distribution of features in production
data
10. What is needed to detect prediction drift? (Choose two.)
A. Baseline statistical distribution of input features in training
data
B. Baseline statistical distribution of input features in
production data
C. Continuous statistical distribution of features in training data
D. Continuous statistical distribution of features in production
data
11. What is the distance score used for categorical features in Vertex
AI?
A. L infinity distance
B. Count of the number of times the categorical value occurs
over time
C. Jensen Shannon divergence
D. Normalized percentage of the time the categorical values
differ
12. You deployed a model on an endpoint and enabled monitoring.
You want to reduce cost. Which of the following is a valid
approach?
A. Periodically switch off monitoring to save money.
B. Reduce the sampling rate to an appropriate level.
C. Reduce the inputs to the model to reduce the monitoring
footprint.
D. Choose a high threshold so that alerts are not sent too often.
13. Which of the following are features of Vertex AI model
monitoring? (Choose three.)
A. Sampling rate: Configure a prediction request sampling rate.
B. Monitoring frequency: Rate at which model's inputs are
monitored.
C. Choose different distance metrics: Choose one of the many
distance scores for each feature.
D. Alerting thresholds: Set the threshold at which alerts will be
sent.
14. Which of the following is not a correct combination of model
building and schema parsing in Vertex AI model monitoring?
A. AutoML model with automatic schema parsing
B. Custom model with automatic schema parsing with values in
key/value pairs
C. Custom model with automatic schema parsing with values
not in key/value pairs
D. Custom model with custom schema specified with values not
in key/value pairs
15. Which of the following is not a valid data type in the model
monitoring schema?
A. String
B. Number
C. Array
D. Category
16. Which of the following is not a valid logging type in Vertex AI?
A. Container logging
B. Input logging
C. Access logging
D. Request response logging
17. How can you get a log of a sample of the prediction requests and
responses?
A. Container logging
B. Input logging
C. Access logging
D. Request response logging
18. Which of the following is a not a valid reason for using a metadata
store?
A. To compare the effectiveness of different sets of
hyperparameters
B. To track lineage
C. To find the right proportion of train and test data
D. To track downstream usage of artifacts for audit purposes
19. What is an artifact in a metadata store?
A. Any piece of information in the metadata store
B. The train and test dataset
C. Any entity or a piece of data that was created by or can be
consumed by an ML workflow
D. A step in the ML workflow that can be annotated with
runtime parameters
20. Which of the following is not part of the data model in a Vertex
ML metadata store?
A. Artifact
B. Workflow step
C. Context
D. Execution
Chapter 13
Maintaining ML Solutions
MLOps Maturity
Organizations go through a journey starting with experimenting with
machine learning technology and then progressively bringing the
concepts of continuous integration/continuous deployment (CI/CD)
into machine learning. This application of DevOps principles to
machine learning is called MLOps.
While there are similarities between MLOps and DevOps, there are
some key differences.
We have found that organizations first start experimenting with
machine learning by manually training models and then bring
automation to the process using pipelines; they later enter into a
transformational phase as they fully automate. These three phases that
they journey through use different technologies and reflect their “AI
readiness.”
Before we look at each of these phases in detail, let's first look at the
steps in ML:
1. Data
a. Extraction: Collect data from different sources, aggregate
the data, and make it available for the ML process
downstream.
b. Analysis: Perform exploratory data analysis (EDA) on the
data collected to understand the schema, statistical
distributions, and relationships. Identify the feature
engineering and data preparation requirements that will have
to be performed.
c. Preparation: Apply transformations and feature
engineering on the data. Split the data into train, test, and
validation for the ML task.
2. Model
a. Training: Set up ML training using the input data, and
predict the output. Experiment with different algorithms and
hyperparameters to identify the best performing model.
b. Evaluation: Evaluate the model using a holdout set and
assess the quality of the model based on predefined metrics.
c. Validation: Validate if the model is qualified for
deployment, that is, if the metrics meet a certain baseline
performance criterion.
3. Deployment
a. Serving: Deploy the model to serve online or batch
predictions. For online predictions, create and maintain a
RESTful endpoint for the model, and provision to scale based
on demand. This could also include making batch
predictions.
b. Monitor: Continuously monitor the model after deployment
to detect anomalies, drift, and skew.
We looked at all of these steps in the previous chapters, and so you
should be familiar with them. The level of automation of these steps
defines the maturity of the MLOps. Google characterizes MLOps as
having three levels: MLOps level 0 (manual phase), MLOps level
1(strategic automation phase), and MLOps level 2 (CI/CD automation,
transformational phase).
Challenges
The most important challenge in this phase is the model degradation.
Well trained models frequently underperform in real life due to
differences between training data and real data. The only way to
mitigate these problems is to actively monitor the quality of
predictions, to frequently retrain the models, and to frequently
experiment with new algorithms or implementations of the model to
leverage improvements in technology.
Challenges
In this phase, the expectation is that the team manages only a few
pipelines. Also, new pipelines are manually deployed. Pipelines are
triggered mainly when there are changes to the data. This is not ideal
when you are deploying models based on new ML ideas. For example,
this method is good for retraining the model with new features.
However, if you want to use technologies that are not part of the
existing pipeline, then it has to be manually deployed. To be able to do
that, you need to create a CI/CD setup to automate build/test/deploy
ML pipelines.
Versioning Models
When you are building multiple models and sharing them with other
teams, it is important to have some way to find them. One method is to
use metadata to identify artifacts; however, users external to the
organization access a deployed model using an API, and to them the
metadata store might not be accessible. This is where versioning can
be used.
The problem with having multiple models is that, when users are
accessing through an API, they expect a certain performance and
behavior from the model. If there is a sudden change in behavior from
the same API, it causes unexpected disruption. However, there is a
need for the model to update or change. For example, say there is a
model to identify objects and humans in an image and it accurately
detects all the human faces. Later, say you have updated the model
and it now has the ability to also detect dogs and other pets; this could
cause a disruption. In this case, the user needs the ability to choose the
older model.
The solution is to use model versioning. Model versioning allows you
to deploy an additional model to the existing model, with the ability to
select it using a version ID. In this case, both the models can be
accessed by the end user by specifying the version. This solves the
problem of backward compatibility; that is, this works well for models
that have the same inputs and outputs.
For models that behave in a different way (say a model has the ability
to provide model explanations), they can be deployed as a new model
(and not a new version of an existing model).
In both cases, there should be the ability to access these models using
REST endpoints, separately and conveniently for the users. Enabling
monitoring on all the deployed model versions allows you the ability to
compare the different versions.
Feature Store
Feature engineering is an important factor in the ability to build good
ML models. However, it has been observed that, practically, feature
engineering is more time consuming than experimenting with ML
models. For this reason, a valuable feature that has been engineered
provides a huge value for the entire ML solution.
So, feature engineering is a huge investment for any organization. In
order to optimize their models, many teams work on creating new
features. At times these features would have been valuable to other
teams as well. Unfortunately, sharing these features is tricky and so
the same feature engineering tasks are done over and over again. This
creates several problems:
Non reusable: Features are created ad hoc and not intended to
be used by others. Each team creates a feature with the purpose of
only using it themselves. These ad hoc features are not automated
in pipelines and are sometimes derived from other expensive data
preprocessing pipelines.
Governance: Diversity of sources of the methods of creating
these features creates a very complex situation for data
governance.
Cross collaboration: Due to the ad hoc nature of these features
not being shared, more divisions between teams are created and
they continue to go separate ways.
Training and serving differences: When features are built ad
hoc and not automated, this creates differences between training
data and serving data and reduces the effectiveness of the ML
solution.
Productizing features: While these ad hoc features are useful
during experimentation, they cannot be productized because of
the lack of automation and the need for low latency retrieval of
the features.
Solution
The solution is to have a central location to store the features as well as
the metadata about the features that can be shared between the data
engineers and ML engineers. This also allows the application of the
software engineering principles of versioning, documentation, and
access control of these features.
Feature stores also has two key features: the ability to process large
feature sets quickly and the ability to access the features with low
latency for real time prediction and batch access for training time and
batch predictions.
Feast is an open source Feature Store created by Google and Gojek
that is available as a software download. Feast was designed with
Redis and Google Cloud services BigQuery and Apache Beam. Google
Cloud also offers a managed service called Vertex AI Feature Store that
allows you to scale dynamically based on your needs.
Data Model
We will now discuss the data model used by the Vertex AI Feature
Store service. It uses a time series model to store all the data, which
enables you to manage the data as it changes over time. All the data in
Vertex AI Feature Store is arranged in a hierarchy with the top level
called a featurestore. This featurestore is a container that can have one
or more entity types, which represents a certain type of feature. In
your entity type you can store similar or related features.
Featurestore → EntityType → Feature
As an example, Table 13.1 has data of baseball batters. The first step is
to create a featurestore called baseballfs.
TABLE 13.1 Table of baseball batters
Row player_id Team Batting_avg Age
1 player_1 RedSox 0.289 29
2 player_2 Giants 0.301 32
3 player_3 Yankees 0.241 35
You can then create an EntityType called batters and map the column
header player_id to that entity. You can then add team, batting_avg,
and age as features to this EntityType.
Here player_1 and player_2 are entities in this EntityType. Entities
must always be unique and must always be of type String.
RedSox, Giants, and Yankees are featurevalues in the featurestore.
We looked at general IAM best practices, and now we will look at some
special cases for Vertex AI.
Summary
In this chapter we looked at the long term maintenance of a ML
application. ML operations or MLOps is based on CI/CD principles of
maintaining software applications. During this process we look at how
to automate training, deployment, and monitoring. Retraining policy
of models is an important concept that has to be balanced between
model quality and cost of training. Another important problem that
arises in large enterprises is the inability to share features between
departments, which causes lots of inefficiencies. To solve this problem
the idea of feature store was invented, which can be implemented
either using open source software or Vertex AI Feature Store.
Exam Essentials
Understand MLOps maturity. Learn different levels of
maturity of MLOps and how it matches with the organizational
goals. Know the MLOps architecture at the experimental phase,
then a strategic phase where there is some automation, and finally
a fully mature CI/CD inspired MLOps architecture.
Understand model versioning and retraining triggers. A
common problem faced in MLOps is knowing when to trigger new
training. It could be based on model degradation as observed in
model monitoring, or it could be time based. When retraining a
model, learn how to add it as a new version or a new model.
Understand the use of feature store. Feature engineering is
an expensive operation, so the features generated using those
methods are more useful if shared between teams. Vertex AI
Feature Store is a managed service, and Feast is an open source
feature store by Google.
Review Questions
1. Which of the following is not one of the major steps in the MLOps
workflow?
A. Data processing, including extraction, analysis, and
preparation
B. Integration with third party software and identifying further
use cases for similar models
C. Model training, testing, and validation
D. Deployment of the model, monitoring, and triggering
retraining
2. You are on a small ML team in a very old retail organization, and
the organization is looking to start exploring machine learning for
predicting daily sales of products. What level of MLOps would you
implement in this situation?
A. No MLOps, will build ML models ad hoc
B. MLOps level 0
C. MLOps level 1
D. MLOps level 2
3. You are a data scientist working as part of an ML team that has
experimented with ML for its online fashion retail store. The
models you build match customers to the right size/fit of clothes.
Organization has decided to build this out, and you are leading
this effort. What is the level of MLOps you would implement
here?
A. No MLOps, will build ML models ad hoc
B. MLOps level 0
C. MLOps level 1
D. MLOps level 2
4. You have been hired as an ML engineer to work in a large
organization that works on processing photos and images. The
team creates models to identify objects in photos, faces in photos,
and the orientation of photos (to automatically turn) and also
models to adjust the colors of photos. The organization is also
experimenting with new algorithms that can automatically create
images from text. What is the level of MLOps you would
recommend?
A. No MLOps, ad hoc because they are using new algorithms
B. MLOps level 0
C. MLOps level 1
D. MLOps level 2
5. What problems does MLOps level 0 solve?
A. It is ad hoc building of models so it does not solve any
problems.
B. It automates training so building models is a repeatable
process.
C. Model training is manual but deployment is automated once
there is model handoff.
D. It is complete automation from data to deployment.
6. Which of these statements is false regarding MLOps level 1
(strategic phase)?
A. Building models becomes a repeatable process due to
training automation.
B. Model training is triggered automatically by new data.
C. Trained models are automatically packaged and deployed.
D. The pipeline is automated to handle new libraries and
algorithms.
7. You are part of an ML engineering team of a large organization
that has started using ML extensively across multiple products. It
is experimenting with different algorithms and even creating its
own new ML algorithms. What should be its MLOps maturity
level to be able to scale?
A. Ad hoc is the only level that works for the organization
because it is using custom algorithms.
B. MLOps level 0.
C. MLOps level 1.
D. MLOps level 2.
8. In MLOps level 1 of maturity (strategic phase), what is handed off
to deployment?
A. The model file
B. The container containing the model
C. The pipeline to train a model
D. The TensorFlow or ML framework libraries
9. In MLOps level 0 of maturity (tactical phase) what is handed off
to the deployment?
A. The model file
B. The container containing the model
C. The pipeline to train a model
D. The TensorFlow or ML framework libraries
10. What triggers building a new model in MLOps level 2?
A. Feature store
B. Random trigger
C. Performance degradation from monitoring
D. ML Metadata Store
11. What should you consider when you are setting the trigger for
retraining a model? (Choose two.)
A. The algorithm
B. The frequency of triggering retrains
C. Cost of retraining
D. Time to access data
12. What are reasonable policies to apply for triggering retraining
from a model monitoring data? (Choose two.)
A. The amount of prediction requests to a model
B. Model performance degradation below a threshold
C. Security breach
D. Sudden drop in performance of the model
13. When you train or retrain a model, when do you deploy a new
version (as opposed to deploy as a new model)?
A. Every time you train a model, it is deployed as a new version.
B. Only models that have been uptrained from pretrained
models get a new version.
C. Never create a new version, always a new model.
D. Whenever the model has similar inputs and outputs and is
used for the same purpose.
14. Which of the following are good reasons to use a feature store?
(Choose two.)
A. There are many features for a model.
B. There are many engineered features that have not been
shared between teams.
C. The features created by the data teams are not available
during serving time, and this is creating training/serving
differences.
D. The models are built on a variety of features, including
categorical variables and continuous variables.
15. Which service does Feast not use?
A. BigQuery
B. Redis
C. Gojek
D. Apache Beam
16. What is the hierarchy of the Vertex AI Feature Store data model?
A. Featurestore > EntityType > Feature
B. Featurestore > Entity > Feature
C. Featurestore > Feature > FeatureValue
D. Featurestore > Entity > FeatureValue
17. What is the highest level in the hierarchy of the data model of a
Vertex AI Feature Store called?
A. Featurestore
B. Entity
C. Feature
D. EntityType
18. You are working in a small organization and dealing with
structured data, and you have worked on creating multiple high
value features. Now you want to use these features for machine
learning training and make these features available for real time
serving as well. You are given only a day to implement a good
solution for this and then move on to a different project. Which
options work best for you?
A. Store the features in BigQuery and retrieve using the
BigQuery Python client.
B. Create a Feature Store from scratch using BigQuery, Redis,
and Apache Beam.
C. Download and install open source Feast.
D. Use Vertex AI Feature Store.
19. Which of these statements is false?
A. Vertex AI Feature Store can ingest from BigQuery.
B. Vertex AI Feature Store can ingest from Google Cloud
Storage.
C. Vertex AI Feature Store can even store images.
D. Vertex AI Feature Store serves features with low latency.
20. Which of these statements is true?
A. Vertex AI Feature Store uses a time series model to store all
data.
B. Vertex AI Feature Store cannot ingest from Google Cloud
Storage.
C. Vertex AI Feature Store can even store images.
D. Vertex AI Feature Store cannot serve features with low
latency.
Chapter 14
BigQuery ML
Data analysts and others who are familiar with SQL prefer
to use BigQuery ML instead of other methods.
BigQuery ML Algorithms
BigQuery ML (previously called BQML) allows you to create machine
learning models using standard SQL queries. You can create models
and train, test, validate, and predict using models with only SQL
queries. You don't have to write any Python code to use machine
learning in BigQuery. Moreover, it is completely serverless.
Model Training
To create a model, the keyword to use is CREATE MODEL. This statement
is similar to the CREATE TABLE DDL statement. When you run a query
with the CREATE MODEL statement, a query job is generated and
processes the query. Similar to CREATE MODEL, you also have CREATE
MODEL IF NOT EXISTS and CREATE OR REPLACE MODEL, commands with
names that are intuitive to help us reuse model names for our
convenience.
CREATE MODEL modelname1
OPTIONS(model_type='linear_reg', input_label_cols=
['label_col'])
AS SELECT * FROM table1
In the preceding SQL command, after the keyword, you must provide
two options, model_type and input_label_cols. The model type
specifies what kind of model you are trying to build. There are
regression, classification, and time series models available for you to
choose from. See Table 14.1 for the full list of models available today.
The second option is input_label_cols, which identifies the target
column in the data provided below.
Finally, the last part of the SQL command (“SELECT * FROM table1)
identifies the tables you are going to use for training. Notice that it
appears as a simple selection query, which means the query result is
being passed to the training job. In this line, you can select some
columns, select some rows, join multiple tables to generate a result,
and so on to create your training dataset. The only restriction is to
make sure that the target column exists and there are enough rows to
train a model.
TABLE 14.1 Models available on BigQuery ML
Model Model Type Description
Category
Regression LINEAR_REG, To predict a real
BOOSTED_TREE_REGRESSOR, value
DNN_REGRESSOR,
AUTOML_REGRESSION
Classification LOGISTIC_REG, To predict either
BOOSTED_TREE_CLASSIFIER, a binary label or
DNN_CLASSIFIER, multiple labels
DNN_LINEAR_COMBINED_CLASSIFIER,
AUTOML_CLASSIFIER
Deep and wide DNN_LINEAR_COMBINED_REGRESSOR, Deep and wide
models DNN_LINEAR_COMBINED_CLASSIFIER models used for
recommendation
systems and
personalization
Clustering KMEANS Unsupervised
clustering models
Collaborative MATRIX_FACTORIZATION For
filtering recommendations
Dimensionality PCA, AUTOENCODER Unsupervised
reduction preprocessing
step
Time series ARIMA_PLUS Forecasting
forecasting
General TENSORFLOW Generic
TensorFlow
model
If you take a look at the list in Table 14.1, there are several kinds of
available models. The expected ones are in linear regression,
classification, and clustering and are easy to define using SQL.
However, as you go down the list, you may see DNN, which stands for
deep neural network. In BigQuery ML, you have the complete
flexibility to define and train DNN models by passing the right
parameters in the options section. See Figure 14.3 for the full list of
options for DNN_CLASSIFIER and DNN_REGRESSOR. These models are built
using TensorFlow estimators. Notice that you have all the flexibility
you need to build the model of your choice.
Model Evaluation
However, it is recommended to use a separate dataset not seen by the
model for evaluating the model using the keyword ML.EVALUATE.
SELECT * FROM
ML.EVALUATE(MODEL `projectid.test.creditcard_model1`,
( SELECT * FROM `test.creditcardtable`))
This gave us the result shown in Figure 14.4 in a few seconds, which
shows the query and the results below it in the web interface.
Prediction
The ML.PREDICT function is used for prediction in BigQuery ML. You
can pass an entire table to predict and the output will be a table with
all the input columns and the same number of rows, along with two
new columns, predicted_<label_column_name> and
predicted_<label_column_name>_probs. Here the
<label_column_name> is the name of the label column in the training
data. There are some differences.
Here is the example SQL for making predictions from the model we
created:
select * from
ML.PREDICT (MODEL `dataset1.creditcard_model1`,
(select * FROM `dataset1.creditcardpredict` limit 1))
The result of the preceding query is shown in Figure 14.5. Notice that
the predictions probability is shown for each label.
FIGURE 14.5 Query results showing only the predictions
Explainability in BigQuery ML
Explainability is important to debug models and improve
transparency, and in some domains it is even a regulatory
requirement. In BigQuery, you can get global feature importance
values at the model level or you can get explanations for each
prediction. These are also accessed using SQL functions.
To have explanations at the global level, you must set
enable_global_explain=TRUE during training. Here is the sample SQL
for our previous example:
CREATE OR REPLACE MODEL
`model1`
OPTIONS(model_type=’logistic_reg’,
enable_global_explain=TRUE,
input_label_cols=[‘defaultpaymentnextmonth’]) AS SELECT
*
FROM ` dataset1.creditcardtable`
And after the model has trained, you can query the model's global
explanations, which are returned as a table (Figure 14.6); each row
contains the input features with the floating point number
representing the importance.
SELECT *
FROM
ML.GLOBAL_EXPLAIN(MODEL ` model1`)
The numbers next to the features represent the impact of a feature on
making the predictions. The higher the attribution, the higher the
relevance to model and vice versa. However, note that the attributions
are not normalized (they do not add up to 1). See Table 14.2.
FIGURE 14.6 Global feature importance returned for our model
TABLE 14.2 Model types
Model Type Explainability Description
Method
Linear and Shapley values This is the average of all the marginal
logistic and standard contributions to all possible
regression errors, p values coalitions.
Boosted Trees Tree SHAP, Shapley values optimized for decision
Gini based tree–based models.
feature
importance
Deep Neural Integrated A gradients based method to
Network and gradients efficiently compute feature
Wide and attributions with same axiomatic
Deep properties as Shapley.
Arima_PLUS Time series Decompose into multiple
decomposition components if present in the time
series.
There is a computational cost to adding explainability to predictions.
This is especially true for methods like Shapley where the complexity
increases exponentially with the number of features.
In the following SQL code, we are using EXPLAIN_PREDICT instead
of the PREDICT function. We are selecting all columns from a table
called dataset1.credit_card_test and only one row. The result is shown
in Figure 14.7.
In our case, predicted value is –1.281. Now let us look at the top five
features that were reported as part of the query result (Figure 14.8).
One thing to notice is that these individual feature contributions are
two orders of magnitude lower than the baseline.
Notice that we are not exporting the data from BigQuery and then
importing into Vertex AI. Thanks to this integration, you can now
seamlessly connect to data in BigQuery.
Hashed Feature
This solution addresses three problems faced by categorical variables:
Incomplete vocabulary: The input data might not have the full set
of the values that a categorical variable could take. This creates a
problem if the data is fed directly into a ML model.
High cardinality: Zip code is an example of a categorical variable
with high cardinality, which creates some scaling issues in ML.
Cold start problem: There could be new values added to the
categorical variable that might not have existed in the training
dataset—for example, the creation of a new employee ID when a
person joins.
One method to deal with this problem is to transform this high
cardinal variable into a low cardinal domain by hashing. This can be
done very easily in BigQuery like this:
ABS(MOD(FARM_FINGERPRINT(zipcode), numbuckets))
Transforms
Sometimes inputs to models are modified or enhanced or engineered
before feeding into the model like the hashing example. There is
valuable information in the transformations applied to the inputs. The
transformations applied to the inputs to the training dataset must also
be applied to the inputs if the model is deployed in production. So the
code to transform is part of the pipeline when you are making the
predictions. BigQuery has an elegant solution to this called the
TRANSFORM clause that is part of the CREATE_MODEL function:
CREATE OR REPLACE MODEL m
TRANSFORM(ML.FEATURE_CROSS(STRUCT(f1, f2)) as cross_f,
ML.QUANTILE_BUCKETIZE(f3) OVER() as buckets,
label_col)
OPTIONS(model_type=’linear_reg’, input_label_cols=
['label_col'])
AS SELECT * FROM t
The caveat for this design pattern is that these models with transforms
will not work outside BigQuery ML, say, if you export the model to
Vertex AI.
Summary
BigQuery is an important service that revolutionized the use of ML in
the SQL community. BigQuery ML democratized machine learning
and made it available to many more people.
In this chapter we saw how to use SQL to perform all actions of a ML
pipeline. We also learned how to apply transformations to input values
directly using SQL, which reduces the time to create models. Although
BigQueryML is a separate service, it is highly interoperable with
Vertex AI. Lastly we also saw some interesting design patterns which
are unique to BigQuery ML.
Exam Essentials
Understand BigQuery and ML. Learn the history of BigQuery
and the innovation of bringing machine learning into a data
warehouse and to data analysis and anyone familiar with SQL.
Learn how to train, predict, and provide model explanations using
SQL.
Be able to explain the differences between BigQuery ML
and Vertex AI and how they work together. These services
offer similar features but are designed for different users.
BigQuery ML is designed for analysts and anyone familiar with
SQL, and Vertex AI is designed for ML engineers. Learn the
various different integration points that make it seamless to work
between the two services.
Understand BigQuery design patterns. BigQuery has
elegant solutions to recurring problems in machine learning.
Hashing, transforms, and serverless predictions are easy to apply
to your ML pipeline.
Review Questions
1. You work as part of a large data analyst team in a company that
owns a global footwear brand. The company manufactures in
South Asia and distributes all over the globe. Its sales were
affected during the COVID 19 pandemic and so was distribution.
Your team has been asked to forecast sales per country with new
data about the spread of the illness and a plan for recovery.
Currently your data is on prem and sales data comes from all over
the world weekly. What will you use to forecast?
A. Use Vertex AI AutoML Tables to forecast sales as this is a
distributed case.
B. User Vertex AI AutoML Tables with custom models
(TensorFlow) because this is a special case due to COVID 19.
C. Use BigQuery ML, experiment with a TensorFlow model and
DNN models to find the best results.
D. Use BigQuery ML with ARIMA_PLUS, and use the BigQuery
COVID 19 public dataset for trends.
2. You are part of a startup that rents bicycles, and you want to
predict the amount of time a bicycle will be used and the distance
it will be taken based on current location and userid. You are part
of a small team of data analysts, and currently all the data is
sitting in a data warehouse. Your manager asks you to quickly
create a machine learning model so that they can evaluate this
idea. Your manager wants to show this prototype to the CEO to
improve sales. What will you choose?
A. Use a TensorFlow model on Vertex AI tables to predict time
and distance.
B. Use the advanced path prediction algorithm in Google Maps.
C. Use BigQuery ML.
D. Use a Vertex AI custom model to get better results because
the inputs include map coordinates.
3. You are a data analyst for a large video sharing website. The
website has thousands of users that provide 5 star ratings for
videos. You have been asked to provide recommendations per
user. What would you use?
A. Use BigQuery classification model_type.
B. Use a Vertex AI custom model to build a collaborative
filtering model and serve it online.
C. Use the matrix factorization model in BigQuery ML to create
recommendations using explicit feedback.
D. Use Vertex AI AutoML for matrix factorization.
4. You are a data analyst and your manager gave you a TensorFlow
SavedModel to use for a classification. You need to get some
predictions quickly but don't want to set up any instances or
create pipelines. What would be your approach?
A. Use BigQuery ML and choose TensorFlow as the model type
to run predictions.
B. Use Vertex AI custom models, and create a custom container
with the TensorFlow SavedModel.
C. TensorFlow SavedModel can only be used locally, so
download the data onto a Jupyter Notebook and predict
locally.
D. Use Kubeflow to create predictions.
5. You are working as a data scientist in the finance industry and
there are regulations about collecting and storing explanations for
every machine learning prediction. You have been tasked to
provide an initial machine learning model to classify good loans
and loans that have defaulted. The model that you provide will be
used initially and is expected to be improved further by a data
analyst team. What is your solution?
A. Use Kubeflow Pipelines to create a Vertex AI AutoML Table
with explanations.
B. Use Vertex AI Pipelines to create a Vertex AI AutoML Table
with explanations and store them in BigQuery for analysts to
work on.
C. Use BigQuery ML, and select “classification” as the model
type and enable explanations.
D. Use Vertex AI AutoML Tables with explanations and store
the results in BigQuery ML for analysts.
6. You are a data scientist and have built extensive Vertex AI
Pipelines which use Vertex AI AutoML Tables. Your manager is
asking you to build a new model with data in BigQuery. How do
you want to proceed?
A. Create a Vertex AI pipeline component to download the
BigQuery dataset to a GCS bucket and then run Vertex AI
AutoML Tables.
B. Create a new Vertex AI pipeline component to train BigQuery
ML models on the BigQuery data.
C. Create a Vertex AI pipeline component to execute Vertex AI
AutoML by directly importing a BigQuery dataset.
D. Create a schedule query to train a model in BigQuery.
7. You are a data scientist and have built extensive Vertex AI
Pipelines which use Vertex AI AutoML Tables. Your manager is
asking you to build a new model with a BigQuery public dataset.
How do you want to proceed?
A. Create a Vertex AI pipeline component to download the
BigQuery dataset to a GCS bucket and then run Vertex AI
AutoML Tables.
B. Create a new Vertex AI pipeline component to train BigQuery
ML models on the BigQuery data.
C. Create a Vertex AI pipeline component to execute Vertex AI
AutoML by directly importing the BigQuery public dataset.
D. Train a model in BigQuery ML because it is not possible to
access BigQuery public datasets from Vertex AI.
8. You are a data scientist, and your team extensively uses Jupyter
Notebooks. You are merging with the data analytics team, which
uses only BigQuery. You have been asked to build models with
new data that the analyst team created in BigQuery. How do you
want to access it?
A. Export the BigQuery data to GCS and then download it to the
Vertex AI notebook.
B. Create an automated Vertex AI pipeline job to download the
BigQuery data to a GCS bucket and then download it to the
Vertex AI notebook.
C. Use Vertex AI managed notebooks, which can directly access
BigQuery tables.
D. Start using BigQuery console to accommodate the analysts.
9. You are a data scientist, and your team extensively uses Vertex AI
AutoML Tables and pipelines. Your manager wants you to send
the predictions of new test data to test for bias and fairness. The
fairness test will be done by the analytics team that is comfortable
with SQL. How do you want to access it?
A. Export the test prediction data from GCS and create an
automation job to transfer it to BigQuery for analysis.
B. Move your model to BigQuery ML and create predictions
there.
C. Deploy the model and run a batch prediction on the new
dataset to save in GCS and then transfer to BigQuery.
D. Add the new data to your AutoML Tables test set, and
configure the Vertex AI tables to export test results to
BigQuery.
10. You are a data scientist, and your team extensively uses Vertex AI
AutoML Tables and pipelines. Your manager wants you to send
predictions to test for bias and fairness. The fairness test will be
done by the analytics team that is comfortable with SQL. How do
you want to access it?
A. Export the test prediction data from GCS and create an
automation job to transfer it to BigQuery for analysis.
B. Move your model to BigQuery ML and create predictions
there.
C. Deploy the model and run a batch prediction on the new
dataset to save in GCS and then transfer to BigQuery.
D. Deploy the model and run a batch prediction on the new
dataset to export directly to BigQuery.
11. You are a data scientist, and your team extensively uses Vertex AI
AutoML Tables and pipelines. Another team of analysts has built
some highly accurate models on BigQuery ML. You want to use
those models also as part of your pipeline. What is your solution?
A. Run predictions in BigQuery and export the prediction data
from BigQuery into GCS and then load it into your pipeline.
B. Retrain the models on Vertex AI tables with the same data
and hyperparameters.
C. Load the models in the Vertex AI model repository and run
batch predictions in Vertex AI.
D. Download the model and create a container for Vertex AI
custom models and run batch predictions.
12. You are a data analyst and working with structured data. You are
exploring different machine learning options, including Vertex AI
and BigQuery ML. You have found that your model accuracy is
suffering because of a categorical feature (zipcode) that has high
cardinality. You do not know if this feature is causing it. How can
you fix this?
A. Use the hashing function
(ABS(MOD(FARM_FINGERPRINT(zipcode),buckets)) in BigQuery
to bucketize.
B. Remove the input feature and train without it.
C. Don't change the input as it affects accuracy.
D. Vertex AI tables will automatically take care of this.
13. You are a data analyst working with structured data in BigQuery
and you want to perform some simple feature engineering
(hashing, bucketizing) to improve your model accuracy. What are
your options?
A. Use the BigQuery TRANSFORM clause during
CREATE_MODEL for your feature engineering.
B. Have a sequence of queries to transform your data and then
use this data for BigQuery ML training.
C. Use Data Fusion to perform feature engineering and then
load it into BigQuery.
D. Build Vertex AI AutoML Tables which can automatically take
care of this problem.
14. You are part of a data analyst team working with structured data
in BigQuery but also considering using Vertex AI AutoML. Which
of the following statements is wrong?
A. You can run BigQuery ML models in Vertex AI AutoML
Tables.
B. You can use BigQuery public datasets in AutoML Tables.
C. You can import data from BigQuery into AutoML.
D. You can use SQL queries on Vertex AI AutoML Tables.
15. Which of the following statements is wrong?
A. You can run SQL in BigQuery through Python.
B. You can run SQL in BigQuery through the CLI.
C. You can run SQL in BigQuery through R.
D. You can run SQL in BigQuery through Vertex AI.
16. You are training models on BigQuery but also use Vertex AI
AutoML Tables and custom models. You want flexibility in using
data and models and want portability. Which of the following is a
bad idea?
A. Bring TensorFlow models into BigQuery ML.
B. Use TRANSFORM functionality in BigQuery ML.
C. Use BigQuery public datasets for training.
D. Use Vertex AI Pipelines for automation.
17. You want to standardize your MLOps using Vertex AI, especially
AutoML Tables and Vertex AI Pipelines, etc., but some of your
team is using BigQuery ML. Which of the following is incorrect?
A. Vertex AI Pipelines will work with BigQuery.
B. BigQuery ML models that include TRANSFORM can also be
run on AutoML.
C. BigQuery public datasets can be used in Vertex AI AutoML
Tables.
D. You can use BigQuery and BigQuery ML through Python
from Vertex AI managed notebooks.
18. Which of these statements about BigQuery ML is incorrect?
A. BigQuery ML supports both supervised and unsupervised
models.
B. BigQuery ML supports models for recommendation engines.
C. You can control the various hyperparameters of a deep
learning model like dropouts in BigQuery ML.
D. BigQuery ML models with TRANSFORM clause can be
ported to Vertex AI.
19. Which of these statements about comparing BigQuery ML
explanations is incorrect?
A. All BigQuery ML models provide explanations with each
prediction.
B. Feature attributions are provided both at the global level and
for each prediction.
C. The explanations vary by the type of model used.
D. Not all models have global explanations.
20. You work as part of a large data analyst team in a company that
owns hundreds of retail stores across the country. Their sales
were affected due to bad weather. Currently your data is on prem
and sales data comes from all across the country. What will you
use to forecast sales using weather data?
A. Use Vertex AI AutoML Tables to forecast with previous sales
data.
B. User Vertex AI AutoML Tables with a custom model
(TensorFlow) and augment the data with weather data.
C. Use BigQeury ML, and use the Wide and Deep model to
forecast sales for a wide number of stores as well as deep into
the future.
D. Use BigQuery ML with ARIMA_PLUS, and use the BigQuery
public weather dataset for trends.
Appendix
Answers to Review Questions
Chapter 1: Framing ML Problems
1. A, B, D. First understand the use case, and then look for the
details such as impact, success criteria, and budget and time
frames. Finding the algorithm comes later.
2. B. Hyperparameters are variables that cannot be learned. You will
use a hyperparameter optimization (HPO) algorithm to
automatically find the best hyperparameters. This is not
considered when you are trying to match a business case to an ML
problem.
3. B. The input data is time series data and predicting for next 7 days
is typical of a forecasting problem.
4. B. A prediction has only two outputs: either valid or not valid.
This is binary classification. If there are more than two classes, it
is multiclass classification. Linear regression is predicting a
number. Option C is popular with support tickets to identify
clusters of topics but cannot be used in this case.
5. C. When you are trying to identify an object across several frames,
this is video object tracking. Option A is factually incorrect.
Option B is for images, not video. Scene detection or action
detection classifies whether an “action” has taken place in video, a
different type of problem, so option D is also wrong.
6. D. Topic modeling is an unsupervised ML problem. Given a set of
documents, it would cluster them into groups and also provide the
keywords that define each cluster.
7. C. Precision is a metric for unbalanced classification problems.
8. A. The root mean squared error (RMSE) is the best option if you
are trying to reduce extreme errors.
9. A. RMSE, MAPE, and R2 are regression metrics. Accuracy is the
only classification metric here.
10. C. We can eliminate RMSE because it is a regression metric.
Accuracy is also wrong because it is a poor metric for imbalanced
(1:100) datasets. So, the correct answer is either precision or
recall. In this case, a false negative could cause severe problems
later on, so we want to choose a metric that minimizes false
negatives. So, the answer is recall.
11. B. “No labeled data” means you cannot have supervised learning
or semi supervised learning. Reinforcement learning is when an
agent actively explores an environment (like a robot), which is not
relevant here. Only unsupervised learning can be applied to
purely unlabeled data.
12. C. The Like button here is the explicit feedback that users provide
on content and can be used for training. Collaborative filtering is
the class of algorithm that can be used for recommendations such
as in this case.
13. B. Option B is the bad idea because you need to update the data
related to the new products. The idea of a “golden dataset” exists,
but in this case, the dataset needs to be updated.
14. D. Use supervised learning when you have labeled data. Use
unsupervised learning when you have unlabeled data. Use semi
supervised learning when you have a mix. There is no such thing
as hyper supervised learning.
15. A, D. Option A is absolutely true and is done throughout the
industry. Option B is incorrect because it is done frequently in
practice. Option C is partially true because it may amplify errors,
but that does not mean you never feed one. Option D is correct
because there is an entire class of models that help in
transforming data for downstream prediction.
16. D. Whenever dealing with customer data and sensitive data, it is
important to test your model for biases and apply responsible AI
practices.
17. C. More testing data is not going to achieve much here. But that
does not mean we cannot do anything. You can't always remove
all the fields that may cause bias because some details might be
hidden in other fields. The correct answer is to use model
interpretability and explanations.
18. C. The model was deployed properly. Most Android phones can
handle deep learning models very well. We cannot say much
about the metric because it is unknown. This fun Android app
could be used by a wide variety of people and was possibly not
tested on a representative sample dataset.
19. B, D. There are many kinds of private data, not just photographs.
Scans are also private data. There should always be concerns
when using sensitive data.
20. B. While you can use the data creatively, there is always a privacy
concern when dealing with customer data. Option A is true
because you usually recommend other products at checkout.
Option C is true because changes in user behavior and in the
product catalog require retraining. Option D is true because you
can use important information about products, like similar
products or complementary products, to sell more.
Chapter 2: Exploring Data and Building Data
Pipelines
1. D. Oversampling is the way to improve the imbalanced data class.
2. A. The model performed poorly on new patient data due to label
leakage because you are training the model on hospital name.
3. A. Monitoring the model for skew and retraining will help with
data distribution.
4. B. Model retraining will help with data distribution and
minimizing data skew.
5. B. Downsample the majority data with unweighting to create 10
percent samples.
6. D. Transforming data before splitting for testing and training will
avoid data leakage and will lead to better performance during
model training.
7. C. Removing features with missing values will help because the
dataset has columns with missing values.
8. A, B, and D. All of the options describe reasons for data leakage
except option C, removing features with missing values.
Chapter 3: Feature Engineering
1. C. With one hot encoding you can convert categorical features to
numeric features. Moreover, not all algorithms works well on
categorical features.
2. B. Normalizing the data will help convert the range into a
normalized format and will help converge the model.
3. C. For imbalanced datasets, AUC PR is a way to minimize false
positives compared to AUC ROC.
4. B. Since the model is performing well with training data, it is a
case of data leakage. Cross validation is one of the strategies to
overcome data leakage. We covered this in Chapter 2.
5. A, B. With TensorFlow data, prefectching and interleaving are
techniques to improve processing time.
6. A. Use a tf.data.Dataset.prefetch transformation.
7. C. We will get one feature cross of binned latitude, binned
longitude, and binned roomsPerPerson.
8. A. Cloud Data Fusion is the UI based tool for ETL.
9. A. TensorFlow Transform is the most scalable way to transform
your training and testing data for production workloads.
10. D. Since the model is underperforming with production data,
there is a training serving skew. Using a tf.Transform pipeline
helps prevent this skew by creating transformations separately for
training and testing.
Chapter 4: Choosing the Right ML
Infrastructure
1. C. Always start with a pretrained model and see how well it solves
your problem. If that does not work, you can move to AutoML.
Custom models should always be the last resort.
2. C. “Glossary” is a feature that is intended to solve this exact
problem. If you have some terms that need to be translated in a
certain way, you can create a list of these in a XML document and
pass it to Google Translate. Choose the Advanced option and not
Basic. Whenever these specific words/phrases appear, it will
replace them with your translation from the glossary.
3. D. It is true that there is no “translated subtitle” service; however,
you can combine two services to suit your needs. Option A is
wrong because there is no AutoML in Vertex AI today. Options B
and C are possible but should not be the first step.
4. A. This is a classification problem. Using the AutoML Edge model
type is the right approach because the model will be deployed on
the edge device. While both Coral.ai and Android app deployment
are right, if you want to go to market quickly, it is better to go with
Android application using ML Kit.
5. A. You get the error “not found” when a GPU is not available in
the selected region. Not all regions have all GPUs. If you have
insufficient quota, you will get the error “Quota
‘NVIDIA_V100_GPUS’ exceeded.”
6. D. Option A is wrong because n1 standard 2 is too small for GPUs,
and option B is wrong because it is still using CPUs. Option D is
better because it is better to go with 1 TPU than 8 GPUs,
especially when you don't have any manual placements.
7. D. “Recommended for you” is intended for home pages, which
brings attention to the most likely products based on current
trends and user behavior. “Similar items” is based on product
information only, which helps customers choose between similar
products. “Others you may like” is the right choice for content
based on the user's browsing history. “Frequently bought
together” is intended to be shown at checkout when they can
quickly add more into their cart.
8. B. “Frequently bought together” is intended to be shown at
checkout when customers can quickly add more into their cart.
“Recommended for you” is intended for home pages, which brings
attention to the most likely product. “Similar items” is based on
product information only, which helps customers choose between
similar products. “Others you may like” is the right choice for
showing content based on the user's browsing history.
9. A. When you want the customer to “engage more,” it means you
want them to spend more time in the website/app browsing
through the products. “Frequently bought together” is intended to
be shown at checkout when customers can quickly add more into
their cart. “Recommended for you” is intended for home pages
and brings attention to the most likely product, and “Similar
items” is based on product information only, which helps
customers to choose between similar products. “Others you may
like” is the right choice for showing content based on the user's
browsing history.
10. C. When you do not have browsing data, or “user events,” you
have to create recommendations based on project catalog
information only. The only model that does not require “user
information” in this list is “Similar items,” which shows the
products that are similar to the one the user is currently viewing.
11. B. The objective of “click through rate” is based on the number of
times the user clicks and follows a link, whereas the “revenue per
order” captures the effectiveness for a recommendation being
made at checkout.
12. D. Option A is wrong because there is no AutoML for this today.
Currently there is no pretrained API available for this on GCP. A
Vertex AI custom job is the most appropriate.
13. B. Option A is wrong because the Natural Language API does not
accept voice. While options C and D are also correct, these are
custom models that will take time.
14. C. TPUs do not support custom TensorFlow operations. GPUs are
the best options here.
15. A. Only A2 and N1 machine series support GPUs. Option C is
wrong because you cannot have 3 GPUs in an instance.
16. A. Pushing a large model to an Android device without hardware
support might slow the device significantly. Using devices with
Edge TPU installed is the best answer here.
17. A, C. You cannot have TPU and GPU in a single instance. You
would not usually go for a cluster of TPU VMs.
18. C. If you have a sparse matrix, TPUs will not provide the
necessary efficiency.
19. C. TPUs are not used for high precision predictions.
20. C, D. Options A and B increase the size of the instance without
identifying the root cause. The question mentions that model has
been already deployed on a big instance (32 core). The next step
should be to identify the root cause of the latency, so Option C is a
correct choice. Also, checking the code to see if it is single
threaded is correct, because it is not always a hardware problem;
it could be a configuration issue or a software issue of being single
threaded code.
Chapter 5: Architecting ML Solutions
1. B. The question is asking for the simplest solution, so we do not
need Memorystore and Bigtable as the latency requirement is
300ms@p99. The best and simplest way to handle this is using
App Engine to deploy the applications and call the model
endpoint on the Vertex AI Prediction.
2. B. Bigtable is designed for very low latency reads of very large
datasets.
3. A. To preprocess data you will use Dataflow, and then you can use
the Vertex AI platform for training and serving. Since it's a
recommendation use case, Cloud BigQuery is the recommended
NoSQL store to manage this use case storage at scale and reduce
latency.
4. A. Since you want to minimize the infrastructure overhead, you
can use the Vertex AI platform for distributed training.
5. A, C. With Document AI, you can get started quickly because it's a
solution offering built on the top layer of your AI stack with the
least infrastructure heavy lifting needed by you to set up. Cloud
Storage is the recommended data storage solution to create a
document data lake.
6. A. When the question asks for retraining, look for a pipeline that
can automate and orchestrate the task. Kubeflow Pipelines is the
only option here that can help automate the retraining workflow.
7. D. Kubeflow Pipelines is the only choice that comes with an
experiment tracking feature. See
www.kubeflow.org/docs/components/pipelines/concepts/experiment
A
access logging, 247
AdaGrad optimization, 42
Adam optimization, 42
AEAD (Authenticated Encryption with Associated Data), 104
AI (artificial intelligence)
best practices, 13–14
fairness, 13
interpretability, 13
intro, 2
model explanations, 13
privacy, 13
security, 14
AI/ML stack, 86
ANNs (artificial neural networks), 126
Apache Airflow, 228–229
Apache Beam, 28
asynchronous predictions, 95–96
asynchronous training, 123
AUC (area under the curve), 11
AUC ROC (Area Under the Curve Receiver Operating Characteristic),
11, 46
AUC PR curve, 12, 46
augmentation on the fly, 132
automation, 91
review question answers, 304–305
AutoML
CCAI (Contact Center AI), 69–70
compared to others, 58–60
Dialogflow, 69–70
Document AI, 69
images, 66–67
versus others, 87
Retail AI, 68
review question answers, 302–304
tables/structured data
BigQuery ML, 64
Vertex AI tables, 64–65
text, 67
Vertex AI, 60
video, 66–67
B
bar plots, 21–22
batch data, collecting, 146–147
batch prediction, 94
input data, 212–213
BigQuery, 28, 64, 87, 88
Dataproc and, 148
encryption and, 104
Jupyter Notebooks, 280–281
versus others, 87
Pub/Sub and, 146
Python API, 281
SQL queries, 280
tools for reading data, 88
BigQuery Data Transfer Service, 146
BigQuery integration, 155–156
BigQuery ML
data
import to Vertex AI, 290
test prediction, 290
Vertex AI Workbench Notebooks, 290
design
hashed feature, 291
transforms, 291–292
DNNs (deep neural networks), 283
explainability in, 286–288
export to Vertex AI, 291
Jupyter Notebooks, 285
ML.evaluate keyword, 284
model creation, 282
model training, 282–284
models, 283
predictions, 285–286
public dataset access, 289
review question answers, 313–314
Vertex AI, prediction results export, 290
Vertex AI tables comparison, 289
BigQuery Omni, 235
BigQuery REST API, 88
BigQuery Spark connector, 149
Bigtable, 90–91
binary classification, 7, 9
bivariate analysis, data visualization, 20
bucketing, 42
C
caching architecture, 206
categorical data, 41
categorical values, mapping
embedding, 44
feature hashing, 44
hybrid of hashing and vocabulary, 44
integer encoding, 43
label encoding, 43
one hot encoding, 43
OOV (out of vocab), 43
CCAI (Contact Center AI), 69–70
CI/CD pipeline, 230
class imbalance, 44–45
AUC ROC (Area Under the Curve Receiver Operating
Characteristic), 46
classification threshold, 45
false negative, 45
false positive, 45
true negative, 45
true positive, 44
classification, 7
binary, 7, 9
multiclass, 7
prediction classes, 9–10
classification threshold, 45
classification threshold invariance, 11
client side encryption, 105
clipping, 26
Cloud Bigtable, 149
Cloud Build, 215
Cloud Composer, 149, 229
Cloud Data Fusion, 51, 148
Cloud Dataflow, 147–148
Cloud Dataprep, 149
Cloud Dataproc, 148–149
BigQuery, 148
BigQuery Spark connect, 149
Cloud Bigtable, 149
Pub/Sub Lite Spark, 149
Cloud Run, 215
Cloud Scheduler, 215
Cloud Storage, Dataproc and, 148
clustering, 8
CNNs (convolutional neural networks), 126–127
concept drift, 178, 242
confusion matrix, 9
container logging, 247
containers
custom, model training, 166–168
prebuilt, model training and, 163–165
correlation
negative, 24
positive, 24
zero, 24
CT (continuous training) pipeline, 230
custom ML models
compared to others, 58–60
CPU, 71
GPUs (graphics processing units), 70
ALUs (arithmetic logic units), 71
restrictions, 71–72
virtual CPUs, 72
versus others, 87
TPUs (Tensor Processing Units), 72–73
advantages, 73
Cloud TPU model, 74
D
DAG (directed acyclic graph), 93
data
missing, 32
review question answers, 301–302
semi structured, model training and, 145
structured
modeling training and, 145
regression and, 7
unstructured, model training and, 145
data augmentation, 132
augmentation on the fly, 132
data cleaning, 25
data collection
batch data, 146–147
BigQuery, 88
Bigtable, 90–91
Cloud Composer, 149
Cloud Data Fusion, 148
Cloud Dataflow, 147–148
Cloud Dataprep, 149
Cloud Dataproc, 148–149
Datastore, 90–91
GCS (Google Cloud Storage), 88
Memorystore, 90–91
model training, 146–147
NoSQL data store, 90–91
privacy implications, 113–117
review question answers, 304–305
sensitive data removal, 116–117
streaming data, 146–147
Vertex AI, managed datasets, 89
Vertex AI Feature Store, 89
data compatibility. See data transformation
data constraints, validation, 27–28
data drift, 178, 243
data lakes, 28
data leakage, 33–34
data parallelism, 122
asynchronous training, 123
synchronous training, 123
data quality. See also data transformation; quality of data
data reliability. See reliability of data
data sampling, 29
data skew, 25
data splitting, 31
online systems, 31
data transformation
bucketing, 42
Cloud Data Fusion, 51
data compatibility and, 40
data quality and, 40
Dataprep by Trifacta, 51
dimensionality, 44
feature selection, 44
inside models, 41
mapping categorical values
embedding, 44
feature hashing, 44
hybrid of hashing and vocabulary, 44
integer encoding, 43
label encoding, 43
one hot encoding, 43
OOV (out of vocab), 43
mapping numeric values
bucketing, 42
normalizing, 42
normalizing, 42
pretraining, 40–41
structured data
categorical data, 41
numeric data, 41
TensorFlow Transform
library, 49–51
tf.data API, 49
TFX (TensorFlow Extended), 49–51
data validation
TFDV (TensorFlow Data Validation), 27–28, 272
TFX (TensorFlow Extended) platform, 27–28
data visualization
bar plots, 21–22
bivariate analysis, 20
box plots
outliers, 20–21
quartiles, 20–21
whiskers, 20–21
line plots, 21
scatterplots, 22
univariate analysis, 20
Dataflow, Pub/Sub and, 146
Dataprep by Trifacta, 51
datasets
imbalanced data, 29–31
model training, Vertex AI, 163
review question answers, 310–311
sampling
oversampling, 29
undersampling, 29
splitting data, 31
test datasets, 29
training datasets, 29
validation datasets, 29
Vertex AI managed datasets, 89
Datastore, 90–91
Datastream, 146
debugging, Vertex AI, 272
development workflow, 223
Dialogflow, 69
Agent Assist, 70
CCAI, 70
insights, 70
virtual agent, 70
DICOM (Digital Imaging and Communications in Medicine), 116
distributed training, model training, 168–169
DLP (Data Loss Prevention) API, 104, 114–115
DNNs (deep neural networks), 126
BigQuery ML, 283
Docker images, custom containers, 166–168
Document AI, 69
DP (differential privacy), Vertex AI and, 112
dynamic reference features, 203–204
architecture, 205
E
EDA (exploratory data analysis), 20
visualization
bar plots, 21–22
bivariate analysis, 20
box plots, 20–21
line plots, 21
scatterplots, 22
univariate analysis, 20
edge inference, 76
Edge TPU, 76
embedding, 44
encryption
BigQuery and, 104
client side, 105
FPE (Format Preserving Encryption), 113
at rest, 104–105
server side, 105
tokenization, 113
in transit, 105
in use, 105
explainability. See also Vertex Explainable AI
BigQuery ML, 286–288
global, 188
local, 188
F
false negative, 45
false positive, 45
feature crosses, 46–48
feature columns, 48
feature engineering, 40
class imbalance
AUC ROC, 46
classification threshold, 45
false negative, 45
false positive, 45
true negative, 45
true positive, 44
data preprocessing, 40–41
data transformation
Cloud Data Fusion, 51
data compatibility and, 40
data quality and, 40
Dataprep by Trifacta, 51
inside models, 41
pretraining, 40–41
TensorFlow Transform, 49–51
TFX (TensorFlow Extended), 49
feature crosses, 46–48
feature columns, 48
predictions, 74–75
deploy to Android, 76
deploy to iOS devices, 76
Edge TPU, 76
machine types, 75–76
ML Kit, 76
scaling, 75
review question answers, 302
feature hashing, 44
feature importance, 189
federated learning, Vertex AI and, 112
FHIR (Fast Healthcare Interoperability Resources), 116
forecasting, 8
FPE (Format Preserving Encryption), 113
G
GANs (generative adversarial networks), 132
GCP AI APIs, 235
GCS (Google Cloud Storage), 88
Github integration, 156–157
GNMT (Google Neural Machine Translation), 63
Google Cloud Healthcare API, 115–116
DICOM (Digital Imaging and Communications in Medicine), 116
FHIR (Fast Healthcare Interoperability Resources), 116
GPUs (graphics processing units), 70
ALUs (arithmetic logic units), 71
model parallelism, 124–125
restrictions, 71–72
virtual CPUs, 72
gradient descent, 128
H
host third party pipelines (MLFlow) on Google Cloud, 213–214
hybrid cloud strategies, 235–236
hybrid of hashing and vocabulary, 44
hyperparameter tuning
algorithm options, 170
Bayesian search, 170
grid search, 170
importance, 170–171
optimization speed, 171
parameter comparison, 169
random search, 170
review question answers, 307–308
Vertex AI, 171–174
Vertex AI Vizier, 174
I
IAM (identity and access management), 104
FPE (Format Preserving Encryption), 113
project level roles, 105
resource level roles, 105
Vertex AI and, 106
federated learning, 112
Vertex AI Workbench permissions, 106–108
infrastructure, 86
review question answers, 302–304
inside model data transformation, 41
integer encoding, 43
interactive shells, 175–176
J
JupyterLab features, 154–155
K
k NN (k nearest neighbors) algorithm, missing data, 32
Kubeflow DSL, system design, 232–233
pipeline components, 233
Kubeflow Pipelines, 92–93, 224–225, 229
workflow scheduling, 230–232
Kubernetes Engine, 87
L
label encoding, 43
latency, online prediction, 96
line plots, 21
lineage tracking, metadata, 252
LOCF (last observation carried forward), 32
log scaling, 26
logging
log settings, 248
model monitoring and, 248
prediction logs
access logging, 247
container logging, 247
request response logging, 248
request response logging, 248
review question answers, 310–311
loss functions, 127–128
LSTMs (long short term memory), 127
M
machine types, 75
QPS (queries per second), 75
restrictions, 75–76
MAE (mean absolute error), 12
managed datasets, 89
managed notebook, Vertex AI Workbench
BigQuery integration, 155–156
creation, 153–154
data integration, 155
Github integration, 156–157
JupyterLab features, 154–155
scaling up, 156, 157
scheduling or executing code, 158–159
vs. user managed notebooks, 152–153
mapping categorical values
embedding, 44
feature hashing, 44
hybrid of hashing and vocabulary, 44
integer encoding, 43
label encoding, 43
one hot encoding, 43
OOV (out of vocab), 43
mapping numeric values, 42
mean, 22
skewed data, 25
variance, 23
Media Translation API, 63
median, 22
skewed data, 25
Memorystore, 90–91
metadata
lineage tracking, 252
review question answers, 310–311
Vertex ML, 249–250
artifacts, 249
context, 249
events, 250
execution, 249
metadataschema, 250
schema management, 250–252
Vertex AI Pipelines, 252
metrics, model training
interactive shells, 175–176
TensorFlow Profiler, 177
WIT (What If Tool), 177–178
missing data, 32
ML (machine learning), 2
business use cases, 3–4
classification, 7
clustering, 8
forecasting, 8
problem types, 6
problems, review question answers, 300–301
regression, 7
semi supervised learning, 7
supervised learning, 6
unsupervised learning, 6
topic modeling, 6–7
ML Kit, 76–77
ML metrics, 8–9
AUC ROC (Area Under the Curve Receiver Operating
Characteristic), 11
AUC PR curve, 12
regression
MAE (mean absolute error), 12
RMSE (root mean squared error), 12
RMSLE (root mean squared logarithmic error), 12
summary, 10
ML models
AutoML
CCAI (Contact Center AI), 69–70
compared to others, 58–60
Dialogflow, 69–70
Document AI, 69
images, 66–67
Retail AI, 68
structured data, 64–65
tables, 64–65
text, 67
video, 66–67
custom
compared to others, 58–60
CPU, 71
GPUs (graphics processing units), 70–72
TPUs (Tensor Processing Units), 72–74
pretrained, 60
compared to others, 58–60
Natural Language AI, 62–63
Speech to Text service, 63
Text to Speech service, 64
Translation AI, 63
Video AI, 62
Vision AI, 61–62
review question answers, 302–304
ML workflow, Google Cloud services, 85
MLOps (machine learning operations), 222, 260–261
data
analysis, 261
extraction, 261
preparation, 261
deployment
monitor, 261
serving, 261
Level 0 Manual/Tactical phase, 261–263
Level 1 Strategic Automation phase, 263–264
Level 2 CI/CD Automation, Transformational phase, 264–266
models
evaluation, 261
training, 261
validation, 261
mode, 23
model building
ANNs (artificial neural networks), 126
batches, 129
size, 129
tuning batch size, 129–130
bias, 133
variance trade off, 133
CNNs (convolutional neural networks), 126–127
data parallelism, 122
asynchronous training, 123
synchronous training, 123
DNNs (deep neural networks), 126
epoch, 129
gradient descent, 128
hyperparameters, 129
learning rate, 129
tuning, 130
loss functions, 127–128
model parallelism, 123–125
overfitting, 134
regularization, 134–136
dropout, 136
exploding gradients, 135
L1, 135
L2, 135
losses, 136
ReLU units, dead, 135
vanishing gradients, 135–136
review question answers, 306
RNNs (recurrent neural networks), 127
step size, 129
underfitting, 133–134
variance, 133
bias variance trade off, 133
model deployment, 207–209
model monitoring, 242
concept drift, 242–243
data drift, 243
review question answers, 310–311
Vertex AI, 243–244
drift, 244–245
input schemas, 245–247
skew, 244–245
Model Registry, 209
model retraining, 266–267
model servers
deployment, 200
serving time errors, 271
TensorFlow, 200
model training
algorithmic correctness, testing for, 180
AutoML, 161
BigQuery ML, 282–284
custom, 162
custom containers, 166–168
data
semi structured, 145
structured, 145
unstructured, 145
data analysis, 150–151
data collection, 146–147
Cloud Composer, 149
Cloud Dataprep, 149
data storage, 150–151
datasets, Vertex AI and, 163
distributed training, 168–169
metrics
interactive shells, 175–176
TensorFlow Profiler, 177
WIT (What If Tool), 177–178
MLOps (machine learning operations), 222
new data and, 230
prebuilt containers, 163–165
review question answers, 307–308
training time errors, 271
unit testing, 179
updates to API call, 180
Vertex AI Workbench, 109–110
managed notebook, 151–159
user managed notebook, 151–153, 159–161
workflow
custom jobs, 162
hyperparameter tuning jobs, 162
training pipelines, 162
Vertex AI and, 162
model versioning, 267–268
models, testing, performance, 214–215
multiclass classification, 7
multicloud strategies, 235–236
N
Naive Bayes, missing data, 32
NaN data error, 42
NaN values, 26, 32
Natural Language AI, 62–63
negative correlation, 24
neural networks
ANNs (artificial neural networks), 126
CNNs (convolutional neural networks), 126–127
data augmentation, 132
DNNs (deep neural networks), 126
RNNs (recurrent neural networks), 127
normalization, 42
NoSQL data store, 90–91
numeric data, 41
numeric values, mapping
bucketing, 42
normalizing, 42
O
offline data augmentation, 132
offline prediction, 94
one hot encoding, 43
online data augmentation, 132
online prediction
asynchronous
poll notifications, 95–96
push notifications, 95–96
endpoint setup, 207
latency, 96
making predictions, 210
model deployment, 207–209
synchronous, 95
online predictions
endpoints, undeploy, 211
explanation requests, 212
OOV (out of vocab), 43
optimization, hyperparameters, 159
orchestration, 91
frameworks, 223
Apache Airflow, 228–229
Cloud Composer, 229
Kubeflow Pipelines, 224–225, 229
Vertex AI Pipelines, 225–229
review question answers, 304–305
TensorFlow, 92
Vertex AI pipelines, 92
orchestrators, 94
outliers
clipping, 26
detecting, 23
handling, 26–27
P
parameters, hyperparameter comparison, 169
PCA (principal component analysis), 44
performance testing, 214–215
PHI (protected health information), 104, 113–116, 118
PII (personally identifiable information), 104, 113–115, 118
DLP (Data Loss Prevention), 114–115
Google Cloud Healthcare API, 115–116
pipelines
Apache Airflow and, 228–229
artifact lineage, 226
artifacts, 226
CI/CD pipeline, 230
Cloud Composer and, 229
CT (continuous training), 230
functionalities, 222
Kubeflow, 92–93
workflow scheduling, 230–232
Kubeflow DSL, 232–233
Kubeflow Pipelines, 224–225, 229
metadata, 226
review question answers, 301–302, 305–306, 309
TensorFlow Extended SDK, 93
triggering schedules, 215–216
Vertex AI Pipelines, 225–229
scheduling, 232
when to use, 93–94
poll notifications, 95–96
online prediction and, 95–96
positive correlation, 24
prebuilt containers, model training and, 163–165
precomputing prediction, 204–207
prediction, 74–75
batch, 94
input data, 212
BigQuery ML, 285–286
caching, architecture, 206
deploy to Android, 76
deploy to iOS devices, 76
dynamic reference features, 203–204
architecture, 205
Edge TPU, 76
lookup keys, 206–207
machine types, 75–76
ML Kit, 76
offline, 94
online
A/B testing versions, 210–211
asynchronous, 95–96
endpoint setup, 207
endpoints, undeploy, 211
explanation requests, 212
latency, 96
making, 211
model deployment, 207–209
synchronous, 95
precomputing, 204–207
review question answers, 304–305, 308–309
scaling, 75
scaling prediction service, 200–203
static reference features, 203–204
architecture, 205
TensorFlow Serving, 201–202
triggering jobs, 215–216
prediction classes, 9–10
prediction logs
access logging, 247
container logging, 247
request response logging, 248
pretrained ML models, 60
compared to others, 58–60
Natural Language AI, 62–63
review question answers, 302–304
Speech to Text service, 63
Text to Speech service, 64
Translation AI, 63
Video AI, 62
Vision AI, 61–62
pretraining data transformation, 40–41
Private endpoints, Vertex AI, 110–111
problem types, 6
Public endpoint, Vertex AI, 110
Pub/Sub, 146, 215
Pub/Sub Lite, 146
Pub/Sub Lite Spark connector, 149
push notifications, online prediction and, 95–96
Q
quality of data, 24–27
R
Random Forest algorithm, missing data, 32
redeployment evaluation
concept drift and, 178
data changes trigger, 179
data drift and, 178
on demand, 179
performance based trigger, 179
periodic, 179
when to retrain, 178–179
regression, 7
MAE (mean absolute error), 12
RMSE (root mean squared error), 12
RMSLE (root mean squared logarithmic error), 12
structured data and, 7
regularization, 134
dropout, 136
exploding gradients, 135
L1, 135
L2, 135
losses, 136
ReLU units, dead, 135
vanishing gradients, 135–136
reliability of data, 24–27
request response logging, 248
Retail AI, 68
retraining
concept drift and, 178
data changes trigger, 179
data drift and, 178
on demand, 179
models, 266–267
performance based trigger, 179
periodic, 179
when to retrain, 178–179
RMSE (root mean squared error), 12
RMSLE (root mean squared logarithmic error), 12
RNNs (recurrent neural networks), 127
ROC (receiver operating characteristics), 11
S
SaaS (Software as a Service), 86
scale invariance, 11
scaling, 25–26
prediction, 200–203
z score, 26
scatterplots, 22
semi structured data, model training and, 145
semi supervised learning, 7
sensitive data, removing, 116–117
Seq2seq+, 66
server side encryption, 105
services, 86
skewed data, 25
SMOTE (Synthetic Minority Oversampling Technique), 25
solutions, 86, 311–313
Speech to Text service, 63
splitting data, 31
SSL (semi supervised learning)
limitations, 131
need for, 131
standard deviation, 23
static reference features, 203–204
architecture, 205
statistics
correlation
negative, 24
positive, 24
zero, 24
mean, 22
variance, 23
median, 22
mode, 23
outliers, detecting, 23
standard deviation, 23
streaming data, collecting, 146–147
structured data
categorical data, 41
modeling training and, 145
numeric data, 41
regression and, 7
supervised learning, 6
synchronous training, 123
system design
Kubeflow DSL, 232–233
review question answers, 309
TFX (TensorFlow Extended), 234–235
T
t SNE (t distributed stochastic neighbor embedding), 44
Temporal Fusion Transformer, 66
TensorFlow
model serving, 200
multiclass classification, 128
orchestration, 92
training strategies, 124–125
TensorFlow Extended SDK, 93
TensorFlow ModelServer, 201–202
TensorFlow Profiler, 177
TensorFlow Serving, 201–202
TensorFlow Transform
library, 49–51
tf.data API, 49
TFX (TensorFlow Extended), 49–51
test datasets, 29
testing, for performance, 214–215
Text to Speech service, 64
tf.data API
tf.data.Dataset.cache, 49
tf.data.Dataset.prefetch, 49
TFDV (TensorFlow Data Validation), 20, 272
APIs (application programming interfaces), 28
exploratory data analysis phase, 28
production pipeline phase, 28
TFX (TensorFlow Extended), 27–28, 49–51
system design and, 234–235
time series data
data leakage and, 33
forecasting and, 8
TLS (Transport Layer Security), 105
tokenization, 113
TPUs (Tensor Processing Units), 58
training datasets, 29
training jobs, Vertex AI, 111
transfer learning, 130
Translation AI, 63
triggering prediction jobs, 215–216
true negative, 45
true positive, 44
U
univariate analysis, data visualization, 20
unstructured data, model training and, 145
unsupervised learning, 6
topic modeling, 6–7
user managed notebook, Vertex AI Workbench, 151–153, 159–161
V
validation datasets, 29
TFDV (TensorFlow Data Validation), 272
versioning models, 267–268
Vertex AI
APIs, 86
AutoML, 86
batch predictions, 212–213
BigQuery ML comparison, 289
data bias and fairness, 193–194
debugging shell, 272
endpoints, 110–111
DP (differential privacy) and, 112
example based explanations, 193
experiments, 252–253
review question answers, 310–311
federated learning, 112
IAM roles and, 106
interpretability term, 189
ML solution readiness, 194–195
model monitoring, 243–244
drift, 244–245
input schemas, 245–247
skew, 244–245
model training
custom containers, 166–168
datasets, 163
prebuilt containers, 163–165
workflow, 162
permissions
Access Transparency logs, 271
Cloud Audit logs, 271
service accounts, custom, 270
platform, 86
training jobs, 111
VPC network, 110
Workbench, 86, 109–110
Vertex AI AutoML, 60, 87
Vertex AI Feature Store, 89
data model, 269
ingestion, 269–270
serving, 269–270
solution, 268–269
Vertex AI Jupyter Notebooks, 88
Vertex AI Model Monitoring, retraining, 266–267
Vertex AI Pipelines, 28, 93–94, 225–229
scheduling, 232
Vertex AI tables, 64–65
Vertex AI Workbench, 109–110
IAM permissions, 106–108
managed notebook, 151–159
user managed notebook, 151–153, 159–161
Vertex Explainable AI, 189–190
explainability term, 189
explanations
batch, 195
online, 195
feature attributions
differentiable models, 192
integrated gradients method, 191, 192
nondifferentiable models, 192
Sampled Shapley method, 190–192
XRAI, 191, 192
feature importance, 189
global explainability, 188
local explainability, 188
local kernel explanations, 195
review question answers, 308
Vertex ML, metadata, 249–250
artifacts, 249
context, 249
events, 250
execution, 249
metadataschema, 250–252
Vertex AI Pipelines, 252
Vertex ML metadata, 92
Video AI, 62
Vision AI, 61–62
VPC network, Vertex AI, 110
W
WIT (What If Tool), 177–178
Workflow, 216
workpool tasks, distributed training, 168–169
Z
z score, 26
zero correlation, 24
Online Test Bank
To help you study for your Google Cloud Professional Machine
Learning Engineer certification exams, register to gain one year of
FREE access after activation to the online interactive test bank—
included with your purchase of this book! All of the practice questions
in this book are included in the online test bank so you can study in a
timed and graded setting.
5.Find your book on that page and click the “Register or Login”
link with it. Then enter the pin code you received and click the
“Activate PIN” button.
6.On the Create an Account or Login page, enter your username
and password, and click Login or, if you don't have an account
already, create a new account.
7.At this point, you should be in the test bank site with your new
test bank listed at the top of the page. If you do not see it there,
please refresh the page or log out and log back in.
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.