Newwhitepaper - Operationalizing Generative AI On Vertex AI
Newwhitepaper - Operationalizing Generative AI On Vertex AI
Generative AI on
Vertex AI using
MLOps
Authors: Anant Nawalgaria,
Gabriela Hernandez Larios, Elia Secchi,
Mike Styer, Christos Aniftos
and Onofrio Petragallo
Operationalizing Generative AI on Vertex AI using ML Ops
Acknowledgements
Nenshad Bardoliwalla
Warren Barkley
Mikhail Chrestkha
Chase Lyall
Lakshmanan Sethu
Erwan Menard
Antonio Gulli
Anant Nawalgaria
Grace Mollison
Technical Writer
Joey Haymaker
Designer
Michael Lanning
September 2024 2
Table of contents
Introduction 5
Discover 9
Data Practices 23
Evaluate 27
Deploy 30
Version control 31
Infrastructure validation 34
Compression and optimization 35
Govern 41
Train 49
Tune 49
Orchestrate 51
Chain & Augment: Vertex AI Grounding, Extensions, and RAG building blocks 52
Experiment 56
Evaluation 57
Conclusion 61
Endnotes 63
Operationalizing Generative AI on Vertex AI using ML Ops
Introduction
The emergence of foundation models and generative AI (gen AI) has introduced a new era
for building AI systems. Selecting the right model from a diverse range of architectures
and sizes, curating data, engineering optimal prompts, tuning models for specific tasks,
grounding model outputs in real-world data, optimizing hardware - these are just a few of the
novel challenges that large models introduce.
This whitepaper delves into the fundamental tenets of MLOps and the necessary adaptations
required for the domain of gen AI and Foundation Models. We also examine the diverse range
of Vertex AI products, specifically tailored to address the unique demands of foundation
models and gen AI-based applications. Through this exploration we uncover how Vertex AI,
with its solid foundations of AI infrastructure and MLOps tools, expands its capabilities to
provide a comprehensive MLOps platform for gen AI.
September 2024 5
Operationalizing Generative AI on Vertex AI using ML Ops
DevOps is a software engineering methodology that aims to bridge the gap between
development (Dev) and operations (Ops). It promotes collaboration, automation, and
continuous improvement to streamline the software development lifecycle, introducing
practices such as continuous integration and continuous delivery.
MLOps builds upon DevOps principles to address the unique challenges of operationalizing
Machine Learning systems rapidly and reliably. In particular, MLOps tackles the experimental
nature of ML through practices like:
• Model monitoring: Tracking model behavior in production to detect and mitigate drift.
• Tracking & reproducibility: Maintaining meticulous records for experiment tracking and
result reproduction.
September 2024 6
Operationalizing Generative AI on Vertex AI using ML Ops
Imagine deploying your first chatbot after months of dedicated work, and it's now interacting
with users and answering questions. Behind this seemingly simple interaction lies the
complex and fascinating life cycle of a gen AI System, which can be broken down into five
key moments.
First in the discovery phase, developers and AI engineers must navigate the expanding
landscape of available models to identify the most suitable one for their specific gen AI
application. They must consider each model's strengths, weaknesses, and costs to make an
informed decision.
Data engineering practices have a critical role across all development stages, with factual
grounding (ensuring the model's outputs are based on accurate, up-to-date information) and
recent data from internal and enterprise systems being essential for reliable outputs. Tuning
data is often needed to adapt models to specific tasks, styles, or to rectify persistent errors.
Deployment needs to manage many new artifacts in the deployment process, including
prompt templates, chain definitions, embedding models, retrieval data stores, and fine-tuned
model adapters among others. These artifacts each have unique governance requirements,
necessitating careful management throughout development and deployment. Gen AI system
deployment also needs to account for the technical capabilities of the target infrastructure,
ensuring that system hardware requirements are fulfilled.
September 2024 7
Operationalizing Generative AI on Vertex AI using ML Ops
Continuous Improvement as a concept is still key for Gen AI-based applications, though
with a twist. For most Gen AI applications, instead of training models from scratch, we’re
taking foundation models (FMs) and then adapting them to our specific use case. This means
constantly tweaking these FMs through prompting techniques, swapping them out for newer
versions, or even combining multiple models for enhanced performance, cost efficiency, or
reduced latency. Traditional continuous training still holds relevance for scenarios when
recurrent fine-tuning or incorporating human feedback loops are still needed.
Naturally, this lifecycle assumes that the foundational model powering the gen AI system is
already operationalized. It's important to recognize that not all organizations will be directly
involved in this part of the process. In particular, the operationalization of foundational
models is a specialized set of tasks that is typically only relevant for a select few companies
with the necessary resources and expertise.
Because of that, this whitepaper will focus on practices required to operationalize gen AI
applications using and adapting existing foundation models, referring to other whitepapers in
the book should you want to deepdive into how foundational models are operationalized.
This includes active areas of research such as model pre-training, alignment (ensuring the
model's outputs align with the desired goals and values), evaluation or serving.
September 2024 8
Operationalizing Generative AI on Vertex AI using ML Ops
Figure 2. Lifecycle of a Foundational Model & gen AI system and relative operationalization practices
Discover
As mentioned before, building foundational models from scratch is resource-intensive.
Training costs and data requirements are substantial, pushing most practitioners towards
adapting existing foundation models through techniques like fine-tuning and prompt
engineering. This shift highlights a crucial need: efficiently discovering the optimal foundation
model for a given use case.
These two characteristics of the gen AI landscape make model discovery an essential
MLOps practice:
September 2024 9
Operationalizing Generative AI on Vertex AI using ML Ops
1. Quality: Early assessments can involve running test prompts or analyzing public
benchmarks and metrics to gauge output quality.
2. Latency & throughput: These factors directly impact user experience. A chatbot
demands lower latency than batch-processed summarization tasks.
3. Development & maintenance time: Consider the time investment for both initial
development and ongoing maintenance. Managed models often require less effort than
self-deployed open-source alternatives.
4. Usage cost: Factor in infrastructure and consumption costs associated with using the
chosen model.
Because the activity of discovery has become so important for gen AI systems, many model
discoverability platforms were created to support this need. An example of that is Vertex
Model Garden,1 which is explored later in this whitepaper.
September 2024 10
Operationalizing Generative AI on Vertex AI using ML Ops
Foundation models differ from predictive models most importantly because they are multi-
purpose models. Instead of being trained for a single purpose, on data specific to that
task, foundation models are trained on broad datasets, and therefore can be applied to
many different use cases. This distinction brings with it several more important differences
between foundation models and predictive models.
Foundation models also exhibit what are known as ‘emergent properties’,2 capabilities that
emerge in response to specific input without additional training. Predictive models are
only able to perform the single function they were trained for; a traditional French-English
translation model, for instance, cannot also solve math problems.
Foundation models are also highly sensitive to changes in their input. The output of the
model and the task it performs are strongly affected, indeed determined, by the input to the
model. A foundation model can be made to perform translation, generation, or classification
tasks simply by changing the input. Even insignificant changes to the input can affect its
ability to correctly perform that task.
September 2024 11
Operationalizing Generative AI on Vertex AI using ML Ops
These new properties of foundation models have created a corresponding paradigm shift
in the practices required to develop and operationalize Gen AI systems. While models in
the predictive AI context are self-sufficient and task-specific, gen AI models are multi-
purpose and need an additional element beyond the user input to function as part of a
gen AI Application: a prompt, and more specifically, a prompt template, defined as a set of
instructions and examples along with placeholders to accommodate user input. A prompt
template, along with dynamic data such as user input, can be combined to create a complete
prompt, the text that is passed as input to the foundation model.
Figure 3. How Prompt Template and User input can be combined to create a prompt
September 2024 12
Operationalizing Generative AI on Vertex AI using ML Ops
This introduces an important distinction when it comes to MLOps practices for gen AI. In
the development of a gen AI System, experimentation and iteration need to be done in the
context of a prompted model component, the combination of a model and a prompt. The Gen
September 2024 13
Operationalizing Generative AI on Vertex AI using ML Ops
AI experimentation cycle typically begins with testing variations of the prompt – changing the
wording of the instructions, providing additional context, or including relevant examples, etc.,
and evaluating the impact of those changes. This practice is commonly referred to as prompt
engineering.
1. Prompting: Crafting and refining prompts to elicit desired behaviors from a foundational
model for a specific use case.
However, we need to identify which type of artifacts they are. In the good old days of
Predictive AI, we had clear lines - data was one thing, pipelines and code another. But with
the “Prompt” paradigm in gen AI, those lines get blurry. Think about it: prompts can include
anything from context, instructions, examples, guardrails to actual internal or external data
pulled from somewhere else. So, are prompts data? Are they code?
September 2024 14
Operationalizing Generative AI on Vertex AI using ML Ops
To address these questions, a hybrid approach is needed, recognizing that a prompt has
different components and requires different management strategies. Let’s break it down:
Prompt as Data: Some parts of the prompt will act just like data. Elements like few-shot
examples, knowledge bases, and user queries are essentially data points. For these
components, we need data-centric MLOps practices such as data validation, drift detection,
and lifecycle management.
Prompt as Code: Other components such as context, prompt templates, guardrails are mode
code-like. They define the structure and rules of the prompt itself. Here, we need code-
centric practices such as approval processes, code versioning, and testing.
As a result, when applying MLOps practices to gen AI, it becomes important to have in place
processes that give developers easy storage, retrieval, tracking, and modification of prompts.
This allows for fast iteration and principled experimentation. Often one version of a prompt
will work well with a specific version of the model and less well with a different version. In
tracking the results of an experiment, both the prompt and its components version, and the
model version must be recorded and stored along with metrics and output data produced by
the prompted model.
The fact that development and experimentation in gen AI requires working with the prompt
and the model together introduces changes in some of the common MLOps practices,
compared to the predictive AI case in which experimentation is done by changing the model
alone. Specifically, several of the MLOps practices need to be expanded to consider the
prompted model component together as a unit. This includes practices like evaluation,
experiment tracking, model adaptation and deployment, and artifact management,
which will be discussed below in this whitepaper.
September 2024 15
Operationalizing Generative AI on Vertex AI using ML Ops
Gen AI models, particularly large language models (LLMs), face inherent challenges in
maintaining recency and avoiding hallucinations. Encoding new information into LLMs
requires expensive and data-intensive pre-training, posing a significant hurdle. Additionally,
LLMs might be unable to solve complex challenges, especially when step-by-step reasoning
is required. Depending on the use case, leveraging only one prompted model to perform
a particular generation might not be sufficient. To solve this issue, leveraging a divide and
conquer approach, several prompted models can be connected together, along with calls
to external APIs and logic expressed as code. A sequence of prompted model components
connected together in this way is commonly known as a chain.
September 2024 16
Operationalizing Generative AI on Vertex AI using ML Ops
Two common chain-based patterns that have emerged to mitigate recency and
hallucinations are retrieval augmented generation (RAG)3 and Agents.
RAG and Agents approaches can be combined to create multi-agent systems connected
to large information networks, enabling sophisticated query handling and real-time
decision-making.
The orchestration of different models, logic and APIs is not a novelty of gen AI
Applications. For example, recommendation engines have long combined collaborative
filtering models, content-based models, and business rules to generate personalized
product recommendations for users. Similarly, in fraud detection, machine learning
models are integrated with rule-based systems and external data sources to identify
suspicious activities.
September 2024 17
Operationalizing Generative AI on Vertex AI using ML Ops
What makes these chains of gen AI components different, is that, we can't a priori
characterize or cover the distribution of component inputs, which makes the individual
components much harder to evaluate and maintain in isolation.
This results in a paradigm shift in how AI applications are being developed for gen AI.
Unlike Predictive AI where it is often possible to iterate on the separate models and
components in isolation to then chain in the AI application, in gen AI it’s often easier to
develop a chain in integration, performing experimentation on the chain end-to-end, iterating
over chaining strategies, prompts, the underlying foundational models and other APIs in
a coordinated manner to achieve a specific goal. No feature engineering, data collection,
or further model training cycles is often needed; just changes to the wording of the
prompt template.
The shift towards MLOps for gen AI, in contrast to predictive AI, brings forth a new set of
demands. Let's break down these key differences:
1. Evaluation: Because of their tight coupling, chains need end-to-end evaluation, not just
on a per-component basis, to gauge their overall performance and the quality of their
output. In terms of evaluation techniques and metrics, evaluating chains is not dissimilar
to evaluating prompted models. Please refer to the below segment on evaluation for more
details on these approaches.
2. Versioning: A chain needs to be managed as a complete artifact in its entirety. The chain
configuration should be tracked with its own revision history for analysis, reproducibility,
and understanding the impact of changes on output. Logging should also include the
inputs, outputs, and intermediate states of the chain, and any chain configurations used
during each execution.
September 2024 18
Operationalizing Generative AI on Vertex AI using ML Ops
4. Introspection: The ability to inspect the internal data flows of a chain (inputs and outputs
from each component) as well as the inputs and outputs of the entire chain is paramount.
By providing visibility into the data flowing through the chain and the resulting content,
developers can pinpoint the sources of errors, biases, or undesirable behavior.
There are several products in Vertex AI that can support the need for chaining and
augmentation, including Grounding as a service,5 Extensions,6 and Vector Search,7 Agent
Builder.8 We discuss the products in the section “Role of a AI Platform”. Langchain9 is also
integrated with the Vertex SDK,10 and can be used alongside the core Vertex products to
define and configure gen AI chained applications.
September 2024 19
Operationalizing Generative AI on Vertex AI using ML Ops
When developing a gen AI use case and a specific task that involves LLMs, it can be difficult,
especially for complex tasks, to rely on only prompt engineering and chaining to solve it.
To improve task performance practitioners often also need to fine-tune the model directly.
Fine-tuning lets you actively change the layers or a subset of layers of the LLM to optimize
the capability of the model to perform a certain task. Two of the most common ways of
tuning a model are:
1. Supervised fine-tuning: This is where we train the model in a supervised manner, teaching
it to predict the right output sequence for a given input.
2. Reinforcement Learning from Human Feedback (RLHF): In this approach, we first train
a reward model to predict what humans would prefer as a response. Then, we use this
reward model to nudge the LLM in the right direction during the tuning process. Like
having a panel of human judges guiding the model's learning.
September 2024 20
Operationalizing Generative AI on Vertex AI using ML Ops
When viewed through the MLOps lens, fine-tuning shares similar requirements with
model training:
1. The capability to track artifacts being part of the tuning job. This includes for example the
input data or the parameters being used to tune the model.
2. The capability to measure the impact of the tuning. This translates into the capability
to perform evaluation of the tuned model for the specific tasks it was trained on and to
compare results with previously tuned models or frozen models for the same task.
Platforms like Vertex AI11 (and the Google Cloud platform more broadly) provide a robust
suite of services designed to address these MLOps requirements: Vertex Model Registry,12
for instance, provides a centralized storage location for all the artifacts created during the
tuning job, and Vertex Pipelines13 streamlines the development and management of these
tuning jobs. Dataplex,14 meanwhile, provides an organization-wide data fabric for data lineage
and governance and integrates well with both Vertex AI and BigQuery.15 What’s more, these
products provide the same governance capability for both predictive and gen AI applications,
meaning customers do not need separate products or configurations to manage generative
versus AI development.
September 2024 21
Operationalizing Generative AI on Vertex AI using ML Ops
The approach to continuous tuning depends on the specific use case and goals. For relatively
static tasks like text summarization, the continuous tuning requirements may be lower. But
for dynamic applications like chatbots that need constant human alignment, more frequent
tuning using techniques like RLHF based on human feedback is necessary.
To determine the right continuous tuning strategy, AI practitioners must carefully evaluate
the nature of their use case and how the input data evolves over time. Cost is also a major
consideration, as the compute infrastructure greatly impacts the speed and expense of
tuning. We discuss in detail monitoring of GenAI systems in the Logging and Monitoring
section of this whitepaper.
Graphics processing units (GPUs) and tensor processing units (TPUs) are key hardware for
fine-tuning. GPUs, known for their parallel processing power, are highly effective in handling
the computationally intensive workloads and often associated with training and running
complex machine learning models. TPUs, on the other hand, are specifically designed
by Google for accelerating machine learning tasks. TPUs excel in handling large matrix
operations common in deep learning neural networks.
To manage costs, techniques like model quantization can be applied. This represents model
weights and activations using lower-precision 8-bit integers rather than 32-bit floats, which
reduces computational and memory requirements.
We discuss in detail the support for tuning in Vertex AI in the Customize: Vertex AI Training &
Tuning section.
September 2024 22
Operationalizing Generative AI on Vertex AI using ML Ops
Data Practices
Traditionally, ML model behavior was dictated solely by its training data. While this still holds
true for foundation models – trained on massive, multilingual, multimodal datasets – gen AI
applications built on top of them introduce a new twist: model behavior is now determined by
how you adapt the model using different types of input data (Figure. 9).
Figure 9. Examples of data spectrum for foundation models – creation (left) vs. adaptation (right)
The key difference between traditional predictive ML and gen AI lies in where you start. In
predictive ML, the data is paramount. You spend a lot of time on data engineering, and if you
don’t have the right data, you cannot build an application. Gen AI takes a unique approach to
this matter. You start with a foundation model, some instructions and maybe a few example
inputs (in-context learning). You can prototype and launch an application with surprisingly
little data.
September 2024 23
Operationalizing Generative AI on Vertex AI using ML Ops
This ease of prototyping, however, comes with a challenge. Traditional predictive AI relies on
apriori well-defined dataset(s). In gen AI, a single application can leverage various data types,
from completely different data sources, all working together (Figure 10). Let’s explore some
of these data types:
• Conditioning prompts: These are essentially instructions given to the Foundation Model
(FM) to guide its output, setting boundaries of what it can generate.
• Few-shot examples: A way to show the model what you want to achieve through input-
output pairs. This helps the model grasp the specific task(s) at hand, and in many cases, it
boosts performances.
• Grounding/augmentation data: Data coming from either external APIs (like Google
Search) or internal APIs and data sources. This data permits the FM to produce answers
for a specific context, keeping responses current, relevant without retraining the entire
FM. This type of data also supports reducing hallucinations.
• Task-specific datasets: These are used to fine-tune an existing FM for a particular task,
improving its performance in that specific area.
• Full pre training corpora: These are massive datasets used to initially train foundation
models. While application builders may not have access to them nor the tokenizers,
the information encoded in the model itself will influence the application’s output
and performance.
This is not an exhaustive list. The variety of data used in gen AI applications is constantly
growing and evolving.
September 2024 24
Operationalizing Generative AI on Vertex AI using ML Ops
Figure 10. Example of high-level data and adaptations landscape for developing gen AI applications using
existing foundation models
This diverse range of data adds another complexity layer in terms of data organization,
tracking and lifecycle management. Take a RAG-based application as an example: it might
involve rewriting user queries, dynamically gathering relevant examples using a curated set
of examples, querying a vector database, and combining it all with a prompt template. This
involves managing multiple data types: user queries, vector databases with curated few-shot
examples and company information, and prompt templates.
September 2024 25
Operationalizing Generative AI on Vertex AI using ML Ops
Each data type needs careful organization and maintenance. For example, the vector
database requires processing data into embeddings, optimizing chunking strategies, and
ensuring only relevant information is available. The prompt template itself needs versioning
and tracking, the user queries need rewriting, etc. This is where traditional MLOps and
DevOps best practices come into play, with a twist. We need to ensure reproducibility,
adaptability, governance, and continuous improvement using all the data required in an
application as a whole but also individually. Think of it this way: in predictive AI, the focus
was on well-defined data pipelines for extraction, transformation, and loading. In gen AI,
it's about building pipelines to manage, evolve, adapt and integrate different data types in a
versionable, trackable, and reproducible way.
As mentioned earlier, fine-tuning foundation models (FMs) can boost gen AI app
performance, but it needs data. You can get this data by launching your app and gathering
real-world data, generating synthetic data, or a mix of both. Using large models to generate
synthetic data is becoming popular because it speeds things up, but it's still good to have a
human check the results for quality assurance. Here are few ways to leverage large models
for data engineering purposes:
1. Synthetic data generation: This process involves creating artificial data that closely
resembles real-world data in terms of its characteristics and statistical properties, often
being done with a large and capable model. This synthetic data serves as additional
training data for gen AI, enabling it to learn patterns and relationships even when labeled
real-world data is scarce.
2. Synthetic data correction: This technique focuses on identifying and correcting errors
and inconsistencies within existing labeled datasets. By leveraging the power of larger
models, gen AI can flag potential labeling mistakes and propose corrections, improving the
quality and reliability of the training data.
September 2024 26
Operationalizing Generative AI on Vertex AI using ML Ops
3. Synthetic data augmentation: This approach goes beyond simply generating new
data. It involves intelligently manipulating existing data to create diverse variations while
preserving essential features and relationships. Thus, gen AI can encounter a broader
range of scenarios during training, leading to improved generalization and ability to
generate nuanced and relevant outputs.
Evaluating gen AI, unlike predictive AI, is tricky. You don't usually know the training data
distribution of the foundational models. Building a custom evaluation dataset reflecting your
use case is essential. This dataset should cover essential, average, and edge cases. Similar
to fine-tuning data, you can leverage powerful language models to generate, curate, and
augment data for building robust evaluation datasets.
Evaluate
Even if only prompt engineering is performed, as any experimental process, it does require
evaluation in order to iterate and improve. This makes the evaluation process a core activity
of the development of any gen AI systems.
In the context of gen AI systems, evaluation might have different degrees of automation: from
entirely driven by humans to entirely automated by a process.
In the early days of a project, when you're still prototyping, evaluation is often a manual
process. Developers eyeball the model's outputs, getting a qualitative sense of how it's
performing. But as the project matures and the number of test cases balloons, manual
evaluation becomes a bottleneck. That's when automation becomes key.
September 2024 27
Operationalizing Generative AI on Vertex AI using ML Ops
Automating evaluation has two big benefits. First, it lets you move faster. Instead of spending
time manually checking each test case, you can let the machines do the heavy lifting.
This means more iterations, more experiments, and ultimately, a better product. Second,
automation makes evaluation more reliable. It takes human subjectivity out of the equation,
ensuring that results are reproducible.
But automating evaluation for gen AI comes with its own set of challenges.
For one, both the inputs (prompts) and outputs can be incredibly complex. A single prompt
might include multiple instructions and constraints that the model needs to juggle. And the
outputs themselves are often high-dimensional - think a generated image or a block of text.
Capturing the quality of these outputs in a simple metric is tough.
There are some established metrics, like BLEU for translations and ROUGE for summaries,
but they don't always tell the full story. That's where custom evaluation methods come in.
One approach is to use another foundational model as a judge. For example, you could
prompt a large language model to score the quality of generated texts across various
dimensions. This is the idea behind techniques like AutoSxS.16
Another challenge is the subjective nature of many evaluation metrics for gen AI. What
makes one output ‘better’ than another can often be a matter of opinion. The key here is to
make sure your automated evaluation aligns with human judgment. You want your metrics
to be a reliable proxy for what people would think. And to ensure comparability between
experiments, it's crucial to lock down your evaluation approach and metrics early in the
development process.
Lack of ground truth data is another common hurdle, especially in the early stages of a
project. One workaround is to generate synthetic data to serve as a temporary ground truth,
which can be refined over time with human feedback.
September 2024 28
Operationalizing Generative AI on Vertex AI using ML Ops
Depending on the use case, the evaluation process will require a high degree
of customization.
It is possible to generate synthetic ground truth data to accommodate for the lack of
real ground truth data.
September 2024 29
Operationalizing Generative AI on Vertex AI using ML Ops
Deploy
It should be clear by this point that production gen AI applications are complex systems with
many interacting components. Some of the common components discussed include multiple
prompts, models, adapter layers and external data sources. In deploying a gen AI system to
production, all these components need to be managed and coordinated with the previous
stages of gen AI system development. Given the novelty of these systems, best practices
for deployment and management are still evolving, but we can discuss observations and
recommendations for these components and indicate how to address the major concerns.
Deploying gen AI solutions necessarily involves multiple steps. For example, a single
application might utilize several large language models (LLMs) alongside a database, all
fed by a dynamic data pipeline. Each of these components potentially requires its own
deployment process.
September 2024 30
Operationalizing Generative AI on Vertex AI using ML Ops
Version control
• Chain definitions: The code defining the chain (including API integrations, database calls,
functions, etc.) should also be versioned using tools like Git. This provides a clear history
and enables easy rollback if needed.
• Adapter models: The landscape of techniques like LoRA tuning for adapter models is
constantly evolving. . You can leverage established data storage solutions (e.g. cloud
storage) to manage and version these assets effectively.
September 2024 31
Operationalizing Generative AI on Vertex AI using ML Ops
In a continuous integration framework, every code change goes through automatic testing
before merging to catch issues early. Here, unit and integration testing are key for quality
and reliability. Unit tests act like a microscope, zooming in on individual code pieces, while
integration testing verifies that different components work together.
2. Catch bugs early: Identifying issues through testing prevents them from causing bigger
problems downstream. It also makes the system more robust and resilient to edge cases
and unexpected inputs.
These benefits are applicable to gen AI Systems as much as any software product.
Continuous Integration should be applied to all elements of the system, including the prompt
templates, the chain and chaining logic, and any embedding models and retrieval systems.
1. Difficult to generate comprehensive test cases: The complex and open-ended nature of
gen AI outputs makes it hard to define and create an exhaustive set of test cases that
cover all possibilities.
September 2024 32
Operationalizing Generative AI on Vertex AI using ML Ops
These challenges are closely related to the broader question of how to evaluate gen AI
systems. Many of the same techniques discussed in the Evaluation section above can also
be applied to the development of CI systems for gen AI. This is an ongoing area of research,
however, and more techniques will undoubtedly emerge in the near future.
Once the code is merged, a continuous delivery process begins to move the built and tested
code through environments that closely resemble production for further testing before the
final deployment.
As mentioned in the "'"Develop and Experiment"'" segment, chain elements become one
of the main components to deploy, as they fundamentally constitute the gen AI application
serving users.
The delivery process of the gen AI application containing the chain may vary depending on
the latency requirements and whether the use case is batch or online:
1. Batch use cases require deploying a batch process executed on a schedule in production.
The delivery process should focus on testing the entire pipeline in integration in an
environment close to production before deployment. As part of the testing process,
developers can assert specific requirements around the throughput of the batch process
itself and checking that all components of the application are functioning correctly (e.g.,
permissioning, infrastructure, code dependencies).
September 2024 33
Operationalizing Generative AI on Vertex AI using ML Ops
2. Online use cases require deploying an API, in this case, the application containing the
chain, capable of responding to users at low latency. The delivery process should involve
testing the API in integration in an environment close to production, with tests to assert
that all components of the application are functioning correctly (e.g., permissioning,
infrastructure, code dependencies). Non-functional requirements (e.g., scalability,
reliability, performance) can be verified through a series of tests, including load tests.
Because foundation models are so large and complex, deployment and serving of these
models raises a number of issues – most obviously, the compute and storage resources
needed to run these massive models successfully. At a minimum, a foundation model
deployment needs to include several key considerations: selecting and securing necessary
compute resources, such as GPUs or TPUS; choosing appropriate data storage services
like BigQuery or Google Cloud Storage that can scale to deal with the large datasets; and
implementing model optimization or compression techniques.
Infrastructure validation
One technique that can be applied to address the resource requirements of gen AI systems is
infrastructure validation. This refers to the introduction of an additional verification step, prior
to deploying the training and serving systems, to check both the compatibility of the model
with the defined serving configuration and the availability of the required hardware. There
are a number of optional infrastructure validation layers that can perform some of these
checks automatically. For instance, TFX19 has an infrastructure validation layer that checks
September 2024 34
Operationalizing Generative AI on Vertex AI using ML Ops
whether the model will run correctly on a specified hardware configuration, which can help
catch configuration issues before deployment. Nevertheless, the availability of the required
hardware still needs to be verified by hand by the engineer or the system administrator.
Some techniques for model compression and optimization include quantization, distillation
and model pruning. Quantization reduces the size and computational requirements of the
model by converting its weights and activations from higher-precision floating-point numbers
to lower-precision representations, such as 8-bit integers or 16-bit floating-point numbers.
This can significantly reduce the memory footprint and computational overhead of the model.
Model Pruning is a technique for eliminating unnecessary weight parameters or by selecting
only important subnetworks within the model. This reduces model size while maintaining
accuracy as high as possible. Finally, distillation trains a smaller model, using the responses
generated by a larger LLM, to reproduce the output of the larger LLM for a specific domain.
This can significantly reduce the amount of training data, compute, and storage resources
needed for the application.
In certain situations, model distillation can also improve the performance of the model itself
in addition to reducing resource requirements. This happens because the smaller model can
combine the knowledge of the larger model with labeled data, which can help it to generalize
better to new data on a limited use case.The process of distillation usually involves training
a large foundational LLM (teacher model) and having it generate responses to certain tasks,
September 2024 35
Operationalizing Generative AI on Vertex AI using ML Ops
and then having the smaller LLM (student model) use a combination of the LLMs knowledge
as well as task specific supervised dataset to learn. The size and complexity of the smaller
LLM can be adjusted to achieve the desired trade-off between performance and resource
requirements. A technique known as step-by-step distillation20 has proven to achieve
great results.
Following are the important steps to take when deploying a model on Vertex AI.
□ Configure version control: Implement version control practices for LLM deployments.
This allows you to roll back to previous versions if necessary and track changes made to
the model or deployment configuration.
□ Optimize the model: Perform any model optimization (distillation, quantization, pruning,
etc.) before packaging or deploying the model.
□ Containerize the model: Package the trained LLM model into a container.
□ Define model endpoint: Define the endpoint configuration using Vertex AI's endpoint
creation interface or the Vertex AI SDK. Specify the model container, input and output
formats, and any additional configuration parameters.
□ Allocate resources: Allocate the appropriate compute resources for the endpoint based
on the expected traffic and performance requirements.
September 2024 36
Operationalizing Generative AI on Vertex AI using ML Ops
□ Create model endpoint: Create a Vertex AI endpoint to deploy21 the LLM as a REST API
service. This allows clients to send requests to the endpoint and receive responses from
the LLM..
□ Configure monitoring and logging: Establish monitoring and logging systems to track
the endpoint's performance, resource utilization, and error logs.
□ Deploy custom integrations: Integrate the LLM into custom applications or services
using the model's SDK or APIs. This provides more flexibility for integrating the LLM into
specific workflows or frameworks.
September 2024 37
Operationalizing Generative AI on Vertex AI using ML Ops
Logging is necessary for applying monitoring and debugging on your gen AI system in
production. An input to the application triggers multiple components. Imagine the output
to a given input is factually inaccurate. How can you find out which of the components are
the ones that didn’t perform well? To answer this question it is necessary to apply logging
on the application level and at the component level. We need lineage in our logging for all
components executed. For every component we need to log their inputs and outputs. We
also need to be able to map those with any additional artifacts and parameters they depend
on so we can easily analyze those inputs and outputs.
Monitoring can be applied to the overall gen AI application and to individual components. We
prioritize monitoring at the application level. This is because if the application is performant
and monitoring proves that, it implies that all components are also performant. You can also
apply the same practices to each of the prompted model components to get more granular
results and understanding of your application.
Skew detection in traditional ML systems refers to training-serving skew that occurs when
the feature data distribution in production deviates from the feature data distribution
observed during model training. In the case of Gen AI systems using pretrained models in
components chained together to produce the output, we need to modify our approach. We
can measure skew by comparing the distribution of the input data we used to evaluate our
application (the test set as described under the Data Curation and Principles section above)
and the distribution of the inputs to our application in production. Once the two distributions
drift apart,further investigation is needed. The same process can be applied to the output
data as well.
September 2024 38
Operationalizing Generative AI on Vertex AI using ML Ops
Like skew detection, the drift detection process checks for statistical differences between
two datasets. However, instead of comparing evaluations and serving inputs, drift looks for
changes in input data. This allows you to check how the inputs and therefore the behavior of
your users changed over time. This is the same as traditional MLOps.
Given that the input to the application is typically text, there are a few approaches to
measuring skew and drift. In general all the methods are trying to identify significant
changes in production data, both textual (size of input) and conceptual (topics in input),
when compared to the evaluation dataset. All these methods are looking for changes that
could potentially indicate the application might not be prepared to successfully handle the
nature of the new data that are now coming in. Some common approaches are calculating
embeddings and distances, counting text length and number of tokens, and tracking
vocabulary changes, new concepts and intents, prompts and topics in datasets, as well
as statistical approaches such as least-squares density difference,22 maximum mean
discrepancy (MMD),23 learned kernel MMD,24 or context-aware MMD.25 As gen AI use cases
are so diverse, it is often necessary to create additional custom metrics that better capture
abnormal changes in your data.
September 2024 39
Operationalizing Generative AI on Vertex AI using ML Ops
As with traditional monitoring in MLOps an alerting process should be deployed for notifying
application owners when a drift, skew or performance decay from evaluation tasks is
detected. This can help you promptly intervene and resolve issues. This is achieved by
integrating alerting and notification tools into your monitoring process.
Monitoring expands beyond drift, skew and evaluation tasks. Monitoring in MLOps includes
efficiency metrics like resources utilization and latency. Efficiency metrics are as relevant and
important in gen AI as they are in any other AI application.
Vertex AI provides a set of tools that can help with monitoring. Model Evaluation for gen AI26
tasks can be used for classification, summarization, question answering, and text generation
tasks. Vertex Pipelines can be used to allow the recurrent execution of evaluation jobs in
production as well as running pipelines for skew and drift detection processes.
September 2024 40
Operationalizing Generative AI on Vertex AI using ML Ops
Govern
In the context of MLOps governance encompasses all the practices, and policies that
establish control, accountability, and transparency over the development, deployment, and
ongoing management of machine learning (ML) models, including all the activities related to
the code, data and models lifecycle.
As mentioned in the Develop & Experiment section the chain element and the relative
components become a new type of assets that need to be governed over the full lifecycle
from development to deployment, to monitoring.
The governance of the chain element lifecycle extends to lineage tracking practices as well.
While for predictive AI systems lineage focuses on tracking and understanding the complete
journey of a machine learning model, in gen AI, lineage goes beyond the model artifact
extending to all the components in the chain. This includes the data and models used and
their lineage, the code involved and the relative evaluation data and metrics. This can help
auditing, debugging and improvements of the models
Along with these new practices, existing MLOps and DevOps practices still apply to MLOps
for gen AI:
2. The need to govern the tuned model lifecycle; see “Tuning and Training”.
September 2024 41
Operationalizing Generative AI on Vertex AI using ML Ops
The next segment will introduce a set of products that allow developers to perform
governance of the data, model and code assets. We will discuss products like Google
Cloud Dataplex, which centralizes the governance of model and data, Vertex ML Metadata
and Vertex Experiment, which allows developers to register experiments, their metrics
and artifacts.
At the heart of an AI platform lies its ability to support diverse AI development needs.
Whether you seek to utilize pre-trained AI solutions, adapt existing models through tuning
or transfer learning, or embark on training your own large models, AI platforms provide the
infrastructure and tools necessary to support these journeys. The advent of these platforms
has revolutionized the way organizations approach AI, enabling them to productionize AI
applications in a secure, enterprise-ready, responsible, controlled and scalable manner.
These platforms accelerate innovation as well as foster reproducibility and collaboration
while reducing costs and maximizing Return on Investment (ROI).
September 2024 42
Operationalizing Generative AI on Vertex AI using ML Ops
The new gen AI paradigm discussed in prior sections demands a robust and reliable AI
platform that can seamlessly integrate and orchestrate a wide range of functionalities.
These functionalities include model tuning for specific tasks; leveraging paradigms like
retrieval augmented generation3 (RAG) to connect to internal and external data sources;
and pre-training or instruction fine-tuning large models from scratch. Complex applications
also often require chaining with other models, such as classifiers to route inputs to the
appropriate LLM/ML model, extraction of customer information from a knowledge base,
inclusion of safety checks, or even creation of caching systems for cost optimization.
Vertex AI eliminates the complexities of managing the entire infrastructure required for AI
development and deployment. Instead, Vertex AI offers a user-centric approach, providing
on-demand access to the needed resources. This flexibility empowers organizations to
focus on innovation and collaboration, rather than infrastructure management, and up-
front hardware purchase. The features of Vertex AI that support gen AI development can be
grouped into eight areas.
September 2024 43
Operationalizing Generative AI on Vertex AI using ML Ops
As discussed before, there is already a wide variety of available foundation models, trained
on a broad range of datasets, and the cost of training a new foundation model can be
prohibitive. Thus it often makes sense for companies to adapt existing foundation models
rather than creating their own from scratch. As a result, a platform facilitating seamless
discovery and integration of diverse model types is critical.
Vertex AI Model Garden1 supports these needs, offering a curated collection of over
150 Machine Learning and gen AI models from Google, Google partners, and the open-
source community. It simplifies the discovery, customization, and deployment of both
Google’s proprietary foundational models and diverse open-source models across a
vast spectrum of modalities, tasks, and features. This comprehensive repository permits
developers to leverage the collective research on artificial intelligence models within a single
streamlined environment.
Model Garden encompasses a diverse range of modalities such as Language, Vision, Tabular,
Document, Speech, Video, and Multimodal data. This broad coverage enables developers
to tackle a multitude of tasks, including generation, classification, regression, extraction,
recognition, segmentation, tracking, translation, and embedding. Model Garden houses
Google’s proprietary and foundational models (like Gemini,27 PaLM 2,28 Imagen29) alongside
numerous popular open source and third-party partner models like like Llama 3,30 T5 Flan,31
BERT,32 Stable Diffusion,33 Claude 3 (Anthropic),34 and Mistral AI.35 Additionally, it offers task-
specific models for occupancy analysis, watermark detection, text-moderation, text-to-video,
hand-gesture recognition, product identification, and tag recognition, among others. Every
model36 in Vertex Model Garden has a model card which includes a description of the model,
the main use cases that can cover, and the option (if available) to tune the model or deploy
it directly.
September 2024 44
Operationalizing Generative AI on Vertex AI using ML Ops
September 2024 45
Operationalizing Generative AI on Vertex AI using ML Ops
Mistral AI51
Table 2. An overview of some of the models in Model Garden [Last Updated: March 18th, 2024]
September 2024 46
Operationalizing Generative AI on Vertex AI using ML Ops
Rapid development and prototyping capabilities are also essential for developing gen AI
applications. Vertex AI prioritizes inclusivity and flexibility in its development environments,
catering to a wide range of developer preferences and proficiency levels. This platform
provides options for both console-driven and programmatic development workflows. Users
can leverage the intuitive web interface for end-to-end application creation or utilize various
APIs for deeper customization and control. These include the REST API56 and dedicated
SDKs for Python,57 NodeJS58 and Java,59 ensuring compatibility with diverse programming
languages and ecosystems. Developers can choose to use the tools and IDEs of their
choice for interacting with the platform, or take advantage of Vertex-native tools like Vertex
Colab Enterprise or Vertex Workbench to explore and experiment with code within familiar
notebook environments.
Vertex AI Studio60 provides a unified console-driven entry point to access and leverage the
full spectrum of Vertex AI's gen AI services. It facilitates exploration and experimentation with
various Google first party foundation models (for example, PaLM 2, Gemini, Codey, Imagen,
and Universal Speech Model). Additionally, it offers prompt examples and functionalities
for testing distinct prompts and models with diverse parameters. It’s also possible to adapt
existing models through various techniques like supervised fine-tuning (SFT), reinforcement
learning tuning techniques, and Distillation, and deploy gen AI applications in just a few
clicks. Vertex AI Studio considerably simplifies and democratizes gen AI adoption, catering
to a variety of users, from business analysts to machine learning engineers. You can see the
homepage of Vertex AI Studio in Figure 13.
September 2024 47
Operationalizing Generative AI on Vertex AI using ML Ops
While prompt engineering and augmentation are sufficient for some gen AI use cases, other
cases require training, tuning and adapting the models to get the best results. Vertex AI
provides a comprehensive platform for training and adapting LLMs, supporting a range of
techniques and approaches from prompt engineering to training models from scratch.
September 2024 48
Operationalizing Generative AI on Vertex AI using ML Ops
Train
For full-scale LLM training, TPUs and GPUs are vital because of their superior processing
power and memory capacity compared to CPUs. GPUs excel at parallel processing, enabling
faster model training. TPUs, specifically designed for machine learning tasks, offer even
faster processing and higher energy efficiency. This makes them ideal for large-scale,
complex models. Google Cloud provides a range of offerings to support LLM training,
including TPU VMs with various configurations, pre-configured AI platforms like Vertex AI,
and dedicated resources like Cloud TPU Pods for scaling up training. These offerings allow
users to choose the right infrastructure for their needs, accelerating LLM development and
enabling cutting-edge research and applications.
Tune
Vertex AI also provides a comprehensive solution for adapting pre-trained LLMs. It supports
a spectrum of techniques from a non-technical prompt engineering playground at inference
time, to data-driven approaches involving tuning, reinforcement learning and distillation
methods during the development or adaptation phase. The following five techniques – many
of which are unique to Vertex AI – enable users to explore and implement them effectively.
This applies to both proprietary and open-source LLMs, allowing you to achieve superior
results while optimizing for costs and latency requirements.
September 2024 49
Operationalizing Generative AI on Vertex AI using ML Ops
September 2024 50
Operationalizing Generative AI on Vertex AI using ML Ops
Orchestrate
Any training or tuning job you run can be orchestrated and then operationalized using Vertex
Pipelines,13 a service that aims to simplify and automate the deployment, management, and
scaling of your ML workflows.
It provides a platform for building, orchestrating, scheduling and monitoring complex and
custom ML pipelines, enabling you to efficiently translate your models from prototypes
to production.
Vertex Pipelines is also the platform behind all the managed tuning and evaluation services
for the Google Foundation Models on Vertex AI. This ensures consistency as you can
consume and extend those pipelines easily, without having to familiarize yourself with
many services.
Getting started with Vertex Pipelines is simple: you define the pipeline’s step sequence in
a Python file utilizing Kubeflow SDK.67 For further details and comprehensive onboarding,
consult the official documentation.68
September 2024 51
Operationalizing Generative AI on Vertex AI using ML Ops
Beyond training, tuning and adapting models and prompts directly, Vertex AI offers a
comprehensive ecosystem for augmenting LLMs, to address the challenges of factual
grounding and hallucination. The platform incorporates emerging techniques like RAG and
agent-based approaches.
RAG overcomes limitations by enriching prompts with data retrieved from vector databases,
circumventing pre-training requirements and ensuring the integration of up-to-date
information. Agent-based approaches, popularized by ReAct prompting, leverage LLMs as
mediators interacting with tools like RAG systems, APIs, and custom extensions. Vertex AI
facilitates this dynamic information source selection, enabling complex queries, real-time
actions, and the creation of multi-agent systems connected to vast information networks for
sophisticated query processing and real-time decision-making.
Vertex AI Grounding5 helps users connect large models with verifiable information by
grounding them to internal data corpora on Vertex AI Agent Builder70 or external sources
using Google Search. This enables two key functionalities: verifying model-generated outputs
against internal or external sources and creating RAG systems using Google’s advanced
search capabilities that produce quality content grounded in your own or web search data.
September 2024 52
Operationalizing Generative AI on Vertex AI using ML Ops
Vertex AI extensions6 let developers integrate Vertex Foundation Models with real-time
data and real-world actions through APIs and functions, enabling task execution and allowing
enhanced capabilities. This extends to leveraging 1st party extensions like Vertex AI Search7
and Code Interpreter,71 or 3rd party extensions for triggering and completing transactions.
Imagine building an application that leverages the LLM's knowledge to plan a trip and
seamlessly utilizes internal APIs to book hotels and flights, all within a single interface.
Additionally, Vertex Extensions facilitate function calling with the gemini-pro model, enabling
you to generate descriptions, pass them to the large model, receive JSON with function
arguments, and automatically call the function.
Vertex AI Agent Builder70 is an out-of-the-box solution that allows you to quickly build gen
AI agents, to be used as conversational chatbots or as part of a search engine. With Vertex
AI Agent Builder, you are be able to easily ground your agents by pointing to a diverse range
of data sources, including structured datastores such us BigQuery, Spanner, Cloud SQL,
unstructured sources like website content crawling and cloud storage as well as connectors
to Google drive and other APIs. Agent Builder utilizes a robust foundation of Google Search
technologies, encompassing semantic search, content chunking, ranking, algorithms,
and user intent understanding. Under the hood it optimizes document loading, chunking,
embedding models, and ranking strategies. It abstracts away these complexities and allows
users to simply specify their data source to initiate the gen AI-powered agent.This approach
is ideal for organizations seeking to build robust search experiences for standard use cases
without extensive technical expertise.
Vector databases are specialized systems for managing multi-dimensional data. This data,
encompassing images, text, audio, video, and other structured or unstructured formats,
is represented as vectors capturing its semantic meaning. Vector databases accelerate
searching and retrieval within these high-dimensional spaces, enabling efficient tasks like
finding similar images from billions or extracting relevant text snippets based on various
September 2024 53
Operationalizing Generative AI on Vertex AI using ML Ops
inputs. For a deeper dive into these topics, refer to 4 and 19. Vertex AI offers three flexible
solutions for storing and serving embeddings at scale, catering to diverse use cases and
user profiles.
Vertex AI Vector Search7 is a highly scalable low-latency similarity search and fully
managed vector database scaling to billions of vector embeddings with auto-scaling. This
technology, built upon ScaNN72 (a Google-developed technology used in products like
Search, YouTube, and Play), allows you to search from billions of semantically similar or
related items within your stored data. In the context of gen AI, the most common use cases
where Vertex Vector Search can be used are:
1. Finding similar items (either text or image) based solely on their semantic meaning, in
conjunction with an embedding model.
2. Creating a hybrid search approach that combines semantic and keyword or metadata
search to refine the results.
3. Extracting relevant information from the database to feed into LLMs, enabling them to
generate more accurate and informed responses.
Vertex AI Vector Search primarily functions as a vector database for storing pre-generated
embeddings. These embeddings must be created beforehand using separate models like
Vertex Embedding models73 (namely textembedding-gecko, text-embedding-gecko-
multilingual, or multimodalembedding). Choosing Vertex Vector Search is optimal
when you require control over aspects like the chunk, retrieval, query and models strategy.
This includes fine-tuning an embedding model for your specific data. However, if your use
case is a standard one requiring little customization, a readily available solution like Vertex
Search might be a better choice.
September 2024 54
Operationalizing Generative AI on Vertex AI using ML Ops
Vertex AI Feature Store74 is a centralized and fully managed repository for ML features
and embedding. It enables teams to share, serve, and reuse machine learning features and
embeddings effortlessly alongside other data. Its native BigQuery23 integration eliminates
duplication, simplifies lineage tracking and preserves data governance. Vertex AI Feature
Store supports offline retrieval and an easy and fast online serving for machine learning
features and embeddings. Vertex AI Feature Store is a good choice when you want to iterate
and maintain different embedding versions alongside other machine learning features in a
single place.
Vertex AI offers the flexibility to seamlessly create and connect various products to build
your own custom grounding, RAG, and Agent systems. This includes utilizing diverse
embedding models (multimodal, multilingual), various vector stores (Vector Search, Feature
Store) and search engines like Vertex AI Agent Builder, extensions, grounding, and even SQL
query generation for complex natural language queries. Moreover, Vertex AI provides SDK
integration with LangChain9 to easily build and prototype applications using the umbrella
of Vertex AI products. For further details and integration information, consult the official
documentation75 and official examples.76
In the dynamic world of gen AI, experimentation and evaluation are the cornerstones of
iterative development and continuous improvement. With a multitude of variables influencing
Gen AI models (prompt engineering, model selection, data interaction, pretraining,
and tuning), evaluation goes hand-in-hand with experimentation. The more seamlessly
experiments and evaluations can be integrated into the development process, the
September 2024 55
Operationalizing Generative AI on Vertex AI using ML Ops
smoother and more efficient the overall development becomes. Vertex AI provides cohesive
experimentation and evaluation products permitting connected iterations over applications
and models alongside their evaluations.
Experiment
The process of selecting, creating, and customizing machine learning (including large
models) and its applications involves significant experimentation, collaboration, and iteration.
Vertex AI also provides two tools for tracking and visualizing the output of many experiment
cycles and training runs. Vertex AI Experiments79 facilitates meticulous tracking and
analysis of model architectures, hyperparameters, and training environments. It logs
experiments, artifacts, and metrics, enabling comparison and reproducibility across multiple
runs. This comprehensive tracking permits data scientists to select the optimal model
and architecture for their specific use case. Vertex AI TensorBoard80 complements the
experimentation process by providing detailed visualizations for tracking, visualizing, and
sharing ML experiments. It offers a range of visualizations, including loss and accuracy
metrics tracking, model computational graph visualization, and weight and bias histograms,
which - for example - can be used for tracking various metrics pertaining to training and
evaluation of gen AI models with different prompting and tuning strategies. It also projects
embeddings to lower-dimensional space, and displays image, text, and audio samples.
September 2024 56
Operationalizing Generative AI on Vertex AI using ML Ops
Evaluation
Vertex AI also provides a comprehensive set of evaluation tools for gen AI, from ground truth
metrics to using LLMs as raters.
For Ground Truth-based metrics, Automatic Metrics in Vertex AI 81 lets you evaluate a model
based on a defined task and “ground truth” dataset. For LLM-based evaluation, Automatic
Side by Side (Auto SxS) in Vertex AI82 uses a large model to evaluate the output of multiple
models or configurations being tested, helping to augment human evaluation at scale.
In addition to that, users can also leverage Rapid Evaluation API, which offers a set of pre-
built metrics for evaluating gen AI applications and relative SDK, integrated into the Vertex
AI Python SDK for rapid and flexible, notebook-based, prototyping. To get started with Rapid
Evaluation Vertex AI SDK see example in the official documentation.83
Once developed, a production gen AI application must be deployed, including all its model
components. If the application uses any models that have been trained or adapted, those
models need to be deployed to their own serving endpoints. You can serve any model in the
Model Garden through Vertex AI Endpoints21,which acts as the gateway for deploying your
trained machine learning models. They allow you to serve online predictions with low latency,
manage access controls, and monitor model performance easily through Model Monitoring.
Endpoints also offer scaling options to handle varying traffic demands, ensuring optimal user
experience and reliability.
September 2024 57
Operationalizing Generative AI on Vertex AI using ML Ops
Along with the prediction service, Vertex AI offers the following features for all Google
managed models:
• Citation checkers: Gen AI on Vertex performs Citation checks71. Citations are important
for LLMs and gen AI for several reasons. Citing sources ensures proper acknowledgment
of sources and prevents plagiarism and demonstrates transparency and accountability.
Citing sources is essential for LLMs and gen AI also because they help identify,
understand potential biases, and enable reproducibility and verification. For example in
Google Cloud,84 the gen AI models are designed to produce original content, limiting the
possibility of copying existing contents. If this happens, Google Cloud provides quotes for
websites and code repositories.
• Safety scores: Safety attributes are crucial for LLMs and gen AI to mitigate potential
risks like bias, lack of explainability, and misuse. These attributes help detect and mitigate
biased outputs and mitigate misuse, enabling these tools to be used responsibly. As
LLMs and gen AI evolve, incorporating safety attributes will be increasingly essential for
responsible and ethical use. For example, Google Cloud added safety scores in Vertex
AI PaLM API and Vertex AI Gemini API85: content processed through the API is checked
against a list of safety attributes, including "harmful categories" and sensitive topics. Each
attribute has a confidence score between 0.0 and 1.0, indicating the likelihood of the
input belonging to that category. These safety filters can be used in conjunction with all
models: be it proprietary ones like Palm2 and Gemini or OSS ones like the ones available in
Model garden.
September 2024 58
Operationalizing Generative AI on Vertex AI using ML Ops
• Content moderation and bias detection: By using the Content moderation88 and Bias89
detection tools on Vertex AI, you can add an extra layer of security on the responses
of the LLMs to mitigate the risk that the model training and tuning may sway a model to
generate outputs that aren’t fair or appropriate for the task.
Addressing the multifaceted requirements of data and model lineage and governance in
gen AI requires a comprehensive strategy that tackles both conventional challenges and
novel regulatory or technical complexities associated with large models. By adopting robust
governance, observability, and lineage practices in the development of gen AI solutions,
organizations can ensure comprehensive tracking, iteration, and evolution of data. They
can also track the large models used, prompt adaptations, tuning, and other artifacts. This
facilitates reproducibility of results, transparency and understanding of generated content
sources, troubleshooting, compliance enforcement, and enhanced reliability and security.
These practices collectively enable the ethical and responsible development and deployment
of gen AI solutions. This fosters internal and external trust and fairness in gen AI models and
practices. Vertex AI and Google Cloud offer the following comprehensive suite of tools for
unified lineage, governance and monitoring, effectively addressing these critical concerns.
• Monitor feature (prompt) and embedding, response drift, and identify potential
issues proactively
• Store feature formulas and discover relevant features or embeddings for different
use cases
September 2024 59
Operationalizing Generative AI on Vertex AI using ML Ops
• Consolidate and unify all machine learning data within a singular repository encompassing
numerical data, categorical data, textual data, and embeddings representations
September 2024 60
Operationalizing Generative AI on Vertex AI using ML Ops
integration permits users across an organization to identify ‘champion models’ and ‘golden
datasets and features’ across projects and regions in a secure way by adhering to identity
access management (IAM)92 boundaries. In short, Dataplex encapsulates a framework within
an organization that governs the interaction between people, processes and technology
across all the products in Google Cloud.
Conclusion
The explosion of gen AI in the last several years introduced fundamental changes in the way
AI applications are developed – but far from upending the MLOps discipline, these changes
have only reinforced its basic principles and processes. As we have seen, the principles of
MLOps that emphasize reliability, repeatability, and dependability in ML systems development
are comfortably extended to include the innovations of gen AI. Some of the necessary
changes are deeper and more far-reaching than others, but nowhere do we find any change
that MLOps cannot accommodate.
As a result, many tools and processes built to support traditional MLOps can also support
the requirements of gen AI. Vertex AI, for instance, is a powerful platform that can be used to
build and deploy machine learning models and AI applications. It provides a comprehensive
suite of functions for developing both Predictive and gen AI systems, encompassing data
preparation, pre-trained APIs, AutoML capabilities, training and serving hardware, advanced
fine-tuning techniques and deployment tools, and a diverse selection of proprietary and
open-source foundation models. It also offers evaluation methods, monitoring capabilities,
and governance tools, all unified within a single platform to streamline the AI development
lifecycle. It’s built on Google Cloud Platform, which provides a scalable, reliable, secure and
compliant infrastructure for machine learning. It’s a good choice for organizations that want
to build and deploy machine learning models and AI applications.
September 2024 61
Operationalizing Generative AI on Vertex AI using ML Ops
The next few years will undoubtedly see gen AI extended in directions that today are
unimaginable. Regardless of the direction these developments take, it will continue to
be important to build on solid engineering processes that embody the basic principles
of MLOps. These principles support the development of scalable, robust production AI
applications today, and no doubt will continue to do so into the future.
September 2024 62
Operationalizing Generative AI on Vertex AI using ML Ops
Endnotes
2. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang,
Jeff Dean, William Fedus. 2022. Emergent Abilities of Large Language Models. Available at: https://arxiv.org/
pdf/2206.07682.pdf
3. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich
Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. 2022. Retrieval-Augmented
Generation for Knowledge-Intensive NLP Tasks. Available at: https://arxiv.org/pdf/2005.11401.pdf
4. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, Department of
Computer Science, Princeton University, Google Research, Brain team, REACT: SYNERGIZING REASONING AND
ACTING IN LANGUAGE MODELS. Available at: https://arxiv.org/pdf/2210.03629.pdf
6. Vertex Extensions. Connect models to APIs by using extensions. Available at: https://cloud.google.com/
vertex-ai/docs/generative-ai/extensions/overview
9. LangChain. Get your LLM application from prototype to production. Available at: https://www.langchain.
com/
10. Introduction to the Vertex AI SDK for Python. Available at: https://cloud.google.com/vertex-ai/docs/python-
sdk/use-vertex-ai-python-sdk
September 2024 63
Operationalizing Generative AI on Vertex AI using ML Ops
19. TFX is an end-to-end platform for deploying production ML pipelines. Available at: https://www.tensorflow.
org/tfx
20. Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay
Krishna, Chen-Yu Lee, Tomas Pfister. 2023. Distilling Step-by-Step! Outperforming Larger Language Models with
Less Training Data and Smaller Model Sizes. Available at: https://arxiv.org/pdf/2305.02301.pdf
21. Vertex Endpoints. Use private endpoints for online prediction. Available at: https://cloud.google.com/vertex-
ai/docs/predictions/using-private-endpoints
22. Tuan Duong Nguyen, Marthinus Christoffel du Plessis, Takafumi Kanamori, Masashi Sugiyama, 2014.
Constrained Least-Squares Density-Difference Estimation. Available at: https://www.ms.k.u-tokyo.ac.jp/
sugi/2014/CLSDD.pdf
23. Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander Smola, 2012. A Kernel
Two-Sample Test. Available at: https://jmlr.csail.mit.edu/papers/v13/gretton12a.html
24. Oliver Cobb, Arnaud Van Looveren, 2022. Context-Aware Drift Detection. Available at: https://arxiv.org/
pdf/2203.08644.pdf
27. Gemini Team, Google, 2023. Gemini: A Family of Highly Capable Multimodal Models. Available at: https://
storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf
28. Anil, Dai et al., 2023. PaLM 2 Technical Report. Available at: https://arxiv.org/abs/2305.10403
September 2024 64
Operationalizing Generative AI on Vertex AI using ML Ops
29. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David
J Fleet, Mohammad Norouzi, 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding. Available at: https://arxiv.org/abs/2205.11487
30. Build the future of AI with Meta Llama 3. Available at: https://llama.meta.com/llama3
31. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang,
Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun
Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang,
Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff
Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei. 2022. Scaling Instruction-Finetuned
Language Models. Available at: https://arxiv.org/abs/2210.11416
32. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. Available at: https://arxiv.org/abs/1810.04805
37. Vertex AI Studio. Customize and deploy generative models. Available at: https://cloud.google.com/
generative-ai-studio
38. vLLM. Easy, fast, and cheap LLM serving for everyone. Available at: https://github.com/vllm-project/vllm
September 2024 65
Operationalizing Generative AI on Vertex AI using ML Ops
46. Translate docs, audio, and videos in real time with Google AI. Available at: https://cloud.google.com/
translate
52. Hugging Face, 2024. Vision Transformer (ViT) Documentation. Hugging Face, [online] Available at:
https://huggingface.co/docs/transformers/en/model_doc/vit
53. Mingxing Tan, Quoc V. Le, 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.
Available at: https://arxiv.org/abs/1905.11946
55. Anthropic Claude 3 on Google Cloud Model Garden. Available at: https://cloud.google.com/blog/products/
ai-machine-learning/announcing-anthropics-claude-3-models-in-google-cloud-vertex-ai
September 2024 66
Operationalizing Generative AI on Vertex AI using ML Ops
72. Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar, 2020.
Accelerating Large-Scale Inference with Anisotropic Vector Quantization. Available at: https://arxiv.org/
pdf/1908.10396.pdf
September 2024 67
Operationalizing Generative AI on Vertex AI using ML Ops
87. SynthID. Identifying AI-generated content with SynthID. Available at: https://deepmind.google/technologies/
synthid/
89. Model bias metrics for Vertex AI. Available at: https://cloud.google.com/vertex-ai/docs/evaluation/model-
bias-metrics
September 2024 68
Operationalizing Generative AI on Vertex AI using ML Ops
September 2024 69