0% found this document useful (0 votes)

12 views34 pages

Chapter 9

Chapter 9 discusses the integration of Data Fabric and Data Mesh in the AI lifecycle, emphasizing the importance of operationalizing AI to generate business value. It outlines the stages of the AI lifecycle, from understanding business problems to model governance, and highlights the roles of DataOps, ModelOps, and MLOps in streamlining these processes. The chapter also addresses challenges such as data access and fragmentation, advocating for a unified approach to data management and governance to enhance collaboration and efficiency in AI initiatives.

Uploaded by

shreestisinhaeg3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views34 pages

Chapter 9

Uploaded by

shreestisinhaeg3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

CHAPTER 9

Data Fabric and Data

Mesh for the AI
Lifecycle
With the development of AI in technology and business, AI is no longer
an experiment limited to a select few data scientists. It will penetrate
all aspects of enterprise business operations and continue to innovate
and optimize for new business scenarios. Now the focus shifts from the
competition of AI algorithms to how to combine the strength of expert
teams and AI technology for the actual needs of the enterprise and
industry to generate business value.
To put data and AI into production means enterprises not only need
to create AI models but also operationalize AI workloads, which means
practicing AI effectively and efficiently and, more importantly, doing it
in a way that instills confidence in the outcome. Therefore, it is crucial
to establish a verifiable AI full lifecycle management system within the
enterprise.
Let us dive into this topic by introducing the AI lifecycle.

© Eberhard Hechler, Maryela Weihrauch, Yan (Catherine) Wu 2023 195

E. Hechler et al., Data Fabric and Data Mesh Approaches with AI,
https://doi.org/10.1007/978-1-4842-9253-2_9
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

Introduction to the AI Lifecycle

AI and software engineering have fundamental differences. Traditional
software is rule-driven, where the programmer translates the solution to
a problem into clear logical rules, while AI is more data-driven, which is
based on training sets of data and a set of selected algorithms. Traditional
software decomposes the problem into components, modules, and
functions till the lines of code. It always has a deterministic output for
each building block, and all building blocks orchestrate to produce
deterministic outputs for the systems for the same input.
AI works in a very different way. Without explicit rules defined, the
AI model is trained to find an approximation of the optimal solution
to the problem. The efficiency of the solution depends on the quality
of the data used for training, the effectiveness of selected features, and
the sophistication of the algorithm. It implements using a stepwise
approximation plus search strategy to find a set of parameters (models) to
minimize the loss function.
Therefore, AI engineering is a whole new field. According to Gartner
technology trend 2022,1 “AI engineering automates updates to data,
models, and applications to streamline AI delivery. Combined with
strong AI governance, AI engineering will operationalize the delivery of
AI to ensure its ongoing business value.” In essence, AI engineering is
a collection of methods, tools, and practices that expedite the entire AI
lifecycle and ensures the efficient delivery of AI models that are robust,
trustworthy, and interpretable and that continue to create value for the
enterprise.

1
See Reference [1] for more information on the Gartner Top Strategic Technology
Trends for 2022.

196
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

AI Engineering

Monitor AI models Understand

business problems

Deploy AI models AI Governance Collect data

Build AI models Prepare data

Figure 9-1. The Stages of the AI Lifecycle

Although there are multiple reference models2 of the AI lifecycle in

the industry, most of them comprise the following common stages from
conception to maintenance, as shown in Figure 9-1:

1. Understand the business problem: Data scientists

learn from domain experts to understand business
problems, research the necessity and feasibility of
AI, and define key metrics of the AI project – not
only performance metrics of the model itself but
also service-level metrics3:

a. Service latency: The time it takes to load models and

prepare required features

b. Inference latency: The time the model takes to make

a prediction for a given input

2. Collect data: Data scientists request data access

that is required for AI projects. This stage is often
the most time-consuming and labor-intensive

2
See References [2] and[3] for more information on reference models of the AI
lifecycle.
3
See Reference [4] for more information on service-level requirements.

197
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

if the enterprise has not established a trusted

enterprise-level data foundation for analytics. Data
scientists not knowing what kind of business data
is available in what format may struggle to make
specific data requests, so much so that the process
may be repeated several times until the required
data is finally obtained. Therefore, it’s not an
overstatement that “74% of AI adopters say that data
access is a challenge.” New industry regulations are
making this even more challenging.

3. Prepare data: When datasets are available, data

scientists start to wrangle, explore, and cleanse
datasets. The challenges in this stage for data
scientists are the following:

a. How to identify flaws in the data? Since datasets are

from disparate data sources and often produced by
different business applications, datasets have various
types of quality issues, for example, missing values,
duplicated records, inconsistent values, etc.

b. How to visualize large datasets? Visualization

helps data scientists identify outliers in the data,
understand the statistics of datasets, and prepare for
the feature engineering in the next stage.

4. Build models: This stage requires the creativity

of data scientists. First, data scientists extract
features from datasets, and quite often data
scientists derive new features from raw data by
aggregation or transformation for better results of
prediction. Second, data scientists build models
by splitting datasets between training and testing,

198
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

training models including HPO (Hyperparameter

Optimization), and evaluating models. This is an
iterative process.

5. Deploy models: This is where ML engineers take

control from data scientists. It may involve the
reimplementation of models in a scalable fashion
because of service-level requirements defined in the
first stage. There are several examples:

a. Rewrite the Python model into Java/C++ for better

performance.

b. Rearchitect the model to run in a parallel way.

c. Build a feature store for the features less likely to change.

d. Deploy the models to the environment close to the data

source (data gravity).

6. Monitor models: After the model is deployed in a

production environment, ML engineers continue
to monitor the quality or accuracy of the model
and drift, which is the drop in accuracy and in
data consistency over time. The degradation of
predictive performance triggers pullout of models
from the production environment and retraining
with recent datasets. In highly regulated industries,
the monitoring stage comprises fairness and
explainability.

7. Govern models: This stage is to capture necessary

information and calculate risk scores on an ongoing
basis to ensure enterprises govern the creation
and adoption of AI throughout the entire lifecycle.

199
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

For example, facts and lineage provide metadata

tracking of underlying datasets and algorithms. This
is very useful to become enterprise-ready for any
regulatory requirement that may arise.

AI engineering is the operationalization of AI models to create a

sustainable, repeatable, and measurable operationalization process. It
involves data engineers, data administrators, data scientists, ML engineers,
business analysts, and operations engineers. The goal is to empower every
data-related role in the enterprise, regardless of their background and
skills, to collaborate closely and smoothly to deliver the full value of AI
investments and improve time-to-market.

Key Aspects: DataOps, ModelOps, MLOps

When it comes to operationalizing AI, it is important to understand a
few concepts, namely, DataOps, ModelOps, and MLOps. Like DevOps,
systematic software development that aims to deliver software from
code to production rapidly with high quality and on a continuous basis,
DataOps follows the same principles and practices but applies them to
data. It is designed to accelerate the collection, processing, and analysis
to produce high-quality data for data citizens to fulfill their needs in a
compliant way.

200
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

Hyper-
Model
parameter
selection
tuning

Feature Model
engineering evaluation
ModelOps

Prepare Model
Data Data data deployment Code Test
cleanse governance

Model Model
DataOps governance monitoring DevOps
Data Data Operate Build
quality discovery

Database
Database
Data Database Deploy
enrichment

Figure 9-2. DataOps, ModelOps, and DevOps

As shown in Figure 9-2, DataOps covers two stages in the AI lifecycle:

Collect Data and Prepare Data. The implementation of DataOps
streamlines the processes in these two stages by automating data tasks
into data pipelines. A Data Fabric architecture and Data Mesh solution4
is an ideal way to implement DataOps practices.5
ModelOps was coined at almost the same time as DataOps. It’s a
combination of AI with analytical models and DevOps, designed to
bring some of the proven capabilities of agile software engineering to
AI and the analytics space. ModelOps is a DevOps-like framework and
set of toolchains and processes that bring together data engineers, data
scientists, developers, and operators to accelerate the delivery of models
from data preparation to model production through an effective enterprise
data and AI strategy while continuously monitoring the models and
retraining the models when needed and instilling the trust throughout the
entire process. It covers all stages in the AI lifecycle presented in Figure 9-1.

4
Please review Chapter 5.
5
See Reference [5] for more information on DataOps.

201
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

MLOps is a subfield of ModelOps. The primary goal is to help data

scientists with rapid prototyping for ML models and fast delivery of ML
models to production systems. MLOps focuses on the automation of tasks
specific to ML, as outlined in Figure 9-2: feature engineering, HPO, version
control, model evaluation, and finally deployment of inference models
at scale. It increasingly employs tools with interpretability, transparency,
security, governance, and reproducibility of experiments to incorporate
ethics and remove bias throughout the entire ML lifecycle.
ModelOps and MLOps are often used interchangeably. They are
very similar in terms of capabilities. But ModelOps has a broader
scope including not just ML models but also knowledge graphs, rules,
optimization, and natural language techniques and agents, while MLOps
focuses on ML model operationalization only, according to Gartner.6
As presented in Figure 9-2, the data preparation task in ModelOps
triggers DataOps pipelines. The output of DataOps, which is high-quality
governed data, goes into the next stage of ModelOps. So where does the
output of ModelOps go? The answer is business application. Typically,
there are several ways to integrate the predictions from ML models with
business applications. The first option is offline inference, where the
prediction results are stored in a database and the application gets them
directly from the database. The second is batch prediction, which makes
inference on a set of records on a regular basis, for example, weekly,
monthly, or quarterly. This is commonly used for accumulated data and
situations that don’t need immediate results. The third option is online
inference, where the model is called via an API to make predictions.
Usually, online inference is expected to have results instantly. Also,
the prediction requests can queue up by the message broker and be
processed later.

6
See Reference [6] for more information on the difference between ModelOps
and MLOps.

202
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

Either way, the model inference is consumed by business applications.

Therefore, DevOps also comes into play. Both new deployments of models
and anomalies detected during the monitoring phase trigger the DevOps
pipelines and actions accordingly. DevOps practices are essential to
ensure that AI models can be deployed and injected into the business
workflow of the enterprise. The model repositories need to be built for
AI model lifecycle management, champion testing, and system testing,
and model rollout/rollback mechanisms need to be set up to ensure
availability, backed by CI/CD (Continuous Integration/Continuous
Deployment).
In summary, AI engineering consists of three pillars: data, model,
and code. To achieve best practice in AI engineering, it is recommended
to adopt a platform that can provide capabilities for DataOps, ModelOps,
and DevOps or integrate seamlessly with external Ops frameworks without
supporting all three.
The following are some motivational aspects to implement DataOps,
MLOps, and ModelOps in the context of the AI lifecycle:

1. Complexity of data domains: Data challenges

that enterprises are faced with are not just the
explosive growth of data but the complexity of the
data domains and the heterogeneity of data sources
from hybrid cloud and multi-regions. Data silos
are becoming an increasingly serious problem.
The advent of Data Fabric and Data Mesh concepts
is aiming to resolve this issue by providing smart
integration capabilities to help users decide to
virtualize, replicate, or transform data depending on
various factors, for example, policies, performance,
latency, etc. Especially Data Mesh solutions with
their organizational and federated approach are
geared toward breaking data silos with data source
and data ownerships.

203
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

2. AI governance: One critical aspect of implementing

DataOps is to address the need for data and AI
governance and privacy. Data Fabric and Data Mesh
provide a knowledge-augmented central catalog
that contains

a. An inventory of data assets with enriched business

semantics

b. A set of governance artifacts including business

glossaries, regulations, data privacy policies, data
protection rules, and privacy data classifications

3. Knowledge catalog: The Data Fabric architecture

with its knowledge catalog capabilities
automatically enriches data assets via data
discovery and data integration and enforces quality
and protection rules throughout all activities in the
data preparation stage.

4. Democratization of data and AI: Furthermore,

one key benefit of DataOps is data and AI
democratization. The intelligent catalog and
semantic search enable everyone in the company
to find the data they need to perform their job. In
the context of the AI lifecycle, it greatly improves
the collaboration and communication among data
scientists, data engineers, and IT specialists and
reduces the time for Collect Data.

5. Data and AI orchestration: Last but not the least,

the core of DataOps lies in orchestration, which
is responsible for moving data and AI between
different stages in the pipeline and instantiating
the data tools that operate on it. It also monitors

204
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

progress and issues alerts for specific data problems.

Data Fabric and Data Mesh provide a unified
pipeline that composes an end-to-end workflow by
reusable data pipelines, which is the core objective
of DataOps orchestration.

To illustrate DataOps and ModelOps further, let us dive into a couple of

case studies.

ase Study 1: Consolidating Fragmented

C
Data in a Hybrid Cloud Environment
The data needed to build models is scattered across multiple data sources
and often across multiple clouds. According to IDC’s State of the CDO
2021 study,7 data fragmentation and complexity is the number one barrier
to digital transformation in 2022. Nearly 80% of organizations surveyed
are storing more than half of their data in hybrid cloud infrastructures.
Seventy-nine percent of organizations are using more than 100 data
sources, and 30% are using more than 1,000 data sources. However, 75% of
organizations do not yet have a complete architecture to manage a set of
end-to-end data activities, including integration, access, governance, and
protection.
One of the ultimate goals of a Data Fabric architecture and Data Mesh
solution8 is to achieve a single source of truth, which asserts enterprise-
wide data coverage across applications that is not limited by any single
platform or tool.9 This creates both technical and process challenges for
groups seeking to access and explore their data. The technical challenges
arise from the logistics of extracting data from multiple sources or if

7
See Reference [7] for more information on the CDO 2021 study.
8
See Chapter 8.
9
See Reference [8] for more information on Gartner’s vision for data and analytics.

205
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

that data is stored in different formats. Making data usable requires a

significant amount of time and effort from highly skilled data engineers.
More complex are the organizational-level restrictions around who is
accessing what data and for what purpose. This is particularly difficult in
industries such as healthcare and finance, where sensitive data needs to
be handled with care and often with strict regulatory requirements. At the
same time, people struggle with analyzing data without replicating it. In
the majority of analytics projects, multiple copies of the data are stored in
different locations and formats, which creates additional issues such as
cost, latency, untrustworthy data, security risks, and more.

Application ID Debt Credit Score Interest Rate Recommend Interest Rate

Daily update Interest rate Mortgage Application

Mongo

Update debt Applicants’

per transaction score Mortgage
PostgreSQL Warehouse application data

On-premise Cloud

Figure 9-3. The Data Architecture for a Mortgage Application

As most organizations struggle with data fragmentation, there is

a strong need for a unified enterprise data architecture. This section
presents you with a case study to integrate data from multiple sources
with the implementation of a Data Fabric architecture. Data and sample
projects used in the case study can be found in IBM Cloud Pak for Data
Gallery.10

See Reference [9] for more information on data and sample projects in IBM
10

Cloud Pak for Data Gallery.

206
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

In this case study, a bank in the United States wants to offer a smart
mortgage service for California residents. The interest rate on the loan is
based on a combination of the applicant’s personal credit score and the
latest interest rate regulations. To implement this service, data engineers
in this bank need to collect all key information about the applicant and
recommended rates. The key information is spread in different database
systems as depicted in Figure 9-3:

• The anonymized mortgage application data and

mortgage applicants’ PII (Personally Identifiable
Information) are stored on a cloud data warehouse.

• The mortgage applicants’ credit score data is stored on

a PostgreSQL database.

• The interest rate is subject to market change and

is scheduled to be refreshed daily, where the latest
interest rate data can be retrieved from a MongoDB
database.

Data engineers need to find mortgage applicants’ credit scores, filter

the data by state (to only include records from California), calculate the
total debt for each applicant, and then merge credit score ranges into a
data asset to be consumed by the data science team.
As explained in the previous section, being able to collect and prepare
data with good quality is critical to building a data pipeline. It is important
to understand the metadata of the ingested data. Sample data previews
can help us understand the values within the data, and statistics and
visualizations can help us determine a strategy for connecting multiple
datasets.
From the data preview, as depicted in Figure 9-4, it is found that the
column ID can be used to connect application data with PII data. It’s
also found that the column STATE_CODE can be used to filter the data
instead of using the state name (which requires an additional step for data

207
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

transformation). It’s noticeable that the mortgage applicant’s credit score

data uses a different code in the column ID, which cannot be used as a
key for further connections. So the column EMAIL_ADDRESS is chosen
instead.
Let’s see how to build data pipelines to consolidate fragmented data
from disparate data sources, which is a goal of DataOps.

Figure 9-4. Preview Sample Data

First, adding the amounts of the loan and credit card debt gives the
total debt of an applicant. Then query the interest rate table in MongoDB
with the credit score to find the corresponding interest rate. Finally,
generate recommended interest rate for mortgage applicants. The whole
data pipeline looks like Figure 9-5.
However, interest rate data keeps changing. Loan and credit card
debt amounts are updated monthly. Credit scores are calculated by other
applications and pushed to the PostgreSQL database daily. Interest rates
are updated daily. If the tasks of joining of data, the calculation of total
debt, and querying of interest rates are separated and run independently,

208
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

there’s no guarantee the result keeps the expected level of accuracy.

Therefore, the pipeline, as depicted in Figure 9-5, needs to be deployed
as one job and is scheduled to run regularly. That ensures that the data
request of interest rate is fulfilled with up-to-date data.
High-quality ML models require high-quality data. ML pioneer
Andrew Ng believes that focusing on the data quality that powers AI
systems will help unlock their full power.11 Business leaders like Gartner
believe that low-quality data induces high cost and undermines business.12
Data only delivers business value when it is correlated and can be
accessed by any user or application in the organization. Organizations
looking to improve data quality typically start with a data and analytics
governance program that consolidates fragmented data from hybrid cloud
environments. When implemented properly, Data Fabric and Data Mesh
help ensure that this value is available throughout the organization in the
most efficient and automated manner.

11
See Reference [10] for more information on why Andrew Ng advocates for
data-centric AI.
12
See Reference [11] for more information on the impact of poor-quality data on
business from Forbes.

209
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

Figure 9-5. Create Data Pipelines

210
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

The ideal implementation of both concepts puts together self-service

data tools into an intelligent fabric of a heterogeneous data landscape. It
provides everyone in the organization with the ability to find, explore ,and
interrogate all available data, whether on-premises or in a hybrid cloud
landscape.

Case Study 2: Operationalizing AI

Operationalizing AI refers to AI lifecycle management, which
introduces integration between the data engineering team that builds
the data pipeline, the data science team that builds the AI models,
and the operation team that deploys and maintains the AI models.
Operationalizing AI and AI engineering are often interchangeable. As
explained in the previous section, AI engineering involves the core
management competencies of ModelOps, DataOps, and DevOps to enable
organizations to improve the performance, scalability, interpretability, and
reliability of AI models while delivering the full value of AI investments.
In the mortgage business, change is constantly happening based
on changing regulations, products, and processes. It is imperative that
customers get up-to-date information and timely support to streamline
their home-buying experience. The bank in our case study wants to
have an expansion of its business by offering low-interest-rate mortgage
renewals for online applications. The task for the data scientist team is to
train a mortgage approval model to predict which applicants qualify for
a mortgage and deploy the model for real-time evaluation based on the
applicant’s requirements.

211
212
Chapter 9
Data Fabric and Data Mesh for the AI Lifecycle

Figure 9-6. Explain How Prediction Is Being Made in Plain English

Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

There are a wealth of algorithms for training the model. Once it’s
available, the data scientists save it to the project and use a holdout
dataset to evaluate the model. Figure 9-6 explains in plain English why
the model makes a prediction with a high degree of confidence. When the
performance of the model is satisfactory, it can be taken to production.
Moreover, the status and production performance of the models can be
monitored at any time from the model inventory, as shown in Figure 9-7.
There are a few challenges when operationalizing AI. The most
common one is that deploying AI models into production is expensive and
time-consuming. For many organizations, over 80% of the models have
never been operationalized. While data science teams build many models,
very few are actually deployed into production, which is where the real
value comes from. For many organizations, the time it takes them to build,
train, and deploy models is 6–12 months.
The issue of AI bias has attracted increasing attention in public. Drift
occurs as data patterns change, which leads to a reduction in the accuracy
of each model’s predictions. When this happens, line-of-business leaders
are increasingly losing confidence that their models are producing
actionable insights for their business. Fairness is also an area of concern.
If the model has produced favorable predicting results for specific groups
(gender, age, or nationality), then it can lead to AI ethics discussions and
possibly even legal risks.

213
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

Figure 9-7. The Status of the Model in the Model Inventory

214
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

Another AI trust issue that affects model deployment comes from the
lack of model lineage analysis. This includes two aspects. One is how the
model is built and which features have a decisive role in the final scoring
results of the model. This is the area where the interpretability of the
model comes into play. The other area is data lineage: where the data used
to train the model comes from, whether it is accurate and secure, and
whether there is a possibility of tampering.
The fact sheets shown in Figure 9-7 are an example to help business
users understand and trust the model.
These challenges need to be considered when an organization chooses
a Data Fabric and Data Mesh implementation. The goal of acquiring data
is to use it for a particular business purpose. Therefore, the best solution
is to have the capabilities to operationalize data and AI implemented
within the Data Fabric architecture. It helps organizations reduce the skills
required to build and manipulate AI models, speed up delivery time by
minimizing mundane tasks and data preparation challenges, and, at the
same time, optimize the quality and accuracy of AI models with real-time
governance.

Accelerate MLOps with AutoAI

There is a chasm between using Jupyter notebooks to develop models
for experimentation and deploying them to production systems. While
enterprise investment in AI has been increasing, the percentage of AI
models that are delivered to production is still small. Enterprises are
recognizing that crossing this chasm has become critical to realizing the
value of AI. Due to the success of DevOps, many enterprises are looking
to MLOps to solve this problem. As it is explained in the earlier section,
MLOps is a set of practices that connect data preparation, model creation,
deployment, and monitoring, with a focus on operating ML models
effectively.

215
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

The success criteria for MLOps are to establish a sustainable set of

disciplines for enterprises to roll out experimental models to production
smoothly. Listed in the following are challenges that need to be dealt with
when implementing MLOps practices.
The first is data preparation. Since ML models are built on data, they
are very sensitive to the quality of data, such as the semantics, quantity,
and completeness of the data being used to train the model. However, data
preprocessing can be very time-consuming, depending on the conditions
of the data. As explained in Chapter 6, a sequence of data understanding
and data preparation tasks need to be performed. Here are some of the
most common examples:

• Feature selection: Discard features that are less

important to prediction. First, exclude the features
that have a constant value throughout the dataset.
Second, if the data type is not time or date, ignore the
columns that have a unique value, which very likely
represents an ID.

• Missing or incorrect values: For the missing values or

inaccurate values in the datasets, one option is to use a
statistical approach to estimate – the mean, median, or
average of adjacent records. Another option is to use an
ML algorithm to predict the missing or inaccurate value.

• Feature encoding and scaling: There are many

transformation methods based on data types. For
example, you can encode categorical features as
ordinal numbers and scale numerical features to see
how they affect the performance of the model.

The next phase is model creation. Based on the characteristics of

datasets and the nature of business problems, there are a wealth of ML
algorithms to use. In reality, the choice of algorithms needs to balance
between the accuracy and the time spent for training. Another thing to

216
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

consider is the metrics used to evaluate ML models, such as ROC and

AUC for binary classification13; F1, precision, and recall for multi-class
classification; mean squared error (MSE), root mean squared error
(RMSE), and R2 for regression; etc.
Feature engineering also plays an important role in model creation.
ML algorithms deal with tabular data, and if relationships exist between
different features, using data transformation to derive new features and
combine them will help reveal the insights hidden in the data.
While these previously described tasks are complex, the approach to
when and what method is needed to solve a problem is clear. This lays a
good foundation for automation. Many vendors offer AutoAI solutions to
automate the preceding tasks.
AutoAI is a no-code/low-code platform that automates several
aspects of the MLOps lifecycle. As shown in Figure 9-8, AutoAI provides
an easy-to-follow wizard to help you define the model training settings
and to choose the target (predicted) feature. AutoAI also provides HPO
capabilities that help optimize the hyperparameters of the best-performing
pipelines from the previous phases. It uses a model-based, derivative-free
global search algorithm, called RBFOpt,14 which is tailored for the costly
ML model training and scoring evaluations.
For each folding and algorithm type, AutoAI creates two pipelines
optimizing the algorithm type using HPO: the first optimizes the algorithm
type based on the preprocessed (imputed/encoded/scaled) dataset, and
the second one optimizes the algorithm type based on optimized feature
engineering of the preprocessed (imputed/encoded/scaled) dataset.
The final model can be selected from the set of candidate models.
Once the model is built and exported to a Jupyter notebook, the data
scientist can further customize the results, which is often required for more
complex use cases.

Please, refer to chapter 6 for an explanation of ROC, AUC, etc.

RBFOpt is an open source library for black-box optimization with costly function
14

evaluations.
217
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

Figure 9-8. AutoAI

218
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

AutoAI automates the highly complex process of finding and

optimizing the best ML models, features, and model hyperparameters for
training.15 It allows people without deep data science expertise to create
models of all types and even those with deep data science expertise to
prototype and iterate from them more quickly. AutoAI reduces the effort
of building models and increases productivity and accuracy. It provides
significant productivity gains for enterprises implementing MLOps.

Deployment Patterns for AI Engineering

When it comes to the training and deployment of AI models, IT often
defines the requirements that data science practices must comply with. For
example, if training data are stored in a public cloud, model training using
that data very likely also must take place in the same cloud to minimize
data export (outbound) costs or to comply with governance rules. There
are many other factors around model training and deployment, including
security, latency, performance, and data residency. Here are a few
examples:

• Meet the service-level requirement, for example,

response time, latency, and throughput:
• Co-locating with the data source for easy access to
features

• Co-locating with the applications to reduce

network overhead

• Easy scale-out to accommodate massive

inference calls

15
See Reference [12] for more information about the benefits AutoAI could bring
to MLOps and the AI lifecycle.

219
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

• Optimize cost for deployment related to hardware,

integration, and operation cost:

• Reuse existing computational resources.

• Reduce efforts of integrating with existing

applications.

• Reduce operational costs by reusing existing

operations infrastructure.

Depending on the importance of these factors, enterprises employ

one or more of three common patterns when deciding on ML deployment
architecture. The first pattern is to deploy the runtime environment of the
ML model to the platform where the data originated.
Business-critical applications in large enterprises are currently
deployed in highly reliable and regulated environments. For instance,
two of three large financial institutions in the world are running their core
banking systems on IBM zSystems. As a result, a large amount of training
data for ML and the raw data needed for inference after the ML models
go live are still generated and stored on-premises. Moving this data from
a secure environment to the public cloud not only introduces latency but
also increases the risk of data leakage and data tampering.
The deployment pattern in Figure 9-9 can be an option for
organizations that do not want to take legal risks and potential financial
losses, but still want to benefit from flexible, scalable, and cost-effective
computing resources on the public cloud.
First, the data is masked according to the data protection regulations
and migrated to the public cloud platform through intelligent integration
technology as the dataset for model training. When the model training
is completed, it is deployed to the on-premises system. To obtain better

220
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

scalability, it is recommended to use container-based deployment.

In addition, multiple versions of the same model can be deployed for
subsequent AB testing16 and gray release.

On-Premise Public Cloud

Machine Learning Runtime
Machine Learning Runtime Machine Learning Runtime Deploy to Machine Learning Training Platform
Container Container
target
Machine Learning Runtime Hyperparameter
environments
Machine Learning Runtime
Replica Container Container Loan Approval
Replica
optimization
Loan Approval v1 (PMML)
Machine Learning Runtime
Machine Learning Runtime
Container
Container
Model evaluation
Loan Approval v2 (ONNX)
Loan Approval v1 (PMML)

Invoke model Get prediction

inference API results Train Test
Virtualize,
API
replicate or
gateway
transform for
training

Retrieve raw Core

data for model business
inference applications

Figure 9-9. Pattern 1: Co-locate ML Runtime with Data for Easy

Access of Features

There are several ways in which models and applications can be

integrated. To continuously monitor and easily update models, invoking
inference APIs of models from applications is the most common way of
integration. Considering the scale of the workloads in production systems,
which may reach up to 10,000 per second, enterprises usually use an API
gateway for load balancing.
In this pattern, the application directly obtains the raw data needed
to invoke the model APIs and then sends the request for model inference
through the API gateway, which forwards it to the runtime container of
the corresponding model version based on the specific information of
the request. If too many inference requests are received at the same time,
the API gateway may create a new replica of the runtime with the specific
model version to maintain the service level for the model.

16
A/B testing is a method of comparing two versions of a web page or app.

221
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

The core idea behind this deployment pattern is to train models on

public cloud with regulated data from on-premises but deploy models
back to on-premises to easily access raw data, thereby improving the
performance and quality of model inference. One potential drawback of
this pattern is the massive amount of scoring requests that may have an
impact on business-critical applications that reside in the same system.
Another disadvantage is that skills and toolchains for operationalization
are more complex due to cross-platform deployments.
In contrast with data gravity, the second deployment pattern is
a cloud-native one. Assuming most data for training and scoring are
generated on-premises and modernized applications are running on
public cloud, enterprises can consider the deployment pattern presented
in Figure 9-10.

On-Premise Public
Machine Learning Training Platform Cloud
Loan Approval
Hyperparameter
Virtualize, optimization
replicate Model evaluation
transform
for training

Machine Learning Runtime

Replicate in
Machine Learning Runtime
Container

near-time Machine Learning Runtime

Replica
for scoring
Container

Loan Approval v1 (PMML) Machine Learning Runtime

Machine Learning Runtime Container
Container
Machine Learning Runtime
Loan Approval v1 (PMML)
Container
Replica
API
Applications Machine Learning Runtime
Gateway Container

Loan Approval v2 (ONNX)

Figure 9-10. Pattern 2: Co-locate ML Runtime with the Application

on Cloud

Like the previous pattern, during the training phase, data is moved off
from the on-premises system to the public cloud, and after the training is
complete, it is deployed directly to the public cloud. Applications running
on public cloud retrieve raw data either directly from the data source on-
premises or though access to a near-real-time cache of the data source

222
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

on-premises. Once applications get the data needed for model inference
APIs, they send requests to the API gateway, which dispatches the requests
to a specific runtime for models as described in the first pattern.
There are two important consideration factors when deploying this
pattern:

1. Whether the latency of copying data or accessing

data for model inference is acceptable. Depending
on the data source and technology, the latency
varies from hours to seconds.

2. Whether the data movement complies with

security and privacy regulations. For example,
data generated by data centers in the countries in
the European Union is not allowed for transit use
by applications running on public cloud in the
United States.

If the above-mentioned factors are not hindrances, then this

deployment pattern has advantages. First, the environment in which the
models run can easily scale out through a public cloud infrastructure.
Second, the cost of operations and maintenance is relatively low due to the
unified operations toolchain. Last but not the least, it has minimal impact
on on-premises systems.
The third and final common deployment pattern is an edge
deployment one. AI models for image recognition and video analysis are
widely used in the manufacturing industry. One typical use case for image
recognition is to spot the defects in the parts in the production line to
reduce the manual efforts of quality inspection. Another use case is the use
of video analytics to monitor whether workers are operating regulation-
safe operations.

223
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

Public Cloud / On-Premise

Machine Learning Training Platform

Loan Approval
Hyperparameter
optimization
Model evaluation

Deploy Deploy
Deploy
Edge Edge Edge
DL model DL model DL model
Applications in C++ Applications in C++ Applications in C++

Integrate Integrate Integrate

Figure 9-11. Deploy an Inference Service on an Edge Device

In these two examples, very likely the production lines in the shop floor
don’t have connections to the public cloud. Even if they have connections,
the network overhead increases the delay in the return of model inference
results, and the performance requirements of the application cannot be
met. That’s why edge deployment has traction.
In this deployment pattern, as depicted in Figure 9-11, all data
captured at edge devices (including images and videos) is sent to
public cloud or on-premises for training. Since images and videos are
unstructured data, manual annotation is usually required. When the
model training is completed, the model is deployed to multiple edge
devices. One difficulty with this deployment pattern is that the model may
be rewritten in a language like C++ due to the resource constraints of the
edge-side devices and the extremely high requirements for performance,
which imposes additional difficulties for model upgrades and version
management. It often requires an additional component to dispatch the
models to edge devices and manage the lifecycle of models at the edge.

Key Takeaways
We conclude this chapter with a few key takeaways as summarized in
Table 9-1.
224
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

Table 9-1. Key Takeaways

# Key Takeaway High-Level Description

1 AI and software Traditional software is rule-driven, where the

engineering have problem can be coded with clear logical rules and
fundamental have deterministic output, while AI is more data-
differences. driven, which is based on training sets of data and
algorithms for approximation of an optimal solution.
2 The AI lifecycle The AI lifecycle comprises of business problem
comprises of multiple understanding, collecting data, preparing data,
stages. building the model, deploying the model, monitoring
the model, and governing the model.
3 AI operationalization AI operationalization consists of three pillars – data,
domains include model, and code. It is recommended to adopt a
DataOps, ModelOps, platform that can provide capabilities for DataOps,
and DevOps. ModelOps, and DevOps.
4 Differences between MLOps is a subfield of ModelOps. ModelOps
ModelOps and contains the operationalization not just for machine
MLOps. learning but also for knowledge graphs, rules,
optimization, Natural Language Processing, agents,
etc.
5 The Data Fabric Data Fabric and Data Mesh provide a unified
architecture and enterprise data architecture and solution for
Data Mesh solution consolidating dispersed data from a hybrid cloud
implements DataOps environment through automated data discovery,
practices. smart data integration, and intelligent cataloging.
(continued)

225
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

Table 9-1. (continued)

# Key Takeaway High-Level Description

6 Majority of models Over 80% of models are never operationalized

never get into because the efforts involved in deploying them are
production. enormous and the models are deployed and found
to produce drift or fairness issues that outweigh the
benefits.
6 AutoAI accelerates AutoAI is a no-code/low-code platform that
MLOps. automates several aspects of the MLOps lifecycle.
It allows citizen data scientists to create models
of all types and even seasoned data scientists to
prototype and iterate from them more quickly.
8 There are multiple The choice of deployment architecture is
deployment determined by various factors, including but not
architecture patterns. limited to data locality, performance and latency
requirements, and security requirement.

R
eferences
[1] Gartner Top Strategic Technology Trends for 2022,
www.gartner.com/en/information-technology/
insights/top-technology-trends
[2] Mark Haakman, Luís Cruz, Hennie Huijgens, & Arie
van Deursen , AI lifecycle models need to be revised,
https://link.springer.com/article/10.1007/
s10664-021-09993-1

226
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

[3] Walch, K. Forbes. Operationalizing AI, 2020, www.

forbes.com/sites/cognitiveworld/2020/01/26/
operationalizing-ai/#49ef691c33df (accessed
March 2, 2020).

[4] Aparna Dhinakaran, Two Essentials for ML

Service-Level Performance Monitoring – A guide
to optimizing ML service latency and ML inference
latency, https://towardsdatascience.com/two-
essentials-for-ml-service-level-performance-
monitoring-2637bdabc0d2

[5] How and Why to DataOps, www.ibm.com/

blogs/academy-of-technology/wp-content/
uploads/2022/02/IBMDataOpsHowandWhy_
Whitepaper.pdf

[6] Natasha Sharma, What Is ModelOps and How Is It

Different From MLOps? https://neptune.ai/blog/
modelops

[7] 2021 State of the CDO study, www.

informatica.com/about-us/news/news-
releases/2021/12/20211209-informatica-
unveils-2021-state-of-the-cdo-study.html

[8] Leadership Vision for 2022: Top 3 Strategic Priorities

for Data and Analytics Leaders, www.gartner.
com/en/information-technology/insights/
leadership-vision-for-data-and-analytics

[9] Data and Sample projects in IBM Cloud Pak for Data
Gallery, https://dataplatform.cloud.ibm.com/ga
llery?context=cpdaas&format=project-template
&topic=Data-fabric

227
Chapter 9 Data Fabric and Data Mesh for the AI Lifecycle

[10] Why it’s time for “data-centric artificial intelligence,”

https://mitsloan.mit.edu/ideas-made-to-
matter/why-its-time-data-centric-artificial-
intelligence

[11] Flying Blind: How Bad Data Undermines

Business, www.forbes.com/sites/
forbestechcouncil/2021/10/14/flying-blind-
how-bad-data-undermines-business/

[12] MLOps and Trustworthy AI, www.ibm.com/

products/cloud-pak-for-data/scale-
trustworthy-ai

[13] Meenu Mary John, Helena Holmstr ̈om Olsson,

and Jan Bosch, Architecting AI Deployment: A
Systematic Review of State-of-the-art and State-
of-practice Literature, www.researchgate.net/
publication/348655621_Architecting_AI_
Deployment_A_Systematic_Review_of_State-of-
the-Art_and_State-of- Practice_Literature