OceanofPDF - Com Implementing Data Mesh Principles and Practice To Design Build and Implement Data Mesh - Jean-Georges Perrin
OceanofPDF - Com Implementing Data Mesh Principles and Practice To Design Build and Implement Data Mesh - Jean-Georges Perrin
With Early Release ebooks, you get books in their earliest form—the
author’s raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.
OceanofPDF.com
Implementing Data Mesh
by JG Perrin and Eric Broda
Copyright © 2024 Oplo LLC and Broda Group Software Inc. All rights
reserved.
The views expressed in this work are those of the authors and do not
represent the publisher’s views. While the publisher and the authors have
used good faith efforts to ensure that the information and instructions
contained in this work are accurate, the publisher and the authors disclaim
all responsibility for errors or omissions, including without limitation
responsibility for damages resulting from the use of or reliance on this
work. Use of the information and instructions contained in this work is at
your own risk. If any code samples or other technology this work contains
or describes is subject to open source licenses or the intellectual property
rights of others, it is your responsibility to ensure that your use thereof
complies with such licenses and/or rights.
978-1-09815-616-9
[FILL IN]
OceanofPDF.com
Brief Table of Contents (Not Yet Final)
Part 1: The basics (available)
Part 3: Teams, operating models, and roadmaps for Data Mesh (available)
Chapter 14: Defining and Establishing the Data Mesh Team (available)
OceanofPDF.com
Part I. The basics
This first part of the book aims at setting the decorum for the rest of the
book: at the end of this part, you will be familiar with our terminology and
our use case.
OceanofPDF.com
Chapter 1. Understanding Data Mesh -
The Essentials
With Early Release ebooks, you get books in their earliest form—the
author’s raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.
This will be the 1st chapter of the final book. Please note that the GitHub
repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the editor at [email protected].
Data as a Product:
Now, since Agile Manifesto was published, there have been over twenty
years of turning core agile principles into practice. We now deliver software
faster, better, and cheaper: McKinsey, a consulting firm, has shown that
“agile organizations have a 70 percent chance of being in the top quartile of
organizational health, the best indicator of long-term performance.” Simply
put, the software engineering world has never been the same.
Similarly, Data Mesh introduces agility into the data landscape,
emphasizing decentralized ownership, responsive data management, and
collaborative cross-functional teams. Just as Agile promotes self-organizing
teams, Data Mesh advocates for domain-oriented decentralized ownership,
putting the power of data in the hands of individual domain teams. In an
Agile context, customer collaboration involves continuous engagement with
stakeholders to understand their evolving needs. Likewise, Data Mesh
encourages domain teams to engage with data consumers within their
organization, gathering feedback, and iterating on their data products to
meet their specific requirements.
With its focus on self-serve data infrastructure, Data Mesh enables domain
teams to access and manage their data independently. This eliminates the
need for bureaucratic processes and time-consuming requests to centralized
data teams, reducing wait times and accelerating the data development
lifecycle. By putting the necessary tools and resources into the hands of
data practitioners, Data Mesh enables rapid iteration, experimentation, and
delivery of data products. This increased speed allows organizations to
capitalize on data insights more efficiently, gaining a competitive advantage
in today’s fast-paced business landscape.
And with local autonomy comes speed and agility: By distributing data
ownership and fostering collaboration, Data Mesh enables teams to respond
swiftly to changing business needs and data requirements. Domain teams
have the flexibility to adapt their data products and infrastructure to meet
evolving demands, avoiding the constraints of rigid centralized systems.
This agility empowers organizations to seize emerging opportunities, make
data-driven decisions in real-time, and stay ahead of the competition.
Consider data silos. Data silos hinder data accessibility and collaboration,
making it difficult to gain a holistic view and leverage the full potential of
the available data. They present a real, present, and formidable challenge
that almost all data practitioners experience in modern enterprises.
Data silos, much like isolated islands in an immense ocean, are repositories
of data that are confined within specific departments or systems,
disconnected from the broader organizational data landscape. This
segregation results in a fragmented data ecosystem, where valuable insights
remain untapped, and the collective intelligence of the enterprise is
underutilized.
In this context, approaches like Data Mesh become highly relevant, offering
a decentralized, yet cohesive framework for data management. Data Mesh
advocates for domain-driven ownership of data, enabling individual teams
to manage and share their data effectively while aligning with the overall
organizational objectives. By embracing this paradigm, enterprises can
gradually dismantle the barriers of data silos, paving the way for a more
integrated, agile, and data-centric organizational culture.
Now, let’s consider data complexity. In the digital era, the complexity of
data - and its management - has become a central challenge for modern
enterprises. This complexity is not merely a byproduct of the volume of
data but also its variety and velocity.
And as data volume and variety grow, ensuring data quality and integrity
becomes an increasingly difficult task. Poor data quality can lead to
incorrect or bad business decisions, misguided strategies, and ultimately, a
detrimental impact on business outcomes. Making matters worse, the sheer
complexity of data can obstruct compliance efforts, as understanding the
nuances of data privacy regulations becomes more difficult when data is
scattered and convoluted. For global organizations, this challenge is
amplified by the need to navigate a patchwork of regional and international
data laws.
Mastering this complexity requires a multifaceted approach, blending
technology, strategy, and organizational culture. Advanced technologies
such as machine learning and artificial intelligence offer powerful tools for
analyzing complex data sets, uncovering patterns, and generating insights
that would be impossible for humans to discern unaided. However,
technology alone is not a panacea; it must be coupled with a robust data
strategy that prioritizes data governance, quality, and integration.
Organizations need to foster a data-literate culture where employees across
departments understand the importance of data and are equipped with the
skills and tools to leverage it effectively.
The complexity of this task is magnified by the sheer volume and diversity
of data that organizations handle. With data collected from myriad sources –
customer databases, online transactions, IoT devices, and more – the risk of
exposure and vulnerability increases exponentially.
This clear ownership not only fosters a more focused and vigilant approach
to securing data but also enables swifter action in addressing urgent security
issues. When security threats emerge, domain teams can respond rapidly,
applying targeted measures to protect their data products without the delays
often associated with centralized decision-making processes.
By virtue of their close involvement with their data products, these teams
possess detailed knowledge of the data’s nature, use cases, and potential
vulnerabilities. This intimate understanding enables them to implement
security measures that are precisely tailored to the specific characteristics
and risks associated with their data.
And the decentralized nature of Data Mesh facilitates a more adaptive and
responsive security posture. As domain teams are deeply versed in the
relevant legal and regulatory requirements, they can ensure compliance in a
dynamic legal landscape, adapting quickly to new regulations and privacy
standards. This approach not only strengthens the security of data assets but
also enhances the overall resilience of the organization against the myriad
threats that define the contemporary data security landscape.
In the digital age, the velocity of data creation and consumption has become
a defining challenge for organizations. Data is generated at an
unprecedented rate from a plethora of sources – social media, mobile
devices, IoT sensors, and countless enterprise applications. This rapid
generation and consumption of data, akin to a high-speed train, necessitates
a continuous and agile approach to data management.
The agility and responsiveness that Data Mesh brings to data management
are crucial in an era where the speed of data is continuously accelerating.
As domain teams are closer to the data sources and more attuned to the
specific requirements of their domain, they can implement more effective
and timely data strategies.
This not only includes the technical aspects of data handling but also
encompasses a more nuanced understanding of the data’s context and
potential value. Data Mesh’s emphasis on viewing data as a product means
that each data product is designed with its use cases in mind, ensuring that it
is not just processed efficiently but also utilized effectively.
And last but not least, comes every data practitioner’s favorite topic: data
governance.
Given the penalties for non-compliance and the risks associated with data
breaches, governance is not just a compliance issue but a critical business
necessity. In this evolving landscape, data governance must be agile,
responsive, and deeply integrated into the day-to-day handling of data.
But far too often, today data governance is viewed as a task that must be
done, a command from on-high, rather than a task that drives inherent
value. Data Mesh offers an alternative.
In the Data Mesh governance model, the data governance team is akin to
the ASA, setting overarching governance standards and policies. Data
product owners, on the other hand, are like the product vendors. They
ensure that their data products comply with the established governance
standards and, once compliant, can be certified as meeting the enterprise’s
governance criteria.
This certification not only serves as a mark of trust and quality within the
organization but also streamlines the process of governance by empowering
those closest to the data. It ensures that governance is not a top-down,
bureaucratic process but a collaborative, integrated practice that enhances
the value and security of data across the enterprise.
Furthermore, data product owners, who are closest to the data and its use
cases, are in a unique position to understand and manage the compliance
requirements effectively. They can publish and update their certification
statuses, making this information transparent and accessible within the Data
Mesh ecosystem.
This method contrasts starkly with the conventional centralized governance
models, where compliance is often managed by a central group that
oversees and polices all data activities. While this model has its strengths in
maintaining control and uniformity, it can also lead to bottlenecks, delays,
and a disconnect between the governance process and the real-world
application of data.
Our aspiration in writing this book is rooted in a humble yet bold vision:
two decades from now, we hope to look back and see Data Mesh as a
pivotal force in bringing agile methodologies to the realm of data
management. Our contribution, though a modest part of this larger
movement, aims to empower organizations to derive better, faster, and more
cost-effective insights and business value from their data. Through the
pages of this book, we seek to inspire a new generation of data
professionals, equipping them with the knowledge and tools to
revolutionize data management practices and drive their organizations
towards a future where data is not just an asset, but a catalyst for innovation
and growth.
We will define Data Products (Chapter 2), and how they are
members in the Data Mesh ecosystem. We will introduce our case
study (Chapter 3) - applying Data Mesh to make climate data easy to
find, consume, share, and trust - that will be used throughout the
book to demonstrate how to implement Data Mesh practices. And, of
course, we will have a perspective on Data Mesh architecture
(Chapter 4).
We will describe how to run and operate your Data Mesh (Chapter
10) and describe the “team topology” required to deliver your Data
Mesh (Chapter 11) and show how to establish autonomous domain
teams responsible for specific areas of the business and will describe
the role of the Data Product owner that has ownership of their
domain’s data, including its quality, governance, and accessibility.
We will provide a tried and tested “roadmap” (Chapter 13) that starts
with a strategy, and then shows how to implement the core Data
Product and Data Mesh foundational elements as well as establishing
Data Product teams and the broader Data Mesh operating model. We
will also show how to establish channels for collaboration and
knowledge sharing among domain teams through communities of
practice, regular cross-functional meetings, or data councils. We will
show how to socialize Data Mesh within your organization to
encourage teams to share best practices, lessons learned, and data
assets to leverage the collective knowledge and expertise across the
organization.
Enjoy!
OceanofPDF.com
Chapter 2. Applying Data Mesh
Principles
With Early Release ebooks, you get books in their earliest form—the
author’s raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.
This will be the 2nd chapter of the final book. Please note that the GitHub
repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the editor at [email protected].
For those that have read Zhamak Dehghani’s visionary book, this will be a
review of sorts. But for those that are new to Data Mesh, our hope is that
this is a valuable - but brief - introduction to Data Mesh, its core principles,
and its core components.
What is a Data Mesh?
At its simplest, a Data Mesh is just an ecosystem of interacting data
products as shown in Figure 2-1. But like any ecosystem, there are many
moving parts, each operating somewhat independently, that are connected
through common standards and a communications fabric. And data products
in a data mesh, ideally, have a common technical implementation with a
consistent set of interfaces.
So, Data Mesh is, then, at its foundation, a conceptual framework in the
realm of data architecture, which emphasizes decentralized data ownership
and architecture. It recognizes that in large organizations, data is vast and
varied, where each business domain has autonomy over its own data. By
decentralizing control, it empowers individual domains to manage and
make decisions about their data while maintaining a cohesive overall
structure. And presumably, with this autonomy comes better, more
localized, and faster decisions, which in-turn, leads to speed and agility.
Figure 2-1. :Data Mesh - An Ecosystem of Interacting Data Products
Obviously, we will have much more to say about the Data Mesh ecosystem
in the architecture chapter (Chapter 4). Nevertheless, let’s continue.
Data Mesh Principles
A set of guiding principles stands at the core of Data Mesh, with each
playing a crucial role in the framework’s efficacy and sustainability. The
first of these principles is the establishment of a clear boundary for each
data product. This boundary demarcation is essential in defining what each
data product represents, its scope, and its limitations. It’s an exercise similar
to mapping out a city’s boroughs, where each area is distinctly outlined and
understood for its unique characteristics and contributions. The principle of
a clear boundary in a Data Mesh ensures that every data product is a well-
defined entity within the larger ecosystem. This clarity prevents overlap and
confusion, establishing a clear understanding of the data product’s purpose
and scope. It aids in managing expectations and directs efforts and
resources appropriately, ensuring that each data product can effectively
fulfill its intended role.
A third principle central to the Data Mesh concept is the provision of self-
serve capability. This feature is pivotal in democratizing data within an
organization. It allows users from various departments and skill levels to
access, manipulate, and analyze data independently, without relying on
specialized technical support. This approach to data accessibility is
comparable to empowering city residents to utilize public spaces and
services on their own terms, fostering a sense of ownership and
engagement.
Self-serve capability in a Data Mesh not only empowers users but also
fosters a culture of innovation and agility. It enables individuals to leverage
data for their specific needs, encouraging experimentation and personalized
analysis. This capability reduces bottlenecks typically associated with
centralized data systems, where requests for data access and analysis can
slow down decision-making processes.
Findable
Accessible
Interoperable
Reusable
According to FAIR, “the principles emphasize machine-actionability (i.e.,
the capacity of computational systems to find, access, interoperate, and
reuse data with none or minimal human intervention) because humans
increasingly rely on computational support to deal with data as a result of
the increase in volume, complexity, and creation speed of data.”
So, let’s elaborate on these principles and apply them to data products.
Findability is the first of the FAIR principles. For a data product to be
valuable, it must be easily discoverable within the organization’s broader
data landscape. This involves implementing a system for cataloging and
effective metadata management. For instance, a data product containing
sales figures should have detailed metadata that includes information on the
time period covered, the geographical scope, and the type of sales data
included. This enables users to quickly find and identify the data they need
without wading through irrelevant information.
The fourth principle, Reuse, focuses on the ability to apply data in multiple
contexts. This principle is particularly important in maximizing the value of
data. By designing data products to be modular and reusable, they can be
used across different projects and applications. For instance, a data product
containing customer demographic information can be used by marketing
teams for campaign planning, by sales teams for sales strategy
development, and by product development teams for market analysis.
In conclusion, “good” data products in a Data Mesh are those that are FAIR:
findable, accessible, interoperable, and reusable. These principles ensure
that data is not just stored but is actively managed and used in a way that
adds value to the organization. By adhering to FAIR principles, data
products become more than just repositories of information; they transform
into dynamic assets that drive innovation and decision-making across the
enterprise.
Yet, security alone is not sufficient. The reliability of the data product is
equally important. Users need to trust that the data product will provide
accurate and consistent information at all times. Ensuring reliability
involves implementing validation checks, error detection algorithms, and
maintaining high data availability. This is where the concept of reliability
intersects with security; a secure data product is inherently more reliable as
it protects against data tampering and loss.
Observability extends the concept of reliability. It’s about having the ability
to monitor the health and performance of the data product. By using tools to
track various metrics like response times and error rates, organizations can
proactively manage the data product’s health. This proactive management
plays a crucial role in maintaining the product’s reliability, as it allows for
the early identification and resolution of potential issues before they
escalate.
User experience is another crucial aspect that ties these attributes together.
An enterprise-grade data product should not only be robust and reliable but
also intuitive and user-friendly. This involves considering diverse user
needs and incorporating thoughtful UI/UX design, ensuring that the product
is accessible to a wide range of users with varying technical expertise.
Queries are also an essential artifact in data products. These can include
pre-written SQL queries or other access methods that provide users with
ready-to-use insights. These queries are particularly valuable for users who
may not have deep technical expertise but need to derive meaningful
information from the data product.
The inclusion of these diverse artifacts transforms the data product from a
static repository of information into a dynamic toolkit for analysis and
insight. The value of the data product is thus not just in the raw data it
contains but also in how it enables users to interact with and make use of
that data in varied and meaningful ways.
The process of integrating these artifacts into the data product also demands
thoughtful consideration. It involves ensuring compatibility among different
elements and creating an intuitive user experience. This requires a deep
understanding of both the technical aspects of the artifacts and the user
journey within the data product.
The target state of the data product should be ambitious yet achievable. It
needs to strike a balance between aspirational goals and practical realities.
The target state should challenge the status quo but remain grounded in
what is realistically attainable given the current technological capabilities
and organizational context. So, clearly a valuable data product must also
have a well-defined target state or end goal. This target state should reflect a
clear vision of what the data product aims to achieve or contribute to within
the organization. Establishing this target state ensures that the development
of the data product remains focused and aligned with the intended
objectives.
Clearly linked to the target state is the need for a roadmap - a way to get to
the target state. It’s a strategic plan that details the progression from the
current state of the data product to its desired future state, including the
technologies, resources, and timelines involved. This is clearly a big topic
and much more detail will be available in the “Establishing a practical Data
Mesh roadmap” chapter (Chapter 13). Developing the roadmap for the data
product is a collaborative process. It involves input from various teams,
including data scientists, IT professionals, and business stakeholders. The
roadmap should be flexible enough to accommodate changes and agile
enough to respond to new findings or shifts in business priorities.
Funding ensures that the data product has the necessary resources for
development, deployment, and ongoing maintenance. It’s about having a
financial plan that supports the entire lifecycle of the data product. The
funding mechanism should align with the overall value proposition of the
data product. It’s important that the investment in the data product is seen in
the context of the returns it promises, whether in terms of efficiency gains,
revenue generation, or other strategic benefits.
Let’s dig a bit deeper. The Data Product Owner holds a position of
significant responsibility and authority, overseeing the overall health,
performance, and strategic alignment of the data product with the business’s
needs. Their role is multifaceted, encompassing various aspects of data
product management, from conceptualization to implementation and
ongoing maintenance. So, perhaps it goes without saying, but without an
empowered data product owner, you do not have a “good” data product.
Now, one of the critical powers vested in a Data Product Owner is decision
rights. They have the authority to make key decisions regarding the
development, deployment, and evolution of the data product. This includes
decisions about features, functionalities, and the overall direction of the
product. Their decision-making authority is essential for maintaining the
product’s relevance and effectiveness in a rapidly changing business
environment.
And with these decision rights, an empowered Data Product Owner also has
a high degree of autonomy. This autonomy allows them to operate
independently within the defined boundaries of the data product, making
decisions and implementing strategies that foster innovation and agility. The
autonomy granted to them is not unfettered but is balanced with the need
for alignment with broader organizational goals and strategies.
But let’s make this a bit more concrete. Here is a common scenario that
illustrates the need for clear decision rights - it involves the selection of
technology tools and platforms for the data product. More specifically, it is
quite frequent for an enterprise to have a preferred set of tools and
platforms that it mandates across its operations. However, the Data Product
Owner might identify alternative tools that they believe would be more
effective for their specific data product.
In such cases, if the principles of Data Mesh are adhered to, the decision
rests with the Data Product Owner. They have the authority to choose the
tools and technologies that best suit the needs of their data product. This
autonomy is crucial for ensuring that the data product is built with the most
suitable and effective technologies.
Conclusion
So, at this point we understand what a “good” data product is: it follows
data mesh principles and is aligned to FAIR principles. It is enterprise
grade. It delivers real tangible value. It balances both cost as well as agility
and speed concerns. It is much more than just data. And, it has an
empowered owner that can deliver on the data product’s promise,
So, with all of this going for it, the next obvious question is “how do I build
a ‘good’ data product that has all of these attributes?”. Well, that is a big
question, and the next two chapters will kickstart the process: We will first
try to introduce a scenario that is used throughout the book to show how to
put these principles and characteristics into practice. And then we will do a
deep dive on the architecture components of both a data mesh and its
constituent data products.
OceanofPDF.com
Chapter 3. Our Case Study - Climate
Quantum Inc.
With Early Release ebooks, you get books in their earliest form—the
author’s raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.
This will be the 3rd chapter of the final book. Please note that the GitHub
repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the editor at [email protected].
In this chapter we will introduce our case study - Climate Quantum Inc. -
where we will apply data mesh capabilities to an important and pressing
need: Climate Change.
Second, climate data volumes are huge, and incredibly diverse: The sheer
volume of climate data is staggering. With thousands of data sources each
governed by their licensing terms, some publicly available and others
proprietary, the challenge lies not just in data acquisition but in harmonizing
and interpreting it.
And if that wasn’t enough, the axiom upon which foundational decisions
are made - that past behavior is indicative of future behavior - is proving to
be false. Historically, the past has been our compass, guiding predictions
and decisions. However, the accelerating pace of climate change makes the
past an unreliable predictor, thrusting businesses into largely uncharted
waters.
The central issue with traditional, centralized data platforms lies in their
inherent limitations when dealing with the sheer volume and complexity of
climate data. In a standard enterprise setting, these systems often struggle
with managing internal data effectively. The climate data challenge
amplifies this complexity exponentially, rendering centralized systems
inadequate. In such scenarios, these systems often become overwhelmed,
leading to inefficiencies and data silos.
Viewing data as a product under Data Mesh transforms the way data is
managed. It assigns clear ownership and responsibility, paralleling a
product owner’s role in a company. Owners of climate data domains are
thus accountable for the data’s accuracy, timeliness, and relevance. This
shift not only elevates the quality of climate data but also bolsters user trust,
as each dataset is meticulously curated and maintained.
Data Mesh recognizes the need for diverse expertise in managing the broad
spectrum of climate data. By allocating specific datasets to domain experts
or designated entities, the framework ensures that data is handled by those
best equipped to understand and interpret it. This decentralized ownership
model not only enhances data accuracy and reliability but also expedites
decision-making. Since domain owners are closer to the source of the data,
they can implement real-time updates and modifications more effectively.
This approach draws inspiration from the “shift-left” concept in software
development, incorporating agile methodologies into data management,
thereby making the entire system more dynamic and responsive to changes.
One of the most significant hurdles in the realm of climate data is its
fragmentation. Vital datasets are dispersed across myriad sources,
creating a labyrinthine challenge in locating specific data. This
dispersion not only leads to inefficiencies but also creates significant
gaps in data availability. Climate Quantum Inc. utilizes Data Mesh’s
discovery capabilities, including a comprehensive set of APIs and a
meticulously curated catalog of data products. This approach
transforms the search for climate data into a streamlined, efficient
process, ensuring every data product and the information it contains
is readily accessible.
Secondly, the complexity and diversity of climate data sources have their
counterpart in the varied and often inconsistent data within large
enterprises. Climate Quantum’s use of standardized data contracts and
consumption mechanisms to streamline data structures can be replicated in
an enterprise setting. By establishing uniform data formats and access
protocols, businesses can simplify the consumption of data across different
departments, making it more usable and reducing the time and resources
spent on data preparation and interpretation.
Finally, the need for trust and verification in climate data is a universal
concern in any data-driven decision-making process. Climate Quantum’s
approach of employing robust data contracts and a certification program to
ensure data quality and transparency can be applied to enterprise data
management. Establishing similar governance structures and quality
standards in an enterprise will build trust among stakeholders, ensuring that
decisions are based on reliable and verified data. This approach not only
enhances the integrity of the data but also reinforces the organization’s
commitment to data-driven excellence.
The book will also cover the Data Mesh contracts that facilitate interactions
between the various components and Data Products within the Data Mesh.
Climate Quantum data contracts will be used as sample artifacts. These
contracts are pivotal in defining how data is shared, accessed, and utilized
across different domains, ensuring seamless interoperability and
maintaining data integrity.
Finally, we will present a roadmap for building a global climate Data Mesh.
This roadmap will outline the step-by-step process for developing and
implementing a Data Mesh system on a global scale, tailored specifically to
the complex and dynamic nature of climate data. Climate Quantum will be
a vehicle that will show strategies for scaling the system, integrating new
data sources, and evolving and governing the architecture to meet changing
needs and challenges.
OceanofPDF.com
Part II. Designing, building, and
deploying Data Mesh
With the basic concepts understood, you can now take your journey to
designing and building data mesh components. Most of the work might be
done by software engineers under the supervision of data engineers &
architects. This section will target both groups of engineers (software &
data) but will ensure terms that may not be familiar to data engineers are
explained.
Chapter 5 will dive deeply on data contracts, and why they are essential to
creating data products, and later, Data Mesh.
In chapter 6, you will implement your first data quantum (or data product).
Chapter 7 will fly you through the different planes of Data Mesh.
In chapter 8, you will mesh your different data products (or data quanta) to
create Data Mesh.
Finally, in chapter 9, you will read more about how AI and Generative AI
fits in the Data Mesh model.
Happy reading!
OceanofPDF.com
Chapter 4. Defining the Data Mesh
Architecture
With Early Release ebooks, you get books in their earliest form—the
author’s raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.
This will be the 4th chapter of the final book. Please note that the GitHub
repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the editor at [email protected].
At the heart of this ecosystem are data products, which form the
foundational units of Data Mesh. Each data product in the ecosystem is
designed to be discoverable, observable, and operable, ensuring that data
can be efficiently shared and utilized across different parts of an
organization. Figure 4-1 illustrates the data product architecture which will
be elaborated upon below. There are three key groups of capability within
the data product architecture: Definition, Run-time, and Operations.
Figure 4-1. : Data Product Architecture
Definition
Data Product Definition is a crucial process in managing and utilizing data
effectively. It encompasses the detailed characterization of a data product,
including who owns it, what it contains, and the rules governing its use.
This clear and comprehensive definition is essential for ensuring that data
products are not only functional and accessible but also align with
organizational policies and user needs. By thoroughly defining each data
product, organizations can maximize their data’s value and utility, making
this process a foundational aspect of modern data management strategies.
Defining the core high-level information about a data product begins with
identifying the data product owner. This individual plays a pivotal role,
acting as the decision-maker and primary source of knowledge for the data
product. The owner is responsible for overseeing the development,
maintenance, and overall strategy of the data product. They ensure that it
meets the evolving needs of its users and stays aligned with the broader
goals of the organization. The owner’s deep understanding of the data
product’s capabilities and uses makes them an invaluable resource for
answering queries and guiding users in leveraging the data product
effectively.
Next, a clear and concise data product summary, along with relevant tags, is
essential for users to quickly grasp the essence and scope of the data
product, and support effective search capabilities that make it easy to find
the data product within Data Mesh. The summary should provide an
overview of the data product’s purpose, its primary features, and potential
applications, enabling users to ascertain its relevance to their needs at a
glance. Tags act as keywords or labels, helping categorize and organize data
products within the larger ecosystem. They facilitate easier discovery and
retrieval, especially in environments with a multitude of data products.
Well-chosen tags can greatly enhance the user experience by simplifying
navigation and search processes within the data product landscape.
Last but certainly not least, the creation of a data product glossary is a
critical step in fostering a shared understanding among the community that
interacts with the data product. This glossary should include definitions of
terms and concepts specific to the data product, clarifying any technical
jargon or industry-specific language. It serves as a reference tool that
ensures consistency in terminology, helping new users acclimate and
enabling effective communication among experienced users. A well-
constructed glossary not only aids in comprehension but also contributes to
building a cohesive community of users who are well-versed in the
language and nuances of the data product.
Artifacts in a data product can include programs and models integrated with
the data, providing enhanced functionality and analytical capabilities. These
programs and models can be crucial for interpreting the data, drawing
insights, and supporting decision-making processes. They add a dynamic
aspect to the data product, transforming it from a static repository of
information into a versatile tool for analysis and exploration.
Queries that act upon data within the data product are also considered
artifacts. These queries can range from simple retrieval operations to
complex analytical processes, enabling users to interact with and extract
value from the data. The inclusion of queries as artifacts underscores the
interactive nature of modern data products, where engagement with data is
as important as the data itself.
Access rights are the first key consideration in policy definition. They
determine who can access the data product and what level of access they are
granted. This can range from read-only access for some users to full
administrative rights for others. Defining access rights involves assessing
the needs and responsibilities of different users or user groups and assigning
access levels accordingly. Effective management of access rights is crucial
for ensuring that users can perform their roles efficiently while preventing
unauthorized access to sensitive data.
Defining authorized users and/or roles forms the third cornerstone of policy
development. This involves specifying which individuals or roles within the
organization are permitted to access the data product and what actions they
are authorized to perform. This distinction is crucial for maintaining
operational security and for ensuring that each user has access to the
appropriate level of data and functionality. By clearly defining authorized
users and roles, organizations can maintain tight control over their data and
prevent unauthorized use.
Finally, the policy definition must encompass any other security or privacy
rules that are necessary to comply with enterprise standards or regulatory
imperatives. This includes adherence to industry-specific regulations, data
protection laws, and internal security policies. Ensuring compliance with
these regulations and standards is imperative for protecting sensitive data
and for maintaining the organization’s reputation and legal standing. This
aspect of policy definition requires staying abreast of evolving regulatory
landscapes and ensuring that data product policies are regularly updated to
reflect these changes.
The data product harness is the framework and code that implements the
various interfaces for a data product. An architecture goal for any enterprise
data mesh is to make these “harnesses” consistent for all data products and
thereby making it simpler and more intuitive to interact with any data
product in the enterprise data mesh. Now, indeed, every data product is
different and embodies unique capabilities, but the mechanisms by which
they interact, and the signatures for their interactions can be made largely
consistent. Yes, even for ingestion (“/ingest” interface) and consumption
(“/consume” interface) - while the “parameters” required for each
interaction may be different, but the mechanisms for their interactions can
be standardized.
And this is even more practical for other core and foundational interfaces
for discovery (“/discover” interface), observability (“/observe” interface),
and control or management capabilities (“/control) interface). By doing this,
all interactions with data products become consistent and standardized -
which means that templating and “factories” become a practical
consideration that streamlines and speeds the creation and management of
data products.
Now, since all interactions with a data product occur with its “harness”, it
makes sense to have each interaction with the data product also be verified
and validated against its data contracts. So, the data product harness
becomes the integration point for data contracts and associated policy
enforcement mechanisms (which is covered in more detail in Chapter X).
Run-Time
The run-time architecture describes the components in a running data
product. Let’s dive a bit deeper.
Ingestion Interfaces
Ingesting data into a data product is a fundamental process that determines
how data is collected, processed, and made available for use. The method
chosen for data ingestion depends on various factors such as the volume of
data, frequency of updates, and the specific requirements of the data
product. Understanding and selecting the right ingestion method is crucial
for ensuring the efficiency and effectiveness of the data product. From APIs
to bulk ingestion and pipelines, each technique has its own advantages and
ideal use cases.
Queries are a common method for ingesting data, particularly useful when
dealing with real-time or near-real-time data updates. They are ideal for
situations where data needs to be pulled frequently and in small amounts.
Queries allow for specific data to be selected and retrieved based on certain
criteria, making them efficient for targeted data ingestion. This method is
particularly useful for data products that require up-to-date information and
have the capability to handle frequent, incremental updates.
Bulk ingestion methods, such as file transfers, are suited for scenarios
where large volumes of data need to be imported into the data product. This
method is typically used for initial data loads or periodic updates where a
significant amount of data is transferred at once. Bulk ingestion is efficient
in terms of resource usage and time, especially when dealing with large
datasets that do not require frequent updates. It is often used in conjunction
with data warehousing or when integrating historical data into a data
product.
Data pipelines, using tools like Airflow or DBT (Data Build Tool), are
designed for more complex data ingestion scenarios. These tools allow for
the automation of data workflows, enabling the ingestion, transformation,
and loading of data in a more controlled and systematic manner. Pipelines
are particularly useful when the data ingestion process involves multiple
steps, such as data cleansing, transformation, or integration from multiple
sources. They provide a robust framework for managing complex data
flows, ensuring consistency and reliability in the data ingestion process.
Other ingestion methods include streaming data ingestion, used for real-
time data processing, and web scraping, for extracting data from web
sources. Streaming data ingestion is ideal for scenarios where data is
continuously generated and needs to be processed in real-time, such as
sensor data or user interactions. Web scraping, on the other hand, is useful
for extracting data from websites or web applications, especially when other
methods of data integration are not available.
Consumption Interfaces
Obviously, once the data has been ingested into the data product, it needs to
be consumed. The consumption of a data product involves how end-users
access and utilize the data and artifacts provided by the data product owner.
It’s a process distinct from data ingestion, focusing on the output and user
interaction rather than the input of data into the system. While ingestion is
about how data gets into the data product, consumption is about how that
data, along with other artifacts created by the data product owner, is made
available and useful to users. The methods of consumption are diverse, each
suited to different needs and scenarios, ranging from queries and APIs to
bulk transfers and pipelines.
Bulk consumption methods, such as file transfers, are used when large
volumes of data need to be accessed, often for offline processing or
analysis. This method is typical in scenarios where the entire dataset or
large parts of it are required, such as for data warehousing or big data
analytics. Bulk consumption is less about real-time interaction and more
about comprehensive access, making it suitable for use cases where
extensive data processing is necessary.
Data pipelines, utilizing tools like Airflow or DBT, are used for more
complex consumption scenarios where data needs to be processed,
transformed, or integrated into other systems. These pipelines allow for
automated workflows that can handle large volumes of data efficiently.
They are particularly useful in scenarios where the data needs to be
enriched, aggregated, or transformed before being consumed, ensuring that
users have access to high-quality and relevant data.
Beyond raw data, a data product may include a variety of other artifacts for
consumption. This can range from pre-defined and vetted queries that
facilitate easy access to common data sets, to more complex offerings like
Jupyter Notebooks, programs, AI/ML models, and documents. These
artifacts add significant value to the data product, allowing users not just to
access data, but to interact with it in more meaningful and sophisticated
ways. For example, Jupyter Notebooks can provide an interactive
environment for data exploration and analysis, while AI/ML models can
offer advanced insights and predictions based on the data.
Policy Enforcement
Enforcing policies in the context of a data product is a critical aspect of
ensuring that the data is used appropriately and securely, and, ultimately, is
trusted. Policy enforcement goes beyond merely defining what the policies
are; it involves implementing mechanisms and controls that actively ensure
compliance with these policies during runtime. This is where the theoretical
framework of a policy meets the practical aspects of its application.
Effective policy enforcement is essential for maintaining data integrity,
security, and privacy, and for ensuring that the data product aligns with both
organizational standards and regulatory requirements.
Enforcing data quality and related data contracts is vital for ensuring that
the data within the product remains accurate, consistent, and reliable. This
involves implementing checks and validations both at the point of data
ingestion and during data usage. Data quality rules might include
constraints on data formats, ranges, or the presence of mandatory fields,
while data contracts could enforce rules about data relationships and
integrity. Automation plays a key role here, with systems set up to
continuously monitor and validate data against these predefined standards
and rules.
Security and privacy enforcement is critical, particularly in the context of
increasing data breaches and stringent regulatory requirements. This
involves a multi-layered approach, including encryption, regular security
audits, and adherence to compliance standards such as GDPR or HIPAA.
Encryption ensures that data is protected both in transit and at rest, while
security audits help in identifying and mitigating vulnerabilities.
Compliance with legal and regulatory standards requires a deep
understanding of these regulations and the implementation of processes and
controls that align with them.
Operations Experience
In the intricate world of data management, particularly within a data mesh
framework, the operational considerations for data products encompass a
spectrum of interfaces that ensure these products are not only functional but
also align with overarching organizational standards and user expectations.
These considerations include discoverability, observability, governance, and
control interfaces, each playing a pivotal role in the lifecycle and utility of a
data product. Discoverability interfaces enable data products to be
registered and located within the data mesh, ensuring that they are visible
and accessible to users. Observability interfaces offer insights into the
performance and usage of data products, allowing for effective monitoring
and management. Governance interfaces are critical for maintaining data
integrity and compliance, whereas control interfaces provide essential tools
for managing the operational state of data products.
Discoverability Interfaces
Discoverability interfaces in data products play a crucial role in the data
mesh ecosystem, acting as the beacon that signals the presence of a data
product to the rest of the system. These interfaces allow data products to
register themselves, a process akin to placing a pin on a digital map,
marking their location and existence in the Data Mesh landscape. This
registration is vital, as it ensures that the data product is not just a
standalone entity but a recognized part of the larger network. The
information provided during registration typically includes basic details like
the data product’s name, a brief description, and the identity of its owner,
providing a quick but essential overview for potential users.
Observability Interfaces
Observability interfaces within data products play a crucial role in ensuring
that these resources are not just operational, but also efficient, reliable, and
transparent in their functioning. These interfaces provide a comprehensive
view into the various operational aspects of a data product, including usage
statistics, performance metrics, and overall operating health. By offering
real-time insights into how a data product is performing, observability
interfaces allow data product owners and users to monitor and evaluate the
product’s effectiveness continuously. This visibility is key to maintaining
high performance, as it enables quick identification and resolution of any
issues that may arise.
Control Interfaces
Control interfaces in the context of data products are essential tools that
provide data product owners with the ability to manage the operational state
of their offerings. These interfaces encompass a range of functionalities,
including the capability to start or stop the data product, pause or resume its
operations, and configure various settings and parameters. This level of
control is crucial for ensuring that data products can be managed effectively
and can respond flexibly to different operational requirements or situations.
In essence, control interfaces act as the command center for data product
owners, offering them direct oversight and management capabilities over
their products.
The importance of these control interfaces becomes particularly evident in
dynamic operational environments. The ability to start or stop a data
product enables owners to manage resource allocation efficiently, ensuring
that the data product is active only when needed, thereby optimizing
resource utilization and reducing unnecessary costs. Similarly, the pause
and resume functionalities are vital in scenarios where temporary
interruptions are required, such as during maintenance, updates, or in
response to unforeseen issues. These capabilities ensure minimal disruption
to the end-users while allowing for necessary adjustments or repairs to be
made.
Artifacts, ranging from basic data sets to complex programs and models,
add significant value to the data product, making it more than just a data
store. They enable a more integrated and user-friendly approach to data
management, where the focus is not just on providing data but on delivering
a complete, valuable, and ready-to-use data solution. This holistic approach
is what sets modern data products apart, making them indispensable tools in
the data-driven world.
These artifacts represent a broad and inclusive array of elements that extend
well beyond traditional data sets. They embody any set of data or related
objects that a data product owner deems valuable for users and wishes to
make available for consumption. This expansive view of what constitutes
an artifact reflects a modern approach to data management, where the value
of a data product is significantly enhanced by the versatility and
comprehensiveness of its components.
At the most basic level, data product artifacts include conventional data
elements like databases, tables, and files. These fundamental components
form the backbone of any data product, providing the core data that users
seek for various applications. In a traditional sense, these data sets are what
one might typically expect to find within a data product. However, the
scope of artifacts in the Data Mesh concept goes far beyond these basic
elements, embracing a more holistic and integrated approach to data
management.
Beyond these standard data components, artifacts in a data product can also
include programs that are directly related to or integrated with the data.
These programs can range from simple scripts that facilitate data processing
to complex software applications that provide advanced analytics
capabilities. The integration of such programs with the data enhances the
overall value of the data product, enabling users to not only access data but
also to process and analyze it within the same environment. This integration
represents a shift from static data repositories to dynamic, interactive
platforms that offer greater utility and efficiency.
Queries are yet another type of artifact that can be included in a data
product. Pre-defined queries or query templates can significantly ease the
process of data extraction and analysis for users. These queries can be
tailored to common use cases or specific data analysis tasks, enabling users
to quickly and efficiently access the data they need. The inclusion of such
queries not only makes the data product more user-friendly but also reduces
the time and effort required to derive value from the data.
Data product artifacts can also encompass more comprehensive bundles that
integrate data with models, programs, or queries. These bundles provide a
packaged solution to users, combining all the necessary components for
specific data applications or analyses. This approach simplifies the user
experience by providing a ready-to-use set of tools and data, thereby
enhancing the efficiency and effectiveness of the data product.
Data Contracts
Data contracts are a fundamental concept in data management, particularly
in environments where data sharing and interoperability are key, such as in
a Data Mesh architecture. They serve as formal agreements that define the
specifics of how data is structured, accessed, and used within and between
different data products and systems. These contracts are crucial for ensuring
consistency, reliability, and security in data exchanges. Let’s break down
the key aspects of data contracts:
Data contracts specify the format and structure of the data. This
includes the data types, schemas, and layout that the data adheres to.
By defining these parameters, data contracts ensure that when data is
shared or transferred between systems, it is understood and
interpreted correctly by all parties involved.
Data contracts outline how data can be accessed and used. This
might include specifying API endpoints for data access, defining
what operations (like read, write, update, delete) are allowed, and
setting out any other usage restrictions. This helps in managing data
access rights and maintaining data integrity across different users and
applications.
They often include provisions for data quality, ensuring that the data
meets certain standards of accuracy, completeness, and consistency.
This is crucial for maintaining the trustworthiness of data, especially
when it is used for decision-making or analytical purposes.
Security and Privacy
They may also include details on version control and how updates to
the data or the contract itself are managed. This ensures that all
parties are working with the most current and accurate data and that
changes are tracked and communicated effectively.
Finally, data contracts can include protocols for error handling and
conflict resolution, outlining the steps to be taken in case of
discrepancies or data issues.
So, why do we include data contracts in this architecture chapter? Yes, there
will be dedicated chapters that address data contracts, but the simple reason
is that they are a foundational concept and, like any foundational concept,
applying them after-the-fact is expensive and impractive. Hence we are
introducing the concept here, and ensuring that data contracts are addressed
explicitly in our data product architecture.
The process used by the ANSI for governing product quality, safety, and
reliability is a blend of centralized and federated approaches. The process
begins with the identification of a need for standardization. Once a need is
identified, ANSI facilitates the creation of a consensus body which is
responsible for the drafting of the standard.
If the product meets the standards, the certification body issues a certificate
and allows the manufacturer to label the product as compliant. This
certification is a mark of quality and reliability that can be used in
marketing and product documentation. Additionally, to maintain
certification, manufacturers are typically required to undergo regular
follow-up assessments and audits. This ensures that their products continue
to meet the necessary standards over time.
Now, how does this apply to data governance, and what role do data
products and their owners play?
Once a data product owner has verified that their product meets the required
standards, they can then “certify” their data product, much like a product
earning the right to display the ASA logo. This certification is a public
declaration that the data product has been vetted and meets the prescribed
criteria for quality, safety, and reliability. This process encourages a culture
of accountability and transparency among data product owners, as their
products’ compliance is openly demonstrated and can be scrutinized by
users and stakeholders.
The governance interfaces, or APIs, in this system play a crucial role, much
like the ASA’s oversight in the product certification process. These
interfaces allow anyone within the organization to query a data product’s
governance status at any time. This feature ensures that the governance
information is not just a static badge but a dynamic and continually up-to-
date reflection of the data product’s compliance with the established
standards.
For producers who provide data to these products, the marketplace offers
valuable tools to manage and monitor their offerings. This includes
dashboards that display commonly consumed or viewed data products and
their usage patterns. These insights are crucial for producers to understand
the impact and reach of their data products, allowing them to make
informed decisions on how to evolve and improve their offerings.
The marketplace also evolves with user feedback and changing data
landscapes. It continuously adapts to meet the emerging needs of its users,
whether it’s through updating its interface for better user experience or
integrating new functionalities to handle different types of data products
and artifacts.
The Data Product Registry within a data mesh ecosystem functions much
like the Domain Name System (DNS) does for the internet. It’s a
streamlined and efficient directory that provides basic yet essential
information about the various data products available within the data mesh.
This Registry is designed to facilitate easy discovery of data products,
acting as the foundational layer that connects users to the data they seek. It
maintains only the most crucial information – data product summaries and
tags – to assist users in finding the right data products through simple
keyword and natural language searches. This design mirrors the DNS’s role
in the internet, where it serves as a phone book, directing users to the right
web addresses with minimal yet crucial information.
The process of populating the Data Product Registry is intentionally made
straightforward and user-friendly. Data product owners are responsible for
registering their products, a task that requires them to provide a concise
summary, limited to a couple of paragraphs, and a set of relevant tags. The
emphasis here is on simplicity and ease of use. The design of the Registry is
such that even junior staff members can complete the registration process
without difficulty. This approach ensures that the barrier to entry for
registering data products is low, encouraging comprehensive participation
from all parts of the organization.
Once a user selects a data product from the marketplace, they are then
directed to more detailed interfaces of the data product, such as ‘discovery',
‘observability', or ‘governance’. These interfaces provide deeper insights
into the data product, allowing users to understand its structure, usage
patterns, compliance status, and more. This two-tiered approach – starting
with the simplicity of the Registry and moving to the more detailed
interfaces – ensures a balanced user experience, combining ease of
discovery with depth of information.
The Data Mesh Console, a critical component of the Data Mesh ecosystem,
serves as a command line interface (CLI) that provides comprehensive
management capabilities across the entire data mesh. This console is
designed for users who prefer or require a more direct, script-based
interaction with the data mesh, as opposed to the graphical user interface of
the marketplace. It offers similar functionalities to the marketplace but in a
format that is more aligned with traditional command line operations. For
instance, through the CLI, users can interact with the Data Mesh Registry,
accessing and managing data product information efficiently. This
command line approach is particularly appealing to those who seek a faster,
more streamlined way to navigate and manipulate the data mesh, especially
for tasks that are repetitive or require automation.
At the heart of the Data Mesh fabric are the infrastructure services, which
include compute, network, and storage capabilities. These services form the
core platform that supports almost all components of the data mesh. They
provide the essential resources required for data processing and storage,
ensuring that data products have the necessary computational power and
space to operate effectively. The robustness and scalability of these
infrastructure services are crucial for the smooth functioning of the data
mesh, especially as the number and complexity of data products grow.
Data access services form another layer of the Data Mesh fabric. This
includes tools and technologies for federated query, data pipelines, and bulk
data transfer. These services allow for efficient and flexible access to data
across the mesh, regardless of where it resides. Federated query capabilities,
for instance, enable users to retrieve and combine data from multiple
sources without moving the data, while pipelines and bulk transfer
mechanisms provide efficient ways to move large volumes of data when
necessary.
Data platforms are another vital component of the Data Mesh fabric. This
includes a range of storage and processing solutions like databases, data
marts, data warehouses, data lakes, and data lakehouses. Each of these
platforms serves a specific purpose, from structured data storage in
databases and data marts to large-scale data storage and processing in data
lakes and lakehouses. The availability of diverse data platforms within the
fabric allows for greater flexibility and choice in how data is stored,
processed, and accessed, catering to the diverse needs of different data
products.
Collaboration services within the Data Mesh fabric play a key role in
fostering a community-centric environment. Knowledge management tools,
forums, threads, and social features like likes and comments enable
developers, data scientists, and other stakeholders to share insights, ask
questions, and collaborate on data projects. These collaborative tools not
only enhance the user experience but also drive innovation and knowledge
sharing across the data mesh.
Data Product Actors
In a Data Mesh ecosystem, various actors play pivotal roles in ensuring the
smooth functioning and effectiveness of the system. These roles can be
broadly categorized into five key groups: data product owners, data
producers, data consumers, data mesh admins, and data governance
professionals. Each group has its own unique set of responsibilities and
contributions, making them essential components of the Data Mesh
architecture. Understanding the nuances of these roles is crucial for the
successful implementation and operation of a Data Mesh, as it ensures that
all aspects of data handling, from production to consumption, are
effectively managed.
Data Product Owners are at the forefront of the Data Mesh, overseeing
the lifecycle of data products. They are responsible for defining the
vision, strategy, and functionality of their data products. This group
ensures that the data products align with business objectives and user
needs, managing everything from data sourcing to the presentation of
data. Data Product Owners play a crucial role in bridging the gap
between technical capabilities and business requirements, making sure
that the data products deliver real value to the organization.
Data Producers include operational source systems, other analytics
data sources, and even other data products within the Data Mesh. This
group is responsible for providing the raw data that is ingested by
various data products. Their role is crucial in ensuring that high-
quality, relevant, and timely data is available for further processing and
analysis. Data Producers are the foundation of the Data Mesh, as the
quality and reliability of their output directly impact the effectiveness
of the data products.
Data Consumers encompass business users, data scientists, and
analysts who utilize the data and artifacts contained within data
products. They rely on data products to gain insights, make informed
decisions, and drive business strategies. This group is the end-user of
the Data Mesh, and their feedback often shapes the evolution of data
products. Their interaction with the data products is a key indicator of
the effectiveness and relevance of the Data Mesh in addressing
business needs.
Data Mesh Administrators are responsible for the overall management
and control of the Data Mesh infrastructure. They handle the technical
aspects, including system performance, resource allocation, and
ensuring the smooth operation of the Data Mesh. This group plays a
critical role in maintaining the health of the Data Mesh, ensuring that it
is scalable, secure, and efficient.
Data Governance Professionals facilitate policy management and the
data product “certification” process within the Data Mesh. They ensure
that data products adhere to defined standards, compliance
requirements, and quality benchmarks. This group is pivotal in
instilling a culture of accountability and trust in the Data Mesh,
ensuring that data is handled responsibly and ethically across the
organization.
Now, let’s see how we can put the pieces together for our Climate Quantum,
our use case. As you recall, Climate Quantum’s mission is to make climate
data easy to find, consume, share, and trust using data products and a
broader data mesh ecosystem.
Figure 4-4. : Climate Quantum Data Products
The sea level data product complements the temperature and precipitation
products by focusing on changes in sea levels – a critical aspect of climate
change studies. By aggregating data from various sources, this product
provides invaluable insights into sea level trends, which are essential for
understanding the long-term impacts of climate change on coastal areas.
The owner of this data product is responsible for its accuracy, completeness,
and ongoing relevance, making it a vital component of the Climate
Quantum framework.
At the apex of Climate Quantum’s data offerings is the primary public data
product known as the Physical Risk model. This model integrates and
synthesizes data from the temperature, precipitation, and sea level products
to create a detailed analysis of the physical risks posed by climate change.
The consumers of these data products are as diverse as the data itself.
Climate scientists delve into the data to predict and anticipate climate
change trends, utilizing the detailed analyses to inform their research and
studies. Business users leverage this data to communicate the impacts of
climate change, translating complex climate metrics into actionable insights
for organizational strategies. Financial analysts, particularly interested in
the physical risk data product, use these insights to understand how climate
change might affect the assets they manage, integrating this information
into their financial models and risk assessments.
While the marketplace is a primary interface for users, the Data Mesh
Console and its command-line interface provide another layer of
interaction. This tool is especially useful for data mesh administrators and
data product owners, who often use it to execute specific commands and
manage various aspects of the data products. The console’s advanced
capabilities make it a versatile tool for more technical users, enabling them
to manage and interact with the data mesh in a more direct and granular
way.
These platforms and tools are supported by the enterprise support team, a
group dedicated to maintaining the underlying infrastructure and common
platforms that support the data mesh. Their role is crucial in ensuring that
the data mesh operates smoothly, efficiently, and without interruption..
OceanofPDF.com
Chapter 5. Driving Data Products with
Data Contracts
With Early Release ebooks, you get books in their earliest form—the
author’s raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.
This will be the 5th chapter of the final book. Please note that the GitHub
repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the editor at [email protected].
In this chapter, we will first start by looking at what Data Mesh is from an
implementation perspective, answering the question, what are its main
components? I will then draw the parallel with product thinking, explore
what a data product is, and finally jump into data contracts. The examples
in this chapter follow the theme of Climate Quantum Inc.
Bringing Value Through Trust
Do not worry, I am not switching from an engineering-oriented book to a
business book, but I am convinced that, collectively, we need to keep goals
with Data Mesh, and one of them is trust.
When I am thinking about a good product, I think a lot about the trust I
have in this product. When I have the time, I love to take photos and use my
rather nice Nikon mirrorless camera. As I hike and get my camera out, I
expect it to be available as soon as I turn it on, not having some weird boot
time. When I press the actuator, I expect the photo to be available instantly
on my CFexpress card and within ten seconds on my iPhone.
The quality here is that the image is saved on the card, matching my
expectations in resolution, compression or not, color balance, and a few
more attributes. The time needed to transfer my photo from the camera to
my phone is a service-level objective.
In many conversations with fellow engineers and scientists, I often tell them
that I am not smart enough to know what to do with data, but I know how to
bring them the data they need (and sometimes want). And my teams deliver
either the data or the tools to access & process the data.
As a result, my customers get access to data they can trust and build upon.
Trust is the value you should aim to deliver.
Figure 5-1. : Assuming your goal is to provide trusted data to your consumer, the data contracts are
going to be the vehicle of this trust.
Now that I have established the need for trust as the main value, let’s dive
into product thinking before we can articulate this trust and focus on
implementing it.
Navigating the data contract
As you saw in the first part of this chapter, the need for a data contract is
pretty obvious: you need an artifact to support the trust you want to build in
your data, and as you read in Chapter 2, many of the characteristics and
features of the data products. Let’s go first into the theory, understanding
the information you want in the contract, and dive into a couple of
examples.
Let’s start with the boring part: the theory. A contract establishes a formal
relationship between two or more parties. We have all signed implicitly or
explicitly contracts: you’re working for someone (or you have customers),
you most likely have a cell phone, and you probably don’t live under a
bridge. As you most likely imagined, a data contract is a similar agreement
that now involves data (duh).
In some cases, problems can be a total deal breaker: too many duplicated
records, too many NULLs in a required field, or heavier nitrogen-oxide
emissions (NOx) released in the atmosphere. The constraints can come
from the producer/seller or the regulator -- and this opens the door to
federated computational governance, but I won’t go there in this chapter,
but you can check in TBD.
Stacking up good information
There are many flavors of data contracts; in this book, I will actively follow
the Open Data Contract Standard (ODCS), which is part of the Bitol project
supported by the Linux Foundation AI & Data (see https://bitol.io).
So, what do you put in a data contract? I’d say virtually anything. In the
open source data contract standard (see https://aidaug.org/odcs ), the
community has divided the information into the following eight categories:
This section describes the dataset and the schema of the data
contract. It is the support for data quality, which I detail in the next
section. A data contract focuses on a single dataset with several
tables (and, obviously, columns).
Data quality
This category describes data quality rules & parameters. They are
tightly linked to the schema defined in the dataset & schema section.
Pricing
This section covers pricing when you bill your customer for using
this data product. Pricing is currently experimental.
Stakeholders
This important part lists stakeholders and the history of their relation
with this data contract.
Roles
This section lists the roles that a consumer may need to access the
dataset depending on the type of access they require.
Service-level agreement
Custom properties
Let’s see how data contracts relate to data products and Data Mesh. It will
help you understand how you can optimally use them.
Like my car example, data can be sold (or consumed) as a product. Like my
Beetle, it can have functional requirements (four wheels, a steering wheel,
and a few other details), non-functional requirements (the quantity of NOx
released in the atmosphere while driving), and SLA (delivery date,
replacement car when stranded).
For a data product, it’s the same thing. I can add or remove some
information and deal with the changes through versions. I can also have
different variations on the product I share, like raw, curated, aggregated,
and more.
To summarize, a data product can have multiple data contracts, exactly one
data contract per pair of datasets/versions. It gives us the required
flexibility.
Data Mesh tracks data product owners (DPO) as stakeholders in any data
product. When they move to a new role, we add the new DPO to the
contract, keeping a human lineage. This results in a patch version: my data
contract can move from 1.0.1 to 1.0.2.
Over time, the data producer realized that the prospect and customer tables
should be combined. It is a major change, breaking much of the consumer
code. This evolution is a reason to bump the updated contract to version
2.0.0.
A major change is a change that does not provide backward compatibility.
Such a change will cause existing downstream apps/solutions/queries to
break.
Table 5-1 lists changes considered patches, minor, and major with
examples. These lists are not exhaustive.
SEVERITY OF CHANGES
Changes are not equal in intensity. Table 5-1 describes the severity of the
change, whether it is a patch, a minor, or a major change for tables, APIs,
and data contracts.
Table 5-1. : Severity of changes in your data product
resource, or any
other changes to
API functionality.
Adding a new
required field for
client HTTP
requests.
It may not sound easy, but it will allow data producers and consumers to
evolve at their own pace, increasing trust between them and the robustness
of our systems.
To ease the management of those contracts, Data Mesh needs tooling. For
example, the Rosewall team developed a comparison service to analyze two
contracts and suggest a version number.
A philosophy change may also be needed, as a strict contract does not mean
that its consumption should be as uncompromising. After all, Jon Postel,
one of the creators of the TCP protocol, said, “Be conservative in what you
do, be liberal in what you accept from others.” It became the robustness
principle in computer science and is often called Postel’s law.
As this chapter focuses on data products and data contracts, I ignore the role
of Data Mesh here. Still, you can easily picture that Data Mesh is a way to
organize data products, as illustrated in Figure 5-2.
Figure 5-2. : The hierarchy between the different artifacts.
Let’s dive into a complex business problem that the data contract solves.
There are a few examples in the GitHub repository at
https://github.com/AIDAUserGroup/data-contract-template, and the team
will continue adding more examples.
Usually, engineers do not mind documenting. They are often under or over-
documenting things, but documentation usually exists in the 21st century.
The problem is documenting change. The data contract provides a solution
to documenting at the right level and synchronizing change.
Here, the data contract can help. For example, leverage the description and
other informational fields. Note: this data contract format focuses for now
on table and structured data, extensions are in progress.
- table: tbl
description: Provides core payment metrics
dataGranularity: Aggregation on columns txn_r
columns:
- column: txn_ref_dt
businessName: Transaction reference date
logicalType: date
physicalType: date
description: Reference date for the trans
sampleValues:
- 2022-10-03
- 2025-01-28
But this is an easy example, although sometimes, finding the right field is
not as easy as it seems!
- table: tbl
columns:
- column: rcvr_cntry_code
businessName: Receiver country code
logicalType: string
physicalType: varchar(2)
authoritativeDefinitions:
- url: https://collibra.com/asset/748f-71
type: Business definition
- url: https://github.com/myorg/myrepo
type: Reference implementation
Let’s imagine this scenario: Clint Eastwood joined Climate Quantum, Inc. a
few years ago as a data product owner (DPO).
stakeholders:
- username: ceastwood
role: dpo
dateIn: 2014-08-02
stakeholders:
- username: ceastwood
role: dpo
dateIn: 2014-08-02
dateOut: 2014-10-01
replacedByUsername: jwayne
- username: jwayne
role: dpo
dateIn: 2014-10-01
John Wayne’s style was a better fit, and he got promoted. He hired Calamity
Jane to replace him as a DPO.
stakeholders:
- username: ceastwood
role: dpo
dateIn: 2014-08-02
dateOut: 2014-10-01
replacedByUsername: jwayne
- username: jwayne
role: dpo
dateIn: 2014-10-01
dateOut: 2019-03-14
replacedByUsername: cjane
- username: cjane
role: dpo
dateIn: 2019-03-14
Great things happened, and Calamity got to her sabbatical. Although she
was still going to be the DPO when she returned, she asked a kid, Billy, to
watch for her.
stakeholders:
- username: ceastwood
role: dpo
dateIn: 2014-08-02
dateOut: 2014-10-01
replacedByUsername: jwayne
- username: jwayne
role: dpo
dateIn: 2014-10-01
dateOut: 2019-03-14
replacedByUsername: cjane
- username: cjane
role: dpo
dateIn: 2019-03-14
comment: Minor interruption due to sabbatical
dateOut: 2021-04-01
replacedByUsername: bkid
- username: bkid
role: dpo
dateIn: 2021-04-01
And that’s when Billy got into trouble and needed an emergency leave. As I
jinxed Murphy’s law too many times, this is when I needed critical
information from the DPO. Although Billy and Calamity were out, thanks
to the human lineage described in the contract, I could happily reach out to
Clint (yes, John was also off this week).
This example illustrates a fictional situation, but I am pretty sure you have
experienced something similar in your career.
As your need for observing your data grows with the maturity of your
business, you will realize that the number of attributes you want to measure
will bring more complexity than simplicity. That’s why, back in 2021, I
came up with the idea to combine both into a single table, inspired by
Mandeleev’s (and many others) work on classifying atomic elements in
physics, you can see the result in figure 5-3.
Figure 5-3. : Inspired by Mandeleev’s periodic table for classifying atomic elements, the Data QoS
table represents the finest elements used for measuring data quality and service levels for data.
Representation
Figure 5-4. : Each element has a name, an abbreviation, a group, an order in this group, and a
category.
The periods are time-sensitive elements. Some elements are pretty obvious,
as “end of life” is definitely after “general availability,” as illustrated by
Figure 5-7.
Figure 5-5. : General availability comes before the end of support, which comes before the end of
life.
The second classification to find was about grouping. How can we group
those elements? Is there a logical relation between the elements that would
make sense?
Here are the seven data quality dimensions, as represented by figure 5-7.
Figure 5-7. : The seven data quality dimensions are on the Data QoS table.
Accuracy (Ac)
The measurement of the veracity of data to its authoritative source: the data
is provided but incorrect. Accuracy refers to how precise data is, and it can
be assessed by comparing it to the original documents and trusted sources
or confirming it against business rules.
Examples:
Fun fact: a lot of accuracy problems come from the data input. If you have
data entry people on your team, reward them for accuracy, not only speed!
Completeness (Cp)
Data is required to be populated with a value (aka not null, not nullable).
Completeness checks if all necessary data attributes are present in the
dataset.
Examples:
Conformity (Cf)
Data content must align with required standards, syntax (format, type,
range), or permissible domain values. Conformity assesses how closely data
adheres to standards, whether internal, external, or industry-wide.
Examples:
Fun fact: ISO country codes are 2 or 3 digits (like FR and FRA for France).
If you mix up the two in the same datasets, it’s not a conformity problem;
it’s a consistency problem.
Consistency (Cs)
Examples:
Coverage (Cv)
All records are contained in a data store or data source. Coverage relates to
the extent and availability of data present but absent from a dataset.
Examples:
1. Every customer must be stored in the Customer database.
2. The replicated database has missing rows or columns from the source.
Timeliness (Tm)
The data must represent current conditions; the data is available and can be
used when needed. Timeliness gauges how well data reflects current
market/business conditions and its availability when needed
Examples:
A file delivered too late or a source table not fully updated for a
business process or operation.
A credit rating change was not updated on the day it was issued.
An address is not up to date for a physical mailing.
Uniqueness (Uq)
How much data can be duplicated? It supports the idea that no record or
attribute is recorded more than once. Uniqueness means each record and
attribute should be one-of-a-kind, aiming for a single, unique data entry
(yeah, one can dream, right?)
Examples:
Two instances of the same customer, product, or partner with different
identifiers or spelling.
A share is represented as equity and debt in the same database.
Fun fact: data replication is not bad per se; involuntary data replication is!
But I still feel it is not enough. Data quality does not answer questions
about end-of-life, retention period, and time to repair when broken. Let’s
look at service levels.
The quality here is that the image is saved on the card, matching my
expectations in resolution, compression or not, color balance, and a few
more attributes. The time needed to transfer my photo from the camera to
my phone is a service-level objective.
Figure 5-8 is a list of service-level indicators that can be applied to your
data and its delivery. You will have to set some objectives (service-level
objectives or SLO) for your production systems and agree with your users
and their expectations (set service-level agreements or SLA).
Availability (Av)
Throughput (Th)
Throughput is about how fast I can access the data. It can measured in bytes
or records by unit of time.
How often will your data have errors, and over what period? What is your
tolerance for those errors?
The date at which your product will not have support anymore.
For data, it means that the data may still be available after this date, but if
you have an issue with it, you won’t be offered a fix. It also means that you,
as a consumer, will expect a replacement version.
The date at which your product will not be available anymore. No support,
no access. Rien. Nothing. Nada. Nichts.
For data, this means that the connection will fail or the file will not be
available. It can also be that the contract with an external data provider has
ended.
Fun fact: Google Plus was shut down in April 2019. You can’t access
anything from Google’s social network after this date.
Retention (Re)
How long are we keeping the records and documents? There is nothing
extraordinary here, as with most service-level indicators, it can vary by use
case and legal constraints.
Frequency of update (Fy)
Latency (Ly)
Measures the time between the production of the data and its availability for
consumption.
How fast can you detect a problem? Sometimes, a problem can be breaking,
like your car not starting on a cold morning or slow, like data feeding your
SEC (Security Commission for Publicly Traded Companies) being wrong
for several months. How fast do you guarantee the detection of the
problem? You can also see this service-level indicator called “failure
detection time.”
Fun fact: squirrels (or another similar creature) ate the gas line on my wife’s
car. We detected the problem as the gauge went down quickly, even for a
few miles. Do you even drive the car to the mechanic?
Time to notify (Tn)
Once you see a problem, how much time do you need to notify your users?
This is, of course, assuming you know your users.
How long do you need to fix the issue once it is detected? This is a very
common metric for network operators running backbone-level fiber
networks
Of course, there are a lot more service-level indicators that will come over
time. Agreements follow indicators; agreements can include penalties. You
see that the description of the service can become very complex.
In the next section, let’s apply Data QoS to the data contract, in the context
of Climate Quantum, Inc.
To demonstrate the use of those dimensions in a data contract, I will use the
NYC Air Quality dataset, which could be used by Climate Quantum, Inc.
The dataset’s metadata looks like table 5-2.
Table 5-2. : Name, description, and type of the columns of the NYC Air Quality dataset.
StartDate Date value for the start of the Date & Time
TimePeriod; Always a date value;
could be useful for plotting a time
series
Message notes that apply to the data value; For Plain Text
example, if an estimate is based on
small numbers we will detail here
Let’s make sure that the information around measurements conforms to our
expectations.
In the data contract, this is how you could add a conformity data quality
rule. In this scenario, the measurement identifier requires a minimum value
of 100,000. If not, the error is considered having an operational business
impact.
- table: Air_Quality
description: Air quality of the city of New Yor
dataGranularity: Raw records
columns:
- column: UniqueID
isPrimary: true
businessName: Unique identifier
logicalType: number
physicalType: int
quality:
- templateName: RangeCheck
toolName: ClimateQuantumDataQualityPack
description: 'This column should not co
dimension: conformity
severity: error
businessImpact: operational
customProperties:
- property: min
value: 100000
Completeness
Data is required to be populated with a value: you don’t want NULL values.
- table: Air_Quality
description: Air quality of the city of New Yor
columns:
- column: UniqueID
isPrimary: true
businessName: Unique identifier
logicalType: number
physicalType: int
quality:
- templateName: NullCheck
toolName: ClimateQuantumDataQualityPackage
description: 'This column should not contai
dimension: completeness
severity: error
businessImpact: operational
Accuracy
Insures that the provided data is correct. Here are a couple of examples:
In the data contract, you can specify that the value of the air quality should
be between 0 and 500. As you can see, it is the same application
(templateName) used for the validity dimension. The same tool can be used
for multiple data quality dimensions.
- table: Air_Quality
description: Air quality of the city of New Yor
dataGranularity: Raw records
columns:
- column: DataValue
businessName: Measured value
logicalType: number
physicalType: float(3,2)
quality:
- templateName: RangeCheck
toolName: ClimateQuantumDataQualityPack
description: 'This column should contai
dimension: accuracy
severity: error
businessImpact: operational
customProperties:
- property: min
value: 0
- property: max
value: 500
Service levels usually apply to the entire data product. You will not have a
table with a retention of six months and another with a retention of 3 years.
slaDefaultColumn: StartDate
slaProperties:
- property: endOfSupport
value: 2030-01-01T00:00:00-04:00
- property: retention
value: 100
unit: y
- property: generalAvailability
value: 2014-10-23T00:00:00-04:00
- property: latency
value: -1
unit: As needed
As you can see, service-level indicators are properties, leaving room for
extensibility.
Summary
In this chapter, you learned that trust is really fundamental when it comes to
data. It is achieved by following three qualities: having a positive
relationship, showing expertise, and being consistent.
Data contracts are key to enabling this trust and in building reliable data
products. They can also be used outside of the scope of data products as
well, like, for example, documenting data pipelines.
Data QoS combines the seven data quality dimensions, as recognized by the
EDM Council, with an extensive list of service levels. They can be grouped
and organized through a timeline, like a periodic table. Data QoS is an
extensible framework that defines the values you can use when
implementing data contracts.
OceanofPDF.com
Chapter 6. Building your first data
quantum
OceanofPDF.com
Chapter 7. Aligning with the experience
planes
OceanofPDF.com
Chapter 8. Meshing your data quanta
OceanofPDF.com
Chapter 9. Data Mesh and Generative AI
With Early Release ebooks, you get books in their earliest form—the
author’s raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.
This will be the 9th chapter of the final book. Please note that the GitHub
repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the editor at [email protected].
GenAI represents a seismic shift in the way we perceive and interact with
data. Using cutting-edge machine learning techniques, this emerging
technology enables systems to create original content, mimic human-like
behavior, and generate synthetic data that rivals the real thing.
While still in its early stages, it is probably fair to say this capability is
nothing short of a game-changer. So, what is GenAI? Broadly speaking I
include several components in GenAI, including Large Language Models,
Embeddings (floating point representations of text, or other types of
content) that are stored in Vector Databases, and related technology and
libraries. So, let’s establish a few definitions…
Embeddings in GenAI are essential because they enable the models to learn
meaningful representations of the data, capturing important patterns and
relationships. By learning a compact representation, generative models can
generate new data that closely resembles the training data, allowing them to
create realistic and diverse samples in various domains, such as images,
music, text, and more.
Vector Databases
Beyond nearest neighbor search, a vector database also offers the benefit of
efficient retrieval and manipulation of data. By storing data in vector
representations, we can work with lower-dimensional and continuous
representations of the original data. This can speed up computation times
and allow us to perform tasks like dimensionality reduction, data clustering,
and classification more effectively.
Challenges
Nevertheless, despite GenAI’s seemingly great power, there are still gaps
and limitations. First, there are clear data gaps where the LLM is unable to
answer queries in a realistic manner. This is actually a very practical
consideration: simply put, the cost and time involved in training LLMs is
extensive, measured in the tens and hundreds of millions of dollars and
many months of time. This means that only the largest companies—today,
only the large internet giants—are able to make the investment necessary to
create a competitive LLM.
Another limitation is the potential bias in the training data. Since LLMs
learn from publicly available text data, they might unintentionally absorb
biases present in the data sources. For instance, if a specific website or
forum has a biased view on certain topics, the model could reflect that bias
in its outputs. For example, gender bias where models were trained on
gender stereotypes like “traditional jobs'' which may emphasize particular
gender. As a result, LLMs might not always provide fully objective or
balanced answers, which could be problematic, especially in sensitive or
controversial topics.
Last, the lack of access to private data can impact the performance of LLMs
in specialized domains. For example, a company working on cutting-edge
research or proprietary technology might require AI models that have an in-
depth understanding of their domain-specific jargon and knowledge. Since
LLMs lack exposure to such private data, they might not be as effective in
these specialized contexts.
But most of the headlines today related to GenAI’s capability using data
that originates beyond the four walls of the enterprise. However, almost all
large language models were trained on the data riches of the whole internet,
but, literally, know nothing about the data that exists within an enterprise.
Now, the fusion of Data Mesh and GenAI give organizations a tantalizing
new opportunity: Use Data Mesh as the foundational data platform that
makes it easy for Generative AI to find, consume, share, and trust enterprise
data. And together, Generative AI and Data Mesh make it easy to create
insights. Simply put, Data Mesh supercharges Generative AI.
So, imagine the ability to deploy GenAI models across various domains
within an organization, enabling stakeholders to effortlessly generate
synthetic data, simulate complex scenarios, and extract profound patterns
that fuel strategic initiatives with an unprecedented fervor.
The synergy between GenAI and Data Mesh holds immense potential to
reshape the enterprise data landscape. With the incredible power of
generative AI to consume enterprise data, privacy concerns are alleviated
propelling innovation and experimentation to new heights and directions.
So, how can these challenges be addressed? How can the various different
forms of content be “normalized” into a usable form? And, ideally, how can
the semantic richness of the content be maintained as it is normalized?
Moreover, how can we make this internal enterprise content known to
GenAI so that it can provide meaningful responses to user requests and
queries?
Rather, first let’s define the components in our architecture. First, Content
(#1 in diagram) that is consumed by and used by our architecture as the
basis for addressing requests and answering queries. This content may come
from internal or external sources and documents such as data descriptions,
developer documentation, tutorials, and guides that are related to the
queries and requests we have. Today, tools are available to process just
about any data types including text, CSV (comma separated values),
Microsoft Office documents (practically, any textual data type) and most
access methods (raw files, storage buckets, URLs).
Next is the Embeddings Function (#2), which converts textual content into
vector representations, typically called embeddings. The most interesting
and useful thing about embeddings is that modern techniques allow these
embeddings to have strong and rich relationships to the actual semantics of
original content.
Next, Prompts (#5) are a combination of the user query and a “context”.
The context is particularly important as it provides guidance and, in some
cases constraints, on how GenAI responds to user requests / queries. But
where does the context come from? Well, the context is the output from a
search from our vector database. But, again, what is the actual search
request presented to the vector database? Well, the search is the user query.
But why the original query? Well, the original query, when converted to an
embedding, provides the semantics of the user request—so, when we search
the vector database we are actually searching for items that are semantically
similar to our query. And with some basic prompt engineering, the original
query is combined with the context to guide GenAI as it responds to a user
request/query.
Large Language Models (#6) respond to prompts. The so-called super-
power of these models—the thing that has only now been achievable—is
that they respond to user requests / queries in a a compelling, human-like,
and conversational way to based upon provided data combined with the
knowledge gleaned from being trained on the vast riches of the internet.
And by using our architecture, we allow the models to also respond to our
data, the information in our Data Products.
Code/Document Generation
While strictly speaking this may be initially considered an aid to
developers, this capability nevertheless is much more than that. It is
quite common and relatively simple to generate prompts which use
our GenAI architecture to act upon user, developer, and data
documentation, to create OpenAPI specifications, JSON schemas, as
well as software fragments used by Data Product developers as well
as Data Product consumers. And with some creative prompts we can
generate reports, analyze data, identify trends, and make
recommendations and predictions based upon data that is unique to
the Data Products in our enterprise.
Lastly, High Value Use Cases (#8) can be built using raw GenAI
capabilities, or better yet, accelerated through the use of our composable
components. Again, while limited only by our imagination, several useful
solutions can be considered:
Data Analytics and Reporting
Operational Insights
Content Management
But, how does this work? First, since our GenAI architecture can ingest any
form of data using just about any access mechanism, we can easily acquire
vast amounts of content about the data for our Data Product. Next, once
acquired, our Summarization and Tagging composable component can
capture the semantic richness of the content and generate consistent sets of
content summaries.
Next, with on-boarded data products, Data Products now can be easily
made discoverable. To do this, we can load our summarizations and tags
into the Data Product Registry (see Chapter 3 for more information) which
acts as the catalog for an enterprise Data Mesh. Now, any user can easily
find the data products based upon semantically rich summaries and tags.
So, now that we have rich content in our Data Product Registry, we make it
searchable using our Natural Language Search composable components.
Now, all of the semantically rich information about Data Products is
available using natural language and semantically rich search. No more
SQL. No more guessing key words. Just simple, intuitive questions.
So, how do these components interact in our Climate Data Inc scenario?
GenAI can assist in tagging climate data, making it easier to categorize and
organize the information. By identifying key variables, regions, and
temporal aspects within the data, the LLMs can create relevant tags or
keywords. For instance, the model can tag weather data with attributes like
temperature, precipitation, humidity, and location details. These tags help in
efficient data retrieval and analysis, making it simpler to compare and
correlate different climate variables and their impacts over time.
With the vast amounts of climate data being collected from diverse sources,
integrating this data into applications and analytical tools can be time-
consuming and challenging. GenAI can automate code generation for data
integration, creating code snippets that fetch, process, and preprocess
climate data from various repositories. This automation not only saves time
but also ensures consistency and accuracy in the integration process.
Summary
By leveraging GenAI’s capabilities in summarization, tagging, knowledge
graph creation, and code generation, managing vast amounts of climate data
in our Data Mesh becomes much more efficient and insightful. It facilitates
faster decision-making, scientific research, and policy analysis, contributing
to a better understanding of climate patterns, impacts, and potential
mitigation strategies. And, as climate data continues to evolve, GenAI can
adapt and update its models, ensuring that the climate data management
process remains relevant and effective over time.
OceanofPDF.com
Part III. Teams, operating models, and
roadmaps for Data Mesh
With a basic Data Mesh in place, how do you successfully set up the teams,
operating model, and roadmap required to establish, nurture, and grow your
Data Mesh? You will also discover a maturity model that will enable you to
measure your progression.
OceanofPDF.com
Chapter 10. Running and operating your
Data Mesh
OceanofPDF.com
Chapter 11. Implementing a Data Mesh
Marketplace
OceanofPDF.com
Chapter 12. Implementing Data Mesh
Governance
OceanofPDF.com
Chapter 13. Running your Data Mesh
Factory
OceanofPDF.com
Chapter 14. Defining and Establishing
the Data Mesh Team
With Early Release ebooks, you get books in their earliest form—the
author’s raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.
This will be the 14th chapter of the final book. Please note that the GitHub
repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the editor at [email protected].
Data Governance
To establish effective data product teams within Data Mesh, I leverage the
concept of team topologies adapted from Matthew Skelton and Manuel Pais
in their Team Topologies book. As shown in Figure 9.1, the core team in
our discussion is the Data Product Teams (Pais/Skelton’s Stream-aligned
teams) that own, manage, and govern data products, however, we will also
touch briefly upon the Data Platform and Data Enabling teams that support
Data Product teams.
Figure 14-1. Data Mesh Team Topology
In figure 9-1, we show three types of teams: the “Data Product” team, the
“Data Platform” team, and the “Data Enabling” team. So, what do each of
them do?
Data Product teams are the core of the Data Mesh team ecosystem. They are
the experts in consuming data from data providers, transforming data to
deliver business value, and making it available to data consumers. In a Data
Mesh ecosystem there are many data product teams, each taking on the
responsibility of end-to-end delivery of services related to a specific data
product. A data product team interacts with groups that create, manage, and
consume as well as platform and enabling teams that provide technical and
support capabilities.
In our “ecosystem” metaphor, data products are analogous to different
species within the ecosystem. Each Data Product team acts as a unique
species, specializing in a specific domain or data-related service. Just as
diverse species contribute to the overall biodiversity and functionality of an
ecosystem, data products contribute their distinct data capabilities and
expertise to the data mesh ecosystem.
There are two key teams that support the Data Product teams: Data
Platform teams and Data Enabling teams.
Let’s touch upon Data Platform teams first. Where Data Product teams are
experts in consuming data technology, the Data Platform team are experts
in managing data technology. The Data Platform teams play a crucial role
in supporting and enabling the work of other Data Product teams within the
Data Mesh ecosystem. These teams focus on providing the necessary tools,
utilities, and technical services that make it easier for data product teams to
perform their tasks efficiently and effectively. Platform teams act as a
central resource, offering shared services and capabilities that can be
utilized by multiple teams across the organization.
The scope of Data Platform teams can vary depending on the organization’s
needs and priorities. They may include teams specializing in areas such as
cloud infrastructure, APIs, security, networking, or any other technical
domain that is critical for supporting the organization’s software
development efforts. Data Platform teams collaborate closely with other
teams, understanding their requirements and continuously improving the
services they offer to ensure smooth and efficient operations.
Continuing with our ecosystem metaphor, the Data Platform teams can be
seen as the infrastructure that supports and nourishes the ecosystem. They
provide the necessary tools, services, and technical capabilities akin to the
fertile soil, clean water, and favorable climate that enable the growth and
sustainability of the ecosystem. These platform teams ensure that the data
products have the necessary resources and support to operate effectively
and deliver value.
Last, but definitely not least, are Data Enabling teams. They play a vital
role by providing consultative support and expertise to Data Product teams.
They help address obstacles, offer guidance, and promote best practices in
data management and governance.
The role of Data Enabling teams is to identify and understand the unique
requirements and constraints faced by other teams. They collaborate closely
with these teams, working in short bursts or on a project basis to provide
targeted assistance. Enabling teams bring their expertise and knowledge to
bear on specific problems or areas where additional support is needed.
Data Enabling teams can take different forms depending on the organization
and its specific needs. They may include steering groups, enterprise
governance and architecture teams, training groups, or any other specialized
teams that can offer insights and assistance in specific domains. These
teams typically have deep knowledge and experience in their respective
areas and can provide valuable guidance, best practices, and resources to
help teams succeed.
By leveraging the expertise of Data Enabling teams, Data Product teams
can benefit from specialized support and knowledge without having to build
the same capabilities within every individual team. Data Enabling teams
help foster collaboration, knowledge sharing, and innovation across the
organization by providing targeted support to teams facing challenges or
pursuing opportunities.
Still, the Data Product team is the core team with a Data Mesh, so let’s dig a
bit deeper here. What does a Data Product team do? What are the key roles
on the team? And what benefits and challenges do they experience?
A Data Product team has a clear scope and boundaries, typically centered
around a specific database, set of tables, or files. They are accountable for
all aspects of the Data Product lifecycle, including data ingestion,
consumption, discovery, observability, and ensuring its overall success in
delivering value to the organization.
But most importantly, each Data Product team works independently and has
the autonomy to make decisions regarding their Data Products. And it is
this autonomy and independence that allows for faster decision-making and
shorter feedback loops which are essential for delivering data-driven
solutions efficiently.
Data Product teams interact with various other teams within the data mesh.
Producer teams, which manage the source of data, collaborate with data
product teams to ensure smooth data ingestion. Consumer teams, on the
other hand, access and utilize the data offered by the data product team for
various analytical purposes. Platform teams provide essential “X-as-a-
Service” capabilities to support the data product team’s data ingestion,
consumption, and sharing processes. Enabling teams assist the data product
team in overcoming short-term obstacles or addressing specific needs.
It’s important to note that while these roles are commonly found within data
product teams, the specific structure and composition can vary depending
on the organization and the nature of the data product being developed. The
size of a data product team varies depending on its particular purpose and
objective. In some cases, Data Product teams may be quite small perhaps
where scope is limited, but in other cases may be somewhat larger.
However, in most cases, the old AWS maxim of a “two pizza team” (about
10-12 people) is probably a practical maximum.
In this metaphor, the Data Product team leader can be seen as the conductor
of the orchestra. They set the vision, provide guidance, and ensure that all
team members are aligned and working towards a common goal. Like a
conductor, they coordinate and bring together the diverse talents and
expertise of the team members to create a cohesive and impactful
performance.
The different roles within the Data Product team can be compared to
various musical instruments. Each role has its own specific responsibilities
and skills, similar to how each instrument has a distinct sound and purpose
in an orchestra. Whether it’s the Data Product owner, metadata manager,
consumption services manager, or ingestion services manager, each member
contributes their expertise and skills to the overall symphony of Data
Product delivery.
Now, let’s take a look at how an individual Data product team is organized
and explore key team roles and responsibilities.
The Data Product team is led by a “Data Product Owner” who has overall
accountability for the success of the data product. This role is responsible
for setting the direction and roadmap of the data product, acquiring funding,
and liaising with stakeholders and other teams involved.
Strategic Direction
Cross-Functional Collaboration
The Data Product Owner needs to collaborate with various teams and
individuals across the organization. They work closely with data
engineers, data analysts, domain experts, and other stakeholders to
define and deliver the data product. Collaboration skills are essential
for fostering a culture of teamwork, encouraging knowledge sharing,
and ensuring alignment between different teams and functions.
Product Ownership
The Data Product Owner is responsible for its overall success. This
includes defining the product strategy, prioritizing features and
enhancements, and managing the product backlog. They should
continuously monitor and evaluate the performance of the data
product, making data-driven decisions to optimize its value,
usability, and impact.
Domain Knowledge
Release Manager
The Release Manager plays a crucial role in coordinating and managing the
release process of data products within a data mesh. Their responsibilities
revolve around ensuring the successful deployment, communication, and
adoption of data product releases. Here are some of the most important
responsibilities and skills of a data product release manager:
Release Planning and Coordination
Data Governance
Data Contracts
Through the marketplace metaphor, we can grasp the vital role of metadata
management and governance in establishing a well-functioning data mesh.
They facilitate the discovery, accessibility, and trustworthiness of metadata,
allowing data consumers to navigate and leverage the diverse range of data
products within the ecosystem effectively.
The Data and Security Manager has responsibility for the overall
architecture of the Data Product. The collaborating with other team
members to design and maintain a scalable and efficient data architecture
that supports the organization’s data needs. The Data and Security Manager
has skills that include:
Data modeling
Performance Optimization
Like the other roles, the Ingestion Services Manager must take into
account the constantly changing data management landscape. The
Ingestion Services Manager should have a passion for continuous
learning and adaptability to stay updated with industry trends and
incorporate new techniques into their data ingestion processes. They
should be open to exploring innovative approaches and leveraging
new tools and technologies to improve data ingestion efficiency.
The above figure provides a skills matrix for the Data Product team, but
there are some clear skills that are emphasized for each team role:
Metadata Manager
This role secures and manages the data within the Data Product, and
hence requires an outstanding knowledge of the data platform
technology (database, datalake, datawarehouse etc). With security
and privacy playing a crucial role in all modern enterprises, this role
is an expert in practices that ensure access to the data product and its
data is only available to authorized individuals, that all data is secure
and protected.
Consumption Manager
This role ensures that the data product is accessible and available to
consumers of the data product. As such this role has deep
technology data access skills such as programming and APIs, as well
as in data science and analytics which would be common usage for
the Data Product.
Ingestion Manager
This role ensures that data is ingested and stored in the data product
in a secure and reliable manner. As such, this role has deep technical
skills in data pipelines and data transformation, as well as data
storage skills related to the data platform(database, datalake,
datawarehouse) used by the data product.
Benefits
Data Product teams bring several benefits to organizations when it comes to
effectively managing and leveraging data. Here are some key benefits of
having a Data Product team:
Faster Time-to-Value
Domain Expertise
Data Product teams develop deep domain expertise in the specific
areas they work on. They understand the intricacies and nuances of
the data, as well as the unique challenges and opportunities within
the domain. This expertise enables them to provide valuable insights
and solutions tailored to the specific needs of the business.
Continuous Improvement
Challenges
Nevertheless, Data Product teams can face several challenges throughout
their journey. Here are some common challenges that data product teams
may encounter that will likely require constant care and feeding:
As Data Products grow and more users rely on them, scalability and
performance become crucial challenges. Managing increasing data
volumes, handling data processing and storage efficiently, and
ensuring responsiveness and reliability can be demanding tasks for
data product teams.
Data Product teams often need to balance the need for innovation
and introducing new features with the stability and reliability of
existing data products. Striking the right balance between pushing
boundaries and maintaining robustness can be a delicate challenge.
Summary
In this chapter, we explored the concept of establishing an effective data
product team within a data mesh. We discussed the organizational structure
and roles within a data mesh, including stream aligned teams, platform
teams, enabling teams, and data product teams. A data product team is
responsible for the end-to-end delivery of services required by a data
product and interacts with producer teams, consumer teams, platform
teams, enabling teams, and complicated subsystem teams.
We also delved into the responsibilities and skills of key roles within a data
product team, such as the data product team owner, release manager,
metadata manager, consumption services manager, and ingestion services
manager. These roles encompass a range of responsibilities, including
strategic planning, stakeholder management, data governance, data pipeline
design and development, performance optimization, and compliance.
OceanofPDF.com
Chapter 15. Defining an Operating
model for Data Mesh
Introduction
With Early Release ebooks, you get books in their earliest form—the
author’s raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.
This will be the 15th chapter of the final book. Please note that the GitHub
repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the editor at [email protected].
However, the distributed model also comes with its own set of challenges,
particularly in terms of decision-making complexity. The lack of a
centralized decision-making authority can sometimes lead to
inconsistencies and difficulties in maintaining a unified strategic direction.
In the case of the ASF, coordinating efforts and maintaining consistency
across numerous independent projects can be a complex task. The challenge
lies in ensuring that all the autonomous units or teams are aligned with the
overall objectives and values of the organization while retaining their
independence. Additionally, in the absence of a central authority, conflict
resolution and the enforcement of standards and policies can become more
complicated. The success of a distributed organization like the ASF depends
heavily on the establishment of strong, shared values and objectives, along
with effective communication channels and collaborative tools that enable
disparate teams to work towards common goals while respecting each
other’s autonomy.
Now, at the end of the day, each organization is unique and probably has a
mix of several operating models, each implemented in different parts of the
enterprise. Nevertheless, it is important to understand the advantages and
disadvantages of each operating model to be able to plan your approach for
your data mesh and data products.
In our Data Mesh Ecosystem Operating Model, we extend its scope beyond
the realm of individual Data Product teams, encapsulating the entire
spectrum of data management within an organization. This model is pivotal
in orchestrating the interactions and collaborative efforts of multiple data
product teams, aligning them to form an integrated and efficient Data Mesh.
This holistic approach is crucial in harnessing the full potential of a
decentralized data architecture, ensuring that the collective output of
various data products synergizes into a unified, strategic asset for the
organization.
Figure 15-3. Operating Model Interactions
Central to this model is the integration of the individual Data Product team
operating models into a larger, interconnected ecosystem. While these
teams concentrate on their domain-specific roles, the Ecosystem Operating
Model ensures they do not operate in isolation. It promotes a culture of
collaboration and shared learning, encouraging teams to exchange best
practices, tools, and insights. This synergy is essential for balancing the
autonomy of individual teams with the overarching goals of the Data Mesh,
facilitating data products that are not only effective in their own right but
also interoperable and complementary across the organization.
One of the main objectives of this model is to foster a seamless flow of data
and collaboration among different Data Product teams. This goal is
achieved by creating an environment conducive to easy sharing of data,
knowledge, and resources, thus amplifying the collective value of the Data
Mesh. Such an environment paves the way for a more agile and responsive
data infrastructure, adaptable to the dynamic needs of business and
technology landscapes. Another critical objective is maintaining
consistency and uniformity in data practices and standards across the Mesh,
which is instrumental in upholding data quality and reliability.
A central, but light-weight and nimble, governance entity, mirroring the role
of the ANSI, is integral to this model, shown in Figure 15-4. This body sets
forth the core policies, standards, and certification processes for the entire
Data Mesh. Rather than enforcing compliance top-down, it establishes a
framework within which all Data Product teams operate, ensuring that these
standards stay relevant, current, and aligned with both organizational
objectives and industry best practices. By doing so, it upholds a high
standard of data quality, security, and compliance across the Mesh while
fostering the flexibility needed for teams to innovate and address their
unique data challenges.
Governance within the Data Product Team Operating Model aligns more
with a “certification” approach rather than a traditional centralized policing
style. A central governance team establishes policies and standards, and it is
up to the individual entities, in this case, the Data Product teams, to adhere
to these standards. The teams are responsible for ensuring their data
products meet the established criteria, much like a vendor ensuring their
product meets certain quality standards before it gets certified.
Some of these tradeoffs are described quite well through Conway’s law. So,
let’s start there.
Figure 15-5. Conway’s Law and Its Implications
But let’s explore each of the operating models and their implications on
architecture, as shown in Figure 15-5.
In centralized organizations, decision-making and control are highly
concentrated at the top of the hierarchy. This model is often found in sectors
where uniformity and precision are paramount, such as manufacturing or
finance. The centralized nature of these organizations lends itself well to
monolithic data architectures, like centralized data warehouses. The key
advantage of this architecture is its ability to maintain consistency and
control over data. A centralized data warehouse ensures that all
organizational data adheres to a uniform set of standards and policies,
mirroring the centralized decision-making process. However, the rigidity of
this model can be a disadvantage in rapidly changing environments, as it
may not adapt quickly to new data sources or analytics needs.
The regional focus fostered by Conway’s Law often results in data products
that are finely tuned to local market needs and regulatory landscapes. While
this approach enhances the effectiveness of data products in specific
regions, it can inadvertently lead to fragmentation in global data strategy,
manifesting as data silos and challenges in cross-regional collaboration.
Organizations striving for a unified global data perspective might struggle
to integrate these diverse, regionally-focused data products into a
harmonious whole.
So, clearly the choice of an operating model for a Data Mesh is a decision
that shapes an organization’s data future. It requires a careful balance
between local autonomy and global coherence, between innovation and
standardization. As organizations navigate this complex landscape, the key
to success lies in aligning these choices with their long-term strategic
vision, ensuring that their data ecosystem is not only responsive to current
needs but also poised for future challenges and opportunities. So, clearly,
the interplay between an organization’s operating model and its data
architecture is a fundamental aspect of effective data management and
strategy. And, understanding this relationship helps organizations choose
the right architecture to support their operational needs and strategic goals,
whether it be a centralized data warehouse, a data lake, a data mesh, or a
microservices architecture. For organizations looking to implement a data
mesh, aligning its principles with their operating model, especially in
federated structures, is key to leveraging its full potential.
All that being said, the evolution of Data Mesh into loosely coupled
regional ecosystems may actually be a testament to the dynamic nature of
data management in today’s global business environment. This architectural
shift, where data solutions are crafted to reflect the specific needs and
contexts of various regions, is deeply rooted in the principles of Conway’s
Law. According to this law, the structure of systems developed within an
organization is often a mirror of its communication patterns. In the realm of
Data Mesh, this means that an organization’s geographic or regional
structure profoundly influences its data architecture, leading to the creation
of region-specific data products that are both locally relevant and
responsive.
So, clearly the move towards regional Data Mesh models represents a
sophisticated approach to data management, one that respects the unique
characteristics of different regions while maintaining a unified vision. By
carefully balancing regional autonomy with centralized oversight and
integrating technological solutions that facilitate collaboration and
standardization, organizations can create a data ecosystem that is not only
regionally effective but also globally coherent. This strategy ensures that
organizations are well-positioned to harness the full potential of their data
assets in a rapidly evolving global business landscape.
OceanofPDF.com
Chapter 16. Establishing a practical Data
Mesh roadmap
OceanofPDF.com