Data Mesh
Data Mesh
Mesh
Principles, patterns, architecture, and
strategies
for data-driven decision making
Pradeep Menon
www.bpbonline.com
First Edition 2024
ISBN: 978-93-55519-962
All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any
form or by any means or stored in a database or retrieval system, without the prior written permission
of the publisher with the exception to the program listings which may be entered, stored and executed
in a computer system, but they can not be reproduced by the means of publication, photocopy,
recording, or by any electronic and mechanical means.
All trademarks referred to in the book are acknowledged as properties of their respective owners but
BPB Publications cannot guarantee the accuracy of this information.
www.bpbonline.com
Dedicated to
In the rapidly evolving world of data management, the shift from traditional
centralized architectures like data lakes and warehouses to a decentralized,
domain-oriented approach marks a revolutionary change. Architecting the
Data Mesh: Patterns and Strategies dives deep into this transformative
concept known as Data Mesh, which redefines how data is handled across
organizations. This book is crafted for data professionals eager to understand
and implement a structure that promotes agility, scalability, and resilience
within their data ecosystems.
Data Mesh represents a paradigm shift, focusing on treating data as a product
and emphasizing decentralized governance. This approach aligns closely with
the needs of modern businesses that require rapid access to diverse,
distributed data sources. By breaking down the traditional silos, Data Mesh
enables a more collaborative and flexible data management environment.
This book is designed not only to introduce the concept but also to provide a
detailed guide on implementing Data Mesh effectively.
Architecting the Data Mesh: Patterns and Strategies embarks on a
comprehensive exploration of Data Mesh, guiding readers through the
transformative shift from traditional centralized data architectures to a
decentralized, domain-oriented framework. The journey begins by
establishing a contextual foundation for Data Mesh, followed by a historical
overview of data architecture evolution, highlighting the necessity for such an
innovative approach. As the chapters progress, readers delve into the core
principles and patterns of Data Mesh, gaining insights into how it fosters
agility, scalability, and resilience in data management. The book then
navigates through the practical aspects of implementing Data Mesh, covering
data governance, cataloging, sharing, and security, each treated with depth
and precision to facilitate understanding and application. Finally, the book
culminates with practical examples and real-world applications, illustrating
how to operationalize Data Mesh effectively within various organizational
contexts. This structured journey equips data professionals with the
knowledge to not only understand but also implement Data Mesh to enhance
their data management practices and stay ahead in the rapidly evolving data
landscape.
By the conclusion of this book, readers will not only grasp the theoretical
underpinnings of Data Mesh but will also be equipped with practical
knowledge and strategies to implement these concepts in their day-to-day
operations. Whether you are a seasoned data architect, a Chief Data Officer,
or a curious analyst, Architecting the Data Mesh: Patterns and Strategies
offers valuable insights and guidelines that will help you stay at the forefront
of data management technology. This book is your comprehensive guide to
navigating the complexities of modern data architectures and leveraging the
full potential of Data Mesh to drive business value.
Chapter 1: Establishing the Data Mesh Context – This chapter introduces
the Data Mesh concept by delineating its need within modern data
management paradigms. It sets the stage by describing the shift from
centralized systems to a more fluid, decentralized architecture, explaining
how this approach aligns with the demands of big data and agile enterprises.
Chapter 2: Evolution of Data Architectures – This chapter traces the
development of data architectures from traditional databases and data
warehouses to modern data lakes and beyond. It highlights the limitations of
earlier systems and sets the rationale for the adoption of Data Mesh,
presenting a historical perspective that underscores the evolution toward
decentralized data domains.
Chapter 3: Principles of Data Mesh Architecture - This chapter delves
into the core principles that define the Data Mesh framework. It explains each
principle in detail, providing the theoretical foundation necessary for
understanding and implementing Data Mesh.
Chapter 4: The Patterns of Data Mesh Architecture – This chapter
explores various architectural patterns within Data Mesh, including
decentralized topologies and hybrid models. It offers guidelines on how to
select and implement these patterns based on specific organizational needs
and data strategies.
Chapter 5: Data Governance in a Data Mesh - This chapter discusses the
unique challenges and solutions for governing data in a decentralized context.
It covers strategies for maintaining data quality, managing metadata, ensuring
compliance, and aligning data governance with organizational goals within
the Data Mesh framework.
Chapter 6: Data Cataloging in a Data Mesh - This chapter focuses on
effective data cataloging practices that enhance the discoverability and
usability of data across decentralized domains. It details the strategy,
processes, and tools for building a comprehensive data catalog that supports
the Data Mesh’s collaborative and agile nature.
Chapter 7: Data Sharing in a Data Mesh - This chapter examines the
topologies for secure and efficient data sharing across different domains
within a Data Mesh. It provides insights into designing data-sharing strategies
that balance autonomy with oversight, which is crucial for fostering an
integrated yet flexible data environment.
Chapter 8: Data Security in a Data Mesh - This chapter addresses the
critical aspects of securing a decentralized data architecture. It lays out the
detailed framework for data security Data Mesh environments that covers the
organization, inter-domain, and intra-domain security.
Chapter 9: Data Mesh in Practice - This chapter culminates the learnings
from all previous chapters, synthesizing the principles, patterns, governance,
cataloging, sharing, and security strategies into a cohesive framework for
implementing Data Mesh in practice. It lays out step-by-step guidelines for
operationalizing Data Mesh within various organizational contexts, providing
a comprehensive roadmap that translates theoretical concepts into actionable
strategies.
Coloured Images
Please follow the link to download the
Coloured Images of the book:
https://rebrand.ly/e8b279
We have code bundles from our rich catalogue of books and videos available
at https://github.com/bpbpublications. Check them out!
Errata
We take immense pride in our work at BPB Publications and follow best
practices to ensure the accuracy of our content to provide with an indulging
reading experience to our subscribers. Our readers are our mirrors, and we
use their inputs to reflect and improve upon human errors, if any, that may
have occurred during the publishing processes involved. To let us maintain
the quality and help us reach out to any readers who might be having
difficulties due to any unforeseen errors, please write to us at :
[email protected]
Your support, suggestions and feedbacks are highly appreciated by the BPB
Publications’ Family.
Did you know that BPB offers eBook versions of every book published, with PDF and ePub files
available? You can upgrade to the eBook version at www.bpbonline.com and as a print book
customer, you are entitled to a discount on the eBook copy. Get in touch with us at :
[email protected] for more details.
At www.bpbonline.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters, and receive exclusive discounts and offers on BPB books and eBooks.
Piracy
If you come across any illegal copies of our works in any form on the internet, we would be
grateful if you would provide us with the location address or website name. Please contact us at
[email protected] with a link to the material.
Reviews
Please leave a review. Once you have read and used this book, why not leave a review on the site
that you purchased it from? Potential readers can then see and use your unbiased opinion to make
purchase decisions. We at BPB can understand what you think about our products, and our authors
can see your feedback on their book. Thank you!
For more information about BPB, please visit www.bpbonline.com.
Index
CHAPTER 1
Establishing the Data Mesh Context
Introduction
Decades ago, Clive Humbly, a respected mathematician and data science
pioneer, stated, Data is the new oil. Today, his words hold even greater
significance as we are in a data-driven era where effective data management
has become a critical aspect of transformation.
In the digital age, data has emerged as one of the most valuable assets for
organizations worldwide. In this chapter, we embark on a journey through the
intricate maze of the modern data landscape. We begin by navigating the
contemporary data ecosystem and understanding its complexities and
challenges. From the structured realms of Data Warehouses to the vast
expanses of Data Lakes and the hybrid environment of the Data Lakehouse,
we explore each architecture’s nuances, strengths, and limitations. As we
progress, we recognize the growing need for a more encompassing solution –
a macro data architecture pattern. This pattern seeks to address the unique
challenges extensive and multifaceted organizations face in today’s data-
driven world. Join us as we unravel the intricacies of these architectures and
pave the way for a more holistic approach to data management.
Structure
In this chapter, we will introduce the following:
Navigating the modern data landscape.
Need for a macro data architecture pattern.
Objectives
The primary objective of this chapter is to provide readers with a
foundational understanding of the contemporary data landscape. We aim to
demystify the core architectures that dominate today’s data management
practices, from the structured world of Data Warehouses to the expansive
domains of Data Lakes and the integrative approach of Data Lakehouses in
subsequent chapters. By exploring these architectures, we highlight their
merits and challenges. Furthermore, we underscore the emerging need for a
macro data architecture pattern, emphasizing its significance in addressing
the complexities of large-scale data management.
Lastly, this chapter serves as a precursor to the deeper discussions in the
subsequent chapters, offering a brief overview of the topics and insights.
Through this chapter, we aspire to equip readers with a holistic perspective
on modern data architectures and set the stage for the following
comprehensive exploration.
Data Lake A centralized repository that allows you Has better Prone to
to store all your structured and flexibility becoming Data
unstructured data at any scale. swamps.
Has better
scalability Challenging
Relatively Security
cost-effective implementations
Complexity to
process
unstructured
data.
Requires greater
governance.
Data Warehouses
The concept of a Data Warehouse has been introduced previously. Bill Inmon
first introduced it in the 1970s. He defined it as a subject-oriented,
integrated, time-variant, and non-volatile collection of data in support of
management’s decision-making process. The idea was to create a central
repository where data from various sources could be stored and analyzed.
Over time, data warehousing has evolved with technological advancements,
but the core concept remains the same.
A Data Warehouse is a centralized repository where data from various
sources is consolidated, transformed, and stored. This data is typically
structured and processed, making it suitable for analysis and reporting. Data
Warehouses are used by organizations to support business intelligence
activities, including data analytics, reporting, and decision-making. They
provide a historical data view, enabling trend analysis and strategic planning.
Data Warehouses, like any other system, have their advantages and
disadvantages. Here are a few to be considered:
Advantages
Integrated data: Data Warehouses consolidate data from various
sources, providing a unified view of the data. Data integration
makes it easier to perform cross-functional analysis.
Improved data quality and consistency: Data from different
sources is cleaned and transformed into a standard format in a Data
Warehouse, improving data quality and consistency.
Better decision-making: Data Warehouses support business
intelligence tools and analytics, enabling better decision-making
based on data.
Disadvantages
Complexity and cost: Setting up a Data Warehouse can be complex
and costly. It requires significant upfront design and ongoing
maintenance.
Data latency: Since data is typically batch-loaded into a Data
Warehouse, there can be a delay (latency) in data availability for
analysis.
Limited flexibility: Data Warehouses are schema-on-write systems.
Schema-on-write means the schema (structure of the data) needs to
be defined before writing the data, which can limit flexibility in
handling unstructured data or changes in the data structure.
Data Lakes
A Data Lake was developed to address the growing need for organizations to
store large amounts of structured and unstructured raw data in a centralized
location. The Hadoop ecosystem’s emergence, which permits the storage and
processing of big data, was a significant factor in the rise and adoption of
Data Lakes. Hadoop’s adaptable and scalable architecture enables data to be
stored in its original format, a substantial feature of Data Lakes.
A Data Lake is a centralized repository that allows you to store all your
structured and unstructured data at any scale. You can store your data as-is
without having first to structure the data and run different types of analytics
—from dashboards and visualizations to big data processing, real-time
analytics, and machine learning to guide better decisions.
Data Lakes, like any other system, have their advantages and disadvantages.
Here are a few to be considered:
Advantages:
Flexibility: Data Lakes allow storing all types of data (structured,
semi-structured, and unstructured) in their raw format.
Scalability: They can store vast amounts of data and are easily
scalable.
Cost-effective: Data Lakes are often more cost-effective than
traditional data warehousing solutions.
Disadvantages:
Data swamps: Without proper data governance and management,
Data Lakes can quickly become data swamps—unorganized, raw
repositories unusable for insights.
Security: Ensuring data security and privacy can be challenging due
to the diverse nature of data.
Complexity: Extracting meaningful insights requires advanced
tools and skills, which can add to the complexity.
While Data Lakes offer a flexible and scalable solution for storing
vast amounts of data, they require robust data governance strategies
to prevent them from becoming data swamps.
Data Lakehouse
In early 2020, data management experts proposed a new architecture pattern
that combined the best aspects of Data Lakes and Data Warehouses, called
the Data Lakehouse architecture. Its goal was to leverage the low-cost storage
and flexibility of Data Lakes with the reliable performance and data
structuring of Data Warehouses.
A Data Lakehouse provides a unified platform for various data workloads,
such as descriptive, predictive, and prescriptive analytics. It can handle
structured and unstructured data and enforce schema at both read and write
times, enabling traditional business intelligence tasks and advanced analytics
on the same platform. The advantages and disadvantages of a Data
Lakehouse are like those of a Data Lake. Here are a few to be considered:
Advantages:
Flexibility: A Data Lakehouse can handle all types of data,
including structured and unstructured data, like a Data Lake.
Performance: It delivers reliable performance for complex queries,
drawing on Data Warehouse features.
Unified platform: A Data Lakehouse reduces the need for moving
data between systems by providing a unified platform for all types
of analytics.
Disadvantages:
Data swamps: Like Data Lakes, Data Lakehouses can become data
swamps without proper data governance and management.
Complexity: Implementing a Data Lakehouse architecture can be
complex, requiring a blend of technologies and skills from the Data
Lake and Data Warehouse worlds.
Maturity: Data Lakehouse technologies and best practices are still
evolving as a relatively new concept.
While these systems have served their purpose, they are fundamentally
simple patterns that may only partially meet the intricate requirements of
large and complex organizations. Therefore, there is a growing need to
explore new, scalable designs to address these complexities. In the next
section, we will discuss the need for a macro data architecture pattern that
strives to address these complexities.
How can they ensure that their decision support systems are governed
appropriately yet provide them the flexibility to innovate at their own pace?
Conclusion
This chapter introduces the concept of a Data Mesh, presenting it as a novel
approach to address the complexities and challenges of managing data at
scale in large and complex organizations. It emphasizes the limitations of
traditional data management architectures—Data Warehouses, Data Lakes,
and Data Lakehouses—in meeting the needs of such organizations,
particularly in balancing governance with flexibility. The chapter outlines the
evolution of data architectures and the need for a macro architecture pattern,
Data Mesh. This pattern is a decentralized, flexible, scalable, and governed
solution to data management.
The next chapter traces the evolution of data architecture, from the early
structured Relational Database Management Systems (RDBMS) to
expansive data lakes, and onto the innovative hybrid Data Lakehouse model.
This journey reflects the broader technological advancements and the
continuous pursuit of more efficient, scalable, and insightful data
management solutions. Understanding this historical progression sheds light
on the design decisions and trade-offs that have shaped today’s data
management practices, preparing us for future trends and informing strategic
decisions in adopting Data Mesh for complex organizational landscapes.
Key takeaways
Here are the key takeaways from the introductory chapter of the book:
Data management challenges: Traditional data management
architectures like Data Warehouses, Data Lakes, and Data Lakehouses
have strengths and weaknesses. However, they may not fully meet the
needs of large and complex organizations, especially in balancing
governance and flexibility.
Data Mesh concept: Data Mesh is a new approach to data architecture
that addresses the challenges of managing data at scale in large and
complex organizations. It combines the best aspects of Data Lakes and
Data Warehouses, providing a flexible, scalable, and governed solution
for data management.
Introduction
Structure
The chapter covers the following topics:
Era of monolithic data architecture
Era of Data Warehouses
The perfect storm
The era of Data Lakes
The era of Data Lakehouses
Introduction to Data Mesh
Objectives
This chapter examines the evolution of data architecture, tracing its progress
from the early monolithic systems through to the Data Mesh era. It aims to
provide insights into the design principles, the advancements, and the
decision-making processes that have shaped modern data management
practices, highlighting the transition from Data Warehouses and Data Lakes
to the innovative hybrid Data Lakehouse model and, finally, to the
decentralized Data Mesh approach.
By the end of this chapter, readers will understand how data architecture has
evolved over time, the importance of this evolution, and the potential benefits
of adopting a Data Mesh approach in today’s complex organizational
landscapes.
As illustrated in the Figure 2.2, the EDW architecture consists of seven key
components:
Source systems: The organization’s operational data is stored in the
original databases and systems. These data stores could include CRM,
ERP, financial databases, and other transactional systems.
Extract transform and load (ETL) process: In the ETL process, data
is initially “extracted” from diverse source systems, gathered from
various points of origin. This data then undergoes a transformation
phase, cleaned, enriched, and reformatted to ensure consistency and
usability. Finally, the refined data is loaded into a designated target
database or data warehouse, making it readily accessible for
subsequent analysis and reporting.
Staging area: This is a temporary storage area where data is processed
before being loaded into the EDW. It is used to hold data extracted
from source systems and to prepare it for loading. The tables in the
staging area are a replica of the data sources, aiming to decouple OLTP
from EDW.
Data warehouse: This is the central repository of processed and
transformed data. It is designed for efficient querying and reporting.
The underlying design principle in designing a data warehouse is using
the third normal form (3NF) schemas. The 3NF schema is a pivotal
concept in relational database design that aims to eliminate redundancy
and ensure data integrity. In simpler terms, 3NF mandates that there are
no transitive dependencies between non-key attributes. By adhering to
3NF, databases can maintain high consistency and accuracy, ensuring
that data is stored in its most granular form without unnecessary
repetitions. This streamlined structure optimizes data retrieval and
insertion processes and simplifies database maintenance and updates.
Data marts: These are subsets of the data warehouse tailored for
specific business areas or departments. Data marts can improve
performance by providing more localized access to data.
OLAP cubes (online analytical processing): These multi-dimensional
data structures allow for complex analytical and ad-hoc queries with
rapid execution times. They enable users to view data from different
perspectives.
Presentation layer: The front-end layer where users interact with the
data. It includes reporting tools, dashboards, and business intelligence
platforms that visualize and present data to end-users.
On the other hand, Ralph Kimball advocated for a bottom-up approach. His
methodology starts with creating data marts, which are smaller, subject-
specific subsets of a data warehouse tailored to specific business departments
or functions. Over time, these data marts can be integrated to form a
comprehensive data warehouse aligned with subject areas. This approach
allows for quicker delivery of business value, as individual data marts can be
developed and deployed rapidly.
The following figure depicts the principle of dimensional modeling is central
to Kimball’s methodology. This design technique is tailored to enhance the
efficiency of databases for querying and analytical tasks. A version of Ralph
Kimball’s approach is shown in the following figure:
Figure 2.3: EDW Architecture using dimensional modeling
Conclusion
As we have navigated the foundational principles underpinning the Data
Mesh paradigm, it is evident that each carries profound implications for the
future of data architecture. This exploration into the evolution of data
architecture demonstrates a continuous quest for more efficient, scalable, and
insightful data management solutions. The journey from monolithic systems
to the decentralized Data Mesh framework reveals an industry adapting to the
expanding needs of complex organizations. The chapter underscores the
pivotal shift towards a Data Mesh architecture, offering a nuanced approach
that balances the need for stringent governance with the agility required for
innovation. It sets the stage for a deeper dive into the principles underpinning
Data Mesh, preparing readers for the challenges and transformative potential
of this emerging paradigm.
In the next chapter, we will meticulously unpack these foundational
principles, exploring their transformative potential and the challenges they
present. The next chapter introduces domains and nodes as foundational
elements, facilitating logical data grouping and technical capabilities.
Key takeaways
Here are the key takeaways from this chapter:
The chapter outlines the transformation of data architecture over the
years, emphasizing the shift from monolithic systems and Data
Warehouses to Data Lakes and Data Lakehouses, culminating in the
introduction of Data Mesh.
It underscores the challenges and limitations of traditional data
management systems in accommodating the complexities of modern,
large-scale organizations.
The discussion highlights the significance of adopting a Data Mesh
approach to enhance scalability, flexibility, and governance in today’s
complex organizational landscapes.
With a foundational understanding now in place, our journey ahead will take
a deeper dive into each of these principles.
Introduction
Throughout this book, we have covered a lot of ground. We explored the
evolution of data architectures, including Monolithic Systems, Data
Warehouses, Data Lakes, and Data Lakehouses. While each of these systems
was groundbreaking in its time, it presented its challenges, especially for
large, multifaceted organizations. As these entities expanded their horizons,
the need for a more adaptable, scalable, and nuanced data architecture
became increasingly evident. As discussed in the previous chapters, this is
where the Data Mesh, a novel macro data architecture, positions itself in the
ecosystem.
The Data Mesh has garnered significant attention in the industry, promising
to address the age-old governance-flexibility dilemma. It offers a fresh
perspective, emphasizing decentralization, autonomy, and viewing data as a
product.
The Data Mesh is built on three foundational principles that are central to its
success. In this chapter, we will delve deep into these architectural principles.
Structure
The chapter covers the following topics:
Understanding domains and nodes
The foundations of the principles
Principle 1: Domain-oriented ownership
Principle 2: Reimagining data as a product
Principle 3: Empowering with self-serve data infrastructure
Objectives
The objective of this chapter is to simplify the intricate architecture of Data
Mesh by breaking down its core elements. Our aim is to provide readers with
a clear understanding of domains and nodes, which are essential to the Data
Mesh structure.
The chapter aims to dissect the Data Mesh architecture by elucidating its core
principles and components—domains, nodes, and three foundational
principles, including Domain-oriented ownership, Reimagining Data as a
Product, and Empowering with Self-Serve Data Infrastructure. It strives to
detail how these elements collaborate to transform data management
practices, promoting a scalable, adaptable, and user-centric data ecosystem
within organizations.
By the end of this chapter, you will have a holistic understanding of where
the Data Mesh fits within the broader data ecosystem. Let us explore the
principles of data mesh.
Domain
In Data Mesh, the concept of the domain is crucial because it establishes the
scope and limitations of data ownership, governance, and collaboration.
Rather than a static or predetermined entity, a domain is dynamic and
contextual, shaped by an organization’s structure, operations, and problem-
solving approach.
Simply put, a domain is any logical grouping of organizational units that
serves a functional context while adhering to organizational constraints.
In a broader sense, this organizational system is a constantly evolving
interplay between a central unit and its subunits, which affects the
organization’s coherence and function.
The figure below illustrates the relationship between a central unit and its
subunits:
Central unit
At the top of this structure is the central unit, which serves as the
organizational hub. This unit provides guidance and direction, issuing
directives to be carried out by the various subunits. Its responsibilities include
allocating budgets for initiatives across subunits and providing platforms
catering to the organization’s needs. It creates a roadmap that unites the entire
organization’s efforts under a common goal. These platforms serve as shared
resources, ensuring everyone follows the same practices and shares a
collective purpose.
Subunits
In this complex network, sub-units appear as diverse nodes, each with distinct
functions and degrees of autonomy. The level of independence subunits enjoy
varies depending on the organization’s structure and culture. Subunits usually
fall into three categories:
Different organizations within a group: This includes entities that
operate within the same group organization, sometimes across other
geographical regions. While these entities remain interconnected, they
retain their identities. They may receive guidance from the central unit
or share resources but maintain a certain level of self-sufficiency.
Independent business units: Separate business units often emerge
within the same organization. These units operate autonomously and
cater to diverse markets, products, or services. Their independence
allows them to tailor strategies to their specific objectives.
Intra-organizational departments: These microcosms within the
organization serve specific functions such as marketing or sales. They
represent specialized domains of expertise that contribute to the
organization’s overall functioning.
The interplay between central and subunits gives rise to the concept of
domains, which encapsulate function and constraint. A domain is a logical
grouping of organizational units designed to fulfill specific functional
contexts while adhering to executive limitations. A domain consists of two
critical elements:
Functional context: The functional context refers to the task that the
domain is assigned to perform. It is what gives the domain its purpose.
For instance, a domain may be focused on providing customer service,
generating insights from data, or developing a new feature.
Organizational constraints: Domains operate within boundaries
defined by business constraints, regulations, skill availability, and
operational dependencies. These constraints shape their operations and
align them with the overarching organizational objectives. The
organizational constraints can be business constraints imposed on the
domain, like regulations, people and skills, and operational
dependencies. These constraints can limit or influence the domain’s
scope, boundaries, and interactions. For example, a domain may be
subject to different legal or regulatory requirements depending on its
geography, industry, or customer segment.
Domains appear in different contexts, including:
Product groups: These domains focus on creating and delivering
specific products or services. They work towards the goal of product
development.
Departmental domains: These are the functional departments of an
organization, like marketing or sales. Each department operates under
the organization’s umbrella while catering to its specialized functions.
Subsidiaries: When an organization spans different geographic
regions, each subsidiary becomes a domain. Subsidiaries retain their
individual identities and operational dynamics while being connected
to the larger entity.
Domains are essential in a data mesh architecture as they manage their own
data products. By understanding the dynamic relationship between Central
Units, Subunits, and Domains, organizations can skillfully handle the
challenges and opportunities of functions, constraints, and goals, creating a
cohesive and effective ecosystem.
Now that we understand the concept of the domain let us investigate the
concept of a Node.
Node
Each domain has unique data requirements that must be fulfilled for
reporting, analytics, or machine learning to support decision-making
processes. Nodes provide technical capabilities, such as decision support, for
specific domains. These nodes play a crucial role in ensuring a seamless flow
of data and insights tailored to the unique functional contexts of each domain.
A node is a technical component that enables a domain to produce, consume,
and share its data products. Nodes can have various sub-components or data
products that provide different functionalities and services to meet the data
needs of the domain.
For instance, a node that supports decision-making for a domain may have
sub-components such as:
A Data Warehouse, Data Lake, or Data Lakehouse that stores and
organizes the domain’s data in a structured or semi-structured format.
A Data Catalog that provides metadata and documentation about the
domain’s data products, such as schemas, definitions, lineage, quality,
etc.
A Data Processing Engine that performs transformations, aggregations,
and calculations on the domain’s data using batch or streaming
methods.
A Machine Learning Platform that enables the domain to build, train,
and deploy predictive models using its data.
A Data Visualization Tool that allows the domain to explore, analyze,
and present its data using charts, dashboards, and reports.
The figure below illustrates how a Data Lakehouse can serve as the node that
meets the technical requirements of a domain:
Figure 3.2: The concept of a domain node
Let us now deep-dive into the five key aspects of this principle.
Let us now deep-dive into the five key aspects of this principle.
Aspects of the principle of reimagining data as a product
The principle of Data as a product is simple. It means treating the
information your organization collects and uses as a goods or service you
sell. Rather than viewing data as merely a set of numbers or facts that result
from doing business, you recognize it as something valuable that can be
improved, packaged, and provided to others in a way that adds value to your
organization and helps them.
Just as you would carefully create and sell a physical product, you put the
same care and attention into collecting, storing, and sharing your data. This
enables you to use the data to make better decisions, improve services, or
create new forms of value, such as supporting teams in working more
efficiently or providing insights that lead to better customer experiences. The
Data as a Product principle focuses on five key aspects.
Let us now deep-dive into the five key aspects of this principle.
Conclusion
This chapter has distilled the fundamental principles that drive the concept of
Data Mesh. We began by exploring the Domain and the Node concepts,
recognizing their crucial role in orchestrating the data ecosystem and building
a functional data mesh infrastructure.
Next, we introduced a structured framework for analyzing principles through
three lenses: Aspects, Rationale, and Implications. This framework helped us
identify different aspects of principles, provide the logical foundation for
each principle, and outline the consequential impacts that result from
implementing these principles.
We focused on the three principles and discussed their philosophical
underpinnings. The three principles of Data Mesh Architecture aim to
modernize and optimize how data is managed and utilized within
organizations:
Domain-oriented ownership: This principle advocates for each
business domain to take full responsibility for its data throughout its
lifecycle, including ingestion, transformation, quality, and distribution.
It emphasizes the importance of treating data as a distinct entity within
each domain, with clear definitions, documentation, and interfaces
tailored to the needs of its users. The goal is to decentralize data
ownership to enhance quality, relevance, and agility by empowering
individual domains.
Reimagining data as a product: Data is repositioned from being a
secondary by-product to a primary asset, underscoring the importance
of viewing and treating data as a product in its own right. This
perspective shift encourages the creation of high-quality, well-
documented, and easily accessible data products that serve the needs of
their consumers, thereby increasing the overall value derived from
data.
Empowering with self-serve data infrastructure: This principle
focuses on making data easily accessible and usable to all
organizational members without the need for extensive specialist
intervention. By establishing a self-serve data infrastructure,
individuals can access and utilize data as needed, enhancing efficiency,
fostering innovation, and promoting a culture of data-driven decision-
making across the organization.
Together, these principles aim to create a more dynamic, decentralized, and
user-centric data architecture, leading to improved data quality, accessibility,
and utility across the organization.
Looking ahead, we will delve deeper into the architectural nuances that
dictate the configuration and operationality of Data Mesh systems. The next
chapter will provide an in-depth exploration of Data Mesh topologies,
specifically:
The authoritative structure in the Fully Governed Data Mesh Pattern.
The decentralized freedom in the Fully Federated Data Mesh Pattern.
The harmonious blend encountered in the Hybrid Data Mesh Pattern.
Key takeaways
Following are the key takeaways from this chapter:
Data Mesh architecture hinges on domains and nodes, structuring the
data ecosystem for clarity and efficiency.
The Governance-Flexibility Spectrum underlines the balance between
strict governance and operational flexibility within domains, promoting
a tailored approach to data management that aligns with specific
business goals and regulatory requirements.
Domain-oriented Ownership emphasizes autonomous lifecycle
management of data within domains.
Reimagining Data as a Product advocates for a consumer-focused,
value-driven approach to data management.
Empowering with Self-Serve Data Infrastructure champions accessible,
self-manageable data infrastructure for agility and independence in
team operations.
Introduction
In the previous chapter, we explored three key principles of data mesh:
domain-oriented ownership, data as a product, and self-service
infrastructure. These principles serve as a guide for designing and
implementing a data mesh. It is important to note that there is no one-size-
fits-all architecture for data mesh, as different domains and use cases may
require different patterns.
In macro architecture patterns, especially when discussing data mesh, it is
essential to understand the fluidity between certain terminologies. Notably,
conceptual architecture is often synonymous with topology. This
interchangeability stems from the inherent nature of these patterns, where the
overarching design (or architecture) often dictates the arrangement and
interrelation of parts (or topology). For clarity and consistency in this chapter,
readers should know that architecture and topology will be used
interchangeably. Both words aim to convey the data mesh framework’s
structural design and interconnections.
This chapter will first delve into the Component Model of Data Mesh, laying
the groundwork for understanding the building blocks that constitute this
architecture. This model provides a blueprint detailing the essential elements
and their interplay within the Data Mesh ecosystem.
Following this foundational understanding, we will explore the three distinct
architectural patterns that shape the Data Mesh landscape and understand
how organizations choose between these patterns.
Structure
The chapter covers the following topics:
Data mesh component model
Fully governed data mesh architecture
Fully federated data mesh architecture
Hybrid data mesh architecture
Domain placement methodology
Objectives
This chapter delineates the architectural patterns within Data Mesh, including
the Fully Governed, Fully Federated, and Hybrid Data Mesh Architectures. It
focuses on understanding these patterns’ implications for governance and
flexibility, offering a methodology for determining the appropriate
architecture for different organizational domains based on their specific needs
and characteristics.
By the end of this chapter, you will have a comprehensive understanding of
the components of the data mesh, its architectural topologies, and the
considerations for choosing the right topology.
Let us explore each of these components. We will focus on the purpose, the
functionality, and the usage of each component.
Domain
Let us start by revisiting the domain concept and its role, as discussed in
Chapter 3, The Principles of Data Mesh Architecture
These constructs are logical components of the data mesh architecture. As
mentioned earlier, the domain concept is essential in a Data Mesh
architecture because it establishes the scope and boundaries for data
ownership, governance, and collaboration. A domain is dynamic and context-
specific, influenced by an organization’s structure, operations, and problem-
solving approach.
The functional context and organizational constraints shape a domain. In a
broader sense, this organizational system is a continuously evolving interplay
between a central unit and its subunits, which impacts the organization’s
coherence and functioning.
Let us focus on the next data mesh architecture component, the domain node.
Domain node
We also briefly discussed the concept of the domain node in Chapter 3, The
Principles of Data Mesh Architecture
A domain is a logical grouping of business functions or processes with
common data needs and objectives.
A data product is a piece of data that brings value to the domain or other
consumers.
A domain node is a component that allows a domain to create, use, and share
its data products. The main objective of a domain node is to provide specific
technical capabilities, such as decision support, tailored to the unique data
requirements of each domain. Data products can be raw, refined, or derived
data stored, processed, or visualized using various technologies and methods.
The purpose of a domain node is to provide technical capabilities and
services that support the data requirements and objectives of the domain. A
domain node allows the domain to:
Manage and govern its data products, such as defining schemas, quality
standards, access policies, and so on.
Perform data operations, such as ingestion, transformation, enrichment,
analysis, and so on.
Deliver data insights, such as reports, dashboards, models, predictions,
and so on.
Collaborate with other domains and consumers, such as publishing
metadata, sharing data products, providing feedback, and so on.
A domain node’s functionality depends on the domain’s specific needs and
context. A domain node can have various sub-components or data products
that provide different functionalities and services. For example, a domain
node that supports decision-making for a domain may have sub-components
such as:
A Data Warehouse, Data Lake, or Data Lakehouse that stores and
organizes the domain’s data in a structured or semi-structured format.
A Data Processing Engine that performs transformations, aggregations,
and calculations on the domain’s data using batch or streaming
methods.
A Machine Learning and AI Engine that enables the domain to build,
train, and deploy predictive models using its data.
A Data Analytics Engine that allows the domain to explore, analyze,
and present its data using charts, dashboards, and reports or create
datasets that other domains can use.
The usage of a domain node is determined by the roles and responsibilities of
the domain and its stakeholders. A domain node can be used by:
Data producers who create and maintain the data products within the
domain node.
Data consumers who access and use the data products from the domain
node or other nodes.
Data stewards who oversee and ensure the quality, security, and
compliance of the data products within the domain node.
Data engineers who design and implement the technical infrastructure
and architecture of the domain node.
Data scientists who apply advanced analytics and machine learning
techniques to the data products within the domain node.
Data analysts who perform descriptive and exploratory analysis on the
data products within the domain node.
A domain node is a key component of the data mesh architecture that enables
a decentralized and distributed approach to data management. By
empowering domains to own and operate their nodes, the data mesh
architecture aims to achieve scalability, agility, autonomy, and alignment
across the organization.
Data is one of the most valuable assets of any organization. However, data
can also be complex, diverse, and distributed across domains and systems. To
make the most of data, it is essential to understand what data clearly and
consistently is available, where it comes from, how it is used, and what it
means. This is where data cataloging and curation come in.
Let us focus on the next data mesh architecture component, the data catalog.
Data Catalog
A Data Catalog is a component of the data mesh architecture that provides an
inventory of data assets. It helps users to discover, understand, and trust data
by providing metadata, documentation, lineage, quality, and governance
information. A data catalog enables users to search, browse, and access data
through a user-friendly interface.
The purpose of the data catalog component is to facilitate data discovery and
consumption by providing a unified view of the data landscape. It also
supports data governance and compliance by properly documenting,
classifying, and securing data. Using a data catalog component, users can
find the right data for their needs, understand its context and meaning, and
use it confidently. The following diagram distills the data catalog’s purpose,
functionality, and usage.
The Data Catalog component offers the following functionality:
Data discovery: Users can search for data using keywords, filters,
facets, and natural language queries. It also provides recommendations
and suggestions based on user preferences and behavior.
Data understanding: Rich metadata and documentation are provided
for each asset, including name, description, owner, source, schema,
format, tags, categories, and so on. It also shows the lineage and
relationships of the data, such as how it was created, transformed, and
consumed.
Data quality: The component monitors and measures the quality of the
data assets using various metrics and indicators, such as completeness,
accuracy, validity, timeliness, consistency, and so on. It also provides
alerts and notifications for any quality issues or anomalies.
Data governance: Policies and rules for managing and using the data
assets are enforced. It also tracks and audits the changes and activities
on the data assets to ensure compliance and accountability.
The usage of the Data Catalog component involves the following:
Data producers create and publish the data assets to the catalog
component. They also provide metadata and documentation to make
them discoverable and understandable.
Data consumers must access and use the data catalog component’s
assets. They can search for relevant data assets using various criteria
and methods. They can also view the metadata and documentation of
the data assets to understand their context and meaning.
Data stewards oversee and maintain the quality and governance of the
data catalog component’s data assets. They can define and apply
policies and rules for the data assets. They can also monitor and review
the quality and usage of the data assets.
The data catalog component is a crucial element of the data mesh
architecture that enables a decentralized and distributed approach to
managing and sharing data across domains. Using a data catalog
component, users can leverage the power of data more efficiently and
effectively.
Let us focus on the next data mesh architecture component, the data share
component.
Data Share
Data sharing in the Data Mesh architecture is the conduit for exchanging
information between domains. It involves the structured dissemination of
data from multiple sources, regardless of format or size, to ensure that
information can be easily accessed and utilized across different domains.
The main purpose of data sharing is to provide controlled access to data. It
allows for implementing data-sharing policies, ensuring that organizational
guidelines, regulatory requirements, and legal constraints share data. This is
particularly important in highly regulated industries, where selective data
sharing is necessary to comply with industry standards and regulations.
The data-sharing component offers a service that enables data to be shared in
any format and size from multiple sources, both within and outside an
organization. This service also provides the necessary controls to facilitate
data sharing and allows for creating data-sharing policies. Additionally, it
enables data sharing in a structured manner and provides complete visibility
into how the data is shared and utilized.
The data-sharing component supports various use cases that involve data
integration, analysis, or consumption across domains or organizations. For
instance, data sharing can be utilized to:
Share data for cross-domain analytics or reporting.
Share data for external collaboration or partnership.
Share data for compliance or regulatory purposes. Share data for
innovation or experimentation.
A Data Share service can leverage the data catalog component’s existing
metadata and governance capabilities to discover, describe, and document the
shared data. It can also employ encryption, authentication, and authorization
mechanisms to ensure the security and privacy of the shared data.
The extent to which the data landscape needs to be shared within an
organization’s subunits and central unit depends on various factors, such as
business objectives, organizational culture, and regulatory constraints.
Ideally, complete data sharing would involve every subunit having a
comprehensive view of the data available with the central unit and other
subunits. However, this may not always be the case. For example, there may
be scenarios where subunits are bound by legal or ethical restrictions,
especially in highly regulated industries. In such cases, selective data sharing
between subunits and the central unit would be necessary.
Data sharing is an integral component of the mesh architecture that facilitates
distributed and collaborative data management. It empowers data producers
and consumers to share and access data in a self-service and interoperable
manner while ensuring security and governance. Furthermore, data sharing
promotes a culture of openness and trust among parties that utilize data for
various purposes.
As depicted in the figure, the architecture of the fully federated data mesh
consists of the following components:
Domain and domain node: This component resembles a fully
governed topology, a logical boundary representing a business area or
function. Each domain has its data products, which are the data units
that provide value to the consumers. A domain can be a producer, a
consumer of data, or both. Every domain in this architecture can have
its node, a technical component that supports its decision-making
processes.
Domain Data Catalog: Each domain maintains its data catalog
without a central hub. This catalog is a comprehensive metadata
repository detailing the domain’s data assets, lineage, and other
pertinent details. It ensures that within the domain, there is clarity on
data assets and their characteristics.
Domain Data Share: Data sharing in the fully federated pattern is a
peer-to-peer affair. Domains share data, but there is no central entity
mediating this exchange. This direct sharing ensures quicker data
access and reduces dependencies.
The interaction between the domains in the fully federated data mesh is based
on the principle of self-service. Domains interact with each other directly.
Without a central hub to mediate, interactions are more streamlined.
However, this also means that domains must be proactive in ensuring they
adhere to the overarching governance framework of the organization.
Each domain can discover and consume data products from other domains
using their respective data catalogs and shares. The domains do not need to
coordinate or synchronize with each other, as they are responsible for their
own data quality and availability. The domains can also publish their data
products in a governed manner to other domains using their respective data
shares.
The governance model in the fully federated data mesh is based on the
principle of decentralization. While the Fully Governed model has a hub
domain setting the governance framework, each domain is responsible for its
governance in the fully federated pattern. They own their data products end-
to-end, from cataloging to curation. However, they still align with the broader
governance framework of the organization, ensuring consistency in data
operations. The governance model ensures that the domains are accountable
for their data products while also ensuring compliance and trustworthiness at
the enterprise level.
Cataloging in a fully federated data mesh topology is domain-specific. Each
domain details its data assets, ensuring that within its realm, there is a clear
understanding of the available data, its sources, and its characteristics. The
cataloging of each domain in the fully federated data mesh is based on the
principle of interoperability. Each domain uses its schema and vocabulary to
describe its data products, but it also maps them to a common ontology that
enables semantic understanding across domains. The common ontology can
be based on any industry or domain-specific standard that facilitates cross-
domain discovery and integration. The cataloging of each domain also
follows a common metadata model that captures the essential attributes and
relationships of the data products.
Data sharing is direct and governed. The sharing of data between the domains
in the fully federated data mesh is based on the principle of openness.
Domains share data, ensuring the exchange aligns with the overarching
governance framework. This peer-to-peer sharing is quicker and more
efficient than the hub-spoke model but requires domains to be more vigilant
in maintaining data integrity. Each domain exposes its data products to other
domains using standard APIs and protocols that enable easy and secure
access. Data sharing also follows a common contract model that specifies the
terms and conditions for data consumption and usage across domains. The
contract model can include SLA, pricing, quality, privacy, and security.
In summary, the fully federated data mesh architecture is a pattern that
empowers each domain to be autonomous and independent of any central
hub. It enables each domain to manage its own data end-to-end while also
allowing it to share and consume data from other domains in a governed way.
It is a pattern that supports scalability, agility, and innovation in a complex
and dynamic environment. This balance between autonomy and alignment
makes it a compelling choice for organizations seeking flexibility in their
data operations.
Figure 4.5: Hybrid Data Mesh Architecture with a spoke domain acting as a shared domain.
Figure 4.6: Hybrid Data Mesh Architecture with a hub domain acting as a shared domain
As depicted in both the flavors of hybrid data mesh, the interaction between
the domain networks is mediated by the shared domain. The shared domain is
a hub for the fully governed domain network and a spoke for the fully
federated domain network. The shared domain provides a common interface
for accessing and sharing data across different patterns.
The governance model for hybrid topology combines the governance models
for the fully governed and fully federated topologies. The governance model
defines the roles, responsibilities, policies, standards, and processes for
managing data quality, security, privacy, ethics, and compliance. The
governance model also defines resolving conflicts and inconsistencies
between different patterns.
The cataloging and sharing of data in each domain depend on the pattern of
the domain network. In a fully governed domain network, the cataloging and
sharing of data are centralized and controlled by a central authority. In a fully
federated domain network, the cataloging and sharing of data are
decentralized and self-managed by each domain. In a shared domain, the
cataloging and sharing of data are aligned with both patterns, depending on
the source and destination of the data.
It recognizes that organizations are multifaceted entities, and a rigid approach
might not always fit. Combining the strengths of fully governed and fully
federated patterns, the hybrid approach offers a flexible, robust solution for
complex organizations, ensuring that data is consistent, reliable, secure,
ethical, and compliant across different domains.
Once organizations have identified their domains, the decision between a
fully governed or a fully federated data mesh topology hinges on where the
domain lies in the governance-flexibility spectrum. In our subsequent section,
we will delve into the methodology to determine the appropriate topology for
a domain.
Let us discuss each of these parameters in detail and how they affect the
placement of a domain in the spectrum.
Functional context
A domain’s functional context is the task it is assigned to perform. A
domain’s degree of autonomy for fulfilling its functional context determines
its governance flexibility. A domain with more autonomy for its functional
context can be more flexible in defining its data products, contracts, quality
standards, and access policies. It can also be more responsive to changing
business needs and customer demands. Such a domain is suitable for being
part of a fully federated domain network. A domain with less autonomy for
its functional context may have to adhere to strict requirements and
specifications from other domains or external stakeholders. It may also have
to coordinate with other domains or centralized services for data integration,
validation, security, and governance. Such a domain is a better candidate for
being part of a fully governed domain network.
Regulations
The regulations of a domain are the rules and laws that it has to comply with.
These can be internal or external regulations that affect its functional context,
data products, data quality, data security, data privacy, or data ethics. A
domain’s degree of independence for complying with regulations determines
its governance flexibility.
A domain with more independence for complying with regulations can be
more flexible in interpreting and implementing them. It can also proactively
identify and mitigate potential risks and issues. Such a domain is an optimal
candidate for being part of a fully federated domain network.
A domain with less independence for complying with regulations may have
to follow strict guidelines and standards from other domains or external
authorities. It may also have to report and audit its compliance regularly and
transparently. Such a domain is an adequate candidate for being part of a
fully governed domain network.
Operations
The operations of a domain are the activities and resources it uses to fulfill its
functional context. This includes the planning, execution, monitoring,
optimization, and maintenance of its data products and services. A domain’s
degree of independence for controlling its operations determines its
governance flexibility.
A domain with more independence for controlling its operations can be more
flexible in allocating and managing its resources, such as time, money,
infrastructure, tools, etc. It can also be more efficient in delivering value to its
customers and stakeholders. Such a domain is an excellent candidate for
being part of a fully federated domain network.
A domain with less independence for controlling its operations may have to
follow predefined plans and budgets from other domains or centralized
services. It may also have to share or outsource some of its resources or
capabilities to other domains or external providers. Such a domain is an
acceptable candidate for being part of a fully governed domain network.
Technical capabilities
The technical capabilities of a domain are the technologies and services it
uses to fulfill its functional context. This includes selecting, implementing,
and managing its data platforms, pipelines, models, APIs, analytics, data
visualization, etc. A domain’s degree of independence for choosing and
implementing its technical capabilities determines its governance flexibility.
A domain with more independence for choosing and implementing its
technical capabilities can be more flexible in adopting and innovating with
new technologies and services. It can also be more scalable and resilient in
handling data volume, velocity, variety, and veracity. Such a domain is an
outstanding candidate for being part of a fully federated domain network.
A domain with less independence for choosing and implementing its
technical capabilities may have to use standardized or prescribed
technologies and services from other domains or centralized platforms. It
may also have to integrate or migrate its data products and services to other
domains or external systems. Such a domain is a reasonable candidate for
being part of a fully governed domain network.
Placing a domain within the Data Mesh topology is not a one-size-fits-all
decision. It is a calculated choice influenced by multiple parameters
determining a domain’s relative independence. While the spectrum provides a
guideline, the organization’s unique context will dictate the final placement.
As we delve deeper into Data Mesh patterns, understanding this spectrum
becomes pivotal for organizations aiming to harness the full potential of their
data domains.
Let us now look at an example of how this methodology can be applied for
placing a domain with an example.
Methodology in action
In this section, we will illustrate how to apply the methodology we
introduced in the previous section to determine the placement of a domain in
a data mesh architecture. The domain placement score determines the
guidelines for the placement of the domain.
Domain placement score is computed as the sum product of the parameter
weightage and the parameter score.
Parameter weightage
The weightage reflects the significance of a particular parameter for a
domain. Represented as a number between 0 and 1, it quantifies the
importance. A higher value indicates greater relevance. Ensuring that the
cumulative weightage for all parameters equals one is crucial, ensuring a
balanced evaluation.
Parameter score
The score, ranging between 1 and 5, gauges the domain’s flexibility
concerning a specific parameter. A higher score signifies greater flexibility,
indicating the domain’s ability to operate autonomously and adapt to
changes. Consider the following:
DomainPlacementScore = ∑ iParameterweighti × Parameterscorei
The domain placement score is the compass:
If it is three or above (the median score), the domain aligns more with a
fully federated domain network, suggesting it can operate with
significant autonomy and flexibility.
Conversely, scores below three indicate a better fit for a fully governed
domain network, where centralized governance is more appropriate.
Let us look at an example. The following figure shows the parameter’s
weightage and score for a domain (Domain 1):
Conclusion
This chapter explores the various architectural patterns or topologies of Data
Mesh. We begin by discussing the component model of Data Mesh, which
defines the essential elements and their interactions within the architecture.
We also examine three architectural patterns that shape the Data Mesh
landscape: fully governed, fully federated, and hybrid. These patterns
represent different trade-offs between centralized governance and
decentralized flexibility. We also discuss how to determine the placement of
a domain within a hybrid mesh topology based on its position on the
governance-flexibility spectrum. Here are the key takeaways from this
chapter:
Data Mesh consists of four components: domains, domain nodes, data
catalog, and data share component.
Data Mesh can be implemented using three architectural patterns: fully
governed, fully federated, and hybrid. Each pattern has advantages and
disadvantages, depending on the organization’s context and goals.
The placement of a domain within a hybrid mesh topology depends on
where it falls on the governance-flexibility spectrum. This spectrum
represents the trade-off between governance and the domain’s
flexibility.
In the next chapter, we delve into the crucial role of data governance within
the Data Mesh, underscoring its significance and the consequences of
inadequate governance structures. We address the limitations of
conventional, centralized governance models in the context of a Data Mesh,
advocating for a novel, decentralized approach to governance that harmonizes
with the Mesh’s inherent structure. Further, we outline a practical governance
framework tailored for the Data Mesh, detailing its objectives, goals, and
essential components.
Key takeaways
Following are the key takeaways from this chapter:
Data Mesh architecture is built around domains, domain nodes, data
catalogs, and data sharing components, each playing a vital role in
decentralizing data management across an organization.
The architecture manifests in three patterns: Fully Governed, with
centralized control; Fully Federated, granting domain autonomy; and
Hybrid, a balanced mix of centralized governance and domain
flexibility.
Selection of an architectural pattern relies on a domain’s placement on
the governance-flexibility spectrum, balancing between centralized
governance for consistency and domain autonomy for flexibility.
A methodology for determining a domain’s architectural alignment
assesses its independence in functional context, personnel skills,
regulatory adherence, operational control, and technological
capabilities.
Implementing the right Data Mesh topology—governed, federated, or
hybrid—requires a nuanced understanding of an organization’s specific
needs, ensuring effective data governance and operational efficiency.
Introduction
– Louis Brandeis.
As we have emphasized so far, the data mesh architecture treats data as a
product, with each domain responsible for managing its data. This approach
enables agility, autonomy, and scalability for data-driven organizations but
also introduces complexities and risks for data governance. Here is where the
practice of data governance comes into play. Data governance establishes
policies, standards, and practices to ensure data quality, security, privacy, and
compliance. It is crucial for organizations that want to use data as a strategic
asset and derive value from it. Data governance is an evolving practice that
adapts to data consumers’ and stakeholders’ changing needs and
expectations.
In the context of a data mesh, data governance becomes even more critical
and challenging. It must address questions such as maintaining consistent
data quality, security, privacy, and compliance across multiple domains and
enabling seamless data discovery, access, and collaboration across the mesh.
It must also find the right balance between decentralization and centralization
in data governance.
This chapter will explore these topics and provide practical guidance on
implementing effective data governance in a data mesh.
Structure
The chapter covers the following topics:
Importance of data governance
Traditional data governance: A centralized approach
Data mesh governance framework
The governance goal
The seven objectives
The three governance components
Objectives
The chapter aims to dissect the intricate nature of Data Governance within the
Data Mesh model, highlighting its pivotal role in the mesh’s success. It
addresses the challenges of traditional governance approaches in a
decentralized setup and proposes a novel governance framework tailored to
Data Mesh. This framework outlines clear goals, objectives, and essential
components, ensuring data integrity, compliance, and efficient collaboration
across diverse domains. The objective is to equip organizations with the
knowledge to implement robust governance practices that align with the
decentralized, domain-oriented essence of Data Mesh, fostering a reliable,
agile, and compliant data ecosystem.
By the end of this chapter, you will have a solid understanding of how to
design and implement effective data governance in a data mesh.
Now that we have an overview of the framework, let us now discuss each of
these elements of the framework.
The three organizations crucial in achieving this are the Data Management
Office (DMO), the Data Governance Council, and the Data Leadership by
Domain. They are explained in the following points:
Data Management Office (DMO): The DMO is responsible for
defining policies and standards, empowering data leaders, and ensuring
coordination and consistency across key roles in the data life cycle. As
a facilitator and enabler for the data domains, the DMO provides them
with the necessary guidance, tools, and support to effectively manage
their data assets. Additionally, the DMO monitors and reports on the
performance and compliance of the data domains, identifying and
resolving any cross-domain issues or conflicts that may arise.
Data governance council: The data governance council oversees the
DMO structure, defines and approves data policies, and reviews and
initiates data projects. It comprises senior executives from various
business units and functions representing the organization’s strategic
interests and priorities. The council sets the vision and direction for the
data mesh, allocates resources and budgets for data initiatives, and
promotes alignment and collaboration among the data domains. The
council also fosters a culture of data-driven decision-making.
Data Domain Leadership: The role of data domain leadership is to
develop and implement data strategies, understand domain needs, and
manage data assets and models. The data leadership team includes
domain experts, data owners, data stewards, data engineers, data
analysts, and data consumers, who collaborate to deliver high-quality
and valuable data products to stakeholders. The data domain leadership
owns and governs its data assets and defines and implements domain-
specific data policies and standards.
These organizational entities do not operate independently. The DMO
establishes standards, while the data governance council ensures compliance
and guides strategic decision-making. The data domain leadership
collaborates closely with both entities, implementing standards and
contributing to the overall strategy. This three-way interaction fosters a
balanced and efficient decentralized governance model designed for the Data
Mesh framework.
Data sharing
This process involves disseminating and granting access to data products
among various organizational domains, teams, or entities. It also includes
defining and implementing data-sharing policies, agreements, and protocols
for the data products. It involves the following steps:
1. Identifying the data sharing needs and objectives: This step entails
determining the data sharing needs and objectives of both data
producers and consumers. This includes considering the data sharing’s
type, scope, frequency, and purpose. Identifying the potential benefits
and risks associated with data sharing, such as the value, impact, or
challenges it may bring, is also important.
2. Defining the data sharing policies, agreements, and protocols: This
step involves establishing the policies, agreements, and protocols that
govern the data sharing process. This includes data ownership,
stewardship, licensing, consent, privacy, security, quality, and ethics.
It is also necessary to define the roles and responsibilities of the data
producers and consumers, such as the data provider, requester, broker,
or mediator involved in the data sharing.
3. Implementing the data sharing mechanisms and platforms: This
step focuses on implementing the mechanisms and platforms that
enable data sharing. This can include using a data catalog, registry,
exchange, marketplace, or hub to facilitate the discovery and access of
data products. It also involves implementing data integration,
transformation, delivery, or consumption methods to facilitate the
transfer and use of data products.
4. Monitoring and evaluating the data sharing performance and
outcomes: This step involves monitoring and evaluating the
performance and outcomes of the data sharing process. This can be
done by using data-sharing metrics, indicators, or scores. It is also
important to report and communicate the results and feedback of the
data sharing to the relevant stakeholders or authorities.
This process addresses the challenge of making distributed and decentralized
data products accessible while ensuring their integrity, security, and privacy.
It also challenges ensuring consistent and trustworthy data products across
domains or functions.
This process supports the adoption of the data mesh principles in the
following ways:
It promotes domain-oriented ownership by allowing data producers to
retain control over their data products while sharing them with other
domains or functions. It also enables data consumers to access and use
relevant and valuable data products for their specific needs or
problems.
It encourages data producers to provide clear and comprehensive
metadata and documentation, aligning with the principle of
reimagining data as a product. It also encourages data consumers to
provide feedback and ratings to objectively evaluate data products.
It empowers users with self-serve data infrastructure by providing tools
and platforms for data producers and consumers to autonomously and
efficiently share and access data products.
Conclusion
In this chapter, we delved into the significant topic of Data Governance
within the context of the revolutionary Data Mesh paradigm. It underscored
the pivotal role that robust Data Governance plays in ensuring the success
and reliability of the Data Mesh model. The unique challenges that emerge
when attempting to apply traditional, centralized governance methods to the
inherently decentralized character of the Data Mesh were explored in depth,
highlighting the need for a fresh perspective.
A comprehensive Data Mesh Governance Framework was presented,
showcasing its well-defined goals and objectives. The framework’s key
components were dissected, including the various organizational bodies and
roles that form its backbone, as well as the processes and policies that guide
its operation. This examination revealed the interplay between these diverse
elements and their collective contribution to the functioning of the Data
Mesh.
The chapter delved into the specifics of data product lifecycle processes, such
as security, sharing, and monitoring. These processes are critical for ensuring
the integrity and reliability of data products, and their in-depth exploration
provided valuable insights into their implementation within the Data Mesh
context.
Furthermore, the chapter discussed crucial governance policies, including
Data Product Policies, Data Catalog Policies, and Data Sharing Policies.
These policies guide decision-making, establish rules, and set expectations
within the Data Mesh ecosystem, and their discussion provided a deeper
understanding of their role and significance.
In the following chapter, Data Cataloging in Data Mesh, we will examine the
workings of data cataloging, a vital component of efficient data governance.
This essential process makes sure data is not only stored but also available,
comprehensible, and manageable. We will look at the different aspects of
data cataloging within a Data Mesh such as the function data cataloging has,
its principles, and actions for creating and applying a data cataloging plan in
a Data Mesh.
Key takeaways
Following are the key takeaways from this chapter:
Strong data governance is essential for the success and reliability of
data mesh.
Traditional centralized governance methods present difficulties in the
decentralized data mesh model.
An interrelated set of organizational bodies, roles, processes, and
policies form the foundation of data mesh Governance, each fulfilling
distinct yet interconnected functions.
Join our book’s Discord space
Join the book’s Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
CHAPTER 6
Data Cataloging in a Data Mesh
Introduction
Data cataloging is a fundamental aspect of the data mesh architectural
paradigm. This fundamental process ensures that data is not just stored but is
also accessible, understandable, and governable. This chapter aims to
demystify data cataloging within the context of a data mesh, presenting it as
more than simply a mere technical requirement but as a strategic asset.
Data cataloging, in its essence, is organizing data assets so they are easily
discoverable and usable. Cataloging becomes the linchpin that holds the
system together in a data mesh, where data is distributed across various
domains. It allows for data identification, description, and arrangement,
ensuring it is stored and effectively used. This chapter will unravel how
cataloging underpins the entire data ecosystem in a data mesh, aiding data
democratization and enhancing data sovereignty across teams.
The key to understanding data cataloging in a data mesh lies in recognizing
its dual role in governance and utility. It is not merely about keeping a record
of data assets; it is about making data usable and governable on scale.
This chapter will explore the various facets of data cataloging within a data
mesh. We will start with understanding the role of data cataloging in a
decentralized architecture. This is crucial because a data mesh inherently
involves multiple teams and domains, each with its data prerogatives. The
decentralized nature of a data mesh presents unique challenges and
opportunities in cataloging.
Following this, we will discuss the core principles of effective data
cataloging: simplicity, consistency, and integration. These pillars form the
foundation of a robust cataloging strategy that aligns with the unique
dynamics of data mesh. The emphasis here is on creating a cataloging system
that is easy to understand and navigate, consistent in its approach, and well-
integrated with the broader data ecosystem.
Implementing a cataloging strategy is where theory meets practice. This
chapter will offer practical insights into developing and rolling out a
cataloging strategy within a data mesh. We will look at the steps involved,
the challenges likely to be encountered, and the considerations for
overcoming these challenges. The focus will be on creating a cataloging
system that addresses immediate needs and is scalable and sustainable in the
long run.
Structure
This chapter, will cover the following topics:
The role of data cataloging
Principles of data cataloging
Developing a data strategy
Implementing a data cataloging strategy
Objectives
By the end of this chapter, you will have a comprehensive understanding of
data cataloging in a data mesh. You will learn how to design and implement
a data catalog that supports the principles and goals of a data mesh.
The following section will delve deeper into the strategic implications of data
cataloging within a decentralized data architecture. We will explore how
cataloging is a technical exercise and a strategic enabler in a data mesh,
facilitating data accessibility, enhancing data quality, and ultimately driving
business value.
The role of data cataloging
As emphasized in the previous chapters, the concept of data cataloging is
central to this evolution of the concept of the data mesh. This process is both
fundamental and transformative.
Data cataloging in a data mesh is not just about organizing data; it is about
transforming how data is perceived, accessed, and utilized across an
organization. At its core, the main goal of data cataloging is to ensure that
data, irrespective of where it resides in the organization, is easily
discoverable and usable. This objective is pivotal in a decentralized system
like data mesh, where data is spread across various domains, each operating
autonomously. Cataloging serves as the connecting thread, weaving together
these disparate strands of data into a cohesive, navigable, and functional
ecosystem.
It is not merely a technical process but a strategic enabler, aligning with the
broader goals of data democratization and sovereignty. By cataloging data
effectively, organizations empower their teams to locate and leverage the
correct data at the right time, thereby unlocking the full potential of their data
assets. This approach facilitates a deeper understanding of data, fostering a
culture where data is available, meaningful, and actionable.
Moreover, effective data cataloging in a data mesh addresses one of the most
pressing challenges in modern data ecosystems: the siloed nature of data. By
creating a unified catalog that spans across domains, organizations can break
down these silos, ensuring that data is not just stored but is also
interconnected and interoperable. This interconnectedness is crucial for
deriving insights and driving innovation in a fast-paced, data-driven world.
The following figure depicts the two critical roles that the data cataloging
process plays in a data mesh:
Figure 6.1: The role of data cataloging in a data mesh
As depicted in the figure, the two critical roles that the data cataloging
process plays in a data mesh are as follows:
Data cataloging as a means of data utility, ensuring data
discoverability, accessibility, and usability in a data mesh.
Data cataloging as a means of data governance, ensuring data
quality, security, and compliance in a data mesh.
Let us further explore these critical roles.
Conclusion
In this chapter, we have explored the multifaceted role of data cataloging
within the data mesh framework, a critical component for unlocking the full
potential of this innovative data architecture. The journey through the chapter
reveals the essentiality of data cataloging in enhancing data utility,
governance, and the overall functionality of the data mesh.
We delved into how data cataloging serves as a powerful tool for ensuring
data discoverability, accessibility, and usability. It plays a pivotal role in data
governance, upholding data quality, security, and compliance. By effectively
cataloging data, organizations can transform their data assets into more
valuable and governable entities, enabling better decision-making and
adherence to regulatory standards.
The chapter emphasized three fundamental principles of data cataloging -
simplicity, consistency, and integration. Adhering to these principles is
crucial for creating a data catalog that is not only functional but also enhances
the user experience within the data ecosystem. These principles act as guiding
beacons, ensuring that the data catalog remains an integral and effective part
of the data mesh.
We outlined a structured approach to developing and implementing a data
cataloging strategy. This approach encompasses defining the scope and
objectives, assessing current states and gaps, designing the desired state, and
creating a roadmap for effective catalog implementation.
The chapter also detailed the steps involved in implementing a domain data
catalog, including understanding the domain, establishing its structure,
identifying cataloging elements, cataloging the domain, and monitoring and
measuring its effectiveness. These steps are critical for ensuring that the data
catalog aligns with business goals and objectives, and provides clarity,
validity, and integrity to the data domain. Let us surmise the key takeaways
from the chapter.
Key takeaways
Data cataloging is a fundamental aspect of the data mesh architectural
paradigm, ensuring that data is not just stored but also accessible,
understandable, and governable.
Data cataloging plays a dual role in data utility and data governance. It
ensures data discoverability, accessibility, and usability, while also
ensuring data quality, security, and compliance.
The principles of data cataloging in a data mesh are simplicity,
consistency, and integration. These principles guarantee that the
catalog fulfills its core purpose and improves the overall functionality
and user-friendliness of the data ecosystem.
Developing a data cataloging strategy involves defining the scope and
objectives, assessing the current state and gaps, and designing the
desired state and roadmap. This strategy ensures alignment between
data cataloging, business goals, system principles, and data mesh
objectives.
Implementing a data cataloging strategy in a data mesh requires
understanding the domain, establishing the domain structure,
identifying cataloging elements, cataloging the domain, and monitoring
catalog usage and effectiveness. These steps ensure effective data
management, discoverability, and governance within the data mesh.
This chapter lays the foundation for organizations to harness the power of
data cataloging in their journey towards a more interconnected, efficient, and
innovative data landscape.
In the next chapter, our focus will shift to a critical aspect that underpins the
ability to share data within a data mesh - data sharing in a data mesh.
Introduction
Data is a precious thing and will last longer than the systems themselves.
– Tim Berners-Lee
Truer words have never been spoken, especially in the context of the
evolving digital ecosystem. As we delve into this chapter, we build upon the
foundations laid in the previous chapters. This chapter shifts focus to a
critical component: data sharing.
In a data mesh architecture, the significance of data sharing cannot be
overstated. It is the lifeblood of modern data ecosystems, pivotal for
unlocking the value buried in vast troves of data. We have seen how data
mesh revolutionizes data architecture and data governance. Now, we turn our
attention to how it transforms data sharing. This chapter aims to dissect the
complexities of data sharing in a decentralized landscape.
Understanding the role of data sharing is the first step. It is not about moving
data from point A to B. It is about creating a synergy that allows data to be a
catalyst for informed decision-making and innovation. This section of the
chapter aims to articulate why data sharing is a cornerstone for extracting
value from data products and driving data-driven cultures.
Following this, we delve into the principles of data sharing within a data
mesh. These principles are not mere guidelines. They are the foundations
upon which effective and ethical data-sharing practices are built. They ensure
that the sharing of data is not only efficient but also aligns with the
overarching goals of data sovereignty and integrity.
There are different patterns for data sharing, and implementing these patterns
within a data mesh framework is a complex yet rewarding journey. This
chapter aims to elucidate these patterns and then provide clear steps and
considerations for crafting a scalable and robust data-sharing strategy. We
cover the practical aspects of implementation. These aspects ensure that data
sharing seamlessly integrates into the operational fabric of an organization.
Structure
This chapter will cover the following topics:
Role of data sharing
Principles of data sharing
Patterns of data sharing
Implementing a data sharing strategy
Objectives
By the end of this chapter, the reader will have gained an understanding of
the critical role of data sharing within a data mesh framework. Let us get
started by exploring the role of data sharing in data mesh.
There are two key roles of data sharing that are pivotal to a Data Mesh:
Data sharing enables information dissemination.
Data sharing enables data value creation.
Let us discuss each of these roles in detail.
Information dissemination
Data sharing enables information dissemination. Information dissemination is
the process of making data available and accessible to a wider audience. It is
crucial for a data mesh, as it allows different domains to share their insights,
learn from each other, and collaborate on common goals.
Information dissemination in a data mesh has several benefits, such as:
Increased data availability and accessibility: Data products are
published to a common platform or registry, where they can be easily
discovered and accessed by data consumers. Data consumers do not
need to request or wait for data from data producers, as they can
subscribe to the data products they require and receive updates
automatically.
Reduced data duplication and redundancy: The organization
publishes data products to a common platform or registry. Data
consumers can discover and access them. Data consumers do not need
to wait for data from data producers. They can subscribe to the data
products they require and receive updates.
Improved data quality and trust: Data products are self-describing
and self-governing, which means that they are cataloged. They provide
metadata and documentation about their origin, purpose, and quality.
Data consumers can use this information to assess the reliability and
suitability of the data products for their use cases. Data producers and
consumers can also provide feedback and ratings to each other. This
interaction can improve the quality and trust of the data products over
time.
Enhanced data collaboration and innovation: Data products are
interconnected and composable. This feature means that they can be
combined and enriched by other data products or applications.
Data interoperability
Data interoperability is a fundamental principle of data mesh. It plays a
crucial role in ensuring secure, efficient, and compliant data exchanges across
different domains. In a data mesh, where data is inherently distributed and
decentralized, interoperability is key to the seamless integration and
utilization of data. This principle requires the establishment of standardized
data formats and protocols. This standardization facilitates the smooth
exchange and compatibility of data across the network.
It ensures that data, when transferred between domains, adheres to a uniform
structure. This uniformity reduces the risk of data breaches and loss of
integrity. This uniformity also streamlines the process of implementing
security measures.
Data interoperability reduces the complexity and time involved in data
processing. Domains can exchange and integrate data without the need for
extensive data transformation or correction processes. This results in quicker
access to and analysis of data. It drives faster decision-making and
operational efficiency.
Furthermore, data interoperability aligns with compliance standards.
Standardized data formats ensure that data-sharing practices meet various
regulatory requirements. This is especially useful in environments where data
needs to be shared across geographic or regulatory boundaries.
In essence, data interoperability in a data mesh fosters a harmonious and
efficient data-sharing environment.
Quality-first approach
Quality-first approach is a fundamental principle for data sharing in a data
mesh. This approach means prioritizing data quality over data quantity or
speed. A quality-first approach means ensuring that data products shared
across domains are accurate, complete, and reliable. It also means ensuring
that the products meet the expectations and standards of data consumers and
producers. This approach is essential for building trust in data and its
subsequent analyses, as it ensures that data products are fit for their intended
use and purpose. A quality-first approach also enhances the overall value of
data within the organization. It leads to better analytics and insights. These
can support and improve business decisions and outcomes.
A quality-first approach requires implementing robust data validation and
cleansing processes. These processes can detect and correct any errors,
inconsistencies, or anomalies in data products before sharing them. Data
validation and cleansing processes can ensure data products are accurate,
complete, and consistent. They can also ensure that products conform to the
data quality rules and criteria defined by data producers and consumers. Data
validation and cleansing processes can also improve the security and
compliance of data products. They remove any sensitive or confidential data
that should not be shared and adhere to data policies and regulations for data
sharing.
A quality-first approach requires implementing rigorous data quality checks
and monitoring. This approach can measure and evaluate the quality of data
products before and after sharing them. Data quality checks and monitoring
can ensure that data products are reliable, trustworthy, and up to date. They
can also maintain their quality throughout their lifecycle. Data quality checks
and monitoring can also provide feedback and ratings to data producers and
consumers. This helps them improve and maintain the quality of data
products over time.
In summary, adopting a Quality-First Approach in data sharing within a Data
Mesh is crucial for ensuring secure, efficient, and compliant data exchanges.
It forms the backbone of a robust Data Mesh system, facilitating the flow of
trustworthy and valuable data across various domains.
Collaborative data stewardship
Collaborative data stewardship is a key principle for data sharing in a Data
Mesh. This principle is the practice of managing and sharing data as a shared
responsibility. It facilitates collaboration across various domains.
Collaborative data stewardship recognizes that data is not just a domain-
specific asset. It’s a shared resource that benefits the entire organization. This
principle involves developing and following shared standards, practices, and
policies for data management. These ensure consistency, quality, and
compliance across domains. This principle not only ensures data integrity and
security. It also encourages the sharing of best practices and insights,
enhancing the value and utility of the data.
Collaborative data stewardship requires establishing common data standards
and practices. These can guide and govern data management and sharing
across domains. Common data standards and practices can ensure that data
products are consistent and compatible with each other. These practices
ensure that they meet the expectations and requirements of data consumers
and producers. Common data standards and practices can also improve the
efficiency and performance of data products. It does so by enabling data
reuse, automation, and optimization.
Collaborative data stewardship requires enforcing common data policies and
regulations. These policies and regulations can protect and control data
access and usage across domains. Common data policies and regulations can
ensure that data products are secure and private, and that they follow the
relevant data laws and ethics. Common data policies and regulations can also
improve the compliance and accountability of data products. They achieve
this by defining the permissions, restrictions, and obligations for data sharing,
and by providing data audit and reporting mechanisms.
Collaborative data stewardship requires facilitating data communication and
collaboration among data consumers and producers across domains. Data
communication and collaboration can ensure that data products are
transparent and trustworthy. They can also ensure that they provide relevant
and useful information and insights. Data communication and collaboration
can also improve the value and utility of data products. They do this by
enabling data feedback and ratings, data discovery and exploration, and data
innovation and experimentation.
In essence, Collaborative Data Stewardship in a Data Mesh is not just about
sharing data. Creating a synergistic environment where data is managed and
utilized in a way that benefits the entire organization is also important. It is a
principle that recognizes the interdependent nature of data in modern
enterprises and seeks to harness this interdependency for greater
organizational success.
Publish-subscribe pattern
The publish-subscribe pattern is one of the common patterns of data sharing
in a data mesh. This pattern involves data producers publishing their products
to a common platform or registry. From this platform, the data consumers can
discover and subscribe to the data products they require. Data consumers can
then access and use the data products according to their preferences and
needs. Data producers are at the center of this scalable and flexible pattern.
They publish their data products to a common platform. This publishing
enables data consumers to discover and subscribe as per their needs. It
facilitates decoupled data sharing, catering to varied consumer preferences.
The publish-subscribe pattern is based on the principle of decoupling data
producers and consumers. This decoupling allows them to communicate and
collaborate without direct dependencies or interactions.
The following diagram depicts key components of the publish-subscribe
pattern:
Request-response
This pattern involves direct requests from data consumers to producers. It
enables a coupled exchange of data products. A more synchronous and
interactive approach. It suits scenarios where immediate data exchange is
essential. This pattern involves data consumers requesting data products from
data producers, who then respond with the data products they can provide.
Data consumers can then access and utilize the data products according to
their preferences and needs. The interface that facilitates data sharing is
owned by the data producer.
The following diagram depicts the key components of the request-response
pattern:
Push-pull
This pattern, ideal for asynchronous and batched data sharing, sees data
producers pushing data to a common point from where consumers can pull as
needed. This pattern involves data producers pushing data products to data
consumers, who then pull the data products they require. Data consumers can
then access and utilize the data products according to their preferences and
needs. It’s useful for scenarios that require buffering and periodic data
updates. The common data platform is a shared component between the
producer and the consumer.
The following diagram depicts the key components of the push-pull pattern:
Now, let us deep dive into the four steps of implementing a data sharing
pattern.
Conclusion
This chapter meticulously explores the multifaceted aspects of data sharing
within a data mesh environment. We began by underscoring the pivotal role
of data sharing. We emphasized its significance in disseminating information
and fostering data value creation. This foundational understanding set the
stage for delving into the core principles that underpin effective data sharing.
We discussed five key principles: Domain data autonomy, data
interoperability, contextual data sharing, a quality-first approach, and
collaborative data stewardship. Each principle was dissected to reveal its
importance in building a robust data mesh framework. it demonstrated how
these principles collectively contribute to a cohesive and efficient data-
sharing ecosystem.
The exploration of data sharing patterns formed the crux of this chapter. We
examined three predominant patterns: Publish-subscribe, request-response,
and push-pull. each pattern was analyzed for its unique characteristics,
suitability for different scenarios, and its role in enhancing data sharing
within the mesh.
The chapter then transitioned to the practical aspects of implementing a data-
sharing strategy. This journey through the implementation process was
structured into key steps. It began with identifying the appropriate data-
sharing pattern. Then, it established the data-sharing protocol. It culminated
in the creation of a secure infrastructure with robust Access Control
Interfaces. We emphasized the significance of monitoring and performance
optimization. We highlighted how continuous evaluation and refinement are
critical to the success of a data mesh.
In the next chapter, we will discuss another important aspect of a data mesh:
data security. Data security is the protection of data products and data-sharing
activities from unauthorized access, use, modification, or disclosure. This
upcoming chapter will build upon the foundations laid in data sharing,
focusing on how to protect data within the mesh, ensuring its integrity and
confidentiality. We will explore the strategies, tools, and best practices for
securing data in a distributed environment, a critical aspect for any
organization embarking on a data mesh journey.
Key takeaway
Here are the key takeaways from this chapter:
Data sharing in a data mesh is guided by five principles: data
autonomy, data interoperability, contextual data sharing, quality-first
approach, and collaborative data stewardship. These principles ensure
that data sharing is decentralized, domain-oriented, self-descriptive,
trustworthy, and cooperative.
Data sharing in a data mesh can be implemented using three patterns:
publish-subscribe, request-response, and push-pull. These patterns
provide alternatives for exchanging data products between domains,
depending on the nature and volume of the data, the frequency and
urgency of the data sharing, the level of coupling and coordination
between data producers and consumers, and the trade-offs and
implications of each data sharing pattern.
Data sharing in a data mesh requires a step-by-step strategy that
involves identifying the appropriate data-sharing pattern, establishing
the data-sharing protocol, creating a secure infrastructure and access
control interfaces, and monitoring and optimizing the data-sharing
performance and quality. These steps ensure that data sharing is
smooth and secure, as well as useful and valuable.
CHAPTER 8
Data Security in a Data Mesh
Introduction
– Bruce Schneier.
As we delve into the eighth chapter of this book, this quote resonates more
than ever.
Data mesh, by its very nature, challenges traditional security models. The
architecture is decentralized, and control over data is distributed across
diverse domains, introducing many security considerations. The chapter is
not just about understanding these challenges. It is about rethinking data
security in a landscape where the conventional perimeters have dissolved.
We begin by dissecting the security challenges in a decentralized system. The
distributed nature of data ownership and control in a Data Mesh makes the
task of safeguarding data not only complex but also critical. This section will
unravel these complexities and offer insights into addressing them
effectively.
The chapter then moves on to the principles of data mesh security. Here, we
establish the foundation for strong security in a Data Mesh framework. We
will review principles like confidentiality, integrity, and availability in this
new context. These principles will guide us in navigating the security of
decentralized data architectures.
Finally, we will dive into the components of Data Mesh security, which cover
the three main aspects of data security:
Data security: We explore strategies to protect data itself, emphasizing
on advanced encryption and anonymization techniques. This is crucial
in a setting where data is not just stored centrally, but also exchanged
and processed across multiple domains.
Network security: This section underscores the importance of
securing data in transit. With data frequently moving across various
nodes in a Data Mesh, ensuring secure transfer and effective intrusion
prevention is vital.
Access management: We discuss how to manage who has access to
what data, covering mechanisms like RBAC and ABAC, and the role
of encryption for data at rest.
By the end of this chapter, you will have a comprehensive understanding of
data security in a data mesh, and you will be able to apply the best practices
and recommendations to secure your data assets in a decentralized setting.
Structure
This chapter will cover the following topics:
Security challenges in a decentralized system
SECURE: Principles of data mesh security
Data Mesh Security Strategy: The three-circle approach
Components of data mesh security
Objectives
The objective of this chapter is to provide a comprehensive understanding of
data security within the data mesh framework. It aims to elucidate the unique
challenges and strategies associated with ensuring data confidentiality,
integrity, and availability in a decentralized system. The chapter seeks to
equip readers with the knowledge of effectively preventing data breaches,
facilitating safe data sharing, and maintaining high data quality. Additionally,
it aims to offer insights into how these security measures can support and
enhance the core functionalities of a data mesh.
Let us begin by discussing the security challenges in a decentralized system.
Let us deep dive into each circle. We will discuss the policies relevant to each
circle. We will dissect each policy based on the challenge it addresses in a
Data Mesh architecture, the goal, and the impact of these policies.
Organizational security covers the following policies that align with the
SECURE principles of Data Mesh security:
Policy for Scalable Security Architecture (S: Scalable Security
Protocols): This policy underscores the necessity of a security
infrastructure that is inherently scalable and adaptable. It mandates the
development and implementation of security protocols that can
dynamically adjust to the organization’s evolving needs. They ensure
robust protection as the enterprise grows. This policy category
advocates for flexible security frameworks. The goal and the impact of
this policy can be summarized as follows:
Goal: This policy aims to make sure that the organization’s security
infrastructure is scalable and adaptable. It can accommodate the
growth and change of the Data Mesh environment, as well as the
security needs and demands of the data domains and platforms.
Impact: This policy has two impacts. First, it provides a clear and
coherent direction and governance for the security of data, assets,
and resources in the Data Mesh environment. Second, it ensures that
the security protocols and practices align and comply with relevant
laws and regulations.
Data Encryption and Transfer Policy (E: Encryption and Secure
Data Transfer): This policy mandates the use of advanced encryption
protocols for data at rest and in transit, aligning with global best
practices and compliance requirements. It ensures that all data transfers
occur over secure channels. The encryption standards are consistently
applied, aligning with global best practices and compliance
requirements. This policy category focuses on safeguarding data in all
its states, ensuring the confidentiality, integrity, and availability of
data. The goal and the impact of this policy can be summarized as
follows:
Goal: This policy has two main goals. First, it ensures that data is
encrypted and transferred securely and reliably. This stops potential
threats or breaches of the data. Second, it ensures compliance with
relevant privacy and data protection regulations.
Impact: This policy has a few major impacts. It protects data from
any unauthorized or inappropriate access, misuse, or loss. It also
ensures the rights and interests of the data subjects and owners.
Moreover, it ensures the consistency and compatibility of the
encryption protocols and standards across the data domains and
platforms.
Access Control Policy (U: Unified Access Control): This policy
establishes a unified framework for access control. It defines and
enforces the access rights and permissions of different users and
groups. It ensures that access to data and resources is governed by a
comprehensive set of rules. The rules consider user roles, context, and
data sensitivity. This policy aims to streamline access management. It
ensures that access is secure and conducive to the organization’s
operational efficiency. The goal and the impact of this policy can be
summarized as follows:
Goal: The goal of this policy is to ensure that access to data and
resources is governed by a unified framework. This framework
considers user roles, context, and data sensitivity. It ensures that
access is granted only to authorized and appropriate users and
groups. It also ensures compliance with relevant laws and
regulations.
Impact: This policy has two impacts. It protects data and resources
from unauthorized or inappropriate access, misuse, or loss. It also
ensures the rights and interests of the data subjects and owners. The
policy also ensures the consistency and compatibility of the access
control models and mechanisms across the data domains and
platforms.
Privacy and Data Protection Policy (R: Robust Privacy Standards,
E: End-to-End Data Protection): This dual-faceted policy intertwines
Robust Privacy Standards with End-to-End Data Protection. It enforces
stringent privacy measures across all domains. This ensures the
protection of data throughout its entire lifecycle. It ensures that data is
treated with respect and care. It also safeguards data from any potential
threats or breaches. It ensures that data is compliant with the relevant
privacy and data protection regulations. It also ensures that data
respects the rights and interests of the data subjects and owners. The
goal and the impact of this policy can be summarized as follows:
Goal: This policy aims to treat data with respect and care. It also
aims to safeguard data from threats or breaches. It ensures
compliance with privacy and data protection regulations.
Furthermore, it also protects the rights and interests of the data
subjects and owners.
Impact: This policy protects data from any unauthorized or
inappropriate access, misuse, or loss. It also ensures the rights and
interests of the data subjects and owners. The policy also maintains
consistency and compatibility of privacy and data protection
measures across the data domains and platforms.
Inter-Domain Security covers the following policies that align with the
SECURE principles of Data Mesh security:
Data Contract and Agreement Policy (S: Scalable Security
Protocols): This policy defines and enforces the data contracts and
agreements between the data domains. It specifies the terms and
conditions of data exchange. This includes data scope, format, quality,
frequency, and duration of data exchange. It also specifies the security
protocols and requirements for data exchange. This includes
encryption, authentication, authorization, and integrity mechanisms.
This policy category ensures that data exchange is governed by a
scalable and adaptable security framework and can accommodate the
diverse and dynamic needs and demands of the data domains. The goal
and the impact of this policy can be summarized as follows:
Goal: This policy aims to ensure that data exchange is governed by
a scalable and adaptable security framework. This framework can
accommodate the diverse and dynamic needs and demands of the
data domains. The policy will ensure that data exchange is
performed in a secure and reliable manner.
Impact: This policy has a clear impact. It provides direction and
governance for the data exchange between the data domains. It
ensures alignment and compliance of the data contracts and
agreements with the relevant laws and regulations.
Data Integrity and Validation Policy (C: Consistent Data Integrity
Checks): This policy ensures the data integrity and validation between
the data domains, ensuring that data exchange is performed in a
consistent and coherent manner. It ensures that data exchange is
performed in accordance with the data formats, schemas, and
semantics, as well as the data quality and integrity standards and
practices. It also ensures that data exchange is verified and validated by
various tools and methods, such as data profiling, cleansing, or
reconciliation. This policy category ensures that data exchange is
performed in a consistent and reliable manner, ensuring the data
integrity and compatibility across the data domains. The goal and the
impact of this policy can be summarized as follows:
Goal: The goal of this policy is to ensure that data exchange is
performed in a consistent and reliable manner. This ensures data
integrity and compatibility across the data domains. It also ensures
the verification and validation of the data quality and integrity.
Impact: This policy has several impacts. It ensures the quality and
usability of the data exchanged between the data domains. It also
ensures the reliability and trustworthiness of the data. Additionally,
it helps to detect and rectify any errors, inconsistencies, or
anomalies in the data.
Data Sharing and Collaboration Policy (E: Encryption and Secure
Data Transfer): This policy regulates and facilitates data sharing and
collaboration between data domains. It establishes data sharing and
collaboration models and mechanisms, such as the data catalog, data
registry, data marketplace, or data federation. It also establishes data
sharing and collaboration standards and practices, such as data
discovery, consumption, or governance. This policy category ensures
that data sharing and collaboration is performed in a secure and
efficient manner. It ensures the encryption and secure transfer of data
across the data domains. The goal and the impact of this policy can be
summarized as follows:
Goal: This policy has two goals. First, it aims to make data sharing
and collaboration secure and efficient. It will also enable data
discovery and consumption across data domains. Second, it will
improve data governance and quality across the Data Mesh. This
policy category has another goal. It wants to foster a data culture.
This culture will encourage and support data sharing and
collaboration between data domains. It will also promote data
innovation and value creation across the Data Mesh.
Impact: This policy has a few impacts. It enhances the usability and
value of shared data. This includes data shared between different
domains. It also ensures secure data transfer. The policy also aligns
data formats, schemas, semantics, and ontologies. In addition, the
policy enhances the efficiency and productivity of data sharing and
collaboration. This includes sharing data between different domains.
It also ensures that data sharing and collaboration models and
mechanisms are in place. This includes data catalog, data registry,
data marketplace, or data federation. The policy also sets data
sharing and collaboration standards and practices. This includes data
discovery, consumption, and governance.
Data Interoperability and Compatibility Policy (C: Consistent
Data Integrity Checks): This policy ensures data interoperability and
compatibility between data domains. It also ensures that data exchange
is performed in a reliable and trustworthy manner. Data exchange is
performed in accordance with data interoperability and compatibility
standards and practices. These include data formats, schemas,
semantics, or ontologies. Data exchange is also verified and validated
by various tools and methods. These include data mapping,
transformation, or integration. Finally, the policy ensures that data
exchange is performed in a reliable and trustworthy manner. This
ensures that data interoperability and compatibility across the data
domains. The goal and the impact of this policy can be summarized as
follows:
Goal: This policy has the goal of ensuring that data exchange is
performed in a reliable and trustworthy manner. We want to ensure
data interoperability and compatibility across data domains. We also
would like to verify and validate the usability and value of the data.
Impact: This policy ensures the usability and value of data
exchanged between data domains. It also ensures the alignment and
harmonization of data formats, schemas, semantics, or ontologies.
This includes the detection and rectification of any errors,
inconsistencies, or anomalies in the data.
Data Exchange Privacy and Protection Policy (R: Robust Privacy
Standards, E: End-to-End Data Protection): This policy ensures the
privacy and protection of data exchange. It happens between the data
domains. It ensures that data exchange follows privacy and data
protection regulations. Furthermore, it also respects data subjects and
owners’ rights and interests. This policy ensures that data exchange
follows privacy and data protection measures. These include
anonymization, pseudonymization, and encryption. It also ensures that
data exchange is secure and resilient. This ensures the protection of
data through its entire exchange lifecycle, from initiation to
termination. This policy category ensures that data exchange is
respectful and careful. It ensures the privacy and protection of data
across the data domains. The goal and the impact of this policy can be
summarized as follows:
Goal: This policy has two goals. First, it will ensure that data
exchange is respectful and careful. This will ensure the privacy and
protection of data. Second, it will ensure compliance with relevant
privacy and data protection regulations. It will also protect the rights
and interests of data subjects and owners.
Impact: This policy protects data from unauthorized access, misuse,
or loss. It also ensures the rights and interests of the data subjects
and owners. Additionally, it ensures consistency and compatibility
of privacy and data protection measures. These measures are the
same across all data domains and platforms.
Intra-Domain Security covers the following policy categories that align with
the SECURE principles of Data Mesh security:
Domain-Specific Security Architecture Policy (S: Scalable Security
Protocols): This policy defines and enforces the domain-specific
security architecture for each domain. It specifies the security
components and elements that make up the domain’s security
infrastructure. This includes the security devices, systems, and
networks used to protect the domain’s data, assets, and resources.
Additionally, it outlines the security protocols and requirements that
govern the domain’s security operations. These include security
monitoring, detection, and response. This policy category ensures that
each domain has a scalable and adaptable security architecture. It can
accommodate its specific security needs and demands. The goal and
the impact of this policy can be summarized as follows:
Goal: This policy aims to ensure that each domain has a scalable
and adaptable security architecture. It will accommodate specific
security needs, demands, and the types of data, assets, and
resources.
Impact: This policy has a big impact. It provides a clear and
coherent direction and governance for the security of data, assets,
and resources within each domain. This ensures the security
architecture aligns with the relevant laws and regulations.
The following table summarizes and maps the three-circle security
framework with the SECURE principles that the policy supports:
First, we explore Data Security, the cornerstone that ensures the safeguarding
of data both at rest within the domains and during transit. This component is
about implementing stringent measures and protocols. They ensure data is
encrypted, anonymized, or otherwise protected against unauthorized access
or breaches.
Next, we delve into Network Security. This component emphasizes the
importance of protecting data as it traverses the intricate network of the Data
Mesh. This section highlights the strategies and technologies employed to
secure data in transit. They ensure that the channels through which data
moves are fortified against interception, intrusion, and other cyber threats.
Finally, we examine Access Management. It is a critical component that
ensures data is accessible to the right stakeholders under the right conditions.
This segment discusses the mechanisms and policies for managing and
monitoring data access. It ensures that every interaction with data is
authenticated. It is also authorized and compliant with established security
policies.
Together, these components form a comprehensive framework for Data Mesh
security. It addresses the multifaceted challenges of protecting data in a
decentralized environment.
Let us now deep dive into each of these components.
Data encryption
Data encryption is a security mechanism. It converts readable data into an
unreadable format. This format is referred to as ciphertext, using an algorithm
and an encryption key. This process ensures that the data remains unreadable
and secure unless decrypted with the correct key. Encryption comes in two
types. These are symmetric, where the same key is used for encryption and
decryption. The other type is asymmetric, involving a public key for
encryption and a private key for decryption. Encryption plays an important
role as It protects sensitive information from unauthorized access during data
storage and transmission.
In a Data Mesh architecture, data encryption is important. The distributed
nature of a Data Mesh creates multiple points of vulnerability. These include
data in transit and data at rest. Encryption ensures that even if data pathways
or storage mechanisms are compromised, the data remains secure. This is
important for maintaining trust and ensuring compliance with privacy
regulations.
Data Encryption is a robust barrier against data breaches and cyber threats. It
ensures that even if data is intercepted or accessed by unauthorized
individuals, it remains indecipherable and useless without the corresponding
decryption key. Encryption is particularly crucial for safeguarding sensitive
data. This includes Personal Identifiable Information (PII), financial
details, and intellectual property. It is a fundamental aspect of data security
strategies. It offers a last line of defense by ensuring data confidentiality and
integrity. Often, it is mandated by data protection regulations and standards.
Implementing Data Encryption in a Data Mesh involves several strategies
and methodologies:
Key management: Establish a robust key management system to
securely store, manage, and rotate encryption keys. Consider using a
centralized key management service that supports the Data Mesh’s
distributed architecture.
Encryption at rest and in transit: Implement encryption for data at
rest within each domain and ensure that data is encrypted when
transmitted between domains. Use strong encryption standards like
AES for data at rest and TLS for data in transit.
Policy-driven encryption: Define and enforce encryption policies
based on data sensitivity, compliance requirements, and domain-
specific needs. Use policy engines to automate encryption processes
and ensure consistency across the mesh.
Regular audits and compliance checks: Conduct regular audits to
ensure encryption standards are properly implemented and maintained.
Align encryption practices with industry standards and regulatory
requirements to ensure compliance.
End-to-end encryption: Where possible, implement end-to-end
encryption to ensure that data remains encrypted throughout its entire
lifecycle, providing maximum security against unauthorized access.
By employing these strategies, organizations can effectively implement Data
Encryption within a Data Mesh, ensuring robust protection of data across the
distributed environment.
Data masking
Data masking, also known as data obfuscation or anonymization, is a process.
It disguises original data to protect sensitive information while maintaining
its usability. It involves altering or hiding specific data elements within a data
store. The data structure remains intact, but the information content is
securely concealed. This technique is particularly useful for protecting
personal, financial, or other sensitive data. It’s used in environments for
development, testing, or analysis purposes. Data masking can be static or
dynamic. In static masking, the data is masked in the source and copied to the
target. In dynamic masking, data is masked on-the-fly as queries are made.
Ensuring data privacy and compliance is challenging. Data masking becomes
indispensable in such scenarios to maintain the utility of the data while
ensuring that sensitive information is not exposed, especially when moving
data between domains or using it in less secure environments like
development or testing. Implementing Data masking in a Data Mesh ensures
that while domains can independently manage and utilize their data, they also
uphold privacy standards and regulatory requirements, thereby maintaining
the overall security posture of the data ecosystem.
Data masking secures sensitive information from unauthorized access by
making it unreadable or meaningless without proper authorization. It enables
organizations to utilize real datasets for non-production purposes without
risking data exposure. For example, developers can work with production-
like data without having access to the actual sensitive data. This is crucial for
maintaining privacy and compliance, especially under regulations like GDPR
or HIPAA. These regulations mandate stringent controls over personal data.
Data Masking also helps reduce the risk of data breaches. Even if the masked
data is compromised, the actual sensitive information remains safe.
Implementing data masking in a Data Mesh involves the following strategies
and methodologies:
Identify sensitive data: Use data discovery and classification tools to
identify sensitive data that needs masking within each domain of the
Data Mesh.
Choose the right masking technique: Depending on the use case and
data type, choose an appropriate masking technique (for example,
substitution, shuffling, encryption, tokenization) that maintains data
utility while ensuring security.
Apply masking consistently: Ensure that masking rules are
consistently applied across all domains. This may involve centralized
policy management or coordination between domain teams to ensure
uniformity in masking standards.
Preserve data relationships: When masking data, ensure that
relationships between data elements are preserved to maintain data
integrity and utility for non-production workloads.
Monitor and audit: Regularly monitor and audit masked data to
ensure that masking policies are correctly implemented and that the
masked data does not inadvertently reveal sensitive information.
By employing these strategies, organizations can effectively integrate Data
Masking into their Data Mesh, ensuring that sensitive information is
protected while still enabling productive use of data across domains.
Data backup
Data backup is a critical data protection strategy. It involves creating and
storing copies of data. This safeguards against loss, corruption, or disasters.
The core purpose of data backup is to ensure data availability and continuity.
This is done by providing a means to restore data. It can restore data to its
original state or to a specific point in time before an incident occurred.
Backups can be full, copying all data, incremental, copying only data that has
changed since the last backup, or differential, copying data changed since the
last full backup. Data backup is an essential component of disaster recovery
plans and business continuity strategies. This emphasizes its crucial role in
maintaining operational resilience.
When data is distributed across multiple domains and locations, the risk of
data loss or corruption increases. This is due to the complexity of data
management and potential vulnerabilities. Data Backup is indispensable in
such environments. It ensures that no single point of failure can lead to
catastrophic data loss. It also enables swift data recovery and ensures that
each domain within the Data Mesh can maintain its operations and data
integrity, even in adverse scenarios. Backups also facilitate data versioning
and historical analysis that allows for tracking data changes and aiding in
data forensics and anomaly detection.
Data Backup serves as an insurance policy for data. It ensures that critical
information can be recovered in the event of data loss scenarios. These
include hardware failures, accidental deletions, software malfunctions, or
cyber-attacks. By maintaining up-to-date and secure copies of data,
organizations can quickly recover and minimize downtime. This helps
maintain business operations and service delivery. Regular data backups also
help in compliance with data retention policies and regulations. They provide
auditable records and evidence of data integrity and security.
Implementing Data Backup in a Data Mesh involves thoughtful planning and
execution of the following strategies:
Regular and automated backups: Schedule regular and automated
backup processes to ensure that data is consistently backed up without
relying on manual intervention. Automation helps in maintaining
backup consistency and reducing human errors.
Multi-location storage: Store backups in multiple locations, including
on-premises, in the cloud, or hybrid environments, to protect against
localized disasters. This geographical distribution of backups enhances
data resilience.
Implement backup redundancy: Use strategies like mirroring or
replication to create redundant backup copies, ensuring that if one
backup is compromised or unavailable, others can be used for
recovery.
Test backup and recovery procedures: Regularly test backup and
recovery processes to ensure that data can be effectively restored when
needed. Testing helps identify potential issues and improves the
reliability of the backup strategy.
Encrypt backup data: Secure backup data by encrypting it both
during transfer and at rest. Encryption protects backup data from
unauthorized access and ensures that sensitive information remains
confidential.
Monitor and audit backup processes: Continuously monitor backup
processes and maintain logs for auditing purposes. Monitoring helps
detect potential issues early, and auditing ensures compliance with
policies and regulations.
By integrating these strategies, organizations can establish robust Data
Backup mechanisms within their Data Mesh, ensuring data durability and
minimizing the impact of data loss incidents.
Data classification
Data classification is the systematic process of categorizing and labeling data
based on its sensitivity, value, and criticality to an organization. It involves
sorting data into various classes. These are often determined by data privacy
regulations, industry standards, or company policies. The primary categories
typically include public, internal, confidential, and highly confidential. This
process is vital for understanding the data landscape. It’s also important for
enforcing appropriate security measures. It ensures that data handling aligns
with compliance requirements. By classifying data, organizations can
prioritize their security efforts. They apply stringent controls to the most
sensitive data. This process also optimizes resources across the data
spectrum.
In a Data Mesh architecture, domains manage data autonomously. This calls
for consistent and comprehensive data classification. It ensures that despite
the decentralized nature of the data, all domains adhere to a unified
understanding and treatment of data sensitivity and compliance requirements.
data classification in a Data Mesh helps in maintaining data integrity and
trustworthiness across domains. It also enables secure data sharing and
collaboration by clearly defining data access and usage policies based on
classification levels. In addition, it supports compliance with global data
protection regulations by providing clear guidelines on data handling and
processing.
Data classification serves multiple purposes. First, it enhances data security.
It does this by identifying which data requires more stringent protection
measures. For example, this could include encryption or access controls.
Second, it aids in regulatory compliance. It ensures that sensitive data, such
as PII or financial records, is handled in accordance with legal and industry
standards. Third, it streamlines data management. It enables more efficient
data search and retrieval. It also facilitates effective data lifecycle
management. This ensures that data is stored, archived, or deleted in line with
its classification. Lastly, it fosters a culture of data awareness and
responsibility. This is because stakeholders understand the importance and
sensitivity of the data they handle.
Implementing data classification in a Data Mesh requires a coordinated
approach, considering the distributed nature of the architecture:
Develop a unified classification framework: Establish a common
classification framework that is consistently applied across all domains
in the Data Mesh. This framework should include clear definitions for
each classification level and criteria for categorizing data.
Automate classification processes: Leverage data classification tools
and solutions that can automatically classify data based on content,
context, and predefined rules. Automation helps scale the classification
process and ensures consistency.
Integrate classification with data governance: Embed data
classification within the broader data governance framework to ensure
it is an integral part of data management practices across all domains in
the Data Mesh.
Educate and train stakeholders: Ensure that all stakeholders,
including data producers, consumers, and domain owners, understand
the classification framework and their responsibilities related to data
handling and compliance.
Regularly review and update classification: Periodically review and
update the classification of data to reflect changes in business needs,
regulatory requirements, or the data itself. This ensures the
classification remains relevant and effective.
Monitor and enforce compliance: Implement monitoring mechanisms
to ensure data is classified correctly and handling policies per its
classification are followed. Address any deviations promptly to
maintain the integrity of the Data Classification strategy.
Through these strategies, Data Classification becomes a foundational element
of data security in a Data Mesh. It ensures sensitive data is identified,
protected, and handled appropriately across the distributed environment.
Now that we have discussed the data security component in detail, let us deep
dive into the network security component.
Network security
Network security is a vital aspect of Data Mesh, as it ensures that data is
secure in-transit within the domain and between the domains. Network
security prevents attacks such as eavesdropping, spoofing, or denial-of-
service that could compromise data integrity, availability, or confidentiality.
The network security elements in a Data Mesh ensure that data is secure in-
transit within the domain and between the domains. Let us explore each of
these elements in detail.
Firewall
A firewall is a network security device or software that monitors and controls
incoming and outgoing network traffic based on predetermined security rules.
It acts as a barrier between a trusted internal network and untrusted external
networks, such as the Internet. Firewalls can be hardware-based, software-
based, or a combination of both. They are designed to prevent unauthorized
access to or from a private network, ensuring that only legitimate network
traffic is allowed.
In a Data Mesh architecture, a firewall is essential for maintaining domain
isolation and protecting each domain from external threats. It provides a
critical checkpoint for all data entering or leaving a domain, ensuring that
only traffic that complies with the security policies is permitted. Firewalls are
vital for preventing unauthorized access, mitigating network-based attacks,
and maintaining the overall security posture of the Data Mesh.
Firewalls perform several key functions for network security:
Traffic filtering: Analyze and filter incoming and outgoing network
traffic based on an established set of security rules.
Protection from external threats: Prevent unauthorized access and
protect the network from various threats such as cyber-attacks,
malware, and intrusions.
Monitoring and logging: Keep records of network traffic and events,
which can be used for auditing, investigating security incidents, or
improving security policies.
Segmentation: Divide the network into different segments or zones,
each with its own security policies, to reduce the potential impact of
breaches.
Firewalls play a crucial role in securing the Data Mesh, providing a
foundational layer of protection against external threats, and ensuring that
each domain within the mesh maintains its integrity and security.
Authentication
Authentication is the process of verifying the identity of a user, device, or
entity before granting access to data or resources. It’s a critical first step in
ensuring that access to sensitive information is restricted to authorized
individuals or systems. Authentication mechanisms can vary widely, from
simple password-based methods to more complex Multi-Factor
Authentication (MFA) involving a combination of something the user
knows (password), something the user has (token or mobile device), and
something the user is (biometric verification like fingerprints or facial
recognition).
Authentication ensures that every entity interacting with the data is who it
claims to be, thereby protecting the data from unauthorized access. This is
particularly important in a Data Mesh, as the distributed nature of the
architecture could potentially increase the attack surface if not properly
secured.
Authentication ensures the following:
Ensures data confidentiality: By verifying the identity of users or
systems before allowing access to data, authentication ensures that
sensitive information is not disclosed to unauthorized entities.
Minimizes data breaches: Proper authentication mechanisms can
significantly reduce the likelihood of security breaches, as only
authenticated users or systems have access to the data.
Supports compliance requirements: Many regulatory frameworks
require strong authentication controls to ensure that data is accessed
securely and in compliance with privacy laws and industry standards.
A few strategies and methodologies for Implementing Authentication in a
Data Mesh include:
Implement strong authentication mechanisms: Use MFA to add an
extra layer of security. Employ biometrics, One-Time Passwords
(OTPs), or hardware tokens as part of the authentication process.
Use centralized identity management: Implement a centralized
Identity and Access Management (IAM) solution to manage user
identities and authentication across all domains in the Data Mesh.
Employ certificate-based authentication: Use digital certificates for
devices and services to ensure mutual authentication in machine-to-
machine communication within the Data Mesh.
Regularly update and rotate credentials: Ensure that passwords and
other credentials are regularly updated and rotated to reduce the risk of
credential-related security breaches.
By effectively implementing robust authentication mechanisms within a Data
Mesh, organizations can create a secure foundation for data access and
interaction, ensuring that every entity is verified and authorized, thereby
maintaining the overall security and integrity of the distributed data
ecosystem.
Authorization
Authorization is the process of determining the rights and privileges of
authenticated users, devices, or entities to access specific resources or
perform certain operations within a system. Authentication verifies identity,
while authorization grants permissions based on predefined policies. It
becomes important to manage authorization effectively to ensure that the
entities can only access or manipulate data that they are permitted to.
Authorization mechanisms often involve roles, groups, or attributes to define
what an authenticated entity is allowed to do within the system.
Authorization plays a critical role in securing a Data Mesh by ensuring that
each entity can only access data and services they are entitled to based on
their role, context, or attributes. It helps in enforcing the principle of least
privilege, minimizing the risk of unauthorized data exposure or manipulation.
Effective authorization mechanisms prevent privilege escalation and ensure
that operations performed on the data are compliant with the organization’s
security policies and regulations.
The Authorization element does the following for Data Mesh:
Controls data access: Ensures that only authorized entities can access
specific data assets, services, or functionalities, based on their
permissions.
Enforces security policies: Helps in implementing and enforcing
security policies at granular levels, ensuring that data access and
operations are in line with organizational security standards.
Reduces insider threats: Minimizes the risk of data leaks or
unauthorized data manipulation by insiders by strictly defining what
actions each user or system can perform.
Supports compliance and auditing: Facilitates compliance with
regulatory requirements by enforcing access controls and providing an
audit trail of who accessed what data and when.
A few strategies that could be employed to implement Authorization in
Data Mesh are:
Role-Based Access Control (RBAC): Implement RBAC to assign
permissions based on roles, ensuring that entities can perform actions
according to their responsibilities within the organization.
Attribute-Based Access Control (ABAC): Use ABAC to define
access permissions based on attributes (characteristics) of users,
resources, and the environment, providing more dynamic and context-
aware authorization.
Policy-Based Access Control (PBAC): Define and enforce access
policies centrally, using a Policy Decision Point (PDP) to determine
access rights based on policies and a Policy Enforcement Point (PEP)
to enforce those decisions.
Regular policy review and update: Regularly review and update
access control policies to adapt to changes in the organization, such as
new roles, users, or data assets.
Continuous monitoring and auditing: Implement solutions to
monitor authorization mechanisms continuously, detect policy
violations or anomalies, and maintain comprehensive audit logs for
forensic analysis and compliance reporting.
By effectively implementing authorization mechanisms within a Data Mesh,
organizations can ensure that data access and operations are securely
managed, supporting the overall data governance and security strategy while
facilitating compliance with internal policies and regulatory standards.
Key management
Key management refers to the administration of cryptographic keys in a
cryptosystem. This includes generating, using, storing, exchanging, and
revoking keys as required. In a cryptosystem, keys are used to encrypt and
decrypt data, ensuring confidentiality and integrity. Proper key management
is crucial because the security of encrypted data is directly linked to the
security of the keys. In a Data Mesh architecture, with its inherent distributed
nature, managing keys securely and efficiently becomes even more critical to
ensure that data remains protected across various domains.
In a Data Mesh, data is often distributed across different domains, each
possibly having its own encryption requirements and key management
policies. Effective key management ensures that:
Data remains secure and encrypted, protecting it from unauthorized
access.
Keys are safely generated, stored, and accessed, reducing the risk of
key exposure.
Cryptographic processes are streamlined and standardized across the
mesh.
Compliance with data protection regulations is maintained by ensuring
the confidentiality and integrity of data through proper encryption and
key management practices.
The key management ensures the following for Data Mesh:
Secures data: By managing cryptographic keys effectively, key
management ensures that data encrypted with these keys remains
secure.
Facilitates encryption and decryption: Provides the necessary
infrastructure to encrypt data when it is being stored or transmitted and
decrypt it when needed, ensuring data confidentiality and integrity.
Manages key lifecycle: Handles the entire lifecycle of keys, from
creation, distribution, rotation, and revocation to archiving and
destruction, ensuring that keys are valid and secure throughout their
lifecycle.
Enforces access control: Ensures that only authorized entities can
access and use the cryptographic keys, reducing the risk of
unauthorized data access.
Following are a few strategies and methodologies that can be used to enforce
key management:
Centralized Key Management System: Implement a centralized Key
Management System (KMS) to manage keys across the mesh,
providing a single point of control while ensuring high availability and
reliability.
Automated key lifecycle management: Automate key lifecycle
processes, including key generation, rotation, and revocation, to
minimize human errors and ensure that keys are always up-to-date and
secure.
Secure key storage: Store keys securely using hardware security
modules (HSMs) or equivalent secure storage solutions to prevent
unauthorized access or key leakage.
Access control for keys: Implement strict access control policies for
cryptographic keys, ensuring that only authorized applications and
users can access or use the keys.
Audit and compliance: Regularly audit key management practices and
maintain comprehensive logs of key usage, ensuring compliance with
security policies and regulatory requirements.
Key backup and recovery: Ensure that backup and recovery
procedures are in place for cryptographic keys, protecting against data
loss in case of key corruption or accidental deletion.
Conclusion
In this penultimate chapter of this book, we have delved deeply into the
crucial aspect of security within the decentralized paradigm of a Data Mesh.
Our journey began with an exploration of the unique security challenges that
a decentralized system face. These included concerns over data privacy
across multiple domains. We also looked at the prevention of unauthorized
data access, ensuring data integrity and consistency, safeguarding network
security in a distributed environment, and maintaining the scalability of
security measures.
We introduced the SECURE principles of Data Mesh security to address
these challenges. These principles are:
Scalable Security Protocols
Encryption and Secure Data Transfer
Consistent Data Integrity Checks
Unified Access Control
Robust Privacy Standards
End-to-End Data Protection
These principles guide organizations in creating a robust security framework.
This framework is comprehensive and adaptable to the dynamic nature of
Data Mesh.
Further, we outlined the Data Mesh Security Strategy through the lens of the
Three-Circle Approach, which encapsulates Organizational Security, Inter-
Domain Security, and Intra-Domain Security. This structured yet
interconnected approach ensures a formidable defense against the myriad of
security challenges inherent in a Data Mesh environment. It underscores the
importance of holistic security policies that are aligned with the SECURE
principles across different layers of the Data Mesh architecture.
The chapter also dissected the components of Data Mesh security—Data
Security, Network Security, and Access Management—highlighting key
elements like data encryption, firewall implementation, and authentication
methods. This detailed exploration provides readers with the insights needed
to implement these components effectively, ensuring that data remains
secure, accessible, and compliant across the mesh.
Looking ahead, the final chapter will focus on weaving together the concepts
discussed so far. It will offer a pragmatic guide on successfully deploying a
Data Mesh, ensuring that organizations can leverage this innovative
architecture to its fullest potential.
Key takeaways
As we conclude this chapter, the key takeaways from this chapter are:
Address Decentralization Challenges: Actively tackle the security
challenges inherent in Data Mesh’s decentralized architecture, such as
ensuring data privacy across domains, preventing unauthorized access,
maintaining data integrity and consistency, enhancing network
security, and ensuring scalability of security measures.
Implement “SECURE” Principles: Adopt and integrate the
“SECURE” principles—Scalable Security Protocols, Encryption and
Secure Data Transfer, Consistent Data Integrity Checks, Unified
Access Control, Robust Privacy Standards, and End-to-End Data
Protection—into your Data Mesh security strategy to create a robust
defense mechanism.
Apply the three circle approach: Utilize the Three Circle Approach
for Data Mesh Security Strategy, focusing on Organizational Security,
Inter-Domain Security, and Intra-Domain Security, to establish a
comprehensive, layered security framework.
Deploy key security components: Implement essential security
components within your Data Mesh, including Data Security, Network
Security, and Access Management. Focus on deploying data
encryption, firewalls, VPNs, IDS/IPS, SSL/TLS, PKI, authentication
and authorization mechanisms, key management, and access audits to
safeguard your data infrastructure.
Align security policies with “SECURE” principles: Ensure that
security policies at organizational, inter-domain, and intra-domain
levels are aligned with the “SECURE” principles, reinforcing a unified
and effective security posture across the Data Mesh.
Enforce practical security measures: Put into practice specific
measures and strategies for the security components discussed,
emphasizing encryption, secure access management, and continuous
monitoring, to uphold data integrity and guard against security
breaches.
Introduction
– Clifford Stoll
This quote captures the essence of the challenge that many organizations face
today: how to transform the vast amount of data they collect into meaningful
insights that can drive their business decisions and actions. The underlying
theme of this book has been the architectural paradigm of Data addressing the
limitations of traditional data architectures, such as centralized data
warehouses and data lakes.
In this book, we have explored the concepts, principles, and patterns of Data
Mesh and how they can help organizations overcome the common challenges
of data integration, quality, governance, security, and scalability. We have
also discussed the benefits and trade-offs of adopting Data Mesh and how it
can enable a more agile, collaborative, and decentralized data culture.
But how does Data Mesh work in practice? How can you implement it in
your organization, and what are the best practices and tools to use? How can
you measure the success and impact of Data Mesh, and what are the common
pitfalls and risks to avoid?
These are the questions that we will address in this final chapter. This
concluding chapter aims to bridge theory with practice, offering a
comprehensive guide to operationalizing Data Mesh within real-world
contexts. Through the Domain-Architecture-Operations (DAO)
framework, we elucidate a structured methodology for designing,
implementing, and managing a Data Mesh, ensuring organizations can
adeptly navigate the architectural shift toward a more agile, collaborative, and
decentralized data culture. By addressing key considerations for deployment,
from establishing a governance structure to selecting appropriate technologies
and measuring the Data Mesh’s impact, this chapter serves as a pragmatic
roadmap for organizations ready to embark on their Data Mesh journey.
Structure
This chapter has the following structure:
Domain-Architecture-Operations overview
Domain: The foundation
Architecture: Building the blueprint
Operations: From blueprint to action
Objective
The objective of this chapter is to provide a practical guide on how to
implement the Data Mesh in practice, and how to ensure that your
organization can leverage this innovative architecture to its fullest potential.
The chapter will use the Domain-Architecture-Operations (DAO)
framework, which is a tool to help you design, implement, and operate your
Data Mesh.
Let us start by an overview of the DAO framework.
Domain-Architecture-Operations overview
The book’s ideas culminate in DAO, a practical framework for Data Mesh. It
guides organizations through implementing this architecture effectively. The
diagram outlines the framework’s pillars, objectives, and implementation
steps. Let us now talk about these pillars, as shown in the following figure:
Considering the link between domain and node, choose the right tools.
Empower domains as self-sufficient data hubs. They boost insights and
innovation in your organization. The domain node connects business needs to
technical data management. It’s crucial for a successful Data Mesh
implementation.
With this step, the domain has been defined, it has been placed in the right
spectrum, and the domain node has also been defined. Now, we move on to
the next step of the framework, i.e. architecture.
We recognize the data catalog as a core pillar of the domain unit. It functions
as the central repository for metadata, acting as the knowledge base for the
domain’s data products. This comprehensive catalog empowers users to
navigate the vast landscape of domain-specific data assets, fostering informed
decision-making and collaboration.
The domain data catalog offers a rich set of functionalities that cater to the
diverse needs of users:
Data discovery: Users can embark on efficient data exploration
journeys. The catalog provides intuitive search capabilities, allowing
users to find relevant data assets using keywords, filters, and even
natural language queries. Additionally, the catalog can offer
recommendations and suggestions based on past searches and user
behavior, streamlining the discovery process.
Data understanding: Each data asset within the catalog is
accompanied by rich metadata and comprehensive documentation. This
includes details like name, description, data ownership, source,
schema, format, and relevant tags or categories. Lineage information is
also crucial, providing insights into how the data was created,
transformed, and ultimately consumed. This transparency fosters trust
and understanding among data users.
Data quality: The catalog acts as a vigilant guardian of data quality. It
employs various metrics and indicators to monitor and assess the health
of data assets, including completeness, accuracy, validity, timeliness,
and consistency. Anomaly detection capabilities can identify potential
issues, enabling proactive measures to ensure data integrity.
Data governance: The catalog upholds established data governance
policies and rules. It enforces access controls, tracks changes made to
data assets, and maintains audit logs, ensuring compliance with
regulations and organizational standards. This fosters a culture of
accountability and responsible data stewardship.
The catalog plays a proactive role in maintaining data integrity:
Data quality monitoring: The catalog employs various metrics to
assess the quality of data assets, including completeness, accuracy, and
timeliness. This enables proactive identification and remediation of
potential issues.
Data governance enforcement: The catalog upholds established
governance policies and rules, including access controls and audit
trails. This ensures compliance with regulations and organizational
standards.
The effectiveness of the domain data catalog hinges on the collective efforts
of various stakeholders:
Data producers: Responsible for creating, publishing, and enriching
the catalog with comprehensive metadata, making data assets
discoverable and understandable.
Data consumers: Leverage the catalog to search for, understand, and
utilize relevant data assets to inform their work.
Data stewards: Oversee the catalog’s overall health and governance.
They define and enforce data policies, monitor data quality and usage
patterns, and ensure the catalog remains a reliable source of
information.
As outlined in Chapter 6, Data Catalog in a Data Mesh, a well-defined data
cataloging strategy is essential. This strategy should encompass:
Scope and objective definition: Articulate the goals and intended use
cases for the domain data catalog.
Current state assessment: Identify existing data cataloging practices
and any gaps that need to be addressed.
Desired state design: Envision the ideal state of the catalog,
considering factors like functionality, accessibility, and integration.
The core principles of simplicity, consistency, and integration should guide
the development of the catalog. This ensures a user-friendly experience,
maintains consistent data descriptions, and facilitates seamless integration
with the broader data ecosystem. Implementing a data cataloging strategy
within the Data Mesh architecture necessitates a harmonious integration with
existing data management tools and processes. This integration ensures a
seamless transition to a decentralized, domain-oriented approach without
disrupting the current operational flow. By embedding the data cataloging
strategy into the fabric of existing systems, organizations can leverage the full
potential of Data Mesh, enhancing data discovery, governance, and
collaboration across domains.
Central to this governance model are the organizational structures and roles
crucial for nurturing a collaborative and efficient Data Mesh environment.
The following figure recaps the organizations and the roles, processes, and
policies that are pivotal for a data mesh implementation:
Figure 9.10: Data Mesh Roles, Processes, Policies
As depicted in the figure, at the forefront are the Data Product Teams,
entrusted with the ownership and operational excellence of data products
within each domain. Their role is pivotal, as they not only ensure the quality,
security, ethics, and compliance of data products but also foster innovation
and agility by enabling seamless data collaboration and sharing across
domains.
Equally vital are the Data Owners, who provide strategic oversight and define
the vision and scope for their domain’s data products and services. They act
as the custodians of data, ensuring that access and usage align with business
goals and regulatory requirements. Their strategic insights and decision-
making authority ensure that data assets drive value and align with the
broader organizational objectives.
Supporting these roles are the Data Stewards, who operationalize the
governance framework by managing the day-to-day aspects of data products
and services. They work closely with data product teams to adhere to
established policies and standards, ensuring that data products are
discoverable, accessible, and usable, thus fulfilling the domain’s
commitments in cross-domain transactions.
The governance framework is further enriched by defining key governance
processes critical for each domain’s success. These include data product
definition, which lays the groundwork for data product development; data
product cataloging, ensuring data products are discoverable and reusable;
data product quality assurance, guaranteeing data integrity and reliability;
data product security, safeguarding data against unauthorized access; and data
sharing, facilitating controlled access to data products across the
organization. These processes are instrumental in realizing the governance
objectives, each backed by specific policies that guide their implementation.
The next step in this pillar is Data Mesh Technology Selection. Let us briefly
elaborate on that step.
Conclusion
This final chapter culminates the journey through the transformative
landscape of the Data Mesh architecture, presenting a practical guide to
deploying this innovative framework in real-world scenarios. By distilling the
essence of prior discussions into the DAO framework, this chapter serves as
a cornerstone for organizations aiming to navigate the complexities of Data
Mesh and unlock its full potential.
The journey commenced with the Domain pillar, where the focus was
on defining the domain, its placement, and the domain node — each
step a critical foundation for constructing a Data Mesh that is both
resilient and aligned with organizational goals.
Following this, the Architecture pillar was explored, detailing the
creation of a domain data cataloging strategy, defining domain data
sharing patterns, and establishing a comprehensive data mesh security
strategy. These steps are crucial for building the blueprint of a Data
Mesh that ensures data is discoverable, shareable, and secure.
Lastly, the Operations pillar transitioned the blueprint into action,
focusing on establishing a governance structure, selecting the right
technology stack, and operationalizing the Data Mesh with a strategic
rollout plan underpinned by key metrics and feedback mechanisms.
This was the final chapter of the book. This book has embarked on a journey
to demystify the Data Mesh architecture, exploring its core principles,
practical applications, and the considerations for successful implementation.
We’ve delved into the fundamental building blocks – domains as the
cornerstones of data ownership, a focus on self-service data products, and the
importance of a well-defined governance structure. The Data Mesh presents a
paradigm shift from centralized data lakes and warehouses to a decentralized,
domain-oriented architecture that champions data democratization, agility,
and innovation. This book has provided you with the map and compass to
navigate this journey, equipping you with the knowledge to adapt the Data
Mesh to your organization’s landscape.
Let this not be the end, but the beginning of a transformative journey towards
realizing the full potential of your data. The path ahead may be complex, but
the rewards are substantial. By embracing the principles of Data Mesh, your
organization can foster a culture where data is not just an asset but a catalyst
for innovation, growth, and enduring success. As you embark on this journey,
remember that the future of data is not in the hands of a select few. It is
distributed across the domains of your organization, empowering every team
member to contribute to and benefit from the collective intelligence of your
data ecosystem.
The Data Mesh is not just a technological shift, it’s a cultural transformation.
Embrace the power of decentralized data ownership and empower your
teams to unlock the true potential of your data ecosystem.
Key takeaways
The DAO framework highlights the importance of a strategic approach
to implementing Data Mesh, emphasizing that success hinges not only
on technological infrastructure but also on organizational readiness,
governance, and continuous improvement.
A key takeaway is the centrality of the domain in the Data Mesh
architecture, acting as the foundational unit upon which data products
are built and shared. Moreover, the architecture’s resilience and the
operational strategies underscore the need for adaptive, secure, and
user-centric data management practices.
Through the lens of DAO, organizations are guided on how to tailor the
Data Mesh to their unique contexts, fostering a culture of innovation,
collaboration, and data-driven decision-making.
APPENDIX
Key terms
C
Centralized Key Management System 215
Certificate Authority (CA) 211
collaborative data stewardship 162
components, Data Mesh security 202
access management 212
data security component 203
network security 208
contextual data sharing 161
CRUD operations 109
D
data as a product principle 57, 58
accessibility, ensuring 60
aspects 58, 59
compliance, ensuring 60
consistency, ensuring 60
continuous feedback and iterative improvement 60, 61
data products, aligning with business domains and use cases 59, 60
data products, redefining as first-class citizens 59
discoverability, ensuring 60
interoperability, ensuring 60
reliability, ensuring 60
data as a product principle, implication 64
data consumers, empowering 65, 66
data management, transforming with agile, lean and DevOps practices 65
data roles, redefining with data products 64
enriched insights, facilitating through cross-domain collaboration 66
technological innovation, facilitating 65
data as a product principle, rationale 61
data assets, leveraging strategically 63, 64
data consumer experience, enhancing 63
data products, managing with ownership and lifecycle 62, 63
domain teams, empowering 61
quality, enhancing 62
silos, breaking down 62
Data Catalog 85
data cataloging 131
as means of data governance 135, 136
as means of data utility 134, 135
principles 136-138
role 132-134
data cataloging strategy 138
current state and gaps, assessing 139, 140
desired state and roadmap, designing 141, 142
developing 138
scope and objectives, defining 138, 139
data cataloging strategy implementation 142, 143
cataloging elements, identifying 146-149
catalog usage and effectiveness, monitoring 150, 151
domain 143-145
domain, cataloging 149
domain structure, establishing 145, 146
data domain leadership 116
data governance 105, 106
consequences of lax 107, 108
vitality 106, 107
data governance council 116
Data Lake 2, 4
advantages 3, 5
architecture pattern 24-26
benefits, over traditional EDW pattern 26, 27
challenges 27
disadvantages 3, 5
era 21, 22
features 3
Hadoop ecosystem origins 22
to Data Swamp 27
Data Lakehouse 2, 5
adoption 30
advantages 3, 5
architecture 30
architecture pattern 27, 28
challenges 31
cloud computing 28, 29
disadvantages 3, 6
era 28
features 3
pattern 29
Data Management Office (DMO) 116
data mart 169
Data Mesh 31, 32
architectural principles 42
domain 36
node 39, 40
principles 32, 33
Data Mesh component model 82, 83
Data Catalog 85, 86
Data Share 87, 88
domain 83, 85
domain unit, forming 88, 89
Data Mesh governance framework 111
goals 113
overview 112
seven key objectives 113, 114
three governance components 114, 115
Data Mesh governance policies 124
data catalog policies 126, 127
data product policies 125, 126
data sharing policies 127-129
Data Mesh governance processes
data product cataloging 120, 121
data product definition 119, 120
data product quality assurance 121, 122
data product security 122, 123
data sharing 123, 124
Data Mesh security
components 202, 203
SECURE principles 185
the three-circle approach 191-193
DataOps 71
data owners 118, 146
data product teams 118
data security component 203
data backup 205, 206
data classification 206-208
data encryption 203, 204
data masking 204, 205
data sharing
data value creation 158, 159
information dissemination 157, 158
patterns 163
role 156, 157
data sharing principles 159
collaborative data stewardship 162, 163
contextual data sharing 161
data interoperability 160
domain data autonomy 159, 160
quality-first approach 161, 162
data-sharing strategy implementation 171, 172
appropriate data sharing pattern, identifying 172-174
data sharing protocol, establishing 174, 175
monitoring and performance optimization 177
secure infrastructure and access control interfaces, creating 175, 176
data stewards 118
Data Swamp 111
Data Warehouse 2, 3
advantages 2, 4
decoupling analytics and online transaction processing 14
disadvantages 2, 4
divergent approaches 15
era 14
key features 2
decentralized system security challenges 181
data integrity and consistency 183
data privacy across domains 182
network security, in distributed environment 184
scalability, of security measures 184, 185
unauthorized data access 182, 183
Digitally Native Businesses (DNB) 21
Disciplined Core with Peripheral Flexibility 43
Domain-Architecture-Operations (DAO) framework
architecture 228
Domain 222
operations 236
overview 220-222
Domain, DAO framework 222, 223
defining 223-225
domain node, defining 227, 228
placement 225-227
domain, Data Mesh 36
central unit 37
interplay, between node 40-42
subunits 37, 38
domain node 40, 85
domain-oriented ownership 46
aspects 47
business alignment and domain autonomy 49
complete lifecycle ownership 47, 48
context preservation, in data management 48
decentralized governance, to enhance data quality 48, 49
domain-oriented ownership, implications 54
budget allocation, decentralizing for data ownership 57
data intelligence and value creation, enhancing 55
resilient operational framework, creating through data decentralization 55
roles and responsibilities, realigning 54
domain-oriented ownership, rationale 51
data insights and intelligence, enriching through domain diversity 53
organizational learning, facilitating 53, 54
organizational silos, overcoming 51
responsibility, cultivating through 51
domain placement methodology 97, 98
applying 100
functional context 98
operations 99
parameter score 101
parameter weightage 101
people and skills 99
regulations 99
technical capabilities 100
E
Empowering with Self-Serve Data Infrastructure principle 66, 67
agile self-serve data infrastructure, creating with DataOps 71-73
aspects 68
cross-functional collaboration, enhancing 74, 75
data scalability and resilience, achieving with distributed architecture 73
decentralized data infrastructure, fostering 68
domain-driven design 70, 71
platform thinking, leveraging 69
resource efficiency and cost-effectiveness, promoting 74
self-service tools, adopting 69
Empowering with Self-Serve Data Infrastructure principle, implication 75
data security and compliance, ensuring 76, 77
enhanced data discovery and accessibility 77, 78
resilient data architecture, building 77
teams, empowering through training and skill development 76
tools and platforms, integrating 75, 76
Enterprise Data Warehouse (EDW) 15
challenges 17, 18
components 15-17
F
fully federated data mesh architecture 92
components 93, 94
fully governed data mesh architecture 89
components 90
hub and spoke domains 91, 92
hub Data Catalog 91
hub data share 91
hub domain 90
spoke data catalog 91
spoke data share 91
spoke domains 90
G
Google File System (GFS) 23
governance-flexibility spectrum 43
governance-flexibility trade-off 6
H
Hadoop Common 24
Hadoop Distributed File System (HDFS) 23
Hadoop ecosystem
key components 23
origins 22
Hard Disk Drive (HDD) 20
HBase 22
Hive 22
Host-based IDS (HIDS) 209
hub-spoke model 89
hybrid data mesh architecture 95-97
I
Inmon, Bill 15
International Data Corporation (IDC) 19
Intrusion Detection System (IDS) 209
K
Kafka 23
key performance indicators (KPIs) 64, 119
Kimball, Ralph 15
L
Lines of Business (LoBs) 6, 32
M
macro data architecture pattern
need for 6
MapReduce 23, 24
modern data landscape
navigating 2
monolithic data architecture
challenges 11-14
era 11
rise 11, 12
Multi-Factor Authentication (MFA) 212
N
Network-based IDS (NIDS) 209
network security 208
firewall 208
Intrusion Detection System (IDS) 209, 210
Public Key Infrastructure (PKI) 211
Transport Layer Security (TLS) 210, 211
Virtual Private Network (VPN) 209
node, Data Mesh 39
O
One-Time Passwords (OTPs) 212
Online Analytical Processing (OLAP) 11
Online Transaction Processing (OLTP) 11, 109
operations, DAO framework
continuous improvement through learning 242
Data Mesh, operationalizing 241
Data Mesh technology selection 239-241
from blueprint to action 236-239
progress and impact, tracking 241, 242
overarching goals 42
P
patterns for data sharing
publish-subscribe 163
push-pull 163
request-response 163
perfect storm 18
AI advancements 21
decrease in storage cost 20
exponential growth of data 19
increase in computing power 20
rise of cloud computing 20, 21
Personal Identifiable Information (PII) 174, 203
Pig 22
Policy-Based Access Control (PBAC) 214
Presto 23
Public Key Infrastructure (PKI) 211
publish-subscribe pattern 163
advantages 165
components 164
disadvantages 165
methods for data sharing 164, 165
push-pull pattern 163, 168
advantages 170
components 168, 169
disadvantages 170
methods for data sharing 170
Q
Quality Assurance (QA) 126
quality-first approach 161, 162
R
Registration Authority (RA) 211
Relational Database Management System (RDBMS) 9
origin 11
request-response pattern 163, 166
advantages 167
components 166
disadvantages 167, 168
methods for sharing data 167
return on investment (ROI) 64
Role-Based Access Control (RBAC) 176, 214
S
SECURE principle, data mesh security 185, 186
Consistent Data Integrity Checks 188, 189
Encryption and Secure Data Transfer 187, 188
End-to-End Data Protection 191
Robust Privacy Standards 190
Scalable Security Protocols 186, 187
Unified Access Control 189, 190
Spark 23
Storm 22
strategic asset 131
Structured Query Language (SQL) 11
T
ten specific security policies 192
the three-circle security strategy
inter-domain security 192, 196-200
intra-domain security 192, 200, 201
organization security 192-196
third normal form (3NF) schemas 16
three governance components 114, 115
data governance processes 118, 119
key roles and interactions 117, 118
organizational bodies and roles 115, 116
traditional data governance 108
challenges 110, 111
in other architectural patterns 108-110
Transport Layer Security (TLS) 210
V
Virtual Private Network (VPN) 209
Y
Yet Another Resource Negotiator (YARN) 24
Z
ZooKeeper 22