Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
James Serra
Data & AI Solution Architect
Microsoft
[email protected]
Blog: JamesSerra.com
About Me
▪ Microsoft, Data & AI Solution Architect in Microsoft Consulting Services (MCS), now called Industry
Solutions Delivery (ISD)
▪ At Microsoft for most of the last eight years, with a brief stop at EY
▪ Was previously a Data & AI Architect at Microsoft for seven years
▪ In IT for 35 years, worked on many BI and DW projects
▪ Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
▪ Been perm employee, contractor, consultant, business owner
▪ Presenter at PASS Summit, SQLBits, Enterprise Data World conference, Big Data Conference
Europe, SQL Saturdays
▪ Blog at JamesSerra.com
▪ Former SQL Server MVP
▪ Author of book “Reporting with Microsoft SQL Server 2012”
Agenda
▪ Data Warehouse
▪ Data Lake
▪ Modern Data Warehouse
▪ Data Fabric
▪ Data Lakehouse
▪ Data Mesh
I tried to figure out all these data platform buzzwords on my own…
ETL
ETL Design
Development
Technical
Requirements
Data sources
Setup Infrastructure Install and Tune
The “data lake” Uses A Bottoms-Up Approach
Ingest all data Store all data Do analysis
regardless of requirements in native format without Using analytic engines
schema definition like Hadoop
Devices
Batch queries
Interactive queries
Real-time analytics
Machine Learning
Data warehouse
Data Lake + Data Warehouse Better Together
STAGE 4:
Transformative
STAGE 3:
Predictive Data transforms
STAGE 2: business to drive
Informative desired outcomes.
Data capture is
STAGE 1: comprehensive and Any data, any
Reactive
Structured data is scalable and leads source, anywhere at
managed and business decisions scale
Structured data is analyzed centrally based on advanced
transacted and and informs the analytics
locally managed. business
Data used
reactively
Rear-view Real-time
mirror intelligence
Modern Data Warehouse
Data Fabric
Data Fabric adds to a modern data warehouse:
• Data access
• Data policies
• Metadata catalog/Lineage
• Master Data Management (MDM)
• Data virtualization
• Real-time processing
• Data scientist tools
• APIs
• Building blocks/Services
• Products
Bottom line: Additional technology to source more data, secure it, and make it available
Data Fabric defined
Data Lakehouse
Delta Lake
Top features:
• ACID transactions
• Time travel (data versioning enables rollbacks, audit trail)
• Streaming and batch unification
• Schema enforcement
• Upserts and deletes (MERGE)
• Performance improvement
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics
Concerns skipping relational database
• Speed: Relational databases faster, especially Massively Parallel Processing
(MPP)
• Security: No RLS, column-level, dynamic data masking
• Complexity: Metadata separate from data, file-based world
• Concurrency: Multiple reads of a file at the same time
• Missing features: Referential integrity, TDE, workload management; other
features lock you into Spark
• Having to use Spark SQL instead of T-SQL
• People used to using a relational database
Azure Synapse: starting to see data lake only solutions because can use T-SQL,
Power BI (speed, RLS), cost savings with Serverless
Data Lakehouse & Synapse
Data Mesh
Data Mesh in theory
Lots of things sound great in theory…
Data Mesh - Overview
Data mesh is an intentionally designed distributed data architecture, under centralized governance and standardization for
interoperability, enabled by a shared and harmonized self-serve data infrastructure
#1) Domain Ownership #2) Data as a product #3) Self-serve data #4) Federated
infrastructure as a platform computational governance
Decentralize and distribute Analytical data provided by High-level abstraction of Architect global decisions and
responsibility to people who the domains are treated as a diverse infrastructure that standards for interoperability,
are closest to the data in product and the consumers removes complexity and while respecting autonomy of
order to support continuous of that data are treated as friction of provisioning and local domains, and
change and scalability (i.e. customers (domain teams, managing the lifecycle of data implement global policies
manufacturing, sales, API code, data and metadata, products (i.e. storage, effectively (i.e. data quality,
supplier) infrastructure) compute, data pipeline, data security, regulations,
access control) data modeling)
Data Mesh
• Lack of ownership: who owns the data – the data source team or the infrastructure team?
• Lack of quality: the infrastructure team is responsible for quality but does not know the data
well
• Organizational scaling: the central team becomes the bottleneck, such as with an enterprise
data lake/warehouse
• Technical scaling: current big data solutions can’t keep up with additional data requirements
Data Mesh – Logical Architecture
Data Sources
Supplier
P&L (consumer-Aligned)
(aggregate)
Consumers
Concerns with Data Mesh
• No standard definition of a data mesh
• Huge investment in organizational change and technical implementation
• Performance of combining data from multiple domains
• Duplication of data for performance reasons
• Getting quality engineering people for each domain
• Inconsistent technical implementations for the domains
• Domains don’t want to wait for a data mesh
• Need incentives for each domain to counter extra work
• Self-serve approach of data requests could be challenging
• Duplication of data and ingestion platform
• Creation of data silos for domains not able to join data mesh
• Not seeing the big picture for combing data
Enterprise Scale Landing Zones is a prerequisite for Enterprise Scale Analytics since it is built on the core foundation of Enterprise Scale Landing
Zones. Consisting of:
• Prescriptive architecture
• Designed by Subject Matter Experts
• Documented End to End Technical Solution
• Deployment Templates
• Operational Usage Model
Data Mesh on Azure Resources
• Piethein Strengholt: Blog - Implementing Data Mesh on Azure (public), Blog – Data Mesh topologies
(public), Book - Data Management at Scale: Best Practices for Enterprise Architecture (public)
• Cloud Adoption Framework: Azure data management and analytics scenario (public)
• Data Management & Analytics Scenario - Data Management Zone: Github (public)
• Data Management & Analytics Scenario - Data Landing Zone: Github (public)
• Enterprise-Scale - Reference Implementation: Github (public)
• Microsoft doc: A financial institution scenario for data mesh (public)
Governance Topologies : Different Approaches
Mesh Type 2 • Domains use the same technology
• Each domain has its own storage that
is the same technology
Centralised Distributed
(Control) (Agility)
Mesh Type 1 • Domains use the same technology Mesh Type 3 • Domains can use any technology
• Data is kept in one enterprise data they want
lake with each domain getting its • Each domain has its own storage
own container/folder that can be any technology
?
Data Fabric vs Data Mesh
If Data Fabric uses data virtualization, how is it different from Data Mesh:
1) Domain ownership
2) Data as a product
3) Self-serve data infrastructure as a platform
4) Federated computational governance
Future
This view is my own and not that of Microsoft!
In the end, I predict data mesh will become an extension to a centralized data solution for a small percentage of solutions.
There will be a very small percentage of solutions that are 100% true to the data mesh concept (assuming mesh type 1 and 2 are true to
the data mesh concept). Ask ten people what a data mesh is and you will get eleven answers!
Domain
A
Centralized 1) Domain ownership (90%)
2) Data as a product (50%)
Data 3) Self-serve data infrastructure as a
Solution platform (1%)
4) Federated computational
governance (25%)
Domain
B
https://sqlb.it/?7106
Q&A
James Serra, Microsoft, Data & AI Solution Architect
?
Email me at: [email protected]
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com
Comparisons of Data Fabric and Data Mesh
Is complex, even to start a small By far simpler, due to the inherent use of Data
Implementation implementation due to the need of Virtualization, meta data and knowledge
understanding and segregating domain data graphs