02 - Introduction To Data Lakehouse Open-Source Technologies
02 - Introduction To Data Lakehouse Open-Source Technologies
data
Introduction to Data Lakehouse
Open-Source Technologies
Content by:
Kevin Shen
Product Manager | Data & AI Software
[email protected]
Anson Kokkat
Product Manager | Data & AI Software
[email protected]
Joshua Kim
Program Director | Data & AI Software
[email protected]
Adam Learmonth
Advisory, Learning Content Development
[email protected]
Kelly Schlamb
WW Technology Sales Enablement | Data & AI
[email protected]
Presenter:
Ahmad Muzaffar Baharudin
Technical Enablement Specialist | Data & AI
[email protected]
Agenda
• Data Warehouse vs Data Lake
• Data management evolution &
first-generation data lakehouse
• Overview of watsonx.data components
• Storage
• File format
• Table format
• ACID transactions
• Metastore
• Query engines
2
Traditional approaches: Data Warehouse vs Data Lake
How can we ask enterprisewide questions requiring historical data? How can we discover what we don't know?
• A data warehouse gathers structured data from multiple sources into a • As volume, velocity and variety of data grew, data lakes were designed
central repository for data discovery and data science/machine learning use cases.
• Supports multiple data analytics and business intelligence applications • Commonly built on large data platforms such as Hadoop (HDFS)
such as enterprise reporting, to answer questions about your business • Data is stored in RAW and unstructured format = lower cost for large
• Normalized and structured data made it easy to analyze, but was an volumes of data
expensive choice • Highly flexible and scalable
• Difficult to use and govern, and complex to maintain, required data
scientist, now many are data swamps
Additional links:
Databases 101
What is a data warehouse
What is a data lake
Learn more about the data lakehouse kitchen analogy in 3
this 101 video
Data management evolution and
first-generation lakehouses
4
The data warehouse remains the center
of analytics at most organizations
Late 1990s Early 2000s Present • Data warehouses
emerged as the dominant
Data warehouse method to analyze data
• Warehouses technology
has evolved continuously
to improve… from
appliance form factors to
− High up-front costs in-memory technologies
− Structured data only
− ETL required
− Vendor lock-in
− Limited scalability
5
The emergence of the data lake
• As volume, velocity, and
variety of data grew, data
lakes emerged as the new
Late 1990s Early 2000s Present technology to replace
data warehouses
Data warehouse
• Data stored in raw and
unstructured format =
Data lake lower cost for large
volumes of data
6
Cloud data warehouses evolved to address
specific challenges of data warehouses • Specifically, cloud data
warehouses introduced
the separation of
Late 1990s Early 2000s Present compute and storage
7
Traditional approaches to addressing these challenges have created more overall
complexity and cost, which has led to the emergence of data lakehouse architectures
Early 2000s
8
The data lakehouse
9
Lakehouses are a new approach meant to combine the advantages of data warehouses
and data lakes, but first generation lakehouses still have key constraints
10
IBM watsonx.data – the next generation data lakehouse
Your existing
ecosystem Data warehouse Data lake
technologies. Infrastructure
Hybrid-cloud deployments and workload
portability across hyperscalers and on-prem
with Red Hat OpenShift
Ecosystem infrastructure
11
A more flexible, scalable, and cost-effective solution for data management
Data Science Notebooks and ML Model training Data Science Notebooks and ML Model training Data Science Notebooks and ML Model training Data Science Notebooks and ML Model training
Views
Snapshots
13
Storage,
data file formats,
and table formats
14
Storage
Your existing
ecosystem Data warehouse Data lake
Query engines
Governance
and Metadata Metadata store
Access control management
Data format
Storage
Infrastructure
Ecosystem infrastructure
15
What is object storage?
Object storage:
• Low cost
• Near unlimited scalability
• Extreme durability & reliability
(99.999999999%)
• High throughput
• High latency (but can be
compensated for)
• Basic units are objects, which
are organized in buckets
16
The rise of cloud object storage for data lakes and lakehouses
Cloud object storage technology is displacing HDFS as de facto storage technology for data lakes
17
File format
Your existing
ecosystem Data warehouse Data lake
Query engines
Governance
and Metadata Metadata store
Access control management
Data format
Storage
Infrastructure
Ecosystem infrastructure
18
Common data
file formats • Human-readable text • Human-readable text
• Each row corresponds • Open file and data
to a single data record interchange format
Computer systems and • Each record consists • Consists of attribute-
applications store data of one or more fields, value pairs and arrays
in files delimited by commas • JSON = JavaScript
Object Notation
Data can be stored in
binary or text format
19
Apache Parquet
Your existing
ecosystem Data warehouse Data lake
Query engines
Governance
and Metadata Metadata store
Access control management
Data format
Storage
Infrastructure
Ecosystem infrastructure
23
Table
management
• Open-source • Open-source • Open-source, but
and formats • Designed for large, • Manages the storage of Databricks is primary
petabyte (PB)-scale large datasets on HDFS contributor and user, and
tables and cloud object storage controls all commits to
Sits “above” the the project – so “closed”
• ACID-compliant • Includes support for
data file layer transaction support tables, ACID transactions, • Foundation for storing
upserts/ deletes, data in the Databricks
Organizes and manages • Capabilities not Lakehouse Platform
traditionally available advanced indexes,
table metadata and data streaming ingestion • Extends Parquet data
with other table formats,
including schema services, concurrency, files with a file-based
Typically supports
evolution, partition data clustering, and transaction log for ACID
multiple underlying disk asynchronous transactions and scalable
evolution, and table
file formats (Parquet, compaction metadata handling
version rollback – all
Avro, ORC, etc.) without re-writing data • Multiple query options: • Capabilities include
• Advanced data filtering snapshot, incremental, indexing, data skipping,
May offer transactional
and read-optimized compression, caching,
concurrency, I/U/D, • Time-travel queries let and time-travel queries
indexing, time-based you see data at points
in the past • Designed to handle batch
queries, and other as well as streaming data
capabilities
24
Why Apache Iceberg for data lakehouses?
26
ACID transactions
A tomicity
Guarantees that each transaction is a single event that either succeeds or fails
completely; there is no half-way state.
C onsistency
Ensures that data is in a consistent state when a transaction starts and when
it ends, guaranteeing that data is accurate and reliable.
I solation
Allows multiple transactions to occur at the same time without interfering
with each other, ensuring that each transaction executes independently.
D urability
Means that data is not lost or corrupted once a transaction is submitted.
Data can be recovered in the event of a system failure, such as a power outage.
27
ACID transactions
A tomicity
Guarantees that each transaction is a single event that either succeeds or fails
completely; there is no half-way state.
✓
✓ Debit Debit ✓ Debit Debit
✓
Account 2 ✓ Credit Account 2 Credit Account 2 Credit Account 2 ✓ Credit
28
ACID transactions
C onsistency
Ensures that data is in a consistent state when a transaction starts and when
it ends, guaranteeing that data is accurate and reliable.
✓
+ 100 -5 + 100 -5
✓
Account 2 - 100 Account 2 +5 Account 2 - 10 Account 2 +0
29
ACID transactions
I solation
Allows multiple transactions to occur at the same time without interfering
with each other, ensuring that each transaction executes independently.
Account 2 - 100
Transaction 1
✓ Account 2 +5
Transaction 2
Account 2
Transaction 1
Account 2
Transaction 2
30
ACID transactions
D urability
Means that data is not lost or corrupted once a transaction is submitted.
Data can be recovered in the event of a system failure, such as a power outage.
✓
Account 2 - 100
Transaction 1
✓
Account 2 +5
Transaction 2
Account 2 - 100
Transaction 1
Account 2 +0
Transaction 2
31
Metastore
32
Table format
Your existing
ecosystem Data warehouse Data lake
Query engines
Governance
and Metadata Metadata store
Access control management
Data format
Storage
Infrastructure
Ecosystem infrastructure
33
What is a HMS used by
metastore? watsonx.data
36
Query engines
Your existing
ecosystem Data warehouse Data lake
Query engines
Governance
and Metadata Metadata store
Access control management
Data format
Storage
Infrastructure
Ecosystem infrastructure
37
Presto
• Presto is an open-source distributed SQL engine • Presto connectors allow access to data in-
suitable for querying large amounts of data place, allowing for no-copy data access and
federated querying
• Supports both relational (e.g. MySQL, Db2) and non-
relational sources (e.g. MongoDB, Elastisearch) • Consumers are abstracted from the
physical location of data
• Easy to use with data analytics and business
intelligence tools • A wide variety of data sources are
supported, including:
• Supports both interactive and batch workloads
38
Presto architecture
Client
The structure of Presto is similar to that of classical MPP
database management systems.
Alluxio &
RaptorX
caching
and data caching (provided by Alluxio and RaptorX).
Object Storage
39
Apache Spark
Apache Spark is an open-source data-processing engine for large data sets. It is designed
to deliver the computational speed, scalability, and programmability required for big data,
specifically for streaming data, graph data, ML, and AI applications.
40
Apache Spark and machine learning
• Spark Streaming
• Spark SQL
• Spark GraphX
41