0% found this document useful (0 votes)
82 views42 pages

02 - Introduction To Data Lakehouse Open-Source Technologies

Uploaded by

Ming Le
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views42 pages

02 - Introduction To Data Lakehouse Open-Source Technologies

Uploaded by

Ming Le
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

IBM watsonx.

data
Introduction to Data Lakehouse
Open-Source Technologies
Content by:
Kevin Shen
Product Manager | Data & AI Software
[email protected]

Anson Kokkat
Product Manager | Data & AI Software
[email protected]

Joshua Kim
Program Director | Data & AI Software
[email protected]

Adam Learmonth
Advisory, Learning Content Development
[email protected]

Kelly Schlamb
WW Technology Sales Enablement | Data & AI
[email protected]

Presenter:
Ahmad Muzaffar Baharudin
Technical Enablement Specialist | Data & AI
[email protected]
Agenda
• Data Warehouse vs Data Lake
• Data management evolution &
first-generation data lakehouse
• Overview of watsonx.data components
• Storage
• File format
• Table format
• ACID transactions
• Metastore
• Query engines

2
Traditional approaches: Data Warehouse vs Data Lake

Expensive Unstructured data Maintenance


High performance
AI/ML High-complex
Governed Non-scalable
Vendor lock-in Low-cost storage Limited Performance
BI

101: Data Warehouse 101: Data Lake

How can we ask enterprisewide questions requiring historical data? How can we discover what we don't know?

• A data warehouse gathers structured data from multiple sources into a • As volume, velocity and variety of data grew, data lakes were designed
central repository for data discovery and data science/machine learning use cases.

• Supports multiple data analytics and business intelligence applications • Commonly built on large data platforms such as Hadoop (HDFS)
such as enterprise reporting, to answer questions about your business • Data is stored in RAW and unstructured format = lower cost for large
• Normalized and structured data made it easy to analyze, but was an volumes of data
expensive choice • Highly flexible and scalable
• Difficult to use and govern, and complex to maintain, required data
scientist, now many are data swamps
Additional links:
Databases 101
What is a data warehouse
What is a data lake
Learn more about the data lakehouse kitchen analogy in 3
this 101 video
Data management evolution and
first-generation lakehouses

4
The data warehouse remains the center
of analytics at most organizations
Late 1990s Early 2000s Present • Data warehouses
emerged as the dominant
Data warehouse method to analyze data

• Normalized and trusted


data made it easy to
analyze, however it’s
an expensive choice

• Warehouses technology
has evolved continuously
to improve… from
appliance form factors to
− High up-front costs in-memory technologies
− Structured data only
− ETL required
− Vendor lock-in
− Limited scalability

5
The emergence of the data lake
• As volume, velocity, and
variety of data grew, data
lakes emerged as the new
Late 1990s Early 2000s Present technology to replace
data warehouses
Data warehouse
• Data stored in raw and
unstructured format =
Data lake lower cost for large
volumes of data

• Highly flexible & scalable

• Difficult to use & complex


to maintain, and required
a data scientist
− High up-front costs − High complexity
− Structured data only − Poor data quality • Ultimately, most data
− ETL required − Limited performance lakes failed and required
− Vendor lock-in − Expensive to maintain a two-tier architecture
− Limited scalability

6
Cloud data warehouses evolved to address
specific challenges of data warehouses • Specifically, cloud data
warehouses introduced
the separation of
Late 1990s Early 2000s Present compute and storage

Data warehouse • Addressed the


scalability challenge
Data lake with traditional
warehouses – no
data redistribution
Cloud data warehouse
• Ability to add more
compute resources
to the same data to
solve the problem
− High up-front costs − High complexity − Data migration
− Poor data quality
• Easier to manage,
− Structured data only − Vendor lock-in
however, much more
− ETL required − Limited performance − High costs
expensive vs. on-
− Vendor lock-in − Expensive to maintain − Limited AI/ML
use cases
premises warehouses
− Limited scalability

7
Traditional approaches to addressing these challenges have created more overall
complexity and cost, which has led to the emergence of data lakehouse architectures

Early 2000s

Today, leaders at most large


enterprises manage their data
and workloads using a mix of data
repositories and data stores in
hybrid environments.
The overall cost across all these
repositories remains high.
It’s difficult for leaders to effectively
leverage and govern the data across
multiple environments and use
enterprise data for analytics and AI.

8
The data lakehouse

Data lakehouse (n):


A data lakehouse combines the
high-performance characteristics of
a data warehouse with the cost-
efficiency, flexibility and scalability
of a data lake to support highly
complex data transformations
and a wide variety of use cases

9
Lakehouses are a new approach meant to combine the advantages of data warehouses
and data lakes, but first generation lakehouses still have key constraints

First generation lakehouses


are still limited by their
ability to address cost and
complexity challenges:

• Single query engines set


up to support limited
workloads –typically
just BI or ML

• Typically deployed over


cloud only with no support
for multi-/hybrid -cloud
deployments

• Minimal governance and


metadata capabilities
to deploy across your
entire ecosystem

10
IBM watsonx.data – the next generation data lakehouse

Your existing
ecosystem Data warehouse Data lake

Multiple engines including Presto and Spark


Query engines
Completely open. that provide fast, reliable, and efficient
processing of big data at scale
No lock-in!
Governance Built-in governance that is compatible with
and Metadata Metadata store existing solutions such as watsonx.governance
Access control management and IBM Knowledge Catalog

Built on a Vendor agnostic open formats for analytic data


sets, allowing different engines to access and
Data format
foundation of share the same data, at the same time
Cost effective, simple object storage
industry-embraced Storage available across hybrid-cloud and
open-source multicloud environments

technologies. Infrastructure
Hybrid-cloud deployments and workload
portability across hyperscalers and on-prem
with Red Hat OpenShift

watsonx.data Core watsonx.data functionality

Ecosystem infrastructure
11
A more flexible, scalable, and cost-effective solution for data management

Classic Data Landscape Data Lakehouse with watsonx.data

Data Science Notebooks and ML Model training Data Science Notebooks and ML Model training Data Science Notebooks and ML Model training Data Science Notebooks and ML Model training

Data Lake Data Warehouse


Workloads remain in place + Access to all data cataloged

Old data lake and warehouse


engines + New ones
Metadata
Metadata
Data Movement + Transformation
Fact + Dimension Tables Sharable Metadata +Storage
Customer Profile
Fact + Dimension Tables 3rd party data
3rd party data Snapshots
Views
Customer Profile

Views
Snapshots

Credit Card Loan Bank Credit Card Loan Bank


Transactions Payment Transfers Transactions Payment Transfers
12
Let’s get familiar with some common open-source
technologies used in watsonx.data and
data lakehouse architectures in general…

13
Storage,
data file formats,
and table formats

14
Storage

Your existing
ecosystem Data warehouse Data lake

Query engines

Governance
and Metadata Metadata store
Access control management

Data format

Storage

Infrastructure

watsonx.data Core watsonx.data functionality

Ecosystem infrastructure
15
What is object storage?

Object storage:
• Low cost
• Near unlimited scalability
• Extreme durability & reliability
(99.999999999%)
• High throughput
• High latency (but can be
compensated for)
• Basic units are objects, which
are organized in buckets

• Some of vendors (including IBM Cloud) offer S3-compatible object storage

16
The rise of cloud object storage for data lakes and lakehouses

Cloud object storage technology is displacing HDFS as de facto storage technology for data lakes

COS HDFS COS vs. HDFS


Elasticity Yes (decoupled) No S3 is more elastic

Cost/TB/Month $23 $206 10X


Performance 20MB/s/core 90MB/s/core 2x better price/perf

Availability 99.99% 99.9% (estimated) 10X

Durability 99.999999999% 99.9999% 10X+


(estimated)
Transactional Most technologies Yes Comparable
writes now provide strong
consistency

17
File format

Your existing
ecosystem Data warehouse Data lake

Query engines

Governance
and Metadata Metadata store
Access control management

Data format

Storage

Infrastructure

watsonx.data Core watsonx.data functionality

Ecosystem infrastructure
18
Common data
file formats • Human-readable text • Human-readable text
• Each row corresponds • Open file and data
to a single data record interchange format
Computer systems and • Each record consists • Consists of attribute-
applications store data of one or more fields, value pairs and arrays
in files delimited by commas • JSON = JavaScript
Object Notation
Data can be stored in
binary or text format

File formats can


be open or closed • Open-source • Open-source • Open-source
(proprietary/lock-in) • Binary columnar storage • Binary columnar storage • Row-oriented data
• Designed for efficient • Designed and optimized format and serialization
Open formats (Parquet, data storage and for Hive data framework
ORC, and Avro) are fast retrieval • Self-describing • Robust support for
commonly used in data • Highly compressible • Similar in concept schema evolution
lakes and lakehouses • Self-describing to Parquet • Mix of text/binary

19
Apache Parquet

Parquet is designed to support fast data processing Row-oriented storage


for complex data
• Open-source
• Columnar storage
• Highly compressible with configurable compression Column-oriented storage
options and extendable encoding schemas by data type
• Self-describing: schema and structure metadata is included
• Schema evolution with support for automatic schema merging

Why do these things matter in a lakehouse?


• Performance of queries directly impacted by size and amount of file(s) being read
• Ability to read/write data to an open format from multiple runtime engines enables collaboration
• Size of data stored, amount of data scanned, and amount of data transported affect the charges (cost)
incurred in using a lakehouse (depending on the pricing model)
20
Apache ORC

• Open-source, columnar storage format


Column-oriented storage
• Similar in concept to Parquet, but different design
• Parquet considered to be more widely used than ORC
Row-oriented storage
• Highly compressible, with multiple compression options
• Considered to have higher compression rates than Parquet

• Self-describing and type-aware

• Support for schema evolution

• Built-in indexes to enable skipping of data not relevant to a query

• Excellent performance for read-heavy workloads


• ORC generally better for workloads involving frequent updates or appends
• Parquet generally better for write-once, read-many analytics
21
Apache Avro

• Open-source, row-based storage and serialization format


• Can be used for file storage or message passing

• Beneficial for write-intensive workloads

• Format contains a mix of text and binary


• Data definition: Text-based JSON
• Data blocks: Binary

• Robust support for schema evolution


• Handles missing/added/changed fields

• Language-neutral data serialization


• APIs included for Java, Python,
Ruby, C, C++, and more
22
Table format

Your existing
ecosystem Data warehouse Data lake

Query engines

Governance
and Metadata Metadata store
Access control management

Data format

Storage

Infrastructure

watsonx.data Core watsonx.data functionality

Ecosystem infrastructure
23
Table
management
• Open-source • Open-source • Open-source, but
and formats • Designed for large, • Manages the storage of Databricks is primary
petabyte (PB)-scale large datasets on HDFS contributor and user, and
tables and cloud object storage controls all commits to
Sits “above” the the project – so “closed”
• ACID-compliant • Includes support for
data file layer transaction support tables, ACID transactions, • Foundation for storing
upserts/ deletes, data in the Databricks
Organizes and manages • Capabilities not Lakehouse Platform
traditionally available advanced indexes,
table metadata and data streaming ingestion • Extends Parquet data
with other table formats,
including schema services, concurrency, files with a file-based
Typically supports
evolution, partition data clustering, and transaction log for ACID
multiple underlying disk asynchronous transactions and scalable
evolution, and table
file formats (Parquet, compaction metadata handling
version rollback – all
Avro, ORC, etc.) without re-writing data • Multiple query options: • Capabilities include
• Advanced data filtering snapshot, incremental, indexing, data skipping,
May offer transactional
and read-optimized compression, caching,
concurrency, I/U/D, • Time-travel queries let and time-travel queries
indexing, time-based you see data at points
in the past • Designed to handle batch
queries, and other as well as streaming data
capabilities
24
Why Apache Iceberg for data lakehouses?

Open-source data table format that helps simplify Catalog


data processing on large dataset stored in data lakes

People love it because it has:

• SQL — Use it to build the data lake and perform


most operations without learning a new language

• Data Consistency — ACID compliance


(not just append data operations to tables)

• Schema Evolution — Add/remove columns without


distributing underlying table structure

• Data Versioning — Time travel support that lets you


analyze data changes between update and deletes

• Cross Platform Support — Supports variety of storage


systems and query engines (Spark, Presto, Hive, +++)
25
ACID transactions

26
ACID transactions

ACID refers to a set of properties of database transactions intended to


guarantee data validity despite errors, power failures, and other mishaps

A tomicity
Guarantees that each transaction is a single event that either succeeds or fails
completely; there is no half-way state.

C onsistency
Ensures that data is in a consistent state when a transaction starts and when
it ends, guaranteeing that data is accurate and reliable.

I solation
Allows multiple transactions to occur at the same time without interfering
with each other, ensuring that each transaction executes independently.

D urability
Means that data is not lost or corrupted once a transaction is submitted.
Data can be recovered in the event of a system failure, such as a power outage.

27
ACID transactions

ACID refers to a set of properties of database transactions intended to Scenario:


guarantee data validity despite errors, power failures, and other mishaps Bank application that
transfers funds from one
account to another

A tomicity
Guarantees that each transaction is a single event that either succeeds or fails
completely; there is no half-way state.

Account 1 Account 1 Account 1 Account 1


✓ Debit  Debit ✓ Debit  Debit


Account 2 ✓ Credit Account 2  Credit Account 2  Credit Account 2 ✓ Credit

28
ACID transactions

ACID refers to a set of properties of database transactions intended to Scenario:


guarantee data validity despite errors, power failures, and other mishaps Bank application that
transfers funds from one
account to another

C onsistency
Ensures that data is in a consistent state when a transaction starts and when
it ends, guaranteeing that data is accurate and reliable.

Account 1 Account 1 Account 1 Account 1


+ 100 -5 + 100 -5


Account 2 - 100 Account 2 +5 Account 2 - 10 Account 2 +0

29
ACID transactions

ACID refers to a set of properties of database transactions intended to Scenario:


guarantee data validity despite errors, power failures, and other mishaps Bank application that
transfers funds from one
account to another

I solation
Allows multiple transactions to occur at the same time without interfering
with each other, ensuring that each transaction executes independently.

Account 1 + 100 Account 1 -5 Account 1 Account 1

Account 2 - 100
Transaction 1
✓ Account 2 +5
Transaction 2
Account 2
Transaction 1
Account 2
Transaction 2

30
ACID transactions

ACID refers to a set of properties of database transactions intended to Scenario:


guarantee data validity despite errors, power failures, and other mishaps Bank application that
transfers funds from one
account to another

D urability
Means that data is not lost or corrupted once a transaction is submitted.
Data can be recovered in the event of a system failure, such as a power outage.

Account 1 + 100 Account 1 -5 Account 1 +0 Account 1 -5


Account 2 - 100
Transaction 1

Account 2 +5
Transaction 2
Account 2 - 100
Transaction 1
Account 2 +0
Transaction 2

31
Metastore

32
Table format

Your existing
ecosystem Data warehouse Data lake

Query engines

Governance
and Metadata Metadata store
Access control management

Data format

Storage

Infrastructure

watsonx.data Core watsonx.data functionality

Ecosystem infrastructure
33
What is a HMS used by
metastore? watsonx.data

• Hive metastore (HMS) is a • Component of AWS Glue


component of Hive, but integration service
• Manages metadata for the tables in can run standalone • Inventories data assets of
the lakehouse, including: • Open-source AWS data sources
• Manage tables on HDFS • Includes location, schema,
• Schema information (column names, types) and cloud object storage and runtime metrics
• Location and type of data files • Pervasive use in industry

• Similar in principle to the system Microsoft Purview Databricks


catalogs of a relational database Data Catalog Unity Catalog

• Component of Microsoft • Provides centralized


• Shared metastore ensures query Purview data governance access control, auditing,
engines see schema and solution lineage, and data discovery
• Helps manage on-premises, across a Databricks
data consistently multicloud, and SaaS data lakehouse
• Offers discovery, • Contains data and AI
• May be a built-in component of a larger classification, and lineage assets including files,
tables, machine learning
integration/governance solution models, and dashboards
34
Hive Metastore (HMS)

• Open-source Apache Hive was built to provide an


SQL-like query interface for data stored in Hadoop

• Hive Metastore (HMS) is a component of Hive that stores


metadata for tables, including schema and location Metastore
References

• HMS can be deployed standalone, without the rest of metadata

Hive (often needed for lakehouses, like watsonx.data)


Query Query Query
References
Engine Engine Engine
• Query engines use the metadata in HMS to optimize data in storage

query execution plans Reads data from storage

• The metadata is stored in a traditional relational


Lakehouse Object Storage
database (PostgreSQL in the case of watsonx.data)

• In watsonx.data, IBM Knowledge Catalog integrates


with HMS to provide policy-based access and governance
35
Query engines

36
Query engines

Your existing
ecosystem Data warehouse Data lake

Query engines

Governance
and Metadata Metadata store
Access control management

Data format

Storage

Infrastructure

watsonx.data Core watsonx.data functionality

Ecosystem infrastructure
37
Presto

• Presto is an open-source distributed SQL engine • Presto connectors allow access to data in-
suitable for querying large amounts of data place, allowing for no-copy data access and
federated querying
• Supports both relational (e.g. MySQL, Db2) and non-
relational sources (e.g. MongoDB, Elastisearch) • Consumers are abstracted from the
physical location of data
• Easy to use with data analytics and business
intelligence tools • A wide variety of data sources are
supported, including:
• Supports both interactive and batch workloads

• In watsonx.data, spin up one or more Presto


compute engines of various sizes – cost effective,
in that engines are ephemeral and can be spun up
and shut down as needed

38
Presto architecture

Client
The structure of Presto is similar to that of classical MPP
database management systems.

• Client: Issues user query and receives final result.


Presto
Coordinator Hive
• Coordinator: Parses statement, plans query execution, Scheduler Metastore

and manages worker nodes. Gets results from workers


and returns final result to client.

• Workers: Execute tasks and process data.


Presto Presto Presto
• Connectors: Integrate Presto with external data sources Worker Worker Worker
like object stores, relational databases, or Hive.

• Caching: Accelerated query execution through metadata

Alluxio &
RaptorX
caching
and data caching (provided by Alluxio and RaptorX).
Object Storage
39
Apache Spark

Apache Spark is an open-source data-processing engine for large data sets. It is designed
to deliver the computational speed, scalability, and programmability required for big data,
specifically for streaming data, graph data, ML, and AI applications.

The basic Apache Spark architecture diagram:

40
Apache Spark and machine learning

Spark has libraries that extend


the capabilities to ML, AI, and
stream processing.

• Apache Spark MLlib

• Spark Streaming

• Spark SQL

• Spark GraphX

41

You might also like