0% found this document useful (0 votes)
2K views121 pages

Databricks Lakehouse Fundamentals Slide Deck

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views121 pages

Databricks Lakehouse Fundamentals Slide Deck

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

What is a data

lakehouse?
History of data management

©2022 Databricks Inc. — All rights reserved 1


Learning objectives

• Describe the origin and purpose of the data lakehouse.


• Explain the challenges of managing and using big data.

©2022 Databricks Inc. — All rights reserved 2


The history of data
management and analytics

©2022 Databricks Inc. — All rights reserved 3


1980s
Businesses need
more than
relational
databases

©2022 Databricks Inc. — All rights reserved 4


Pros:
● Business intelligence (BI)
● Analytics
● Structured & clean data
● Predefined schemas

©2022 Databricks Inc. — All rights reserved 5


Cons:
● No support for semi or
unstructured data
● Inflexible schemas
● Struggled with volume and
velocity upticks
● Long processing time

©2022 Databricks Inc. — All rights reserved 6


2000s
Big Data
explosion

©2022 Databricks Inc. — All rights reserved 7


Data Lakes

Pros:
● Flexible data storage
● Streaming support
Data Lake
● Cost efficient in the cloud
● Support for AI and
Machine Learning Structured, Semi-Structured and Unstructured Data

©2022 Databricks Inc. — All rights reserved 8


Data Lakes
Cons:
● No transactional support
● Poor data reliability
● Slow analysis performance Data Lake
● Data governance concerns
● Data warehouses still
needed Structured, Semi-Structured and Unstructured Data

©2022 Databricks Inc. — All rights reserved 9


Business required two disparate, incompatible data platforms

©2022 Databricks Inc. — All rights reserved 10


Companies reporting measurable value from data

©2022 Databricks Inc. — All rights reserved 11


©2022 Databricks Inc. — All rights reserved 12
Key features of a data lakehouse:

• Transaction support • Open storage formats


• Schema enforcement and • Support for diverse data types
governance • Support for diverse workloads
• Data governance • End-to-end streaming
• BI Support
• Decoupled storage from
compute

©2022 Databricks Inc. — All rights reserved 13


©2022 Databricks Inc. — All rights reserved 14
What is the
Databricks
Lakehouse Platform?
Databricks and the Data
Lakehouse Platform

©2022 Databricks Inc. — All rights reserved 15


Learning objectives

• Recall the origins of Databricks and the Databricks Lakehouse Platform.


• Define the Databricks Lakehouse Platform.
• Give examples of how the Databricks Lakehouse Platform solves big
data challenges.
• Describe how the Databricks Lakehouse Platform benefits Data
Engineers, Data Analysts, and Data Scientists.

©2022 Databricks Inc. — All rights reserved 16


©2022 Databricks Inc. — All rights reserved 17
Lakehouse: A New Generation of Open Platforms that Unify Data
Warehousing and Advanced Analytics. M. Armbrust, A. Ghodsi, R.
Xin, M. Zaharia. 11th Annual Conference on Innovative Data Systems
Research (CIDR ’21), January 11–15, 2021, Online.

©2022 Databricks Inc. — All rights reserved 18


©2022 Databricks Inc. — All rights reserved 19
©2022 Databricks Inc. — All rights reserved 20
©2022 Databricks Inc. — All rights reserved 21
©2022 Databricks Inc. — All rights reserved 22
©2022 Databricks Inc. — All rights reserved 23
©2022 Databricks Inc. — All rights reserved 24
©2022 Databricks Inc. — All rights reserved 25
©2022 Databricks Inc. — All rights reserved 26
©2022 Databricks Inc. — All rights reserved 27
Databricks
Lakehouse Platform
Simple
Unify your data warehousing and AI
use cases on a single platform

Open
Built on open source and open standards

Multicloud
One consistent data platform across clouds

©2022 Databricks Inc. — All rights reserved 28


©2022 Databricks Inc. — All rights reserved 29
Databricks Lakehouse
Platform Architecture
and Security
Fundamentals
Data reliability and performance

©2022 Databricks Inc. — All rights reserved 30


Learning objectives

• Explain the importance of data reliability and performance on platform


architecture.
• Define Delta Lake and its features.
• Describe how Photon improves performance of the Databricks
Lakehouse Platform.

©2022 Databricks Inc. — All rights reserved 31


Why is data reliability and
performance important?

©2022 Databricks Inc. — All rights reserved 32


bad data in = bad data out

©2022 Databricks Inc. — All rights reserved 33


Problems encountered when using data lakes

● Lack of ACID transaction support


● Lack of schema enforcement
● Lack of integration with a data catalog
● Ineffective partitioning
● Too many small files

©2022 Databricks Inc. — All rights reserved 34


Databricks Lakehouse Platform

Photon

©2022 Databricks Inc. — All rights reserved 35


Delta Lake

● ACID transaction guarantees


● Scalable data and metadata handling
● Audit history and time travel
● Schema enforcement and schema evolution
● Support for deletes, updates, and merges
● Unified streaming and batch data processing

©2022 Databricks Inc. — All rights reserved 36


Additional points:

Lakehouse Platform
● Compatible with Apache Spark™
Data Data Data Data Science

● Uses Delta Tables


Warehousing Engineering Streaming and ML

Unity Catalog ● Has a transaction log


Fine-grained governance for data and AI

Delta Lake ● Is an open-source project


Data reliability and performance

Cloud Data Lake


All structured and unstructured data

©2022 Databricks Inc. — All rights reserved


©2022 Databricks Inc. — All rights reserved 38
What is Photon?

©2022 Databricks Inc. — All rights reserved 39


Spark Instructions Photon Instructions

Photon Engine
Delta/Parquet

Photon Writer
to
Delta/Parquet

©2021 Databricks Inc. — All rights reserved


©2021 Databricks Inc. — All rights reserved
Reported workloads impacted by Photon

● SQL-based jobs
● IoT use cases
● Data privacy and compliance
● Loading data into Delta and Parquet

©2021 Databricks Inc. — All rights reserved 42


©2022 Databricks Inc. — All rights reserved 43
©2022 Databricks Inc. — All rights reserved 44
Databricks Lakehouse
Platform Architecture
and Security
Fundamentals
Unified governance and security

©2022 Databricks Inc. — All rights reserved 45


Learning objectives

• Explain how important unified governance and security is to platform


architecture.
• Recognize the security features of the Databricks Lakehouse Platform.
• Explain how Unity Catalog and Delta Sharing are used.
• Differentiate between the control plane and the data plane in the
Databricks Lakehouse Platform architecture.

©2022 Databricks Inc. — All rights reserved 46


Why is a unified governance
and security structure
important?

©2022 Databricks Inc. — All rights reserved 47


Challenges to data and AI governance

● Diversity of data and AI assets


● Using two disparate and incompatible data platforms
● Rise of multi-cloud adoption
● Fragmented tool usage for data governance

©2022 Databricks Inc. — All rights reserved 48


How the Databricks Lakehouse Platform
solves data and AI governance challenges

Control plane

Unity Catalog

Data plane

©2022 Databricks Inc. — All rights reserved 49


Unity Catalog

©2022 Databricks Inc. — All rights reserved 50


©2022 Databricks Inc. — All rights reserved 51
©2022 Databricks Inc. — All rights reserved 52
©2022 Databricks Inc. — All rights reserved 53
©2022 Databricks Inc. — All rights reserved 54
Data and cloud storage

Data governance and catalog partners

©2022 Databricks Inc. — All rights reserved 55


Data sharing with
Delta Sharing

©2022 Databricks Inc. — All rights reserved 56


©2022 Databricks Inc. — All rights reserved 57
©2022 Databricks Inc. — All rights reserved 58
Benefits of Delta Sharing

● Open cross-platform sharing


● Share live data without copying it
● Centralized administration and governance
● Marketplace for data products
● Privacy-safe data clean rooms

©2022 Databricks Inc. — All rights reserved 59


©2022 Databricks Inc. — All rights reserved 60
Divided security architecture

The control plane and the


data plane

©2022 Databricks Inc. — All rights reserved 61


Control plane

©2022 Databricks Inc. — All rights reserved 62


Security of the data plane

Networking Servers Databricks

©2022 Databricks Inc. — All rights reserved 63


User identity and access

● Table ACLs feature


● IAM instance profiles
● Securely stored access key
● The Secrets API

©2022 Databricks Inc. — All rights reserved 64


Data security

©2022 Databricks Inc. — All rights reserved 65


Compliance

● SOC 2 Type II ● FedRAMP High


● ISO 27001 ● HITRUST
● ISO 27017 ● HIPAA
● ISO 27018 ● PCI

GDPR and CCPA ready

©2022 Databricks Inc. — All rights reserved 66


Databricks Lakehouse
Platform Architecture
and Security
Fundamentals
Instant compute and serverless

©2022 Databricks Inc. — All rights reserved 67


Learning objectives

• Describe compute resources for your Databricks Lakehouse Platform.


• Define serverless compute.
• Explain the benefits of using Databricks Serverless SQL.

©2022 Databricks Inc. — All rights reserved 68


Classic data plane
Customer Databricks Users

Data Plane Control Plane Interactive


Users
Web Application

Cluster Cluster Cluster


Configurations

Your Cloud Storage Notebooks, BI Apps


Repos, DBSQL

Cluster Manager
Data DBFS Root

©2022 Databricks Inc. — All rights reserved 69


Compute resource challenges

● Cluster creation is ● Long running clusters


complicated ● Over provisioning of
● Environment startup is slow resources
● Business cloud account ● Higher resource costs
limitations and resource ● High admin overhead
options
● Unproductive users

©2022 Databricks Inc. — All rights reserved 70


Serverless data plane
Customer Databricks Users

Your Cloud Serverless Data Control Plane Interactive


Storage Plane Users
Web Application

Cluster Cluster Configurations


Data

Notebooks, BI Apps
Unallocated pool Repos, DBSQL

DBFS Root Cluster Manager

©2022 Databricks Inc. — All rights reserved 71


Users Admins

Databricks
Serverless SQL

Increase productivity Reduce effort


Instant query execution Databricks optimally
Built-in connectors configures the cluster,
manages updates to the VMs

Lower cost
Reduce idle time
No over-provisioning
Finance

©2022 Databricks Inc. — All rights reserved 72


Managed Servers
Always-running server fleet
Patched and upgraded
automatically

Serverless SQL Elastic


...
Compute Scale up and down
(Databricks cloud account) automatically

Instant Compute Secure


Allocation in seconds Three layer isolation with
data encryption

©2022 Databricks Inc. — All rights reserved 73


Databricks Lakehouse
Platform Architecture
and Security
Fundamentals
Introduction to lakehouse data
management terminology

©2022 Databricks Inc. — All rights reserved 74


Learning objectives

• Define the terms metastore, catalog, schema, table, view, and function.
• Describe how these terms relate to data management in the Databricks
Lakehouse Platform.

©2022 Databricks Inc. — All rights reserved 75


Databricks Lakehouse Platform
Unity Catalog

Lakehouse Platform

Data Data Data Data Science


Warehousing Engineering Streaming and ML

Unity Catalog
Fine-grained governance for data and AI

Delta Lake
Data reliability and performance

Cloud Data Lake


All structured and unstructured data

©2022 Databricks Inc. — All rights reserved 76


Databricks Unity Catalog
Unified governance for all data and AI assets

● Centralized governance for data and AI Unity Catalog

● Built-in data search and discovery


Databricks Databricks
Workspace Workspace
● Performance and scale

GRANT … ON … TO …
● Automated lineage for all workloads REVOKE … ON … FROM …

● Integrated with your existing tools Catalogs, Schemas, Tables, Views,


Storage credentials, External
locations

©2022 Databricks Inc. — All rights reserved 77


Metastore

Metastore

Storage Credential External Location Catalog Share Recipient

Schema

Table View Function

©2022 Databricks Inc. — All rights reserved 78


Catalog

Metastore

Storage Credential External Location Catalog Share Recipient

Schema

Table View Function

©2022 Databricks Inc. — All rights reserved 79


Three-level namespace

Traditional SQL two-level Unity Catalog three-level


namespace namespace

SELECT * FROM schema.table SELECT * FROM catalog.schema.table

©2022 Databricks Inc. — All rights reserved 80


Schema (Database)

Metastore

Storage Credential External Location Catalog Share Recipient

Schema

Table View Function

©2022 Databricks Inc. — All rights reserved 81


Table

Metastore

Storage Credential External Location Catalog Share Recipient

Schema

Table View Function

©2022 Databricks Inc. — All rights reserved 82


Managed and external tables

Metastore

Storage Credential External Location Catalog Share Recipient

Schema

Managed table
Table View Function
External table

©2022 Databricks Inc. — All rights reserved 83


View

Metastore

Storage Credential External Location Catalog Share Recipient

Schema

Table View Function

©2022 Databricks Inc. — All rights reserved 84


Function

Metastore

Storage Credential External Location Catalog Share Recipient

Schema

Table View Function

©2022 Databricks Inc. — All rights reserved 85


Storage credentials and External locations

Metastore

Storage
External Location Catalog Share Recipient
Credential

Schema

Table View Function

©2022 Databricks Inc. — All rights reserved 86


Delta Sharing

Metastore

Storage Credential External Location Catalog Share Recipient

Schema

Table View Function

©2022 Databricks Inc. — All rights reserved 87


Metastore data storage

Metastore

Control
Plane
Storage Credential External Location Catalog Share Recipient

Schema
Cloud
Storage

Table View Function

©2022 Databricks Inc. — All rights reserved 88


Supported Workloads on
the Databricks
Lakehouse Platform
Data warehousing

©2022 Databricks Inc. — All rights reserved 89


Learning objectives

• Recognize how the Databricks Lakehouse platform supports data


warehousing with Databricks SQL.
• Describe the benefits of data warehousing with the Databricks
Lakehouse Platform.

©2022 Databricks Inc. — All rights reserved 90


Two disparate, incompatible data platforms

©2022 Databricks Inc. — All rights reserved 91


SQL Analytics Data Science / ML Data Sharing

©2022 Databricks Inc. — All rights reserved 92


Key benefits of data warehousing with the
Databricks Lakehouse Platform
● Best price/ performance
● Built-in governance
● A rich ecosystem
● Break down silos

©2022 Databricks Inc. — All rights reserved 93


Supported Workloads on
the Databricks
Lakehouse Platform
Data engineering

©2022 Databricks Inc. — All rights reserved 94


Learning objectives

• Explain why data quality is important for data engineering.


• Describe how the Databricks Lakehouse Platform benefits the data
engineer.
• Define what Delta Live Tables is.
• Explain how Databricks Workflows support data orchestration.

©2022 Databricks Inc. — All rights reserved 95


Data is a valuable business
asset.

©2022 Databricks Inc. — All rights reserved 96


Challenges for the data engineering
workload:
● Complex data ingestion methods
● Support for data engineering principles
● Third-party orchestration tools
● Pipeline and architecture performance tuning
● Inconsistencies between data warehouse and data lake providers

©2022 Databricks Inc. — All rights reserved 97


A unified data platform with managed data ingestion,
schema detection, enforcement, and evolution, paired
with declarative, auto-scaling data flow integrated
with a lakehouse native orchestrator that supports all
kinds of workflows.

©2022 Databricks Inc. — All rights reserved 98


Key capabilities of data engineering on the
lakehouse:
● Easy data ingestion
● Automated ETL pipelines
● Data quality checks
● Batch and streaming tuning
● Automatic recovery
● Data pipeline observability
● Simplified operations
● Scheduling and orchestration

©2022 Databricks Inc. — All rights reserved 99


Data Engineering on the Lakehouse
Reliable data, analytics and AI

Ingest & Transform: Delta Live Tables | Orchestrate: Databricks Workflows

Streaming
Analytics
Kinesis BRONZE SILVER GOLD
CSV,
JSON,TXT… BI &
Reporting
Data Lake

Data Science
Raw ingestion Filtered, cleaned, Business-level & ML

and history augmented aggregates

Data Quality & Governance Data Sharing

©2022 Databricks Inc. — All rights reserved


Data ingestion

Auto Loader COPY INTO

©2022 Databricks Inc. — All rights reserved 101


Data
transformation

©2022 Databricks Inc. — All rights reserved 102


©2022 Databricks Inc. — All rights reserved 103
Databricks Workflows
Orchestrate... ...any task... ...across any platform

Clicks Delta Live Tables

Auto Loader
Sessions Orders

Match
non-Spark

Train
+more +more

©2022 Databricks Inc. — All rights reserved


Orchestrate anything

©2022 Databricks Inc. — All rights reserved 105


Supported Workloads on
the Databricks
Lakehouse Platform
Data streaming

©2022 Databricks Inc. — All rights reserved 106


Learning objectives

• Explain what streaming data is.


• Explain how the Databricks Lakehouse Platform supports data
streaming.

©2022 Databricks Inc. — All rights reserved 107


New opportunities from real-time data
Every organization generates Creating opportunities for new
vast amounts of real-time data kinds of real-time applications
Transactional records Third-party
Point of sale (POS) News feeds
Banking transactions Weather
Airline reservations Market data Fraud Personalized Vaccine
Call center records Real-time traffic detection offer distribution

Smart In-game Connected cars


pricing analytics and smart devices
Interactions IoT events
Web clicks Sensors
Social posts Geolocation
Emails Machine logs
Instant messages Mobile devices Predictive Content
maintenance recommendations

©2022 Databricks Inc. — All rights reserved 108


Data Streaming Use Cases

Real-Time Real-Time Real-Time


Analytics Machine Learning Applications

Analyze streaming data for Train models on the Embed automatic and
instant insights and faster freshest data. Score in real-time actions into
decisions. real-time. business applications.

©2022 Databricks Inc. — All rights reserved


Industry Specific Use Cases

● Retail
● Industrial automation
● Healthcare
● Financial Institutions
● and many more!

©2022 Databricks Inc. — All rights reserved 110


Top 3 differentiating capabilities for data
streaming on the lakehouse

1 Build streaming pipelines and applications faster

2 Simplify operations with automation

3 Unified governance for real time and historical data

©2022 Databricks Inc. — All rights reserved 111


Reference architecture for streaming use cases
Data Sources Object Stores

Lakehouse Platform

Analytics Applications
Data Warehouse Amazon Workflows
S3
End-to-End Orchestration

Azure Data
On-premises Lake Store
Systems
Databricks SQL
Real-Time Analytics
Google Cloud
Storage Delta Live Tables
SaaS
Applications Streaming Ingestion & Machine Learning Applications
Transformation
Message Store Databricks ML
Real-Time Machine Learning Real-Time
Predictive
Real-Time Real-Time
Patient Alert /
Machine & Personalization
Maintenance Diagnostics
Application Logs

Real Time Applications


Application
Events Amazon Kinesis Data
Spark Structured Streaming
MSK Streams Real-Time Applications
Real-Time Real-Time FraudReal-Time Dynamic
Alerts Detection Pricing

Mobile & IoT


Data Azure Event GCP
Hubs Pub/Sub
Unity Catalog
For Governance
Photon
©2022 Databricks Inc. — All rights reserved
Supported Workloads on
the Databricks
Lakehouse Platform
Data science and machine
learning

©2022 Databricks Inc. — All rights reserved 113


Learning objectives

• Explain the challenges of harnessing machine learning and AI.


• Describe how the Databricks Lakehouse Platform supports the data
science and machine learning workload.

©2022 Databricks Inc. — All rights reserved 114


Challenges to successful machine learning
and AI endeavors
● Siloed and disparate data ● Multiple tools available
systems ● Experiments are hard to track
● Complex experimentation ● Reproducing results is difficult
environments
● ML is hard to deploy
● Getting models to production

©2022 Databricks Inc. — All rights reserved 115


©2022 Databricks Inc. — All rights reserved 116
Compute Platform
Any ML workload optimized and accelerated.

Databricks Machine Learning Runtime


● Optimized and preconfigured ML Frameworks
● Turnkey distributed ML
● Built-in AutoML
● GPU support out of the box

Built-in ML Frameworks and Built-in support for distributed Built-in support for AutoML and Built-in support for hardware
model explainability training hyperparameter tuning accelerators

AutoML

©2021 Databricks Inc. — All rights reserved


Databricks Machine Learning

©2021 Databricks Inc. — All rights reserved 118


©2022 Databricks Inc. — All rights reserved 119
©2022 Databricks Inc. — All rights reserved 120
Databricks Machine Learning

©2021 Databricks Inc. — All rights reserved 121

You might also like