Databricks Lakehouse Fundamentals Slide Deck
Databricks Lakehouse Fundamentals Slide Deck
lakehouse?
History of data management
Pros:
● Flexible data storage
● Streaming support
Data Lake
● Cost efficient in the cloud
● Support for AI and
Machine Learning Structured, Semi-Structured and Unstructured Data
Open
Built on open source and open standards
Multicloud
One consistent data platform across clouds
Photon
Lakehouse Platform
● Compatible with Apache Spark™
Data Data Data Data Science
Photon Engine
Delta/Parquet
Photon Writer
to
Delta/Parquet
● SQL-based jobs
● IoT use cases
● Data privacy and compliance
● Loading data into Delta and Parquet
Control plane
Unity Catalog
Data plane
Cluster Manager
Data DBFS Root
Notebooks, BI Apps
Unallocated pool Repos, DBSQL
Databricks
Serverless SQL
Lower cost
Reduce idle time
No over-provisioning
Finance
• Define the terms metastore, catalog, schema, table, view, and function.
• Describe how these terms relate to data management in the Databricks
Lakehouse Platform.
Lakehouse Platform
Unity Catalog
Fine-grained governance for data and AI
Delta Lake
Data reliability and performance
GRANT … ON … TO …
● Automated lineage for all workloads REVOKE … ON … FROM …
Metastore
Schema
Metastore
Schema
Metastore
Schema
Metastore
Schema
Metastore
Schema
Managed table
Table View Function
External table
Metastore
Schema
Metastore
Schema
Metastore
Storage
External Location Catalog Share Recipient
Credential
Schema
Metastore
Schema
Metastore
Control
Plane
Storage Credential External Location Catalog Share Recipient
Schema
Cloud
Storage
Streaming
Analytics
Kinesis BRONZE SILVER GOLD
CSV,
JSON,TXT… BI &
Reporting
Data Lake
Data Science
Raw ingestion Filtered, cleaned, Business-level & ML
Auto Loader
Sessions Orders
Match
non-Spark
Train
+more +more
Analyze streaming data for Train models on the Embed automatic and
instant insights and faster freshest data. Score in real-time actions into
decisions. real-time. business applications.
● Retail
● Industrial automation
● Healthcare
● Financial Institutions
● and many more!
Lakehouse Platform
Analytics Applications
Data Warehouse Amazon Workflows
S3
End-to-End Orchestration
Azure Data
On-premises Lake Store
Systems
Databricks SQL
Real-Time Analytics
Google Cloud
Storage Delta Live Tables
SaaS
Applications Streaming Ingestion & Machine Learning Applications
Transformation
Message Store Databricks ML
Real-Time Machine Learning Real-Time
Predictive
Real-Time Real-Time
Patient Alert /
Machine & Personalization
Maintenance Diagnostics
Application Logs
Built-in ML Frameworks and Built-in support for distributed Built-in support for AutoML and Built-in support for hardware
model explainability training hyperparameter tuning accelerators
AutoML