Databricks Unity Catalog - TechSession-Spain Oct. 2022
Databricks Unity Catalog - TechSession-Spain Oct. 2022
Metadata
Data warehouse
Permissions on ML models,
dashboards, features, …
Yet another governance model
Data scientist
ML models,
dashboards
Data Lake
Data analyst
Metadata
Unity Catalog
Data engineer
Data warehouse
Data scientist
ML models,
dashboards
6
©2022 Databricks Inc. — All rights reserved
Centralized governance for data and AI
7
©2022 Databricks Inc. — All rights reserved
Automated lineage for all workloads
8
©2022 Databricks Inc. — All rights reserved
Built-in search and discovery
9
©2022 Databricks Inc. — All rights reserved
Delta Sharing
Secure sharing of data assets → Now GA on AWS / Azure!
Data Consumer #1
DATA PROVIDER ON DATABRICKS
✔ Access shared Views
Data Provider Databricks-managed sharing
connection ✔ Unified data governance
✔ SQL API and UI
✔ Unified data governance
✔ Partition filtering
DATA CONSUMER
✔ IP access / cloud region
restrictions
Data Consumer #2
✔ SQL API and UI Not on Databricks
Token-based protocol
1
G Data Domain
2
Org perimeter Any platform Any provider
pany
pany
Insight
com
Data
G
com
Data External
Produc domain
t
Insight
Data XYZ
Data
G Data Domain Produc
t
Insight Delta Sharing
Data open protocol
Data
Produc
Delta Sharing
t cloud Server
provid On-Premise
cloud er 2 …
provid
(D2D er 1
)
1 G Data Domain Unity Catalog manages all internal governance and access
g i on
re 2 Insight controls, and processes Delta Sharing requests
ud ion
clo r eg Data
Data
ud Delta sharing Share gold-level aggregations with pre-processed call center
clo
Produc
t
Databricks-to data, CDC-ed POS data, web/mobile clickstream aggregates,
(D2D
) -Databricks financial product aggregates and more
(inter-org)
Delta Sharing is an open and vendor-agnostic protocol
©2022 Databricks Inc. — All rights reserved 11
The new Model
A quick recap
Cloud Storage
(S3, ADLS, GCS)
Databricks
✔ Container / bucket
Workspace
User
User User
Management Management
Databricks Databricks
Metastore Metastore
Workspace Workspace
Clusters Clusters Clusters Clusters
SQL Endpoints SQL Endpoints SQL Endpoints SQL Endpoints
14
©2022 Databricks Inc. — All rights reserved
Multi-Cloud Lakehouse Principles
Autonomy within workspaces, governance across workspaces
Unity
Common Delta Share Catalog
Orchestration, automation, and optimised runtimes
infrastructure Gold Data
16
©2022 Databricks Inc. — All rights reserved
Data organization and namespaces
Metastore > Catalog > Schema/DB
hive_metastore
Catalog 2 Catalog 1
17
©2022 Databricks Inc. — All rights reserved
Centralized Access Controls
Centrally grant and manage access permissions across workloads
18
©2022 Databricks Inc. — All rights reserved
BEST PRACTICE
Privilege Model 1-Leverage Databricks SQL API and connectors to automate
management of UC assets and permissions
2-Use groups instead of individuals to grant access
3 - Always make the owner of Production
Catalogs/Schemas groups, not individuals
BESTPRACTICE
● Automate your logging pipelines on dedicated
SecOps workspace
● Use Databricks SQL to visualize KPIs, and set up
automatic alerts for the key events
Unity Catalog
User
Metastore
Management
Databricks Databricks
Workspace Workspace
Clusters Clusters
SQL Warehouses SQL Warehouses
23
Identity Management
Account BESTPRACTICE
Set up SCIM on the account level and
Deactivate it on the UC workspaces
Billing and Usage Users, Service Principals, Groups
1 2 3
4 5
Unity Catalog Only principals that have been For existing workspaces, new For workspaces without
Attaching a metastore to a
enabled is a added to the Account level can principals can be added from the a metastore attached,
workspace enables Identity
pre-requisite for be assigned to a Workspace. Account level. Pre-existing local local principals and
Federation. We recommend
Identity Federation Users and groups should be workspace groups can still be used. Workspace level SCIM
customers to attach a
SCIM synced to the Account Users and groups should be SCIM will still be available for
metastore to all new
level, and assigned to synced to account level and we the time being
workspaces
Workspaces recommend not using workspace level
©2022 Databricks Inc. — All rights reserved SCIM
Who can do what?
• Account Admin - Create Metastores, Workspaces, Manage Users
• Metastore Admin -
• Manage and delegate access to data,
• Create Catalogs / External Locations
• Workspace Admin - Can create clusters, endpoints, manage ACLs on
workspace resources, users and groups(local) within the workspace
• Catalog/Database/Table Owner - Can Assign access to other users
• Account User - Can Access a workspace, if assigned
Capabilities Chart
Metastore Admin
Account Admin
Workspace Admin
TBL Owner
Catalog, DB,
Account User
Data
• Compute
Create Metastores Y N N N N
Create Catalog N* Y N N N
Create Credential Y N N N N
Unity Catalog
User
Metastore
Management
Databricks Databricks
Workspace Workspace
Clusters Clusters
SQL Warehouses SQL Warehouses
28
Suggested metastore structure
Schema Tables/
dev
databases Views
The catalog level of the Schema Tables/
across SDLC
staging environment
3-level namespace allows databases Views
scopes
Schema Tables/
to structure databases prod
databases Views
Schema Tables/
team_x_sandbox
databases Views across team
sandboxes
Schema Tables/
team_y_sandbox
BESTPRACTICE databases Views
Use catalogs to structure schemas & tables per Catalog+Schema owned by central team. Tables owned by team. Grants
business and SDLC scopes (e.g. dev, prod, sandbox / BU)
Usage Grants performed by central team performed by teams X/Y.
GRANT USAGE on <catalog> Teams X, Y cannot share
GRANT USAGE, CREATE on <schema> outside of team
©2022 Databricks Inc. — All rights reserved
Topology: multi-region / multi-cloud UC
Powered by Delta Sharing
Cloud region 1
WS WS WS
● Metastore boundary = region / Dev Stg Prd
cloud (due to latency, cost) Dev cat Stg cat Prd cat
meta
store
● Use single region Metastore
for all SDLC scopes and Cloud region 2 Cloud region 3
business units WS WS WS WS WS WS
● Use Databricks-to- Dev Stg Prd Dev Stg Prd
Dev cat Stg cat Prd cat Dev cat Stg cat Prd cat
Databricks Delta Sharing
meta
between cloud regions and meta
store store
cloud providers
WS Workspace
What’s the difference between Managed
and External tables again?
Managed External
DROP TABLE Deletes data Does NOT delete data
Data location Metastore’s default Custom S3 / ADLS location
S3/ADLS location
Performance Optimizations YES NO
Management Much simpler More complex
Best For Delta tables 1) R/W to data outside DB
2) Requirements of data
isolation on infra-level
RECOMMENDED 3) Non-Delta tables
BESTPRACTICE tmp/
…
Minimize the number of Credentials
and External Locations, e.g :
● 1 credential / bucket or 1 credential / team Shared tmp directory for all users :
● 1 location / user or 1 location / team ● GRANT READ FILES, WRITE FILES ON EXTERNAL LOCATION tmp TO `team_x`;
● GRANT CREATE TABLE ON EXTERNAL LOCATION tmp TO `team_x`;
Unity Catalog
User
Metastore
Management
Databricks Databricks
Workspace Workspace
Clusters Clusters
SQL Warehouses SQL Warehouses
35
Best practices
Clusters
• Use Cluster Policies to enforce usage of Unity-enabled
clusters and enable Lineage by default on Clusters and
Endpoints
• Set Service Principals as the Owners of Production jobs and
run these as their SP’s
• Use Single-User clusters for Job clusters
• If you have dynamic views, you must use User-Isolation
Clusters
• Use DBR 11.1+ if you would like to use the SYNC command
…
(Unity)
Catalog
Metastore
Schema Managed
External
…
(Database) Table
Databricks table
Catalog
…
Account assigned to
Schema External
Managed
Databricks (Database) Table
Table
Workspace
Unity Catalog
GRANT USAGE ON CATALOG catalog TO <user/group>;
GRANT USAGE ON SCHEMA catalog.database_migrated TO <user/group>;
GRANT <privilege> ON TABLE
catalog.database_migrated.external_table TO <user/group>;
2 5
hive_metastore External Location catalog Update your views,
(legacy)
1 Storage Credentials upstream and
downstream
processes to use
the newly-migrated
database_to_migrate database_migrated external table
external_table external_table
in an external cloud storage External S3 bucket / ADLS Same external cloud storage location
User User
3
Legacy cloud storage access CREATE TABLE
Instance profile / Service Principal catalog.database_migrated.external_table
LIKE
hive_metastore.database_to_migrate.external_table
COPY LOCATION;
Unity Catalog
GRANT USAGE ON CATALOG catalog TO <user/group>;
GRANT USAGE ON SCHEMA catalog.database_migrated TO <user/group>;
GRANT <privilege> ON TABLE
catalog.database_migrated.external_table TO <user/group>;
2 5
External Location catalog Update your views,
1 Storage Credentials upstream and
downstream
processes to use
the newly-migrated
database_migrated external table
external_table
BESTPRACTICE External S3 bucket / ADLS Same external cloud storage location
User
● Use the wizards and the SYNC command 3
to streamline the upgrade of External
Tables CREATE TABLE
catalog.database_migrated.external_table
● Don’t use DFBS mounts anymore LIKE
● Remove any direct file access after hive_metastore.database_to_migrate.external_table
migration COPY LOCATION;
Solution
Unity Catalog provides automated and realtime data
lineage enabling data teams to get an end-to-end
visibility into how data flows across the lakehouse
Solution
• Environment aware ACLs allow additional
isolation by restricting access of a catalog to a Workspace Workspace
particular workspace Clusters
Clusters
• Per-catalog storage locations for cost Warehouses Warehouses
attribution and data isolation per team
• Out of the box system tables
• Audit logs
• Information schema
©2022 Databricks Inc. — Confidential & Subject to Change 45
Auto Tune for Managed Tables
Complete table management for best performance out of the box
Problem
For optimal performance tables require periodic
tuning operations.
Solution
Unity Catalog will auto tune tables.
• For tables under 1TB, no tuning required
• Auto Tune takes care of the data layout
• Run maintenance operations automatically
• New Liquid Partitioning to avoid small files
problems but give high performance and
concurrency
Solution
• Support for cataloging and governing ML feature tables in
Unity Catalog.