0% found this document useful (0 votes)
20 views51 pages

Databricks Unity Catalog - TechSession-Spain Oct. 2022

The document discusses Databricks Unity Catalog, which provides unified governance for data and AI assets across various platforms. It outlines key capabilities such as centralized metadata management, fine-grained access controls, and automated data lineage, along with best practices for implementation. Additionally, it highlights new features like secure data sharing and a roadmap for future enhancements.

Uploaded by

op.ucdm866
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views51 pages

Databricks Unity Catalog - TechSession-Spain Oct. 2022

The document discusses Databricks Unity Catalog, which provides unified governance for data and AI assets across various platforms. It outlines key capabilities such as centralized metadata management, fine-grained access controls, and automated data lineage, along with best practices for implementation. Additionally, it highlights new features like secure data sharing and a roadmap for future enhancements.

Uploaded by

op.ucdm866
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Unity Catalog:

Best Practices &


New Features
Unified governance for
Lakehouse

Rafael Arana / Sr.Solutions Architects


Octubre, 2022

©2022 Databricks Inc. — All rights reserved 1


Agenda
■ Overview of Unity Catalog
■ The new model - a quick recap
■ Getting started - Guidelines & Best Practices
■ Upgrading to Unity
■ New Features & Roadmap

©2022 Databricks Inc. — All rights reserved 2


Governance for data and AI is complex

Permissions on files No row and column level permissions

Inflexible when policies change


Data Lake
Data analyst
Permissions on tables and views
Can be out of sync with data

Metadata

Permissions on tables, columns, rows


Data engineer Different governance model

Data warehouse
Permissions on ML models,
dashboards, features, …
Yet another governance model
Data scientist
ML models,
dashboards

©2022 Databricks Inc. — All rights reserved 3


Databricks Unity Catalog
Unified governance for all data and AI assets

Data Lake
Data analyst

Metadata
Unity Catalog
Data engineer

Data warehouse

Data scientist
ML models,
dashboards

©2022 Databricks Inc. — All rights reserved 4


Databricks Unity Catalog
Unified governance for all data and AI assets

● Centralized governance for data and AI


Lakehouse Platform
● Built-in data search and discovery
Data Data Data Data Science
Warehousing Engineering Streaming and ML

● Performance and scale


Unity Catalog
Fine-grained governance for data and AI
● Automated lineage for all workloads
Delta Lake
Data reliability and performance
● Integrated with your existing tools
Cloud Data Lake
All structured and unstructured data

©2022 Databricks Inc. — All rights reserved 5


Unity Catalog - Key Capabilities

● Centralized metadata and user management GA


Unity Catalog
● Centralized data access controls GA
Databricks Databricks
Public Preview
● Data lineage Workspace Workspace

● Data access auditing GA


GRANT … ON … TO …

● Data search and discovery Public Preview REVOKE … ON … FROM …

● Secure data sharing with Delta Sharing GA Catalogs, Databases (schemas),


Tables, Views, Storage credentials,
External locations

6
©2022 Databricks Inc. — All rights reserved
Centralized governance for data and AI

● Create a single source of truth for your data with


centralized metadata and user management

● Centrally manage access permissions and audit


controls for files, tables across all workspaces and
workloads using open standard ANSI SQL

● Enable fine-grained access controls on tables, files,


rows, and columns

7
©2022 Databricks Inc. — All rights reserved
Automated lineage for all workloads

● Get an end-to-end visibility into how data


flows in your organization from source to
consumption with built-in data lineage

● View lineage across tables, columns,


notebooks, workflows, dashboards

● Captured in real time across all workloads in


SQL, Python, Scala, and R

● Export API to enable partner integrations

8
©2022 Databricks Inc. — All rights reserved
Built-in search and discovery

● Quickly find, understand, and reference


data from across your data estate with a
unified data browsing experience

● Secure by default—leverages common


permission model from Unity Catalog

9
©2022 Databricks Inc. — All rights reserved
Delta Sharing
Secure sharing of data assets → Now GA on AWS / Azure!

DATA CONSUMER ON DATABRICKS

Data Consumer #1
DATA PROVIDER ON DATABRICKS
✔ Access shared Views
Data Provider Databricks-managed sharing
connection ✔ Unified data governance
✔ SQL API and UI
✔ Unified data governance
✔ Partition filtering
DATA CONSUMER
✔ IP access / cloud region
restrictions
Data Consumer #2
✔ SQL API and UI Not on Databricks
Token-based protocol

©2022 Databricks Inc. — All rights reserved


Open Data Sharing to Accelerate Use Case Development
Secure access across organisational and regional boundaries

Multi-cloud deployment (today) Delta Sharing protocol

1
G Data Domain

2
Org perimeter Any platform Any provider

pany
pany
Insight

com
Data
G

com
Data External
Produc domain
t
Insight
Data XYZ
Data
G Data Domain Produc
t
Insight Delta Sharing
Data open protocol
Data
Produc
Delta Sharing
t cloud Server
provid On-Premise
cloud er 2 …
provid
(D2D er 1
)
1 G Data Domain Unity Catalog manages all internal governance and access
g i on
re 2 Insight controls, and processes Delta Sharing requests
ud ion
clo r eg Data
Data
ud Delta sharing Share gold-level aggregations with pre-processed call center
clo
Produc
t
Databricks-to data, CDC-ed POS data, web/mobile clickstream aggregates,
(D2D
) -Databricks financial product aggregates and more
(inter-org)
Delta Sharing is an open and vendor-agnostic protocol
©2022 Databricks Inc. — All rights reserved 11
The new Model
A quick recap

©2022 Databricks Inc. — All rights reserved


Unity Catalog -Architecture

Audit Log Account Level


Metastore
User Mgmt

Lineage Unity Storage


Explorer Catalog Credentials

Data Explorer Access ACL Store


Control

Cloud Storage
(S3, ADLS, GCS)

Databricks
✔ Container / bucket
Workspace
User

©2022 Databricks Inc. — All rights reserved 13


Centralized Metadata and User Management
Create a unified view of your data estate

Without Unity Catalog With Unity Catalog

Databricks Databricks Unity Catalog


Workspace 1 Workspace 2 User
Metastore
Management

User User
Management Management

Databricks Databricks
Metastore Metastore
Workspace Workspace
Clusters Clusters Clusters Clusters
SQL Endpoints SQL Endpoints SQL Endpoints SQL Endpoints

14
©2022 Databricks Inc. — All rights reserved
Multi-Cloud Lakehouse Principles
Autonomy within workspaces, governance across workspaces

Lakehouse architecture Organize Multi-Cloud Workspaces with UC to enable


BI reports Real-time ML/AI Data
Sharing across data domains
G Data Domain
& apps models sharing
dashboards G Data Domain
(workspace) Insight
Data
Insight Data
Data
Product
Workspace 1 Workspace 2 Data
Collaborative
workspaces
… Clusters/
Endpoints
Clusters/
Endpoints
… Product

Unity
Common Delta Share Catalog
Orchestration, automation, and optimised runtimes
infrastructure Gold Data

Global Unity Catalog Workspace as the data


governance Data governance and access controls domain boundary
G Data Domain
Unity catalog manages
Data quality Delta Lake governance, access and
Insight
& integrity Data reliability and performance discovery
Data

Cloud Data Lake Data publish


Scalable Product
All structured and unstructured data metadata
storage & G federated
compute governance

©2022 Databricks Inc. — All rights reserved 15


Identity Federation with Unity Catalog

Identity Provider Unity Catalog


Account Console

Group 1 Group 2 Account Level User Mgmt


(selective) sync

Group 3 Group 4 Group 1 Group 2 Group 4

Databricks Databricks Databricks


Workspace 1 Workspace 2 Workspace 3

Group 1 Group 1 Group 4 Group 2 Group 4

16
©2022 Databricks Inc. — All rights reserved
Data organization and namespaces
Metastore > Catalog > Schema/DB

Without Unity Catalog With Unity Catalog


Regional Metastore

hive_metastore

Catalog 2 Catalog 1

Database 2 Database 1 Database 2 Database 1

External External External Managed


External Managed Views
Views
Table Table Tables Tables
Tables Tables

17
©2022 Databricks Inc. — All rights reserved
Centralized Access Controls
Centrally grant and manage access permissions across workloads

Using ANSI SQL DCL Using UI

GRANT <privilege> ON <securable_type>


<securable_name> TO `<principal>`

GRANT SELECT ON iot.events TO engineers


Choose
‘Table’= collection of
permission level files in S3/ADLS Sync groups from
your identity
provider

18
©2022 Databricks Inc. — All rights reserved
BEST PRACTICE
Privilege Model 1-Leverage Databricks SQL API and connectors to automate
management of UC assets and permissions
2-Use groups instead of individuals to grant access
3 - Always make the owner of Production
Catalogs/Schemas groups, not individuals

• Securable objects in Unity Catalog are


hierarchical and privileges are inherited
downward
• Example: enable ML team to create tables within
a sandbox schema and read each other’s tables:

CREATE CATALOG ml;

CREATE SCHEMA ml.team_sandbox;

GRANT USE_CATALOG ON CATALOG ml TO `ml_team`;

GRANT USE_SCHEMA ON SCHEMA ml.team_sandbox TO `ml_team`;

GRANT CREATE TABLE ON SCHEMA ml.team_sandbox TO `ml_team`;

GRANT SELECT ON SCHEMA ml.team_sandbox TO `ml_team`;

©2022 Databricks Inc. — All rights reserved


Audit logs
• Unity Catalog captures an audit log of actions
performed, and these logs are delivered as part
of Databricks audit logs.
• How to : Enable Audit Logs at Account Level
(AWS, GCP) or workspace (Azure)

→ Incorporate your audit logs into your wider


logging ecosystem (Cloud Provider, Id Provider…)

BESTPRACTICE
● Automate your logging pipelines on dedicated
SecOps workspace
● Use Databricks SQL to visualize KPIs, and set up
automatic alerts for the key events

©2022 Databricks Inc. — All rights reserved


Unity Catalog UI
Demo

©2022 Databricks Inc. — All rights reserved 21


Getting Started
Guidelines &
Best practices

©2022 Databricks Inc. — All rights reserved


User Management

Unity Catalog
User
Metastore
Management

Databricks Databricks
Workspace Workspace
Clusters Clusters
SQL Warehouses SQL Warehouses

23
Identity Management
Account BESTPRACTICE
Set up SCIM on the account level and
Deactivate it on the UC workspaces
Billing and Usage Users, Service Principals, Groups

Account Administrators Group 1 Group 2 Group 3

Metastore New Workspace Existing Workspace Existing/New Workspace

Group 2 Group 0 Group 0


Group 3
(local) (local)

1 2 3
4 5
Unity Catalog Only principals that have been For existing workspaces, new For workspaces without
Attaching a metastore to a
enabled is a added to the Account level can principals can be added from the a metastore attached,
workspace enables Identity
pre-requisite for be assigned to a Workspace. Account level. Pre-existing local local principals and
Federation. We recommend
Identity Federation Users and groups should be workspace groups can still be used. Workspace level SCIM
customers to attach a
SCIM synced to the Account Users and groups should be SCIM will still be available for
metastore to all new
level, and assigned to synced to account level and we the time being
workspaces
Workspaces recommend not using workspace level
©2022 Databricks Inc. — All rights reserved SCIM
Who can do what?
• Account Admin - Create Metastores, Workspaces, Manage Users
• Metastore Admin -
• Manage and delegate access to data,
• Create Catalogs / External Locations
• Workspace Admin - Can create clusters, endpoints, manage ACLs on
workspace resources, users and groups(local) within the workspace
• Catalog/Database/Table Owner - Can Assign access to other users
• Account User - Can Access a workspace, if assigned
Capabilities Chart

Metastore Admin
Account Admin

Workspace Admin

TBL Owner
Catalog, DB,

Account User
Data

• Compute

Data and Compute


* Can gain access by assigning oneself to roles

Create Metastores Y N N N N

Manage Users and Groups, Assign Groups to Workspaces Y N N N N

Create Workspaces, Assign Metastores To Workspace Y N N N N

Create Clusters, Workflows, Delegate Access to compute N* N Y N N

Create Catalog N* Y N N N

Create Credential Y N N N N

Create External Location N* Y N N N

Delegate Access to Data (Can Manage) N* Y N Y N

Access Workspaces and Data N* N Y Y Y


Identity Onboarding Steps

• All UC workspaces use Identity Federation


• Identify Account Administrator (Azure)
• Enable SSO at the account console (OIDC/SAML)
• Workspace SSO is still required
• Identify Business Groups for SCIM
• Enable SCIM for the Account Console
• Set up service principals for workflows (SPN/MI/Profiles)
• Assign users and groups to workspaces
• Existing relationships will be maintained
• Test federation with a small number of dev/test workspaces first
Metastores

Unity Catalog
User
Metastore
Management

Databricks Databricks
Workspace Workspace
Clusters Clusters
SQL Warehouses SQL Warehouses

28
Suggested metastore structure
Schema Tables/
dev
databases Views
The catalog level of the Schema Tables/
across SDLC
staging environment
3-level namespace allows databases Views
scopes
Schema Tables/
to structure databases prod
databases Views

and tables / views bu_dev


Schema Tables/
databases Views
according to technical or Unity Schema Tables/
bu_staging across BUs
business needs. Metastore databases Views
Schema Tables/
bu_prod
databases Views

Schema Tables/
team_x_sandbox
databases Views across team
sandboxes
Schema Tables/
team_y_sandbox
BESTPRACTICE databases Views

Use catalogs to structure schemas & tables per Catalog+Schema owned by central team. Tables owned by team. Grants
business and SDLC scopes (e.g. dev, prod, sandbox / BU)
Usage Grants performed by central team performed by teams X/Y.
GRANT USAGE on <catalog> Teams X, Y cannot share
GRANT USAGE, CREATE on <schema> outside of team
©2022 Databricks Inc. — All rights reserved
Topology: multi-region / multi-cloud UC
Powered by Delta Sharing
Cloud region 1

WS WS WS
● Metastore boundary = region / Dev Stg Prd

cloud (due to latency, cost) Dev cat Stg cat Prd cat

meta
store
● Use single region Metastore
for all SDLC scopes and Cloud region 2 Cloud region 3

business units WS WS WS WS WS WS
● Use Databricks-to- Dev Stg Prd Dev Stg Prd

Dev cat Stg cat Prd cat Dev cat Stg cat Prd cat
Databricks Delta Sharing
meta
between cloud regions and meta
store store

cloud providers
WS Workspace
What’s the difference between Managed
and External tables again?
Managed External
DROP TABLE Deletes data Does NOT delete data
Data location Metastore’s default Custom S3 / ADLS location
S3/ADLS location
Performance Optimizations YES NO
Management Much simpler More complex
Best For Delta tables 1) R/W to data outside DB
2) Requirements of data
isolation on infra-level
RECOMMENDED 3) Non-Delta tables

©2022 Databricks Inc. — All rights reserved


Metastore, external locations and credentials
Catalog Storage
Managed Container/bucket for
Table Managed Tables
Schema
Catalog
(Database) External Locations
External Location forfor
Location External
Externalfor
location
Table Container/bucket
Data sources
Data sources
External Tables
Directory * location

Meta Location forfor


External
store Location Externalfor
Container/bucket
Data sources
Example (permissions ignored for simplicity): Credentials Data sources
External Files
CREATE STORAGE CREDENTIAL finance_cred … credential

CREATE EXTERNAL LOCATION finance credential


URL s3://depts/finance Cloud Storage
WITH (STORAGE CREDENTIAL finance_cred); credential (S3, ADLS, GCS)
CREATE EXTERNAL TABLE forecast
LOCATION s3://depts/finance/forecast;
* CREATE DIRECTORY eu_invoices
LOCATION s3:/depts/finance/eu/invoices;

©2022 Databricks Inc. — All rights reserved * coming soon 32


External Locations
Demo

©2022 Databricks Inc. — All rights reserved 33


Suggested external location structure
How to store and secure external data
Multiple External Locations declared
Databricks External Location
against one storage bucket reuse the
same Storage Credential
Physical intermediate folder
Personal directory for each user.
user1/
Only the user has access via UC :
● CREATE EXTERNAL LOCATION user1loc URL
users/ 'abfss://[email protected]/users/user1'
WITH (CREDENTIAL team_x_cred);
user2/ ● GRANT READ FILES, WRITE FILES, CREATE TABLE ON
EXTERNAL LOCATION user1loc TO `[email protected]`;
/
(Bucket root)
table1/ External tables of the
shared/ same catalog can go in
tables/
tables/
the same bucket (and use
table1/ the same Storage
Credential)

BESTPRACTICE tmp/

Minimize the number of Credentials
and External Locations, e.g :
● 1 credential / bucket or 1 credential / team Shared tmp directory for all users :
● 1 location / user or 1 location / team ● GRANT READ FILES, WRITE FILES ON EXTERNAL LOCATION tmp TO `team_x`;
● GRANT CREATE TABLE ON EXTERNAL LOCATION tmp TO `team_x`;

©2022 Databricks Inc. — All rights reserved


Workloads

Unity Catalog
User
Metastore
Management

Databricks Databricks
Workspace Workspace
Clusters Clusters
SQL Warehouses SQL Warehouses

35
Best practices
Clusters
• Use Cluster Policies to enforce usage of Unity-enabled
clusters and enable Lineage by default on Clusters and
Endpoints
• Set Service Principals as the Owners of Production jobs and
run these as their SP’s
• Use Single-User clusters for Job clusters
• If you have dynamic views, you must use User-Isolation
Clusters
• Use DBR 11.1+ if you would like to use the SYNC command

©2022 Databricks Inc. — All rights reserved


Upgrading to Unity Catalog

©2022 Databricks Inc. — All rights reserved


Hive Metastore is integrated into UC
Create a unified view of your data estate
BESTPRACTICE
(Unity) Unity Catalog Upgrade your legacy hive_metastore
Metastore tables and views into Unity Catalog


(Unity)
Catalog
Metastore
Schema Managed
External


(Database) Table
Databricks table
Catalog


Account assigned to
Schema External
Managed
Databricks (Database) Table
Table
Workspace

Databricks associated Hive View


View
Workspace with Metastore

SELECT * FROM catalog1.database1.table1;


SELECT * FROM hive_metastore.database2.table2;

©2022 Databricks Inc. — All rights reserved 38


Upgrading Hive tables to Unity
External tables
Grant users access to
4
the new external table :

Unity Catalog
GRANT USAGE ON CATALOG catalog TO <user/group>;
GRANT USAGE ON SCHEMA catalog.database_migrated TO <user/group>;
GRANT <privilege> ON TABLE
catalog.database_migrated.external_table TO <user/group>;

2 5
hive_metastore External Location catalog Update your views,
(legacy)
1 Storage Credentials upstream and
downstream
processes to use
the newly-migrated
database_to_migrate database_migrated external table

external_table external_table
in an external cloud storage External S3 bucket / ADLS Same external cloud storage location
User User
3
Legacy cloud storage access CREATE TABLE
Instance profile / Service Principal catalog.database_migrated.external_table
LIKE
hive_metastore.database_to_migrate.external_table
COPY LOCATION;

©2022 Databricks Inc. — All rights reserved


Upgrading Hive tables to Unity
External tables
Grant users access to
4
the new external table :

Unity Catalog
GRANT USAGE ON CATALOG catalog TO <user/group>;
GRANT USAGE ON SCHEMA catalog.database_migrated TO <user/group>;
GRANT <privilege> ON TABLE
catalog.database_migrated.external_table TO <user/group>;

2 5
External Location catalog Update your views,
1 Storage Credentials upstream and
downstream
processes to use
the newly-migrated
database_migrated external table

external_table
BESTPRACTICE External S3 bucket / ADLS Same external cloud storage location
User
● Use the wizards and the SYNC command 3
to streamline the upgrade of External
Tables CREATE TABLE
catalog.database_migrated.external_table
● Don’t use DFBS mounts anymore LIKE
● Remove any direct file access after hive_metastore.database_to_migrate.external_table
migration COPY LOCATION;

©2022 Databricks Inc. — All rights reserved


Upgrading Hive tables to Unity
Managed tables & Views
BESTPRACTICE
Clean up DBFS after upgrading your
tables to Unity.
Managed tables :
• The files of Legacy Managed tables reside completely within DBFS and the only way
forward is to recreate these tables via CLONE or CTAS commands which induce data
recreation
• Legacy Managed tables outside of DBFS will either have to go through the same
CLONE or CTAS commands, or become external table if you wish to avoid data
recreation
Views :
• Views should be upgraded last, once the underlying tables have all been upgraded to
Unity Catalog

©2022 Databricks Inc. — All rights reserved


External tables Upgrade Wizard
& SYNC command
Demo

©2022 Databricks Inc. — All rights reserved 42


New Features &
Roadmap

©2022 Databricks Inc. — All rights reserved


Data Lineage
Automated lineage for all workloads
Problem
Difficult for organizations to understand how data flows and is consumed across the organization, and
to ensure data quality

Solution
Unity Catalog provides automated and realtime data
lineage enabling data teams to get an end-to-end
visibility into how data flows across the lakehouse

What's coming next?


• Adding support for files and delta-delta streaming

• System tables for lineage

• Improved graph visualization

©2022 Databricks Inc. — All rights reserved 44


Data Access Controls
Centralized governance for Data and AI
Unity Catalog
Problem Metastore
Large organizations require fail-safe isolation
mechanisms to protect sensitive data and allow line of Catalog 1 Catalog 2 Catalog 3
business separation

Solution
• Environment aware ACLs allow additional
isolation by restricting access of a catalog to a Workspace Workspace
particular workspace Clusters
Clusters
• Per-catalog storage locations for cost Warehouses Warehouses
attribution and data isolation per team
• Out of the box system tables
• Audit logs
• Information schema
©2022 Databricks Inc. — Confidential & Subject to Change 45
Auto Tune for Managed Tables
Complete table management for best performance out of the box

Problem
For optimal performance tables require periodic
tuning operations.

Solution
Unity Catalog will auto tune tables.
• For tables under 1TB, no tuning required
• Auto Tune takes care of the data layout
• Run maintenance operations automatically
• New Liquid Partitioning to avoid small files
problems but give high performance and
concurrency

©2022 Databricks Inc. — All rights reserved


Governing ML Workloads
Centralized governance for Data and AI
Problem
ML assets such as models and features tables represent an aggregated form of data, which are
typically governed separately from other data assets, adding complexity and risk to
organizations.

Solution
• Support for cataloging and governing ML feature tables in
Unity Catalog.

• Support for ML Runtime clusters that allow multiple users


and any language, running using a service principal’s identity.
Eliminates need for single-user clusters.

©2022 Databricks Inc. — All rights reserved


Row Level Filters & Column UDFs
Changing the “view” of governance on the lakehouse.

CREATE FUNCTION hide_ssn(ssn STRING)


RETURN if(is_member(“admin”), ssn, “***”)

ALTER TABLE sales SET MASK hide_ssn ON (ssn)

CREATE FUNCTION us_filter(region STRING, ssn STRING)


RETURN IF(is_member(“admin”), 1, region=“US”);

CREATE TABLE sales (region STRING) USING delta


WITH ROW FILTER us_filter ON (region)
COMMENT “Access policy for Sales”;

©2022 Databricks Inc. — All rights reserved 48


Attribute Based Access Control
Govern based on the semantic value of tags

CREATE TAG finance

ALTER TABLE discounts ADD TAG finance


ALTER SCHEMA sales ADD TAG finance

GRANT SELECT ON TAG finance TO finance_team

©2022 Databricks Inc. — All rights reserved 49


System Table Enhancements: Audit Logs
High Leverage Centralized Governance

● Audit logs will be available as tables within the “System” Catalog


● Significant improvement over exhaust out →capture →ETL back in to
Databricks.
● Immediately Actionable - Show me everything a user accessed in the
last 24 hours :

SELECT * FROM system.audit_logs.audit_logs


WHERE user = "[email protected]"
AND serviceName = "unityCatalog"
AND actionName = "generateTemporaryTableCredential"
AND datediff(now(), timestamp) < 1
©2022 Databricks Inc. — All rights reserved 50
©2022 Databricks Inc. — All rights reserved

You might also like