Data Governance On Unity Catalog - Jul 2024
Data Governance On Unity Catalog - Jul 2024
Unity Catalog
for data and AI
2
©2024 Databricks Inc. — All rights reserved
Agenda
▪ Introduction
▪ Unity Catalog overview
▪ Key concepts and capabilities
▪ Centralised metadata and access controls
▪ Query federation & Volumes
▪ Discover your data with search and lineage
▪ Audit your data
▪ Open collaboration
▪ This presentation is being recorded, we will share the recording and other
materials after the session, within 48 hours
▪ There is no hands-on component so you only need to take notes
▪ Use the Q&A function to ask questions
▪ If we do not answer your question during the event, we will follow-up with
you afterwards to get you the information you need
▪ Please fill out the survey at the end of the session so that we can improve
our future sessions
“Organizations seeing the highest returns from AI, have a framework for
AI governance to cover every step of the model development process”
—
The State of AI in 2022, McKinsey & Co
“Are we
ML Models meeting the
ML engineer regulatory
Permissions on reports, compliance?”
dashboards
Applications BI dashboards
Applications BI dashboards
Key dimensions of
enterprise data governance
Access Data
Discovery Lineage Quality Classification Monitoring Auditing
Controls Sharing
Open Interfaces
Attribute-based
Access Controls
Central policies across
all data + AI assets
PostgreSQL
Users
Safeguard data across data
sources Amazon
Redshift
Dashboards
Google
BigQuery
Apache
Bring external catalogs for Hive
polars
pandas
ray
Iceberg
ML Models
Hudi
ecosystem
Google
Cloud
Any asset
Tables Objects AI / ML
Data + AI
Managed
assets Volumes Functions
External
ML Models
Views
Any format Parquet
Image Audio PDF
Vector DBs
UniForm Iceberg
©2024 Databricks Inc. — All rights reserved
Delta Hudi
JSON
CSV
UC OSS interoperability
DELTA ECOSYSTEM
{Iceberg REST catalog APIs }
Apache Clickhouse Daft Doris Dremio Flink EMR Flink Glue Hive Pandas Polars PrestoDB Pulsar
Impala
Hive Presto Apache Spark StarRocks Trino PuppyGraph Python Ray Rust Apache Spark StarRocks Startree Trino
(pinot)
HMS CLIENTS
aws-
trino athena glue BigQuery Delta-rs DuckDB Rust Python Daft Athena StarRocks pandas-sdk ballista data fusion ray
Unity
Services Cloud Azure
Amazon Web
Salesforce Google Microsoft Snowflake Doris StarRocks Daft
Services Cloud Azure
DELTA SHARING
Carto
Data Science
Mosaic AI ETL &Tables
Delta Live Orchestration
Workflows Data SQL
Databricks
& AI Real-time Analytics Warehousing
Create, tune, and Automated Job cost optimized Text-to-SQL
Mosaic
serve customAI LLMs Delta Live
data quality Workflows
based on past runs Databricks
Text-to-VizSQL
Tables
Use generative AI to understand the semantics of your data
Key concepts
Centralised metadata and access controls
Query federation & Volumes
Discover your data with search and lineage
Audit your data
Open collaboration
● Metastore ● Connections
○ A high-level entity, one per region, that holds multiple catalogs ○ Credential and connection information to connect to an
○ Cloud provider credential to connect to storage and can be queried alongside managed data sources
and file sources
● External Locations
○ Storage location used for external tables, external volumes, or
arbitrary files, or default managed location for a catalog or
schema
User User
Management Management
Metastore Metastore
Databricks Databricks
Workspace Workspace
Access Controls Access Controls
Clusters Clusters
SQL Warehouses SQL Warehouses
Clusters Clusters
SQL Warehouses SQL Warehouses
hive_metastore Foreign
Catalog 1
(legacy) Catalog
default Foreign
(database) Schema Schema 1
ALTER TABLE sales SET ROW FILTER us_filter ON region; ALTER TABLE users ALTER COLUMN table_ssn SET MASK
ssn_mask;
Test for group Assign reusable Test for group Assign reusable
membership filter to table Specify filter membership mask to column Specify mask or
predicates
function to mask
RULE Creates
1 a rule
Catalog ⇨ business_unit SET Mask UDF ON Column
WHEN has_tag(‘phone’)
Juan
(wants to govern data)
Masking rule
Schema ⇨ sales_prod
4 automatically
applied
full_name cell_phone
Data ⇨ txn_data Todd G 321-123-****
staging staging_ws
Testers
Access to data
Metastore prod prod_ws Analysts
and availability of
data can be
isolated across
bu_1_dev BU Developers
workspaces and
bu_dev_stg_ws
groups
bu_1_staging BU Testers
- Unstructured data, e.g., image, audio, video, or PDF files, used for ML
Data
Data
- Semi-structured training, validation, test data sets, used in ML model
training
- Raw data files used for ad-hoc or early stage data exploration, or saved
outputs
- Library or config files used across workspaces Table
- Tables
©2024 Databricksare
Inc. —registered in Managed
All rights reserved / External Locations, not in Volumes 34
Defining file based data sources in Unity
Simplify data access management across clouds
Unity External
Locations &
Catalog Credentials
Access Control
Cloud Storage
(S3, ADLS, GCS)
Enforce
7 policies
Cloud
Storage
6 Return data
(S3, ADLS)
©2024 Databricks Inc. — All rights reserved 36
Key concepts and capabilities
Key concepts
Centralised metadata and access controls
Query federation & Volumes
What tables are in the sales catalog? Who last updated the gold tables and when?
SELECT table_name SELECT table_name, last_altered_by, last_altered
FROM system.information_schema.tables FROM system.information_schema.tables
WHERE table_catalog="sales" WHERE table_schema = "churn_gold"
AND table_schema!="information_schema"; ORDER BY 1, 3 DESC;
Who accesses this table the most? What has this user accessed in the last 24 hours?
SELECT user_identity.email, count(*) SELECT request_params.table_full_name
FROM system.operational_data.audit_logs FROM system.operational_data.audit_logs
WHERE request_params.table_full_name = "main.uc_deep_dive.login_data_silver" WHERE user_identity.email = "[email protected]"
AND service_name = "unity Catalog" AND service_name = "unity Catalog"
AND action_name = "generateTemporaryTableCredential" AND action_name = "generateTemporaryTableCredential"
GROUP BY 1 ORDER BY 2 DESC LIMIT 1; AND datediff(now(), created_at) < 1;
Who deleted this table? What tables does this user access most frequently?
SELECT user_identity.email SELECT request_params.table_full_name, count(*)
FROM system.operational_data.audit_logs FROM system.operational_data.audit_logs
WHERE request_params.full_name_arg = WHERE user_identity.email = "[email protected]"
"main.uc_deep_dive.login_data_silver" AND service_name = "unity Catalog"
AND service_name = "unity Catalog" AND action_name = "generateTemporaryTableCredential"
AND action_name = "deleteTable"; GROUP BY 1 ORDER BY 2 DESC LIMIT 1;
What is the daily trend in DBU consumption? Which 10 users consumed the most DBUs?
SELECT date(created_on) as `Date`, sum(dbus) as `DBUs Consumed` SELECT tags.creator as `User`, sum(dbus) as `DBUs`
FROM system.operational_data.billing_logs FROM system.operational_data.billing_logs
GROUP BY date(created_on) GROUP BY tags.creator
ORDER BY date(created_on) ASC; ORDER BY `DBUs` DESC
LIMIT 10;
How many DBUs of each SKU have been used so far this month? Which Jobs consumed the most DBUs?
SELECT sku as `SKU`, sum(dbus) as `DBUs` SELECT tags.JobId as `Job ID`, sum(dbus) as `DBUs`
FROM system.operational_data.billing_logs FROM system.operational_data.billing_logs
WHERE GROUP BY `Job ID`;
month(created_on) = month(CURRENT_DATE)
GROUP BY sku
ORDER BY `DBUs` DESC;
What tables are sourced from this table? What user queries read from this table?
SELECT DISTINCT target_table_full_name SELECT DISTINCT entity_type, entity_id,
FROM system.access.table_lineage source_table_full_name
WHERE source_table_name = "login_data_bronze"; FROM system.access.table_lineage
WHERE source_table_name = "login_data_silver";
Open collaboration
Public marketplace,
private exchanges Dashboards
Databricks
Marketplace
Solution
Accelerators
Collaborator 1 Collaborator 2
e.g. Publisher e.g. Advertiser
Data Clean
Room
Hashed_user_id age income ad_id imp clicks Hashed_user_id conversion_event
2 4 6
One Time Setup Upgrade Legacy Upgrade Workloads
● Create metastore Metadata ● Create clusters
● Identity federation ● SYNC external ● Create jobs
● Join workspaces tables/schemas ● Update notebooks
● Migrate managed ● Downstream tools
tables, files
PLANNING/SETUP DATA / WORKLOAD UPGRADE
©2024 Databricks Inc. — All rights reserved 51
Upgrading Hive tables to Unity
Managed & External tables - use SYNC command
• Run multiple times to pull changes from the hive/glue database into Unity over time
• Use a job for long term synchronization
• Use the DRY RUN option to test the sync without making any changes to the target
table
• Works on Hive Managed Tables where schema locations are defined.
Capabilities
❏ Assessment
❏ Group Upgrade
❏ Table Upgrade
❏ ACLs
❏ Workflows/ Notebook Upgrade (Soon)