0% found this document useful (0 votes)
407 views56 pages

Data Governance On Unity Catalog - Jul 2024

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
407 views56 pages

Data Governance On Unity Catalog - Jul 2024

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Governance with

Unity Catalog
for data and AI

©2024 Databricks Inc. — All rights reserved


Gopala Raju

• ~3 years with Databricks


• Many years of experience in cloud,
data and analytics

Governance Product Specialist - APJ

2
©2024 Databricks Inc. — All rights reserved
Agenda
▪ Introduction
▪ Unity Catalog overview
▪ Key concepts and capabilities
▪ Centralised metadata and access controls
▪ Query federation & Volumes
▪ Discover your data with search and lineage
▪ Audit your data
▪ Open collaboration

▪ Upgrading to Unity Catalog


▪ Demo

©2024 Databricks Inc. — All rights reserved


Housekeeping

▪ This presentation is being recorded, we will share the recording and other
materials after the session, within 48 hours
▪ There is no hands-on component so you only need to take notes
▪ Use the Q&A function to ask questions
▪ If we do not answer your question during the event, we will follow-up with
you afterwards to get you the information you need
▪ Please fill out the survey at the end of the session so that we can improve
our future sessions

©2024 Databricks Inc. — All rights reserved


Data and AI governance
drives business value
“Organizations are finally realizing the value of data as an asset that needs
to be protected, managed and maintained to increase asset value”

IDC

“Organizations seeing the highest returns from AI, have a framework for
AI governance to cover every step of the model development process”

The State of AI in 2022, McKinsey & Co

“AI is now an enterprise essential, and as such, AI governance


will join cybersecurity and compliance as a board-level topic”

Forrester, 2023 AI Predictions report

©2024 Databricks Inc. — All rights reserved


Today, data and AI governance is complex
Data Consumers Data Governance Team
Permissions on files
“Where to “How to
discover secure
Data lake
the datasets, these
Data analyst Permissions
models, on tables, rows and columns
assets?”
notebooks,
dashboards?”
“Who is
Data warehouse accessing
“Can I trust Data engineer these assets
Permissions on ML
the data and and how?”
models, features
ML models?”

“Are we
ML Models meeting the
ML engineer regulatory
Permissions on reports, compliance?”
dashboards

Applications BI dashboards

©2024 Databricks Inc. — All rights reserved


Today, data and AI governance is complex
Data Consumers Data Governance Team
Permissions on files
“Where are “How to
Fragmented
the datasets, view of the Disjoint tools for access
Data lake
Incomplete monitoring Lack of cross-platform
secure
models,
data and AI estate management and observability these
data sharing
Data analyst Permissions
notebooks, on tables, rows and columns
assets?”
dashboards
that I need?”
“Who is
Data warehouse accessing
“Can I trust Data engineer these assets
Permissions on ML
the data and and how?”
models, features
ML Reduced
models?” pace of Increased data breach risk, Non-compliance risk, Costly data sharing,
innovation operational expenses reputational harm untapped monetization
“Are we
ML Models meeting the
Data scientist regulatory
Permissions on reports, compliance?”
dashboards

Applications BI dashboards

©2024 Databricks Inc. — All rights reserved


Governance is not just about protection

Key dimensions of
enterprise data governance

©2024 Databricks Inc. — All rights reserved 8


Databricks Unity Catalog
Unified governance for data and AI
Unified visibility and single permission model for data and AI
AI-powered monitoring and observability
Discovery based on metadata
Open source & Open interfaces
Open data sharing

Databricks Unity Catalog

Access Data
Discovery Lineage Quality Classification Monitoring Auditing
Controls Sharing

Tables Files Models Notebooks Dashboards

©2024 Databricks Inc. — All rights reserved


Databricks Lakehouse unifies
data and AI governance
External Compute Platforms

BI & Data Data Data Data


Warehousing Engineering Streaming Science & ML

Open Interfaces

Databricks Unity Catalog


One governance model for structured and unstructured data + AI

Cloud Data Lake External Catalogs External Data Sources


All structured, semi-structured, and unstructured data

©2024 Databricks Inc. — All rights reserved


Unified visibility into data and AI
■ Discover and classify structured and unstructured
data, files, notebooks, ML models, and dashboards at
one place

■ Consolidate and query data from other databases


and data warehouses using a single point of
access, without moving or copying the data

■ Build better understanding of your data estate


with automated lineage, tags and auto-generated
data insights

■ Boost productivity by searching, understanding and


gaining insights from your data and AI assets, using
natural language
©2024 Databricks Inc. — All rights reserved
Available Soon

Attribute-based
Access Controls
Central policies across
all data + AI assets

Use tags, location, identity,


and time attributes

UI and SQL interfaces

©2024 Databricks Inc. — All rights reserved


Lakehouse Federation
Discover, query, and govern all your data - no matter where it lives

PostgreSQL

Build a unified view of your data


estate

Users
Safeguard data across data
sources Amazon
Redshift

Dashboards

Efficient execution and caching


Snowflake

Google
BigQuery

©2024 Databricks Inc. — All rights reserved


Available Soon
Metastore Federation
Bring external catalogs to Unity Catalog

Apache
Bring external catalogs for Hive

seamless transition Users


Users

Overlay access controls


AWS
Dashboards Glue
Dashboards

©2024 Databricks Inc. — All rights reserved


Open data sharing
■ Avoid vendor lock-in with open source Delta
Sharing for seamless data sharing across clouds,
regions, and platforms, without replication

■ Share more than just data - Notebooks, ML


models, dashboards, applications

■ Explore and monetize data products through an


open marketplace

■ Collaborate securely on sensitive data with


scalable data clean rooms

©2024 Databricks Inc. — All rights reserved


OPEN DATA LAKEHOUSE

polars

pandas
ray

Goal: any customer asset from Databricks should be


accessible from any external engine
Tables Objects AI / ML

Delta Volumes Functions

Iceberg

ML Models
Hudi

Views Vector DBs


©2024 Databricks Inc. — All rights reserved
Unity Catalog: The industry’s only universal catalog for Data and AI
MAJOR CLOUD PLATFORMS DATA + AI PLATFORMS COMPUTE ENGINES
Any engine
Client Microsoft Fabric
LlamaIndex

ecosystem
Google
Cloud

Any client UNITY CATALOG (OSS)


Universal Unity REST API Iceberg REST catalog API Delta Sharing
standard UNIFIED GOVERNANCE

Any asset
Tables Objects AI / ML
Data + AI
Managed
assets Volumes Functions
External
ML Models
Views
Any format Parquet
Image Audio PDF

Vector DBs
UniForm Iceberg
©2024 Databricks Inc. — All rights reserved
Delta Hudi
JSON
CSV
UC OSS interoperability

DELTA ECOSYSTEM
{Iceberg REST catalog APIs }

ICEBERG ECOSYSTEM Amazon Apache


Web Services Azure Google Cloud Airbyte Arrow Athena Ballista

Amazon Microsoft Google aws-


Snowflake Salesforce pandas-sdk beam Daft Dask Apache Delta-RS DLT (spark-r)
Web Services Azure Cloud DuckDB
DataFusion

Apache Clickhouse Daft Doris Dremio Flink EMR Flink Glue Hive Pandas Polars PrestoDB Pulsar
Impala

Hive Presto Apache Spark StarRocks Trino PuppyGraph Python Ray Rust Apache Spark StarRocks Startree Trino
(pinot)

UNITY CATALOG PARTNERS

©2024 Databricks Inc. — All rights reserved


Most open and interoperable catalog for data and AI
DELTA ECOSYSTEM

HMS CLIENTS
aws-
trino athena glue BigQuery Delta-rs DuckDB Rust Python Daft Athena StarRocks pandas-sdk ballista data fusion ray

trino EMR Athena


delta- arrow polars airbyte
emr dlt (spark-r) azure synapse pandas prestodb pulsar flink beam Startree hive dask
spark

UNITY CATALOG ICEBERG ECOSYSTEM

Amazon Web Google Microsoft Salesforce LangChain StarRocks DuckDB

Unity
Services Cloud Azure
Amazon Web
Salesforce Google Microsoft Snowflake Doris StarRocks Daft
Services Cloud Azure

LanceDB Unstructured PuppyGraph Tecton LlamaIndex dbt Labs Catalog


Apache Spark flink hive Clickhouse
trino Presto Dremio Impala
Apache
Informatica Immuta Confluent Granica Fivetran XTable

DELTA SHARING

Carto

©2024 Databricks Inc. — All rights reserved


However, it is not just a
catalog … in databricks

©2024 Databricks Inc. — All rights reserved


Unity Catalog Powers Databricks Data Intelligence Platform

Data Science
Mosaic AI ETL &Tables
Delta Live Orchestration
Workflows Data SQL
Databricks
& AI Real-time Analytics Warehousing
Create, tune, and Automated Job cost optimized Text-to-SQL
Mosaic
serve customAI LLMs Delta Live
data quality Workflows
based on past runs Databricks
Text-to-VizSQL
Tables
Use generative AI to understand the semantics of your data

Data Intelligence Engine

Unified security, governance, and cataloging


Unity Catalog
Unityusing
Data discovery Catalog
natural language

Unified data storage for reliability and sharing


Delta Lake
Delta
Data layout is automatically Lakebased on usage patterns
optimized

Open Data Lake


All Raw Data
(Logs, Texts, Audio, Video, Images)
©2024 Databricks Inc. — All rights reserved
Unlock full databricks experience
Unity Catalog is Central to Data Intelligence Platform

Lakehouse Monitoring Databricks Assistant


End-to-end monitoring of Data and A context-aware AI assistant, that integrates
ML Models for quality and drift throughout the platform to improve
productivity via conversational interface

Lakehouse IQ AI Accelerated Performance


A knowledge engine that learns unique Automated organization of data, performance
nuances of your business and data to improvements, predictive IO and serverless
power natural language access

©2024 Databricks Inc. — All rights reserved


Key concepts and capabilities

Key concepts
Centralised metadata and access controls
Query federation & Volumes
Discover your data with search and lineage
Audit your data
Open collaboration

©2024 Databricks Inc. — All rights reserved


Key Concepts
Working with file based data sources Working with databases

● Metastore ● Connections
○ A high-level entity, one per region, that holds multiple catalogs ○ Credential and connection information to connect to an

● Catalogs external database

○ Entities to organize multiple schemas together ● Foreign Catalogs


● Credentials ○ A catalog that represents an external database in UC

○ Cloud provider credential to connect to storage and can be queried alongside managed data sources
and file sources
● External Locations
○ Storage location used for external tables, external volumes, or
arbitrary files, or default managed location for a catalog or
schema

● Managed / External Tables


○ Tabular data stored in managed or external locations

● Managed / External Volumes


○ Arbitrary file container inside a managed or external location

©2024 Databricks Inc. — All rights reserved 24


Centralized metadata and controls
One metadata layer across file and database sources superpowers governance

Without Unity Catalog With Unity Catalog

Databricks Databricks Unity Catalog


Workspace 1 Workspace 2 User
Metastore
Foreign
Access Controls
Management Databases

User User
Management Management

Metastore Metastore
Databricks Databricks
Workspace Workspace
Access Controls Access Controls
Clusters Clusters
SQL Warehouses SQL Warehouses
Clusters Clusters
SQL Warehouses SQL Warehouses

©2024 Databricks Inc. — All rights reserved 25


Governed namespace across file and database sources
Access legacy metastore and foreign databases powered by Lakehouse
Federation
Unity Catalog

hive_metastore Foreign
Catalog 1
(legacy) Catalog

default Foreign
(database) Schema Schema 1

customers Foreign External Models/ Managed / Ext


(table) Table Volumes
Table Views
Functions Tables

SELECT * FROM main.paul.red_wine; -- <catalog>.<database>.<table>

SELECT * FROM hive_metastore.default.customers;

SELECT * FROM snowflake_warehouse.some_schema.some_table;

©2024 Databricks Inc. — All rights reserved 26


Centralized Access Controls
Centrally grant and manage access permissions across workloads and foreign
databases

Using ANSI SQL DCL Using UI

GRANT <privilege> ON <securable_type>


<securable_name> TO `<principal>`

GRANT SELECT ON iot.events TO engineers


Choose
‘Table’= collection of
permission level files in S3/ADLS Sync groups from
your identity
provider

©2024 Databricks Inc. — All rights reserved 27


Row Level Security and Column Level Masking
Provide differential fine grained access to file based datasets and foreign tables

Only show specific rows Mask or redact sensitive columns

CREATE FUNCTION <name> ( <parameter_name > CREATE FUNCTION <name> (<parameter_name>,


<parameter_type> .. ) <parameter_type>, [, <column>...])
RETURN {filter clause whose output must be a boolean} RETURN {expression with the same type as the first
parameter}

CREATE FUNCTION us_filter(region STRING) CREATE FUNCTION ssn_mask(ssn STRING)


RETURN IF(IS_MEMBER(‘admin’), true, region=“US”); RETURN IF(IS_MEMBER(‘admin’), ssn, “****”);

ALTER TABLE sales SET ROW FILTER us_filter ON region; ALTER TABLE users ALTER COLUMN table_ssn SET MASK
ssn_mask;

Test for group Assign reusable Test for group Assign reusable
membership filter to table Specify filter membership mask to column Specify mask or
predicates
function to mask

©2024 Databricks Inc. — All rights reserved 28


Available Soon
Scalable governance with ABAC
Unity Catalog
Metastore

RULE Creates
1 a rule
Catalog ⇨ business_unit SET Mask UDF ON Column
WHEN has_tag(‘phone’)
Juan
(wants to govern data)
Masking rule
Schema ⇨ sales_prod
4 automatically
applied

full_name cell_phone
Data ⇨ txn_data Todd G 321-123-****

⇶ full_name Reads data 3 5 Can only see


phone
⇶ cell_phone

2 Applies tags (tags) Sarita


(wants to query data)
Malik
(wants to produce data)
29
©2024 Databricks Inc. — All rights reserved
Catalog binding
Restrict catalog access by environment or purpose

Catalogs Workspaces Groups

dev dev_ws Developers

staging staging_ws

Testers

Access to data
Metastore prod prod_ws Analysts
and availability of
data can be
isolated across
bu_1_dev BU Developers
workspaces and
bu_dev_stg_ws
groups
bu_1_staging BU Testers

bu_1_prod bu_prod_ws BU Users

©2024 Databricks Inc. — All rights reserved 30


High Leverage Governance with Terraform & APIs
Use data-sec-ops, policies as code patterns to scale your efforts

• Privileges for UC objects can be managed


programmatically using our Terraform resource "databricks_grants" "sandbox" {
provider, especially for teams already using provider = databricks.workspace
Terraform catalog = databricks_catalog.sandbox.name
• This will pair naturally with the grant {

management of the UC objects (Metastore, principal = "Data Scientists"


privileges = ["USAGE", "CREATE"]
Catalog, Assignments etc.) themselves.
}
(If not already using Terraform, maybe now is a good time!) grant {
principal = "Data Engineers"
privileges = ["USAGE"]
}
}

©2024 Databricks Inc. — All rights reserved 31


Key concepts and capabilities
Key concepts
Centralised metadata and access controls

Query federation & Volumes


Discover your data with search and lineage
Audit your data
Open collaboration

©2024 Databricks Inc. — All rights reserved


Query Federation
Unify your entire data estate with lakehouse

Query Federation provides one single point of secure


access to all your data - no matter where it lives - and one
way to access, catalog, govern, and query all your data - no
ingestion required.

● Unified permission controls


● Intelligent pushdown optimizations
● Accelerated query performance with Materialized Views
● Support for R/O operations today

CREATE FOREIGN CATALOG <catalog_name>


USING CONNECTION <connection_name>
OPTIONS (database ‘<remote_database>’)

SELECT * FROM <catalog_name>.<schema_name>.<table_name>

©2024 Databricks Inc. — All rights reserved 33


Volumes in Unity Catalog
Access, store, organize and process files with Unity Catalog governance
- Volumes can be accessed by some POSIX commands Cloud Storage
(S3, ADLS, GCS)
dbutils.fs.ls(“s3://my_external_location/Volumes/catalog/schema/volume123”)
Managed / External
ls /Volumes/catalog/schema/volume123 Location

- Volumes are created under Managed or External Locations and show


Volume
up in UC Lineage

- Volumes add governance over non-tabular data sets Volume

- Unstructured data, e.g., image, audio, video, or PDF files, used for ML
Data
Data
- Semi-structured training, validation, test data sets, used in ML model
training
- Raw data files used for ad-hoc or early stage data exploration, or saved
outputs
- Library or config files used across workspaces Table

- Operational data, e.g., logging or checkpointing output files

- Tables
©2024 Databricksare
Inc. —registered in Managed
All rights reserved / External Locations, not in Volumes 34
Defining file based data sources in Unity
Simplify data access management across clouds

Unity External
Locations &
Catalog Credentials

Access Control
Cloud Storage
(S3, ADLS, GCS)

Managed Managed Location on


Schema/Catalog
Tables
ged
a
an
M
Ex
ter
User Cluster or na Path in External Location
SQL warehouse l

Path in External Location


Volumes External

©2024 Databricks Inc. — All rights reserved 35


Querying file based data sources with Unity
● Creates an IAM role (AWS)/
Managed Identity (Azure) /
Service Account (GCP)
● Creates storage credentials/
external locations in Unity Catalog
Admin
● Defines access policies in Unity
Catalog
Check namespace ,
2 metadata and grants Write
audit log
Unity
Send query (SQL
Catalog
1 Python, R, Scala,) Return list of paths/data files Audit log
4 and scoped down temporary
tokens
Assume IAM Role /
3 Managed Identity / Service
User
8 Send result Cluster or SQL Request/ingest data from Account
warehouse 5 paths/data files with temporary
tokens

Enforce
7 policies
Cloud
Storage
6 Return data
(S3, ADLS)
©2024 Databricks Inc. — All rights reserved 36
Key concepts and capabilities
Key concepts
Centralised metadata and access controls
Query federation & Volumes

Discover your data with search and lineage


Audit your data
Open collaboration

©2024 Databricks Inc. — All rights reserved


Why is data lineage important?

Compliance Discovery Observability

● Regulatory requirements to ● Understand context and ● Track down issues /


verify data lineage trustworthiness of data discrepancies in reports by
before using it in analytics tracing back the data

● Track the spread of ● Prevent duplicative work ● Analyze impact of proposed


sensitive data across and data changes to downstream
datasets reports e.g. column
deprecation

©2024 Databricks Inc. — All rights reserved 38


Automated lineage for all workloads
End-to-end visibility into how data flows and consumed in your organization

● Auto-capture runtime data lineage on a


Databricks cluster or SQL warehouse

● Leverage common permission model from


Unity Catalog

● Lineage across tables, columns,


dashboards, workflows, notebooks, files,
external sources, and models

©2024 Databricks Inc. — All rights reserved 39


Built-in search and discovery
Accelerate time to value with low latency data discovery

● Unified UI to search for data assets


stored in Unity Catalog
● Leverage common permission model
from Unity Catalog

● Tag Column, Table, Schema, Catalog


objects in UC
● Search for objects on tags

Recommendation: Use comments and


Tag your Data Assets on Ingest
©2024 Databricks Inc. — All rights reserved 40
Key concepts and capabilities
Key concepts
Centralised metadata and access controls
Query federation & Volumes
Discover your data with search and lineage

Audit your data


Open collaboration

©2024 Databricks Inc. — All rights reserved


System Tables: Object Metadata
Answer questions about the state of objects in the catalog

What tables are in the sales catalog? Who last updated the gold tables and when?
SELECT table_name SELECT table_name, last_altered_by, last_altered
FROM system.information_schema.tables FROM system.information_schema.tables
WHERE table_catalog="sales" WHERE table_schema = "churn_gold"
AND table_schema!="information_schema"; ORDER BY 1, 3 DESC;

Who owns this gold table?


Who has access to this table?
SELECT table_owner
SELECT grantee, table_name, privilege_type
FROM system.information_schema.tables
FROM system.information_schema.table_privileges
WHERE table_catalog = "retail_prod" AND table_schema =
WHERE table_name = "login_data_silver";
"churn_gold" AND table_name = "churn_features";

©2024 Databricks Inc. — All rights reserved 42


System Tables: Audit Logs
Near-real time, see who accessed what, and when

Who accesses this table the most? What has this user accessed in the last 24 hours?
SELECT user_identity.email, count(*) SELECT request_params.table_full_name
FROM system.operational_data.audit_logs FROM system.operational_data.audit_logs
WHERE request_params.table_full_name = "main.uc_deep_dive.login_data_silver" WHERE user_identity.email = "[email protected]"
AND service_name = "unity Catalog" AND service_name = "unity Catalog"
AND action_name = "generateTemporaryTableCredential" AND action_name = "generateTemporaryTableCredential"
GROUP BY 1 ORDER BY 2 DESC LIMIT 1; AND datediff(now(), created_at) < 1;

Who deleted this table? What tables does this user access most frequently?
SELECT user_identity.email SELECT request_params.table_full_name, count(*)
FROM system.operational_data.audit_logs FROM system.operational_data.audit_logs
WHERE request_params.full_name_arg = WHERE user_identity.email = "[email protected]"
"main.uc_deep_dive.login_data_silver" AND service_name = "unity Catalog"
AND service_name = "unity Catalog" AND action_name = "generateTemporaryTableCredential"
AND action_name = "deleteTable"; GROUP BY 1 ORDER BY 2 DESC LIMIT 1;

©2024 Databricks Inc. — All rights reserved 43


System Tables: Billing Logs
Understand cost allocation across your data estate

What is the daily trend in DBU consumption? Which 10 users consumed the most DBUs?
SELECT date(created_on) as `Date`, sum(dbus) as `DBUs Consumed` SELECT tags.creator as `User`, sum(dbus) as `DBUs`
FROM system.operational_data.billing_logs FROM system.operational_data.billing_logs
GROUP BY date(created_on) GROUP BY tags.creator
ORDER BY date(created_on) ASC; ORDER BY `DBUs` DESC
LIMIT 10;

How many DBUs of each SKU have been used so far this month? Which Jobs consumed the most DBUs?
SELECT sku as `SKU`, sum(dbus) as `DBUs` SELECT tags.JobId as `Job ID`, sum(dbus) as `DBUs`
FROM system.operational_data.billing_logs FROM system.operational_data.billing_logs
WHERE GROUP BY `Job ID`;
month(created_on) = month(CURRENT_DATE)
GROUP BY sku
ORDER BY `DBUs` DESC;

©2024 Databricks Inc. — All rights reserved 44


System Tables: Lineage Data
Query upstream and downstream sources in one place

What tables are sourced from this table? What user queries read from this table?
SELECT DISTINCT target_table_full_name SELECT DISTINCT entity_type, entity_id,
FROM system.access.table_lineage source_table_full_name
WHERE source_table_name = "login_data_bronze"; FROM system.access.table_lineage
WHERE source_table_name = "login_data_silver";

©2024 Databricks Inc. — All rights reserved 45


Key concepts and capabilities
Key concepts
Centralised metadata and access controls
Query federation & Volumes
Discover your data with search and lineage
Audit your data

Open collaboration

©2024 Databricks Inc. — All rights reserved


Delta Sharing
An open standard for secure sharing of tables, views, files, models, and more

Data provider Data consumer

Delta Lake Delta Sharing Delta Sharing


Any compatible
table / view / file / server protocol
client
model / etc … and more

Share cross-platform w/ open protocol Share data with no replication

©2024 Databricks Inc. — All rights reserved 47


Databricks Marketplace
An open marketplace for data, analytics, and AI

Data sets, Notebooks, ML


models and applications from
top data & solution providers Data
Files
Data
Tables

Public marketplace,
private exchanges Dashboards
Databricks
Marketplace
Solution
Accelerators

Open for Databricks &


non-Databricks users
ML Notebooks
Models

©2024 Databricks Inc. — All rights reserved 4


8
Clean rooms
Secure environments to run computations on joint data

Collaborator 1 Collaborator 2
e.g. Publisher e.g. Advertiser
Data Clean
Room
Hashed_user_id age income ad_id imp clicks Hashed_user_id conversion_event

How did my campaign do


for our common users?

Collaborator 1 owned Collaborator 2 owned


sensitive data sensitive data

Secure, privacy preserving


environment

©2024 Databricks Inc. — All rights reserved


Upgrade to Unity Catalog

©2024 Databricks Inc. — All rights reserved 50


How to upgrade to Unity Catalog
Steps to consider for a full upgrade
CLEANUP
1 3 5 7
UC Design Create UC Objects Grant Access Decommissioning
● Catalogs ● Storage Credentials ● Catalogs ● Old pipelines
● Workspaces ● External locations ● Schemas ● Old clusters
● Account groups ● Catalogs ● Tables ● Hive_metastores
● Default roles ● Set Owners ● Files ● Mounts

2 4 6
One Time Setup Upgrade Legacy Upgrade Workloads
● Create metastore Metadata ● Create clusters
● Identity federation ● SYNC external ● Create jobs
● Join workspaces tables/schemas ● Update notebooks
● Migrate managed ● Downstream tools
tables, files
PLANNING/SETUP DATA / WORKLOAD UPGRADE
©2024 Databricks Inc. — All rights reserved 51
Upgrading Hive tables to Unity
Managed & External tables - use SYNC command

• Run multiple times to pull changes from the hive/glue database into Unity over time
• Use a job for long term synchronization
• Use the DRY RUN option to test the sync without making any changes to the target
table
• Works on Hive Managed Tables where schema locations are defined.

SYNC SCHEMA hive_metastore.my_db TO SCHEMA main.my_db_uc DRY RUN

SYNC TABLE hive_metastore.my_db.my_tbl TO TABLE main.my_db_uc.my_tbl

©2024 Databricks Inc. — All rights reserved 52


UCX - your best resource for UC upgrades
github.com/databrickslabs/ucx

UCX is a Databricks lab project aimed at streamlining UC Adoption. It is a public


source project by Databricks UC field team

Capabilities
❏ Assessment
❏ Group Upgrade
❏ Table Upgrade
❏ ACLs
❏ Workflows/ Notebook Upgrade (Soon)

©2024 Databricks Inc. — All rights reserved 53


Demo
https://www.databricks.com/resources/demos/tutorials/governance/data-lineage-with-unity-catalog?itm_data=demo_cent
er

©2024 Databricks Inc. — All rights reserved 54


Thank you!

©2024 Databricks Inc. — All rights reserved 55


©2022 Databricks Inc. — All rights reserved

You might also like