Kent Graziano Intro To Datavault Short
Kent Graziano Intro To Datavault Short
ENGINEERING
Introduction to Data Vault 2.0
• Bio
• Agile & DW
• What is a Data Vault & Where does
it fit?
• How to design a Data Vault model
• Foundational Keys
• Benefits of Data Vault
• Who is using Data Vault?
• References
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
My Bio › Chief Technical Evangelist, Snowflake Computing
› OakTable Member
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
4
Founded 2012 by
industry veterans First customers
with over 120 2014, general
database patents availability 2015
Fun facts:
Queries processed in Largest single Largest number of Single customer Single customer
Snowflake per day: table: tables single DB: most data: most users:
100 million 68 trillion rows 200,000 > 40PB > 10,000
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Manifesto for Agile Software Development
http://agilemanifesto.org
"We are uncovering better ways of developing software by doing it and helping others do it.
Through this work we have come to value:
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Applying Agile to DW
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Data Vault Model Definition
The Data Vault is a detail oriented, historical tracking and uniquely linked
set of normalized tables that support one or more functional areas of
business.
It is a hybrid approach encompassing the best of breed between 3rd normal
form (3NF) and star schema. The design is flexible, scalable, consistent
and adaptable to the needs of the enterprise.
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Data Vault Timeline
E.F. Codd invented
Mid 70’s: AC Nielsen
relational modeling
popularized Mid 80’s: Bill Inmon
Dimension & Fact Terms
Chris Date and Hugh popularized Data
1990: Dan Linstedt begins
Harwin refined Warehousing
R&D on Data Vault
modeling concepts 1976: Dr. Peter Chen Modeling
created E-R
diagramming
1 HUB
2 LINK
EDW
Data Vault
3 SATELLITE
H HUB_SUPPLIER H HUB_PART
P * MD5_HUB_SUPPLIER VARCHAR2 (32) P * MD5_HUB_PART VARCHAR2 (32)
U * S_NAME VARCHAR2 (80) U * P_NAME VARCHAR2 (80)
* LDTS TIMESTAMP U * P_BRAND VARCHAR2 (80)
* RSCR VARCHAR2 (256) U * P_TYPE VARCHAR2 (80)
* S_SUPPKEY INTEGER U * P_SIZE INTEGER
HUB_SUPPLIER_PK (MD5_HUB_SUPPLIER) U * P_CONTAINER VARCHAR2 (80)
HUB_SUPPLIER__UN (S_NAME) * LDTS TIMESTAMP
* RSCR VARCHAR2 (256)
P_PARTKEY INTEGER
HUB_PART_PK (MD5_HUB_PART)
HUB_PART__UN (P_NAME, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER)
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
What Does the MD5 Look Like?
• MD5 hash function – Snowflake
• MD5 (UPPER(RTRIM(RMC.CAFCUSCHN)))
• MD5 hash function – Oracle
• rawtohex(sys.utl_raw.cast_to_raw(dbms_obfuscation_toolkit.md5 (input_string
=> ...)
• NEW: dbms_crypto.HASH(utl_raw.cast_to_raw(<input string>), 2);
• 2 is for MD5 algorithm option
• MD5 hash function - SQL Server
• CONVERT([Char](32),HASHBYTES('MD5’,
UPPER(RTRIM(RMC.CAFCUSCHN))))
• Need to minimize chance of duplicates
• 12||3||45 and 1||2||345 hash to same value
• Need a separator between each
• Example: Col1||’^’||Col2||’^’||Col3
• Need to account for NULLs too
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Other Considerations
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Final Input String
(
UPPER(TRIM(T1.GENERICNAME)) ||'^'||
UPPER(TRIM(TO_CHAR(T1.MED_STRNG_AMT))) ||'^'||
UPPER(TRIM(T1.UOM_CD)) ||'^'||
UPPER(TRIM(T1.MED_FORM_NM)) ||'^'
)
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
2: Links = Associations
H HUB_CUSTOMER
H HUB_ORDER
P * MD5_HUB_CUSTOMER VARCHAR2 (32)
U * C_NAME VARCHAR2 (80) P * MD5_HUB_ORDER VARCHAR2 (32)
LDTS TIMESTAMP U * O_ORDERID INTEGER
RSCR VARCHAR2 (256) * LDTS TIMESTAMP
C_CUSTKEY INTEGER * RSCR VARCHAR2 (256)
O_ORDERKEY INTEGER
HUB_CUSTOMER_PK (MD5_HUB_CUSTOMER)
HUB_CUSTOMER__UN (C_NAME) HUB_ORDER_PK (MD5_HUB_ORDER)
HUB_ORDER__UN (O_ORDERID)
L LNK_CUSTOMER_ORDER
P * MD5_LNK_CUSTOMER_ORDER VARCHAR2 (32)
UF * MD5_HUB_CUSTOMER VARCHAR2 (32)
UF * MD5_HUB_ORDER VARCHAR2 (32)
* C_NAME VARCHAR2 (80)
* O_ORDERID INTEGER
* LDTS TIMESTAMP
* RSCR VARCHAR2 (256)
LNK_CUSTOMER_ORDER_PK (MD5_LNK_CUSTOMER_ORDER)
LNK_CUSTOMER_ORDER__UN (MD5_HUB_CUSTOMER, MD5_HUB_ORDER)
LNK_CUSTOMER_ORDER_HUB_CUSTOMER_FK (MD5_HUB_CUSTOMER)
LNK_CUSTOMER_ORDER_HUB_ORDER_FK (MD5_HUB_ORDER)
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Modeling Links - 1:1 or 1:M?
L LNK_PART_SUPPLIER
P * MD5_LNK_PART_SUPPLIER VARCHAR2 (32)
UF * MD5_HUB_SUPPLIER VARCHAR2 (32) S SAT_LNK_PART_SUPPLIER
UF * MD5_HUB_PART VARCHAR2 (32)
PF * MD5_LNK_PART_SUPPLIER VARCHAR2 (32)
* S_NAME VARCHAR2 (80)
P * LDTS TIMESTAMP
* P_NAME VARCHAR2 (80)
PS_AVAILQTY INTEGER
* P_BRAND VARCHAR2 (80)
PS_SUPPLYCOST NUMBER (12,2)
* P_TYPE VARCHAR2 (80)
PS_COMMENT VARCHAR2 (199)
* P_SIZE INTEGER
* HASH_DIFF VARCHAR2 (32)
* P_CONTAINER VARCHAR2 (80)
* RSRC VARCHAR2 (256)
* LDTS TIMESTAMP
* RSRC VARCHAR2 (256) PK_SAT_LNK_PART_SUPPLIER (LDTS, MD5_LNK_PART_SUPPLIER)
LNK_PART_SUPPLIER_PK (MD5_LNK_PART_SUPPLIER) SAT_LNK_PART_SUPPLIER_LNK_PART_SUPPLIER_FK (MD5_LNK_PART_SUPPLIER)
LNK_PART_SUPPLIER__UN (MD5_HUB_SUPPLIER, MD5_HUB_PART)
LNK_PART_SUPPLIER_HUB_PART_FK (MD5_HUB_PART)
LNK_PART_SUPPLIER_HUB_SUPPLIER_FK (MD5_HUB_SUPPLIER)
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
MD5-Based Change Detection
• Think Type 2 SCD (Slowly Changing Dimensions)
• Old Way:
• Compare column by column
• Source value != Current value in DW table
• 20 columns, then 20 compares
• New Way:
• Concatenate all columns to one string
• Convert to one char(32) string with hash function
• Compare to hashed value (HASH_DIFF) in target table
• Does not matter how many columns
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Easily Getting Current Rows
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Example – Virtual Expire Date!
CASE
WHEN LEAD(stg.LOAD_DTS)
OVER (PARTITION BY stg.CDC_KEY
ORDER BY stg.LOAD_DTS) IS NULL
THEN 'Y'
ELSE 'N'
END CURR_FLG,
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Foundational Keys
Highly normalized
Highly normalized
Goes beyond › › Hubs
Hubsand Links
and only
Links hold
only keys
hold and
keys meta
and data
meta data
standard 3NF › › Satellites
Satellitessplit byby
split rate ofof
rate change and/or
change source
and/or source
Enables
Enables Agile
Agiledata
data modeling
modeling
› › Easy
Easytoto
add toto
add model
modelwithout having
without toto
having change existing
change existing
structures and
structures andload routines
load routines
• •Relationships (links)
Relationships can
(links) bebe
can dropped and
dropped created
and on-demand.
created on-demand.
› › NoNo
more reloading
more history
reloading because
history ofof
because a missed requirement
a missed requirement
Not system
Not surrogate
system keys
surrogate keys
Based on natural
business keys Allows forfor
integrating data across functions
Allows integrating data across functions
and source systems more easily
and source systems more easily
› › AllAll
data relationships
data areare
relationships keykey
driven
driven
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Data Vault Agility
Ø Loading Processes
Ø Data Model
Ø Reports & BI Functions
Ø Downstream Systems
Ø Star Schemas or Data Marts
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Perhaps You Wish to Split for Security Reasons?
To THIS!
From This In Snowflake – DV “Physical” Partitioning
DV Physical
DV “Logical” Partitioning
Partitioning might split for
Snowflake DB 1 Snowflake DB 2
security reasons
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Productivity
• Standardized modeling rules
• Highly repeatable and learnable modeling technique
• Can standardize load routines
• Delta Driven process
• Re-startable, consistent loading patterns.
• Load multiple objects in parallel!
• Can standardize extract routines
• Rapid build of new or revised Data Marts
• Can be automated (e.g. WhereScape)
• Can use a BI-meta layer to virtualize the reporting structures
• Example: Looker using LookML semantic layer
• Example: BOBJ Universe Business Layer
• Can put views on the DV structures as well
• Simulate ODS/3NF or Star Schemas
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Productivity - Loading Less Scheduling
Non-Deterministic Keys
Deterministic Keys
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Other Benefits of a Data Vault
• Modeling the EDW as a DV forces integration of the
Business Keys upfront
• Good for organizational alignment
• An integrated data set with raw data extends it’s
value beyond BI:
• Source for data quality projects
• Source for master data
• Source for data mining
• Source for Data as a Service (DaaS) in an SOA
(Service Oriented Architecture)
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Other Benefits of Data Vault
• Upfront Hub integration simplifies the data integration
routines required to load data marts
• Helps divide the work a bit
• Much easier to implement security on these granular
pieces
• Granular, re-startable processes enable pin-point
failure correction
• Designed and optimized for real-time loading in its
core architecture (without any tweaks or mods)
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Organizations Using Data Vault
• University of Texas, MD Anderson Cancer Center
• Denver Public Schools
• Micron
• Independent Purchasing Cooperative (IPC, Miami)
• Kaplan
• US Defense Department
• Colorado Springs Utilities
• State Court of Wyoming
• Federal Express
• US Dept. of Agriculture
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Snowflake Customers using Data Vault
Ø Aptus Health
Ø ResearchNow
Ø F+W Media
Ø Sainsbury’s
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
Data Vault Training & Certification
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved
The Experts Say…
“The Data Vault is the optimal choice
for modeling the EDW in the DW 2.0
framework.” Bill Inmon
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
References
› Available on Amazon:
http://www.amazon.com/Better-
Data-Modeling-Introduction-
Engineering-ebook /dp/
B018BREV1C/
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
Super Charge
Your Data Warehouse
› Available on Amazon.com
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
New DV 2.0 Book
from Dan Linstedt
› Available on Amazon:
http://www.amazon.com/Buildin
g-Scalable-Data-Warehouse-
Vault/dp/0128025107/
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
Contact Information
Kent Graziano
Snowflake Computing
[email protected]
On Twitter @KentGraziano
More info at
http://snowflake.com
Visit my blog at
http://kentgraziano.com
© 2018 Snowflake Computing Inc. All Rights Reserved. Snowflake Proprietary. Not for Redistribution.
© 2018 Snowflake Computing Inc. All Rights Reserved