0% found this document useful (0 votes)
49 views19 pages

Dimensional Data Modeling Day 1

Uploaded by

majbah00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views19 pages

Dimensional Data Modeling Day 1

Uploaded by

majbah00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Dimensional Data Modeling

Day 1
EcZachly Inc
What is a dimension?
EcZachly Inc

- Dimensions are attributes of an entity (e.g. user’s birthday, user’s favorite


food)
- Some of these dimensions may IDENTIFY an entity (e.g. a user’s ID)
- Others are just attributes
- Dimensions come in two flavors EcZachly Inc
- Slowly-changing
- Fixed
What we’ll cover today
EcZachly Inc

- Knowing your data consumer


- OLTP vs OLAP data modeling
- Cumulative Table design
- The compactness vs usability tradeoff
- Temporal cardinality explosion EcZachly Inc
- Run-length encoding compression gotchas
Knowing your Consumer
EcZachly Inc

- Data analysts / Data scientists


- Should be very easy to query. Not many complex data types
- Other data engineers
- Should be compact and probably harder to query. Nested types are okay
- ML models EcZachly Inc
- Depends on the model and how its trained
- Customers
- Should be a very easy to interpret chart
OLTP vs master data vs OLAP
EcZachly Inc

- OLTP (online transaction processing)


- Optimizes for low-latency, low-volume queries
- OLAP (online analytical processing)
- Optimizes for large volume, GROUP BY queries, minimizes JOINs
- Master Data EcZachly Inc
- Optimizes for completeness of entity definitions, deduped
Mismatching needs = less business
value!!!!
EcZachly Inc

- Some of the biggest problems in data engineering occur when data is


modeled for the wrong consumer!

EcZachly Inc
OLTP and OLAP IS A CONTINUUM
EcZachly Inc

EcZachly Inc
Production Database
Master Data OLAP Cubes Metrics
Snapshots
Cumulative Table Design
EcZachly Inc

- Core components
- 2 dataframes (yesterday and today)
- FULL OUTER JOIN the two data frames together
- COALESCE values to keep everything around
- Hang onto all of history
- Usages EcZachly Inc
- Growth analytics at Facebook (dim_all_users)
- State transition tracking (we will cover this more in Analytics track, Applying analytical
patterns later)
Diagram of cumulative table design
EcZachly Inc

FULL OUTER JOIN

COALESCE ids and


Yesterday unchanging dimensions

Compute cumulation
metrics (e.g. days since
Cumulated Output
EcZachly Inc
x)

Combine arrays and


Today changing values
Cumulative Table Design (cont)
EcZachly Inc

- Strengths
- Historical analysis without shuffle
- Easy “transition” analysis
- Drawbacks
- Can only be backfilled sequentially
- Handling PII data can be a mess since deleted/inactive users get carried forward
EcZachly Inc
The compactness vs usability tradeoff
EcZachly Inc

- The most usable tables usually


- Have no complex data types
- Easily can be manipulated with WHERE and GROUP BY
- The most compact tables (not human readable)
- Are compressed to be as small as possible and can’t be queried directly until they’re
decoded
EcZachly Inc
- The middle-ground tables
- Use complex data types (e.g. ARRAY, MAP and STRUCT), making querying trickier but also
compacting more
The compactness vs usability tradeoff
EcZachly Inc

- When would you use each type of table?


- Most compact
- Online systems where latency and data volumes matter a lot. Consumers are usually
highly technical
- Middle-ground
- Upstream staging / master data where the majority of consumers are other EcZachly
data Inc
engineers
- Most usable
- When analytics is the main consumer and the majority of consumers are less
technical
Struct vs Array vs Map
EcZachly Inc

- Struct
- Keys are rigidly defined, compression is good!
- Values can be any type
- Map
- Keys are loosely defined, compression is okay!
- Values all have to be the same type
EcZachly Inc
- Array
- Ordinal
- List of values that all have to be the same type
Temporal Cardinality Explosions of
Dimensions
EcZachly Inc

- When you add a temporal aspect to your dimensions and the cardinality
increases by at least 1 order of magnitude
- Example
- Airbnb has ~6 million listings
- If we want to know the nightly pricing and available of each night for the next year
- That’s 365 * 6 million or about ~2 billion nights EcZachly Inc
- Should this dataset be:
- Listing-level with an array of nights?
- Listing night level with 2 billion rows?
- If you do the sorting right, Parquet will keep these two about same size
Badness of denormalized temporal
dimensions
EcZachly Inc

If you explode it out and need to join other dimensions,

Spark shuffle will ruin your compression!

EcZachly Inc
Run-length encoding compression
EcZachly Inc

- Probably the most important compression technique in big data right now
- It’s why Parquet file format has become so successful
- Shuffle can ruin this. BE CAREFUL!
- Shuffle happens in distributed environments when you do JOIN and GROUP BY

EcZachly Inc
Run-length encoding compression
EcZachly Inc

5 5 5

EcZachly Inc
Spark Shuffle
EcZachly Inc
After a join, Spark may mix up the ordering of the rows and ruin your
compression

EcZachly Inc

2 2 2
Let’s start the workshop!
EcZachly Inc

EcZachly Inc

You might also like