Dimensional Data Modeling Day 1
Dimensional Data Modeling Day 1
Day 1
EcZachly Inc
What is a dimension?
EcZachly Inc
EcZachly Inc
OLTP and OLAP IS A CONTINUUM
EcZachly Inc
EcZachly Inc
Production Database
Master Data OLAP Cubes Metrics
Snapshots
Cumulative Table Design
EcZachly Inc
- Core components
- 2 dataframes (yesterday and today)
- FULL OUTER JOIN the two data frames together
- COALESCE values to keep everything around
- Hang onto all of history
- Usages EcZachly Inc
- Growth analytics at Facebook (dim_all_users)
- State transition tracking (we will cover this more in Analytics track, Applying analytical
patterns later)
Diagram of cumulative table design
EcZachly Inc
Compute cumulation
metrics (e.g. days since
Cumulated Output
EcZachly Inc
x)
- Strengths
- Historical analysis without shuffle
- Easy “transition” analysis
- Drawbacks
- Can only be backfilled sequentially
- Handling PII data can be a mess since deleted/inactive users get carried forward
EcZachly Inc
The compactness vs usability tradeoff
EcZachly Inc
- Struct
- Keys are rigidly defined, compression is good!
- Values can be any type
- Map
- Keys are loosely defined, compression is okay!
- Values all have to be the same type
EcZachly Inc
- Array
- Ordinal
- List of values that all have to be the same type
Temporal Cardinality Explosions of
Dimensions
EcZachly Inc
- When you add a temporal aspect to your dimensions and the cardinality
increases by at least 1 order of magnitude
- Example
- Airbnb has ~6 million listings
- If we want to know the nightly pricing and available of each night for the next year
- That’s 365 * 6 million or about ~2 billion nights EcZachly Inc
- Should this dataset be:
- Listing-level with an array of nights?
- Listing night level with 2 billion rows?
- If you do the sorting right, Parquet will keep these two about same size
Badness of denormalized temporal
dimensions
EcZachly Inc
EcZachly Inc
Run-length encoding compression
EcZachly Inc
- Probably the most important compression technique in big data right now
- It’s why Parquet file format has become so successful
- Shuffle can ruin this. BE CAREFUL!
- Shuffle happens in distributed environments when you do JOIN and GROUP BY
EcZachly Inc
Run-length encoding compression
EcZachly Inc
5 5 5
EcZachly Inc
Spark Shuffle
EcZachly Inc
After a join, Spark may mix up the ordering of the rows and ruin your
compression
EcZachly Inc
2 2 2
Let’s start the workshop!
EcZachly Inc
EcZachly Inc