9 - Analytics Databases
9 - Analytics Databases
This is done using an ETL (Extract, Transform, Load) process which is typically
scheduled as a batch job (we will discuss these later).
Stars and Snowflakes
In the transaction database, there may be many
tables with all different types of relationships
between them. However, in an analytics
database, there is typically one central table
known as the “fact” table. The fact table has many
foreign keys which reference other tables known
as dimension tables - this is called the “star”
schema.
If the dimensions tables reference sub-dimensions tables, this is known as a snowflake schema.
Column Oriented Storage Background
Most transactions based databases use row oriented storage: they store the entire
contents of a row together on disk to improve with locality.
However, in analytical queries, it is rare that we need the entire row, but rather are
aggregating the value of one column over a certain table. Hence, it makes more
sense to store the columns of the table together, known as column oriented
storage.
8, 8, 8, 4, 5, 2, 2, 2, 2, 1
Compression
If each column has a lot of duplicate values, we can compress it to save space.
Imagine the following column representing coding proficiency out of 10:
8, 8, 8, 4, 5, 2, 2, 2, 2, 1
Bitmap encodings:
For val 8: 1 1 1 0 0 0 0 0 0 0
For val 4: 0 0 0 1 0 0 0 0 0 0
For val 5: 0 0 0 0 1 0 0 0 0 0
For val 2: 0 0 0 0 0 1 1 1 1 0
For val 1: 0 0 0 0 0 0 0 0 0 1
Compression
If each column has a lot of duplicate values, we can compress it to save space.
Imagine the following column representing coding proficiency out of 10:
8, 8, 8, 4, 5, 2, 2, 2, 2, 1
If you want columns sorted in a different way, can have a replica of the analytics
database with a different sort order:
● Acts as an index for efficient querying if you have a common query pattern
● Allows more column compression
Writing to Column Oriented Storage
Inefficient writes because would have to modify every column file
Instead all writes go to a sorted tree in memory (LSM tree), which is eventually
written in bulk to all of the column files once it gets too big
● Reads must check both the tree and the column files and merge them
Materialized Views
Idea: Database precomputes common queries so they do not constantly have to
be rerun
Pros:
● Do not have to rerun certain expensive common aggregations
Cons:
● Writes take longer since materialized views must be updated
● Less flexibility than querying raw data
Data Cube
Idea: Special type of materialized view that precomputes a multi dimensional table
(e.g. sales of every product_id on every day in the database)
Analytics Databases Summary
Very useful to decouple analytics databases from transactional ones as analytics
queries can take a very long time.