0% found this document useful (0 votes)
17 views12 pages

9 - Analytics Databases

Analytics databases are often used to store data copied from transactional databases to optimize analytics queries. The data is structured using star or snowflake schemas and stored column-oriented to better utilize CPUs. Writes go to an LSM tree then bulk writes to columns. Materialized views precompute common queries but slow down writes and reduce flexibility.

Uploaded by

rohit kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

9 - Analytics Databases

Analytics databases are often used to store data copied from transactional databases to optimize analytics queries. The data is structured using star or snowflake schemas and stored column-oriented to better utilize CPUs. Writes go to an LSM tree then bulk writes to columns. Materialized views precompute common queries but slow down writes and reduce flexibility.

Uploaded by

rohit kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Analytics Databases

Analytics Databases Background


Businesses will often want to run large internal queries across their data, that do
full table scans. However, doing such a thing can take a huge performance hit on
their database that deals with client interaction. Hence, they will typically have a
second database, for analytics processing, where data is copied some period of
time after the fact.

This is done using an ETL (Extract, Transform, Load) process which is typically
scheduled as a batch job (we will discuss these later).
Stars and Snowflakes
In the transaction database, there may be many
tables with all different types of relationships
between them. However, in an analytics
database, there is typically one central table
known as the “fact” table. The fact table has many
foreign keys which reference other tables known
as dimension tables - this is called the “star”
schema.

If the dimensions tables reference sub-dimensions tables, this is known as a snowflake schema.
Column Oriented Storage Background
Most transactions based databases use row oriented storage: they store the entire
contents of a row together on disk to improve with locality.

However, in analytical queries, it is rare that we need the entire row, but rather are
aggregating the value of one column over a certain table. Hence, it makes more
sense to store the columns of the table together, known as column oriented
storage.

Note that each column must be stored in the same order.


Compression
If each column has a lot of duplicate values, we can compress it to save space.
Imagine the following column representing coding proficiency out of 10:

8, 8, 8, 4, 5, 2, 2, 2, 2, 1
Compression
If each column has a lot of duplicate values, we can compress it to save space.
Imagine the following column representing coding proficiency out of 10:

8, 8, 8, 4, 5, 2, 2, 2, 2, 1

Bitmap encodings:

For val 8: 1 1 1 0 0 0 0 0 0 0

For val 4: 0 0 0 1 0 0 0 0 0 0

For val 5: 0 0 0 0 1 0 0 0 0 0

For val 2: 0 0 0 0 0 1 1 1 1 0

For val 1: 0 0 0 0 0 0 0 0 0 1
Compression
If each column has a lot of duplicate values, we can compress it to save space.
Imagine the following column representing coding proficiency out of 10:

8, 8, 8, 4, 5, 2, 2, 2, 2, 1

Bitmap encodings: Run-Length encodings:

For val 8: 1 1 1 0 0 0 0 0 0 0 For val 8: 0, 3, 7

For val 4: 0 0 0 1 0 0 0 0 0 0 For val 4: 3, 1, 6

For val 5: 0 0 0 0 1 0 0 0 0 0 For val 5: 4, 1, 6

For val 2: 0 0 0 0 0 1 1 1 1 0 For val 2: 5, 4, 1

For val 1: 0 0 0 0 0 0 0 0 0 1 For val 1: 9, 1


Compression Continued
Perform bitwise operations on the encodings to find rows where multiple fields
match certain values (column A = 10 AND column B = 20, could also do column A
= 10 OR column A = 15)

Allows more data to fit in CPU cache

If you want columns sorted in a different way, can have a replica of the analytics
database with a different sort order:
● Acts as an index for efficient querying if you have a common query pattern
● Allows more column compression
Writing to Column Oriented Storage
Inefficient writes because would have to modify every column file

Instead all writes go to a sorted tree in memory (LSM tree), which is eventually
written in bulk to all of the column files once it gets too big
● Reads must check both the tree and the column files and merge them
Materialized Views
Idea: Database precomputes common queries so they do not constantly have to
be rerun

Pros:
● Do not have to rerun certain expensive common aggregations

Cons:
● Writes take longer since materialized views must be updated
● Less flexibility than querying raw data
Data Cube
Idea: Special type of materialized view that precomputes a multi dimensional table
(e.g. sales of every product_id on every day in the database)
Analytics Databases Summary
Very useful to decouple analytics databases from transactional ones as analytics
queries can take a very long time.

Additionally, using an analytics database allows us to store our data in a column


oriented manner, which once compressed, can be better utilized by a CPU’s cache
and iterated on in tight loops (no function calls). When writing to column oriented
storage, we can use an LSM tree as a buffer.

Finally, materialized views are a potentially very useful caching mechanism by


analytics databases in order to avoid recomputation of popular aggregations, at
the cost of slower writes.

You might also like