HANA DB Column Store
HANA DB Column Store
HANA DB Column Store
Jordan Jordanov
Agenda
Concepts of Column Store
Structure Compared to Row Store Performance issues Compared to Row Store Go through examples to make the points
Performance bottleneck
Orders of Magnitude
presented by Jeff Dean (Google)
Activity L1 cache reference Branch mis-prediction L2 cache reference Mutex lock/unlock Main memory reference Compress 1K bytes with Zippy Send 2K bytes over 1 Gbps network Read 1 MB sequentially from memory Round trip within same datacenter Disk seek Read 1 MB sequentially from disk Send packet CA->Netherlands->CA
Time in ns 0.5 5 7 25 100 3,000 20,000 250,000 500,000 10,000,000 20,000,000 150,000,000
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
2011 SAP AG. All rights reserved. 4
Row 1
Data Page1
Row 2 . . . . . . Row N
Data Page6
Data Page7
Data Page8
Data Page9
Data Page10
. . .
. . .
Data Page(5n-4)
Data Page(5n-3)
Data Page(5n-2)
Data Page(5n-1)
Data Page(5n)
. . .
Row n = (the size of all columns) * n
. . .
Data Page(5n-4) Data Page(5n-3) Data Page(5n-2) Data Page(5n-1)
. . .
Data Page(5n)
Row 1
Data Page1
Row 2 . . . . . . . Row N
Data Page6
Data Page7
Data Page8
Data Page9
Data Page10
. . .
. . .
. . .
. . .
. . .
Data Page(5n-4)
Data Page(5n-3)
Data Page(5n-2)
Data Page(5n-1)
Data Page(5n)
Table
Country Product US US JP UK Alpha Beta Alpha Alpha Sales 3000 1250 700 450
Row3 Row2 Row1
Row Store US Alpha 3000 US Beta 1250 JP Alpha 700 UK Row4 Alpha 450 Sales Product Country
Column Store US US JP UK Alpha Beta Alpha Alpha 3000 1250 700 450
Example (cont.) For Column Store: How is the logical Structure Preserved? Row ID
Column Store US (Row ID 1) Country US JP UK Alpha (Row ID 1) Beta Product Alpha Alpha 3000 (Row ID 1) 1250 Sales 700 450
2011 SAP AG. All rights reserved. 10
Data Dictionary
11
12
13
14
We will understand the Pros and Cons of each method following an example. Lets look at the following school table: Family Father
Smith Galway Bush Brown Taylor Moore Harris Taylor Richard Stephen John Jack John Peter Clark James
2nd Grade
null Eric
3rd Grade
null Alex Roland null David null
4th Grade
Donna null null Donald null Ruth Frank Melissa
5th Grade
null Jeffrey null null Brian Karen Janet null
6th Grade
Kevin null Alexis Susan Larry Laura null Brenda
Timothy null Sandra Jessica Ronald Dennis Shirley Jason null Angela
15
16
17
Again, recall the Physical Structure discussed earlier and try to answer Which Storing method will enable us a faster read? Can we have a definite answer here? What are the Pros and Cons?
18
Now, a mistake was found with the tables data, and we found out that David Taylor from 3rd grade is actually in 4th grade. So we need to update the table accordingly: update School set 3rd_grade = null, 4th_grade = David where Family = Taylor and Father = John and Mother = Ginny Again, recall the Physical Structure discussed earlier and try to answer Which Storing method will enable us a faster update?
19
Row ID 4 3 2 7 6 1 5 8
Row ID 7 4 8 3 5 6 1 2
Row ID 4 3 5 2 8 1 6 7
20
Row ID 2 5 7 3 1 4 6 8
Row ID 1 4 7 8 6 2 3 5
21
Row ID 2 7 3 5 1 4 6 8
23
A new family has moved into town, and they registered their kids to the school. We want to reflect this with an insert command: insert into School values (Donovan, Harry, Pamela, null, Martha, null, Brenda, Albert, Justin) How would we implement this action in both methods?
24
Add the new value Re-sort the column, and maybe reorder, assuming we want the values` to be contiguous.
For Row Store, we simply allocate new data pages at the end of the table and simply pour the data in there. It should take o(1) time. So we can see the straightforward advantage of Row Store when inserting new data is involved.
25
So when does Column Store have a clear cut advantage over Row Store? Calculations are typically executed on a single or a few columns only The table is searched based on values of a few columns The table has a big number of columns The table has a big number of rows and columnar operations are required (aggregate, scan, etc.) High compression rates can be achieved because the majority of the columns contain only few distinct values (compared to number of rows) Elimination of indexes Parallelization
26
Row Store tables are better when: The application needs to process only one single record at one time (many selects and /or updates of single records). The application typically needs to access the complete record The columns contain mainly distinct values so compression rate would be low Neither aggregations nor fast searching are required The table has a small number of rows (for example configuration tables)
27
28
So we saw that inserting a new row (and sometimes update too) is a very expensive action to perform for Column Store. So what do we do to ease the pain? Every write operation (Insert or Update) in Column Store does not directly modify compressed data, but rather goes into a separate area called the Delta Storage. The changes are taken over from the delta storage asynchronously at some later point in time. This action is called Delta Merge. The Delta Merge operation integrates committed changes collected in delta storage into main storage. The following steps are taken when a write operation occurs:
29
30
31
32
33
34
35
36
37
Delta Merge
Executed on Table Level when: Number of lines in delta storage for this table exceeds specified number Memory consumption of delta storage exceeds specified limit Merge is triggered explicitly by a client using SQL The delta log for a columnar table exceeds the defined limit. As the delta log is truncated only during merge operation, a merge operation needs to be performed in this case.
38
Delta Merge
39
Data Compression
40
41
42
43
44
Indirect Encoding
46
47
48
49
50
51
52
53
Can calculate Inner joins, Right Outer joins, Left Outer joins, and Full Outer joins. Limited to Equi-Joins only. Following is a Join example (using Value ID):
54
55
Thank You!