0% found this document useful (0 votes)
173 views64 pages

Vertica Column-vs-Row

The document compares and contrasts column-oriented and row-oriented database storage approaches. It discusses how column-oriented databases store data by columns on disk rather than by rows, allowing them to more efficiently retrieve only the relevant columns for a query. It also examines ways row-oriented databases can emulate some performance benefits of column-oriented databases through vertical partitioning, index-only plans, and materialized views, but with certain disadvantages. The document also outlines optimizations specific to column-oriented databases like compression, late materialization, block iteration, and invisible joins.

Uploaded by

surredd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
173 views64 pages

Vertica Column-vs-Row

The document compares and contrasts column-oriented and row-oriented database storage approaches. It discusses how column-oriented databases store data by columns on disk rather than by rows, allowing them to more efficiently retrieve only the relevant columns for a query. It also examines ways row-oriented databases can emulate some performance benefits of column-oriented databases through vertical partitioning, index-only plans, and materialized views, but with certain disadvantages. The document also outlines optimizations specific to column-oriented databases like compression, late materialization, block iteration, and invisible joins.

Uploaded by

surredd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 64

Column-Stores vs.

Row-Stores
How Different are they Really?
Daniel J. Abadi, Samuel Madden, and Nabil Hachem,
SIGMOD 2008

Presented By,

Paresh
Modak(09305060)
Souman
Mandal(09305066)

Contents

Column-store introduction
Column-store data model
Emulation of Column Store in Row store
Column store optimization
Experiment and Results
Conclusion

Row Store and Column Store

Figure taken form [2]

In row store data are stored in the disk tuple


by tuple.
Where in column store data are stored in the
disk column by column

Row Store and Column Store

Most of the queries does not process all the attributes of


a particular relation.

For example the query


Select c.name and c.address
From CUSTOMES as c
Where c.region=Mumbai;

Only process three attributes of the relation CUSTOMER.


But the customer relation can have more than three
attributes.

Column-stores are more I/O efficient for read-only queries


as they read, only those attributes which are accessed by
a query.
4

Row Store and Column Store


Row Store

Column Store

(+) Easy to add/modify a record

(+) Only need to read in relevant


data

(-) Might read in unnecessary


data

(-) Tuple writes require multiple


accesses

So column stores are suitable for read-mostly,


read-intensive, large data repositories

Why Column Stores?

Can be significantly faster than row stores for some


applications

But can be slower for other applications

Fetch only required columns for a query


Better cache effects
Better compression (similar attribute values within a
column)
OLTP with many row inserts, ..

Long war between the column store and row store


camps :-)

This paper tries to give a balanced picture of advantages


and disadvantages, after adding/ subtracting a number
of optimizations for each approach

Column Stores - Data Model

Standard relational logical data model

EMP(name, age, salary, dept)


DEPT(dname, floor)

Table collection of projections


Projection set of columns
Horizontally partitioned into segments with
segment identifier

Column Stores - Data Model

To answer queries, projections are joined using


Storage keys and join indexes
Storage Keys:

Within a segment, every data value of every


column is associated with a unique Skey
Values from different columns with matching Skey
belong to the same logical row

Column Stores Data Model

Join Indexes

T1 and T2 are projections on T


M segments in T1 and N segments in T2
Join Index from T1 to T2 is a table of the form:

(s: Segment ID in T2, k: Storage key in Segment s)


Each row in join index matches corresponding row in T1

Join indexes are built such that T could be


efficiently reconstructed from T1 and T2

Column Stores Data Model

Construct EMP(name, age, salary) from EMP1


and EMP3 using join index on EMP3

10

Compression

Trades I/O for CPU

Increased column-store opportunities:


Higher data value locality in column stores
Techniques such as run length encoding far more
useful

Schemes

11

Null Suppression
Dictionary encoding
Run Length encoding
Bit-Vector encoding
Heavyweight schemes

Query Execution - Operators

Select: Same as relational algebra, but


produces a bit string
Project: Same as relational algebra
Join: Joins projections according to predicates
Aggregation: SQL like aggregates
Sort: Sort all columns of a projection

12

Query Execution - Operators

Decompress: Converts compressed column


to uncompressed representation
Mask(Bitstring B, Projection Cs) => emit only
those values whose corresponding bits are 1
Concat: Combines one or more projections
sorted in the same order into a single
projection
Permute: Permutes a projection according to
the ordering defined by a join index
Bitstring operators: Band Bitwise AND,
Bor Bitwise OR, Bnot complement

13

Row Store Vs Column Store

How much of the buzz around column-stores


is marketing hype?

Do you really need to buy Sybase IQ or Vertica?

How far will your current row-store take you?


Can you get column-store performance from a
row-store?
Can you simulate a column-store in a rowstore?

14

Row Store Vs Column Store

Now the simplistic view about the difference


in storage layout leads to that one can obtain
the performance benefits of a column-store
using a row-store by making some changes to
the physical structure of the row store.

This changes can be

15

Vertically partitioning
Using index-only plans
Using materialized views

Vertical Partitioning

Process:

Full Vertical partitioning of each relation

Each column =1 Physical table


This can be achieved by adding integer position column
to every table
Adding integer position is better than adding primary key

Join on Position for multi column fetch

Problems:

Position - Space and disk bandwidth


Header for every tuple further space wastage

16

e.g. 24 byte overhead in PostgreSQL

Vertical Partitioning: Example

17

Index-only plans

Process:

Add B+Tree index for every Table.column


Plans never access the actual tuples on disk
Headers are not stored, so per tuple overhead is less

Problem:

Separate indices may require full index scan, which


is slower
Eg: SELECT AVG(salary)
FROM emp
WHERE age > 40

18

Composite index with (age, salary) key helps.

Index-only plans: Example

19

Materialized Views

Process:

Create optimal' set of MVs for given query workload


Objective:

Provide just the required data


Avoid overheads
Performs better

Expected to perform better than other two


approach

Problems:

20

Practical only in limited situation


Require knowledge of query workloads in advance

Materialized Views: Example

Select F.custID
from Facts as F
where F.price>20

21

Optimizing Column oriented Execution

Different optimization for column oriented


database

22

Compression
Late Materialization
Block Iteration
Invisible Join

Compression

Low information entropy (high data value


locality) leads to High compression ratio
Advantage

Disk Space is saved


Less I/O
CPU cost decrease if we can perform operation
without decompressing

Light weight compression schemes do better

23

Compression

If data is sorted on one column that column


will be super-compressible in row store

eg. Run length encoding

Figure taken form [2]


24

Late Materialization

Most query results entity-at-a-time not


column-at-a-time
So at some point of time multiple column
must be combined
One simple approach is to join the columns
relevant for a particular query
But further performance can be improve using
late-materialization

25

Late Materialization

Delay Tuple Construction


Might avoid constructing it altogether
Intermediate position lists might need to be
constructed
Eg: SELECT R.a FROM R WHERE R.c = 5 AND
R.b = 10

26

Output of each predicate is a bit string


Perform Bitwise AND
Use final position list to extract R.a

Late Materialization

Advantages

27

Unnecessary construction of tuple is avoided


Direct operation on compressed data
Cache performance is improved (PAX)

N-ary storage model (NSM)

28

Decomposition Storage Model(DSM)

High degree of
spatial locality for
sequential access
of single attribute
Performance
deteriorates
significantly for
queries that
involve multiple
attributes

29

PAX - Partition Attributes Across*

Cache utilization and performance is very


important
In-page data placement is the key to high cache
performance
PAX groups together all values of each attribute
within each page
Only affects layout inside the pages, incurs no
storage penalty and does not affect I/O behavior
Exhibits superior cache performance and utilization
over traditional methods
* A. Ailamaki, D. J. DeWitt, et. al. Weaving relations for cache
performance VLDB, 2001.
30

PAX Model

Maximizes interrecord spatial


locality within each
column in the page
Incurs minimal
record
reconstruction cost
Orthogonal to other
design decisions
because it only
affects the layout of
data stored on a
single page
31

Block Iteration

Operators operate on blocks of tuples at once


Iterate over blocks rather than tuples
Like batch processing
If column is fixed width, it can be operated as
an array
Minimizes per-tuple overhead
Exploits potential for parallelism
Can be applied even in Row stores IBM DB2
implements it

32

Star Schema Benchmark

SSBM is a data warehousing benchmark


derived from TPC-H
It consist of a single fact table LINE-ORDER
There are four dimension table.

CUSTOMER
PART
SUPPLIER
DATE

LINEORDER table consist of 60,000,000 tuples


SSBM consist of thirteen queries divided into
four category

33

Star Schema Benchmark

34

Figure taken
form [1]

Invisible Join

Queries over data warehouse (particularly


modeled with star schema) often have
following structure

Restrict set of tuple in the fact table using


selection predicates on dimension table
Perform some aggregation on the restricted fact
table
Often grouping by other dimension table attribute

For each selection predicate and for each


aggregate grouping join between fact table
and dimension table is required

35

Invisible Join

36

Find Total revenue from Asian customers who purchase a prod


supplied by an Asian supplier between 1992 and 1997 groupe
nation of the customer, supplier and year of transaction

Invisible Join

Traditional plan for this type of query is to


pipeline join in order of predicate selectivity
Alternate plan is late materialized join
technique
But both have disadvantages

37

Traditional plan lacks all the advantages described


previously of late materialization
In the late materialized join technique group by
columns need to be extracted in out-of-position
order

Invisible Join

Invisible join is a late materialized join but


minimize the values that need to be extracted
out of order
Invisible join

38

Rewrite joins into predicates on the foreign key


columns in the fact table
These predicates evaluated either by hash-lookup
Or by between-predicate rewriting

Invisible Join

Phase 1
39

Figure taken form [1]

Invisible Join

40

Phase
2

Figure taken form [1]

Invisible Join

41

Phase
3

Figure taken form [1]

Invisible Join

Between-Predicate rewriting

42

Use of range predicates instead of hash lookup in


phase 1
Useful if contiguous set of keys are valid after
applying a predicate
Dictionary encoding for key reassignment if not
contiguous
Query optimizer is not altered. Predicate is
rewritten at runtime

Between-predicate rewriting

Figure taken form [2]


43

Experiments

Goal

44

Comparison of attempts to emulate a column


store in a row-store with baseline performance of
C-Store
Is it possible for an unmodified row-store to obtain
the benefits of column oriented design
Effect of different optimization technique in
column-store

Experiment setup

Environment

2.8GHz Dual Core Pentium(R) workstation


3 GB RAM
RHEL 5
4 disk array mapped as a single logical volume
Reported numbers are average of several runs
Warm buffer (30% improvement for both systems)

45

Data read exceeds the size of buffer pool

C-Store Vs Commercial row oriented DB

Figure taken form [1]

RS: Base System X

CS: Base C-Store case

RS (MV): System X with optimal collection of MVs

System
X= Commercial
CS (Row-MV): Column store constructed from
RS(MV)
row oriented database

46

Results and Analysis

From the graph we can see

C-Store out performs System X by a

Factor of six in the base case


Factor of three when System x use materialized view

However CS (Row-MV) perform worse than RS


(MV)

System X provide advance performance feature


C-Store has multiple known performance
bottleneck

47

C-Store doesn't support Partitioning, multithreading

Column Store simulation in Row Store

Partitioning improves the performance of row


store if done on a predicate of the query
Authors found it improve the speed by a factor of
two
System X implement star join
Optimizer will bloom filters if it feels necessary
Other configuration parameter

48

32 KB disk pages
1.5 GB maximum memory for sort joins, intermediate
result
500 MB buffer pool

Different configuration of System X

Experimented with five different configuration


1.
2.
3.
4.
5.

49

Traditional row oriented representation with


bitmap and bloom filter
Traditional (bitmap): Biased to use bitmaps;
might be inferior sometimes
Vertical Partitioning: Each column is a relation
Index-Only: B+Tree on each column
Materialized Views: Optimal set of views for
every query

Different configuration of System X

Figure taken form [1]

50

Different configuration of System X

Figure taken form [1]

51

Different configuration of System X


T Traditional

T(B)
Traditional(bitmap)

MV materialized
views

VP vertical
partitioning
Better
AI Allperformance
indexes

of
traditional system is
because of
partitioning.
Partitioning on
52

Figure taken form [1]

Different configuration of System X

Materialized view performs best


Index only plans are the worst
Expensive column joins on fact table

53

System X use hash join by default


Nested loop join, merge join also doesnt help

Column Store simulation in Row Store:


Analysis

Tuple overheads:

LineOrder Table 60 million tuples, 17 columns


Compressed data
8 bytes of over head per row
4 bytes of record-id
1 column

Whole table

RS

0.7-1.1 GB

4 GB

CS

240 MB

2.3 GB

For SSBM scale 10 lineorder table


54

Column Store simulation in Row Store:


Analysis

Query selectivity is 8.0x10-3

55

Method

Time

Traditional

43

Vertical partitioning

65

Index-only plans

360

Column Store simulation in Row Store:


Analysis

Traditional

Vertical Partitioning

Scans the entire lineorder table


Hash joins with dwdate, part and supplier
Hash-joins partkey column with the filtered part table
Hash-joins suppkey column with filtered supplier table
Hash join the result of the two above join

Index-Only plans

56

Access all columns through unclustred B+Tree


indexes

Column Store Performence

Column Store perform better than the best


case of row store (4.0sec Vs 10.2sec)
Though they access the same amount of I/O is
similar

57

Tuple overhead and Join costs

Row Store

Column Store

Store the record-id explicitly

Dont explicitly Store the


record-id

Headers are stored with each


column

Header are stored in separate


column

Use index-based merge join

Use merge join

This differences are not fundamental

58

Breakdown of Column-Store Advantages

Block processing improves the performance


by a factor of 5% to 50%
Compression improves the performance by
almost a factor of two on avg
Late materialization improves performance by
almost a factor of three
Invisible join improves the performance by 5075%

59

Breakdown of Column-Store Advantages

Figure taken form [1]

T=tuple-at-a-time processing,; t=block processing; I=invisible


join enabled; i=disabled; C=compression enabled,
c=disabled; L=late materialization enabled; l=disabled;
60

Conclusion

To emulate column store in row store, techniques like

Vertical portioning
Index only plan

does not yield good performance


High per-tuple overheads, high tuple reconstruction
cost are the reason
Where in column store

Late materialization
Compression
Block iteration
Invisible join

are the reason for good performance


61

Conclusion

Successful emulation column store in row


store require

62

Virtual record-ids
Reduced tuple over head
Fast merge join
Run length encoding across multiple tuples
Operating directly on compressed data
Block processing
Invisible join
Late materialization

References
1.

Column-stores vs. row-stores: how


different are they really? Daniel J. Abadi,
Samuel Madden, Nabil Hachem: SIGMOD
Conference 2008: 967-980

2.

Column-Oriented Database Systems, VLDB


2009 Tutorial; Stavros Harizopoulos, Daniel
Abadi, Peter Boncz

63

Thank
You!

64

You might also like