0% found this document useful (0 votes)

173 views64 pages

Vertica Column-vs-Row

The document compares and contrasts column-oriented and row-oriented database storage approaches. It discusses how column-oriented databases store data by columns on disk rather than by rows, allowing them to more efficiently retrieve only the relevant columns for a query. It also examines ways row-oriented databases can emulate some performance benefits of column-oriented databases through vertical partitioning, index-only plans, and materialized views, but with certain disadvantages. The document also outlines optimizations specific to column-oriented databases like compression, late materialization, block iteration, and invisible joins.

Uploaded by

surredd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

173 views64 pages

Vertica Column-vs-Row

Uploaded by

surredd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 64

Column-Stores vs.

Row-Stores
How Different are they Really?
Daniel J. Abadi, Samuel Madden, and Nabil Hachem,
SIGMOD 2008

Presented By,

Paresh
Modak(09305060)
Souman
Mandal(09305066)

Contents

Column-store introduction
Column-store data model
Emulation of Column Store in Row store
Column store optimization
Experiment and Results
Conclusion

Row Store and Column Store

Figure taken form [2]

In row store data are stored in the disk tuple

by tuple.
Where in column store data are stored in the
disk column by column

Row Store and Column Store

Most of the queries does not process all the attributes of

a particular relation.

For example the query

Select c.name and c.address
From CUSTOMES as c
Where c.region=Mumbai;

Only process three attributes of the relation CUSTOMER.

But the customer relation can have more than three
attributes.

Column-stores are more I/O efficient for read-only queries

as they read, only those attributes which are accessed by
a query.
4

Row Store and Column Store

Row Store

Column Store

(+) Easy to add/modify a record

(+) Only need to read in relevant

data

(-) Might read in unnecessary

data

(-) Tuple writes require multiple

accesses

So column stores are suitable for read-mostly,

read-intensive, large data repositories

Why Column Stores?

Can be significantly faster than row stores for some

applications

But can be slower for other applications

Fetch only required columns for a query

Better cache effects
Better compression (similar attribute values within a
column)
OLTP with many row inserts, ..

Long war between the column store and row store

camps :-)

This paper tries to give a balanced picture of advantages

and disadvantages, after adding/ subtracting a number
of optimizations for each approach

Column Stores - Data Model

Standard relational logical data model

EMP(name, age, salary, dept)

DEPT(dname, floor)

Table collection of projections

Projection set of columns
Horizontally partitioned into segments with
segment identifier

Column Stores - Data Model

To answer queries, projections are joined using

Storage keys and join indexes
Storage Keys:

Within a segment, every data value of every

column is associated with a unique Skey
Values from different columns with matching Skey
belong to the same logical row

Column Stores Data Model

Join Indexes

T1 and T2 are projections on T

M segments in T1 and N segments in T2
Join Index from T1 to T2 is a table of the form:

(s: Segment ID in T2, k: Storage key in Segment s)

Each row in join index matches corresponding row in T1

Join indexes are built such that T could be

efficiently reconstructed from T1 and T2

Column Stores Data Model

Construct EMP(name, age, salary) from EMP1

and EMP3 using join index on EMP3

Compression

Trades I/O for CPU

Increased column-store opportunities:

Higher data value locality in column stores
Techniques such as run length encoding far more
useful

Schemes

Null Suppression
Dictionary encoding
Run Length encoding
Bit-Vector encoding
Heavyweight schemes

Query Execution - Operators

Select: Same as relational algebra, but

produces a bit string
Project: Same as relational algebra
Join: Joins projections according to predicates
Aggregation: SQL like aggregates
Sort: Sort all columns of a projection

Query Execution - Operators

Decompress: Converts compressed column

to uncompressed representation
Mask(Bitstring B, Projection Cs) => emit only
those values whose corresponding bits are 1
Concat: Combines one or more projections
sorted in the same order into a single
projection
Permute: Permutes a projection according to
the ordering defined by a join index
Bitstring operators: Band Bitwise AND,
Bor Bitwise OR, Bnot complement

Row Store Vs Column Store

How much of the buzz around column-stores

is marketing hype?

Do you really need to buy Sybase IQ or Vertica?

How far will your current row-store take you?

Can you get column-store performance from a
row-store?
Can you simulate a column-store in a rowstore?

Row Store Vs Column Store

Now the simplistic view about the difference

in storage layout leads to that one can obtain
the performance benefits of a column-store
using a row-store by making some changes to
the physical structure of the row store.

This changes can be

Vertically partitioning
Using index-only plans
Using materialized views

Vertical Partitioning

Process:

Full Vertical partitioning of each relation

Each column =1 Physical table

This can be achieved by adding integer position column
to every table
Adding integer position is better than adding primary key

Join on Position for multi column fetch

Problems:

Position - Space and disk bandwidth

Header for every tuple further space wastage

e.g. 24 byte overhead in PostgreSQL

Vertical Partitioning: Example

Index-only plans

Process:

Add B+Tree index for every Table.column

Plans never access the actual tuples on disk
Headers are not stored, so per tuple overhead is less

Problem:

Separate indices may require full index scan, which

is slower
Eg: SELECT AVG(salary)
FROM emp
WHERE age > 40

Composite index with (age, salary) key helps.

Index-only plans: Example

Materialized Views

Process:

Create optimal' set of MVs for given query workload

Objective:

Provide just the required data

Avoid overheads
Performs better

Expected to perform better than other two

approach

Problems:

Practical only in limited situation

Require knowledge of query workloads in advance

Materialized Views: Example

Select F.custID
from Facts as F
where F.price>20

Optimizing Column oriented Execution

Different optimization for column oriented

database

Compression
Late Materialization
Block Iteration
Invisible Join

Compression

Low information entropy (high data value

locality) leads to High compression ratio
Advantage

Disk Space is saved

Less I/O
CPU cost decrease if we can perform operation
without decompressing

Light weight compression schemes do better

Compression

If data is sorted on one column that column

will be super-compressible in row store

eg. Run length encoding

Figure taken form [2]

Late Materialization

Most query results entity-at-a-time not

column-at-a-time
So at some point of time multiple column
must be combined
One simple approach is to join the columns
relevant for a particular query
But further performance can be improve using
late-materialization

Late Materialization

Delay Tuple Construction

Might avoid constructing it altogether
Intermediate position lists might need to be
constructed
Eg: SELECT R.a FROM R WHERE R.c = 5 AND
R.b = 10

Output of each predicate is a bit string

Perform Bitwise AND
Use final position list to extract R.a

Late Materialization

Advantages

Unnecessary construction of tuple is avoided

Direct operation on compressed data
Cache performance is improved (PAX)

N-ary storage model (NSM)

Decomposition Storage Model(DSM)

High degree of
spatial locality for
sequential access
of single attribute
Performance
deteriorates
significantly for
queries that
involve multiple
attributes

PAX - Partition Attributes Across*

Cache utilization and performance is very

important
In-page data placement is the key to high cache
performance
PAX groups together all values of each attribute
within each page
Only affects layout inside the pages, incurs no
storage penalty and does not affect I/O behavior
Exhibits superior cache performance and utilization
over traditional methods
* A. Ailamaki, D. J. DeWitt, et. al. Weaving relations for cache
performance VLDB, 2001.
30

PAX Model

Maximizes interrecord spatial

locality within each
column in the page
Incurs minimal
record
reconstruction cost
Orthogonal to other
design decisions
because it only
affects the layout of
data stored on a
single page
31

Block Iteration

Operators operate on blocks of tuples at once

Iterate over blocks rather than tuples
Like batch processing
If column is fixed width, it can be operated as
an array
Minimizes per-tuple overhead
Exploits potential for parallelism
Can be applied even in Row stores IBM DB2
implements it

Star Schema Benchmark

SSBM is a data warehousing benchmark

derived from TPC-H
It consist of a single fact table LINE-ORDER
There are four dimension table.

CUSTOMER
PART
SUPPLIER
DATE

LINEORDER table consist of 60,000,000 tuples

SSBM consist of thirteen queries divided into
four category

Star Schema Benchmark

Figure taken
form [1]

Invisible Join

Queries over data warehouse (particularly

modeled with star schema) often have
following structure

Restrict set of tuple in the fact table using

selection predicates on dimension table
Perform some aggregation on the restricted fact
table
Often grouping by other dimension table attribute

For each selection predicate and for each

aggregate grouping join between fact table
and dimension table is required

Invisible Join

Find Total revenue from Asian customers who purchase a prod

supplied by an Asian supplier between 1992 and 1997 groupe
nation of the customer, supplier and year of transaction

Invisible Join

Traditional plan for this type of query is to

pipeline join in order of predicate selectivity
Alternate plan is late materialized join
technique
But both have disadvantages

Traditional plan lacks all the advantages described

previously of late materialization
In the late materialized join technique group by
columns need to be extracted in out-of-position
order

Invisible Join

Invisible join is a late materialized join but

minimize the values that need to be extracted
out of order
Invisible join

Rewrite joins into predicates on the foreign key

columns in the fact table
These predicates evaluated either by hash-lookup
Or by between-predicate rewriting

Invisible Join

Phase 1
39

Figure taken form [1]

Invisible Join

Phase
2

Figure taken form [1]

Invisible Join

Phase
3

Figure taken form [1]

Invisible Join

Between-Predicate rewriting

Use of range predicates instead of hash lookup in

phase 1
Useful if contiguous set of keys are valid after
applying a predicate
Dictionary encoding for key reassignment if not
contiguous
Query optimizer is not altered. Predicate is
rewritten at runtime

Between-predicate rewriting

Figure taken form [2]

Experiments

Goal

Comparison of attempts to emulate a column

store in a row-store with baseline performance of
C-Store
Is it possible for an unmodified row-store to obtain
the benefits of column oriented design
Effect of different optimization technique in
column-store

Experiment setup

Environment

2.8GHz Dual Core Pentium(R) workstation

3 GB RAM
RHEL 5
4 disk array mapped as a single logical volume
Reported numbers are average of several runs
Warm buffer (30% improvement for both systems)

Data read exceeds the size of buffer pool

C-Store Vs Commercial row oriented DB

Figure taken form [1]

RS: Base System X

CS: Base C-Store case

RS (MV): System X with optimal collection of MVs

System
X= Commercial
CS (Row-MV): Column store constructed from
RS(MV)
row oriented database

Results and Analysis

From the graph we can see

C-Store out performs System X by a

Factor of six in the base case

Factor of three when System x use materialized view

However CS (Row-MV) perform worse than RS

(MV)

System X provide advance performance feature

C-Store has multiple known performance
bottleneck

C-Store doesn't support Partitioning, multithreading

Column Store simulation in Row Store

Partitioning improves the performance of row

store if done on a predicate of the query
Authors found it improve the speed by a factor of
two
System X implement star join
Optimizer will bloom filters if it feels necessary
Other configuration parameter

32 KB disk pages
1.5 GB maximum memory for sort joins, intermediate
result
500 MB buffer pool

Different configuration of System X

Experimented with five different configuration

1.
2.
3.
4.
5.

Traditional row oriented representation with

bitmap and bloom filter
Traditional (bitmap): Biased to use bitmaps;
might be inferior sometimes
Vertical Partitioning: Each column is a relation
Index-Only: B+Tree on each column
Materialized Views: Optimal set of views for
every query

Different configuration of System X

Figure taken form [1]

Different configuration of System X

Figure taken form [1]

Different configuration of System X

T Traditional

T(B)
Traditional(bitmap)

MV materialized
views

VP vertical
partitioning
Better
AI Allperformance
indexes

of
traditional system is
because of
partitioning.
Partitioning on
52

Figure taken form [1]

Different configuration of System X

Materialized view performs best

Index only plans are the worst
Expensive column joins on fact table

System X use hash join by default

Nested loop join, merge join also doesnt help

Column Store simulation in Row Store:

Analysis

Tuple overheads:

LineOrder Table 60 million tuples, 17 columns

Compressed data
8 bytes of over head per row
4 bytes of record-id
1 column

Whole table

0.7-1.1 GB

4 GB

240 MB

2.3 GB

For SSBM scale 10 lineorder table

Column Store simulation in Row Store:

Analysis

Query selectivity is 8.0x10-3

Method

Time

Traditional

Vertical partitioning

Index-only plans

360

Column Store simulation in Row Store:

Analysis

Traditional

Vertical Partitioning

Scans the entire lineorder table

Hash joins with dwdate, part and supplier
Hash-joins partkey column with the filtered part table
Hash-joins suppkey column with filtered supplier table
Hash join the result of the two above join

Index-Only plans

Access all columns through unclustred B+Tree

indexes

Column Store Performence

Column Store perform better than the best

case of row store (4.0sec Vs 10.2sec)
Though they access the same amount of I/O is
similar

Tuple overhead and Join costs

Row Store

Column Store

Store the record-id explicitly

Dont explicitly Store the

record-id

Headers are stored with each

column

Header are stored in separate

column

Use index-based merge join

Use merge join

This differences are not fundamental

Breakdown of Column-Store Advantages

Block processing improves the performance

by a factor of 5% to 50%
Compression improves the performance by
almost a factor of two on avg
Late materialization improves performance by
almost a factor of three
Invisible join improves the performance by 5075%

Breakdown of Column-Store Advantages

Figure taken form [1]

T=tuple-at-a-time processing,; t=block processing; I=invisible

join enabled; i=disabled; C=compression enabled,
c=disabled; L=late materialization enabled; l=disabled;
60

Conclusion

To emulate column store in row store, techniques like

Vertical portioning
Index only plan

does not yield good performance

High per-tuple overheads, high tuple reconstruction
cost are the reason
Where in column store

Late materialization
Compression
Block iteration
Invisible join

are the reason for good performance

Conclusion

Successful emulation column store in row

store require

Virtual record-ids
Reduced tuple over head
Fast merge join
Run length encoding across multiple tuples
Operating directly on compressed data
Block processing
Invisible join
Late materialization

References
1.

Column-stores vs. row-stores: how

different are they really? Daniel J. Abadi,
Samuel Madden, Nabil Hachem: SIGMOD
Conference 2008: 967-980

Column-Oriented Database Systems, VLDB

2009 Tutorial; Stavros Harizopoulos, Daniel
Abadi, Peter Boncz

Thank
You!

HOL3018 - Hands On Lab Session Oracle Database 23ai Best New Features - 1725977266806001OiqY
No ratings yet
HOL3018 - Hands On Lab Session Oracle Database 23ai Best New Features - 1725977266806001OiqY
19 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Azure Data Factory
No ratings yet
Azure Data Factory
3,167 pages
ADB - CH2 - Advanced SQL
No ratings yet
ADB - CH2 - Advanced SQL
60 pages
Aws Class Notes
67% (3)
Aws Class Notes
83 pages
Gartner Reprint
No ratings yet
Gartner Reprint
42 pages
Activate Methology SAP
100% (4)
Activate Methology SAP
39 pages
PgDay 2017 Innodb Architecture Performance Optimization
No ratings yet
PgDay 2017 Innodb Architecture Performance Optimization
175 pages
Optimization of Database Management System
No ratings yet
Optimization of Database Management System
3 pages
DataStage Migration Webinar - v3FINAL
No ratings yet
DataStage Migration Webinar - v3FINAL
28 pages
HP Vertica
No ratings yet
HP Vertica
18 pages
Hibernate Reference3.2
No ratings yet
Hibernate Reference3.2
232 pages
Column-vs-Row Databases
No ratings yet
Column-vs-Row Databases
12 pages
Oracle 12c - CDB - PDB - Performing Basic Tasks PDF
No ratings yet
Oracle 12c - CDB - PDB - Performing Basic Tasks PDF
18 pages
Vertica Unify 2021 - Health Advisor - Evaluate and Improve The Health of Your Vertica Cluster
No ratings yet
Vertica Unify 2021 - Health Advisor - Evaluate and Improve The Health of Your Vertica Cluster
25 pages
PostgreSQL Cheat Sheet
No ratings yet
PostgreSQL Cheat Sheet
1 page
Deep Dive Aurora
No ratings yet
Deep Dive Aurora
55 pages
System Design Primer
100% (3)
System Design Primer
59 pages
Advanced SQL Case Study
No ratings yet
Advanced SQL Case Study
42 pages
Mum PSP SQL
No ratings yet
Mum PSP SQL
24 pages
Oracle 12c Partitioned and Subpartitioned Tables
No ratings yet
Oracle 12c Partitioned and Subpartitioned Tables
24 pages
Column Vs Row
No ratings yet
Column Vs Row
64 pages
Five Tuning Tips For Your Data Warehouse
No ratings yet
Five Tuning Tips For Your Data Warehouse
46 pages
VerTica Architecture
100% (1)
VerTica Architecture
13 pages
DBMS Unit 4
No ratings yet
DBMS Unit 4
22 pages
DB2 Refresher
No ratings yet
DB2 Refresher
28 pages
Streams 2 GG
No ratings yet
Streams 2 GG
59 pages
A Brief History in Time For Data Vault
100% (1)
A Brief History in Time For Data Vault
6 pages
FPP - Poc Report v.4.1
No ratings yet
FPP - Poc Report v.4.1
12 pages
h8310 Deploying Pentaho Data Integration Dia
No ratings yet
h8310 Deploying Pentaho Data Integration Dia
29 pages
Ogh20180613 Rob Lasonder
No ratings yet
Ogh20180613 Rob Lasonder
64 pages
Change Data Capture
No ratings yet
Change Data Capture
4 pages
Building CERN For AI An Institutional Blueprint 1738940918
No ratings yet
Building CERN For AI An Institutional Blueprint 1738940918
50 pages
Materialized View
No ratings yet
Materialized View
31 pages
Basics of Partitioning
100% (1)
Basics of Partitioning
2 pages
Pentaho Data Integration
No ratings yet
Pentaho Data Integration
99 pages
MIE1628 Big Data Analytics Lecture8
No ratings yet
MIE1628 Big Data Analytics Lecture8
82 pages
An Investigation of NoSQL Database Performance From A MYSQL Perspective
No ratings yet
An Investigation of NoSQL Database Performance From A MYSQL Perspective
3 pages
Index Creation For Mutli Value Fields
No ratings yet
Index Creation For Mutli Value Fields
3 pages
Bare Bones ASM: What Every DBA Needs To Know
No ratings yet
Bare Bones ASM: What Every DBA Needs To Know
32 pages
Performance Tuning With InfoSphere CDC
100% (1)
Performance Tuning With InfoSphere CDC
37 pages
Column Vs Row
No ratings yet
Column Vs Row
64 pages
Optimizing Data Loading
No ratings yet
Optimizing Data Loading
26 pages
Big Data and Data Warehouse
No ratings yet
Big Data and Data Warehouse
19 pages
Understanding Table Queues
No ratings yet
Understanding Table Queues
56 pages
Netezza Best Practices
No ratings yet
Netezza Best Practices
5 pages
Teradata API
No ratings yet
Teradata API
237 pages
Database Sharding at Netlog
100% (5)
Database Sharding at Netlog
70 pages
Oracle Netapp Best Practices
No ratings yet
Oracle Netapp Best Practices
47 pages
IBMS4 RA Exec Summary 1.1 PDF
No ratings yet
IBMS4 RA Exec Summary 1.1 PDF
17 pages
How To Configure DB2 Connect To System I With ODBC On Unix Final
No ratings yet
How To Configure DB2 Connect To System I With ODBC On Unix Final
4 pages
DB2 Survival Guide
No ratings yet
DB2 Survival Guide
11 pages
Magic Quadrant For C 763557 NDX
No ratings yet
Magic Quadrant For C 763557 NDX
51 pages
Advanced Data Model
No ratings yet
Advanced Data Model
18 pages
SQLandTMWSDataModel PDF
100% (1)
SQLandTMWSDataModel PDF
32 pages
Oracle Streams - Step by Step
100% (32)
Oracle Streams - Step by Step
11 pages
Real Application Testing: Consolidated Database Replay Feature
No ratings yet
Real Application Testing: Consolidated Database Replay Feature
12 pages
B202 Hashing
No ratings yet
B202 Hashing
32 pages
Drill Slides
No ratings yet
Drill Slides
14 pages
OLTP
No ratings yet
OLTP
12 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
Azure Data Fundamental
No ratings yet
Azure Data Fundamental
81 pages
Oracle Indexes
No ratings yet
Oracle Indexes
3 pages
Oracle Streams
No ratings yet
Oracle Streams
6 pages
Liberalisation, Privatisation and Globalisation: India Met With An Economic Crisis Relating To Its External Debt
No ratings yet
Liberalisation, Privatisation and Globalisation: India Met With An Economic Crisis Relating To Its External Debt
7 pages
90 (Informatics Practices)
No ratings yet
90 (Informatics Practices)
12 pages
Agile
100% (1)
Agile
26 pages
Cloud Computing IIT Kanpur PDF
No ratings yet
Cloud Computing IIT Kanpur PDF
123 pages
Final Interview Questions (Etl - Informatica) : Subject Oriented, Integrated, Time Variant, Non Volatile
100% (1)
Final Interview Questions (Etl - Informatica) : Subject Oriented, Integrated, Time Variant, Non Volatile
77 pages
Aws RDS-2
No ratings yet
Aws RDS-2
8 pages
Introducing Oracle Database 21c
No ratings yet
Introducing Oracle Database 21c
14 pages
Poverty and Social Sector
No ratings yet
Poverty and Social Sector
9 pages
SQL Server To SQL Server PDW: Migration Guide (AU3)
No ratings yet
SQL Server To SQL Server PDW: Migration Guide (AU3)
65 pages
ABAP Training For SAP HANA (Autosaved)
No ratings yet
ABAP Training For SAP HANA (Autosaved)
30 pages
FAQ SAP HANA Performance Optimization
No ratings yet
FAQ SAP HANA Performance Optimization
16 pages
Chapter - 2: Database Model Key-Value Data Store Document Databases Column Databases Graph Databases
No ratings yet
Chapter - 2: Database Model Key-Value Data Store Document Databases Column Databases Graph Databases
61 pages
SAP HANA Modeling Guide For SAP HANA Studio en
100% (1)
SAP HANA Modeling Guide For SAP HANA Studio en
266 pages
SAP BW On HANA
No ratings yet
SAP BW On HANA
13 pages
SAP HANA Developer Guide - Part 1
No ratings yet
SAP HANA Developer Guide - Part 1
300 pages
How To Handle HANA Alert 29: 'Size of Delta Storage of Column-Store Tables'
No ratings yet
How To Handle HANA Alert 29: 'Size of Delta Storage of Column-Store Tables'
5 pages
CS 245 Final Exam Spring 2019 (Solutions) : 2.5 Hours
No ratings yet
CS 245 Final Exam Spring 2019 (Solutions) : 2.5 Hours
17 pages
HANA Memory RCA
No ratings yet
HANA Memory RCA
32 pages
SAP HANA Development Basics PDF
No ratings yet
SAP HANA Development Basics PDF
159 pages
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
100% (1)
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
72 pages
Columnar Database
No ratings yet
Columnar Database
18 pages
2019-Sql-Robert Pastijn-Datenbank 19c Neue Funktionalitaeten Und Roadmap-Praesentation PDF
No ratings yet
2019-Sql-Robert Pastijn-Datenbank 19c Neue Funktionalitaeten Und Roadmap-Praesentation PDF
70 pages
Information Technology Bca 4th Sem Model Paper
No ratings yet
Information Technology Bca 4th Sem Model Paper
18 pages
CSE6006 NoSQL-Databases ETH 1 AC41
No ratings yet
CSE6006 NoSQL-Databases ETH 1 AC41
10 pages
Sap Hana Nse
No ratings yet
Sap Hana Nse
32 pages
Kudu
No ratings yet
Kudu
9 pages
Sai Chirravuri
No ratings yet
Sai Chirravuri
54 pages
CA PPM Data Warehouse Jasper Soft
No ratings yet
CA PPM Data Warehouse Jasper Soft
38 pages
Materialized V
No ratings yet
Materialized V
30 pages
BITAApplform New
No ratings yet
BITAApplform New
1 page
Optimize Oracle Business Intelligence Analytics With Oracle 12c In-Memory Database Option
No ratings yet
Optimize Oracle Business Intelligence Analytics With Oracle 12c In-Memory Database Option
9 pages
MemSQL Columnstore
No ratings yet
MemSQL Columnstore
5 pages