0% found this document useful (0 votes)
234 views40 pages

File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1

File Format Benchmark_

Uploaded by

senthil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
234 views40 pages

File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1

File Format Benchmark_

Uploaded by

senthil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

File Format

Benchmark - Avro,
JSON, ORC, & Parquet
Owen OMalley
[email protected]
@owen_omalley
September 2016

Who Am I?
Worked on Hadoop since Jan 2006
MapReduce, Security, Hive, and
ORC
Worked on different file formats

Sequence File, RCFile, ORC File, T-File, and


Avro requirements
2

Hortonworks Inc. 2011 2016. All Rights Reserved

Goal

Seeking to discover unknowns


How do the different formats perform?
What could they do better?
Best part of open source is looking inside!

Use real & diverse data sets


Over-reliance on similar datasets leads to
weakness

Hortonworks Inc. 2011 2016. All Rights Reserved

The File Formats

Hortonworks Inc. 2011 2016. All Rights Reserved

Avro
Cross-language file format for
Hadoop
Schema evolution was primary
goal
Schema segregated from data

Unlike Protobuf and Thrift


5

Row major format

Hortonworks Inc. 2011 2016. All Rights Reserved

JSON
Serialization format for HTTP &
Javascript
Text-format with MANY parsers
Schema completely integrated
with data
Row major format

Hortonworks Inc. 2011 2016. All Rights Reserved

ORC

Originally part of Hive to replace


RCFile
Now top-level project

Schema segregated into footer


Column major format with stripes
Rich type model, stored top-down

Hortonworks Inc. 2011 2016. All Rights Reserved

Parquet
Design based on Googles Dremel
paper
Schema segregated into footer
Column major format with stripes
Simpler type-model with logical
types

Hortonworks Inc. 2011 2016. All Rights Reserved

Data Sets

Hortonworks Inc. 2011 2016. All Rights Reserved

NYC Taxi Data

Every taxi cab ride in NYC from


2009
Publically available
http://tinyurl.com/nyc-taxi-analysis

18 columns with no null values


Doubles, integers, decimals, & strings

10

2 months of data 22.7 million

Hortonworks Inc. 2011 2016. All Rights Reserved

11

Hortonworks Inc. 2011 2016. All Rights Reserved

Github Logs

All actions on Github public


repositories
Publically available
https://www.githubarchive.org/

704 columns with a lot of


structure & nulls
Pretty much the kitchen sink

12

Hortonworks Inc. 2011 2016. All Rights Reserved

Finding the Github


Schema
The data is all in JSON.
No schema for the data is
published.
We wrote a JSON schema
discoverer.

13

Scans the document and figures out the


type

Hortonworks Inc. 2011 2016. All Rights Reserved

Sales

Generated data
Real schema from a production Hive
deployment
Random data based on the data statistics

14

55 columns with lots of nulls


A little structure
Timestamps, strings, longs, booleans, list,
& struct

Hortonworks Inc. 2011 2016. All Rights Reserved

Storage costs

15 Hortonworks Inc. 2011 2016. All Rights Reserved

Compression

Data size matters!


Hadoop stores all your data, but requires
hardware
Is one factor in read speed

ORC and Parquet use RLE &


Dictionaries
All the formats have general

16

Hortonworks Inc. 2011 2016. All Rights Reserved

17

Hortonworks Inc. 2011 2016. All Rights Reserved

Taxi Size Analysis


Dont use JSON
Use either Snappy or Zlib
compression
Avros small compression window
hurts
Parquet Zlib is smaller than ORC

18

Group the column sizes by type

Hortonworks Inc. 2011 2016. All Rights Reserved

19

Hortonworks Inc. 2011 2016. All Rights Reserved

20

Hortonworks Inc. 2011 2016. All Rights Reserved

Taxi Size Analysis

ORC did better than expected


String columns have small cardinality
Lots of timestamp columns
No doubles

Need to revalidate results with


original
Improve random data generator

21

Hortonworks Inc. 2011 2016. All Rights Reserved

22

Hortonworks Inc. 2011 2016. All Rights Reserved

Github Size Analysis

Surprising win for JSON and Avro


Worst when uncompressed
Best with zlib

Many partially shared strings


ORC and Parquet dont compress across
columns

23

Need to investigate shared

Hortonworks Inc. 2011 2016. All Rights Reserved

Use Cases

24 Hortonworks Inc. 2011 2016. All Rights Reserved

Full Table Scans


Read all columns & rows
All formats except JSON are
splitable

Different workers do different parts of file

25

Hortonworks Inc. 2011 2016. All Rights Reserved

26

Hortonworks Inc. 2011 2016. All Rights Reserved

Taxi Read Performance


Analysis

JSON is very slow to read


Large storage size for this data set
Needs to do a LOT of string parsing

Tradeoff between space & time


Less compression is sometimes faster

27

Hortonworks Inc. 2011 2016. All Rights Reserved

28

Hortonworks Inc. 2011 2016. All Rights Reserved

Sales Read Performance


Analysis

Read performance is dominated


by format
Compression matters less for this data set
Straight ordering: ORC, Avro, Parquet, &
JSON

Garbage collection is important


ORC 0.3 to 1.4% of time

29

Hortonworks Inc. 2011 2016. All Rights Reserved

30

Hortonworks Inc. 2011 2016. All Rights Reserved

Github Read Performance


Analysis

Garbage collection is critical


ORC 2.1 to 3.4% of time
Avro 0.1% of time
Parquet 11.4 to 12.8% of time

A lot of columns needs more


space
Suspect that we need bigger stripes

31

Hortonworks Inc. 2011 2016. All Rights Reserved

Column Projection

Often just need a few columns


Only ORC & Parquet are columnar
Only read, decompress, & deserialize
Dataset
format
compression us/row
projection
Percent time
github
orc
zlib
21.319
0.185
0.87%
some
columns

32

github

parquet

zlib

72.494

0.585

0.81%

sales

orc

zlib

1.866

0.056

3.00%

sales

parquet

zlib

12.893

0.329

2.55%

taxi
taxi

orc
parquet

zlib
zlib

2.766
3.496

0.063
0.718

2.28%
20.54%

Hortonworks Inc. 2011 2016. All Rights Reserved

Projection & Predicate


Pushdown

Sometimes have a filter predicate


on table
Select a superset of rows that match
Selective filters have a huge impact

Improves data layout options


Better than partition pruning with sorting

33

ORC has added optional bloom

Hortonworks Inc. 2011 2016. All Rights Reserved

Metadata Access

ORC & Parquet store metadata


Stored in file footer
File schema
Number of records
Min, max, count of each column

34

Provides O(1) Access

Hortonworks Inc. 2011 2016. All Rights Reserved

Conclusions

35 Hortonworks Inc. 2011 2016. All Rights Reserved

Recommendations

Disclaimer Everything
changes!
Both these benchmarks and the formats
will change.

Dont use JSON for processing.


If your use case needs column
projection or predicate push

36

Hortonworks Inc. 2011 2016. All Rights Reserved

Recommendations

For complex tables with common


strings
Avro with Snappy is a good fit (w/o
projection)

For other tables


ORC with Zlib or Snappy is a good fit

37

Tweet benchmark suggestions to

Hortonworks Inc. 2011 2016. All Rights Reserved

Fun Stuf
Built open benchmark suite for
files
Built pieces of a tool to convert
files

Avro, CSV, JSON, ORC, & Parquet

38

Built a random parameterized


generator

Hortonworks Inc. 2011 2016. All Rights Reserved

Remaining work
Extend benchmark with LZO and
LZ4.
Finish predicate pushdown
benchmark.
Add C++ reader for ORC, Parquet,
& Avro.
Add Presto ORC reader.

39

Hortonworks Inc. 2011 2016. All Rights Reserved

Thank you!

Twitter: @owen_omalley
Email: [email protected]

40 Hortonworks Inc. 2011 2016. All Rights Reserved

You might also like