File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1
File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1
Benchmark - Avro,
JSON, ORC, & Parquet
Owen OMalley
[email protected]
@owen_omalley
September 2016
Who Am I?
Worked on Hadoop since Jan 2006
MapReduce, Security, Hive, and
ORC
Worked on different file formats
Goal
Avro
Cross-language file format for
Hadoop
Schema evolution was primary
goal
Schema segregated from data
JSON
Serialization format for HTTP &
Javascript
Text-format with MANY parsers
Schema completely integrated
with data
Row major format
ORC
Parquet
Design based on Googles Dremel
paper
Schema segregated into footer
Column major format with stripes
Simpler type-model with logical
types
Data Sets
10
11
Github Logs
12
13
Sales
Generated data
Real schema from a production Hive
deployment
Random data based on the data statistics
14
Storage costs
Compression
16
17
18
19
20
21
22
23
Use Cases
25
26
27
28
29
30
31
Column Projection
32
github
parquet
zlib
72.494
0.585
0.81%
sales
orc
zlib
1.866
0.056
3.00%
sales
parquet
zlib
12.893
0.329
2.55%
taxi
taxi
orc
parquet
zlib
zlib
2.766
3.496
0.063
0.718
2.28%
20.54%
33
Metadata Access
34
Conclusions
Recommendations
Disclaimer Everything
changes!
Both these benchmarks and the formats
will change.
36
Recommendations
37
Fun Stuf
Built open benchmark suite for
files
Built pieces of a tool to convert
files
38
Remaining work
Extend benchmark with LZO and
LZ4.
Finish predicate pushdown
benchmark.
Add C++ reader for ORC, Parquet,
& Avro.
Add Presto ORC reader.
39
Thank you!
Twitter: @owen_omalley
Email: [email protected]