0% found this document useful (0 votes)

234 views40 pages

File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1

File Format Benchmark_

Uploaded by

senthil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

234 views40 pages

File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1

File Format Benchmark_

Uploaded by

senthil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

File Format

Benchmark - Avro,
JSON, ORC, & Parquet
Owen OMalley
[email protected]
@owen_omalley
September 2016

Who Am I?
Worked on Hadoop since Jan 2006
MapReduce, Security, Hive, and
ORC
Worked on different file formats

Sequence File, RCFile, ORC File, T-File, and

Avro requirements
2

Hortonworks Inc. 2011 2016. All Rights Reserved

Goal

Seeking to discover unknowns

How do the different formats perform?
What could they do better?
Best part of open source is looking inside!

Use real & diverse data sets

Over-reliance on similar datasets leads to
weakness

Hortonworks Inc. 2011 2016. All Rights Reserved

The File Formats

Hortonworks Inc. 2011 2016. All Rights Reserved

Avro
Cross-language file format for
Hadoop
Schema evolution was primary
goal
Schema segregated from data

Unlike Protobuf and Thrift

Row major format

Hortonworks Inc. 2011 2016. All Rights Reserved

JSON
Serialization format for HTTP &
Javascript
Text-format with MANY parsers
Schema completely integrated
with data
Row major format

Hortonworks Inc. 2011 2016. All Rights Reserved

ORC

Originally part of Hive to replace

RCFile
Now top-level project

Schema segregated into footer

Column major format with stripes
Rich type model, stored top-down

Hortonworks Inc. 2011 2016. All Rights Reserved

Parquet
Design based on Googles Dremel
paper
Schema segregated into footer
Column major format with stripes
Simpler type-model with logical
types

Hortonworks Inc. 2011 2016. All Rights Reserved

Data Sets

Hortonworks Inc. 2011 2016. All Rights Reserved

NYC Taxi Data

Every taxi cab ride in NYC from

2009
Publically available
http://tinyurl.com/nyc-taxi-analysis

18 columns with no null values

Doubles, integers, decimals, & strings

2 months of data 22.7 million

Hortonworks Inc. 2011 2016. All Rights Reserved

Github Logs

All actions on Github public

repositories
Publically available
https://www.githubarchive.org/

704 columns with a lot of

structure & nulls
Pretty much the kitchen sink

Hortonworks Inc. 2011 2016. All Rights Reserved

Finding the Github

Schema
The data is all in JSON.
No schema for the data is
published.
We wrote a JSON schema
discoverer.

Scans the document and figures out the

type

Hortonworks Inc. 2011 2016. All Rights Reserved

Sales

Generated data
Real schema from a production Hive
deployment
Random data based on the data statistics

55 columns with lots of nulls

A little structure
Timestamps, strings, longs, booleans, list,
& struct

Hortonworks Inc. 2011 2016. All Rights Reserved

Storage costs

15 Hortonworks Inc. 2011 2016. All Rights Reserved

Compression

Data size matters!

Hadoop stores all your data, but requires
hardware
Is one factor in read speed

ORC and Parquet use RLE &

Dictionaries
All the formats have general

Hortonworks Inc. 2011 2016. All Rights Reserved

Taxi Size Analysis

Dont use JSON
Use either Snappy or Zlib
compression
Avros small compression window
hurts
Parquet Zlib is smaller than ORC

Group the column sizes by type

Hortonworks Inc. 2011 2016. All Rights Reserved

Taxi Size Analysis

ORC did better than expected

String columns have small cardinality
Lots of timestamp columns
No doubles

Need to revalidate results with

original
Improve random data generator

Hortonworks Inc. 2011 2016. All Rights Reserved

Github Size Analysis

Surprising win for JSON and Avro

Worst when uncompressed
Best with zlib

Many partially shared strings

ORC and Parquet dont compress across
columns

Need to investigate shared

Hortonworks Inc. 2011 2016. All Rights Reserved

Use Cases

24 Hortonworks Inc. 2011 2016. All Rights Reserved

Full Table Scans

Read all columns & rows
All formats except JSON are
splitable

Different workers do different parts of file

Hortonworks Inc. 2011 2016. All Rights Reserved

Taxi Read Performance

Analysis

JSON is very slow to read

Large storage size for this data set
Needs to do a LOT of string parsing

Tradeoff between space & time

Less compression is sometimes faster

Hortonworks Inc. 2011 2016. All Rights Reserved

Sales Read Performance

Analysis

Read performance is dominated

by format
Compression matters less for this data set
Straight ordering: ORC, Avro, Parquet, &
JSON

Garbage collection is important

ORC 0.3 to 1.4% of time

Hortonworks Inc. 2011 2016. All Rights Reserved

Github Read Performance

Analysis

Garbage collection is critical

ORC 2.1 to 3.4% of time
Avro 0.1% of time
Parquet 11.4 to 12.8% of time

A lot of columns needs more

space
Suspect that we need bigger stripes

Hortonworks Inc. 2011 2016. All Rights Reserved

Column Projection

Often just need a few columns

Only ORC & Parquet are columnar
Only read, decompress, & deserialize
Dataset
format
compression us/row
projection
Percent time
github
orc
zlib
21.319
0.185
0.87%
some
columns

github

parquet

zlib

72.494

0.585

0.81%

sales

orc

zlib

1.866

0.056

3.00%

sales

parquet

zlib

12.893

0.329

2.55%

taxi
taxi

orc
parquet

zlib
zlib

2.766
3.496

0.063
0.718

2.28%
20.54%

Projection & Predicate

Pushdown

Sometimes have a filter predicate

on table
Select a superset of rows that match
Selective filters have a huge impact

Improves data layout options

Better than partition pruning with sorting

ORC has added optional bloom

Metadata Access

ORC & Parquet store metadata

Stored in file footer
File schema
Number of records
Min, max, count of each column

Provides O(1) Access

Conclusions

Recommendations

Disclaimer Everything
changes!
Both these benchmarks and the formats
will change.

Dont use JSON for processing.

If your use case needs column
projection or predicate push

Recommendations

For complex tables with common

strings
Avro with Snappy is a good fit (w/o
projection)

For other tables

ORC with Zlib or Snappy is a good fit

Tweet benchmark suggestions to

Fun Stuf
Built open benchmark suite for
files
Built pieces of a tool to convert
files

Avro, CSV, JSON, ORC, & Parquet

Built a random parameterized

generator

Remaining work
Extend benchmark with LZO and
LZ4.
Finish predicate pushdown
benchmark.
Add C++ reader for ORC, Parquet,
& Avro.
Add Presto ORC reader.

Thank you!

Twitter: @owen_omalley
Email: [email protected]

PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Unit 3-BDA
50% (2)
Unit 3-BDA
26 pages
Internship Report
No ratings yet
Internship Report
31 pages
Avro Parquet
No ratings yet
Avro Parquet
5 pages
Big Data File Formats For Data Engineers
No ratings yet
Big Data File Formats For Data Engineers
3 pages
An Empirical Evaluation of Columnar Storage Formats (Extended Version)
No ratings yet
An Empirical Evaluation of Columnar Storage Formats (Extended Version)
15 pages
Stop Using CSVs For Storage - This File Format Is Faster and Lighter - by Dario Radečić - Sep, 2021 - Towards Data Science
No ratings yet
Stop Using CSVs For Storage - This File Format Is Faster and Lighter - by Dario Radečić - Sep, 2021 - Towards Data Science
7 pages
Bigdata Fileformats
No ratings yet
Bigdata Fileformats
12 pages
What Is Apache Parquet
No ratings yet
What Is Apache Parquet
20 pages
p148 Zeng
No ratings yet
p148 Zeng
14 pages
Comparison of File Formats For Big Data
No ratings yet
Comparison of File Formats For Big Data
4 pages
Hive Notes
No ratings yet
Hive Notes
26 pages
File Types
No ratings yet
File Types
1 page
U Iv Parquet I
No ratings yet
U Iv Parquet I
30 pages
Bigdata PPT
No ratings yet
Bigdata PPT
140 pages
Unit - 3
No ratings yet
Unit - 3
91 pages
1
No ratings yet
1
2 pages
Parquet: Columnar Storage For The People
No ratings yet
Parquet: Columnar Storage For The People
27 pages
Csvkit Manual
No ratings yet
Csvkit Manual
53 pages
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
Day 03
No ratings yet
Day 03
11 pages
Data Engineering 101 PySpark Vs Pandas 1721887961
No ratings yet
Data Engineering 101 PySpark Vs Pandas 1721887961
36 pages
2.1.1 Data Formats
No ratings yet
2.1.1 Data Formats
14 pages
Demystifying Apache Arrow
No ratings yet
Demystifying Apache Arrow
6 pages
Common Data Representation Formats Used For Big Data Include
No ratings yet
Common Data Representation Formats Used For Big Data Include
7 pages
Hive Performance With Different Fileformats
No ratings yet
Hive Performance With Different Fileformats
12 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
BDAmod 3
No ratings yet
BDAmod 3
18 pages
Dataset #1
No ratings yet
Dataset #1
5 pages
All
No ratings yet
All
4 pages
Arrow Cookbook
No ratings yet
Arrow Cookbook
12 pages
Lecture 2 File Types Suitable For Storing Big Data
No ratings yet
Lecture 2 File Types Suitable For Storing Big Data
12 pages
Azure Data Engineering Complete Guide
No ratings yet
Azure Data Engineering Complete Guide
130 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
File Formats & Service Binding
No ratings yet
File Formats & Service Binding
6 pages
Module 1
No ratings yet
Module 1
11 pages
Btrblocks - Data Lake Compression
No ratings yet
Btrblocks - Data Lake Compression
14 pages
1
No ratings yet
1
4 pages
Module 1 Notes
No ratings yet
Module 1 Notes
7 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Parquet - in - 10 - Mins - 1750341911 1
No ratings yet
Parquet - in - 10 - Mins - 1750341911 1
28 pages
Unit3 BDAT
No ratings yet
Unit3 BDAT
18 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Azure Data Fundamentals - Study Notes
No ratings yet
Azure Data Fundamentals - Study Notes
22 pages
Hadoop File Formats and Data Ingestion
No ratings yet
Hadoop File Formats and Data Ingestion
25 pages
pREP dOC-Azure
No ratings yet
pREP dOC-Azure
35 pages
Exploratory Data Analysis Using Python
No ratings yet
Exploratory Data Analysis Using Python
41 pages
Databrick Interview
No ratings yet
Databrick Interview
5 pages
BSM 461 Introduction To Big Data: Kevser Ovaz Akpınar, PHD
No ratings yet
BSM 461 Introduction To Big Data: Kevser Ovaz Akpınar, PHD
26 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
DP 900 Data Fundamentals 1710103456
No ratings yet
DP 900 Data Fundamentals 1710103456
35 pages
Data Visulization Chapter 2
No ratings yet
Data Visulization Chapter 2
24 pages
Data Science Formats Beyond CSV and Hdfs
No ratings yet
Data Science Formats Beyond CSV and Hdfs
54 pages
Documentation - Parquet
No ratings yet
Documentation - Parquet
75 pages
File Handling
No ratings yet
File Handling
12 pages
Big Data Developer
No ratings yet
Big Data Developer
81 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Ass-1 Prac
No ratings yet
Ass-1 Prac
23 pages
In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures
From Everand
In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures
Matthew Topol
No ratings yet
Recording Performances of Some File Types For Pandas Data. DOI-10.31590-Ejosat.1103499-2374400
No ratings yet
Recording Performances of Some File Types For Pandas Data. DOI-10.31590-Ejosat.1103499-2374400
6 pages
Dimensional Planning and Validation Admin Guide
No ratings yet
Dimensional Planning and Validation Admin Guide
162 pages
Mongo DB
No ratings yet
Mongo DB
20 pages
Nutanix Study Notes (Part 1) - InfraPCS
No ratings yet
Nutanix Study Notes (Part 1) - InfraPCS
7 pages
Responsibilities: Real Time Analytics Processing Specialist
No ratings yet
Responsibilities: Real Time Analytics Processing Specialist
20 pages
Power BI Developer: Professional Summary
No ratings yet
Power BI Developer: Professional Summary
6 pages
Fundamentals of Business Analytics Module
No ratings yet
Fundamentals of Business Analytics Module
5 pages
ODPLTool ProblemDefinition
No ratings yet
ODPLTool ProblemDefinition
14 pages
A Digital Security System With Door Lock System Using RFID Technology
No ratings yet
A Digital Security System With Door Lock System Using RFID Technology
3 pages
Monitoring OMU Startup Part2
No ratings yet
Monitoring OMU Startup Part2
94 pages
Parchive
No ratings yet
Parchive
3 pages
13 1+SAP+HANA+Scale-out
No ratings yet
13 1+SAP+HANA+Scale-out
11 pages
Dropped Object Effectin Offshore Subsea Structuresand Pipeline Approach
No ratings yet
Dropped Object Effectin Offshore Subsea Structuresand Pipeline Approach
15 pages
TDP SQL Config
No ratings yet
TDP SQL Config
8 pages
Biostatistics and Research Unit 1
No ratings yet
Biostatistics and Research Unit 1
28 pages
LIBRARY BOOK LOCATOR PROJECT - Android
No ratings yet
LIBRARY BOOK LOCATOR PROJECT - Android
22 pages
Aadde Masarat Addam Kitilaa
No ratings yet
Aadde Masarat Addam Kitilaa
15 pages
Chapter02-02 Database Design - Transforming ERD Into Relation
No ratings yet
Chapter02-02 Database Design - Transforming ERD Into Relation
42 pages
Unit 1 (Dba)
No ratings yet
Unit 1 (Dba)
23 pages
? Creating Databases, Dropping & Creating Schemas - ESS-DWW Courseware - Snowflake University - On-Demand
No ratings yet
? Creating Databases, Dropping & Creating Schemas - ESS-DWW Courseware - Snowflake University - On-Demand
4 pages
Tips For Max Performance
No ratings yet
Tips For Max Performance
5 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
7 pages
Case Study Methodology
67% (3)
Case Study Methodology
3 pages
Caltech Data Analytics Brochure 2022
No ratings yet
Caltech Data Analytics Brochure 2022
15 pages
Nibras Research Paper
No ratings yet
Nibras Research Paper
23 pages
Laboratory Report # - : Title: Figure 1. Schematic Diagram Showing The Cart, Track and Motion Sensor
No ratings yet
Laboratory Report # - : Title: Figure 1. Schematic Diagram Showing The Cart, Track and Motion Sensor
2 pages
Chapter 3 BOOK
No ratings yet
Chapter 3 BOOK
30 pages
P P P K D K P, J U: Spatial Pattern of Slum Areas in Pademangan Sub-District, North Jakarta
No ratings yet
P P P K D K P, J U: Spatial Pattern of Slum Areas in Pademangan Sub-District, North Jakarta
11 pages
File System Mounting
No ratings yet
File System Mounting
8 pages
Chapter 03
100% (1)
Chapter 03
4 pages