0% found this document useful (0 votes)

41 views8 pages

Hadoop File Formats - YoussefEtman

The document compares different Hadoop file formats based on characteristics like readability, splitability, whether they are row or column oriented, support for block compression and schema evolution, and suitability for reading or writing. It discusses key concepts like Hadoop input splits and how the optimal split size matches the HDFS block size. Avro files are highlighted as supporting schema evolution, block compression, and having broad tool support. CSV and JSON files lack block compression. Sequence files only support appending new fields for schema evolution. RC files don't support schema evolution but provide query performance benefits for columnar data.

Uploaded by

Israa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views8 pages

Hadoop File Formats - YoussefEtman

Uploaded by

Israa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Hadoop File Formats

We are going to compare the different file formats with respect to the following characteristics:

 Readability.
 Splitability.
 Row/Column oriented.
 Block compression.
 Support for schema evolution.
 Better for read or write.

That is why it’s important to first discuss the following concepts first.

MapReduce Input Splits:

Definition: Hadoop divides the input to a MapReduce job into fixed‐size pieces called input splits, or just splits. Hadoop
creates one map task for each split, which runs the user‐defined map function for each record in the split.

Having many splits means the time taken to process each split is small compared to the time to process the whole input. So
if we are processing the splits in parallel, the processing is better load balanced when the splits are small.

On the other hand, if splits are too small, the overhead of managing the splits and map task creation begins to dominate
the total job execution time. For most jobs, a good split size tends to be the size of an HDFS block, which is 128 MB by
default.

Hadoop does its best to run the map task on a node where the input data resides in HDFS, because it doesn’t use valuable
cluster bandwidth. This is called the data locality optimization. Sometimes, however, all the nodes hosting the HDFS block
replicas for a map task’s input split are running other map tasks, so the job scheduler will look for a free map slot on a node
in the same rack as one of the blocks. Very occasionally even this is not possible, so an off‐rack node is used, which results
in an inter‐rack network transfer.

It should now be clear why the optimal split size is the same as the block size: it is the largest size of input that can be
guaranteed to be stored on a single node. If the split spanned two blocks, it would be unlikely that any HDFS node stored
both blocks, so some of the split would have to be transferred across the network to the node running the map task, which
is clearly less efficient than running the whole map task using local data.

Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is intermediate output: it’s processed
by reduce tasks to produce the final output, and once the job is complete, the map output can be thrown away. So, storing
it in HDFS with replication would be overkill. If the node running the map task fails before the map output has been
consumed by the reduce task, then Hadoop will automatically rerun the map task on another node to re‐create the map
output.

Reduce tasks don’t have the advantage of data locality; the input to a single reduce task is normally the output from all
mappers.
When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task.
There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single
partition. The partitioning can be controlled by a user‐defined partitioning function, but normally the default partitioner —
which buckets keys using a hash function — works very well.

Finally, it’s also possible to have zero reduce tasks.

When we say that a file format is splittable it means that Hadoop is able to divide the file into input splits that can be
distributed and processed by different mappers.
Schema Evolution:
Definition: The ability to add, remove, rename or modify a new field in the schema of the file without the need to create
new files.

Schema evolution is nothing but a term used for how to store the behaves when schema changes . Users can
start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may
end up with multiple files with different but mutually compatible schemas.

So lets say if you have one avro/parquet file and you want to change its schema, you can rewrite that file
with a new schema inside. But what if you have terabytes of avro/parquet files and you want to change their
schema? Will you rewrite all of the data, every time the schema changes?

Schema evolution allows you to update the schema used to write new data, while maintaining backwards
compatibility with the schema(s) of your old data. Then you can read it all together, as if all of the data has
one schema. Of course there are precise rules governing the changes allowed, to maintain compatibility.
Those rules are listed under Schema Resolution.

Block compression
Definition: The ability to compress multiple records together.

Please note that the word block here isn’t equivalent to the HDFS block.

Different file formats have different ways for compressing data, for example Sequence files (will be explained later, but
they are generally files stored as key‐value pairs) can be configured by one of 3 types of compression: default (no
compression), record compression, block compression.

If no compression is enabled (the default), each record is made up of the record length (in bytes), the key length, the key,
and then the value.

The format for record compression is almost identical to that for no compression, except the value bytes are compressed.
Block compression (Figure 5‐3) compresses multiple records at once; it is therefore more compact than and should
generally be preferred over record compression because it has the opportunity to take advantage of similarities between
records. The format of a block is a field indicating the number of records in the block, followed by four compressed fields:
the key lengths, the keys, the value lengths, and the values.

Row vs. Column oriented files

Sequence files, map files, and Avro datafiles are all row‐oriented file formats, which means that the values for each row are
stored contiguously in the file. In a column‐oriented format, the rows in a file (or, equivalently, a table in Hive) are broken
up into row splits, then each split is stored in column‐oriented fashion: the values for each row in the first column are
stored first, followed by the values for each row in the second column, and so on. This is shown diagrammatically in Figure
5‐4.

A column‐oriented layout permits columns that are not accessed in a query to be skipped. Consider a query of the table in
Figure 5‐4 that processes only column 2. With row‐oriented storage, like a sequence file, the whole row (stored in a
sequence file record) is loaded into memory.

With column‐oriented storage, only the column 2 parts of the file (highlighted in the figure) need to be read into memory.
In general, column‐oriented formats work well when queries access only a small number of columns in the table.
Conversely, row‐oriented formats are appropriate when a large number of columns of a single row are needed for
processing at the same time.
File formats comparison
1.Text/CSV Files

CSV files are still quite common and often used for exchanging data between Hadoop and external systems.
They are readable and ubiquitously parsable. They come in handy when doing a dump from a database or
bulk loading data from Hadoop into an analytic database. However, CSV files do not support block
compression, thus compressing a CSV file in Hadoop often comes at a significant read performance cost.
When working with Text/CSV files in Hadoop, never include header or footer lines. Each line of the file should
contain a record. This, of course, means that there is no metadata stored with the CSV file. You must know
how the file was written in order to make use of it. Also, since the file structure is dependent on field order,
new fields can only be appended at the end of records while existing fields can never be deleted. As such,
CSV files have limited support for schema evolution.

2. JSON Records

JSON records are different from JSON Files in that each line is its own JSON datum -- making the files
splittable. Unlike CSV files, JSON stores metadata with the data, fully enabling schema evolution. However,
like CSV files, JSON files do not support block compression. Additionally, JSON support was a relative late
comer to the Hadoop toolset and many of the native serdes contain significant bugs. Fortunately, third party
serdes are frequently available and often solve these challenges. You may have to do a little experimentation
and research for your use cases.

3. Avro Files

Avro files are quickly becoming the best multi-purpose storage format within Hadoop. Avro files store
metadata with the data but also allow specification of an independent schema for reading the file. This
makes Avro the epitome of schema evolution support since you can rename, add, delete and change the
data types of fields by defining new independent schema. Additionally, Avro files are splittable, support block
compression and enjoy broad, relatively mature, tool support within the Hadoop ecosystem.

4. Sequence Files

Sequence files store data in a binary format with a similar structure to CSV. Like CSV, sequence files do not
store metadata with the data so the only schema evolution option is appending new fields. However, unlike
CSV, sequence files do support block compression. Due to the complexity of reading sequence files, they are
often only used for in flight data such as intermediate data storage used within a sequence of MapReduce
jobs.

5. RC Files

RC Files or Record Columnar Files were the first columnar file format adopted in Hadoop. Like columnar
databases, the RC file enjoys significant compression and query performance benefits. However, the current
serdes for RC files in Hive and other tools do not support schema evolution. In order to add a column to your
data you must rewrite every pre-existing RC file. Also, although RC files are good for query, writing an RC file
requires more memory and computation than non-columnar file formats. They are generally slower to write.
6. ORC Files

ORC Files or Optimized RC Files were invented to optimize performance in Hive and are primarily backed by
HortonWorks. ORC files enjoy the same benefits and limitations as RC files just done better for Hadoop. This
means ORC files compress better than RC files, enabling faster queries. However, they still don?t support
schema evolution. Some benchmarks indicate that ORC files compress to be the smallest of all file formats in
Hadoop. It is worthwhile to note that, at the time of this writing, Cloudera Impala does not support ORC files.

7. Parquet Files

Parquet Files are yet another columnar file format that originated from Hadoop creator Doug Cutting?s
Trevni project. Like RC and ORC, Parquet enjoys compression and query performance benefits, and is
generally slower to write than non-columnar file formats. However, unlike RC and ORC files Parquet serdes
support limited schema evolution. In Parquet, new columns can be added at the end of the structure. At
present, Hive and Impala are able to query newly added columns, but other tools in the ecosystem such as
Hadoop Pig may face challenges. Parquet is supported by Cloudera and optimized for Cloudera Impala.
Native Parquet support is rapidly being added for the rest of the Hadoop ecosystem. One note on Parquet
file support with Hive... It is very important that Parquet column names are lowercase. If your Parquet file
contains mixed case column names, Hive will not be able to read the column and will return queries on the
column with null values and not log any errors. Unlike Hive, Impala handles mixed case column names. A
truly perplexing problem when you encounter it.

How to choose a file format?

There are three types of performance to consider:

 Write performance -- how fast can the data be written.

 Partial read performance -- how fast can you read individual columns within a file.
 Full read performance -- how fast can you read every data element in a file.
A columnar, compressed file format like Parquet or ORC may optimize partial and full read performance, but
they do so at the expense of write performance. Conversely, uncompressed CSV files are fast to write but due
to the lack of compression and column-orientation are slow for reads. You may end up with multiple copies
of your data each formatted for a different performance profile.

As discussed, each file format is optimized by purpose. Your choice of format is driven by your use case and
environment. Here are the key factors to consider:

 Hadoop Distribution- Cloudera and Hortonworks support/favor different formats

 Schema Evolution- Will the structure of your data evolve?
 Processing Requirements- Will you be crunching the data and with what tools?
 Read/Query Requirements- Will you be using SQL on Hadoop? Which engine?
 Extract Requirements- Will you be extracting the data from Hadoop for import into an
external database engine or other platform?
 Storage Requirements- Is data volume a significant factor? Will you get significantly more
bang for your storage buck through compression?
So, with all the options and considerations are there any obvious choices? If you are storing intermediate
data between MapReduce jobs, then Sequence files are preferred. If query performance against the data is
most important, ORC (HortonWorks/Hive) or Parquet (Cloudera/Impala) are optimal --- but these files will
take longer to write. (We have also seen order of magnitude query performance improvements when using
Parquet with Spark SQL.) Avro is great if your schema is going to change over time, but query performance
will be slower than ORC or Parquet. CSV files are excellent if you are going to extract data from Hadoop to
bulk load into a database.

Unit 3 Topic 9 Hadoop Archives
No ratings yet
Unit 3 Topic 9 Hadoop Archives
32 pages
Data Structure and Algorithm MCQ
100% (1)
Data Structure and Algorithm MCQ
8 pages
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
No ratings yet
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
26 pages
Big Data Developer
No ratings yet
Big Data Developer
81 pages
Unit 3 Analyzing Data With Hadoop Notes
No ratings yet
Unit 3 Analyzing Data With Hadoop Notes
2 pages
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
Unit 3-BDA
50% (2)
Unit 3-BDA
26 pages
Transaction OPK4 - Parameters For Order Confirmation
No ratings yet
Transaction OPK4 - Parameters For Order Confirmation
14 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Unit - 3
No ratings yet
Unit - 3
91 pages
Hadoop - The Final Product
100% (2)
Hadoop - The Final Product
42 pages
BDAmod 3
No ratings yet
BDAmod 3
18 pages
Bda 3
No ratings yet
Bda 3
70 pages
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
Wa0001.
No ratings yet
Wa0001.
56 pages
Unit Iii
No ratings yet
Unit Iii
107 pages
Bigdata Fileformats
No ratings yet
Bigdata Fileformats
12 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
Unit3 Bda
No ratings yet
Unit3 Bda
71 pages
Unit3 BDAT
No ratings yet
Unit3 BDAT
18 pages
Week 14
No ratings yet
Week 14
33 pages
Hadoop
No ratings yet
Hadoop
30 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
IV-UNIT - BIG - DATA (2 Files Merged)
No ratings yet
IV-UNIT - BIG - DATA (2 Files Merged)
25 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Bda Unit34
No ratings yet
Bda Unit34
17 pages
Big Data PPT Unit 2 1
No ratings yet
Big Data PPT Unit 2 1
25 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
New Printout
No ratings yet
New Printout
5 pages
Csen 3101
No ratings yet
Csen 3101
11 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
20 pages
Chapter 3 Basics of Hadoop
No ratings yet
Chapter 3 Basics of Hadoop
11 pages
Job Scheduling in MR
No ratings yet
Job Scheduling in MR
6 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
HADOOP Notes Unit 3 and 4
No ratings yet
HADOOP Notes Unit 3 and 4
14 pages
Unit 4 Bda
No ratings yet
Unit 4 Bda
33 pages
CH 3 BDA
No ratings yet
CH 3 BDA
13 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
BDT Unit - Iii
No ratings yet
BDT Unit - Iii
12 pages
Day 03
No ratings yet
Day 03
11 pages
Hadoop Primitives
No ratings yet
Hadoop Primitives
6 pages
Unit IV Basics - of - Hadoop
No ratings yet
Unit IV Basics - of - Hadoop
20 pages
Unit 3 (Big Data Analytics)
No ratings yet
Unit 3 (Big Data Analytics)
18 pages
Storage Formats in Hadoop
No ratings yet
Storage Formats in Hadoop
4 pages
Vibration Analysis Guide
No ratings yet
Vibration Analysis Guide
40 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Unit 4
No ratings yet
Unit 4
14 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
IET Udaipur BDA Unit-3
No ratings yet
IET Udaipur BDA Unit-3
10 pages
SA Vol 30 MU-MIMO
100% (1)
SA Vol 30 MU-MIMO
73 pages
Etabs Tutorial
No ratings yet
Etabs Tutorial
68 pages
Operating and Service Manual: Agilent 8480 Series Coaxial Power Sensors
No ratings yet
Operating and Service Manual: Agilent 8480 Series Coaxial Power Sensors
86 pages
OptiX OSN500 STM 1 Amp STM 4 Multi Service CPE Optical Transmission System Product Description V100R002 02 PDF
No ratings yet
OptiX OSN500 STM 1 Amp STM 4 Multi Service CPE Optical Transmission System Product Description V100R002 02 PDF
143 pages
Ravindra Singh Gangwar
No ratings yet
Ravindra Singh Gangwar
87 pages
Three Chpter MCQS
No ratings yet
Three Chpter MCQS
48 pages
Javascript Cheat Sheet
No ratings yet
Javascript Cheat Sheet
1 page
Hadoop: Presented by Y Naveen
No ratings yet
Hadoop: Presented by Y Naveen
7 pages
Digital Testing of HV Circuit Breaker
100% (2)
Digital Testing of HV Circuit Breaker
14 pages
Cyber Security Workshop Lab File Complete
No ratings yet
Cyber Security Workshop Lab File Complete
43 pages
Data
No ratings yet
Data
1,356 pages
Capsule All State ITI Exam PDF
No ratings yet
Capsule All State ITI Exam PDF
16 pages
Complete Attrition Script
No ratings yet
Complete Attrition Script
898 pages
Global SuperStore Complete
No ratings yet
Global SuperStore Complete
231 pages
Harmony Gtu Hmidt952
No ratings yet
Harmony Gtu Hmidt952
7 pages
Kartu Soal Paket 11
100% (1)
Kartu Soal Paket 11
52 pages
Cc1 Module 4 Addition of Number System
No ratings yet
Cc1 Module 4 Addition of Number System
5 pages
Queues
No ratings yet
Queues
34 pages
Inside Questions of Digital Document
No ratings yet
Inside Questions of Digital Document
13 pages
Algorithms and Data Structures Exercises: Antonio Carzaniga University of Lugano Edition 1.2 January 2009
No ratings yet
Algorithms and Data Structures Exercises: Antonio Carzaniga University of Lugano Edition 1.2 January 2009
13 pages
Lesson 1-1
No ratings yet
Lesson 1-1
29 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Meshless Methods For Conservation Laws
No ratings yet
Meshless Methods For Conservation Laws
24 pages
Leet
No ratings yet
Leet
2 pages
Les11 RW
No ratings yet
Les11 RW
13 pages
Design of Energy Storage For PV
No ratings yet
Design of Energy Storage For PV
6 pages
Instructional Games Complete Information PDF
No ratings yet
Instructional Games Complete Information PDF
63 pages
Installation Manual: Sliding Gate Opener
No ratings yet
Installation Manual: Sliding Gate Opener
22 pages
HDFS - YoussefEtman
No ratings yet
HDFS - YoussefEtman
10 pages
Day 2 Summary
No ratings yet
Day 2 Summary
6 pages
Prime Number
No ratings yet
Prime Number
1 page
YARN - YoussefEtman
No ratings yet
YARN - YoussefEtman
5 pages
Crypto - Lab - 8.ipynb - Colab
No ratings yet
Crypto - Lab - 8.ipynb - Colab
2 pages
Implement Basic Connectivity
No ratings yet
Implement Basic Connectivity
9 pages
Tuning Day 1 Defition S
No ratings yet
Tuning Day 1 Defition S
4 pages
Super Store Project Requirements
No ratings yet
Super Store Project Requirements
3 pages
Computer Organization: 1st Sem 2018-2019 1
No ratings yet
Computer Organization: 1st Sem 2018-2019 1
13 pages
Notes
No ratings yet
Notes
1 page
Konsep Dasar Multimedia: Kuliah 2
No ratings yet
Konsep Dasar Multimedia: Kuliah 2
15 pages
LSB Based Digital Watermarking Technique
No ratings yet
LSB Based Digital Watermarking Technique
4 pages
Ans: - Continous and Discrete Simulation Model
No ratings yet
Ans: - Continous and Discrete Simulation Model
6 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet

Hadoop File Formats - YoussefEtman

Uploaded by

Hadoop File Formats - YoussefEtman

Uploaded by

Hadoop File Formats

MapReduce Input Splits:

Finally, it’s also possible to have zero reduce tasks.

Row vs. Column oriented files

How to choose a file format?

 Write performance -- how fast can the data be written.

 Hadoop Distribution- Cloudera and Hortonworks support/favor different formats

You might also like