0% found this document useful (0 votes)
41 views8 pages

Hadoop File Formats - YoussefEtman

The document compares different Hadoop file formats based on characteristics like readability, splitability, whether they are row or column oriented, support for block compression and schema evolution, and suitability for reading or writing. It discusses key concepts like Hadoop input splits and how the optimal split size matches the HDFS block size. Avro files are highlighted as supporting schema evolution, block compression, and having broad tool support. CSV and JSON files lack block compression. Sequence files only support appending new fields for schema evolution. RC files don't support schema evolution but provide query performance benefits for columnar data.

Uploaded by

Israa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views8 pages

Hadoop File Formats - YoussefEtman

The document compares different Hadoop file formats based on characteristics like readability, splitability, whether they are row or column oriented, support for block compression and schema evolution, and suitability for reading or writing. It discusses key concepts like Hadoop input splits and how the optimal split size matches the HDFS block size. Avro files are highlighted as supporting schema evolution, block compression, and having broad tool support. CSV and JSON files lack block compression. Sequence files only support appending new fields for schema evolution. RC files don't support schema evolution but provide query performance benefits for columnar data.

Uploaded by

Israa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Hadoop File Formats

We are going to compare the different file formats with respect to the following characteristics:

 Readability.
 Splitability.
 Row/Column oriented.
 Block compression.
 Support for schema evolution.
 Better for read or write.

That is why it’s important to first discuss the following concepts first.

MapReduce Input Splits:


Definition: Hadoop divides the input to a MapReduce job into fixed‐size pieces called input splits, or just splits. Hadoop
creates one map task for each split, which runs the user‐defined map function for each record in the split.

Having many splits means the time taken to process each split is small compared to the time to process the whole input. So
if we are processing the splits in parallel, the processing is better load balanced when the splits are small.

On the other hand, if splits are too small, the overhead of managing the splits and map task creation begins to dominate
the total job execution time. For most jobs, a good split size tends to be the size of an HDFS block, which is 128 MB by
default.

Hadoop does its best to run the map task on a node where the input data resides in HDFS, because it doesn’t use valuable
cluster bandwidth. This is called the data locality optimization. Sometimes, however, all the nodes hosting the HDFS block
replicas for a map task’s input split are running other map tasks, so the job scheduler will look for a free map slot on a node
in the same rack as one of the blocks. Very occasionally even this is not possible, so an off‐rack node is used, which results
in an inter‐rack network transfer.

It should now be clear why the optimal split size is the same as the block size: it is the largest size of input that can be
guaranteed to be stored on a single node. If the split spanned two blocks, it would be unlikely that any HDFS node stored
both blocks, so some of the split would have to be transferred across the network to the node running the map task, which
is clearly less efficient than running the whole map task using local data.

Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is intermediate output: it’s processed
by reduce tasks to produce the final output, and once the job is complete, the map output can be thrown away. So, storing
it in HDFS with replication would be overkill. If the node running the map task fails before the map output has been
consumed by the reduce task, then Hadoop will automatically rerun the map task on another node to re‐create the map
output.

Reduce tasks don’t have the advantage of data locality; the input to a single reduce task is normally the output from all
mappers.
When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task.
There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single
partition. The partitioning can be controlled by a user‐defined partitioning function, but normally the default partitioner —
which buckets keys using a hash function — works very well.

Finally, it’s also possible to have zero reduce tasks.

When we say that a file format is splittable it means that Hadoop is able to divide the file into input splits that can be
distributed and processed by different mappers.
Schema Evolution:
Definition: The ability to add, remove, rename or modify a new field in the schema of the file without the need to create
new files.

Schema evolution is nothing but a term used for how to store the behaves when schema changes . Users can
start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may
end up with multiple files with different but mutually compatible schemas.

So lets say if you have one avro/parquet file and you want to change its schema, you can rewrite that file
with a new schema inside. But what if you have terabytes of avro/parquet files and you want to change their
schema? Will you rewrite all of the data, every time the schema changes?

Schema evolution allows you to update the schema used to write new data, while maintaining backwards
compatibility with the schema(s) of your old data. Then you can read it all together, as if all of the data has
one schema. Of course there are precise rules governing the changes allowed, to maintain compatibility.
Those rules are listed under Schema Resolution.

Block compression
Definition: The ability to compress multiple records together.

Please note that the word block here isn’t equivalent to the HDFS block.

Different file formats have different ways for compressing data, for example Sequence files (will be explained later, but
they are generally files stored as key‐value pairs) can be configured by one of 3 types of compression: default (no
compression), record compression, block compression.

If no compression is enabled (the default), each record is made up of the record length (in bytes), the key length, the key,
and then the value.

The format for record compression is almost identical to that for no compression, except the value bytes are compressed.
Block compression (Figure 5‐3) compresses multiple records at once; it is therefore more compact than and should
generally be preferred over record compression because it has the opportunity to take advantage of similarities between
records. The format of a block is a field indicating the number of records in the block, followed by four compressed fields:
the key lengths, the keys, the value lengths, and the values.

Row vs. Column oriented files


Sequence files, map files, and Avro datafiles are all row‐oriented file formats, which means that the values for each row are
stored contiguously in the file. In a column‐oriented format, the rows in a file (or, equivalently, a table in Hive) are broken
up into row splits, then each split is stored in column‐oriented fashion: the values for each row in the first column are
stored first, followed by the values for each row in the second column, and so on. This is shown diagrammatically in Figure
5‐4.

A column‐oriented layout permits columns that are not accessed in a query to be skipped. Consider a query of the table in
Figure 5‐4 that processes only column 2. With row‐oriented storage, like a sequence file, the whole row (stored in a
sequence file record) is loaded into memory.

With column‐oriented storage, only the column 2 parts of the file (highlighted in the figure) need to be read into memory.
In general, column‐oriented formats work well when queries access only a small number of columns in the table.
Conversely, row‐oriented formats are appropriate when a large number of columns of a single row are needed for
processing at the same time.
File formats comparison
1.Text/CSV Files

CSV files are still quite common and often used for exchanging data between Hadoop and external systems.
They are readable and ubiquitously parsable. They come in handy when doing a dump from a database or
bulk loading data from Hadoop into an analytic database. However, CSV files do not support block
compression, thus compressing a CSV file in Hadoop often comes at a significant read performance cost.
When working with Text/CSV files in Hadoop, never include header or footer lines. Each line of the file should
contain a record. This, of course, means that there is no metadata stored with the CSV file. You must know
how the file was written in order to make use of it. Also, since the file structure is dependent on field order,
new fields can only be appended at the end of records while existing fields can never be deleted. As such,
CSV files have limited support for schema evolution.

2. JSON Records

JSON records are different from JSON Files in that each line is its own JSON datum -- making the files
splittable. Unlike CSV files, JSON stores metadata with the data, fully enabling schema evolution. However,
like CSV files, JSON files do not support block compression. Additionally, JSON support was a relative late
comer to the Hadoop toolset and many of the native serdes contain significant bugs. Fortunately, third party
serdes are frequently available and often solve these challenges. You may have to do a little experimentation
and research for your use cases.

3. Avro Files

Avro files are quickly becoming the best multi-purpose storage format within Hadoop. Avro files store
metadata with the data but also allow specification of an independent schema for reading the file. This
makes Avro the epitome of schema evolution support since you can rename, add, delete and change the
data types of fields by defining new independent schema. Additionally, Avro files are splittable, support block
compression and enjoy broad, relatively mature, tool support within the Hadoop ecosystem.

4. Sequence Files

Sequence files store data in a binary format with a similar structure to CSV. Like CSV, sequence files do not
store metadata with the data so the only schema evolution option is appending new fields. However, unlike
CSV, sequence files do support block compression. Due to the complexity of reading sequence files, they are
often only used for in flight data such as intermediate data storage used within a sequence of MapReduce
jobs.

5. RC Files

RC Files or Record Columnar Files were the first columnar file format adopted in Hadoop. Like columnar
databases, the RC file enjoys significant compression and query performance benefits. However, the current
serdes for RC files in Hive and other tools do not support schema evolution. In order to add a column to your
data you must rewrite every pre-existing RC file. Also, although RC files are good for query, writing an RC file
requires more memory and computation than non-columnar file formats. They are generally slower to write.
6. ORC Files

ORC Files or Optimized RC Files were invented to optimize performance in Hive and are primarily backed by
HortonWorks. ORC files enjoy the same benefits and limitations as RC files just done better for Hadoop. This
means ORC files compress better than RC files, enabling faster queries. However, they still don?t support
schema evolution. Some benchmarks indicate that ORC files compress to be the smallest of all file formats in
Hadoop. It is worthwhile to note that, at the time of this writing, Cloudera Impala does not support ORC files.

7. Parquet Files

Parquet Files are yet another columnar file format that originated from Hadoop creator Doug Cutting?s
Trevni project. Like RC and ORC, Parquet enjoys compression and query performance benefits, and is
generally slower to write than non-columnar file formats. However, unlike RC and ORC files Parquet serdes
support limited schema evolution. In Parquet, new columns can be added at the end of the structure. At
present, Hive and Impala are able to query newly added columns, but other tools in the ecosystem such as
Hadoop Pig may face challenges. Parquet is supported by Cloudera and optimized for Cloudera Impala.
Native Parquet support is rapidly being added for the rest of the Hadoop ecosystem. One note on Parquet
file support with Hive... It is very important that Parquet column names are lowercase. If your Parquet file
contains mixed case column names, Hive will not be able to read the column and will return queries on the
column with null values and not log any errors. Unlike Hive, Impala handles mixed case column names. A
truly perplexing problem when you encounter it.

How to choose a file format?


There are three types of performance to consider:

 Write performance -- how fast can the data be written.


 Partial read performance -- how fast can you read individual columns within a file.
 Full read performance -- how fast can you read every data element in a file.
A columnar, compressed file format like Parquet or ORC may optimize partial and full read performance, but
they do so at the expense of write performance. Conversely, uncompressed CSV files are fast to write but due
to the lack of compression and column-orientation are slow for reads. You may end up with multiple copies
of your data each formatted for a different performance profile.

As discussed, each file format is optimized by purpose. Your choice of format is driven by your use case and
environment. Here are the key factors to consider:

 Hadoop Distribution- Cloudera and Hortonworks support/favor different formats


 Schema Evolution- Will the structure of your data evolve?
 Processing Requirements- Will you be crunching the data and with what tools?
 Read/Query Requirements- Will you be using SQL on Hadoop? Which engine?
 Extract Requirements- Will you be extracting the data from Hadoop for import into an
external database engine or other platform?
 Storage Requirements- Is data volume a significant factor? Will you get significantly more
bang for your storage buck through compression?
So, with all the options and considerations are there any obvious choices? If you are storing intermediate
data between MapReduce jobs, then Sequence files are preferred. If query performance against the data is
most important, ORC (HortonWorks/Hive) or Parquet (Cloudera/Impala) are optimal --- but these files will
take longer to write. (We have also seen order of magnitude query performance improvements when using
Parquet with Spark SQL.) Avro is great if your schema is going to change over time, but query performance
will be slower than ORC or Parquet. CSV files are excellent if you are going to extract data from Hadoop to
bulk load into a database.

You might also like