0% found this document useful (0 votes)
19 views2 pages

Haddop Archive

Hadoop Archive (HAR) is designed to efficiently manage small files in Hadoop by packing them into a single compact HDFS block, reducing the memory burden on the namenode. The archiving process involves running a MapReduce job to create an archive file with a .har extension, which can be used as input for further MapReduce jobs. However, HAR files have limitations, including the need for additional disk space during creation, the inability to modify archives without recreation, and potential inefficiencies due to the requirement for many map tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views2 pages

Haddop Archive

Hadoop Archive (HAR) is designed to efficiently manage small files in Hadoop by packing them into a single compact HDFS block, reducing the memory burden on the namenode. The archiving process involves running a MapReduce job to create an archive file with a .har extension, which can be used as input for further MapReduce jobs. However, HAR files have limitations, including the need for additional disk space during creation, the inability to modify archives without recreation, and potential inefficiencies due to the requirement for many map tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

What is hadoop archive Hadoop is created to deal with large files data .

So
Hadoop archive is a facility which packs up small files into small files are problematic and to be handled
one compact HDFSblock to avoid memory wastage of name efficiently.
node.name node stores the metadata information of the As large input file is splitted into number of small
the HDFS data.SO,say 1GB file is broken in 1000 pieces then input files and stored across all the data nodes, all
namenode will have to store metadata about all those 1000 these huge number of records are to be stored in
small files.In that manner,namenode memory willbe name node which makes name node inefficient. To
wasted it storing and managing a lot of data. handle this problem, Hadoop Archieve has been
HAR is created from a collection of files and the archiving created which packs the HDFS files into archives and
tool will run a MapReduce job.these Maps reduce jobs to we can directly use these files an as input to the MR
process the input files in parallel to create an archive file. jobs. It always comes with *.har extension.
HAR Syntax:
HAR command hadoop archive -archiveName NAME -p <parent
hadoop archive -archiveName myhar.har path> <src>* <dest>
/input/location /output/location Example:
hadoop archive -archiveName foo.har -p
/user/hadoop dir1 dir2 /user/zoo

Pawan Kumar Singh, AP, Deptt of Cse 1


Limitations of HAR Files:
If you have a hadoop archive stored in HDFS in 1) Creation of HAR files will create a copy of the original
/user/zoo/foo.har then for using this archive for MapReduce files. So, we need as much disk space as size of original
input, all you need to specify the input directory as files which we are archiving. We can delete the original
har:///user/zoo/foo.har. files after creation of archive to release some disk
If we list the archive file: space.
$hadoop fs -ls /data/myArch.har 2) Once an archive is created, to add or remove files
/data/myArch..har/_index from/to archive we need to re-create the archive.
3) HAR file will require lots of map tasks which are
/data/myArch..har/_masterindex inefficient.
/data/myArch..har/part-0

part files are the original files concatenated together with big
files and index files are to look up for the small files in the big
part file.

Pawan Kumar Singh, AP, Deptt of Cse 2

You might also like