Haddop Archive
Haddop Archive
So
Hadoop archive is a facility which packs up small files into small files are problematic and to be handled
one compact HDFSblock to avoid memory wastage of name efficiently.
node.name node stores the metadata information of the As large input file is splitted into number of small
the HDFS data.SO,say 1GB file is broken in 1000 pieces then input files and stored across all the data nodes, all
namenode will have to store metadata about all those 1000 these huge number of records are to be stored in
small files.In that manner,namenode memory willbe name node which makes name node inefficient. To
wasted it storing and managing a lot of data. handle this problem, Hadoop Archieve has been
HAR is created from a collection of files and the archiving created which packs the HDFS files into archives and
tool will run a MapReduce job.these Maps reduce jobs to we can directly use these files an as input to the MR
process the input files in parallel to create an archive file. jobs. It always comes with *.har extension.
HAR Syntax:
HAR command hadoop archive -archiveName NAME -p <parent
hadoop archive -archiveName myhar.har path> <src>* <dest>
/input/location /output/location Example:
hadoop archive -archiveName foo.har -p
/user/hadoop dir1 dir2 /user/zoo
part files are the original files concatenated together with big
files and index files are to look up for the small files in the big
part file.