Hadoop: A Report Writing On
Hadoop: A Report Writing On
Hadoop: A Report Writing On
On
HADOOP
Index of Topics:
Abstract
Introduction
What is MapReduce?
HDFS
Design
Concepts
Cluster Rebalancing
Data Integrity
Accessibility
Hadoop Archives
Bibliography
Abstract
Problem Statement:
The amount total digital data in the world has exploded in recent years. This has
happened primarily due to information (or data) generated by various enterprises all over
the globe. In 2006, the universal data was estimated to be 0.18 zettabytes in
2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes.
1 zettabyte = 10 21 bytes
The problem is that while the storage capacities of hard drives have increased
massively over the years, access speeds—the rate at which data can be read
from drives have not kept up. One typical drive from 1990 could store 1370 MB
of data and had a transfer speed of 4.4 MB/s, so we could read all the data
from a full drive in around 300 seconds. In 2010, 1 Tbdrives are the standard
hard disk size, but the transfer speed is around 100 MB/s, so it takes more than
two and a half hours to read all the data off the disk.
Solution Proposed:
Parallelisation:
A very obvious solution to solving this problem is parallelisation. The input data is
usually large and the computations have to be distributed across hundreds or
thousands of machines in order to finish in a reasonable amount of time.
Reading 1 Tb from a single hard drive may take a long time, but on parallelizing
this over 100 different machines can solve the problem in 2 minutes.
The key issues involved in this Solution:
Hardware failure
Even though HDFS and MapReduce are the most significant features of Hadoop,
other subprojects provide complementary services:
Core
Avro
Pig
HBase
Zoo Keeper
Hive
Chukwa
Introduction
The features of hadoop that stand out are its simplified programming model and its
efficient, automatic distribution of data and work across machines.
Now we take a deeper look into these two main features of Hadoop and list their
important characteristics and description.
1. Data Distribution:
In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in.
The Hadoop Distributed File System (HDFS) will split large data files into chunks which are
managed by different nodes in the cluster. In addition to this each chunk is replicated across
several machines, so that a single machine failure does not result in any data being
unavailable. In case of a system failure, the data is re-replicated which can result in partial
storage. Even though the file chunks are replicated and distributed across several machines,
they form a single namespace, so their contents are universally accessible.
Data is conceptually record-oriented in the Hadoop programming framework. Individual
input files are broken into segments and each segment is processed upon by a node.
The Hadoop framework schedules the processes to be run in proximity to the location of
data/records using knowledge from the distributed file system. Each computation
process running on a node operates on a subset of the data. Which data operated on by
which node is decide based on its proximity to the node: i.e:
Most data is read from the local disk straight into the CPU, alleviating strain on network
bandwidth and preventing unnecessary network transfers. This strategy of moving
computation to the data, instead of moving the data to the
computation allows Hadoop to achieve high data locality which in turn results in high
performance.
Hadoop limits the amount of communication which can be performed by the processes,
as each individual record is processed by a task in isolation from one another. It makes
the whole framework much more reliable. Programs must be written to conform to a
particular programming model, named "MapReduce."
The output from the Mappers is then brought together by Reducers, where results from
different mappers are merged together.
Separate nodes in a Hadoop cluster communicate implicitly. Pieces of data can be tagged with
key names which inform Hadoop how to send related bits of information to a common destination
node. Hadoop internally manages all of the data transfer and cluster topology issues.
By restricting the communication between nodes, Hadoop makes the distributed system
much more reliable. Individual node failures can be worked around by restarting tasks on
other machines. The other workers continue to operate as though nothing went wrong,
leaving the challenging aspects of partially restarting the program.
What is MapReduce?
MapReduce is a programming model for processing and generating large data sets. Users specify a
map function that processes a key/value pair to generate a set of intermediate key/value
airs, and a reduce function that merges all intermediate values associated with the
same intermediate key.
Programs written in this functional style are automatically parallelized and executed on a large
cluster of commodity machines. The run-time system takes care of the details of partitioning the
input data, scheduling the program's execution across a set of machines, handling machine
failures, and managing the required inter-machine communication (i.e this procedure is
abstracted or hidden from the user who can focus on the computational problem)
Note: This abstraction was inspired by the map and reduces primitives present in Lisp and many
other functional languages.
The computation takes a set of input key/value pairs, and produces a set of output key/value
pairs. The user of the MapReduce library expresses the computation as two functions: Map and
Reduce.
Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs.
The MapReduce library groups together all intermediate values associated with the same
intermediate key I and passes them to the Reduce function.
The Reduce function, also written by the user, accepts an intermediate key I and a set of values
for that key. It merges together these values to form a possibly smaller set of values. Typically just
zero or one output value is produced per Reduce invocation. The intermediate values
In case we have a single reduce task that is fed by all of the map tasks: The sorted map
outputs have to be transferred across the network to the node where the reduce task is
running, where they are merged and then passed to the user-defined reduce function.
The output of the reducer is normally stored in HDFS for reliability. For each HDFS block
of the reduce output, the first replica is stored on the local node, with other replicas being
stored on off-rack nodes.
The map tasks partition their output, each creating one partition for each reduce task.
There can be many keys (and their associated values) in each partition, but the records
for every key are all in a single partition.
Filesystems that manage the storage across a network of machines are called
distributed filesystems. They are network-based, and thus all the complications of
network programming are also present in distributed file system. Hadoop comes with a
distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.
HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold
very large amounts of data (terabytes or even petabytes), and provide high-throughput
access to this information.
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of data.
HDFS is built around the idea that the most efficient data processing pattern is a write-
once, read-many-times pattern. A dataset is typically generated or copied from source,
then various analyses are performed on that dataset over time. Each analysis will
involve a large proportion, if not all, of the dataset, so the time to read the whole dataset
is more important than the latency in reading the first record.
Cluster Rebalancing
The HDFS architecture is compatible with data rebalancing schemes. A scheme might
automatically move data from one DataNode to another if the free space on a DataNode falls
below a certain threshold. In the event of a sudden high demand for a particular file, a
scheme might dynamically create additional replicas and rebalance other data in the cluster.
These types of data rebalancing schemes are not yet implemented.
Data Integrity
It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption
can occur because of faults in a storage device, network faults, or buggy software. The
HDFS client software implements checksum checking on the contents of HDFS files. When a
client creates an HDFS file, it computes a checksum of each block of the file and stores
these checksums in a separate hidden file in the same HDFS namespace. When a client
retrieves file contents it verifies that the data it received from each DataNode matches the
checksum stored in the associated checksum file. If not, then the client can opt to retrieve
that block from another DataNode that has a replica of that block.
Hadoop Archives:
HDFS stores small files inefficiently, since each file is stored in a block, and block metadata
is held in memory by the namenode. Thus, a large number of small files can eat up a lot of
memory on the namenode. (Note, however, that small files do not take up any more disk
space than is required to store the raw contents of the file. For example, a 1 MB file stored
with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives, or
HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently,
thereby reducing namenode memory usage while still allowing transparent access to files. In
particular, Hadoop Archives can be used as input to MapReduce.
A Hadoop Archive is created from a collection of files using the archive tool. The tool
runs a MapReduce job to process the input files in parallel, so to run it, you need a
MapReduce cluster running to use it.
Limitations
There are a few limitations to be aware of with HAR files. Creating an archive creates a
copy of the original files, so you need as much disk space as the files you are archiving
to create the archive (although you can delete the originals once you have created the
archive). There is currently no support for archive compression, although the files that go
into the archive can be compressed (HAR files are like tar files in this respect). Archives
are immutable once they have been created. To add or remove files, you must recreate
the archive. In practice, this is not a problem for files that don’t change after being
written, since they can be archived in batches on a regular basis, such as daily or
weekly. As noted earlier, HAR files can be used as input to MapReduce. However, there
is no archive-aware InputFormat that can pack multiple files into a single MapReduce
split, so processing lots of small files, even in a HAR file, can still be inefficient.