0% found this document useful (0 votes)
45 views4 pages

Real-Time Processing of Events (Sensor, Telecommunications, Fraud Etc.) Even

Apache Hadoop v2 represents a major shift in Hadoop's architecture with the introduction of YARN. YARN separates resource management from job processing, allowing Hadoop to support various workloads beyond batch processing like real-time analytics and interactive SQL queries. Key features of Hadoop v2 include YARN, high availability for HDFS, HDFS federation, and improved performance and Windows support. The community continues enhancing capabilities like YARN scheduling and long-running services.

Uploaded by

amitbcm007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views4 pages

Real-Time Processing of Events (Sensor, Telecommunications, Fraud Etc.) Even

Apache Hadoop v2 represents a major shift in Hadoop's architecture with the introduction of YARN. YARN separates resource management from job processing, allowing Hadoop to support various workloads beyond batch processing like real-time analytics and interactive SQL queries. Key features of Hadoop v2 include YARN, high availability for HDFS, HDFS federation, and improved performance and Windows support. The community continues enhancing capabilities like YARN scheduling and long-running services.

Uploaded by

amitbcm007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Apache Hadoop v2 is not just a major release number, but represents generational shift

in the architecture of Apache Hadoop. With YARN, Apache Hadoop is recast as a


significantly more powerful platform one that takes Hadoop beyond merely batch
applications to taking its position as a data operating system.
To recap, Apache Hadoop v1 comprised of HDFS & MapReduce.
With HDFS one could store data of all manner, however MapReduce was the only
algorithm you could use to process that data in parallel. That was very limiting since
MapReduce, although very general, proved inadequate to satisfy all the demands being
placed on Apache Hadoop.
As Apache Hadoop crystallizes into a key component of a Modern Data Architecture,
users and customers want to store all data in HDFS and interact with that data in
multiple ways:

Real-time processing of events (sensor, telecommunications, fraud etc.) even


before it lands on HDFS

Interactive query capabilities for interrogating new data for data analysts (SQL)
and data scientists (SQL plus scripting etc.)

The need to productionize the insight i.e. batch-processing, reporting etc. in a


well-defined and timely manner

The community has worked together to make HDFS itself a much more scalable,
efficient and enterprise-friendly storage platform by addressing key functionality High
Availability for the HDFS NameNode, Federation for scaling & HDFS Snapshots to list a
few.
With YARN, Apache Hadoop now clearly delineates the system (resource management,
security, SLAs etc.) from the application framework (e.g. MapReduce) and allows for
multiple ways to interact with the data in HDFS (batch with MapReduce, streaming with
Apache Storm, interactive SQL with Apache Hive and Apache Tez).

We are already seeing the benefits of this vision in the form of many and varied
applications and services being re-vectored on top of YARN such as Apache Storm for
event processing, Apache Giraph for graph processing, Apache Tez for interactive SQL
queries, HOYA for running services such as Apache HBase and Apache Accumulo on
YARN and so on. Exciting times indeed!
As a result the Hadoop stack looks very different with Hadoop v2:

Personally, its a huge thrill to see this baby grow up and reach adulthood since
the original Jira ticket (MAPREDUCE-279) opened more than 5 years ago!

Apache Hadoop v2
As a lot of people are aware, Apache Hadoop 2 landed the Beta tag a few months ago.
Since then the community has spent a lot of time validating the APIs, protocols and the
system itself. As a result we are now very confident in our ability to not only handle the
workloads that will be thrown at Apache Hadoop, but also in our ability to do so in a
forward compatible manner such that Apache Hadoop v2 represents a stable base atop
which the ecosystem can flourish in the future.
For those who, like me, are more comfortable with simplified lists (*smile*), here are the
enhancements and major features:

YARN

High Availability for HDFS

HDFS Federation

HDFS Snapshots

NFSv3 access to data in HDFS

Binary Compatibility for MapReduce applications between Hadoop v1 and


Hadoop v2 to ease migration

Performance

Support for running Hadoop on Microsoft Windows

Integration testing for the entire Apache Hadoop ecosystem at the ASF.

Onwards
Although its a major milestone and a big reason to celebrate, the Apache Hadoop
community will continue to drive it forward under the aegis of the the ASF. There are
ever more things to do, user-cases to fulfill and users to thrill. The HDFS community is
striving hard to finish up the addition of symlinks to HDFS which just didnt make the cut
at the last minute. On the YARN side we plan to add more enhancements such as

advanced scheduling features, high availability for YARN Resource Manager, enhanced
support for long-running services and generally make it easier to run other applications
such as Apache Storm within YARN. Stay tuned!

Terminology and Architecture


MapReduce from Hadoop 1 (MapReduce 1) has been split into two components. The cluster resource
management capabilities have become YARN (Yet Another Resource Negotiator), while the MapReducespecific capabilities remain MapReduce. In the MapReduce 1 architecture, the cluster was managed by a
service called the JobTracker. TaskTracker services lived on each node and would launch tasks on behalf
of jobs. The JobTracker would serve information about completed jobs. In MapReduce 2, the functions of
the JobTracker have been split between three services. The ResourceManager is a persistent YARN
service that receives and runs applications (a MapReduce job is an application) on the cluster. It contains
the scheduler, which, as previously, is pluggable. The MapReduce-specific capabilities of the JobTracker
have been moved into the MapReduce Application Master, one of which is started to manage each
MapReduce job and terminated when the job completes. The JobTrackers function of serving information
about completed jobs has been moved to the JobHistoryServer. The TaskTracker has been replaced with
the NodeManager, a YARN service that manages resources and deployment on a node. It is responsible
for launching containers, each of which can house a map or reduce task.

The new architecture has its advantages. First, by breaking up the JobTracker into a few different
services, it avoids many of the scaling issues faced by MapReduce in Hadoop 1. More importantly, it
makes it possible to run frameworks other than MapReduce on a Hadoop cluster. For example, Impala
can also run on YARN and share resources on a cluster with MapReduce.

http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoopmapreduce-client-core/MapReduceTutorial.html

You might also like