100% found this document useful (1 vote)
609 views19 pages

Big Data - Unit 2 Hadoop Framework

The document discusses the Hadoop framework, which consists of three main components - HDFS for distributed storage, MapReduce for parallel processing, and YARN for resource management. Hadoop uses distributed and parallel computing to efficiently store and process large datasets across clusters of commodity hardware. It provides advantages like fault tolerance, scalability, flexibility with structured and unstructured data, and cost effectiveness over traditional data systems.

Uploaded by

Aditya Deshpande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
609 views19 pages

Big Data - Unit 2 Hadoop Framework

The document discusses the Hadoop framework, which consists of three main components - HDFS for distributed storage, MapReduce for parallel processing, and YARN for resource management. Hadoop uses distributed and parallel computing to efficiently store and process large datasets across clusters of commodity hardware. It provides advantages like fault tolerance, scalability, flexibility with structured and unstructured data, and cost effectiveness over traditional data systems.

Uploaded by

Aditya Deshpande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Big Data Analytics

Unit-II: Hadoop Framework

– Requirement of Hadoop Framework


Hadoop Framework - Components, and Uses
If you are learning about Big Data, you are bound to come across mentions of the "Hadoop
Framework". The rise of big data and its analytics have made the Hadoop framework very popular.
Hadoop is open-source software, meaning the bare software is easily available for free and customizable
according to individual needs.
This helps in curating the software according to the specific needs of the big data that needs to be
handled. As we know, big data is a term used to refer to the huge volume of data that cannot be stored
or processed, or analyzed using the mechanisms traditionally used. It is due to several characteristics of
big data. This is because big data has a high volume, is generated at great speed, and the data comes in
many varieties.

Since the traditional frameworks are ineffective in handling big data, new techniques had to be
developed to combat it. This is where the Hadoop framework comes in. The Hadoop framework is
primarily based on Java and is used to deal with big data.

What is Hadoop?

Hadoop is a data handling framework written primarily in Java, with some secondary code in shell script
and C. It uses a basic-level programming model and is able to deal with large datasets. It was developed
by Doug Cutting and Mike Cafarella. This framework uses distributed storage and parallel processing to
store and manage big data. It is one of the most widely used pieces of big data software.

Hadoop consists mainly of three components: Hadoop HDFS, Hadoop MapReduce, and Hadoop YARN.
These components come together to handle big data effectively. These components are also known
as Hadoop modules.
Hadoop is slowly becoming a mandatory skill required from a data scientist. Companies looking to invest
in Big Data technology are increasingly giving more importance to Hadoop, making it a valuable skill
upgrade for professionals. Hadoop 3.x is the latest version of Hadoop.
How Does Hadoop Work?

Hadoop's concept is rather straightforward. The volume, variety, and velocity of big data offer problems.
Building servers with heavy setups that could handle such a vast data pool at ever-increasing sizes would
not be viable. It would be simpler to connect numerous computers using a single CPU as an alternative,
though.

This would turn it into a distributed system that works as one system. This indicates that the clustered
computers can work together in parallel to achieve the same objective. This would expedite and reduce
the cost of handling large amounts of data.

This can be better understood with the help of an example. Imagine a carpenter who primarily makes
chairs and stores them at his warehouse before they are sold. At some point, the market demands other
products like tables and cupboards. So now the same carpenter is working on all three products.
However, this is depleting his energy, and he is not able to keep up with producing all three.

He decides to enlist the help of two other apprentices, who each work on one product. Now they are
able to produce at a good rate, but a problem regarding storage arises. Now the carpenter cannot buy a
bigger and bigger warehouse as per increases in demand or product. Instead, he takes three smaller
storage units for the three different products.

The carpenter in this analogy might be compared to the server that manages data. Big data, which is too
much for the server to handle alone, is created by the increase in demand, which is expressed in the
variety, velocity, and volume of the product.

Now that he has two apprentices reporting to him, they are all working towards the same objective
thanks to the concept of a single CPU assisted by many computers. Storage is assigned to curated
storage based on variety to prevent a bottleneck. This is essentially how Hadoop functions.

Main Components of Hadoop Framework


There are three core components of Hadoop as mentioned earlier. They are HDFS, MapReduce, and
YARN. These together form the Hadoop framework architecture.
 HDFS (Hadoop Distributed File System):
It is a data storage system. Since the data sets are huge, it uses a distributed system to store this data. It
is stored in blocks where each block is 128 MB. It consists of NameNode and DataNode. There can only
be one NameNode but multiple DataNodes.

Features:

 The storage is distributed to handle a large data pool


 Distribution increases data security
 It is fault-tolerant, other blocks can pick up the failure of one block
 MapReduce:
The MapReduce framework is the processing unit. All data is distributed and processed parallelly. There
is a MasterNode that distributes data amongst SlaveNodes. The SlaveNodes do the processing and send
it back to the MasterNode.
Features:

 Consists of two phases, Map Phase and Reduce Phase.


 Processes big data faster with multiples nodes working under one CPU
 YARN (yet another Resources Negotiator):
It is the resource management unit of the Hadoop framework. The data which is stored can be
processed with help of YARN using data processing engines like interactive processing. It can be used to
fetch any sort of data analysis.
Features:

 It is a filing system that acts as an Operating System for the data stored on HDFS
 It helps to schedule the tasks to avoid overloading any system

Advantages of the Hadoop framework

Hadoop framework has become the most used tool to handle big data because of the various benefits
that it offers.
 Data Locality:
The concept is rather simple. The pool of data is very large, and it would be very slow and tiresome to
move the data to the computation logic. By using, if data locality, the computation logic can instead be
moved toward the data. This makes processing much faster.

 Faster Data Processing:


As we saw earlier, the data is stored in small blocks using the HDFS filing system. This makes it possible
to process the data parallelly using the common CPU with the help of MapReduce. This makes the
performance level very high when compared to any traditional system.

 Inbuilt fault tolerance:


The problem with using smaller cluster computers is that the risk of them crashing is very real. This is
solved with the help of a high fault tolerance level that is inbuilt into the Hadoop platform. This is
because of the various DataNodes that are present. This, along with parallel data processing and
storage, ensures that data is available in multiple nodes, which ensures that these systems can take over
and provide cover for any system that crashes. Hadoop in fact makes three copies of each file block. This
ensures that any fault in the system is tolerated.
 High Availability:
This refers to the high and easy availability of data on the Hadoop cluster. Due to the high fault
tolerance that is inbuilt, the data is reliable, easily available, and can be accessed easily. Processed data
can be easily accessed using YARN as well.

 Highly Scalable:
This basically refers to the flexibility one has in scaling up or down the machines, or nodes, used for data
processing. Since multiple machines are used parallelly under the same CPU, this is possible. Scaling is
done according to changes in the volume of data or the requirements of the organization.

 Flexibility:
Hadoop framework is written in Java and C, it can be easily run on any system. Further, it can be
curated to suit the specific needs of the type of data. It can handle both structured and unstructured
data efficiently. It can handle very different kinds of data sets, ranging from social media analysis to data
warehousing.
 Open Source:
It means it is free to use. Since it is an open-source project, the source code is available online for
anyone to make modifications to. This allows the Hadoop software to be curated according to very
specific needs.
 Easy to Use:
Hadoop is easy to use since the developers need not worry about any of the processing work since it is
managed by Hadoop itself. The Hadoop ecosystem is also very large and comes up with lots of tools like
Hive, Pig, Spark, HBase, Mahout, etc.

 Cost-Effective:
Not only is it highly efficient and customizable, but it also reduces the cost of processing such data
significantly. Traditional data processing would require investments in very large server systems for a
less efficient model. This framework instead employs cheaper investment systems to deliver a very
efficient system. This makes it highly preferred by organizations.

Design Principles of the Hadoop Distributed File System (HDFS)


HDFS can be considered the ‘secret sauce’ behind the flexibility and scalability of Hadoop. Let’s review
the key principles underlying this innovative system.

Data Replication and Fault Tolerance

HDFS is designed with data replication in mind, to offer fault tolerance and high availability. Data is
divided into blocks, and each block is replicated across multiple nodes in the cluster. This strategy
ensures that even if a node fails, the data is not lost, as it can be accessed from the other nodes where
the blocks are replicated.

Data Locality

Hadoop architecture is designed considering data locality to improve the efficiency of data processing.
Data locality refers to the ability to move the computation close to where the data resides in the
network, rather than moving large amounts of data to where the application is running. This approach
minimizes network congestion and increases the overall throughput of the system.
Storage Formats

HDFS provides a choice of storage formats. These formats can significantly impact the system’s
performance in terms of processing speed and storage space requirements. Hadoop supports file
formats like Text files, Sequence Files, Avro files, and Parquet files. The best file format for your specific
use case will depend on the data characteristics and the specific requirements of the application.

Comparison with other system - Hadoop Components – Hadoop 1 vs.


Hadoop 2
Hadoop 1 vs Hadoop 2 Daemons

If you will look into the Hadoop 1.0 daemons, you will find below as the important ones-

Namenode

Datanode

JobTracker

Tasktracker

But in Hadoop 2, JobTracker and Tasktracker no longer exist. In Hadoop 1, both application management
and resource management were done by the MapReduce but with Hadoop 2, resource management has
been replaced with a new component called YARN (yet another resource negotiator).

And so, with Hadoop 2, MapReduce is managing application management and YARN is managing the
resources.

YARN has introduced two new daemons with Hadoop 2 and those are-

Resource Manager

Node Manager

These two new Hadoop 2 daemons have replaced the JobTracker and TaskTracker in Hadoop 1. And the
typical Hadoop 2 architecture for daemons will look like the below-
Hadoop 1 vs Hadoop 2

Features Hadoop 1 Hadoop 2


Introducing High Availability (HA)
Single NameNode
NameNode: multiple active and standby
Architecture architecture containing one
NameNodes for fault tolerance and no
NameNode.
single point of failure.
Improved scalability with tens of
Limited Scalability with a few
Scalability thousands of nodes, making it suitable
thousand nodes per cluster.
for large-scale data processing.
It brought in the YARN framework, which
Uses the MapReduce
splits the tasks of managing resources
Job Execution processing model for job
and scheduling jobs from the MapReduce
execution.
framework.
Mintains a backward compatibility with
Compatibility Compatible with Hadoop 2.
Hadoop 1.
Supports both real-time and batch
Data Primarily focuses on batch processing. Real-time processing
Processing processing of data. happens with frameworks like Spark, and
Storm.
Limited ecosystem Enhanced ecosystem integration.
Ecosystem integration; supports fewer Supports diverse data processing tools
Integration data processing tools than like MapReduce, Apache Hive, HBase,
Hadoop 2. Pig, and more.
Conclusion

Hadoop 1 laid the groundwork for big data processing, but Hadoop 2 delivered substantial upgrades and
innovations.

Introducing YARN (Yet Another Resource Negotiator) in Hadoop 2 improved scalability and resource
management.

Hadoop 2 enabled concurrently using many data processing frameworks such as MapReduce, Apache
Spark, and others.

Hadoop 2 addressed Hadoop 1's constraints regarding scalability, dependability, and task management.

Hadoop 2 improved its data processing capabilities, making it better suited for real-time and interactive
applications.

Hadoop – Daemons and Their Features


Daemons mean Process. Hadoop Daemons are a set of processes that run on Hadoop. Hadoop is a
framework written in Java, so all these processes are Java Processes.

Apache Hadoop 2 consists of the following Daemons:

NameNode

DataNode

Secondary Name Node

Resource Manager

Node Manager

Namenode, Secondary NameNode, and Resource Manager work on a Master System while the Node
Manager and DataNode work on the Slave machine.
1. NameNode

NameNode works on the Master System. The primary purpose of Namenode is to manage all the
MetaData. Metadata is the list of files stored in HDFS(Hadoop Distributed File System). As we know the
data is stored in the form of blocks in a Hadoop cluster. So the DataNode on which or the location at
which that block of the file is stored is mentioned in MetaData. All information regarding the logs of the
transactions happening in a Hadoop cluster (when or who read/wrote the data) will be stored in
MetaData. MetaData is stored in the memory.

Features:

It never stores the data that is present in the file.

As Namenode works on the Master System, the Master system should have good processing power and
more RAM than Slaves.

It stores the information of DataNode such as their Block id’s and Number of Blocks

How to start Name Node?

hadoop-daemon.sh start namenode

How to stop Name Node?

hadoop-daemon.sh stop namenode

2. DataNode

DataNode works on the Slave system. The NameNode always instructs DataNode for storing the Data.
DataNode is a program that runs on the slave system that serves the read/write request from the client.
As the data is stored in this DataNode, they should possess high memory to store more Data.

How to start Data Node?


hadoop-daemon.sh start datanode

How to stop Data Node?

hadoop-daemon.sh stop datanode

3. Secondary NameNode

Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop cluster fails,
or crashes, the secondary Namenode will take the hourly backup or checkpoints of that data and store
this data into a file name fsimage. This file then gets transferred to a new system. A new MetaData is
assigned to that new system and a new Master is created with this MetaData, and the cluster is made to
run again correctly.
This is the benefit of Secondary Name Node. Now in Hadoop2, we have High-Availability and Federation
features that minimize the importance of this Secondary Name Node in Hadoop2.

Major Function Of Secondary NameNode:

It groups the Edit logs and Fsimage from NameNode together.

It continuously reads the MetaData from the RAM of NameNode and writes into the Hard Disk.

As secondary NameNode keeps track of checkpoints in a Hadoop Distributed File System, it is also
known as the checkpoint Node.
The Hadoop Daemon’s Port

Name Node 50070

Data Node 50075

Secondary Name Node 50090

These ports can be configured manually in hdfs-site.xml and mapred-site.xml files.

4. Resource Manager

Resource Manager is also known as the Global Master Daemon that works on the Master System. The
Resource Manager Manages the resources for the applications that are running in a Hadoop Cluster. The
Resource Manager Mainly consists of 2 things.

1. ApplicationsManager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and also makes a memory
resource on the Slaves in a Hadoop cluster to host the Application Master. The scheduler is utilized for
providing resources for applications in a Hadoop cluster and for monitoring this application.

How to start ResourceManager?

yarn-daemon.sh start resourcemanager

How to stop ResourceManager?

stop:yarn-daemon.sh stop resoucemnager

5. Node Manager

The Node Manager works on the Slaves System that manages the memory resource within the Node
and Memory Disk. Each Slave Node in a Hadoop cluster has a single NodeManager Daemon running in it.
It also sends this monitoring information to the Resource Manager.

How to start Node Manager?

yarn-daemon.sh start nodemanager

How to stop Node Manager?

yarn-daemon.sh stop nodemanager

In a Hadoop cluster, Resource Manager and Node Manager can be tracked with the specific URLs, of
type http://:port_number
The Hadoop Daemon’s Port

ResourceManager 8088

NodeManager 8042

The below diagram shows how Hadoop works.

Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech
landscape, GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable
prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've
already empowered, and we're here to do the same for you. Don't miss out - check it out now!
Hadoop InputFormat & Types of InputFormat in MapReduce
What is Hadoop InputFormat?

Hadoop InputFormat describes the input-specification for execution of the Map-Reduce job.

InputFormat describes how to split up and read input files. In MapReduce job execution, InputFormat is
the first step. It is also responsible for creating the input splits and dividing them into records.

Input files store the data for MapReduce job. Input files reside in HDFS. Although these files format is
arbitrary, we can also use line-based log files and binary format. Hence, In MapReduce, InputFormat
class is one of the fundamental classes which provides below functionality:

InputFormat selects the files or other objects for input.

It also defines the Data splits. It defines both the size of individual Map tasks and its potential execution
server.

Hadoop InputFormat defines the RecordReader. It is also responsible for reading actual records from the
input files.

How we get the data from Mapper?

Methods to get the data from mapper are: getsplits() and createRecordReader() which are as follows:

public abstract class InputFormat<K, V>

public abstract List<InputSplit> getSplits(JobContext context)

throws IOException, InterruptedException;

public abstract RecordReader<K, V>

createRecordReader(InputSplit split,

TaskAttemptContext context) throws IOException,

InterruptedException;
}

Follow TechVidvan on Google & Stay updated with latest technology trends

Types of InputFormat in MapReduce

There are different types of MapReduce InputFormat in Hadoop which are used for different purpose.
Let’s discuss the Hadoop InputFormat types below:

1. File Input Format

It is the base class for all file-based InputFormats. FileInputFormat also specifies input directory which
has data files location. When we start a MapReduce job execution, FileInputFormat provides a path
containing files to read.

This InpuFormat will read all files. Then it divides these files into one or more InputSplits.

2. Text Input Format

It is the default InputFormat. This InputFormat treats each line of each input file as a separate record. It
performs no parsing. TextInputFormat is useful for unformatted data or line-based records like log files.
Hence,

Key – It is the byte offset of the beginning of the line within the file (not whole file one split). So it will
be unique if combined with the file name.

Value – It is the contents of the line. It excludes line terminators.

3. Key Value Text Input Format

It is similar to TextInputFormat. This InputFormat also treats each line of input as a separate record.
While the difference is that TextInputFormat treats entire line as the value, but the
KeyValueTextInputFormat breaks the line itself into key and value by a tab character (‘/t’). Hence,

Key – Everything up to the tab character.

Value – It is the remaining part of the line after tab character.

4. Sequence File Input Format


It is an InputFormat which reads sequence files. Sequence files are binary files. These files also store
sequences of binary key-value pairs. These are block-compressed and provide direct serialization and
deserialization of several arbitrary data. Hence,

Key & Value both are user-defined.

5. Sequence File As Text Input Format

It is the variant of SequenceFileInputFormat. This format converts the sequence file key values to Text
objects. So, it performs conversion by calling ‘tostring()’ on the keys and values. Hence,
SequenceFileAsTextInputFormat makes sequence files suitable input for streaming.

6. Sequence File As Binary Input Format

By using SequenceFileInputFormat we can extract the sequence file’s keys and values as an opaque
binary object.

7. N line Input Format

It is another form of TextInputFormat where the keys are byte offset of the line. And values are contents
of the line. So, each mapper receives a variable number of lines of input with TextInputFormat and
KeyValueTextInputFormat.

The number depends on the size of the split. Also, depends on the length of the lines. So, if want our
mapper to receive a fixed number of lines of input, then we use NLineInputFormat.

N- It is the number of lines of input that each mapper receives.

By default (N=1), each mapper receives exactly one line of input.

Suppose N=2, then each split contains two lines. So, one mapper receives the first two Key-Value pairs.
Another mapper receives the second two key-value pairs.

8. DB Input Format

This InputFormat reads data from a relational database, using JDBC. It also loads small datasets, perhaps
for joining with large datasets from HDFS using MultipleInputs. Hence,

Key – LongWritables
Value – DBWritables.

What is Reduce Side Join and Map side Join?

What is map side join and reduce side join?

Two different large data can be joined in map reduce programming also. Joins in Map phase refers as
Map side join, while join at reduce side called as reduce side join. Lets go in detail, Why we would
require to join the data in map reduce. If one Dataset A has master data and B has sort of transactional
data(A & B are just for reference). we need to join them on a coexisting common key for a result. It is
important to realize that we can share data with side data sharing techniques(passing key value pair in
job configuration /distribution caching) if master data set is small. we will use map-reduce join only
when we have both dataset is too big to use data sharing techniques.
Joins at Map Reduce is not recommended way. Same problem can be addressed through high level
frameworks like Hive or cascading. even if you are in situation then we can use below mentioned
method to join.

Map side Join

Joining at map side performs the join before data reached to map. function It expects a strong
prerequisite before joining data at map side. Both joining techniques comes with it’s own kind of pros
and cons. Map side join could be more efficient to reduce side but strict format requirement is very
tough to meet natively. however if we would prepare this kind of data through some other MR jobs, will
loose the expected performance over reduce side join.

Data should be partitioned and sorted in particular way.

Each input data should be divided in same number of partition.

Must be sorted with same key.

All the records for a particular key must reside in the same partition.

Reduce Side Join

Reduce side join also called as Repartitioned join or Repartitioned sort merge join and also it is mostly
used join type. This type of join would be performed at reduce side. i.e it will have to go through sort
and shuffle phase which would incur network overhead. to make it simple we are going to add the steps
needs to be performed for reduce side join. Reduce side join uses few terms like data source, tag and
group key lets be familiar with it.

Data Source is referring to data source files, probably taken from RDBMS

Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any
given point of time be it is in map/reduce phase. why it is required will cover it later.

Group key is referring column to be used as join key between two data sources.

As we know we are going to join this data on reduce side we must prepare in a way that it can be used
for joining in reduce phase. let’s have a look what are the steps needs to be perform.
Map Phase

Expectation from routine map function is emit, (Key, value), while to joining at reduce side join we
would design map in a way so that it could emit, (Key, Source Tag+Value) of every record for each data
source. This output will then go for sort and shuffle phase, as we know these operation would based on
key, so it will club all the values from all source at one place regarding a particular key. and this data
would reach to reducer

Reduce Phase

Reducer will create a cross product of every record of map out put for one key and will handover to
combine function.

Combine function

whether this reduce function is going to perform inner join or outer join would be decided in combine
function. And desired ouput format will also be decided at this place

Please do not get confuse with combiner both are different.

You might also like