Big Data - Unit 2 Hadoop Framework
Big Data - Unit 2 Hadoop Framework
Since the traditional frameworks are ineffective in handling big data, new techniques had to be
developed to combat it. This is where the Hadoop framework comes in. The Hadoop framework is
primarily based on Java and is used to deal with big data.
What is Hadoop?
Hadoop is a data handling framework written primarily in Java, with some secondary code in shell script
and C. It uses a basic-level programming model and is able to deal with large datasets. It was developed
by Doug Cutting and Mike Cafarella. This framework uses distributed storage and parallel processing to
store and manage big data. It is one of the most widely used pieces of big data software.
Hadoop consists mainly of three components: Hadoop HDFS, Hadoop MapReduce, and Hadoop YARN.
These components come together to handle big data effectively. These components are also known
as Hadoop modules.
Hadoop is slowly becoming a mandatory skill required from a data scientist. Companies looking to invest
in Big Data technology are increasingly giving more importance to Hadoop, making it a valuable skill
upgrade for professionals. Hadoop 3.x is the latest version of Hadoop.
How Does Hadoop Work?
Hadoop's concept is rather straightforward. The volume, variety, and velocity of big data offer problems.
Building servers with heavy setups that could handle such a vast data pool at ever-increasing sizes would
not be viable. It would be simpler to connect numerous computers using a single CPU as an alternative,
though.
This would turn it into a distributed system that works as one system. This indicates that the clustered
computers can work together in parallel to achieve the same objective. This would expedite and reduce
the cost of handling large amounts of data.
This can be better understood with the help of an example. Imagine a carpenter who primarily makes
chairs and stores them at his warehouse before they are sold. At some point, the market demands other
products like tables and cupboards. So now the same carpenter is working on all three products.
However, this is depleting his energy, and he is not able to keep up with producing all three.
He decides to enlist the help of two other apprentices, who each work on one product. Now they are
able to produce at a good rate, but a problem regarding storage arises. Now the carpenter cannot buy a
bigger and bigger warehouse as per increases in demand or product. Instead, he takes three smaller
storage units for the three different products.
The carpenter in this analogy might be compared to the server that manages data. Big data, which is too
much for the server to handle alone, is created by the increase in demand, which is expressed in the
variety, velocity, and volume of the product.
Now that he has two apprentices reporting to him, they are all working towards the same objective
thanks to the concept of a single CPU assisted by many computers. Storage is assigned to curated
storage based on variety to prevent a bottleneck. This is essentially how Hadoop functions.
Features:
It is a filing system that acts as an Operating System for the data stored on HDFS
It helps to schedule the tasks to avoid overloading any system
Hadoop framework has become the most used tool to handle big data because of the various benefits
that it offers.
Data Locality:
The concept is rather simple. The pool of data is very large, and it would be very slow and tiresome to
move the data to the computation logic. By using, if data locality, the computation logic can instead be
moved toward the data. This makes processing much faster.
Highly Scalable:
This basically refers to the flexibility one has in scaling up or down the machines, or nodes, used for data
processing. Since multiple machines are used parallelly under the same CPU, this is possible. Scaling is
done according to changes in the volume of data or the requirements of the organization.
Flexibility:
Hadoop framework is written in Java and C, it can be easily run on any system. Further, it can be
curated to suit the specific needs of the type of data. It can handle both structured and unstructured
data efficiently. It can handle very different kinds of data sets, ranging from social media analysis to data
warehousing.
Open Source:
It means it is free to use. Since it is an open-source project, the source code is available online for
anyone to make modifications to. This allows the Hadoop software to be curated according to very
specific needs.
Easy to Use:
Hadoop is easy to use since the developers need not worry about any of the processing work since it is
managed by Hadoop itself. The Hadoop ecosystem is also very large and comes up with lots of tools like
Hive, Pig, Spark, HBase, Mahout, etc.
Cost-Effective:
Not only is it highly efficient and customizable, but it also reduces the cost of processing such data
significantly. Traditional data processing would require investments in very large server systems for a
less efficient model. This framework instead employs cheaper investment systems to deliver a very
efficient system. This makes it highly preferred by organizations.
HDFS is designed with data replication in mind, to offer fault tolerance and high availability. Data is
divided into blocks, and each block is replicated across multiple nodes in the cluster. This strategy
ensures that even if a node fails, the data is not lost, as it can be accessed from the other nodes where
the blocks are replicated.
Data Locality
Hadoop architecture is designed considering data locality to improve the efficiency of data processing.
Data locality refers to the ability to move the computation close to where the data resides in the
network, rather than moving large amounts of data to where the application is running. This approach
minimizes network congestion and increases the overall throughput of the system.
Storage Formats
HDFS provides a choice of storage formats. These formats can significantly impact the system’s
performance in terms of processing speed and storage space requirements. Hadoop supports file
formats like Text files, Sequence Files, Avro files, and Parquet files. The best file format for your specific
use case will depend on the data characteristics and the specific requirements of the application.
If you will look into the Hadoop 1.0 daemons, you will find below as the important ones-
Namenode
Datanode
JobTracker
Tasktracker
But in Hadoop 2, JobTracker and Tasktracker no longer exist. In Hadoop 1, both application management
and resource management were done by the MapReduce but with Hadoop 2, resource management has
been replaced with a new component called YARN (yet another resource negotiator).
And so, with Hadoop 2, MapReduce is managing application management and YARN is managing the
resources.
YARN has introduced two new daemons with Hadoop 2 and those are-
Resource Manager
Node Manager
These two new Hadoop 2 daemons have replaced the JobTracker and TaskTracker in Hadoop 1. And the
typical Hadoop 2 architecture for daemons will look like the below-
Hadoop 1 vs Hadoop 2
Hadoop 1 laid the groundwork for big data processing, but Hadoop 2 delivered substantial upgrades and
innovations.
Introducing YARN (Yet Another Resource Negotiator) in Hadoop 2 improved scalability and resource
management.
Hadoop 2 enabled concurrently using many data processing frameworks such as MapReduce, Apache
Spark, and others.
Hadoop 2 addressed Hadoop 1's constraints regarding scalability, dependability, and task management.
Hadoop 2 improved its data processing capabilities, making it better suited for real-time and interactive
applications.
NameNode
DataNode
Resource Manager
Node Manager
Namenode, Secondary NameNode, and Resource Manager work on a Master System while the Node
Manager and DataNode work on the Slave machine.
1. NameNode
NameNode works on the Master System. The primary purpose of Namenode is to manage all the
MetaData. Metadata is the list of files stored in HDFS(Hadoop Distributed File System). As we know the
data is stored in the form of blocks in a Hadoop cluster. So the DataNode on which or the location at
which that block of the file is stored is mentioned in MetaData. All information regarding the logs of the
transactions happening in a Hadoop cluster (when or who read/wrote the data) will be stored in
MetaData. MetaData is stored in the memory.
Features:
As Namenode works on the Master System, the Master system should have good processing power and
more RAM than Slaves.
It stores the information of DataNode such as their Block id’s and Number of Blocks
2. DataNode
DataNode works on the Slave system. The NameNode always instructs DataNode for storing the Data.
DataNode is a program that runs on the slave system that serves the read/write request from the client.
As the data is stored in this DataNode, they should possess high memory to store more Data.
3. Secondary NameNode
Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop cluster fails,
or crashes, the secondary Namenode will take the hourly backup or checkpoints of that data and store
this data into a file name fsimage. This file then gets transferred to a new system. A new MetaData is
assigned to that new system and a new Master is created with this MetaData, and the cluster is made to
run again correctly.
This is the benefit of Secondary Name Node. Now in Hadoop2, we have High-Availability and Federation
features that minimize the importance of this Secondary Name Node in Hadoop2.
It continuously reads the MetaData from the RAM of NameNode and writes into the Hard Disk.
As secondary NameNode keeps track of checkpoints in a Hadoop Distributed File System, it is also
known as the checkpoint Node.
The Hadoop Daemon’s Port
4. Resource Manager
Resource Manager is also known as the Global Master Daemon that works on the Master System. The
Resource Manager Manages the resources for the applications that are running in a Hadoop Cluster. The
Resource Manager Mainly consists of 2 things.
1. ApplicationsManager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and also makes a memory
resource on the Slaves in a Hadoop cluster to host the Application Master. The scheduler is utilized for
providing resources for applications in a Hadoop cluster and for monitoring this application.
5. Node Manager
The Node Manager works on the Slaves System that manages the memory resource within the Node
and Memory Disk. Each Slave Node in a Hadoop cluster has a single NodeManager Daemon running in it.
It also sends this monitoring information to the Resource Manager.
In a Hadoop cluster, Resource Manager and Node Manager can be tracked with the specific URLs, of
type http://:port_number
The Hadoop Daemon’s Port
ResourceManager 8088
NodeManager 8042
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech
landscape, GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable
prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've
already empowered, and we're here to do the same for you. Don't miss out - check it out now!
Hadoop InputFormat & Types of InputFormat in MapReduce
What is Hadoop InputFormat?
Hadoop InputFormat describes the input-specification for execution of the Map-Reduce job.
InputFormat describes how to split up and read input files. In MapReduce job execution, InputFormat is
the first step. It is also responsible for creating the input splits and dividing them into records.
Input files store the data for MapReduce job. Input files reside in HDFS. Although these files format is
arbitrary, we can also use line-based log files and binary format. Hence, In MapReduce, InputFormat
class is one of the fundamental classes which provides below functionality:
It also defines the Data splits. It defines both the size of individual Map tasks and its potential execution
server.
Hadoop InputFormat defines the RecordReader. It is also responsible for reading actual records from the
input files.
Methods to get the data from mapper are: getsplits() and createRecordReader() which are as follows:
createRecordReader(InputSplit split,
InterruptedException;
}
Follow TechVidvan on Google & Stay updated with latest technology trends
There are different types of MapReduce InputFormat in Hadoop which are used for different purpose.
Let’s discuss the Hadoop InputFormat types below:
It is the base class for all file-based InputFormats. FileInputFormat also specifies input directory which
has data files location. When we start a MapReduce job execution, FileInputFormat provides a path
containing files to read.
This InpuFormat will read all files. Then it divides these files into one or more InputSplits.
It is the default InputFormat. This InputFormat treats each line of each input file as a separate record. It
performs no parsing. TextInputFormat is useful for unformatted data or line-based records like log files.
Hence,
Key – It is the byte offset of the beginning of the line within the file (not whole file one split). So it will
be unique if combined with the file name.
It is similar to TextInputFormat. This InputFormat also treats each line of input as a separate record.
While the difference is that TextInputFormat treats entire line as the value, but the
KeyValueTextInputFormat breaks the line itself into key and value by a tab character (‘/t’). Hence,
It is the variant of SequenceFileInputFormat. This format converts the sequence file key values to Text
objects. So, it performs conversion by calling ‘tostring()’ on the keys and values. Hence,
SequenceFileAsTextInputFormat makes sequence files suitable input for streaming.
By using SequenceFileInputFormat we can extract the sequence file’s keys and values as an opaque
binary object.
It is another form of TextInputFormat where the keys are byte offset of the line. And values are contents
of the line. So, each mapper receives a variable number of lines of input with TextInputFormat and
KeyValueTextInputFormat.
The number depends on the size of the split. Also, depends on the length of the lines. So, if want our
mapper to receive a fixed number of lines of input, then we use NLineInputFormat.
Suppose N=2, then each split contains two lines. So, one mapper receives the first two Key-Value pairs.
Another mapper receives the second two key-value pairs.
8. DB Input Format
This InputFormat reads data from a relational database, using JDBC. It also loads small datasets, perhaps
for joining with large datasets from HDFS using MultipleInputs. Hence,
Key – LongWritables
Value – DBWritables.
Two different large data can be joined in map reduce programming also. Joins in Map phase refers as
Map side join, while join at reduce side called as reduce side join. Lets go in detail, Why we would
require to join the data in map reduce. If one Dataset A has master data and B has sort of transactional
data(A & B are just for reference). we need to join them on a coexisting common key for a result. It is
important to realize that we can share data with side data sharing techniques(passing key value pair in
job configuration /distribution caching) if master data set is small. we will use map-reduce join only
when we have both dataset is too big to use data sharing techniques.
Joins at Map Reduce is not recommended way. Same problem can be addressed through high level
frameworks like Hive or cascading. even if you are in situation then we can use below mentioned
method to join.
Joining at map side performs the join before data reached to map. function It expects a strong
prerequisite before joining data at map side. Both joining techniques comes with it’s own kind of pros
and cons. Map side join could be more efficient to reduce side but strict format requirement is very
tough to meet natively. however if we would prepare this kind of data through some other MR jobs, will
loose the expected performance over reduce side join.
All the records for a particular key must reside in the same partition.
Reduce side join also called as Repartitioned join or Repartitioned sort merge join and also it is mostly
used join type. This type of join would be performed at reduce side. i.e it will have to go through sort
and shuffle phase which would incur network overhead. to make it simple we are going to add the steps
needs to be performed for reduce side join. Reduce side join uses few terms like data source, tag and
group key lets be familiar with it.
Data Source is referring to data source files, probably taken from RDBMS
Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any
given point of time be it is in map/reduce phase. why it is required will cover it later.
Group key is referring column to be used as join key between two data sources.
As we know we are going to join this data on reduce side we must prepare in a way that it can be used
for joining in reduce phase. let’s have a look what are the steps needs to be perform.
Map Phase
Expectation from routine map function is emit, (Key, value), while to joining at reduce side join we
would design map in a way so that it could emit, (Key, Source Tag+Value) of every record for each data
source. This output will then go for sort and shuffle phase, as we know these operation would based on
key, so it will club all the values from all source at one place regarding a particular key. and this data
would reach to reducer
Reduce Phase
Reducer will create a cross product of every record of map out put for one key and will handover to
combine function.
Combine function
whether this reduce function is going to perform inner join or outer join would be decided in combine
function. And desired ouput format will also be decided at this place