Introduction To Hadoop
Introduction To Hadoop
The New York Stock Exchange is an example of Big Data that generates about one terabyte
of new trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social
media site Facebook, every day. This data is mainly generated in terms of photo and video
uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
Types Of Big Data
Following are the types of Big Data:
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data. Over the period of time, talent in computer science has achieved greater
success in developing techniques for working with such kind of data (where the format is
well known in advance) and also deriving value out of it. However, nowadays, we are
foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in
the rage of multiple zettabytes.
Looking at these figures one can easily understand why the name Big Data is given and
imagine the challenges involved in its storage and processing.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational
DBMS. Example of semi-structured data is a data represented in an XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec> Data
Growth over the years
Data Growth over the years
Please note that web application data, which is unstructured, consists of log files, transaction
history files etc. OLTP systems are built to work with structured data wherein data is stored
in relations (tables).
Big data is exactly what the name suggests, a “big” amount of data. Big Data means a data set
that is large in terms of volume and is more complex. Because of the large volume and higher
complexity of Big Data, traditional data processing software cannot handle it. Big Data
simply means datasets containing a large amount of diverse data, both structured as well as
unstructured.
Big Data allows companies to address issues they are facing in their business, and solve these
problems effectively using Big Data Analytics. Companies try to identify patterns and draw
insights from this sea of data so that it can be acted upon to solve the problem(s) at hand.
Although companies have been collecting a huge amount of data for decades, the concept of
Big Data only gained popularity in the early-mid 2000s. Corporations realized the amount of
data that was being collected on a daily basis, and the importance of using this data
effectively.
1. Volume refers to the amount of data that is being collected. The data could be
structured or unstructured.
2. Velocity refers to the rate at which data is coming in.
3. Variety refers to the different kinds of data (data types, formats, etc.) that is coming
in for analysis. Over the last few years, 2 additional Vs of data have also emerged –
value and veracity.
4. Value refers to the usefulness of the collected data.
5. Veracity refers to the quality of data that is coming in from different sources.
How Does Big Data Work?
Big data involves collecting, processing, and analyzing vast amounts of data from multiple
sources to uncover patterns, relationships, and insights that can inform decision-making. The
process involves several steps:
1. Data Collection
Big data is collected from various sources such as social media, sensors, transactional
systems, customer reviews, and other sources.
2. Data Storage
The collected data then needs to be stored in a way that it can be easily accessed and
analyzed later. This often requires specialized storage technologies capable of
handling large volumes of data.
3. Data Processing
Once the data is stored, it needs to be processed before it can be analyzed. This
involves cleaning and organizing the data to remove any errors or inconsistencies, and
transform it into a format suitable for analysis.
4. Data Analysis
After the data has been processed, it is time to analyze it using tools like statistical
models and machine learning algorithms to identify patterns, relationships, and trends.
5. Data Visualization
The insights derived from data analysis are then presented in visual formats such as
graphs, charts, and dashboards, making it easier for decision-makers to understand
and act upon them.
Use Cases
Big Data helps corporations in making better and faster decisions, because they have more
information available to solve problems, and have more data to test their hypothesis on.
Customer experience is a major field that has been revolutionized with the advent of Big
Data. Companies are collecting more data about their customers and their preferences than
ever. This data is being leveraged in a positive way, by giving personalized recommendations
and offers to customers, who are more than happy to allow companies to collect this data in
return for the personalized services. The recommendations you get on Netflix, or
Amazon/Flipkart are a gift of Big Data!
Machine Learning
Machine Learning is another field that has benefited greatly from the increasing popularity
of Big Data. More data means we have larger datasets to train our ML models, and a more
trained model (generally) results in a better performance. Also, with the help of Machine
Learning, we are now able to automate tasks that were earlier being done manually, all thanks
to Big Data.
Demand Forecasting
Demand forecasting has become more accurate with more and more data being collected
about customer purchases. This helps companies build forecasting models, that help them
forecast future demand, and scale production accordingly. It helps companies, especially
those in manufacturing businesses, to reduce the cost of storing unsold inventory in
warehouses.
Big data also has extensive use in applications such as product development and fraud
detection.
The volume and velocity of Big Data can be huge, which makes it almost impossible to store
it in traditional data warehouses. Although some and sensitive information can be stored on
company premises, for most of the data, companies have to opt for cloud storage or Hadoop.
Cloud storage allows businesses to store their data on the internet with the help of a cloud
service provider (like Amazon Web Services, Microsoft Azure, or Google Cloud Platform)
who takes the responsibility of managing and storing the data. The data can be accessed
easily and quickly with an API.
Hadoop also does the same thing, by giving you the ability to store and process large
amounts of data at once. Hadoop is an open-source software framework and is free. It allows
users to process large datasets across clusters of computers.
1. Apache Hadoop is an open-source big data tool designed to store and process large
amounts of data across multiple servers. Hadoop comprises a distributed file system
(HDFS) and a MapReduce processing engine.
2. Apache Spark is a fast and general-purpose cluster computing system that supports
inmemory processing to speed up iterative algorithms. Spark can be used for batch
processing, real-time stream processing, machine learning, graph processing, and SQL
queries.
3. Apache Cassandra is a distributed NoSQL database management system designed to
handle large amounts of data across commodity servers with high availability and
fault tolerance.
4. Apache Flink is an open-source streaming data processing framework that supports
batch processing, real-time stream processing, and event-driven applications. Flink
provides low-latency, high-throughput data processing with fault tolerance and
scalability.
5. Apache Kafka is a distributed streaming platform that enables the publishing and
subscribing to streams of records in real-time. Kafka is used for building real-time
data pipelines and streaming applications.
6. Splunk is a software platform used for searching, monitoring, and analyzing
machinegenerated big data in real-time. Splunk collects and indexes data from various
sources and provides insights into operational and business intelligence.
7. Talend is an open-source data integration platform that enables organizations to
extract, transform, and load (ETL) data from various sources into target systems.
Talend supports big data technologies such as Hadoop, Spark, Hive, Pig, and HBase.
8. Tableau is a data visualization and business intelligence tool that allows users to
analyze and share data using interactive dashboards, reports, and charts. Tableau
supports big data platforms and databases such as Hadoop, Amazon Redshift, and
Google BigQuery.
9. Apache NiFi is a data flow management tool used for automating the movement of
data between systems. NiFi supports big data technologies such as Hadoop, Spark,
and Kafka and provides real-time data processing and analytics.
10. QlikView is a business intelligence and data visualization tool that enables users to
analyze and share data using interactive dashboards, reports, and charts. QlikView
supports big data platforms such as Hadoop, and provides real-time data processing
and analytics.
To effectively manage and utilize big data, organizations should follow some best practices:
1. Data Growth
Managing datasets having terabytes of information can be a big challenge for companies. As
datasets grow in size, storing them not only becomes a challenge but also becomes an
expensive affair for companies.
To overcome this, companies are now starting to pay attention to data compression and
deduplication. Data compression reduces the number of bits that the data needs, resulting in a
reduction in space being consumed. Data de-duplication is the process of making sure
duplicate and unwanted data does not reside in our database.
2. Data Security
Data security is often prioritized quite low in the Big Data workflow, which can backfire at
times. With such a large amount of data being collected, security challenges are bound to
come up sooner or later.
Mining of sensitive information, fake data generation, and lack of cryptographic protection
(encryption) are some of the challenges businesses face when trying to adopt Big Data
techniques.
Companies need to understand the importance of data security, and need to prioritize it. To
help them, there are professional Big Data consultants nowadays, that help businesses move
from traditional data storage and analysis methods to Big Data.
3. Data Integration
Data is coming in from a lot of different sources (social media applications, emails, customer
verification documents, survey forms, etc.). It often becomes a very big operational challenge
for companies to combine and reconcile all of this data.
There are several Big Data solution vendors that offer ETL (Extract, Transform, Load) and
data integration solutions to companies that are trying to overcome data integration problems.
There are also several APIs that have already been built to tackle issues related to data
integration.
• Improved decision-making: Big data can provide insights and patterns that help
organizations make more informed decisions.
• Increased efficiency: Big data analytics can help organizations identify inefficiencies
in their operations and improve processes to reduce costs.
• Better customer targeting: By analyzing customer data, businesses can develop
targeted marketing campaigns that are relevant to individual customers, resulting in
better customer engagement and loyalty.
• New revenue streams: Big data can uncover new business opportunities, enabling
organizations to create new products and services that meet market demand.
• Competitive advantage: Organizations that can effectively leverage big data have a
competitive advantage over those that cannot, as they can make faster, more informed
decisions based on data-driven insights.
Here are top 10 industries that use big data in their favor –
Industry Use of Big data
Finance Detect fraud, assess risks and make informed investment decisions
Manufacturing Optimize supply chain processes, reduce costs and improve product
quality through predictive maintenance
Energy Monitor and analyze energy usage patterns, optimize production, and
reduce waste through predictive analytics
Government and public Address issues such as preventing crime, improving traffic
management, and predicting natural disasters
Big Data technologies can be used for creating a staging area or landing zone for new data
before identifying what data should be moved to the data warehouse. In addition, such
integration of Big Data technologies and data warehouse helps an organization to offload
infrequently accessed data.
How Does Big Data Work?
Big data involves collecting, processing, and analyzing vast amounts of data from multiple
sources to uncover patterns, relationships, and insights that can inform decision-making. The
process involves several steps:
1. Data Collection
Big data is collected from various sources such as social media, sensors, transactional
systems, customer reviews, and other sources.
2. Data Storage
The collected data then needs to be stored in a way that it can be easily accessed and
analyzed later. This often requires specialized storage technologies capable of
handling large volumes of data.
3. Data Processing
Once the data is stored, it needs to be processed before it can be analyzed. This
involves cleaning and organizing the data to remove any errors or inconsistencies, and
transform it into a format suitable for analysis.
4. Data Analysis
After the data has been processed, it is time to analyze it using tools like statistical
models and machine learning algorithms to identify patterns, relationships, and trends. 5.
Data Visualization
The insights derived from data analysis are then presented in visual formats such as
graphs, charts, and dashboards, making it easier for decision-makers to understand
and act upon them.
A Distributed File System (DFS) as the name suggests, is a file system that is distributed on
multiple file servers or multiple locations. It allows programs to access or store isolated files
as they do with the local ones, allowing programmers to access files from any network or
computer.
The main purpose of the Distributed File System (DFS) is to allows users of physically
distributed systems to share their data and resources by using a Common File System. A
collection of workstations and mainframes connected by a Local Area Network (LAN) is a
configuration on Distributed File System. A DFS is executed as a part of the operating
system. In DFS, a namespace is created and this process is transparent for the clients.
DFS has two components:
•
Location Transparency –
Location Transparency achieves through the namespace component.
• Redundancy –
Redundancy is done through a file replication component.
In the case of failure and heavy load, these components together improve data availability by
allowing the sharing of data in different locations to be logically grouped under one folder,
which is known as the “DFS root”.
It is not necessary to use both the two components of DFS together, it is possible to use the
namespace component without using the file replication component and it is perfectly
possible to use the file replication component without using the namespace component
between servers.
File system replication:
Early iterations of DFS made use of Microsoft’s File Replication Service (FRS), which
allowed for straightforward file replication between servers. The most recent iterations of the
whole file are distributed to all servers by FRS, which recognises new or updated files. “DFS
Replication” was developed by Windows Server 2003 R2 (DFSR). By only copying the
portions of files that have changed and minimising network traffic with data compression, it
helps to improve FRS. Additionally, it provides users with flexible configuration options to
manage network traffic on a configurable schedule.
Features of DFS :
• Transparency :
• Structure transparency –
There is no need for the client to know about the number or locations
of file servers and the storage devices. Multiple file servers should be
provided for performance, adaptability, and dependability.
• Access transparency –
Both local and remote files should be accessible in the same manner.
The file system should be automatically located on the accessed file
and send it to the client’s side.
• Naming transparency –
There should not be any hint in the name of the file to the location of
the file. Once a name is given to the file, it should not be changed
during transferring from one node to another.
• Replication transparency –
If a file is copied on multiple nodes, both the copies of the file and
their locations should be hidden from one node to another.
• User mobility :
It will automatically bring the user’s home directory to the node where the user
logs in.
• Performance :
Performance is based on the average amount of time needed to convince the client
requests. This time covers the CPU time + time taken to access secondary storage
+ network access time. It is advisable that the performance of the Distributed File
System be similar to that of a centralized file system.
•
• Simplicity and ease of use : The user interface of a file
system should be simple and the number of commands in the file should be small.
High availability :
A Distributed File System should be able to continue in case of any partial
failures like a link failure, a node failure, or a storage drive crash.
A high authentic and adaptable distributed file system should have different and
independent file servers for controlling different and independent storage devices.
• Scalability :
Since growing the network by adding new machines or joining two networks
together is routine, the distributed system will inevitably grow over time. As a
result, a good distributed file system should be built to scale quickly as the
number of nodes and users in the system grows. Service should not be
substantially disrupted as the number of nodes and users grows.
• High reliability :
The likelihood of data loss should be minimized as much as feasible in a suitable
distributed file system. That is, because of the system’s unreliability, users should
not feel forced to make backup copies of their files. Rather, a file system should
create backup copies of key files that can be used if the originals are lost. Many
file systems employ stable storage as a high-reliability strategy.
• Data integrity :
Multiple users frequently share a file system. The integrity of data saved in a
shared file must be guaranteed by the file system. That is, concurrent access
requests from many users who are competing for access to the same file must be
correctly synchronized using a concurrency control method. Atomic transactions
are a high-level concurrency management mechanism for data integrity that is
frequently offered to users by a file system.
• Security :
A distributed file system should be secure so that its users may trust that their
data will be kept private. To safeguard the information contained in the file
system from unwanted & unauthorized access, security mechanisms must be
implemented.
• Heterogeneity :
Heterogeneity in distributed systems is unavoidable as a result of huge scale.
Users of heterogeneous distributed systems have the option of using multiple
computer platforms for different purposes.
History :
The server component of the Distributed File System was initially introduced as an add-on
feature. It was added to Windows NT 4.0 Server and was known as “DFS 4.1”. Then later on
it was included as a standard component for all editions of Windows 2000 Server. Clientside
support has been included in Windows NT 4.0 and also in later on version of Windows.
Linux kernels 2.6.14 and versions after it come with an SMB client VFS known as “cifs”
which supports DFS. Mac OS X 10.7 (lion) and onwards supports Mac OS X DFS.
Properties:
• File transparency: users can access files without knowing where they are
physically stored on the network.
•
• Load balancing: the file system can distribute file access requests across multiple
computers to improve performance and reliability.
• Data replication: the file system can store copies of files on multiple computers to
ensure that the files are available even if one of the computers fails.
• Security: the file system can enforce access control policies to ensure that only
authorized users can access files.
Scalability: the file system can support a large number of users and a large number
of files.
• Concurrent access: multiple users can access and modify the same file at the same
time.
• Fault tolerance: the file system can continue to operate even if one or more of its
components fail.
• Data integrity: the file system can ensure that the data stored in the files is
accurate and has not been corrupted.
• File migration: the file system can move files from one location to another without
interrupting access to the files.
• Data consistency: changes made to a file by one user are immediately visible to all
other users. Support for different file types: the file system can support a
wide range of file types, including text files, image files, and video files.
Applications :
• NFS –
NFS stands for Network File System. It is a client-server architecture that allows a
computer user to view, store, and update files remotely. The protocol of NFS is
one of the several distributed file system standards for Network-Attached Storage
(NAS).
• CIFS –
CIFS stands for Common Internet File System. CIFS is an accent of SMB. That is,
CIFS is an application of SIMB protocol, designed by Microsoft.
• SMB –
SMB stands for Server Message Block. It is a protocol for sharing a file and was
invented by IMB. The SMB protocol was created to allow computers to perform
read and write operations on files to a remote host over a Local Area Network
(LAN). The directories present in the remote host can be accessed via SMB and
are called as “shares”.
• Hadoop –
Hadoop is a group of open-source software services. It gives a software
framework for distributed storage and operating of big data using the MapReduce
programming model. The core of Hadoop contains a storage part, known as
Hadoop Distributed File System (HDFS), and an operating part which is a
MapReduce programming model.
• NetWare –
NetWare is an abandon computer network operating system developed by Novell,
Inc. It primarily used combined multitasking to run different services on a
personal computer, using the IPX network protocol.
Working of DFS :
There are two ways in which DFS can be implemented:
•
• Standalone DFS namespace –
It allows only for those DFS roots that exist on the local computer and are not
using Active Directory. A Standalone DFS can only be acquired on those
computers on which it is created. It does not provide any fault liberation and
cannot be linked to any other DFS. Standalone DFS roots are rarely come across
because of their limited advantage.
• Domain-based DFS namespace – It stores the configuration of DFS in Active
Directory, creating the DFS
namespace root accessible
at \\<domainname>\<dfsroot> or \\<FQDN>\<dfsroot>
Advantages :
• DFS allows multiple user to access or store the data.
• It allows the data to be share remotely.
• It improved the availability of file, access time, and network efficiency.
• Improved the capacity to change the size of the data and also improves the ability
to exchange the data.
• Distributed File System provides transparency of data even if server or disk fails.
Disadvantages :
• In Distributed File System nodes and connections needs to be secured therefore we
can say that security is at stake.
• There is a possibility of lose of messages and data in the network while movement
from one node to another.
• Database connection in case of Distributed File System is complicated.
• Also handling of the database is not easy in Distributed File System as compared
to a single user system.
• There are chances that overloading will take place if all nodes tries to send data at
once.
Algorithm using map reduce
MapReduce is a framework using which we can write applications to process huge amounts
of data, in parallel, on large clusters of commodity hardware in a reliable manner.
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers
is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling
the application to run over hundreds, thousands, or even tens of thousands of machines in a
cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the
data resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.
o Map stage − The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper processes
the data and creates several small chunks of data.
o Reduce stage − This stage is the
combination of the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper.
After processing, it produces a new set of output, which will be
stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
• Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
If the above data is given as input, we have to write applications to process it and produce
results such as finding the year of maximum usage, year of minimum usage, and so on. This
is a walkover for the programmers with finite number of records. They will simply write the
logic to produce the required output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the largescale industries
of a particular state, since its formation.
When we write applications to process such bulk data, •
They will take a lot of time to execute.
• Therewill be a heavy network traffic when we move data from source to network
server and so on.
To solve these problems, we have the MapReduce framework.
Input Data
The above data is saved as sample.txtand given as input. The input file looks as shown
below.
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
Example Program
Given below is the program to the sample data using MapReduce framework.
package hadoop;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*; import
org.apache.hadoop.mapred.*; import
org.apache.hadoop.util.*;
//Reducer class
public static class E_EReduce extends MapReduceBase implements Reducer< Text,
IntWritable, Text, IntWritable > {
//Reduce function
public void reduce( Text key, Iterator <IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int maxavg = 30;
int val = Integer.MIN_VALUE;
while (values.hasNext()) {
if((val = values.next().get())>maxavg)
{ output.collect(key, new IntWritable(val));
}
}
}
}
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
Save the above program as ProcessUnits.java. The compilation and execution of the
program is explained below.
Compilation and Execution of Process Units Program
Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1
The following command is to create a directory to store the compiled java classes.
$ mkdir units
Step 2
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce
program. Visit the following link mvnrepository.com to download the jar. Let us assume the
downloaded folder is /home/hadoop/.
Step 3
The following commands are used for compiling the ProcessUnits.java program and creating
a jar for the program.
$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java $
jar -cvf units.jar -C units/ .
Step 4
The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5
The following command is used to copy the input file named sample.txtin the input directory
of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6
The following command is used to verify the files in the input directory.
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7
The following command is used to run the Eleunit_max application by taking the input files
from the input directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir
Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of
columns(j)=2. Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of
columns(k)=2. Each cell of the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is
called A21 i.e. 2nd-row 1st column. Now One step matrix multiplication has 1 mapper and 1
reducer. The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i Therefore
computing the mapper for Matrix A:
# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have
# 2 values 1 & 2, each case can have 2 further
# values of j=1 and j=2. Substituting all values
# in formula
What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing.It is being used by Facebook,
Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by
adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts
it into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node
includes DataNode and TaskTracker.
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode performs
the role of master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily
run the NameNode and DataNode software.
NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure. o It manages
the file system namespace by executing an operation like the opening, renaming and
closing the files.
o It simplifies the architecture of the system.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode. o In response, NameNode provides metadata to Job
Tracker.
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce
job to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task
Trackers. Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is
rescheduled.
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in
faster retrieval. Even the tools to process the data are often on the same servers, thus
reducing the processing time. It is able to process terabytes of data in minutes and
Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so
it really cost effective as compared to traditional relational database management
system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the
Google File System paper, published by Google.
o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data
they have to spend a lot of costs which becomes the consequence of that project. This
problem becomes one of the important reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data. o In
2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known as
HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this
year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant. o In
2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released. o In 2017, Hadoop 3.0 was released.
Year Event
2009
o Yahoo runs 17 clusters of 24,000 machines.
o Hadoop becomes capable enough to sort a petabyte.
o MapReduce and HDFS become separate subproject.
2010 o Hadoop added the support for Kerberos. o
Hadoop operates 4,000 nodes with 40 petabytes.
o Apache Hive and Pig released.
Note: Apart from the above-mentioned components, there are many other components too
that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier. HDFS:
•
HDFS is the primary or major component of Hadoop ecosystem and is responsible
for storing large data sets of structured or unstructured data across various nodes
and thereby maintaining the metadata in the form of log files.
• HDFS consists of two core components i.e.
1. Name node
2. Data Node
• Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data.
These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who
•
helps to manage the resources across the clusters. In short, it performs scheduling
and resource allocation for the Hadoop System.
• Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
• Resource manager has the privilege of allocating resources for the applications in
a system whereas Node managers work on the allocation of resources such as
CPU, memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of
the two.
MapReduce:
• With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
• It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the processing
of queries.
Mahout:
• It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions, and
visualization, etc.
• It consumes in memory resources hence, thus being faster than the prior in terms
of optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.
Apache HBase:
• It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
• At times where we need to search or retrieve the occurrences of something small
in a huge database, the request must be processed within a short quick span of
•
time. At such times, HBase comes handy as it gives us a tolerant way of storing
limited data
Other Components: Apart from all of these, there are some other components too that carry
out a huge task in order to make Hadoop capable of processing large datasets. They are as
follows:
Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by Solr.
• Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which resulted
in inconsistency, often. Zookeeper overcame all the problems by performing
synchronization, inter-component based communication, grouping, and
maintenance.
• Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie
workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be
executed in a sequentially ordered manner whereas Oozie Coordinator jobs are
those that are triggered when some data or external stimulus is given to it.
The Map task includes splitting and mapping of the data by taking a dataset and converting it
into another set of data, where the individual elements get broken down into tuples i.e.
key/value pairs. After which the Reduce task shuffles and reduces the data, which means it
combines the data tuples based on the key and modifies the value of the key accordingly.
In the Hadoop framework, MapReduce model is the core component for data processing.
Using this model, it is very easy to scale an application to run over hundreds, thousands and
many more machines in a cluster by only making a configuration change. This is also
because the programs of the model in cloud computing are parallel in nature. Hadoop has the
capability of running MapReduce in many languages such as Java, Ruby, Python and C++.
Read more on mapreduce architecture.
The MapReduce model operates on <key, value> pairs. It views the input to the jobs as a set
of <key, value> pairs and produces a different set of <key, value> pairs as the output of the
•
jobs. Data input is supported by two classes in this framework, namely InputFormat and
RecordReader.
The first is consulted to determine how the input data should be partitioned for the map tasks,
while the latter reads the data from the inputs. For the data output also there are two classes,
OutputFormat and RecordWriter. The first class performs a basic validation of the data sink
properties and the second class is used to write each reducer output to the data sink.
What are the Phases of MapReduce?
Input Splits: An input in the MapReduce model is divided into small fixed-size parts called
input splits. This part of the input is consumed by a single map. The input data is generally a
file or directory stored in the HDFS.
Mapping: This is the first phase in the map-reduce program execution where the data in each
split is passed line by line, to a mapper function to process it and produce the output values.
Shuffling: It is a part of the output phase of Mapping where the relevant records are
consolidated from the output. It consists of merging and sorting. So, all the key-value pairs
which have the same keys are combined. In sorting, the inputs from the merging step are
taken and sorted. It returns key-value pairs, sorting the output.
Reduce: All the values from the shuffling phase are combined and a single output value is
returned. Thus, summarizing the entire dataset.
Hadoop divides a task into two parts, Map tasks which includes Splits and Mapping, and
Reduce tasks which includes Shuffling and Reducing. These were mentioned in the phases in
the above section. The execution of these tasks is controlled by two entities called JobTracker
and Multiple Task tracker.
With every job that gets submitted for execution, there is a JobTracker that resides on the
NameNode and multiple task trackers that reside on the DataNode. A job gets divided into
multiple tasks that run onto multiple data nodes in the cluster. The JobTracker coordinates the
activity by scheduling tasks to run on various data nodes.
The task tracker looks after the execution of individual tasks. It also sends the progress report
to the JobTracker. Periodically, it sends a signal to the JobTracker to notify the current state
of the system. When there is a task failure, the JobTracker reschedules it on a different task
tracker.
Advantages of MapReduce
There are a number of advantages for applications which use this model. These are
Data serialization is the process of converting an object into a stream of bytes to more easily
save or transmit it.
The reverse process—constructing a data structure or object from a series of bytes— is
deserialization. The deserialization process recreates the object, thus making the data easier
to read and modify as a native structure in a programming language.
Serialization and deserialization work together to transform/recreate data objects to/from a
portable format.
Serialization enables us to save the state of an object and recreate the object in a new location.
Serialization encompasses both the storage of the object and exchange of data. Since objects
are composed of several components, saving or delivering all the parts typically requires
significant coding effort, so serialization is a standard way to capture the object into a
sharable format. With serialization, we can transfer objects:
In some distributed systems, data and its replicas are stored in different partitions on multiple
cluster members. If data is not present on the local member, the system will retrieve that data
from another member. This requires serialization for use cases such as:
Big data systems often include technologies/data that are described as “schemaless.” This
means that the managed data in these systems are not structured in a strict format, as defined
by a schema. Serialization provides several benefits in this type of environment:
With the help of the HDFS command, we can perform Hadoop HDFS file operations like
changing the file permissions, viewing the file contents, creating files or directories, copying
file/directory from the local file system to HDFS or vice-versa, etc.
Before starting with the HDFS command, we have to start the Hadoop services. To start the
Hadoop services do the following:
In this Hadoop Commands tutorial, we have mentioned the top 10 Hadoop HDFS commands
with their usage, examples, and description. Let us now start with the HDFS commands.
1. version
Note: If the directory already exists in HDFS, then we will get an error message that file
already exists.
Use hadoop fs mkdir -p /path/directoryname, so not to fail even if directory exists. Learn
various features of Hadoop HDFS from this HDFS features guide.
3. ls
Learn Internals of HDFS Data Read Operation, How Data flows in HDFS while reading
the file.
Any Doubt yet in Hadoop HDFS Commands? Please Comment.
6. get
Hadoop HDFS get Command Description: The Hadoop fs shell command get
copies the file or directory from the Hadoop file system to the local file system.
We can cross-check whether the file is copied or not using the ls command.
Anatomy of File Write in HDFS Next, we’ll check out how files are written to HDFS.
Consider figure 1.2 to get a better understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit the
files which are already stored in HDFS, but we can append data by reopening the files.
Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s
namespace, with no blocks associated with it. The name node performs various checks to
make sure the file doesn’t already exist and that the client has the right permissions to create
the file. If these checks pass, the name node prepares a record of the new file; otherwise, the
file can’t be created and therefore the client is thrown an error i.e. IOException. The DFS
returns an FSDataOutputStream for the client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it
writes to an indoor queue called the info queue. The data queue is consumed by the
DataStreamer, which is liable for asking the name node to allocate new blocks by picking an
inventory of suitable data nodes to store the replicas. The list of data nodes forms a pipeline,
and here we’ll assume the replication level is three, so there are three nodes in the pipeline.
The DataStreamer streams the packets to the primary data node within the pipeline, which
stores each packet and forwards it to the second data node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last)
data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be
acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and waits for
acknowledgments before connecting to the name node to signal whether the file is complete
or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that are already stored
in HDFS, but we can include them by again reopening the file. This design allows HDFS to
scale to a large number of concurrent clients because the data traffic is spread across all the
data nodes in the cluster. Thus, it increases the availability, scalability, and throughput of the
system.
1. NameNode
NameNode works on the Master System. The primary purpose of Namenode is to manage all
the MetaData. Metadata is the list of files stored in HDFS(Hadoop Distributed File System).
As we know the data is stored in the form of blocks in a Hadoop cluster. So the DataNode on
which or the location at which that block of the file is stored is mentioned in MetaData. All
information regarding the logs of the transactions happening in a Hadoop cluster (when or
who read/wrote the data) will be stored in MetaData. MetaData is stored in the memory.
Features:
• It never stores the data that is present in the file.
• As Namenode works on the Master System, the Master system should have good
processing power and more RAM than Slaves.
• It stores the information of DataNode such as their Block id’s and Number of
Blocks How to start Name Node?
hadoop-daemon.sh start namenode
How to stop Name Node? hadoop-daemon.sh
stop namenode
2. DataNode
DataNode works on the Slave system. The NameNode always instructs DataNode for storing
the Data. DataNode is a program that runs on the slave system that serves the read/write
request from the client. As the data is stored in this DataNode, they should possess high
memory to store more Data.
How to start Data Node?
hadoop-daemon.sh start datanode
How to stop Data Node? hadoop-daemon.sh
stop datanode
3. Secondary NameNode
Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop
cluster fails, or crashes, the secondary Namenode will take the hourly backup or checkpoints
of that data and store this data into a file name fsimage. This file then gets transferred to a
new system. A new MetaData is assigned to that new system and a new Master is created
with this
MetaData, and the cluster is made to run again correctly. This is the benefit of Secondary
Name Node. Now in Hadoop2, we have High-Availability and Federation features that
minimize the importance of this Secondary Name Node in Hadoop2.
Major Function Of Secondary NameNode:
• It groups the Edit logs and Fsimage from NameNode together.
• It continuously reads the MetaData from the RAM of NameNode and writes into
the Hard Disk.
As secondary NameNode keeps track of checkpoints in a Hadoop Distributed File System, it
is also known as the checkpoint Node.
Resource Manager is also known as the Global Master Daemon that works on the Master
System. The Resource Manager Manages the resources for the applications that are running
in a Hadoop Cluster. The Resource Manager Mainly consists of 2 things.
1. ApplicationsManager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and also makes a
memory resource on the Slaves in a Hadoop cluster to host the Application Master. The
scheduler is utilized for providing resources for applications in a Hadoop cluster and for
monitoring this application.
How to start ResourceManager? yarn-daemon.sh
start resourcemanager
How to stop ResourceManager? stop:yarn-daemon.sh
stop resoucemnager
5. Node Manager
The Node Manager works on the Slaves System that manages the memory resource within
the Node and Memory Disk. Each Slave Node in a Hadoop cluster has a single NodeManager
Daemon running in it. It also sends this monitoring information to the Resource Manager.
How to start Node Manager? yarn-daemon.sh
start nodemanager
How to stop Node Manager? yarn-daemon.sh
stop nodemanager
In a Hadoop cluster, Resource Manager and Node Manager can be tracked with the specific
URLs, of type http://:port_number
ResourceManager 8088
NodeManager 8042
The MapReduce paradigm was created in 2003 to enable processing of large data sets in a
massively parallel manner. The goal of the MapReduce model is to simplify the approach to
transformation and analysis of large datasets, as well as to allow developers to focus on
algorithms instead of data management. The model allows for simple implementation of
dataparallel algorithms. There are a number of implementations of this model, including
Google’s approach, programmed in C++, and Apache’s Hadoop implementation,
programmed in Java. Both run on large clusters of commodity hardware in a shared-nothing,
peer-to-peer environment.
The MapReduce model consists of two phases: the map phase and the reduce phase,
expressed by the map function and the reduce function, respectively. The functions are
specified by the programmer and are designed to operate on key/value pairs as input and
output. The keys and values can be simple data types, such as an integer, or more complex,
such as a commercial transaction.
Map
The map function, also referred to as the map task, processes a single key/value input
pair and produces a set of intermediate key/value pairs.
Reduce
The reduce function, also referred to as the reduce task, consists of taking all
key/value pairs produced in the map phase that share the same intermediate key and
producing zero, one, or more data items.
Note that the map and reduce functions, do not address the parallelization and execution of
the MapReduce jobs. This is the responsibility of the MapReduce model, which automatically
takes care of distribution of input data, as well as scheduling and managing map and reduce
tasks.
The ResourceManager has two main components that are Schedulers and
ApplicationsManager.
Confused with YARN? Refer Hadoop YARN architecture to learn YARN in detail.
The scheduler performs scheduling based on the resource requirements of the applications.
It has some pluggable policies that are responsible for partitioning the cluster resources
among the various queues, applications, etc.
The FIFO Scheduler, CapacityScheduler, and FairScheduler are such pluggable policies that
are responsible for allocating resources to the applications.
1. FIFO Scheduler
First In First Out is the default scheduling policy used in Hadoop. FIFO Scheduler gives
more preferences to the application coming first than those coming later. It places the
applications in a queue and executes them in the order of their submission (first in, first out).
Here, irrespective of the size and priority, the request for the first application in the queue are
allocated first. Once the first application request is satisfied, then only the next application in
the queue is served.
2. Capacity Scheduler
The CapacityScheduler allows multiple-tenants to securely share a large Hadoop cluster. It is
designed to run Hadoop applications in a shared, multi-tenant cluster while maximizing the
throughput and the utilization of the cluster.
It supports hierarchical queues to reflect the structure of organizations or groups that utilizes
the cluster resources. A queue hierarchy contains three types of queues that are root, parent,
and leaf.
The root queue represents the cluster itself, parent queue represents organization/group or
sub-organization/sub-group, and the leaf accepts application submission.
The Capacity Scheduler allows the sharing of the large cluster while giving capacity
guarantees to each organization by allocating a fraction of cluster resources to each queue.
Also, when there is a demand for the free resources that are available on the queue who has
completed its task, by the queues running below capacity, then these resources will be
assigned to the applications on queues running below capacity. This provides elasticity for the
organization in a cost-effective manner.
Apart from it, the CapacityScheduler provides a comprehensive set of limits to ensure that a
single application/user/queue cannot use a disproportionate amount of resources in the
cluster.
To ensure fairness and stability, it also provides limits on initialized and pending apps from a
single user and queue.
Advantages:
• It maximizes the utilization of resources and throughput in the Hadoop cluster.
• Provides elasticity for groups or organizations in a cost-effective manner.
• It also gives capacity guarantees and safeguards to the organization utilizing
cluster.
Disadvantage:
• It is complex amongst the other scheduler.
3. Fair Scheduler
FairScheduler allows YARN applications to fairly share resources in large Hadoop clusters.
With FairScheduler, there is no need for reserving a set amount of capacity because it will
dynamically balance resources between all running applications.
It assigns resources to applications in such a way that all applications get, on average, an
equal amount of resources over time.
The FairScheduler, by default, takes scheduling fairness decisions only on the basis of
memory. We can configure it to schedule with both memory and CPU.
When the single application is running, then that app uses the entire cluster resources. When
other applications are submitted, the free up resources are assigned to the new apps so that
every app eventually gets roughly the same amount of resources. FairScheduler enables short
apps to finish in a reasonable time without starving the long-lived apps.
Apart from fair scheduling, the FairScheduler allows for assigning minimum shares to queues
for ensuring that certain users, production, or group applications always get sufficient
resources. When an app is present in the queue, then the app gets its minimum share, but
when the queue doesn’t need its full guaranteed share, then the excess share is split between
other running applications.
Advantages:
• It provides a reasonable way to share the Hadoop Cluster between the number of
users.
• Also, the FairScheduler can work with app priorities where the priorities are used as
weights in determining the fraction of the total resources that each application should
get.
Disadvantage:
• It requires configuration.
Hadoop 2.0 also introduces the solution to the much awaited High Availability problem.
• Hadoop introduced YARN - that has the ability to process terabytes and Petabytes of
data present in HDFS with the use of various non-MapReduce applications namely
GIRAPH and MPI.
• Hadoop 2.0 divides the responsibilities of the overloaded Job Tracker into 2 different
divine components i.e. the Application Master (per application) and the Global
Resource Manager.
• Hadoop 2.0 improves horizontal scalability of the NameNode through HDFS
Federation and eliminates the Single Point of Failure Problem with the NameNode
High Availability​
Hadoop 1.0 NameNode has single point of failure (SPOF) problem- which means that if the
NameNode fails, then that Hadoop Cluster will become out-of-the-way. Nevertheless, this is
anticipated to be a rare occurrence as applications make use of business critical hardware
with RAS features (Reliability, Availability and Serviceability) for all the NameNode servers.
In case, if NameNode failure occurs then it requires manual intervention of the Hadoop
Administrators to recover the NameNode with the help of a secondary NameNode.
NameNode SPOF problem limits the overall availability of the Hadoop Cluster in the
following ways:
Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging,
and Data Visualization
The main motive of the Hadoop 2.0 High Availability project is to render availability to big
data applications 24/7 by deploying 2 Hadoop NameNodes –One in active configuration and
the other is the Standby Node in passive configuration.
Earlier there was one Hadoop NameNode for maintaining the tree hierarchy of the HDFS
files and tracking the data storage in the cluster. Hadoop 2.0 High Availability allows users to
configure Hadoop clusters with uncalled- for NameNodes so as to eliminate the probability of
SPOF in a given Hadoop cluster. The Hadoop Configuration capability allows users to build
clusters horizontally with several NameNodes which can operate autonomously through a
common data storage pool, thereby, offering better computing scalability when compared to
Hadoop 1.0
With Hadoop 2.0, Hadoop architecture is now configured in a manner that it supports
automated failover with complete stack resiliency and a hot Standby NameNode.
From the above graph, it is evident that both the active and passive (Standby) NameNodes
have state-of-the-art metadata that ensures flawless failover for large Hadoop clusters
indicating that there would not be any downtime for your Hadoop cluster and it will be
available all the time.
Hadoop 2.0 is keyed up to identify any failures in NameNode host and processes, so that it
can automatically switch to the passive NameNode i.e. the Standby Node to ensure high
availability of the HDFS services to the Big Data applications. With the advent of Hadoop 2.0
HA it’s time for Hadoop Administrators to take a breather, as this process does not require
manual intervention.
With HDP 2.0 High Availability, the complete Hadoop Stack i.e. HBase, Pig, Hive,
MapReduce, Oozie are equipped to tackle the NameNode failure problem- without having to
lose the job progress or any related data. Thus, any critical long running jobs that are
scheduled to be completed at a specific time will not be affected by the NameNode failure.
Get More Practice, More Big Data and Analytics Projects, and More
guidance.FastTrack Your Career Transition with ProjectPro
• Ease of Installation
According to Hadoop users, setting up High Availability should be a trifling activity devoid
of requiring the Hadoop Administrator to install any other open source or commercial third
party software.
• No Demand for Additional Hardware Requirements ​ Hadoop users say that Hadoop
2.0 High Availability feature should not demand that the users deploy, maintain or purchase
additional hardware. 100% Commodity hardware must be used to achieve high availability i.e.
there should not be any further dependencies on non commodity hardware such as Load
Balancers.
HDFS federation
HDFS federation provides MapReduce with the ability to start multiple HDFS namespaces in
the cluster, monitor their health, and fail over in case of daemon or host failure. Namespaces,
which run on separate hosts, are independent and do not require coordination with each other.
The DataNodes are used as common storage by all the namespaces, and register with all the
namespaces in the cluster.
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has now
evolved to be known as a large-scale distributed operating system used for Big Data
processing.YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and process data
stored in HDFS (Hadoop Distributed File System) thus making the system much more
efficient. Through its various components, it can dynamically allocate various resources and
schedule the application processing. For large volume data processing, it is quite necessary to
manage the available resources properly so that every application can leverage them.
Running MRv1 in YARN.
YARN uses the ResourceManager web interface for monitoring applications running on a
YARN cluster. The ResourceManager UI shows the basic cluster metrics, list of applications,
and nodes associated with the cluster. In this section, we'll discuss the monitoring of MRv1
applications over YARN.
The Resource Manager is the core component of YARN – Yet Another Resource Negotiator.
In analogy, it occupies the place of JobTracker of MRV1. Hadoop YARN is designed to
provide a generic and flexible framework to administer the computing resources in the
Hadoop cluster.
In this direction, the YARN Resource Manager Service (RM) is the central controlling
authority for resource management and makes allocation decisions ResourceManager has two
main components: Scheduler and ApplicationsManager.