0% found this document useful (0 votes)
201 views

Introduction To Hadoop

The document discusses big data concepts including what is data, what is big data, examples of big data, the four Vs of big data, types of big data, how big data works, and use cases of big data.

Uploaded by

anujapawar1950
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
201 views

Introduction To Hadoop

The document discusses big data concepts including what is data, what is big data, examples of big data, the four Vs of big data, types of big data, how big data works, and use cases of big data.

Uploaded by

anujapawar1950
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Unit 1 :- Introduction to Big data

1.1 Introduction – Distributed file System


1.2 Big Data and its importance
1.3 Four Vs Driver for Big Data
1.4 Big data applications
1.5 Algorithm using map reduce
1.6 Matrix-Vector Multiplication by Map Reduce
Unit 2 :- Introduction to Hadoop
2.1 Big Data
2.2 Apache Hadoop & Hadoop Ecosystem
2.3 Moving Data in and out of Hadoop
2.4 Understanding inputs and outputs of MapReduce 2.5
Data Serialization
Unit 3 :- Hadoop Architecture
3.1 Hadoop Architecture
3.2 Hadoop Storage : HDFS , Common Hadoop Shell Commands
3.3 Anatomy of File write and Read
3.4 NameNode , Secondary NameNode
3.5 DataNode , HadoopReduce Paradigm , map and Reduce tasks

Unit 4 :- Hadoop Ecosystem and Yarn


4.1 Hadoop Ecosystem components – Schedulers – Fair and Capacity
4.2 Hadoop 2.0 New Features NameNode High Availability
4.3 HDFS Federation , MRv2 , YARN , Running MRv1 in Yarn

Introduction to Big data What


is Data?
The quantities, characters, or symbols on which operations are performed by a computer,
which may be stored and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media.

What is Big Data?


Big Data is a collection of data that is huge in volume, yet growing exponentially with time.
It is a data with so large size and complexity that none of traditional data management tools
can store it or process it efficiently. Big data is also a data but with huge size. What is an
Example of Big Data?
Following are some of the Big Data examples-

The New York Stock Exchange is an example of Big Data that generates about one terabyte
of new trade data per day.

Social Media

The statistic shows that 500+terabytes of new data get ingested into the databases of social
media site Facebook, every day. This data is mainly generated in terms of photo and video
uploads, message exchanges, putting comments etc.

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
Types Of Big Data
Following are the types of Big Data:

1. Structured
2. Unstructured
3. Semi-structured

Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data. Over the period of time, talent in computer science has achieved greater
success in developing techniques for working with such kind of data (where the format is
well known in advance) and also deriving value out of it. However, nowadays, we are
foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in
the rage of multiple zettabytes.

Looking at these figures one can easily understand why the name Big Data is given and
imagine the challenges involved in its storage and processing.

Examples Of Structured Data

An ‘Employee’ table in a database is an example of Structured Data


Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000


7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000


7699 Priya Sane Female Finance 550000 Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to
the size being huge, un-structured data poses multiple challenges in terms of its processing for
deriving value out of it. A typical example of unstructured data is a heterogeneous data source
containing a combination of simple text files, images, videos etc. Now day organizations have
wealth of data available with them but unfortunately, they don’t know how to derive value out
of it since this data is in its raw form or unstructured format.

Examples Of Un-structured Data

The output returned by ‘Google Search’

Example Of Un-structured Data

Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational
DBMS. Example of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec> Data
Growth over the years
Data Growth over the years
Please note that web application data, which is unstructured, consists of log files, transaction
history files etc. OLTP systems are built to work with structured data wherein data is stored
in relations (tables).

What is Big Data?

Big data is exactly what the name suggests, a “big” amount of data. Big Data means a data set
that is large in terms of volume and is more complex. Because of the large volume and higher
complexity of Big Data, traditional data processing software cannot handle it. Big Data
simply means datasets containing a large amount of diverse data, both structured as well as
unstructured.

Big Data allows companies to address issues they are facing in their business, and solve these
problems effectively using Big Data Analytics. Companies try to identify patterns and draw
insights from this sea of data so that it can be acted upon to solve the problem(s) at hand.

Although companies have been collecting a huge amount of data for decades, the concept of
Big Data only gained popularity in the early-mid 2000s. Corporations realized the amount of
data that was being collected on a daily basis, and the importance of using this data
effectively.

5Vs of Big Data

1. Volume refers to the amount of data that is being collected. The data could be
structured or unstructured.
2. Velocity refers to the rate at which data is coming in.
3. Variety refers to the different kinds of data (data types, formats, etc.) that is coming
in for analysis. Over the last few years, 2 additional Vs of data have also emerged –
value and veracity.
4. Value refers to the usefulness of the collected data.
5. Veracity refers to the quality of data that is coming in from different sources.
How Does Big Data Work?

Big data involves collecting, processing, and analyzing vast amounts of data from multiple
sources to uncover patterns, relationships, and insights that can inform decision-making. The
process involves several steps:

1. Data Collection
Big data is collected from various sources such as social media, sensors, transactional
systems, customer reviews, and other sources.

2. Data Storage
The collected data then needs to be stored in a way that it can be easily accessed and
analyzed later. This often requires specialized storage technologies capable of
handling large volumes of data.

3. Data Processing
Once the data is stored, it needs to be processed before it can be analyzed. This
involves cleaning and organizing the data to remove any errors or inconsistencies, and
transform it into a format suitable for analysis.

4. Data Analysis
After the data has been processed, it is time to analyze it using tools like statistical
models and machine learning algorithms to identify patterns, relationships, and trends.

5. Data Visualization
The insights derived from data analysis are then presented in visual formats such as
graphs, charts, and dashboards, making it easier for decision-makers to understand
and act upon them.

Use Cases

Big Data helps corporations in making better and faster decisions, because they have more
information available to solve problems, and have more data to test their hypothesis on.

Customer experience is a major field that has been revolutionized with the advent of Big
Data. Companies are collecting more data about their customers and their preferences than
ever. This data is being leveraged in a positive way, by giving personalized recommendations
and offers to customers, who are more than happy to allow companies to collect this data in
return for the personalized services. The recommendations you get on Netflix, or
Amazon/Flipkart are a gift of Big Data!

Machine Learning
Machine Learning is another field that has benefited greatly from the increasing popularity
of Big Data. More data means we have larger datasets to train our ML models, and a more
trained model (generally) results in a better performance. Also, with the help of Machine
Learning, we are now able to automate tasks that were earlier being done manually, all thanks
to Big Data.

Demand Forecasting
Demand forecasting has become more accurate with more and more data being collected
about customer purchases. This helps companies build forecasting models, that help them
forecast future demand, and scale production accordingly. It helps companies, especially
those in manufacturing businesses, to reduce the cost of storing unsold inventory in
warehouses.

Big data also has extensive use in applications such as product development and fraud
detection.

How to Store and Process Big Data?

The volume and velocity of Big Data can be huge, which makes it almost impossible to store
it in traditional data warehouses. Although some and sensitive information can be stored on
company premises, for most of the data, companies have to opt for cloud storage or Hadoop.

Cloud storage allows businesses to store their data on the internet with the help of a cloud
service provider (like Amazon Web Services, Microsoft Azure, or Google Cloud Platform)
who takes the responsibility of managing and storing the data. The data can be accessed
easily and quickly with an API.

Hadoop also does the same thing, by giving you the ability to store and process large
amounts of data at once. Hadoop is an open-source software framework and is free. It allows
users to process large datasets across clusters of computers.

Big Data Tools

1. Apache Hadoop is an open-source big data tool designed to store and process large
amounts of data across multiple servers. Hadoop comprises a distributed file system
(HDFS) and a MapReduce processing engine.
2. Apache Spark is a fast and general-purpose cluster computing system that supports
inmemory processing to speed up iterative algorithms. Spark can be used for batch
processing, real-time stream processing, machine learning, graph processing, and SQL
queries.
3. Apache Cassandra is a distributed NoSQL database management system designed to
handle large amounts of data across commodity servers with high availability and
fault tolerance.
4. Apache Flink is an open-source streaming data processing framework that supports
batch processing, real-time stream processing, and event-driven applications. Flink
provides low-latency, high-throughput data processing with fault tolerance and
scalability.
5. Apache Kafka is a distributed streaming platform that enables the publishing and
subscribing to streams of records in real-time. Kafka is used for building real-time
data pipelines and streaming applications.
6. Splunk is a software platform used for searching, monitoring, and analyzing
machinegenerated big data in real-time. Splunk collects and indexes data from various
sources and provides insights into operational and business intelligence.
7. Talend is an open-source data integration platform that enables organizations to
extract, transform, and load (ETL) data from various sources into target systems.
Talend supports big data technologies such as Hadoop, Spark, Hive, Pig, and HBase.
8. Tableau is a data visualization and business intelligence tool that allows users to
analyze and share data using interactive dashboards, reports, and charts. Tableau
supports big data platforms and databases such as Hadoop, Amazon Redshift, and
Google BigQuery.
9. Apache NiFi is a data flow management tool used for automating the movement of
data between systems. NiFi supports big data technologies such as Hadoop, Spark,
and Kafka and provides real-time data processing and analytics.
10. QlikView is a business intelligence and data visualization tool that enables users to
analyze and share data using interactive dashboards, reports, and charts. QlikView
supports big data platforms such as Hadoop, and provides real-time data processing
and analytics.

Big Data Best Practices

To effectively manage and utilize big data, organizations should follow some best practices:

• Define clear business objectives: Organizations should define clear business


objectives while collecting and analyzing big data. This can help avoid wasting time
and resources on irrelevant data.
• Collect and store relevant data only: It is important to collect and store only the
relevant data that is required for analysis. This can help reduce data storage costs and
improve data processing efficiency.
• Ensure data quality: It is critical to ensure data quality by removing errors,
inconsistencies, and duplicates from the data before storage and processing.
• Use appropriate tools and technologies: Organizations must use appropriate tools and
technologies for collecting, storing, processing, and analyzing big data. This includes
specialized software, hardware, and cloud-based technologies.
• Establish data security and privacy policies: Big data often contains sensitive
information, and therefore organizations must establish rigorous data security and
privacy policies to protect this data from unauthorized access or misuse.
• Leverage machine learning and artificial intelligence: Machine learning and artificial
intelligence can be used to identify patterns and predict future trends in big data.
Organizations must leverage these technologies to gain actionable insights from their
data.
• Focus on data visualization: Data visualization can simplify complex data into
intuitive visual formats such as graphs or charts, making it easier for decision-makers
to understand and act upon the insights derived from big data.

Challenges Faced by Big Data

1. Data Growth
Managing datasets having terabytes of information can be a big challenge for companies. As
datasets grow in size, storing them not only becomes a challenge but also becomes an
expensive affair for companies.
To overcome this, companies are now starting to pay attention to data compression and
deduplication. Data compression reduces the number of bits that the data needs, resulting in a
reduction in space being consumed. Data de-duplication is the process of making sure
duplicate and unwanted data does not reside in our database.

2. Data Security
Data security is often prioritized quite low in the Big Data workflow, which can backfire at
times. With such a large amount of data being collected, security challenges are bound to
come up sooner or later.

Mining of sensitive information, fake data generation, and lack of cryptographic protection
(encryption) are some of the challenges businesses face when trying to adopt Big Data
techniques.

Companies need to understand the importance of data security, and need to prioritize it. To
help them, there are professional Big Data consultants nowadays, that help businesses move
from traditional data storage and analysis methods to Big Data.

3. Data Integration
Data is coming in from a lot of different sources (social media applications, emails, customer
verification documents, survey forms, etc.). It often becomes a very big operational challenge
for companies to combine and reconcile all of this data.

There are several Big Data solution vendors that offer ETL (Extract, Transform, Load) and
data integration solutions to companies that are trying to overcome data integration problems.
There are also several APIs that have already been built to tackle issues related to data
integration.

Advantages and Disadvantages of Big Data

Advantages of Big Data

• Improved decision-making: Big data can provide insights and patterns that help
organizations make more informed decisions.
• Increased efficiency: Big data analytics can help organizations identify inefficiencies
in their operations and improve processes to reduce costs.
• Better customer targeting: By analyzing customer data, businesses can develop
targeted marketing campaigns that are relevant to individual customers, resulting in
better customer engagement and loyalty.
• New revenue streams: Big data can uncover new business opportunities, enabling
organizations to create new products and services that meet market demand.
• Competitive advantage: Organizations that can effectively leverage big data have a
competitive advantage over those that cannot, as they can make faster, more informed
decisions based on data-driven insights.

Disadvantages of Big Data


• Privacy concerns: Collecting and storing large amounts of data can raise privacy
concerns, particularly if the data includes sensitive personal information.
• Risk of data breaches: Big data increases the risk of data breaches, leading to loss of
confidential data and negative publicity for the organization.
• Technical challenges: Managing and processing large volumes of data requires
specialized technologies and skilled personnel, which can be expensive and
timeconsuming.
• Difficulty in integrating data sources: Integrating data from multiple sources can be
challenging, particularly if the data is unstructured or stored in different formats.
• Complexity of analysis: Analyzing large datasets can be complex and time-
consuming, requiring specialized skills and expertise.

Applications of Big Data

Here are top 10 industries that use big data in their favor –
Industry Use of Big data

Healthcare Analyze patient data to improve healthcare outcomes, identify trends


and patterns, and develop personalized treatment
Retail Track and analyze customer data to personalize marketing
campaigns, improve inventory management and enhance CX

Finance Detect fraud, assess risks and make informed investment decisions

Manufacturing Optimize supply chain processes, reduce costs and improve product
quality through predictive maintenance

Transportation Optimize routes, improve fleet management and enhance safety by


predicting accidents before they happen

Energy Monitor and analyze energy usage patterns, optimize production, and
reduce waste through predictive analytics

Telecommunications Manage network traffic, improve service quality, and reduce


downtime through predictive maintenance and outage prediction

Government and public Address issues such as preventing crime, improving traffic
management, and predicting natural disasters

Advertising and Understand consumer behavior, target specific audiences and


marketing measure the effectiveness of campaigns

Education Personalize learning experiences, monitor student progress and


improve teaching methods through adaptive learning

Big Data technologies can be used for creating a staging area or landing zone for new data
before identifying what data should be moved to the data warehouse. In addition, such
integration of Big Data technologies and data warehouse helps an organization to offload
infrequently accessed data.
How Does Big Data Work?

Big data involves collecting, processing, and analyzing vast amounts of data from multiple
sources to uncover patterns, relationships, and insights that can inform decision-making. The
process involves several steps:

1. Data Collection

Big data is collected from various sources such as social media, sensors, transactional
systems, customer reviews, and other sources.

2. Data Storage

The collected data then needs to be stored in a way that it can be easily accessed and
analyzed later. This often requires specialized storage technologies capable of
handling large volumes of data.

3. Data Processing

Once the data is stored, it needs to be processed before it can be analyzed. This
involves cleaning and organizing the data to remove any errors or inconsistencies, and
transform it into a format suitable for analysis.

4. Data Analysis

After the data has been processed, it is time to analyze it using tools like statistical
models and machine learning algorithms to identify patterns, relationships, and trends. 5.
Data Visualization

The insights derived from data analysis are then presented in visual formats such as
graphs, charts, and dashboards, making it easier for decision-makers to understand
and act upon them.

A Distributed File System (DFS) as the name suggests, is a file system that is distributed on
multiple file servers or multiple locations. It allows programs to access or store isolated files
as they do with the local ones, allowing programmers to access files from any network or
computer.

The main purpose of the Distributed File System (DFS) is to allows users of physically
distributed systems to share their data and resources by using a Common File System. A
collection of workstations and mainframes connected by a Local Area Network (LAN) is a
configuration on Distributed File System. A DFS is executed as a part of the operating
system. In DFS, a namespace is created and this process is transparent for the clients.
DFS has two components:

Location Transparency –
Location Transparency achieves through the namespace component.
• Redundancy –
Redundancy is done through a file replication component.
In the case of failure and heavy load, these components together improve data availability by
allowing the sharing of data in different locations to be logically grouped under one folder,
which is known as the “DFS root”.
It is not necessary to use both the two components of DFS together, it is possible to use the
namespace component without using the file replication component and it is perfectly
possible to use the file replication component without using the namespace component
between servers.
File system replication:
Early iterations of DFS made use of Microsoft’s File Replication Service (FRS), which
allowed for straightforward file replication between servers. The most recent iterations of the
whole file are distributed to all servers by FRS, which recognises new or updated files. “DFS
Replication” was developed by Windows Server 2003 R2 (DFSR). By only copying the
portions of files that have changed and minimising network traffic with data compression, it
helps to improve FRS. Additionally, it provides users with flexible configuration options to
manage network traffic on a configurable schedule.
Features of DFS :
• Transparency :
• Structure transparency –
There is no need for the client to know about the number or locations
of file servers and the storage devices. Multiple file servers should be
provided for performance, adaptability, and dependability.
• Access transparency –
Both local and remote files should be accessible in the same manner.
The file system should be automatically located on the accessed file
and send it to the client’s side.
• Naming transparency –
There should not be any hint in the name of the file to the location of
the file. Once a name is given to the file, it should not be changed
during transferring from one node to another.
• Replication transparency –
If a file is copied on multiple nodes, both the copies of the file and
their locations should be hidden from one node to another.
• User mobility :
It will automatically bring the user’s home directory to the node where the user
logs in.
• Performance :
Performance is based on the average amount of time needed to convince the client
requests. This time covers the CPU time + time taken to access secondary storage
+ network access time. It is advisable that the performance of the Distributed File
System be similar to that of a centralized file system.

• Simplicity and ease of use : The user interface of a file
system should be simple and the number of commands in the file should be small.
High availability :
A Distributed File System should be able to continue in case of any partial
failures like a link failure, a node failure, or a storage drive crash.
A high authentic and adaptable distributed file system should have different and
independent file servers for controlling different and independent storage devices.
• Scalability :
Since growing the network by adding new machines or joining two networks
together is routine, the distributed system will inevitably grow over time. As a
result, a good distributed file system should be built to scale quickly as the
number of nodes and users in the system grows. Service should not be
substantially disrupted as the number of nodes and users grows.
• High reliability :
The likelihood of data loss should be minimized as much as feasible in a suitable
distributed file system. That is, because of the system’s unreliability, users should
not feel forced to make backup copies of their files. Rather, a file system should
create backup copies of key files that can be used if the originals are lost. Many
file systems employ stable storage as a high-reliability strategy.
• Data integrity :
Multiple users frequently share a file system. The integrity of data saved in a
shared file must be guaranteed by the file system. That is, concurrent access
requests from many users who are competing for access to the same file must be
correctly synchronized using a concurrency control method. Atomic transactions
are a high-level concurrency management mechanism for data integrity that is
frequently offered to users by a file system.
• Security :
A distributed file system should be secure so that its users may trust that their
data will be kept private. To safeguard the information contained in the file
system from unwanted & unauthorized access, security mechanisms must be
implemented.
• Heterogeneity :
Heterogeneity in distributed systems is unavoidable as a result of huge scale.
Users of heterogeneous distributed systems have the option of using multiple
computer platforms for different purposes.
History :
The server component of the Distributed File System was initially introduced as an add-on
feature. It was added to Windows NT 4.0 Server and was known as “DFS 4.1”. Then later on
it was included as a standard component for all editions of Windows 2000 Server. Clientside
support has been included in Windows NT 4.0 and also in later on version of Windows.
Linux kernels 2.6.14 and versions after it come with an SMB client VFS known as “cifs”
which supports DFS. Mac OS X 10.7 (lion) and onwards supports Mac OS X DFS.
Properties:
• File transparency: users can access files without knowing where they are
physically stored on the network.

• Load balancing: the file system can distribute file access requests across multiple
computers to improve performance and reliability.
• Data replication: the file system can store copies of files on multiple computers to
ensure that the files are available even if one of the computers fails.
• Security: the file system can enforce access control policies to ensure that only
authorized users can access files.
Scalability: the file system can support a large number of users and a large number
of files.
• Concurrent access: multiple users can access and modify the same file at the same
time.
• Fault tolerance: the file system can continue to operate even if one or more of its
components fail.
• Data integrity: the file system can ensure that the data stored in the files is
accurate and has not been corrupted.
• File migration: the file system can move files from one location to another without
interrupting access to the files.
• Data consistency: changes made to a file by one user are immediately visible to all
other users. Support for different file types: the file system can support a
wide range of file types, including text files, image files, and video files.
Applications :
• NFS –
NFS stands for Network File System. It is a client-server architecture that allows a
computer user to view, store, and update files remotely. The protocol of NFS is
one of the several distributed file system standards for Network-Attached Storage
(NAS).
• CIFS –
CIFS stands for Common Internet File System. CIFS is an accent of SMB. That is,
CIFS is an application of SIMB protocol, designed by Microsoft.
• SMB –
SMB stands for Server Message Block. It is a protocol for sharing a file and was
invented by IMB. The SMB protocol was created to allow computers to perform
read and write operations on files to a remote host over a Local Area Network
(LAN). The directories present in the remote host can be accessed via SMB and
are called as “shares”.
• Hadoop –
Hadoop is a group of open-source software services. It gives a software
framework for distributed storage and operating of big data using the MapReduce
programming model. The core of Hadoop contains a storage part, known as
Hadoop Distributed File System (HDFS), and an operating part which is a
MapReduce programming model.
• NetWare –
NetWare is an abandon computer network operating system developed by Novell,
Inc. It primarily used combined multitasking to run different services on a
personal computer, using the IPX network protocol.
Working of DFS :
There are two ways in which DFS can be implemented:

• Standalone DFS namespace –
It allows only for those DFS roots that exist on the local computer and are not
using Active Directory. A Standalone DFS can only be acquired on those
computers on which it is created. It does not provide any fault liberation and
cannot be linked to any other DFS. Standalone DFS roots are rarely come across
because of their limited advantage.
• Domain-based DFS namespace – It stores the configuration of DFS in Active
Directory, creating the DFS
namespace root accessible
at \\<domainname>\<dfsroot> or \\<FQDN>\<dfsroot>

Advantages :
• DFS allows multiple user to access or store the data.
• It allows the data to be share remotely.
• It improved the availability of file, access time, and network efficiency.
• Improved the capacity to change the size of the data and also improves the ability
to exchange the data.
• Distributed File System provides transparency of data even if server or disk fails.
Disadvantages :
• In Distributed File System nodes and connections needs to be secured therefore we
can say that security is at stake.
• There is a possibility of lose of messages and data in the network while movement
from one node to another.
• Database connection in case of Distributed File System is complicated.
• Also handling of the database is not easy in Distributed File System as compared
to a single user system.
• There are chances that overloading will take place if all nodes tries to send data at
once.
Algorithm using map reduce
MapReduce is a framework using which we can write applications to process huge amounts
of data, in parallel, on large clusters of commodity hardware in a reliable manner.
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers
is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling
the application to run over hundreds, thousands, or even tens of thousands of machines in a
cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the
data resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.
o Map stage − The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper processes
the data and creates several small chunks of data.
o Reduce stage − This stage is the
combination of the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper.
After processing, it produces a new set of output, which will be
stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
• Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs (Java Perspective)


The MapReduce framework operates on <key, value> pairs, that is, the framework views the
input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence,
need to implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types
of a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)


Terminology
• PayLoad − Applications implement the Map and the Reduce functions, and
form the core of the job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate
key/value pair.
• NamedNode − Node that manages the Hadoop Distributed File System
(HDFS).
• DataNode − Node where data is presented in advance before any processing
takes place.
• MasterNode − Node where JobTracker runs and which accepts job requests
from clients.
• SlaveNode − Node where Map and Reduce program runs.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker − Tracks the task and reports status to JobTracker.
• Job − A program is an execution of a Mapper and Reducer across a dataset.
• Task − An execution of a Mapper or a Reducer on a slice of data.
• Task Attempt − A particular instance of an attempt to execute a task on a
SlaveNode.
Example Scenario
Given below is the data regarding the electrical consumption of an organization. It contains
the monthly electrical consumption and the annual average for various years.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg

1979 23 23 2 43 24 25 26 26 26 26 25 26 25

1980 26 27 28 28 28 30 31 31 31 30 30 30 29

1981 31 32 32 32 33 34 35 36 36 34 34 34 34

1984 39 38 39 39 39 41 42 43 40 39 38 38 40

1985 38 39 39 39 39 41 41 41 00 40 39 39 45
If the above data is given as input, we have to write applications to process it and produce
results such as finding the year of maximum usage, year of minimum usage, and so on. This
is a walkover for the programmers with finite number of records. They will simply write the
logic to produce the required output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the largescale industries
of a particular state, since its formation.
When we write applications to process such bulk data, •
They will take a lot of time to execute.
• Therewill be a heavy network traffic when we move data from source to network
server and so on.
To solve these problems, we have the MapReduce framework.
Input Data
The above data is saved as sample.txtand given as input. The input file looks as shown
below.
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
Example Program
Given below is the program to the sample data using MapReduce framework.
package hadoop;

import java.util.*;

import java.io.IOException; import


java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*; import
org.apache.hadoop.mapred.*; import
org.apache.hadoop.util.*;

public class ProcessUnits {


//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable ,/*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
IntWritable> /*Output value Type*/
{
//Map function
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {


String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();
while(s.hasMoreTokens()) {
lasttoken = s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
}
}

//Reducer class
public static class E_EReduce extends MapReduceBase implements Reducer< Text,
IntWritable, Text, IntWritable > {

//Reduce function
public void reduce( Text key, Iterator <IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int maxavg = 30;
int val = Integer.MIN_VALUE;

while (values.hasNext()) {
if((val = values.next().get())>maxavg)
{ output.collect(key, new IntWritable(val));
}
}
}
}

//Main function public static void main(String


args[])throws Exception { JobConf conf = new
JobConf(ProcessUnits.class);

conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}
}
Save the above program as ProcessUnits.java. The compilation and execution of the
program is explained below.
Compilation and Execution of Process Units Program
Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1
The following command is to create a directory to store the compiled java classes.
$ mkdir units
Step 2
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce
program. Visit the following link mvnrepository.com to download the jar. Let us assume the
downloaded folder is /home/hadoop/.
Step 3
The following commands are used for compiling the ProcessUnits.java program and creating
a jar for the program.
$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java $
jar -cvf units.jar -C units/ .
Step 4
The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5
The following command is used to copy the input file named sample.txtin the input directory
of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6
The following command is used to verify the files in the input directory.
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7
The following command is used to run the Eleunit_max application by taking the input files
from the input directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir

Matrix-Vector Multiplication by Map Reduce


MapReduce is a technique in which a huge program is subdivided into small tasks and run
parallelly to make computation faster, save time, and mostly used in distributed systems. It
has 2 important parts:
• Mapper: It takes raw data input and organizes into key, value pairs. For example, In a
dictionary, you search for the word “Data” and its associated meaning is “facts and
statistics collected together for reference or analysis”. Here the Key is Data and the
Value associated with is facts and statistics collected together for reference or
analysis.
• Reducer: It is responsible for processing data in parallel and produce final output.
Let us consider the matrix multiplication example to visualize MapReduce. Consider the
following matrix:
2×2 matrices A and B

Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of
columns(j)=2. Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of
columns(k)=2. Each cell of the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is
called A21 i.e. 2nd-row 1st column. Now One step matrix multiplication has 1 mapper and 1
reducer. The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i Therefore
computing the mapper for Matrix A:
# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have
# 2 values 1 & 2, each case can have 2 further
# values of j=1 and j=2. Substituting all values
# in formula

k=1 i=1 j=1 ((1, 1), (A, 1, 1))


j=2 ((1, 1), (A, 2, 2)) i=2
j=1 ((2, 1), (A, 1, 3)) j=2
((2, 1), (A, 2, 4))

k=2 i=1 j=1 ((1, 2), (A, 1, 1))


j=2 ((1, 2), (A, 2, 2)) i=2
j=1 ((2, 2), (A, 1, 3)) j=2
((2, 2), (A, 2, 4))
Computing the mapper for Matrix B i=1
j=1 k=1 ((1, 1), (B, 1, 5))
k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
k=2 ((1, 2), (B, 2, 8))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))


k=2 ((2, 2), (B, 1, 6)) j=2
k=1 ((2, 1), (B, 2, 7)) k=2
((2, 2), (B, 2, 8))
The formula for Reducer is:
Reducer(k, v)=(i, k)=>Make sorted Alist and Blist
(i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Therefore computing the reducer:
# We can observe from Mapper computation
# that 4 pairs are common (1, 1), (1, 2),
# (2, 1) and (2, 2)
# Make a list separate for Matrix A &
# B with adjoining values taken from #
Mapper step above:

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}


Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i)

(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}


Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii)

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}


Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)

(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}


Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv)

From (i), (ii), (iii) and (iv) we conclude that


((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)
Therefore the Final Matrix is:

Unit 2 :- Introduction to Hadoop

What is Hadoop

Hadoop is an open source framework from Apache and is used to store process and analyze
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing.It is being used by Facebook,
Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by
adding nodes in the cluster.

Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts
it into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.

Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node
includes DataNode and TaskTracker.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode performs
the role of master, and multiple DataNodes performs the role of a slave.

Backward Skip 10sPlay VideoForward Skip 10s

Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily
run the NameNode and DataNode software.
NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure. o It manages
the file system namespace by executing an operation like the opening, renaming and
closing the files.
o It simplifies the architecture of the system.

DataNode o The HDFS cluster contains multiple


DataNodes. o Each DataNode contains
multiple data blocks.
o These data blocks are used to store data. o It is the responsibility of DataNode to read
and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.

Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode. o In response, NameNode provides metadata to Job
Tracker.

Task Tracker o It works as a slave node


for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.

MapReduce Layer

The MapReduce comes into existence when the client application submits the MapReduce
job to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task
Trackers. Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is
rescheduled.

Advantages of Hadoop

o Fast: In HDFS the data distributed over the cluster and are mapped which helps in
faster retrieval. Even the tools to process the data are often on the same servers, thus
reducing the processing time. It is able to process terabytes of data in minutes and
Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so
it really cost effective as compared to traditional relational database management
system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.

History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the
Google File System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data
they have to spend a lot of costs which becomes the consequence of that project. This
problem becomes one of the important reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data. o In
2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known as
HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this
year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant. o In
2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released. o In 2017, Hadoop 3.0 was released.

Year Event

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2006 o Hadoop introduced. o Hadoop 0.1.0 released. o Yahoo deploys 300


machines and within this year reaches 600 machines.

2007 o Yahoo runs 2 clusters of 1000 machines.


o Hadoop includes HBase.

2008 o YARN JIRA opened


o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node cluster
within 209 seconds.
o Yahoo clusters loaded with 10 terabytes per day. o Cloudera was founded as a
Hadoop distributor.

2009
o Yahoo runs 17 clusters of 24,000 machines.
o Hadoop becomes capable enough to sort a petabyte.
o MapReduce and HDFS become separate subproject.
2010 o Hadoop added the support for Kerberos. o
Hadoop operates 4,000 nodes with 40 petabytes.
o Apache Hive and Pig released.

2011 o Apache Zookeeper released.


o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of storage.

2012 Apache Hadoop 1.0 version released.

2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.

2.1 Big Data


Big data is a collection of large datasets that cannot be processed using traditional computing
techniques. It is not a single technique or a tool, rather it has become a complete subject,
which involves various tools, technqiues and frameworks

2.2 Apache Hadoop & Hadoop Ecosystem


Apache Hadoop is an open source, Java-based software platform that manages data
processing and storage for big data applications. The platform works by distributing Hadoop
big data and analytics jobs across nodes in a computing cluster, breaking them down into
smaller workloads that can be run in parallel. Some key benefits of Hadoop are scalability,
resilience and flexibility. The Hadoop Distributed File System (HDFS) provides reliability
and resiliency by replicating any node of the cluster to the other nodes of the cluster to protect
against hardware or software failures. Hadoop flexibility allows the storage of any data
format including structured and unstructured data.
Hadoop Ecosystem :-
Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common.
Most of the tools or solutions are used to supplement or support these major elements. All
these tools work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

• HDFS: Hadoop Distributed File System


• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling

Note: Apart from the above-mentioned components, there are many other components too
that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier. HDFS:

HDFS is the primary or major component of Hadoop ecosystem and is responsible
for storing large data sets of structured or unstructured data across various nodes
and thereby maintaining the metadata in the form of log files.
• HDFS consists of two core components i.e.
1. Name node
2. Data Node
• Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data.
These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:

Yet Another Resource Negotiator, as the name implies, YARN is the one who

helps to manage the resources across the clusters. In short, it performs scheduling
and resource allocation for the Hadoop System.
• Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
• Resource manager has the privilege of allocating resources for the applications in
a system whereas Node managers work on the allocation of resources such as
CPU, memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of
the two.
MapReduce:

• By making the use of distributed and parallel algorithms, MapReduce makes it


possible to carry over the processing’s logic and helps to write applications which
transform big data sets into a manageable one.
• MapReduce makes the use of two functions i.e. Map() and Reduce() whose task
is:
1. Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based result
which is later on processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating
the mapped data. In simple, Reduce() takes the output generated by
Map() as input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
• It is a platform for structuring the data flow, processing and analyzing huge data
sets.

• Pig does the work of executing commands and in the background, all the activities
of MapReduce are taken care of. After the processing, pig stores the result in
HDFS.
Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
• Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:

• With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
• It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the processing
of queries.
Mahout:

• Mahout, allows Machine Learnability to a system or application. Machine


Learning, as the name suggests helps the system to develop itself based on some
patterns, user/environmental interaction or on the basis of algorithms.
• It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine learning.
It allows invoking algorithms as per our need with the help of its own libraries.
Apache Spark:

• It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions, and
visualization, etc.
• It consumes in memory resources hence, thus being faster than the prior in terms
of optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.
Apache HBase:

• It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
• At times where we need to search or retrieve the occurrences of something small
in a huge database, the request must be processed within a short quick span of

time. At such times, HBase comes handy as it gives us a tolerant way of storing
limited data
Other Components: Apart from all of these, there are some other components too that carry
out a huge task in order to make Hadoop capable of processing large datasets. They are as
follows:

Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by Solr.
• Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which resulted
in inconsistency, often. Zookeeper overcame all the problems by performing
synchronization, inter-component based communication, grouping, and
maintenance.
• Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie
workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be
executed in a sequentially ordered manner whereas Oozie Coordinator jobs are
those that are triggered when some data or external stimulus is given to it.

2.3 Moving Data in and out of Hadoop

2.4 Understanding inputs and outputs of MapReduce


Hadoop MapReduce is a programming model and software framework used for writing
applications that process large amounts of data. There are two phases in the MapReduce
program, Map and Reduce.

The Map task includes splitting and mapping of the data by taking a dataset and converting it
into another set of data, where the individual elements get broken down into tuples i.e.
key/value pairs. After which the Reduce task shuffles and reduces the data, which means it
combines the data tuples based on the key and modifies the value of the key accordingly.

In the Hadoop framework, MapReduce model is the core component for data processing.
Using this model, it is very easy to scale an application to run over hundreds, thousands and
many more machines in a cluster by only making a configuration change. This is also
because the programs of the model in cloud computing are parallel in nature. Hadoop has the
capability of running MapReduce in many languages such as Java, Ruby, Python and C++.
Read more on mapreduce architecture.

Inputs and Outputs

The MapReduce model operates on <key, value> pairs. It views the input to the jobs as a set
of <key, value> pairs and produces a different set of <key, value> pairs as the output of the

jobs. Data input is supported by two classes in this framework, namely InputFormat and
RecordReader.

The first is consulted to determine how the input data should be partitioned for the map tasks,
while the latter reads the data from the inputs. For the data output also there are two classes,
OutputFormat and RecordWriter. The first class performs a basic validation of the data sink
properties and the second class is used to write each reducer output to the data sink.
What are the Phases of MapReduce?

In MapReduce a data goes through the following phases.

Input Splits: An input in the MapReduce model is divided into small fixed-size parts called
input splits. This part of the input is consumed by a single map. The input data is generally a
file or directory stored in the HDFS.

Mapping: This is the first phase in the map-reduce program execution where the data in each
split is passed line by line, to a mapper function to process it and produce the output values.

Shuffling: It is a part of the output phase of Mapping where the relevant records are
consolidated from the output. It consists of merging and sorting. So, all the key-value pairs
which have the same keys are combined. In sorting, the inputs from the merging step are
taken and sorted. It returns key-value pairs, sorting the output.

Reduce: All the values from the shuffling phase are combined and a single output value is
returned. Thus, summarizing the entire dataset.

How does MapReduce Organize Work?

Hadoop divides a task into two parts, Map tasks which includes Splits and Mapping, and
Reduce tasks which includes Shuffling and Reducing. These were mentioned in the phases in
the above section. The execution of these tasks is controlled by two entities called JobTracker
and Multiple Task tracker.

With every job that gets submitted for execution, there is a JobTracker that resides on the
NameNode and multiple task trackers that reside on the DataNode. A job gets divided into
multiple tasks that run onto multiple data nodes in the cluster. The JobTracker coordinates the
activity by scheduling tasks to run on various data nodes.

The task tracker looks after the execution of individual tasks. It also sends the progress report
to the JobTracker. Periodically, it sends a signal to the JobTracker to notify the current state
of the system. When there is a task failure, the JobTracker reschedules it on a different task
tracker.

Advantages of MapReduce

There are a number of advantages for applications which use this model. These are

• – Big data can be easily handled.


• – Datasets can be processed parallely.
• – All types of data such as structured, unstructured and semi-structured can be
easily processed.
• – High scalability is provided.
• – Counting occurrences of words is easy and these applications can have
massive data collection.
• – Large samples of respondents can be accessed quickly.
• – In data analysis, a generic tool can be used to search tools.
• – Load balancing time is offered in large clusters.
• – The process of extracting contexts of user locations, situations, etc. is easily
possible.
• – Good generalization performance and convergence is provided to these
applications.

2.5 Data Serialization


Serialization is the process of converting a data object—a combination of code and data
represented within a region of data storage—into a series of bytes that saves the state of the
object in an easily transmittable form. In this serialized form, the data can be delivered to
another data store (such as an in-memory computing platform), application, or some other
destination.

Data serialization is the process of converting an object into a stream of bytes to more easily
save or transmit it.
The reverse process—constructing a data structure or object from a series of bytes— is
deserialization. The deserialization process recreates the object, thus making the data easier
to read and modify as a native structure in a programming language.
Serialization and deserialization work together to transform/recreate data objects to/from a
portable format.
Serialization enables us to save the state of an object and recreate the object in a new location.
Serialization encompasses both the storage of the object and exchange of data. Since objects
are composed of several components, saving or delivering all the parts typically requires
significant coding effort, so serialization is a standard way to capture the object into a
sharable format. With serialization, we can transfer objects:

• Over the wire for messaging use cases


• From application to application via web services such as REST APIs
• Through firewalls (as JSON or XML strings)
• Across domains
• To other data stores
• To identify changes in data over time
• While honoring security and user-specific details across applications

Why Is Data Serialization Important for Distributed Systems?

In some distributed systems, data and its replicas are stored in different partitions on multiple
cluster members. If data is not present on the local member, the system will retrieve that data
from another member. This requires serialization for use cases such as:

• Adding key/value objects to a map


• Putting items into a queue, set, or list
• Sending a lambda functions to another server
• Processing an entry within a map
• Locking an object
• Sending a message to a topic

What Are Common Languages for Data Serialization?

A number of popular object-oriented programming languages provide either native support


for serialization or have libraries that add non-native capabilities for serialization to their
feature set. Java, .NET, C++, Node.js, Python, and Go, for example, all either have native
serialization support or integrate with libraries for serialization.
Data formats such as JSON and XML are often used as the format for storing serialized data.
Customer binary formats are also used, which tend to be more space-efficient due to less
markup/tagging in the serialization.

What Is Data Serialization in Big Data?

Big data systems often include technologies/data that are described as “schemaless.” This
means that the managed data in these systems are not structured in a strict format, as defined
by a schema. Serialization provides several benefits in this type of environment:

• Structure. By inserting some schema or criteria for a data structure through


serialization on read, we can avoid reading data that misses mandatory fields, is
incorrectly classified, or lacks some other quality control requirement.
• Portability. Big data comes from a variety of systems and may be written in a variety
of languages. Serialization can provide the necessary uniformity to transfer such data
to other enterprise systems or applications.
• Versioning. Big data is constantly changing. Serialization allows us to apply version
numbers to objects for lifecycle management.

3.1 Hadoop Architecture


3.2 Hadoop Storage : HDFS , Common Hadoop Shell Commands
Hadoop HDFS is a distributed file system that provides redundant storage space for files
having huge sizes. It is used for storing files that are in the range of terabytes to petabytes.
Hadoop HDFS Commands

With the help of the HDFS command, we can perform Hadoop HDFS file operations like
changing the file permissions, viewing the file contents, creating files or directories, copying
file/directory from the local file system to HDFS or vice-versa, etc.

Before starting with the HDFS command, we have to start the Hadoop services. To start the
Hadoop services do the following:

1. Move to the ~/hadoop-3.1.2 directory


2. Start Hadoop service by using the command sbin/start-dfs.sh

In this Hadoop Commands tutorial, we have mentioned the top 10 Hadoop HDFS commands
with their usage, examples, and description. Let us now start with the HDFS commands.
1. version

Hadoop HDFS version Command Usage: version


Hadoop HDFS version Command Example:
Before working with HDFS you need to Deploy Hadoop, follow this guide to Install and
configure Hadoop 3. hadoop version

Hadoop HDFS version Command Description:


The Hadoop fs shell command version prints the Hadoop version.
2. mkdir

Hadoop HDFS mkdir Command Usage: hadoop


fs –mkdir /path/directory_name
Hadoop HDFS mkdir Command Example 1:
In this example, we are trying to create a newDataFlair named directory in HDFS using the
mkdir command.

Using the ls command, we can check for the directories in HDFS.


Example 2:

Hadoop HDFS mkdir Command Description:


This command creates the directory in HDFS if it does not already exist.

Note: If the directory already exists in HDFS, then we will get an error message that file
already exists.
Use hadoop fs mkdir -p /path/directoryname, so not to fail even if directory exists. Learn
various features of Hadoop HDFS from this HDFS features guide.
3. ls

Hadoop HDFS ls Command Usage:


hadoop fs -ls /path
Hadoop HDFS ls Command Example 1:
Here in the below example, we are using the ls command to enlist the files and directories
present in HDFS.

Hadoop HDFS ls Command Description:


The Hadoop fs shell command ls displays a list of the contents of a directory specified in the
path provided by the user. It shows the name, permissions, owner, size, and modification date
for each file or directories in the specified directory.
Hadoop HDFS ls Command Example 2:
Hadoop HDFS ls Description:
This Hadoop fs command behaves like -ls, but recursively displays entries in all
subdirectories of a path.
4. put

Hadoop HDFS put Command Usage:


haoop fs -put <localsrc> <dest>
Hadoop HDFS put Command Example:
Here in this example, we are trying to copy localfile1 of the local file system to the Hadoop
filesystem.

Hadoop HDFS put Command Description:


The Hadoop fs shell command put is similar to the copyFromLocal, which copies files or
directory from the local filesystem to the destination in the Hadoop filesystem.
5. copyFromLocal

Hadoop HDFS copyFromLocal Command Usage:


hadoop fs -copyFromLocal <localsrc> <hdfs destination>
Hadoop HDFS copyFromLocal Command Example:
Here in the below example, we are trying to copy the ‘test1’ file present in the local file
system to the newDataFlair directory of Hadoop.

Hadoop HDFS copyFromLocal Command Description:


This command copies the file from the local file system to HDFS.

Learn Internals of HDFS Data Read Operation, How Data flows in HDFS while reading
the file.
Any Doubt yet in Hadoop HDFS Commands? Please Comment.
6. get

Hadoop HDFS get Command Usage:


hadoop fs -get <src> <localdest>
Hadoop HDFS get Command Example:
In this example, we are trying to copy the ‘testfile’ of the hadoop filesystem to the local file
system.

Hadoop HDFS get Command Description: The Hadoop fs shell command get
copies the file or directory from the Hadoop file system to the local file system.

Learn: Rack Awareness, High Availability


7. copyToLocal

Hadoop HDFS copyToLocal Command Usage:


hadoop fs -copyToLocal <hdfs source> <localdst>
Hadoop HDFS copyToLocal Command Example:
Here in this example, we are trying to copy the ‘sample’ file present in the newDataFlair
directory of HDFS to the local file system.

We can cross-check whether the file is copied or not using the ls command.

Hadoop HDFS copyToLocal Description:


copyToLocal command copies the file from HDFS to the local file system.
8. cat

Hadoop HDFS cat Command Usage:


hadoop fs –cat /path_to_file_in_hdfs
Hadoop HDFS cat Command Example:
Here in this example, we are using the cat command to display the content of the ‘sample’ file
present in newDataFlair directory of HDFS.
Hadoop HDFS cat Command Description:
The cat command reads the file in HDFS and displays the content of the file on console or
stdout.
9. mv

Hadoop HDFS mv Command Usage: hadoop


fs -mv <src> <dest>
Hadoop HDFS mv Command Example:
In this example, we have a directory ‘DR1’ in HDFS. We are using mv command to move the
DR1 directory to the DataFlair directory in HDFS.

Hadoop HDFS mv Command Description:


The HDFS mv command moves the files or directories from the source to a destination within
HDFS.
10. cp

Hadoop HDFS cp Command Usage: hadoop


fs -cp <src> <dest>

3.3 Anatomy of File write and Read

Anatomy of File Read in HDFS


Let’s get an idea of how data flows between the client interacting with HDFS, the name node,
and the data nodes with the help of a diagram. Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the File System
Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls
(RPCs), to determine the locations of the first few blocks in the file. For each block, the name
node returns the addresses of the data nodes that have a copy of that block. The DFS returns
an FSDataInputStream to the client for it to read data from. FSDataInputStream in turn wraps
a DFSInputStream, which manages the data node and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info
node addresses for the primary few blocks within the file, then connects to the primary
(closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly
on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to
the data node, then finds the best data node for the next block. This happens transparently to
the client, which from its point of view is simply reading an endless stream. Blocks are read
as, with the DFSInputStream opening new connections to data nodes because the client reads
through the stream. It will also call the name node to retrieve the data node locations for the
next batch of blocks as needed.
Step 6: When the client has finished reading the file, a function is called, close() on the
FSDataInputStream.

Anatomy of File Write in HDFS Next, we’ll check out how files are written to HDFS.
Consider figure 1.2 to get a better understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit the
files which are already stored in HDFS, but we can append data by reopening the files.
Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s
namespace, with no blocks associated with it. The name node performs various checks to
make sure the file doesn’t already exist and that the client has the right permissions to create
the file. If these checks pass, the name node prepares a record of the new file; otherwise, the
file can’t be created and therefore the client is thrown an error i.e. IOException. The DFS
returns an FSDataOutputStream for the client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it
writes to an indoor queue called the info queue. The data queue is consumed by the
DataStreamer, which is liable for asking the name node to allocate new blocks by picking an
inventory of suitable data nodes to store the replicas. The list of data nodes forms a pipeline,
and here we’ll assume the replication level is three, so there are three nodes in the pipeline.
The DataStreamer streams the packets to the primary data node within the pipeline, which
stores each packet and forwards it to the second data node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last)
data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be
acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and waits for
acknowledgments before connecting to the name node to signal whether the file is complete
or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that are already stored
in HDFS, but we can include them by again reopening the file. This design allows HDFS to
scale to a large number of concurrent clients because the data traffic is spread across all the
data nodes in the cluster. Thus, it increases the availability, scalability, and throughput of the
system.

Hadoop – Daemons and Their Features


Daemons mean Process. Hadoop Daemons are a set of processes that run on Hadoop.
Hadoop is a framework written in Java, so all these processes are Java Processes.
Apache Hadoop 2 consists of the following Daemons:
• NameNode
• DataNode
• Secondary Name Node
• Resource Manager
• Node Manager
Namenode, Secondary NameNode, and Resource Manager work on a Master System while
the Node Manager and DataNode work on the Slave machine.

1. NameNode

NameNode works on the Master System. The primary purpose of Namenode is to manage all
the MetaData. Metadata is the list of files stored in HDFS(Hadoop Distributed File System).
As we know the data is stored in the form of blocks in a Hadoop cluster. So the DataNode on
which or the location at which that block of the file is stored is mentioned in MetaData. All
information regarding the logs of the transactions happening in a Hadoop cluster (when or
who read/wrote the data) will be stored in MetaData. MetaData is stored in the memory.
Features:
• It never stores the data that is present in the file.
• As Namenode works on the Master System, the Master system should have good
processing power and more RAM than Slaves.
• It stores the information of DataNode such as their Block id’s and Number of
Blocks How to start Name Node?
hadoop-daemon.sh start namenode
How to stop Name Node? hadoop-daemon.sh
stop namenode

2. DataNode

DataNode works on the Slave system. The NameNode always instructs DataNode for storing
the Data. DataNode is a program that runs on the slave system that serves the read/write
request from the client. As the data is stored in this DataNode, they should possess high
memory to store more Data.
How to start Data Node?
hadoop-daemon.sh start datanode
How to stop Data Node? hadoop-daemon.sh
stop datanode
3. Secondary NameNode

Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop
cluster fails, or crashes, the secondary Namenode will take the hourly backup or checkpoints
of that data and store this data into a file name fsimage. This file then gets transferred to a
new system. A new MetaData is assigned to that new system and a new Master is created
with this
MetaData, and the cluster is made to run again correctly. This is the benefit of Secondary
Name Node. Now in Hadoop2, we have High-Availability and Federation features that
minimize the importance of this Secondary Name Node in Hadoop2.
Major Function Of Secondary NameNode:
• It groups the Edit logs and Fsimage from NameNode together.
• It continuously reads the MetaData from the RAM of NameNode and writes into
the Hard Disk.
As secondary NameNode keeps track of checkpoints in a Hadoop Distributed File System, it
is also known as the checkpoint Node.

The Hadoop Daemon’s Port

Name Node 50070

Data Node 50075

Secondary Name Node 50090

These ports can be configured manually in hdfs-site.xml and mapred-site.xml files.


4. Resource Manager

Resource Manager is also known as the Global Master Daemon that works on the Master
System. The Resource Manager Manages the resources for the applications that are running
in a Hadoop Cluster. The Resource Manager Mainly consists of 2 things.
1. ApplicationsManager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and also makes a
memory resource on the Slaves in a Hadoop cluster to host the Application Master. The
scheduler is utilized for providing resources for applications in a Hadoop cluster and for
monitoring this application.
How to start ResourceManager? yarn-daemon.sh
start resourcemanager
How to stop ResourceManager? stop:yarn-daemon.sh
stop resoucemnager

5. Node Manager

The Node Manager works on the Slaves System that manages the memory resource within
the Node and Memory Disk. Each Slave Node in a Hadoop cluster has a single NodeManager
Daemon running in it. It also sends this monitoring information to the Resource Manager.
How to start Node Manager? yarn-daemon.sh
start nodemanager
How to stop Node Manager? yarn-daemon.sh
stop nodemanager

In a Hadoop cluster, Resource Manager and Node Manager can be tracked with the specific
URLs, of type http://:port_number

The Hadoop Daemon’s Port

ResourceManager 8088

NodeManager 8042

The below diagram shows how Hadoop works.


3.4 NameNode , Secondary NameNode

3.5 DataNode , HadoopReduce Paradigm , map and Reduce tasks The


MapReduce paradigm

The MapReduce paradigm was created in 2003 to enable processing of large data sets in a
massively parallel manner. The goal of the MapReduce model is to simplify the approach to
transformation and analysis of large datasets, as well as to allow developers to focus on
algorithms instead of data management. The model allows for simple implementation of
dataparallel algorithms. There are a number of implementations of this model, including
Google’s approach, programmed in C++, and Apache’s Hadoop implementation,
programmed in Java. Both run on large clusters of commodity hardware in a shared-nothing,
peer-to-peer environment.

The MapReduce model consists of two phases: the map phase and the reduce phase,
expressed by the map function and the reduce function, respectively. The functions are
specified by the programmer and are designed to operate on key/value pairs as input and
output. The keys and values can be simple data types, such as an integer, or more complex,
such as a commercial transaction.
Map
The map function, also referred to as the map task, processes a single key/value input
pair and produces a set of intermediate key/value pairs.
Reduce
The reduce function, also referred to as the reduce task, consists of taking all
key/value pairs produced in the map phase that share the same intermediate key and
producing zero, one, or more data items.

Note that the map and reduce functions, do not address the parallelization and execution of
the MapReduce jobs. This is the responsibility of the MapReduce model, which automatically
takes care of distribution of input data, as well as scheduling and managing map and reduce
tasks.

Introduction to Hadoop Scheduler

Prior to Hadoop 2, Hadoop MapReduce is a software framework for writing applications


that process huge amounts of data (terabytes to petabytes) in-parallel on the large Hadoop
cluster. This framework is responsible for scheduling tasks, monitoring them, and re-executes
the failed task.
In Hadoop 2, a YARN called Yet Another Resource Negotiator was introduced. The basic
idea behind the YARN introduction is to split the functionalities of resource management and
job scheduling or monitoring into separate daemons that are ResorceManager,
ApplicationMaster, and NodeManager.
ResorceManager is the master daemon that arbitrates resources among all the applications in
the system. NodeManager is the slave daemon responsible for containers, monitoring their
resource usage, and reporting the same to ResourceManager or Schedulers.
ApplicationMaster negotiates resources from the ResourceManager and works with
NodeManager in order to execute and monitor the task.

The ResourceManager has two main components that are Schedulers and
ApplicationsManager.

Schedulers in YARN ResourceManager is a pure scheduler which is responsible for


allocating resources to the various running applications.
It is not responsible for monitoring or tracking the status of an application. Also, the
scheduler does not guarantee about restarting the tasks that are failed either due to hardware
failure or application failure.

Confused with YARN? Refer Hadoop YARN architecture to learn YARN in detail.
The scheduler performs scheduling based on the resource requirements of the applications.

It has some pluggable policies that are responsible for partitioning the cluster resources
among the various queues, applications, etc.

The FIFO Scheduler, CapacityScheduler, and FairScheduler are such pluggable policies that
are responsible for allocating resources to the applications.

Let us now study each of these Schedulers in detail.

TYPES OF HADOOP SCHEDULER

1. FIFO Scheduler
First In First Out is the default scheduling policy used in Hadoop. FIFO Scheduler gives
more preferences to the application coming first than those coming later. It places the
applications in a queue and executes them in the order of their submission (first in, first out).
Here, irrespective of the size and priority, the request for the first application in the queue are
allocated first. Once the first application request is satisfied, then only the next application in
the queue is served.

Advantage: • It is simple to understand and doesn’t need any


configuration.
• Jobs are executed in the order of their submission.
Disadvantage:
• It is not suitable for shared clusters. If the large application comes before the
shorter one, then the large application will use all the resources in the cluster, and
the shorter application has to wait for its turn. This leads to starvation.
• It does not take into account the balance of resource allocation between the long
applications and short applications.

2. Capacity Scheduler
The CapacityScheduler allows multiple-tenants to securely share a large Hadoop cluster. It is
designed to run Hadoop applications in a shared, multi-tenant cluster while maximizing the
throughput and the utilization of the cluster.
It supports hierarchical queues to reflect the structure of organizations or groups that utilizes
the cluster resources. A queue hierarchy contains three types of queues that are root, parent,
and leaf.

The root queue represents the cluster itself, parent queue represents organization/group or
sub-organization/sub-group, and the leaf accepts application submission.

The Capacity Scheduler allows the sharing of the large cluster while giving capacity
guarantees to each organization by allocating a fraction of cluster resources to each queue.

Also, when there is a demand for the free resources that are available on the queue who has
completed its task, by the queues running below capacity, then these resources will be
assigned to the applications on queues running below capacity. This provides elasticity for the
organization in a cost-effective manner.

Apart from it, the CapacityScheduler provides a comprehensive set of limits to ensure that a
single application/user/queue cannot use a disproportionate amount of resources in the
cluster.

To ensure fairness and stability, it also provides limits on initialized and pending apps from a
single user and queue.

Advantages:
• It maximizes the utilization of resources and throughput in the Hadoop cluster.
• Provides elasticity for groups or organizations in a cost-effective manner.
• It also gives capacity guarantees and safeguards to the organization utilizing
cluster.
Disadvantage:
• It is complex amongst the other scheduler.
3. Fair Scheduler

FairScheduler allows YARN applications to fairly share resources in large Hadoop clusters.
With FairScheduler, there is no need for reserving a set amount of capacity because it will
dynamically balance resources between all running applications.

It assigns resources to applications in such a way that all applications get, on average, an
equal amount of resources over time.

The FairScheduler, by default, takes scheduling fairness decisions only on the basis of
memory. We can configure it to schedule with both memory and CPU.

When the single application is running, then that app uses the entire cluster resources. When
other applications are submitted, the free up resources are assigned to the new apps so that
every app eventually gets roughly the same amount of resources. FairScheduler enables short
apps to finish in a reasonable time without starving the long-lived apps.

Similar to CapacityScheduler, the FairScheduler supports hierarchical queue to reflect the


structure of the long shared cluster.

Apart from fair scheduling, the FairScheduler allows for assigning minimum shares to queues
for ensuring that certain users, production, or group applications always get sufficient
resources. When an app is present in the queue, then the app gets its minimum share, but
when the queue doesn’t need its full guaranteed share, then the excess share is split between
other running applications.

Advantages:
• It provides a reasonable way to share the Hadoop Cluster between the number of
users.
• Also, the FairScheduler can work with app priorities where the priorities are used as
weights in determining the fraction of the total resources that each application should
get.
Disadvantage:
• It requires configuration.

Hadoop 2.0 – An Overview


Hadoop 2.0 boasts of improved scalability and availability of the system via a set ofbundled
features that represent a generational swing in the Hadoop Architecture with the introduction
of YARN

Hadoop 2.0 also introduces the solution to the much awaited High Availability problem.
• Hadoop introduced YARN - that has the ability to process terabytes and Petabytes of
data present in HDFS with the use of various non-MapReduce applications namely
GIRAPH and MPI.
• Hadoop 2.0 divides the responsibilities of the overloaded Job Tracker into 2 different
divine components i.e. the Application Master (per application) and the Global
Resource Manager.
• Hadoop 2.0 improves horizontal scalability of the NameNode through HDFS
Federation and eliminates the Single Point of Failure Problem with the NameNode
High Availability​

Hadoop NameNode High Availability problem:

Hadoop 1.0 NameNode has single point of failure (SPOF) problem- which means that if the
NameNode fails, then that Hadoop Cluster will become out-of-the-way. Nevertheless, this is
anticipated to be a rare occurrence as applications make use of business critical hardware
with RAS features (Reliability, Availability and Serviceability) for all the NameNode servers.
In case, if NameNode failure occurs then it requires manual intervention of the Hadoop
Administrators to recover the NameNode with the help of a secondary NameNode.

NameNode SPOF problem limits the overall availability of the Hadoop Cluster in the
following ways:

• If there are any planned maintenance activities of hardware or software upgrades on


the NameNode then it will result in overall downtime of the Hadoop Cluster.
• If any unplanned event triggers, which results in the machine crashing, then the
Hadoop cluster would not be available unless the Hadoop Administrator restarts the
NameNode.

What is high availability in Hadoop?


Hadoop 2.0 overcomes this SPOF shortcoming by providing support for multiple
NameNodes. It introduces Hadoop 2.0 High Availability feature that brings in an extra
NameNode (Passive Standby NameNode) to the Hadoop Architecture which is configured for
automatic failover.

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging,
and Data Visualization
The main motive of the Hadoop 2.0 High Availability project is to render availability to big
data applications 24/7 by deploying 2 Hadoop NameNodes –One in active configuration and
the other is the Standby Node in passive configuration.

Earlier there was one Hadoop NameNode for maintaining the tree hierarchy of the HDFS
files and tracking the data storage in the cluster. Hadoop 2.0 High Availability allows users to
configure Hadoop clusters with uncalled- for NameNodes so as to eliminate the probability of
SPOF in a given Hadoop cluster. The Hadoop Configuration capability allows users to build
clusters horizontally with several NameNodes which can operate autonomously through a
common data storage pool, thereby, offering better computing scalability when compared to
Hadoop 1.0

With Hadoop 2.0, Hadoop architecture is now configured in a manner that it supports
automated failover with complete stack resiliency and a hot Standby NameNode.

Image Credit :blog.cloudera.com

From the above graph, it is evident that both the active and passive (Standby) NameNodes
have state-of-the-art metadata that ensures flawless failover for large Hadoop clusters
indicating that there would not be any downtime for your Hadoop cluster and it will be
available all the time.

Hadoop 2.0 is keyed up to identify any failures in NameNode host and processes, so that it
can automatically switch to the passive NameNode i.e. the Standby Node to ensure high
availability of the HDFS services to the Big Data applications. With the advent of Hadoop 2.0
HA it’s time for Hadoop Administrators to take a breather, as this process does not require
manual intervention.

With HDP 2.0 High Availability, the complete Hadoop Stack i.e. HBase, Pig, Hive,
MapReduce, Oozie are equipped to tackle the NameNode failure problem- without having to
lose the job progress or any related data. Thus, any critical long running jobs that are
scheduled to be completed at a specific time will not be affected by the NameNode failure.

Get More Practice, More Big Data and Analytics Projects, and More
guidance.FastTrack Your Career Transition with ProjectPro

Hadoop Users Expectations from Hadoop 2.0 High Availability


When Hadoop users were interviewed about the High Availability Requirements from
Hadoop 2.0 Architecture, some of the most common High Availability requirements that they
came up with are:
• No Data Loss on Failure/No Job Failure/No Downtime
Hadoop users stated that with Hadoop 2.0 High Availability should ensure that there should
not be any impact on the applications due to any individual software or hardware failure.

• Stand for Multiple Failures -


Hadoop users stated that with Hadoop 2.0 High Availability the Hadoop Cluster must be able
to stand for more than one failure simultaneously. Preferably, Hadoop configuration must
allow the administrator to configure the degree of tolerance or let the user make a choice at
the resource level - on how many failures can be tolerated by the cluster.

• Self Recovery from a Failure


Hadoop users stated that with Hadoop 2.0 High Availability, the Hadoop Cluster must heal
automatically (self healing) without any manual intervention to restore it back to a highly
available state after the failure, with the pre-assumption that sufficient physical resources are
already available.

• Ease of Installation
According to Hadoop users, setting up High Availability should be a trifling activity devoid
of requiring the Hadoop Administrator to install any other open source or commercial third
party software.

• No Demand for Additional Hardware Requirements ​ Hadoop users say that Hadoop
2.0 High Availability feature should not demand that the users deploy, maintain or purchase
additional hardware. 100% Commodity hardware must be used to achieve high availability i.e.
there should not be any further dependencies on non commodity hardware such as Load
Balancers.

HDFS federation

Last Updated: 2023-02-22

HDFS federation provides MapReduce with the ability to start multiple HDFS namespaces in
the cluster, monitor their health, and fail over in case of daemon or host failure. Namespaces,
which run on separate hosts, are independent and do not require coordination with each other.
The DataNodes are used as common storage by all the namespaces, and register with all the
namespaces in the cluster.

A namespace consists of two daemons: NameNode daemon and Secondary NameNode


daemon. To integrate HDFS with IBM® Spectrum Symphony, the NameNode and Secondary
NameNode daemons are packaged as two EGO services. For HDFS federation, there are
multiple NameNode and Secondary NameNode daemons with each daemon treated as an
instance of the NameNode service and Secondary NameNode service, respectively.
MRv2
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each
cluster, and each data node runs a Node Manager. For each job, one slave node will act as the
Application Master, monitoring resources/tasks, etc. The MapReduce framework in the
Hadoop 1.x version is also known as MRv1. The MRv1 framework includes client
communication, job execution and management, resource scheduling and resource
management. The Hadoop daemons associated with MRv1 are JobTracker and TaskTracker
as shown in the following figure:
YARN

YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has now
evolved to be known as a large-scale distributed operating system used for Big Data
processing.YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and process data
stored in HDFS (Hadoop Distributed File System) thus making the system much more
efficient. Through its various components, it can dynamically allocate various resources and
schedule the application processing. For large volume data processing, it is quite necessary to
manage the available resources properly so that every application can leverage them.
Running MRv1 in YARN.

YARN uses the ResourceManager web interface for monitoring applications running on a
YARN cluster. The ResourceManager UI shows the basic cluster metrics, list of applications,
and nodes associated with the cluster. In this section, we'll discuss the monitoring of MRv1
applications over YARN.
The Resource Manager is the core component of YARN – Yet Another Resource Negotiator.
In analogy, it occupies the place of JobTracker of MRV1. Hadoop YARN is designed to
provide a generic and flexible framework to administer the computing resources in the
Hadoop cluster.
In this direction, the YARN Resource Manager Service (RM) is the central controlling
authority for resource management and makes allocation decisions ResourceManager has two
main components: Scheduler and ApplicationsManager.

You might also like