100% found this document useful (1 vote)

609 views19 pages

Big Data - Unit 2 Hadoop Framework

The document discusses the Hadoop framework, which consists of three main components - HDFS for distributed storage, MapReduce for parallel processing, and YARN for resource management. Hadoop uses distributed and parallel computing to efficiently store and process large datasets across clusters of commodity hardware. It provides advantages like fault tolerance, scalability, flexibility with structured and unstructured data, and cost effectiveness over traditional data systems.

Uploaded by

Aditya Deshpande

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

609 views19 pages

Big Data - Unit 2 Hadoop Framework

Uploaded by

Aditya Deshpande

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Big Data Analytics

Unit-II: Hadoop Framework

– Requirement of Hadoop Framework

Hadoop Framework - Components, and Uses
If you are learning about Big Data, you are bound to come across mentions of the "Hadoop
Framework". The rise of big data and its analytics have made the Hadoop framework very popular.
Hadoop is open-source software, meaning the bare software is easily available for free and customizable
according to individual needs.
This helps in curating the software according to the specific needs of the big data that needs to be
handled. As we know, big data is a term used to refer to the huge volume of data that cannot be stored
or processed, or analyzed using the mechanisms traditionally used. It is due to several characteristics of
big data. This is because big data has a high volume, is generated at great speed, and the data comes in
many varieties.

Since the traditional frameworks are ineffective in handling big data, new techniques had to be
developed to combat it. This is where the Hadoop framework comes in. The Hadoop framework is
primarily based on Java and is used to deal with big data.

What is Hadoop?

Hadoop is a data handling framework written primarily in Java, with some secondary code in shell script
and C. It uses a basic-level programming model and is able to deal with large datasets. It was developed
by Doug Cutting and Mike Cafarella. This framework uses distributed storage and parallel processing to
store and manage big data. It is one of the most widely used pieces of big data software.

Hadoop consists mainly of three components: Hadoop HDFS, Hadoop MapReduce, and Hadoop YARN.
These components come together to handle big data effectively. These components are also known
as Hadoop modules.
Hadoop is slowly becoming a mandatory skill required from a data scientist. Companies looking to invest
in Big Data technology are increasingly giving more importance to Hadoop, making it a valuable skill
upgrade for professionals. Hadoop 3.x is the latest version of Hadoop.
How Does Hadoop Work?

Hadoop's concept is rather straightforward. The volume, variety, and velocity of big data offer problems.
Building servers with heavy setups that could handle such a vast data pool at ever-increasing sizes would
not be viable. It would be simpler to connect numerous computers using a single CPU as an alternative,
though.

This would turn it into a distributed system that works as one system. This indicates that the clustered
computers can work together in parallel to achieve the same objective. This would expedite and reduce
the cost of handling large amounts of data.

This can be better understood with the help of an example. Imagine a carpenter who primarily makes
chairs and stores them at his warehouse before they are sold. At some point, the market demands other
products like tables and cupboards. So now the same carpenter is working on all three products.
However, this is depleting his energy, and he is not able to keep up with producing all three.

He decides to enlist the help of two other apprentices, who each work on one product. Now they are
able to produce at a good rate, but a problem regarding storage arises. Now the carpenter cannot buy a
bigger and bigger warehouse as per increases in demand or product. Instead, he takes three smaller
storage units for the three different products.

The carpenter in this analogy might be compared to the server that manages data. Big data, which is too
much for the server to handle alone, is created by the increase in demand, which is expressed in the
variety, velocity, and volume of the product.

Now that he has two apprentices reporting to him, they are all working towards the same objective
thanks to the concept of a single CPU assisted by many computers. Storage is assigned to curated
storage based on variety to prevent a bottleneck. This is essentially how Hadoop functions.

Main Components of Hadoop Framework

There are three core components of Hadoop as mentioned earlier. They are HDFS, MapReduce, and
YARN. These together form the Hadoop framework architecture.
 HDFS (Hadoop Distributed File System):
It is a data storage system. Since the data sets are huge, it uses a distributed system to store this data. It
is stored in blocks where each block is 128 MB. It consists of NameNode and DataNode. There can only
be one NameNode but multiple DataNodes.

Features:

 The storage is distributed to handle a large data pool

 Distribution increases data security
 It is fault-tolerant, other blocks can pick up the failure of one block
 MapReduce:
The MapReduce framework is the processing unit. All data is distributed and processed parallelly. There
is a MasterNode that distributes data amongst SlaveNodes. The SlaveNodes do the processing and send
it back to the MasterNode.
Features:

 Consists of two phases, Map Phase and Reduce Phase.

 Processes big data faster with multiples nodes working under one CPU
 YARN (yet another Resources Negotiator):
It is the resource management unit of the Hadoop framework. The data which is stored can be
processed with help of YARN using data processing engines like interactive processing. It can be used to
fetch any sort of data analysis.
Features:

 It is a filing system that acts as an Operating System for the data stored on HDFS
 It helps to schedule the tasks to avoid overloading any system

Advantages of the Hadoop framework

Hadoop framework has become the most used tool to handle big data because of the various benefits
that it offers.
 Data Locality:
The concept is rather simple. The pool of data is very large, and it would be very slow and tiresome to
move the data to the computation logic. By using, if data locality, the computation logic can instead be
moved toward the data. This makes processing much faster.

 Faster Data Processing:

As we saw earlier, the data is stored in small blocks using the HDFS filing system. This makes it possible
to process the data parallelly using the common CPU with the help of MapReduce. This makes the
performance level very high when compared to any traditional system.

 Inbuilt fault tolerance:

The problem with using smaller cluster computers is that the risk of them crashing is very real. This is
solved with the help of a high fault tolerance level that is inbuilt into the Hadoop platform. This is
because of the various DataNodes that are present. This, along with parallel data processing and
storage, ensures that data is available in multiple nodes, which ensures that these systems can take over
and provide cover for any system that crashes. Hadoop in fact makes three copies of each file block. This
ensures that any fault in the system is tolerated.
 High Availability:
This refers to the high and easy availability of data on the Hadoop cluster. Due to the high fault
tolerance that is inbuilt, the data is reliable, easily available, and can be accessed easily. Processed data
can be easily accessed using YARN as well.

 Highly Scalable:
This basically refers to the flexibility one has in scaling up or down the machines, or nodes, used for data
processing. Since multiple machines are used parallelly under the same CPU, this is possible. Scaling is
done according to changes in the volume of data or the requirements of the organization.

 Flexibility:
Hadoop framework is written in Java and C, it can be easily run on any system. Further, it can be
curated to suit the specific needs of the type of data. It can handle both structured and unstructured
data efficiently. It can handle very different kinds of data sets, ranging from social media analysis to data
warehousing.
 Open Source:
It means it is free to use. Since it is an open-source project, the source code is available online for
anyone to make modifications to. This allows the Hadoop software to be curated according to very
specific needs.
 Easy to Use:
Hadoop is easy to use since the developers need not worry about any of the processing work since it is
managed by Hadoop itself. The Hadoop ecosystem is also very large and comes up with lots of tools like
Hive, Pig, Spark, HBase, Mahout, etc.

 Cost-Effective:
Not only is it highly efficient and customizable, but it also reduces the cost of processing such data
significantly. Traditional data processing would require investments in very large server systems for a
less efficient model. This framework instead employs cheaper investment systems to deliver a very
efficient system. This makes it highly preferred by organizations.

Design Principles of the Hadoop Distributed File System (HDFS)

HDFS can be considered the ‘secret sauce’ behind the flexibility and scalability of Hadoop. Let’s review
the key principles underlying this innovative system.

Data Replication and Fault Tolerance

HDFS is designed with data replication in mind, to offer fault tolerance and high availability. Data is
divided into blocks, and each block is replicated across multiple nodes in the cluster. This strategy
ensures that even if a node fails, the data is not lost, as it can be accessed from the other nodes where
the blocks are replicated.

Data Locality

Hadoop architecture is designed considering data locality to improve the efficiency of data processing.
Data locality refers to the ability to move the computation close to where the data resides in the
network, rather than moving large amounts of data to where the application is running. This approach
minimizes network congestion and increases the overall throughput of the system.
Storage Formats

HDFS provides a choice of storage formats. These formats can significantly impact the system’s
performance in terms of processing speed and storage space requirements. Hadoop supports file
formats like Text files, Sequence Files, Avro files, and Parquet files. The best file format for your specific
use case will depend on the data characteristics and the specific requirements of the application.

Comparison with other system - Hadoop Components – Hadoop 1 vs.

Hadoop 2
Hadoop 1 vs Hadoop 2 Daemons

If you will look into the Hadoop 1.0 daemons, you will find below as the important ones-

Namenode

Datanode

JobTracker

Tasktracker

But in Hadoop 2, JobTracker and Tasktracker no longer exist. In Hadoop 1, both application management
and resource management were done by the MapReduce but with Hadoop 2, resource management has
been replaced with a new component called YARN (yet another resource negotiator).

And so, with Hadoop 2, MapReduce is managing application management and YARN is managing the
resources.

YARN has introduced two new daemons with Hadoop 2 and those are-

Resource Manager

Node Manager

These two new Hadoop 2 daemons have replaced the JobTracker and TaskTracker in Hadoop 1. And the
typical Hadoop 2 architecture for daemons will look like the below-
Hadoop 1 vs Hadoop 2

Features Hadoop 1 Hadoop 2

Introducing High Availability (HA)
Single NameNode
NameNode: multiple active and standby
Architecture architecture containing one
NameNodes for fault tolerance and no
NameNode.
single point of failure.
Improved scalability with tens of
Limited Scalability with a few
Scalability thousands of nodes, making it suitable
thousand nodes per cluster.
for large-scale data processing.
It brought in the YARN framework, which
Uses the MapReduce
splits the tasks of managing resources
Job Execution processing model for job
and scheduling jobs from the MapReduce
execution.
framework.
Mintains a backward compatibility with
Compatibility Compatible with Hadoop 2.
Hadoop 1.
Supports both real-time and batch
Data Primarily focuses on batch processing. Real-time processing
Processing processing of data. happens with frameworks like Spark, and
Storm.
Limited ecosystem Enhanced ecosystem integration.
Ecosystem integration; supports fewer Supports diverse data processing tools
Integration data processing tools than like MapReduce, Apache Hive, HBase,
Hadoop 2. Pig, and more.
Conclusion

Hadoop 1 laid the groundwork for big data processing, but Hadoop 2 delivered substantial upgrades and
innovations.

Introducing YARN (Yet Another Resource Negotiator) in Hadoop 2 improved scalability and resource
management.

Hadoop 2 enabled concurrently using many data processing frameworks such as MapReduce, Apache
Spark, and others.

Hadoop 2 addressed Hadoop 1's constraints regarding scalability, dependability, and task management.

Hadoop 2 improved its data processing capabilities, making it better suited for real-time and interactive
applications.

Hadoop – Daemons and Their Features

Daemons mean Process. Hadoop Daemons are a set of processes that run on Hadoop. Hadoop is a
framework written in Java, so all these processes are Java Processes.

Apache Hadoop 2 consists of the following Daemons:

NameNode

DataNode

Secondary Name Node

Resource Manager

Node Manager

Namenode, Secondary NameNode, and Resource Manager work on a Master System while the Node
Manager and DataNode work on the Slave machine.
1. NameNode

NameNode works on the Master System. The primary purpose of Namenode is to manage all the
MetaData. Metadata is the list of files stored in HDFS(Hadoop Distributed File System). As we know the
data is stored in the form of blocks in a Hadoop cluster. So the DataNode on which or the location at
which that block of the file is stored is mentioned in MetaData. All information regarding the logs of the
transactions happening in a Hadoop cluster (when or who read/wrote the data) will be stored in
MetaData. MetaData is stored in the memory.

Features:

It never stores the data that is present in the file.

As Namenode works on the Master System, the Master system should have good processing power and
more RAM than Slaves.

It stores the information of DataNode such as their Block id’s and Number of Blocks

How to start Name Node?

hadoop-daemon.sh start namenode

How to stop Name Node?

hadoop-daemon.sh stop namenode

2. DataNode

DataNode works on the Slave system. The NameNode always instructs DataNode for storing the Data.
DataNode is a program that runs on the slave system that serves the read/write request from the client.
As the data is stored in this DataNode, they should possess high memory to store more Data.

How to start Data Node?

hadoop-daemon.sh start datanode

How to stop Data Node?

hadoop-daemon.sh stop datanode

3. Secondary NameNode

Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop cluster fails,
or crashes, the secondary Namenode will take the hourly backup or checkpoints of that data and store
this data into a file name fsimage. This file then gets transferred to a new system. A new MetaData is
assigned to that new system and a new Master is created with this MetaData, and the cluster is made to
run again correctly.
This is the benefit of Secondary Name Node. Now in Hadoop2, we have High-Availability and Federation
features that minimize the importance of this Secondary Name Node in Hadoop2.

Major Function Of Secondary NameNode:

It groups the Edit logs and Fsimage from NameNode together.

It continuously reads the MetaData from the RAM of NameNode and writes into the Hard Disk.

As secondary NameNode keeps track of checkpoints in a Hadoop Distributed File System, it is also
known as the checkpoint Node.
The Hadoop Daemon’s Port

Name Node 50070

Data Node 50075

Secondary Name Node 50090

These ports can be configured manually in hdfs-site.xml and mapred-site.xml files.

4. Resource Manager

Resource Manager is also known as the Global Master Daemon that works on the Master System. The
Resource Manager Manages the resources for the applications that are running in a Hadoop Cluster. The
Resource Manager Mainly consists of 2 things.

1. ApplicationsManager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and also makes a memory
resource on the Slaves in a Hadoop cluster to host the Application Master. The scheduler is utilized for
providing resources for applications in a Hadoop cluster and for monitoring this application.

How to start ResourceManager?

yarn-daemon.sh start resourcemanager

How to stop ResourceManager?

stop:yarn-daemon.sh stop resoucemnager

5. Node Manager

The Node Manager works on the Slaves System that manages the memory resource within the Node
and Memory Disk. Each Slave Node in a Hadoop cluster has a single NodeManager Daemon running in it.
It also sends this monitoring information to the Resource Manager.

How to start Node Manager?

yarn-daemon.sh start nodemanager

How to stop Node Manager?

yarn-daemon.sh stop nodemanager

In a Hadoop cluster, Resource Manager and Node Manager can be tracked with the specific URLs, of
type http://:port_number
The Hadoop Daemon’s Port

ResourceManager 8088

NodeManager 8042

The below diagram shows how Hadoop works.

Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech
landscape, GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable
prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've
already empowered, and we're here to do the same for you. Don't miss out - check it out now!
Hadoop InputFormat & Types of InputFormat in MapReduce
What is Hadoop InputFormat?

Hadoop InputFormat describes the input-specification for execution of the Map-Reduce job.

InputFormat describes how to split up and read input files. In MapReduce job execution, InputFormat is
the first step. It is also responsible for creating the input splits and dividing them into records.

Input files store the data for MapReduce job. Input files reside in HDFS. Although these files format is
arbitrary, we can also use line-based log files and binary format. Hence, In MapReduce, InputFormat
class is one of the fundamental classes which provides below functionality:

InputFormat selects the files or other objects for input.

It also defines the Data splits. It defines both the size of individual Map tasks and its potential execution
server.

Hadoop InputFormat defines the RecordReader. It is also responsible for reading actual records from the
input files.

How we get the data from Mapper?

Methods to get the data from mapper are: getsplits() and createRecordReader() which are as follows:

public abstract class InputFormat<K, V>

public abstract List<InputSplit> getSplits(JobContext context)

throws IOException, InterruptedException;

public abstract RecordReader<K, V>

createRecordReader(InputSplit split,

TaskAttemptContext context) throws IOException,

InterruptedException;
}

Follow TechVidvan on Google & Stay updated with latest technology trends

Types of InputFormat in MapReduce

There are different types of MapReduce InputFormat in Hadoop which are used for different purpose.
Let’s discuss the Hadoop InputFormat types below:

1. File Input Format

It is the base class for all file-based InputFormats. FileInputFormat also specifies input directory which
has data files location. When we start a MapReduce job execution, FileInputFormat provides a path
containing files to read.

This InpuFormat will read all files. Then it divides these files into one or more InputSplits.

2. Text Input Format

It is the default InputFormat. This InputFormat treats each line of each input file as a separate record. It
performs no parsing. TextInputFormat is useful for unformatted data or line-based records like log files.
Hence,

Key – It is the byte offset of the beginning of the line within the file (not whole file one split). So it will
be unique if combined with the file name.

Value – It is the contents of the line. It excludes line terminators.

3. Key Value Text Input Format

It is similar to TextInputFormat. This InputFormat also treats each line of input as a separate record.
While the difference is that TextInputFormat treats entire line as the value, but the
KeyValueTextInputFormat breaks the line itself into key and value by a tab character (‘/t’). Hence,

Key – Everything up to the tab character.

Value – It is the remaining part of the line after tab character.

4. Sequence File Input Format

It is an InputFormat which reads sequence files. Sequence files are binary files. These files also store
sequences of binary key-value pairs. These are block-compressed and provide direct serialization and
deserialization of several arbitrary data. Hence,

Key & Value both are user-defined.

5. Sequence File As Text Input Format

It is the variant of SequenceFileInputFormat. This format converts the sequence file key values to Text
objects. So, it performs conversion by calling ‘tostring()’ on the keys and values. Hence,
SequenceFileAsTextInputFormat makes sequence files suitable input for streaming.

6. Sequence File As Binary Input Format

By using SequenceFileInputFormat we can extract the sequence file’s keys and values as an opaque
binary object.

7. N line Input Format

It is another form of TextInputFormat where the keys are byte offset of the line. And values are contents
of the line. So, each mapper receives a variable number of lines of input with TextInputFormat and
KeyValueTextInputFormat.

The number depends on the size of the split. Also, depends on the length of the lines. So, if want our
mapper to receive a fixed number of lines of input, then we use NLineInputFormat.

N- It is the number of lines of input that each mapper receives.

By default (N=1), each mapper receives exactly one line of input.

Suppose N=2, then each split contains two lines. So, one mapper receives the first two Key-Value pairs.
Another mapper receives the second two key-value pairs.

8. DB Input Format

This InputFormat reads data from a relational database, using JDBC. It also loads small datasets, perhaps
for joining with large datasets from HDFS using MultipleInputs. Hence,

Key – LongWritables
Value – DBWritables.

What is Reduce Side Join and Map side Join?

What is map side join and reduce side join?

Two different large data can be joined in map reduce programming also. Joins in Map phase refers as
Map side join, while join at reduce side called as reduce side join. Lets go in detail, Why we would
require to join the data in map reduce. If one Dataset A has master data and B has sort of transactional
data(A & B are just for reference). we need to join them on a coexisting common key for a result. It is
important to realize that we can share data with side data sharing techniques(passing key value pair in
job configuration /distribution caching) if master data set is small. we will use map-reduce join only
when we have both dataset is too big to use data sharing techniques.
Joins at Map Reduce is not recommended way. Same problem can be addressed through high level
frameworks like Hive or cascading. even if you are in situation then we can use below mentioned
method to join.

Map side Join

Joining at map side performs the join before data reached to map. function It expects a strong
prerequisite before joining data at map side. Both joining techniques comes with it’s own kind of pros
and cons. Map side join could be more efficient to reduce side but strict format requirement is very
tough to meet natively. however if we would prepare this kind of data through some other MR jobs, will
loose the expected performance over reduce side join.

Data should be partitioned and sorted in particular way.

Each input data should be divided in same number of partition.

Must be sorted with same key.

All the records for a particular key must reside in the same partition.

Reduce Side Join

Reduce side join also called as Repartitioned join or Repartitioned sort merge join and also it is mostly
used join type. This type of join would be performed at reduce side. i.e it will have to go through sort
and shuffle phase which would incur network overhead. to make it simple we are going to add the steps
needs to be performed for reduce side join. Reduce side join uses few terms like data source, tag and
group key lets be familiar with it.

Data Source is referring to data source files, probably taken from RDBMS

Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any
given point of time be it is in map/reduce phase. why it is required will cover it later.

Group key is referring column to be used as join key between two data sources.

As we know we are going to join this data on reduce side we must prepare in a way that it can be used
for joining in reduce phase. let’s have a look what are the steps needs to be perform.
Map Phase

Expectation from routine map function is emit, (Key, value), while to joining at reduce side join we
would design map in a way so that it could emit, (Key, Source Tag+Value) of every record for each data
source. This output will then go for sort and shuffle phase, as we know these operation would based on
key, so it will club all the values from all source at one place regarding a particular key. and this data
would reach to reducer

Reduce Phase

Reducer will create a cross product of every record of map out put for one key and will handover to
combine function.

Combine function

whether this reduce function is going to perform inner join or outer join would be decided in combine
function. And desired ouput format will also be decided at this place

Please do not get confuse with combiner both are different.

COMMUNICATION Skills 1
No ratings yet
COMMUNICATION Skills 1
232 pages
Unit 2 BDA
No ratings yet
Unit 2 BDA
32 pages
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
No ratings yet
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
40 pages
Sap Travel Management Configuration Steps
100% (2)
Sap Travel Management Configuration Steps
18 pages
CCS334 BDA Lab Manual Final
No ratings yet
CCS334 BDA Lab Manual Final
40 pages
Ad3351 Daa Unit I
No ratings yet
Ad3351 Daa Unit I
135 pages
Sap HR - HCM Overview
100% (4)
Sap HR - HCM Overview
28 pages
Unit V
100% (1)
Unit V
66 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
61 pages
Bda Module 4 PPT (KM)
No ratings yet
Bda Module 4 PPT (KM)
76 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
ds4015 Big Data Analytics Vignesh K Notes
No ratings yet
ds4015 Big Data Analytics Vignesh K Notes
146 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Machine Learning - AL3451 - Notes - Unit 5 - Design and Analysis of Machine Learning Experiments
No ratings yet
Machine Learning - AL3451 - Notes - Unit 5 - Design and Analysis of Machine Learning Experiments
33 pages
BDA Lab ManuaL
No ratings yet
BDA Lab ManuaL
83 pages
View Equivalent Schedule in DBMS
No ratings yet
View Equivalent Schedule in DBMS
22 pages
Organizational Management Training Manual - en V1.0 PDF
100% (1)
Organizational Management Training Manual - en V1.0 PDF
62 pages
1353360372sql Practice Questions
100% (1)
1353360372sql Practice Questions
24 pages
Database Management System Notes
No ratings yet
Database Management System Notes
25 pages
NO SQL Data Management
No ratings yet
NO SQL Data Management
123 pages
Unit 5
No ratings yet
Unit 5
27 pages
DMS Steps
No ratings yet
DMS Steps
20 pages
BD - Unit - III - MapReduce
100% (1)
BD - Unit - III - MapReduce
31 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Unit-4-Unit-4-Bda EDIT
No ratings yet
Unit-4-Unit-4-Bda EDIT
16 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
DBMS Unit 3 Notes
No ratings yet
DBMS Unit 3 Notes
29 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Join Stage
No ratings yet
Join Stage
14 pages
Sets in Python
No ratings yet
Sets in Python
7 pages
Chpater 1 - Unit 2
No ratings yet
Chpater 1 - Unit 2
31 pages
Hadoop ppt@87
No ratings yet
Hadoop ppt@87
16 pages
DWDM UNIT-1 Lecture Notes
No ratings yet
DWDM UNIT-1 Lecture Notes
15 pages
Queueing Theory
100% (1)
Queueing Theory
48 pages
Hbase PPT PDF
No ratings yet
Hbase PPT PDF
100 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
MCQ Type Questions
No ratings yet
MCQ Type Questions
24 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
Hbase
No ratings yet
Hbase
13 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
Unit 5 Notes
100% (3)
Unit 5 Notes
66 pages
DBMS Chapter 4
No ratings yet
DBMS Chapter 4
39 pages
Hadoop I/O: Jaeyong Choi
No ratings yet
Hadoop I/O: Jaeyong Choi
36 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
Organizational Management Training Manual - en V1.0
No ratings yet
Organizational Management Training Manual - en V1.0
28 pages
Samsung Bloatware List
No ratings yet
Samsung Bloatware List
2 pages
Advance Python Question Paper 2023
No ratings yet
Advance Python Question Paper 2023
2 pages
CHAPTER - 1 - Introduction - 1
No ratings yet
CHAPTER - 1 - Introduction - 1
33 pages
F U-4 PDF
No ratings yet
F U-4 PDF
48 pages
Integrity and Domain Constraints
No ratings yet
Integrity and Domain Constraints
25 pages
CAD Tutorials
No ratings yet
CAD Tutorials
18 pages
APP Question Bank Unit3
100% (1)
APP Question Bank Unit3
5 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
Vtu 7TH Sem Cse/ise Data Warehousing & Data Mining Notes 10cs755/10is74
94% (18)
Vtu 7TH Sem Cse/ise Data Warehousing & Data Mining Notes 10cs755/10is74
70 pages
Extra Queries
No ratings yet
Extra Queries
17 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Sap Travel Management Configuration Steps
No ratings yet
Sap Travel Management Configuration Steps
4 pages
Practical 4 Asset Transfer App
No ratings yet
Practical 4 Asset Transfer App
8 pages
QB Solved m3
No ratings yet
QB Solved m3
4 pages
Essay On Advantages and Disadvantages of Computer
0% (1)
Essay On Advantages and Disadvantages of Computer
4 pages
UK Tax - Real Time Concept
No ratings yet
UK Tax - Real Time Concept
64 pages
Siemens Acuson Cypress Plus Brochure
0% (1)
Siemens Acuson Cypress Plus Brochure
6 pages
Assignment DBMS
No ratings yet
Assignment DBMS
8 pages
Hbase Lab Manual3.0-Update
No ratings yet
Hbase Lab Manual3.0-Update
8 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
HCIP-Data Center Facility Deployment V2.0 Version Instruction
No ratings yet
HCIP-Data Center Facility Deployment V2.0 Version Instruction
16 pages
E Commerce
No ratings yet
E Commerce
18 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Brochure Advia 2120I
No ratings yet
Brochure Advia 2120I
2 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
Cloud Computing QB
No ratings yet
Cloud Computing QB
3 pages
Pipex: Summary: This Project Is The Discovery in Detail and by Programming of A UNIX Mechanism That You Already Know
No ratings yet
Pipex: Summary: This Project Is The Discovery in Detail and by Programming of A UNIX Mechanism That You Already Know
8 pages
Untitled
No ratings yet
Untitled
4 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
98 pages
Lesson 1 CSE472 Spring 2024
No ratings yet
Lesson 1 CSE472 Spring 2024
26 pages
DBMS LAB Manual Iare
No ratings yet
DBMS LAB Manual Iare
10 pages
Benefits Module Configuration
No ratings yet
Benefits Module Configuration
13 pages
Aset Class TT 04-08 Nov
No ratings yet
Aset Class TT 04-08 Nov
48 pages
Data Engineering UNIT-1
100% (1)
Data Engineering UNIT-1
14 pages
QAV Checksheet 23.05.2022 L
No ratings yet
QAV Checksheet 23.05.2022 L
10 pages
Syntron Presentation
No ratings yet
Syntron Presentation
26 pages
Pine Labs POS - Troubleshooting Guide-HRPL-1
No ratings yet
Pine Labs POS - Troubleshooting Guide-HRPL-1
14 pages
LSMW MMQ Test Hiring
No ratings yet
LSMW MMQ Test Hiring
20 pages
HCM TRG Uk
No ratings yet
HCM TRG Uk
4 pages
Or Or: Important Information!
No ratings yet
Or Or: Important Information!
5 pages
Chapter 06-Statistical Methods in Quality Management: True/False
No ratings yet
Chapter 06-Statistical Methods in Quality Management: True/False
20 pages
SAQA - 115431 - Learner Guide
No ratings yet
SAQA - 115431 - Learner Guide
21 pages
Basic Setting: Request No. Short Description
No ratings yet
Basic Setting: Request No. Short Description
9 pages
MUJ Newsletter April 2024
No ratings yet
MUJ Newsletter April 2024
1 page
Payroll - India: Parameters
No ratings yet
Payroll - India: Parameters
10 pages
11.7.1: Packet Tracer Skills Integration Challenge Activity: (Instructor Version)
No ratings yet
11.7.1: Packet Tracer Skills Integration Challenge Activity: (Instructor Version)
4 pages
SAP Human Resources (HR) Training
No ratings yet
SAP Human Resources (HR) Training
4 pages
Nokia: Service Schematics
No ratings yet
Nokia: Service Schematics
6 pages
Integration Between SAP HR Sub Modules
100% (1)
Integration Between SAP HR Sub Modules
3 pages
Ess Configuration
No ratings yet
Ess Configuration
3 pages
US Payroll Configuration-150518103818-Lva1-App6891
No ratings yet
US Payroll Configuration-150518103818-Lva1-App6891
2 pages
Enterprise Resource Planning (Department Elective) : BE IT - 2019-20 Sample Question Paper For Final Exam
No ratings yet
Enterprise Resource Planning (Department Elective) : BE IT - 2019-20 Sample Question Paper For Final Exam
8 pages
Bug Bounty em Crypto
No ratings yet
Bug Bounty em Crypto
8 pages
Soft Q-Learning With Mutual Information Regularization
No ratings yet
Soft Q-Learning With Mutual Information Regularization
19 pages
Recursive Functions
No ratings yet
Recursive Functions
12 pages
Harshit Patel
No ratings yet
Harshit Patel
1 page
SAP HCM Interview Questions
100% (2)
SAP HCM Interview Questions
18 pages
Log
No ratings yet
Log
3 pages
Activity Guide and Evaluation Rubric - Task 3 - Electromagnetic Waves in Guided Media PDF
No ratings yet
Activity Guide and Evaluation Rubric - Task 3 - Electromagnetic Waves in Guided Media PDF
7 pages
Octnov 23
No ratings yet
Octnov 23
3 pages
Mohammad Kausar Uddin
No ratings yet
Mohammad Kausar Uddin
3 pages
FIRST SEMESTER 2022-2023: of Programming Languages 10 Edition, Pearson, 2012.
No ratings yet
FIRST SEMESTER 2022-2023: of Programming Languages 10 Edition, Pearson, 2012.
3 pages
Usman's Resume
No ratings yet
Usman's Resume
1 page

Big Data - Unit 2 Hadoop Framework

Uploaded by

Big Data - Unit 2 Hadoop Framework

Uploaded by

Big Data Analytics

Unit-II: Hadoop Framework

– Requirement of Hadoop Framework

Main Components of Hadoop Framework

 The storage is distributed to handle a large data pool

 Consists of two phases, Map Phase and Reduce Phase.

Advantages of the Hadoop framework

 Faster Data Processing:

 Inbuilt fault tolerance:

Design Principles of the Hadoop Distributed File System (HDFS)

Data Replication and Fault Tolerance

Comparison with other system - Hadoop Components – Hadoop 1 vs.

Features Hadoop 1 Hadoop 2

Hadoop – Daemons and Their Features

Apache Hadoop 2 consists of the following Daemons:

Secondary Name Node

It never stores the data that is present in the file.

How to start Name Node?

hadoop-daemon.sh start namenode

How to stop Name Node?

hadoop-daemon.sh stop namenode

How to start Data Node?

How to stop Data Node?

hadoop-daemon.sh stop datanode

Major Function Of Secondary NameNode:

It groups the Edit logs and Fsimage from NameNode together.

Name Node 50070

Data Node 50075

Secondary Name Node 50090

These ports can be configured manually in hdfs-site.xml and mapred-site.xml files.

How to start ResourceManager?

yarn-daemon.sh start resourcemanager

How to stop ResourceManager?

stop:yarn-daemon.sh stop resoucemnager

How to start Node Manager?

yarn-daemon.sh start nodemanager

How to stop Node Manager?

yarn-daemon.sh stop nodemanager

The below diagram shows how Hadoop works.

InputFormat selects the files or other objects for input.

How we get the data from Mapper?

public abstract class InputFormat<K, V>

public abstract List<InputSplit> getSplits(JobContext context)

throws IOException, InterruptedException;

public abstract RecordReader<K, V>

TaskAttemptContext context) throws IOException,

Types of InputFormat in MapReduce

1. File Input Format

2. Text Input Format

Value – It is the contents of the line. It excludes line terminators.

3. Key Value Text Input Format

Key – Everything up to the tab character.

Value – It is the remaining part of the line after tab character.

4. Sequence File Input Format

Key & Value both are user-defined.

5. Sequence File As Text Input Format

6. Sequence File As Binary Input Format

7. N line Input Format

N- It is the number of lines of input that each mapper receives.

By default (N=1), each mapper receives exactly one line of input.

What is Reduce Side Join and Map side Join?

What is map side join and reduce side join?

Map side Join

Data should be partitioned and sorted in particular way.

Each input data should be divided in same number of partition.

Must be sorted with same key.

Reduce Side Join

Please do not get confuse with combiner both are different.

You might also like