0% found this document useful (0 votes)

100 views7 pages

Unit 2 - Hadoop PDF

The document discusses Hadoop, which includes three main components - HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS stores data across clusters of commodity servers and provides high fault tolerance. MapReduce uses a map-reduce programming model to process large datasets in parallel. YARN allows running various distributed applications beyond MapReduce. Other Hadoop ecosystem components discussed include Pig, Hive, HBase, Zookeeper, Oozie, Ambari and Sqoop. Common data formats and analyzing data with Hadoop are also summarized.

Uploaded by

Gopal Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views7 pages

Unit 2 - Hadoop PDF

Uploaded by

Gopal Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

HADOOP[unit 2]

Hadoop Distributed File System:

● In Hadoop, data resides in a distributed file system which is called a Hadoop Distributed
File system.

● HDFS splits files into blocks and sends them across various nodes in form of large
clusters.

● The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS)
and provides a distributed file system that is designed to run on commodity hardware.

● Commodity hardware is cheap and widely available, these are useful for achieving
greater computational power at a low cost.

● It is highly fault-tolerant and is designed to be deployed on low-cost hardware.

● It provides high throughput access to application data and is suitable for applications
having large datasets.

● Hadoop framework includes the following two modules :

▪ Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules.

▪ Hadoop YARN: This is a framework for job scheduling and cluster resource
management (managing the resources of the Clusters).

Hadoop Ecosystem And Components:

There are three main components of Hadoop:

Hadoop HDFS -

● Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.

● HDFS consists of two core components i.e.

▪ Name node: Name Node is the prime node which contains metadata requiring
comparatively fewer resources than the data nodes that stores the actual data.
▪ Data Node: These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.

● HDFS splits files into blocks and sends them across various nodes in form of large
clusters.

● It is highly fault-tolerant and is designed to be deployed on low-cost hardware.

● It provides high throughput access to application data and is suitable for applications
having large datasets.

Hadoop MapReduce -

● Hadoop MapReduce is the processing unit of Hadoop.

● MapReduce is a computational model and software framework for writing applications

that are run on Hadoop.

● These MapReduce programs are capable of processing enormous data in parallel on

large clusters of computation nodes.

● MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:

▪ Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
▪ Reduce() as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as input
and combines those tuples into smaller set of tuples.

Hadoop YARN –

● Hadoop YARN is a resource management unit of Hadoop.

● This is a framework for job scheduling and cluster resource management.

● YARN helps to open up Hadoop by allowing to process and run data for batch processing,
stream processing, interactive processing and graph processing which are stored in
HDFS.

● It helps to run different types of distributed applications other than MapReduce.

● Consists of three major components i.e.

▪ Resource Manager: It has the privilege of allocating resources for the

applications in a system.
▪ Nodes Manager: work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager.
▪ Application Manager: works as an interface between the resource manager and
node manager and performs negotiations as per the requirement of the two.

Other components:

PIG:

● Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
● It is a platform for structuring the data flow, processing and analyzing huge data sets.
● Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.

● Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the Java virtual machine (JVM).
● Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.

Mahout:

● Mahout, allows Machine Learnability to a system or application.

● It provides various libraries or functionalities such as collaborative filtering, clustering,
and classification which are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.

HIVE:

● With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).
● It is highly scalable as it allows real-time processing and batch processing both. Also, all
the SQL datatypes are supported by Hive thus, making the query processing easier.
● Similar to the Query Processing frameworks, HIVE too comes with two components: Java
Database Connectivity (JDBC) Drivers and HIVE Command Line.
● JDBC, along with ODBC drivers (uses the Open Database Connectivity interface by Microsoft)
work on establishing the data storage permissions and connection whereas HIVE
Command line helps in the processing of queries.

HBase:

● It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to
work on Big Data sets effectively.
● At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time. At such
times, HBase comes handy as it gives us a tolerant way of storing limited data.

Zookeeper:

● There was a huge issue of management of coordination and synchronization among the
resources or the components of Hadoop which resulted in inconsistency, often.
● Zookeeper overcame all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.

Oozie:

● Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There are two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs.
● Oozie workflow is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some data or
external stimulus is given to it.

Ambari (for management and monitoring):

● It eliminates the need for the manual tasks that used to watch over Hadoop operations.
● It gives a simple and secure platform for provisioning, managing, and monitoring
Hortonworks Data Platform (HDP) deployments.

Sqoop (SQL-to-Hadoop):

● It is a big data tool that offers the capability to extract data from non-Hadoop data
stores, transform the data into a form usable by Hadoop, and then load the data into
HDFS.
● Open Database Connectivity (ODBC) is an open standard Application Programming
Interface (API) for accessing a database.
DATA FORMAT:

● A data/file format defines how information is stored in HDFS.

● Hadoop does not have a default file format and the choice of a format depends on its
use.
● Managing the processing and storage of large volumes of information is very complex
that’s why a certain data format is required.
● The choice of an appropriate file format can produce the following benefits:
▪ Optimum writing time
▪ Optimum reading time
▪ File divisibility
▪ Adaptive scheme and compression support
● Some of the most commonly used formats of the Hadoop ecosystem are:
▪ Text/CSV: A plain text file or CSV is the most common format both outside and
within the Hadoop ecosystem.
▪ SequenceFile: The SequenceFile format stores the data in binary format, this
format accepts compression but does not store metadata.
▪ Avro: Avro is a row-based storage format. This format includes the definition of
the scheme of your data in JSON format. Avro allows block compression along
with its divisibility, making it a good choice for most cases when using Hadoop.
▪ Parquet: Parquet is a column-based binary storage format that can store nested
data structures. This format is very efficient in terms of disk input/output
operations when the necessary columns to be used are specified.
▪ RCFile (Record Columnar File): RCFile is a columnar format that divides data into
groups of rows, and inside it, data is stored in columns.
▪ ORC (Optimized Row Columnar): ORC is considered an evolution of the RCFile
format and has all its benefits alongside some improvements such as better
compression, allowing faster queries.

Analysing Data with Hadoop :

● While the MapReduce programming model is at the heart of Hadoop, it is low-level and
as such becomes an unproductive way for developers to write complex analysis jobs.

● To increase developer productivity, several higher-level languages and APIs have been
created that abstract away the low-level details of the MapReduce programming model.

● There are several choices available for writing data analysis jobs.

● The Hive and Pig projects are popular choices that provide SQL-like and procedural data
flow-like languages, respectively.

● HBase is also a popular way to store and analyze data in HDFS. It is a column-oriented
database, and unlike MapReduce, provides random read and write access to data with
low latency (there are no or almost no delays on network or Internet connection).
● MapReduce jobs can read and write data in HBase’s table format, but data processing is
often done via HBase’s own client API.
Scaling In Vs Scaling Out :
● Once a decision has been made for data scaling, the specific scaling approach must be chosen.
● There are two commonly used types of data scaling :
▪ Up
▪ Out
● Scaling up, or vertical scaling :
▪ It involves obtaining a faster server with more powerful processors and more memory.
▪ This solution uses less network hardware, and consumes less power; but ultimately for
many platforms, it may only provide a short-term fix, especially if continued growth is
expected.
● Scaling out, or horizontal scaling :
▪ It involves adding servers for parallel computing.
▪ The scale-out technique is a long-term solution, as more and more servers may be
added when needed. But going from one monolithic system to this type of cluster may
be difficult, although extremely effective solution.

Difference between scaling up and scaling out:

Horizontal Scaling Vertical Scaling

(scaling out) (scaling up)

In vertical scaling, the data lives on a single

In a database world, horizontal scaling is
node and scaling is done through multi-core,
Databases usually based on the partitioning of data
e.g. spreading the load between the CPU
(each node contains only part of the data).
and RAM resources of the machine.

In theory, adding more machines to the

existing pool means you are not limited to Vertical scaling is limited to the capacity of
Downtime one machine, scaling beyond that capacity
the capacity of a single unit, making it
possible to scale with less downtime. can involve downtime and has an upper hard
limit, i.e. the scale of the hardware on which
you are currently running.

Also described as distributed programming,

Actor model: concurrent programming on
as it involves distributing jobs across
multi-core machines is often performed via
Concurrency machines over the network. Several patterns
multi-threading and in-process message
associated with this model: Master/Worker,
passing.
Tuple Spaces, Blackboard, MapReduce.

In distributed computing, the lack of a

shared address space makes data sharing In a multi-threaded scenario, you can
Message more complex. It also makes the process of assume the existence of a shared address
passing sharing, passing or updating data more space, so data sharing and message passing
costly since you have to pass copies of the can be done by passing a reference.
data.

Examples Cassandra, MongoDB, Google Cloud Spanner MySQL, Amazon RDS

Wa0005.
No ratings yet
Wa0005.
84 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Unit 2
No ratings yet
Unit 2
9 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
BDP Unit 4
No ratings yet
BDP Unit 4
28 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
Fro CH3
No ratings yet
Fro CH3
21 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Punjab University Lahore B A English Explanation of Poems Edited
50% (2)
Punjab University Lahore B A English Explanation of Poems Edited
41 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
Unit 4
No ratings yet
Unit 4
85 pages
Unit Iii
No ratings yet
Unit Iii
107 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
Unit 2
No ratings yet
Unit 2
23 pages
2.2. Components of Hadoop - Analysing
No ratings yet
2.2. Components of Hadoop - Analysing
16 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Hadoop
No ratings yet
Hadoop
61 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
6 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Part Big Data Unit-IV
No ratings yet
Part Big Data Unit-IV
12 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Unit Iii
No ratings yet
Unit Iii
9 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
INTRO Hadoop-Ecosystem
No ratings yet
INTRO Hadoop-Ecosystem
6 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
BDP Unit 3
No ratings yet
BDP Unit 3
20 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Assignment 6
No ratings yet
Assignment 6
12 pages
Web Technology PDF
100% (1)
Web Technology PDF
359 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
No ratings yet
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
5 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Pokemon Ranger Guardian Signs Official Game Guide - Unleashed
100% (1)
Pokemon Ranger Guardian Signs Official Game Guide - Unleashed
257 pages
Hadoop Ecosystem: Overview
No ratings yet
Hadoop Ecosystem: Overview
5 pages
Unit 4
No ratings yet
Unit 4
21 pages
Experiment No. - 7 EXPERIMENT NAME: Measurement of Capacitance by de Sauty's Modified Bridge Theory
100% (3)
Experiment No. - 7 EXPERIMENT NAME: Measurement of Capacitance by de Sauty's Modified Bridge Theory
3 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Ecosystem Short NotesTSpdf-1
No ratings yet
Hadoop Ecosystem Short NotesTSpdf-1
4 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
U2 - Hadoop EcoSytem
No ratings yet
U2 - Hadoop EcoSytem
6 pages
Shs12 Trends q1w2
100% (1)
Shs12 Trends q1w2
12 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Hadoop
No ratings yet
Hadoop
11 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Climate of India - Wikipedia
No ratings yet
Climate of India - Wikipedia
146 pages
BOX Hill Growth Centres Precinct Development Control Plan - in Force 28 June 2021
No ratings yet
BOX Hill Growth Centres Precinct Development Control Plan - in Force 28 June 2021
243 pages
Coir
No ratings yet
Coir
34 pages
Compound Exercises List PDF
No ratings yet
Compound Exercises List PDF
10 pages
IGNOU S Indian History Part 2 India Earliest Times To The 8th Century AD
100% (1)
IGNOU S Indian History Part 2 India Earliest Times To The 8th Century AD
439 pages
Echeverría - The Slaughter Yard PDF
No ratings yet
Echeverría - The Slaughter Yard PDF
6 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
CEILING SUSPENDED AHU 2600 CFM 125 MMWG 2 Nos
No ratings yet
CEILING SUSPENDED AHU 2600 CFM 125 MMWG 2 Nos
1 page
Tariffs & Beyond - Future of India-US Trade Relations
No ratings yet
Tariffs & Beyond - Future of India-US Trade Relations
49 pages
Kirt You So Much
No ratings yet
Kirt You So Much
3 pages
Parts in Your Kit: Breadboard
No ratings yet
Parts in Your Kit: Breadboard
5 pages
Nuclear - Iaea - Trs433
No ratings yet
Nuclear - Iaea - Trs433
147 pages
50 BMG Armor Penetration
No ratings yet
50 BMG Armor Penetration
13 pages
Unit 1 - From Big Data Analytics PDF
No ratings yet
Unit 1 - From Big Data Analytics PDF
5 pages
Pitot Tube Apparatus
No ratings yet
Pitot Tube Apparatus
6 pages
Abinet PR
No ratings yet
Abinet PR
8 pages
Lockout Procedures
No ratings yet
Lockout Procedures
10 pages
LD7750RGR
No ratings yet
LD7750RGR
1 page
Latihan Soal
No ratings yet
Latihan Soal
10 pages
Physical Addressing - Error Handling PDF
No ratings yet
Physical Addressing - Error Handling PDF
14 pages
Week 5 Tutorial Notes Answers
No ratings yet
Week 5 Tutorial Notes Answers
6 pages
398-Article Text-1335-1-10-20160904
No ratings yet
398-Article Text-1335-1-10-20160904
7 pages
c630 Nickel Aluminum Bronze PDF
No ratings yet
c630 Nickel Aluminum Bronze PDF
2 pages
Thermal Insulation Barrier Providing Corrosion Protection With "Cool-To-Touch" Properties
No ratings yet
Thermal Insulation Barrier Providing Corrosion Protection With "Cool-To-Touch" Properties
2 pages
Cisco 2800 Series Integrated Services Routers: Data Sheet
No ratings yet
Cisco 2800 Series Integrated Services Routers: Data Sheet
16 pages
Gaseous Exchange in Humans PDF
100% (1)
Gaseous Exchange in Humans PDF
6 pages
Static Structural Finite-Element Analysis of Tower Crane Based On FEM
No ratings yet
Static Structural Finite-Element Analysis of Tower Crane Based On FEM
4 pages
Scott Slaybaugh - Who Is To Blame? (Titanic Articles)
No ratings yet
Scott Slaybaugh - Who Is To Blame? (Titanic Articles)
8 pages
Canablast EDP 10 Pump - en PDF
No ratings yet
Canablast EDP 10 Pump - en PDF
4 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

Unit 2 - Hadoop PDF

Uploaded by

Unit 2 - Hadoop PDF

Uploaded by

HADOOP[unit 2]

Hadoop Distributed File System:

● It is highly fault-tolerant and is designed to be deployed on low-cost hardware.

● Hadoop framework includes the following two modules :

Hadoop Ecosystem And Components:

● Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.

● HDFS consists of two core components i.e.

● It is highly fault-tolerant and is designed to be deployed on low-cost hardware.

● Hadoop MapReduce is the processing unit of Hadoop.

● MapReduce is a computational model and software framework for writing applications

● These MapReduce programs are capable of processing enormous data in parallel on

● Hadoop YARN is a resource management unit of Hadoop.

● This is a framework for job scheduling and cluster resource management.

● It helps to run different types of distributed applications other than MapReduce.

● Consists of three major components i.e.

▪ Resource Manager: It has the privilege of allocating resources for the

● Mahout, allows Machine Learnability to a system or application.

Ambari (for management and monitoring):

● A data/file format defines how information is stored in HDFS.

Analysing Data with Hadoop :

Difference between scaling up and scaling out:

Horizontal Scaling Vertical Scaling

In vertical scaling, the data lives on a single

In theory, adding more machines to the

Also described as distributed programming,

In distributed computing, the lack of a

Examples Cassandra, MongoDB, Google Cloud Spanner MySQL, Amazon RDS

You might also like