0% found this document useful (0 votes)

14 views28 pages

8 MapReduce Different Phases 08-01-2025

The Hadoop Ecosystem is a comprehensive framework for processing Big Data, encompassing various components like HDFS for distributed storage, YARN for resource management, and MapReduce for data processing. It also includes tools like Hive and Pig for querying structured data, Flume for log data collection, and Mahout for machine learning. Additionally, Oozie manages workflows, while ZooKeeper provides coordination services, making Hadoop a robust solution for handling large datasets.

Uploaded by

Varun Srivastava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views28 pages

8 MapReduce Different Phases 08-01-2025

Uploaded by

Varun Srivastava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

HADOOP ECOSYSTEM

Hadoop is a framework which deals with Big Data but unlike

any other framework it's not a simple framework, it has
its own family for processing different thing which is
tied up in one umbrella called as Hadoop Ecosystem.
Data is mainly categorized in 3 types under Big Data platform.
Structured Data - Data which has proper structure and which can be easily
stored in tabular form in any relational databases like Mysql, Oracle etc is
known as structured data.Example- Employee data .
Semi-Structured Data - Data which has some structure but cannot be saved in a
tabular form in relational databases is known as semi structured data.
Example-XML data, email messages etc.
Unstructured Data - Data which is not having any structure and cannot be saved
in tabular form of relational databases is known as unstructured data.
Example- Video files, Audio files, Text file etc.
SQOOP : SQL + HADOOP = SQOOP

When we import any structured data from table (RDBMS) to c

HDFS a file is created in HDFS which we can process by either Map
Reduce program directly or by HIVE or PIG.
FLUME

Flume is a distributed, reliable, and available system for efficiently

collecting, aggregating and moving large amounts of log data from
many different sources to a centralized data store. c

Flume can be used to transport massive quantities of event data

including but not limited to network traffic data, social-media-
generated data, email messages and pretty much any data source
possible.
HDFS
(HADOOP DISTRIBUTED FILE
SYSTEM)
HDFS is a main component of Hadoop and a
technique to store the data in distributed
manner in order to compute fast.
HDFS saves data in a block of 128 MB in size
which is logical splitting of data in a
Datanode (physical storage of data) in Hadoop
cluster(formation of several Datanode which is a
collection commodity hardware connected through
single network).
All information about data splits in datanode known
as metadata is captured in Namenode which is again
a part of HDFS.
YET ANOTHER RESOURCE NEGOTIATOR
(YARN) IN HADOOP 2.0

• Allocates resources for all scheduled tasks

• Two services
• Resource Manager (MASTER DAEMON)
• Manages resources and schedule applications running on top of
YARN.
• Node Manager (SLAVE DAEMON)
• Manages containers and monitors resource utilization in each
container.
MAPREDUCE FRAMEWORK

• It is another main component of Hadoop and a method of programming in

a distributed data stored in a HDFS.
• We can write MapReduce program by using any programming language like
c,c++,Java, R ,Python
• A job is divided into n Map and n Reduce task. Map does calculation and
Reduce aggregates it.
• Input and output are Key and Value
• MapReduce Program can be applied to any type of data whether
Structured or Unstructured stored in HDFS. Example - word count using
MapReduce
• MAP = Filtering, Grouping and Sorting

• REDUCE = Aggregates and summarizes the result produced by

map function
HBASE

Hadoop Database or HBASE is a non- relational (NoSQL)

database that runs on top of HDFS.
HBASE was created for large table which have billions of
rows and millions of columns with fault tolerance
capability and horizontal scalability and based on Google
Big Table.

Hadoop can perform only batch

processing, and data will be accessed only in a sequential
manner, for random access of huge data HBASE is used.
HIVE

Hive is created by Facebook and later donated to

Apache foundation.

Hive mainly deals with structured data which is stored in

HDFS with a Query Language similar to SQL and known
as HQL (Hive Query Language)

Hive also run Map reduce program in a backend to

process data in HDFS
• 2 basic components
• Hive Command Line
• Java Database Connectivity (JDBC) and Open Database
Connectivity (ODBC)
• Supports User Defined Function (UDF) to accomplish specific
needs.
PIG

Similar to HIVE, PIG also deals with structured data using

PIG LATIN language.

PIG was originally developed at Yahoo to answer similar need

to HIVE.

It is an alternative provided to programmer who loves scripting

and don't want to use Java/Python or SQL to process data.
A Pig Latin program is made up of a series of operations, or
transformations, that are applied to the input data which
runs MapReduce program in backend to produce output.
• By Yahoo!

• 1 line of piglatin = Approx. 100 lines of Map-Reduce job

• The compiler internally converts pig latin to map reduce

• It gives you platform for building data flow for Extract, Transform and Load (ETL)

• Pig
• Loads the data
• Group
• Filter
• Join
• Sort
HIVE PIG

CLI/UI/ Load
API MAP
(JDBC/ODB
C)
filterLocal
aggregatio
Meta JDBC Filter n
store ODBC
Thrift Client
Thrift
(PHP/
Server
Perl/Python/
C++/Java) Group Ditsribute
HiveQL

Driver

Compile HIV foreach

Reduce
Map Reduce HADOOP
Jobs Global
Aggrega
store tion
Executio n
Engine
MAHOUT

Mahout is an open-source machine learning library from

Apache written in java.

The algorithms it implements fall under the broad umbrella of

machine learning or collective intelligence.

Mahout aims to be the machine learning tool of choice when

the collection of data to be processed is very large, perhaps far
too large for a single machine.

Tasks : Predictive analysis-recommender, clustering,

classification
OOZIE

It is a workflow scheduler system to manage hadoop jobs.

Oozie is implemented as a Java Web-Application that runs in

a Java Servlet-Container.

Hadoop basically deals with bigdata and when some

programmer wants to run many job in a sequential manner
like output of job A will be input to Job B and similarly
output of job B is input to job C and final output will be
output of job C. To automate this sequence we need a
workflow and to execute same we need engine for which OOZIE
is used.
• OOZIE WORKFLOW

• Sequential set of actions to be executed.

• OOZIE COORDINATOR

• Oozie jobs which are triggered when the data is made available to it (or)
even triggered based on time.
ZOOKEEPER

• ZooKeeper is a centralized service for maintaining

configuration information, naming, providing distributed
synchronization, and providing group services .

• Writing distributed applications is difficult because of partial

failure may occur between nodes, to overcome this Apache
Zookeper has been developed by maintaining an open-source
server which enables highly reliable distributed coordination.

• In case of any partial failure clients can connect to any node

and be assured that they will receive the correct, up-to-date
information.
APACHE SPARK

• A framework for real time data analytics

• 100 X faster than Hadoop and Mapreduces

• Supports Standalone and Distributed processing

HDFS FILE SYSTEM

• Creating a File System Object

•
PATH = HDFS OBJECT

•
HADOOP SPECIFIC FILE SYSTEM TYPES
FEATURES OF HDFS

• Data Replication
• Data Resilience
• Data Integrity
• Maintaining Transaction logs
• Validating Checksum = Numerical value is assigned to a
transmitted message, Verification of content of a file.
• Creating Data blocks = Called as block servers
FUNCTIONS PERFORMED BY BLOCK SERVER

• Storage of data on a local file system.

• Storage of metadata of a block on the local file system on the basis of
similar template on the Name Node.
• Conduct of periodic validations for file checksums
• Intimation about availability of blocks to Name Nade by sending reports
regularly
• On-demand supply of metadata and data to clients
• Movement of data to connected nodes on the basis of pipelining model
• “sudo -u hdfs hdfs balancer”

Laxmancibi Sivakumar Databricks Resume
No ratings yet
Laxmancibi Sivakumar Databricks Resume
5 pages
Data Engineer Certification Questions1
100% (1)
Data Engineer Certification Questions1
22 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Hadoop and Their Ecosystem
100% (2)
Hadoop and Their Ecosystem
24 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
BDP Unit 3
No ratings yet
BDP Unit 3
20 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Hadoop Ecosystem: Overview
No ratings yet
Hadoop Ecosystem: Overview
5 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Hadoop Framework
No ratings yet
Hadoop Framework
22 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Unit 2
No ratings yet
Unit 2
9 pages
Big Data Introduction & Ecosystems
No ratings yet
Big Data Introduction & Ecosystems
4 pages
MODULE 2 Hadoop Ecosystem Tools
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
44 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
BDA Module2
No ratings yet
BDA Module2
43 pages
BDP Unit 4
No ratings yet
BDP Unit 4
28 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Big Data Emerging Technologie
No ratings yet
Big Data Emerging Technologie
10 pages
Unit 4
No ratings yet
Unit 4
85 pages
INTRO Hadoop-Ecosystem
No ratings yet
INTRO Hadoop-Ecosystem
6 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
Big Data Infarstructure
No ratings yet
Big Data Infarstructure
7 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
Wa0005.
No ratings yet
Wa0005.
84 pages
Chapter 4 - Big Data Tools, Techniques, and Systems
No ratings yet
Chapter 4 - Big Data Tools, Techniques, and Systems
19 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
12 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
Unit 2
No ratings yet
Unit 2
23 pages
Big Data and Hadoop Guide
No ratings yet
Big Data and Hadoop Guide
8 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Banking Problem Database
No ratings yet
Banking Problem Database
5 pages
Cassandra & Datastax Enterprise Essentials: Documentation
No ratings yet
Cassandra & Datastax Enterprise Essentials: Documentation
37 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
Data Scientist - Docx .2
No ratings yet
Data Scientist - Docx .2
10 pages
Data Engineering
No ratings yet
Data Engineering
15 pages
Sr. AWS Data Engineer. Resume Nashville, TN - Hire IT People - We Get IT Done
No ratings yet
Sr. AWS Data Engineer. Resume Nashville, TN - Hire IT People - We Get IT Done
10 pages
Mattin Zargarpur - Lead Android Developer - Royal Cyber
No ratings yet
Mattin Zargarpur - Lead Android Developer - Royal Cyber
2 pages
Getting Started With Apache Nifi
No ratings yet
Getting Started With Apache Nifi
10 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
3 pages
Databricks Unity Catalog - TechSession-Spain Oct. 2022
No ratings yet
Databricks Unity Catalog - TechSession-Spain Oct. 2022
51 pages
CS441 FT by AC 03222254114
No ratings yet
CS441 FT by AC 03222254114
14 pages
JNTU KAKINADA - B.Tech - HADOOP AND BIG DATA R13 RT4105B112017 FR 269 PDF
No ratings yet
JNTU KAKINADA - B.Tech - HADOOP AND BIG DATA R13 RT4105B112017 FR 269 PDF
4 pages
Chapter 6 Spark - An In-Memory Distributed Computing Engine
No ratings yet
Chapter 6 Spark - An In-Memory Distributed Computing Engine
43 pages
Krishna Teja Resume1
No ratings yet
Krishna Teja Resume1
3 pages
Azure Data Engineer Resume
No ratings yet
Azure Data Engineer Resume
2 pages
Apache Cassandra Sample Resume
No ratings yet
Apache Cassandra Sample Resume
17 pages
Experience Summary: Neelotpaul Roy Data Engineering Manager - Accenture Strategy & Consulting (AI)
No ratings yet
Experience Summary: Neelotpaul Roy Data Engineering Manager - Accenture Strategy & Consulting (AI)
6 pages
Samatha Hadoop
No ratings yet
Samatha Hadoop
6 pages
Hortonworks Spark ODBC Driver User Guide
No ratings yet
Hortonworks Spark ODBC Driver User Guide
96 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
OFSAAI Administration Guide 8.0
No ratings yet
OFSAAI Administration Guide 8.0
133 pages
Heuristic Ladder: Hadoop - Big Data Analytics Course
No ratings yet
Heuristic Ladder: Hadoop - Big Data Analytics Course
5 pages
Data Analytics With Spark PDF
No ratings yet
Data Analytics With Spark PDF
29 pages
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
No ratings yet
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
219 pages
Bi & Analytics On A Data Lake
No ratings yet
Bi & Analytics On A Data Lake
57 pages
Big Data Analytics
No ratings yet
Big Data Analytics
3 pages
Big Data Presentations (Autosaved)
No ratings yet
Big Data Presentations (Autosaved)
126 pages
Talend Resume
No ratings yet
Talend Resume
8 pages

8 MapReduce Different Phases 08-01-2025

Uploaded by

8 MapReduce Different Phases 08-01-2025

Uploaded by

HADOOP ECOSYSTEM

Hadoop is a framework which deals with Big Data but unlike

When we import any structured data from table (RDBMS) to c

Flume is a distributed, reliable, and available system for efficiently

Flume can be used to transport massive quantities of event data

• Allocates resources for all scheduled tasks

• It is another main component of Hadoop and a method of programming in

• REDUCE = Aggregates and summarizes the result produced by

Hadoop Database or HBASE is a non- relational (NoSQL)

Hadoop can perform only batch

Hive is created by Facebook and later donated to

Hive mainly deals with structured data which is stored in

Hive also run Map reduce program in a backend to

Similar to HIVE, PIG also deals with structured data using

PIG was originally developed at Yahoo to answer similar need

It is an alternative provided to programmer who loves scripting

• 1 line of piglatin = Approx. 100 lines of Map-Reduce job

• The compiler internally converts pig latin to map reduce

Compile HIV foreach

Mahout is an open-source machine learning library from

The algorithms it implements fall under the broad umbrella of

Mahout aims to be the machine learning tool of choice when

Tasks : Predictive analysis-recommender, clustering,

It is a workflow scheduler system to manage hadoop jobs.

Oozie is implemented as a Java Web-Application that runs in

Hadoop basically deals with bigdata and when some

• Sequential set of actions to be executed.

• ZooKeeper is a centralized service for maintaining

• Writing distributed applications is difficult because of partial

• In case of any partial failure clients can connect to any node

• A framework for real time data analytics

• 100 X faster than Hadoop and Mapreduces

• Supports Standalone and Distributed processing

• Creating a File System Object

• Storage of data on a local file system.

You might also like