Hadoop Ecosystem

Example – There are two files A.txt and B.txt which are stored in a cluster having 5 nodes. When these files are put in HDFS, as per the applicable block size, let's say both of these files are divided into two blocks.

Uploaded by

rajsreerama.s

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views8 pages

Hadoop Ecosystem

Uploaded by

rajsreerama.s

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

https://www.edureka.

co/blog/hadoop-ecosystem
https://www.naukri.com/code360/library/hadoop-ecosystem-in-big-data
https://www.analyticsvidhya.com/blog/2020/10/introduction-hadoop-ecosystem/
https://medium.com/@msj.masood/hadoop-installation-using-virtual-vmware-workstation-
b757b1aab1ee
https://key2consulting.com/install-hadoop-on-linux-virtual-machine-on-windows-10/

The Hadoop Ecosystem is a group of software tools, it is a platform or framework which

solves big data problems. Hadoop is a framework that manages big data storage by means of parallel
and distributed processing. Hadoop is comprised of various tools that are dedicated to different
sections of data management, like storing, processing, and analyzing.

HDFS (Hadoop Distributed File System)

 In the traditional approach, all data was stored in a single central database. With the rise of
big data, a single database was not enough to handle the task.
 It is the storage component of Hadoop that stores data in the form of files.
 Each file is divided into blocks of 128MB (configurable) and stores them on different
machines in the cluster in distributed fashion.
 The default block size can be changed depending on the processing speed and the data
distribution.
 HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).
 It helps us in storing our data across various nodes and maintaining the log file about the
stored data (metadata).

There are two components in HDFS:

1. NameNode - The NameNode is the main node and it doesn’t store the actual data. It contains
metadata, just like a log file or you can say as a table of content. Therefore, it requires less
storage and high computational resources.
2. DataNode - It stores the actual data. There can be multiple DataNodes.
Always communicate to the NameNode while writing the data into the Datanodes. Then, it
internally sends a request to the client to store and replicate data on various DataNodes.

In HDFS storing data in a distributed fashion, HDFS splits the data into multiple blocks, defaulting to
a maximum of 128 MB. The default block size can be changed depending on the processing speed
and the data distribution.

In the above image, we have 300 MB of data. This is broken down into 128 MB, 128 MB, and
44 MB. The final block handles the remaining needed storage space, so it doesn’t have to be sized at
128 MB. This is how data gets stored in a distributed manner in HDFS.

YARN (Yet Another Resource Negotiator)

It handles the cluster of nodes and acts as Hadoop’s resource management unit. YARN
allocates RAM, memory, and other resources to different applications.

YARN has two components :

1. ResourceManager (Master) - This is the master daemon. It manages the assignment of resources
such as CPU, memory, and network bandwidth.
2. NodeManager (Slave) - This is the slave daemon, and it reports the resource usage to the
Resource Manager.

MapReduce
Hadoop data processing is built on MapReduce, which processes large volumes of data in a
parallelly distributed manner i.e It essentially divides a single task into multiple tasks and processes
them on different machines. With the help of the figure below, we can understand how MapReduce
works:
we have our big data that needs to be processed, with the intent of eventually arriving at an output.
So in the beginning, input data is divided up to form the input splits. The first phase is the Map
phase, where data in each split is passed to produce output values. In the shuffle and sort phase, the
mapping phase’s output is taken and grouped into blocks of similar data. Finally, the output values
from the shuffling phase are aggregated. It then returns a single output value.
Sqoop
Sqoop is used to transfer data between Hadoop and external datastores such as relational
databases and enterprise data warehouses. It imports data from external datastores into HDFS, Hive,
and HBase.
As seen below, the client machine gathers code, which will then be sent to Sqoop. The Sqoop then
goes to the Task Manager, which in turn connects to the enterprise data warehouse, documents
based systems, and RDBMS. It can map those tasks into Hadoop.

Flume
Flume is another data collection and ingestion tool, a distributed service for collecting, aggregating,
and moving large amounts of log data. It ingests online streaming data from social media, logs files,
web server into HDFS.
As you can see below, data is taken from various sources, depending on your organization’s needs. It
then goes through the source, channel, and sink. The sink feature ensures that everything is in sync
with the requirements. Finally, the data is dumped into HDFS.

Pig
Apache Pig was developed by Yahoo researchers, targeted mainly towards non-programmers. It
was designed with the ability to analyze and process large datasets without using complex Java
codes. It provides a high-level data processing language that can perform numerous operations.
It consists of:
1. Pig Latin - This is the language for scripting
2. Pig Latin Compiler - This converts Pig Latin code into executable code

Programmers write scripts in Pig Latin to analyze data using Pig. Grunt Shell is Pig’s interactive shell,
used to execute all Pig scripts. If the Pig script is written in a script file, the Pig Server executes it. The
parser checks the syntax of the Pig script, after which the output will be a DAG (Directed Acyclic
Graph). The DAG (logical plan) is passed to the logical optimizer. The compiler converts the DAG into
MapReduce jobs. The MapReduce jobs are then run by the Execution Engine.
Hive
Hive is a distributed data warehouse system developed by Facebook. It allows for easy reading,
writing, and managing files on HDFS. It has its own querying language for the purpose known as Hive
Querying Language (HQL) which is very similar to SQL. This makes it very easy for programmers to
write MapReduce functions using simple HQL queries.
Apache Hive has two major components:
 Hive Command Line
 JDBC/ ODBC driver
The Java Database Connectivity (JDBC) application is connected through JDBC Driver, and the Open
Database Connectivity (ODBC) application is connected through ODBC Driver. Commands are
executed directly in CLI. Hive driver is responsible for all the queries submitted, performing the three
steps of compilation, optimization, and execution internally. It then uses the MapReduce framework
to process queries.
Hive’s architecture is shown below:

Spark
Spark is a huge framework in and of itself, an open-source distributed computing engine for
processing and analyzing vast volumes of real-time data. It runs 100 times faster than MapReduce.
Spark provides an in-memory computation of data, used to process and analyze real-time streaming
data such as stock market and banking data, among other things.

As seen from the above image, the MasterNode has a driver program. The Spark code behaves as a
driver program and creates a SparkContext, which is a gateway to all of the Spark functionalities.
Spark applications run as independent sets of processes on a cluster. The driver program and Spark
context take care of the job execution within the cluster. A job is split into multiple tasks that are
distributed over the worker node. When an RDD is created in the Spark context, it can be distributed
across various nodes. Worker nodes are slaves that run different tasks. The Executor is responsible
for the execution of these tasks. Worker nodes execute the tasks assigned by the Cluster Manager
and return the results to the SparkContext.
Mahout
Mahout is used to create scalable and distributed machine learning algorithms such as clustering,
linear regression, classification, and so on. It has a library that contains built-in algorithms for
collaborative filtering, classification, and clustering.

Ambari
Next up, we have Apache Ambari. It is an open-source tool responsible for keeping track of running
applications and their statuses. Ambari manages, monitors, and provisions Hadoop clusters. Also, it
also provides a central management service to start, stop, and configure Hadoop services.
As seen in the following image, the Ambari web, which is your interface, is connected to the Ambari
server. Apache Ambari follows a master/slave architecture. The master node is accountable for
keeping track of the state of the infrastructure. For doing this, the master node uses a database
server that can be configured during the setup time. Most of the time, the Ambari server is located
on the MasterNode, and is connected to the database. This is where agents look into the host
server. Agents run on all the nodes that you want to manage under Ambari. This program
occasionally sends heartbeats to the master node to show its aliveness. By using Ambari Agent, the
Ambari Server is able to execute many tasks.

Kafka
Kafka is a distributed streaming platform designed to store and process streams of records. It is
written in Scala. It builds real-time streaming data pipelines that reliably get data between
applications, and also builds real-time applications that transform data into streams.
Kafka uses a messaging system for transferring data from one application to another. As seen below,
we have the sender, the message queue, and the receiver involved in data transfer.
Storm
The storm is an engine that processes real-time streaming data at a very high speed. It is written in
Clojure. A storm can handle over 1 million jobs on a node in a fraction of a second. It is integrated
with Hadoop to harness higher throughputs.
Now that we have looked at the various data ingestion tools and streaming services let us take a
look at the security frameworks in the Hadoop ecosystem
Ranger
Ranger is a framework designed to enable, monitor, and manage data security across the Hadoop
platform. It provides centralized administration for managing all security-related tasks. Ranger
standardizes authorization across all Hadoop components, and provides enhanced support for
different authorization methods like role-based access control, and attributes based access control,
to name a few.
Knox
Apache Knox is an application gateway used in conjunction with Hadoop deployments, interacting
with REST APIs and UIs. The gateway delivers three types of user-facing services:
1. Proxying Services - This provides access to Hadoop via proxying the HTTP request
2. Authentication Services - This gives authentication for REST API access and WebSSO flow for user
interfaces
3. Client Services - This provides client development either via scripting through DSL or using the
Knox shell classes
Oozie
Oozie is a workflow scheduler system used to manage Hadoop jobs. It consists of two parts:
1. Workflow engine - This consists of Directed Acyclic Graphs (DAGs), which specify a sequence of
actions to be executed
2. Coordinator engine - The engine is made up of workflow jobs triggered by time and data
availability
As seen from the flowchart below, the process begins with the MapReduce jobs. This action can
either be successful, or it can end in an error. If it is successful, the client is notified by an email. If
the action is unsuccessful, the client is similarly notified, and the action is terminated.

Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Big Data Analytics Assignment
No ratings yet
Big Data Analytics Assignment
7 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Wa0005.
No ratings yet
Wa0005.
84 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Big Data Technology Stack
100% (1)
Big Data Technology Stack
12 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Big Data and Hadoop - Suzanne
No ratings yet
Big Data and Hadoop - Suzanne
5 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BDT Unit 2 Textbook
No ratings yet
BDT Unit 2 Textbook
20 pages
Big Data Introduction & Ecosystems
No ratings yet
Big Data Introduction & Ecosystems
4 pages
Bda Unit-2
No ratings yet
Bda Unit-2
52 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
No ratings yet
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
5 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
BDP Unit 4
No ratings yet
BDP Unit 4
28 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Unit 4
No ratings yet
Unit 4
85 pages
Unit 1
No ratings yet
Unit 1
50 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
15 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Unit 2
No ratings yet
Unit 2
9 pages
Cloud Computing - Unit 5
No ratings yet
Cloud Computing - Unit 5
8 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Unit 4
No ratings yet
Unit 4
4 pages
BDP Unit 3
No ratings yet
BDP Unit 3
20 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
A5E01428618-03 SITRANS CV en en-US
No ratings yet
A5E01428618-03 SITRANS CV en en-US
114 pages
Hadoop
No ratings yet
Hadoop
5 pages
Hadoop Ecosystem: Overview
No ratings yet
Hadoop Ecosystem: Overview
5 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
Hadoop Notes 1
No ratings yet
Hadoop Notes 1
9 pages
Unit 4
No ratings yet
Unit 4
21 pages
U2 - Hadoop EcoSytem
No ratings yet
U2 - Hadoop EcoSytem
6 pages
Unit 1 ER Diagrams Case Studies
No ratings yet
Unit 1 ER Diagrams Case Studies
19 pages
RMK Group 21cs905 CV Unit 1
No ratings yet
RMK Group 21cs905 CV Unit 1
77 pages
Chapter - 02 Logistics
No ratings yet
Chapter - 02 Logistics
13 pages
COMSATS Institute of Information Technology, Islamabad: Terminal Examination Fall2014
No ratings yet
COMSATS Institute of Information Technology, Islamabad: Terminal Examination Fall2014
8 pages
Online Grocery Shop Project Proposal
No ratings yet
Online Grocery Shop Project Proposal
27 pages
A Study On Windows Mobile 6.5 Operation System
No ratings yet
A Study On Windows Mobile 6.5 Operation System
13 pages
SOS Computer Class 5 2024-25
No ratings yet
SOS Computer Class 5 2024-25
3 pages
Complete
No ratings yet
Complete
48 pages
GDOH HIS SAP Business Client Configuration
No ratings yet
GDOH HIS SAP Business Client Configuration
14 pages
Addressing Modes
No ratings yet
Addressing Modes
37 pages
Beginner's Guide To Make A Game Controller
No ratings yet
Beginner's Guide To Make A Game Controller
23 pages
Spam Detection Viva Questions Full
No ratings yet
Spam Detection Viva Questions Full
5 pages
Software Engineer JD - Jane Street
No ratings yet
Software Engineer JD - Jane Street
1 page
2023-2024 - SEM - 1 - Online B.Sc. CS-Batch 2 - SEM 1 - BCS ZC313 - Introduction To Programming - EC-3 - SLOT-1 - 08-10-2023
No ratings yet
2023-2024 - SEM - 1 - Online B.Sc. CS-Batch 2 - SEM 1 - BCS ZC313 - Introduction To Programming - EC-3 - SLOT-1 - 08-10-2023
9 pages
Sentinel Worlds I Future Magic Users Manual
No ratings yet
Sentinel Worlds I Future Magic Users Manual
45 pages
Asuquo Happiness CV
No ratings yet
Asuquo Happiness CV
6 pages
Lab 7 Capturing and Examining The Registry (15 PTS.)
No ratings yet
Lab 7 Capturing and Examining The Registry (15 PTS.)
8 pages
Deep Learning Lab Manual
No ratings yet
Deep Learning Lab Manual
47 pages
Enterprise Resource Planning (Department Elective) : BE IT - 2019-20 Sample Question Paper For Final Exam
No ratings yet
Enterprise Resource Planning (Department Elective) : BE IT - 2019-20 Sample Question Paper For Final Exam
8 pages
HikCentral Access Control Brochure
No ratings yet
HikCentral Access Control Brochure
2 pages
Unit 04
No ratings yet
Unit 04
24 pages
2017 Supervised Machine Learning Based Surface Inspection by Synthetizing Artificial Defects
No ratings yet
2017 Supervised Machine Learning Based Surface Inspection by Synthetizing Artificial Defects
6 pages
Aws Report 1
No ratings yet
Aws Report 1
7 pages
L310 Start Here Guide
No ratings yet
L310 Start Here Guide
4 pages
Form ISO
No ratings yet
Form ISO
6 pages
Esigno E S: Nergy Aver
No ratings yet
Esigno E S: Nergy Aver
2 pages
Computer Programming 2 Prelim Reviewer (AMA)
No ratings yet
Computer Programming 2 Prelim Reviewer (AMA)
4 pages
Emerging Biometric Modalities and Their Use
No ratings yet
Emerging Biometric Modalities and Their Use
6 pages
Flipkart Labels 27 Jun 2025-07-47
No ratings yet
Flipkart Labels 27 Jun 2025-07-47
1 page
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Hadoop Ecosystem

Uploaded by

Hadoop Ecosystem

Uploaded by

https://www.edureka.

The Hadoop Ecosystem is a group of software tools, it is a platform or framework which

HDFS (Hadoop Distributed File System)

There are two components in HDFS:

YARN (Yet Another Resource Negotiator)

YARN has two components :

You might also like