0% found this document useful (0 votes)
7 views8 pages

Hadoop Ecosystem

Example – There are two files A.txt and B.txt which are stored in a cluster having 5 nodes. When these files are put in HDFS, as per the applicable block size, let's say both of these files are divided into two blocks.

Uploaded by

rajsreerama.s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views8 pages

Hadoop Ecosystem

Example – There are two files A.txt and B.txt which are stored in a cluster having 5 nodes. When these files are put in HDFS, as per the applicable block size, let's say both of these files are divided into two blocks.

Uploaded by

rajsreerama.s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

https://www.edureka.

co/blog/hadoop-ecosystem
https://www.naukri.com/code360/library/hadoop-ecosystem-in-big-data
https://www.analyticsvidhya.com/blog/2020/10/introduction-hadoop-ecosystem/
https://medium.com/@msj.masood/hadoop-installation-using-virtual-vmware-workstation-
b757b1aab1ee
https://key2consulting.com/install-hadoop-on-linux-virtual-machine-on-windows-10/

The Hadoop Ecosystem is a group of software tools, it is a platform or framework which


solves big data problems. Hadoop is a framework that manages big data storage by means of parallel
and distributed processing. Hadoop is comprised of various tools that are dedicated to different
sections of data management, like storing, processing, and analyzing.

HDFS (Hadoop Distributed File System)

 In the traditional approach, all data was stored in a single central database. With the rise of
big data, a single database was not enough to handle the task.
 It is the storage component of Hadoop that stores data in the form of files.
 Each file is divided into blocks of 128MB (configurable) and stores them on different
machines in the cluster in distributed fashion.
 The default block size can be changed depending on the processing speed and the data
distribution.
 HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).
 It helps us in storing our data across various nodes and maintaining the log file about the
stored data (metadata).

There are two components in HDFS:

1. NameNode - The NameNode is the main node and it doesn’t store the actual data. It contains
metadata, just like a log file or you can say as a table of content. Therefore, it requires less
storage and high computational resources.
2. DataNode - It stores the actual data. There can be multiple DataNodes.
Always communicate to the NameNode while writing the data into the Datanodes. Then, it
internally sends a request to the client to store and replicate data on various DataNodes.

In HDFS storing data in a distributed fashion, HDFS splits the data into multiple blocks, defaulting to
a maximum of 128 MB. The default block size can be changed depending on the processing speed
and the data distribution.

In the above image, we have 300 MB of data. This is broken down into 128 MB, 128 MB, and
44 MB. The final block handles the remaining needed storage space, so it doesn’t have to be sized at
128 MB. This is how data gets stored in a distributed manner in HDFS.

YARN (Yet Another Resource Negotiator)


It handles the cluster of nodes and acts as Hadoop’s resource management unit. YARN
allocates RAM, memory, and other resources to different applications.

YARN has two components :


1. ResourceManager (Master) - This is the master daemon. It manages the assignment of resources
such as CPU, memory, and network bandwidth.
2. NodeManager (Slave) - This is the slave daemon, and it reports the resource usage to the
Resource Manager.

MapReduce
Hadoop data processing is built on MapReduce, which processes large volumes of data in a
parallelly distributed manner i.e It essentially divides a single task into multiple tasks and processes
them on different machines. With the help of the figure below, we can understand how MapReduce
works:
we have our big data that needs to be processed, with the intent of eventually arriving at an output.
So in the beginning, input data is divided up to form the input splits. The first phase is the Map
phase, where data in each split is passed to produce output values. In the shuffle and sort phase, the
mapping phase’s output is taken and grouped into blocks of similar data. Finally, the output values
from the shuffling phase are aggregated. It then returns a single output value.
Sqoop
Sqoop is used to transfer data between Hadoop and external datastores such as relational
databases and enterprise data warehouses. It imports data from external datastores into HDFS, Hive,
and HBase.
As seen below, the client machine gathers code, which will then be sent to Sqoop. The Sqoop then
goes to the Task Manager, which in turn connects to the enterprise data warehouse, documents
based systems, and RDBMS. It can map those tasks into Hadoop.

Flume
Flume is another data collection and ingestion tool, a distributed service for collecting, aggregating,
and moving large amounts of log data. It ingests online streaming data from social media, logs files,
web server into HDFS.
As you can see below, data is taken from various sources, depending on your organization’s needs. It
then goes through the source, channel, and sink. The sink feature ensures that everything is in sync
with the requirements. Finally, the data is dumped into HDFS.

Pig
Apache Pig was developed by Yahoo researchers, targeted mainly towards non-programmers. It
was designed with the ability to analyze and process large datasets without using complex Java
codes. It provides a high-level data processing language that can perform numerous operations.
It consists of:
1. Pig Latin - This is the language for scripting
2. Pig Latin Compiler - This converts Pig Latin code into executable code

Programmers write scripts in Pig Latin to analyze data using Pig. Grunt Shell is Pig’s interactive shell,
used to execute all Pig scripts. If the Pig script is written in a script file, the Pig Server executes it. The
parser checks the syntax of the Pig script, after which the output will be a DAG (Directed Acyclic
Graph). The DAG (logical plan) is passed to the logical optimizer. The compiler converts the DAG into
MapReduce jobs. The MapReduce jobs are then run by the Execution Engine.
Hive
Hive is a distributed data warehouse system developed by Facebook. It allows for easy reading,
writing, and managing files on HDFS. It has its own querying language for the purpose known as Hive
Querying Language (HQL) which is very similar to SQL. This makes it very easy for programmers to
write MapReduce functions using simple HQL queries.
Apache Hive has two major components:
 Hive Command Line
 JDBC/ ODBC driver
The Java Database Connectivity (JDBC) application is connected through JDBC Driver, and the Open
Database Connectivity (ODBC) application is connected through ODBC Driver. Commands are
executed directly in CLI. Hive driver is responsible for all the queries submitted, performing the three
steps of compilation, optimization, and execution internally. It then uses the MapReduce framework
to process queries.
Hive’s architecture is shown below:

Spark
Spark is a huge framework in and of itself, an open-source distributed computing engine for
processing and analyzing vast volumes of real-time data. It runs 100 times faster than MapReduce.
Spark provides an in-memory computation of data, used to process and analyze real-time streaming
data such as stock market and banking data, among other things.

As seen from the above image, the MasterNode has a driver program. The Spark code behaves as a
driver program and creates a SparkContext, which is a gateway to all of the Spark functionalities.
Spark applications run as independent sets of processes on a cluster. The driver program and Spark
context take care of the job execution within the cluster. A job is split into multiple tasks that are
distributed over the worker node. When an RDD is created in the Spark context, it can be distributed
across various nodes. Worker nodes are slaves that run different tasks. The Executor is responsible
for the execution of these tasks. Worker nodes execute the tasks assigned by the Cluster Manager
and return the results to the SparkContext.
Mahout
Mahout is used to create scalable and distributed machine learning algorithms such as clustering,
linear regression, classification, and so on. It has a library that contains built-in algorithms for
collaborative filtering, classification, and clustering.

Ambari
Next up, we have Apache Ambari. It is an open-source tool responsible for keeping track of running
applications and their statuses. Ambari manages, monitors, and provisions Hadoop clusters. Also, it
also provides a central management service to start, stop, and configure Hadoop services.
As seen in the following image, the Ambari web, which is your interface, is connected to the Ambari
server. Apache Ambari follows a master/slave architecture. The master node is accountable for
keeping track of the state of the infrastructure. For doing this, the master node uses a database
server that can be configured during the setup time. Most of the time, the Ambari server is located
on the MasterNode, and is connected to the database. This is where agents look into the host
server. Agents run on all the nodes that you want to manage under Ambari. This program
occasionally sends heartbeats to the master node to show its aliveness. By using Ambari Agent, the
Ambari Server is able to execute many tasks.

Kafka
Kafka is a distributed streaming platform designed to store and process streams of records. It is
written in Scala. It builds real-time streaming data pipelines that reliably get data between
applications, and also builds real-time applications that transform data into streams.
Kafka uses a messaging system for transferring data from one application to another. As seen below,
we have the sender, the message queue, and the receiver involved in data transfer.
Storm
The storm is an engine that processes real-time streaming data at a very high speed. It is written in
Clojure. A storm can handle over 1 million jobs on a node in a fraction of a second. It is integrated
with Hadoop to harness higher throughputs.
Now that we have looked at the various data ingestion tools and streaming services let us take a
look at the security frameworks in the Hadoop ecosystem
Ranger
Ranger is a framework designed to enable, monitor, and manage data security across the Hadoop
platform. It provides centralized administration for managing all security-related tasks. Ranger
standardizes authorization across all Hadoop components, and provides enhanced support for
different authorization methods like role-based access control, and attributes based access control,
to name a few.
Knox
Apache Knox is an application gateway used in conjunction with Hadoop deployments, interacting
with REST APIs and UIs. The gateway delivers three types of user-facing services:
1. Proxying Services - This provides access to Hadoop via proxying the HTTP request
2. Authentication Services - This gives authentication for REST API access and WebSSO flow for user
interfaces
3. Client Services - This provides client development either via scripting through DSL or using the
Knox shell classes
Oozie
Oozie is a workflow scheduler system used to manage Hadoop jobs. It consists of two parts:
1. Workflow engine - This consists of Directed Acyclic Graphs (DAGs), which specify a sequence of
actions to be executed
2. Coordinator engine - The engine is made up of workflow jobs triggered by time and data
availability
As seen from the flowchart below, the process begins with the MapReduce jobs. This action can
either be successful, or it can end in an error. If it is successful, the client is notified by an email. If
the action is unsuccessful, the client is similarly notified, and the action is terminated.

You might also like