Hadoop Ecosystem PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Hadoop Ecosystem

1. Hadoop Distributed File System

It is the most important component of Hadoop Ecosystem. HDFS is the primary storage system
of Hadoop. Hadoop distributed file system (HDFS) is a java based file system that provides
scalable, fault tolerance, reliable and cost efficient data storage for Big data. HDFS is a
distributed filesystem that runs on commodity hardware. HDFS is already configured with
default configuration for many installations. Most of the time for large clusters configuration is
needed. Hadoop interact directly with HDFS by shell-like commands.

HDFS Components:

There are two major components of Hadoop HDFS- NameNode and DataNode. Let’s now
discuss these Hadoop HDFS Components-

i. NameNode

It is also known as Master node. NameNode does not store actual data or dataset. NameNode
stores Metadata i.e. number of blocks, their location, on which Rack, which Datanode the data is
stored and other details. It consists of files and directories.

Tasks of HDFS NameNode

 Manage file system namespace.

 Regulates client’s access to files.

 Executes file system execution such as naming, closing, opening files and directories.

ii. DataNode

It is also known as Slave. HDFS Datanode is responsible for storing actual data in HDFS.
Datanode performs read and write operation as per the request of the clients. Replica block of
Datanode consists of 2 files on the file system. The first file is for data and second file is for
recording the block’s metadata. HDFS Metadata includes checksums for data. At startup, each
Datanode connects to its corresponding Namenode and does handshaking. Verification of
namespace ID and software version of DataNode take place by handshaking. At the time of
mismatch found, DataNode goes down automatically.

Tasks of HDFS DataNode

 DataNode performs operations like block replica creation, deletion, and replication
according to the instruction of NameNode.

 DataNode manages data storage of the system.


1
2. MapReduce

Hadoop MapReduce is the core Hadoop ecosystem component which provides data processing.
MapReduce is a software framework for easily writing applications that process the vast amount
of structured and unstructured data stored in the Hadoop Distributed File system.
MapReduce programs are parallel in nature, thus are very useful for performing large-scale data
analysis using multiple machines in the cluster. Thus, it improves the speed and reliability of
cluster this parallel processing.

Working of MapReduce

Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two phases:

 Map phase

 Reduce phase

Each phase has key-value pairs as input and output. In addition, programmer also specifies two
functions: map function and reduce function

Map function takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/valuepairs)
Reduce function takes the output from the Map as an input and combines those data tuples
based on the key and accordingly modifies the value of the key.

Features of MapReduce

 Simplicity – MapReduce jobs are easy to run. Applications can be written in any
language such as java, C++, and python.

 Scalability – MapReduce can process petabytes of data.

 Speed – By means of parallel processing problems that take days to solve, it is solved in
hours and minutes by MapReduce.

 Fault Tolerance – MapReduce takes care of failures. If one copy of data is unavailable,
another machine has a copy of the same key pair which can be used for solving the same
subtask.

3. YARN

Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem component that
provides the resource management. Yarn is also one the most important component of Hadoop
Ecosystem. YARN is called as the operating system of Hadoop as it is responsible for managing
and monitoring workloads. It allows multiple data processing engines such as real-time
streaming and batch processing to handle data stored on a single platform.

2
YARN has been projected as a data operating system for Hadoop2. Main features of YARN are:

 Flexibility – Enables other purpose-built data processing models beyond MapReduce


(batch), such as interactive and streaming. Due to this feature of YARN, other
applications can also be run along with Map Reduce programs in Hadoop2.

 Efficiency – As many applications run on the same cluster, Hence, efficiency of Hadoop
increases without much effect on quality of service.

 Shared – Provides a stable, reliable, secure foundation and shared operational services
across multiple workloads. Additional programming models such as graph processing
and iterative modeling are now possible for data processing.

4. Hive

The Hadoop ecosystem component, Apache Hive, is an open source data warehouse system for
querying and analyzing large datasets stored in Hadoop files. Hive do three main functions: data
summarization, query, and analysis.

Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL automatically
translates SQL-like queries into MapReduce jobs which will execute on Hadoop.

5. Pig

Apache Pig is a high-level language platform for analyzing and querying huge dataset that are
stored in HDFS. Pig as a component of Hadoop Ecosystem uses PigLatin language. It is very
similar to SQL. It loads the data, applies the required filters and dumps the data in the required
format. For Programs execution, pig requires Java runtime environment.

Features of Apache Pig:

 Extensibility – For carrying out special purpose processing, users can create their own
function.

 Optimization opportunities – Pig allows the system to optimize automatic execution.


This allows the user to pay attention to semantics instead of efficiency.

 Handles all kinds of data – Pig analyzes both structured as well as unstructured.

3
6. HBase

Apache HBase is a Hadoop ecosystem component which is a distributed database that was
designed to store structured data in tables that could have billions of row and millions of
columns. HBase is scalable, distributed, and NoSQL database that is built on top of HDFS.
HBase, provide real-time access to read or write data in HDFS.

Components of Hbase

There are two HBase Components namely- HBase Master and RegionServer.

i. HBase Master

It is not part of the actual data storage but negotiates load balancing across all RegionServer.

 Maintain and monitor the Hadoop cluster.

 Performs administration (interface for creating, updating and deleting tables.)

 Controls the failover.

 HMaster handles DDL operation.

ii. RegionServer

It is the worker node which handles read, writes, updates and delete requests from clients.
Region server process runs on every node in Hadoop cluster. Region server runs on HDFS
DateNode.

7. HCatalog

It is a table and storage management layer for Hadoop. HCatalog supports different components
available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and write data
from the cluster. HCatalog is a key component of Hive that enables the user to store their data in
any format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.

Benefits of HCatalog:

 Enables notifications of data availability.

 With the table abstraction, HCatalog frees the user from overhead of data storage.

 Provide visibility for data cleaning and archiving tools.

4
8. Apache Mahout

Mahout is open source framework for creating scalable machine learning algorithm and data
mining library. Once data is stored in Hadoop HDFS, mahout provides the data science tools to
automatically find meaningful patterns in those big data sets.

Algorithms of Mahout are:

 Clustering – Here it takes the item in particular class and organizes them into naturally
occurring groups, such that item belonging to the same group are similar to each other.

 Collaborative filtering – It mines user behavior and makes product recommendations


(e.g. Amazon recommendations)

 Classifications – It learns from existing categorization and then assigns unclassified


items to the best category.

 Frequent pattern mining – It analyzes items in a group (e.g. items in a shopping cart or
terms in query session) and then identifies which items typically appear together.

9. Apache Sqoop

Sqoop imports data from external sources into related Hadoop ecosystem components like
HDFS, Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop works
with relational databases such as teradata, Netezza, oracle, MySQL.

Features of Apache Sqoop:

 Import sequential datasets from mainframe – Sqoop satisfies the growing need to
move data from the mainframe to HDFS.

 Import direct to ORC files – Improves compression and light weight indexing and
improve query performance.

 Parallel data transfer – For faster performance and optimal system utilization.

 Efficient data analysis – Improve efficiency of data analysis by combining structured


data and unstructured data on a schema on reading data lake.

 Fast data copies – from an external system into Hadoop.

10. Apache Flume

Flume efficiently collects, aggregate and moves a large amount of data from its origin and
sending it back to HDFS. It is fault tolerant and reliable mechanism. This Hadoop Ecosystem
component allows the data flow from the source into Hadoop environment. It uses a simple

5
extensible data model that allows for the online analytic application. Using Flume, we can get
the data from multiple servers immediately into hadoop.

11. Ambari

Ambari, another Hadop ecosystem component, is a management platform for provisioning,


managing, monitoring and securing apache Hadoop cluster. Hadoop management gets simpler as
Ambari provide consistent, secure platform for operational control.

Features of Ambari:

 Simplified installation, configuration, and management – Ambari easily and


efficiently create and manage clusters at scale.

 Centralized security setup – Ambari reduce the complexity to administer and configure
cluster security across the entire platform.

 Highly extensible and customizable – Ambari is highly extensible for bringing custom
services under management.

 Full visibility into cluster health – Ambari ensures that the cluster is healthy and
available with a holistic approach to monitoring.

12. Zookeeper

Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for maintaining
configuration information, naming, providing distributed synchronization, and providing group
services. Zookeeper manages and coordinates a large cluster of machines.

Features of Zookeeper:

 Fast – Zookeeper is fast with workloads where reads to data are more common than
writes. The ideal read/write ratio is 10:1.

 Ordered – Zookeeper maintains a record of all transactions.

You might also like