Hadoop Ecosystem PDF
Hadoop Ecosystem PDF
Hadoop Ecosystem PDF
It is the most important component of Hadoop Ecosystem. HDFS is the primary storage system
of Hadoop. Hadoop distributed file system (HDFS) is a java based file system that provides
scalable, fault tolerance, reliable and cost efficient data storage for Big data. HDFS is a
distributed filesystem that runs on commodity hardware. HDFS is already configured with
default configuration for many installations. Most of the time for large clusters configuration is
needed. Hadoop interact directly with HDFS by shell-like commands.
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and DataNode. Let’s now
discuss these Hadoop HDFS Components-
i. NameNode
It is also known as Master node. NameNode does not store actual data or dataset. NameNode
stores Metadata i.e. number of blocks, their location, on which Rack, which Datanode the data is
stored and other details. It consists of files and directories.
Executes file system execution such as naming, closing, opening files and directories.
ii. DataNode
It is also known as Slave. HDFS Datanode is responsible for storing actual data in HDFS.
Datanode performs read and write operation as per the request of the clients. Replica block of
Datanode consists of 2 files on the file system. The first file is for data and second file is for
recording the block’s metadata. HDFS Metadata includes checksums for data. At startup, each
Datanode connects to its corresponding Namenode and does handshaking. Verification of
namespace ID and software version of DataNode take place by handshaking. At the time of
mismatch found, DataNode goes down automatically.
DataNode performs operations like block replica creation, deletion, and replication
according to the instruction of NameNode.
Hadoop MapReduce is the core Hadoop ecosystem component which provides data processing.
MapReduce is a software framework for easily writing applications that process the vast amount
of structured and unstructured data stored in the Hadoop Distributed File system.
MapReduce programs are parallel in nature, thus are very useful for performing large-scale data
analysis using multiple machines in the cluster. Thus, it improves the speed and reliability of
cluster this parallel processing.
Working of MapReduce
Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two phases:
Map phase
Reduce phase
Each phase has key-value pairs as input and output. In addition, programmer also specifies two
functions: map function and reduce function
Map function takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/valuepairs)
Reduce function takes the output from the Map as an input and combines those data tuples
based on the key and accordingly modifies the value of the key.
Features of MapReduce
Simplicity – MapReduce jobs are easy to run. Applications can be written in any
language such as java, C++, and python.
Speed – By means of parallel processing problems that take days to solve, it is solved in
hours and minutes by MapReduce.
Fault Tolerance – MapReduce takes care of failures. If one copy of data is unavailable,
another machine has a copy of the same key pair which can be used for solving the same
subtask.
3. YARN
Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem component that
provides the resource management. Yarn is also one the most important component of Hadoop
Ecosystem. YARN is called as the operating system of Hadoop as it is responsible for managing
and monitoring workloads. It allows multiple data processing engines such as real-time
streaming and batch processing to handle data stored on a single platform.
2
YARN has been projected as a data operating system for Hadoop2. Main features of YARN are:
Efficiency – As many applications run on the same cluster, Hence, efficiency of Hadoop
increases without much effect on quality of service.
Shared – Provides a stable, reliable, secure foundation and shared operational services
across multiple workloads. Additional programming models such as graph processing
and iterative modeling are now possible for data processing.
4. Hive
The Hadoop ecosystem component, Apache Hive, is an open source data warehouse system for
querying and analyzing large datasets stored in Hadoop files. Hive do three main functions: data
summarization, query, and analysis.
Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL automatically
translates SQL-like queries into MapReduce jobs which will execute on Hadoop.
5. Pig
Apache Pig is a high-level language platform for analyzing and querying huge dataset that are
stored in HDFS. Pig as a component of Hadoop Ecosystem uses PigLatin language. It is very
similar to SQL. It loads the data, applies the required filters and dumps the data in the required
format. For Programs execution, pig requires Java runtime environment.
Extensibility – For carrying out special purpose processing, users can create their own
function.
Handles all kinds of data – Pig analyzes both structured as well as unstructured.
3
6. HBase
Apache HBase is a Hadoop ecosystem component which is a distributed database that was
designed to store structured data in tables that could have billions of row and millions of
columns. HBase is scalable, distributed, and NoSQL database that is built on top of HDFS.
HBase, provide real-time access to read or write data in HDFS.
Components of Hbase
There are two HBase Components namely- HBase Master and RegionServer.
i. HBase Master
It is not part of the actual data storage but negotiates load balancing across all RegionServer.
ii. RegionServer
It is the worker node which handles read, writes, updates and delete requests from clients.
Region server process runs on every node in Hadoop cluster. Region server runs on HDFS
DateNode.
7. HCatalog
It is a table and storage management layer for Hadoop. HCatalog supports different components
available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and write data
from the cluster. HCatalog is a key component of Hive that enables the user to store their data in
any format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.
Benefits of HCatalog:
With the table abstraction, HCatalog frees the user from overhead of data storage.
4
8. Apache Mahout
Mahout is open source framework for creating scalable machine learning algorithm and data
mining library. Once data is stored in Hadoop HDFS, mahout provides the data science tools to
automatically find meaningful patterns in those big data sets.
Clustering – Here it takes the item in particular class and organizes them into naturally
occurring groups, such that item belonging to the same group are similar to each other.
Frequent pattern mining – It analyzes items in a group (e.g. items in a shopping cart or
terms in query session) and then identifies which items typically appear together.
9. Apache Sqoop
Sqoop imports data from external sources into related Hadoop ecosystem components like
HDFS, Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop works
with relational databases such as teradata, Netezza, oracle, MySQL.
Import sequential datasets from mainframe – Sqoop satisfies the growing need to
move data from the mainframe to HDFS.
Import direct to ORC files – Improves compression and light weight indexing and
improve query performance.
Parallel data transfer – For faster performance and optimal system utilization.
Flume efficiently collects, aggregate and moves a large amount of data from its origin and
sending it back to HDFS. It is fault tolerant and reliable mechanism. This Hadoop Ecosystem
component allows the data flow from the source into Hadoop environment. It uses a simple
5
extensible data model that allows for the online analytic application. Using Flume, we can get
the data from multiple servers immediately into hadoop.
11. Ambari
Features of Ambari:
Centralized security setup – Ambari reduce the complexity to administer and configure
cluster security across the entire platform.
Highly extensible and customizable – Ambari is highly extensible for bringing custom
services under management.
Full visibility into cluster health – Ambari ensures that the cluster is healthy and
available with a holistic approach to monitoring.
12. Zookeeper
Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for maintaining
configuration information, naming, providing distributed synchronization, and providing group
services. Zookeeper manages and coordinates a large cluster of machines.
Features of Zookeeper:
Fast – Zookeeper is fast with workloads where reads to data are more common than
writes. The ideal read/write ratio is 10:1.