Unit 3
Unit 3
Modules of Hadoop
HDFS: Hadoop Distributed File System. Google published its paper GFS
and on the basis of that HDFS was developed. It states that the files will
be broken into blocks and stored in nodes over the distributed
architecture.
Yarn: Yet another Resource Negotiator is used for job scheduling and
manages the cluster.
Hadoop Common: These Java libraries are used to start Hadoop and
are used by other Hadoop modules.
Features of Hadoop
1. Open Source:
Hadoop is open-source, which means it is free to use. Since it is an
open-source project the source-code is available online for anyone to
understand it or make some modifications as per their industry
requirement.
5. Cost-Effective:
Hadoop is designed in such a way that it can deal with any kind of
dataset like structured (MySql Data), Semi-Structured (XML, JSON), Un-
structured (Images and Videos) very efficiently. This means it can easily
process any kind of data independent of its structure which makes it
highly flexible.
7. Easy to Use:
Hadoop is easy to use since the developers need not worry about any
of the processing work since it is managed by the Hadoop itself.
Hadoop ecosystem is also very large comes up with lots of tools like
Hive, Pig, Spark, HBase, Mahout, etc.
Hadoop Architecture
Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data. Hadoop
works on MapReduce Programming Algorithm that was introduced by
Google. Today lots of Big Brand Companies are using Hadoop in their
Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay,
etc. The Hadoop Architecture Mainly consists of 4 components.
1. MapReduce
2. HDFS(Hadoop Distributed File System)
3. YARN(Yet Another Resource Negotiator)
4. Common Utilities or Hadoop Common
Architecture of Hadoop
Distributed Processing
Map Reduce
1. MapReduce
Meta Data can also be the name of the file, size, and the information
about the location (Block number, Block ids) of DataNode that
NameNode stores to find the closest DataNode for Faster
Communication. NameNode instructs the DataNodes with the
operation like delete, create, Replicate, etc.
Features of YARN
1. Multi-Tenancy
2. Scalability
3. Cluster-Utilization
4. Compatibility
Hadoop common or Common utilities are nothing but our java library
and java files or we can say the java scripts that we need for all the
other components present in a Hadoop cluster. These utilities are used
by HDFS, YARN, and MapReduce for running the cluster. Hadoop
Common verifies that Hardware failure in a Hadoop cluster is common
so it needs to be solved automatically in software by Hadoop
Framework.
To perform data management using Flume, you first need to define the
sources of the data that you want to collect. Flume supports various
sources such as syslog, Avro, Thrift, and HTTP.
Next, you need to define the channel that you want to use for storing
the collected data temporarily before it is processed by the sink. Flume
supports various channels such as memory, JDBC, and file-based
channels.
Finally, you need to define the sink that you want to use for storing the
collected data permanently. Flume supports various sinks such as HDFS,
HBase, and Kafka.
A Flume agent is a JVM process which has 3 components –Flume
Source, Flume Channel and Flume Sink– through which events
propagate after initiated at an external source.
Overall, Flume is a powerful tool for data management in big data
environments. It provides a flexible and scalable architecture that can
be customized to meet the specific needs of your data management
tasks.
2. Oozie
Apache Oozie is a workflow scheduler for Hadoop. It is a system which
runs the workflow of dependent jobs. Here, users are permitted to
create Directed Acyclic Graphs of workflows, which can be run in
parallel and sequentially in Hadoop.
3. Zookeeper
Apache Zookeeper is an open source distributed coordination service
that helps to manage a large set of hosts. Management and
coordination in a distributed environment is tricky. Zookeeper
automates this process and allows developers to focus on building
software features rather than worry about its distributed nature.
4. Hive
Hive is a data warehouse system which is used to analyze structured
data. It is built on the top of Hadoop. It was developed by Facebook.
1. Hive Client
2. Hive Services
3. Hive Storage and Computer
Using Hive, we can skip the requirement of the traditional approach of
writing complex MapReduce programs. Hive supports Data Definition
Language (DDL), Data Manipulation Language (DML), and User Defined
Functions (UDF).
5. Pig
Pig Represents Big Data as data flows. Pig is a high-level platform or
tool which is used to process the large datasets. It provides a high-level
of abstraction for processing over the MapReduce. It provides a high-
level scripting language, known as Pig Latin which is used to develop
the data analysis codes. First, to process the data which is stored in the
HDFS, the programmers will write the scripts using the Pig Latin
Language. Internally Pig Engine(a component of Apache Pig) converted
all these scripts into a specific map and reduce task. But these are not
visible to the programmers in order to provide a high-level of
abstraction. Pig Latin and Pig Engine are the two main components of
the Apache Pig tool. The result of Pig always stored in the HDFS.
6. Avro
To transfer data over a network or for its persistent storage, you need
to serialize the data. Prior to the serialization APIs provided by Java and
Hadoop, we have a special utility, called Avro, a schema-based
serialization technique.
HBase
In HBase, tables are split into regions and are served by the region servers.
Regions are vertically divided by column families into “Stores”. Stores are
saved as files in HDFS. Shown below is the architecture of HBase.
The term ‘store’ is used for regions to explain the storage structure.
HBase has three major components: the client library, a master server,
and region servers. Region servers can be added or removed as per
requirement.
MasterServer
1. Assigns regions to the region servers and takes the help of Apache
ZooKeeper for this task.
2. Handles load balancing of the regions across region servers. It
unloads the busy servers and shifts the regions to less occupied
servers.
3. Maintains the state of the cluster by negotiating the load
balancing.
4. Is responsible for schema changes and other metadata operations
such as creation of tables and column families.
Regions
Regions are nothing but tables that are split up and spread across the
region servers.
Region server