0% found this document useful (0 votes)
34 views15 pages

Unit 3

Hadoop is an open-source framework that allows for the distributed storage and processing of large datasets across clusters of computers. It uses HDFS for storage, MapReduce as its processing framework, and YARN for resource management. Hadoop provides advantages like scalability, fault tolerance, flexibility with different data types, and cost effectiveness. Common tools used with Hadoop include Flume for data ingestion, Oozie for workflow scheduling, and Zookeeper for coordination across nodes.

Uploaded by

xcgfxgvx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views15 pages

Unit 3

Hadoop is an open-source framework that allows for the distributed storage and processing of large datasets across clusters of computers. It uses HDFS for storage, MapReduce as its processing framework, and YARN for resource management. Hadoop provides advantages like scalability, fault tolerance, flexibility with different data types, and cost effectiveness. Common tools used with Hadoop include Flume for data ingestion, Oozie for workflow scheduling, and Zookeeper for coordination across nodes.

Uploaded by

xcgfxgvx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Introduction to Hadoop

Hadoop is an open source framework from Apache and is used to store


process and analyze data which are very huge in volume. Hadoop is
written in Java and is not OLAP (online analytical processing). It is used
for batch/offline processing. It is being used by Facebook, Yahoo,
Google, Twitter, LinkedIn and many more. Moreover it can be scaled up
just by adding nodes in the cluster.

Modules of Hadoop
HDFS: Hadoop Distributed File System. Google published its paper GFS
and on the basis of that HDFS was developed. It states that the files will
be broken into blocks and stored in nodes over the distributed
architecture.

Yarn: Yet another Resource Negotiator is used for job scheduling and
manages the cluster.

Map Reduce: This is a framework which helps Java programs to do the


parallel computation on data using key value pair. The Map task takes
input data and converts it into a data set which can be computed in Key
value pair. The output of Map task is consumed by reduce task and
then the out of reducer gives the desired result.

Hadoop Common: These Java libraries are used to start Hadoop and
are used by other Hadoop modules.

Features of Hadoop
1. Open Source:
Hadoop is open-source, which means it is free to use. Since it is an
open-source project the source-code is available online for anyone to
understand it or make some modifications as per their industry
requirement.

2. Highly Scalable Cluster:

Hadoop is a highly scalable model. A large amount of data is divided


into multiple inexpensive machines in a cluster which is processed
parallelly. The number of these machines or nodes can be increased or
decreased as per the enterprise’s requirements. In traditional RDBMS
(Relational Database Management System) the systems cannot be
scaled to approach large amounts of data.

3. Fault Tolerance is Available:

Hadoop uses commodity hardware (inexpensive systems) which can be


crashed at any moment. In Hadoop data is replicated on various
DataNodes in a Hadoop cluster which ensures the availability of data if
somehow any of your systems got crashed. You can read all of the data
from a single machine if this machine faces a technical issue data can
also be read from other nodes in a Hadoop cluster because the data is
copied or replicated by default.

4. High Availability is provided:

Fault tolerance provides High Availability in the Hadoop cluster. High


Availability means the availability of data on the Hadoop cluster. Due to
fault tolerance in case if any of the DataNode goes down the same data
can be retrieved from any other node where the data is replicated. The
High available Hadoop cluster also has 2 or more than two NameNode
i.e. Active NameNode and Passive NameNode also known as stand by
NameNode.

5. Cost-Effective:

Hadoop is open-source and uses cost-effective commodity hardware


which provides a cost-efficient model, unlike traditional Relational
databases that require expensive hardware and high-end processors to
deal with Big Data. The problem with traditional Relational databases is
that storing the Massive volume of data is not cost-effective, so the
company’s started to remove the raw data.

6. Hadoop Provide Flexibility:

Hadoop is designed in such a way that it can deal with any kind of
dataset like structured (MySql Data), Semi-Structured (XML, JSON), Un-
structured (Images and Videos) very efficiently. This means it can easily
process any kind of data independent of its structure which makes it
highly flexible.

7. Easy to Use:

Hadoop is easy to use since the developers need not worry about any
of the processing work since it is managed by the Hadoop itself.
Hadoop ecosystem is also very large comes up with lots of tools like
Hive, Pig, Spark, HBase, Mahout, etc.

8. Hadoop uses Data Locality:

The concept of Data Locality is used to make Hadoop processing fast. In


the data locality concept, the computation logic is moved near data
rather than moving the data to the computation logic. The cost of
Moving data on HDFS is costliest and with the help of the data locality
concept, the bandwidth utilization in the system is minimized.

Hadoop Architecture
Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data. Hadoop
works on MapReduce Programming Algorithm that was introduced by
Google. Today lots of Big Brand Companies are using Hadoop in their
Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay,
etc. The Hadoop Architecture Mainly consists of 4 components.

1. MapReduce
2. HDFS(Hadoop Distributed File System)
3. YARN(Yet Another Resource Negotiator)
4. Common Utilities or Hadoop Common

Architecture of Hadoop
Distributed Processing
Map Reduce

Distributed Storage HDFS

Yet Another Resource


Negotiator (Job YARN Hadoop Common
Scheduling and
Resource Manager)

Java Libraries and


Utilities (Java Scripts)

1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that


is based on the YARN framework. The major feature of MapReduce is to
perform the distributed processing in parallel in a Hadoop cluster which
Makes Hadoop working so fast. When you are dealing with Big Data,
serial processing is no more of any use. MapReduce has mainly 2 tasks
which are divided phase-wise:

In first phase, Map is utilized and in next phase Reduce is utilized.

2. HDFS (Hadoop Distributed File System)


HDFS (Hadoop Distributed File System) is utilized for storage
permission. It is mainly designed for working on commodity Hardware
devices (inexpensive devices), working on a distributed file system
design. HDFS is designed in such a way that it believes more in storing
the data in a large chunk of blocks rather than storing small data blocks.

HDFS in Hadoop provides Fault-tolerance and High availability to the


storage layer and the other devices present in that Hadoop cluster.

 Data storage Nodes in HDFS.


1. NameNode (Master)
2. DataNode (Slave)

NameNode: NameNode works as a Master in a Hadoop cluster that


guides the DataNode (Slaves). NameNode is mainly used for storing the
Metadata i.e. the data about the data. Meta Data can be the
transaction logs that keep track of the user’s activity in a Hadoop
cluster.

Meta Data can also be the name of the file, size, and the information
about the location (Block number, Block ids) of DataNode that
NameNode stores to find the closest DataNode for Faster
Communication. NameNode instructs the DataNodes with the
operation like delete, create, Replicate, etc.

DataNode: DataNodes works as a Slave DataNodes are mainly utilized


for storing the data in a Hadoop cluster; the number of DataNodes can
be from 1 to 500 or even more than that. The more number of
DataNode, the Hadoop cluster will be able to store more data. So it is
advised that the DataNode should have High storing capacity to store a
large number of file blocks.
3. YARN (Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2


operations that are Job scheduling and Resource Management. The
Purpose of Job scheduler is to divide a big task into small jobs so that
each job can be assigned to various slaves in a Hadoop cluster and
Processing can be maximized. Job Scheduler also keeps track of which
job is important, which job has more priority, dependencies between
the jobs and all the other information like job timing, etc. And the use
of Resource Manager is to manage all the resources that are made
available for running a Hadoop cluster.

 Features of YARN
1. Multi-Tenancy
2. Scalability
3. Cluster-Utilization
4. Compatibility

4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library
and java files or we can say the java scripts that we need for all the
other components present in a Hadoop cluster. These utilities are used
by HDFS, YARN, and MapReduce for running the cluster. Hadoop
Common verifies that Hardware failure in a Hadoop cluster is common
so it needs to be solved automatically in software by Hadoop
Framework.

Introduction to Data Management and Data Access tools


1. Data Management using Flume
Flume is a distributed system used for efficiently collecting,
aggregating, and moving large amounts of streaming data from various
sources to a centralized data store. It is commonly used for log
aggregation in big data environments.

To perform data management using Flume, you first need to define the
sources of the data that you want to collect. Flume supports various
sources such as syslog, Avro, Thrift, and HTTP.

Next, you need to define the channel that you want to use for storing
the collected data temporarily before it is processed by the sink. Flume
supports various channels such as memory, JDBC, and file-based
channels.

Finally, you need to define the sink that you want to use for storing the
collected data permanently. Flume supports various sinks such as HDFS,
HBase, and Kafka.
A Flume agent is a JVM process which has 3 components –Flume
Source, Flume Channel and Flume Sink– through which events
propagate after initiated at an external source.
Overall, Flume is a powerful tool for data management in big data
environments. It provides a flexible and scalable architecture that can
be customized to meet the specific needs of your data management
tasks.

2. Oozie
Apache Oozie is a workflow scheduler for Hadoop. It is a system which
runs the workflow of dependent jobs. Here, users are permitted to
create Directed Acyclic Graphs of workflows, which can be run in
parallel and sequentially in Hadoop.

3. Zookeeper
Apache Zookeeper is an open source distributed coordination service
that helps to manage a large set of hosts. Management and
coordination in a distributed environment is tricky. Zookeeper
automates this process and allows developers to focus on building
software features rather than worry about its distributed nature.

So Zookeeper is an important part of Hadoop that takes care of these


small but important matters so that the developer can focus more on
the application’s functionality.
Zookeeper helps you to maintain configuration information, naming,
group services for distributed applications. It implements different
protocols on the cluster so that the application should not implement
on their own. It provides a single coherent view of multiple machines.

4. Hive
Hive is a data warehouse system which is used to analyze structured
data. It is built on the top of Hadoop. It was developed by Facebook.

Hive provides the functionality of reading, writing, and managing large


datasets residing in distributed storage. It runs SQL like queries called
HQL (Hive query language) which gets internally converted to
MapReduce jobs.

Apache Hive architecture consists mainly of three components:

1. Hive Client
2. Hive Services
3. Hive Storage and Computer
Using Hive, we can skip the requirement of the traditional approach of
writing complex MapReduce programs. Hive supports Data Definition
Language (DDL), Data Manipulation Language (DML), and User Defined
Functions (UDF).

5. Pig
Pig Represents Big Data as data flows. Pig is a high-level platform or
tool which is used to process the large datasets. It provides a high-level
of abstraction for processing over the MapReduce. It provides a high-
level scripting language, known as Pig Latin which is used to develop
the data analysis codes. First, to process the data which is stored in the
HDFS, the programmers will write the scripts using the Pig Latin
Language. Internally Pig Engine(a component of Apache Pig) converted
all these scripts into a specific map and reduce task. But these are not
visible to the programmers in order to provide a high-level of
abstraction. Pig Latin and Pig Engine are the two main components of
the Apache Pig tool. The result of Pig always stored in the HDFS.
6. Avro
To transfer data over a network or for its persistent storage, you need
to serialize the data. Prior to the serialization APIs provided by Java and
Hadoop, we have a special utility, called Avro, a schema-based
serialization technique.

Apache Avro is a language-neutral data serialization system. It was


developed by Doug Cutting, the father of Hadoop. Since Hadoop
writable classes lack language portability, Avro becomes quite helpful,
as it deals with data formats that can be processed by multiple
languages. Avro is a preferred tool to serialize data in Hadoop.

Avro has a schema-based system. A language-independent schema is


associated with its read and writes operations. Avro serializes the data
which has a built-in schema. Avro serializes the data into a compact
binary format, which can be de-serialized by any application.

Avro uses JSON format to declare the data structures. Presently, it


supports languages such as Java, C, C++, C#, Python, and Ruby.
The key feature of AVRO is it can efficiently handle any change in
data schema over time. i.e. Schema Evolution. It handles schema
changes like missing fields, added fields, and changed fields.

7. SQOOP for data access


Apache SQOOP, a command-line interface tool, moves data between
relational databases and Hadoop. It is used to export data from the
Hadoop file system to relational databases and to import data from
relational databases such as MySql and Oracle into the Hadoop file
system.

A component of the Hadoop ecosystem is Apache Sqoop. There was


a need for a specialized tool to perform this process quickly because a
lot of data needed to be moved from relational database systems onto
Hadoop. This is when Apache Sqoop entered the scene and is now
widely used for moving data from RDBMS files to the Hadoop
ecosystem for MapReduce processing and other uses.

HBase
In HBase, tables are split into regions and are served by the region servers.
Regions are vertically divided by column families into “Stores”. Stores are
saved as files in HDFS. Shown below is the architecture of HBase.
The term ‘store’ is used for regions to explain the storage structure.
HBase has three major components: the client library, a master server,
and region servers. Region servers can be added or removed as per
requirement.
MasterServer

1. Assigns regions to the region servers and takes the help of Apache
ZooKeeper for this task.
2. Handles load balancing of the regions across region servers. It
unloads the busy servers and shifts the regions to less occupied
servers.
3. Maintains the state of the cluster by negotiating the load
balancing.
4. Is responsible for schema changes and other metadata operations
such as creation of tables and column families.

Regions

Regions are nothing but tables that are split up and spread across the
region servers.

Region server

1. The region servers have regions that -


2. Communicate with the client and handle data-related operations.
3. Handle read and write requests for all the regions under it.
4. Decide the size of the region by following the region size
thresholds.

You might also like