0% found this document useful (0 votes)
75 views10 pages

Unit-2 Big Data

The document provides an overview of Hadoop, an open-source framework for distributed processing of large datasets, detailing its requirements, design principles, and comparison with SQL and RDBMS. It covers Hadoop's architecture, core components (HDFS, MapReduce, YARN), and the differences between Hadoop 1 and Hadoop 2. Additionally, it highlights key features, scalability, fault tolerance, and the ecosystem tools that enhance Hadoop's functionality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views10 pages

Unit-2 Big Data

The document provides an overview of Hadoop, an open-source framework for distributed processing of large datasets, detailing its requirements, design principles, and comparison with SQL and RDBMS. It covers Hadoop's architecture, core components (HDFS, MapReduce, YARN), and the differences between Hadoop 1 and Hadoop 2. Additionally, it highlights key features, scalability, fault tolerance, and the ecosystem tools that enhance Hadoop's functionality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

25

UNIT II:
Big data technologies and Databases: Hadoop Requirement of Hadoop
Framework - Design principle of Hadoop Comparison with other system SQL and
RDBMS- Hadoop Components Architecture -Hadoop 1 vs Hadoop 2.

2.1 Requirement of Hadoop Framework


Hadoop is an Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers using
simple programming models.

The Hadoop framework application works in an environment that provides


distributed storage and computation across clusters of computers.

Hadoop is designed to scale up from single server to thousands of machines,


each offering local computation and storage.

2.2 Design Principles of Hadoop


The Design principles of Hadoop on which it works:

a) System shall manage and heal itself


Automatically and transparently route around failure (Fault Tolerant)
Speculatively execute redundant tasks if certain nodes are detected to be slow

b) Performance shall scale linearly


Proportional change in capacity with resource change (Scalability)

c) Computation should move to data


Lower latency, lower bandwidth (Data Locality)

d) Simple core, modular and extensible (Economical)

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY


26

2.3 Comparison with other system like SQL

Parameter Hadoop SQL


Hadoop supports an open-source SQL stands for Structured Query
framework. In Hadoop data sets are Language. It is based on domain-
Architecture distributed across computer/server specific language, used to handle
clusters with parallel data processing database management operations
features. in relational databases.
Hadoop is used for storing, processing, SQL is used to store, process,
retrieving, and pattern extraction from retrieve, and pattern mine data
Operations
data across a wide range of formats stored in a relational database
like XML, Text, JSON, etc. only.
Hadoop handles both structured and
SQL works only for structured
Data Type/ unstructured data formats. For data
data but unlike Hadoop, data can
Data update update, Hadoop writes data once but
be written and read multiple times.
reads data multiple times.
Hadoop is developed for Big Data hence,
Data Volume SQL works better on low volumes
it usually handles data volumes up
Processed of data, usually in Gigabytes.
to Terabytes and Petabytes.
Hadoop stores data in the form of key-
SQL stores structured data in
value pairs, hash, maps, tables, etc in
Data Storage a tabular format using tables only
distributed systems with dynamic
with fixed schemas.
schemas.
Schema Hadoop supports dynamic schema SQL supports static schema
Structure structure. structure.
Hadoop supports NoSQL data type
SQL works on the property of
Data structures, columnar data structures, etc.
Atomicity, Consistency, Isolation,
Structures meaning you will have to provide codes for
and Durability (ACID) which is
Supported implementation or for rolling back during a
fundamental to RDBMS.
transaction.
Fault
Hadoop is highly fault-tolerant. SQL has good fault tolerance.
Tolerance

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY


27

As Hadoop uses the notion of distributed SQL supporting databases are


computing and the principle of map- usually available on-prises or on
Availability reduce therefore it handles data the cloud, therefore it
availability on multiple systems across the benefits of distributed
multiple geo-locations. computing.
Integrity Hadoop has low integrity. SQL has high integrity.
Scaling in Hadoop based system requires Scaling in SQL required
connecting computers over the network. purchasing additional SQL servers
Scaling
Horizontal Scaling with Hadoop and configuration which is
is cheap and flexible. expensive and time-consuming.
SQL supports real-time data
Hadoop supports large-scale batch data processing known as Online
Data
processing known as Online Analytical Transaction Processing (OLTP)
Processing
Processing (OLAP). thereby making it interactive and
batch-oriented.
Statements in Hadoop are executed
Execution SQL syntax can be slow when
very quickly even when millions of
Time executed in millions of rows.
queries are executed at once.
Hadoop uses appropriate Java Database
Connectivity (JDBC) to interact with SQL SQL systems can read and write
Interaction
systems to transfer and receive data data to Hadoop systems.
between them.
Hadoop supports advanced machine
Support for ML support for ML and AI
learning and artificial intelligence
and AI is limited compared to Hadoop.
techniques.
Hadoop requires an advanced skill
The SQL skill level required to use
level for you to be proficient in using it and
it is intermediate as it can be
Skill Level trying to learn Hadoop as a beginner can
learned easily for beginners and
be moderately difficult as it requires
entry-level professionals.
certain kinds of skill sets.

Language Hadoop framework is built with Java SQL is a traditional database


Supported programming language. language used to perform

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY


28

database management operations


on relational databases such as
MySQL, Oracle, SQL Server, etc.
When you need to manage unstructured SQL performs well in a moderate
Use Case data, structured data, or semi-structured volume of data and it supports
data in huge volume, Hadoop is a good fit. structured data only.
With SQL supported system,
Hardware In Hadoop, commodity hardware
propriety hardware installation is
Configuration installation is required on the server.
required.
SQL supporting systems are
Pricing Hadoop is a free open-source framework.
mostly licensed.

2.4 Comparison with other system like RDBMS


Below is the comparison table between Hadoop vs RDBMS.
Feature RDBMS Hadoop
Data Variety Mainly for Structured data Used for Structured, Semi-
Structured, and Unstructured data
Data Storage Average size data (GBS) Use for large data sets (Tbs and
Pbs)
Querying SQL Language HQL (Hive Query Language)
Schema Required on write (static Required on reading (dynamic
schema) schema)
Speed Reads are fast Both reads and writes are fast
Cost License Free
Use Case OLTP (Online transaction Analytics (Audio, video, logs, etc.),
processing) Data Discovery
Data Objects Works on Relational Works on Key/Value Pair
Tables
Throughput Low High
Scalability Vertical Horizontal
Hardware Profile High-End Servers Commodity/Utility Hardware
Integrity High (ACID) Low

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY


29

2.5 Components of Hadoop

There are three core components of Hadoop as mentioned earlier. They are HDFS,
MapReduce, and YARN. These together form the Hadoop framework architecture.

1. HDFS (Hadoop Distributed File System):


It is a data storage system. Since the data sets are huge, it uses a distributed system
to store this data. It is stored in blocks where each block is 128 MB. It consists of
NameNode and DataNode. There can only be one NameNode but multiple
DataNodes.

Features:
The storage is distributed to handle a large data pool
Distribution increases data security
It is fault-tolerant, other blocks can pick up the failure of one block

2. MapReduce:
The MapReduce framework is the processing unit. All data is distributed and
processed parallelly. There is a MasterNode that distributes data amongst
SlaveNodes. The SlaveNodes do the processing and send it back to the MasterNode.

Features:
Consists of two phases, Map Phase and Reduce Phase.
Processes big data faster with multiples nodes working under one CPU

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY


30

3. YARN (yet another Resources Negotiator):


It is the resource management unit of the Hadoop framework. The data which is
stored can be processed with help of YARN using data processing engines like
interactive processing. It can be used to fetch any sort of data analysis.

Features:
It is a filing system that acts as an Operating System for the data stored on
HDFS
It helps to schedule the tasks to avoid overloading any system

2.6 Hadoop Architecture


The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the
slave node includes DataNode and TaskTracker.

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY


31

Hadoop Distributed File System


The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines.
The Java language is used to develop HDFS. So any machine that supports Java
language can easily run the NameNode and DataNode software.

NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the
opening, renaming and closing the files.
o It simplifies the architecture of the system.

DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file
system's clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.

Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and
process the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file.
This process can also be called as a Mapper.

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY


32

MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the
appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a
case, that part of the job is rescheduled.

2.7 Difference between Hadoop 1 and Hadoop 2


Hadoop is an open source software programming framework for storing a large
amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.

Hadoop 1 vs Hadoop 2

1. Components: In Hadoop 1 we have MapReduce but Hadoop 2 has YARN(Yet


Another Resource Negotiator) and MapReduce version 2.

Hadoop 1 Hadoop 2

HDFS HDFS

Map Reduce YARN / MRv2

2. Daemons:

Hadoop 1 Hadoop 2

Namenode Namenode

Datanode Datanode

Secondary Namenode Secondary Namenode

Job Tracker Resource Manager

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY


33

Task Tracker Node Manager

3. Working:

In Hadoop 1, there is HDFS which is used for storage and top of it, Map Reduce
which works as Resource Management as well as Data Processing. Due to this
workload on Map Reduce, it will affect the performance.
In Hadoop 2, there is again HDFS which is again used for storage and on the
top of HDFS, there is YARN which works as Resource Management. It basically
allocates the resources and keeps all the things going on.

4. Limitations:

Hadoop 1 is a Master-Slave architecture. It consists of a single master and multiple


slaves. Suppose if master node got crashed then irrespective of your best slave nodes,
your cluster will be destroyed. Again for creating that cluster means copying system
files, image files, etc. on another system is too much time consuming which will not be
tolerated by organizations in time.

Hadoop 2 is also a Master-Slave architecture. But this consists of multiple masters (i.e
active namenodes and standby namenodes) and multiple slaves. If here master node
got crashed then standby master node will take over it. You can make multiple

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY


34

combinations of active-standby nodes. Thus Hadoop 2 will eliminate the problem of a


single point of failure.

5. Ecosystem

Oozie is basically Work Flow Scheduler. It decides the particular time of jobs to
execute according to their dependency.
Pig, Hive and Mahout are data processing tools that are working on the top of
Hadoop.
Sqoop is used to import and export structured data. You can directly import and
export the data into HDFS using SQL database.
Flume is used to import and export the unstructured data and streaming data.

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

You might also like