0% found this document useful (0 votes)

75 views10 pages

Unit-2 Big Data

The document provides an overview of Hadoop, an open-source framework for distributed processing of large datasets, detailing its requirements, design principles, and comparison with SQL and RDBMS. It covers Hadoop's architecture, core components (HDFS, MapReduce, YARN), and the differences between Hadoop 1 and Hadoop 2. Additionally, it highlights key features, scalability, fault tolerance, and the ecosystem tools that enhance Hadoop's functionality.

Uploaded by

hypertechofficialpage

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views10 pages

Unit-2 Big Data

Uploaded by

hypertechofficialpage

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

25

UNIT II:
Big data technologies and Databases: Hadoop Requirement of Hadoop
Framework - Design principle of Hadoop Comparison with other system SQL and
RDBMS- Hadoop Components Architecture -Hadoop 1 vs Hadoop 2.

2.1 Requirement of Hadoop Framework

Hadoop is an Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers using
simple programming models.

The Hadoop framework application works in an environment that provides

distributed storage and computation across clusters of computers.

Hadoop is designed to scale up from single server to thousands of machines,

each offering local computation and storage.

2.2 Design Principles of Hadoop

The Design principles of Hadoop on which it works:

a) System shall manage and heal itself

Automatically and transparently route around failure (Fault Tolerant)
Speculatively execute redundant tasks if certain nodes are detected to be slow

b) Performance shall scale linearly

Proportional change in capacity with resource change (Scalability)

c) Computation should move to data

Lower latency, lower bandwidth (Data Locality)

d) Simple core, modular and extensible (Economical)

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

2.3 Comparison with other system like SQL

Parameter Hadoop SQL

Hadoop supports an open-source SQL stands for Structured Query
framework. In Hadoop data sets are Language. It is based on domain-
Architecture distributed across computer/server specific language, used to handle
clusters with parallel data processing database management operations
features. in relational databases.
Hadoop is used for storing, processing, SQL is used to store, process,
retrieving, and pattern extraction from retrieve, and pattern mine data
Operations
data across a wide range of formats stored in a relational database
like XML, Text, JSON, etc. only.
Hadoop handles both structured and
SQL works only for structured
Data Type/ unstructured data formats. For data
data but unlike Hadoop, data can
Data update update, Hadoop writes data once but
be written and read multiple times.
reads data multiple times.
Hadoop is developed for Big Data hence,
Data Volume SQL works better on low volumes
it usually handles data volumes up
Processed of data, usually in Gigabytes.
to Terabytes and Petabytes.
Hadoop stores data in the form of key-
SQL stores structured data in
value pairs, hash, maps, tables, etc in
Data Storage a tabular format using tables only
distributed systems with dynamic
with fixed schemas.
schemas.
Schema Hadoop supports dynamic schema SQL supports static schema
Structure structure. structure.
Hadoop supports NoSQL data type
SQL works on the property of
Data structures, columnar data structures, etc.
Atomicity, Consistency, Isolation,
Structures meaning you will have to provide codes for
and Durability (ACID) which is
Supported implementation or for rolling back during a
fundamental to RDBMS.
transaction.
Fault
Hadoop is highly fault-tolerant. SQL has good fault tolerance.
Tolerance

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

As Hadoop uses the notion of distributed SQL supporting databases are

computing and the principle of map- usually available on-prises or on
Availability reduce therefore it handles data the cloud, therefore it
availability on multiple systems across the benefits of distributed
multiple geo-locations. computing.
Integrity Hadoop has low integrity. SQL has high integrity.
Scaling in Hadoop based system requires Scaling in SQL required
connecting computers over the network. purchasing additional SQL servers
Scaling
Horizontal Scaling with Hadoop and configuration which is
is cheap and flexible. expensive and time-consuming.
SQL supports real-time data
Hadoop supports large-scale batch data processing known as Online
Data
processing known as Online Analytical Transaction Processing (OLTP)
Processing
Processing (OLAP). thereby making it interactive and
batch-oriented.
Statements in Hadoop are executed
Execution SQL syntax can be slow when
very quickly even when millions of
Time executed in millions of rows.
queries are executed at once.
Hadoop uses appropriate Java Database
Connectivity (JDBC) to interact with SQL SQL systems can read and write
Interaction
systems to transfer and receive data data to Hadoop systems.
between them.
Hadoop supports advanced machine
Support for ML support for ML and AI
learning and artificial intelligence
and AI is limited compared to Hadoop.
techniques.
Hadoop requires an advanced skill
The SQL skill level required to use
level for you to be proficient in using it and
it is intermediate as it can be
Skill Level trying to learn Hadoop as a beginner can
learned easily for beginners and
be moderately difficult as it requires
entry-level professionals.
certain kinds of skill sets.

Language Hadoop framework is built with Java SQL is a traditional database

Supported programming language. language used to perform

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

database management operations

on relational databases such as
MySQL, Oracle, SQL Server, etc.
When you need to manage unstructured SQL performs well in a moderate
Use Case data, structured data, or semi-structured volume of data and it supports
data in huge volume, Hadoop is a good fit. structured data only.
With SQL supported system,
Hardware In Hadoop, commodity hardware
propriety hardware installation is
Configuration installation is required on the server.
required.
SQL supporting systems are
Pricing Hadoop is a free open-source framework.
mostly licensed.

2.4 Comparison with other system like RDBMS

Below is the comparison table between Hadoop vs RDBMS.
Feature RDBMS Hadoop
Data Variety Mainly for Structured data Used for Structured, Semi-
Structured, and Unstructured data
Data Storage Average size data (GBS) Use for large data sets (Tbs and
Pbs)
Querying SQL Language HQL (Hive Query Language)
Schema Required on write (static Required on reading (dynamic
schema) schema)
Speed Reads are fast Both reads and writes are fast
Cost License Free
Use Case OLTP (Online transaction Analytics (Audio, video, logs, etc.),
processing) Data Discovery
Data Objects Works on Relational Works on Key/Value Pair
Tables
Throughput Low High
Scalability Vertical Horizontal
Hardware Profile High-End Servers Commodity/Utility Hardware
Integrity High (ACID) Low

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

2.5 Components of Hadoop

There are three core components of Hadoop as mentioned earlier. They are HDFS,
MapReduce, and YARN. These together form the Hadoop framework architecture.

1. HDFS (Hadoop Distributed File System):

It is a data storage system. Since the data sets are huge, it uses a distributed system
to store this data. It is stored in blocks where each block is 128 MB. It consists of
NameNode and DataNode. There can only be one NameNode but multiple
DataNodes.

Features:
The storage is distributed to handle a large data pool
Distribution increases data security
It is fault-tolerant, other blocks can pick up the failure of one block

2. MapReduce:
The MapReduce framework is the processing unit. All data is distributed and
processed parallelly. There is a MasterNode that distributes data amongst
SlaveNodes. The SlaveNodes do the processing and send it back to the MasterNode.

Features:
Consists of two phases, Map Phase and Reduce Phase.
Processes big data faster with multiples nodes working under one CPU

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

3. YARN (yet another Resources Negotiator):

It is the resource management unit of the Hadoop framework. The data which is
stored can be processed with help of YARN using data processing engines like
interactive processing. It can be used to fetch any sort of data analysis.

Features:
It is a filing system that acts as an Operating System for the data stored on
HDFS
It helps to schedule the tasks to avoid overloading any system

2.6 Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the
slave node includes DataNode and TaskTracker.

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines.
The Java language is used to develop HDFS. So any machine that supports Java
language can easily run the NameNode and DataNode software.

NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the
opening, renaming and closing the files.
o It simplifies the architecture of the system.

DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file
system's clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.

Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and
process the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file.
This process can also be called as a Mapper.

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the
appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a
case, that part of the job is rescheduled.

2.7 Difference between Hadoop 1 and Hadoop 2

Hadoop is an open source software programming framework for storing a large
amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.

Hadoop 1 vs Hadoop 2

1. Components: In Hadoop 1 we have MapReduce but Hadoop 2 has YARN(Yet

Another Resource Negotiator) and MapReduce version 2.

Hadoop 1 Hadoop 2

HDFS HDFS

Map Reduce YARN / MRv2

2. Daemons:

Hadoop 1 Hadoop 2

Namenode Namenode

Datanode Datanode

Secondary Namenode Secondary Namenode

Job Tracker Resource Manager

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

Task Tracker Node Manager

3. Working:

In Hadoop 1, there is HDFS which is used for storage and top of it, Map Reduce
which works as Resource Management as well as Data Processing. Due to this
workload on Map Reduce, it will affect the performance.
In Hadoop 2, there is again HDFS which is again used for storage and on the
top of HDFS, there is YARN which works as Resource Management. It basically
allocates the resources and keeps all the things going on.

4. Limitations:

Hadoop 1 is a Master-Slave architecture. It consists of a single master and multiple

slaves. Suppose if master node got crashed then irrespective of your best slave nodes,
your cluster will be destroyed. Again for creating that cluster means copying system
files, image files, etc. on another system is too much time consuming which will not be
tolerated by organizations in time.

Hadoop 2 is also a Master-Slave architecture. But this consists of multiple masters (i.e
active namenodes and standby namenodes) and multiple slaves. If here master node
got crashed then standby master node will take over it. You can make multiple

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

combinations of active-standby nodes. Thus Hadoop 2 will eliminate the problem of a

single point of failure.

5. Ecosystem

Oozie is basically Work Flow Scheduler. It decides the particular time of jobs to
execute according to their dependency.
Pig, Hive and Mahout are data processing tools that are working on the top of
Hadoop.
Sqoop is used to import and export structured data. You can directly import and
export the data into HDFS using SQL database.
Flume is used to import and export the unstructured data and streaming data.

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

SAP R - 3 Guide To EDI, IDocs and Interfaces 3rd Edition
No ratings yet
SAP R - 3 Guide To EDI, IDocs and Interfaces 3rd Edition
155 pages
Akamai Kona WAF Help Manual
100% (1)
Akamai Kona WAF Help Manual
39 pages
ISV Success - Benefits Guide - GA
No ratings yet
ISV Success - Benefits Guide - GA
7 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Unit Ii Hadoop With HDFS
No ratings yet
Unit Ii Hadoop With HDFS
22 pages
Nosql and hadoop
No ratings yet
Nosql and hadoop
42 pages
BDA_Module2
No ratings yet
BDA_Module2
83 pages
Unit 3 Hadoop
No ratings yet
Unit 3 Hadoop
50 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
61 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
Big Data and Mapreduce Challenges, Opportunities and Trends
No ratings yet
Big Data and Mapreduce Challenges, Opportunities and Trends
9 pages
BDAunit II
No ratings yet
BDAunit II
4 pages
Ch6 Architectural Design v1
No ratings yet
Ch6 Architectural Design v1
26 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
The Big Data Technology Landscape
No ratings yet
The Big Data Technology Landscape
36 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
HADOOP
No ratings yet
HADOOP
10 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Bda 3
No ratings yet
Bda 3
70 pages
BDH Unit 3
No ratings yet
BDH Unit 3
16 pages
Cloud Computing Unit-5
No ratings yet
Cloud Computing Unit-5
22 pages
RDBMS and Hadoop
No ratings yet
RDBMS and Hadoop
5 pages
M2
No ratings yet
M2
31 pages
Unit # 2
No ratings yet
Unit # 2
23 pages
BDA Module 2 Chapter 1
No ratings yet
BDA Module 2 Chapter 1
12 pages
Hadoopdb: An An Architectural Hybrid of Mapreduce & Dbms Technologies For Analytical Workloads
No ratings yet
Hadoopdb: An An Architectural Hybrid of Mapreduce & Dbms Technologies For Analytical Workloads
34 pages
HADOOP
No ratings yet
HADOOP
18 pages
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
No ratings yet
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
6 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Unit 4 - Data Science - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Data Science - WWW - Rgpvnotes.in
18 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Hadoop
No ratings yet
Hadoop
83 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Unit 2 - Intro To Hadoop
No ratings yet
Unit 2 - Intro To Hadoop
51 pages
Bda Unit-2
No ratings yet
Bda Unit-2
37 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
58 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
Bda Unit 4-1
No ratings yet
Bda Unit 4-1
64 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
Big-Data-Unit 4
No ratings yet
Big-Data-Unit 4
99 pages
Bda PPT M1 P2 1
No ratings yet
Bda PPT M1 P2 1
19 pages
Big Data 2 - Part
No ratings yet
Big Data 2 - Part
40 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
Data Science and Big Data UNIT 3
No ratings yet
Data Science and Big Data UNIT 3
11 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
CC Unit 2
No ratings yet
CC Unit 2
29 pages
Unit III
No ratings yet
Unit III
15 pages
Unit I
No ratings yet
Unit I
38 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
V3i308 PDF
No ratings yet
V3i308 PDF
9 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Qualitative Data Analysis (Nvivo) : Jamil A. Malik (PHD)
No ratings yet
Qualitative Data Analysis (Nvivo) : Jamil A. Malik (PHD)
21 pages
F95 Reference
No ratings yet
F95 Reference
169 pages
Wireless Advanced Troubleshooting LAB Cook Book Virtual Deployment Extended
100% (1)
Wireless Advanced Troubleshooting LAB Cook Book Virtual Deployment Extended
17 pages
Enter USB TV Stick Manual
100% (1)
Enter USB TV Stick Manual
7 pages
SHREYA
No ratings yet
SHREYA
6 pages
DBI Setup Steps For Procurement Intelligence
No ratings yet
DBI Setup Steps For Procurement Intelligence
4 pages
Online Complaint Management System PDF
No ratings yet
Online Complaint Management System PDF
67 pages
PEF22558E
No ratings yet
PEF22558E
90 pages
Server Sun Prof. Hemant Samant
No ratings yet
Server Sun Prof. Hemant Samant
34 pages
Serial Attached SCSI Versus Parallel SCSI: The Serial Trend in Computing
No ratings yet
Serial Attached SCSI Versus Parallel SCSI: The Serial Trend in Computing
9 pages
Makaut 6th Semester Syllabus
100% (1)
Makaut 6th Semester Syllabus
22 pages
BCS602 Module 1
No ratings yet
BCS602 Module 1
35 pages
School Bundle 07 SlidesMania
No ratings yet
School Bundle 07 SlidesMania
19 pages
SAS Macro
No ratings yet
SAS Macro
7 pages
GU - SAP ECC - LSMW Material Master by IDOC Method
No ratings yet
GU - SAP ECC - LSMW Material Master by IDOC Method
75 pages
Project Hospital Management
No ratings yet
Project Hospital Management
55 pages
UE Timers and Constants
No ratings yet
UE Timers and Constants
5 pages
3ADW000141R0201 CMT DCS500 Users Manual e B
No ratings yet
3ADW000141R0201 CMT DCS500 Users Manual e B
38 pages
Topic and Sentence Outlines
100% (2)
Topic and Sentence Outlines
5 pages
2 Days AI Deep Learning Workshop
No ratings yet
2 Days AI Deep Learning Workshop
9 pages
Characteristics of DBMS
No ratings yet
Characteristics of DBMS
2 pages
MS Access January Exam 2022 PDF
No ratings yet
MS Access January Exam 2022 PDF
4 pages
HVM100 Blaze Manual
No ratings yet
HVM100 Blaze Manual
70 pages
Intel SGX Emulation Using Qemu: Prerit Jain Soham Desai
No ratings yet
Intel SGX Emulation Using Qemu: Prerit Jain Soham Desai
19 pages
Lab Manual - MCAO20104 Advance Web Development
No ratings yet
Lab Manual - MCAO20104 Advance Web Development
17 pages
Thales Whoswho CyberAttack Pap PDF
No ratings yet
Thales Whoswho CyberAttack Pap PDF
232 pages

Unit-2 Big Data

Uploaded by

Unit-2 Big Data

Uploaded by

25

2.1 Requirement of Hadoop Framework

The Hadoop framework application works in an environment that provides

Hadoop is designed to scale up from single server to thousands of machines,

2.2 Design Principles of Hadoop

a) System shall manage and heal itself

b) Performance shall scale linearly

c) Computation should move to data

d) Simple core, modular and extensible (Economical)

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

2.3 Comparison with other system like SQL

Parameter Hadoop SQL

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

As Hadoop uses the notion of distributed SQL supporting databases are

Language Hadoop framework is built with Java SQL is a traditional database

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

database management operations

2.4 Comparison with other system like RDBMS

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

2.5 Components of Hadoop

1. HDFS (Hadoop Distributed File System):

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

3. YARN (yet another Resources Negotiator):

2.6 Hadoop Architecture

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

Hadoop Distributed File System

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

2.7 Difference between Hadoop 1 and Hadoop 2

1. Components: In Hadoop 1 we have MapReduce but Hadoop 2 has YARN(Yet

Map Reduce YARN / MRv2

Secondary Namenode Secondary Namenode

Job Tracker Resource Manager

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

Task Tracker Node Manager

Hadoop 1 is a Master-Slave architecture. It consists of a single master and multiple

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

combinations of active-standby nodes. Thus Hadoop 2 will eliminate the problem of a

BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY

You might also like