0% found this document useful (0 votes)

10 views18 pages

HADOOP

Uploaded by

maiyi020106

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views18 pages

HADOOP

Uploaded by

maiyi020106

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

GROUP 5

CONTENTS
• WHAT IS HADOOP
• HISTORY OF HADOOP
• COMPONENTS
• ADVANTAGES AND LIMITATION
• ARCHITECTURE OF HADOOP
• WORKING AND USE OF HADOOP
• HADOOP TOOLS

START 05/20/2024 2
WHAT IS HADOOP?
• Hadoop is an open-source framework designed for
distributed storage and processing of large datasets using a
cluster of commodity hardware.
• It consists of two main components: the Hadoop Distributed
File System (HDFS) for storing data across multiple
machines, and the MapReduce programming model for
processing and analyzing data in parallel.
• Hadoop is widely used in big data applications due to its
scalability, fault tolerance, and ability to handle diverse data
types.

START 05/20/2024 3
The HDFS meaning and purpose is to achieve the following goals:

 Manage large datasets - Organizing and storing datasets can be a hard talk to
handle. HDFS is used to manage the applications that have to deal with huge
datasets. To do this, HDFS should have hundreds of nodes per cluster.
 Detecting faults - HDFS should have technology in place to scan and detect
faults quickly and effectively as it includes a large number of commodity
hardware. Failure of components is a common issue.
 Hardware efficiency - When large datasets are involved it can reduce the
network traffic and increase the processing speed.

05/20/2024 4
HISTORY OF HADOOP What is HDFS in the world of big data?
•HDFS big data is data organized into the
 The design of HDFS was based on the Google File System. It was
HDFS filing system.
originally built as infrastructure for the Apache Nutch web
search engine project but has since become a member of the •Hadoop is a framework that works by using
Hadoop Ecosystem. parallel processing and distributed storage.
 Early internet spawned web crawlers for searching info, leading This can be used to sort and store big data, as
to search engines like Yahoo and Google. Nutch emerged, it can't be stored in traditional ways.
aiming to distribute data and computations across multiple
computers. Later, Nutch joined Yahoo, then split into two. •It's the most commonly used software to
 Apache Spark and Hadoop are now their own separate entities. handle big data, and is used by companies
Where Hadoop is designed to handle batch processing and such as Netflix, Expedia, and British Airways
Spark is made to handle real-time data efficiently. who have a positive relationship with Hadoop
for data storage.

START 05/20/2024 5
HDFS components
•It's important to know that there are three main
components of Hadoop.

1.Hadoop HDFS - Hadoop Distributed File System (HDFS) is

the storage unit of Hadoop.

2.Hadoop MapReduce - Hadoop MapReduce is the

processing unit of Hadoop. This software framework is used
to write applications to process vast amounts of data.

3.Hadoop YARN - Hadoop YARN is a resource management

component of Hadoop. It processes and runs data for batch,
stream, interactive, and graph processing - all of which are
stored in HDFS.

START 05/20/2024 6
Advantages of HADOOP Distributed File System
• Cost: Hadoop is open-source and uses cost-effective commodity hardware which provides
a cost-efficient model, unlike traditional Relational databases that require expensive
hardware and high-end processors to deal with Big Data. The problem with traditional
Relational databases is that storing the Massive volume of data is not cost-effective, so the
company’s started to remove the Raw data. which may not result in the correct scenario of
their business. Means Hadoop provides us 2 main benefits with the cost one is it’s open-
source means free to use and the other is that it uses commodity hardware which is also
inexpensive.
• Scalability: Hadoop is a highly scalable model. A large amount of data is divided into
multiple inexpensive machines in a cluster which is processed parallelly. the number of
these machines or nodes can be increased or decreased as per the enterprise’s requirements.
In traditional RDBMS(Relational DataBase Management System) the systems can not be
scaled to approach large amounts of data.
• Flexibility: Hadoop is designed in such a way that it can deal with any kind of dataset
like structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and
Videos) very efficiently. This means it can easily process any kind of data independent of
its structure which makes it highly flexible. which is very much useful for enterprises as
they can process large datasets easily, so the businesses can use Hadoop to analyze valuable
insights of data from sources like social media, email, etc. with this flexibility Hadoop can
be used with log processing, Data Warehousing, Fraud detection, etc.
• Speed: Hadoop's distributed file system (HDFS) breaks large files into smaller blocks
distributed across nodes in a cluster, enabling parallel processing for faster performance
compared to traditional database systems. This architecture allows rapid access to
terabytes of unstructured data, making Hadoop ideal for handling large-scale data with
speed and efficiency.

START 05/20/2024 7
Advantages of HADOOP Distributed File System

• Fault Tolerance: Hadoop uses commodity hardware(inexpensive systems) which can be

crashed at any moment. In Hadoop data is replicated on various DataNodes in a Hadoop cluster
which ensures the availability of data if somehow any of your systems got crashed. You can
read all of the data from a single machine if this machine faces a technical issue data can also be
read from other nodes in a Hadoop cluster because the data is copied or replicated by default.
Hadoop makes 3 copies of each file block and stored it into different nodes.

• High Throughput:Hadoop works on Distributed file System where various jobs are assigned
to various Data node in a cluster, the bar of this data is processed parallelly in the Hadoop
cluster which produces high throughput. Throughput is nothing but the task or job done per unit
time.

• Minimum Network Traffic: In Hadoop, each task is divided into various small sub-task which
is then assigned to each data node available in the Hadoop cluster. Each data node process a
small amount of data which leads to low traffic in a Hadoop cluster.

START 05/20/2024 8
Limitations:
• Low latency data access: Applications that require low-latency access to data i.e in the range of milliseconds will not work
well with HDFS, because HDFS is designed keeping in mind that we need high-throughput of data even at the cost of latency.
• Small file problem: Having lots of small files will result in lots of seeks and lots of movement from one datanode to another
datanode to retrieve each small file, this whole process is a very inefficient data access pattern.
• Complexity of Configuration and Management: Setting up and managing an HDFS cluster can be complex, requiring
expertise in distributed systems and configuration management tools. Organizations may need dedicated resources for cluster
administration and maintenance.
• Single Point of Failure: While HDFS is designed to be fault-tolerant, the NameNode, which stores metadata about file
locations and block assignments, can be a single point of failure. Although HDFS provides mechanisms like standby
NameNodes and high availability configurations to mitigate this risk, the potential for NameNode failure remains a concern.
• High Overhead for Replication: HDFS achieves fault tolerance by replicating data across multiple nodes, typically three
replicas by default. This replication incurs overhead in terms of storage space and network bandwidth, especially for large
datasets.

START 05/20/2024 9
HDFS ARCHITECTURE

HDFS is designed in such a way that

it believes more in storing the data in a
large chunk of blocks rather than
storing small data blocks.
HDFS in Hadoop provides Fault-
tolerance and High availability to the
storage layer and the other devices
present in that Hadoop cluster. Data
storage Nodes in HDFS

START 05/20/2024 10
•NameNode(Master)
•DataNode(Slave)
NameNode: NameNode works as a Master in
a Hadoop cluster that guides the
Datanode(Slaves).
DataNode: DataNodes works as a Slave
DataNodes are mainly utilized for storing the
data in a Hadoop cluster.
HDFS is an Open source component of the
Apache Software Foundation that manages
data. HDFS has scalability, availability, and
replication as key features. Name nodes,
secondary name nodes, data nodes,
checkpoint nodes, backup nodes, and blocks
all make up the architecture of HDFS. HDFS is
fault-tolerant and is replicated. Files are
distributed across the cluster systems using
the Name node and Data Nodes.

05/20/2024 11
Hadoop: Working and Using
Hadoop has three components that’s specifically
designed to work on big data.
• HDFS (Hadoop Distributed File System) :-
Storing of data.
• Data is distributed among many computers and
stored as blocks. 128MB is the default size of each
block.
• Replication method: Makes copies of data and
stores it across multiple systems.
• HDFS is fault tolerant.

START 05/20/2024 12
Hadoop: Working and Using

• MapReduce: Processing the data stored.

• Data is split into different parts and processed separately on different nodes.
• Aggregation of individual nodes.
• Benefit: time-saving

• Yarn: processes job requests and manages cluster resources.

• To process the MapReduce created.

To use HDFS you need to install and set up a Hadoop cluster. This can be a single node set up which
is more appropriate for first-time users, or a cluster set up for large, distributed clusters. You then
need to familiarize yourself with HDFS commands

START 05/20/2024 13
TOOLS OF HADOOP

• Introducing three of them

Why HBase
HBASE •HBase is schema-less, it doesn't have
the concept of fixed columns schema.
• HBase is an open-source, NoSQL, distributed big data store and is •It is built for wide tables. HBase is
very effective for handling large, sparse datasets. horizontally scalable.
• It is used whenever there is a need to write heavy applications. •No transactions are there in HBase.
• HBase is used whenever we need to provide fast random access to •It has de-normalized data.
available data. •It is good for semi-structured as well
• Companies such as Facebook, Twitter, Yahoo, and Adobe use as structured data.
HBase internally.

What is HBase vs Hadoop?

• HBase is a column-oriented non-relational database management
system that runs on top of Hadoop Distributed File System (HDFS),
a main component of Apache Hadoop.

START 05/20/2024 15
HADOOP HIVE WHY HIVE?
• Hive is commonly used by Data
• Hive is a data warehouse system which is used to analyze Analysts.
structured data. It is built on the top of Hadoop. It was • It follows SQL-like queries.
developed by Facebook.
• Hive is fast and scalable.
• Hive provides the functionality of reading, writing, and • It can handle structured data.
• It works on server-side of HDFS
managing large datasets residing in distributed storage. It
cluster.
runs SQL like queries called HQL (Hive query language)
which gets internally converted to MapReduce jobs.
• It is capable of analyzing large datasets stored in HDFS.
• It allows different storage types such as plain text, RCFile,
and HBase.

START 05/20/2024 16
HADOOP PIG WHY PIG?
• Pig Represents Big Data as data flows. Pig is a high-level • Pig is commonly used by
platform or tool which is used to process the large programmers.
datasets. • It follows the data-flow language.
• Pig Latin and Pig Engine are the two main components of • It can handle semi-structured data.
the Apache Pig tool. The result of Pig always stored in the • It works on client-side of HDFS
HDFS. cluster
• Pig Latin which is used to develop the data analysis codes. • Pig is better faster compared to hive
First, to process the data which is stored in the HDFS, the
programmers will write the scripts using the Pig Latin
Language.
• Pig Engine(a component of Apache Pig) converted all these
scripts into a specific map and reduce task. But these are
not visible to the programmers in order to provide a high-
level of abstraction.

START 05/20/2024 17
THANK YOU

05/20/2024 18

The Impact of Artificial Intelligence Use On Academic Performance Among Students at Baybay City Senior High School
No ratings yet
The Impact of Artificial Intelligence Use On Academic Performance Among Students at Baybay City Senior High School
49 pages
Programming: Just Basic Tutorials
67% (3)
Programming: Just Basic Tutorials
360 pages
Aquamon SMARTPRO 8966 - PH Pipe & Wall Mount Manual
100% (6)
Aquamon SMARTPRO 8966 - PH Pipe & Wall Mount Manual
80 pages
Catalogo de Piezas IM430F
No ratings yet
Catalogo de Piezas IM430F
122 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
61 pages
Smart LOCK LEZN Smart Zoom Company Door Lock Offers
No ratings yet
Smart LOCK LEZN Smart Zoom Company Door Lock Offers
11 pages
Project Report On Secondary Research
No ratings yet
Project Report On Secondary Research
7 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Exception Handling
No ratings yet
Exception Handling
54 pages
Hisagent User
No ratings yet
Hisagent User
432 pages
UNIT 2 Full
No ratings yet
UNIT 2 Full
121 pages
Mold Design Using Creo Parametric 3
No ratings yet
Mold Design Using Creo Parametric 3
604 pages
MOONS' ModbusRTU Library User Manual
No ratings yet
MOONS' ModbusRTU Library User Manual
80 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
Polynomials Test Paper
No ratings yet
Polynomials Test Paper
3 pages
Bdav QB
No ratings yet
Bdav QB
88 pages
BD Unit II
No ratings yet
BD Unit II
57 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
Unit III
No ratings yet
Unit III
15 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Vitara Service Manual
No ratings yet
Vitara Service Manual
835 pages
Bda Unit-2
No ratings yet
Bda Unit-2
37 pages
Part 02 - Big Data Solutions
No ratings yet
Part 02 - Big Data Solutions
17 pages
Jana Resume
No ratings yet
Jana Resume
6 pages
Unit 2
No ratings yet
Unit 2
73 pages
IT Vocabulary For Esl Students
No ratings yet
IT Vocabulary For Esl Students
4 pages
Unlocked Games For School
No ratings yet
Unlocked Games For School
2 pages
It Support JD
No ratings yet
It Support JD
1 page
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Map Duisenberg 2021
No ratings yet
Map Duisenberg 2021
1 page
HADOOP
No ratings yet
HADOOP
19 pages
Computer Basics-WPS Office
No ratings yet
Computer Basics-WPS Office
4 pages
2022 08 01 20 35 25 DESKTOP-2DJ4HTG Log
No ratings yet
2022 08 01 20 35 25 DESKTOP-2DJ4HTG Log
104 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Unit-2 - Hadoop2
No ratings yet
Unit-2 - Hadoop2
30 pages
IMTC634 - Data Science - Chapter 13
No ratings yet
IMTC634 - Data Science - Chapter 13
16 pages
Unit 5
No ratings yet
Unit 5
32 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit II Big Data
No ratings yet
Unit II Big Data
27 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
CPE/EE 421/521 Fall 2004 Chapter 1 - The Microcomputer: Dr. Rhonda Kay Gaede
No ratings yet
CPE/EE 421/521 Fall 2004 Chapter 1 - The Microcomputer: Dr. Rhonda Kay Gaede
6 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
BDA CW Chapter 2
No ratings yet
BDA CW Chapter 2
6 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
Managing The Forces of Fragmentation
No ratings yet
Managing The Forces of Fragmentation
9 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
BEA Online Test
No ratings yet
BEA Online Test
9 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
5000 SQli Vulnerable Websites List 2016 Fresh
No ratings yet
5000 SQli Vulnerable Websites List 2016 Fresh
120 pages
Connections I V2.1.0
No ratings yet
Connections I V2.1.0
49 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
In Einer Fernen Zeit Sheet Music For Female, Male (SATB) - Musescore
No ratings yet
In Einer Fernen Zeit Sheet Music For Female, Male (SATB) - Musescore
1 page
Hadoop
No ratings yet
Hadoop
7 pages
Qwen Technical Report
No ratings yet
Qwen Technical Report
59 pages
Nosql and Hadoop Technologies On Oracle Cloud: Volume 2, Issue 2, March - April 2013
No ratings yet
Nosql and Hadoop Technologies On Oracle Cloud: Volume 2, Issue 2, March - April 2013
6 pages
Belt Drive Training System Plus - Industrial Belt Alignment Skills
No ratings yet
Belt Drive Training System Plus - Industrial Belt Alignment Skills
4 pages
Attachment
No ratings yet
Attachment
11 pages
Guided By:-Prof. K. Kakwani: Payal M. Wadhwani
No ratings yet
Guided By:-Prof. K. Kakwani: Payal M. Wadhwani
24 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Salesforce Certified Agentforce - 14
No ratings yet
Salesforce Certified Agentforce - 14
5 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
HADOOP
No ratings yet
HADOOP
10 pages
Hadoop Features 2
No ratings yet
Hadoop Features 2
3 pages
IGCSE Preliminary Timetable 2025
No ratings yet
IGCSE Preliminary Timetable 2025
2 pages
ICMsystem DS E102
No ratings yet
ICMsystem DS E102
4 pages

HADOOP

Uploaded by

HADOOP

Uploaded by

GROUP 5

1.Hadoop HDFS - Hadoop Distributed File System (HDFS) is

2.Hadoop MapReduce - Hadoop MapReduce is the

3.Hadoop YARN - Hadoop YARN is a resource management

• Fault Tolerance: Hadoop uses commodity hardware(inexpensive systems) which can be

HDFS is designed in such a way that

• MapReduce: Processing the data stored.

• Yarn: processes job requests and manages cluster resources.

• Introducing three of them

What is HBase vs Hadoop?

You might also like