HADOOP
HADOOP
CONTENTS
• WHAT IS HADOOP
• HISTORY OF HADOOP
• COMPONENTS
• ADVANTAGES AND LIMITATION
• ARCHITECTURE OF HADOOP
• WORKING AND USE OF HADOOP
• HADOOP TOOLS
START 05/20/2024 2
WHAT IS HADOOP?
• Hadoop is an open-source framework designed for
distributed storage and processing of large datasets using a
cluster of commodity hardware.
• It consists of two main components: the Hadoop Distributed
File System (HDFS) for storing data across multiple
machines, and the MapReduce programming model for
processing and analyzing data in parallel.
• Hadoop is widely used in big data applications due to its
scalability, fault tolerance, and ability to handle diverse data
types.
START 05/20/2024 3
The HDFS meaning and purpose is to achieve the following goals:
Manage large datasets - Organizing and storing datasets can be a hard talk to
handle. HDFS is used to manage the applications that have to deal with huge
datasets. To do this, HDFS should have hundreds of nodes per cluster.
Detecting faults - HDFS should have technology in place to scan and detect
faults quickly and effectively as it includes a large number of commodity
hardware. Failure of components is a common issue.
Hardware efficiency - When large datasets are involved it can reduce the
network traffic and increase the processing speed.
05/20/2024 4
HISTORY OF HADOOP What is HDFS in the world of big data?
•HDFS big data is data organized into the
The design of HDFS was based on the Google File System. It was
HDFS filing system.
originally built as infrastructure for the Apache Nutch web
search engine project but has since become a member of the •Hadoop is a framework that works by using
Hadoop Ecosystem. parallel processing and distributed storage.
Early internet spawned web crawlers for searching info, leading This can be used to sort and store big data, as
to search engines like Yahoo and Google. Nutch emerged, it can't be stored in traditional ways.
aiming to distribute data and computations across multiple
computers. Later, Nutch joined Yahoo, then split into two. •It's the most commonly used software to
Apache Spark and Hadoop are now their own separate entities. handle big data, and is used by companies
Where Hadoop is designed to handle batch processing and such as Netflix, Expedia, and British Airways
Spark is made to handle real-time data efficiently. who have a positive relationship with Hadoop
for data storage.
START 05/20/2024 5
HDFS components
•It's important to know that there are three main
components of Hadoop.
START 05/20/2024 6
Advantages of HADOOP Distributed File System
• Cost: Hadoop is open-source and uses cost-effective commodity hardware which provides
a cost-efficient model, unlike traditional Relational databases that require expensive
hardware and high-end processors to deal with Big Data. The problem with traditional
Relational databases is that storing the Massive volume of data is not cost-effective, so the
company’s started to remove the Raw data. which may not result in the correct scenario of
their business. Means Hadoop provides us 2 main benefits with the cost one is it’s open-
source means free to use and the other is that it uses commodity hardware which is also
inexpensive.
• Scalability: Hadoop is a highly scalable model. A large amount of data is divided into
multiple inexpensive machines in a cluster which is processed parallelly. the number of
these machines or nodes can be increased or decreased as per the enterprise’s requirements.
In traditional RDBMS(Relational DataBase Management System) the systems can not be
scaled to approach large amounts of data.
• Flexibility: Hadoop is designed in such a way that it can deal with any kind of dataset
like structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and
Videos) very efficiently. This means it can easily process any kind of data independent of
its structure which makes it highly flexible. which is very much useful for enterprises as
they can process large datasets easily, so the businesses can use Hadoop to analyze valuable
insights of data from sources like social media, email, etc. with this flexibility Hadoop can
be used with log processing, Data Warehousing, Fraud detection, etc.
• Speed: Hadoop's distributed file system (HDFS) breaks large files into smaller blocks
distributed across nodes in a cluster, enabling parallel processing for faster performance
compared to traditional database systems. This architecture allows rapid access to
terabytes of unstructured data, making Hadoop ideal for handling large-scale data with
speed and efficiency.
START 05/20/2024 7
Advantages of HADOOP Distributed File System
• High Throughput:Hadoop works on Distributed file System where various jobs are assigned
to various Data node in a cluster, the bar of this data is processed parallelly in the Hadoop
cluster which produces high throughput. Throughput is nothing but the task or job done per unit
time.
• Minimum Network Traffic: In Hadoop, each task is divided into various small sub-task which
is then assigned to each data node available in the Hadoop cluster. Each data node process a
small amount of data which leads to low traffic in a Hadoop cluster.
START 05/20/2024 8
Limitations:
• Low latency data access: Applications that require low-latency access to data i.e in the range of milliseconds will not work
well with HDFS, because HDFS is designed keeping in mind that we need high-throughput of data even at the cost of latency.
• Small file problem: Having lots of small files will result in lots of seeks and lots of movement from one datanode to another
datanode to retrieve each small file, this whole process is a very inefficient data access pattern.
• Complexity of Configuration and Management: Setting up and managing an HDFS cluster can be complex, requiring
expertise in distributed systems and configuration management tools. Organizations may need dedicated resources for cluster
administration and maintenance.
• Single Point of Failure: While HDFS is designed to be fault-tolerant, the NameNode, which stores metadata about file
locations and block assignments, can be a single point of failure. Although HDFS provides mechanisms like standby
NameNodes and high availability configurations to mitigate this risk, the potential for NameNode failure remains a concern.
• High Overhead for Replication: HDFS achieves fault tolerance by replicating data across multiple nodes, typically three
replicas by default. This replication incurs overhead in terms of storage space and network bandwidth, especially for large
datasets.
START 05/20/2024 9
HDFS ARCHITECTURE
START 05/20/2024 10
•NameNode(Master)
•DataNode(Slave)
NameNode: NameNode works as a Master in
a Hadoop cluster that guides the
Datanode(Slaves).
DataNode: DataNodes works as a Slave
DataNodes are mainly utilized for storing the
data in a Hadoop cluster.
HDFS is an Open source component of the
Apache Software Foundation that manages
data. HDFS has scalability, availability, and
replication as key features. Name nodes,
secondary name nodes, data nodes,
checkpoint nodes, backup nodes, and blocks
all make up the architecture of HDFS. HDFS is
fault-tolerant and is replicated. Files are
distributed across the cluster systems using
the Name node and Data Nodes.
05/20/2024 11
Hadoop: Working and Using
Hadoop has three components that’s specifically
designed to work on big data.
• HDFS (Hadoop Distributed File System) :-
Storing of data.
• Data is distributed among many computers and
stored as blocks. 128MB is the default size of each
block.
• Replication method: Makes copies of data and
stores it across multiple systems.
• HDFS is fault tolerant.
START 05/20/2024 12
Hadoop: Working and Using
To use HDFS you need to install and set up a Hadoop cluster. This can be a single node set up which
is more appropriate for first-time users, or a cluster set up for large, distributed clusters. You then
need to familiarize yourself with HDFS commands
START 05/20/2024 13
TOOLS OF HADOOP
START 05/20/2024 15
HADOOP HIVE WHY HIVE?
• Hive is commonly used by Data
• Hive is a data warehouse system which is used to analyze Analysts.
structured data. It is built on the top of Hadoop. It was • It follows SQL-like queries.
developed by Facebook.
• Hive is fast and scalable.
• Hive provides the functionality of reading, writing, and • It can handle structured data.
• It works on server-side of HDFS
managing large datasets residing in distributed storage. It
cluster.
runs SQL like queries called HQL (Hive query language)
which gets internally converted to MapReduce jobs.
• It is capable of analyzing large datasets stored in HDFS.
• It allows different storage types such as plain text, RCFile,
and HBase.
START 05/20/2024 16
HADOOP PIG WHY PIG?
• Pig Represents Big Data as data flows. Pig is a high-level • Pig is commonly used by
platform or tool which is used to process the large programmers.
datasets. • It follows the data-flow language.
• Pig Latin and Pig Engine are the two main components of • It can handle semi-structured data.
the Apache Pig tool. The result of Pig always stored in the • It works on client-side of HDFS
HDFS. cluster
• Pig Latin which is used to develop the data analysis codes. • Pig is better faster compared to hive
First, to process the data which is stored in the HDFS, the
programmers will write the scripts using the Pig Latin
Language.
• Pig Engine(a component of Apache Pig) converted all these
scripts into a specific map and reduce task. But these are
not visible to the programmers in order to provide a high-
level of abstraction.
START 05/20/2024 17
THANK YOU
05/20/2024 18