Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for storage, which partitions and replicates data across nodes for reliability. MapReduce is used for parallel processing, where jobs are split into tasks that are mapped and reduced across nodes.

Uploaded by

TANMAY SHRESTH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views25 pages

Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)

Uploaded by

TANMAY SHRESTH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Hadoop:

HDFS & MapReduce

Traditional Approach

• A computer to store and process big data.

• For storage purpose, the programmers will take the help of their choice of
database vendors such as Oracle, IBM, etc.
• Limitation: hectic task to process huge amount of scalable data through a
single database bottleneck.
Google’s Solution
• Google solved this problem using
an algorithm called MapReduce.

• This algorithm divides the task

into small parts and assigns them
to many computers.

• Collects the results from them

which when integrated, form the
result dataset.
Hadoop
• Using the solution provided by
Google, Doug Cutting and
his team developed an Open
Source Project called
HADOOP.

• Hadoop runs applications

using the MapReduce
algorithm, where the data is
processed in parallel with
others.
How Does Hadoop Work?
• It is quite expensive to build bigger servers with heavy configurations.

• As an alternative, we can tie together many commodity computers with

single-CPU, as a single functional distributed system.

• The clustered machines can read the dataset in parallel and provide a
much higher throughput.

• Moreover, it is cheaper than one high-end server.

Hadoop runs code across a cluster of computers. This process
includes the following core tasks that Hadoop performs −
• Data is initially divided into directories and files.
• Files are divided into uniform sized blocks of 128M and 64M (preferably 128M).
• These files are then distributed across various cluster nodes for further processing.
• HDFS, being on top of the local file system, supervises the processing.
• Blocks are replicated for handling hardware failure.
• Performing the sort that takes place between the map and reduce stages.
• Sending the sorted data to a certain computer.
• Writing the debugging logs for each job.
Hadoop Architecture
• At its core, Hadoop has two major layers
namely −
• Processing/Computation layer (MapReduce),
and
• Storage layer (Hadoop Distributed File
System).
1. Hadoop - MapReduce
• A software framework for distributed processing of large data sets.
• The framework takes care of scheduling tasks, monitoring them and re-
executing any failed tasks.
• It splits the input data set into independent chunks that are processed in a
completely parallel manner.
• MapReduce framework sorts the outputs of the maps, which are then input
to the reduce tasks. Typically, both the input and the output of the job are
stored in a file system.
Map Reduce Architecture
MAP Reduce
Dataflow in Map Reduce
• An input reader
• A Map function
• A partition function
• A compare function
• A Reduce function
• An output writer
JobTracker
• JobTracker is the daemon service for submitting and tracking MapReduce
jobs in Hadoop.

• JobTracker performs following actions in Hadoop :

• It accepts the MapReduce Jobs from client applications
• Talks to NameNode to determine data location
• Locates available TaskTracker Node
• Submits the work to the chosen TaskTracker Node
TaskTracker
• A TaskTracker node accepts map, reduce or shuffle operations from a
JobTracker.
• Its configured with a set of slots, these indicate the number of tasks that it
can accept
• JobTracker seeks for the free slot to assign a job.
• TaskTracker notifies the JobTracker about job success status.
• TaskTracker also sends the heartbeat signals to the job tracker to ensure
its availability, it also reports the no. of available free slots with it.
2. Hadoop - HDFS Overview
• HDFS holds very large amount of data and provides easier access.
• To store such huge data, the files are stored across multiple machines.
• These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure.
• HDFS also makes applications available to parallel processing.
Features of HDFS
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to easily check
the status of cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
HDFS Architecture
• HDFS follows the master-slave
architecture and it has the
following elements.
• Namenode
• Datanode
• Block
Namenode
• The namenode is the commodity hardware that contains the namenode
software.
• The system having the namenode acts as the master server and it does
the following tasks −
• Manages the file system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and opening files
and directories.
Namenode
Datanode
• The datanode is a commodity hardware having the datanode software.
• For every node (Commodity hardware/System) in a cluster, there will be a
datanode.
• These nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file systems, as per client
request.
• They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
Block
• Generally the user data is stored in the files of HDFS.
• The file in a file system will be divided into one or more segments and/or
stored in individual data nodes.
• These file segments are called as blocks.
• In other words, the minimum amount of data that HDFS can read or write
is called a Block.
• The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
Roles of Components
Data Replication
• Replication placement
• High initialization time to create replication to all machines
• An approximate solution: Only 3 replications
• One replication resides in current node
• One replication resides in current rack
• One replication resides in another rack
Goals of HDFS
• Fault detection and recovery −
• Since HDFS includes a large number of commodity hardware, failure of
components is frequent.
• Therefore HDFS should have mechanisms for quick and automatic fault
detection and recovery.
• Huge datasets −
• HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.
• Hardware at data −
• A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
Thank You

Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Big Data
No ratings yet
Big Data
67 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Unit 5
No ratings yet
Unit 5
101 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
56 pages
Wa0002.
No ratings yet
Wa0002.
66 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
Unit 2
No ratings yet
Unit 2
22 pages
Jenny Blog
No ratings yet
Jenny Blog
12 pages
Unit 3
No ratings yet
Unit 3
18 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Unit - II
No ratings yet
Unit - II
64 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
3 HDFS
No ratings yet
3 HDFS
16 pages
Unit 2
No ratings yet
Unit 2
19 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Module II
No ratings yet
Module II
46 pages
Chapter 2
No ratings yet
Chapter 2
19 pages

Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)

Uploaded by

Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)

Uploaded by

Hadoop:

HDFS & MapReduce

• A computer to store and process big data.

• This algorithm divides the task

• Collects the results from them

• Hadoop runs applications

• As an alternative, we can tie together many commodity computers with

• Moreover, it is cheaper than one high-end server.

• JobTracker performs following actions in Hadoop :

You might also like