0% found this document useful (0 votes)
36 views25 pages

Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for storage, which partitions and replicates data across nodes for reliability. MapReduce is used for parallel processing, where jobs are split into tasks that are mapped and reduced across nodes.

Uploaded by

TANMAY SHRESTH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views25 pages

Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for storage, which partitions and replicates data across nodes for reliability. MapReduce is used for parallel processing, where jobs are split into tasks that are mapped and reduced across nodes.

Uploaded by

TANMAY SHRESTH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Hadoop:

HDFS & MapReduce


Traditional Approach

• A computer to store and process big data.


• For storage purpose, the programmers will take the help of their choice of
database vendors such as Oracle, IBM, etc.
• Limitation: hectic task to process huge amount of scalable data through a
single database bottleneck.
Google’s Solution
• Google solved this problem using
an algorithm called MapReduce.

• This algorithm divides the task


into small parts and assigns them
to many computers.

• Collects the results from them


which when integrated, form the
result dataset.
Hadoop
• Using the solution provided by
Google, Doug Cutting and
his team developed an Open
Source Project called
HADOOP.

• Hadoop runs applications


using the MapReduce
algorithm, where the data is
processed in parallel with
others.
How Does Hadoop Work?
• It is quite expensive to build bigger servers with heavy configurations.

• As an alternative, we can tie together many commodity computers with


single-CPU, as a single functional distributed system.

• The clustered machines can read the dataset in parallel and provide a
much higher throughput.

• Moreover, it is cheaper than one high-end server.


Hadoop runs code across a cluster of computers. This process
includes the following core tasks that Hadoop performs −
• Data is initially divided into directories and files.
• Files are divided into uniform sized blocks of 128M and 64M (preferably 128M).
• These files are then distributed across various cluster nodes for further processing.
• HDFS, being on top of the local file system, supervises the processing.
• Blocks are replicated for handling hardware failure.
• Performing the sort that takes place between the map and reduce stages.
• Sending the sorted data to a certain computer.
• Writing the debugging logs for each job.
Hadoop Architecture
• At its core, Hadoop has two major layers
namely −
• Processing/Computation layer (MapReduce),
and
• Storage layer (Hadoop Distributed File
System).
1. Hadoop - MapReduce
• A software framework for distributed processing of large data sets.
• The framework takes care of scheduling tasks, monitoring them and re-
executing any failed tasks.
• It splits the input data set into independent chunks that are processed in a
completely parallel manner.
• MapReduce framework sorts the outputs of the maps, which are then input
to the reduce tasks. Typically, both the input and the output of the job are
stored in a file system.
Map Reduce Architecture
MAP Reduce
Dataflow in Map Reduce
• An input reader
• A Map function
• A partition function
• A compare function
• A Reduce function
• An output writer
JobTracker
• JobTracker is the daemon service for submitting and tracking MapReduce
jobs in Hadoop.

• JobTracker performs following actions in Hadoop :


• It accepts the MapReduce Jobs from client applications
• Talks to NameNode to determine data location
• Locates available TaskTracker Node
• Submits the work to the chosen TaskTracker Node
TaskTracker
• A TaskTracker node accepts map, reduce or shuffle operations from a
JobTracker.
• Its configured with a set of slots, these indicate the number of tasks that it
can accept
• JobTracker seeks for the free slot to assign a job.
• TaskTracker notifies the JobTracker about job success status.
• TaskTracker also sends the heartbeat signals to the job tracker to ensure
its availability, it also reports the no. of available free slots with it.
2. Hadoop - HDFS Overview
• HDFS holds very large amount of data and provides easier access.
• To store such huge data, the files are stored across multiple machines.
• These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure.
• HDFS also makes applications available to parallel processing.
Features of HDFS
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to easily check
the status of cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
HDFS Architecture
• HDFS follows the master-slave
architecture and it has the
following elements.
• Namenode
• Datanode
• Block
Namenode
• The namenode is the commodity hardware that contains the namenode
software.
• The system having the namenode acts as the master server and it does
the following tasks −
• Manages the file system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and opening files
and directories.
Namenode
Datanode
• The datanode is a commodity hardware having the datanode software.
• For every node (Commodity hardware/System) in a cluster, there will be a
datanode.
• These nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file systems, as per client
request.
• They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
Block
• Generally the user data is stored in the files of HDFS.
• The file in a file system will be divided into one or more segments and/or
stored in individual data nodes.
• These file segments are called as blocks.
• In other words, the minimum amount of data that HDFS can read or write
is called a Block.
• The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
Roles of Components
Data Replication
• Replication placement
• High initialization time to create replication to all machines
• An approximate solution: Only 3 replications
• One replication resides in current node
• One replication resides in current rack
• One replication resides in another rack
Goals of HDFS
• Fault detection and recovery −
• Since HDFS includes a large number of commodity hardware, failure of
components is frequent.
• Therefore HDFS should have mechanisms for quick and automatic fault
detection and recovery.
• Huge datasets −
• HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.
• Hardware at data −
• A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
Thank You

You might also like