0% found this document useful (0 votes)
34 views43 pages

Module 3

This presentation from St. Francis Institute of Technology focuses on the MapReduce paradigm within Big Data Analytics, detailing its implementation, features, and components. It covers the Map and Reduce tasks, algorithms using MapReduce, and the process of grouping key-value pairs for data processing. The document emphasizes the educational purpose of the material and prohibits distribution or modifications.

Uploaded by

Biya Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views43 pages

Module 3

This presentation from St. Francis Institute of Technology focuses on the MapReduce paradigm within Big Data Analytics, detailing its implementation, features, and components. It covers the Map and Reduce tasks, algorithms using MapReduce, and the process of grouping key-value pairs for data processing. The document emphasizes the educational purpose of the material and prohibits distribution or modifications.

Uploaded by

Biya Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes.

Distribution and modifications of the content is prohibited.

BIG DATA ANALYTICS(BDA)


ITDO8011

Subject In-charge
Sonali Suryawanshi
Assistant Professor, Department of Information Technology, SFIT
Room No. 328
email: [email protected]

St. Francis Institute of Technology


Department of Information Technology 1
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Module 3: MapReduce Paradigm


CO3: Implement several Data Intensive tasks using the Map Reduce
Paradigm

St. Francis Institute of Technology


Department of Information Technology 2
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Contents :
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners, Details of MapReduce
Execution, Coping With Node Failures.

Algorithms Using MapReduce:


MatrixVector Multiplication by MapReduce ,
Relational-Algebra Operations,
Computing Selections by MapReduce,
Computing Projections by MapReduce,
Union, Intersection, and Difference by MapReduce,
Computing Natural Join by MapReduce,
Grouping and Aggregation by MapReduce,
Matrix Multiplication, Matrix Multiplication with One MapReduce Step .

Illustrating use of MapReduce with use of real life databases and applications.

Self-learning Topics: Implementation of MapReduce algorithms like Word count, Matrix-Vector and
MatrixMatrix algorithm
St. Francis Institute of Technology
Department of Information Technology 3
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce
Definition:
MapReduce is a framework using which we can write applications to
process huge amounts of data, in parallel, on large clusters of
commodity hardware in a reliable manner.

St. Francis Institute of Technology


Department of Information Technology 4
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce

Features:
• two functions, called Map and Reduce,
• programming model for distributed computing written in java
• easy to scale data processing over multiple computing nodes.
• Decomposing a data processing application into mappers and reducers is
sometimes nontrivial.
• once we write an application in the MapReduce form, scaling the application to
run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change.

St. Francis Institute of Technology


Department of Information Technology 5
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: batch processing framework


MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

St. Francis Institute of Technology


Department of Information Technology 6
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: batch processing framework


Implementation of MapReduce in 3 phases
In brief, a MapReduce computation executes as follows:
1.

• Some number of Map tasks -given one or more chunks from a distributed
file system that turn the chunk into a sequence of key-value pairs.

•key-value pairs produced from the input data & determined by the code
written by the user for the Map function.

St. Francis Institute of Technology


Department of Information Technology 7
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: batch processing framework


Implementation of MapReduce
In brief, a MapReduce computation executes as follows:
2.

• Sorting of key-value pairs from each Map task collected by a master controller

• keys are divided among all the Reduce tasks

• all key-value pairs with the same key wind up at the same Reduce task.

St. Francis Institute of Technology


Department of Information Technology 8
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: batch processing framework


Implementation of MapReduce
3.

• The Reduce tasks work on one key at a time,

• combining all the values associated with that key in some way.

• The manner of combination of values is determined by the code written by the


user for the Reduce function.

St. Francis Institute of Technology


Department of Information Technology 9
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The map function produces zero or more


The map function takes as input a output key/value pairs for one input
key/value pair which represent a pair. For example if the map function ia a
logical record from the input data filtering map function it may only
source. In the case of the file this produce output if a certain condition is
could be a line or if the input met or it could be performing a
source is the table in a database , demultiplexing operation where a signle
it could be a row key value yield multiple key/value pair

Map(key1,value1) list(key2,value2)

A.txt(64MB)
A.Txt (byteoffeset,value) MAP
B.txt(64MB) Hi how are you Eg 1,hi how are you
File.txt (200 MB) How is big data class 16,how is a big data
c.txt(64MB) you are learning class…. (Hi,1),(how,1),
about mapreduce (are,1),(how,1)
d.txt(8MB) …..(you,1)..

St. Francis Institute of Technology


Department of Information Technology 10 10
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

St. Francis Institute of Technology


Department of Information Technology 11 11
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: batch processing framework


Main Components of MapReduce
The main components of MapReduce are listed
below:
1. JobTrackers: JobTracker is the master which
manages the jobs and resources in the cluster.
The JobTracker tries to schedule each map on the
TaskTracker which is running on the same
DataNode as the underlying block.

2. TaskTrackers: TaskTrackers are slaves which


are deployed on each machine in the cluster. They
are responsible for running the map and reduce
tasks as instructed by the JobTracker.

3. JobHistoryServer: JobHistoryServer is a
daemon that saves historical information about
St. Francis Institute of Technology
Department of Information Technology
completed tasks/applications. 12
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

St. Francis Institute of Technology


Department of Information Technology 13
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

St. Francis Institute of Technology


Department of Information Technology 14
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Sequence diagram explanation

DataNodes are the slaves which provide the actual storage and are deployed on each machine. They are
responsible for processing read and write requests for the clients. The following are the other functions of
DataNode:

1. Handles block storage on multiple volumes and also maintain block integrity.
2. Periodically sends heartbeats and also the block reports to NameNode.
Figure 2.2 shows how HDFS handles job processing requests from the user in the form of Sequence Diagram.
User copies the input files into DFS and submits the job to the client. Client gets the input file information from
DFS, creates splits and uploads the job information to DFS. JobTracker puts ready job into the internal queue.
JobScheduler picks job from the queue and initializes the job by creating job object. JobTracker creates a list of
tasks and assigns one map task for each input split.
TaskTrackers send heartbeat to JobTracker to indicate it is ready to run new tasks. JobTracker chooses task from
first job in priority-queue and assigns it to the TaskTracker.

Secondary NameNode is responsible for performing periodic checkpoints. These are used to restart the
NameNode in case of failure. MapReduce can then process the data where it is located.
St. Francis Institute of Technology
Department of Information Technology 15
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

Map task
• Consists of elements - tuple, a line or a document.
• A chunk is the collection of elements.
• All inputs to Map task are key–value pairs.
• Map function converts the elements to zero or more key–value pairs.
• Keys not unique.
• Several same key–value pairs from same element is possible..

St. Francis Institute of Technology


Department of Information Technology 16
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

Map task
• Example : count the number of words in a document. S
• document - element.
• The Map function has to read a document and break it into sequence of words.
• Each word is counted as one. Each word is different and so the key is not
unique.
• output of the Map function-> (word1, 1), (word2, 1), …, (wordn, 1).
• Input can be a repository or collection of documents and output is the number
of occurrences of each word and a single Map task can process all the
documents In one or more chunks.

St. Francis Institute of Technology


Department of Information Technology 17
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

The sample code for mapper class


• public class SampleMapper extends Mapper(k1,v1,k2,v2)
•{
• Void map ( k1 key, V1 value, context context) thorws IOException .
•{

•}
Hadoop java API includes Mapper class. An abstract function map() is
•} present in the Mapper class. Any specific Mapper implementation should be
a subclass of this class and overrides the abstract function map()

St. Francis Institute of Technology


Department of Information Technology 18
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

• Number of maps = size of input files (block of input files)


• Eg Input file size is 1 TB and block size is 128MB then there will be 8192 maps

• The number of map task Nmap can be explicitly setby using setNmapTask(int)

• 10-100 maps per node

St. Francis Institute of Technology


Department of Information Technology 19
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

• 1) Data should be first converted to key-value pair before it is passed to mapper


as mapper only understands key-value pairs of data
• 2) it is generated as follows
• i) input split- Defines a logical representation of the data and presents a Spit
data for processing at individual maps
• Ii) Record reader :Communicate with input split and convert the split into
records (key,value)
• It uses TextInputFormat by default for converting data into keyvalue pairs
• (other formats keyvalueTextInputFormat, SequnecFileInput format,
SequenceFileastext input format)

St. Francis Institute of Technology


Department of Information Technology 20
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

St. Francis Institute of Technology


Department of Information Technology 21
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

Grouping by Key
❖As soon as the Map tasks have all completed successfully, the key-value pairs are grouped
by key, and the values associated with each key are formed into a list of values. Eg
(w1,[1,1,1,]),(w2,[1,1])
❖The grouping is performed by the system, regardless of what the Map and Reduce tasks
do. The master controller process knows how many Reduce tasks there will be, say r such
tasks.
❖The user typically tells the MapReduce system what r should be. Then, the master
controller picks a hash function that applies to keys and produces a bucket number from 0
to r − 1.
❖Each key that is output by a Map task is hashed and its key-value pair is put in one of r
local files. Each file is destined for one of the Reduce tasks1.]

St. Francis Institute of Technology


Department of Information Technology 22
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

Grouping by Key
❖the master controller merges the files from each Map task that are destined for a particular
Reduce task and feeds the merged file to that process as a sequence of key-list-of-value
pairs.

❖That is, for each key k, the input to the Reduce task that handles key k is a pair of the form
(k, [v1, v2, . . . , vn]) 🡪where (k, v1), (k, v2), . . . , (k, vn) are all the key-value pairs with key k
coming from all the Map tasks.

St. Francis Institute of Technology


Department of Information Technology 23
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

Partitioner
❖ Semi mappers in MapReduce which are optional class
❖ Driver class can specify the Partitioner
❖ A partitioner processes the output before submitting it to reducer tasks
❖ Its an optimization in MapReduce that allows local partitioning before reduce
task phase
❖ Main function is to split the map output record with the same key
❖ The function of MapReduce partitioner is to make sure that all the value of a
single key goes to the same reducer, eventually which helps even distribution
of the map output over the reducers

St. Francis Institute of Technology


Department of Information Technology 24
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

Combiners
❖ Semi-Reducers, Optional Class (driver class can specify combiner )
❖ Sometimes, a Reduce function is associative and commutative.
❖ That is, the values to be combined can be combined in any order, with the same result.
❖ It does not matter how we group a list of numbers v1, v2, . . . , vn; the sum will be the same.
❖ When the Reduce function is associative and commutative, we can push some of what the
reducers do to the Map tasks. This reduces input output operation between mapper and
reducer
❖ These key-value pairs would thus be replaced by one pair with key w and value equal to the
sum of all the 1’s in all those pairs.

St. Francis Institute of Technology


Department of Information Technology 25
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

Combiners
❖That is, the pairs with key w generated by a single Map task would be replaced by
a pair (w, m)
• 🡪where m is the number of times that w appears among the documents
handled by this Map task.

✔ Note that it is still necessary to do grouping and aggregation and to pass the
result to the Reduce tasks, since there will typically be one key-value pair with
key w coming from each of the Map tasks.

St. Francis Institute of Technology


Department of Information Technology 26
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

The Reduce Tasks


❖JAVA API at Hadoop include reducer class
❖An abstract function reduce() is in the Reducer
❖Any specific reducer implementation should be subclass of this class and override the abstract method reduce()
❖The Reduce function’s argument is a pair consisting of a key and its list of associated values.
❖The output of the Reduce function is a sequence of zero or more key-value pairs.
❖task receives one or more keys and their associated value lists.
❖That is, a Reduce task executes one or more reducers.
❖The outputs from all the Reduce tasks are merged into a single file.
❖In word count problem the reduce function will add up all the values in the list.
❖The output of reduce task is sequence of words (key) that appears at least once among all input documents and v
is total number of times it was observed
❖(w1,v2),(w2,v2)….

St. Francis Institute of Technology


Department of Information Technology 27
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

The sample code for Reducer class


• public class SampleReducer extends Reducer (k2, V2, k3,v4)
• {
• void reduce ( k2 key, Iterable<v2> values, Context context) thorws
IOException . InterruptException
• {

• }
• }
Hadoop java API includes Reducer class. An abstract function reduce() is present in the Reduce class. Any
specific Reducer implementation should be a subclass of this class and overrides the abstract function reduce()

St. Francis Institute of Technology


Department of Information Technology 28
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,

St. Francis Institute of Technology


Department of Information Technology 29
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce Execution Pipeline components


Driver : main program

Input data : in HDFS

Mapper : for each map task

Shuffle and sort

Reducer: user defined code

Optimizing MapReduce process by using


Combiners

Distributed cache
St. Francis Institute of Technology
Department of Information Technology 30
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce Execution Pipeline components


1) Driver
main program that Initializes the MapReduce Jobs
Gets back status of Job Execution
• For each job, it defines the configuration and specification of all
its components
• (Mapper, Reducer, Combiner and Custom partitioner) including
the input−output formats.

St. Francis Institute of Technology


Department of Information Technology 31
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce Execution Pipeline components


2) Input data
• May reside in HDFS or HBASE
• InputFormat defines the number of map tasks in the mapping phase.
• job Driver invokes the InputFormat directly to decide the number
(InputSplits) and location of the map task execution
• each map task is given a single InputSplit to work on
• RecordReader class, defined by InputFormat, reads the data that is inside
the mapper task.
• RecordReader converts the data into key–value pairs and delivers it to map
method.

St. Francis Institute of Technology


Department of Information Technology 32
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce Execution Pipeline components


2) Mapper
• For each map task, a new instance of mapper is instantiated.

• mappers do not communicate with each other

• The partition of the key space produced by the mapper, that is every
intermediate data from the mapper, is given as input to reducer.

• The partitioner determines the reduce node for the given key–value pair.

• All map values of the same key are reduced together and so all the map nodes
must be aware of the reducer node
St. Francis Institute of Technology
Department of Information Technology 33
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce Execution Pipeline components


Shuffle and sort:

moving map outputs to reducers

triggered when mapper completes its job.

Sort process groups the key–value pairs to form the list of values.

All data from a partition goes to the same reducer.

Shuffling -- pairs with same key are grouped together and passed to a single
machine that will run the reducer script over them.
St. Francis Institute of Technology
Department of Information Technology 34
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce Execution Pipeline components


Reducer:

user-defined code.

receives a key along with an iterator over all the values associated with
the key and produces the output key– value pairs.

RecordWriter is used for storing data in a location specified by


OutputFormat.

Output can be from reducer or mapper, if reducer is not present


St. Francis Institute of Technology
Department of Information Technology 35
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce Execution Pipeline components


Optional Combiners:
For optimizing key-value process

Combiners pushes some of the Reduce task to Map task.

Combiner takes the mapper instances as input and combines the values with the
same key to reduce the number of keys (key space) that must
be sorted.

St. Francis Institute of Technology


Department of Information Technology 36
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce Execution Pipeline components

Distributed cache:

• sharing data globally by all nodes in the cluster.

• shared library that each task can access.

• The user’s code for driver, map and reduce along with the configuring
parameters can be packaged into a single jar file and placed in this cache.

St. Francis Institute of Technology


Department of Information Technology 37
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce Execution Pipeline components


Process Pipeline
1. Job driver uses InputFormat to partition a map’s execution and initiate a JobClient.
2. JobClient communicates with JobTracker and submits the job for execution.
3. JobTracker creates one Map task for each split as well as a set of reducer tasks.
4. TaskTracker that is present on every node of the cluster controls the actual
execution of Map job.
5. Once the TaskTracker starts the Map job, it periodically sends a heartbeat message to
the JobTracker to communicate that it is alive and also to indicate that it is ready to
accept a new job for execution.
6. JobTracker then uses a scheduler to allocate the task to the TaskTracker by using
heartbeat return value.

St. Francis Institute of Technology


Department of Information Technology 38
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce Execution Pipeline components


7. Once the task is assigned to the TaskTracker, it copies the job jar file to
TaskTracker’s local file system. along with other files needed for the execution of
the task, it creates an instance of task runner (a child process).
8. The child process informs the parent (TaskTracker) about the task progress every
few seconds till it completes the task.
9. When the last task of the job is complete, JobTracker receives a notification and
it changes the status of the job as “completed”.
10. By periodically polling the JobTracker, the JobClient recognizes the job status.

St. Francis Institute of Technology


Department of Information Technology 39
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce Execution Pipeline components

Why there is a need of combiner?


Without Combiner.
"hello hello there" => mapper1 => (hello, 1), (hello,1), (there,1)
"howdy howdy again" => mapper2 => (howdy, 1), (howdy,1), (again,1)
Both outputs get to the reducer => (again, 1), (hello, 2), (howdy, 2), (there, 1)

Using the Reducer as the Combiner


"hello hello there" => mapper1 with combiner => (hello, 2), (there,1)
"howdy howdy again" => mapper2 with combiner => (howdy, 2), (again,1)
Both outputs get to the reducer => (again, 1), (hello, 2), (howdy, 2), (there, 1)
.
St. Francis Institute of Technology
Department of Information Technology 40
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

MapReduce

Working of Mapreduce :

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

https://www.geeksforgeeks.org/map-reduce-in-hadoop/

St. Francis Institute of Technology


Department of Information Technology 41
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The difference between Chunks and block size :


Block : Hdfs talks in terms of blocks for eg : if you have file of 256 mb and you have configured your block size is 128 mb so now 2 blocks gets created
for 256 mb.

Block size is configurable across the cluster and even file basis also.

Split : It has something related with map reduce , you do have an option that you can change the split size , means you can modify your split size greater
than your block size or your split size less than your block size . By default if you don't do any configuration then your split size is approximately equal
to block size .

In map reduce processing, number of mapper spawned would be equal to your number of splits : for a file if 10 splits are there then 10 mappers would
be spawned.

When put command is being fired , it goes to namenode , namenode asks client (in this case hadoop fs utility is behaving like a client) , break the file
into blocks and as per block size , which could be defined in hdfs-site.xml then ,namenode ask client to write the different blocks to different data nodes
.

Actual data will get store on data nodes and meta data of data means file's block location and file attributes would be stored on name node .

client first establish the connection with name node , once it gets the confirmation about where to store the block and then it would directly make a tcp
connection with data nodes and writes the data .

Based on replication factor other copies would be maintained in hadoop cluster and their blocks information would be stored on namenode .

But in any scenario data node won't have duplicate copies of block , means same block would not be replicating on the same node .
St. Francis Institute of Technology
Department of Information Technology 42
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The difference between Chunks and block size :


Block : Hdfs talks in terms of blocks for eg : if you have file of 256 mb and you have configured your block size is 128 mb so now 2 blocks gets created
for 256 mb.

Block size is configurable across the cluster and even file basis also.

Split : It has something related with map reduce , you do have an option that you can change the split size , means you can modify your split size greater
than your block size or your split size less than your block size . By default if you don't do any configuration then your split size is approximately equal
to block size .

In map reduce processing, number of mapper spawned would be equal to your number of splits : for a file if 10 splits are there then 10 mappers would
be spawned.

When put command is being fired , it goes to namenode , namenode asks client (in this case hadoop fs utility is behaving like a client) , break the file
into blocks and as per block size , which could be defined in hdfs-site.xml then ,namenode ask client to write the different blocks to different data nodes
.

Actual data will get store on data nodes and meta data of data means file's block location and file attributes would be stored on name node .

client first establish the connection with name node , once it gets the confirmation about where to store the block and then it would directly make a tcp
connection with data nodes and writes the data .

Based on replication factor other copies would be maintained in hadoop cluster and their blocks information would be stored on namenode .

But in any scenario data node won't have duplicate copies of block , means same block would not be replicating on the same node .
St. Francis Institute of Technology
Department of Information Technology 43

You might also like