0% found this document useful (0 votes)
48 views28 pages

Hadoop MapReduce Flow Chart

here we make a project on how the flow of mapreduce in hadoop in details

Uploaded by

BENFTIMAA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views28 pages

Hadoop MapReduce Flow Chart

here we make a project on how the flow of mapreduce in hadoop in details

Uploaded by

BENFTIMAA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Hadoop MapReduce

Flow Chart
BEN FTIMA MOHAMED
Whatsapp 21692971305
Map Flow Chart : My JOB in Hadoop : Hi
WordCountJob
How is your job
I take one small file :
How is your family
If it is 400mb and if i keep this file in
HDFS , how many number of Blocks will How is your sister
be given 128+128+128= 384mb + How is your brother
remaining 16 mb
What is time now
4 blocks will be given ,Right ok , so
What is the strength of hadoop
Let split this file in 4 input splits :
Let split this file in 4 input splits :because your HDFS is
128mb (hadoop 2.0)
I have to count number of words occurence in this file big.txt
how many times “hi” is available : 2 times
how many times “how” is available : 5 times

output should be :
(hi,2)
(how,5)
(are,1)
(is,6)
(you,3)
(your,1)
(job,1)
(family,1)
(father,1)
……
(what,3)
>benftima$ hadoop jar wordcount.jar word.class big.text outputDir (outputDir is optionnel)

>benftima$ hadoop jar wordcount.jar word.class


big.text outputDir (outputDir is optionnel)

generally when applying the


wordcount.jar file on the hole big.text
file , if i want to process something i
have to process the entire file and not
for one block separately
input split 1 —-> one mapper will be created
Parallel processing
hadoop can run mapreduce in key value pair (K,V) , we have to convert the lines in key value pair
how ? with the interface RecordReader , lines are called records in hadoop terminology
we need to tell in recordreader interface the type of inputfileformat , by
default it is TextInputFormat
take a break
how to concert a record in key value pairs because our mapreduce can take only key value pair :the interface
recordreader will do the job for us but how :

we have to know the fileinput format in the record reader :

the recordreader knows only how to read just one record (one line) only at a time in the inputsplit

1st record : hi how are you

(byteoffset , entireline) = ( 0 , hi how are you )

2nd record: hi how is your job

(byteoffset , entireline) = ( 16, hi how is your job )

byteoffset : i how are you 15 char+ backslash =16

—--> for each key value pair created the mapper run once —------->
map interface will be running in key value pair , the entire collection in
java is based on object type no primitive type in collection
for each key value ie byte-offset line the mapper run once , its parallel processing instead of sending the
hole task to one system , we are sending multiple job to multiple system
the key value pair = (0 , hi how re you ) must be of object type they must be of object boxclasses
type .then key is the number of char in a file must longWritable then value can be Text .ok
We resume now: is you file format
is TextInputFormat
your recordReader is converting
that file in (key ,value)=
(byteoffset,line) and here
byteoffset is LongWritable type
and line is Text type (boxtypes
hadoop).
the mapper will accept duplicate keys
mapper can take only (key, value) pair and give only (key,value) pair as output for —-->reducer , the output
format depends on the job you are running
Now come the reducer wich will combine all your (key,value) pair
Here comes in picture , map function :

key should be not duplicate , but value can be duplicate


shuffling
all wrapper classes are implemented with comparable interface , they provided
implementation for compare method,so that one key can compare to other key so they
can decide the sorting order , all box classes implemented writable comparable
interface , after shuffling no duplicate key.
now sorting can be done automatically.after shuffling and
sorting result will be given to the reducer …………..the
reducer will be executed as number of input times .as in
javaweb doGet and doPost methods will be executed as
times as number of request will be asked to the
controller .hadoop provide ident reducer is doing sorting
only only if you don’t write your own reducer but if you
write your own reducer the shuffling and sorting will be
done ;Mapper doesnt know how to sort your key value
pairs , reducer only knows that .mapper gives (how,1)
(how,1) twice ., when we are writing reducer code shuffling
and sorting will come because of the iterator interface in
the reducer class that we will see when we will discuss that
logic.ident reducer built in hadoop knows how to sort but
dont know how to shuffle it , but reducer know how to sort
and shuffle.
reducer will give result to the recordwriter who knows how to write these
key value pairs

You might also like