Dllction To MAPREDUCE Afflrlling: L Tro
Dllction To MAPREDUCE Afflrlling: L Tro
dllction to MAPREDUCE
l~tro afflrlling
rro9'
~ -~ -·----~~----....,,,_-----~-------
• Sort
I - • .. · Score? • Reduce \1 '
.. ,x11i.ir's 111_
' d11cuon • Output Format
, Jn1ru
r • Combiner
I ~,JarPe ·dRcader
, Recoi • Partitioner
, rvlap • Searching
combiner
I ,•, Sorting
, PJrtirioner • •, Compression
~-=--==•
----
' R,ducer
' Shuffle _,....,..,,_.,,."'"'"'"""'""~""'-""''""ne""''""'~.""""""""",,_"""'".._....._ _ _ _ _ _ _ _ _ _ _ _ _ __
11 1,
"The alchemists in their search for gold discovered many other things ofgreater value."
- Arthur Schopenhauer, German Philosopher
HAT'S IN STORE?
W
~eassume that you are familiar with the basic concepts of HDFS and Map Reduce Programming discussed
JChapters 4 and 5. The focus of this chapter will be to build on this knowledge to understand optimiza-
~ntechniques of MapReduce Programming such as combiner, partitioner, and compression. We will also
:isr~ss how to write Map Reduce Programming for sorting and searching .
. We suggest you refer to some of the learning resources provided at the end of this chapter for better learn-
~gmdcomprehension.
2_16_•_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _B..!:igc.::D:.:a:.::ta:..:a::.'.:nd~ .A al
- n Ytic:i
8.1 INTRODUCTION
In MapReduce Programming, Jobs (Applications) are split into a set of ~ap casks reduc!. tasks. ihe
these casks are executed in a distributed fashion on Hadoop duster. Each cask processes small subset of d n
that has been assigned co it. This way, Hadoop distribute~ the load across the duster. MapReduce job t~ta
a ser of files that is scored in HDFS (Hadoop Distributed File System) as input. es
Map cask takes care of loading, parsing, transforming, and filtering. The responsibility of reduce task .
grouping and aggregating data that is produced by map casks to generate final output. Each map task ~s
. . -==- IS
bro ken mto the following phases:
I. RecordReader.
f- 2. Mapper.
c., 3. Combiner.
Q 4. Partitioner.
The output produced by map cask is known as intermediate keys and values. These intermediate keys and
values are sent to reducer. The reduce tasks are broken into the following phases:
S,, I. Shuille.
2. Sort.
\f\. 3. Reducer.
0 4. Output Format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed resides. This way, Hadoop
ensures data locality. Data locality means that data is not moved over network; only computational code is
moved wprocess data which saves network bandwidth.
8.2 MAPPER
A mapper maps the input key-value pairs into a set of intermediate key-value pairs. Maps are individual
tasks that have the responsibility of transforming input records into intermediate key-value pairs.
I. RecordReader: RecordReader converts a byte-oriented view of the input (as generated by the Input-
Split) into a record-oriented view and resents it to the Mapper tasks. It presents the tasks with keys
and values. Generally the key is the positional information and value is a chunk of data that c~ ltes
...c:__-- 4iL-- __. --------
the record.
2. Map: Map function works on the key-value pair produced by RecordReader and generates zero or
more intermediate key-value pairs. The MapReduce decides the - -context.
3. Com~iner: It is an op~ nal ~ction but provid~s hig~ performance in terms ~f nen,vor~- b~dwidth
and disk space. It tak~ ermediate key-value pair prov b mapper and app hes Oser-specific aggre-
gate function to only that mapp} - It is also known a loc educer. -._
4. Partitioner: The partitioner takes the intermediate key-value pairs produced by the mapper,
splits them in~ rd, and sends the shard to -~ parcicul;g reducer as per the user-specific code.
Usually, the key with same values goes to the same reducer. The partitioned data of each map cask
is written co the local disk of that machine and pulled by the respective reducer.
/
\I ,
7, ~!/
I rO
' i
1./ cfR
· ~piJ ch< Reducer is to reduce a s« of intcrni,,I'
i J,.ore off -lues. The Reducer has three primary Ltat.c value., (the
o· s< o \f'dJ P•i.ascs• Sb fll 0 ncs rhar ah,
-1~1•
1
I ' · " " and Son R,d • ' """"1on
I'~ sJ11 • This phase takes the output of all the · .. ' uce, and Ou,pu '
i,l :iJld sort• the reduce . running. Then thes . . :~tt•oners and d
l th/ie chifle w:dh<'\ st The main purpose of this so; Ill l\'Jdua] data pipe, ;moad, thon, into tli,
I·"'11!IJ~ l:i.rger acad over by the reduce task. is grouping sinular w edssoned by k'YS which
ce .cerate or so that th .
r{!)"LIb''par I d cer cakes the grouped data produced b h
'1"'he re u . Yt e shu.fB
. e1r values
c>P11ce: J rocesses one .group a~ a time. The reduce fu . and son hase, a I'
~J •011, :i.11d ducer funcnon provides various operat' ncc,Qn Iterates all the al P~ e~~ e
J, , ,o Re .
,1 h 10 ns such as
JuO 3c e,•0 e ic 1s done, t e output (zero or rnore k
. v ues associated
aggregation, filterin d
,irbt d c:i.- nc ey-value pairs) of d . g, an corn-
'. j[lg a c re ucer is sent to the
b1t1 r for!l'la · • The output format separates key-value p . .h
utPLI forJJlat. . au wu tab (defauJ ) . . .
,o . u record wnter.
Oll•r,ollt t and wnres it our to a
)' ,, l)Slllz, fM C b. ..
he the chores o apper, om mer, Partu1oner and R d
1deScfl·bes blem has been d'1scussed under "Combiner"' d "P
e ucer for th
..
cl
,, . e wor coum problern.
,:f s. count pro an aniuoner
_·\\vrd .
. '
f
11
f
\t
8.4 COMBINER
It is an optimization technique for MapReduce Job. Generally, the reducer class is set to be t ~
111
class. The difference between combiner class and reducer class is as follows: n-t er
1. Output generated by combiner is intermediate data and it is passed to the reducer.
2. Output of the reducer is passed to the output file on disk.
~bjective: Write a MapReduce program to count the occurrence of similar words in a file. Use corn.
bmer for optimization.
Note: Refer Chapter 5 - Hadoop for Mapper Class and Reduce Class and Driver Program.
Input Data:
Act: In the driver program, set the combiner class as shown below.
job.setCombinerClass(WordCounterRed.class);
hadoop jar «jar name» «driver class» «input path» «output path»
Here driver class name, input path, and output path are optional arguments.
Output:
f root @volgalnx0l0 mapreducedemos) f hadoop jar 'w'Ordcount . jarU
Goto :pma~ederl'IM -~
······---·-··-· - - - - - - - -
- -- - - - - - - - - - - -
Gx b~cL: ro DfS1){'11.1,t
Loca I logs
I I
8.5 PARTITIONER
The partitioning phase happens after map phase and before reduce phase. Usually the number of partitions
are equal to rhe number of reducers. The default partitioner is hash partitioner.
Objective: Write a MapReduce program to count the occurrence of similar words in a file. Use parti-
tioner to partition key based on alphabets.
Note: Refer Chapter 5 - Hadoop for Mapper Class and Reduce Class and Driver Program.
Input Data:
Act:
WordCou.nrPartirioncr.java
impon org.apad1c.hadoop.io.lncWri rablc::
impon org.apad1e.hadoop.io.Tcxr;
impon org.apachc.hadoop. mapre<l ua:-Partitioncr:
public d,w \'(lo.lU>••• r,,,;,ion" "''"''" p,u,icione«-re,,, In, W,i.,bb I
@Override . nurnPartitions) I
public inr 1,rccParririon(fc."1:r key. Inc\X/ri~ble value. mt
String word = kcy. roStr1ngO;
char alphabet = word.roUpperCase().charAr(O);
int parcirionNumber = 0;
swirch(alphahet) I
case 'A': parririonNumber = I; break;
case 'B': partitionNumber = 2; break;
case 'C' : parcicionNumber = 3; break;
case 'D': partitionNumber = 4; break;
case 'E': parcitionNumber· = 5; break;
case 'F': parcicionNumber = 6; break;
case 'G': parcitionNumbcr = 7; break;
case 'H' : particionNumber = 8; break;
case 'I': partitionNumber = 9; break;
case 'J': particionNumber = 10; break;
case 'K': partitionNumber = 11; break;
case 'L': partitionNumber = 12; break;
case 'M': partitionNumber = 13; break;
case 'N': partitionNumber = 14; break;
case 'O': partitionNumber = 15; break;
case 'P': partitionNumber = 16; break;
case 'Q': partitionNumber = 17; break;
case 'R': partitionNumber = 18; b~eak;
case 'S': partitionNumber = 19; break;
case 'T': partitionNumber = 20; break;
case 'U': partitionNumber = 21; break;
case 'V': partitionNumber = 22; break;
case 'W': partitionNumber = 23; break;
case 'X': partitionNumber = 24· break·
case 'Y': partitionNumber = 25;
break;
case 'Z': partitionNumber = 26; break;
default: partitionNumber = 0; break;
}
return partitionNumber;
• 221
',
I00l John.45
I002Jack,39
1003.Alcx,44
1004,Smith,38
l 005,Bob,33
Act:
WordSearchcr.java
import java.io.lOExcepcion;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs. Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FilelnputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextlnputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output TextOutputFormat;
""
<' jJ' · hadoop.conf.Co
I
ac 1e. .
nfigura
.
tion:
3
il.111' rt PacI1e. hadoop. 10.lncWmab le;
(1~o -
-~I'°
,. ft 0 rg.:iP· I hadoop. io.LongWritable;
ac 1e. .
,,,,f'°
, rt 0 rg.(1P h
ac e. hadoop. 1 0.Text;
3
·,11'1'° ft 0 rg- Pach e.hadoop.mapreduce.lnputSplit·•
;,,,I'° A 0 rg.aP
..o•· ac11e.
hadoop.mapreduce.Mapper;
. .
;Jt'r..o•·
A
0
rg.aP pacI1e. hadoop.mapreduce.ltb.1nput.FileSplit·,
jtllr of'l3·a M ds M
. port dSearch apper exten apper<Long~ • bl Ti
,ti! • dll55 Wor . rita e, ext, Text, Texr, [
11bh' • Sering keyword;
p cattC .
s . in.tpos = 0,
scauc
d void setup(Context context) throws IQ-c._ •
0 tecte i:.x:cepuon,
pr InterruptedException {
WordSearchReducer.java
import jav:1.io.iOExccp1ion:
import org.apache.hadoop.io.Tcx1;
import org.apachc.hadoop.rnaprcducc. Rc<lucer; TeXc TeXt Text, Tc:xt> {
d
public class \XfordSea rchReduccr extends Re uccr< ' '
k 1i al Concexc conceXt)
protected void rcducc(Texl ey, exc v ue, . I
throws IOException, lnccrrupcedExcepuon
concexc.wrice(key, value);
Output:
FUt: 1m1prtduct1!!JIIJ2Jil/ttaitll/P•tH·OOOOO
-
~· ,,~"°""""''°"" liJ
-
Go boct- to d,r J,soav
AID mKnl \Js:w/dromloiN opHQlli
--
1oe2, l ac k. 39 studen :. csv , 2, S
8.7 SORTING
Input Data:
1001,John,45
1002,Jack,39
1003,Alex,44
1004,Smich,38
1005,Bob,33
Act:
import java.io.IOExcepcion;
~port org.apache.hadoop.conf.Configuration;
~port org.apache.hadoop.fs.Pach;
~port org.apache.hadoop.io.LongWricable·
~mport org.apache.hadoop.io.NullWricable·'
import org.apache. hadoop.io.Text; ,
I
- V r,J
J
lJ .
Mf\J'R£DUC E Prog.!'2fflming
_,h,ctioll co
Jr'!":
hadoop.maprcduce.Job;
apa che .
rt org• che had oop.maprcd uce.Ma pper:
·p,P0 apa ·
port org•. ache.hadoop.mapreduce. Redu cer;
port org,aPache.hadoop.maprcd uce.lib.input. Fil elnp ut Form:u ;
port org,aP ache. hadoop.mapred ucc. lib.o urput.FileO utputForm:n :
1J1' re 0 rg,aP
0
iOlP SorcScud Names I
bUc da.ss
p0 rc static class SortMapper extends
pub .l Mapper<LongWritable, Text, Tex t, Tc:xr> I
protected void map(LongWri table key, Text value, Concexc conce.x t)
throws IOExcepcion, lncerru ptedException I
Sering[] token= value.toStringO .spl it(",");
concext.write(new Text(token[l ]), new Text(token[0l+ " - "+cokenflJ));
}
r_J
. so rte d ...
value 1s
// pere,
ublic static class SortReducer extends
P Reducer<Text, Text, NullWritable, Text> { I '
2Ui •
Outpull
ntt: ~lllUW/)U[t)l/l'•rt-r..00000
°""' ll!d
C'\I b,x;,t,., du: tlt!W
lMl, ) eclc, ,, -
1tv6tnt .c1v,2 , 5
8.8 COMPRESSION
In MapReduce programming, you can compress the MapReduce output file. Compressio~
benelirs as follows: t\vo
I. Reduces the space co store liles.
2. Speeds up data transfer across the network.
You can specify compression format in the Driver Program as shown below:
conf.setBoolean("mapred.output.compress", true);
1 conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);
,,
'
Here, codec is the implementation of a compression and decompression algorithm. GzipCodec is th
pression algorithm for gzip. This compresses the output file. e c°lll.
REMIND ME
1 • Mapper maps the input key-value pairs to intermediate key-value pairs.
• Reducer then reduces the set of key-value pairs that share a common key to a smaller set of val
• The Reducer has three primary phases: ues.
• Shu.ffie and Sort
• Reduce
• Output Format
• Combiner and Partitioner are optimization techniques.
J
POINT ME (BOOK)