0% found this document useful (0 votes)
27 views12 pages

Dllction To MAPREDUCE Afflrlling: L Tro

This document discusses MapReduce optimization techniques like combiner, partitioner, and compression. It describes how MapReduce jobs are split into map and reduce tasks, and the phases within each task like record reader, mapper, combiner, partitioner, shuffle, sort, and reducer. The document also discusses using MapReduce for sorting and searching.

Uploaded by

babel 8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views12 pages

Dllction To MAPREDUCE Afflrlling: L Tro

This document discusses MapReduce optimization techniques like combiner, partitioner, and compression. It describes how MapReduce jobs are split into map and reduce tasks, and the phases within each task like record reader, mapper, combiner, partitioner, shuffle, sort, and reducer. The document also discusses using MapReduce for sorting and searching.

Uploaded by

babel 8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

CHAPTER 8

dllction to MAPREDUCE
l~tro afflrlling
rro9'
~ -~ -·----~~----....,,,_-----~-------
• Sort
I - • .. · Score? • Reduce \1 '
.. ,x11i.ir's 111_
' d11cuon • Output Format
, Jn1ru
r • Combiner
I ~,JarPe ·dRcader
, Recoi • Partitioner
, rvlap • Searching
combiner
I ,•, Sorting
, PJrtirioner • •, Compression

~-=--==•
----
' R,ducer
' Shuffle _,....,..,,_.,,."'"'"'"""'""~""'-""''""ne""''""'~.""""""""",,_"""'".._....._ _ _ _ _ _ _ _ _ _ _ _ _ __

11 1,

"The alchemists in their search for gold discovered many other things ofgreater value."
- Arthur Schopenhauer, German Philosopher

HAT'S IN STORE?
W

~eassume that you are familiar with the basic concepts of HDFS and Map Reduce Programming discussed
JChapters 4 and 5. The focus of this chapter will be to build on this knowledge to understand optimiza-
~ntechniques of MapReduce Programming such as combiner, partitioner, and compression. We will also
:isr~ss how to write Map Reduce Programming for sorting and searching .
. We suggest you refer to some of the learning resources provided at the end of this chapter for better learn-
~gmdcomprehension.
2_16_•_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _B..!:igc.::D:.:a:.::ta:..:a::.'.:nd~ .A al
- n Ytic:i

8.1 INTRODUCTION
In MapReduce Programming, Jobs (Applications) are split into a set of ~ap casks reduc!. tasks. ihe
these casks are executed in a distributed fashion on Hadoop duster. Each cask processes small subset of d n
that has been assigned co it. This way, Hadoop distribute~ the load across the duster. MapReduce job t~ta
a ser of files that is scored in HDFS (Hadoop Distributed File System) as input. es
Map cask takes care of loading, parsing, transforming, and filtering. The responsibility of reduce task .
grouping and aggregating data that is produced by map casks to generate final output. Each map task ~s
. . -==- IS
bro ken mto the following phases:

I. RecordReader.
f- 2. Mapper.
c., 3. Combiner.
Q 4. Partitioner.
The output produced by map cask is known as intermediate keys and values. These intermediate keys and
values are sent to reducer. The reduce tasks are broken into the following phases:

S,, I. Shuille.
2. Sort.
\f\. 3. Reducer.
0 4. Output Format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed resides. This way, Hadoop
ensures data locality. Data locality means that data is not moved over network; only computational code is
moved wprocess data which saves network bandwidth.

8.2 MAPPER
A mapper maps the input key-value pairs into a set of intermediate key-value pairs. Maps are individual
tasks that have the responsibility of transforming input records into intermediate key-value pairs.
I. RecordReader: RecordReader converts a byte-oriented view of the input (as generated by the Input-
Split) into a record-oriented view and resents it to the Mapper tasks. It presents the tasks with keys
and values. Generally the key is the positional information and value is a chunk of data that c~ ltes
...c:__-- 4iL-- __. --------

the record.
2. Map: Map function works on the key-value pair produced by RecordReader and generates zero or
more intermediate key-value pairs. The MapReduce decides the - -context.
3. Com~iner: It is an op~ nal ~ction but provid~s hig~ performance in terms ~f nen,vor~- b~dwidth
and disk space. It tak~ ermediate key-value pair prov b mapper and app hes Oser-specific aggre-
gate function to only that mapp} - It is also known a loc educer. -._
4. Partitioner: The partitioner takes the intermediate key-value pairs produced by the mapper,
splits them in~ rd, and sends the shard to -~ parcicul;g reducer as per the user-specific code.
Usually, the key with same values goes to the same reducer. The partitioned data of each map cask
is written co the local disk of that machine and pulled by the respective reducer.

/
\I ,
7, ~!/
I rO
' i
1./ cfR
· ~piJ ch< Reducer is to reduce a s« of intcrni,,I'
i J,.ore off -lues. The Reducer has three primary Ltat.c value., (the
o· s< o \f'dJ P•i.ascs• Sb fll 0 ncs rhar ah,
-1~1•
1
I ' · " " and Son R,d • ' """"1on
I'~ sJ11 • This phase takes the output of all the · .. ' uce, and Ou,pu '
i,l :iJld sort• the reduce . running. Then thes . . :~tt•oners and d
l th/ie chifle w:dh<'\ st The main purpose of this so; Ill l\'Jdua] data pipe, ;moad, thon, into tli,
I·"'11!IJ~ l:i.rger acad over by the reduce task. is grouping sinular w edssoned by k'YS which
ce .cerate or so that th .
r{!)"LIb''par I d cer cakes the grouped data produced b h
'1"'he re u . Yt e shu.fB
. e1r values
c>P11ce: J rocesses one .group a~ a time. The reduce fu . and son hase, a I'
~J •011, :i.11d ducer funcnon provides various operat' ncc,Qn Iterates all the al P~ e~~ e
J, , ,o Re .
,1 h 10 ns such as
JuO 3c e,•0 e ic 1s done, t e output (zero or rnore k
. v ues associated
aggregation, filterin d
,irbt d c:i.- nc ey-value pairs) of d . g, an corn-
'. j[lg a c re ucer is sent to the
b1t1 r for!l'la · • The output format separates key-value p . .h
utPLI forJJlat. . au wu tab (defauJ ) . . .
,o . u record wnter.
Oll•r,ollt t and wnres it our to a
)' ,, l)Slllz, fM C b. ..
he the chores o apper, om mer, Partu1oner and R d
1deScfl·bes blem has been d'1scussed under "Combiner"' d "P
e ucer for th
..
cl
,, . e wor coum problern.
,:f s. count pro an aniuoner
_·\\vrd .

. '
f
11

f
\t

Figure 8.1 The chores of Mapper, Combiner, Partitioner, and Reducer.


_
218_• _ _ _ _ _ __ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _:,:
Bi::c.g.::D~ata and A
llalytiei

8.4 COMBINER
It is an optimization technique for MapReduce Job. Generally, the reducer class is set to be t ~
111
class. The difference between combiner class and reducer class is as follows: n-t er
1. Output generated by combiner is intermediate data and it is passed to the reducer.
2. Output of the reducer is passed to the output file on disk.

The sections have been designed as follows:


Objective: What is it that we are trying to achieve here?
Input Data: What is the input that has been given to us to act upon?
Act: The actual statement/command to accomplish the task at hand.
Output: The result/output as a consequence of executing the statement.

~bjective: Write a MapReduce program to count the occurrence of similar words in a file. Use corn.
bmer for optimization.
Note: Refer Chapter 5 - Hadoop for Mapper Class and Reduce Class and Driver Program.
Input Data:

Welcome to Hadoop Session


Introduction to Hadoop
Introducing Hive
Hive Session
Pig Session

Act: In the driver program, set the combiner class as shown below.

job.setCombinerClass(WordCounterRed.class);

// Input and Output Path


FileinputFormat.addlnputPath(job, new Path("/mapreducedemos/lines.ext"));
FileOutputFormat.setOutputf'.ath(job, new_Path("/mapre~ucedemos/output/wordcount/"));

hadoop jar «jar name» «driver class» «input path» «output path»
Here driver class name, input path, and output path are optional arguments.
Output:
f root @volgalnx0l0 mapreducedemos) f hadoop jar 'w'Ordcount . jarU

Coortnls of dfrrffory [nup1'Nlurtd,mo~

Goto :pma~ederl'IM -~
······---·-··-· - - - - - - - -

- -- - - - - - - - - - - -
Gx b~cL: ro DfS1){'11.1,t

Loca I logs
I I

8.5 PARTITIONER
The partitioning phase happens after map phase and before reduce phase. Usually the number of partitions
are equal to rhe number of reducers. The default partitioner is hash partitioner.

Objective: Write a MapReduce program to count the occurrence of similar words in a file. Use parti-
tioner to partition key based on alphabets.
Note: Refer Chapter 5 - Hadoop for Mapper Class and Reduce Class and Driver Program.
Input Data:

Welcome to Hadoop Session


Introduction to Hadoop
Introducing Hive
Hive Session
Pig Session
- -----------------------~B~isDa~ and~
\.:V
-
220 • n~

Act:
WordCou.nrPartirioncr.java
impon org.apad1c.hadoop.io.lncWri rablc::
impon org.apad1e.hadoop.io.Tcxr;
impon org.apachc.hadoop. mapre<l ua:-Partitioncr:
public d,w \'(lo.lU>••• r,,,;,ion" "''"''" p,u,icione«-re,,, In, W,i.,bb I
@Override . nurnPartitions) I
public inr 1,rccParririon(fc."1:r key. Inc\X/ri~ble value. mt
String word = kcy. roStr1ngO;
char alphabet = word.roUpperCase().charAr(O);
int parcirionNumber = 0;
swirch(alphahet) I
case 'A': parririonNumber = I; break;
case 'B': partitionNumber = 2; break;
case 'C' : parcicionNumber = 3; break;
case 'D': partitionNumber = 4; break;
case 'E': parcitionNumber· = 5; break;
case 'F': parcicionNumber = 6; break;
case 'G': parcitionNumbcr = 7; break;
case 'H' : particionNumber = 8; break;
case 'I': partitionNumber = 9; break;
case 'J': particionNumber = 10; break;
case 'K': partitionNumber = 11; break;
case 'L': partitionNumber = 12; break;
case 'M': partitionNumber = 13; break;
case 'N': partitionNumber = 14; break;
case 'O': partitionNumber = 15; break;
case 'P': partitionNumber = 16; break;
case 'Q': partitionNumber = 17; break;
case 'R': partitionNumber = 18; b~eak;
case 'S': partitionNumber = 19; break;
case 'T': partitionNumber = 20; break;
case 'U': partitionNumber = 21; break;
case 'V': partitionNumber = 22; break;
case 'W': partitionNumber = 23; break;
case 'X': partitionNumber = 24· break·
case 'Y': partitionNumber = 25;
break;
case 'Z': partitionNumber = 26; break;
default: partitionNumber = 0; break;
}
return partitionNumber;
• 221

',

The output .lilt parc-r--00008 is associated with alphabc-r 'H'.


I Ir: .mvm1ttulln»2'1VlllVU!ll'J'.9Tik!lJIJIIPld.iJJilil.ttfpart-r-00()08

>,, ,.., ;:, ¼?;.r,,r;;0~,l,YAil


222• ------

~8.~62 SE~A~R~C~H~IN~G~ - - - - - - - - - - - - - -------.......


. 'fie keywo rd in a fi le.
10
enrdl for a speCJ
Objccth~: To wrirc a MapRcducc progr.un s'
Input Data.:

I00l John.45
I002Jack,39
1003.Alcx,44
1004,Smith,38
l 005,Bob,33

Act:
WordSearchcr.java

import java.io.lOExcepcion;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs. Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FilelnputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextlnputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output TextOutputFormat;

public class WordSearcher {


public static void main(String[] args) throws IO Exception,
Interrupted.Exception, ClassNotFoundException {
Configuration conf = new Configuration();
Job job= new Job(conf);
job.setJarByClass(WordSearcher.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(WordSearchMapper.class);
job.setReducerClass(WordSearchReducer.class);
job.setlnputFormatClass(TextlnputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(l);
job,.getC()nfiguration().set("keyword", "Jack");
, , filein~~!-format.setlnputPaths(job, new Path("/mapreduce/student.csv"));
. vn.
1~{,ppeC·J"
,-,1.ed
~;,v · . 10£."(ccptto •
· n·
\(-1 • I•10 ·

""
<' jJ' · hadoop.conf.Co
I
ac 1e. .
nfigura
.
tion:
3
il.111' rt PacI1e. hadoop. 10.lncWmab le;
(1~o -
-~I'°
,. ft 0 rg.:iP· I hadoop. io.LongWritable;
ac 1e. .
,,,,f'°
, rt 0 rg.(1P h
ac e. hadoop. 1 0.Text;
3
·,11'1'° ft 0 rg- Pach e.hadoop.mapreduce.lnputSplit·•
;,,,I'° A 0 rg.aP
..o•· ac11e.
hadoop.mapreduce.Mapper;
. .
;Jt'r..o•·
A
0
rg.aP pacI1e. hadoop.mapreduce.ltb.1nput.FileSplit·,
jtllr of'l3·a M ds M
. port dSearch apper exten apper<Long~ • bl Ti
,ti! • dll55 Wor . rita e, ext, Text, Texr, [
11bh' • Sering keyword;
p cattC .
s . in.tpos = 0,
scauc
d void setup(Context context) throws IQ-c._ •
0 tecte i:.x:cepuon,
pr InterruptedException {

Configuration confi~ration = context.getConfiguration();


keyword= configuration.get( keyword");
11

} rotected void map(LongWritable


E . key, Text value, Context context)
P
throws IO xcepnon, lnterruptedException {

InputSplit i = context.getlnputSplit(); // Get the input split for this map.


FileSplit f = (FileSplit) i; .
String fileName = £getPath().getName();
Integer wordPos;
pos++;
if (value. toString() .contains(keyword)) {
wordPos = value.fi.nd(keyword);
context.write(value, new Text(fileName + ","+ new lntWritable(pos).
roSmng + " ' "+ wordPos.toString()));
. ()
}
--~lll illld A11.tt >iit,
lU ,, ______ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ B....;i~'D..::.

WordSearchReducer.java

import jav:1.io.iOExccp1ion:
import org.apache.hadoop.io.Tcx1;
import org.apachc.hadoop.rnaprcducc. Rc<lucer; TeXc TeXt Text, Tc:xt> {
d
public class \XfordSea rchReduccr extends Re uccr< ' '
k 1i al Concexc conceXt)
protected void rcducc(Texl ey, exc v ue, . I
throws IOException, lnccrrupcedExcepuon

concexc.wrice(key, value);

Output:
FUt: 1m1prtduct1!!JIIJ2Jil/ttaitll/P•tH·OOOOO

-
~· ,,~"°""""''°"" liJ
-
Go boct- to d,r J,soav
AID mKnl \Js:w/dromloiN opHQlli
--
1oe2, l ac k. 39 studen :. csv , 2, S

8.7 SORTING

Objective: To write a MapRe<luce program to sorr data by student name (value) .

Input Data:

1001,John,45
1002,Jack,39
1003,Alex,44
1004,Smich,38
1005,Bob,33

Act:

import java.io.IOExcepcion;
~port org.apache.hadoop.conf.Configuration;
~port org.apache.hadoop.fs.Pach;
~port org.apache.hadoop.io.LongWricable·
~mport org.apache.hadoop.io.NullWricable·'
import org.apache. hadoop.io.Text; ,
I
- V r,J
J
lJ .
Mf\J'R£DUC E Prog.!'2fflming

_,h,ctioll co
Jr'!":
hadoop.maprcduce.Job;
apa che .
rt org• che had oop.maprcd uce.Ma pper:
·p,P0 apa ·
port org•. ache.hadoop.mapreduce. Redu cer;
port org,aPache.hadoop.maprcd uce.lib.input. Fil elnp ut Form:u ;
port org,aP ache. hadoop.mapred ucc. lib.o urput.FileO utputForm:n :
1J1' re 0 rg,aP
0
iOlP SorcScud Names I
bUc da.ss
p0 rc static class SortMapper extends
pub .l Mapper<LongWritable, Text, Tex t, Tc:xr> I
protected void map(LongWri table key, Text value, Concexc conce.x t)
throws IOExcepcion, lncerru ptedException I
Sering[] token= value.toStringO .spl it(",");
concext.write(new Text(token[l ]), new Text(token[0l+ " - "+cokenflJ));

}
r_J
. so rte d ...
value 1s
// pere,
ublic static class SortReducer extends
P Reducer<Text, Text, NullWritable, Text> { I '

public void reduce(Text key, lterable<Text> values, Context contex t)


throws IOException, lnterruptedException {
for (Text details : values) {
context. write(N ullWritable.get(), details);

public static void main(String[] args) throws IO Exception,


lnterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(SortEmpNames.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FilelnputFormat.setlnputPaths(job, new Path(" Imap reduce/ smdent.csv") );
FileOutputFormat.setOutputPath(job, new
Pach("/ map reduce/ output/ sorted/"));
System.exit(job.waitForCompletion(true) ? 0 : l);
V

2Ui •

Outpull
ntt: ~lllUW/)U[t)l/l'•rt-r..00000

°""' ll!d
C'\I b,x;,t,., du: tlt!W

lMl, ) eclc, ,, -
1tv6tnt .c1v,2 , 5

8.8 COMPRESSION
In MapReduce programming, you can compress the MapReduce output file. Compressio~
benelirs as follows: t\vo
I. Reduces the space co store liles.
2. Speeds up data transfer across the network.
You can specify compression format in the Driver Program as shown below:

conf.setBoolean("mapred.output.compress", true);
1 conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);
,,
'
Here, codec is the implementation of a compression and decompression algorithm. GzipCodec is th
pression algorithm for gzip. This compresses the output file. e c°lll.

REMIND ME
1 • Mapper maps the input key-value pairs to intermediate key-value pairs.
• Reducer then reduces the set of key-value pairs that share a common key to a smaller set of val
• The Reducer has three primary phases: ues.
• Shu.ffie and Sort
• Reduce
• Output Format
• Combiner and Partitioner are optimization techniques.
J
POINT ME (BOOK)

• MapReduce Design Patterns, O'REILLY, Donald Miner and Adam Shook.


_j

You might also like