0% found this document useful (0 votes)
59 views

Mapreduce Programming Framework

This document provides an overview of MapReduce, a programming framework for processing large datasets in a distributed manner. It discusses basic concepts like map and reduce functions, and how programmers implement mappers and reducers. It also describes the programming paradigm, provides examples like word count, and discusses challenges in working with large-scale data and applications of MapReduce.

Uploaded by

Linh Ngo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Mapreduce Programming Framework

This document provides an overview of MapReduce, a programming framework for processing large datasets in a distributed manner. It discusses basic concepts like map and reduce functions, and how programmers implement mappers and reducers. It also describes the programming paradigm, provides examples like word count, and discusses challenges in working with large-scale data and applications of MapReduce.

Uploaded by

Linh Ngo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

MapReduce

Programming
Framework
Contents
Motivation
Basic Concepts
Programming Paradigm
Set up developmental environment on Palmetto
WordCount
Supporting Software Infrastructure
More examples
Air Traffic Data
Realit of Wor!ing wit" Big Data
P"sical storage reali#ation for data intensive computing
$%% T& of data uploaded to 'ace&oo! dail
Total 'ace&oo! data is more t"an (% Peta&tes )$ peta&te *
$+%%%+%%% giga&tes,
-atest .ew/gg "ard drive0 1TB )$ tera&te * $+%%% giga&tes,
'ace&oo! would need a total of 234% "ard drives )not counting data
replication+ supporting software infrastructure+ and mac"ine
redundanc to mitigate failure, wit" at least 56 new "ard drives added
dail7
T"is is not counting computing nodes
Realit of Wor!ing wit" Big Data
8undreds or t"ousands of mac"ines to support &ig data
Paralleli#e t"e computation
Distri&ute t"e data
8andle failure
W"at MapReduce does0
Automate t"e paralleli#ation and distri&ution of computation
T"e paralleli#ation and distri&ution of data is done & anot"er
separate framework
Automate t"e "andling of failure
Basic Concepts
W"at is 9map:;
A function<procedure t"at is applied to ever individual
elements of a collection<list<arra<=
int s>uare)x, ? return x@xAB
map s>uare C$+5+(+1D EF C$+1+G+$3D
W"at is 9reduce:;
A function<procedure t"at performs an operation on a list7
T"is operation will 9fold<reduce: t"is list into a single value
)or a smaller su&set,
reduce )C$+5+(+1D, using sum EF $%
reduce )C$+5+(+1D, using multipl EF 51
Basic Programming Paradigm
Programmers implement0
Map function0
Ta!e in t"e input data and return a H!e+valueF pair
Reduce function0
Receive t"e H!e+valueF pairs from t"e mapper and provide a final output
as a reduction operation on t"e pairs
Iptional functions0
Partition function0 determines t"e distri&ution of mappersJ H!e+valueF
pairs to t"e reducers
Com&ine functions0
Initial reduction on t"e mappers to reduce networ! traffics
T"e MapReduce 'ramewor! "andles evert"ing else
Word Count
Count "ow man uni>ue words t"ere are in a
file<multiple files
Standard parallel programming approac"
Count num&er of files
Set num&er of processes
Possi&l setting up dnamic wor!load assignment
A lot of data transfer
Significant coding effort
MapReduce WordCount /xample
"ttp0<<&log7Kteam7nl<5%%G<%4<%1<introductionEtoE"adoop<
MapReduce WordCount /xample
"ttp0<<&log7Kteam7nl<5%%G<%4<%1<introductionEtoE"adoop<
MapReduce PageRan! /xample $
"ttp0<<www7adminEmaga#ine7com<8PC<Articles<MapReduceEandE8adoop
)A+B,0 T"ere is a referral )lin!, from site A to site B7 Loogle loo!s at "ow man
referrals site B "as in order to determine t"e ran!ing of site B7
MapReduce PageRan! /xample 5
"ttp0<<www7adminEmaga#ine7com<8PC<Articles<MapReduceEandE8adoop
Basic Anatom of a Mava MR
Program
E Main Class
E Mapper Class
E Reducer Class
Main Class
pu&lic class HMRMo&F ?
pu&lic static void main)StringCD args, t"rows /xception ?
Configuration conf N new Configuration),A
Mo& Ko& N new Mo& )conf+ 9Mo& .ame:,A
Ko&7setMarBClass)HMRMo&F7class,A
Ko&7setIutputOeClass)H8adoop tpeF7class,A
Ko&7setIutputPalueClass)H8adoop tpeF7class,A
Ko&7setMapperClass)HMapperClassF7class,A
Ko&7setReducerClass)HReducerClassF7class,A
Ko&7setInput'ormatClass)Hpredefined Input'ormatF7class,A
Ko&7setIutput'ormatClass)Hpredefined Iutput'ormatF7class,A
'ileInput'ormat7addInputPat")Ko&+ new Pat")argsC%D,,A
'ileIutput'ormat7setIutputPat")Ko&+ new Pat")argsC$D,,A
Ko&7wait'orCompletion)true,A
B
B
WordCount Main Class
pu&lic class WordCount ?
pu&lic static void main)StringCD args, t"rows /xception ?
Configuration conf N new Configuration),A
Mo& Ko& N new Mo&)conf+ QwordcountQ,A
Ko&7setMarBClass)WordCount7class,A
Ko&7setIutputOeClass)Text7class,A
Ko&7setIutputPalueClass)IntWrita&le7class,A
Ko&7setMapperClass)Map7class,A
Ko&7setReducerClass)Reduce7class,A
Ko&7setInput'ormatClass)TextInput'ormat7class,A
Ko&7setIutput'ormatClass)TextIutput'ormat7class,A
'ileInput'ormat7addInputPat")Ko&+ new Pat")argsC%D,,A
'ileIutput'ormat7setIutputPat")Ko&+ new Pat")argsC$D,,A
Ko&7wait'orCompletion)true,A
B
B
Mapper<Reducer Classes
pu&lic class HMapperClassF extends MapReduceBase implements MapperH O/RI.+ PA-S/I.+ O/RIST+ PA-S/ISTF ?
pu&lic void map )O/RI.+PA-S/I.+ Context context, t"rows II/xception ?
// do something with the value and produce a pair of <KEYOUT,VAUEOUT!
context7write)O/RIST+ PA-S/IST,A
B
B
pu&lic class HReducerClassF extends MapReduceBase implements ReducerH O/RI.+ PA-S/I.+ O/RIST+ PA-S/ISTF ?
pu&lic void reduce )O/RI.+ Itera&leHPA-S/I. tpeF ARRARPA-S/I.+ Context context, t"rows II/xception ?
// do some reduction operations on the A""AYVAUE#$
context7write)O/RIST+ PA-S/IST,A
B
B
WordCount Mapper<Reducer
pu&lic static class Map extends MapperH-ongWrita&le+ Text+ Text+ IntWrita&leF ?
private final static IntWrita&le one N new IntWrita&le)$,A
private Text word N new Text),A
pu&lic void map)-ongWrita&le !e+ Text value+ Context context, t"rows II/xception+ Interrupted/xception ?
String line N value7toString),A
StringTo!eni#er to!eni#er N new StringTo!eni#er)line,A
w"ile )to!eni#er7"asMoreTo!ens),, ?
word7set)to!eni#er7nextTo!en),,A
context7write)word+ one,A
B B B
pu&lic static class Reduce extends ReducerHText+ IntWrita&le+ Text+ IntWrita&leF ?
pu&lic void reduce)Text !e+ Itera&leHIntWrita&leF values+ Context context, t"rows II/xception+ Interrupted/xception ?
int sum N %A
for )IntWrita&le val 0 values, ?
sum TN val7get),A
B
context7write)!e+ new IntWrita&le)sum,,A
B B
Testing MapReduce at
Small Scale
Re>uire -inux environment
'or Windows mac"ines0
Cgwin
Set up -inux virtual mac"ines using PMWare or PirtualBox
)"ttp0<<www7psc"ocats7net<u&untu<virtual&ox,
'or compilation and small scale testing in class0
"adoopE$7$757tar7g#
)"ttp0<<arc"ive7apac"e7org<dist<"adoop<core<"adoopE$7$75<,
Kd!$727%U56 )"ttp0<<download7oracle7com<otn<Kava<Kd!<2u56E&$6<Kd!E
2u56ElinuxEx317tar7g#,
Iptional0
/clipse )"ttp0<<www7eclipse7org<downloads<,
Demo )PirtualBox,
export MAPAU8IM/NClocation to Kava directorD
export 8ADIIPU8IM/NClocation to t"e "adoop directorD
m!dir classes
VMAPAU8IM/<&in<Kavac Eclasspat" V8ADIIPU8IM/<"adoopEcoreE$7$757Kar Ed
classes WordCount7Kava
Kar Ecvf WordCount7Kar EC classes< 7
VMAPAU8IM/<&in<Kava Ecp WordCount7Kar0V8ADIIPU8IM/<"adoopEcoreE
$7$757Kar0V8ADIIPU8IM/<li&<@07 WordCount guten&ergEs"a!espeare7txt
output<
Mapper
Reducer
other mappers
other reducers
circular buffer
(in memory)
spills (on disk)
merged spills
(on disk)
intermediate files
(on disk)
Combiner
Combiner
W"at is 9evert"ing else:;
9/vert"ing else:
Sc"eduling
Data distri&ution
Snc"roni#ation
/rror and 'ault 8andling
-imited control over data and execution flow
All algorit"ms must &e expressed as a com&ination of mapping+ reducing+
com&ining+ and partitioning functions
/xtremel limited !nowledge on
-ocation of mappers and reducers
-ife ccle of individual mappers and reducers
Information a&out w"ic" mapper "andles w"ic" data &loc!
Information a&out w"ic" reducer "andles w"ic" intermediate !e
C"allenges in wor!ing wit" MR
All algorit"ms must &e expressed as a com&ination of
mapping+ reducing+ com&ining+ and partitioning
functions
-arge scale de&ugging is difficult
'unctional errors are difficult to follow at large scale
DataEdependent errors are even more difficult to catc" and
fix
Applications of MapReduce
Text to!eni#ation+ indexing+ and searc"
We& access log stats
Inverted index construction
TermEvector per "ost
Distri&uted grep<sort
Lrap" creation
We& lin!Egrap" reversal
LoogleJs PageRan!
Data Mining and mac"ine learning
Document clustering
Mac"ine learning
Statistical mac"ine translation

You might also like