0% found this document useful (0 votes)
87 views10 pages

Lesson 4 Notes PDF

Uploaded by

DaWheng Vargas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views10 pages

Lesson 4 Notes PDF

Uploaded by

DaWheng Vargas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Intro to Hadoop and MapReduce

Lesson 4 Notes

Designing With Patterns


MynameisAndy,Imgoingtobepresentingthislesson,wherewearegoingto
learnaboutdesignpatternsinmapreduce.Thepatternsthatwewillcoverare
templatesforsolvingcommonproblemswithmapreduce.Thesepatternsachieve
agoodbalancebetweenflexibilityandrigidity:Theyaregeneralenoughthatthey
canbeadaptedtosolvelotsofproblems,butspecificenoughthattheydontrequire
toomuchefforttouse.Thegoalhereis
togiveyoufamiliarity(butnot
necessarilymastery)withsomeof
thesepatterns.Wewontdiscusswhere
thesepatternscamefromorwhythey
werechosen.Wearejustgoingto
introducethem.Note:thesepatterns
comefromthebook:MapReduceDesignPatternsbyDonaldMiner&AdamShook.

What we will Cover

NoteveryproblemcanbesolvedwithMapReduce.Infact,problemshavetobemanipulatedto
fitintothisframeworkofmappingandreducing.Butluckilyprogrammershavedevelopeda
sophisticatedarsenalofpatternstomakesomepotentiallyunexpectedproblems
mapreducable.Inthislesson,wellintroducethefollowingpatterns.

Copyright2014Udacity,Inc.AllRightsReserved.

FilteringPatterns:dontchangerecords.Onlygetapartofthedata!
Examples:sampling,randomsampling,top10
SummarizationPatterns:Givetoplevelviewofdata.
Examples:counting,minimum,maximum,mean,median,standarddeviation,
invertedindex
Takeabreaktodiscusscombiners.
StructuredtoHierarchicalPatterns:
Examples:combining2datasets.

Thislessonwillgofast.Youprobablywontremembereverythingaboutthesepatterns,butthats
okay.Youllknowthattheyexistandyoullbeabletoconsultthemwhenyoufindyourselffacing
aproblemthatyouthinkcouldbemoldedtofitoneofthesepatterns.

Filtering Patterns

Thesepatternshaveonethingincommon:theydontchangetheactualrecords.Recordsthat
passthefilter,areoutputexactlythewaytheycamein.Thesepatternsallfindasubsetofdata,
whetheritbesmall,likeatoptenlisting,orlarge,likeallresultsfromthelastyearfromadataset
thatcontainsrecordsfor5years.Thereareseveralpossibleapplicationsoffilteringpatterns:


Copyright2014Udacity,Inc.AllRightsReserved.

Asimplefilteringboilsdowntohavingafunctionthat
givenaninputreturnskeeporthrowaway.Mapper
appliesthisfunctiontoeveryinputelement,and
accordinglytheoutputwillbeafilteredsubsetofthe
inputcontainingonlydatathatisinterestingtoyou

Samplingisaboutpullingoutasubsetofthedataforfutureprocessing.Typically,you'duse
samplingtoextractasmallerbutrepresentativedatasetonwhichyoucouldthenperformfurther
analysis.However,takingjustthefirst10,000recordsfromamassivedatasetmightnotbethe
bestideabecauseitmightnotberepresentative.
Instead,youcouldusePython'srandomnumber
generatortoreturnjust,say,1%ofthedatabyonly
passingitthroughtheMapperifarandomvalue
between1and100equalled1.

Top10:veryinterestingcase,wealllovetoseetop
lists!

Sampling

Letshavesomepracticewithfiltering!Thistimewewillworkwitha
datasetthatisgeneratedbyourveryownUdacitystudentsour
forums!Weareinterestedingettingasubsetofdatafromthe
forumthatcontainsonlypoststhatare1sentenceorless(so,
containsonlyoneperiod/!/?character,andperiodisthelast
significantcharacterinthepost).


Top10

AninterestingapplicationofmapreduceismakingtopNrecordlists.InRDBMSyouwould
normallyfirstsortthedata,thentaketopNrecords.Inmapreducethiskindofapproachwillnot
work,becausethedataisnotsortedandisprocessedonseveralmachines.Thus,themappers
willfirsthavetofindtheirowntopNlists,withoutsortingthedata,andthensendthelocalliststo
thereducerswhothencanfindtheglobaltopNlist.

Copyright2014Udacity,Inc.AllRightsReserved.

Letsfindthetop5longestpostsinourforum!

SummarizationPatterns

Thenextsetofdesignpatternswewilldiscusswillbe
patternsthatproduceatoplevel,summarizedviewof
yourdata.Thisisverypowerfulapproachtogetabirds
eyeviewofyourdata,somethingthatyoucannotgetby
justmanuallylookingatpartsofthedata.Youcangroup
similardatatogether,forexamplebyday,orbytimeof
day,orbyusernameandthencalculateastatisticof
somevalue,likemin/max/average/meanetc.Youcanalso
buildanindex,orjustsimplycountoccurrencesof
something.

Therearecoupleofproblemsthatcanbesolvedbyusing
similarapproach:

numericalsummarizationsfindingcountofvalue,
aswellasmin,max,avg,medianandother
numericalvaluesforyourdataset
invertedindex:importantwhenbuildingasearch
engineorfulltextsearchfunctionalityforyourown
websiteorapplication.

YoucanusebuiltinHadoopfunctionalitytoperformsome

Copyright2014Udacity,Inc.AllRightsReserved.

oftheseoperationsmoreefficientlyandwetalkaboutthataswell.

Inverted Index

Itisoftenneededtobuildareverseindexfroma
dataset,toenablefastersearching.The
obviousexamplewouldbeawebsearch
engine.Youneedtocreateamappingfrom
keywordstoweblinks,toenablefasterfinding
ofrelevantinformation.Thinkofitasanindex
forabookyouhaveaword,orterm,andall
pagesyoucanfindthisterm.

Whilethepatternitselfissimplemapper
outputseachwordasthekey,thelinkbeingthe
value,youhavetobeconsciousofthefactthat
thistypeofmapreduceissusceptibletounevenkeydistribution.Forexample,youcanassume
thatthewordthewillbefoundalotmoreinatextthanotherwords.Youhavetothinkifyou
reallyneedtoincludesuchwordsintheindex.

Numerical summarizations

Commonusesforthistypeofanalysisare:
wordorrecordcount(thatyoualreadyusedwhen
analyzingwebserverlogfiles).Mapperjustoutputsthe
thingthatyouareinterestedasakey,and1asavalue.
Reducerthencanjustsumupthevalues.Anothervery
popularexampleiswordcount,whichmightbethe
canonicalHello,WorldforMapReducecounthow
manytimesawordappearsinadataset.

min/max/countforaparticularevent,for
examplefirst/lasttimeauserpostedona
forum,orfirst/lasttimeandhowmany
particularitemwasboughtinashop.

Mean and Standard Deviation

Copyright2014Udacity,Inc.AllRightsReserved.

Letssaythatyouwanttoknowifthereisanycorrelationbetweendayofweekandhowmuch
moneypeoplespendonitems.Yourtaskistofindoutmeanandstandarddeviationforsaleper
dayofweek.

Mapperwillneedtohaveonlyoutputthedayofweekasthekeyandthesaleamountasthe
value.Reducerhastodoallthemaths.

IngeneralifyouneedtofindoutAverage/Median/Standarddeviation,thesewouldhavetobe
calculatedinthereducer.

Combiners

Tocalculatethemeaninthatlastproblem,whatdidyoudo?Youmayhavedonesomethinglike
this:
1. Yourmappermayhavegonethroughtherecordsandoutputakeyvaluepairthatlooked
like:dayofweekvalue.
2. Foreachdayoftheweek,yourreducerkeptarunningtotalofthevalueaswellasa
countofthenumberofrecords.
3. Youdividedthetotalvaluebythenumberofrecordstogetthemean.

Now,letsthinkaboutWHEREeachofthesestepstookplace.
1. Onvariousmachinesthroughoutthenetwork.
2. OnONEmachineinthenetwork.
3. Again,ONEmachine.





Copyright2014Udacity,Inc.AllRightsReserved.

Buttheresaproblemhere.Thatsecondstepinvolvesmovingalotofdataaroundyournetwork.
Whatifwecoulddosomeofthereductionlocallybeforesendingthedatatothereducers?

Wecan!Wecanusecombiners!Combinerswill,inessence,doasmuchreductionaspossible
locallybeforesendingthatdatatothereducers.

Thismightsavesignificantnetworktrafficinyouhavealotofrecords,butmuchless
uniquekeys.Youwillneedtoaddadditionalcommandtothecommandlinescripttouse
thisfunctionalityorusethefulljavacommand.Pleaseseeinstructorcommentsfor
detailedinformation.
Whenyourunajob,youwillseesomeoutputonthescreen,whichincludesatracking
url:

Openitinabrowser
Youwillseeajobpage,containinginformationaboutthejob,asitisbeingrun

Herearethecomparisonscreenshotsfrom2jobs,onerunwithoutcombinerandone
withacombiner:

Copyright2014Udacity,Inc.AllRightsReserved.

Asyoucansee,whenusingareducer(secondscreenshot),reducersgetsignificantlyless
recordsandhavetoshufflelessbytesthanwithoutacombiner.Withoutcombiner4,138,476
records,withcombiner412records.Whileitdoesnotleadtotimesavingsonasinglenode
pseudodistributedclusterruninaVM,likeyouhave,inrealworlditcouldsavealotofnetwork
traffic.

Generally,wewantourcombinerstodothesamethingasourreducers.Also,keepinmindthat
thisintermediatestepmayintroducesomecomplicationswhenwetrytodoourcomputation.
Youllencounterthatinthenextexercise.

CalculatingtheSumwithCombiners

Trycalculatingthesumofvalues,likeyoudidinlesson3,butthistimeuseacombiner.

Structured to Hierarchical Patterns

WhenmigratingdatafromanRDBMSto
aHadoopsystem,oneofthefirstthings
youshouldconsiderdoingis
reformattingyourdata.SinceHadoop
doesntcarewhatformatyourdataisin,
youcantakeadvantageofhierarchical
datatoavoiddoinglivejoinsofseveral
datasetsatthetimeofanalysis.Ifyou
knowwhatkindofinformationyouwill
Copyright2014Udacity,Inc.AllRightsReserved.

wanttogetlater,youmightsavesignificanttimebyreformattingthedata.

Youcanusethispatternifyouhavedatasourcesthatarelinkedbysomesetofforeignkeysand
yourdataisstructuredandrowbased.

Combine 2 datasets

Youmightwanttoknowwhatthereputationoftheauthorofapostis.Combinepostanduser
tablestoproduceasimpledatasetthathasalltherelevantinformationtogether,orinRDBMS
termsisdenormalized.

Posttablewillcontaininformationlikethis:

Usertablewillhavethis:

Wewanttocombinethetables,sothat
eachpostcontainsinformationaboutthe
reputationandbadgesfortheauthor,so
thatwecanrunsomeanalysison
contentofapostinrelationtoreputation
ofauser.Forexample,istherea
correlationbetweenpostlengthanduser
reputation?

Conclusion

Wevecoveredseveralpatternswhatgoodarethey?

Well,youllhavetopracticewiththemtohavethembecomesecondnature,butfornowyoucan
askyourselfwhensolvingaproblem:DoesthisproblemfitintooneofthepatternsIlearned?Is
itafilteringproblem?Asummarizationproblem?aHierarchicaldataproblem?


Thesearenotallthepatternsyoumaywanttouse.Andgettingyourselftoaskthisquestionis

Copyright2014Udacity,Inc.AllRightsReserved.

notsimple.Infact,justidentifyingwhenmapreduceistheappropriatetoolforaproblemisa
largepartofwhatitmeanstobeamapreduceexpert.


Ifyouwanttogainthisexpertise,youllhavetopractice.Thismayhappennaturallyinajob
environment,oryoucouldmakeithappendeliberatelybyseekingoutproblemsandtryingto
solvethemwiththisframework.

Ifyouenjoylearningaboutthesepatterns,youcanfindmoreinthebook

Niceworkandcongratulations.Now,ontothefinalproject!Goodluck.

Copyright2014Udacity,Inc.AllRightsReserved.

You might also like