Lesson1-1 Introduction PDF
Lesson1-1 Introduction PDF
TheMultithreadedDAGModel
DAG=DirectedAcyclicGraph:acollectionofverticesanddirectededges(lineswitharrows).
Eachedgeconnectstwovertices.Thefinalresultoftheseconnectionsisthatthereisnowayto
startatsomevertexA,followasequenceofverticesalongdirectedpaths,andendupbackat
A.
DAGscanbeusedforavarietyoftasks,includingmodelingprocessesinwhichdataflowsina
consistentdirectionthroughanetworkofprocessors.
Eachvertexisanoperationlikeafunctioncall,addition,branch,etc.
Directededgesshowhowoperationsdependononeanother.
Thesinkdependsontheoutputofthesource
Assumethereisalwaysonestartingandoneexitvertex.
Beginanalysisbylookingforastartingvertexthisisavertexwhereallinputsaresatisfied.
Thisvertexcanbeassignedtoanyopenprocessor.
Schedulingtakingunitsofworkandassigningittoprocessors.
HowlongwillittaketoruntheDAG?Acostmodelisneeded.
CostModelAssumptions: allprocessorsrunatthesamespeed
1operation=1unitoftime
Edgesdonothaveanycostassociatedwiththem
ExampleSequentialReduction
Reductionreduceanarraytoasumofitselements.
Tofindthecostofthisreduction.wewillonlycareaboutthecostofarrayaccessandthe
costofaddition.
HowlongwillittaketoexecutethisDAGwithprocessors?
Tp(n)=>ceilingofn/p(timeisdependentuponthesizeofthearray)and
Tp(n)=>n(thentimeforeachaddition)
Theadditionsmustbedonesequentially.
Bothtimeconditionsmustbetrue.
Tp(n)=>ceilingofn/ppwillalwaysbeatleastone.Thismeansareductionwilltakenunits
oftimeonaPRAM.
Tp(n)=>n(thentimeforeachaddition)
QUIZ:AReductionTree
Assumeassociativity(a+b)+c=a+(b+c)
Assumenprocessors.
Assumeadditionisdoneinpairs.
WhatistheminimumtimeonaPRAMwithP=nprocessors?
TheDAGisexecutedlevelbylevelandeachleveltakesconstanttimesoallthatisneededto
calculatethetimeistoknowthelevels.logn.
WorkandSpan
Work=numberofverticesintheDAG=W(n)
Span=longestpaththroughtheDAG=D(n)=numberofverticesonthelongestspan
Spanisalsoknownasthecriticalpath.
T1(n)=W(n)
Tinfinity(n)=D(n)
QUIZ:WorkandSpanforReduction
ForthesequentialDAGspan=O(n)
ForthetreeDAGspan=O(log(n))
BasicWorkSpanLaws
W(n)/D(n)=theamountofworkpercriticalvertex=theaverageavailableparallelisminthe
DAG.
Howmanyprocessorsfortheproblem?W(n)/D(n)
SpanLawTp(n)=>D(n)
WorkLawTp(n)=>ceilingofW(n)/P
Tp(n)=>maximumof{SpanLaw,WorkLaw}={D(n),ceilingofW(n)/P}
BrentsTheoremPart1(setup)
IsthereanupperboundtoexecutetheDAG?Yes,accordingtoBrentsTheorem
GivenaPRAMwithPprocessors.
Breaktheexecutionintophases:
1.
Eachphasehas1criticalpathvertex
2.
Noncriticalpathverticesineachphaseareindependent.Thismeanstheverticesinthe
phasecanhaveedgesthatenterorexitthephase,buttheycannotdependonone
another.
3.
Everyvertexhastobeinsomephase,andonlyonephase.
Howlongwillittaketoexecutephasek?
QUIZBrentsTheoremAside
Usethefollowingequivalencies
BrentsTheoremPart2
TheupperboundofthetimetoexecutetheDAGis:
whichbecomes.
ThisisBrentsTheorem.Itsays
Theupperlimitoftimetoexecutethepath,usingPprocessorsis<=Thetimetoexecutethe
criticalpath+thetimetoexecuteeverythingoffthecriticalpathusingpprocessors.
**Thissetsthegoalforanyscheduler.**
Thesetwolimitsarewithinafactorof2witheachother.
ThisimpliesthatyoumaybeabletoexecutetheDAGinafastertimethanBrentpredicts,but
neverfasterthanthelowerbound.
DesiderataSpeedup,WorkOptimality,andWeakScaling
HowcanwetellisaDAGisgoodorbad.
Speedup=bestsequentialtime/paralleltime=Sp(n)=T*(n)/Tp(n)
T*(n)dependsontheworkdonebythebestsequentialalgorithm
Tp(n)dependsonthework,thespan,n,andp
IdealSpeedup:LinearinP(youwantthespeeduptobelinearwiththenumberofprocessors).
Sp(n)=Theta(p)=BestSequentialWork/ParallelTime=W*(n)/Tp(n)
UseBrentsTheoremtogetanUpperboundontime.
Intheequationshownbelowthereisstilladependenceonn,itisjustnotshownontheright
side.
P=numberofprocessors
Thepenalty(thedenominator)to
getlinearscaling,thedenominator
needstobeaconstant.
Togetaconstantinthedenominator:
W=W*WorkOptimality
WeakScalability
P=O(W*/D)W*/P=Omega(D)workperprocessorhastogrowproportionaltothe
span.Spandependsonproblemsizen.
Recap:
Speeduplinearscalingisthegoal.
Toachievelinearscalingtheworkoftheparallelalgorithmshouldmatchthebestsequential
algorithmandtheworkperprocessorshouldgrowasafunctionofn.
BasicConcurrencyPrimitives
TheDivideandConquerScheme
Thisisthesequentialversionofthedivide
andconquerscheme.
Notethatthetworecursivecallsare
independent,andwillnowbecalledSPAWN
Spawnisasignaltoeitherthecompileror
theruntimesystemthatthetargetisan
independentunitofwork.Thetargetmaybe
executedasynchronouslyfromthecaller.
SYNCthedependencebetweenaandbandthereturnstatement.Thesehavetobecombined.
Syncisusedtocombinethedependentstatements.
TowhichSpawndoesagivenSyncapply?Thesyncmatchesanyspawninthesameframe.
NestedParallelism=Thereisalwaysanimplicitsyncbeforereturningtothecaller.
Thespawncreatestwoindependentpathsonepathcarriesthenewwork,andonepath
continuescarryingonafterthespawn.
QUIZ:ASubtlePointAboutSpawns
Theabovecursivereductionusestwospawnsaretheybothnecessary?YoucaneliminateB
butnotA.
IfyoueliminatetheApathyoueliminate
concurrencythisisbad.
IfyoueliminatetheBpaththetwosub
graphscanbeexecutedconcurrently.
BasicAnalysisofWorkandSpan
Manyoftheanalysistoolsusedonsequentialalgorithmscanbeusedonparallelalgorithms.
Wanttoanalyzeworkandspan.
Assumeeachspawnandsyncisaconstanttime
operation.Andcanbeignoredforanalysis.
Analyzingworkiscountingtotaloperations,endupwith
linearworkO(n).
AnalyzingSpanaspawncreatestwopaths,thecritical
pathisthelongerofthetwopaths.
DesiderataForWorkandSpan
Thegoalsofaparallelalgorithmdesigner:
1.
WorkoptimalityAchieveadegreeofworkthatmatchesthebestsequentialalgorithm.
2.
Findalgorithmswithpolylogarithmicspan.D(n)=O(logkn)thisislowspan
Thisinsurestheaverageavailableparallelismgrowswithn.
ConcurrencyPrimitiveParallelFor
Alliterationsareindependentofoneanother.
Aparforcreatesnindependentsubpaths.
Theendofaparforloopwillincludeanimplicitsyncpoint.
TheWorkofaparforisWparfor(n)=O(n)
TheSpanofaparforisDparfor(n)=O(1)intheory,butinpracticeitwillgrowwithn,especiallyifn
isreallylarge.
QUIZImplementingParFor
TheDAGexecutesthespawnssequentially,oneafteranother.Thisleadstoabottleneck.The
Spangrowswithn.Thisisbad.
ImplementingParForPart2
Implementparforasaprocedurecall(ParForT).Thisisabetterwaytoimplementaparallelfor
loop.Thespanwillnowgrowlogarithmicallywithn.
Fortherestofthiscourse,assumetheParForTimplementation.
D(n)=O(logn)
QUIZMatrixVectorMultiply
Ifaloopcarriesadependence,thenitcannotbeparallelizedwithaparfor.
DataRacesandRaceConditions
Ifwelookatthenestedloops,we
seethattheinnermostloopthere
areiterationsofjthatwritetothe
samei.
DataRace=atleastonereadandonewritecanhappenatthesamedatalocationatthesame
time.
RaceCondition=adataracethatcausesanerror.
**Adataracedoesnotalwaysleadtoaracecondition.**
VectorNotation
t[1:n]A[i,1:n]*x[1:n]Thisisamorecompactformoftheparforloop.
t[:]A[i,:]*x[:]
Thiscanbefurtherreducedto:y[i]y[i]+reduce(t)