GPU Programming in MATLAB
GPU Programming in MATLAB
Figure1.ComparisonofthenumberofcoresonaCPUsystemandaGPU.
ThegreatlyincreasedthroughputmadepossiblebyaGPU,however,comesatacost.First,memoryaccess
becomesamuchmorelikelybottleneckforyourcalculations.DatamustbesentfromtheCPUtotheGPU
beforecalculationandthenretrievedfromitafterwards.BecauseaGPUisattachedtothehostCPUviathe
PCIExpressbus,thememoryaccessisslowerthanwithatraditionalCPU.1 Thismeansthatyouroverall
computationalspeedupislimitedbytheamountofdatatransferthatoccursinyouralgorithm.Second,
programmingforGPUsinCorFortranrequiresadifferentmentalmodelandaskillsetthatcanbedifficultand
timeconsumingtoacquire.Additionally,youmustspendtimefinetuningyourcodeforyourspecificGPUto
optimizeyourapplicationsforpeakperformance.
ThisarticledemonstratesfeaturesinParallelComputingToolboxthatenableyoutorunyourMATLAB
codeonaGPUbymakingafewsimplechangestoyourcode.Weillustratethisapproachbysolvingasecond
orderwaveequationusingspectralmethods.
WhyParallelizeaWaveEquationSolver?
Waveequationsareusedinawiderangeofengineeringdisciplines,includingseismology,fluiddynamics,
acoustics,andelectromagnetics,todescribesound,light,andfluidwaves.
Analgorithmthatusesspectralmethodstosolvewaveequationsisagoodcandidateforparallelization
becauseitmeetsbothofthecriteriaforaccelerationusingtheGPU(see"WillExecutiononaGPUAccelerate
MyApplication?"):
Itiscomputationallyintensive.ThealgorithmperformsmanyfastFouriertransforms(FFTs)andinverse
fastFouriertransforms(IFFTs).Theexactnumberdependsonthesizeofthegrid(Figure2)andthenumber
oftimestepsincludedinthesimulation.EachtimesteprequirestwoFFTsandfourIFFTsondifferent
matrices,andasinglecomputationcaninvolvehundredsofthousandsoftimesteps.
Itismassivelyparallel.TheparallelFFTalgorithmisdesignedto"divideandconquer"sothatasimilartask
isperformedrepeatedlyondifferentdata.Additionally,thealgorithmrequiressubstantialcommunication
http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html
1/6
12/19/2014
betweenprocessingthreadsandplentyofmemorybandwidth.TheIFFTcansimilarlyberuninparallel.
Figure2.Asolutionforasecondorderwaveequationona32x32grid(seeanimation
(http://www.mathworks.com/videos/solutionofsecondorderwaveequationanimation79288.html?type=shadow)
).
WillExecutiononaGPUAccelerateMyApplication?
AGPUcanaccelerateanapplicationifitfitsbothofthefollowingcriteria:
ComputationallyintensiveThetimespentoncomputationsignificantlyexceedsthetimespentontransferringdata
toandfromGPUmemory.
MassivelyparallelThecomputationscanbebrokendownintohundredsorthousandsofindependentunitsofwork.
ApplicationsthatdonotsatisfythesecriteriamightactuallyrunsloweronaGPUthanonaCPU.
GPUComputinginMATLAB
Beforecontinuingwiththewaveequationexample,let'squicklyreviewhowMATLABworkswiththeGPU.
FFT,IFFT,andlinearalgebraicoperationsareamongmorethan100builtinMATLABfunctionsthatcanbe
executeddirectlyontheGPUbyprovidinganinputargumentofthetypeGPUArray,aspecialarraytype
providedbyParallelComputingToolbox.TheseGPUenabledfunctionsareoverloadedinotherwords,they
operatedifferentlydependingonthedatatypeoftheargumentspassedtothem.
Forexample,thefollowingcodeusesanFFTalgorithmtofindthediscreteFouriertransformofavectorof
pseudorandomnumbersontheCPU:
A = rand(2^16,1);
B = fft(A);
ToperformthesameoperationontheGPU,wefirstusethegpuArraycommandtotransferdatafromthe
MATLABworkspacetodevicememory.Thenwecanrunfft,whichisoneoftheoverloadedfunctionsonthat
data:
A = gpuArray(rand(2^16,1));
B = fft(A);
ThefftoperationisexecutedontheGPUratherthantheCPUsinceitsinput(aGPUArray)isheldonthe
GPU.
Theresult,B,isstoredontheGPU.However,itisstillvisibleintheMATLABworkspace.Byrunningclass(B),
wecanseethatitisaGPUArray.
class(B)
ans =
parallel.gpu.GPUArray
WecancontinuetomanipulateBonthedeviceusingGPUenabledfunctions.Forexample,tovisualizeour
http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html
2/6
12/19/2014
results,theplotcommandautomaticallyworksonGPUArrays:
plot(B);
ToreturnthedatabacktothelocalMATLABworkspace,youcanusethegathercommandforexample
C = gather(B);
CisnowadoubleinMATLABandcanbeoperatedonbyanyoftheMATLABfunctionsthatworkondoubles.
Inthissimpleexample,thetimesavedbyexecutingasingleFFTfunctionisoftenlessthanthetimespent
transferringthevectorfromtheMATLABworkspacetothedevicememory.Thisisgenerallytruebutis
dependentonyourhardwareandsizeofthearray.Datatransferoverheadcanbecomesosignificantthatit
degradestheapplication'soverallperformance,especiallyifyourepeatedlyexchangedatabetweentheCPU
andGPUtoexecuterelativelyfewcomputationallyintensiveoperations.Itismoreefficienttoperformseveral
operationsonthedatawhileitisontheGPU,bringingthedatabacktotheCPUonlywhenrequired2 .
NotethatGPUs,likeCPUs,havefinitememories.However,unlikeCPUs,theydonothavetheabilitytoswap
memorytoandfromdisk.Thus,youmustverifythatthedatayouwanttokeepontheGPUdoesnotexceed
itsmemorylimits,particularlywhenyouareworkingwithlargematrices.ByrunninggpuDevice,youcanquery
yourGPUcard,obtaininginformationsuchasname,totalmemory,andavailablememory.
ImplementingandAcceleratingtheAlgorithmtoSolveaWaveEquationinMATLAB
Toputtheaboveexampleintocontext,let'simplementtheGPUfunctionalityonarealproblem.Our
computationalgoalistosolvethesecondorderwaveequation
withtheconditionu=0ontheboundaries.Weuseanalgorithmbasedonspectralmethodstosolvethe
equationinspaceandasecondordercentralfinitedifferencemethodtosolvetheequationintime.
Spectralmethodsarecommonlyusedtosolvepartialdifferentialequations.Withspectralmethods,the
solutionisapproximatedasalinearcombinationofcontinuousbasisfunctions,suchassinesandcosines.In
thiscase,weapplytheChebyshevspectralmethod,whichusesChebyshevpolynomialsasthebasis
functions.
Ateverytimestep,wecalculatethesecondderivativeofthecurrentsolutioninboththexandydimensions
usingtheChebyshevspectralmethod.Usingthesederivativestogetherwiththeoldsolutionandthecurrent
solution,weapplyasecondordercentraldifferencemethod(alsoknownastheleapfrogmethod)tocalculate
thenewsolution.Wechooseatimestepthatmaintainsthestabilityofthisleapfrogmethod.
TheMATLABalgorithmiscomputationallyintensive,andasthenumberofelementsinthegridoverwhichwe
computethesolutiongrows,thetimethealgorithmtakestoexecuteincreasesdramatically.Whenexecutedon
asingleCPUusinga2048x2048grid,ittakesmorethanaminutetocompletejust50timesteps.Notethat
thistimealreadyincludestheperformancebenefitoftheinherentmultithreadinginMATLAB.SinceR2007a,
MATLABsupportsmultithreadedcomputationforanumberoffunctions.Thesefunctionsautomatically
executeonmultiplethreadswithouttheneedtoexplicitlyspecifycommandstocreatethreadsinyourcode.
WhenconsideringhowtoacceleratethiscomputationusingParallelComputingToolbox,wewillfocusonthe
codethatperformscomputationsforeachtimestep.Figure3illustratesthechangesrequiredtogetthe
algorithmrunningontheGPU.NotethatthecomputationsinvolveMATLABoperationsforwhichGPUenabled
overloadedfunctionsareavailablethroughParallelComputingToolbox.TheseoperationsincludeFFTand
IFFT,matrixmultiplication,andvariouselementwiseoperations.Asaresult,wedonotneedtochangethe
algorithminanywaytoexecuteitonaGPU.WesimplytransferthedatatotheGPUusinggpuArraybefore
enteringtheloopthatcomputesresultsateachtimestep.
http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html
3/6
12/19/2014
Figure3.CodeComparisonToolshowingthedifferencesintheCPUandGPUversionsofthecode.TheGPUandCPU
versionsshareover84%oftheircodeincommon(94linesoutof111).
AfterthecomputationsareperformedontheGPU,wetransfertheresultsfromtheGPUtotheCPU.Each
variablereferencedbytheGPUenabledfunctionsmustbecreatedontheGPUortransferredtotheGPU
beforeitisused.
ToconvertoneoftheweightsusedforspectraldifferentiationtoaGPUArrayvariable,weuse
W1T = gpuArray(W1T);
CertaintypesofarrayscanbeconstructeddirectlyontheGPUwithoutourhavingtotransferthemfromthe
MATLABworkspace.Forexample,tocreateamatrixofzerosdirectlyontheGPU,weuse
uxx = parallel.gpu.GPUArray.zeros(N+1,N+1);
WeusethegatherfunctiontobringdatabackfromtheGPUforexample:
vvg = gather(vv);
NotethatthereisasingletransferofdatatotheGPU,followedbyasingletransferofdatafromtheGPU.All
thecomputationsforeachtimestepareperformedontheGPU.
ComparingCPUandGPUExecutionSpeeds
ToevaluatethebenefitsofusingtheGPUtosolvesecondorderwaveequations,weranabenchmarkstudy
inwhichwemeasuredtheamountoftimethealgorithmtooktoexecute50timestepsforgridsizesof64,128,
512,1024,and2048onanIntelXeonProcessorX5650andthenusinganNVIDIATeslaC2050GPU.
Foragridsizeof2048,thealgorithmshowsa7.5xdecreaseincomputetimefrommorethanaminuteonthe
CPUtolessthan10secondsontheGPU(Figure4).ThelogscaleplotshowsthattheCPUisactuallyfaster
forsmallgridsizes.Asthetechnologyevolvesandmatures,however,GPUsolutionsareincreasinglyableto
handlesmallerproblems,atrendthatweexpecttocontinue.
Figure4.Plotofbenchmarkresultsshowingthetimerequiredtocomplete50timestepsfordifferentgridsizes,using
eitheralinearscale(left)oralogscale(right).
http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html
4/6
12/19/2014
AdvancedGPUProgrammingwithMATLAB
ParallelComputingToolboxprovidesastraightforwardwaytospeedupMATLABcodebyexecutingitona
GPU.Yousimplychangethedatatypeofafunction'sinputtotakeadvantageofthemanyMATLAB
commandsthathavebeenoverloadedforGPUArrays.(AcompletelistofbuiltinMATLABfunctionsthat
supportGPUArrayisavailableintheParallelComputingToolboxdocumentation
(http://www.mathworks.com/help/toolbox/distcomp/bsic4fr1.html#bsloua31).)
ToaccelerateanalgorithmwithmultiplesimpleoperationsonaGPU,youcanusearrayfun,whichappliesa
functiontoeachelementofanarray.BecausearrayfunisaGPUenabledfunction,youincurthememory
transferoverheadonlyonthesinglecalltoarrayfun,notoneachindividualoperation.
Finally,experiencedprogrammerswhowritetheirownCUDAcodecanusetheCUDAKernelinterfacein
ParallelComputingToolboxtointegratethiscodewithMATLAB.TheCUDAKernelinterfaceenableseven
morefinegrainedcontroltospeedupportionsofcodethatwereperformancebottlenecks.Itcreatesa
MATLABobjectthatprovidesaccesstoyourexistingkernelcompiledintoPTXcode(PTXisalowlevelparallel
threadexecutioninstructionset).YoutheninvokethefevalcommandtoevaluatethekernelontheGPU,
usingMATLABarraysasinputandoutput.
Summary
EngineersandscientistsaresuccessfullyemployingGPUtechnology,originallyintendedforaccelerating
graphicsrendering,toacceleratetheirdisciplinespecificcalculations.Withminimaleffortandwithoutextensive
knowledgeofGPUs,youcannowusethepromisingpowerofGPUswithMATLAB.GPUArraysandGPU
enabledMATLABfunctionshelpyouspeedupMATLABoperationswithoutlowlevelCUDAprogramming.If
youarealreadyfamiliarwithprogrammingforGPUs,MATLABalsoletsyouintegrateyourexistingCUDA
kernelsintoMATLABapplicationswithoutrequiringanyadditionalCprogramming.
ToachievespeedupswiththeGPUs,yourapplicationmustsatisfysomecriteria,amongthemthefactthat
sendingthedatabetweentheCPUandGPUmusttakelesstimethantheperformancegainedbyrunningon
theGPU.Ifyourapplicationsatisfiesthesecriteria,itisagoodcandidatefortherangeofGPUfunctionality
availablewithMATLAB.
GPUGlossary
CPU(centralprocessingunit).Thecentralunitinacomputerresponsibleforcalculationsandforcontrollingor
supervisingotherpartsofthecomputer.TheCPUperformslogicalandfloatingpointoperationsondataheldinthe
computermemory.
GPU(graphicsprocessingunit).Programmablechiporiginallyintendedforgraphicsrendering.Thehighlyparallel
structureofaGPUmakesthemmoreeffectivethangeneralpurposeCPUsforalgorithmswhereprocessingoflarge
blocksofdataisdoneinparallel.
Core.AsingleindependentcomputationalunitwithinaCPUorGPUchip.CPUandGPUcoresarenotequivalentto
eachotherGPUcoresperformspecializedoperationswhereasCPUcoresaredesignedforgeneralpurposeprograms.
CUDA.AparallelcomputingtechnologyfromNVIDIAthatconsistsofaparallelcomputingarchitectureanddeveloper
tools,libraries,andprogrammingdirectivesforGPUcomputing.
Device.AhardwarecardcontainingtheGPUanditsassociatedmemory.
Host.TheCPUandsystemmemory.
Kernel.CodewrittenforexecutionontheGPU.Kernelsarefunctionsthatcanrunonalargenumberofthreads.The
parallelismarisesfromeachthreadindependentlyrunningthesameprogramondifferentdata.
Published201191967v01
References
1.SeeChapter6(MemoryOptimization)oftheNVIDIACUDACBestPracticesdocumentationforfurtherinformation
aboutpotentialGPUcomputingbottlenecksandoptimizationofGPUmemoryaccess.
2.SeeChapter6(MemoryOptimization)oftheNVIDIACUDACBestPracticesdocumentationforfurtherinformation
aboutimprovingperformancebyminimizingdatatransfers.
ProductsUsed
http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html
5/6
12/19/2014
MATLAB(http://www.mathworks.com/products/matlab)
ParallelComputingToolbox(http://www.mathworks.com/products/parallelcomputing)
LearnMore
SpectralMethods,LloydN.Trefethen(http://www.mathworks.com/support/books/book48110.html?
category=6&language=1&view=category)
IntroductiontoMATLABGPUComputing(http://www.mathworks.com/discovery/matlabgpu.html)
AcceleratingSignalProcessingAlgorithmswithGPUsandMATLAB
(http://www.mathworks.com/discovery/gpusignalprocessing.html)
Thispagewasprintedfrom:http://www.mathworks.com/company/newsletters/articles/gpuprogramminginmatlab.html
19942014TheMathWorks,Inc.
http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html
6/6