100% found this document useful (1 vote)
183 views39 pages

Hacker's Guide To Neural Networks

This document provides a summary of a hacker's guide to neural networks. It introduces neural networks as real-valued circuits where values flow along edges and interact in gates. The document explains how to tweak input values slightly to increase the output value using two strategies: 1) Random local search, which tweaks inputs randomly and tracks improvements, and 2) Numerical gradient, which evaluates the derivative of the output with respect to each input to determine how to tweak each input to increase the output. It provides an example of calculating the numerical gradient for a simple circuit with one multiplication gate.

Uploaded by

Albert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
183 views39 pages

Hacker's Guide To Neural Networks

This document provides a summary of a hacker's guide to neural networks. It introduces neural networks as real-valued circuits where values flow along edges and interact in gates. The document explains how to tweak input values slightly to increase the output value using two strategies: 1) Random local search, which tweaks inputs randomly and tracks improvements, and 2) Numerical gradient, which evaluates the derivative of the output with respect to each input to determine how to tweak each input to increase the output. It provides an example of calculating the numerical gradient for a simple circuit with one multiplication gate.

Uploaded by

Albert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

3/17/2017 Hacker'sguidetoNeuralNetworks

AndrejKarpathyblog About Hacker'sguidetoNeuralNetworks

Hacker'sguidetoNeuralNetworks
Hithere,ImaCSPhDstudentatStanford.IveworkedonDeepLearningforafewyearsas
partofmyresearchandamongseveralofmyrelatedpetprojectsisConvNetJSaJavascript
libraryfortrainingNeuralNetworks.Javascriptallowsonetonicelyvisualizewhatsgoingon
andtoplayaroundwiththevarioushyperparametersettings,butIstillregularlyhearfrom
peoplewhoaskforamorethoroughtreatmentofthetopic.Thisarticle(whichIplantoslowly
expandouttolengthsofafewbookchapters)ismyhumbleattempt.ItsonwebinsteadofPDF
becauseallbooksshouldbe,andeventuallyitwillhopefullyincludeanimations/demosetc.

MypersonalexperiencewithNeuralNetworksisthateverythingbecamemuchclearerwhenI
startedignoringfullpage,densederivationsofbackpropagationequationsandjuststarted
writingcode.Thus,thistutorialwillcontainverylittlemath(Idontbelieveitisnecessaryandit
cansometimesevenobfuscatesimpleconcepts).SincemybackgroundisinComputerScience
andPhysics,IwillinsteaddevelopthetopicfromwhatIrefertoashackerssperspective.My
expositionwillcenteraroundcodeandphysicalintuitionsinsteadofmathematicalderivations.
Basically,IwillstrivetopresentthealgorithmsinawaythatIwishIhadcomeacrosswhenI
wasstartingout.

everythingbecamemuchclearerwhenIstartedwritingcode.

YoumightbeeagertojumprightinandlearnaboutNeuralNetworks,backpropagation,how
theycanbeappliedtodatasetsinpractice,etc.Butbeforewegetthere,Idlikeustofirstforget
aboutallthat.Letstakeastepbackandunderstandwhatisreallygoingonatthecore.Lets
firsttalkaboutrealvaluedcircuits.

Updatenote:Isuspendedmyworkonthisguideawhileagoandredirectedalotofmyenergy
toteachingCS231n(ConvolutionalNeuralNetworks)classatStanford.Thenotesareon
cs231.github.ioandthecourseslidescanbefoundhere.Thesematerialsarehighlyrelatedto
materialhere,butmorecomprehensiveandsometimesmorepolished.

Chapter1:RealvaluedCircuits

https://karpathy.github.io/neuralnets/ 1/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Inmyopinion,thebestwaytothinkofNeuralNetworksisasrealvaluedcircuits,wherereal
values(insteadofbooleanvalues {0,1} )flowalongedgesandinteractingates.However,
insteadofgatessuchas AND , OR , NOT ,etc,wehavebinarygatessuchas * (multiply), +
(add), max orunarygatessuchas exp ,etc.Unlikeordinarybooleancircuits,however,wewill
eventuallyalsohavegradientsflowingonthesameedgesofthecircuit,butintheopposite
direction.Butweregettingaheadofourselves.Letsfocusandstartoutsimple.

BaseCase:SingleGateintheCircuit
Letsfirstconsiderasingle,simplecircuitwithonegate.Heresanexample:

y *

Thecircuittakestworealvaluedinputs x and y andcomputes x*y withthe * gate.


Javascriptversionofthiswouldverysimplylooksomethinglikethis:

varforwardMultiplyGate=function(x,y){
returnx*y;
};
forwardMultiplyGate(2,3);//returns6.Exciting.

Andinmathformwecanthinkofthisgateasimplementingtherealvaluedfunction:

f (x, y) = xy

Aswiththisexample,allofourgateswilltakeoneortwoinputsandproduceasingleoutput
value.

TheGoal
Theproblemweareinterestedinstudyinglooksasfollows:

1.Weprovideagivencircuitsomespecificinputvalues(e.g. x=2 , y=3 )


2.Thecircuitcomputesanoutputvalue(e.g. 6 )

https://karpathy.github.io/neuralnets/ 2/41
3/17/2017 Hacker'sguidetoNeuralNetworks

3.Thecorequestionthenbecomes:Howshouldonetweaktheinputslightlytoincreasethe
output?

Inthiscase,inwhatdirectionshouldwechange x,y togetanumberlargerthan 6 ?Note


that,forexample, x=1.99 and y=2.99 gives x*y=5.95 ,whichishigherthan
6.0 .Dontgetconfusedbythis: 5.95 isbetter(higher)than 6.0 .Itsanimprovementof
0.05 ,eventhoughthemagnitudeof 5.95 (thedistancefromzero)happenstobelower.

Strategy#1:RandomLocalSearch

Okay.Sowait,wehaveacircuit,wehavesomeinputsandwejustwanttotweakthemslightly
toincreasetheoutputvalue?Whyisthishard?Wecaneasilyforwardthecircuittocompute
theoutputforanygiven x and y .Soisntthistrivial?Whydontwetweak x and y
randomlyandkeeptrackofthetweakthatworksbest:

//circuitwithsinglegatefornow
varforwardMultiplyGate=function(x,y){returnx*y;};
varx=2,y=3;//someinputvalues

//trychangingx,yrandomlysmallamountsandkeeptrackofwhatworksbest
vartweak_amount=0.01;
varbest_out=Infinity;
varbest_x=x,best_y=y;
for(vark=0;k<100;k++){
varx_try=x+tweak_amount*(Math.random()*21);//tweakxabit
vary_try=y+tweak_amount*(Math.random()*21);//tweakyabit
varout=forwardMultiplyGate(x_try,y_try);
if(out>best_out){
//bestimprovementyet!Keeptrackofthexandy
best_out=out;
best_x=x_try,best_y=y_try;
}
}

WhenIrunthis,Iget best_x=1.9928 , best_y=2.9901 ,and best_out=5.9588 .


Again, 5.9588 ishigherthan 6.0 .So,weredone,right?Notquite:Thisisaperfectlyfine
strategyfortinyproblemswithafewgatesifyoucanaffordthecomputetime,butitwontdoif
wewanttoeventuallyconsiderhugecircuitswithmillionsofinputs.Itturnsoutthatwecando
muchbetter.

Stategy#2:NumericalGradient
https://karpathy.github.io/neuralnets/ 3/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Heresabetterway.Rememberagainthatinoursetupwearegivenacircuit(e.g.ourcircuit
withasingle * gate)andsomeparticularinput(e.g. x=2,y=3 ).Thegatecomputesthe
output( 6 )andnowwedliketotweak x and y tomaketheoutputhigher.

Aniceintuitionforwhatwereabouttodoisasfollows:Imaginetakingtheoutputvaluethat
comesoutfromthecircuitandtuggingonitinthepositivedirection.Thispositivetensionwillin
turntranslatethroughthegateandinduceforcesontheinputs x and y .Forcesthattellus
how x and y shouldchangetoincreasetheoutputvalue.

Whatmightthoseforceslooklikeinourspecificexample?Thinkingthroughit,wecanintuitthat
theforceon x shouldalsobepositive,becausemaking x slightlylargerimprovesthecircuits
output.Forexample,increasing x from x=2 to x=1 wouldgiveusoutput 3 much
largerthan 6 .Ontheotherhand,wedexpectanegativeforceinducedon y thatpushesit
tobecomelower(sincealower y ,suchas y=2 ,downfromtheoriginal y=3 would
makeoutputhigher: 2x2=4 ,again,largerthan 6 ).Thatstheintuitiontokeepinmind,
anyway.Aswegothroughthis,itwillturnoutthatforcesImdescribingwillinfactturnouttobe
thederivativeoftheoutputvaluewithrespecttoitsinputs( x and y ).Youmayhaveheard
thistermbefore.

Thederivativecanbethoughtofasaforceoneachinputaswepullontheoutputtobecome
higher.

Sohowdoweexactlyevaluatethisforce(derivative)?Itturnsoutthatthereisaverysimple
procedureforthis.Wewillworkbackwards:Insteadofpullingonthecircuitsoutput,welliterate
overeveryinputonebyone,increaseitveryslightlyandlookatwhathappenstotheoutput
value.Theamounttheoutputchangesinresponseisthederivative.Enoughintuitionsfornow.
Letslookatthemathematicaldefinition.Wecanwritedownthederivativeforourfunctionwith
respecttotheinputs.Forexample,thederivativewithrespectto x canbecomputedas:

f (x, y) f (x + h, y) f (x, y)
=
x h

Whereh issmallitsthetweakamount.Also,ifyourenotveryfamiliarwithcalculusitis
importanttonotethatinthelefthandsideoftheequationabove,thehorizontallinedoesnot
f (x,y)
indicatedivision.Theentiresymbol isasinglething:thederivativeofthefunction
x

withrespecttox .Thehorizontallineontherightisdivision.Iknowitsconfusingbutits
f (x, y)

standardnotation.Anyway,Ihopeitdoesntlooktooscarybecauseitisnt:Thecircuitwas
givingsomeinitialoutputf (x, y),andthenwechangedoneoftheinputsbyatinyamounth
andreadthenewoutputf (x + h, y).Subtractingthosetwoquantitiestellsusthechange,and

https://karpathy.github.io/neuralnets/ 4/41
3/17/2017 Hacker'sguidetoNeuralNetworks

thedivisionbyh justnormalizesthischangebythe(arbitrary)tweakamountweused.Inother
wordsitsexpressingexactlywhatIdescribedaboveandtranslatesdirectlytothiscode:

varx=2,y=3;
varout=forwardMultiplyGate(x,y);//6
varh=0.0001;

//computederivativewithrespecttox
varxph=x+h;//1.9999
varout2=forwardMultiplyGate(xph,y);//5.9997
varx_derivative=(out2out)/h;//3.0

//computederivativewithrespecttoy
varyph=y+h;//3.0001
varout3=forwardMultiplyGate(x,yph);//6.0002
vary_derivative=(out3out)/h;//2.0

Letswalkthrough x forexample.Weturnedtheknobfrom x to x+h andthecircuit


respondedbygivingahighervalue(noteagainthatyes, 5.9997 ishigherthan 6 : 5.9997
>6 ).Thedivisionby h istheretonormalizethecircuitsresponsebythe(arbitrary)valueof
h wechosetousehere.Technically,youwantthevalueof h tobeinfinitesimal(theprecise
mathematicaldefinitionofthegradientisdefinedasthelimitoftheexpressionas h goesto
zero),butinpractice h=0.00001 orsoworksfineinmostcasestogetagoodapproximation.
Now,weseethatthederivativew.r.t. x is +3 .Immakingthepositivesignexplicit,becauseit
indicatesthatthecircuitistuggingonxtobecomehigher.Theactualvalue, 3 canbe
interpretedastheforceofthattug.

Thederivativewithrespecttosomeinputcanbecomputedbytweakingthatinputbyasmall
amountandobservingthechangeontheoutputvalue.

Bytheway,weusuallytalkaboutthederivativewithrespecttoasingleinput,orabouta
gradientwithrespecttoalltheinputs.Thegradientisjustmadeupofthederivativesofallthe
inputsconcatenatedinavector(i.e.alist).Crucially,noticethatifwelettheinputsrespondto
thetugbyfollowingthegradientatinyamount(i.e.wejustaddthederivativeontopofevery
input),wecanseethatthevalueincreases,asexpected:

varstep_size=0.01;
varout=forwardMultiplyGate(x,y);//before:6
x=x+step_size*x_derivative;//xbecomes1.97

https://karpathy.github.io/neuralnets/ 5/41
3/17/2017 Hacker'sguidetoNeuralNetworks

y=y+step_size*y_derivative;//ybecomes2.98
varout_new=forwardMultiplyGate(x,y);//5.87!exciting.

Asexpected,wechangedtheinputsbythegradientandthecircuitnowgivesaslightlyhigher
value( 5.87>6.0 ).Thatwasmuchsimplerthantryingrandomchangesto x and y ,
right?Afacttoappreciatehereisthatifyoutakecalculusyoucanprovethatthegradientis,in
fact,thedirectionofthesteepestincreaseofthefunction.Thereisnoneedtomonkeyaround
tryingoutrandompertubationsasdoneinStrategy#1.Evaluatingthegradientrequiresjust
threeevaluationsoftheforwardpassofourcircuitinsteadofhundreds,andgivesthebesttug
youcanhopefor(locally)ifyouareinterestedinincreasingthevalueoftheoutput.

Biggerstepisnotalwaysbetter.Letmeclarifyonthispointabit.Itisimportanttonotethatin
thisverysimpleexample,usingabigger step_size than0.01willalwaysworkbetter.For
example, step_size=1.0 givesoutput 1 (higer,better!),andindeedinfinitestepsize
wouldgiveinfinitelygoodresults.Thecrucialthingtorealizeisthatonceourcircuitsgetmuch
morecomplex(e.g.entireneuralnetworks),thefunctionfrominputstotheoutputvaluewillbe
morechaoticandwiggly.Thegradientguaranteesthatifyouhaveaverysmall(indeed,
infinitesimallysmall)stepsize,thenyouwilldefinitelygetahighernumberwhenyoufollowits
direction,andforthatinfinitesimallysmallstepsizethereisnootherdirectionthatwouldhave
workedbetter.Butifyouuseabiggerstepsize(e.g. step_size=0.01 )allbetsareoff.The
reasonwecangetawaywithalargerstepsizethaninfinitesimallysmallisthatourfunctions
areusuallyrelativelysmooth.Butreally,werecrossingourfingersandhopingforthebest.

Hillclimbinganalogy.OneanalogyIveheardbeforeisthattheoutputvalueofourcircutis
liketheheightofahill,andweareblindfoldedandtryingtoclimbupwards.Wecansensethe
steepnessofthehillatourfeet(thegradient),sowhenweshuffleourfeetabitwewillgo
upwards.Butifwetookabig,overconfidentstep,wecouldhavesteppedrightintoahole.

Great,IhopeIveconvincedyouthatthenumericalgradientisindeedaveryusefulthingto
evaluate,andthatitischeap.But.Itturnsoutthatwecandoevenbetter.

Strategy#3:AnalyticGradient
Intheprevioussectionweevaluatedthegradientbyprobingthecircuitsoutputvalue,
independentlyforeveryinput.Thisproceduregivesyouwhatwecallanumericalgradient.
Thisapproach,however,isstillexpensivebecauseweneedtocomputethecircuitsoutputas
wetweakeveryinputvalueindependentlyasmallamount.Sothecomplexityofevaluatingthe
gradientislinearinnumberofinputs.Butinpracticewewillhavehundreds,thousandsor(for
neuralnetworks)eventenstohundredsofmillionsofinputs,andthecircuitsarentjustone

https://karpathy.github.io/neuralnets/ 6/41
3/17/2017 Hacker'sguidetoNeuralNetworks

multiplygatebuthugeexpressionsthatcanbeexpensivetocompute.Weneedsomething
better.

Luckily,thereisaneasierandmuchfasterwaytocomputethegradient:wecanusecalculusto
deriveadirectexpressionforitthatwillbeassimpletoevaluateasthecircuitsoutputvalue.
Wecallthisananalyticgradientandtherewillbenoneedfortweakinganything.Youmay
haveseenotherpeoplewhoteachNeuralNetworksderivethegradientinhugeand,frankly,
scaryandconfusingmathematicalequations(ifyourenotwellversedinmaths).Butits
unnecessary.IvewrittenplentyofNeuralNetscodeandIrarelyhavetodomathematical
derivationlongerthantwolines,and95%ofthetimeitcanbedonewithoutwritinganythingat
all.Thatisbecausewewillonlyeverderivethegradientforverysmallandsimpleexpressions
(thinkofitasthebasecase)andthenIwillshowyouhowwecancomposetheseverysimply
withchainruletoevaluatethefullgradient(thinkinductive/recursivecase).

Theanalyticderivativerequiresnotweakingoftheinputs.Itcanbederivedusing
mathematics(calculus).

Ifyourememberyourproductrules,powerrules,quotientrules,etc.(seee.g.derivativerulesor
wikipage),itsveryeasytowritedownthederivitativewithrespecttoboth x and y fora
smallexpressionsuchas x*y .Butsupposeyoudontrememberyourcalculusrules.We
cangobacktothedefinition.Forexample,herestheexpressionforthederivativew.r.t x :

f (x, y) f (x + h, y) f (x, y)
=
x h

(TechnicallyImnotwritingthelimitas h goestozero,forgivememathpeople).Okayandlets
pluginourfunction(f (x, y) = xy )intotheexpression.Readyforthehardestpieceofmath
ofthisentirearticle?Herewego:

f (x, y) f (x + h, y) f (x, y) (x + h)y xy xy + hy xy hy


= = = = = y
x h h h h

Thatsinteresting.Thederivativewithrespectto x isjustequalto y .Didyounoticethe


coincidenceintheprevioussection?Wetweaked x to x+h andcalculated x_derivative=
3.0 ,whichexactlyhappenstobethevalueof y inthatexample.Itturnsoutthatwasnta
coincidenceatallbecausethatsjustwhattheanalyticgradienttellsusthe x derivativeshould
befor f(x,y)=x*y .Thederivativewithrespectto y ,bytheway,turnsouttobe x ,
unsurprisinglybysymmetry.Sothereisnoneedforanytweaking!Weinvokedpowerful
mathematicsandcannowtransformourderivativecalculationintothefollowingcode:

https://karpathy.github.io/neuralnets/ 7/41
3/17/2017 Hacker'sguidetoNeuralNetworks

varx=2,y=3;
varout=forwardMultiplyGate(x,y);//before:6
varx_gradient=y;//byourcomplexmathematicalderivationabove
vary_gradient=x;

varstep_size=0.01;
x+=step_size*x_gradient;//2.03
y+=step_size*y_gradient;//2.98
varout_new=forwardMultiplyGate(x,y);//5.87.Higheroutput!Nice.

Tocomputethegradientwewentfromforwardingthecircuithundredsoftimes(Strategy#1)to
forwardingitonlyonorderofnumberoftimestwicethenumberofinputs(Strategy#2),to
forwardingitasingletime!AnditgetsEVENbetter,sincethemoreexpensivestrategies(#1
and#2)onlygiveanapproximationofthegradient,while#3(thefastestonebyfar)givesyou
theexactgradient.Noapproximations.Theonlydownsideisthatyoushouldbecomfortable
withsomecalculus101.

Letsrecapwhatwehavelearned:

INPUT:Wearegivenacircuit,someinputsandcomputeanoutputvalue.
OUTPUT:Wearetheninterestedfindingsmallchangestoeachinput(independently)that
wouldmaketheoutputhigher.
Strategy#1:Onesillywayistorandomlysearchforsmallpertubationsoftheinputsand
keeptrackofwhatgivesthehighestincreaseinoutput.
Strategy#2:Wesawwecandomuchbetterbycomputingthegradient.Regardlessof
howcomplicatedthecircuitis,thenumericalgradientisverysimple(butrelatively
expensive)tocompute.Wecomputeitbyprobingthecircuitsoutputvalueaswetweak
theinputsoneatatime.
Strategy#3:Intheend,wesawthatwecanbeevenmorecleverandanalyticallyderivea
directexpressiontogettheanalyticgradient.Itisidenticaltothenumericalgradient,itis
fastestbyfar,andthereisnoneedforanytweaking.

Inpracticebytheway(andwewillgettothisonceagainlater),allNeuralNetworklibraries
alwayscomputetheanalyticgradient,butthecorrectnessoftheimplementationisverifiedby
comparingittothenumericalgradient.Thatsbecausethenumericalgradientisveryeasyto
evaluate(butcanbeabitexpensivetocompute),whiletheanalyticgradientcancontainbugs
attimes,butisusuallyextremelyefficienttocompute.Aswewillsee,evaluatingthegradient
(i.e.whiledoingbackprop,orbackwardpass)willturnouttocostaboutasmuchasevaluating
theforwardpass.

RecursiveCase:CircuitswithMultipleGates
https://karpathy.github.io/neuralnets/ 8/41
3/17/2017 Hacker'sguidetoNeuralNetworks

RecursiveCase:CircuitswithMultipleGates
Butholdon,yousay:Theanalyticgradientwastrivialtoderiveforyoursupersimple
expression.Thisisuseless.WhatdoIdowhentheexpressionsaremuchlarger?Dontthe
equationsgethugeandcomplexveryfast?.Goodquestion.Yestheexpressionsgetmuch
morecomplex.No,thisdoesntmakeitmuchharder.Aswewillsee,everygatewillbehanging
outbyitself,completelyunawareofanydetailsofthehugeandcomplexcircuitthatitcouldbe
partof.Itwillonlyworryaboutitsinputsanditwillcomputeitslocalderivativesasseeninthe
previoussection,exceptnowtherewillbeasingleextramultiplicationitwillhavetodo.

Asingleextramultiplicationwillturnasingle(uselessgate)intoacoginthecomplexmachine
thatisanentireneuralnetwork.

Ishouldstophypingitupnow.IhopeIvepiquedyourinterest!Letsdrilldownintodetailsand
gettwogatesinvolvedwiththisnextexample:

x q

y
+ * f

Theexpressionwearecomputingnowisf (x, y, z) = (x + y)z .Letsstructurethecodeas


followstomakethegatesexplicitasfunctions:

varforwardMultiplyGate=function(a,b){
returna*b;
};
varforwardAddGate=function(a,b){
returna+b;
};
varforwardCircuit=function(x,y,z){
varq=forwardAddGate(x,y);
varf=forwardMultiplyGate(q,z);
returnf;
};

varx=2,y=5,z=4;
varf=forwardCircuit(x,y,z);//outputis12

https://karpathy.github.io/neuralnets/ 9/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Intheabove,Iamusing a and b asthelocalvariablesinthegatefunctionssothatwedont


gettheseconfusedwithourcircuitinputs x,y,z .Asbefore,weareinterestedinfindingthe
derivativeswithrespecttothethreeinputs x,y,z .Buthowdowecomputeitnowthatthere
aremultiplegatesinvolved?First,letspretendthatthe + gateisnotthereandthatweonly
havetwovariablesinthecircuit: q,z andasingle * gate.Notethatthe q isisoutputofthe
+ gate.Ifwedontworryabout x and y butonlyabout q and z ,thenwearebackto
havingonlyasinglegate,andasfarasthatsingle * gateisconcerned,weknowwhatthe
(analytic)derivatesarefromprevioussection.Wecanwritethemdown(exceptherewere
replacing x,y with q,z ):

f (q, z) f (q, z)
f (q, z) = qz = z, = q
q z

Simpleenough:thesearetheexpressionsforthegradientwithrespectto q and z .Butwait,


wedontwantgradientwithrespectto q ,butwithrespecttotheinputs: x and y .Luckily, q
iscomputedasafunctionof x and y (byadditioninourexample).Wecanwritedownthe
gradientfortheadditiongateaswell,itsevensimpler:

q(x, y) q(x, y)
q(x, y) = x + y = 1, = 1
x y

Thatsright,thederivativesarejust1,regardlessoftheactualvaluesof x and y .Ifyouthink


aboutit,thismakessensebecausetomaketheoutputofasingleadditiongatehigher,we
expectapositivetugonboth x and y ,regardlessoftheirvalues.

Backpropagation

WearefinallyreadytoinvoketheChainRule:Weknowhowtocomputethegradientof q
withrespectto x and y (thatsasinglegatecasewith + asthegate).Andweknowhowto
computethegradientofourfinaloutputwithrespectto q .Thechainruletellsushowto
combinethesetogetthegradientofthefinaloutputwithrespectto x and y ,whichiswhat
wereultimatelyinterestedin.Bestofall,thechainruleverysimplystatesthattherightthingto
doistosimplymultiplythegradientstogethertochainthem.Forexample,thefinalderivative
for x willbe:

f (q, z) q(x, y) f (q, z)


=
x x q

Therearemanysymbolstheresomaybethisisconfusingagain,butitsreallyjusttwonumbers
beingmultipliedtogether.Hereisthecode:

https://karpathy.github.io/neuralnets/ 10/41
3/17/2017 Hacker'sguidetoNeuralNetworks

//initialconditions
varx=2,y=5,z=4;
varq=forwardAddGate(x,y);//qis3
varf=forwardMultiplyGate(q,z);//outputis12

//gradientoftheMULTIPLYgatewithrespecttoitsinputs
//wrtisshortfor"withrespectto"
varderivative_f_wrt_z=q;//3
varderivative_f_wrt_q=z;//4

//derivativeoftheADDgatewithrespecttoitsinputs
varderivative_q_wrt_x=1.0;
varderivative_q_wrt_y=1.0;

//chainrule
varderivative_f_wrt_x=derivative_q_wrt_x*derivative_f_wrt_q;//4
varderivative_f_wrt_y=derivative_q_wrt_y*derivative_f_wrt_q;//4

Thatsit.Wecomputedthegradient(theforces)andnowwecanletourinputsrespondtoitby
abit.Letsaddthegradientsontopoftheinputs.Theoutputvalueofthecircuitbetterincrease,
upfrom12!

//finalgradient,fromabove:[4,4,3]
vargradient_f_wrt_xyz=[derivative_f_wrt_x,derivative_f_wrt_y,derivative_f_wrt_z

//lettheinputsrespondtotheforce/tug:
varstep_size=0.01;
x=x+step_size*derivative_f_wrt_x;//2.04
y=y+step_size*derivative_f_wrt_y;//4.96
z=z+step_size*derivative_f_wrt_z;//3.97

//Ourcircuitnowbettergivehigheroutput:
varq=forwardAddGate(x,y);//qbecomes2.92
varf=forwardMultiplyGate(q,z);//outputis11.59,upfrom12!Nice!

Lookslikethatworked!Letsnowtrytointerpretintuitivelywhatjusthappened.Thecircuitwants
tooutputhighervalues.Thelastgatesawinputs q=3,z=4 andcomputedoutput 12 .
Pullingupwardsonthisoutputvalueinducedaforceonboth q and z :Toincreasethe
outputvalue,thecircuitwants z toincrease,ascanbeseenbythepositivevalueofthe
derivative( derivative_f_wrt_z=+3 ).Again,thesizeofthisderivativecanbeinterpretedas

https://karpathy.github.io/neuralnets/ 11/41
3/17/2017 Hacker'sguidetoNeuralNetworks

themagnitudeoftheforce.Ontheotherhand, q feltastrongeranddownwardforce,since
derivative_f_wrt_q=4 .Inotherwordsthecircuitwants q todecrease,withaforceof
4.

Nowwegettothesecond, + gatewhichoutputs q .Bydefault,the + gatecomputesits


derivativeswhichtellsushowtochange x and y tomake q higher.BUT!Hereisthe
crucialpoint:thegradienton q wascomputedasnegative( derivative_f_wrt_q=4 ),so
thecircuitwants q todecrease,andwithaforceof 4 !Soifthe + gatewantstocontributeto
makingthefinaloutputvaluelarger,itneedstolistentothegradientsignalcomingfromthetop.
Inthisparticularcase,itneedstoapplytugson x,y oppositeofwhatitwouldnormallyapply,
andwithaforceof 4 ,sotospeak.Themultiplicationby 4 seeninthechainruleachieves
exactlythis:insteadofapplyingapositiveforceof +1 onboth x and y (thelocalderivative),
thefullcircuitsgradientonboth x and y becomes 1x4=4 .Thismakessense:the
circuitwantsboth x and y togetsmallerbecausethiswillmake q smaller,whichinturnwill
make f larger.

Ifthismakessense,youunderstandbackpropagation.

Letsrecaponceagainwhatwelearned:

Inthepreviouschapterwesawthatinthecaseofasinglegate(orasingleexpression),
wecanderivetheanalyticgradientusingsimplecalculus.Weinterpretedthegradientas
aforce,oratugontheinputsthatpullstheminadirectionwhichwouldmakethisgates
outputhigher.
Incaseofmultiplegateseverythingstaysprettymuchthesameway:everygateis
hangingoutbyitselfcompletelyunawareofthecircuititisembeddedin.Someinputs
comeinandthegatecomputesitsoutputandthederivatewithrespecttotheinputs.The
onlydifferencenowisthatsuddenly,somethingcanpullonthisgatefromabove.Thats
thegradientofthefinalcircuitoutputvaluewithrespecttotheouputthisgatecomputed.It
isthecircuitaskingthegatetooutputhigherorlowernumbers,andwithsomeforce.The
gatesimplytakesthisforceandmultipliesittoalltheforcesitcomputedforitsinputs
before(chainrule).Thishasthedesiredeffect:

1.Ifagateexperiencesastrongpositivepullfromabove,itwillalsopullharderonitsown
inputs,scaledbytheforceitisexperiencingfromabove
2.Andifitexperiencesanegativetug,thismeansthatcircuitwantsitsvaluetodecreasenot
increase,soitwillfliptheforceofthepullonitsinputstomakeitsownoutputvalue
smaller.

https://karpathy.github.io/neuralnets/ 12/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Anicepicturetohaveinmindisthataswepullonthecircuitsoutputvalueattheend,this
inducespullsdownwardthroughtheentirecircuit,allthewaydowntotheinputs.

Isntitbeautiful?Theonlydifferencebetweenthecaseofasinglegateandmultipleinteracting
gatesthatcomputearbitrarilycomplexexpressionsisthisadditionalmultipyoperationthatnow
happensineachgate.

Patternsinthebackwardflow

Letslookagainatourexamplecircuitwiththenumbersfilledin.Thefirstcircuitshowstheraw
values,andthesecondcircuitshowsthegradientsthatflowbacktotheinputsasdiscussed.
Noticethatthegradientalwaysstartsoffwith +1 attheendtostartoffthechain.Thisisthe
(default)pullonthecircuittohaveitsvalueincreased.

2 3

5
+ * 12 (Values)

4 4
+
4 * 1 (Gradients)

Afterawhileyoustarttonoticepatternsinhowthegradientsflowbackwardinthecircuits.For
example,the + gatealwaystakesthegradientontopandsimplypassesitontoallofits
inputs(noticetheexamplewith4simplypassedontobothoftheinputsof + gate).Thisis
becauseitsownderivativefortheinputsisjust +1 ,regardlessofwhattheactualvaluesofthe
inputsare,sointhechainrule,thegradientfromaboveisjustmultipliedby1andstaysthe
same.Similarintuitionsapplyto,forexample,a max(x,y) gate.Sincethegradientof
max(x,y) withrespecttoitsinputis +1 forwhicheveroneof x , y islargerand 0 forthe
other,thisgateisduringbackpropeffectivelyjustagradientswitch:itwilltakethegradient
fromaboveandrouteittotheinputthathadahighervalueduringtheforwardpass.

https://karpathy.github.io/neuralnets/ 13/41
3/17/2017 Hacker'sguidetoNeuralNetworks

NumericalGradientCheck.Beforewefinishwiththissection,letsjustmakesurethatthe
(analytic)gradientwecomputedbybackpropaboveiscorrectasasanitycheck.Remember
thatwecandothissimplybycomputingthenumericalgradientandmakingsurethatweget
[4,4,3] for x,y,z .Heresthecode:

//initialconditions
varx=2,y=5,z=4;

//numericalgradientcheck
varh=0.0001;
varx_derivative=(forwardCircuit(x+h,y,z)forwardCircuit(x,y,z))/h;//4
vary_derivative=(forwardCircuit(x,y+h,z)forwardCircuit(x,y,z))/h;//4
varz_derivative=(forwardCircuit(x,y,z+h)forwardCircuit(x,y,z))/h;//3

andweget [4,4,3] ,ascomputedwithbackprop.phew!:)

Example:SingleNeuron
Intheprevioussectionyouhopefullygotthebasicintuitionbehindbackpropagation.Letsnow
lookatanevenmorecomplicatedandborderlinepracticalexample.Wewillconsidera2
dimensionalneuronthatcomputesthefollowingfunction:

f (x, y, a, b, c) = (ax + by + c)

Inthisexpression, isthesigmoidfunction.Itsbestthoughtofasasquashingfunction,
becauseittakestheinputandsquashesittobebetweenzeroandone:Verynegativevalues
aresquashedtowardszeroandpositivevaluesgetsquashedtowardsone.Forexample,we
have sig(5)=0.006,sig(0)=0.5,sig(5)=0.993 .Sigmoidfunctionisdefinedas:

1
(x) =
x
1 + e

Thegradientwithrespecttoitssingleinput,asyoucancheckonWikipediaorderiveyourselfif
youknowsomecalculusisgivenbythisexpression:

(x)
= (x)(1 (x))
x

Forexample,iftheinputtothesigmoidgateis x=3 ,thegatewillcomputeoutput f=1.0/


(1.0+Math.exp(x))=0.95 ,andthenthe(local)gradientonitsinputwillsimplybe dx=
(0.95)*(10.95)=0.0475 .

https://karpathy.github.io/neuralnets/ 14/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Thatsallweneedtousethisgate:weknowhowtotakeaninputandforwarditthroughthe
sigmoidgate,andwealsohavetheexpressionforthegradientwithrespecttoitsinput,sowe
canalsobackpropthroughit.Anotherthingtonoteisthattechnically,thesigmoidfunctionis
madeupofanentireseriesofgatesinalinethatcomputemoreatomicfunctions:an
exponentiationgate,anadditiongateandadivisiongate.Treatingitsowouldworkperfectly
finebutforthisexampleIchosetocollapseallofthesegatesintoasinglegatethatjust
computessigmoidinoneshot,becausethegradientexpressionturnsouttobesimple.

Letstakethisopportunitytocarefullystructuretheassociatedcodeinaniceandmodularway.
First,Idlikeyoutonotethateverywireinourdiagramshastwonumbersassociatedwithit:

1.thevalueitcarriesduringtheforwardpass
2.thegradient(i.ethepull)thatflowsbackthroughitinthebackwardpass

Letscreateasimple Unit structurethatwillstorethesetwovaluesoneverywire.Ourgates


willnowoperateover Unit s:theywilltakethemasinputsandcreatethemasoutputs.

//everyUnitcorrespondstoawireinthediagrams
varUnit=function(value,grad){
//valuecomputedintheforwardpass
this.value=value;
//thederivativeofcircuitoutputw.r.tthisunit,computedinbackwardpass
this.grad=grad;
}

InadditiontoUnitswealsoneed3gates: + , * and sig (sigmoid).Letsstartoutby


implementingamultiplygate.ImusingJavascriptherewhichhasafunnywayofsimulating
classesusingfunctions.IfyourenotaJavascriptfamiliarperson,allthatsgoingonhereis
thatImdefiningaclassthathascertainproperties(accessedwithuseof this keyword),and
somemethods(whichinJavascriptareplacedintothefunctionsprototype).Justthinkabout
theseasclassmethods.Alsokeepinmindthatthewaywewillusetheseeventuallyisthatwe
willfirst forward allthegatesonebyone,andthen backward allthegatesinreverseorder.
Hereistheimplementation:

varmultiplyGate=function(){};
multiplyGate.prototype={
forward:function(u0,u1){
//storepointerstoinputUnitsu0andu1andoutputunitutop
this.u0=u0;
this.u1=u1;
https://karpathy.github.io/neuralnets/ 15/41
3/17/2017 Hacker'sguidetoNeuralNetworks

this.utop=newUnit(u0.value*u1.value,0.0);
returnthis.utop;
},
backward:function(){
//takethegradientinoutputunitandchainitwiththe
//localgradients,whichwederivedformultiplygatebefore
//thenwritethosegradientstothoseUnits.
this.u0.grad+=this.u1.value*this.utop.grad;
this.u1.grad+=this.u0.value*this.utop.grad;
}
}

Themultiplygatetakestwounitsthateachholdavalueandcreatesaunitthatstoresitsoutput.
Thegradientisinitializedtozero.Thennoticethatinthe backward functioncallwegetthe
gradientfromtheoutputunitweproducedduringtheforwardpass(whichwillbynowhopefully
haveitsgradientfilledin)andmultiplyitwiththelocalgradientforthisgate(chainrule!).This
gatecomputesmultiplication( u0.value*u1.value )duringforwardpass,sorecallthatthe
gradientw.r.t u0 is u1.value andw.r.t u1 is u0.value .Alsonotethatweareusing += to
addontothegradientinthe backward function.Thiswillallowustopossiblyusetheoutputof
onegatemultipletimes(thinkofitasawirebranchingout),sinceitturnsoutthatthegradients
fromthesedifferentbranchesjustaddupwhencomputingthefinalgradientwithrespecttothe
circuitoutput.Theothertwogatesaredefinedanalogously:

varaddGate=function(){};
addGate.prototype={
forward:function(u0,u1){
this.u0=u0;
this.u1=u1;//storepointerstoinputunits
this.utop=newUnit(u0.value+u1.value,0.0);
returnthis.utop;
},
backward:function(){
//addgate.derivativewrtbothinputsis1
this.u0.grad+=1*this.utop.grad;
this.u1.grad+=1*this.utop.grad;
}
}

varsigmoidGate=function(){
//helperfunction
this.sig=function(x){return1/(1+Math.exp(x));};
};
https://karpathy.github.io/neuralnets/ 16/41
3/17/2017 Hacker'sguidetoNeuralNetworks

sigmoidGate.prototype={
forward:function(u0){
this.u0=u0;
this.utop=newUnit(this.sig(this.u0.value),0.0);
returnthis.utop;
},
backward:function(){
vars=this.sig(this.u0.value);
this.u0.grad+=(s*(1s))*this.utop.grad;
}
}

Notethat,again,the backward functioninallcasesjustcomputesthelocalderivativewith


respecttoitsinputandthenmultipliesonthegradientfromtheunitabove(i.e.chainrule).To
fullyspecifyeverythingletsfinallywriteouttheforwardandbackwardflowforour2dimensional
neuronwithsomeexamplevalues:

//createinputunits
vara=newUnit(1.0,0.0);
varb=newUnit(2.0,0.0);
varc=newUnit(3.0,0.0);
varx=newUnit(1.0,0.0);
vary=newUnit(3.0,0.0);

//createthegates
varmulg0=newmultiplyGate();
varmulg1=newmultiplyGate();
varaddg0=newaddGate();
varaddg1=newaddGate();
varsg0=newsigmoidGate();

//dotheforwardpass
varforwardNeuron=function(){
ax=mulg0.forward(a,x);//a*x=1
by=mulg1.forward(b,y);//b*y=6
axpby=addg0.forward(ax,by);//a*x+b*y=5
axpbypc=addg1.forward(axpby,c);//a*x+b*y+c=2
s=sg0.forward(axpbypc);//sig(a*x+b*y+c)=0.8808
};
forwardNeuron();

console.log('circuitoutput:'+s.value);//prints0.8808

https://karpathy.github.io/neuralnets/ 17/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Andnowletscomputethegradient:Simplyiterateinreverseorderandcallthe backward
function!Rememberthatwestoredthepointerstotheunitswhenwedidtheforwardpass,so
everygatehasaccesstoitsinputsandalsotheoutputunititpreviouslyproduced.

s.grad=1.0;
sg0.backward();//writesgradientintoaxpbypc
addg1.backward();//writesgradientsintoaxpbyandc
addg0.backward();//writesgradientsintoaxandby
mulg1.backward();//writesgradientsintobandy
mulg0.backward();//writesgradientsintoaandx

Notethatthefirstlinesetsthegradientattheoutput(verylastunit)tobe 1.0 tostartoffthe


gradientchain.Thiscanbeinterpretedastuggingonthelastgatewithaforceof +1 .Inother
words,wearepullingontheentirecircuittoinducetheforcesthatwillincreasetheoutput
value.Ifwedidnotsetthisto1,allgradientswouldbecomputedaszeroduetothe
multiplicationsinthechainrule.Finally,letsmaketheinputsrespondtothecomputedgradients
andcheckthatthefunctionincreased:

varstep_size=0.01;
a.value+=step_size*a.grad;//a.gradis0.105
b.value+=step_size*b.grad;//b.gradis0.315
c.value+=step_size*c.grad;//c.gradis0.105
x.value+=step_size*x.grad;//x.gradis0.105
y.value+=step_size*y.grad;//y.gradis0.210

forwardNeuron();
console.log('circuitoutputafteronebackprop:'+s.value);//prints0.8825

Success! 0.8825 ishigherthanthepreviousvalue, 0.8808 .Finally,letsverifythatwe


implementedthebackpropagationcorrectlybycheckingthenumericalgradient:

varforwardCircuitFast=function(a,b,c,x,y){
return1/(1+Math.exp((a*x+b*y+c)));
};
vara=1,b=2,c=3,x=1,y=3;
varh=0.0001;
vara_grad=(forwardCircuitFast(a+h,b,c,x,y)forwardCircuitFast(a,b,c,x,y))/h;
varb_grad=(forwardCircuitFast(a,b+h,c,x,y)forwardCircuitFast(a,b,c,x,y))/h;
varc_grad=(forwardCircuitFast(a,b,c+h,x,y)forwardCircuitFast(a,b,c,x,y))/h;
varx_grad=(forwardCircuitFast(a,b,c,x+h,y)forwardCircuitFast(a,b,c,x,y))/h;
vary_grad=(forwardCircuitFast(a,b,c,x,y+h)forwardCircuitFast(a,b,c,x,y))/h;

https://karpathy.github.io/neuralnets/ 18/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Indeed,theseallgivethesamevaluesasthebackpropagatedgradients [0.105,0.315,
0.105,0.105,0.210] .Nice!

Ihopeitisclearthateventhoughweonlylookedatanexampleofasingleneuron,thecodeI
gaveabovegeneralizesinaverystraightforwardwaytocomputegradientsofarbitrary
expressions(includingverydeepexpressions#foreshadowing).Allyouhavetodoiswritesmall
gatesthatcomputelocal,simplederivativesw.r.ttheirinputs,wireitupinagraph,doaforward
passtocomputetheoutputvalueandthenabackwardpassthatchainsthegradientsallthe
waytotheinput.

BecomingaBackpropNinja
Overtimeyouwillbecomemuchmoreefficientinwritingthebackwardpass,evenfor
complicatedcircuitsandallatonce.Letspracticebackpropabitwithafewexamples.Inwhat
follows,letsnotworryaboutUnit,Circuitclassesbecausetheyobfuscatethingsabit,andlets
justusevariablessuchas a,b,c,x ,andrefertotheirgradientsas da,db,dc,dx respectively.
Again,wethinkofthevariablesastheforwardflowandtheirgradientsasbackwardflow
alongeverywire.Ourfirstexamplewasthe * gate:

varx=a*b;
//andgivengradientonx(dx),wesawthatinbackpropwewouldcompute:
varda=b*dx;
vardb=a*dx;

Inthecodeabove,Imassumingthatthevariable dx isgiven,comingfromsomewhereabove
usinthecircuitwhileweredoingbackprop(oritis+1bydefaultotherwise).Imwritingitout
becauseIwanttoexplicitlyshowhowthegradientsgetchainedtogether.Notefromthe
equationsthatthe * gateactsasaswitcherduringbackwardpass,forlackofbetterword.It
rememberswhatitsinputswere,andthegradientsoneachonewillbethevalueoftheother
duringtheforwardpass.Andthenofcoursewehavetomultiplywiththegradientfromabove,
whichisthechainrule.Heresthe + gateinthiscondensedform:

varx=a+b;
//>
varda=1.0*dx;
vardb=1.0*dx;

Where 1.0 isthelocalgradient,andthemultiplicationisourchainrule.Whataboutadding


threenumbers?:
https://karpathy.github.io/neuralnets/ 19/41
3/17/2017 Hacker'sguidetoNeuralNetworks

//letscomputex=a+b+cintwosteps:
varq=a+b;//gate1
varx=q+c;//gate2

//backwardpass:
dc=1.0*dx;//backpropgate2
dq=1.0*dx;
da=1.0*dq;//backpropgate1
db=1.0*dq;

Youcanseewhatshappening,right?Ifyourememberthebackwardflowdiagram,the + gate
simplytakesthegradientontopandroutesitequallytoallofitsinputs(becauseitslocal
gradientisalwayssimply 1.0 forallitsinputs,regardlessoftheiractualvalues).Sowecando
itmuchfaster:

varx=a+b+c;
varda=1.0*dx;vardb=1.0*dx;vardc=1.0*dx;

Okay,howaboutcombininggates?:

varx=a*b+c;
//givendx,backpropinonesweepwouldbe=>
da=b*dx;
db=a*dx;
dc=1.0*dx;

Ifyoudontseehowtheabovehappened,introduceatemporaryvariable q=a*b andthen


compute x=q+c toconvinceyourself.Andhereisourneuron,letsdoitintwosteps:

//letsdoourneuronintwosteps:
varq=a*x+b*y+c;
varf=sig(q);//sigisthesigmoidfunction
//andnowbackwardpass,wearegivendf,and:
vardf=1;
vardq=(f*(1f))*df;
//andnowwechainittotheinputs
varda=x*dq;
vardx=a*dq;
vardy=b*dq;
vardb=y*dq;
vardc=1.0*dq;

https://karpathy.github.io/neuralnets/ 20/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Ihopethisisstartingtomakealittlemoresense.Nowhowaboutthis:

varx=a*a;
varda=//???

Youcanthinkofthisasvalue a flowingtothe * gate,butthewiregetssplitandbecomes


bothinputs.Thisisactuallysimplebecausethebackwardflowofgradientsalwaysaddsup.In
otherwordsnothingchanges:

varda=a*dx;//gradientintoafromfirstbranch
da+=a*dx;//andaddonthegradientfromthesecondbranch

//shortforminsteadis:
varda=2*a*dx;

Infact,ifyouknowyourpowerrulefromcalculusyouwouldalsoknowthatifyouhave
f (a)
f (a) = a
2
then = 2a ,whichisexactlywhatwegetifwethinkofitaswiresplittingup
a

andbeingtwoinputstoagate.

Letsdoanotherone:

varx=a*a+b*b+c*c;
//weget:
varda=2*a*dx;
vardb=2*b*dx;
vardc=2*c*dx;

Okaynowletsstarttogetmorecomplex:

varx=Math.pow(((a*b+c)*d),2);//pow(x,2)squarestheinputJS

Whenmorecomplexcaseslikethiscomeupinpractice,Iliketosplittheexpressioninto
manageablechunkswhicharealmostalwayscomposedofsimplerexpressionsandthenI
chainthemtogetherwithchainrule:

varx1=a*b+c;
varx2=x1*d;
varx=x2*x2;//thisisidenticaltotheaboveexpressionforx
//andnowinbackpropwegobackwards:
vardx2=2*x2*dx;//backpropintox2
https://karpathy.github.io/neuralnets/ 21/41
3/17/2017 Hacker'sguidetoNeuralNetworks

vardd=x1*dx2;//backpropintod
vardx1=d*dx2;//backpropintox1
varda=b*dx1;
vardb=a*dx1;
vardc=1.0*dx1;//done!

Thatwasnttoodifficult!Thosearethebackpropequationsfortheentireexpression,andweve
donethempiecebypieceandbackproppedtoallthevariables.Noticeagainhowforevery
variableduringforwardpasswehaveanequivalentvariableduringbackwardpassthat
containsitsgradientwithrespecttothecircuitsfinaloutput.Hereareafewmoreuseful
functionsandtheirlocalgradientsthatareusefulinpractice:

varx=1.0/a;//division
varda=1.0/(a*a);

Hereswhatdivisionmightlooklikeinpracticethen:

varx=(a+b)/(c+d);
//letsdecomposeitinsteps:
varx1=a+b;
varx2=c+d;
varx3=1.0/x2;
varx=x1*x3;//equivalenttoabove
//andnowbackprop,againinreverseorder:
vardx1=x3*dx;
vardx3=x1*dx;
vardx2=(1.0/(x2*x2))*dx3;//localgradientasshownabove,andchainrule
varda=1.0*dx1;//andfinallyintotheoriginalvariables
vardb=1.0*dx1;
vardc=1.0*dx2;
vardd=1.0*dx2;

Hopefullyyouseethatwearebreakingdownexpressions,doingtheforwardpass,andthenfor
everyvariable(suchas a )wederiveitsgradient da aswegobackwards,onebyone,
applyingthesimplelocalgradientsandchainingthemwithgradientsfromabove.Heres
anotherone:

varx=Math.max(a,b);
varda=a===x?1.0*dx:0.0;
vardb=b===x?1.0*dx:0.0;

https://karpathy.github.io/neuralnets/ 22/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Okaythisismakingaverysimplethinghardtoread.The max functionpassesonthevalueof


theinputthatwaslargestandignorestheotherones.Inthebackwardpassthen,themaxgate
willsimplytakethegradientontopandrouteittotheinputthatactuallyflowedthroughitduring
theforwardpass.Thegateactsasasimpleswitchbasedonwhichinputhadthehighestvalue
duringforwardpass.Theotherinputswillhavezerogradient.Thatswhatthe === isabout,
sincewearetestingforwhichinputwastheactualmaxandonlyroutingthegradienttoit.

Finally,letslookattheRectifiedLinearUnitnonlinearity(orReLU),whichyoumayhaveheard
of.ItisusedinNeuralNetworksinplaceofthesigmoidfunction.Itissimplythresholdingat
zero:

varx=Math.max(a,0)
//backpropthroughthisgatewillthenbe:
varda=a>0?1.0*dx:0.0;

Inotherwordsthisgatesimplypassesthevaluethroughifitslargerthan0,oritstopstheflow
andsetsittozero.Inthebackwardpass,thegatewillpassonthegradientfromthetopifitwas
activatedduringtheforawrdpass,oriftheoriginalinputwasbelowzero,itwillstopthegradient
flow.

Iwillstopatthispoint.Ihopeyougotsomeintuitionabouthowyoucancomputeentire
expressions(whicharemadeupofmanygatesalongtheway)andhowyoucancompute
backpropforeveryoneofthem.

Everythingwevedoneinthischaptercomesdowntothis:Wesawthatwecanfeedsomeinput
througharbitrarilycomplexrealvaluedcircuit,tugattheendofthecircuitwithsomeforce,and
backpropagationdistributesthattugthroughtheentirecircuitallthewaybacktotheinputs.If
theinputsrespondslightlyalongthefinaldirectionoftheirtug,thecircuitwillgiveabitalong
theoriginalpulldirection.Maybethisisnotimmediatelyobvious,butthismachineryisa
powerfulhammerforMachineLearning.

Maybethisisnotimmediatelyobvious,butthismachineryisapowerfulhammerforMachine
Learning.

Letsnowputthismachinerytogooduse.

Chapter2:MachineLearning

https://karpathy.github.io/neuralnets/ 23/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Inthelastchapterwewereconcernedwithrealvaluedcircuitsthatcomputedpossiblycomplex
expressionsoftheirinputs(theforwardpass),andalsowecouldcomputethegradientsof
theseexpressionsontheoriginalinputs(backwardpass).Inthischapterwewillseehowuseful
thisextremelysimplemechanismisinMachineLearning.

BinaryClassification
Aswedidbefore,letsstartoutsimple.Thesimplest,commonandyetverypracticalproblemin
MachineLearningisbinaryclassification.Alotofveryinterestingandimportantproblemscan
bereducedtoit.Thesetupisasfollows:Wearegivenadatasetof N vectorsandeveryoneof
themislabeledwitha +1 ora 1 .Forexample,intwodimensionsourdatasetcouldlookas
simpleas:

vector>label

[1.2,0.7]>+1
[0.3,0.5]>1
[3,1]>+1
[0.1,1.0]>1
[3.0,1.1]>1
[2.1,3]>+1

Here,wehave N=6 datapoints,whereeverydatapointhastwofeatures( D=2 ).Threeof


thedatapointshavelabel +1 andtheotherthreelabel 1 .Thisisasillytoyexample,butin
practicea+1/1datasetcouldbeveryusefulthingsindeed:Forexamplespam/nospamemails,
wherethevectorssomehowmeasurevariousfeaturesofthecontentoftheemail,suchasthe
numberoftimescertainenhancementdrugsarementioned.

Goal.Ourgoalinbinaryclassificationistolearnafunctionthattakesa2dimensionalvector
andpredictsthelabel.Thisfunctionisusuallyparameterizedbyacertainsetofparameters,
andwewillwanttotunetheparametersofthefunctionsothatitsoutputsareconsistentwith
thelabelingintheprovideddataset.Intheendwecandiscardthedatasetandusethelearned
parameterstopredictlabelsforpreviouslyunseenvectors.

Trainingprotocol
Wewilleventuallybuilduptoentireneuralnetworksandcomplexexpressions,butletsstartout
simpleandtrainalinearclassifierverysimilartothesingleneuronwesawattheendof
Chapter1.Theonlydifferenceisthatwellgetridofthesigmoidbecauseitmakesthings
unnecessarilycomplicated(IonlyuseditasanexampleinChapter1becausesigmoidneurons
https://karpathy.github.io/neuralnets/ 24/41
3/17/2017 Hacker'sguidetoNeuralNetworks

arehistoricallypopularbutmodernNeuralNetworksrarely,ifever,usesigmoidnonlinearities).
Anyway,letsuseasimplelinearfunction:

f (x, y) = ax + by + c

Inthisexpressionwethinkof x and y astheinputs(the2Dvectors)and a,b,c asthe


parametersofthefunctionthatwewillwanttolearn.Forexample,if a=1,b=2,c=1 ,
thenthefunctionwilltakethefirstdatapoint( [1.2,0.7] )andoutput 1*1.2+(2)*0.7
+(1)=1.2 .Hereishowthetrainingwillwork:

1.Weselectarandomdatapointandfeeditthroughthecircuit
2.Wewillinterprettheoutputofthecircuitasaconfidencethatthedatapointhasclass +1 .
(i.e.veryhighvalues=circuitisverycertaindatapointhasclass +1 andverylowvalues
=circuitiscertainthisdatapointhasclass 1 .)
3.Wewillmeasurehowwellthepredictionalignswiththeprovidedlabels.Intuitively,for
example,ifapositiveexamplescoresverylow,wewillwanttotuginthepositivedirection
onthecircuit,demandingthatitshouldoutputhighervalueforthisdatapoint.Notethat
thisisthecaseforthethefirstdatapoint:itislabeledas +1 butourpredictorunctiononly
assignsitvalue 1.2 .WewillthereforetugonthecircuitinpositivedirectionWewant
thevaluetobehigher.
4.Thecircuitwilltakethetugandbackpropagateittocomputetugsontheinputs
a,b,c,x,y
5.Sincewethinkof x,y as(fixed)datapoints,wewillignorethepullon x,y .Ifyourea
fanofmyphysicalanalogies,thinkoftheseinputsaspegs,fixedintheground.
6.Ontheotherhand,wewilltaketheparameters a,b,c andmakethemrespondtotheir
tug(i.e.wellperformwhatwecallaparameterupdate).This,ofcourse,willmakeitso
thatthecircuitwilloutputaslightlyhigherscoreonthisparticulardatapointinthefuture.
7.Iterate!Gobacktostep1.

ThetrainingschemeIdescribedabove,iscommonlyreferredasStochasticGradient
Descent.TheinterestingpartIdliketoreiterateisthat a,b,c,x,y areallmadeupofthe
samestuffasfarasthecircuitisconcerned:Theyareinputstothecircuitandthecircuitwilltug
onalloftheminsomedirection.Itdoesntknowthedifferencebetweenparametersand
datapoints.However,afterthebackwardpassiscompleteweignorealltugsonthedatapoints
( x,y )andkeepswappingtheminandoutasweiterateoverexamplesinthedataset.Onthe
otherhand,wekeeptheparameters( a,b,c )aroundandkeeptuggingonthemeverytimewe
sampleadatapoint.Overtime,thepullsontheseparameterswilltunethesevaluesinsucha
waythatthefunctionoutputshighscoresforpositiveexamplesandlowscoresfornegative
examples.

LearningaSupportVectorMachine
https://karpathy.github.io/neuralnets/ 25/41
3/17/2017 Hacker'sguidetoNeuralNetworks

LearningaSupportVectorMachine
Asaconcreteexample,letslearnaSupportVectorMachine.TheSVMisaverypopularlinear
classifierItsfunctionalformisexactlyasIvedescribedinprevioussection,
f (x, y) = ax + by + c .Atthispoint,ifyouveseenanexplanationofSVMsyoureprobably

expectingmetodefinetheSVMlossfunctionandplungeintoanexplanationofslackvariables,
geometricalintuitionsoflargemargins,kernels,duality,etc.Buthere,Idliketotakeadifferent
approach.Insteadofdefinininglossfunctions,Iwouldliketobasetheexplanationontheforce
specification(Ijustmadethistermupbytheway)ofaSupportVectorMachine,whichI
personallyfindmuchmoreintuitive.Aswewillsee,talkingabouttheforcespecificationandthe
lossfunctionareidenticalwaysofseeingthesameproblem.Anyway,hereitis:

SupportVectorMachineForceSpecification:

IfwefeedapositivedatapointthroughtheSVMcircuitandtheoutputvalueislessthan1,
pullonthecircuitwithforce +1 .Thisisapositiveexamplesowewantthescoretobe
higherforit.
Conversely,ifwefeedanegativedatapointthroughtheSVMandtheoutputisgreater
than1,thenthecircuitisgivingthisdatapointdangerouslyhighscore:Pullonthecircuit
downwardswithforce 1 .
Inadditiontothepullsabove,alwaysaddasmallamountofpullontheparameters a,b
(notice,noton c !)thatpullsthemtowardszero.Youcanthinkofboth a,b asbeing
attachedtoaphysicalspringthatisattachedatzero.Justaswithaphysicalspring,this
willmakethepullproprotionaltothevalueofeachof a,b (Hookeslawinphysics,
anyone?).Forexample,if a becomesveryhighitwillexperienceastrongpullof
magnitude |a| backtowardszero.Thispullissomethingwecallregularization,andit
ensuresthatneitherofourparameters a or b getsdisproportionallylarge.Thiswould
beundesirablebecauseboth a,b getmultipliedtotheinputfeatures x,y (remember
theequationis a*x+b*y+c ),soifeitherofthemistoohigh,ourclassifierwouldbe
overlysensitivetothesefeatures.Thisisntanicepropertybecausefeaturescanoftenbe
noisyinpractice,sowewantourclassifiertochangerelativelysmoothlyiftheywiggle
around.

Letsquicklygothroughasmallbutconcreteexample.Supposewestartoutwitharandom
parametersetting,say, a=1,b=2,c=1 .Then:

Ifwefeedthepoint [1.2,0.7] ,theSVMwillcomputescore 1*1.2+(2)*0.7


1=1.2 .Thispointislabeledas +1 inthetrainingdata,sowewantthescoretobe
higherthan1.Thegradientontopofthecircuitwillthusbepositive: +1 ,whichwill
backpropagateto a,b,c .Additionally,therewillalsobearegularizationpullon a of

https://karpathy.github.io/neuralnets/ 26/41
3/17/2017 Hacker'sguidetoNeuralNetworks

1 (tomakeitsmaller)andregularizationpullon b of +2 tomakeitlarger,toward
zero.
Supposeinsteadthatwefedthedatapoint [0.3,0.5] totheSVM.Itcomputes 1*
(0.3)+(2)*0.51=2.3 .Thelabelforthispointis 1 ,andsince 2.3 is
smallerthan 1 ,weseethataccordingtoourforcespecificationtheSVMshouldbe
happy:Thecomputedscoreisverynegative,consistentwiththenegativelabelofthis
example.Therewillbenopullattheendofthecircuit(i.eitszero),sincethereno
changesarenecessary.However,therewillstillbetheregularizationpullon a of 1
andon b of +2 .

Okaytheresbeentoomuchtext.LetswritetheSVMcodeandtakeadvantageofthecircuit
machinerywehavefromChapter1:

//Acircuit:ittakes5Units(x,y,a,b,c)andoutputsasingleUnit
//Itcanalsocomputethegradientw.r.t.itsinputs
varCircuit=function(){
//createsomegates
this.mulg0=newmultiplyGate();
this.mulg1=newmultiplyGate();
this.addg0=newaddGate();
this.addg1=newaddGate();
};
Circuit.prototype={
forward:function(x,y,a,b,c){
this.ax=this.mulg0.forward(a,x);//a*x
this.by=this.mulg1.forward(b,y);//b*y
this.axpby=this.addg0.forward(this.ax,this.by);//a*x+b*y
this.axpbypc=this.addg1.forward(this.axpby,c);//a*x+b*y+c
returnthis.axpbypc;
},
backward:function(gradient_top){//takespullfromabove
this.axpbypc.grad=gradient_top;
this.addg1.backward();//setsgradientinaxpbyandc
this.addg0.backward();//setsgradientinaxandby
this.mulg1.backward();//setsgradientinbandy
this.mulg0.backward();//setsgradientinaandx
}
}

Thatsacircuitthatsimplycomputes a*x+b*y+c andcanalsocomputethegradient.It


usesthegatescodewedevelopedinChapter1.NowletswritetheSVM,whichdoesntcare

https://karpathy.github.io/neuralnets/ 27/41
3/17/2017 Hacker'sguidetoNeuralNetworks

abouttheactualcircuit.Itisonlyconcernedwiththevaluesthatcomeoutofit,anditpullson
thecircuit.

//SVMclass
varSVM=function(){

//randominitialparametervalues
this.a=newUnit(1.0,0.0);
this.b=newUnit(2.0,0.0);
this.c=newUnit(1.0,0.0);

this.circuit=newCircuit();
};
SVM.prototype={
forward:function(x,y){//assumexandyareUnits
this.unit_out=this.circuit.forward(x,y,this.a,this.b,this.c);
returnthis.unit_out;
},
backward:function(label){//labelis+1or1

//resetpullsona,b,c
this.a.grad=0.0;
this.b.grad=0.0;
this.c.grad=0.0;

//computethepullbasedonwhatthecircuitoutputwas
varpull=0.0;
if(label===1&&this.unit_out.value<1){
pull=1;//thescorewastoolow:pullup
}
if(label===1&&this.unit_out.value>1){
pull=1;//thescorewastoohighforapositiveexample,pulldown
}
this.circuit.backward(pull);//writesgradientintox,y,a,b,c

//addregularizationpullforparameters:towardszeroandproportionaltovalue
this.a.grad+=this.a.value;
this.b.grad+=this.b.value;
},
learnFrom:function(x,y,label){
this.forward(x,y);//forwardpass(set.valueinallUnits)
this.backward(label);//backwardpass(set.gradinallUnits)
this.parameterUpdate();//parametersrespondtotug

https://karpathy.github.io/neuralnets/ 28/41
3/17/2017 Hacker'sguidetoNeuralNetworks

},
parameterUpdate:function(){
varstep_size=0.01;
this.a.value+=step_size*this.a.grad;
this.b.value+=step_size*this.b.grad;
this.c.value+=step_size*this.c.grad;
}
};

NowletstraintheSVMwithStochasticGradientDescent:

vardata=[];varlabels=[];
data.push([1.2,0.7]);labels.push(1);
data.push([0.3,0.5]);labels.push(1);
data.push([3.0,0.1]);labels.push(1);
data.push([0.1,1.0]);labels.push(1);
data.push([1.0,1.1]);labels.push(1);
data.push([2.1,3]);labels.push(1);
varsvm=newSVM();

//afunctionthatcomputestheclassificationaccuracy
varevalTrainingAccuracy=function(){
varnum_correct=0;
for(vari=0;i<data.length;i++){
varx=newUnit(data[i][0],0.0);
vary=newUnit(data[i][1],0.0);
vartrue_label=labels[i];

//seeifthepredictionmatchestheprovidedlabel
varpredicted_label=svm.forward(x,y).value>0?1:1;
if(predicted_label===true_label){
num_correct++;
}
}
returnnum_correct/data.length;
};

//thelearningloop
for(variter=0;iter<400;iter++){
//pickarandomdatapoint
vari=Math.floor(Math.random()*data.length);
varx=newUnit(data[i][0],0.0);

https://karpathy.github.io/neuralnets/ 29/41
3/17/2017 Hacker'sguidetoNeuralNetworks

vary=newUnit(data[i][1],0.0);
varlabel=labels[i];
svm.learnFrom(x,y,label);

if(iter%25==0){//every10iterations...
console.log('trainingaccuracyatiter'+iter+':'+evalTrainingAccuracy());
}
}

Thiscodeprintsthefollowingoutput:

trainingaccuracyatiteration0:0.3333333333333333
trainingaccuracyatiteration25:0.3333333333333333
trainingaccuracyatiteration50:0.5
trainingaccuracyatiteration75:0.5
trainingaccuracyatiteration100:0.3333333333333333
trainingaccuracyatiteration125:0.5
trainingaccuracyatiteration150:0.5
trainingaccuracyatiteration175:0.5
trainingaccuracyatiteration200:0.5
trainingaccuracyatiteration225:0.6666666666666666
trainingaccuracyatiteration250:0.6666666666666666
trainingaccuracyatiteration275:0.8333333333333334
trainingaccuracyatiteration300:1
trainingaccuracyatiteration325:1
trainingaccuracyatiteration350:1
trainingaccuracyatiteration375:1

Weseethatinitiallyourclassifieronlyhad33%trainingaccuracy,butbytheendalltraining
examplesarecorrectlyclassifierastheparameters a,b,c adjustedtheirvaluesaccordingto
thepullsweexerted.WejusttrainedanSVM!Butpleasedontusethiscodeanywherein
production:)Wewillseehowwecanmakethingsmuchmoreefficientonceweunderstand
whatisgoingonatthecore.

Numberofiterationsneeded.Withthisexampledata,withthisexampleinitialization,andwith
thesettingofstepsizeweused,ittookabout300iterationstotraintheSVM.Inpractice,this
couldbemanymoreormanylessdependingonhowhardorlargetheproblemis,howyoure
initializating,normalizingyourdata,whatstepsizeyoureusing,andsoon.Thisisjustatoy
demonstration,butlaterwewillgooverallthebestpracticesforactuallytrainingthese
classifiersinpractice.Forexample,itwillturnoutthatthesettingofthestepsizeisvery
imporantandtricky.Smallstepsizewillmakeyourmodelslowtotrain.Largestepsizewilltrain

https://karpathy.github.io/neuralnets/ 30/41
3/17/2017 Hacker'sguidetoNeuralNetworks

faster,butifitistoolarge,itwillmakeyourclassifierchaoticallyjumparoundandnotconverge
toagoodfinalresult.Wewilleventuallyusewitheldvalidationdatatoproperlytuneittobejust
inthesweetspotforyourparticulardata.

OnethingIdlikeyoutoappreciateisthatthecircuitcanbearbitraryexpression,notjustthe
linearpredictionfunctionweusedinthisexample.Forexample,itcanbeanentireneural
network.

Bytheway,Iintentionallystructuredthecodeinamodularway,butwecouldhavetrainedan
SVMwithamuchsimplercode.Hereisreallywhatalloftheseclassesandcomputationsboil
downto:

vara=1,b=2,c=1;//initialparameters
for(variter=0;iter<400;iter++){
//pickarandomdatapoint
vari=Math.floor(Math.random()*data.length);
varx=data[i][0];
vary=data[i][1];
varlabel=labels[i];

//computepull
varscore=a*x+b*y+c;
varpull=0.0;
if(label===1&&score<1)pull=1;
if(label===1&&score>1)pull=1;

//computegradientandupdateparameters
varstep_size=0.01;
a+=step_size*(x*pulla);//aisfromtheregularization
b+=step_size*(y*pullb);//bisfromtheregularization
c+=step_size*(1*pull);
}

thiscodegivesanidenticalresult.Perhapsbynowyoucanglanceatthecodeandseehow
theseequationscameabout.

Variablepull?Aquicknotetomakeatthispoint:Youmayhavenoticedthatthepullisalways
1,0,or1.Youcouldimaginedoingotherthings,forexamplemakingthispullproportionalto
howbadthemistakewas.ThisleadstoavariationontheSVMthatsomepeoplerefertoas
squaredhingelossSVM,forreasonsthatwilllaterbecomeclear.Dependingonvarious
featuresofyourdataset,thatmayworkbetterorworse.Forexample,ifyouhaveverybad
outliersinyourdata,e.g.anegativedatapointthatgetsascore +100 ,itsinfluencewillbe

https://karpathy.github.io/neuralnets/ 31/41
3/17/2017 Hacker'sguidetoNeuralNetworks

relativelyminoronourclassifierbecausewewillonlypullwithforceof 1 regardlessofhow
badthemistakewas.Inpracticewerefertothispropertyofaclassifierasrobustnessto
outliers.

Letsrecap.Weintroducedthebinaryclassificationproblem,wherewearegivenND
dimensionalvectorsandalabel+1/1foreach.Wesawthatwecancombinethesefeatures
withasetofparametersinsidearealvaluedcircuit(suchasaSupportVectorMachinecircuit
inourexample).Then,wecanrepeatedlypassourdatathroughthecircuitandeachtimetweak
theparameterssothatthecircuitsoutputvalueisconsistentwiththeprovidedlabels.The
tweakingrelied,crucially,onourabilitytobackpropagategradientsthroughthecircuit.Inthe
end,thefinalcircuitcanbeusedtopredictvaluesforunseeninstances!

GeneralizingtheSVMintoaNeuralNetwork
OfinterestisthefactthatanSVMisjustaparticulartypeofaverysimplecircuit(circuitthat
computes score=a*x+b*y+c where a,b,c areweightsand x,y aredatapoints).This
canbeeasilyextendedtomorecomplicatedfunctions.Forexample,letswritea2layerNeural
Networkthatdoesthebinaryclassification.Theforwardpasswilllooklikethis:

//assumeinputsx,y
varn1=Math.max(0,a1*x+b1*y+c1);//activationof1sthiddenneuron
varn2=Math.max(0,a2*x+b2*y+c2);//2ndneuron
varn3=Math.max(0,a3*x+b3*y+c3);//3rdneuron
varscore=a4*n1+b4*n2+c4*n3+d4;//thescore

Thespecificationaboveisa2layerNeuralNetworkwith3hiddenneurons(n1,n2,n3)that
usesRectifiedLinearUnit(ReLU)nonlinearityoneachhiddenneuron.Asyoucansee,there
arenowseveralparametersinvolved,whichmeansthatourclassifierismorecomplexandcan
representmoreintricatedecisionboundariesthanjustasimplelineardecisionrulesuchasan
SVM.Anotherwaytothinkaboutitisthateveryoneofthethreehiddenneuronsisalinear
classifierandnowwereputtinganextralinearclassifierontopofthat.Nowwerestartingtogo
deeper:).Okay,letstrainthis2layerNeuralNetwork.ThecodelooksverysimilartotheSVM
examplecodeabove,wejusthavetochangetheforwardpassandthebackwardpass:

//randominitialparameters
vara1=Math.random()0.5;//arandomnumberbetween0.5and0.5
//...similarlyinitializeallotherparameterstorandoms
for(variter=0;iter<400;iter++){
//pickarandomdatapoint
vari=Math.floor(Math.random()*data.length);

https://karpathy.github.io/neuralnets/ 32/41
3/17/2017 Hacker'sguidetoNeuralNetworks

varx=data[i][0];
vary=data[i][1];
varlabel=labels[i];

//computeforwardpass
varn1=Math.max(0,a1*x+b1*y+c1);//activationof1sthiddenneuron
varn2=Math.max(0,a2*x+b2*y+c2);//2ndneuron
varn3=Math.max(0,a3*x+b3*y+c3);//3rdneuron
varscore=a4*n1+b4*n2+c4*n3+d4;//thescore

//computethepullontop
varpull=0.0;
if(label===1&&score<1)pull=1;//wewanthigheroutput!Pullup.
if(label===1&&score>1)pull=1;//wewantloweroutput!Pulldown.

//nowcomputebackwardpasstoallparametersofthemodel

//backpropthroughthelast"score"neuron
vardscore=pull;
varda4=n1*dscore;
vardn1=a4*dscore;
vardb4=n2*dscore;
vardn2=b4*dscore;
vardc4=n3*dscore;
vardn3=c4*dscore;
vardd4=1.0*dscore;//phew

//backproptheReLUnonlinearities,inplace
//i.e.justsetgradientstozeroiftheneuronsdidnot"fire"
vardn3=n3===0?0:dn3;
vardn2=n2===0?0:dn2;
vardn1=n1===0?0:dn1;

//backproptoparametersofneuron1
varda1=x*dn1;
vardb1=y*dn1;
vardc1=1.0*dn1;

//backproptoparametersofneuron2
varda2=x*dn2;
vardb2=y*dn2;
vardc2=1.0*dn2;

//backproptoparametersofneuron3

https://karpathy.github.io/neuralnets/ 33/41
3/17/2017 Hacker'sguidetoNeuralNetworks

varda3=x*dn3;
vardb3=y*dn3;
vardc3=1.0*dn3;

//phew!Endofbackprop!
//notewecouldhavealsobackproppedintox,y
//butwedonotneedthesegradients.Weonlyusethegradients
//onourparametersintheparameterupdate,andwediscardx,y

//addthepullsfromtheregularization,tuggingallmultiplicative
//parameters(i.e.notthebiases)downward,proportionaltotheirvalue
da1+=a1;da2+=a2;da3+=a3;
db1+=b1;db2+=b2;db3+=b3;
da4+=a4;db4+=b4;dc4+=c4;

//finally,dotheparameterupdate
varstep_size=0.01;
a1+=step_size*da1;
b1+=step_size*db1;
c1+=step_size*dc1;
a2+=step_size*da2;
b2+=step_size*db2;
c2+=step_size*dc2;
a3+=step_size*da3;
b3+=step_size*db3;
c3+=step_size*dc3;
a4+=step_size*da4;
b4+=step_size*db4;
c4+=step_size*dc4;
d4+=step_size*dd4;
//wowthisistedious,pleaseuseforloopsinprod.
//we'redone!
}

Andthatshowyoutrainaneuralnetwork.Obviously,youwanttomodularizeyourcodenicely
butIexpendedthisexampleforyouinthehopethatitmakesthingsmuchmoreconcreteand
simplertounderstand.Later,wewilllookatbestpracticeswhenimplementingthesenetworks
andwewillstructurethecodemuchmoreneatlyinamodularandmoresensibleway.

Butfornow,Ihopeyourtakeawayisthata2layerNeuralNetisreallynotsuchascarything:
wewriteaforwardpassexpression,interpretthevalueattheendasascore,andthenwepull
onthatvalueinapositiveornegativedirectiondependingonwhatwewantthatvaluetobefor
ourcurrentparticularexample.Theparameterupdateafterbackpropwillensurethatwhenwe

https://karpathy.github.io/neuralnets/ 34/41
3/17/2017 Hacker'sguidetoNeuralNetworks

seethisparticularexampleinthefuture,thenetworkwillbemorelikelytogiveusavaluewe
desire,nottheoneitgavejustbeforetheupdate.

AmoreConventionalApproach:LossFunctions
Nowthatweunderstandthebasicsofhowthesecircuitsfunctionwithdata,letsadoptamore
conventionalapproachthatyoumightseeelsewhereontheinternetandinothertutorialsand
books.Youwontseepeopletalkingtoomuchaboutforcespecifications.Instead,Machine
Learningalgorithmsarespecifiedintermsoflossfunctions(orcostfunctions,or
objectives).

AsIdevelopthisformalismIwouldalsoliketostarttobealittlemorecarefulwithhowwename
ourvariablesandparameters.Idliketheseequationstolooksimilartowhatyoumightseeina
bookorsomeothertutorial,soletmeusemorestandardnamingconventions.

Example:2DSupportVectorMachine
Letsstartwithanexampleofa2dimensionalSVM.WearegivenadatasetofN examples
(xi0 , xi1 ) andtheircorrespondinglabelsyi whichareallowedtobeeither+1/ 1 for

positiveornegativeexamplerespectively.Mostimportantly,asyourecallwehavethree
parameters(w0 , w1 , w2 ) .TheSVMlossfunctionisthendefinedasfollows:

N 2 2
L = [ _i = 1 max(0, y_i(w_0x_i0 + w_1x_i1 + w_2) + 1)] + [w_0 + w_1 ]

Noticethatthisexpressionisalwayspositive,duetothethresholdingatzerointhefirst
expressionandthesquaringintheregularization.Theideaisthatwewillwantthisexpression
tobeassmallaspossible.Beforewediveintosomeofitssubtletiesletmefirsttranslateitto
code:

varX=[[1.2,0.7],[0.3,0.5],[3,2.5]]//arrayof2dimensionaldata
vary=[1,1,1]//arrayoflabels
varw=[0.1,0.2,0.3]//example:randomnumbers
varalpha=0.1;//regularizationstrength

functioncost(X,y,w){

vartotal_cost=0.0;//L,inSVMlossfunctionabove
N=X.length;
for(vari=0;i<N;i++){
//loopoveralldatapointsandcomputetheirscore

https://karpathy.github.io/neuralnets/ 35/41
3/17/2017 Hacker'sguidetoNeuralNetworks

varxi=X[i];
varscore=w[0]*xi[0]+w[1]*xi[1]+w[2];

//accumulatecostbasedonhowcompatiblethescoreiswiththelabel
varyi=y[i];//label
varcosti=Math.max(0,yi*score+1);
console.log('example'+i+':xi=('+xi+')andlabel='+yi);
console.log('scorecomputedtobe'+score.toFixed(3));
console.log('=>costcomputedtobe'+costi.toFixed(3));
total_cost+=costi;
}

//regularizationcost:wewantsmallweights
reg_cost=alpha*(w[0]*w[0]+w[1]*w[1])
console.log('regularizationcostforcurrentmodelis'+reg_cost.toFixed(3));
total_cost+=reg_cost;

console.log('totalcostis'+total_cost.toFixed(3));
returntotal_cost;
}

Andhereistheoutput:

costforexample0is0.440
costforexample1is1.370
costforexample2is0.000
regularizationcostforcurrentmodelis0.005
totalcostis1.815

Noticehowthisexpressionworks:ItmeasureshowbadourSVMclassifieris.Letsstepthrough
thisexplicitly:

Thefirstdatapoint xi=[1.2,0.7] withlabel yi=1 willgivescore 0.1*1.2+


0.2*0.7+0.3 ,whichis 0.56 .Notice,thisisapositiveexamplesowewanttothe
scoretobegreaterthan +1 . 0.56 isnotenough.Andindeed,theexpressionforcost
forthisdatapointwillcompute: costi=Math.max(0,1*0.56+1) ,whichis 0.44 .
YoucanthinkofthecostasquantifyingtheSVMsunhappiness.
Theseconddatapoint xi=[0.3,0.5] withlabel yi=1 willgivescore 0.1*
(0.3)+0.2*0.5+0.3 ,whichis 0.37 .Thisisntlookingverygood:Thisscoreis
veryhighforanegativeexample.Itshouldbelessthan1.Indeed,whenwecomputethe

https://karpathy.github.io/neuralnets/ 36/41
3/17/2017 Hacker'sguidetoNeuralNetworks

cost: costi=Math.max(0,1*0.37+1) ,weget 1.37 .Thatsaveryhighcostfrom


thisexample,asitisbeingmisclassified.
Thelastexample xi=[3,2.5] withlabel yi=1 givesscore 0.1*3+0.2*2.5+
0.3 ,andthatis 1.1 .Inthiscase,theSVMwillcompute costi=Math.max(0,1*1.1
+1) ,whichisinfactzero.Thisdatapointisbeingclassifiedcorrectlyandthereisnocost
associatedwithit.

Acostfunctionisanexpressionthatmeasuresshowbadyourclassifieris.Whenthetraining
setisperfectlyclassified,thecost(ignoringtheregularization)willbezero.

Noticethatthelastterminthelossistheregularizationcost,whichsaysthatourmodel
parametersshouldbesmallvalues.Duetothistermthecostwillneveractuallybecomezero
(becausethiswouldmeanallparametersofthemodelexceptthebiasareexactlyzero),butthe
closerweget,thebetterourclassifierwillbecome.

ThemajorityofcostfunctionsinMachineLearningconsistoftwoparts:1.Apartthat
measureshowwellamodelfitsthedata,and2:Regularization,whichmeasuressomenotion
ofhowcomplexorlikelyamodelis.

IhopeIconvincedyouthen,thattogetaverygoodSVMwereallywanttomakethecostas
smallaspossible.Soundsfamiliar?Weknowexactlywhattodo:Thecostfunctionwritten
aboveisourcircuit.Wewillforwardallexamplesthroughthecircuit,computethebackward
passandupdateallparameterssuchthatthecircuitwilloutputasmallercostinthefuture.
Specifically,wewillcomputethegradientandthenupdatetheparametersintheopposite
directionofthegradient(sincewewanttomakethecostsmall,notlarge).

Weknowexactlywhattodo:Thecostfunctionwrittenaboveisourcircuit.

todo:cleanupthissectionandfleshitoutabit

Chapter3:BackpropinPractice

Buildingupalibrary

Example:PracticalNeuralNetworkClassifier
https://karpathy.github.io/neuralnets/ 37/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Multiclass:StructuredSVM
Multiclass:LogisticRegression,Softmax

Example:Regression
Tinychangesneededtocostfunction.L2regularization.

Example:StructuredPrediction
Basicideaistotrainan(unnormalized)energymodel

VectorizedImplementations
WritingaNeuralNetclassfierinPythonwithnumpy.

Backpropinpractice:Tips/Tricks
MonitoringofCostfunction
Monitoringtraining/validationperformance
Tweakinginitiallearningrates,learningrateschedules
Optimization:UsingMomentum
Optimization:LBFGS,Nesterovacceleratedgradient
ImportanceofInitialization:weightsandbiases
Regularization:L2,L1,Groupsparsity,Dropout
Hyperparametersearch,crossvalidations
Commonpitfalls:(e.g.dyingReLUs)
Handlingunbalanceddatasets
Approachestodebuggingnetswhensomethingdoesntwork

Chapter4:NetworksintheWild
Casestudiesofmodelsthatworkwellinpracticeandhavebeendeployedinthewild.

CaseStudy:ConvolutionalNeuralNetworksforimages
Convolutionallayers,pooling,AlexNet,etc.

CaseStudy:RecurrentNeuralNetworksforSpeechandText
https://karpathy.github.io/neuralnets/ 38/41
3/17/2017 Hacker'sguidetoNeuralNetworks

CaseStudy:RecurrentNeuralNetworksforSpeechandText
VanillaRecurrentnets,bidirectionalrecurrentnets.MaybeoverviewofLSTM

CaseStudy:Word2Vec
TrainingwordvectorrepresentationsinNLP

CaseStudy:tSNE
Trainingembeddingsforvisualizingdata

Acknowledgements
Thanksalottothefollowingpeoplewhomadethisguidebetter:wodenokoto(HN),zackmorris
(HN).

Comments
ThisguideisaworkinprogressandIappreciatefeedback,especiallyregardingpartsthatwere
unclearoronlymadehalfsense.Thankyou!

SomeoftheJavascriptcodeinthistutorialhasbeentranslatedtoPythonbyAjit,finditoveron
Github.

8Comments Andrej'sBlog
1 Login

Recommend 5 Share SortbyBest

Jointhediscussion

maxkhesin2yearsago
Thisisawesome,canwehazmoar?
6 Reply Share

GauravKumar3monthsago
moar
Reply Share
https://karpathy.github.io/neuralnets/ 39/41

You might also like