100% found this document useful (1 vote)

183 views39 pages

Hacker's Guide To Neural Networks

This document provides a summary of a hacker's guide to neural networks. It introduces neural networks as real-valued circuits where values flow along edges and interact in gates. The document explains how to tweak input values slightly to increase the output value using two strategies: 1) Random local search, which tweaks inputs randomly and tracks improvements, and 2) Numerical gradient, which evaluates the derivative of the output with respect to each input to determine how to tweak each input to increase the output. It provides an example of calculating the numerical gradient for a simple circuit with one multiplication gate.

Uploaded by

Albert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

183 views39 pages

Hacker's Guide To Neural Networks

Uploaded by

Albert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

3/17/2017 Hacker'sguidetoNeuralNetworks

AndrejKarpathyblog About Hacker'sguidetoNeuralNetworks

Hacker'sguidetoNeuralNetworks
Hithere,ImaCSPhDstudentatStanford.IveworkedonDeepLearningforafewyearsas
partofmyresearchandamongseveralofmyrelatedpetprojectsisConvNetJSaJavascript
libraryfortrainingNeuralNetworks.Javascriptallowsonetonicelyvisualizewhatsgoingon
andtoplayaroundwiththevarioushyperparametersettings,butIstillregularlyhearfrom
peoplewhoaskforamorethoroughtreatmentofthetopic.Thisarticle(whichIplantoslowly
expandouttolengthsofafewbookchapters)ismyhumbleattempt.ItsonwebinsteadofPDF
becauseallbooksshouldbe,andeventuallyitwillhopefullyincludeanimations/demosetc.

MypersonalexperiencewithNeuralNetworksisthateverythingbecamemuchclearerwhenI
startedignoringfullpage,densederivationsofbackpropagationequationsandjuststarted
writingcode.Thus,thistutorialwillcontainverylittlemath(Idontbelieveitisnecessaryandit
cansometimesevenobfuscatesimpleconcepts).SincemybackgroundisinComputerScience
andPhysics,IwillinsteaddevelopthetopicfromwhatIrefertoashackerssperspective.My
expositionwillcenteraroundcodeandphysicalintuitionsinsteadofmathematicalderivations.
Basically,IwillstrivetopresentthealgorithmsinawaythatIwishIhadcomeacrosswhenI
wasstartingout.

everythingbecamemuchclearerwhenIstartedwritingcode.

YoumightbeeagertojumprightinandlearnaboutNeuralNetworks,backpropagation,how
theycanbeappliedtodatasetsinpractice,etc.Butbeforewegetthere,Idlikeustofirstforget
aboutallthat.Letstakeastepbackandunderstandwhatisreallygoingonatthecore.Lets
firsttalkaboutrealvaluedcircuits.

Updatenote:Isuspendedmyworkonthisguideawhileagoandredirectedalotofmyenergy
toteachingCS231n(ConvolutionalNeuralNetworks)classatStanford.Thenotesareon
cs231.github.ioandthecourseslidescanbefoundhere.Thesematerialsarehighlyrelatedto
materialhere,butmorecomprehensiveandsometimesmorepolished.

Chapter1:RealvaluedCircuits

https://karpathy.github.io/neuralnets/ 1/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Inmyopinion,thebestwaytothinkofNeuralNetworksisasrealvaluedcircuits,wherereal
values(insteadofbooleanvalues {0,1} )flowalongedgesandinteractingates.However,
insteadofgatessuchas AND , OR , NOT ,etc,wehavebinarygatessuchas * (multiply), +
(add), max orunarygatessuchas exp ,etc.Unlikeordinarybooleancircuits,however,wewill
eventuallyalsohavegradientsflowingonthesameedgesofthecircuit,butintheopposite
direction.Butweregettingaheadofourselves.Letsfocusandstartoutsimple.

BaseCase:SingleGateintheCircuit
Letsfirstconsiderasingle,simplecircuitwithonegate.Heresanexample:

y *

Thecircuittakestworealvaluedinputs x and y andcomputes xy withthe gate.

Javascriptversionofthiswouldverysimplylooksomethinglikethis:

varforwardMultiplyGate=function(x,y){
returnx*y;
};
forwardMultiplyGate(2,3);//returns6.Exciting.

Andinmathformwecanthinkofthisgateasimplementingtherealvaluedfunction:

f (x, y) = xy

Aswiththisexample,allofourgateswilltakeoneortwoinputsandproduceasingleoutput
value.

TheGoal
Theproblemweareinterestedinstudyinglooksasfollows:

1.Weprovideagivencircuitsomespecificinputvalues(e.g. x=2 , y=3 )

2.Thecircuitcomputesanoutputvalue(e.g. 6 )

https://karpathy.github.io/neuralnets/ 2/41
3/17/2017 Hacker'sguidetoNeuralNetworks

3.Thecorequestionthenbecomes:Howshouldonetweaktheinputslightlytoincreasethe
output?

Inthiscase,inwhatdirectionshouldwechange x,y togetanumberlargerthan 6 ?Note

that,forexample, x=1.99 and y=2.99 gives x*y=5.95 ,whichishigherthan
6.0 .Dontgetconfusedbythis: 5.95 isbetter(higher)than 6.0 .Itsanimprovementof
0.05 ,eventhoughthemagnitudeof 5.95 (thedistancefromzero)happenstobelower.

Strategy#1:RandomLocalSearch

Okay.Sowait,wehaveacircuit,wehavesomeinputsandwejustwanttotweakthemslightly
toincreasetheoutputvalue?Whyisthishard?Wecaneasilyforwardthecircuittocompute
theoutputforanygiven x and y .Soisntthistrivial?Whydontwetweak x and y
randomlyandkeeptrackofthetweakthatworksbest:

//circuitwithsinglegatefornow
varforwardMultiplyGate=function(x,y){returnx*y;};
varx=2,y=3;//someinputvalues

//trychangingx,yrandomlysmallamountsandkeeptrackofwhatworksbest
vartweak_amount=0.01;
varbest_out=Infinity;
varbest_x=x,best_y=y;
for(vark=0;k<100;k++){
varx_try=x+tweak_amount*(Math.random()*21);//tweakxabit
vary_try=y+tweak_amount*(Math.random()*21);//tweakyabit
varout=forwardMultiplyGate(x_try,y_try);
if(out>best_out){
//bestimprovementyet!Keeptrackofthexandy
best_out=out;
best_x=x_try,best_y=y_try;
}
}

WhenIrunthis,Iget best_x=1.9928 , best_y=2.9901 ,and best_out=5.9588 .

Again, 5.9588 ishigherthan 6.0 .So,weredone,right?Notquite:Thisisaperfectlyfine
strategyfortinyproblemswithafewgatesifyoucanaffordthecomputetime,butitwontdoif
wewanttoeventuallyconsiderhugecircuitswithmillionsofinputs.Itturnsoutthatwecando
muchbetter.

Stategy#2:NumericalGradient
https://karpathy.github.io/neuralnets/ 3/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Heresabetterway.Rememberagainthatinoursetupwearegivenacircuit(e.g.ourcircuit
withasingle * gate)andsomeparticularinput(e.g. x=2,y=3 ).Thegatecomputesthe
output( 6 )andnowwedliketotweak x and y tomaketheoutputhigher.

Aniceintuitionforwhatwereabouttodoisasfollows:Imaginetakingtheoutputvaluethat
comesoutfromthecircuitandtuggingonitinthepositivedirection.Thispositivetensionwillin
turntranslatethroughthegateandinduceforcesontheinputs x and y .Forcesthattellus
how x and y shouldchangetoincreasetheoutputvalue.

Whatmightthoseforceslooklikeinourspecificexample?Thinkingthroughit,wecanintuitthat
theforceon x shouldalsobepositive,becausemaking x slightlylargerimprovesthecircuits
output.Forexample,increasing x from x=2 to x=1 wouldgiveusoutput 3 much
largerthan 6 .Ontheotherhand,wedexpectanegativeforceinducedon y thatpushesit
tobecomelower(sincealower y ,suchas y=2 ,downfromtheoriginal y=3 would
makeoutputhigher: 2x2=4 ,again,largerthan 6 ).Thatstheintuitiontokeepinmind,
anyway.Aswegothroughthis,itwillturnoutthatforcesImdescribingwillinfactturnouttobe
thederivativeoftheoutputvaluewithrespecttoitsinputs( x and y ).Youmayhaveheard
thistermbefore.

Thederivativecanbethoughtofasaforceoneachinputaswepullontheoutputtobecome
higher.

Sohowdoweexactlyevaluatethisforce(derivative)?Itturnsoutthatthereisaverysimple
procedureforthis.Wewillworkbackwards:Insteadofpullingonthecircuitsoutput,welliterate
overeveryinputonebyone,increaseitveryslightlyandlookatwhathappenstotheoutput
value.Theamounttheoutputchangesinresponseisthederivative.Enoughintuitionsfornow.
Letslookatthemathematicaldefinition.Wecanwritedownthederivativeforourfunctionwith
respecttotheinputs.Forexample,thederivativewithrespectto x canbecomputedas:

f (x, y) f (x + h, y) f (x, y)
=
x h

Whereh issmallitsthetweakamount.Also,ifyourenotveryfamiliarwithcalculusitis
importanttonotethatinthelefthandsideoftheequationabove,thehorizontallinedoesnot
f (x,y)
indicatedivision.Theentiresymbol isasinglething:thederivativeofthefunction
x

withrespecttox .Thehorizontallineontherightisdivision.Iknowitsconfusingbutits
f (x, y)

standardnotation.Anyway,Ihopeitdoesntlooktooscarybecauseitisnt:Thecircuitwas
givingsomeinitialoutputf (x, y),andthenwechangedoneoftheinputsbyatinyamounth
andreadthenewoutputf (x + h, y).Subtractingthosetwoquantitiestellsusthechange,and

https://karpathy.github.io/neuralnets/ 4/41
3/17/2017 Hacker'sguidetoNeuralNetworks

thedivisionbyh justnormalizesthischangebythe(arbitrary)tweakamountweused.Inother
wordsitsexpressingexactlywhatIdescribedaboveandtranslatesdirectlytothiscode:

varx=2,y=3;
varout=forwardMultiplyGate(x,y);//6
varh=0.0001;

//computederivativewithrespecttox
varxph=x+h;//1.9999
varout2=forwardMultiplyGate(xph,y);//5.9997
varx_derivative=(out2out)/h;//3.0

//computederivativewithrespecttoy
varyph=y+h;//3.0001
varout3=forwardMultiplyGate(x,yph);//6.0002
vary_derivative=(out3out)/h;//2.0

Letswalkthrough x forexample.Weturnedtheknobfrom x to x+h andthecircuit

respondedbygivingahighervalue(noteagainthatyes, 5.9997 ishigherthan 6 : 5.9997
>6 ).Thedivisionby h istheretonormalizethecircuitsresponsebythe(arbitrary)valueof
h wechosetousehere.Technically,youwantthevalueof h tobeinfinitesimal(theprecise
mathematicaldefinitionofthegradientisdefinedasthelimitoftheexpressionas h goesto
zero),butinpractice h=0.00001 orsoworksfineinmostcasestogetagoodapproximation.
Now,weseethatthederivativew.r.t. x is +3 .Immakingthepositivesignexplicit,becauseit
indicatesthatthecircuitistuggingonxtobecomehigher.Theactualvalue, 3 canbe
interpretedastheforceofthattug.

Thederivativewithrespecttosomeinputcanbecomputedbytweakingthatinputbyasmall
amountandobservingthechangeontheoutputvalue.

Bytheway,weusuallytalkaboutthederivativewithrespecttoasingleinput,orabouta
gradientwithrespecttoalltheinputs.Thegradientisjustmadeupofthederivativesofallthe
inputsconcatenatedinavector(i.e.alist).Crucially,noticethatifwelettheinputsrespondto
thetugbyfollowingthegradientatinyamount(i.e.wejustaddthederivativeontopofevery
input),wecanseethatthevalueincreases,asexpected:

varstep_size=0.01;
varout=forwardMultiplyGate(x,y);//before:6
x=x+step_size*x_derivative;//xbecomes1.97

https://karpathy.github.io/neuralnets/ 5/41
3/17/2017 Hacker'sguidetoNeuralNetworks

y=y+step_size*y_derivative;//ybecomes2.98
varout_new=forwardMultiplyGate(x,y);//5.87!exciting.

Asexpected,wechangedtheinputsbythegradientandthecircuitnowgivesaslightlyhigher
value( 5.87>6.0 ).Thatwasmuchsimplerthantryingrandomchangesto x and y ,
right?Afacttoappreciatehereisthatifyoutakecalculusyoucanprovethatthegradientis,in
fact,thedirectionofthesteepestincreaseofthefunction.Thereisnoneedtomonkeyaround
tryingoutrandompertubationsasdoneinStrategy#1.Evaluatingthegradientrequiresjust
threeevaluationsoftheforwardpassofourcircuitinsteadofhundreds,andgivesthebesttug
youcanhopefor(locally)ifyouareinterestedinincreasingthevalueoftheoutput.

Biggerstepisnotalwaysbetter.Letmeclarifyonthispointabit.Itisimportanttonotethatin
thisverysimpleexample,usingabigger step_size than0.01willalwaysworkbetter.For
example, step_size=1.0 givesoutput 1 (higer,better!),andindeedinfinitestepsize
wouldgiveinfinitelygoodresults.Thecrucialthingtorealizeisthatonceourcircuitsgetmuch
morecomplex(e.g.entireneuralnetworks),thefunctionfrominputstotheoutputvaluewillbe
morechaoticandwiggly.Thegradientguaranteesthatifyouhaveaverysmall(indeed,
infinitesimallysmall)stepsize,thenyouwilldefinitelygetahighernumberwhenyoufollowits
direction,andforthatinfinitesimallysmallstepsizethereisnootherdirectionthatwouldhave
workedbetter.Butifyouuseabiggerstepsize(e.g. step_size=0.01 )allbetsareoff.The
reasonwecangetawaywithalargerstepsizethaninfinitesimallysmallisthatourfunctions
areusuallyrelativelysmooth.Butreally,werecrossingourfingersandhopingforthebest.

Hillclimbinganalogy.OneanalogyIveheardbeforeisthattheoutputvalueofourcircutis
liketheheightofahill,andweareblindfoldedandtryingtoclimbupwards.Wecansensethe
steepnessofthehillatourfeet(thegradient),sowhenweshuffleourfeetabitwewillgo
upwards.Butifwetookabig,overconfidentstep,wecouldhavesteppedrightintoahole.

Great,IhopeIveconvincedyouthatthenumericalgradientisindeedaveryusefulthingto
evaluate,andthatitischeap.But.Itturnsoutthatwecandoevenbetter.

Strategy#3:AnalyticGradient
Intheprevioussectionweevaluatedthegradientbyprobingthecircuitsoutputvalue,
independentlyforeveryinput.Thisproceduregivesyouwhatwecallanumericalgradient.
Thisapproach,however,isstillexpensivebecauseweneedtocomputethecircuitsoutputas
wetweakeveryinputvalueindependentlyasmallamount.Sothecomplexityofevaluatingthe
gradientislinearinnumberofinputs.Butinpracticewewillhavehundreds,thousandsor(for
neuralnetworks)eventenstohundredsofmillionsofinputs,andthecircuitsarentjustone

https://karpathy.github.io/neuralnets/ 6/41
3/17/2017 Hacker'sguidetoNeuralNetworks

multiplygatebuthugeexpressionsthatcanbeexpensivetocompute.Weneedsomething
better.

Luckily,thereisaneasierandmuchfasterwaytocomputethegradient:wecanusecalculusto
deriveadirectexpressionforitthatwillbeassimpletoevaluateasthecircuitsoutputvalue.
Wecallthisananalyticgradientandtherewillbenoneedfortweakinganything.Youmay
haveseenotherpeoplewhoteachNeuralNetworksderivethegradientinhugeand,frankly,
scaryandconfusingmathematicalequations(ifyourenotwellversedinmaths).Butits
unnecessary.IvewrittenplentyofNeuralNetscodeandIrarelyhavetodomathematical
derivationlongerthantwolines,and95%ofthetimeitcanbedonewithoutwritinganythingat
all.Thatisbecausewewillonlyeverderivethegradientforverysmallandsimpleexpressions
(thinkofitasthebasecase)andthenIwillshowyouhowwecancomposetheseverysimply
withchainruletoevaluatethefullgradient(thinkinductive/recursivecase).

Theanalyticderivativerequiresnotweakingoftheinputs.Itcanbederivedusing
mathematics(calculus).

Ifyourememberyourproductrules,powerrules,quotientrules,etc.(seee.g.derivativerulesor
wikipage),itsveryeasytowritedownthederivitativewithrespecttoboth x and y fora
smallexpressionsuchas x*y .Butsupposeyoudontrememberyourcalculusrules.We
cangobacktothedefinition.Forexample,herestheexpressionforthederivativew.r.t x :

f (x, y) f (x + h, y) f (x, y)
=
x h

(TechnicallyImnotwritingthelimitas h goestozero,forgivememathpeople).Okayandlets
pluginourfunction(f (x, y) = xy )intotheexpression.Readyforthehardestpieceofmath
ofthisentirearticle?Herewego:

f (x, y) f (x + h, y) f (x, y) (x + h)y xy xy + hy xy hy

= = = = = y
x h h h h

Thatsinteresting.Thederivativewithrespectto x isjustequalto y .Didyounoticethe

coincidenceintheprevioussection?Wetweaked x to x+h andcalculated x_derivative=
3.0 ,whichexactlyhappenstobethevalueof y inthatexample.Itturnsoutthatwasnta
coincidenceatallbecausethatsjustwhattheanalyticgradienttellsusthe x derivativeshould
befor f(x,y)=x*y .Thederivativewithrespectto y ,bytheway,turnsouttobe x ,
unsurprisinglybysymmetry.Sothereisnoneedforanytweaking!Weinvokedpowerful
mathematicsandcannowtransformourderivativecalculationintothefollowingcode:

https://karpathy.github.io/neuralnets/ 7/41
3/17/2017 Hacker'sguidetoNeuralNetworks

varx=2,y=3;
varout=forwardMultiplyGate(x,y);//before:6
varx_gradient=y;//byourcomplexmathematicalderivationabove
vary_gradient=x;

varstep_size=0.01;
x+=step_size*x_gradient;//2.03
y+=step_size*y_gradient;//2.98
varout_new=forwardMultiplyGate(x,y);//5.87.Higheroutput!Nice.

Tocomputethegradientwewentfromforwardingthecircuithundredsoftimes(Strategy#1)to
forwardingitonlyonorderofnumberoftimestwicethenumberofinputs(Strategy#2),to
forwardingitasingletime!AnditgetsEVENbetter,sincethemoreexpensivestrategies(#1
and#2)onlygiveanapproximationofthegradient,while#3(thefastestonebyfar)givesyou
theexactgradient.Noapproximations.Theonlydownsideisthatyoushouldbecomfortable
withsomecalculus101.

Letsrecapwhatwehavelearned:

INPUT:Wearegivenacircuit,someinputsandcomputeanoutputvalue.
OUTPUT:Wearetheninterestedfindingsmallchangestoeachinput(independently)that
wouldmaketheoutputhigher.
Strategy#1:Onesillywayistorandomlysearchforsmallpertubationsoftheinputsand
keeptrackofwhatgivesthehighestincreaseinoutput.
Strategy#2:Wesawwecandomuchbetterbycomputingthegradient.Regardlessof
howcomplicatedthecircuitis,thenumericalgradientisverysimple(butrelatively
expensive)tocompute.Wecomputeitbyprobingthecircuitsoutputvalueaswetweak
theinputsoneatatime.
Strategy#3:Intheend,wesawthatwecanbeevenmorecleverandanalyticallyderivea
directexpressiontogettheanalyticgradient.Itisidenticaltothenumericalgradient,itis
fastestbyfar,andthereisnoneedforanytweaking.

Inpracticebytheway(andwewillgettothisonceagainlater),allNeuralNetworklibraries
alwayscomputetheanalyticgradient,butthecorrectnessoftheimplementationisverifiedby
comparingittothenumericalgradient.Thatsbecausethenumericalgradientisveryeasyto
evaluate(butcanbeabitexpensivetocompute),whiletheanalyticgradientcancontainbugs
attimes,butisusuallyextremelyefficienttocompute.Aswewillsee,evaluatingthegradient
(i.e.whiledoingbackprop,orbackwardpass)willturnouttocostaboutasmuchasevaluating
theforwardpass.

RecursiveCase:CircuitswithMultipleGates
https://karpathy.github.io/neuralnets/ 8/41
3/17/2017 Hacker'sguidetoNeuralNetworks

RecursiveCase:CircuitswithMultipleGates
Butholdon,yousay:Theanalyticgradientwastrivialtoderiveforyoursupersimple
expression.Thisisuseless.WhatdoIdowhentheexpressionsaremuchlarger?Dontthe
equationsgethugeandcomplexveryfast?.Goodquestion.Yestheexpressionsgetmuch
morecomplex.No,thisdoesntmakeitmuchharder.Aswewillsee,everygatewillbehanging
outbyitself,completelyunawareofanydetailsofthehugeandcomplexcircuitthatitcouldbe
partof.Itwillonlyworryaboutitsinputsanditwillcomputeitslocalderivativesasseeninthe
previoussection,exceptnowtherewillbeasingleextramultiplicationitwillhavetodo.

Asingleextramultiplicationwillturnasingle(uselessgate)intoacoginthecomplexmachine
thatisanentireneuralnetwork.

Ishouldstophypingitupnow.IhopeIvepiquedyourinterest!Letsdrilldownintodetailsand
gettwogatesinvolvedwiththisnextexample:

x q

y
+ * f

Theexpressionwearecomputingnowisf (x, y, z) = (x + y)z .Letsstructurethecodeas

followstomakethegatesexplicitasfunctions:

varforwardMultiplyGate=function(a,b){
returna*b;
};
varforwardAddGate=function(a,b){
returna+b;
};
varforwardCircuit=function(x,y,z){
varq=forwardAddGate(x,y);
varf=forwardMultiplyGate(q,z);
returnf;
};

varx=2,y=5,z=4;
varf=forwardCircuit(x,y,z);//outputis12

https://karpathy.github.io/neuralnets/ 9/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Intheabove,Iamusing a and b asthelocalvariablesinthegatefunctionssothatwedont

gettheseconfusedwithourcircuitinputs x,y,z .Asbefore,weareinterestedinfindingthe
derivativeswithrespecttothethreeinputs x,y,z .Buthowdowecomputeitnowthatthere
aremultiplegatesinvolved?First,letspretendthatthe + gateisnotthereandthatweonly
havetwovariablesinthecircuit: q,z andasingle * gate.Notethatthe q isisoutputofthe
+ gate.Ifwedontworryabout x and y butonlyabout q and z ,thenwearebackto
havingonlyasinglegate,andasfarasthatsingle * gateisconcerned,weknowwhatthe
(analytic)derivatesarefromprevioussection.Wecanwritethemdown(exceptherewere
replacing x,y with q,z ):

f (q, z) f (q, z)
f (q, z) = qz = z, = q
q z

Simpleenough:thesearetheexpressionsforthegradientwithrespectto q and z .Butwait,

wedontwantgradientwithrespectto q ,butwithrespecttotheinputs: x and y .Luckily, q
iscomputedasafunctionof x and y (byadditioninourexample).Wecanwritedownthe
gradientfortheadditiongateaswell,itsevensimpler:

q(x, y) q(x, y)
q(x, y) = x + y = 1, = 1
x y

Thatsright,thederivativesarejust1,regardlessoftheactualvaluesof x and y .Ifyouthink

aboutit,thismakessensebecausetomaketheoutputofasingleadditiongatehigher,we
expectapositivetugonboth x and y ,regardlessoftheirvalues.

Backpropagation

WearefinallyreadytoinvoketheChainRule:Weknowhowtocomputethegradientof q
withrespectto x and y (thatsasinglegatecasewith + asthegate).Andweknowhowto
computethegradientofourfinaloutputwithrespectto q .Thechainruletellsushowto
combinethesetogetthegradientofthefinaloutputwithrespectto x and y ,whichiswhat
wereultimatelyinterestedin.Bestofall,thechainruleverysimplystatesthattherightthingto
doistosimplymultiplythegradientstogethertochainthem.Forexample,thefinalderivative
for x willbe:

f (q, z) q(x, y) f (q, z)

=
x x q

Therearemanysymbolstheresomaybethisisconfusingagain,butitsreallyjusttwonumbers
beingmultipliedtogether.Hereisthecode:

https://karpathy.github.io/neuralnets/ 10/41
3/17/2017 Hacker'sguidetoNeuralNetworks

//initialconditions
varx=2,y=5,z=4;
varq=forwardAddGate(x,y);//qis3
varf=forwardMultiplyGate(q,z);//outputis12

//gradientoftheMULTIPLYgatewithrespecttoitsinputs
//wrtisshortfor"withrespectto"
varderivative_f_wrt_z=q;//3
varderivative_f_wrt_q=z;//4

//derivativeoftheADDgatewithrespecttoitsinputs
varderivative_q_wrt_x=1.0;
varderivative_q_wrt_y=1.0;

//chainrule
varderivative_f_wrt_x=derivative_q_wrt_x*derivative_f_wrt_q;//4
varderivative_f_wrt_y=derivative_q_wrt_y*derivative_f_wrt_q;//4

Thatsit.Wecomputedthegradient(theforces)andnowwecanletourinputsrespondtoitby
abit.Letsaddthegradientsontopoftheinputs.Theoutputvalueofthecircuitbetterincrease,
upfrom12!

//finalgradient,fromabove:[4,4,3]
vargradient_f_wrt_xyz=[derivative_f_wrt_x,derivative_f_wrt_y,derivative_f_wrt_z

//lettheinputsrespondtotheforce/tug:
varstep_size=0.01;
x=x+step_size*derivative_f_wrt_x;//2.04
y=y+step_size*derivative_f_wrt_y;//4.96
z=z+step_size*derivative_f_wrt_z;//3.97

//Ourcircuitnowbettergivehigheroutput:
varq=forwardAddGate(x,y);//qbecomes2.92
varf=forwardMultiplyGate(q,z);//outputis11.59,upfrom12!Nice!

Lookslikethatworked!Letsnowtrytointerpretintuitivelywhatjusthappened.Thecircuitwants
tooutputhighervalues.Thelastgatesawinputs q=3,z=4 andcomputedoutput 12 .
Pullingupwardsonthisoutputvalueinducedaforceonboth q and z :Toincreasethe
outputvalue,thecircuitwants z toincrease,ascanbeseenbythepositivevalueofthe
derivative( derivative_f_wrt_z=+3 ).Again,thesizeofthisderivativecanbeinterpretedas

https://karpathy.github.io/neuralnets/ 11/41
3/17/2017 Hacker'sguidetoNeuralNetworks

themagnitudeoftheforce.Ontheotherhand, q feltastrongeranddownwardforce,since
derivative_f_wrt_q=4 .Inotherwordsthecircuitwants q todecrease,withaforceof
4.

Nowwegettothesecond, + gatewhichoutputs q .Bydefault,the + gatecomputesits

derivativeswhichtellsushowtochange x and y tomake q higher.BUT!Hereisthe
crucialpoint:thegradienton q wascomputedasnegative( derivative_f_wrt_q=4 ),so
thecircuitwants q todecrease,andwithaforceof 4 !Soifthe + gatewantstocontributeto
makingthefinaloutputvaluelarger,itneedstolistentothegradientsignalcomingfromthetop.
Inthisparticularcase,itneedstoapplytugson x,y oppositeofwhatitwouldnormallyapply,
andwithaforceof 4 ,sotospeak.Themultiplicationby 4 seeninthechainruleachieves
exactlythis:insteadofapplyingapositiveforceof +1 onboth x and y (thelocalderivative),
thefullcircuitsgradientonboth x and y becomes 1x4=4 .Thismakessense:the
circuitwantsboth x and y togetsmallerbecausethiswillmake q smaller,whichinturnwill
make f larger.

Ifthismakessense,youunderstandbackpropagation.

Letsrecaponceagainwhatwelearned:

Inthepreviouschapterwesawthatinthecaseofasinglegate(orasingleexpression),
wecanderivetheanalyticgradientusingsimplecalculus.Weinterpretedthegradientas
aforce,oratugontheinputsthatpullstheminadirectionwhichwouldmakethisgates
outputhigher.
Incaseofmultiplegateseverythingstaysprettymuchthesameway:everygateis
hangingoutbyitselfcompletelyunawareofthecircuititisembeddedin.Someinputs
comeinandthegatecomputesitsoutputandthederivatewithrespecttotheinputs.The
onlydifferencenowisthatsuddenly,somethingcanpullonthisgatefromabove.Thats
thegradientofthefinalcircuitoutputvaluewithrespecttotheouputthisgatecomputed.It
isthecircuitaskingthegatetooutputhigherorlowernumbers,andwithsomeforce.The
gatesimplytakesthisforceandmultipliesittoalltheforcesitcomputedforitsinputs
before(chainrule).Thishasthedesiredeffect:

1.Ifagateexperiencesastrongpositivepullfromabove,itwillalsopullharderonitsown
inputs,scaledbytheforceitisexperiencingfromabove
2.Andifitexperiencesanegativetug,thismeansthatcircuitwantsitsvaluetodecreasenot
increase,soitwillfliptheforceofthepullonitsinputstomakeitsownoutputvalue
smaller.

https://karpathy.github.io/neuralnets/ 12/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Anicepicturetohaveinmindisthataswepullonthecircuitsoutputvalueattheend,this
inducespullsdownwardthroughtheentirecircuit,allthewaydowntotheinputs.

Isntitbeautiful?Theonlydifferencebetweenthecaseofasinglegateandmultipleinteracting
gatesthatcomputearbitrarilycomplexexpressionsisthisadditionalmultipyoperationthatnow
happensineachgate.

Patternsinthebackwardflow

Letslookagainatourexamplecircuitwiththenumbersfilledin.Thefirstcircuitshowstheraw
values,andthesecondcircuitshowsthegradientsthatflowbacktotheinputsasdiscussed.
Noticethatthegradientalwaysstartsoffwith +1 attheendtostartoffthechain.Thisisthe
(default)pullonthecircuittohaveitsvalueincreased.

2 3

5
+ * 12 (Values)

4 4
+
4 * 1 (Gradients)

Afterawhileyoustarttonoticepatternsinhowthegradientsflowbackwardinthecircuits.For
example,the + gatealwaystakesthegradientontopandsimplypassesitontoallofits
inputs(noticetheexamplewith4simplypassedontobothoftheinputsof + gate).Thisis
becauseitsownderivativefortheinputsisjust +1 ,regardlessofwhattheactualvaluesofthe
inputsare,sointhechainrule,thegradientfromaboveisjustmultipliedby1andstaysthe
same.Similarintuitionsapplyto,forexample,a max(x,y) gate.Sincethegradientof
max(x,y) withrespecttoitsinputis +1 forwhicheveroneof x , y islargerand 0 forthe
other,thisgateisduringbackpropeffectivelyjustagradientswitch:itwilltakethegradient
fromaboveandrouteittotheinputthathadahighervalueduringtheforwardpass.

https://karpathy.github.io/neuralnets/ 13/41
3/17/2017 Hacker'sguidetoNeuralNetworks

NumericalGradientCheck.Beforewefinishwiththissection,letsjustmakesurethatthe
(analytic)gradientwecomputedbybackpropaboveiscorrectasasanitycheck.Remember
thatwecandothissimplybycomputingthenumericalgradientandmakingsurethatweget
[4,4,3] for x,y,z .Heresthecode:

//initialconditions
varx=2,y=5,z=4;

//numericalgradientcheck
varh=0.0001;
varx_derivative=(forwardCircuit(x+h,y,z)forwardCircuit(x,y,z))/h;//4
vary_derivative=(forwardCircuit(x,y+h,z)forwardCircuit(x,y,z))/h;//4
varz_derivative=(forwardCircuit(x,y,z+h)forwardCircuit(x,y,z))/h;//3

andweget [4,4,3] ,ascomputedwithbackprop.phew!:)

Example:SingleNeuron
Intheprevioussectionyouhopefullygotthebasicintuitionbehindbackpropagation.Letsnow
lookatanevenmorecomplicatedandborderlinepracticalexample.Wewillconsidera2
dimensionalneuronthatcomputesthefollowingfunction:

f (x, y, a, b, c) = (ax + by + c)

Inthisexpression, isthesigmoidfunction.Itsbestthoughtofasasquashingfunction,
becauseittakestheinputandsquashesittobebetweenzeroandone:Verynegativevalues
aresquashedtowardszeroandpositivevaluesgetsquashedtowardsone.Forexample,we
have sig(5)=0.006,sig(0)=0.5,sig(5)=0.993 .Sigmoidfunctionisdefinedas:

1
(x) =
x
1 + e

Thegradientwithrespecttoitssingleinput,asyoucancheckonWikipediaorderiveyourselfif
youknowsomecalculusisgivenbythisexpression:

(x)
= (x)(1 (x))
x

Forexample,iftheinputtothesigmoidgateis x=3 ,thegatewillcomputeoutput f=1.0/

(1.0+Math.exp(x))=0.95 ,andthenthe(local)gradientonitsinputwillsimplybe dx=
(0.95)*(10.95)=0.0475 .

https://karpathy.github.io/neuralnets/ 14/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Thatsallweneedtousethisgate:weknowhowtotakeaninputandforwarditthroughthe
sigmoidgate,andwealsohavetheexpressionforthegradientwithrespecttoitsinput,sowe
canalsobackpropthroughit.Anotherthingtonoteisthattechnically,thesigmoidfunctionis
madeupofanentireseriesofgatesinalinethatcomputemoreatomicfunctions:an
exponentiationgate,anadditiongateandadivisiongate.Treatingitsowouldworkperfectly
finebutforthisexampleIchosetocollapseallofthesegatesintoasinglegatethatjust
computessigmoidinoneshot,becausethegradientexpressionturnsouttobesimple.

Letstakethisopportunitytocarefullystructuretheassociatedcodeinaniceandmodularway.
First,Idlikeyoutonotethateverywireinourdiagramshastwonumbersassociatedwithit:

1.thevalueitcarriesduringtheforwardpass
2.thegradient(i.ethepull)thatflowsbackthroughitinthebackwardpass

Letscreateasimple Unit structurethatwillstorethesetwovaluesoneverywire.Ourgates

willnowoperateover Unit s:theywilltakethemasinputsandcreatethemasoutputs.

//everyUnitcorrespondstoawireinthediagrams
varUnit=function(value,grad){
//valuecomputedintheforwardpass
this.value=value;
//thederivativeofcircuitoutputw.r.tthisunit,computedinbackwardpass
this.grad=grad;
}

InadditiontoUnitswealsoneed3gates: + , * and sig (sigmoid).Letsstartoutby

implementingamultiplygate.ImusingJavascriptherewhichhasafunnywayofsimulating
classesusingfunctions.IfyourenotaJavascriptfamiliarperson,allthatsgoingonhereis
thatImdefiningaclassthathascertainproperties(accessedwithuseof this keyword),and
somemethods(whichinJavascriptareplacedintothefunctionsprototype).Justthinkabout
theseasclassmethods.Alsokeepinmindthatthewaywewillusetheseeventuallyisthatwe
willfirst forward allthegatesonebyone,andthen backward allthegatesinreverseorder.
Hereistheimplementation:

varmultiplyGate=function(){};
multiplyGate.prototype={
forward:function(u0,u1){
//storepointerstoinputUnitsu0andu1andoutputunitutop
this.u0=u0;
this.u1=u1;
https://karpathy.github.io/neuralnets/ 15/41
3/17/2017 Hacker'sguidetoNeuralNetworks

this.utop=newUnit(u0.value*u1.value,0.0);
returnthis.utop;
},
backward:function(){
//takethegradientinoutputunitandchainitwiththe
//localgradients,whichwederivedformultiplygatebefore
//thenwritethosegradientstothoseUnits.
this.u0.grad+=this.u1.value*this.utop.grad;
this.u1.grad+=this.u0.value*this.utop.grad;
}
}

Themultiplygatetakestwounitsthateachholdavalueandcreatesaunitthatstoresitsoutput.
Thegradientisinitializedtozero.Thennoticethatinthe backward functioncallwegetthe
gradientfromtheoutputunitweproducedduringtheforwardpass(whichwillbynowhopefully
haveitsgradientfilledin)andmultiplyitwiththelocalgradientforthisgate(chainrule!).This
gatecomputesmultiplication( u0.value*u1.value )duringforwardpass,sorecallthatthe
gradientw.r.t u0 is u1.value andw.r.t u1 is u0.value .Alsonotethatweareusing += to
addontothegradientinthe backward function.Thiswillallowustopossiblyusetheoutputof
onegatemultipletimes(thinkofitasawirebranchingout),sinceitturnsoutthatthegradients
fromthesedifferentbranchesjustaddupwhencomputingthefinalgradientwithrespecttothe
circuitoutput.Theothertwogatesaredefinedanalogously:

varaddGate=function(){};
addGate.prototype={
forward:function(u0,u1){
this.u0=u0;
this.u1=u1;//storepointerstoinputunits
this.utop=newUnit(u0.value+u1.value,0.0);
returnthis.utop;
},
backward:function(){
//addgate.derivativewrtbothinputsis1
this.u0.grad+=1*this.utop.grad;
this.u1.grad+=1*this.utop.grad;
}
}

varsigmoidGate=function(){
//helperfunction
this.sig=function(x){return1/(1+Math.exp(x));};
};
https://karpathy.github.io/neuralnets/ 16/41
3/17/2017 Hacker'sguidetoNeuralNetworks

sigmoidGate.prototype={
forward:function(u0){
this.u0=u0;
this.utop=newUnit(this.sig(this.u0.value),0.0);
returnthis.utop;
},
backward:function(){
vars=this.sig(this.u0.value);
this.u0.grad+=(s*(1s))*this.utop.grad;
}
}

Notethat,again,the backward functioninallcasesjustcomputesthelocalderivativewith

respecttoitsinputandthenmultipliesonthegradientfromtheunitabove(i.e.chainrule).To
fullyspecifyeverythingletsfinallywriteouttheforwardandbackwardflowforour2dimensional
neuronwithsomeexamplevalues:

//createinputunits
vara=newUnit(1.0,0.0);
varb=newUnit(2.0,0.0);
varc=newUnit(3.0,0.0);
varx=newUnit(1.0,0.0);
vary=newUnit(3.0,0.0);

//createthegates
varmulg0=newmultiplyGate();
varmulg1=newmultiplyGate();
varaddg0=newaddGate();
varaddg1=newaddGate();
varsg0=newsigmoidGate();

//dotheforwardpass
varforwardNeuron=function(){
ax=mulg0.forward(a,x);//a*x=1
by=mulg1.forward(b,y);//b*y=6
axpby=addg0.forward(ax,by);//a*x+b*y=5
axpbypc=addg1.forward(axpby,c);//a*x+b*y+c=2
s=sg0.forward(axpbypc);//sig(a*x+b*y+c)=0.8808
};
forwardNeuron();

console.log('circuitoutput:'+s.value);//prints0.8808

https://karpathy.github.io/neuralnets/ 17/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Andnowletscomputethegradient:Simplyiterateinreverseorderandcallthe backward
function!Rememberthatwestoredthepointerstotheunitswhenwedidtheforwardpass,so
everygatehasaccesstoitsinputsandalsotheoutputunititpreviouslyproduced.

s.grad=1.0;
sg0.backward();//writesgradientintoaxpbypc
addg1.backward();//writesgradientsintoaxpbyandc
addg0.backward();//writesgradientsintoaxandby
mulg1.backward();//writesgradientsintobandy
mulg0.backward();//writesgradientsintoaandx

Notethatthefirstlinesetsthegradientattheoutput(verylastunit)tobe 1.0 tostartoffthe

gradientchain.Thiscanbeinterpretedastuggingonthelastgatewithaforceof +1 .Inother
words,wearepullingontheentirecircuittoinducetheforcesthatwillincreasetheoutput
value.Ifwedidnotsetthisto1,allgradientswouldbecomputedaszeroduetothe
multiplicationsinthechainrule.Finally,letsmaketheinputsrespondtothecomputedgradients
andcheckthatthefunctionincreased:

varstep_size=0.01;
a.value+=step_size*a.grad;//a.gradis0.105
b.value+=step_size*b.grad;//b.gradis0.315
c.value+=step_size*c.grad;//c.gradis0.105
x.value+=step_size*x.grad;//x.gradis0.105
y.value+=step_size*y.grad;//y.gradis0.210

forwardNeuron();
console.log('circuitoutputafteronebackprop:'+s.value);//prints0.8825

Success! 0.8825 ishigherthanthepreviousvalue, 0.8808 .Finally,letsverifythatwe

implementedthebackpropagationcorrectlybycheckingthenumericalgradient:

varforwardCircuitFast=function(a,b,c,x,y){
return1/(1+Math.exp((a*x+b*y+c)));
};
vara=1,b=2,c=3,x=1,y=3;
varh=0.0001;
vara_grad=(forwardCircuitFast(a+h,b,c,x,y)forwardCircuitFast(a,b,c,x,y))/h;
varb_grad=(forwardCircuitFast(a,b+h,c,x,y)forwardCircuitFast(a,b,c,x,y))/h;
varc_grad=(forwardCircuitFast(a,b,c+h,x,y)forwardCircuitFast(a,b,c,x,y))/h;
varx_grad=(forwardCircuitFast(a,b,c,x+h,y)forwardCircuitFast(a,b,c,x,y))/h;
vary_grad=(forwardCircuitFast(a,b,c,x,y+h)forwardCircuitFast(a,b,c,x,y))/h;

https://karpathy.github.io/neuralnets/ 18/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Indeed,theseallgivethesamevaluesasthebackpropagatedgradients [0.105,0.315,
0.105,0.105,0.210] .Nice!

Ihopeitisclearthateventhoughweonlylookedatanexampleofasingleneuron,thecodeI
gaveabovegeneralizesinaverystraightforwardwaytocomputegradientsofarbitrary
expressions(includingverydeepexpressions#foreshadowing).Allyouhavetodoiswritesmall
gatesthatcomputelocal,simplederivativesw.r.ttheirinputs,wireitupinagraph,doaforward
passtocomputetheoutputvalueandthenabackwardpassthatchainsthegradientsallthe
waytotheinput.

BecomingaBackpropNinja
Overtimeyouwillbecomemuchmoreefficientinwritingthebackwardpass,evenfor
complicatedcircuitsandallatonce.Letspracticebackpropabitwithafewexamples.Inwhat
follows,letsnotworryaboutUnit,Circuitclassesbecausetheyobfuscatethingsabit,andlets
justusevariablessuchas a,b,c,x ,andrefertotheirgradientsas da,db,dc,dx respectively.
Again,wethinkofthevariablesastheforwardflowandtheirgradientsasbackwardflow
alongeverywire.Ourfirstexamplewasthe * gate:

varx=a*b;
//andgivengradientonx(dx),wesawthatinbackpropwewouldcompute:
varda=b*dx;
vardb=a*dx;

Inthecodeabove,Imassumingthatthevariable dx isgiven,comingfromsomewhereabove
usinthecircuitwhileweredoingbackprop(oritis+1bydefaultotherwise).Imwritingitout
becauseIwanttoexplicitlyshowhowthegradientsgetchainedtogether.Notefromthe
equationsthatthe * gateactsasaswitcherduringbackwardpass,forlackofbetterword.It
rememberswhatitsinputswere,andthegradientsoneachonewillbethevalueoftheother
duringtheforwardpass.Andthenofcoursewehavetomultiplywiththegradientfromabove,
whichisthechainrule.Heresthe + gateinthiscondensedform:

varx=a+b;
//>
varda=1.0*dx;
vardb=1.0*dx;

Where 1.0 isthelocalgradient,andthemultiplicationisourchainrule.Whataboutadding

threenumbers?:
https://karpathy.github.io/neuralnets/ 19/41
3/17/2017 Hacker'sguidetoNeuralNetworks

//letscomputex=a+b+cintwosteps:
varq=a+b;//gate1
varx=q+c;//gate2

//backwardpass:
dc=1.0*dx;//backpropgate2
dq=1.0*dx;
da=1.0*dq;//backpropgate1
db=1.0*dq;

Youcanseewhatshappening,right?Ifyourememberthebackwardflowdiagram,the + gate
simplytakesthegradientontopandroutesitequallytoallofitsinputs(becauseitslocal
gradientisalwayssimply 1.0 forallitsinputs,regardlessoftheiractualvalues).Sowecando
itmuchfaster:

varx=a+b+c;
varda=1.0*dx;vardb=1.0*dx;vardc=1.0*dx;

Okay,howaboutcombininggates?:

varx=a*b+c;
//givendx,backpropinonesweepwouldbe=>
da=b*dx;
db=a*dx;
dc=1.0*dx;

Ifyoudontseehowtheabovehappened,introduceatemporaryvariable q=a*b andthen

compute x=q+c toconvinceyourself.Andhereisourneuron,letsdoitintwosteps:

//letsdoourneuronintwosteps:
varq=a*x+b*y+c;
varf=sig(q);//sigisthesigmoidfunction
//andnowbackwardpass,wearegivendf,and:
vardf=1;
vardq=(f*(1f))*df;
//andnowwechainittotheinputs
varda=x*dq;
vardx=a*dq;
vardy=b*dq;
vardb=y*dq;
vardc=1.0*dq;

https://karpathy.github.io/neuralnets/ 20/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Ihopethisisstartingtomakealittlemoresense.Nowhowaboutthis:

varx=a*a;
varda=//???

Youcanthinkofthisasvalue a flowingtothe * gate,butthewiregetssplitandbecomes

bothinputs.Thisisactuallysimplebecausethebackwardflowofgradientsalwaysaddsup.In
otherwordsnothingchanges:

varda=a*dx;//gradientintoafromfirstbranch
da+=a*dx;//andaddonthegradientfromthesecondbranch

//shortforminsteadis:
varda=2*a*dx;

Infact,ifyouknowyourpowerrulefromcalculusyouwouldalsoknowthatifyouhave
f (a)
f (a) = a
2
then = 2a ,whichisexactlywhatwegetifwethinkofitaswiresplittingup
a

andbeingtwoinputstoagate.

Letsdoanotherone:

varx=a*a+b*b+c*c;
//weget:
varda=2*a*dx;
vardb=2*b*dx;
vardc=2*c*dx;

Okaynowletsstarttogetmorecomplex:

varx=Math.pow(((a*b+c)*d),2);//pow(x,2)squarestheinputJS

Whenmorecomplexcaseslikethiscomeupinpractice,Iliketosplittheexpressioninto
manageablechunkswhicharealmostalwayscomposedofsimplerexpressionsandthenI
chainthemtogetherwithchainrule:

varx1=a*b+c;
varx2=x1*d;
varx=x2*x2;//thisisidenticaltotheaboveexpressionforx
//andnowinbackpropwegobackwards:
vardx2=2*x2*dx;//backpropintox2
https://karpathy.github.io/neuralnets/ 21/41
3/17/2017 Hacker'sguidetoNeuralNetworks

vardd=x1*dx2;//backpropintod
vardx1=d*dx2;//backpropintox1
varda=b*dx1;
vardb=a*dx1;
vardc=1.0*dx1;//done!

Thatwasnttoodifficult!Thosearethebackpropequationsfortheentireexpression,andweve
donethempiecebypieceandbackproppedtoallthevariables.Noticeagainhowforevery
variableduringforwardpasswehaveanequivalentvariableduringbackwardpassthat
containsitsgradientwithrespecttothecircuitsfinaloutput.Hereareafewmoreuseful
functionsandtheirlocalgradientsthatareusefulinpractice:

varx=1.0/a;//division
varda=1.0/(a*a);

Hereswhatdivisionmightlooklikeinpracticethen:

varx=(a+b)/(c+d);
//letsdecomposeitinsteps:
varx1=a+b;
varx2=c+d;
varx3=1.0/x2;
varx=x1*x3;//equivalenttoabove
//andnowbackprop,againinreverseorder:
vardx1=x3*dx;
vardx3=x1*dx;
vardx2=(1.0/(x2*x2))*dx3;//localgradientasshownabove,andchainrule
varda=1.0*dx1;//andfinallyintotheoriginalvariables
vardb=1.0*dx1;
vardc=1.0*dx2;
vardd=1.0*dx2;

Hopefullyyouseethatwearebreakingdownexpressions,doingtheforwardpass,andthenfor
everyvariable(suchas a )wederiveitsgradient da aswegobackwards,onebyone,
applyingthesimplelocalgradientsandchainingthemwithgradientsfromabove.Heres
anotherone:

varx=Math.max(a,b);
varda=a===x?1.0*dx:0.0;
vardb=b===x?1.0*dx:0.0;

https://karpathy.github.io/neuralnets/ 22/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Okaythisismakingaverysimplethinghardtoread.The max functionpassesonthevalueof

theinputthatwaslargestandignorestheotherones.Inthebackwardpassthen,themaxgate
willsimplytakethegradientontopandrouteittotheinputthatactuallyflowedthroughitduring
theforwardpass.Thegateactsasasimpleswitchbasedonwhichinputhadthehighestvalue
duringforwardpass.Theotherinputswillhavezerogradient.Thatswhatthe === isabout,
sincewearetestingforwhichinputwastheactualmaxandonlyroutingthegradienttoit.

Finally,letslookattheRectifiedLinearUnitnonlinearity(orReLU),whichyoumayhaveheard
of.ItisusedinNeuralNetworksinplaceofthesigmoidfunction.Itissimplythresholdingat
zero:

varx=Math.max(a,0)
//backpropthroughthisgatewillthenbe:
varda=a>0?1.0*dx:0.0;

Inotherwordsthisgatesimplypassesthevaluethroughifitslargerthan0,oritstopstheflow
andsetsittozero.Inthebackwardpass,thegatewillpassonthegradientfromthetopifitwas
activatedduringtheforawrdpass,oriftheoriginalinputwasbelowzero,itwillstopthegradient
flow.

Iwillstopatthispoint.Ihopeyougotsomeintuitionabouthowyoucancomputeentire
expressions(whicharemadeupofmanygatesalongtheway)andhowyoucancompute
backpropforeveryoneofthem.

Everythingwevedoneinthischaptercomesdowntothis:Wesawthatwecanfeedsomeinput
througharbitrarilycomplexrealvaluedcircuit,tugattheendofthecircuitwithsomeforce,and
backpropagationdistributesthattugthroughtheentirecircuitallthewaybacktotheinputs.If
theinputsrespondslightlyalongthefinaldirectionoftheirtug,thecircuitwillgiveabitalong
theoriginalpulldirection.Maybethisisnotimmediatelyobvious,butthismachineryisa
powerfulhammerforMachineLearning.

Maybethisisnotimmediatelyobvious,butthismachineryisapowerfulhammerforMachine
Learning.

Letsnowputthismachinerytogooduse.

Chapter2:MachineLearning

https://karpathy.github.io/neuralnets/ 23/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Inthelastchapterwewereconcernedwithrealvaluedcircuitsthatcomputedpossiblycomplex
expressionsoftheirinputs(theforwardpass),andalsowecouldcomputethegradientsof
theseexpressionsontheoriginalinputs(backwardpass).Inthischapterwewillseehowuseful
thisextremelysimplemechanismisinMachineLearning.

BinaryClassification
Aswedidbefore,letsstartoutsimple.Thesimplest,commonandyetverypracticalproblemin
MachineLearningisbinaryclassification.Alotofveryinterestingandimportantproblemscan
bereducedtoit.Thesetupisasfollows:Wearegivenadatasetof N vectorsandeveryoneof
themislabeledwitha +1 ora 1 .Forexample,intwodimensionsourdatasetcouldlookas
simpleas:

vector>label

[1.2,0.7]>+1
[0.3,0.5]>1
[3,1]>+1
[0.1,1.0]>1
[3.0,1.1]>1
[2.1,3]>+1

Here,wehave N=6 datapoints,whereeverydatapointhastwofeatures( D=2 ).Threeof

thedatapointshavelabel +1 andtheotherthreelabel 1 .Thisisasillytoyexample,butin
practicea+1/1datasetcouldbeveryusefulthingsindeed:Forexamplespam/nospamemails,
wherethevectorssomehowmeasurevariousfeaturesofthecontentoftheemail,suchasthe
numberoftimescertainenhancementdrugsarementioned.

Goal.Ourgoalinbinaryclassificationistolearnafunctionthattakesa2dimensionalvector
andpredictsthelabel.Thisfunctionisusuallyparameterizedbyacertainsetofparameters,
andwewillwanttotunetheparametersofthefunctionsothatitsoutputsareconsistentwith
thelabelingintheprovideddataset.Intheendwecandiscardthedatasetandusethelearned
parameterstopredictlabelsforpreviouslyunseenvectors.

Trainingprotocol
Wewilleventuallybuilduptoentireneuralnetworksandcomplexexpressions,butletsstartout
simpleandtrainalinearclassifierverysimilartothesingleneuronwesawattheendof
Chapter1.Theonlydifferenceisthatwellgetridofthesigmoidbecauseitmakesthings
unnecessarilycomplicated(IonlyuseditasanexampleinChapter1becausesigmoidneurons
https://karpathy.github.io/neuralnets/ 24/41
3/17/2017 Hacker'sguidetoNeuralNetworks

arehistoricallypopularbutmodernNeuralNetworksrarely,ifever,usesigmoidnonlinearities).
Anyway,letsuseasimplelinearfunction:

f (x, y) = ax + by + c

Inthisexpressionwethinkof x and y astheinputs(the2Dvectors)and a,b,c asthe

parametersofthefunctionthatwewillwanttolearn.Forexample,if a=1,b=2,c=1 ,
thenthefunctionwilltakethefirstdatapoint( [1.2,0.7] )andoutput 1*1.2+(2)*0.7
+(1)=1.2 .Hereishowthetrainingwillwork:

1.Weselectarandomdatapointandfeeditthroughthecircuit
2.Wewillinterprettheoutputofthecircuitasaconfidencethatthedatapointhasclass +1 .
(i.e.veryhighvalues=circuitisverycertaindatapointhasclass +1 andverylowvalues
=circuitiscertainthisdatapointhasclass 1 .)
3.Wewillmeasurehowwellthepredictionalignswiththeprovidedlabels.Intuitively,for
example,ifapositiveexamplescoresverylow,wewillwanttotuginthepositivedirection
onthecircuit,demandingthatitshouldoutputhighervalueforthisdatapoint.Notethat
thisisthecaseforthethefirstdatapoint:itislabeledas +1 butourpredictorunctiononly
assignsitvalue 1.2 .WewillthereforetugonthecircuitinpositivedirectionWewant
thevaluetobehigher.
4.Thecircuitwilltakethetugandbackpropagateittocomputetugsontheinputs
a,b,c,x,y
5.Sincewethinkof x,y as(fixed)datapoints,wewillignorethepullon x,y .Ifyourea
fanofmyphysicalanalogies,thinkoftheseinputsaspegs,fixedintheground.
6.Ontheotherhand,wewilltaketheparameters a,b,c andmakethemrespondtotheir
tug(i.e.wellperformwhatwecallaparameterupdate).This,ofcourse,willmakeitso
thatthecircuitwilloutputaslightlyhigherscoreonthisparticulardatapointinthefuture.
7.Iterate!Gobacktostep1.

ThetrainingschemeIdescribedabove,iscommonlyreferredasStochasticGradient
Descent.TheinterestingpartIdliketoreiterateisthat a,b,c,x,y areallmadeupofthe
samestuffasfarasthecircuitisconcerned:Theyareinputstothecircuitandthecircuitwilltug
onalloftheminsomedirection.Itdoesntknowthedifferencebetweenparametersand
datapoints.However,afterthebackwardpassiscompleteweignorealltugsonthedatapoints
( x,y )andkeepswappingtheminandoutasweiterateoverexamplesinthedataset.Onthe
otherhand,wekeeptheparameters( a,b,c )aroundandkeeptuggingonthemeverytimewe
sampleadatapoint.Overtime,thepullsontheseparameterswilltunethesevaluesinsucha
waythatthefunctionoutputshighscoresforpositiveexamplesandlowscoresfornegative
examples.

LearningaSupportVectorMachine
https://karpathy.github.io/neuralnets/ 25/41
3/17/2017 Hacker'sguidetoNeuralNetworks

LearningaSupportVectorMachine
Asaconcreteexample,letslearnaSupportVectorMachine.TheSVMisaverypopularlinear
classifierItsfunctionalformisexactlyasIvedescribedinprevioussection,
f (x, y) = ax + by + c .Atthispoint,ifyouveseenanexplanationofSVMsyoureprobably

expectingmetodefinetheSVMlossfunctionandplungeintoanexplanationofslackvariables,
geometricalintuitionsoflargemargins,kernels,duality,etc.Buthere,Idliketotakeadifferent
approach.Insteadofdefinininglossfunctions,Iwouldliketobasetheexplanationontheforce
specification(Ijustmadethistermupbytheway)ofaSupportVectorMachine,whichI
personallyfindmuchmoreintuitive.Aswewillsee,talkingabouttheforcespecificationandthe
lossfunctionareidenticalwaysofseeingthesameproblem.Anyway,hereitis:

SupportVectorMachineForceSpecification:

IfwefeedapositivedatapointthroughtheSVMcircuitandtheoutputvalueislessthan1,
pullonthecircuitwithforce +1 .Thisisapositiveexamplesowewantthescoretobe
higherforit.
Conversely,ifwefeedanegativedatapointthroughtheSVMandtheoutputisgreater
than1,thenthecircuitisgivingthisdatapointdangerouslyhighscore:Pullonthecircuit
downwardswithforce 1 .
Inadditiontothepullsabove,alwaysaddasmallamountofpullontheparameters a,b
(notice,noton c !)thatpullsthemtowardszero.Youcanthinkofboth a,b asbeing
attachedtoaphysicalspringthatisattachedatzero.Justaswithaphysicalspring,this
willmakethepullproprotionaltothevalueofeachof a,b (Hookeslawinphysics,
anyone?).Forexample,if a becomesveryhighitwillexperienceastrongpullof
magnitude |a| backtowardszero.Thispullissomethingwecallregularization,andit
ensuresthatneitherofourparameters a or b getsdisproportionallylarge.Thiswould
beundesirablebecauseboth a,b getmultipliedtotheinputfeatures x,y (remember
theequationis a*x+b*y+c ),soifeitherofthemistoohigh,ourclassifierwouldbe
overlysensitivetothesefeatures.Thisisntanicepropertybecausefeaturescanoftenbe
noisyinpractice,sowewantourclassifiertochangerelativelysmoothlyiftheywiggle
around.

Letsquicklygothroughasmallbutconcreteexample.Supposewestartoutwitharandom
parametersetting,say, a=1,b=2,c=1 .Then:

Ifwefeedthepoint [1.2,0.7] ,theSVMwillcomputescore 11.2+(2)0.7

1=1.2 .Thispointislabeledas +1 inthetrainingdata,sowewantthescoretobe
higherthan1.Thegradientontopofthecircuitwillthusbepositive: +1 ,whichwill
backpropagateto a,b,c .Additionally,therewillalsobearegularizationpullon a of

https://karpathy.github.io/neuralnets/ 26/41
3/17/2017 Hacker'sguidetoNeuralNetworks

1 (tomakeitsmaller)andregularizationpullon b of +2 tomakeitlarger,toward
zero.
Supposeinsteadthatwefedthedatapoint [0.3,0.5] totheSVM.Itcomputes 1*
(0.3)+(2)*0.51=2.3 .Thelabelforthispointis 1 ,andsince 2.3 is
smallerthan 1 ,weseethataccordingtoourforcespecificationtheSVMshouldbe
happy:Thecomputedscoreisverynegative,consistentwiththenegativelabelofthis
example.Therewillbenopullattheendofthecircuit(i.eitszero),sincethereno
changesarenecessary.However,therewillstillbetheregularizationpullon a of 1
andon b of +2 .

Okaytheresbeentoomuchtext.LetswritetheSVMcodeandtakeadvantageofthecircuit
machinerywehavefromChapter1:

//Acircuit:ittakes5Units(x,y,a,b,c)andoutputsasingleUnit
//Itcanalsocomputethegradientw.r.t.itsinputs
varCircuit=function(){
//createsomegates
this.mulg0=newmultiplyGate();
this.mulg1=newmultiplyGate();
this.addg0=newaddGate();
this.addg1=newaddGate();
};
Circuit.prototype={
forward:function(x,y,a,b,c){
this.ax=this.mulg0.forward(a,x);//a*x
this.by=this.mulg1.forward(b,y);//b*y
this.axpby=this.addg0.forward(this.ax,this.by);//a*x+b*y
this.axpbypc=this.addg1.forward(this.axpby,c);//a*x+b*y+c
returnthis.axpbypc;
},
backward:function(gradient_top){//takespullfromabove
this.axpbypc.grad=gradient_top;
this.addg1.backward();//setsgradientinaxpbyandc
this.addg0.backward();//setsgradientinaxandby
this.mulg1.backward();//setsgradientinbandy
this.mulg0.backward();//setsgradientinaandx
}
}

Thatsacircuitthatsimplycomputes ax+by+c andcanalsocomputethegradient.It

usesthegatescodewedevelopedinChapter1.NowletswritetheSVM,whichdoesntcare

https://karpathy.github.io/neuralnets/ 27/41
3/17/2017 Hacker'sguidetoNeuralNetworks

abouttheactualcircuit.Itisonlyconcernedwiththevaluesthatcomeoutofit,anditpullson
thecircuit.

//SVMclass
varSVM=function(){

//randominitialparametervalues
this.a=newUnit(1.0,0.0);
this.b=newUnit(2.0,0.0);
this.c=newUnit(1.0,0.0);

this.circuit=newCircuit();
};
SVM.prototype={
forward:function(x,y){//assumexandyareUnits
this.unit_out=this.circuit.forward(x,y,this.a,this.b,this.c);
returnthis.unit_out;
},
backward:function(label){//labelis+1or1

//resetpullsona,b,c
this.a.grad=0.0;
this.b.grad=0.0;
this.c.grad=0.0;

//computethepullbasedonwhatthecircuitoutputwas
varpull=0.0;
if(label===1&&this.unit_out.value<1){
pull=1;//thescorewastoolow:pullup
}
if(label===1&&this.unit_out.value>1){
pull=1;//thescorewastoohighforapositiveexample,pulldown
}
this.circuit.backward(pull);//writesgradientintox,y,a,b,c

//addregularizationpullforparameters:towardszeroandproportionaltovalue
this.a.grad+=this.a.value;
this.b.grad+=this.b.value;
},
learnFrom:function(x,y,label){
this.forward(x,y);//forwardpass(set.valueinallUnits)
this.backward(label);//backwardpass(set.gradinallUnits)
this.parameterUpdate();//parametersrespondtotug

https://karpathy.github.io/neuralnets/ 28/41
3/17/2017 Hacker'sguidetoNeuralNetworks

},
parameterUpdate:function(){
varstep_size=0.01;
this.a.value+=step_size*this.a.grad;
this.b.value+=step_size*this.b.grad;
this.c.value+=step_size*this.c.grad;
}
};

NowletstraintheSVMwithStochasticGradientDescent:

vardata=[];varlabels=[];
data.push([1.2,0.7]);labels.push(1);
data.push([0.3,0.5]);labels.push(1);
data.push([3.0,0.1]);labels.push(1);
data.push([0.1,1.0]);labels.push(1);
data.push([1.0,1.1]);labels.push(1);
data.push([2.1,3]);labels.push(1);
varsvm=newSVM();

//afunctionthatcomputestheclassificationaccuracy
varevalTrainingAccuracy=function(){
varnum_correct=0;
for(vari=0;i<data.length;i++){
varx=newUnit(data[i][0],0.0);
vary=newUnit(data[i][1],0.0);
vartrue_label=labels[i];

//seeifthepredictionmatchestheprovidedlabel
varpredicted_label=svm.forward(x,y).value>0?1:1;
if(predicted_label===true_label){
num_correct++;
}
}
returnnum_correct/data.length;
};

//thelearningloop
for(variter=0;iter<400;iter++){
//pickarandomdatapoint
vari=Math.floor(Math.random()*data.length);
varx=newUnit(data[i][0],0.0);

https://karpathy.github.io/neuralnets/ 29/41
3/17/2017 Hacker'sguidetoNeuralNetworks

vary=newUnit(data[i][1],0.0);
varlabel=labels[i];
svm.learnFrom(x,y,label);

if(iter%25==0){//every10iterations...
console.log('trainingaccuracyatiter'+iter+':'+evalTrainingAccuracy());
}
}

Thiscodeprintsthefollowingoutput:

trainingaccuracyatiteration0:0.3333333333333333
trainingaccuracyatiteration25:0.3333333333333333
trainingaccuracyatiteration50:0.5
trainingaccuracyatiteration75:0.5
trainingaccuracyatiteration100:0.3333333333333333
trainingaccuracyatiteration125:0.5
trainingaccuracyatiteration150:0.5
trainingaccuracyatiteration175:0.5
trainingaccuracyatiteration200:0.5
trainingaccuracyatiteration225:0.6666666666666666
trainingaccuracyatiteration250:0.6666666666666666
trainingaccuracyatiteration275:0.8333333333333334
trainingaccuracyatiteration300:1
trainingaccuracyatiteration325:1
trainingaccuracyatiteration350:1
trainingaccuracyatiteration375:1

Weseethatinitiallyourclassifieronlyhad33%trainingaccuracy,butbytheendalltraining
examplesarecorrectlyclassifierastheparameters a,b,c adjustedtheirvaluesaccordingto
thepullsweexerted.WejusttrainedanSVM!Butpleasedontusethiscodeanywherein
production:)Wewillseehowwecanmakethingsmuchmoreefficientonceweunderstand
whatisgoingonatthecore.

Numberofiterationsneeded.Withthisexampledata,withthisexampleinitialization,andwith
thesettingofstepsizeweused,ittookabout300iterationstotraintheSVM.Inpractice,this
couldbemanymoreormanylessdependingonhowhardorlargetheproblemis,howyoure
initializating,normalizingyourdata,whatstepsizeyoureusing,andsoon.Thisisjustatoy
demonstration,butlaterwewillgooverallthebestpracticesforactuallytrainingthese
classifiersinpractice.Forexample,itwillturnoutthatthesettingofthestepsizeisvery
imporantandtricky.Smallstepsizewillmakeyourmodelslowtotrain.Largestepsizewilltrain

https://karpathy.github.io/neuralnets/ 30/41
3/17/2017 Hacker'sguidetoNeuralNetworks

faster,butifitistoolarge,itwillmakeyourclassifierchaoticallyjumparoundandnotconverge
toagoodfinalresult.Wewilleventuallyusewitheldvalidationdatatoproperlytuneittobejust
inthesweetspotforyourparticulardata.

OnethingIdlikeyoutoappreciateisthatthecircuitcanbearbitraryexpression,notjustthe
linearpredictionfunctionweusedinthisexample.Forexample,itcanbeanentireneural
network.

Bytheway,Iintentionallystructuredthecodeinamodularway,butwecouldhavetrainedan
SVMwithamuchsimplercode.Hereisreallywhatalloftheseclassesandcomputationsboil
downto:

vara=1,b=2,c=1;//initialparameters
for(variter=0;iter<400;iter++){
//pickarandomdatapoint
vari=Math.floor(Math.random()*data.length);
varx=data[i][0];
vary=data[i][1];
varlabel=labels[i];

//computepull
varscore=a*x+b*y+c;
varpull=0.0;
if(label===1&&score<1)pull=1;
if(label===1&&score>1)pull=1;

//computegradientandupdateparameters
varstep_size=0.01;
a+=step_size*(x*pulla);//aisfromtheregularization
b+=step_size*(y*pullb);//bisfromtheregularization
c+=step_size*(1*pull);
}

thiscodegivesanidenticalresult.Perhapsbynowyoucanglanceatthecodeandseehow
theseequationscameabout.

Variablepull?Aquicknotetomakeatthispoint:Youmayhavenoticedthatthepullisalways
1,0,or1.Youcouldimaginedoingotherthings,forexamplemakingthispullproportionalto
howbadthemistakewas.ThisleadstoavariationontheSVMthatsomepeoplerefertoas
squaredhingelossSVM,forreasonsthatwilllaterbecomeclear.Dependingonvarious
featuresofyourdataset,thatmayworkbetterorworse.Forexample,ifyouhaveverybad
outliersinyourdata,e.g.anegativedatapointthatgetsascore +100 ,itsinfluencewillbe

https://karpathy.github.io/neuralnets/ 31/41
3/17/2017 Hacker'sguidetoNeuralNetworks

relativelyminoronourclassifierbecausewewillonlypullwithforceof 1 regardlessofhow
badthemistakewas.Inpracticewerefertothispropertyofaclassifierasrobustnessto
outliers.

Letsrecap.Weintroducedthebinaryclassificationproblem,wherewearegivenND
dimensionalvectorsandalabel+1/1foreach.Wesawthatwecancombinethesefeatures
withasetofparametersinsidearealvaluedcircuit(suchasaSupportVectorMachinecircuit
inourexample).Then,wecanrepeatedlypassourdatathroughthecircuitandeachtimetweak
theparameterssothatthecircuitsoutputvalueisconsistentwiththeprovidedlabels.The
tweakingrelied,crucially,onourabilitytobackpropagategradientsthroughthecircuit.Inthe
end,thefinalcircuitcanbeusedtopredictvaluesforunseeninstances!

GeneralizingtheSVMintoaNeuralNetwork
OfinterestisthefactthatanSVMisjustaparticulartypeofaverysimplecircuit(circuitthat
computes score=a*x+b*y+c where a,b,c areweightsand x,y aredatapoints).This
canbeeasilyextendedtomorecomplicatedfunctions.Forexample,letswritea2layerNeural
Networkthatdoesthebinaryclassification.Theforwardpasswilllooklikethis:

//assumeinputsx,y
varn1=Math.max(0,a1*x+b1*y+c1);//activationof1sthiddenneuron
varn2=Math.max(0,a2*x+b2*y+c2);//2ndneuron
varn3=Math.max(0,a3*x+b3*y+c3);//3rdneuron
varscore=a4*n1+b4*n2+c4*n3+d4;//thescore

Thespecificationaboveisa2layerNeuralNetworkwith3hiddenneurons(n1,n2,n3)that
usesRectifiedLinearUnit(ReLU)nonlinearityoneachhiddenneuron.Asyoucansee,there
arenowseveralparametersinvolved,whichmeansthatourclassifierismorecomplexandcan
representmoreintricatedecisionboundariesthanjustasimplelineardecisionrulesuchasan
SVM.Anotherwaytothinkaboutitisthateveryoneofthethreehiddenneuronsisalinear
classifierandnowwereputtinganextralinearclassifierontopofthat.Nowwerestartingtogo
deeper:).Okay,letstrainthis2layerNeuralNetwork.ThecodelooksverysimilartotheSVM
examplecodeabove,wejusthavetochangetheforwardpassandthebackwardpass:

//randominitialparameters
vara1=Math.random()0.5;//arandomnumberbetween0.5and0.5
//...similarlyinitializeallotherparameterstorandoms
for(variter=0;iter<400;iter++){
//pickarandomdatapoint
vari=Math.floor(Math.random()*data.length);

https://karpathy.github.io/neuralnets/ 32/41
3/17/2017 Hacker'sguidetoNeuralNetworks

varx=data[i][0];
vary=data[i][1];
varlabel=labels[i];

//computeforwardpass
varn1=Math.max(0,a1*x+b1*y+c1);//activationof1sthiddenneuron
varn2=Math.max(0,a2*x+b2*y+c2);//2ndneuron
varn3=Math.max(0,a3*x+b3*y+c3);//3rdneuron
varscore=a4*n1+b4*n2+c4*n3+d4;//thescore

//computethepullontop
varpull=0.0;
if(label===1&&score<1)pull=1;//wewanthigheroutput!Pullup.
if(label===1&&score>1)pull=1;//wewantloweroutput!Pulldown.

//nowcomputebackwardpasstoallparametersofthemodel

//backpropthroughthelast"score"neuron
vardscore=pull;
varda4=n1*dscore;
vardn1=a4*dscore;
vardb4=n2*dscore;
vardn2=b4*dscore;
vardc4=n3*dscore;
vardn3=c4*dscore;
vardd4=1.0*dscore;//phew

//backproptheReLUnonlinearities,inplace
//i.e.justsetgradientstozeroiftheneuronsdidnot"fire"
vardn3=n3===0?0:dn3;
vardn2=n2===0?0:dn2;
vardn1=n1===0?0:dn1;

//backproptoparametersofneuron1
varda1=x*dn1;
vardb1=y*dn1;
vardc1=1.0*dn1;

//backproptoparametersofneuron2
varda2=x*dn2;
vardb2=y*dn2;
vardc2=1.0*dn2;

//backproptoparametersofneuron3

https://karpathy.github.io/neuralnets/ 33/41
3/17/2017 Hacker'sguidetoNeuralNetworks

varda3=x*dn3;
vardb3=y*dn3;
vardc3=1.0*dn3;

//phew!Endofbackprop!
//notewecouldhavealsobackproppedintox,y
//butwedonotneedthesegradients.Weonlyusethegradients
//onourparametersintheparameterupdate,andwediscardx,y

//addthepullsfromtheregularization,tuggingallmultiplicative
//parameters(i.e.notthebiases)downward,proportionaltotheirvalue
da1+=a1;da2+=a2;da3+=a3;
db1+=b1;db2+=b2;db3+=b3;
da4+=a4;db4+=b4;dc4+=c4;

//finally,dotheparameterupdate
varstep_size=0.01;
a1+=step_size*da1;
b1+=step_size*db1;
c1+=step_size*dc1;
a2+=step_size*da2;
b2+=step_size*db2;
c2+=step_size*dc2;
a3+=step_size*da3;
b3+=step_size*db3;
c3+=step_size*dc3;
a4+=step_size*da4;
b4+=step_size*db4;
c4+=step_size*dc4;
d4+=step_size*dd4;
//wowthisistedious,pleaseuseforloopsinprod.
//we'redone!
}

Andthatshowyoutrainaneuralnetwork.Obviously,youwanttomodularizeyourcodenicely
butIexpendedthisexampleforyouinthehopethatitmakesthingsmuchmoreconcreteand
simplertounderstand.Later,wewilllookatbestpracticeswhenimplementingthesenetworks
andwewillstructurethecodemuchmoreneatlyinamodularandmoresensibleway.

Butfornow,Ihopeyourtakeawayisthata2layerNeuralNetisreallynotsuchascarything:
wewriteaforwardpassexpression,interpretthevalueattheendasascore,andthenwepull
onthatvalueinapositiveornegativedirectiondependingonwhatwewantthatvaluetobefor
ourcurrentparticularexample.Theparameterupdateafterbackpropwillensurethatwhenwe

https://karpathy.github.io/neuralnets/ 34/41
3/17/2017 Hacker'sguidetoNeuralNetworks

seethisparticularexampleinthefuture,thenetworkwillbemorelikelytogiveusavaluewe
desire,nottheoneitgavejustbeforetheupdate.

AmoreConventionalApproach:LossFunctions
Nowthatweunderstandthebasicsofhowthesecircuitsfunctionwithdata,letsadoptamore
conventionalapproachthatyoumightseeelsewhereontheinternetandinothertutorialsand
books.Youwontseepeopletalkingtoomuchaboutforcespecifications.Instead,Machine
Learningalgorithmsarespecifiedintermsoflossfunctions(orcostfunctions,or
objectives).

AsIdevelopthisformalismIwouldalsoliketostarttobealittlemorecarefulwithhowwename
ourvariablesandparameters.Idliketheseequationstolooksimilartowhatyoumightseeina
bookorsomeothertutorial,soletmeusemorestandardnamingconventions.

Example:2DSupportVectorMachine
Letsstartwithanexampleofa2dimensionalSVM.WearegivenadatasetofN examples
(xi0 , xi1 ) andtheircorrespondinglabelsyi whichareallowedtobeeither+1/ 1 for

positiveornegativeexamplerespectively.Mostimportantly,asyourecallwehavethree
parameters(w0 , w1 , w2 ) .TheSVMlossfunctionisthendefinedasfollows:

N 2 2
L = [ _i = 1 max(0, y_i(w_0x_i0 + w_1x_i1 + w_2) + 1)] + [w_0 + w_1 ]

Noticethatthisexpressionisalwayspositive,duetothethresholdingatzerointhefirst
expressionandthesquaringintheregularization.Theideaisthatwewillwantthisexpression
tobeassmallaspossible.Beforewediveintosomeofitssubtletiesletmefirsttranslateitto
code:

varX=[[1.2,0.7],[0.3,0.5],[3,2.5]]//arrayof2dimensionaldata
vary=[1,1,1]//arrayoflabels
varw=[0.1,0.2,0.3]//example:randomnumbers
varalpha=0.1;//regularizationstrength

functioncost(X,y,w){

vartotal_cost=0.0;//L,inSVMlossfunctionabove
N=X.length;
for(vari=0;i<N;i++){
//loopoveralldatapointsandcomputetheirscore

https://karpathy.github.io/neuralnets/ 35/41
3/17/2017 Hacker'sguidetoNeuralNetworks

varxi=X[i];
varscore=w[0]*xi[0]+w[1]*xi[1]+w[2];

//accumulatecostbasedonhowcompatiblethescoreiswiththelabel
varyi=y[i];//label
varcosti=Math.max(0,yi*score+1);
console.log('example'+i+':xi=('+xi+')andlabel='+yi);
console.log('scorecomputedtobe'+score.toFixed(3));
console.log('=>costcomputedtobe'+costi.toFixed(3));
total_cost+=costi;
}

//regularizationcost:wewantsmallweights
reg_cost=alpha*(w[0]*w[0]+w[1]*w[1])
console.log('regularizationcostforcurrentmodelis'+reg_cost.toFixed(3));
total_cost+=reg_cost;

console.log('totalcostis'+total_cost.toFixed(3));
returntotal_cost;
}

Andhereistheoutput:

costforexample0is0.440
costforexample1is1.370
costforexample2is0.000
regularizationcostforcurrentmodelis0.005
totalcostis1.815

Noticehowthisexpressionworks:ItmeasureshowbadourSVMclassifieris.Letsstepthrough
thisexplicitly:

Thefirstdatapoint xi=[1.2,0.7] withlabel yi=1 willgivescore 0.1*1.2+

0.2*0.7+0.3 ,whichis 0.56 .Notice,thisisapositiveexamplesowewanttothe
scoretobegreaterthan +1 . 0.56 isnotenough.Andindeed,theexpressionforcost
forthisdatapointwillcompute: costi=Math.max(0,1*0.56+1) ,whichis 0.44 .
YoucanthinkofthecostasquantifyingtheSVMsunhappiness.
Theseconddatapoint xi=[0.3,0.5] withlabel yi=1 willgivescore 0.1*
(0.3)+0.2*0.5+0.3 ,whichis 0.37 .Thisisntlookingverygood:Thisscoreis
veryhighforanegativeexample.Itshouldbelessthan1.Indeed,whenwecomputethe

https://karpathy.github.io/neuralnets/ 36/41
3/17/2017 Hacker'sguidetoNeuralNetworks

cost: costi=Math.max(0,1*0.37+1) ,weget 1.37 .Thatsaveryhighcostfrom

thisexample,asitisbeingmisclassified.
Thelastexample xi=[3,2.5] withlabel yi=1 givesscore 0.1*3+0.2*2.5+
0.3 ,andthatis 1.1 .Inthiscase,theSVMwillcompute costi=Math.max(0,1*1.1
+1) ,whichisinfactzero.Thisdatapointisbeingclassifiedcorrectlyandthereisnocost
associatedwithit.

Acostfunctionisanexpressionthatmeasuresshowbadyourclassifieris.Whenthetraining
setisperfectlyclassified,thecost(ignoringtheregularization)willbezero.

Noticethatthelastterminthelossistheregularizationcost,whichsaysthatourmodel
parametersshouldbesmallvalues.Duetothistermthecostwillneveractuallybecomezero
(becausethiswouldmeanallparametersofthemodelexceptthebiasareexactlyzero),butthe
closerweget,thebetterourclassifierwillbecome.

ThemajorityofcostfunctionsinMachineLearningconsistoftwoparts:1.Apartthat
measureshowwellamodelfitsthedata,and2:Regularization,whichmeasuressomenotion
ofhowcomplexorlikelyamodelis.

IhopeIconvincedyouthen,thattogetaverygoodSVMwereallywanttomakethecostas
smallaspossible.Soundsfamiliar?Weknowexactlywhattodo:Thecostfunctionwritten
aboveisourcircuit.Wewillforwardallexamplesthroughthecircuit,computethebackward
passandupdateallparameterssuchthatthecircuitwilloutputasmallercostinthefuture.
Specifically,wewillcomputethegradientandthenupdatetheparametersintheopposite
directionofthegradient(sincewewanttomakethecostsmall,notlarge).

Weknowexactlywhattodo:Thecostfunctionwrittenaboveisourcircuit.

todo:cleanupthissectionandfleshitoutabit

Chapter3:BackpropinPractice

Buildingupalibrary

Example:PracticalNeuralNetworkClassifier
https://karpathy.github.io/neuralnets/ 37/41
3/17/2017 Hacker'sguidetoNeuralNetworks

Multiclass:StructuredSVM
Multiclass:LogisticRegression,Softmax

Example:Regression
Tinychangesneededtocostfunction.L2regularization.

Example:StructuredPrediction
Basicideaistotrainan(unnormalized)energymodel

VectorizedImplementations
WritingaNeuralNetclassfierinPythonwithnumpy.

Backpropinpractice:Tips/Tricks
MonitoringofCostfunction
Monitoringtraining/validationperformance
Tweakinginitiallearningrates,learningrateschedules
Optimization:UsingMomentum
Optimization:LBFGS,Nesterovacceleratedgradient
ImportanceofInitialization:weightsandbiases
Regularization:L2,L1,Groupsparsity,Dropout
Hyperparametersearch,crossvalidations
Commonpitfalls:(e.g.dyingReLUs)
Handlingunbalanceddatasets
Approachestodebuggingnetswhensomethingdoesntwork

Chapter4:NetworksintheWild
Casestudiesofmodelsthatworkwellinpracticeandhavebeendeployedinthewild.

CaseStudy:ConvolutionalNeuralNetworksforimages
Convolutionallayers,pooling,AlexNet,etc.

CaseStudy:RecurrentNeuralNetworksforSpeechandText
https://karpathy.github.io/neuralnets/ 38/41
3/17/2017 Hacker'sguidetoNeuralNetworks

CaseStudy:RecurrentNeuralNetworksforSpeechandText
VanillaRecurrentnets,bidirectionalrecurrentnets.MaybeoverviewofLSTM

CaseStudy:Word2Vec
TrainingwordvectorrepresentationsinNLP

CaseStudy:tSNE
Trainingembeddingsforvisualizingdata

Acknowledgements
Thanksalottothefollowingpeoplewhomadethisguidebetter:wodenokoto(HN),zackmorris
(HN).

Comments
ThisguideisaworkinprogressandIappreciatefeedback,especiallyregardingpartsthatwere
unclearoronlymadehalfsense.Thankyou!

SomeoftheJavascriptcodeinthistutorialhasbeentranslatedtoPythonbyAjit,finditoveron
Github.

8Comments Andrej'sBlog
1 Login

Recommend 5 Share SortbyBest

Jointhediscussion

maxkhesin2yearsago
Thisisawesome,canwehazmoar?
6 Reply Share

GauravKumar3monthsago
moar
Reply Share
https://karpathy.github.io/neuralnets/ 39/41

EC A1 Photocopiables PDF
100% (1)
EC A1 Photocopiables PDF
38 pages
SSC Stihl SC GB
No ratings yet
SSC Stihl SC GB
64 pages
Poetry Lesson
100% (1)
Poetry Lesson
41 pages
Emerging Science and Technology Trends: 2017-2047: A Synthesis of Leading Forecasts
No ratings yet
Emerging Science and Technology Trends: 2017-2047: A Synthesis of Leading Forecasts
56 pages
Talk 1 Satoshi Matsuoka
No ratings yet
Talk 1 Satoshi Matsuoka
112 pages
Project Report On Notch Filters: Submitted To: Submitted by
100% (1)
Project Report On Notch Filters: Submitted To: Submitted by
15 pages
Linux Cheat Sheet
No ratings yet
Linux Cheat Sheet
3 pages
Sfm-Net: Learning of Structure and Motion From Video: Sudheendra Vijayanarasimhan Susanna Ricco Cordelia Schmid
100% (1)
Sfm-Net: Learning of Structure and Motion From Video: Sudheendra Vijayanarasimhan Susanna Ricco Cordelia Schmid
9 pages
A Coverless Image Steganography Based On Robust Image Wavelet Hashing
No ratings yet
A Coverless Image Steganography Based On Robust Image Wavelet Hashing
9 pages
Ang Kiukok
No ratings yet
Ang Kiukok
14 pages
Latest AI APPLICATIONS IN DEFENCE 2022
100% (1)
Latest AI APPLICATIONS IN DEFENCE 2022
65 pages
Artificial Neural Network Quick Guide
No ratings yet
Artificial Neural Network Quick Guide
55 pages
Presentation On Neural Networks
No ratings yet
Presentation On Neural Networks
46 pages
Artificial Intelligence and Cybersecurity Research
No ratings yet
Artificial Intelligence and Cybersecurity Research
40 pages
Ann Chapter 2
No ratings yet
Ann Chapter 2
240 pages
Aman's AI Journal - Watch List
No ratings yet
Aman's AI Journal - Watch List
32 pages
Artificial Neural Networks: Part 1/3
No ratings yet
Artificial Neural Networks: Part 1/3
25 pages
Unit 2 Progress Test PDF
0% (1)
Unit 2 Progress Test PDF
8 pages
Unit - V: Principles of HDL
No ratings yet
Unit - V: Principles of HDL
56 pages
Synthetic Generation of High Dimensional Dataset
No ratings yet
Synthetic Generation of High Dimensional Dataset
8 pages
Characteristics of Artificial Neural Networks
No ratings yet
Characteristics of Artificial Neural Networks
38 pages
Generating Synthetic Data For Context-Aware Recommender Systems
No ratings yet
Generating Synthetic Data For Context-Aware Recommender Systems
5 pages
A Study On Emerging Issue On Cyber Law
No ratings yet
A Study On Emerging Issue On Cyber Law
7 pages
Stakeholder Management
No ratings yet
Stakeholder Management
27 pages
Unit 1 Final
100% (1)
Unit 1 Final
36 pages
Anomaly Detection Using Graph Neural Networks
No ratings yet
Anomaly Detection Using Graph Neural Networks
5 pages
Purdue - 19-Photonic Neuromorphic Computing PDF
No ratings yet
Purdue - 19-Photonic Neuromorphic Computing PDF
35 pages
ANN Supervised Learning (Compatibility Mode)
No ratings yet
ANN Supervised Learning (Compatibility Mode)
73 pages
Configure A Wireless Profile
No ratings yet
Configure A Wireless Profile
1 page
Backpropagation Learning in Neural Networks
No ratings yet
Backpropagation Learning in Neural Networks
27 pages
An Analysis of The Famous Poem When You Are Old
No ratings yet
An Analysis of The Famous Poem When You Are Old
6 pages
Mini Project
100% (1)
Mini Project
57 pages
CSE Artificial Neural Networks Report
No ratings yet
CSE Artificial Neural Networks Report
22 pages
Data Science Chapitre 0
No ratings yet
Data Science Chapitre 0
25 pages
Machine Learning: Neural Networks
No ratings yet
Machine Learning: Neural Networks
22 pages
Smart Healthcare
No ratings yet
Smart Healthcare
6 pages
Water Rights and Wrongs English
No ratings yet
Water Rights and Wrongs English
32 pages
DSA Cheat Sheet
No ratings yet
DSA Cheat Sheet
4 pages
2.neural Network
No ratings yet
2.neural Network
19 pages
T1 Decomposition and Abstraction
No ratings yet
T1 Decomposition and Abstraction
26 pages
Robust Malware Detection For Iot Devices Using Deep Eigenspace Learning
No ratings yet
Robust Malware Detection For Iot Devices Using Deep Eigenspace Learning
12 pages
Strahlenfolter Stalking - TI - Mind Control - Mindcontrol Devices
No ratings yet
Strahlenfolter Stalking - TI - Mind Control - Mindcontrol Devices
22 pages
Ref 541 543 and 545 Feeder Terminal
No ratings yet
Ref 541 543 and 545 Feeder Terminal
44 pages
Neurosecurity: Human Brain Electro-Optical Signals As MASINT
No ratings yet
Neurosecurity: Human Brain Electro-Optical Signals As MASINT
8 pages
Ann Book
No ratings yet
Ann Book
16 pages
Deep Learning
No ratings yet
Deep Learning
17 pages
Face Recognition With GNU Octave/MATLAB: Philipp Wagner
No ratings yet
Face Recognition With GNU Octave/MATLAB: Philipp Wagner
14 pages
Install Anaconda On Windows
No ratings yet
Install Anaconda On Windows
19 pages
ANN Notes
No ratings yet
ANN Notes
54 pages
6 Strategies For Teaching Special Education Classes
No ratings yet
6 Strategies For Teaching Special Education Classes
2 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
7 More Steps To Mastering Machine Learning With Python - Page1
No ratings yet
7 More Steps To Mastering Machine Learning With Python - Page1
8 pages
Cultural Assault
No ratings yet
Cultural Assault
15 pages
Ecommerce Price Comparison
No ratings yet
Ecommerce Price Comparison
12 pages
Nano-Biotechnology: Structure and Dynamics of Nanoscale Biosystems
No ratings yet
Nano-Biotechnology: Structure and Dynamics of Nanoscale Biosystems
5 pages
MemGPT - Towards LLMs As Operating Systems - 2310.08560
No ratings yet
MemGPT - Towards LLMs As Operating Systems - 2310.08560
15 pages
AI ETHICAL FRAMEWORK WhitePaper
No ratings yet
AI ETHICAL FRAMEWORK WhitePaper
14 pages
AC Traction Matlab
No ratings yet
AC Traction Matlab
4 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
25 pages
Anaconda Installation Guidelines
No ratings yet
Anaconda Installation Guidelines
6 pages
IA Ethique 15-04
No ratings yet
IA Ethique 15-04
22 pages
What Is Footprinting
No ratings yet
What Is Footprinting
5 pages
Practice Module 2 Introduction To Programming: NIM/Name: 4312111010/abdan Fauzan Nurtsani
No ratings yet
Practice Module 2 Introduction To Programming: NIM/Name: 4312111010/abdan Fauzan Nurtsani
6 pages
Data Visualization For Industry 4
No ratings yet
Data Visualization For Industry 4
3 pages
Imp GRC Tables
No ratings yet
Imp GRC Tables
3 pages
Deep Learning Unit-III
No ratings yet
Deep Learning Unit-III
9 pages
Development Record
No ratings yet
Development Record
4 pages
Python (Anaconda) - Installation Kit
No ratings yet
Python (Anaconda) - Installation Kit
7 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
ControlLogix Controller Portfolio Customer Presentation
No ratings yet
ControlLogix Controller Portfolio Customer Presentation
22 pages
MPS and Least Learned (Diagnostic Test) - 033148
No ratings yet
MPS and Least Learned (Diagnostic Test) - 033148
11 pages
Java Workshop
No ratings yet
Java Workshop
2 pages
The New Scientific Method - WTF 6.0
No ratings yet
The New Scientific Method - WTF 6.0
167 pages
Alchemy of The Heart - Week 3 Article
No ratings yet
Alchemy of The Heart - Week 3 Article
2 pages
L1 Index
No ratings yet
L1 Index
2 pages
Unifying Large Language Models and Knowledge Graphs A Roadmap
No ratings yet
Unifying Large Language Models and Knowledge Graphs A Roadmap
20 pages
ICT Terms Cards (Years 3-4) Full Colour - CO2WAE79
No ratings yet
ICT Terms Cards (Years 3-4) Full Colour - CO2WAE79
5 pages
Impormal Na Sektor Halimbawa - Kahulugan at Iba Pa
No ratings yet
Impormal Na Sektor Halimbawa - Kahulugan at Iba Pa
1 page
LLM Architectures Explained - Transformers (Part 6) - by Vipra Singh - Freedium
No ratings yet
LLM Architectures Explained - Transformers (Part 6) - by Vipra Singh - Freedium
95 pages
The Dignity of The Human Person
No ratings yet
The Dignity of The Human Person
22 pages
Lecture Notes SC
No ratings yet
Lecture Notes SC
21 pages
CD - Love Changes - Kashif Audrey Wheeler Bashiri Johnson B
No ratings yet
CD - Love Changes - Kashif Audrey Wheeler Bashiri Johnson B
6 pages
Ethical Consideration in Artificial Intelligence Development and Deployment
No ratings yet
Ethical Consideration in Artificial Intelligence Development and Deployment
6 pages
Laravel Technical Document
No ratings yet
Laravel Technical Document
10 pages
E@syfile TC Installation Trouble Shooting Guide.
No ratings yet
E@syfile TC Installation Trouble Shooting Guide.
3 pages
5 Meeting Preparation and Format
No ratings yet
5 Meeting Preparation and Format
9 pages
Holiday Homework XII 2025-26
No ratings yet
Holiday Homework XII 2025-26
2 pages
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
From Everand
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
Fouad Sabry
No ratings yet
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
From Everand
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
Fouad Sabry
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet