3rd Unit DL Final Class Notes
3rd Unit DL Final Class Notes
Prepared
by
Dr K Madan Mohan
Asst. Professor
Department of CSE (AI&ML)
Sreyas Institute of Engineering and Technology,
Nagole, Bandlaguda, Hyderabad
Course Outcomes:
Ability to understand the concepts of Neural Networks
Ability to select the Learning Networks in modeling real world systems
Ability to use an efficient algorithm for Deep Models
Ability to apply optimization strategies for large scale applications
UNIT-I
Artificial Neural Networks Introduction, Basic models of ANN, important terminologies, Supervised
Learning Networks, Perceptron Networks, Adaptive Linear Neuron, Back-propagation Network.
Associative Memory Networks. Training Algorithms for pattern association, BAM and Hopfield
Networks.
UNIT-II
Unsupervised Learning Network- Introduction, Fixed Weight Competitive Nets, Maxnet, Hamming
Network, Kohonen Self-Organizing Feature Maps, Learning Vector Quantization, Counter Propagation
Networks, Adaptive Resonance Theory Networks. Special Networks-Introduction to various networks.
UNIT - III
Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed - forward networks,
Gradient-Based learning, Hidden Units, Architecture Design, Back-Propagation and Other
Differentiation Algorithms
UNIT - IV
Regularization for Deep Learning: Parameter norm Penalties, Norm Penalties as Constrained
Optimization, Regularization and Under-Constrained Problems, Dataset Augmentation, Noise
Robustness, Semi-Supervised learning, Multi-task learning, Early Stopping, Parameter Typing and
Parameter Sharing, Sparse Representations, Bagging and other Ensemble Methods, Dropout,
Adversarial Training, Tangent Distance, tangent Prop and Manifold, Tangent Classifier
UNIT - V
Optimization for Train Deep Models: Challenges in Neural Network Optimization, Basic Algorithms,
Parameter Initialization Strategies, Algorithms with Adaptive Learning Rates, Approximate Second-
Order Methods, Optimization Strategies and Meta-Algorithms
Applications: Large-Scale Deep Learning, Computer Vision, Speech Recognition, Natural Language
Processing
TEXT BOOKS:
1. Deep Learning: An MIT Press Book By Ian Goodfellow and Yoshua Bengio and Aaron Courville
2. Neural Networks and Learning Machines, Simon Haykin, 3rd Edition, Pearson Prentice Hall.
III-UNIT
a. Computer vision: Deep learning has revolutionized computer vision tasks such as
image classification, object detection, and image segmentation. It enables machines
to understand and interpret visual data, powering technologies like autonomous
vehicles, facial recognition, and augmented reality.
d. Healthcare: Deep learning has shown promise in medical imaging analysis, disease
diagnosis, and drug discovery. It aids in identifying patterns in medical images,
predicting disease outcomes, and developing new treatments.
e. Finance: Deep learning is used in finance for tasks like fraud detection, algorithmic
trading, and credit scoring. It helps identify fraudulent transactions, analyze market
trends, and make data-driven investment decisions.
a. Data availability: Deep learning models require large amounts of labeled data for
training, which can be challenging to obtain in certain domains. Limited or biased
data can affect model performance and generalization.
c. Interpretability: Deep learning models are often considered black boxes, making
it difficult to understand the rationale behind their predictions. The lack of
interpretability raises concerns in critical applications like healthcare and finance.
d. Overfitting: Deep learning models are prone to overfitting, where they memorize
the training data instead of learning generalizable patterns. Overfitting can lead to
poor performance on unseen data.
Addressing these challenges requires ongoing research and development in the field of
deep learning to improve data collection, model interpretability, regularization
techniques, and security measures.
2.1 Historical Trends in Deep Learning [2nd Topic]
1. Deep learning has a history that spans a long time and has been known by different
names, reflecting different perspectives and trends in the field.
2. The usefulness of deep learning has increased as the availability of training data has
grown. More data allows deep learning models to learn more effectively and make
better predictions.
3. Deep learning models have become larger over time due to advancements in
computer infrastructure. This includes improvements in both hardware (such as
GPUs) and software (such as optimized algorithms and frameworks) specifically
designed for deep learning.
4. As deep learning models have evolved, they have been able to tackle increasingly
complex applications with higher accuracy. This means that deep learning has been
successful in solving more challenging tasks and producing more reliable results.
2.1.1 The Many Names and Changing Fortunes of Neural Networks
1. Deep learning has a long history dating back to the 1940s, but it has recently gained
popularity and is often referred to as a new technology.
2. Deep learning has gone through various name changes over time, reflecting different
researchers and perspectives in the field.
3. There have been three waves of development in deep learning: cybernetics in the
1940s-1960s, connectionism in the 1980s-1990s, and the current resurgence known
as deep learning since 2006.
4. Deep learning models are sometimes called artificial neural networks (ANNs)
because they are inspired by the functioning of the biological brain.
5. While neural networks have been used to understand brain function, they are not
necessarily realistic models of how the brain works.
6. Deep learning is motivated by the idea of reverse engineering the brain's
computational principles to build intelligent systems and understand human
intelligence.
7. Deep learning also focuses on learning multiple levels of composition, which can be
applied in machine learning frameworks that are not necessarily based on neural
inspiration.
The figure represents two historical waves of artificial neural network research based
on Google Books. The first wave, cybernetics (1940s-1960s), focused on theories of
biological learning and the development of the perceptron, a model that could train a
single neuron. The second wave, connectionism (1980-1995), introduced back-
propagation to train neural networks with one or two hidden layers. The current third
wave, deep learning, began around 2006 and is just now being documented in books
since 2016. It's important to note that books on these waves usually appear later than
the actual research takes place.
MCCULLOCH-PITTS NEURON:
PERCEPTRON:
Linear Models
Deep learning was inspired by neuroscience but is not a direct simulation of the
brain.
Early neural networks were simple linear models and could only learn to recognize
two categories of inputs.
The perceptron was the first model that could learn to recognize multiple categories
of inputs.
Linear models have limitations and cannot learn certain functions, such as the XOR
function.
Neuroscience is still an important source of inspiration for deep learning, but it is
not the predominant guide for the field.
We do not have enough information about the brain to use it as a complete guide for
deep learning research.
Deep learning researchers are more likely to cite the brain as an influence than
researchers working in other machine learning fields.
Deep learning and computational neuroscience are two separate fields of study that
are both concerned with understanding the brain.
Deep learning is focused on building AI systems, while computational neuroscience
is focused on building accurate models of the brain.
Connectionism is a movement in cognitive science that studies models of cognition
based on neural implementations.
Distributed representation is a key concept in connectionism that states that each
input should be represented by many features and each feature should be involved
in the representation of many possible inputs.
Back-propagation: A popular algorithm for training deep neural networks.
LSTM: A type of neural network that is well-suited for modeling sequences.
Decline of neural networks: In the 1990s, neural networks lost popularity due to
unrealistic expectations and advances in other machine learning fields.
CIFAR NCAP research initiative: A program that helped to keep neural networks
research alive during the decline.
Deep Networks: Were once thought to be very difficult to train, but this is no longer
the case.
Geoffrey Hinton: Developed a new technique for training deep neural
networks called greedy layer-wise pre-training.
Deep belief networks: A type of neural network that can be efficiently trained
using deep belief networks.
Deep learning: A term used to emphasize the ability to train deeper neural
networks.
Third wave of neural networks research: Began in 2006 and is still ongoing.
Focus of deep learning research: Has shifted from unsupervised learning to
supervised learning.
2.1.2.Important and Conclusions points about Deep Feedforward Networks
1. Deep feedforward networks, also known as multilayer perceptrons (MLPs), are a
type of artificial neural network that approximate a function f* by defining a
mapping y = f(x; θ) and learning the parameters θ. These networks are called
feedforward because information flows in a single direction from the input x to the
output y, without feedback connections. When extended with feedback connections,
they become recurrent neural networks.
2. Feedforward networks consist of multiple layers of functions, where each layer is
connected to the next in a chain. The first layer is called the input layer, and the final
layer is called the output layer. The layers in between are called hidden layers
because their behavior is not directly specified by the training data.
3. During training, the network is presented with labeled examples (x, y) to learn the
desired output y for each input x. The learning algorithm determines how to use the
hidden layers to best approximate f*.
4. The width of the network is determined by the dimensionality of the hidden layers,
and the depth is determined by the number of layers. The choice of functions used
to compute the hidden layer values is inspired by neuroscience, but the goal of these
networks is not to perfectly model the brain.
5. To overcome the limitations of linear models, such as logistic regression and linear
regression, which can only represent linear functions, we can apply a nonlinear
transformation φ(x) to the input x to obtain a set of features describing x. This is
equivalent to using a kernel function in kernel machines.
6. In deep learning, we learn the function φ(x; θ) and map it to the desired output using
parameters w. This approach allows us to capture the benefits of both highly generic
feature mappings and manually engineered feature mappings, while avoiding the
limitations of either.
7. To train a feedforward network, we choose an optimizer, cost function, and output
units, which are similar to those used for linear models. We also choose the
activation functions used to compute the hidden layer values and design the
architecture of the network, including the number of layers, connections between
layers, and number of units in each layer.
8. Computing the gradients of complicated functions in deep neural networks requires
the back-propagation algorithm and its modern generalizations, which can
efficiently compute these gradients.
9. Deep feedforward networks are a type of artificial neural network that approximate
a function by defining a mapping y = f(x; θ) and learning the parameters θ. They
consist of multiple layers of functions, where each layer is connected to the next in
a chain, and can capture the benefits of both highly generic and manually engineered
feature mappings.
10. Training and optimization techniques, such as choosing an optimizer, cost function,
output units, activation functions, and designing the network architecture, are
required to effectively train these networks. The back-propagation algorithm is used
to efficiently compute the gradients required for learning.
Layer of input
It contains the neurons that receive input. The data is subsequently passed on to the next
tier. The input layer’s total number of neurons is equal to the number of variables in the
dataset.
Hidden layer
This is the intermediate layer, which is concealed between the input and output layers.
This layer has a large number of neurons that perform alterations on the inputs. They
then communicate with the output layer.
Output layer
It is the last layer and is depending on the model’s construction. Additionally, the output
layer is the expected feature, as you are aware of the desired outcome.
Neurons weights
Weights are used to describe the strength of a connection between neurons. The range
of a weight’s value is from 0 to 1.
Where,
b = biases
a = output vectors
x = input
As many neurons as there are classes in the output layer. To show the difference
between the predicted and actual distributions of probabilities.
1
w= , (6.6)
−2
and b = 0.
We can now walk through the way that the model processes a batch of inputs.
Let X be the design matrix containing all four points in the binary input space,
with one example per row:
0 0
0 1
X= 1 0 .
(6.7)
1 1
The first step in the neural network is to multiply the input matrix by the first
layer’s weight matrix:
0 0
1 1
XW =
1 1 . (6.8)
2 2
Next, we add the bias vector c, to obtain
0 −1
1 0
(6.9)
1 0 .
2 1
In this space, all of the examples lie along a line with slope 1. As we move along
this line, the output needs to begin at 0, then rise to 1, then drop back down to 0.
A linear model cannot implement such a function. To finish computing the value
of h for each example, we apply the rectified linear transformation:
0 0
1 0
(6.10)
1 0 .
2 1
This transformation has changed the relationship between the examples. They no
longer lie on a single line. As shown in figure 6.1, they now lie in a space where a
linear model can solve the problem.
We finish by multiplying by the weight vector w :
0
1
. (6.11)
1
0
176
t
e
Deep rud" PoWaeA 3_
,tx
.l
tt b a, gfud-7 c't)
ffr,t
or Pi'W* Cur) a,
at cou-l-nl,be
(ff^= SPA ,0 od
€- ,ndJ
insff'vn
d,u ,t4* i{ a'i'> Trt h^,wd, ern, L{n on*"^l'
LCil nn L SPO fr{ t
{-t0
A/e l,rctil< CU
{C'n,oD /k^*#'r
L
in A" Lwo ,{ co r bnl rI
ne hrtoLH
,rrt x)
,rn^P h A)/"t' ;"LP C
k alw''{ &)
Ar
& {qn
)
C s)
Jkxs
nalP"rfi
uj"nq a, SeL o$
buL&'o
man'7
dnh tht"b La"L ^rb
t
fud ca/> "eY**
L
6,
S'.{
lanSunry 6,4;
ne*,oo-gkh ""P
J
red 'hurs*rd ^;nA d; {$ewent {*nc{,aon$.
0kt
(3) c?,) C t)
c! Cil)
fx -fcfl =+ { ,
I
z
xJ I )
(x ;8
U= {
t
+
Tn thtr, a,PPx'aacl, ln/, l' aSAryflekafae
Y..
ncu'utrnh
*
e.8_
b
-a
a{ Lea,q"W XOn
0 t
I
0
I
6
an l^ ,
3 r*rPl* l4ryarTLDE
x Itu ft">t I
nttuUkfr*l"
l,rf" fnlilt */ b" CPNCE vrled-
I^ttft*
aftiart lfiu
f
e1,a
b trc$#w Ccwecil?
o6L ne{oottl+
kl* l^l*'*
{orn Ymtn{P
\=l ;u,q lt,tt
T
u,, T),,0,{,L''
o{ /t*e Pvn
ne {,^rN-l<)
ma!{
{rJ" l,brt( Luo:*fr' oaL.
b J;L lh' lvlS7 lDffi
lt
T6e u"''Q
uh"lbq. lP ent, lhe
gy> Wvl- uhale bal,;"Xt
*E ? &ttc*]N" -Lh,
f #( + {'x ) e)
c
L o( )
+ C
x ex
aj ou.t- *odtl, $(x;4,
Slan uce ,*rj,{ el'}'od1e
g u^n&ff :F
SurrpuP, a" ],trlea* r,od*[., /^Jilf,
ac ane b' mud*{, i} cl"{rrrd- tt be )
\
w b), b = nT*+b
LoL cen qrtiwamt&e JCO) fn .ln4 ,-o t'fk ve+,ect. b'u' qn{- b'
UJ"iW fh" ,ntn L
+
Aj ne ,2eft^rylk J"h
n an'L "frs"[,
-o.b",yy"tr l"^
ffn>tW b rlfr-+
b
hle co* aJ>e dr-[tu
L-
I
x To{fie cy"'efttz't df neuro/* nelooo,rk2, o t"/e Anv"fbtha
Pic*
tu co*b !'
uz/1{ hort *'rA de{eu'n)ns /n,,o br"
ab d,u6*, cluienz JrL
*("
q
!,lfrJi" @ b;xehefuJ
t n YYLa x'i rt'utrn
kfu/;haod-
Jlw co* {*'* cL).art ,h tecLc
frrprro c'oe// ffr"
l-@
tralru Ynea,+utuz Ab iw lte
rnadcl" ff* ec
tu"/ O**cont
a
q )'g {rn drj
tr A
Cr*a'
JCg) = l-tz k -) f' tr
x-
L
l/atVe/)
SF€ c"tlLc {og,rYL
ile CA {
,/rt uoc{*rn'h
NE
fa6rnh If,"-*
rno4rln, and- rrrck{b,
bel,w** rfl pa,*l
clon'L d"P ,rrl- gn' lh*
adel
ffi"6 Sveltt-
e6706 cdtub
k
t" jfr. rn€atu 5 N(u ) {g;
,h4 Tt1fr,! ,J.,te PwEl"L Ctrl.)
r, e7'70 zca )
L Pnobab 2
+ ^#
I d"t* fl .$ (w ) 0 CCI
tCo)= d,
E X )Y
r-) P
a
h otdn ttt A) tr/o Lrd
I
vne f ,rt*
Ne t 6wo
te ar
ah 1ur$
nb
I
bfh* "l*)atrto'
1C
c, {u// 6rh
N T,1/rLrA' o{ '{eavrrtw\'
9)
K-)
"W
r#- tru
t"
LA aL*t*' ? - {tx)
M{tw Y^t i
o
li rlctj
tr
+o@= Ey ^/ Pr,^tu(tr I
r@ At
e s'70
fv\ )
oaQ
ta/ y-vo
bb It- a
A/ Lutu c{im rfrrfr tnJ+',4*,b'-
n*n be* rp
J+*ttb tuair)c{L srL il/l,,
I
J-L ewcb x, h,[l* fr- k**dab
$[
#
+ ry,j- r"rtfr> f^,y * pd^/w {
#rL p
5,f,,?A.Jl fy
f-*
L o"lpu* ur*ln: Mktu
c o*t
nt [ryhq, 'u*Pl*'L
Tfre
unrl
,{
b"L^t"u* th *e*
uw jh, C.l,Alrb -
rv10
o*C bul'u*r*
h #" tt f+e-tewcg ,il-JhO
AI,L{L
CJ,olh "r,fu"ft qt t-rl9
cb
tt* cM,
hu5 Cffr"
.$,.t,
b"bi,E,
hrrL {sb
L fuC
Lrt<,t fh" lndd eru 4rn,{u
Tofhe r'{eu;oL Alellnulb, AyC- Lz-a,*nf*,'rur'7
eru{est{'d' {-w"a
k"
tk"* L. UCE
o *+f,^* Dfffu;b'^t'em;
) Lrngat" Uni{h +"L Gau't-unw ne&.?6'l<-u .kku Al
L lrvw W ,!-n^!-
h l;t laken ,rt vJpe* pexletn
'fh"h e*,
hw" 0ru t't
Ar d( o *oJft,
d* 4
x*d.t*c.rn & VectyL
A t-^Xti- a{ l"neo'q' @@,^L und
tnl'L+b
t9.
lH
W + IEt* r* * ?d o# *"Sl'b"
fexrn.
fru firar bro*
b -)
A
& A
il= hl'fr+b
a&e No ca
.i" oPU^h6'L 1A er*tel'X "Uneal,TEc#.
Wn +
4 At) ,*r',,b*east htn.
(Lr,'.D
Itner&- ,lrrtb La o-lten
*Tffin @
,{ aQ ft*
x bmes,r-
rrt*dn" '+ A
I r)
PCwlA= NC.,
5
ff,
bell -s
d,utn;b u{;m- h ,liks o'/
GouP*o a
I A"Lg
"{ha tw*
eL.t-t6vs fh'& d-A edf beb
.rnfu-frYli$eb "ffiear)
-, kt<rbl,PecL
x lul,ax1rnJe;rfl b2 u $&".4u1'7'
nod st Covo'qAa(t0e *rrd, *"ry
e-td,T)o*db
a b^*rA
*fl-*r
l/9 f#. a,
wtde V ^ilP,lr{ 'f ry
b?.
rndoWq.
Jc" [email protected]
, E
'i
t5
;h olu #*.&
xfr Betnoa-tll eulffibulr'n
6. 7T* ne ,^ill-
ne,
J-
L,
,*l{, PC* L
4
Le o, vo/"d, /-'ufu
r) w*b'ub
* FuL'tfrit,
ir> y'fre i' ki^/'-L [O ' 1
i[ n"ePv{ /;e u*nVL )
L
b9L 62edA
t, b
u.be a"
a bLo';n
knea'r
aV"lroL
* Su-p
hb, Vofirr.
**e
xafubrL{"'y
t, tJ-ryr"*
tC= orffin *
t
14 =vr)ax
cnf"lh
tn corn brn "d fen
uv>t
3tg rnfr ?d O'" Stt'o-a
(.-_
iK- W'Y)
k Kiljfqo d efljrp-q'eh
Lrue L'P
,!
#r1-di*c b "fr"ul
* nS TcL
A
+L
tr
er,'t,
q- fu' ffrt 'brw*it stT* d i,r*cL'st
,rolruae
rq-
Cooryne^/*;
b
,1,
ceryute
TL azap/5 or h,,negL
,l\
-r- I
tt', -TI b
T = Irl
a,cbrv*fr,tm hu*c*ion
@ T{ u-bu)th" sry*"M
Z tn b pfuobrb&tr
nff a)
b cpnveut
ava!*e 'f in/a
d_ ( 'loa
a
Te?be'6n bl* a)
d-)
,t-6 CD=Yz
U) =exPCVO
CD=
ex-P et r)
,
ex-Pcz'4
D
PCD bulitn
fie n vow"bh
0w6
L^fi* ,J5 O-- Nal) 6v)
- btlihu,a $rL
r5
J(e)= -'l'7 P
V
U t4
Cntr-D n
0
\__ ,(-o
,
)D
(t-a'l)7
,, uoVlh rnnrtrl[r'n^-
^f
Jlte Co bt- * unchr>Q
w$
.!;t<*/"ud-
Ln - lu? Peylil' -riln geP"*
-f.L f4"lJi*u
il 3o/;hnan6wh: Ur,fi,$,
a
chnaPe
b
NE N&{' /h" mu'["(- 1srYte (n
J,+ oP tianf JrL
offie ,$ v'L d;++ezent
Va,u-abfe
hr*' vondblr* I lrrlu
I*th" -tD
c*e o4
wduct- Sr & nunbed
uJi /r'h^e-&
4 (y =t -)
U=
16
5. We compute the gradients, which are like the slopes of the loss function with
to each parameter. They tell us how the loss changes as we adjust each parameter.
6. Using these gradients, we update the parameters in the direction that reduces the loss
lfiitial
\a/1" ig. hi 6riisl'ient
{sst
li){t$r}}{:r',l.rt
5r*p
/ l,{inti}xrJm C0}t
rr,'fiy$live c7l
Wr:1ght
It's like adjusting the knobs on a machine until it produees the desired results,
HjAdnl Unfr{*
nf
Ar\X, &'
t nga,4-
@'
n'*o-d Y'
xfucLf+id Ltpe-a* ww{n a'l'e \n'Ary
h:P ,! aL"L a.{hne Liu*S&ne[tdn:
Tx+ D
t, n
(-/ Qa
wiffr sn'L"ll )
'rectf fte I
A,C hu*{*
heFn
g
1"8
stgrtt#fca't{-{"
y"Cu** /t* *u'/ $u*c*ta'
'+* r* Pu*b.eu -ft* no r) - &'+!t6t",tfuA'h +'+, *:
sn{nL?
hrdrtrew u,ni[ a'ckvatttt ]wntffidnb ca'obe
d;*wg'"q-dPd' '
n'**L u*
xp,".Li+rd Lipea,q. ww{n a'4'€' fuP'*!fr!
L"'^A SurY"*$;m :
fuP o{ at'L a.$ &i,'te
Tx+ D
t, U
q
Qa
toiffr swm/l ,
Cz) = r( A,
@R)
tl,
L*V* a,rfrrru{t* {uw{;an '
fh* fuP''tbol'"
G.Sf- clnnr\,,
*afft*e o'ie Livu"bart {ur* frmh
?l) f ( a
14 7')
L
Cz;)
becau*e, L"mh {oL [,;aa*
nat tecomrfl *rtnd
* SiYnnid un;lx aue n.{a tg/<'b du-e b h**
wtn{V ir> #*u& t
s'o^Id L-nrtb utfff>
ah
b,r-t caru be uh"A b"r."rL
b sahrzo**, p
q\,
fiiJ
L
t
(j.. r,
tn i ,...
I a
_\_ X-rf
-p
T ll
#
e lrtl :
)
,
"(.)
"ll (
.ulb
4"9 ,no,*{ X ) rI con be
olw"fuY bo
'li++i
rLry bound'd,
I)q
7 Ca) = TYiQffi e 1 ) vn7n
!l
frrchi{ecfwte DowArL ry
)-ogern '
Jfl{4'uh ) lrrdC en
\- jadud-e 1l7tu*
X Cyrnr{lon
,(""ff*h uvrt{n ct
ancL yu'bP l,r* tt.*
,hc s Pec)fie'Y
b
,rlh,,L
J he a,Ychikcf*r* ee"cL'
a,4e @ry'ft-4 rgtl
ne uzlnzb oJ) 6 JuudJr*N
onr YY) -t'* WNr\
Jl'u AZT' "nrt
a*cl"'L'-chne'
}h
(frtnrc ["',t
i"P
"[
@
t h" $"tt,ut Ta+b Cr)
Ar)
L Cr) Cr) l^l
{, -d
-A
tht i^/U>
I h" Sece ud'
fc1
Ld ) c?) ) Irt,
T ct)
+ b
C e)
(" =tr
_!_*
anL 50 en. a-rf' u0
cqn*ideq*Liwb ea'ctr,
J he,,rna,trt
a*c['t&cbtt""{-
;;,;; tfr"dr4rh yrfr' ne fr,o&qK o"'L- ^"d/li'+
t
Ott
x+
(i)
Depffi o+ ffre,Nr{quyE, J /rsh mea l'tb
aw t it -go* rEW-L
ftn* *"ny lnyaan Ww r ,l- {to,a
,. flfi,u;:, (LbCtbLr-
(ii1 lnlrd-rt $ "Fwo t" boYeL unrtP hov* itt
man/41 node*@o) pl,ocuewi ry ff'*
,ubowo,L.
PtuNd'bh
,tt
Un;veunal
,
De{tfr c
TePu€fr
,nodefn CAlL o,^/Ll,
*'Lmen* lYh^te en $e
L,,le
ffr.Y'+
b -15") n b
lfl
L/ ) 1,1h,
nf,n lrnec^L
J&uf Lwrr>fi {ccrzb kc;t'h lodtl'o
ulte De ,,PL
ne of
X t,tle cen a, aoide 6on8e'
Le
Xitrtat
{5 oP Pu' ,
TTr"sUm t
ftP I^J?Jft
Wr{sn ",L Layu.
)K {n^v* ^*
l,lo
l*fr N)e l*rLcl''n(s fffrnofd'
aa
{
Ceaba*n acLZval*rn
#,;ad,ru u,r;/x
5rnd' Ei'LCI 'fheozenn b"*trilIV
oftrCI nrrYabdn htscX/,h, Ay' I t {\V
lr'o
* UrYr vru*^,(-
ne ,X^l ne "uo;d'e
g&rf $
ft*t LPD {'*v* a'
tuurtt C f"t
r@ri0 rt$
# hdclen
,
wu L
-;: r'
: D/L" T
t
t
fra
a.) nbrulnk Volue f?rckhc^rtan: lfr;n /pa. AW" g
'
o.r*va#an- $urtchnt rhol €ltph rl,e6a*ve "nurr)b,uurb
b po*f{i,rc onffi, C,;rA.*tnY rntTvor rkwge* 't f#e
wvl5f"u'f' Ju'tctnn
r h-
rccfutcfltaYt "
)$-
L
x
i
t
l*r/u
lntclrt"* U u,be,b avL !' a,c&un*faYt
b fiLtoDcltrc eno
a,c*vat:r,t $** clzrm- r* fi"
On. Cfr{nrrlolrl
lnyi,vhc srgmef
I ) w&t"rd- ,[w'^* un#eP"Lu)
LnlroJrrce ffr" obrl'q' &,
,1hr*, Su*r{fm,b in, {fr.
eX- ) l')o{) knen L pa{lernh
coTD Luw
D/4ji "*l*X, a
Jfr. me'ff*'l uce u*e La
(i)
a
D
duurrtnt Loatrsng
{h" nefr,mrif
sett{nX*
,*/#
"
j**
"4*t W)e &** ff,, ,*rt'u
@ eYetflr
tk,W( ) '**]o
'ifr^* i{ dw.snt Noz"K
tu lrsell
L lh, {ru;ntrtY d-r,fu"
0"v'> ne,r-C ^
(_
,
&,t I
lt
$ur''.cLio
A,'
Ll{
t
tu '--.,-()
l.x and I
^fd
Ca o al
n
nwnbu os lrne*
cj,-)o& @
K
[sl;evr-+t*
oce
Cl,>toXf*V Cl/ 4-, vn.drl rneanh tian
"p ,ifa Gmfis;
)eaT{L ur[ur*
$rnclrart- t^le ci)a'n t b
(r)
U/"{r.3 d""f *4 ,4
bchLrLb e/-{'
juactnon* Jf*
gfjr"rL))
l'Y1 &"1 kow''*
D,q/. p5- cvej- *fru s'P'e tr
of?* AzcfruLe'#""q'l'
hJb,t" n eU,qnL neStoovt{*
ett
lfua"
a.b C[4J0" tq'
have
aLC[";bcb''zu-U
TWVE
l
t:- It
tu,) n- I
9ti.Ir
96.0
i{
C)
9$.5
U
L
u f)5"0
9.1.,Ir
u
t{ 94.0
U
t 93.5
*;
y)
93,0
92.5
o'2 rl
j^} ,i S 6 78 !l 10 11
Figure: I . I
Deeper neura I networks perform better and generalize better tha n sha llowe r o nes w he n
transcribing multi-digit numbers from photographs of addresses,
1. In a standard neurat network, every input unit is connested to every output unit.
Z. Specialized networks may ha\re fewer connections, reducang the number of parameters
and computation needed.
3. Different applications may require different connection strategies.
4. Convolutional networks in computer vision use sparse connections that work weltfor that
type of problem.
5. Specific advice for architecture design may vary depending on the application,
6. Future chapters will explore more architectural strategies for different application
domains.
i_ ilri H 3, r:onvolutiona,l
{)
{j
!-
il, firlly (:onlrecied
x rJO
W 11, cr:nvolutiorrai
.= 9il
3r
UeB
? o.)
-
() I
Figure: I 'P, Deeper neural networks tend to perforrn better than shallow ones because they express
a preference for composing many simpler functions together; allowing for more complex
representations and sequential processes to be learned.
,:{
L,
Baclt PnnPaXn'{'on
a,rrd, o/fur Dti#' r.
A^t7or.atfunn
Back.PropagationandotherDifferentiationAlgorithms
l.Back-propagationisamethodforcomputingthegradientinaneurainetwork.
to compute the gradient'
l 2. It allows infbrmation to flow backu'aris through the networkalgorithm such as stochastic
I 3. Back-propagation is used in conjunction with a learning
i gradient descent.
I
derivatives for any function' not just neural
4. Back-propagation can be used to compute
networks'
5. We
L^-" to
will describe' how +^ compute fh gradient vx f(x,y) for an arbitrary function f' where
^nmmrre the
set of variables'
and y is an additional
x is a set of variables whose derivative-s are desired,
algorithms is the gradient of the cost
6. The gradient we most often require in learning
function with respect to the parameters' V0J(e)'
andanalyzing learned
I
7 . Back-propagation can also be applied to computing other derivatives
models.
information through a network is very
8. The idea of computing derivatives by propagating
L general and can be used for multiple outputs'
i
Computational GraPhs
net.'rrorks more precisely'
1. Using computational graphs helps us describe neural
2. Each node in the graph represents a variable' language is accompanied by a
3. An operation is a function of one oI more variables, and our
set of allowable oPerations'
4, The output of an operation is a single variable or a vector' variable when an operation
5. We draw a directed edge from the input variable to the output
I
\
I
a\
f
I
I
I
i
f
I
I
I
,
\Jq)
pk6c/k5 +
.L
-r
cho;" Q* r{ C^,[",r&-rs, 1
4,
-g
d.x clx,
Or Br at,
,_e
Zx"
Z AU Dx; r+.-
i
In vec'ba no{ott*rthfu "nr1 be eyw''-b*Un
tafi Lbn a'r)
Z a? T
x, Z
Ex Y
+
"
tr;o-"CxT
* rD
-r,q
e:Xpue'l'?tfft ll
trll>,,1'
+ a(
f z"rhh.il ];n o* u-rrtL'
,rrrtLrix- Co
H BrveYt "^dun'Zn
ga- YYLfru bnicL o{
t"f& X
x
L Surn
,LC,
L
L +o
Z v, a7,
(
x [, X
a Y;
J
ftPfWrW fke cfroto e"hb oAbn
Rcouo*rw'Qt
/+!
L
Ol.k.NLadanwd,"aft 4a
\.__
? C!,) fi)
t a, ,,vL
h-z
t\,
8 C"'{'&t' Dru
,Edct' nPde iru
l,l, )
(,{,
nade
a Lf"
(n) Dr* ci)
Cn) j)
e U,
(i )
e L.[,"'
Y
i,j€P{u
e
ci)
a
ti+
pofuff1t Tfr,
1 fr Pruc'duve t't''t 'cLeD
€,
fll rnPub go *Cn;)
{"L i=lr."s.,flid,
U,C
i) ,/_ X;
end- +oL
dn
M
D
)= fi; *L, *r-t)
n
ci)
A ,"[irlj € P" C*'o)
C-o) ci)
ff't) { C,q
enA {"L
Cn)
tetutrt U-
*i"' i
,rpd.
vabe W rLi ;"P"b
@ Tfr" ot''$ttd b*oO
Ib du!'ntd'
algccfrfhm
@ Jhe
b"* ortP"t ,J
ed, oW b"*rj'r-l 0r> lfr*
cad-rt'
A/g-e oubput Valua
@76. *''rrdeh 1 0nrL fh" &o&
L vfoa
Valuen
aj Poe
e
o{- {t*
a'
lh* wylt
vlt
@ 3"[ *p o ,tacsp ?tr"{ N;tt ;b?}"fu
lh" e,s cS,
@
@
@
(@
TE
L*u*og'
o GfiArten{- 808"d
rrlolen
cnA, $lt rL Di{{er"n{io*furt
@ B*ck PnryYLrm
nl,gru;ttw' l\bt05
I
t
t. J
. cost functions for neural networks are more or less then same as those {or other parameteric models, e.g. linear models
1
@
infinity
. ::rz,il-'J,'Jee;:Itr1ff*iktjm*u"x:';?i's:ilcslffilf}'o''n"io"rthenitbecomespossibreto
th-".ou..t uuin,,n'*i
resulting in cross-entropy
approaching negative
assign extremelr a,nn o"*t,. ""0*''
like this
o regularization is needed to avoid overfitting
Statistics
5.2.1.2 Learning Conditional statistic ol y given'x
want to learn iust one conditional
distribution P(Y I x; d) ,
olten
r instead of learning a full ProbabilitY
(x;P) that we wish to Predict the
mean ol Y
.
e,g.' maY have a Predictor / able to represent any
lunction .f lrom a wide class
can think of the network as being
r usingasufficientlY powerful
neural network' (rather than bY having a sPecif,c
as continuitY and boundedness
class being limited onlY by features such
of functions, with this
o"t:'n:It',:tJlae than iust a tunction
cost lunction as being atunctionalrather
lunctions to real numbers
o lunctionat"a mapping lrom than merely choosing a
set of parameters
learning as choosing tun.tion Ltn.t
o thus, can think of "
rcandesign.n.."*.,0*",tohaveits*lni.u,*.,'atSomespecifictunctionwedesire
i'nttion that maps x to f [V I
x]
' e'g', design it to have its
minimum fi" *
tftt
'r solving an optimization problem requlres calculus ol vafiations
variations:
o Mo results derived lrom calculus ol
squared error'optimization problem
r' ing
'of
thu mean
o Yields
2
.yieldsafunctionthatpredictsthemedianvalueoiyforeachx'solongassuchalunctinomaybe
over
described by the family of lunctions we optimize
lead to poor results when used with gradient-based
. mean squared error and mean absolute error olten
optimization ,
-L:.--r...:rL
.someoutputsunitsthatSaturateprodUceVerySmallgradientswhencombinedwiththesecostlunction
necessary
r function is more popular, even when it is not
this is one reason that the cross-entropy cost
to estimate an entire distribution p(y i x)
ot output unit
r choice of cost function is tightly couple with the choice
o most o1 the time, we use cross-entropy loss between the data distribution and the model distribution
the form o{ the cross-entropy lunction
r
the choice ot how to represent the output then determines
oanykindofneuralnetworkunitthatmaybeusedaSanoutputcanalsobeusedasahiddenunit
. supposethattheteedforwardnetworkprovidesasetofhiddenleaturesdefinedbyh:'/(xld)
I . the role of the output layer is then to provide some
additional translormation from the fealures to
complete the task that
olinearunit:outputunitbasedonana'finetranslormationwithnononlinearity
o given features h, a layer oi linear output units produces a vectoli : WTh + b
r Gaussian distribution:
olten used to produce the mean of a conditional
P(v I x) : l/ (Y;9,I)
rmaximizingthelog.likelihoodisthenequivalenttominimizingthemeansquarederror of the Gaussian
r max likelihood makes it straightforward to learn the
covariance of the Gaussian' or to make the covariance
Y:o(wrn+b)
a linear layer to comp ule z
: w"h + b; then uses
o sigmoid output unit has tvvo components: first, uses
probability
sigmoid activation function to convert z to a
o how to define a probability distribution over g using the value z:
probability distribulion P(g)' then
. thesigmoid can be motivated by constructing an unnormalized
probabilty distribution
divide by an appropriate constant to obtain a valid
log probabilities are linear ih y and z''
r begin with the assumption that the unnormalized
3
Iog'PIY) = Yz
ijlv) = exp(gz)
_ exp(uz) . .
is the logit
over binarY variables with maximum
likelihood
deflning such a distribution is natural to use
o z variable probabilities in log-sPace
predlcting the
. this approachto
x)' the log in the cost
learning max tikelihood is -logP(v I
function used with
r because the cosl
the exp of the sigmoid learning lrommakino
iunction undoes lrom preventing gradient-based
ol the sigmoid
o this keePs the saturation a sigmoid:
parameterized by
progress
likerihood rearning of a Bernou*i
ior maximum
r the loss lunction
J(0)=-logP(elx)
.-6--.-r'*+'-"*..-ff :-logo((29- l) .)
= ( ((1 '2a)
z)
it
t
i
\
1
1
i
t
l
\I
I
,, I
' "-:e'
="f-
and Other
6.5
i
4
\
(_i
@
. todiscussbackprop,it'susefultofirstdevelopcomputationalgraphlanguage
oleteachnodeintheg'upfinAitut"uvariable(ascalar'vector'matrix'tensor'orothe0
more variables
simple function of one or
o introduce the idea of an opteration-a
a set of allowable operations
graph language is accompanied by
r together
by composing many operations
. these operations may be described
tunctions more comp,cated than
output variable
o wlog, define an operation to return only a single
entries' e'g' a vector
.
the output variabte could have multiple
fi, then we draw a directed edge
lrom r to ly
to u.rilur.
i_ o if a variable g is computed by applying an operation "
.wesometimesannotatetheoutputnodewithtrrenameottrreoperationapplied'andothertimesomitthelabelwhenthe
operation is clear from context
rule of probability)
o (not to be confused with the chain are known
other functions whose derivatives
r to compute the ouriuutira. ol functions formed by composing highly etficient
used of operations that is
the chain ,rru, *iti u ,p.lific order
. backprop is an algorithm tnat computes
o let:
o r€iR
. f,gtR+R
o y: g(r)
o then,
. #:#H
r this gentiitizeo beyond the scalar case:
caiie
o suPPose that:
r x€lR-,Y€lR'
o g ;lR- *r lR"
\- o /:Rnr+lR
. . y:9(x)
. z: lU)
.- then,
"-' a- : Dz dlti
'r ind;vectorLinotation:
6i6i
r V*(z): (H) V,(,)
x can be obtained bv murtiprvins
a Jacobian matrix ff bv
: ffi,ffi ;l:Jj;J,'i^.Tiin[tH:flr.
a gradient Vr(z) !---Li^^ in the
^rarrianr nrnrtrrnt for each ooeration
lthebackpropalgorithmconsistsolperlormingsuchaJacobian.gradientproductforeachopt
graph , -_:--_r:r,
of arbitrary dimensionality
o usually, backprop is not applied to vectors' but tensors
such that we
r assume the nodes are ordered
,r bv evaluating the lunction
j. with an operation /(i ) and is computed
lj:; node u(i) is associared
uttl :1(5ttl)
(i) are parents ol r'r'(t)
o here, A' is the set ol all nodes that
comPutation
a 1: forward
r lor 1,,..,rlido
. u,Q) 1- :r:i
o end lor
o
' {o1j:'n,;*1,"',n'do
'"."A,t;; e Pa (utr))}
{u{rl 1,
. u(,) +- ;trt 1n{'))
. end tor
be put in asraphG
computation' which could
*he lorward propagation
. ;J:l'"tilinln' 'o"t'o"t #;;;;;;"ot ""
q and adds to it an exta set of nodes
o backprop, *" .on.ur.t a computation
' to'" periorm ""n
:"';**;ji:lf**#::n::1iffiJ the rorward graph node
u(i):
## L.o.,u,.o **n
5u@)
O;fr t
i:j €ca(uti) )
6u@'t 7uli)
)u(i\
--
Dutr)
6
. the amount ol computation required Ior performing backprop scales linearly wilh the number ol edge in f
r computation lor each edge corresponds to computing a partial derivative (of one node w.r.t. its parenls) as well
as performing one multiplication and one addition
6.2: version of (for computing the derivatives of u(') w.r'.t the variables In the
graph)
' . simplifications: all variables are scalars, and compute derivatives of all nodes in the graph
o run {orward prapagation (Algorithm 6.1) to obtain.network activalions
. initialize gradtable: a data structure that will store the computed derivatives
c gradtablelu(,)l : ##
r gradtable[ut")] <- tr
o lorj: n,-ldowntoldo
r gradtable [u{i)] +- },.i6e"(,r,r;gradtable fu(t)l *;
.
o thiscomputer *8 : Di,;ee"(,r,,; #+#
end tor
.example x)
- o maps paramelers to the supervised loss I (i, y) associated with a single training example (x, y)
o $r is the output of the neural network when x is provided as input
. require: l, the network depth
o require: 1ry(i), i € {1,.. ",1}, theweigfrtmatricesof themodel
o require: 6('), ri € {1, . . . , l}, the bias parameters of the model
. requird: x, the inputto process
t r require: y, the target output
r h(o):x
. fork:L,...,1do
. a(r,) - 6(t) 11y(*)6(t-t)
o 6(&):/(atrll
o end for
o j'= 1(l)
o J:L(9,v)+,lO(0)
6.4: on the same network Algorithm o,J (compute the on lhe activations a(k) for each
from the and backwards lo the hidden
. afterthe computation, compute the gradient on the output layer:
r g +- V9,.I : VS,tr(9, y)
c lork:l,L-1,...,1do
. convert the gradient on the layer's output into a gradient on the pre-nonlinearity activation (element-wise multi-
f
plciation if is element-wise):
. g <- V.tor./ : gO // (ate))
. compute gradients on weights and biases (including the regularization term, where needed):
. Yrt^'r./ : g * )V61*1()(0)
7
c V1ryroJ : gh(k-1)r' -t-'lVwturf,l(0] ...,,
to*"ui""t hidden raver's activations:
" i=u!;gtr t:t;"5r;;Uf;'l'*i
. g (- vh(ft-1) 'r - a'
o end lor
. algorithms can accomodate any computational
these are simple, specialized on a generarized lorm
of
are based ,backprop,that
r
modern sortware implemJntations
o"ta structure fJr representing
symbolic computation
graph by expiicitly #,0,i"0"n."
Derivatives
6"5.5 SYmbol'to'SYmbol
specific values
variables that do not have
comPutational graPhs both operate on sYmbols'
r algebraic exPressions
and
representations are called
sYmbolic representations
o these algebraic and graPh-based sYmbolic inputs are rePlaced
with a sPecific numericvalue
o when using or training a neural network, approach to differentiation:
use a "sYmbot-to'number"
. some aPProaches to backprop values lor the inputs to the
graPh
graph and a set of numerical
r
take a comPutational gradient at those inPut values
values describing the
.
return a set of numerical
o e.g. Torch, catfe differentiation:
^ a-,,nh^t-ro-svmbol,,apptoach to t desired
.--L^,,^ A^6^rinri6n oi the
p'ouiou a svmbo[c description
' -$il:'m,':,[::1ffi:::;':;Xl1iltrT:::[1ff"":l'1'to#t
derivatives
o e'g' Theano' Tensorflow ,acnrihed in the same language as the original
expression
: ;:U:*;[:XffJ;il]::,il,T'"'jff:f:ffi:?ffi1Tilli;;;,,;;l;*'o,,n backprop asain' dirrerentiatins
exacuv
us to avoid speciivins
,u,u,. are avai,ab,e, a,,owins
: [{tr,.:::Tii}:::ff::,:::::':'l]x'},1un,,,
be computed
when each operation should
. .,,0:|fi:1***;;jf;m#T;:T:IllT::::u.,:,s as are done in the sraph
buird bv the svmbor'to-svmbol
ation
6.5.6 General Back-ProPag
in the graph:
t. one ol its ancestors x
gradient of some scalar z w'r'
. to comPute the
rbegin bY observing that
the gradient w'r .t. z is given bY 1
f" : the current gradient by
the Jacobian
of z in the graPh by multiplying
the gradient w.r.t' each Parent
we can then compute
ot the operation that Produced
z we reach x
the graPh in this waY until
traveling backwards through simPlY sum the
gradients
. continue multiPYing by Jacobians
bY going backwards lrom z through
two or more Paths,
rtor anY node that may be reached
paths at that node
arriving lrom different
r more lormallY:
variable
corresponds to a.
-r ""t
each node in the graph Q
;."'.'",T#;m:t*:;U:;XJ:'jil'.]fi:lllorn.n.,on, (subsumins scarars' vectors'
matrices)
8
. bproP computes a Jacobian-vector product, i'e' the chain rule' Vi(z) : Di (V xU) 'ffi
o e.g. consider a matiix multiplication operation creating
a variable C : Ats
r let the gradient of a scalat z w't'l' C is given by G
. two backprop rules, one {or each of lts input
the matrix mulliplication operation is responsible {or defining
arguments:
gradient on the output is G' bprop must
. lf we call to request the gradient w.r.t. A-given that the
bprop
state that the gradient w'r't' A is given by GB"
must state that the gradient is A7G
o il we call bprop to request the gradient w.r.t. B, bprop
.thebackpropalgorithmdoesnotneedtoknowanydifferentiationrules
. it only needs to call each operation's bprop rules with the right arguments
f (vxop.r (inPuts),)9t
t o inputs:
.
list of inputs supplied to the operation
op. f : the mathematical lunction that the operation implements
o .t: the input whose gradient we wish to compute
o Q'. the gradient on the output of the operation
other, even if they are not
o the op. method should always treat all of its inputs as distinct lrom each
brop
.
e.g. il two copies ol r are input to compute 12, the derivative
w.r.t' each input should still be r
and their bproP methods
. software implementations of backprop usually provide both the operations
a custom operation to an existing library, usually
o a new implementation of backprop
if building or adding
need to derive the op. bprop method for the new operation
a 6.5: the
o this is outermost skeleton, for simple setup and cleanup
omosto|theimportantworkhappensinthebui.ldgradSubroutineolAlgorithm6'6
o require: the target set ol variables whose gradients must be computed
1f ,
gradtable
o require , the variable
o require f , the graph to modilY
o f
require 9/, lhe restriction ol to nodes that participate in the
gradient
o il V is in gradtable then
. return gradtable[V]
r end il
o ri<-1
i
I o lor C in gerconsumers(V, 9) ao
?
I
. op <- getoperation(C)
a
I
. D +- buildgrad (V,9,9',Sradtable)
. G(i) <- op.bprop (getinputs (C,9') 'V,D)
e i+-'i,*l
o end lor
r G <- Irct'l
o gradtablelY] : G
o insert G and the operations creating it into I
9
. retuln G
. computationai cost
with the algorithm specified, we can examine the cost in terms of
. assume that each operation evaluation has roughly the same cost, and then analyze the computational
rmostneuralnetworksareroughlychain-structured,causingbackproptohaveo(ll)cost
by expanding and rewrlting the recursive
chain rule non-recursively:
o the potentially exponential cost can be seen
6'r,(r')
a}i t
rl:jeea(u(i) )
6u0,) gy(i)
a;O a;O
0u@)
ail, tn
nan(u('r ),s(':), ...,'"i*
from n1 :i to rt:n
) ),
h=2
fr1'6x)
ffi *r1
i.e. sumoverallpathsf'o'"(i) tot'(')'
multiplying alt derivatives alonq each path
rthenumberofpathslromnode.jtonode?1cangrowexponentiallyinthelengtholthesepaths with
. (which is the number ol paths)' can grow exponentially
there{ore, the number o{ terms in the sum above
the depth ol the forward propagation graph
is recalculated many times
. the large computational .o,ii' int"'ud *.en fij
rbackpropisadynamicprogrammings|ralegytoavoidtheserecomputations
storing intermediate results
o it can be thought ol as a table-filling algorithm that takes advantage ol
,i,r,")
to store the gradient for that node
. .rsl,oo" in the graph has a corresponding srot in a tabre
avoids repeating evaluating common subexpres-
o by filling in thele table entries in order, backprop
sions
J: Jur-e * ) (; (',1?)'* t
,i,.j
(r,,1) )
10