0% found this document useful (0 votes)
93 views17 pages

Feed-Forward Neural Networks (Part 2: Learning)

This document discusses learning feed-forward neural networks. It explains that stochastic gradient descent (SGD) and backpropagation can be used to learn neural networks similarly to linear classifiers. While multi-layer neural networks are more complex, larger models with more hidden units tend to be easier to learn as long as the units collectively can solve the task, even if not reaching a perfect solution.

Uploaded by

Rahul Vasanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views17 pages

Feed-Forward Neural Networks (Part 2: Learning)

This document discusses learning feed-forward neural networks. It explains that stochastic gradient descent (SGD) and backpropagation can be used to learn neural networks similarly to linear classifiers. While multi-layer neural networks are more complex, larger models with more hidden units tend to be easier to learn as long as the units collectively can solve the task, even if not reaching a perfect solution.

Uploaded by

Rahul Vasanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Feed-forward Neural

Networks
(Part 2: learning)
Outline (part 2)
‣ Learning feed-forward neural networks
‣ SGD and back-propagation
Learning neural networks
Simple example
‣ A long chain like neural network

w 1 z1 f 1 w 2 z2 f2 w L zL f L
x
Back-propagation
w 1 z1 f 1 w 2 z2 f2 w L zL f L
x
2 hidden units: training

Initial network (hidden units) Average hinge loss per epoch


(2)

(1)
2 hidden units: training
‣ After ~10 passes through the data

hidden unit activations

(1)

(2)
(2)

(1)
10 hidden units
‣ Randomly initialized weights (zero offset) for the hidden
units
10 hidden units
‣ After ~ 10 epochs the hidden units are arranged in a
manner sufficient for the task (but not otherwise perfect)
Decisions (and a harder task)
‣ 2 hidden units can no longer solve this task
Decisions (and a harder task)
‣ 2 hidden units can no longer solve this task

10 hidden units
Decisions (and a harder task)

10 hidden units 100 hidden units


Decision boundaries
‣ Symmetries introduced in initialization can persist…

100 hidden units 100 hidden units


(zero offset initialization) (random offset initialization)
Size, optimization
‣ Many recent architectures use ReLU units (cheap to
evaluate, sparsity)
‣ Easier to learn as large models…

10 hidden units
Size, optimization
‣ Many recent architectures use ReLU units (cheap to
evaluate, sparsity)
‣ Easier to learn as large models…

100 hidden units


Size, optimization
‣ Many recent architectures use ReLU units (cheap to
evaluate, sparsity)
‣ Easier to learn as large models…

500 hidden units


Summary (part 2)
‣ Neural networks can be learned with SGD similarly to
linear classifiers
‣ The derivatives necessary for SGD can be evaluated
effectively via back-propagation
‣ Multi-layer neural network models are complicated… we
are no longer guaranteed to reach global (only local)
optimum with SGD
‣ Larger models tend to be easier to learn … units only
need to be adjusted so that they are, collectively,
sufficient to solve the task

You might also like