0% found this document useful (0 votes)
15 views20 pages

Winter1516 Lecture52

Uploaded by

rsethi3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views20 pages

Winter1516 Lecture52

Uploaded by

rsethi3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

A bit of history

The Mark I Perceptron machine was the first


implementation of the perceptron algorithm.

The machine was connected to a camera that used


20×20 cadmium sulfide photocells to produce a 400-pixel
image.

recognized
letters of the alphabet

update rule:

Frank Rosenblatt, ~1957: Perceptron

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 21 20 Jan 2016
A bit of history

Widrow and Hoff, ~1960: Adaline/Madaline

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 22 20 Jan 2016
A bit of history
recognizable maths

Rumelhart et al. 1986: First time back-propagation became popular


Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 23 20 Jan 2016
A bit of history

[Hinton and Salakhutdinov 2006]

Reinvigorated research in
Deep Learning

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 24 20 Jan 2016
First strong results

Context-Dependent Pre-trained Deep Neural Networks


for Large Vocabulary Speech Recognition
George Dahl, Dong Yu, Li Deng, Alex Acero, 2010

Imagenet classification with deep convolutional


neural networks
Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 25 20 Jan 2016
Overview
1. One time setup
activation functions, preprocessing, weight
initialization, regularization, gradient checking
2. Training dynamics
babysitting the learning process,
parameter updates, hyperparameter optimization
3. Evaluation
model ensembles

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 26 20 Jan 2016
Activation Functions

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 27 20 Jan 2016
Activation Functions

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 28 20 Jan 2016
Leaky ReLU
Activation Functions max(0.1x, x)

Sigmoid

Maxout
tanh tanh(x)
ELU

ReLU max(0,x)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 29 20 Jan 2016
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron

Sigmoid

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 30 20 Jan 2016
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron

3 problems:

1. Saturated neurons “kill” the


Sigmoid gradients

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 31 20 Jan 2016
x sigmoid
gate

What happens when x = -10?


What happens when x = 0?
What happens when x = 10?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 32 20 Jan 2016
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron

3 problems:

1. Saturated neurons “kill” the


Sigmoid gradients
2. Sigmoid outputs are not zero-
centered

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 33 20 Jan 2016
Consider what happens when the input to a neuron (x)
is always positive:

What can we say about the gradients on w?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 34 20 Jan 2016
Consider what happens when the input to a neuron is
always positive... allowed
gradient
update
directions

zig zag path


allowed
gradient
update
directions

hypothetical
What can we say about the gradients on w? optimal w
vector
Always all positive or all negative :(
(this is also why you want zero-mean data!)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 35 20 Jan 2016
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron

3 problems:

1. Saturated neurons “kill” the


Sigmoid gradients
2. Sigmoid outputs are not zero-
centered
3. exp() is a bit compute expensive

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 36 20 Jan 2016
Activation Functions

- Squashes numbers to range [-1,1]


- zero centered (nice)
- still kills gradients when saturated :(

tanh(x)

[LeCun et al., 1991]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 37 20 Jan 2016
- Computes f(x) = max(0,x)
Activation Functions
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)

ReLU
(Rectified Linear Unit)
[Krizhevsky et al., 2012]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 38 20 Jan 2016
- Computes f(x) = max(0,x)
Activation Functions
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)

- Not zero-centered output


- An annoyance:
ReLU
(Rectified Linear Unit)
hint: what is the gradient when x < 0?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 39 20 Jan 2016
x ReLU
gate

What happens when x = -10?


What happens when x = 0?
What happens when x = 10?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 40 20 Jan 2016

You might also like