Lecture 2: Basic Artificial Neural
Networks
Xuming He
SIST, ShanghaiTech
Fall, 2020
9/9/2020 Xuming He – CS 280 Deep Learning 1
Logistics
Course project
Each team consists of 3~5 members
You may make exceptions if you are among top 10% in first 3
quizzes
Full course schedule on Piazza
HW1 out next Monday
Tutorial schedule: please vote on Piazza
TA office hours
See Piazza for detailed schedule and location
9/9/2020 Xuming He – CS 280 Deep Learning 2
Outline
Artificial neuron
Perceptron algorithm
Single layer neural networks
Network models
Example: Logistic Regression
Multi-layer neural networks
Limitations of single layer networks
Networks with single hidden layer
Acknowledgement: Hugo Larochelle’s, Mehryar Mohri@NYU’s & Yingyu
Liang@Princeton’s course notes
9/9/2020 Xuming He – CS 280 Deep Learning 3
Mathematical model of a neuron
9/9/2020 4
Single neuron as a linear classifier
Binary classification
9/9/2020 Xuming He – CS 280 Deep Learning 5
How do we determine the weights?
Learning problem
9/9/2020 Xuming He – CS 280 Deep Learning 6
Linear classification
Learning problem: simple approach
• Drawback: Sensitive to “outliers”
9/9/2020 Xuming He – CS 280 Deep Learning 7
1D Example
Compare two predictors
9/9/2020 Xuming He – CS 280 Deep Learning 8
Perceptron algorithm
Learn a single neuron for binary classification
[Link]
9/9/2020 Xuming He – CS 280 Deep Learning 9
Perceptron algorithm
Learn a single neuron for binary classification
Task formulation
9/9/2020 Xuming He – CS 280 Deep Learning 10
Perceptron algorithm
Algorithm outline
9/9/2020 Xuming He – CS 280 Deep Learning 11
Perceptron algorithm
Intuition: correct the current mistake
9/9/2020 Xuming He – CS 280 Deep Learning 12
Perceptron algorithm
The Perceptron theorem
9/9/2020 Xuming He – CS 280 Deep Learning 13
Hyperplane Distance
Perceptron algorithm
The Perceptron theorem: proof
9/9/2020 Xuming He – CS 280 Deep Learning 15
Perceptron algorithm
The Perceptron theorem: proof
9/9/2020 Xuming He – CS 280 Deep Learning 16
Perceptron algorithm
The Perceptron theorem: proof intuition
9/9/2020 Xuming He – CS 280 Deep Learning 17
Perceptron algorithm
The Perceptron theorem: proof
9/9/2020 Xuming He – CS 280 Deep Learning 18
Perceptron algorithm
The Perceptron theorem
9/9/2020 Xuming He – CS 280 Deep Learning 19
Perceptron Learning problem
What loss function is minimized?
9/9/2020 Xuming He – CS 280 Deep Learning 20
Perceptron algorithm
What loss function is minimized?
9/9/2020 Xuming He – CS 280 Deep Learning 21
Perceptron algorithm
What loss function is minimized?
9/9/2020 Xuming He – CS 280 Deep Learning 22
Outline
Artificial neuron
Perceptron algorithm
Single layer neural networks
Network models
Example: Logistic Regression
Multi-layer neural networks
Limitations of single layer networks
Networks with single hidden layer
Acknowledgement: Hugo Larochelle’s, Mehryar Mohri@NYU’s & Yingyu
Liang@Princeton’s course notes
9/9/2020 Xuming He – CS 280 Deep Learning 23
Single layer neural network
9/9/2020 24
Single layer neural network
9/9/2020 25
Single layer neural network
9/9/2020 26
What is the output?
Element-wise nonlinear functions
Independent feature/attribute detectors
9/9/2020 Xuming He – CS 280 Deep Learning 27
What is the output?
Nonlinear functions with vector input
Competition between neurons
9/9/2020 Xuming He – CS 280 Deep Learning 28
What is the output?
Nonlinear functions with vector input
Example: Winner-Take-All (WTA)
9/9/2020 Xuming He – CS 280 Deep Learning 29
A probabilistic perspective
Change the output nonlinearity
From WTA to Softmax function
9/9/2020 Xuming He – CS 280 Deep Learning 30
Multiclass linear classifiers
The WTA prediction: one-hot encoding of its predicted label
9/9/2020 Xuming He – CS 280 Deep Learning 31
Probabilistic outputs
9/9/2020 Xuming He – CS 280 Deep Learning 32
How to learn a multiclass classifier?
Define a loss function and do minimization
9/9/2020 Xuming He – CS 280 Deep Learning 33
Learning a multiclass linear classifier
Design a loss function for multiclass classifiers
Perceptron?
Yes, see homework
Hinge loss
The SVM and max-margin (see CS231n)
Probabilistic formulation
Log loss and logistic regression
Generalization issue
Avoid overfitting by regularization
9/9/2020 Xuming He – CS 280 Deep Learning 34
Example: Logistic Regression
Learning loss: negative log likelihood
9/9/2020 Xuming He – CS 280 Deep Learning 35
Logistic Regression
Learning loss: example
9/9/2020 Xuming He – CS 280 Deep Learning 36
Logistic Regression
Learning loss: questions
9/9/2020 Xuming He – CS 280 Deep Learning 37
Logistic Regression
Learning loss: questions
9/9/2020 Xuming He – CS 280 Deep Learning 38
Learning with regularization
Constraints on hypothesis space
Similar to Linear Regression
9/9/2020 Xuming He – CS 280 Deep Learning 39
Learning with regularization
Regularization terms
Priors on the weights
Bayesian: integrating out weights
Empirical: computing MAP estimate of W
9/9/2020 Xuming He – CS 280 Deep Learning 40
L1 vs L2 regularization
[Link]
9/9/2020 Xuming He – CS 280 Deep Learning 41
L1 vs L2 regularization
Sparsity
9/9/2020 Xuming He – CS 280 Deep Learning 42
Optimization: gradient descent
Gradient descent
Learning rate matters
9/9/2020 Xuming He – CS 280 Deep Learning 43
Optimization: gradient descent
Stochastic gradient descent
9/9/2020 Xuming He – CS 280 Deep Learning 44
Optimization: gradient descent
Stochastic gradient descent
9/9/2020 Xuming He – CS 280 Deep Learning 45
Interpreting network weights
What are those weights?
9/9/2020 Xuming He – CS 280 Deep Learning 46
Outline
Artificial neuron
Perceptron algorithm
Single layer neural networks
Network models
Example: Logistic Regression
Multi-layer neural networks
Limitations of single layer networks
Networks with single hidden layer
Acknowledgement: Hugo Larochelle’s, Mehryar Mohri@NYU’s & Yingyu
Liang@Princeton’s course notes
9/9/2020 Xuming He – CS 280 Deep Learning 47
Capacity of single neuron
Binary classification
A neuron estimates
Its decision boundary is linear, determined by its weights
9/9/2020 Xuming He – CS 280 Deep Learning 48
Capacity of single neuron
Can solve linearly separable problems
Examples
9/9/2020 Xuming He – CS 280 Deep Learning 49
Capacity of single neuron
Can’t solve non linearly separable problems
Can we use multiple neurons to achieve this?
9/9/2020 Xuming He – CS 280 Deep Learning 50
Capacity of single neuron
Can’t solve non linearly separable problems
Unless the input is transformed in a better representation
9/9/2020 Xuming He – CS 280 Deep Learning 51
Capacity of single neuron
Can’t solve non linearly separable problems
Unless the input is transformed in a better representation
9/9/2020 Xuming He – CS 280 Deep Learning 52
Adding one more layer
Single hidden layer neural network
2-layer neural network: ignoring input units
Q: What if using linear activation in hidden layer?
9/9/2020 Xuming He – CS 280 Deep Learning 53
Capacity of neural network
Single hidden layer neural network
Partition the input space into regions
9/9/2020 Xuming He – CS 280 Deep Learning 54
Capacity of neural network
Single hidden layer neural network
Form a stump/delta function
9/9/2020 Xuming He – CS 280 Deep Learning 55
Capacity of neural network
Single hidden layer neural network
9/9/2020 Xuming He – CS 280 Deep Learning 56
Multi-layer perceptron
Boolean case
Multilayer perceptrons (MLPs) can compute more complex
Boolean functions
MLPs can compute any Boolean function
Since they can emulate individual gates
MLPs are universal Boolean functions
9/9/2020 Xuming He – CS 280 Deep Learning 57
Capacity of neural network
Universal approximation
Theorem (Hornik, 1991)
A single hidden layer neural network with a linear output unit can
approximate any continuous function arbitrarily well, given enough
hidden units.
The result applies for sigmoid, tanh and many other hidden
layer activation functions
Caveat: good result but not useful in practice
How many hidden units?
How to find the parameters by a learning algorithm?
9/9/2020 Xuming He – CS 280 Deep Learning 58
General neural network
Multi-layer neural network
9/9/2020 Xuming He – CS 280 Deep Learning 59
Multilayer networks
Multilayer networks
Why more layers (deeper)?
A deep architecture can represent certain functions more
compactly
(Montufar et al., NIPS’14)
Functions representable with a deep rectifier net can require an
exponential number of hidden units with a shallow one.
9/9/2020 Xuming He – CS 280 Deep Learning 62
Why more layers (deeper)?
A deep architecture can represent certain functions more
compactly
Example: Boolean functions
There are Boolean functions which require an exponential number
of hidden units in the single layer case
require a polynomial number of hidden units if we can adapt the
number of layers
Example: multivariate polynomials (Rolnick & Tegmark, ICLR’18)
Total number of neurons m required to approximate natural classes
of multivariate polynomials of n variables
grows only linearly with n for deep neural networks, but grows
exponentially when merely a single hidden layer is allowed.
9/9/2020 Xuming He – CS 280 Deep Learning 63
Why more layers (deeper)?
9/9/2020 Xuming He – CS 280 Deep Learning 64
Summary
Artificial neurons
Single-layer network
Multi-layer neural networks
Next time
Computation in neural networks
Convolutional neural networks
9/9/2020 Xuming He – CS 280 Deep Learning 65