CS771: Introduction To Machine Learning Piyush Rai
CS771: Introduction To Machine Learning Piyush Rai
Output Layer
(with a scalar-valued output)
Hidden layer Learnable
units/nodes act as new
features
weights
Hidden Layer
(with K=2 hidden units)
The effective to
Input Layer mapping is nonlinear
(with D=3 visible units) (will see justification
shortly)
CS771: Intro to ML
4
Illustration: Neural Net with One Hidden Layer
Each input transformed into several pre- Can even be identity
activations using linear models (e.g., for regression yn = sn )
Note: Hidden layer pre-act and post-act will be shown together for brevity
Will directly
show the final
output
Different layers may use different non-linear activations. Output layer may have
none. CS771: Intro to ML
6
Activation Functions: Some Common Choices
sigmoid tanh
Preferred more than
For sigmoid as well as tanh, sigmoid. Helps keep the
gradients saturate (become mean of the next layer’s
close to zero as the function
h h inputs close to zero (with
tends to its extreme values) sigmoid, it is close to 0.5)
a a
Leaky ReLU
ReLU
ReLU and Leaky
Helps fix the dead ReLU are among the
neuron problem of most popular ones
h ReLU when is a Without
negative number nonlinear
activation, a
deep neural
a network is
equivalent to a
linear model no
matter how
many layers we
CS771: Intro to ML
use
7
MLP Can Learn Nonlin. Fn: A Brief Justification
An MLP can be seen as a composition of multiple linear models combined
High-score in the middle and
nonlinearly Obtained by composing the two one-
sided increasing score functions score
low-score on either of the two
sides of it. Exactly what we want
Score monotonically (using = 1, and = -1 to “flip” the for the given classification
second one before adding). This can problem
increases. One-sided increase
(not ideal for learning now learn nonlinear decision
nonlinear decision boundary
boundaries) score score
score
A nonlinear
classification
problem
CS771: Intro to ML
9
Single Hidden Layer and Single Outputs
One hidden layer with nodes and a single output (e.g., scalar-valued regression or
binary classification)
CS771: Intro to ML
10
Single Hidden Layer and Multiple Outputs
One hidden layer with nodes and a vector of output (e.g., vector-valued regression
or multi-class classification or multi-label classification)
CS771: Intro to ML
11
Multiple Hidden Layers (One/Multiple Outputs)
Most general case: Multiple hidden layers with (with same or different number of
hidden nodes in each) and a scalar or vector-valued output
CS771: Intro to ML
12
Neural Nets are Feature Learners
Hidden layers can be seen as learning a feature rep. for each input
CS771: Intro to ML
13
Kernel Methods vs Neural Nets
Recall the prediction rule for a kernel method (e.g., kernel SVM)
Here, the ’s are learned from data (possibly after multiple layers of nonlinear
transformations)
Both kernel methods and deep NNs be seen as using nonlinear basis functions for
making predictions. Kernel methods use fixed basis functions (defined by the
kernel) whereas NN learns the basis functions adaptively from data CS771: Intro to ML
14
Feature Learned by a Neural Network
Node values in each hidden layer tell us how much a “learned” feature is active in
Hidden layer weights are like pattern/feature-detector/filter
CS771: Intro to ML
15
Why Neural Networks Work Better: Another View
Linear models tend to only learn the “average” pattern
Deep models can learn multiple patterns (each hidden node can learn one pattern)
Thus deep models can learn to capture more subtle variations that a simpler linear model
CS771: Intro to ML
16
Backpropagation
Backpropagation = Gradient descent using chain rule of derivatives
Chain rule of derivatives: Example, if and then
Start taking the derivatives of the
loss function w.r.t. params of the
last layer and then proceed
backwards
Reuse already calculated gradients
computed by the previous layer Can reuse previous derivative
computations due to the recursive
nature of the neural net architecture
CS771: Intro to ML
17
Backpropagation through an example
CS771: Intro to ML
18
Backpropagation
Computes loss using current
Backprop iterates between a forward pass and a backward pass values of the parameters
Backward Pass
Forward Pass
Using computational
graphs
Software frameworks such as Tensorflow and PyTorch support this already so you
don’t need to implement it by hand (so no worries of computing derivativesCS771:
etc)Intro to ML
19
Neural Nets: Some Aspects
Much of the magic lies in the hidden layers
CS771: Intro to ML
20
Representational Power of Neural Nets
Consider a single hidden layer neural net with hidden nodes
K=3 K=6 K = 20
Recall that each hidden unit “adds” a function to the overall function
Increasing (number of hidden units) will result in a more complex function
Very large seems to overfit (see above fig). Should we instead prefer small ?
No! It is better to use large and regularize well. Reason/justification:
Simple NN with small will have a few local optima, some of which may be bad
Complex NN with large K will have many local optimal, all equally good (theoretical results on
this)
We can also use multiple hidden layers (each sufficiently large) and regularize well
CS771: Intro to ML
Various other tricks, such as
weight sharing across 21
Preventing Overfitting in Neural Nets different hidden units of the
same layer (used in
convolutional neural nets or
CNN)
Neural nets can overfit. Many ways to avoid overfitting, such as
Standard regularization on the weights, such as , etc ( reg. is also called weight decay)
Single Hidden Layer NN with K = 20 hidden units and L2 regularization
Early stopping (traditionally used): Stop when validation error starts increasing
Dropout: Randomly remove units (with some probability ) during training
Fig courtesy: Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al, 2014) CS771: Intro to ML
22
Wide or Deep?
While very wide single hidden layer can approx. any function, often we prefer many,
less wide, hidden layers
Higher layers help learn more directly useful/interpretable features (also useful for
compressing data using a small number of features)
CS771: Intro to ML
23
Using a Pre-trained Network
A deep NN already trained in some “generic” data can be useful for other tasks, e.g.,
Feature extraction: Use a pre-trained net, remove the output layer, and use the rest of the
network as a feature extractor for a related dataset
This part of a pre-trained net can
be used as a feature extractor on
some new task
Fine-tuning: Use a pre-trained net, use its weights as initialization to train a deep net for a new
CS771: Intro to ML
24
Deep Neural Nets: Some Comments
Highly effective in learning good feature rep. from data in an “end-to-end” manner
Also useful for unsupervised learning problems (will see some examples)
Autoencoders for dimensionality reduction
Deep generative models for generating data and (unsupervisedly) learning features – examples
include generative adversarial networks (GAN) and variational auto-encoders (VAE)
CS771: Intro to ML
25
Coming up next
Convolutional neural nets
Neural nets for sequential data
Neural networks for unsupervised learning and generation
CS771: Intro to ML