0% found this document useful (0 votes)
151 views25 pages

CS771: Introduction To Machine Learning Piyush Rai

1) Deep learning uses neural networks with multiple layers between the input and output layers to learn nonlinear and complex patterns in data. 2) A multi-layer perceptron (MLP) consists of an input layer, one or more hidden layers, and an output layer, where each node in one layer connects to nodes in adjacent layers. 3) The hidden layers allow the network to learn nonlinear decision boundaries by applying nonlinear activation functions like sigmoid, tanh, or ReLU to the outputs of linear combinations of inputs at each node.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views25 pages

CS771: Introduction To Machine Learning Piyush Rai

1) Deep learning uses neural networks with multiple layers between the input and output layers to learn nonlinear and complex patterns in data. 2) A multi-layer perceptron (MLP) consists of an input layer, one or more hidden layers, and an output layer, where each node in one layer connects to nodes in adjacent layers. 3) The hidden layers allow the network to learn nonlinear decision boundaries by applying nonlinear activation functions like sigmoid, tanh, or ReLU to the outputs of linear combinations of inputs at each node.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Introduction to Deep Learning (1)

CS771: Introduction to Machine Learning


Piyush Rai
2
Limitation of Linear Models
 Linear models: Output produced by taking a linear combination of input features
Linear regression,
logistic regression, SVM,
etc

Some monotonic function


(e.g., sigmoid)

 This basic architecture is classically also known as the “Perceptron” (not to be


confused with the Perceptron “algorithm”, which learns a linear classification
Althoughmodel)
can kernelize
to make them nonlinear

 This can’t however learn nonlinear functions or nonlinear decision boundaries


CS771: Intro to ML
3
Neural Networks: Multi-layer Perceptron (MLP)
 An MLP consists of an input layer, an output layer, and one or more hidden layers

Output Layer
(with a scalar-valued output)
Hidden layer Learnable
units/nodes act as new
features
weights
Hidden Layer
(with K=2 hidden units)

Can think of this model as


a combination of two
predictions and of two
simpler models

The effective to
Input Layer mapping is nonlinear
(with D=3 visible units) (will see justification
shortly)

CS771: Intro to ML
4
Illustration: Neural Net with One Hidden Layer
 Each input transformed into several pre- Can even be identity
activations using linear models (e.g., for regression yn = sn )

 Nonlinear activation applied on each pre-act.

 Linear model learned on the new “features”

 Finally, output is produced as


 Unknowns () learned by minimizing some
loss function, for example )
(squared, logistic, softmax, etc)
CS771: Intro to ML
Will denote a linear
5
Neural Nets: A Compact Illustration combination of inputs
followed by a nonlinear
operation on the result

 Note: Hidden layer pre-act and post-act will be shown together for brevity

Will directly
show the final
output

Will combine pre-act and post-act and


directly show only to denote the value
Single computed by a hidden node
Hidden More succinctly..
Layer

 Different layers may use different non-linear activations. Output layer may have
none. CS771: Intro to ML
6
Activation Functions: Some Common Choices
sigmoid tanh
Preferred more than
For sigmoid as well as tanh, sigmoid. Helps keep the
gradients saturate (become mean of the next layer’s
close to zero as the function
h h inputs close to zero (with
tends to its extreme values) sigmoid, it is close to 0.5)

a a

Leaky ReLU
ReLU
ReLU and Leaky
Helps fix the dead ReLU are among the
neuron problem of most popular ones
h ReLU when is a Without
negative number nonlinear
activation, a
deep neural
a network is
equivalent to a
linear model no
matter how
many layers we
CS771: Intro to ML
use
7
MLP Can Learn Nonlin. Fn: A Brief Justification
 An MLP can be seen as a composition of multiple linear models combined
High-score in the middle and
nonlinearly Obtained by composing the two one-
sided increasing score functions score
low-score on either of the two
sides of it. Exactly what we want
Score monotonically (using = 1, and = -1 to “flip” the for the given classification
second one before adding). This can problem
increases. One-sided increase
(not ideal for learning now learn nonlinear decision
nonlinear decision boundary
boundaries) score score
score

A nonlinear
classification
problem

Standard Single “Perceptron” Classifier (no hidden units)


A single hidden layer MLP with sufficiently
A Multi-layer Perceptron Classifier
large number of hidden units can approximate
(one hidden layer with 2 units)
any function (Hornik, 1991) CS771:
Capable of learning nonlinear boundaries Intro to ML
8

Examples of some basic NN/MLP architectures

CS771: Intro to ML
9
Single Hidden Layer and Single Outputs
 One hidden layer with nodes and a single output (e.g., scalar-valued regression or
binary classification)

CS771: Intro to ML
10
Single Hidden Layer and Multiple Outputs
 One hidden layer with nodes and a vector of output (e.g., vector-valued regression
or multi-class classification or multi-label classification)

CS771: Intro to ML
11
Multiple Hidden Layers (One/Multiple Outputs)
 Most general case: Multiple hidden layers with (with same or different number of
hidden nodes in each) and a scalar or vector-valued output

CS771: Intro to ML
12
Neural Nets are Feature Learners
 Hidden layers can be seen as learning a feature rep. for each input

The last hidden


layer’s values

CS771: Intro to ML
13
Kernel Methods vs Neural Nets
 Recall the prediction rule for a kernel method (e.g., kernel SVM)

 This is analogous to a single hidden layer NN with fixed/pre-defined hidden nodes


and output weights
 The prediction rule for a deep neural network Also note that neural nets are faster
than kernel methods at test time
since kernel methods need to store
the training examples at test time
whereas neural nets do not

 Here, the ’s are learned from data (possibly after multiple layers of nonlinear
transformations)
 Both kernel methods and deep NNs be seen as using nonlinear basis functions for
making predictions. Kernel methods use fixed basis functions (defined by the
kernel) whereas NN learns the basis functions adaptively from data CS771: Intro to ML
14
Feature Learned by a Neural Network
 Node values in each hidden layer tell us how much a “learned” feature is active in
 Hidden layer weights are like pattern/feature-detector/filter

All the incoming weights (a vector) on this hidden node


can be seen as representing a pattern/feature-detector/filter
Here, denotes a -dim
template/pattern/feature-detector
All the incoming weights (a vector) on this hidden node
can be seen as representing a template/pattern/feature-
detector
Here, denotes a D-dim
pattern/feature-detector/filter

CS771: Intro to ML
15
Why Neural Networks Work Better: Another View
 Linear models tend to only learn the “average” pattern
 Deep models can learn multiple patterns (each hidden node can learn one pattern)
 Thus deep models can learn to capture more subtle variations that a simpler linear model

CS771: Intro to ML
16
Backpropagation
 Backpropagation = Gradient descent using chain rule of derivatives
 Chain rule of derivatives: Example, if and then
Start taking the derivatives of the
loss function w.r.t. params of the
last layer and then proceed
backwards
Reuse already calculated gradients
computed by the previous layer Can reuse previous derivative
computations due to the recursive
nature of the neural net architecture

CS771: Intro to ML
17
Backpropagation through an example

CS771: Intro to ML
18
Backpropagation
Computes loss using current
 Backprop iterates between a forward pass and a backward pass values of the parameters

Computes the gradient of


the loss, starting with
params in the last layer and
going backwards

Backward Pass
Forward Pass

Using computational
graphs

 Software frameworks such as Tensorflow and PyTorch support this already so you
don’t need to implement it by hand (so no worries of computing derivativesCS771:
etc)Intro to ML
19
Neural Nets: Some Aspects
 Much of the magic lies in the hidden layers

 Hidden layers learn and detect good features


Choosing the right NN architecture is
important and a research area in
itself. Neural Architecture Search
 Need to consider a few aspects (NAS) is an automated technique to
do this
 Number of hidden layers, number of units in each hidden layer
 Why bother about many hidden layers and not use a single
very wide hidden layer (recall Hornik’s universal function
approximator theorem)?
 Complex networks (several, very wide hidden layers) or
simpler networks (few, moderately wide hidden layers)?
 Aren’t deep neural network prone to overfitting (since they
contain a huge number of parameters)?

CS771: Intro to ML
20
Representational Power of Neural Nets
 Consider a single hidden layer neural net with hidden nodes
K=3 K=6 K = 20

 Recall that each hidden unit “adds” a function to the overall function
 Increasing (number of hidden units) will result in a more complex function
 Very large seems to overfit (see above fig). Should we instead prefer small ?
 No! It is better to use large and regularize well. Reason/justification:
 Simple NN with small will have a few local optima, some of which may be bad
 Complex NN with large K will have many local optimal, all equally good (theoretical results on
this)
 We can also use multiple hidden layers (each sufficiently large) and regularize well
CS771: Intro to ML
Various other tricks, such as
weight sharing across 21
Preventing Overfitting in Neural Nets different hidden units of the
same layer (used in
convolutional neural nets or
CNN)
 Neural nets can overfit. Many ways to avoid overfitting, such as
 Standard regularization on the weights, such as , etc ( reg. is also called weight decay)
Single Hidden Layer NN with K = 20 hidden units and L2 regularization

 Early stopping (traditionally used): Stop when validation error starts increasing
 Dropout: Randomly remove units (with some probability ) during training

Fig courtesy: Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al, 2014) CS771: Intro to ML
22
Wide or Deep?
 While very wide single hidden layer can approx. any function, often we prefer many,
less wide, hidden layers

 Higher layers help learn more directly useful/interpretable features (also useful for
compressing data using a small number of features)
CS771: Intro to ML
23
Using a Pre-trained Network
 A deep NN already trained in some “generic” data can be useful for other tasks, e.g.,
 Feature extraction: Use a pre-trained net, remove the output layer, and use the rest of the
network as a feature extractor for a related dataset
This part of a pre-trained net can
be used as a feature extractor on
some new task

Many packages, like Tensorflow


and PyTorch provide such pre-
trained module ready to be used

Sometimes also known as “transfer


learning” in the context of neural nets

 Fine-tuning: Use a pre-trained net, use its weights as initialization to train a deep net for a new
CS771: Intro to ML
24
Deep Neural Nets: Some Comments
 Highly effective in learning good feature rep. from data in an “end-to-end” manner

 The objective functions of these models are highly non-convex


 But fast and robust non-convex opt algos exist for learning such deep networks

 Training these models is computationally very expensive


 But GPUs can help to speed up many of the computations

 Also useful for unsupervised learning problems (will see some examples)
 Autoencoders for dimensionality reduction
 Deep generative models for generating data and (unsupervisedly) learning features – examples
include generative adversarial networks (GAN) and variational auto-encoders (VAE)

CS771: Intro to ML
25
Coming up next
 Convolutional neural nets
 Neural nets for sequential data
 Neural networks for unsupervised learning and generation

CS771: Intro to ML

You might also like