Short Course on Deep Learning
IIIT Hyderabad
Welcome!!
Broad Plans
1. Day 1
Introduction and background setting
Learn to use a Deep Learning Toolbox/Library
How to Train a Deep Network?
2. Day 2
Overview of Popular Architectures
More on CNN
RNN and Applications
3. Day 3
Applications in Vision and Language.
More into the practical issues of training
4. Day 4
Building compact DL solutions (for Mobiles/FPGAs etc.)
Practicing what you learned
Introduction to Machine Learning
IIIT Hyderabad
C. V. Jawahar
www.iiit.ac.in/~jawahar
Image Classification
Example: Indoor scene classification
?
Kitchen/
Living Room/
Dining Room
Object Recognition
Classify
dog
Goal: To assign a class label to input image X from a label set L.
Face Recognition
Kate Winslet
Goal: To predict the name of the person (many classes, finer
variations)
Challenge: Variation in Lighting, occlusion, pose, expression,
multiple faces. Different people in train and test set.
Face Verification
Same person?
YES
Goal: To predict if the two input images X1 & X2 are of same
person or not.
Challenge: Variation in Lighting, occlusion, pose, expression,
multiple faces. Different people in train and test set.
Variations
Binary Classification
Multi Class Classification
Multi Label Classification
Structured Output Prediction
are complex (structured outputs)
Images, text, audio, folds of protein
Problem Space
Feature Extraction: Find X corresponding to an
entity/item I (such as an image, web page, ECG
etc.)
Classification: Find a parameterized function
fW(X) which can make the right predictions Y.
End to End: Can we learn Y directly from I.
Bag of Words Text Domain
Orderless documentation representation,
frequencies of words from a dictionary.
Universal texton dictionary
Bag of Words Text Domain
Orderless documentation representation,
frequencies of words from a dictionary.
Universal texton dictionary
Bag of Words Text Domain
Orderless documentation representation,
frequencies of words from a dictionary.
Universal texton dictionary
Bag of Words (Text Domain)
Orderless document representation: frequencies of words
from a dictionary.
Classification to determine document categories.
BoW: Texture Recognition
Histogram
Universal texton dictionary
Bag of Visual Words
Learn Visual
Vocabulary
Now: Learned Representations
CNN Features can be used for wider applications:
1. Train the CNN (deep network) on a very large database such
as imagenet.
2. Reuse CNN to solve smaller problems
1. Remove the last layer (classification layer)
2. Output is the code/feature representation
which doesnt take into account similarit
and replace it with a learned vector
m:
1-Hot
Rich
Representations
ensionality of the
vector will beto
the size
of vocabulary.
e.g.
representation.
M for Google 1T and 500K for big vocab.
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ]
Stories
book [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ]
library [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ]
Ali Ghodsi
Deep Learning
, Mikolov, 2013.
king man + woman = queen
Mikolov 2013
Word embeddings
Word2Vec
INPUT
PROJECTION
OUTPUT
INPUT
PROJECTION
OUTPUT
w(t-2)
w(t-2)
w(t-1)
w(t-1)
SUM
w(t)
w(t)
w(t+1)
w(t+1)
w(t+2)
w(t+2)
CBOW
Skip-gram
Mikolov 2013
Continuous
word
representation
Learned
Representations
A PPENDIX A
T-SNE IMAGE MAPS FOR TYPICAL
REPRESENTATION SPACES
M AGNET AND TRIPLET
Figure 7: Visualization of t-SNE map for a typical Magnet representation. We highlight interesting
Image Representation
Radford Mertz and Chintala, ICLR 2016
Class of ML Algorithms
Unsupervised learning
Supervised learning
Semi-supervised learning
22
Algorithms
Supervised learning (
Prediction
Classification (discrete labels), Regression (real values)
Unsupervised learning (
Clustering
Probability distribution estimation
Finding association (in features)
Dimension reduction
Semi-supervised learning
Reinforcement learning
Decision making (robot, chess machine)
Classifiers: Nearest neighbor
Training
examples
from class 1
Test
example
Training
examples
from class 2
f(x) = label of the training example nearest to x
All we need is a distance function for our inputs
No training required!
Slide credit: L. Lazebnik
Classifiers: Linear
Find a linear function to separate the classes:
f(x) = sgn(w x + b)
Slide credit: L. Lazebnik
Linear Classifiers
w
ax + by + cz + d = 0
wx + d = 0
D=
ax0 + by 0 + cz 0 + d
a2 + b2 + c2
w x + d distance from
=
point to plane
w
Support Vector Machines
Want line that maximizes the margin.
xi positive ( yi = 1) :
xi w + b 1
xi negative ( yi = 1) :
xi w + b 1
For support, vectors,
xi w + b = 1
Distance between point
and line:
Support vectors
Margin M
| xi w + b |
|| w ||
For support vectors:
w x + b 1
1
1
2
=
M=
=
w
w
w
w
w
Finding the Maximum Margin Plane
1. Maximize margin 2/||w||
2. Correctly classify all training data points:
xi positive ( yi = 1) : xi w + b 1
xi negative ( yi = 1) : xi w + b 1
Quadratic optimization problem:
One constraint for each
training point.
Note sign trick.
Machine Learning
Popular Problems
Classification
Regression
Density Estimation
Classification
K Nearest Neigours
Nave Bayes Classifier
Decision Trees
Random Forest
Logistic Regression
Ensemble Learning
Neural Networks
Support Vector Machines
Optimization: Find the best W
Eg. SVM
Often problem is formulated as an optimization problem on the set of known
examples
Machine learning structure
Supervised learning
Machine learning structure
Unsupervised learning
Some More Key Words
Training: Find f and W
Testing: Evaluate f on a specific example
Training, Testing and Validation splits of the data
Generalization: Goal is to do well on unseen
data
Error, Loss, Objective Functions
Complexity of the solution (eg. Number of free
parameters)
Genera7ve classiers try to model the data.
Discrimina7ve classiers try to predict the label.
What are we seeking?
Under-fitting VS. Over-fitting
error
(model = hypothesis + loss
functions)
Traditional Programming
Data
Program
Computer
Output
Computer
Program
Machine Learning
Data
Output
Genera7ve vs. Discrimina7ve Classiers
Genera7ve Models
Discrimina7ve Models
Represent both the data and Learn to directly predict the
the labels
labels from the data
OEen, makes use of
OEen, assume a simple
condi7onal independence
boundary (e.g., linear)
and priors
Examples
Examples
Logis7c regression
Nave Bayes classier
Bayesian network
SVM
Boosted decision trees
Models of data may apply to OEen easier to predict a
label from the data than to
future predic7on problems
model the data
Slide credit: D. Hoiem
Summary
Popular methods of today are
Supervised
Discriminative
SVMs were/are popular. Nice optimization
problem to solve,
Deep neural networks are becoming the standard
for many problems
Feature extraction
End to end training
Trained models for evaluation
Porting/Transforming network to network.
Introduction to Deep Learning
IIIT Hyderabad
C. V. Jawahar
www.iiit.ac.in/~jawahar
What is deep learning?
Y. Bengio et al, ``Deep
Learning, MIT Press, 2015
Neural Networks
Biologically inspired
networks.
Complex function
approximation
through composition
of functions.
Can learn arbitrary
Nonlinear decision
boundary
Neuron, Perceptron and MLP
Output Layer
Input Layer
Perceptron
Hidden unit/Neuron
Input Layer
E.g. Sigmoid Activation Function
Hidden Layers
Output Layer
Multi Layer Perceptron
Loss or Objective
Input Layer
W1
Hidden Layers
L
O
S
S
Output Layer
Wn
Label
Objective: Find out the best parameters which will minimizes the loss.
Weight Vector
E.g. Squared Loss
Back propagation
Input Layer
W1
Hidden Layers
Output Layer
L
O
S
S
Wn
Solution: Iteratively update W along the direction where loss decreases.
Each layer weights are updated based on the derivative of its output w.r.t. input and weights
Gradient Descent
Visualization of loss function
Loss (L)
Loss decreases in
the direction of
negative gradient
Parameter update
Training
Visualization of loss function
Loss
Initialization
Typically viewed as
highly non-convex
function but more
recently its
believed to have
smoother surfaces
but with many
saddle regions !
Momentum
Step size/learning rate
Step direction
Loss
Training
Other methods
Newton method
Quasi-Newton
Other methods
Newton method
Quasi-Newton
Animation Courtesy: Fei Fei et al. , cs231n
Pros: Hyper parameter free.
Cons: Computation of inverse of Hessian
matrix is very costly
Autoencoders
R ESTRICTED B OLTZMANN M ACHINES
Popular DL Architectures
oencoder networks
An RBM is an energy-based generative model that consists of a
layer of binary visible units, v, and a layer of binary hidden units, h.
RBM
Auto Encoder
hidden units
h2
h3
hj
hJ
encoder
bias
v1
v2
vi
visible units
c
Anthony
Knittel, 2013
CNN
RNN
vI
1
bias
decoder
h1
CNNs
AlexNet (Object Recognition): The network that catapulted the
success of deep learning in 2012
Deep Learning Architectures
Recurrent Neural Networks for Time Series and Sequence Data
Understanding
Deep Learning Architectures
Deep Autoencoders for Dimensionality Reduction
AlexNet (NIPS 2012)
ImageNet Classification Task:
Previous Best: ~25% (CVPR-2011)
AlexNet
: ~15 % (NIPS-2012)
Recent Success of Deep Learning:
ImageNet Challenge
Top-5 Error on Imagenet Classification Challenge (1000 classes)
Method
Top-Error Rate
SIFT+FV [CVPR 2011]
~25.7%
AlexNet [NIPS 2012]
~15%
OverFeat [ICLR 2014]
~ 13%
ZeilerNet [ImageNet 2013]
~11%
Oxford-VGG [ICLR 2015]
~7%
GoogLeNet [CVPR 2015]
~6%, ~4.5%
MSRA [arXiv 2015]
~3.5% ( released on 10
December 2015! )
Human Performance
3 to 5 %
Big Leap
What is this big leap?
What enabled this success?
Modern Features
Invariant to popular transformations
Capable of capturing local and global (shape, colour, texture)
characteristics reliably
Features than can be learnt
Machine Learning
Learn from examples rather than handcoding
New algorithms: effective, efficient
Efficient algorithms to solve complex optimization tasks
Realistic Data
Huge amount; partly annotated
Regular competitions
Challenging problem statements. Evaluation Metrics
Advances in Computational Resources
GPUs
Industrial scale clusters
Summary
Deep Learning has revolutionized the perception
problems in recent years.
Popular architectures: CNN, RNN, Autoencoder,
Training: A variation/refinement of
backpropagation
Excellent libraries and implementations
IIIT Hyderabad
Thank you!!