0% found this document useful (0 votes)
197 views90 pages

Introduction To Deep Learning: Radu Ionescu, Prof. PHD

This document provides an introduction and overview of a deep learning course. It discusses the instructors, prerequisites, grading system, and topics that will be covered, including machine learning basics, supervised and unsupervised learning, perception applications, and what constitutes machine learning and deep learning. No collaboration is allowed on individual projects, which will determine students' grades.

Uploaded by

Filote Cosmin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
197 views90 pages

Introduction To Deep Learning: Radu Ionescu, Prof. PHD

This document provides an introduction and overview of a deep learning course. It discusses the instructors, prerequisites, grading system, and topics that will be covered, including machine learning basics, supervised and unsupervised learning, perception applications, and what constitutes machine learning and deep learning. No collaboration is allowed on individual projects, which will determine students' grades.

Uploaded by

Filote Cosmin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 90

Introduction to Deep Learning

Radu Ionescu, Prof. PhD.


[email protected]

Faculty of Mathematics and Computer Science


University of Bucharest
What is this class about?

• Some of the most exciting


developments in …

• Machine Learning, Vision, NLP,


Speech, Robotics & AI in general

• … in the last decade!


Instructors
• Lectures:
 Radu Ionescu ([email protected])

• Labs:
 Mihaela Găman ([email protected])
 Antonio Bărbălău ([email protected])
Prerequisites
• Practical Machine Learning
– Classifiers, regressors, loss functions, normalization, MLE,
etc.

• Linear Algebra
– Matrix multiplication, eigenvalues, etc.

• Calculus
– Multi-variate gradients, hessians, jacobians, etc.

• Programming!
– Projects will require Python
– Libraries/Frameworks: Numpy, OpenCV, TensorFlow /
PyTorch
Grading System
 Your final is based on 1 or 2 projects:

 Project 1 based on some vision classification/regression task

 Project 2 based on some NLP classification/regression task

 Both projects are individual! (NO collaboration allowed)

 Only one project is mandatory, but you can try both!

 Projects must be presented no later than the day of the “exam”

 There will be no other exam!


Grading System
 Each project consists of implementing some deep learning
method(s) for the proposed Kaggle challenge (TBA)
 The grades will be proportional to your accuracy:
- Top 1-5 => your grade can be up to 10
- Top 6-10 => your grade can be up to 9
- Top 11-15 => your grade can be up to 8
- Top 16-20 => your grade can be up to 7
- Top 21-25 => your grade can be up to 6
- Others => your grade can be up to 5
• Participants that rank in the same range in both challenges, get
an extra bonus point
Grading System
 For a grade higher or equal to 5, you must beat the baseline!

 Project(s) must be presented (2 points awarded for presentation


and documentation)

 The project consists of the code implementation in Python (any


library is allowed) and a report/documentation including:

 a description of the implemented deep learning method(s):


architecture, hyperparameters, loss, etc.
 figures and/or tables with results (including validation,
hyperparameter tuning, grid search, random search)
 comments on the results
 conclusion
(NO) Collaboration Policy

• Collaboration
– Each student must write their own code for the project(s)

• No tolerance on plagiarism
– Neither ethical nor in your best interest
– Don’t cheat. We will find out (code will be checked!)
Acquisitions
Topics in Practical ML
• Basics of Statistical Learning
• Loss function, MLE, MAP, Bayesian estimation, bias-variance tradeoff,
overfitting, regularization, cross-validation

• Supervised Learning
• Nearest Neighbour, Naïve Bayes, Logistic Regression, Support Vector
Machines, Kernels, Neural Networks, Decision Trees
• Ensemble Methods

• Unsupervised Learning
• Clustering: k-means, Gaussian mixture models, EM
• Dimensionality reduction: PCA, SVD, LDA

• Perception
• Applications to Vision, Natural Language Processing
What is Machine Learning?
• “the acquisition of knowledge or skills
through experience, study, or by being
taught”
What is Machine Learning?
• [Arthur Samuel, 1959]
– Field of study that gives computers
– the ability to learn without being explicitly programmed

• [Kevin Murphy] algorithms that


– automatically detect patterns in data
– use the uncovered patterns to predict future data or other
outcomes of interest

• [Tom Mitchell] algorithms that


– improve their performance (P)
– at some task (T)
– with experience (E)
What is Machine Learning?

Machine
Data Understanding
Learning
ML in a Nutshell
• Tens of thousands of machine learning
algorithms
– Hundreds new every year

• Decades of ML research oversimplified:


– All of Machine Learning:
– Learn a mapping from input to output f: X  Y
• e.g. X: emails, Y: {spam, not-spam}
Types of Learning
• Supervised learning
– Training data includes desired outputs

• Unsupervised learning
– Training data does not include desired outputs

• Weakly or Semi-supervised learning


– Training data includes a few desired outputs

• Reinforcement learning
– Rewards from sequence of actions
Tasks
Supervised Learning
x Classification y Discrete

x Regression y Continuous

Unsupervised Learning

x Clustering y Discrete ID

x Dimensionality y Continuous
Reduction
Supervised Learning

x Classification y Discrete
Vision: Image Classification
x y

man
camel
carrot
car
NLP: Machine Translation
Speech: Speech2Text
AI: Turing Test
“Can machines think”

Q: Please write me a sonnet on the subject of the Forth Bridge.


A: Count me out on this one. I never could write poetry.
Q: Add 34957 to 70764.
A: (Pause about 30 seconds and then give as answer) 105621.
AI: Visual Turing Test

Q: How many slices


x of pizza are there?

y A: 6
Supervised Learning
• Input: x (images, text, emails…)

• Output: y (spam or non-spam…)

• (Unknown) Target Function


– f: X  Y (the “true” mapping / reality)

• Data
– (x1,y1), (x2,y2), …, (xN,yN)

• Model / Hypothesis Class


– g: X  Y
– y = g(x) = sign(wTx)

• Learning = Search in hypothesis space


– Find best g in model class.
Synonyms
• Representation Learning

• Deep (Machine) Learning


• Deep Neural Networks

• Deep Unsupervised Learning

• Simply: Deep Learning


So what is Deep (Machine) Learning?

• A few different ideas:

• (Hierarchical) Compositionality
– Cascade of non-linear transformations
– Multiple layers of representations

• End-to-End Learning
– Learning (goal-driven) representations
– Learning to extract features

• Distributed Representations
– No single neuron “encodes” everything
– Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite

SIFT/HOG
classifier “car”
fixed learned

SPEECH
hand-crafted
features your favorite

MFCC
classifier \ˈd ē p\
fixed learned

NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! bag-of-words
classifier “+”
fixed learned
It’s an old paradigm
• The first learning machine:

Feature Extractor
the Perceptron
 Built at Cornell in 1960
• The Perceptron was a linear classifier on
top of a simple feature extractor A Wi
• The vast majority of practical applications
of ML today use glorified linear classifiers
𝑁
or glorified template matching
• Designing a feature extractor requires 𝑦 = 𝑠𝑖𝑔𝑛 ෍ 𝑊𝑖 𝐹𝑖 (𝑋) + 𝑏
considerable efforts by experts 𝑖=1
Hierarchical Compositionality
VISION

pixels edge texton motif part object

SPEECH
sample spectral formant motif phone word
band

NLP
character word NP/VP/.. clause sentence story
Building a Complicated Function
Given a library of simple functions

Compose into a

complicated function
Building A Complicated Function
Given a library of simple functions

Idea 1: Linear Combinations


Compose into a
• Boosting
• Kernels
complicate function
• …

+
Building A Complicated Function
Given a library of simple functions

Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…
Building A Complicated Function
Given a library of simple functions

Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…
Deep Learning = Hierarchical Compositionality

“car”
Deep Learning = Hierarchical Compositionality

Low-Level Mid-Level High-Level Trainable “car”


Feature Feature Feature Classifier

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Sparse DBNs
[Lee et al. ICML ‘09]
Figure courtesy: Quoc Le
The Mammalian Visual Cortex is Hierarchical
• The ventral (recognition) pathway in the visual cortex

[picture from Simon Thorpe]


So what is Deep (Machine) Learning?

• A few different ideas:

• (Hierarchical) Compositionality
– Cascade of non-linear transformations
– Multiple layers of representations

• End-to-End Learning
– Learning (goal-driven) representations
– Learning to extract features

• Distributed Representations
– No single neuron “encodes” everything
– Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite

SIFT/HOG
classifier “car”
fixed learned

SPEECH
hand-crafted
features your favorite

MFCC
classifier \ˈd ē p\
fixed learned

NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Feature Engineering

SIFT Spin Images

HoG Textons

and many many more….


What are the current bottlenecks?
• Ablation studies on DPM [Parikh & Zitnick,
CVPR10]
– Replace every “part” in the model with a
human
• Key takeaway: “parts” or features are the most
important!
Seeing is worse than believing
• [Barbu et al. ECCV14]
Traditional Machine Learning (more accurately)
“Learned”
VISION
K-Means/
SIFT/HOG
pooling
classifier
“car”
fixed unsupervised supervised

SPEECH
Mixture of
MFCC
Gaussians
classifier
\ˈd ē p\
fixed unsupervised supervised

NLP
This burrito place Parse Tree

is yummy and fun! Syntactic


n-grams classifier
“+”
fixed unsupervised supervised
Deep Learning = End-to-End Learning
“Learned”
VISION
K-Means/
SIFT/HOG
pooling
classifier
“car”
fixed unsupervised supervised

SPEECH
Mixture of
MFCC
Gaussians
classifier
\ˈd ē p\
fixed unsupervised supervised

NLP
This burrito place Parse Tree

is yummy and fun! Syntactic


n-grams classifier
“+”
fixed unsupervised supervised
Deep Learning = End-to-End Learning
• A hierarchy of trainable feature transforms
– Each module transforms its input representation
into a higher-level one.
– High-level features are more global and more
invariant
– Low-level features are shared among categories

Trainable Trainable Trainable


Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier

Learned Internal Representations


“Shallow” vs Deep Learning

• “Shallow” models
hand-crafted “Simple” Trainable
Feature Extractor Classifier
fixed learned

• Deep models
Trainable Trainable Trainable
Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier

Learned Internal Representations


Do we really need deep models?
So what is Deep (Machine) Learning?

• A few different ideas:

• (Hierarchical) Compositionality
– Cascade of non-linear transformations
– Multiple layers of representations

• End-to-End Learning
– Learning (goal-driven) representations
– Learning to extract features

• Distributed Representations
– No single neuron “encodes” everything
– Groups of neurons work together
Distributed Representations Toy
Example
• Local vs Distributed
Distributed Representations Toy
Example
• Can we interpret each dimension?
Power of distributed representations!

Local

Distributed
Power of distributed representations!

• United States:Dollar :: Romania:?

Romania
Power of distributed representations!

• Example: all face images of a person


– 1000x1000 pixels = 1,000,000 dimensions
– But the face has 3 Cartesian coordinates and 3 Euler angles
– And humans have less than about 50 muscles in the face
– Hence the manifold of face images for a person has <56 dimensions
• The perfect representations of a face image:
– Its coordinates on the face manifold
– Its coordinates away from the manifold

1.2 Face/not face


Ideal
Feature
Extractor []−3
0.2
− 2 .. .
Pose
Lighting
Expression
-----
Power of distributed representations!
The Ideal Disentangling Feature Extractor

View
Pixel n
Ideal
Feature
Extractor

Pixel 2

Expression
Pixel 1
Distributed Representations
• Q: What objects are in the image? Where?
Power of distributed representations!
So what is Deep (Machine) Learning?

• A few different ideas:

• (Hierarchical) Compositionality
– Cascade of non-linear transformations
– Multiple layers of representations

• End-to-End Learning
– Learning (goal-driven) representations
– Learning to feature extraction

• Distributed Representations
– No single neuron “encodes” everything
– Groups of neurons work together
Benefits of Deep/Representation
Learning
• (Usually) Better Performance
– “Because gradient descent is better than you”
Yann LeCun

• New domains without “experts”


– RGBD
– Multi-spectral data
– Gene-expression data
– Unclear how to hand-engineer
“Expert” intuitions can be misleading

• “Every time I fire a linguist, the performance of our


speech recognition system goes up”
– Fred Jelinik, IBM ’98

• “Maybe the molecule didn’t go to graduate school”


– Will Welch defending the success of his approximate
molecular screening algorithm, given that he’s a
computer scientist, not a chemist
Database Screening for HIV Protease Ligands: The Influence of Binding-Site
Conformation and Representation on Ligand Selectivity", Volker Schnecke,
Leslie A. Kuhn, Proceedings of the Seventh International Conference on
Intelligent Systems for Molecular Biology, Pages 242-251, AAAI Press, 1999.
Problems with Deep Learning
• Problem#1: Non-Convex! Non-Convex! Non-Convex!
– Depth>=3: most losses non-convex in parameters
– Theoretically, all bets are off
– Leads to stochasticity
• different initializations  different local minima

• Standard response #1
– “Yes, but all interesting learning problems are non-convex”
– For example, human learning
• Order matters  wave hands  non-convexity

• Standard response #2
– “Yes, but it often works!”
Problems with Deep Learning
• Problem#2: Hard to track down what’s failing
– Pipeline systems have “oracle” performances at
each step
– In end-to-end systems, it’s hard to know why
things are not working
Problems with Deep Learning
• Problem#2: Hard to track down what’s failing

[Fang et al. CVPR15] [Vinyals et al. CVPR15]

Pipeline End-to-End
Problems with Deep Learning
• Problem#2: Hard to track down what’s failing
– Pipeline systems have “oracle” performances at each step
– In end-to-end systems, it’s hard to know why things are not
working

• Standard response #1
– Tricks of the trade: visualize features, add losses at
different layers, pre-train to avoid degenerate
initializations…
– “We’re working on it”

• Standard response #2
– “Yes, but it often works!”
Problems with Deep Learning
• Problem#2: Hard to track down what’s failing
– There are methods to
visualize the features, e.g.:
GradCAM
GradCAM++
Problems with Deep Learning
• Problem#3: Lack of easy reproducibility
– Direct consequence of stochasticity & non-convexity

• Standard response #1
– It’s getting much better
– Standard toolkits/libraries/frameworks now available
– TensorFlow, PyTorch, Caffe, Theano

• Standard response #2
– “Yes, but it often works!”
Yes it works, but how?
ImageNet Large Scale Visual
Recognition Challenge (ILSVRC)
Classification:
1000 object classes 1.4M/50k/100k images
Detection:
200 object classes 400k/20k/40k images
Dalmatian

http://image-net.org/challenges/LSVRC/{2010,…,2014}
Data Enabling Richer Models
• [Krizhevsky et al. NIPS12]
– 54 million parameters; 8 layers (5 conv, 3 fully-
connected)
– Trained on 1.4M images in ImageNet
– Better Regularization (Dropout)
1k output
Input Image
units

Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP
+ Non-Linearity + Non-Linearity
ImageNet Classification 2012
• [Krizhevsky et al. NIPS12]: 15.4% error
• Next best team: 26.2% error

(C) Dhruv Batra 77


Other Domains & Applications
• Vision • Medical Imaging
• Natural Language Processing • Retail
• Speech • Surveillance
• Robotics • Insurance
• Game Playing • Many others
Why are things working today?
• More compute power
– GPUs are ~50x faster

• More data
– 108 samples (compared to 103 in 1990s)

• Better algorithms/models/regularizers
– Dropout
– ReLU
– Batch-Normalization
– …
THE SPACE OF
MACHINE LEARNING METHODS
Recurrent
Boosting
Neural Net
Convolutional
Neural Net
Perceptron
Neural Net

SVM

Deep (sparse/denoising)
Autoencoder
Autoencoder
Sparse Coding
GMM
Deep Belief Net
Restricted BM
BayesNP
Disclaimer: showing only a
subset of the known methods
SHALLOW
Recurrent
Boosting
Neural Net
Convolutional
Neural Net
Perceptron
Neural Net

SVM

Deep (sparse/denoising)
Autoencoder
Autoencoder
Sparse Coding
GMM
Deep Belief Net
Restricted BM
DEEP

BayesNP
SHALLOW
Recurrent
Boosting
Neural Net
Convolutional
Neural Net
Perceptron
Neural Net

SVM
SUPERVISED
Deep (sparse/denoising) UNSUPERVISED
Autoencoder
Autoencoder
Sparse Coding
GMM
Deep Belief Net
Restricted BM
DEEP

BayesNP
SHALLOW
Recurrent
Boosting
Neural Net
Convolutional
Neural Net
Perceptron
Neural Net

SVM
SUPERVISED
Deep (sparse/denoising) UNSUPERVISED
Autoencoder
Autoencoder
PROBABILISTIC Sparse Coding
GMM
Deep Belief Net
Restricted BM
DEEP

BayesNP
feed-forward Main types of deep architectures

Feed-back
• Neural Nets • Hierar. Sparse Coding
• Conv Nets • Deconv Nets

input input
Bi-directional

• Stacked Auto-encoders Recurrent • Recurrent Neural Nets


• BiLSTM • Recursive Nets
• LSTM

input
input
feed-forward Focus of this class

Feed-back
• Neural Nets • Hierar. Sparse Coding
• Conv Nets • Deconv Nets

input input
Bi-directional

• Stacked Auto-encoders Recurrent • Recurrent Neural Nets


• BiLSTM • Recursive Nets
• LSTM

input
input
Main types of learning protocols
• Purely supervised
• Backprop + SGD
– Good when there is lots of labeled data.

• Layer-wise unsupervised + supervised linear classifier


• Train each layer in sequence using regularized auto-encoders or RBMs
• Hold fix the feature extractor, train linear classifier on features
– Good when labeled data is scarce but there is lots of unlabeled
data.

• Layer-wise unsupervised + supervised backprop


• Train each layer in sequence
• Backprop through the whole system
– Good when learning problem is very difficult.
Focus of this class
• Purely supervised
• Backprop + SGD
– Good when there is lots of labeled data.

• Layer-wise unsupervised + supervised linear classifier


• Train each layer in sequence using regularized auto-encoders or RBMs
• Hold fix the feature extractor, train linear classifier on features
– Good when labeled data is scarce but there is lots of unlabeled
data.

• Layer-wise unsupervised + supervised backprop


• Train each layer in sequence
• Backprop through the whole system
– Good when learning problem is very difficult.

You might also like