0% found this document useful (0 votes)
137 views66 pages

001 Intro

This document provides an overview of a deep learning course taught by Yann LeCun. The course covers the basics of deep learning including neural networks, convolutional neural networks, and more advanced architectures. It consists of 9 lectures by LeCun and 3 guest lectures, as well as practical sessions on Tuesday evenings. Students will be evaluated through a midterm exam and a final project on self-supervised learning and autonomous driving. The course plan covers topics like backpropagation, regularization techniques, energy-based models, self-supervised learning, and how deep learning draws inspiration from the brain.

Uploaded by

manik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views66 pages

001 Intro

This document provides an overview of a deep learning course taught by Yann LeCun. The course covers the basics of deep learning including neural networks, convolutional neural networks, and more advanced architectures. It consists of 9 lectures by LeCun and 3 guest lectures, as well as practical sessions on Tuesday evenings. Students will be evaluated through a midterm exam and a final project on self-supervised learning and autonomous driving. The course plan covers topics like backpropagation, regularization techniques, energy-based models, self-supervised learning, and how deep learning draws inspiration from the brain.

Uploaded by

manik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Deep Learning

http://bit.ly/DLSP20

Yann LeCun
NYU - Courant Institute & Center for Data Science
Facebook AI Research
http://yann.lecun.com
TAs: Alfredo Canziani, Mark Goldstein

NYU DL Spring 2020


Y. LeCun

Course information
Website:
http://bit.ly/DLSP20
TA: Alfredo Canziani & Mark Goldstein
Lectures:
9 lectures by YLC
3 guest lectures
Practical session
Tuesday evenings with Alfredo
Evaluation
Mid-term exam
Final project (on self-supervised learning & autonomous driving)
Y. LeCun

Course Plan (1/3)


Basics of Supervised Learning, Neural Nets, Deep Learning.
What DL can do
What are good features / representations
Backpropropagation and architectural components
Modules, gradients, Architectures, losses, activations
Weight sharing / tying, Multiplicative interactions / sum-product / attention / gating
Mixture of experts, Siamese nets, hyper-networks
Convnets & applications 1
Convnets & applications 2
More DL Architectures
Recurrent nets, BPTT / applications, truck backer-upper
GRU / LSTM, Memory nets, Transformers / adapters
Graph NN
Y. LeCun

Course Plan (2/3)


Regularization tricks / Optimization tricks / understanding how DL works
Convergence of (convex) optimization
Geometry of the objective function
Initialization tricks, Normalization tricks, Drop out, gradient clipping...
Momentum, average SGD, Parallelization of SGD
Target prop, Lagrangian formulation
Energy-based models
Notations, Latent variable models, latent variable inference & regularization
Minimization, marginalization, free energy
Structured prediction / Reasoning as energy minimization
Sparse modeling / k-means / PCA / Convolutional sparse coding
Y. LeCun

Course Plan (3/3)

Self-supervised learning
Contrastive methods and Regularization methods for energy shaping
Accelerated inference: encoder, LISTA, VAE
Denoising AE, variational AE, contrastive divergence….
Generative adversarial Networks
SSL and beyond
How does human and animal learning work?
How do we get to human-level AI?
Building models of the world for control
Y. LeCun

Inspiration for Deep Learning: The Brain!


McCulloch & Pitts (1943): networks of binary neurons can do logic
Donald Hebb (1947): Hebbian synaptic plasticity
Norbert Wiener (1948): cybernetics, optimal filter,
feedback, autopoïesis, auto-organization.
Frank Rosenblatt (1957): Perceptron
Hubel & Wiesel (1960s): visual cortex architecture
Y. LeCun

Supervised Learning
Training a machine by showing examples instead of programming it
When the output is wrong, tweak the parameters of the machine
Works well for:
Speech→words
Image→categories
Portrait→ name
Photo→caption
CAR
Text→topic
…. PLANE
Y. LeCun

Supervised Learning goes back to the Perceptron & Adaline


N
The McCulloch-Pitts Binary Neuron y=sign( ∑ W i X i + b)
Perceptron: weights are motorized potentiometers i=1

Adaline: Weights are electrochemical “memistors”

https://youtu.be/X1G2g3SiCwU
Y. LeCun

The Standard Paradigm of Pattern Recognition

...and “traditional” Machine Learning

Feature Trainable
Extractor Classifier

Hand engineered Trainable


Y. LeCun

Multilayer Neural Nets and Deep Learning


Traditional Machine Learning

Feature Trainable
Extractor Classifier

Hand engineered Trainable

Trainable
Deep Learning

Low-Level Mid-Level High-Level Trainable


Features Features Features Classifier
Y. LeCun

(Deep) Multi-Layer Neural Nets

Multiple Layers of simple units ReLU ( x )= max ( x , 0)


Each units computes a weighted sum of its inputs
Weighted sum is passed through a non-linear function
The learning algorithm changes the weights

Ceci est une voiture

Weight
matrix
Hidden
Layer
Y. LeCun

Supervised Machine Learning = Function Optimization

Function with
adjustable parameters
Objective
Function Error

traffic light: -1
It's like walking in the mountains in a fog
and following the direction of steepest
descent to reach the village in the valley
But each sample gives us a noisy
estimate of the direction. So our path is
a bit random. ∂ L( W , X )
W i ←W i −η
∂Wi
Stochastic Gradient Descent (SGD)
Y. LeCun

Computing Gradients by Back-Propagation


C(X,Y,Θ) ●
A practical Application of Chain Rule
Cost

Backprop for the state gradients:
Wn

dC/dXi-1 = dC/dXi . dXi/dXi-1
dC/dWn Fn(Xn-1,Wn) ●
dC/dXi-1 = dC/dXi . dFi(Xi-1,Wi)/dXi-1
dC/dXi Xi
Wi ●
Backprop for the weight gradients:
Fi(Xi-1,Wi)
dC/dWi ●
dC/dWi = dC/dXi . dXi/dWi
dC/dXi- Xi-1 ●
dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi
1
F1(X0,W1)

Y (desired output)
X (input)
Y. LeCun

Hubel & Wiesel's Model of the Architecture of the Visual Cortex


[Thorpe & Fabre-Thorpe 2001]
[Hubel & Wiesel 1962]:
simple cells detect local features
complex cells “pool” the outputs
of simple cells within a
retinotopic neighborhood.

“Simple cells” “Complex


cells”

pooling
Multiple subsampling
[Fukushima 1982][LeCun 1989, 1998],[Riesenhuber 1999]...... convolutions
Y. LeCun

Convolutional Network Architecture [LeCun et al. NIPS 1989]

Filter Bank +non-linearity

Pooling
Filter Bank +non-linearity

Pooling

Filter Bank +non-linearity

Inspired by [Hubel & Wiesel 1962] &


[Fukushima 1982] (Neocognitron):
simple cells detect local features
complex cells “pool” the outputs of simple
cells within a retinotopic neighborhood.
Y. LeCun

Convolutional Network (LeNet5, vintage 1990)


Filters-tanh → pooling → filters-tanh → pooling → filters-tanh
Y. LeCun

LeNet character recognition demo 1992


Running on an AT&T DSP32C (floating-point DSP, 20 MFLOPS)
Y. LeCun

ConvNets can recognize multiple objects


All layers are convolutional
Networks performs simultaneous segmentation and recognition
[LeCun, Bottou, Bengio, Haffner, Proc IEEE 1998]
Y. LeCun

Face & Pedestrian Detection with ConvNets (1993-2005)

[Osadchy,Miller LeCun JMLR 2007],[Kavukcuoglu et al. NIPS 2010] [Sermanet et al. CVPR 2013]
Y. LeCun

Training a Robot to Drive Itself in Nature [Hadsell 2009]


Y. LeCun

Semantic Segmentation with ConvNets [Farabet 2012]


33 categories
Y. LeCun

1986-1996 Neural Net Hardware at Bell Labs, Holmdel


1986: 12x12 resistor array
Fixed resistor values
E-beam lithography: 6x6microns
1988: 54x54 neural net
Programmable ternary weights 6 microns

On-chip amplifiers and I/O


1991: Net32k: 256x128 net
Programmable ternary weights
320GOPS, 1-bit convolver.
1992: ANNA: 64x64 net
ConvNet accelerator: 4GOPS
6-bit weights, 3-bit activations
Y. LeCun

FPGA ConvNet Accelerator: NewFlow [Farabet 2011]


NeuFlow: Reconfigurable Dataflow architecture
Implemented on Xilinx Virtex6 FPGA
20 configurable tiles. 150GOPS, 10 Watts
Semantic Segmentation: 20 frames/sec at 320x240
Exploits the structure of convolutions
NeuFlow ASIC [Pham 2012]
150GOPS, 0.5 Watts (simulated)
The Deep Learning
Revolution
Speech recognition: 2010
Image recognition: 2013
Natural language processing: 2015
Y. LeCun

Deep ConvNets for Object Recognition (on GPU)


AlexNet [Krizhevsky et al. NIPS 2012], OverFeat [Sermanet et al. 2013]
1 to 10 billion connections, 10 million to 1 billion parameters, 8 to 20 layers.
Y. LeCun

Error Rate on ImageNet

Depth inflation

(Figure: Anirudh Koul)


Y. LeCun

Deep ConvNets: depth inflation!

VGG
[Simonyan 2013]

GoogLeNet
Szegedy 2014]

ResNet
[He et al. 2015]

DenseNet
[Huang et al 2017]
Y. LeCun

GOPS vs Accuracy on ImageNet vs #Parameters


[Canziani 2016]

ResNet50 and
ResNet100 are used
routinely in
production.

Each of the few


billions photos
uploaded on
Facebook every day
goes through a
handful of ConvNets
within 2 seconds.
Y. LeCun

Multilayer Architectures == Compositional Structure of Data


Natural is data is compositional => it is efficiently representable hierarchically

Low-Level Mid-Level High-Level Trainable


Feature Feature Feature Classifier

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Y. LeCun

Progress in Computer Vision


[He 2017]
Y. LeCun

Mask-RCNN, RetinaNet, feature pyramid network

Mask-RCNN
[He et al. arXiv:1703.06870]
ConvNet produces an object mask
for each region of interest
RetinaNet/FPN
[Lin et al. ArXiv:1708.02002]
one-pass object detection
Y. LeCun

Mask-RCNN Results on COCO dataset


Individual
objects are
segmented.
Y. LeCun

Mask R-CNN Results on COCO test set


Y. LeCun

Panoptic Feature Pyramid Network


Segments and recognizes
object instances and regions
[Kirillov arXiv:1901.0244]
Y. LeCun

Detectron2 (FAIR) [Girshick 2019]


Panoptic instance segmentation, (dense) body pose estimation
Open source: https://github.com/facebookresearch/detectron2
Y. LeCun

Driving Cars with Convolutional Nets

MobilEye
(2015)

NVIDIA
Y. LeCun

3D ConvNet for Medical Image Analysis (NYU)


Segmentation Femur from MR Images
[Deniz et al. Nature 2018]
Y. LeCun

3D ConvNet for Medical Image Analysis (NYU)


Y. LeCun

Breast Cancer Detection (NYU)


[Wu et al. ArXiv:1903.08297] https://github.com/nyukat/breast_cancer_classifier
Y. LeCun

FastMRI (NYU+FAIR): 4x-8x speed up for MRI data acquisition

MRI images subsampled


(in k-space) by 4x and 8x
[Zbontar et al.
ArXiv:1811.08839]
U-Net architecture
4-fold acceleration
8-fold acceleration
K-space masks
Y. LeCun

ConvNets (and Deep Learning) in Physics


Approximate solutions of PDEs with a learned update
Integration step of PDE solver: Z(t+1) = Z(t) + dt*G(Z(t))
where is G() a translation-invariant local operator.
Example: G(Z(t)) = V*f(W*Z(t)) conv->transfer_func->conv
High energy Physics
Lattice QCD
Fluid Dynamics
Prediction of aero/hydro-dynamical properties of solids
Shape refinement by gradient descent
Cosmology / Astrophysics
Large-scale simulation of the early universe
Y. LeCun

ConvNets in Astrophysics [He et al. PNAS 07/2019]

1. Train a coarse-grained 3D U-Net to approximate a fine-grained


simulation on a small volume
2. Use it for a simulation on a large volume (the early universe)
Y. LeCun

ConvNets (and Deep Learning) in Physics

Material Science / Molecular dynamics


Protein structure/function prediction
Prediction of material properties
High energy Physics
Jet filtering / analysis
“Deep learning in color: towards automated
quark/gluon jet discrimination”, P Komiske, E
Metodiev, M Schwartz, arXiv:1612.01551
Cosmology / Astrophysics
Infering constants from observations
Statistical studies of galaxies,
Dark matter through gravitational lensing
Y. LeCun

Applications of ConvNets
Self-driving cars, visual perception
Medical signal and image analysis
Radiology, dermatology, EEG/seizure prediction….
Bioinformatics/genomics
Speech recognition
Language translation
Image restoration/manipulation/style transfer
Robotics, manipulation
Physics
High-energy physics, astrophysics
New applications appear every day
E.g. environmental protection,….
Y. LeCun

Applications of Deep Learning


Medical image analysis [Mnih 2015]
Self-driving cars
Accessibility
Face recognition
Language translation
Virtual assistants*
Content Understanding for: [MobilEye]
Filtering
Selection/ranking
Search
Games
Security, anomaly detection
Diagnosis, prediction
Science!
[Geras 2017] [Esteva 2017]
Y. LeCun

ConvNets & The Visual System [Yamins et al. PNAS 2014]

Yamins et al. PNAS


2014]
Y. LeCun

ConvNets as Models of the Visual System?

[Yamins & Di Carlo 2016]


Y. LeCun

ConvNet models & fMRI


[Eickenberg et al. NeuroImage 2016]
Y. LeCun

Why does it work so well?

We can approximate any function with two layers


Why do we need layers?

What is so special convolutional networks?


Why do they work so well on natural signals?

The objective function are highly non-convex.


Why doesn’t SGD get trapped in local minima?

The networks are widely over-parameterized.


Why do they not overfit?
Y. LeCun

The world is compositional


Convolutional networks learn hierarchical representations
Upper-layer representation are at a coarse spatial scale
Renormalization group theory
Multi-scale entanglement renormalization ansatz (MERA)

Low-Level Mid-Level High-Level Trainable


Features Features Features Classifier

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Y. LeCun

What current deep learning methods enables


What we can have What we cannot have (yet)
Safer cars, autonomous cars Machines with common sense
Better medical image analysis Intelligent personal assistants
Personalized medicine “Smart” chatbots”
Adequate language translation Household robots
Useful but stupid chatbots Agile and dexterous robots
Information search, retrieval, filtering Artificial General Intelligence
Numerous applications in energy, (AGI)
finance, manufacturing,
environmental protection, commerce,
law, artistic creation, games,…..
Learning
Representations

What are good representations?


Why do networks need to be deep?
Y. LeCun
Deep Learning = Learning Representations/Features

The traditional model of pattern recognition (since the late 50's)


Fixed/engineered features (or fixed kernel) + trainable classifier

hand-crafted “Simple” Trainable


Feature Extractor Classifier

End-to-end learning / Feature learning / Deep learning


Trainable features (or kernel) + trainable classifier

Trainable Trainable
Feature Extractor Classifier
Y. LeCun

Ideas for “generic” feature extraction

Basic principle:
expanding the dimension of the representation so that things are more
likely to become linearly separable.

- space tiling
- random projections
- polynomial classifier (feature cross-products)
- radial basis functions
- kernel machines
Y. LeCun

Hierarchical representation
Hierarchy of representations with increasing level of abstraction
Each stage is a kind of trainable feature transform
Image recognition
Pixel → edge → texton → motif → part → object
Text
Character → word → word group → clause → sentence → story
Speech
Sample → spectral band → sound → … → phone → phoneme → word
Y. LeCun

Do we really need deep architectures?


Theoretician's dilemma: “We can approximate any function as close as we
want with shallow architecture. Why would we need deep ones?”

kernel machines (and 2-layer neural nets) are “universal”.


Deep learning machines

Deep machines are more efficient for representing certain classes of


functions, particularly those involved in visual recognition
they can represent more complex functions with less “hardware”
We need an efficient parameterization of the class of functions that are useful
for “AI” tasks (vision, audition, NLP...)
Why would deep architectures be more efficient? Y. LeCun

[Bengio & LeCun 2007 “Scaling Learning Algorithms Towards AI”]


A deep architecture trades space for time (or breadth for depth)
more layers (more sequential computation),
but less hardware (less parallel computation).
Example1: N-bit parity
requires N-1 XOR gates in a tree of depth log(N).
Even easier if we use threshold gates
requires an exponential number of gates of we restrict ourselves to 2 layers (DNF
formula with exponential number of minterms).
Example2: circuit for addition of 2 N-bit binary numbers
Requires O(N) gates, and O(N) layers using N one-bit adders with ripple carry
propagation.
Requires lots of gates (some polynomial in N) if we restrict ourselves to two layers (e.g.
Disjunctive Normal Form).
Bad news: almost all boolean functions have a DNF formula with an exponential
number of minterms O(2^N).....
Y. LeCun

Which Models are Deep?


2-layer models are not deep (even if you
train the first layer)
Because there is no feature hierarchy
Neural nets with 1 hidden layer are not
deep
SVMs and Kernel methods are not deep
Layer1: kernels; layer2: linear
The first layer is “trained” in with the
simplest unsupervised method ever
devised: using the samples as templates for
the kernel functions.
Classification trees are not deep
No hierarchy of features. All decisions are
made in the input space
What are
Good Features?
What are good representations?
Y. LeCun
Discovering the Hidden Structure in High-Dimensional
Data: The manifold hypothesis
Learning Representations of Data:
Discovering & disentangling the independent explanatory factors
The Manifold Hypothesis:
Natural data lives in a low-dimensional (non-linear) manifold
Because variables in natural data are mutually dependent
Y. LeCun

Discovering the Hidden Structure in High-Dimensional Data

Example: all face images of a person


1000x1000 pixels = 1,000,000 dimensions
But the face has 3 Cartesian coordinates and 3 Euler angles
And humans have less than about 50 muscles in the face
Hence the manifold of face images for a person has <56 dimensions
The perfect representations of a face image:
Its coordinates on the face manifold
Its coordinates away from the manifold
We do not have good and general methods to learn functions that turns an image into
this kind of representation Face/not face
1 . 2

[ ]
Ideal
−3 Pose
Feature
0.2 Lighting
Extractor −2 .. . Expression
-----
Y. LeCun

Disentangling factors of variation

The Ideal Disentangling Feature Extractor

View
Pixel n

Ideal
Feature
Extractor
Pixel 2

Expression
Pixel 1
Data Manifold & Invariance: Y. LeCun

Some variations must be eliminated


[Hadsell et al. CVPR 2006]
Azimuth-Elevation manifold. Ignores lighting.
Y. LeCun

Basic Idea for Invariant Feature Learning

Embed the input non-linearly into a high(er) dimensional space


In the new space, things that were non separable may become separable
Pool regions of the new space together
Bringing together things that are semantically similar. Like pooling.

Pooling
Non-Linear
Or
Function
Aggregation

Input Stable/invariant
high-dim
features
Unstable/non-smooth
features
Y. LeCun

Non-Linear Expansion → Pooling

Entangled data manifolds


Non-Linear Dim
Pooling.
Expansion,
Aggregation
Disentangling
Y. LeCun

Sparse Non-Linear Expansion → Pooling

Use clustering to break things apart, pool together similar things


Clustering,
Pooling.
Quantization,
Aggregation
Sparse Coding

You might also like