0% found this document useful (0 votes)
69 views55 pages

Deep Learning 1.0 and Beyond: A Tutorial

The document provides an overview of deep learning 1.0 and beyond. It discusses classic deep learning models like convolutional neural networks and recurrent neural networks. It also covers more recent models like Transformers and graph neural networks. The document aims to provide a tutorial on deep learning concepts from early models to current research directions.

Uploaded by

Giang Do
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views55 pages

Deep Learning 1.0 and Beyond: A Tutorial

The document provides an overview of deep learning 1.0 and beyond. It discusses classic deep learning models like convolutional neural networks and recurrent neural networks. It also covers more recent models like Transformers and graph neural networks. The document aims to provide a tutorial on deep learning concepts from early models to current research directions.

Uploaded by

Giang Do
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Deep learning 1.

0 and Beyond
A tutorial
Part I [email protected]
A/Prof Truyen Tran truyentran.github.io
With contribution from Vuong Le, Hung
@truyenoz
Le, Thao Le, Tin Pham & Dung Nguyen
letdataspeak.blogspot.com
Deakin University
goo.gl/3jJ1O0
December 2020
linkedin.com/in/truyen-tran
16/11/2020 1
AusDM 2016

2016

2012 Turing Awards 2018

8 years snapshot GPT-3 2020


Why (still) DL?
Theoretical Practical
Expressiveness: Neural nets Generality: Applicable to many
can approximate any function. domains.
Learnability: Neural nets are Competitive: DL is hard to beat as
trained easily. long as there are data to train.
Generalisability: Neural nets Scalability: DL is better with more
generalize surprisingly well to data, and it is very scalable.
unseen data.
It is easy to get lost in current DL zoo

AAAI’20
Vietnam News

16/11/2020 4
AAAI’20
16/11/2020 asimovinstitute.org/neural-network-zoo/ 5
Model design goals
Uniformity Resource adaptive,
Universality compressible
Scalability Easy to train
Reusability Use (almost) no labels
Capture long-term Ability to extrapolate
dependencies in time and Support both fast and slow
space learning
Capture invariances natively Support both fast and slow
inference
16/11/2020 6
Agenda

Deep learning 1.0 Deep learning 2.0

Classic models A system view

Transformers Neural memories

Graph neural networks Neural reasoning

Unsupervised learning Theory of mind

16/11/2020 7
Deep models via layer stacking
Theoretically powerful, but limited in practice

Integrate-and-fire neuron

Feature detector

andreykurenkov.com

16/11/2020 Block representation 8


Shorten path length with skip-connections
Easier information and gradient flows

Practice
Theory
http://qiita.com/supersaiakujin/items/935bbc9610d0f87607e8
16/11/2020 9
http://torch.ch/blog/2016/02/04/resnets.html
Sequence model with recurrence
Assume the stationary world
Sentence classification Sequence labelling
Classification
Image captioning Neural machine translation

Source: http://karpathy.github.io/assets/rnn/diags.jpeg
16/11/2020 10
Spatial model with convolutions
Assume filters/motifs are translation invariant
Learnable kernels

http://colah.github.io/posts/2015-09-NN-Types-FP/

Feature detector,
andreykurenkov.com
often many
Convolutional networks
Summarizing filter responses, destroying locations

adeshpande3.github.io

16/11/2020 12
Operator on sets/bags: Attentions
Not everything is created equal for a goal
Need attention model to select or ignore
certain computations or inputs
Can be “soft” (differentiable) or “hard”
(requires RL)

Attention provides a short-cut  long-


term dependencies http://distill.pub/2016/augmented-rnns/

Also encourages sparsity if done right!


Fast weights | HyperNet
The world is recursive
Early ideas in early 1990s by Juergen Schmidhuber and collaborators.
Data-dependent weights | Using a controller to generate weights of the main
net.

Ha, David, Andrew Dai, and Quoc V. Le. "Hypernetworks." arXiv preprint arXiv:1609.09106 (2016).
16/11/2020 14
Neural architecture search
When design is cheap and non-creative
The space is huge and discrete
Can be done through meta-heuristics (e.g., genetic algorithms) or Reinforcement
learning (e.g., one discrete change in model structure is an action).

Bello, Irwan, et al. "Neural optimizer search with reinforcement learning." arXiv preprint arXiv:1709.07417 (2017).
16/11/2020 15
Agenda

Deep learning 1.0 Deep learning 2.0

Classic models A system view

Transformers Neural memories

Graph neural networks Neural reasoning

Unsupervised learning Theory of mind

16/11/2020 16
Motivations
RNN is theoretically powerful, but purely sequential, hence slow and has
limited effective memory for finite size.
 Augmenting with external memories solve some problem, but still slow

CNN is a feed-forward net, can be parallelized, but theoretically not too


strong – random long-term dependencies are hard to encode
Prior to 2017, most architectures are mixture of FNN, RNN and CNN  Non-
uniformity, hard to scale to a large number of tasks.
We need supports for
 Parallel computation
 Long-rang dependency encoding (constant path length)
 Uniform construction (e.g., like columnar structure of neocortex)

16/11/2020 17
Prelim: Memory networks

 Input is a set  Load into memory,


which is NOT updated.
 State is a RNN with attention reading
from inputs
 Concepts: Query, key and content +
Content addressing.
 Deep models, but constant path length
Sukhbaatar, Sainbayar, Jason Weston, and Rob
from input to output.
Fergus. "End-to-end memory networks." Advances in  Equivalent to a RNN with shared input
neural information processing systems. 2015.
set.
16/11/2020 18
Transformers: The triumph of self-attention
State

Query Memory
Key

Tay, Yi, et al. "Efficient transformers: A survey." arXiv


preprint arXiv:2009.06732 (2020).
16/11/2020 19
Transformers are (new) Hopfield net

Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint
16/11/2020
arXiv:2008.02217 (2020). 20
Transformer v.s. memory networks
Memory network:
 Attention to input set
 One hidden state update at a time.
 Final state integrate information of the set, conditioned on the query.

Transformer:
 Loading all inputs into working memory
 Assigns one hidden state per input element.
 All hidden states (including those from the query) to compute the answer.
16/11/2020 21
Universal transformers Dehghani, Mostafa, et al. "Universal
Transformers." International Conference on
Learning Representations. 2018.

https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
16/11/2020 22
Efficient Transformers

Transformer is quadratic in time


 Cannot deal with large sets
(or sequence)

Tay, Yi, et al. "Efficient transformers: A survey." arXiv


preprint arXiv:2009.06732 (2020).
16/11/2020 23
Agenda

Deep learning 1.0 Deep learning 2.0

Classic models A system view

Transformers Neural memories

Graph neural networks Neural reasoning

Unsupervised learning Theory of mind

16/11/2020 24
Why graphs?
Graphs are pervasive in many
scientific disciplines.
Deep learning needs to move beyond
vector, fixed-size data.
The sub-area of graph representation
has reached a certain maturity, with
multiple reviews, workshops and NeurIPS 2020
papers at top AI/ML venues.
16/11/2020 25
System
medicine
16/11/2020 https://www.frontiersin.org/articles/10.3389/fphys.2015.00225/full 26
Gilmer, Justin, et al. "Neural message passing for quantum
chemistry." arXiv preprint arXiv:1704.01212 (2017).

Biology, pharmacy & chemistry

Molecule as graph: atoms as


nodes, chemical bonds as edges
Computing molecular
properties
Chemical-chemical interaction
Chemical reaction

#REF: Penmatsa, Aravind, Kevin H. Wang, and Eric Gouaux. "X-


ray structure of dopamine transporter elucidates antidepressant
16/11/2020 mechanism." Nature 503.7474 (2013): 85-90. 27
Materials science
• Crystal properties
• Exploring/generating
solid structures
• Inverse design

Xie, Tian, and Jeffrey C. Grossman.


"Crystal Graph Convolutional Neural
Networks for an Accurate and
Interpretable Prediction of Material
Properties." Physical review
letters 120.14 (2018): 145301.
16/11/2020 28
Videos as space-time region graphs

(Abhinav Gupta et al, ECCV’18)


Basic neural graph mechanism:
Relation graph Message passing
Generalized message passing

GCN update rule, vector form

GCN update rule, matrix form


#REF: Pham, Trang, et al. "Column Networks
for Collective Classification." AAAI. 2017.
16/11/2020 30
Learning deep matrix representations, K Do, T Tran, S
Venkatesh, arXiv preprint arXiv:1703.01454

Attention: Not all messages are created equal


(Do et al arXiv’s17, Veličković et al ICLR’ 18)

16/11/2020 31
Neural graph morphism
Input: Graph
Output: A new graph. Same
nodes, different edges.
Model: Graph morphism
Method: Graph
transformation policy
network (GTPN)

Kien Do, Truyen Tran, and Svetha Venkatesh. "Graph Transformation Policy Network for Chemical
Reaction Prediction." KDD’19.

16/11/2020 32
Neural graph recurrence
Graphs that represent interaction between entities through time
Spatial edges are node interaction at a time step
Temporal edges are consistency relationship through time
ASSIGN: Asynchronous, Sparse Interaction Graph Network
(Morais et al, 2021 @ A2I2, Deakin – Work in Progress)

16/11/2020 35
Graph generation
No regular structures (e.g. grid, sequence,…)
Graphs are permutation invariant:
#permutations are exponential function of #nodes
The probability of a generated graph G need to be
marginalized over all possible permutations
Generating graphs with variable size
Aim for diversity of generated graphs
Generation methods
Classical random graph models, e.g., An exponential
family of probability distributions for directed graphs
(Holland and Leinhardt, 1981)
Deep generative models: GraphVAE, Graphphite,
Junction Tree VAE, GAN variants etc.
Sequence-based & RL methods

16/11/2020 37
GraphRNN

A case of graph
dynamics: nodes and
edges are added
sequentially.
Solve tractability using
BFS

You, Jiaxuan, et al.


"GraphRNN: Generating
realistic graphs with deep
auto-regressive
models." ICML (2018).

16/11/2020 38
Graphs step-wise construction using
reinforcement learning
Graph rep (message passing) | graph validation (RL) | graph
faithfulness (GAN)

16/11/2020 39
You, Jiaxuan, et al. "Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation." NeurIPS (2018).
Agenda

Deep learning 1.0 Deep learning 2.0

Classic models A system view

Transformers Neural memories

Graph neural networks Neural reasoning

Unsupervised learning Theory of mind

16/11/2020 41
Humans mainly learn by exploring without clear instructions and labelling

Photo credit: Brandon/Flickr

Unsupervised learning
16/11/2020 42
Representation learning, a bit of history
“Representation is the use of signs that stand in for
and take the place of something else”
It has been a goal of neural networks since the 1980s and the current wave
of deep learning (2005-present)  Replacing feature engineering
Between 2006-2012, many unsupervised learning models with varying
degree of success: RBM, DBN, DBM, DAE, DDAE, PSD
Between 2013-2018, most models were supervised, following AlexNet
Since 2018, unsupervised learning has become competitive (with
contrastive learning, self-supervised learning, BERT)!
16/11/2020 43
16/11/2020 Source: asimovinstitute.org/neural-network-zoo/ 44
Criteria for a good representation
Separates factors of variation (aka disentanglement), which are
linearly correlated with desired outputs of downstream tasks.
Provides abstraction that is invariant against deformations and
small variations.
Is distributed (one concept is represented by multiple units), which
is compact and good for interpolation.
Optionally, offers dimensionality reduction.
Optionally, is sparse, giving room for emerging symbols.
Bengio, Yoshua, Aaron Courville, and Pascal Vincent. "Representation learning: A review and new
perspectives." IEEE transactions on pattern analysis and machine intelligence 35.8 (2013): 1798-1828.
16/11/2020 45
Why neural unsupervised learning?
Neural nets have representational richness:
 FFN are functional approximator
 RNN are program approximator, can estimate a program behaviour and
generate a string
 CNN are for translation invariance
 Transformers are powerful contextual encoder
Compactness: Representations are (sparse and) distributed.
 Essential to perception, compact storage and reasoning
Accounting for uncertainty: Neural nets can be stochastic to model
distributions
Symbolic representation: realisation through sparse activations and
gating mechanisms
16/11/2020 46
lyusungwon.github.io/studies/2018/07/25/nade/

Neural The keys: (a) long-term dependencies, (b)


ordering, & (c) parameter sharing.
autoregressive Can be realized using:
models:
 RNN
 CNN: One-sided CNN, dilated CNN (e.g., WaveNet),
PixelCNN
Predict the next step  Transformers  GPT-X family
given the history  Masked autoencoder  MADE
Pros: General, good quality thus far
Cons: Slow – needs better inductive biases for
16/11/2020
scalability 47
Generative models:
Discover the underlying process that generates data
Many applications:
• Text to speech
• Simulate data that are hard to obtain/share in
real life (e.g., healthcare)
• Generate meaningful sentences conditioned on
some input (foreign language, image, video)
• Semi-supervised learning
• Planning

16/11/2020 48
Deep (Denoising) AutoEncoder:
Self-reconstruction of data
Reconstruction

Decoder

Representation

Encoder

Raw data
(optionally with
added noise) Deep Auto-encoder
Feature detector Auto-encoder
16/11/2020
49
Variational Autoencoder
Approximating the posterior by a neural net
Two separate processes: generative (hidden  visible) versus
recognition (visible  hidden)

Gaussian
hidden
variables

Recognising Generative
net net

Data
Credit: kvfrans.com
GAN: Generative Adversarial nets
Matching data statistics
Yann LeCun: GAN is one of best idea in past 10 years!
Instead of modeling the entire distribution of data, learns to map ANY random
distribution into the region of data, so that there is no discriminator that
can distinguish sampled data from real data.

Binary discriminator,
usually a neural Neural net that maps
classifier Any random distribution zx
in any space
Generative adversarial networks
(Adapted from Goodfellow’s, NIPS 2014)

16/11/2020 52
Progressive GAN: Generated images

Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved
16/11/2020
quality, stability, and variation. arXiv preprint arXiv:1710.10196. 53
BERT
Transformer that predicts its own masked parts

BERT is like parallel


approximate pseudo-
likelihood
 ~ Maximizing the conditional
likelihood of some variables
given the rest.
 When the number of variables is
large, this converses to MLE
(maximum likelihood estimate).

16/11/2020
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 54
Contrastive
learning:
Comparing samples

Le-Khac, Phuc H., Graham Healy, and


Alan F. Smeaton. "Contrastive
Representation Learning: A Framework
and Review." arXiv preprint
arXiv:2010.05113 (2020).

16/11/2020 55
Unsupervised learning: A few more points
No external labels, but rich training signals (thousand bits per sample,
as opposed to a few bits in supervised learning).
A few techniques:
 Compressing data as much as possible with little loss
 Energy-based, i.e., pull down energy of observed data, pull up every else
 Filling the missing slots (aka predictive learning, self-supervised learning)
We have not covered unsupervised learning on graphs (e.g.,
DeepWalk, GPT-GNN), but the general principles should hold.
Question: Multiple objectives, or no objective at all?
Question: Emergence from many simple interacting elements?
Liu, Xiao, et al. "Self-supervised learning: Generative or contrastive." arXiv preprint
16/11/2020 arXiv:2006.08218 (2020). 56
End of part I

16/11/2020 57

You might also like