0% found this document useful (0 votes)
4 views

Lecture 01

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 01

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

WiSe 2023/24

Deep Learning 1

Lecture 1 Introduction
Organisational Matters

1/44
Organisational Matters

▶ Lectures
▶ Fridays, 10:15-11:45, HE2013
▶ First lecture: 20.10.2023
▶ Held by Prof. Dr. Grégoire Montavon
▶ Tutorials
▶ Fridays, 14:15-15:45, A151
▶ First tutorial: 03.11.2023
▶ Held by Lorenz Vaitl & Dr. Mihail Bogojeski
▶ Exams
▶ First Exam 20.02.2024, 11:3013:30, HE 101
▶ Second Exam 04.04.2024, 11:3013:30, H 104
▶ Prerequisite: pass (> 50%) 6 homework assignments

2/44
Homework

▶ 10 homework assignments in total


▶ Every week, starting 27.10
▶ Either theoretical or practical or hybrid
▶ Theoretical: Math heavy, pen and paper
▶ Practical: Programming, Python, PyTorch & Multiple Choice
▶ Submission via ISIS
▶ Either ISIS-quiz (alone) or assignment (in groups of up to 6)
▶ Assignments will be corrected by us
▶ Deadline for nding a group: 26.10, group selection via ISIS
▶ For general questions, please don't hesitate to use the forum

3/44
Lecture

4/44
Outline

▶ Review of Classical ML
▶ Linear & Nonlinear Models
▶ Deep Learning / Neural Networks
▶ Motivations
▶ Biological vs. Articial Neuron
▶ Biological vs. Articial Neural Networks
▶ Practical Architectures
▶ Applications of Deep Learning
▶ DL for Autonomous Decision Making
▶ DL for Data Science
▶ DL for Neuroscience
▶ Theoretical Considerations
▶ Universal Approximation Theorem
▶ Compactness of Representations
▶ Optimization

5/44
Book Suggestions

C. Bishop
Neural Networks for Pattern Recognition of Bishop
Oxford University Press, 1995

I. Goodfellow, Y. Bengio, A. Courville


Deep Learning
MIT Press, 2016
(online version at: https://www.deeplearningbook.org/ )

6/44
Part 1 Review of Classical ML

7/44
ML Review: Linear Models

A linear classication model takes as input


a data point x ∈ Rd (a vector) and applies
the linear function:
f (x) = x1 w1 + x2 w2 + · · · + xd wd + b
= w⊤ x + b

to the data point. It then classies the


data point to be of the rst class if f (x) >
0 and of the other class if f (x) < 0.

8/44
ML Review: Learning a Linear Model

In practice, we would like to learn a model


from some training set of data points and
label pairs D = {(x1 , y1 ), . . . , (xN , yN )}.
A popular formulation is given by the con-
strained optimization problem:
min ∥w∥2
w,b

s.t.
∀ i ∈ class 1 : x⊤
i w+b ≥ 1

∀ i ∈ class 2 : x⊤
i w + b ≤ −1

which nds the decision boundary between


the two classes that has the highest mar-
gin. This is a convex optimization problem
(convex objective and convex constraints),
and one can easily extract the global opti-
mum.

9/44
ML Review: From Linear to Nonlinear Models

Most problems are however not linearly separable, and we need a way to
enable ML models to learn nonlinear decision boundaries. A simple approach
consists of nonlinearly mapping x to some high-dimensional feature space
ϕ(x), and classify linearly in that space. The decision boundary becomes
nonlinear in input space.

Question: How to choose the feature map ϕ?

10/44
ML Review: Features Engineering

Idea:
▶ Extract through some hand-designed algorithm input features that
make sense for the task, and store them in some feature vector ϕ(x).

Limitation:
▶ No guarantee that the rst few features the algorithm generates are
good enough/sucient to solve the task accurately. Making the
problem linearly separable may require an extremely large number of
features (→ computationally expensive).

11/44
Part 2 Deep Learning / Neural Networks

12/44
Beyond Feature Engineering: Deep Learning

Empirical Observation:
▶ Humans have shown capable of
mastering tasks such as visual
recognition, motion, speech, games,
etc. All these tasks are highly
nonlinear (i.e. they somehow need
some nonlinear feature
representation ϕ(x)).

Question:
▶ Can machine learning models take
inspiration of some mechanisms in
the human's brain in order to learn
the needed feature representation
ϕ(x)?

13/44
The Human Brain as a Model for Machine Learning

▶ The human brain is a highly complex


(and so far scarcely understood) system.
▶ Scientic research in the past century
has however provided some
understanding of what might enable
these systems to learn successfully:
▶ Complex abstract representations
result from the interconnection of
many simple nonlinear neurons.
▶ The property of these neurons to
modify their response when exposed
repeatedly to a certain stimuli
enables the brain to learn.

14/44
Biological vs. Articial Neurons

Biological neuron Articial neuron

▶ The biological neuron is a highly sophisticated physical system with


complex spatio-temporal dynamics that transfers signal received by
dendrites to the axon.
▶ Articial neurons only retain the most essential components of the
biological neuron for practical purposes: nonlinearity and ability to
learn.

15/44
The Articial Neuron

▶ Simple multivariate, nonlinear and dierentiable function.


▶ Ultra-simplication of the biological neuron that retains two key
properties: (1) ability to produce complex nonlinear representations
when many neurons are interconnected (2) ability to learn from the
data.

16/44
Interconnecting Multiple Neurons

Biological network Articial network

▶ The human brain is composed of a very large number of neurons


(approx. 86 billions) that are interconnected (150 trillions synapses).
▶ An articial neural network mimicks the way biological neurons are
connected in the brain by composing many articial neurons. For
practical purposes, neurons of an articial neural network can be
organized in a layered structure.

17/44
Neural Networks: Forward Pass

The forward pass mapping the input of the network to the output is given
by:
(layer 1)
P
zj = xi wij + bj aj = g(zj )
Pi
zk = j aj wjk + bk ak = g(zk ) (layer 2)
(layer 3)
P
y= k ak vk + c

18/44
Neural Networks: Forward Pass (Matrix Formulation)

Matrix formulation:
z (1) = W (1) x + b(1) a(1) = g(z (1) ) (layer 1)
z (2)
=W (2) (1)
a +b (2)
a(2)
= g(z (2)
) (layer 2)
y = v ⊤ a(2) + c (layer 3)

where [W (1) ]ji = wij , [W (2) ]kj = wjk , and where g applies element-wise.
The matrix formulation makes it convenient to train neural networks with
hundreds, thousands, or more neurons.

19/44
Image Recognition: The Neocognitron (1979)

The Neocognitron [2] is an early neural network for predicting images. It is


designed in a way that the produced output becomes approximately invariant
small local translations/distortions in the input image.

The Neocognitron consists of an alternation of `simple cells' (convolutions)


and `complex cells' (pooling). It is a precursor of modern convolutional neural
network architectures.

20/44
Image Recognition: Large ConvNets (2012. . . )

Example: The VGG-16 convolutional neural network [4]:

▶ The neural network takes the image as input and processes it by


multiple layers to nally arrive at a prediction.
▶ Throughout the multiple layers, one progressively trades spatial
resolution for more complex recognized shapes.

21/44
Image Recognition: Large ConvNets (2012. . . )
Examples of Prediction:

Krizhevsky et al.
ImageNet
Classication with
Deep Convolutional
Neural Networks.
NIPS 2012

▶ Can accurately predict images into a large number of classes (1000


possible classes).
▶ Even misclassications of the model are somewhat reasonable (e.g.
two dierent objects in the same image, similar classes).

22/44
Other Deep Learning Successes

Examples:
Speech Recognition Hard to manually extract good features from the
raw waveform or a spectrogram. Speech entangled with
complex noise patterns (e.g. echo, reverberation, multiple
sources). Deep learning / neural networks have become
state-of-the-art on speech recognition (e.g. DeepSpeech2).
Natural Language Processing Unlike formal languages, there is no
simple way to parse a natural language. Yet, the complex
construction of the sentence needs to be extracted (e.g.
logical reasoning, sentiment, irony). Deep learning
architectures such as transformer networks have been highly
successful in practice (e.g. BERT/GPT/LLaMA language
models).
Playing Games Deep learning has been combined with other AI
techniques (e.g. search, RL), in order to achieve above
human performance in many complex and competitive
games (e.g. AlphaGo, AlphaZero).

23/44
Part 3 Applications of Deep Learning

24/44
Applications of Deep Learning

Three main categories of applications:


Autonomous Decision Making Take good decisions in a given
environment (can be used as a substitute for a human
decider, or complement/support human decisions).
Application in e.g. robotics, recommender systems, medical
diagnosis.
Data Science / Knowledge Discovery Learn to approximate the
input-output relation of some complex process, or the
relation between dierent variables of interest. Then,
analyze the learned model in order to understand this
process/relation.
Neuroscience Use the neural network itself as a model for the brain (in
order to understand how the brain works). E.g. in which
way intermediate layers correlate with neuron activations in
specic areas of the brain.

25/44
Autonomous Decision Making Example
Autonomous Car Driving

Source: https://medium.com/self-driving-cars/nvidia-drive-labs-a09627d745f9

▶ Deep learning can process sensor data and produce fully or partly
automated decisions of when to turn left/right, brake, accelerate, etc.
Such automation enables to lower the burden on (or fully replace) the
human driver.
▶ The neural network must make meaningful and safe driving decisions.
Incorrect decisions can have severe consequences (crash, etc.). →
Need for stringent model validation and testing.

26/44
Data Science Example (1)

▶ Train a neural network


on a gene expression
dataset to mutually
predict gene expression
from other gene
expressions.
▶ Retrieve from the
trained model a model
of interaction between
genes (a gene
regulatory network).

Keyl et al. Nucleic Acids Research, gkac1212, 2023

27/44
Data Science Example (2)

▶ Train several neural networks to predict from


various subsets of observables internal
parameters of a planet.
▶ This enables to infer what subsets of
observables have the highest predictive power,
and which ones are therefore the most worthy
to measure in practice.
Agarwal et al. Earth and Space Science 8 (4), e2020EA001484, 2021

28/44
Neuroscience Example

Cadieu et al. PNAS 111 (23), 8619-8624, 2014

29/44
Part 4 Theoretical Considerations

30/44
Theoretical Considerations about Neural Networks

▶ Universality: Can they approximate any functions (assuming we have


enough neurons)?

▶ Compactness: Can functions (and the learning of these functions) be


represented in a compact form (i.e. using nitely many neurons)?

▶ Optimization: Are neural network easy/hard to optimize (e.g. can the


optimization procedure get stuck in local optima)?

31/44
Universal Approximation Theorem (1)

Neural networks with suciently many neurons can approximate any


function f of its input variables x1 , x2 , . . . , xd .

32/44
Universal Approximation Theorem (2)

Theorem (simplied): With suciently many neurons, neural networks can


approximate any nonlinear functions.

Sketch proof taken from the book Bishop'95 Neural Network for Pattern
Recognition, p. 130131, (after Jones'90 and Blum&Li'91):
▶ Consider the special class of functions y : R2 → R where input
variables are called x1 , x2 .
▶ We will show that any two-layer network with threshold functions as
nonlinearity can approximate y(x1 , x2 ) up to arbitrary accuracy.
▶ We rst observe that any function of x2 (with x1 xed) can be
approximated as an innite Fourier series.
X
y(x1 , x2 ) ≃ As (x1 ) cos(sx2 )
s

33/44
Universal Approximation Theorem (3)

▶ We rst observe that any function of x2 (with x1 xed) can be


approximated as an innite Fourier series.
X
y(x1 , x2 ) ≃ As (x1 ) cos(sx2 )
s
▶ Similarly, the coecients themselves can be expressed as an innite
Fourier series: XX
y(x1 , x2 ) ≃ Asl cos(lx1 ) cos(sx2 )
s l
▶ We now make use of a trigonometric identity to write the function
above as a sum of cosines:
1 1
cos(α) cos(β) = cos(α + β) + cos(α − β)
2 2
▶ Thus, the function to approximate can be written as a sum of cosines,
where each of them receives a linear combination of the input variables:

X
y(x1 , x2 ) ≃ vj cos(x1 w1j + x2 v2j )
j=1

34/44
Universal Approximation Theorem (4)

▶ Thus, the function to approximate can be written as a sum of cosines,


where each of them receives a linear combination of the input variables:

X
y(x1 , x2 ) ≃ vj cos(x1 w1j + x2 v2j )
j=1

▶ This is a two-layer neural network, except for the cosine nonlinearity.


The latter can however be approximated by a superposition of a large
number of step functions.

[cos(τ · (i + 1)) − cos(τ · i)] · 1z>τ ·(i+1) +const.


X
cos(z) = lim
τ →0 | {z } | {z }
i
constant step function

35/44
Neural Networks: Compactness (1)

Neural networks can express a broad range of `useful' functions in compact


manner (e.g. without having to use exponentially many neurons).

36/44
Neural Networks: Compactness (2)

▶ The neural network starts


with a nite and typically
small set of randomly
(exponentially many possible initialized neurons (i.e. a
neurons → intractable) subset of all possible
neurons).
▶ The compact problem
representation is
progressively extracted
before after
during training under the
training training simultaneous eect of
optimization (minimizing the
error) and the nite number
create a randomly of neurons in the model.
initialized network
▶ The learned representation is
almost as predictive as an
exhaustive set of neurons,
but much more compact.

37/44
Neural Networks: Compactness (3)

Example of the set of rst-layer lters learned by a neural network trained on


image classication (AlexNet):

These 96 lters capture most of the important low-level signal for image
classication, and are much more compact than the exhaustive set of all
possible lters (potentially thousands or millions of possible lters).

38/44
Neural Networks: Compactness (4)

▶ Progressive tradeo between spatial resolution and semantic resolution


ensures that the representation remains compact at every step.

39/44
Neural Networks: Optimization

Neural networks also have downsides:


▶ Non-convex objective (e.g. even the
simplest two-layer network
ϕ(x; θ) = θ1 θ2 x is already non-convex
with θ). Many hyperparameters (e.g.
initialization, learning rate, etc.) can
aect the result of learning. E
▶ Multiple layers can cause pathological
curvature, i.e. the gradient vanishes µ1
along certain directions of the parameter
space. The optimizer may get stuck on
µ2
large plateaus.
With heuristics on the neural network de-
sign (e.g. choice of layers and nonlinearities)
and optimization (e.g. momentum, batch-
normalization), it is however still possible to
train them eciently.

40/44
Neural Networks vs. Other Feature Extraction

Universal Compact Convex/Easy

Feature Engineering
(few features) ✗ ✓ ✓
(many features) ✓ ✗ ✓
Neural Networks ✓ ✓ ✗

▶ Compare to feature engineering approaches, neural networks are able


to achieve at the same to solve a broad range of problems (universal)
and in a way that keeps the model reasonably small (compact).
▶ However, this comes at the cost of a more complex optimization
procedure. Heuristics will be presented in Lectures 3 and 4 on how to
nevertheless optimize neural networks eciently.

41/44
Summary

42/44
Summary

▶ Deep learning is a learning paradigm where both the classier and the
features supporting the classier are learned from the data.
▶ Deep learning relies on neural networks, specically, their ability to
represent and learn complex nonlinear functions through the
interconnection of many simple computational units (neurons).
▶ Deep learning provides a solution for dicult tasks where many
classical ML techniques do not work well (e.g. image recognition,
speech recognition, natural language processing), and has become
state-of-the-art on many such tasks.
▶ Deep learning is often used in practice for its ability to produce
accurate decisions autonomously, however, there are also a broad
range of possible applications of deep learning in data science as well
as in neuroscience.
▶ Deep learning can learn models that are both compact and highly
adaptable to the task. At the same time, the optimization problem is
non-convex and generally harder, which makes them more dicult to
handle.

43/44
References

S. Agarwal, N. Tosi, P. Kessel, S. Padovan, D. Breuer, and G. Montavon.


Toward constraining mars' thermal evolution using machine learning.
Earth and Space Science, 8(4):e2020EA001484, 2021.
K. Fukushima.
Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition
unaected by shift in position.
Biological Cybernetics, 36(4):193202, 1980.
P. Keyl, P. Bischo, G. Dernbach, M. Bockmayr, R. Fritz, D. Horst, N. Blüthgen, G. Montavon, K.-R.
Müller, and F. Klauschen.
Single-cell gene regulatory network prediction by explainable AI.
Nucleic Acids Research, Jan. 2023.
K. Simonyan and A. Zisserman.
Very deep convolutional networks for large-scale image recognition.
In ICLR, 2015.
D. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo.
Performance-optimized hierarchical models predict neural responses in higher visual cortex.
Proceedings of the National Academy of Sciences, 111:86198624, 2014.

44/44

You might also like