0% found this document useful (0 votes)

41 views

Lecture1 INL

This document introduces a course on neural networks. The course will cover the basics of how neural networks work using stochastic gradient descent, what they can be used for, and how to improve performance. It will focus on thorough understanding of the fundamentals. The main reference text is listed as well as other supplementary materials. MNIST handwritten digit dataset will be used as a running example. The outline lists topics to be covered including the architecture of neural networks, backpropagation, activation functions, convolutional neural networks, and more.

Uploaded by

bkiakisolako

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Lecture1 INL

Uploaded by

bkiakisolako

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 132

Introduction to Neural Networks

Jan Hązła

AIMS Rwanda, Feb-Mar 2024

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 1 / 98

What is this course about?

(Artificial) Neural networks trained by Stochastic Gradient Descent.

HOW they work.
What they can be used for.
How to improve their performance.
Basics of the tensorflow library.
First focus: Thorough understanding of the basics.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 2 / 98

Literature

The main book we initially follow is

Michael Nielsen “Neural Networks and Deep Learning”.
http://neuralnetworksanddeeplearning.com.

Other references:
Google crash course: https:
//developers.google.com/machine-learning/crash-course.
Goodfellow, Bengio, Courville, “Deep Learning”,
https://www.deeplearningbook.org.
MIT fast course:
http://introtodeeplearning.com/2023/index.html
Hardt, Recht, “Patterns, Predictions and Actions”.
https://mlstory.org.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 3 / 98

Running example: MNIST dataset

Source: wikipedia

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 4 / 98

Outline

1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 5 / 98

One neuron

Definition
Let w ∈ Rd , b ∈ R, σ : R → R. A neuron is a function

d
X
Φw ,b,σ (x ) = σ b + wi xi .
i=1

Vector notation: Φw ,b,σ (x ) = σ(w T · x + b).

(Convention: Normally we take w , x to be column vectors.)

w is called the weight vector, and b bias.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 6 / 98

One neuron

Definition
Let w ∈ Rd , b ∈ R, σ : R → R. A neuron is a function

d
X
Φw ,b,σ (x ) = σ b + wi xi .
i=1

Vector notation: Φw ,b,σ (x ) = σ(w T · x + b).

(Convention: Normally we take w , x to be column vectors.)

w is called the weight vector, and b bias.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 6 / 98

Activation functions

Recall: Φ(x ) = σ(w T · x + b).

σ : R → R is called an activation function.

Choices for σ we will consider:

(
1, if x ≥ 0,
Step function H(x ) = (neuron is called perceptron)
0, if x < 0.
1
Sigmoid S(x ) = 1+exp(−x ) .
ReLU(x ) = max(0, x ). (default choice in many cases)

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 7 / 98

Activations: step function
H(x ) = 1(x ≥ 0)

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 8 / 98

Activations: sigmoid
1
S(x ) = 1+exp(−x )

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 9 / 98

Activations: ReLU
ReLU(x ) = max(0, x )

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 10 / 98

What is a neural network?
In general, a neural network for inputs x1 , . . . , xd ∈ R is a directed acyclic
graph (DAG) with:
Input vertices x1 , . . . , xk .
Processing (hidden) neurons (vertices).
Designated output neurons.
Every neuron (except the inputs) has its own weights (for each incoming
graph edge) and bias term. Example:
a1
x1 y1
a2
x2 y2
a3

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 11 / 98

Question

What is the reason for using activation functions?

In other words, what would go wrong if σ = id?

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 12 / 98

Logical gates

Let x1 , . . . , xk ∈ {0, 1}. Then:

NOT(xi ) = 1 − xi . (negation)
OR(x1 , . . . , xk ) = 0 if x1 = . . . = xk = 0. Otherwise,
OR(x1 , . . . , xk ) = 1. (disjunction)
AND(x1 , . . . , xk ) = 1 if x1 = . . . = xk = 1. Otherwise,
AND(x1 , . . . , xk ) = 0. (conjunction)

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 13 / 98

Boolean circuits

A boolean circuit on x1 , . . . , xn is a directed acyclic graph (DAG) with n

inputs, m outputs and boolean logical gates. A circuit defines a function
f : {0, 1}n → {0, 1}m .

x1 NOT
OR

x2
AND
OR
x3

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 14 / 98

Universality of boolean circuits

Theorem
For every function f : {0, 1}n → {0, 1}m there exists a boolean circuit that
computes it.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 15 / 98

Simulating boolean gates by neurons

Consider the NOT gate NOT(x ) = 1 − x for x ∈ {0, 1}.

Let Φ : R → R be the following neuron: w = −1, b = 0.5 and activation

step function H(x ). Then,
Φ(0) = H(−1 · 0 + 0.5) = 1 and Φ(1) = H(−1 · 1 + 0.5) = 0.
Therefore, Φ(x ) computes NOT(x ) for x ∈ {0, 1}.

We will see later that also OR and NOT gates can be computed by neural
networks (using more than one neuron).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 16 / 98

Universality of neural networks

Corollary
For every function f : {0, 1}n → {0, 1}m there exists a neural network Φ
with step activations for all hidden neurons such that

∀x ∈ {0, 1}n : f (x ) = Φ(x ) .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 17 / 98

Universality of neural networks

There is a similar theorem for continuous functions:

Theorem
Let f : [0, 1]n → [0, 1] be a continuous function,
σ ∈ {H(x ), S(x ), ReLU(x )} and ε > 0.
Then, there exists a neural net Φ with activation σ for hidden neurons
(and identity activation for output neurons) such that

sup |Φ(x ) − f (x )| ≤ ε .
x ∈[0,1]n

However, this theorem is false for σ = id. It is essential that there is

non-linearity in the activation.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 18 / 98

Universality of neural networks

There is a similar theorem for continuous functions:

sup |Φ(x ) − f (x )| ≤ ε .
x ∈[0,1]n

However, this theorem is false for σ = id. It is essential that there is

non-linearity in the activation.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 18 / 98

Outline

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 19 / 98

We will discuss architectures that:
Have neurons arranged in layers.
Are feedforward, that is the data flows from lower to higher layers.
Are fully connected, that is there are connections (weights) between
every pair of neurons in adjacent layers.
There is always one input layer, one output layer, and one or more
hidden layers.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 20 / 98

Example architecture with one hidden layer

a1
x1 y1
a2
x2 y2
a3

Every neuron has its own weights and bias term.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 21 / 98

MNIST network: Input layer

Our data samples are grayscale images, each with 28x28=784 pixels. Pixel
intensity is a real value between 0 (white) and 1 (black).

INPUT LAYER: x ∈ [0, 1]784 .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 22 / 98

MNIST network: Hidden layer

We will make one hidden layer with n neurons. Each neuron has the
sigmoid activation function.

The weights are given as W ∈ Rn×784 and biases as b ∈ Rn .

Each neuron computes its activation (output) value as

 
784
X
ai = S bi + Wij xj  1≤i ≤n
j=1

vector notation: a = S(Wx + b).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 23 / 98

MNIST network: Output layer

Output layer will have 10 sigmoid neurons (one for every digit from 0 to 9).
These have similar formula. For weights V ∈ R10×n and biases b ′ ∈ R10 :
 
n
yi = S bi′ +
X
Vij aj  i = 0, . . . , 9
j=1

y = S(Va + b ′ ) = S(V (S(Wx + b)) + b ′ ).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 24 / 98

MNIST network: Prediction

The prediction of the network is a digit with the highest output activation
value.

ŷ = argmax0≤i≤9 yi .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 25 / 98

Prediction error and accuracy

Main quantity of interest that we will minimize: For the (test) data set

(x (1) , y (1) ), . . . , (x (M) , y (M) ), x (i) ∈ [0, 1]784 , y (i) ∈ {0, . . . , 9}

we measure prediction error:

1
Pe = {1 ≤ i ≤ M : ŷ (x (i) ) ̸= y (i) }
M
1 − Pe is called accuracy.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 26 / 98

Outline

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 27 / 98

Taylor’s theorem

Theorem (Taylor’s theorem for k = 2)

Let f : Rd → R be a twice continuously differentiable function and
x = (x1 , . . . , xd ) ∈ Rd .
Then, there exist C , c > 0, such that for all −c ≤ ε1 , . . . , εd ≤ c:

d d
X ∂ X
f (x1 +ε1 , . . . , xd +εd )−f (x1 , . . . , xd )− εi f (x1 , . . . , xd ) ≤ C ε2i .
i=1
∂xi i=1

vector notation: f (x + ε) = f (x ) + ∇f (x ) · ε + O(∥ε∥2 ).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 28 / 98

O-notation

x , ε ∈ Rd . We write that f (x , ε) = g(x , ε) + O(h(ε)) if

∀x ∃C , c > 0 ∀ε s.t. − c ≤ ε1 , . . . , εd ≤ c : f (x , ε) − g(x , ε) ≤ C · h(ε)

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 29 / 98

Loss function

For our MNIST network, we have

∂
ŷ (x ) = 0
∂wij

Therefore, we need a surrogate loss that operates directly on y .

For now, we will take the quadratic loss. Let x be the input and
y ∈ {0, . . . , 9} the correct label. Let ye = ey ∈ R10 be the vector which has
one on position y and 0 everywhere else.
Note that the NN output is y (x ) ∈ [0, 1]10 . Then

1
C (x , y ) = ∥y (x ) − ye ∥2
2

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 30 / 98

Loss function

For our MNIST network, we have

∂
ŷ (x ) = 0
∂wij

Therefore, we need a surrogate loss that operates directly on y .

1
C (x , y ) = ∥y (x ) − ye ∥2
2

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 30 / 98

Loss function: Example

Let y (x ) = (0, 0, 0.8, 0.9, 0, 0, 0, 0, 0.1, 0.2) and y = 2. Then

ỹ = (0, 0, 1, 0, 0, 0, 0, 0, 0, 0) and the loss is given as

1
C (x , y ) = (0 + 0 + 0.04 + 0.81 + 0 + 0 + 0 + 0 + 0.01 + 0.04) = 0.45 .
2
This is quite a high loss due to incorrect large weight 0.9 for label 3.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 31 / 98

Gradient descent: Reminder

Let f : Rk → R be a differentiable function. The gradient descent

minimization algorithm proceeds according to the rule

w (t+1) = w (t) − η∇f (w (t) ) .

For us the function to minimize is

M
1 X
f (W , b, V , b ′ ) = C x (i) , y (i)
M i=1

where (x (1) , y (1) ), . . . , (x (M) , y (M) ) is the training dataset. (Note that C is
also a function of W , b, V , b ′ .)

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 32 / 98

Gradient descent: Reminder

Let f : Rk → R be a differentiable function. The gradient descent

minimization algorithm proceeds according to the rule

w (t+1) = w (t) − η∇f (w (t) ) .

For us the function to minimize is

M
1 X
f (W , b, V , b ′ ) = C x (i) , y (i)
M i=1

where (x (1) , y (1) ), . . . , (x (M) , y (M) ) is the training dataset. (Note that C is
also a function of W , b, V , b ′ .)

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 32 / 98

Gradient descent: Weight initialization

We will apply a variant of this algorithm called stochastic gradient descent

(SGD). This algorithm starts (initializes) with some weights and updates
them one step at a time.

Let us use the following initialization: W , b, V , b ′ ∼ N (0, 1). That is,

every weight is independent standard Gaussian.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 33 / 98

Gradient of a network

Let x ∈ R784 be an input and y ∈ {0, . . . , 9} the correct label on x .

Then, we write
!
∂
∇W C (x , y ) = C (x , y )
∂Wij 1≤i≤n,1≤j≤784

For the training set (x (1) , y (1) ), . . . , (x (M) , y (M) ) we want to minimize the
average training loss. So, our gradient should be
M
1 X
∇W C (x (i) , y (i) )
M i=1

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 34 / 98

Gradient of a network

Let x ∈ R784 be an input and y ∈ {0, . . . , 9} the correct label on x .

Then, we write
!
∂
∇W C (x , y ) = C (x , y )
∂Wij 1≤i≤n,1≤j≤784

For the training set (x (1) , y (1) ), . . . , (x (M) , y (M) ) we want to minimize the
average training loss. So, our gradient should be
M
1 X
∇W C (x (i) , y (i) )
M i=1

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 34 / 98

SGD mini-batch

However, in SGD we approximate the gradient by the mini-batch of size k.

We choose k inputs (indices i1 , . . . , ik ) at random and compute
k
1X
∇W C (x (ij ) , y (ij ) )
k j=1

A gradient of the mini-batch is less precise (more noisy) than the gradient
of the whole dataset. But, it is much faster to compute. For MNIST
M = 50000 and we will use k = 10. So, one SGD step will be 5000 times
faster.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 35 / 98

SGD mini-batch

However, in SGD we approximate the gradient by the mini-batch of size k.

We choose k inputs (indices i1 , . . . , ik ) at random and compute
k
1X
∇W C (x (ij ) , y (ij ) )
k j=1

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 35 / 98

SGD: one step

Let (w (T ) , b (T ) ) be the weights at step T and i1 , . . . , ik a randomly

chosen minibatch.

Then, the weights at time T + 1 are given by

k
ηX
W (T +1) = W (T ) − ∇W C (x (ij ) , y (ij ) )
k j=1

η is a hyperparameter called the learning rate.

The same formula holds for other weights and biases b, V , b ′ .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 36 / 98

SGD: one step

Let (w (T ) , b (T ) ) be the weights at step T and i1 , . . . , ik a randomly

chosen minibatch.

Then, the weights at time T + 1 are given by

k
ηX
W (T +1) = W (T ) − ∇W C (x (ij ) , y (ij ) )
k j=1

η is a hyperparameter called the learning rate.

The same formula holds for other weights and biases b, V , b ′ .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 36 / 98

How long to run?

In every epoch, we divide the training dataset of M samples into M/k

minibatches of size k. Then, we run one step of SGD on each minibatch.

How long to run?

1 Option 1: A predetermined time. For example, let us initially use 30
epochs.
2 Option 2: Until the accuracy of the test set is not improving anymore
(for example, stop training if the accuracy did not improve for 10
epochs).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 37 / 98

How long to run?

In every epoch, we divide the training dataset of M samples into M/k

minibatches of size k. Then, we run one step of SGD on each minibatch.

How long to run?

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 37 / 98

Outline

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 38 / 98

Objective: Gradient on one input
Let x ∈ R784 be an input with label y ∈ {0, . . . , 9} and the correct output
vector ye .

We need to compute the gradients

∇W C , ∇b C , ∇V C , ∇b ′ C .

For this we need more notation:

z = Wx + b
a = S(z)
z ′ = Va + b ′
a′ = S(z ′ )

These values can be computed and remembered in the feedforward pass.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 39 / 98
Objective: Gradient on one input
Let x ∈ R784 be an input with label y ∈ {0, . . . , 9} and the correct output
vector ye .

We need to compute the gradients

∇W C , ∇b C , ∇V C , ∇b ′ C .

For this we need more notation:

z = Wx + b
a = S(z)
z ′ = Va + b ′
a′ = S(z ′ )

These values can be computed and remembered in the feedforward pass.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 39 / 98
Step 1: ∇z ′ C

− yei )2 .
1 P9 ′
Recall that C = 2 i=0 (ai

∂ ∂C ∂a′
′ C = ′ i′ = (ai′ − yei ) · S ′ (zi′ )
∂zi ∂ai ∂zi

Vector notation ∇z ′ C = (a′ − ye ) ⊙ S ′ (z ′ ),

where (u ⊙ v )i = ui vi is the Hadamard product.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 40 / 98

Step 2: ∇b ′ C and ∇V C

− yei )2 . Let δ ′ = ∇z ′ C .
1 P9 ′
Recall that C = 2 i=0 (ai

∂ ∂C ∂zi′
C = = δi′ · 1
∂bi′ ∂zi′ ∂bi′
∂ ∂C ∂z ′
C = ′ i = δi′ · aj .
∂Vij ∂zi ∂Vij

Vector notation: ∇b ′ C = δ ′ and ∇V C = δ ′ aT , that is the outer product

of δ ′ and a.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 41 / 98

Step 3: ∇z C

− yei )2 . Let δ ′ = ∇z ′ C .
1 P9 ′
Recall that C = 2 i=0 (ai

∂ X9
∂C ∂zj′ X 9
′ ∂
Xn
C= = δj Vjk S(zk )
∂zi j=0
∂zj′ ∂zi j=0
∂zi k=1
 
9 9
∂
δj′ · Vji S(zi ) =  Vji · δj′  · S ′ (zi )
X X
=
j=0
∂zi j=0

Vector notation: ∇z C = (V T δ ′ ) ⊙ S ′ (z).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 42 / 98

Step 4: ∇b C and ∇W C

− yei )2 . Let δ = ∇z C .
1 P9 ′
Recall that C = 2 i=0 (ai

∂ ∂C ∂zi
C= = δi · 1
∂bi ∂zi ∂bi
∂ ∂C ∂zi
C= = δi · xj .
∂Wij ∂zi ∂Wij

Vector notation: ∇b C = δ and ∇W C = δx T .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 43 / 98

Conclusions

This idea generalizes to many layers and any activation: One

feedforward pass and one backward pass.
Feedforward pass is matrix multiplication. In the backward pass we
have Hadamard products, outer products and multiplying by the
transposed weight matrices.
This method is faster than computing every partial derivative
separately.
Partial derivatives can be understood by decomposing them into
∂
products: Most important, ∂W ij
= δi xj .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 44 / 98

Outline

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 45 / 98

Exploding and vanishing gradients

As you will see yourself, there can be various problems occurring during
the network training. Diagnosing and preventing them is one of the main
practical challenges in the field.

There are two particular problems with gradients that can happen:
Exploding gradients (which are too large).
Vanishing gradients (too small).
In both cases it is unlikely the network will learn good weights. And both
problems are generally more likely to happen in deeper networks.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 46 / 98

Exploding/vanishing gradients: What to do?

If the gradients are exploding, sometimes it is effective to just make

them shorter. This is called gradient clipping. For example, set some
maximum M and if ∥∇W C ∥ > M, then set

M
∇W C ← ∇W C
∥∇W C ∥

(this is called clipping by norm).

Modifying the network architecture and hyperparameters can help
overcome either exploding or vanishing gradients. Gradient problems
are one of the inspirations for innovations in NN architecture.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 47 / 98

Exploding/vanishing gradients: What to do?

If the gradients are exploding, sometimes it is effective to just make

them shorter. This is called gradient clipping. For example, set some
maximum M and if ∥∇W C ∥ > M, then set

M
∇W C ← ∇W C
∥∇W C ∥

(this is called clipping by norm).

Modifying the network architecture and hyperparameters can help
overcome either exploding or vanishing gradients. Gradient problems
are one of the inspirations for innovations in NN architecture.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 47 / 98

Vanishing gradient example: Sigmoid and square loss

Let z ∈ R10 be the input values for the final layer of our MNIST network
and y = S(z) the NN output. With the square loss Csq = 12 ∥y − ỹ ∥2 it
can be computed
∂Csq
= (yi − ỹi )yi (1 − yi ) .
∂zi
For example, if it happens that ỹi = 0 and yi ≈ 1, then the loss is large
(and NN likely to output wrong predictions), but the gradient will be small
due to 1 − yi ≈ 0. This is an example of a vanishing gradient.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 48 / 98

Cross-entropy loss

A popular choice for the loss function is cross-entropy loss. In our case
(MNIST with sigmoid output neurons) we define it as follows. For neural
network output y ∈ R10 and the label vector ye ∈ R10 we have
9
X
Ccross = − yei ln yi − (1 − yei ) ln(1 − yi ) .
i=0

Note that this loss will help with sigmoid vanishing gradients, since
∂Ccross
= (yi − ỹi ) .
∂zi

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 49 / 98

Cross-entropy loss

Note that this loss will help with sigmoid vanishing gradients, since
∂Ccross
= (yi − ỹi ) .
∂zi

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 49 / 98

Softmax

Another possiblity is to replace the sigmoids in the output layer by

softmax. Let z ∈ R10 be inputs to the output layer. Then, we let

exp(zi )
yi = softmax(z)i = P9 .
j=0 exp(zj )

And we define the log-likelihood loss (this is also called cross-entropy in

some sources!)
9
X
Clog = − yei ln yi = − ln yy
i=0

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 50 / 98

Softmax and probabilities

Cross-entropy and softmax are often used when the output can be
interpreted as a probability distribution.

From the definition it is easy to check that if y = softmax(z), then yi > 0

P
and i yi = 1.

In that case the values zi can be used to compute log-likelihood ratios:

   
X X yi
zi − zj = ln(yi ) − ln  exp(zj ) − ln(yj ) + ln  exp(zj ) = ln .
j j
yj

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 51 / 98

Softmax and probabilities

Cross-entropy and softmax are often used when the output can be
interpreted as a probability distribution.

From the definition it is easy to check that if y = softmax(z), then yi > 0

P
and i yi = 1.

In that case the values zi can be used to compute log-likelihood ratios:

   
X X yi
zi − zj = ln(yi ) − ln  exp(zj ) − ln(yj ) + ln  exp(zj ) = ln .
j j
yj

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 51 / 98

Impact on SGD

In every case SGD formulas have to be accordingly modified.

For example, for output y and label vector ye , we have

∂ yi − yei
Ccross = .
∂yi yi (1 − yi )

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 52 / 98

Outline

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 53 / 98

Regularization

Another modification to avoid overfitting is regularization of the weights.

The idea is to modify cost such that only “simpler” (for us smaller)
weights are preferred

Normally we regularize only weights, not biases.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 54 / 98

L2 regularization

Most typically L2 regularization is used. The cost function Cold (like

square or cross-entropy loss) is modified to become
 
λ X X
Ctotal = Cold + ·  Wij2 + Vij2  .
2 ij ij

for a hyperparameter λ > 0.

The gradient becomes modified by ∇W Ctotal = ∇W Cold + λW ,

∇V Ctotal = ∇V Cold + λV .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 55 / 98

L2 regularization

Most typically L2 regularization is used. The cost function Cold (like

square or cross-entropy loss) is modified to become
 
λ X X
Ctotal = Cold + ·  Wij2 + Vij2  .
2 ij ij

for a hyperparameter λ > 0.

The gradient becomes modified by ∇W Ctotal = ∇W Cold + λW ,

∇V Ctotal = ∇V Cold + λV .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 55 / 98

Normalized initialization

We have chosen standard Gaussian initialization Wij ∼ N (0, 1).

P Wij ∼ N
However, it might be more natural to choose (0, 1/d), since in
d d
that case for x ∈ {−1, 1} we will have j=1 Wij aj ∼ N (0, 1).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 56 / 98

Synthetic data

More good quality data is always better. For certain datasets we can
artificially expand them into larger ones for a more robust network and less
overfitting.

For example, a picture from the MNIST dataset can be modified:

Shift left/right/up/down a little bit.
Rotate by a small angle (not too large!).
Change the brightness a little.
After the change it should still be the same digit. But for the network it
might make a difference.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 57 / 98

Outline

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 58 / 98

Motivation

Problems of interest (e.g., image, speech and text related) often have
spatial/temporal and hierarchical structure.

For that, consider our MNIST input to be x ∈ R28×28 . Similarly, a color

picture can be presented as x ∈ R28×28×3 (RGB: red, green, blue). In that
case we call each of the three color coordinates a channel.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 59 / 98

Convolutional layer

Let x ∈ RK ×K ×C . So x is a 2D object with C channels.

Let k be a small integer. For example, we will take k = 5 for MNIST.

Then, let w ∈ Rk×k×c and b ∈ R. We call (w , b) a kernel or a filter and
k × k are the kernel dimensions.
Then, we define the kernel applied to the position i, j, where 1 ≤ i, j ≤ K :
k−1
X k−1
XXC
zi,j = wα,β,γ xi+α,j+β,γ + b .
α=0 β=0 γ=1

Then, the activation is defined as usual ai,j = σ(zi,j ) for an activation

function σ : R → R.

The operation of applying a filter to a position is called convolution.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 60 / 98

Convolutional layer

Let x ∈ RK ×K ×C . So x is a 2D object with C channels.

Let k be a small integer. For example, we will take k = 5 for MNIST.

Then, the activation is defined as usual ai,j = σ(zi,j ) for an activation

function σ : R → R.

The operation of applying a filter to a position is called convolution.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 60 / 98

Weight sharing

One kernel is applied to all input coordinates with the same weights. This
is called weight sharing.

The coordinates close to the borders require some attention. We can

either:
Imagine that the coordinates outside the input range are zeros. In
that case we can place one neuron for every input neuron, and the
filter output has shape K × K . This is called padding.
Or allow only the coordinates where the filter is well-defined. Then
the filter output shape is (K − k + 1) × (K − k + 1).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 61 / 98

Weight sharing

One kernel is applied to all input coordinates with the same weights. This
is called weight sharing.

The coordinates close to the borders require some attention. We can

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 61 / 98

Putting things together: Multiple filters

It is not enough to apply just one filter (one set of weights). Instead,
many filters are applied to create many output channels: One output
channel for one filter.

So, if in one convolutional layer with padding, the input has shape
K × K × C , and we apply D filters with dimensions k × k, we have:
K 2 C neurons (activations) in the input to the layer.
K 2 D neurons in the output of the layer.
k 2 CD weights.
D biases.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 62 / 98

Putting things together: Multiple filters

It is not enough to apply just one filter (one set of weights). Instead,
many filters are applied to create many output channels: One output
channel for one filter.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 62 / 98

MNIST example: Counting the parameters

Let us consider an example convolutional layer for MNIST. In the input we

have 28 · 28 = 784 input neurons and just one channel (K = 28 and
C = 1).

Then, we create 20 filters of dimension 5 · 5 (k = 5 and D = 20). Let us

do it without padding. That means that every filter gives 24 × 24 = 576
neurons in the output layer (since K − k + 1 = 24).

All in all, we have 784 input neurons and 576 × 20 = 11520 (hidden)
output neurons. But, just 25 · 1 · 20 = 500 weights and 20 biases.

Compare this with 784 · 30 = 23520 weights for fully-connected layer with
30 hidden neurons and 78400 weight with 100 hidden neurons.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 63 / 98

MNIST example: Counting the parameters

Let us consider an example convolutional layer for MNIST. In the input we

have 28 · 28 = 784 input neurons and just one channel (K = 28 and
C = 1).

Then, we create 20 filters of dimension 5 · 5 (k = 5 and D = 20). Let us

do it without padding. That means that every filter gives 24 × 24 = 576
neurons in the output layer (since K − k + 1 = 24).

All in all, we have 784 input neurons and 576 × 20 = 11520 (hidden)
output neurons. But, just 25 · 1 · 20 = 500 weights and 20 biases.

Compare this with 784 · 30 = 23520 weights for fully-connected layer with
30 hidden neurons and 78400 weight with 100 hidden neurons.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 63 / 98

MNIST example: Counting the parameters

Let us consider an example convolutional layer for MNIST. In the input we

have 28 · 28 = 784 input neurons and just one channel (K = 28 and
C = 1).

Then, we create 20 filters of dimension 5 · 5 (k = 5 and D = 20). Let us

do it without padding. That means that every filter gives 24 × 24 = 576
neurons in the output layer (since K − k + 1 = 24).

All in all, we have 784 input neurons and 576 × 20 = 11520 (hidden)
output neurons. But, just 25 · 1 · 20 = 500 weights and 20 biases.

Compare this with 784 · 30 = 23520 weights for fully-connected layer with
30 hidden neurons and 78400 weight with 100 hidden neurons.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 63 / 98

Pooling

A pooling layer summarizes the information extracted from a larger

number of neighboring neurons.

The most typical variants are max pooling and average pooling.

In pooling we take k × k (for some small k) neighboring activations and

create just one neuron that computes:
The maximum in case of max pooling.
The average in case of average pooling.

So, for input of shape K × K × C pooling with size k × k creates output

of shape K /k × K /k × C . Note that pooling is applied on each channel
separately!

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 64 / 98

Pooling

A pooling layer summarizes the information extracted from a larger

number of neighboring neurons.

The most typical variants are max pooling and average pooling.

In pooling we take k × k (for some small k) neighboring activations and

create just one neuron that computes:
The maximum in case of max pooling.
The average in case of average pooling.

So, for input of shape K × K × C pooling with size k × k creates output

of shape K /k × K /k × C . Note that pooling is applied on each channel
separately!

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 64 / 98

Pooling

A pooling layer summarizes the information extracted from a larger

number of neighboring neurons.

The most typical variants are max pooling and average pooling.

In pooling we take k × k (for some small k) neighboring activations and

create just one neuron that computes:
The maximum in case of max pooling.
The average in case of average pooling.

So, for input of shape K × K × C pooling with size k × k creates output

of shape K /k × K /k × C . Note that pooling is applied on each channel
separately!

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 64 / 98

Pooling example: MNIST

For our MNIST example, after applying the convolution layer, we have
activations in the shape 24 × 24 × 20.

After applying 2 × 2 max poooling, we remain with 12 × 12 × 20 = 2880

neurons.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 65 / 98

Strides

Pooling and especially convolution can be also done using strides. Using
stride s means that the “convolution window” is not applied at every
position, but only every s steps.

For example, convolution applied with padding to input shape K × K × C

gives one filter with shape K × K . If we apply stride s, then the output
filter has shape K /s × K /s.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 66 / 98

Strides

Pooling and especially convolution can be also done using strides. Using
stride s means that the “convolution window” is not applied at every
position, but only every s steps.

For example, convolution applied with padding to input shape K × K × C

gives one filter with shape K × K . If we apply stride s, then the output
filter has shape K /s × K /s.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 66 / 98

Deep learning: Overall structure

For problems that require deep structure, many layers of convolution and
pooling are applied in sequence. Details are always different.

Most often, the first (close to the input) layers are convolutional and
pooling (these can be very deep), and the last (usually just one or two
layers) are fully connected.

As always, the SGD formulas have to be modified for the new architecture.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 67 / 98

Deep learning: Overall structure

For problems that require deep structure, many layers of convolution and
pooling are applied in sequence. Details are always different.

Most often, the first (close to the input) layers are convolutional and
pooling (these can be very deep), and the last (usually just one or two
layers) are fully connected.

As always, the SGD formulas have to be modified for the new architecture.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 67 / 98

Motivation

Convolutional layers are supposed to detect local spatial/temporal

structure.
Deep convolutional layers reflect hierarchical structure of problems,
where more advanced features are constructed from simpler features.
Pooling attempts to summarize and simplify information.
The fully connected layers on top are supposed to give flexibility at
the highest levels of abstraction.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 68 / 98

MNIST example: architecture 1

For MNIST, we will apply the following architecture:

1 Input shape 28 × 28 × 1.
2 2D Convolution without padding, ReLU activation, 20 filters, filter
size 5 × 5. The first hidden layer has shape 24 × 24 × 20.
3 2D max 2 × 2 pooling. The second hidden layer has shape
12 × 12 × 20.
4 Fully connected layer with 100 neurons, ReLU activation.
5 Output layer with 10 neurons, softmax.
6 Log-likelihood loss.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 69 / 98

MNIST example: architecture 2

1 Input shape 28 × 28 × 1.
2 2D convolution without padding, ReLU activation, 20 filters, filter
size 5 × 5. The first hidden layer has shape 24 × 24 × 20.
3 2D max 2 × 2 pooling. The second hidden layer has shape
12 × 12 × 20.
4 2D convolution without padding, ReLU activation, 40 filters, filter
size 5 × 5. Hidden layer shape 8 × 8 × 40.
5 2D max 2 × 2 pooling. Hidden layer shape 4 × 4 × 40.
6 Fully connected layer with 100 neurons, ReLU activation.
7 Output layer with 10 neurons, softmax.
8 Log-likelihood loss.
We might also add dropout to the fully connected layer.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 70 / 98

Outline

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 71 / 98

RNNs

Recurrent neural networks (RNNs) are used for inputs which are sequences
of variable length x (1) , . . . , x (T ) (for example, words in a sentence).

The main ideas:

Weights are shared across time.
The network keeps updating a hidden state as we apply inputs in the
sequence.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 72 / 98

RNN with one hidden layer
Example RNN with one hidden layer:
Input x (t) ∈ Rk , hidden layer activations h(t) ∈ Rn .
Weights Whh ∈ Rn×n and Whx ∈ Rn×k , biases b ∈ Rn .

h(t+1) = σ Whh h(t) + Whx x (t+1) + b .
h(0) can be initialized to zero.

One input sequence is x (1) , . . . , x (T ) . We have

h(1) = σ Whh h(0) + Whx x (1) + b
...

h(T ) = σ Whh h(T −1) + Whx x (T ) + b .

Note that the weights are shared (the same) for all 1 ≤ t ≤ T .
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 73 / 98
RNN with one hidden layer
Example RNN with one hidden layer:
Input x (t) ∈ Rk , hidden layer activations h(t) ∈ Rn .
Weights Whh ∈ Rn×n and Whx ∈ Rn×k , biases b ∈ Rn .

h(t+1) = σ Whh h(t) + Whx x (t+1) + b .
h(0) can be initialized to zero.

One input sequence is x (1) , . . . , x (T ) . We have

h(1) = σ Whh h(0) + Whx x (1) + b
...

h(T ) = σ Whh h(T −1) + Whx x (T ) + b .

Note that the weights are shared (the same) for all 1 ≤ t ≤ T .
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 73 / 98
Illustration: feedforward vs recurrent
Feedforward

x
Wx + b
z
σ(z)
h
...

Recurrent
h(0)
Whh h(0)

Whx x (1) + b σ(z (1) ) ...

x (1) z (1) h(1)
Whh h(1)

Whx x (2) + b σ(z (2) )

x (2) z (2) h(2)
···
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 74 / 98
Example: GPT-like text prediction

Consider a problem where inputs are sentences: Each training sample is a

sequence of T words w (1) , . . . , w (T ) .

The objective is to train a network that will be good at predicting w (t)

given w (1) , . . . , w (t−1) .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 75 / 98

Text prediction: Input encoding

First, we need to encode word w as a vector in Rk .

Option 1: Let (w1 , . . . , wK ) be a dictionary, that is a list of all
possible words in the training data. Recall the standard basis vector ei
which has 1 in coordinate i and 0 everywhere else.

Then, just define onehot(wi ) ∈ RK to be onehot(wi ) := ei . This is

called one-hot encoding.
Option 2: Use pre-trained (existing) encoding. Usually this encoding
comes from another NN model and tries to capture some geometric
language structure. Popular example is called word2vec.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 76 / 98

One-hot encoding: Example

Let the dictionary be D = (are, hello, how , you). Then, the one-hot
encoding sets

onehot(are) = (1, 0, 0, 0) ,
onehot(hello) = (0, 1, 0, 0) ,
onehot(how) = (0, 0, 1, 0) ,
onehot(you) = (0, 0, 0, 1) .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 77 / 98

Text prediction: Network output

Let Φ be one-layer RNN with n hidden neurons, weights Whh , Whx and
biases b. As before, let

h(t) = σ Whh h(t−1) + Whx x (t) + b .

We define the output y (t+1) ∈ RK as y (t+1) = softmax Vh(t) + b ′ ,
where V ∈ RK ×n , b ′ ∈ RK .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 78 / 98

Text prediction: SGD training
For word w (t) let x (t) = onehot(w (t) ) ∈ RK be its one-hot encoding. Note
(t) P (t)
that xi ≥ 0 and i xi = 1.

On input sequence x (1) , . . . , x (T ) the network outputs

y (2) , . . . , y (T +1) ∈ RK . We can use the loss function
T
X
(1) (T ) (2) (T )
L(x ,...,x ,y ,...,y )= H y (i) , x (i)
i=2
PK
where H is the log-likelihood loss H(y , x ) = i=1 xi ln yi .

Note: Somehow we created an unsupervised learning algortihm.

This can be trained with SGD (training variables Whh , Whx , V , b, b ′ )

using usual backpropagation rules (note this time we are also
backpropagating through time).
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 79 / 98
Text prediction: SGD training
For word w (t) let x (t) = onehot(w (t) ) ∈ RK be its one-hot encoding. Note
(t) P (t)
that xi ≥ 0 and i xi = 1.

On input sequence x (1) , . . . , x (T ) the network outputs

y (2) , . . . , y (T +1) ∈ RK . We can use the loss function
T
X
(1) (T ) (2) (T )
L(x ,...,x ,y ,...,y )= H y (i) , x (i)
i=2
PK
where H is the log-likelihood loss H(y , x ) = i=1 xi ln yi .

Note: Somehow we created an unsupervised learning algortihm.

This can be trained with SGD (training variables Whh , Whx , V , b, b ′ )

using usual backpropagation rules (note this time we are also
backpropagating through time).
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 79 / 98
Example

Take the text “Hello how are you”. Recall the encoding from the previous
slide. Then,

x (1) = (0, 1, 0, 0) .

The first input is x (1) = (0, 1, 0, 0). Assume the network outputs
y (2)
= (0.2, 0.3,
0.4, 0.1). Then, the first term in the loss is
H y ,x(2) (2) = − ln 0.4, where x (2) = (0, 0, 1, 0) is the encoding of “how”.

y (2) can be interpreted as the estimate of the network: 20% chance the
next token is “are”, 30% it is “hello”, 40% it is “how” and 10% it is “you”.

We continue likewise: The (2) (3)

network gets input x , computes output y ,
the next loss term is H y (3) , x (3) , where x (3) = (1, 0, 0, 0) is the
encoding of “are”.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 80 / 98

Example

Take the text “Hello how are you”. Recall the encoding from the previous
slide. Then,

x (1) = (0, 1, 0, 0) .

y (2) can be interpreted as the estimate of the network: 20% chance the
next token is “are”, 30% it is “hello”, 40% it is “how” and 10% it is “you”.

We continue likewise: The (2) (3)

network gets input x , computes output y ,
the next loss term is H y (3) , x (3) , where x (3) = (1, 0, 0, 0) is the
encoding of “are”.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 80 / 98

Example

Take the text “Hello how are you”. Recall the encoding from the previous
slide. Then,

x (1) = (0, 1, 0, 0) .

y (2) can be interpreted as the estimate of the network: 20% chance the
next token is “are”, 30% it is “hello”, 40% it is “how” and 10% it is “you”.

We continue likewise: The (2) (3)

network gets input x , computes output y ,
the next loss term is H y (3) , x (3) , where x (3) = (1, 0, 0, 0) is the
encoding of “are”.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 80 / 98

Text prediction: Running the model
After training, given a prompt (user question) x (1) , . . . , x (T ) :
First, the network runs sequentially on inputs x (1) , . . . , x (T ) . The
outputs y (2) , . . . , y (T ) are discarded. After that we have hidden state
h(T ) and output y (T +1) .
Let t be any time with t > T and y (t) ∈ {1, . . . , K } be the argmax of
y (t) . The network runs for more steps: At step t hidden input is given
by h(t−1) and input is given by onehot(wi ), where i = y (t) . This
generates new hidden state h(t) and output y (t+1) .
At some time T ′ we decide to stop the process. We do not discuss
how to choose this time.
Finally, the output of the network (answer to the user) are the words
′
defined by y (T +1) , . . . , y (T ) .

How (early versions of) text generation networks work is based on this
idea. Current version of GPT is based on attention and transformer
architectures.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 81 / 98
Text prediction: Running the model
After training, given a prompt (user question) x (1) , . . . , x (T ) :
First, the network runs sequentially on inputs x (1) , . . . , x (T ) . The
outputs y (2) , . . . , y (T ) are discarded. After that we have hidden state
h(T ) and output y (T +1) .
Let t be any time with t > T and y (t) ∈ {1, . . . , K } be the argmax of
y (t) . The network runs for more steps: At step t hidden input is given
by h(t−1) and input is given by onehot(wi ), where i = y (t) . This
generates new hidden state h(t) and output y (t+1) .
At some time T ′ we decide to stop the process. We do not discuss
how to choose this time.
Finally, the output of the network (answer to the user) are the words
′
defined by y (T +1) , . . . , y (T ) .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 82 / 98

Text prediction: Example

t = 1: input x (1) = E (please), output y (2) discarded.

t = 2: input x (2) = E (solve), output y (3) discarded.
t = 3: input x (3) = E (assignment), output y (4) .

argmax(y (4) ) → x (4) = E (of) [PRINT “of”]

t = 4: input x (4) , output y (5) .

argmax(y (5) ) → x (5) = E (course) [PRINT “course”]

t = 6: input x (5) , output y (6) . . ..

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 83 / 98

Text prediction: Example

t = 1: input x (1) = E (please), output y (2) discarded.

t = 2: input x (2) = E (solve), output y (3) discarded.
t = 3: input x (3) = E (assignment), output y (4) .

argmax(y (4) ) → x (4) = E (of) [PRINT “of”]

t = 4: input x (4) , output y (5) .

argmax(y (5) ) → x (5) = E (course) [PRINT “course”]

t = 6: input x (5) , output y (6) . . ..

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 83 / 98

RNN: Conclusion and implementation

Due to vanishing gradients, more complicated layer architectures are

used in deep RNNs. Currently a popular choice is called long
short-term memory layers (LSTMs). We will not discuss these.
Some keras classes: tf.keras.layers.LSTM,
tf.keras.layers.SimpleRNN.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 84 / 98

Outline

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 85 / 98

Tricks round 2

Dropout.
Batch normalization.
SGD with momentum.
Skip connections.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 86 / 98

Dropout: Training

Dropout relies on using a randomly chosen subset of neurons during every

step of SGD.

Assume we have a fully connected layer with nin input neurons and nout
output neurons. Let 0 ≤ p ≤ 1 be a hyperparameter called the dropout
rate. For us p will be the probability of deleting a input neuron, but always
check as there are different conventions. Then, at every step (mini-batch)
of SGD:
1 Choose independently at random ≈ p · nin neurons which are deleted.
2 Temporarily delete those neurons (and their weights) from the
network.
3 Run one step of SGD update on the new, smaller network (using
gradient formulas for the small network).
4 Restore the deleted neurons.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 87 / 98

Dropout: Prediction

After the training is finished, all weights (not biases!) in the final network
should be multiplied by 1 − p. Then, prediction is run on the full network,
without deleting neurons.

The idea for dropout is to make neurons more independent (not rely on
the presence of other neurons in the network). It is considered a
regularization technique.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 88 / 98

Dropout: Technicalities

It is more typical to apply dropout in fully connected layers. This is

because weight sharing in convolutional layers is itself considered to have
regularization effect.

In keras: tf.keras.layers.Dropout.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 89 / 98

Batch normalization: training

Batch normalization is a recent popular regularization technique.

Consider a layer (it could be input layer) with output activations a ∈ Rn .

During a step of SGD with mini-batch of size k, in one feedforward pass
we compute activations a(1) , . . . , a(k) in this layer.

If a batch normalization is present, we then compute empirical mean

µ = k1 ki=1 a(i) and empirical variance σ 2 = k1 ki=1 (a(i) − µ)2 (where the
P P

square function is applied on every coordinate independently).

Then, we let for a small hyperparameter ε > 0 (this is to avoid worrying

about σ 2 ≈ 0):

a(i) − µ
a(i) = √
b
σ2 + ε

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 90 / 98

Batch normalization: training

a(i) − µ
a(i) = √
b
σ2 + ε
Pk 2
1 1 Pk
which implies k a(i)
i=1 b = 0 and k i=1 ba (i) ≈ (1, . . . , 1) (if σ 2 ≫ ε).
Hence, the updated activations (1)
a ,...,b
b (k)
a can be said to be normalized.

Finally, we let y (i) = γ · b

a(i) + β for some β, γ ∈ R which are new
parameters trained by SGD. y (1) , . . . , y (k) become inputs to the next layer.

Note that with batch normalization the activations (and gradients) depend
on all samples in the mini-batch. So, the feedforward and backpropagation
must run in parallel for the whole mini-batch.

Other than that, the updated gradient formulas can be derived as usual.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 91 / 98
Batch normalization: training

Finally, we let y (i) = γ · b

a(i) + β for some β, γ ∈ R which are new
parameters trained by SGD. y (1) , . . . , y (k) become inputs to the next layer.

Note that with batch normalization the activations (and gradients) depend
on all samples in the mini-batch. So, the feedforward and backpropagation
must run in parallel for the whole mini-batch.

Other than that, the updated gradient formulas can be derived as usual.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 91 / 98
Batch normalization: training

Finally, we let y (i) = γ · b

a(i) + β for some β, γ ∈ R which are new
parameters trained by SGD. y (1) , . . . , y (k) become inputs to the next layer.

Note that with batch normalization the activations (and gradients) depend
on all samples in the mini-batch. So, the feedforward and backpropagation
must run in parallel for the whole mini-batch.

Other than that, the updated gradient formulas can be derived as usual.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 91 / 98
Batch normalization: Prediction

After training is concluded, we let µ and σ 2 to be the mean and variance

of the activations over the whole dataset.

New samples use those fixed values of µ and σ 2 . As a result, a network

trained with batch normalization might not work well for unseen samples
with significantly different statistics.

For keras, see tf.keras.layers.BatchNormalization.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 92 / 98

Batch normalization: Prediction

After training is concluded, we let µ and σ 2 to be the mean and variance

of the activations over the whole dataset.

New samples use those fixed values of µ and σ 2 . As a result, a network

trained with batch normalization might not work well for unseen samples
with significantly different statistics.

For keras, see tf.keras.layers.BatchNormalization.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 92 / 98

SGD with momentum

In state of the art optimizers, SGD formulas are often supplemented with
momentum. Recall that the SGD formula is

W (t+1) = W (t) − η · ∇W L(x (t) , y (t) ) .

SGD with momentum with hyperparameter 0 ≤ β ≤ 1 introduces the

momentum M (t) (the same shape as the weights and biases) and uses the
formula

M (t+1) = β · M (t) + (1 − β) · ∇W L(x (t) , y (t) )

W (t+1) = W (t) − η · M (t+1) .

Note that β = 0 is the “standard” SGD.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 93 / 98

SGD with momentum

In state of the art optimizers, SGD formulas are often supplemented with
momentum. Recall that the SGD formula is

W (t+1) = W (t) − η · ∇W L(x (t) , y (t) ) .

SGD with momentum with hyperparameter 0 ≤ β ≤ 1 introduces the

momentum M (t) (the same shape as the weights and biases) and uses the
formula

M (t+1) = β · M (t) + (1 − β) · ∇W L(x (t) , y (t) )

W (t+1) = W (t) − η · M (t+1) .

Note that β = 0 is the “standard” SGD.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 93 / 98

Nesterov momentum
A fancy modification that has been popular recently is Nesterov
momentum:

M (t+1) = β · M (t) + (1 − β) · ∇W L(x (t) , y (t) )|W =W (t) −βM (t)

W (t+1) = W (t) − η · V (t+1) .

Source: http://cs231n.github.io/neural-networks-3/

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 94 / 98

Why is momentum useful?

Why do people think that momentum is useful?

Because it “smooths out” noise from the SGD.
Because it allows to move faster when there is a “steep cliff” (narrow
valley) in the gradient.

Source: https://www.willamette.edu/ gorr/classes/cs449/momrate.html

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 95 / 98

Momentum: technicalities

Various versions of momentum can be used with optimizers in Keras,

including tf.keras.optimizers.SGD and
tf.keras.optimizers.Adam.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 96 / 98

Skip connections
Normally neurons in the network are arranged in layers, and the
connections (weights) are placed only between neurons in neighboring
layers.

Skip connections or residual connections are additional connections

(weights) placed between layers which are not direct neighbors (thus
“skipping” over some layers).

Source: He, Zhang, Ren, Sun “Deep residual learning for image recognition”
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 97 / 98
Skip connections: Technicalities

For deep networks, it might be difficult to learn dependencies across many

layers, due to vanishing gradients. Skip connections allow for more direct
learning from features that are many layers behind.

In fact, it is not easy to implement skip connections using

tf.keras.Sequential construction.

Instead, it is preferable to use “the functional API”. See

https://keras.io/guides/functional_api/ for details.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 98 / 98

Skip connections: Technicalities

For deep networks, it might be difficult to learn dependencies across many

layers, due to vanishing gradients. Skip connections allow for more direct
learning from features that are many layers behind.

In fact, it is not easy to implement skip connections using

tf.keras.Sequential construction.

Instead, it is preferable to use “the functional API”. See

https://keras.io/guides/functional_api/ for details.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 98 / 98

Alcpt Sample A Test
57% (7)
Alcpt Sample A Test
12 pages
5 Topnotch Legal Medicine and Juris SuperExam1
100% (3)
5 Topnotch Legal Medicine and Juris SuperExam1
86 pages
Fmu Trainging Module
100% (2)
Fmu Trainging Module
23 pages
Criteria For Parametric Roll of Large Containerships in Longitudinal Seas
No ratings yet
Criteria For Parametric Roll of Large Containerships in Longitudinal Seas
24 pages
Lecture 2 - Neural Network v1.0
No ratings yet
Lecture 2 - Neural Network v1.0
64 pages
Neural Network Notes
No ratings yet
Neural Network Notes
8 pages
Understanding Activation Functions in Neural Networks
No ratings yet
Understanding Activation Functions in Neural Networks
15 pages
tfm_lichtner_bajjaoui_aisha
No ratings yet
tfm_lichtner_bajjaoui_aisha
18 pages
Session NN
No ratings yet
Session NN
32 pages
Neural Network and Fuzzy Logic
50% (2)
Neural Network and Fuzzy Logic
54 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
Untitledfff
No ratings yet
Untitledfff
40 pages
Neural-Network(Basics)
No ratings yet
Neural-Network(Basics)
48 pages
DL mod 1 final
No ratings yet
DL mod 1 final
4 pages
06 AIS302 ANN backpropagation
No ratings yet
06 AIS302 ANN backpropagation
83 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Ml Neural Networks
No ratings yet
Ml Neural Networks
71 pages
LLM for Maths People
No ratings yet
LLM for Maths People
53 pages
Module 2
No ratings yet
Module 2
44 pages
Artificial Neural Networks: System That Can Acquire, Store, and Utilize Experiential Knowledge
No ratings yet
Artificial Neural Networks: System That Can Acquire, Store, and Utilize Experiential Knowledge
43 pages
Module-2
100% (1)
Module-2
62 pages
unit v
No ratings yet
unit v
9 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
6 Lecture CNN
No ratings yet
6 Lecture CNN
45 pages
Neural Networks
No ratings yet
Neural Networks
27 pages
CS217_2024_lec11
No ratings yet
CS217_2024_lec11
7 pages
NN Concepts
No ratings yet
NN Concepts
4 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
3 - DeepLearning - and - CNN v3
No ratings yet
3 - DeepLearning - and - CNN v3
50 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
L7
No ratings yet
L7
11 pages
AD3451 ML UNIT 4 NOTES
No ratings yet
AD3451 ML UNIT 4 NOTES
36 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
NNDL
No ratings yet
NNDL
96 pages
11 Neural Nets Annot
No ratings yet
11 Neural Nets Annot
33 pages
ML_Lec-22
No ratings yet
ML_Lec-22
25 pages
Unit 2
No ratings yet
Unit 2
18 pages
Neural Networks
No ratings yet
Neural Networks
54 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Neural Networks and Neural Language Models
No ratings yet
Neural Networks and Neural Language Models
27 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Week 6 - Lab
No ratings yet
Week 6 - Lab
5 pages
Activation Functions in Neural Networks - 241102 - 224129
No ratings yet
Activation Functions in Neural Networks - 241102 - 224129
7 pages
Jntuk R20 ML Unit-V
No ratings yet
Jntuk R20 ML Unit-V
19 pages
0905 Cs 161183 Vishal
No ratings yet
0905 Cs 161183 Vishal
38 pages
AN2DL_02_2324_Perceptron_2_FeedForward
No ratings yet
AN2DL_02_2324_Perceptron_2_FeedForward
55 pages
NN unit_1
No ratings yet
NN unit_1
27 pages
Lec 23
No ratings yet
Lec 23
13 pages
a imprimer 4
No ratings yet
a imprimer 4
4 pages
Machine Learning (CSO851) - Lecture 08
No ratings yet
Machine Learning (CSO851) - Lecture 08
27 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Learning XOR - Gradient Based Learning - Hidden Units
No ratings yet
Learning XOR - Gradient Based Learning - Hidden Units
43 pages
Artificial Neural Network: Lecture Module 22
No ratings yet
Artificial Neural Network: Lecture Module 22
54 pages
Ann Muj
No ratings yet
Ann Muj
65 pages
Perceptron in Machine Learning
No ratings yet
Perceptron in Machine Learning
11 pages
UNIT V
No ratings yet
UNIT V
26 pages
08 Neural Networks
No ratings yet
08 Neural Networks
47 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
68 pages
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
No ratings yet
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
3 pages
7_NN_Apr_28_2021
No ratings yet
7_NN_Apr_28_2021
81 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Introduction to Minimax
From Everand
Introduction to Minimax
V. F. Dem’yanov
No ratings yet
Ordinary Differential Equations and Stability Theory: An Introduction
From Everand
Ordinary Differential Equations and Stability Theory: An Introduction
David A. Sanchez
No ratings yet
Crypto
No ratings yet
Crypto
3 pages
Use of High Temperature in Food Processing
No ratings yet
Use of High Temperature in Food Processing
57 pages
IRCTC Registration Form
No ratings yet
IRCTC Registration Form
24 pages
Buddhism (History Notes For UPSC & Govt. Exams)
100% (1)
Buddhism (History Notes For UPSC & Govt. Exams)
13 pages
In Stylistics Foregrounding
100% (4)
In Stylistics Foregrounding
10 pages
Preschool Comments First Term
No ratings yet
Preschool Comments First Term
4 pages
Eassp Specification Vol1 v1.1
No ratings yet
Eassp Specification Vol1 v1.1
85 pages
Q4-Shaker Investor PPT FY-2020
No ratings yet
Q4-Shaker Investor PPT FY-2020
32 pages
Political Science - 2 22LLB030
No ratings yet
Political Science - 2 22LLB030
15 pages
Electrical Power Supply and Distribution
No ratings yet
Electrical Power Supply and Distribution
127 pages
Test Bank for Aging As a Social Process: Canada and Beyond, 7th Edition, Andrew V. Wister - Full Version Is Now Available For Download
100% (6)
Test Bank for Aging As a Social Process: Canada and Beyond, 7th Edition, Andrew V. Wister - Full Version Is Now Available For Download
50 pages
Gowrishankar - Technical Architect(2)
No ratings yet
Gowrishankar - Technical Architect(2)
2 pages
0549 s07 Ms 1
100% (1)
0549 s07 Ms 1
7 pages
Yevenes Et Al., 2019
No ratings yet
Yevenes Et Al., 2019
13 pages
Module 1 Chapters 1 and 2
No ratings yet
Module 1 Chapters 1 and 2
125 pages
Pharmacology General Anesthetics Muliple Choice Questions: D) All of The Above
100% (1)
Pharmacology General Anesthetics Muliple Choice Questions: D) All of The Above
5 pages
K To 12 TLE Industrial Arts - Masonry Curriculum Guide December 2013 LO - Learning Outcome Page 1 of 24
100% (1)
K To 12 TLE Industrial Arts - Masonry Curriculum Guide December 2013 LO - Learning Outcome Page 1 of 24
24 pages
Box Culvert Extension
No ratings yet
Box Culvert Extension
4 pages
Plant Pathology 101 Powerpoint Part1
No ratings yet
Plant Pathology 101 Powerpoint Part1
19 pages
İstanbul Tarihi Yarımada'daki Antik Yapılarda Ve Anıtlarda Kullanılan Doğal Taşların Özellikleri Ve Korunmuşluk Durumları
No ratings yet
İstanbul Tarihi Yarımada'daki Antik Yapılarda Ve Anıtlarda Kullanılan Doğal Taşların Özellikleri Ve Korunmuşluk Durumları
12 pages
H&M PDF
No ratings yet
H&M PDF
80 pages
A Short History Of Chess Davidson Henry Alexander download
No ratings yet
A Short History Of Chess Davidson Henry Alexander download
17 pages
Ficha - T135X-C Leroy TAL-A44-E
No ratings yet
Ficha - T135X-C Leroy TAL-A44-E
7 pages
Christian Bible College and Seminary Catalog-2019 PDF
No ratings yet
Christian Bible College and Seminary Catalog-2019 PDF
30 pages
Hillman Hunter Gt Hustler Royal 660 Nc
No ratings yet
Hillman Hunter Gt Hustler Royal 660 Nc
6 pages
Sve Soft Copy - Corrected 121
No ratings yet
Sve Soft Copy - Corrected 121
52 pages