0% found this document useful (0 votes)
41 views

Lecture1 INL

This document introduces a course on neural networks. The course will cover the basics of how neural networks work using stochastic gradient descent, what they can be used for, and how to improve performance. It will focus on thorough understanding of the fundamentals. The main reference text is listed as well as other supplementary materials. MNIST handwritten digit dataset will be used as a running example. The outline lists topics to be covered including the architecture of neural networks, backpropagation, activation functions, convolutional neural networks, and more.

Uploaded by

bkiakisolako
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Lecture1 INL

This document introduces a course on neural networks. The course will cover the basics of how neural networks work using stochastic gradient descent, what they can be used for, and how to improve performance. It will focus on thorough understanding of the fundamentals. The main reference text is listed as well as other supplementary materials. MNIST handwritten digit dataset will be used as a running example. The outline lists topics to be covered including the architecture of neural networks, backpropagation, activation functions, convolutional neural networks, and more.

Uploaded by

bkiakisolako
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 132

Introduction to Neural Networks

Jan Hązła

AIMS Rwanda, Feb-Mar 2024

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 1 / 98


What is this course about?

(Artificial) Neural networks trained by Stochastic Gradient Descent.


HOW they work.
What they can be used for.
How to improve their performance.
Basics of the tensorflow library.
First focus: Thorough understanding of the basics.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 2 / 98


Literature

The main book we initially follow is


Michael Nielsen “Neural Networks and Deep Learning”.
http://neuralnetworksanddeeplearning.com.

Other references:
Google crash course: https:
//developers.google.com/machine-learning/crash-course.
Goodfellow, Bengio, Courville, “Deep Learning”,
https://www.deeplearningbook.org.
MIT fast course:
http://introtodeeplearning.com/2023/index.html
Hardt, Recht, “Patterns, Predictions and Actions”.
https://mlstory.org.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 3 / 98


Running example: MNIST dataset

Source: wikipedia

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 4 / 98


Outline

1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 5 / 98


One neuron

Definition
Let w ∈ Rd , b ∈ R, σ : R → R. A neuron is a function

 d
X 
Φw ,b,σ (x ) = σ b + wi xi .
i=1

Vector notation: Φw ,b,σ (x ) = σ(w T · x + b).

(Convention: Normally we take w , x to be column vectors.)

w is called the weight vector, and b bias.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 6 / 98


One neuron

Definition
Let w ∈ Rd , b ∈ R, σ : R → R. A neuron is a function

 d
X 
Φw ,b,σ (x ) = σ b + wi xi .
i=1

Vector notation: Φw ,b,σ (x ) = σ(w T · x + b).

(Convention: Normally we take w , x to be column vectors.)

w is called the weight vector, and b bias.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 6 / 98


Activation functions

Recall: Φ(x ) = σ(w T · x + b).

σ : R → R is called an activation function.

Choices for σ we will consider:


(
1, if x ≥ 0,
Step function H(x ) = (neuron is called perceptron)
0, if x < 0.
1
Sigmoid S(x ) = 1+exp(−x ) .
ReLU(x ) = max(0, x ). (default choice in many cases)

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 7 / 98


Activations: step function
H(x ) = 1(x ≥ 0)

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 8 / 98


Activations: sigmoid
1
S(x ) = 1+exp(−x )

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 9 / 98


Activations: ReLU
ReLU(x ) = max(0, x )

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 10 / 98


What is a neural network?
In general, a neural network for inputs x1 , . . . , xd ∈ R is a directed acyclic
graph (DAG) with:
Input vertices x1 , . . . , xk .
Processing (hidden) neurons (vertices).
Designated output neurons.
Every neuron (except the inputs) has its own weights (for each incoming
graph edge) and bias term. Example:
a1
x1 y1
a2
x2 y2
a3

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 11 / 98


Question

What is the reason for using activation functions?

In other words, what would go wrong if σ = id?

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 12 / 98


Logical gates

Let x1 , . . . , xk ∈ {0, 1}. Then:


NOT(xi ) = 1 − xi . (negation)
OR(x1 , . . . , xk ) = 0 if x1 = . . . = xk = 0. Otherwise,
OR(x1 , . . . , xk ) = 1. (disjunction)
AND(x1 , . . . , xk ) = 1 if x1 = . . . = xk = 1. Otherwise,
AND(x1 , . . . , xk ) = 0. (conjunction)

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 13 / 98


Boolean circuits

A boolean circuit on x1 , . . . , xn is a directed acyclic graph (DAG) with n


inputs, m outputs and boolean logical gates. A circuit defines a function
f : {0, 1}n → {0, 1}m .

x1 NOT
OR

x2
AND
OR
x3

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 14 / 98


Universality of boolean circuits

Theorem
For every function f : {0, 1}n → {0, 1}m there exists a boolean circuit that
computes it.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 15 / 98


Simulating boolean gates by neurons

Consider the NOT gate NOT(x ) = 1 − x for x ∈ {0, 1}.

Let Φ : R → R be the following neuron: w = −1, b = 0.5 and activation


step function H(x ). Then,
Φ(0) = H(−1 · 0 + 0.5) = 1 and Φ(1) = H(−1 · 1 + 0.5) = 0.
Therefore, Φ(x ) computes NOT(x ) for x ∈ {0, 1}.

We will see later that also OR and NOT gates can be computed by neural
networks (using more than one neuron).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 16 / 98


Universality of neural networks

Corollary
For every function f : {0, 1}n → {0, 1}m there exists a neural network Φ
with step activations for all hidden neurons such that

∀x ∈ {0, 1}n : f (x ) = Φ(x ) .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 17 / 98


Universality of neural networks

There is a similar theorem for continuous functions:


Theorem
Let f : [0, 1]n → [0, 1] be a continuous function,
σ ∈ {H(x ), S(x ), ReLU(x )} and ε > 0.
Then, there exists a neural net Φ with activation σ for hidden neurons
(and identity activation for output neurons) such that

sup |Φ(x ) − f (x )| ≤ ε .
x ∈[0,1]n

However, this theorem is false for σ = id. It is essential that there is


non-linearity in the activation.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 18 / 98


Universality of neural networks

There is a similar theorem for continuous functions:


Theorem
Let f : [0, 1]n → [0, 1] be a continuous function,
σ ∈ {H(x ), S(x ), ReLU(x )} and ε > 0.
Then, there exists a neural net Φ with activation σ for hidden neurons
(and identity activation for output neurons) such that

sup |Φ(x ) − f (x )| ≤ ε .
x ∈[0,1]n

However, this theorem is false for σ = id. It is essential that there is


non-linearity in the activation.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 18 / 98


Outline

1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 19 / 98


We will discuss architectures that:
Have neurons arranged in layers.
Are feedforward, that is the data flows from lower to higher layers.
Are fully connected, that is there are connections (weights) between
every pair of neurons in adjacent layers.
There is always one input layer, one output layer, and one or more
hidden layers.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 20 / 98


Example architecture with one hidden layer

a1
x1 y1
a2
x2 y2
a3

Every neuron has its own weights and bias term.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 21 / 98


MNIST network: Input layer

Our data samples are grayscale images, each with 28x28=784 pixels. Pixel
intensity is a real value between 0 (white) and 1 (black).

INPUT LAYER: x ∈ [0, 1]784 .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 22 / 98


MNIST network: Hidden layer

We will make one hidden layer with n neurons. Each neuron has the
sigmoid activation function.

The weights are given as W ∈ Rn×784 and biases as b ∈ Rn .

Each neuron computes its activation (output) value as


 
784
X
ai = S bi + Wij xj  1≤i ≤n
j=1

vector notation: a = S(Wx + b).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 23 / 98


MNIST network: Output layer

Output layer will have 10 sigmoid neurons (one for every digit from 0 to 9).
These have similar formula. For weights V ∈ R10×n and biases b ′ ∈ R10 :
 
n
yi = S bi′ +
X
Vij aj  i = 0, . . . , 9
j=1

y = S(Va + b ′ ) = S(V (S(Wx + b)) + b ′ ).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 24 / 98


MNIST network: Prediction

The prediction of the network is a digit with the highest output activation
value.

ŷ = argmax0≤i≤9 yi .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 25 / 98


Prediction error and accuracy

Main quantity of interest that we will minimize: For the (test) data set

(x (1) , y (1) ), . . . , (x (M) , y (M) ), x (i) ∈ [0, 1]784 , y (i) ∈ {0, . . . , 9}

we measure prediction error:


1
Pe = {1 ≤ i ≤ M : ŷ (x (i) ) ̸= y (i) }
M
1 − Pe is called accuracy.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 26 / 98


Outline

1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 27 / 98


Taylor’s theorem

Theorem (Taylor’s theorem for k = 2)


Let f : Rd → R be a twice continuously differentiable function and
x = (x1 , . . . , xd ) ∈ Rd .
Then, there exist C , c > 0, such that for all −c ≤ ε1 , . . . , εd ≤ c:

d d
X ∂ X
f (x1 +ε1 , . . . , xd +εd )−f (x1 , . . . , xd )− εi f (x1 , . . . , xd ) ≤ C ε2i .
i=1
∂xi i=1

vector notation: f (x + ε) = f (x ) + ∇f (x ) · ε + O(∥ε∥2 ).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 28 / 98


O-notation

x , ε ∈ Rd . We write that f (x , ε) = g(x , ε) + O(h(ε)) if

∀x ∃C , c > 0 ∀ε s.t. − c ≤ ε1 , . . . , εd ≤ c : f (x , ε) − g(x , ε) ≤ C · h(ε)

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 29 / 98


Loss function

For our MNIST network, we have



ŷ (x ) = 0
∂wij

Therefore, we need a surrogate loss that operates directly on y .

For now, we will take the quadratic loss. Let x be the input and
y ∈ {0, . . . , 9} the correct label. Let ye = ey ∈ R10 be the vector which has
one on position y and 0 everywhere else.
Note that the NN output is y (x ) ∈ [0, 1]10 . Then

1
C (x , y ) = ∥y (x ) − ye ∥2
2

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 30 / 98


Loss function

For our MNIST network, we have



ŷ (x ) = 0
∂wij

Therefore, we need a surrogate loss that operates directly on y .

For now, we will take the quadratic loss. Let x be the input and
y ∈ {0, . . . , 9} the correct label. Let ye = ey ∈ R10 be the vector which has
one on position y and 0 everywhere else.
Note that the NN output is y (x ) ∈ [0, 1]10 . Then

1
C (x , y ) = ∥y (x ) − ye ∥2
2

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 30 / 98


Loss function: Example

Let y (x ) = (0, 0, 0.8, 0.9, 0, 0, 0, 0, 0.1, 0.2) and y = 2. Then


ỹ = (0, 0, 1, 0, 0, 0, 0, 0, 0, 0) and the loss is given as

1
C (x , y ) = (0 + 0 + 0.04 + 0.81 + 0 + 0 + 0 + 0 + 0.01 + 0.04) = 0.45 .
2
This is quite a high loss due to incorrect large weight 0.9 for label 3.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 31 / 98


Gradient descent: Reminder

Let f : Rk → R be a differentiable function. The gradient descent


minimization algorithm proceeds according to the rule

w (t+1) = w (t) − η∇f (w (t) ) .

For us the function to minimize is


M
1 X  
f (W , b, V , b ′ ) = C x (i) , y (i)
M i=1

where (x (1) , y (1) ), . . . , (x (M) , y (M) ) is the training dataset. (Note that C is
also a function of W , b, V , b ′ .)

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 32 / 98


Gradient descent: Reminder

Let f : Rk → R be a differentiable function. The gradient descent


minimization algorithm proceeds according to the rule

w (t+1) = w (t) − η∇f (w (t) ) .

For us the function to minimize is


M
1 X  
f (W , b, V , b ′ ) = C x (i) , y (i)
M i=1

where (x (1) , y (1) ), . . . , (x (M) , y (M) ) is the training dataset. (Note that C is
also a function of W , b, V , b ′ .)

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 32 / 98


Gradient descent: Weight initialization

We will apply a variant of this algorithm called stochastic gradient descent


(SGD). This algorithm starts (initializes) with some weights and updates
them one step at a time.

Let us use the following initialization: W , b, V , b ′ ∼ N (0, 1). That is,


every weight is independent standard Gaussian.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 33 / 98


Gradient of a network

Let x ∈ R784 be an input and y ∈ {0, . . . , 9} the correct label on x .


Then, we write
!

∇W C (x , y ) = C (x , y )
∂Wij 1≤i≤n,1≤j≤784

For the training set (x (1) , y (1) ), . . . , (x (M) , y (M) ) we want to minimize the
average training loss. So, our gradient should be
M
1 X
∇W C (x (i) , y (i) )
M i=1

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 34 / 98


Gradient of a network

Let x ∈ R784 be an input and y ∈ {0, . . . , 9} the correct label on x .


Then, we write
!

∇W C (x , y ) = C (x , y )
∂Wij 1≤i≤n,1≤j≤784

For the training set (x (1) , y (1) ), . . . , (x (M) , y (M) ) we want to minimize the
average training loss. So, our gradient should be
M
1 X
∇W C (x (i) , y (i) )
M i=1

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 34 / 98


SGD mini-batch

However, in SGD we approximate the gradient by the mini-batch of size k.


We choose k inputs (indices i1 , . . . , ik ) at random and compute
k
1X
∇W C (x (ij ) , y (ij ) )
k j=1

A gradient of the mini-batch is less precise (more noisy) than the gradient
of the whole dataset. But, it is much faster to compute. For MNIST
M = 50000 and we will use k = 10. So, one SGD step will be 5000 times
faster.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 35 / 98


SGD mini-batch

However, in SGD we approximate the gradient by the mini-batch of size k.


We choose k inputs (indices i1 , . . . , ik ) at random and compute
k
1X
∇W C (x (ij ) , y (ij ) )
k j=1

A gradient of the mini-batch is less precise (more noisy) than the gradient
of the whole dataset. But, it is much faster to compute. For MNIST
M = 50000 and we will use k = 10. So, one SGD step will be 5000 times
faster.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 35 / 98


SGD: one step

Let (w (T ) , b (T ) ) be the weights at step T and i1 , . . . , ik a randomly


chosen minibatch.

Then, the weights at time T + 1 are given by


k
ηX
W (T +1) = W (T ) − ∇W C (x (ij ) , y (ij ) )
k j=1

η is a hyperparameter called the learning rate.

The same formula holds for other weights and biases b, V , b ′ .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 36 / 98


SGD: one step

Let (w (T ) , b (T ) ) be the weights at step T and i1 , . . . , ik a randomly


chosen minibatch.

Then, the weights at time T + 1 are given by


k
ηX
W (T +1) = W (T ) − ∇W C (x (ij ) , y (ij ) )
k j=1

η is a hyperparameter called the learning rate.

The same formula holds for other weights and biases b, V , b ′ .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 36 / 98


How long to run?

In every epoch, we divide the training dataset of M samples into M/k


minibatches of size k. Then, we run one step of SGD on each minibatch.

How long to run?


1 Option 1: A predetermined time. For example, let us initially use 30
epochs.
2 Option 2: Until the accuracy of the test set is not improving anymore
(for example, stop training if the accuracy did not improve for 10
epochs).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 37 / 98


How long to run?

In every epoch, we divide the training dataset of M samples into M/k


minibatches of size k. Then, we run one step of SGD on each minibatch.

How long to run?


1 Option 1: A predetermined time. For example, let us initially use 30
epochs.
2 Option 2: Until the accuracy of the test set is not improving anymore
(for example, stop training if the accuracy did not improve for 10
epochs).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 37 / 98


Outline

1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 38 / 98


Objective: Gradient on one input
Let x ∈ R784 be an input with label y ∈ {0, . . . , 9} and the correct output
vector ye .

We need to compute the gradients

∇W C , ∇b C , ∇V C , ∇b ′ C .

For this we need more notation:

z = Wx + b
a = S(z)
z ′ = Va + b ′
a′ = S(z ′ )

These values can be computed and remembered in the feedforward pass.


Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 39 / 98
Objective: Gradient on one input
Let x ∈ R784 be an input with label y ∈ {0, . . . , 9} and the correct output
vector ye .

We need to compute the gradients

∇W C , ∇b C , ∇V C , ∇b ′ C .

For this we need more notation:

z = Wx + b
a = S(z)
z ′ = Va + b ′
a′ = S(z ′ )

These values can be computed and remembered in the feedforward pass.


Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 39 / 98
Step 1: ∇z ′ C

− yei )2 .
1 P9 ′
Recall that C = 2 i=0 (ai

∂ ∂C ∂a′
′ C = ′ i′ = (ai′ − yei ) · S ′ (zi′ )
∂zi ∂ai ∂zi

Vector notation ∇z ′ C = (a′ − ye ) ⊙ S ′ (z ′ ),

where (u ⊙ v )i = ui vi is the Hadamard product.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 40 / 98


Step 2: ∇b ′ C and ∇V C

− yei )2 . Let δ ′ = ∇z ′ C .
1 P9 ′
Recall that C = 2 i=0 (ai

∂ ∂C ∂zi′
C = = δi′ · 1
∂bi′ ∂zi′ ∂bi′
∂ ∂C ∂z ′
C = ′ i = δi′ · aj .
∂Vij ∂zi ∂Vij

Vector notation: ∇b ′ C = δ ′ and ∇V C = δ ′ aT , that is the outer product


of δ ′ and a.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 41 / 98


Step 3: ∇z C

− yei )2 . Let δ ′ = ∇z ′ C .
1 P9 ′
Recall that C = 2 i=0 (ai

∂ X9
∂C ∂zj′ X 9
′ ∂
Xn
C= = δj Vjk S(zk )
∂zi j=0
∂zj′ ∂zi j=0
∂zi k=1
 
9 9

δj′ · Vji S(zi ) =  Vji · δj′  · S ′ (zi )
X X
=
j=0
∂zi j=0

Vector notation: ∇z C = (V T δ ′ ) ⊙ S ′ (z).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 42 / 98


Step 4: ∇b C and ∇W C

− yei )2 . Let δ = ∇z C .
1 P9 ′
Recall that C = 2 i=0 (ai

∂ ∂C ∂zi
C= = δi · 1
∂bi ∂zi ∂bi
∂ ∂C ∂zi
C= = δi · xj .
∂Wij ∂zi ∂Wij

Vector notation: ∇b C = δ and ∇W C = δx T .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 43 / 98


Conclusions

This idea generalizes to many layers and any activation: One


feedforward pass and one backward pass.
Feedforward pass is matrix multiplication. In the backward pass we
have Hadamard products, outer products and multiplying by the
transposed weight matrices.
This method is faster than computing every partial derivative
separately.
Partial derivatives can be understood by decomposing them into

products: Most important, ∂W ij
= δi xj .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 44 / 98


Outline

1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 45 / 98


Exploding and vanishing gradients

As you will see yourself, there can be various problems occurring during
the network training. Diagnosing and preventing them is one of the main
practical challenges in the field.

There are two particular problems with gradients that can happen:
Exploding gradients (which are too large).
Vanishing gradients (too small).
In both cases it is unlikely the network will learn good weights. And both
problems are generally more likely to happen in deeper networks.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 46 / 98


Exploding/vanishing gradients: What to do?

If the gradients are exploding, sometimes it is effective to just make


them shorter. This is called gradient clipping. For example, set some
maximum M and if ∥∇W C ∥ > M, then set

M
∇W C ← ∇W C
∥∇W C ∥

(this is called clipping by norm).


Modifying the network architecture and hyperparameters can help
overcome either exploding or vanishing gradients. Gradient problems
are one of the inspirations for innovations in NN architecture.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 47 / 98


Exploding/vanishing gradients: What to do?

If the gradients are exploding, sometimes it is effective to just make


them shorter. This is called gradient clipping. For example, set some
maximum M and if ∥∇W C ∥ > M, then set

M
∇W C ← ∇W C
∥∇W C ∥

(this is called clipping by norm).


Modifying the network architecture and hyperparameters can help
overcome either exploding or vanishing gradients. Gradient problems
are one of the inspirations for innovations in NN architecture.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 47 / 98


Vanishing gradient example: Sigmoid and square loss

Let z ∈ R10 be the input values for the final layer of our MNIST network
and y = S(z) the NN output. With the square loss Csq = 12 ∥y − ỹ ∥2 it
can be computed
∂Csq
= (yi − ỹi )yi (1 − yi ) .
∂zi
For example, if it happens that ỹi = 0 and yi ≈ 1, then the loss is large
(and NN likely to output wrong predictions), but the gradient will be small
due to 1 − yi ≈ 0. This is an example of a vanishing gradient.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 48 / 98


Cross-entropy loss

A popular choice for the loss function is cross-entropy loss. In our case
(MNIST with sigmoid output neurons) we define it as follows. For neural
network output y ∈ R10 and the label vector ye ∈ R10 we have
9
X
Ccross = − yei ln yi − (1 − yei ) ln(1 − yi ) .
i=0

Note that this loss will help with sigmoid vanishing gradients, since
∂Ccross
= (yi − ỹi ) .
∂zi

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 49 / 98


Cross-entropy loss

A popular choice for the loss function is cross-entropy loss. In our case
(MNIST with sigmoid output neurons) we define it as follows. For neural
network output y ∈ R10 and the label vector ye ∈ R10 we have
9
X
Ccross = − yei ln yi − (1 − yei ) ln(1 − yi ) .
i=0

Note that this loss will help with sigmoid vanishing gradients, since
∂Ccross
= (yi − ỹi ) .
∂zi

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 49 / 98


Softmax

Another possiblity is to replace the sigmoids in the output layer by


softmax. Let z ∈ R10 be inputs to the output layer. Then, we let

exp(zi )
yi = softmax(z)i = P9 .
j=0 exp(zj )

And we define the log-likelihood loss (this is also called cross-entropy in


some sources!)
9
X
Clog = − yei ln yi = − ln yy
i=0

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 50 / 98


Softmax and probabilities

Cross-entropy and softmax are often used when the output can be
interpreted as a probability distribution.

From the definition it is easy to check that if y = softmax(z), then yi > 0


P
and i yi = 1.

In that case the values zi can be used to compute log-likelihood ratios:


   
X X yi
zi − zj = ln(yi ) − ln  exp(zj ) − ln(yj ) + ln  exp(zj ) = ln .
j j
yj

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 51 / 98


Softmax and probabilities

Cross-entropy and softmax are often used when the output can be
interpreted as a probability distribution.

From the definition it is easy to check that if y = softmax(z), then yi > 0


P
and i yi = 1.

In that case the values zi can be used to compute log-likelihood ratios:


   
X X yi
zi − zj = ln(yi ) − ln  exp(zj ) − ln(yj ) + ln  exp(zj ) = ln .
j j
yj

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 51 / 98


Impact on SGD

In every case SGD formulas have to be accordingly modified.

For example, for output y and label vector ye , we have


∂ yi − yei
Ccross = .
∂yi yi (1 − yi )

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 52 / 98


Outline

1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 53 / 98


Regularization

Another modification to avoid overfitting is regularization of the weights.


The idea is to modify cost such that only “simpler” (for us smaller)
weights are preferred

Normally we regularize only weights, not biases.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 54 / 98


L2 regularization

Most typically L2 regularization is used. The cost function Cold (like


square or cross-entropy loss) is modified to become
 
λ X X
Ctotal = Cold + ·  Wij2 + Vij2  .
2 ij ij

for a hyperparameter λ > 0.

The gradient becomes modified by ∇W Ctotal = ∇W Cold + λW ,


∇V Ctotal = ∇V Cold + λV .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 55 / 98


L2 regularization

Most typically L2 regularization is used. The cost function Cold (like


square or cross-entropy loss) is modified to become
 
λ X X
Ctotal = Cold + ·  Wij2 + Vij2  .
2 ij ij

for a hyperparameter λ > 0.

The gradient becomes modified by ∇W Ctotal = ∇W Cold + λW ,


∇V Ctotal = ∇V Cold + λV .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 55 / 98


Normalized initialization

We have chosen standard Gaussian initialization Wij ∼ N (0, 1).

P Wij ∼ N
However, it might be more natural to choose  (0, 1/d), since in
d d
that case for x ∈ {−1, 1} we will have j=1 Wij aj ∼ N (0, 1).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 56 / 98


Synthetic data

More good quality data is always better. For certain datasets we can
artificially expand them into larger ones for a more robust network and less
overfitting.

For example, a picture from the MNIST dataset can be modified:


Shift left/right/up/down a little bit.
Rotate by a small angle (not too large!).
Change the brightness a little.
After the change it should still be the same digit. But for the network it
might make a difference.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 57 / 98


Outline

1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 58 / 98


Motivation

Problems of interest (e.g., image, speech and text related) often have
spatial/temporal and hierarchical structure.

For that, consider our MNIST input to be x ∈ R28×28 . Similarly, a color


picture can be presented as x ∈ R28×28×3 (RGB: red, green, blue). In that
case we call each of the three color coordinates a channel.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 59 / 98


Convolutional layer

Let x ∈ RK ×K ×C . So x is a 2D object with C channels.

Let k be a small integer. For example, we will take k = 5 for MNIST.


Then, let w ∈ Rk×k×c and b ∈ R. We call (w , b) a kernel or a filter and
k × k are the kernel dimensions.
Then, we define the kernel applied to the position i, j, where 1 ≤ i, j ≤ K :
k−1
X k−1
XXC
zi,j = wα,β,γ xi+α,j+β,γ + b .
α=0 β=0 γ=1

Then, the activation is defined as usual ai,j = σ(zi,j ) for an activation


function σ : R → R.

The operation of applying a filter to a position is called convolution.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 60 / 98


Convolutional layer

Let x ∈ RK ×K ×C . So x is a 2D object with C channels.

Let k be a small integer. For example, we will take k = 5 for MNIST.


Then, let w ∈ Rk×k×c and b ∈ R. We call (w , b) a kernel or a filter and
k × k are the kernel dimensions.
Then, we define the kernel applied to the position i, j, where 1 ≤ i, j ≤ K :
k−1
X k−1
XXC
zi,j = wα,β,γ xi+α,j+β,γ + b .
α=0 β=0 γ=1

Then, the activation is defined as usual ai,j = σ(zi,j ) for an activation


function σ : R → R.

The operation of applying a filter to a position is called convolution.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 60 / 98


Weight sharing

One kernel is applied to all input coordinates with the same weights. This
is called weight sharing.

The coordinates close to the borders require some attention. We can


either:
Imagine that the coordinates outside the input range are zeros. In
that case we can place one neuron for every input neuron, and the
filter output has shape K × K . This is called padding.
Or allow only the coordinates where the filter is well-defined. Then
the filter output shape is (K − k + 1) × (K − k + 1).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 61 / 98


Weight sharing

One kernel is applied to all input coordinates with the same weights. This
is called weight sharing.

The coordinates close to the borders require some attention. We can


either:
Imagine that the coordinates outside the input range are zeros. In
that case we can place one neuron for every input neuron, and the
filter output has shape K × K . This is called padding.
Or allow only the coordinates where the filter is well-defined. Then
the filter output shape is (K − k + 1) × (K − k + 1).

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 61 / 98


Putting things together: Multiple filters

It is not enough to apply just one filter (one set of weights). Instead,
many filters are applied to create many output channels: One output
channel for one filter.

So, if in one convolutional layer with padding, the input has shape
K × K × C , and we apply D filters with dimensions k × k, we have:
K 2 C neurons (activations) in the input to the layer.
K 2 D neurons in the output of the layer.
k 2 CD weights.
D biases.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 62 / 98


Putting things together: Multiple filters

It is not enough to apply just one filter (one set of weights). Instead,
many filters are applied to create many output channels: One output
channel for one filter.

So, if in one convolutional layer with padding, the input has shape
K × K × C , and we apply D filters with dimensions k × k, we have:
K 2 C neurons (activations) in the input to the layer.
K 2 D neurons in the output of the layer.
k 2 CD weights.
D biases.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 62 / 98


MNIST example: Counting the parameters

Let us consider an example convolutional layer for MNIST. In the input we


have 28 · 28 = 784 input neurons and just one channel (K = 28 and
C = 1).

Then, we create 20 filters of dimension 5 · 5 (k = 5 and D = 20). Let us


do it without padding. That means that every filter gives 24 × 24 = 576
neurons in the output layer (since K − k + 1 = 24).

All in all, we have 784 input neurons and 576 × 20 = 11520 (hidden)
output neurons. But, just 25 · 1 · 20 = 500 weights and 20 biases.

Compare this with 784 · 30 = 23520 weights for fully-connected layer with
30 hidden neurons and 78400 weight with 100 hidden neurons.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 63 / 98


MNIST example: Counting the parameters

Let us consider an example convolutional layer for MNIST. In the input we


have 28 · 28 = 784 input neurons and just one channel (K = 28 and
C = 1).

Then, we create 20 filters of dimension 5 · 5 (k = 5 and D = 20). Let us


do it without padding. That means that every filter gives 24 × 24 = 576
neurons in the output layer (since K − k + 1 = 24).

All in all, we have 784 input neurons and 576 × 20 = 11520 (hidden)
output neurons. But, just 25 · 1 · 20 = 500 weights and 20 biases.

Compare this with 784 · 30 = 23520 weights for fully-connected layer with
30 hidden neurons and 78400 weight with 100 hidden neurons.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 63 / 98


MNIST example: Counting the parameters

Let us consider an example convolutional layer for MNIST. In the input we


have 28 · 28 = 784 input neurons and just one channel (K = 28 and
C = 1).

Then, we create 20 filters of dimension 5 · 5 (k = 5 and D = 20). Let us


do it without padding. That means that every filter gives 24 × 24 = 576
neurons in the output layer (since K − k + 1 = 24).

All in all, we have 784 input neurons and 576 × 20 = 11520 (hidden)
output neurons. But, just 25 · 1 · 20 = 500 weights and 20 biases.

Compare this with 784 · 30 = 23520 weights for fully-connected layer with
30 hidden neurons and 78400 weight with 100 hidden neurons.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 63 / 98


Pooling

A pooling layer summarizes the information extracted from a larger


number of neighboring neurons.

The most typical variants are max pooling and average pooling.

In pooling we take k × k (for some small k) neighboring activations and


create just one neuron that computes:
The maximum in case of max pooling.
The average in case of average pooling.

So, for input of shape K × K × C pooling with size k × k creates output


of shape K /k × K /k × C . Note that pooling is applied on each channel
separately!

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 64 / 98


Pooling

A pooling layer summarizes the information extracted from a larger


number of neighboring neurons.

The most typical variants are max pooling and average pooling.

In pooling we take k × k (for some small k) neighboring activations and


create just one neuron that computes:
The maximum in case of max pooling.
The average in case of average pooling.

So, for input of shape K × K × C pooling with size k × k creates output


of shape K /k × K /k × C . Note that pooling is applied on each channel
separately!

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 64 / 98


Pooling

A pooling layer summarizes the information extracted from a larger


number of neighboring neurons.

The most typical variants are max pooling and average pooling.

In pooling we take k × k (for some small k) neighboring activations and


create just one neuron that computes:
The maximum in case of max pooling.
The average in case of average pooling.

So, for input of shape K × K × C pooling with size k × k creates output


of shape K /k × K /k × C . Note that pooling is applied on each channel
separately!

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 64 / 98


Pooling example: MNIST

For our MNIST example, after applying the convolution layer, we have
activations in the shape 24 × 24 × 20.

After applying 2 × 2 max poooling, we remain with 12 × 12 × 20 = 2880


neurons.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 65 / 98


Strides

Pooling and especially convolution can be also done using strides. Using
stride s means that the “convolution window” is not applied at every
position, but only every s steps.

For example, convolution applied with padding to input shape K × K × C


gives one filter with shape K × K . If we apply stride s, then the output
filter has shape K /s × K /s.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 66 / 98


Strides

Pooling and especially convolution can be also done using strides. Using
stride s means that the “convolution window” is not applied at every
position, but only every s steps.

For example, convolution applied with padding to input shape K × K × C


gives one filter with shape K × K . If we apply stride s, then the output
filter has shape K /s × K /s.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 66 / 98


Deep learning: Overall structure

For problems that require deep structure, many layers of convolution and
pooling are applied in sequence. Details are always different.

Most often, the first (close to the input) layers are convolutional and
pooling (these can be very deep), and the last (usually just one or two
layers) are fully connected.

As always, the SGD formulas have to be modified for the new architecture.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 67 / 98


Deep learning: Overall structure

For problems that require deep structure, many layers of convolution and
pooling are applied in sequence. Details are always different.

Most often, the first (close to the input) layers are convolutional and
pooling (these can be very deep), and the last (usually just one or two
layers) are fully connected.

As always, the SGD formulas have to be modified for the new architecture.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 67 / 98


Motivation

Convolutional layers are supposed to detect local spatial/temporal


structure.
Deep convolutional layers reflect hierarchical structure of problems,
where more advanced features are constructed from simpler features.
Pooling attempts to summarize and simplify information.
The fully connected layers on top are supposed to give flexibility at
the highest levels of abstraction.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 68 / 98


MNIST example: architecture 1

For MNIST, we will apply the following architecture:


1 Input shape 28 × 28 × 1.
2 2D Convolution without padding, ReLU activation, 20 filters, filter
size 5 × 5. The first hidden layer has shape 24 × 24 × 20.
3 2D max 2 × 2 pooling. The second hidden layer has shape
12 × 12 × 20.
4 Fully connected layer with 100 neurons, ReLU activation.
5 Output layer with 10 neurons, softmax.
6 Log-likelihood loss.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 69 / 98


MNIST example: architecture 2

1 Input shape 28 × 28 × 1.
2 2D convolution without padding, ReLU activation, 20 filters, filter
size 5 × 5. The first hidden layer has shape 24 × 24 × 20.
3 2D max 2 × 2 pooling. The second hidden layer has shape
12 × 12 × 20.
4 2D convolution without padding, ReLU activation, 40 filters, filter
size 5 × 5. Hidden layer shape 8 × 8 × 40.
5 2D max 2 × 2 pooling. Hidden layer shape 4 × 4 × 40.
6 Fully connected layer with 100 neurons, ReLU activation.
7 Output layer with 10 neurons, softmax.
8 Log-likelihood loss.
We might also add dropout to the fully connected layer.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 70 / 98


Outline

1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 71 / 98


RNNs

Recurrent neural networks (RNNs) are used for inputs which are sequences
of variable length x (1) , . . . , x (T ) (for example, words in a sentence).

The main ideas:


Weights are shared across time.
The network keeps updating a hidden state as we apply inputs in the
sequence.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 72 / 98


RNN with one hidden layer
Example RNN with one hidden layer:
Input x (t) ∈ Rk , hidden layer activations h(t) ∈ Rn .
Weights Whh ∈ Rn×n and Whx ∈ Rn×k , biases b ∈ Rn .
 
h(t+1) = σ Whh h(t) + Whx x (t+1) + b .
h(0) can be initialized to zero.

One input sequence is x (1) , . . . , x (T ) . We have


 
h(1) = σ Whh h(0) + Whx x (1) + b
...
 
h(T ) = σ Whh h(T −1) + Whx x (T ) + b .

Note that the weights are shared (the same) for all 1 ≤ t ≤ T .
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 73 / 98
RNN with one hidden layer
Example RNN with one hidden layer:
Input x (t) ∈ Rk , hidden layer activations h(t) ∈ Rn .
Weights Whh ∈ Rn×n and Whx ∈ Rn×k , biases b ∈ Rn .
 
h(t+1) = σ Whh h(t) + Whx x (t+1) + b .
h(0) can be initialized to zero.

One input sequence is x (1) , . . . , x (T ) . We have


 
h(1) = σ Whh h(0) + Whx x (1) + b
...
 
h(T ) = σ Whh h(T −1) + Whx x (T ) + b .

Note that the weights are shared (the same) for all 1 ≤ t ≤ T .
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 73 / 98
Illustration: feedforward vs recurrent
Feedforward

x
Wx + b
z
σ(z)
h
...

Recurrent
h(0)
Whh h(0)

Whx x (1) + b σ(z (1) ) ...


x (1) z (1) h(1)
Whh h(1)

Whx x (2) + b σ(z (2) )


x (2) z (2) h(2)
···
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 74 / 98
Example: GPT-like text prediction

Consider a problem where inputs are sentences: Each training sample is a


sequence of T words w (1) , . . . , w (T ) .

The objective is to train a network that will be good at predicting w (t)


given w (1) , . . . , w (t−1) .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 75 / 98


Text prediction: Input encoding

First, we need to encode word w as a vector in Rk .


Option 1: Let (w1 , . . . , wK ) be a dictionary, that is a list of all
possible words in the training data. Recall the standard basis vector ei
which has 1 in coordinate i and 0 everywhere else.

Then, just define onehot(wi ) ∈ RK to be onehot(wi ) := ei . This is


called one-hot encoding.
Option 2: Use pre-trained (existing) encoding. Usually this encoding
comes from another NN model and tries to capture some geometric
language structure. Popular example is called word2vec.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 76 / 98


One-hot encoding: Example

Let the dictionary be D = (are, hello, how , you). Then, the one-hot
encoding sets

onehot(are) = (1, 0, 0, 0) ,
onehot(hello) = (0, 1, 0, 0) ,
onehot(how) = (0, 0, 1, 0) ,
onehot(you) = (0, 0, 0, 1) .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 77 / 98


Text prediction: Network output

Let Φ be one-layer RNN with n hidden neurons, weights Whh , Whx and
biases b. As before, let
 
h(t) = σ Whh h(t−1) + Whx x (t) + b .

 
We define the output y (t+1) ∈ RK as y (t+1) = softmax Vh(t) + b ′ ,
where V ∈ RK ×n , b ′ ∈ RK .

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 78 / 98


Text prediction: SGD training
For word w (t) let x (t) = onehot(w (t) ) ∈ RK be its one-hot encoding. Note
(t) P (t)
that xi ≥ 0 and i xi = 1.

On input sequence x (1) , . . . , x (T ) the network outputs


y (2) , . . . , y (T +1) ∈ RK . We can use the loss function
T
X  
(1) (T ) (2) (T )
L(x ,...,x ,y ,...,y )= H y (i) , x (i)
i=2
PK
where H is the log-likelihood loss H(y , x ) = i=1 xi ln yi .

Note: Somehow we created an unsupervised learning algortihm.

This can be trained with SGD (training variables Whh , Whx , V , b, b ′ )


using usual backpropagation rules (note this time we are also
backpropagating through time).
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 79 / 98
Text prediction: SGD training
For word w (t) let x (t) = onehot(w (t) ) ∈ RK be its one-hot encoding. Note
(t) P (t)
that xi ≥ 0 and i xi = 1.

On input sequence x (1) , . . . , x (T ) the network outputs


y (2) , . . . , y (T +1) ∈ RK . We can use the loss function
T
X  
(1) (T ) (2) (T )
L(x ,...,x ,y ,...,y )= H y (i) , x (i)
i=2
PK
where H is the log-likelihood loss H(y , x ) = i=1 xi ln yi .

Note: Somehow we created an unsupervised learning algortihm.

This can be trained with SGD (training variables Whh , Whx , V , b, b ′ )


using usual backpropagation rules (note this time we are also
backpropagating through time).
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 79 / 98
Example

Take the text “Hello how are you”. Recall the encoding from the previous
slide. Then,

x (1) = (0, 1, 0, 0) .

The first input is x (1) = (0, 1, 0, 0). Assume the network outputs
y (2)
 = (0.2, 0.3,
 0.4, 0.1). Then, the first term in the loss is
H y ,x(2) (2) = − ln 0.4, where x (2) = (0, 0, 1, 0) is the encoding of “how”.

y (2) can be interpreted as the estimate of the network: 20% chance the
next token is “are”, 30% it is “hello”, 40% it is “how” and 10% it is “you”.

We continue likewise: The (2) (3)


 network  gets input x , computes output y ,
the next loss term is H y (3) , x (3) , where x (3) = (1, 0, 0, 0) is the
encoding of “are”.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 80 / 98


Example

Take the text “Hello how are you”. Recall the encoding from the previous
slide. Then,

x (1) = (0, 1, 0, 0) .

The first input is x (1) = (0, 1, 0, 0). Assume the network outputs
y (2)
 = (0.2, 0.3,
 0.4, 0.1). Then, the first term in the loss is
H y ,x(2) (2) = − ln 0.4, where x (2) = (0, 0, 1, 0) is the encoding of “how”.

y (2) can be interpreted as the estimate of the network: 20% chance the
next token is “are”, 30% it is “hello”, 40% it is “how” and 10% it is “you”.

We continue likewise: The (2) (3)


 network  gets input x , computes output y ,
the next loss term is H y (3) , x (3) , where x (3) = (1, 0, 0, 0) is the
encoding of “are”.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 80 / 98


Example

Take the text “Hello how are you”. Recall the encoding from the previous
slide. Then,

x (1) = (0, 1, 0, 0) .

The first input is x (1) = (0, 1, 0, 0). Assume the network outputs
y (2)
 = (0.2, 0.3,
 0.4, 0.1). Then, the first term in the loss is
H y ,x(2) (2) = − ln 0.4, where x (2) = (0, 0, 1, 0) is the encoding of “how”.

y (2) can be interpreted as the estimate of the network: 20% chance the
next token is “are”, 30% it is “hello”, 40% it is “how” and 10% it is “you”.

We continue likewise: The (2) (3)


 network  gets input x , computes output y ,
the next loss term is H y (3) , x (3) , where x (3) = (1, 0, 0, 0) is the
encoding of “are”.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 80 / 98


Text prediction: Running the model
After training, given a prompt (user question) x (1) , . . . , x (T ) :
First, the network runs sequentially on inputs x (1) , . . . , x (T ) . The
outputs y (2) , . . . , y (T ) are discarded. After that we have hidden state
h(T ) and output y (T +1) .
Let t be any time with t > T and y (t) ∈ {1, . . . , K } be the argmax of
y (t) . The network runs for more steps: At step t hidden input is given
by h(t−1) and input is given by onehot(wi ), where i = y (t) . This
generates new hidden state h(t) and output y (t+1) .
At some time T ′ we decide to stop the process. We do not discuss
how to choose this time.
Finally, the output of the network (answer to the user) are the words

defined by y (T +1) , . . . , y (T ) .

How (early versions of) text generation networks work is based on this
idea. Current version of GPT is based on attention and transformer
architectures.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 81 / 98
Text prediction: Running the model
After training, given a prompt (user question) x (1) , . . . , x (T ) :
First, the network runs sequentially on inputs x (1) , . . . , x (T ) . The
outputs y (2) , . . . , y (T ) are discarded. After that we have hidden state
h(T ) and output y (T +1) .
Let t be any time with t > T and y (t) ∈ {1, . . . , K } be the argmax of
y (t) . The network runs for more steps: At step t hidden input is given
by h(t−1) and input is given by onehot(wi ), where i = y (t) . This
generates new hidden state h(t) and output y (t+1) .
At some time T ′ we decide to stop the process. We do not discuss
how to choose this time.
Finally, the output of the network (answer to the user) are the words

defined by y (T +1) , . . . , y (T ) .

How (early versions of) text generation networks work is based on this
idea. Current version of GPT is based on attention and transformer
architectures.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 81 / 98
Text prediction: Running the model
After training, given a prompt (user question) x (1) , . . . , x (T ) :
First, the network runs sequentially on inputs x (1) , . . . , x (T ) . The
outputs y (2) , . . . , y (T ) are discarded. After that we have hidden state
h(T ) and output y (T +1) .
Let t be any time with t > T and y (t) ∈ {1, . . . , K } be the argmax of
y (t) . The network runs for more steps: At step t hidden input is given
by h(t−1) and input is given by onehot(wi ), where i = y (t) . This
generates new hidden state h(t) and output y (t+1) .
At some time T ′ we decide to stop the process. We do not discuss
how to choose this time.
Finally, the output of the network (answer to the user) are the words

defined by y (T +1) , . . . , y (T ) .

How (early versions of) text generation networks work is based on this
idea. Current version of GPT is based on attention and transformer
architectures.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 81 / 98
Text prediction: Example

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 82 / 98


Text prediction: Example

t = 1: input x (1) = E (please), output y (2) discarded.


t = 2: input x (2) = E (solve), output y (3) discarded.
t = 3: input x (3) = E (assignment), output y (4) .

argmax(y (4) ) → x (4) = E (of) [PRINT “of”]

t = 4: input x (4) , output y (5) .

argmax(y (5) ) → x (5) = E (course) [PRINT “course”]

t = 6: input x (5) , output y (6) . . ..

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 83 / 98


Text prediction: Example

t = 1: input x (1) = E (please), output y (2) discarded.


t = 2: input x (2) = E (solve), output y (3) discarded.
t = 3: input x (3) = E (assignment), output y (4) .

argmax(y (4) ) → x (4) = E (of) [PRINT “of”]

t = 4: input x (4) , output y (5) .

argmax(y (5) ) → x (5) = E (course) [PRINT “course”]

t = 6: input x (5) , output y (6) . . ..

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 83 / 98


RNN: Conclusion and implementation

Due to vanishing gradients, more complicated layer architectures are


used in deep RNNs. Currently a popular choice is called long
short-term memory layers (LSTMs). We will not discuss these.
Some keras classes: tf.keras.layers.LSTM,
tf.keras.layers.SimpleRNN.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 84 / 98


Outline

1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 85 / 98


Tricks round 2

Dropout.
Batch normalization.
SGD with momentum.
Skip connections.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 86 / 98


Dropout: Training

Dropout relies on using a randomly chosen subset of neurons during every


step of SGD.

Assume we have a fully connected layer with nin input neurons and nout
output neurons. Let 0 ≤ p ≤ 1 be a hyperparameter called the dropout
rate. For us p will be the probability of deleting a input neuron, but always
check as there are different conventions. Then, at every step (mini-batch)
of SGD:
1 Choose independently at random ≈ p · nin neurons which are deleted.
2 Temporarily delete those neurons (and their weights) from the
network.
3 Run one step of SGD update on the new, smaller network (using
gradient formulas for the small network).
4 Restore the deleted neurons.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 87 / 98


Dropout: Prediction

After the training is finished, all weights (not biases!) in the final network
should be multiplied by 1 − p. Then, prediction is run on the full network,
without deleting neurons.

The idea for dropout is to make neurons more independent (not rely on
the presence of other neurons in the network). It is considered a
regularization technique.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 88 / 98


Dropout: Technicalities

It is more typical to apply dropout in fully connected layers. This is


because weight sharing in convolutional layers is itself considered to have
regularization effect.

In keras: tf.keras.layers.Dropout.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 89 / 98


Batch normalization: training

Batch normalization is a recent popular regularization technique.

Consider a layer (it could be input layer) with output activations a ∈ Rn .


During a step of SGD with mini-batch of size k, in one feedforward pass
we compute activations a(1) , . . . , a(k) in this layer.

If a batch normalization is present, we then compute empirical mean


µ = k1 ki=1 a(i) and empirical variance σ 2 = k1 ki=1 (a(i) − µ)2 (where the
P P

square function is applied on every coordinate independently).

Then, we let for a small hyperparameter ε > 0 (this is to avoid worrying


about σ 2 ≈ 0):

a(i) − µ
a(i) = √
b
σ2 + ε

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 90 / 98


Batch normalization: training

a(i) − µ
a(i) = √
b
σ2 + ε
Pk  2
1 1 Pk
which implies k a(i)
i=1 b = 0 and k i=1 ba (i) ≈ (1, . . . , 1) (if σ 2 ≫ ε).
Hence, the updated activations (1)
a ,...,b
b (k)
a can be said to be normalized.

Finally, we let y (i) = γ · b


a(i) + β for some β, γ ∈ R which are new
parameters trained by SGD. y (1) , . . . , y (k) become inputs to the next layer.

Note that with batch normalization the activations (and gradients) depend
on all samples in the mini-batch. So, the feedforward and backpropagation
must run in parallel for the whole mini-batch.

Other than that, the updated gradient formulas can be derived as usual.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 91 / 98
Batch normalization: training

a(i) − µ
a(i) = √
b
σ2 + ε
Pk  2
1 1 Pk
which implies k a(i)
i=1 b = 0 and k i=1 ba (i) ≈ (1, . . . , 1) (if σ 2 ≫ ε).
Hence, the updated activations (1)
a ,...,b
b (k)
a can be said to be normalized.

Finally, we let y (i) = γ · b


a(i) + β for some β, γ ∈ R which are new
parameters trained by SGD. y (1) , . . . , y (k) become inputs to the next layer.

Note that with batch normalization the activations (and gradients) depend
on all samples in the mini-batch. So, the feedforward and backpropagation
must run in parallel for the whole mini-batch.

Other than that, the updated gradient formulas can be derived as usual.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 91 / 98
Batch normalization: training

a(i) − µ
a(i) = √
b
σ2 + ε
Pk  2
1 1 Pk
which implies k a(i)
i=1 b = 0 and k i=1 ba (i) ≈ (1, . . . , 1) (if σ 2 ≫ ε).
Hence, the updated activations (1)
a ,...,b
b (k)
a can be said to be normalized.

Finally, we let y (i) = γ · b


a(i) + β for some β, γ ∈ R which are new
parameters trained by SGD. y (1) , . . . , y (k) become inputs to the next layer.

Note that with batch normalization the activations (and gradients) depend
on all samples in the mini-batch. So, the feedforward and backpropagation
must run in parallel for the whole mini-batch.

Other than that, the updated gradient formulas can be derived as usual.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 91 / 98
Batch normalization: Prediction

After training is concluded, we let µ and σ 2 to be the mean and variance


of the activations over the whole dataset.

New samples use those fixed values of µ and σ 2 . As a result, a network


trained with batch normalization might not work well for unseen samples
with significantly different statistics.

For keras, see tf.keras.layers.BatchNormalization.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 92 / 98


Batch normalization: Prediction

After training is concluded, we let µ and σ 2 to be the mean and variance


of the activations over the whole dataset.

New samples use those fixed values of µ and σ 2 . As a result, a network


trained with batch normalization might not work well for unseen samples
with significantly different statistics.

For keras, see tf.keras.layers.BatchNormalization.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 92 / 98


SGD with momentum

In state of the art optimizers, SGD formulas are often supplemented with
momentum. Recall that the SGD formula is

W (t+1) = W (t) − η · ∇W L(x (t) , y (t) ) .

SGD with momentum with hyperparameter 0 ≤ β ≤ 1 introduces the


momentum M (t) (the same shape as the weights and biases) and uses the
formula

M (t+1) = β · M (t) + (1 − β) · ∇W L(x (t) , y (t) )


W (t+1) = W (t) − η · M (t+1) .

Note that β = 0 is the “standard” SGD.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 93 / 98


SGD with momentum

In state of the art optimizers, SGD formulas are often supplemented with
momentum. Recall that the SGD formula is

W (t+1) = W (t) − η · ∇W L(x (t) , y (t) ) .

SGD with momentum with hyperparameter 0 ≤ β ≤ 1 introduces the


momentum M (t) (the same shape as the weights and biases) and uses the
formula

M (t+1) = β · M (t) + (1 − β) · ∇W L(x (t) , y (t) )


W (t+1) = W (t) − η · M (t+1) .

Note that β = 0 is the “standard” SGD.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 93 / 98


Nesterov momentum
A fancy modification that has been popular recently is Nesterov
momentum:

M (t+1) = β · M (t) + (1 − β) · ∇W L(x (t) , y (t) )|W =W (t) −βM (t)


W (t+1) = W (t) − η · V (t+1) .

Source: http://cs231n.github.io/neural-networks-3/

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 94 / 98


Why is momentum useful?

Why do people think that momentum is useful?


Because it “smooths out” noise from the SGD.
Because it allows to move faster when there is a “steep cliff” (narrow
valley) in the gradient.

Source: https://www.willamette.edu/ gorr/classes/cs449/momrate.html

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 95 / 98


Momentum: technicalities

Various versions of momentum can be used with optimizers in Keras,


including tf.keras.optimizers.SGD and
tf.keras.optimizers.Adam.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 96 / 98


Skip connections
Normally neurons in the network are arranged in layers, and the
connections (weights) are placed only between neurons in neighboring
layers.

Skip connections or residual connections are additional connections


(weights) placed between layers which are not direct neighbors (thus
“skipping” over some layers).

Source: He, Zhang, Ren, Sun “Deep residual learning for image recognition”
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 97 / 98
Skip connections: Technicalities

For deep networks, it might be difficult to learn dependencies across many


layers, due to vanishing gradients. Skip connections allow for more direct
learning from features that are many layers behind.

In fact, it is not easy to implement skip connections using


tf.keras.Sequential construction.

Instead, it is preferable to use “the functional API”. See


https://keras.io/guides/functional_api/ for details.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 98 / 98


Skip connections: Technicalities

For deep networks, it might be difficult to learn dependencies across many


layers, due to vanishing gradients. Skip connections allow for more direct
learning from features that are many layers behind.

In fact, it is not easy to implement skip connections using


tf.keras.Sequential construction.

Instead, it is preferable to use “the functional API”. See


https://keras.io/guides/functional_api/ for details.

Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 98 / 98

You might also like