Lecture1 INL
Lecture1 INL
Jan Hązła
Other references:
Google crash course: https:
//developers.google.com/machine-learning/crash-course.
Goodfellow, Bengio, Courville, “Deep Learning”,
https://www.deeplearningbook.org.
MIT fast course:
http://introtodeeplearning.com/2023/index.html
Hardt, Recht, “Patterns, Predictions and Actions”.
https://mlstory.org.
Source: wikipedia
1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks
Definition
Let w ∈ Rd , b ∈ R, σ : R → R. A neuron is a function
d
X
Φw ,b,σ (x ) = σ b + wi xi .
i=1
Definition
Let w ∈ Rd , b ∈ R, σ : R → R. A neuron is a function
d
X
Φw ,b,σ (x ) = σ b + wi xi .
i=1
x1 NOT
OR
x2
AND
OR
x3
Theorem
For every function f : {0, 1}n → {0, 1}m there exists a boolean circuit that
computes it.
We will see later that also OR and NOT gates can be computed by neural
networks (using more than one neuron).
Corollary
For every function f : {0, 1}n → {0, 1}m there exists a neural network Φ
with step activations for all hidden neurons such that
sup |Φ(x ) − f (x )| ≤ ε .
x ∈[0,1]n
sup |Φ(x ) − f (x )| ≤ ε .
x ∈[0,1]n
1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks
a1
x1 y1
a2
x2 y2
a3
Our data samples are grayscale images, each with 28x28=784 pixels. Pixel
intensity is a real value between 0 (white) and 1 (black).
We will make one hidden layer with n neurons. Each neuron has the
sigmoid activation function.
Output layer will have 10 sigmoid neurons (one for every digit from 0 to 9).
These have similar formula. For weights V ∈ R10×n and biases b ′ ∈ R10 :
n
yi = S bi′ +
X
Vij aj i = 0, . . . , 9
j=1
The prediction of the network is a digit with the highest output activation
value.
ŷ = argmax0≤i≤9 yi .
Main quantity of interest that we will minimize: For the (test) data set
1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks
d d
X ∂ X
f (x1 +ε1 , . . . , xd +εd )−f (x1 , . . . , xd )− εi f (x1 , . . . , xd ) ≤ C ε2i .
i=1
∂xi i=1
For now, we will take the quadratic loss. Let x be the input and
y ∈ {0, . . . , 9} the correct label. Let ye = ey ∈ R10 be the vector which has
one on position y and 0 everywhere else.
Note that the NN output is y (x ) ∈ [0, 1]10 . Then
1
C (x , y ) = ∥y (x ) − ye ∥2
2
For now, we will take the quadratic loss. Let x be the input and
y ∈ {0, . . . , 9} the correct label. Let ye = ey ∈ R10 be the vector which has
one on position y and 0 everywhere else.
Note that the NN output is y (x ) ∈ [0, 1]10 . Then
1
C (x , y ) = ∥y (x ) − ye ∥2
2
1
C (x , y ) = (0 + 0 + 0.04 + 0.81 + 0 + 0 + 0 + 0 + 0.01 + 0.04) = 0.45 .
2
This is quite a high loss due to incorrect large weight 0.9 for label 3.
where (x (1) , y (1) ), . . . , (x (M) , y (M) ) is the training dataset. (Note that C is
also a function of W , b, V , b ′ .)
where (x (1) , y (1) ), . . . , (x (M) , y (M) ) is the training dataset. (Note that C is
also a function of W , b, V , b ′ .)
For the training set (x (1) , y (1) ), . . . , (x (M) , y (M) ) we want to minimize the
average training loss. So, our gradient should be
M
1 X
∇W C (x (i) , y (i) )
M i=1
For the training set (x (1) , y (1) ), . . . , (x (M) , y (M) ) we want to minimize the
average training loss. So, our gradient should be
M
1 X
∇W C (x (i) , y (i) )
M i=1
A gradient of the mini-batch is less precise (more noisy) than the gradient
of the whole dataset. But, it is much faster to compute. For MNIST
M = 50000 and we will use k = 10. So, one SGD step will be 5000 times
faster.
A gradient of the mini-batch is less precise (more noisy) than the gradient
of the whole dataset. But, it is much faster to compute. For MNIST
M = 50000 and we will use k = 10. So, one SGD step will be 5000 times
faster.
1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks
∇W C , ∇b C , ∇V C , ∇b ′ C .
z = Wx + b
a = S(z)
z ′ = Va + b ′
a′ = S(z ′ )
∇W C , ∇b C , ∇V C , ∇b ′ C .
z = Wx + b
a = S(z)
z ′ = Va + b ′
a′ = S(z ′ )
− yei )2 .
1 P9 ′
Recall that C = 2 i=0 (ai
∂ ∂C ∂a′
′ C = ′ i′ = (ai′ − yei ) · S ′ (zi′ )
∂zi ∂ai ∂zi
− yei )2 . Let δ ′ = ∇z ′ C .
1 P9 ′
Recall that C = 2 i=0 (ai
∂ ∂C ∂zi′
C = = δi′ · 1
∂bi′ ∂zi′ ∂bi′
∂ ∂C ∂z ′
C = ′ i = δi′ · aj .
∂Vij ∂zi ∂Vij
− yei )2 . Let δ ′ = ∇z ′ C .
1 P9 ′
Recall that C = 2 i=0 (ai
∂ X9
∂C ∂zj′ X 9
′ ∂
Xn
C= = δj Vjk S(zk )
∂zi j=0
∂zj′ ∂zi j=0
∂zi k=1
9 9
∂
δj′ · Vji S(zi ) = Vji · δj′ · S ′ (zi )
X X
=
j=0
∂zi j=0
− yei )2 . Let δ = ∇z C .
1 P9 ′
Recall that C = 2 i=0 (ai
∂ ∂C ∂zi
C= = δi · 1
∂bi ∂zi ∂bi
∂ ∂C ∂zi
C= = δi · xj .
∂Wij ∂zi ∂Wij
1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks
As you will see yourself, there can be various problems occurring during
the network training. Diagnosing and preventing them is one of the main
practical challenges in the field.
There are two particular problems with gradients that can happen:
Exploding gradients (which are too large).
Vanishing gradients (too small).
In both cases it is unlikely the network will learn good weights. And both
problems are generally more likely to happen in deeper networks.
M
∇W C ← ∇W C
∥∇W C ∥
M
∇W C ← ∇W C
∥∇W C ∥
Let z ∈ R10 be the input values for the final layer of our MNIST network
and y = S(z) the NN output. With the square loss Csq = 12 ∥y − ỹ ∥2 it
can be computed
∂Csq
= (yi − ỹi )yi (1 − yi ) .
∂zi
For example, if it happens that ỹi = 0 and yi ≈ 1, then the loss is large
(and NN likely to output wrong predictions), but the gradient will be small
due to 1 − yi ≈ 0. This is an example of a vanishing gradient.
A popular choice for the loss function is cross-entropy loss. In our case
(MNIST with sigmoid output neurons) we define it as follows. For neural
network output y ∈ R10 and the label vector ye ∈ R10 we have
9
X
Ccross = − yei ln yi − (1 − yei ) ln(1 − yi ) .
i=0
Note that this loss will help with sigmoid vanishing gradients, since
∂Ccross
= (yi − ỹi ) .
∂zi
A popular choice for the loss function is cross-entropy loss. In our case
(MNIST with sigmoid output neurons) we define it as follows. For neural
network output y ∈ R10 and the label vector ye ∈ R10 we have
9
X
Ccross = − yei ln yi − (1 − yei ) ln(1 − yi ) .
i=0
Note that this loss will help with sigmoid vanishing gradients, since
∂Ccross
= (yi − ỹi ) .
∂zi
exp(zi )
yi = softmax(z)i = P9 .
j=0 exp(zj )
Cross-entropy and softmax are often used when the output can be
interpreted as a probability distribution.
Cross-entropy and softmax are often used when the output can be
interpreted as a probability distribution.
1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks
P Wij ∼ N
However, it might be more natural to choose (0, 1/d), since in
d d
that case for x ∈ {−1, 1} we will have j=1 Wij aj ∼ N (0, 1).
More good quality data is always better. For certain datasets we can
artificially expand them into larger ones for a more robust network and less
overfitting.
1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks
Problems of interest (e.g., image, speech and text related) often have
spatial/temporal and hierarchical structure.
One kernel is applied to all input coordinates with the same weights. This
is called weight sharing.
One kernel is applied to all input coordinates with the same weights. This
is called weight sharing.
It is not enough to apply just one filter (one set of weights). Instead,
many filters are applied to create many output channels: One output
channel for one filter.
So, if in one convolutional layer with padding, the input has shape
K × K × C , and we apply D filters with dimensions k × k, we have:
K 2 C neurons (activations) in the input to the layer.
K 2 D neurons in the output of the layer.
k 2 CD weights.
D biases.
It is not enough to apply just one filter (one set of weights). Instead,
many filters are applied to create many output channels: One output
channel for one filter.
So, if in one convolutional layer with padding, the input has shape
K × K × C , and we apply D filters with dimensions k × k, we have:
K 2 C neurons (activations) in the input to the layer.
K 2 D neurons in the output of the layer.
k 2 CD weights.
D biases.
All in all, we have 784 input neurons and 576 × 20 = 11520 (hidden)
output neurons. But, just 25 · 1 · 20 = 500 weights and 20 biases.
Compare this with 784 · 30 = 23520 weights for fully-connected layer with
30 hidden neurons and 78400 weight with 100 hidden neurons.
All in all, we have 784 input neurons and 576 × 20 = 11520 (hidden)
output neurons. But, just 25 · 1 · 20 = 500 weights and 20 biases.
Compare this with 784 · 30 = 23520 weights for fully-connected layer with
30 hidden neurons and 78400 weight with 100 hidden neurons.
All in all, we have 784 input neurons and 576 × 20 = 11520 (hidden)
output neurons. But, just 25 · 1 · 20 = 500 weights and 20 biases.
Compare this with 784 · 30 = 23520 weights for fully-connected layer with
30 hidden neurons and 78400 weight with 100 hidden neurons.
The most typical variants are max pooling and average pooling.
The most typical variants are max pooling and average pooling.
The most typical variants are max pooling and average pooling.
For our MNIST example, after applying the convolution layer, we have
activations in the shape 24 × 24 × 20.
Pooling and especially convolution can be also done using strides. Using
stride s means that the “convolution window” is not applied at every
position, but only every s steps.
Pooling and especially convolution can be also done using strides. Using
stride s means that the “convolution window” is not applied at every
position, but only every s steps.
For problems that require deep structure, many layers of convolution and
pooling are applied in sequence. Details are always different.
Most often, the first (close to the input) layers are convolutional and
pooling (these can be very deep), and the last (usually just one or two
layers) are fully connected.
As always, the SGD formulas have to be modified for the new architecture.
For problems that require deep structure, many layers of convolution and
pooling are applied in sequence. Details are always different.
Most often, the first (close to the input) layers are convolutional and
pooling (these can be very deep), and the last (usually just one or two
layers) are fully connected.
As always, the SGD formulas have to be modified for the new architecture.
1 Input shape 28 × 28 × 1.
2 2D convolution without padding, ReLU activation, 20 filters, filter
size 5 × 5. The first hidden layer has shape 24 × 24 × 20.
3 2D max 2 × 2 pooling. The second hidden layer has shape
12 × 12 × 20.
4 2D convolution without padding, ReLU activation, 40 filters, filter
size 5 × 5. Hidden layer shape 8 × 8 × 40.
5 2D max 2 × 2 pooling. Hidden layer shape 4 × 4 × 40.
6 Fully connected layer with 100 neurons, ReLU activation.
7 Output layer with 10 neurons, softmax.
8 Log-likelihood loss.
We might also add dropout to the fully connected layer.
1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks
Recurrent neural networks (RNNs) are used for inputs which are sequences
of variable length x (1) , . . . , x (T ) (for example, words in a sentence).
Note that the weights are shared (the same) for all 1 ≤ t ≤ T .
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 73 / 98
RNN with one hidden layer
Example RNN with one hidden layer:
Input x (t) ∈ Rk , hidden layer activations h(t) ∈ Rn .
Weights Whh ∈ Rn×n and Whx ∈ Rn×k , biases b ∈ Rn .
h(t+1) = σ Whh h(t) + Whx x (t+1) + b .
h(0) can be initialized to zero.
Note that the weights are shared (the same) for all 1 ≤ t ≤ T .
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 73 / 98
Illustration: feedforward vs recurrent
Feedforward
x
Wx + b
z
σ(z)
h
...
Recurrent
h(0)
Whh h(0)
Let the dictionary be D = (are, hello, how , you). Then, the one-hot
encoding sets
onehot(are) = (1, 0, 0, 0) ,
onehot(hello) = (0, 1, 0, 0) ,
onehot(how) = (0, 0, 1, 0) ,
onehot(you) = (0, 0, 0, 1) .
Let Φ be one-layer RNN with n hidden neurons, weights Whh , Whx and
biases b. As before, let
h(t) = σ Whh h(t−1) + Whx x (t) + b .
We define the output y (t+1) ∈ RK as y (t+1) = softmax Vh(t) + b ′ ,
where V ∈ RK ×n , b ′ ∈ RK .
Take the text “Hello how are you”. Recall the encoding from the previous
slide. Then,
x (1) = (0, 1, 0, 0) .
The first input is x (1) = (0, 1, 0, 0). Assume the network outputs
y (2)
= (0.2, 0.3,
0.4, 0.1). Then, the first term in the loss is
H y ,x(2) (2) = − ln 0.4, where x (2) = (0, 0, 1, 0) is the encoding of “how”.
y (2) can be interpreted as the estimate of the network: 20% chance the
next token is “are”, 30% it is “hello”, 40% it is “how” and 10% it is “you”.
Take the text “Hello how are you”. Recall the encoding from the previous
slide. Then,
x (1) = (0, 1, 0, 0) .
The first input is x (1) = (0, 1, 0, 0). Assume the network outputs
y (2)
= (0.2, 0.3,
0.4, 0.1). Then, the first term in the loss is
H y ,x(2) (2) = − ln 0.4, where x (2) = (0, 0, 1, 0) is the encoding of “how”.
y (2) can be interpreted as the estimate of the network: 20% chance the
next token is “are”, 30% it is “hello”, 40% it is “how” and 10% it is “you”.
Take the text “Hello how are you”. Recall the encoding from the previous
slide. Then,
x (1) = (0, 1, 0, 0) .
The first input is x (1) = (0, 1, 0, 0). Assume the network outputs
y (2)
= (0.2, 0.3,
0.4, 0.1). Then, the first term in the loss is
H y ,x(2) (2) = − ln 0.4, where x (2) = (0, 0, 1, 0) is the encoding of “how”.
y (2) can be interpreted as the estimate of the network: 20% chance the
next token is “are”, 30% it is “hello”, 40% it is “how” and 10% it is “you”.
How (early versions of) text generation networks work is based on this
idea. Current version of GPT is based on attention and transformer
architectures.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 81 / 98
Text prediction: Running the model
After training, given a prompt (user question) x (1) , . . . , x (T ) :
First, the network runs sequentially on inputs x (1) , . . . , x (T ) . The
outputs y (2) , . . . , y (T ) are discarded. After that we have hidden state
h(T ) and output y (T +1) .
Let t be any time with t > T and y (t) ∈ {1, . . . , K } be the argmax of
y (t) . The network runs for more steps: At step t hidden input is given
by h(t−1) and input is given by onehot(wi ), where i = y (t) . This
generates new hidden state h(t) and output y (t+1) .
At some time T ′ we decide to stop the process. We do not discuss
how to choose this time.
Finally, the output of the network (answer to the user) are the words
′
defined by y (T +1) , . . . , y (T ) .
How (early versions of) text generation networks work is based on this
idea. Current version of GPT is based on attention and transformer
architectures.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 81 / 98
Text prediction: Running the model
After training, given a prompt (user question) x (1) , . . . , x (T ) :
First, the network runs sequentially on inputs x (1) , . . . , x (T ) . The
outputs y (2) , . . . , y (T ) are discarded. After that we have hidden state
h(T ) and output y (T +1) .
Let t be any time with t > T and y (t) ∈ {1, . . . , K } be the argmax of
y (t) . The network runs for more steps: At step t hidden input is given
by h(t−1) and input is given by onehot(wi ), where i = y (t) . This
generates new hidden state h(t) and output y (t+1) .
At some time T ′ we decide to stop the process. We do not discuss
how to choose this time.
Finally, the output of the network (answer to the user) are the words
′
defined by y (T +1) , . . . , y (T ) .
How (early versions of) text generation networks work is based on this
idea. Current version of GPT is based on attention and transformer
architectures.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 81 / 98
Text prediction: Example
1 One neuron
2 Neural network architecture
3 Gradients and stochastic gradient descent
4 Backpropagation
5 Cross-entropy and softmax
6 More tricks
7 Convolutional neural networks
8 Recurrent neural networks
9 Even more tricks
Dropout.
Batch normalization.
SGD with momentum.
Skip connections.
Assume we have a fully connected layer with nin input neurons and nout
output neurons. Let 0 ≤ p ≤ 1 be a hyperparameter called the dropout
rate. For us p will be the probability of deleting a input neuron, but always
check as there are different conventions. Then, at every step (mini-batch)
of SGD:
1 Choose independently at random ≈ p · nin neurons which are deleted.
2 Temporarily delete those neurons (and their weights) from the
network.
3 Run one step of SGD update on the new, smaller network (using
gradient formulas for the small network).
4 Restore the deleted neurons.
After the training is finished, all weights (not biases!) in the final network
should be multiplied by 1 − p. Then, prediction is run on the full network,
without deleting neurons.
The idea for dropout is to make neurons more independent (not rely on
the presence of other neurons in the network). It is considered a
regularization technique.
In keras: tf.keras.layers.Dropout.
a(i) − µ
a(i) = √
b
σ2 + ε
a(i) − µ
a(i) = √
b
σ2 + ε
Pk 2
1 1 Pk
which implies k a(i)
i=1 b = 0 and k i=1 ba (i) ≈ (1, . . . , 1) (if σ 2 ≫ ε).
Hence, the updated activations (1)
a ,...,b
b (k)
a can be said to be normalized.
Note that with batch normalization the activations (and gradients) depend
on all samples in the mini-batch. So, the feedforward and backpropagation
must run in parallel for the whole mini-batch.
Other than that, the updated gradient formulas can be derived as usual.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 91 / 98
Batch normalization: training
a(i) − µ
a(i) = √
b
σ2 + ε
Pk 2
1 1 Pk
which implies k a(i)
i=1 b = 0 and k i=1 ba (i) ≈ (1, . . . , 1) (if σ 2 ≫ ε).
Hence, the updated activations (1)
a ,...,b
b (k)
a can be said to be normalized.
Note that with batch normalization the activations (and gradients) depend
on all samples in the mini-batch. So, the feedforward and backpropagation
must run in parallel for the whole mini-batch.
Other than that, the updated gradient formulas can be derived as usual.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 91 / 98
Batch normalization: training
a(i) − µ
a(i) = √
b
σ2 + ε
Pk 2
1 1 Pk
which implies k a(i)
i=1 b = 0 and k i=1 ba (i) ≈ (1, . . . , 1) (if σ 2 ≫ ε).
Hence, the updated activations (1)
a ,...,b
b (k)
a can be said to be normalized.
Note that with batch normalization the activations (and gradients) depend
on all samples in the mini-batch. So, the feedforward and backpropagation
must run in parallel for the whole mini-batch.
Other than that, the updated gradient formulas can be derived as usual.
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 91 / 98
Batch normalization: Prediction
In state of the art optimizers, SGD formulas are often supplemented with
momentum. Recall that the SGD formula is
In state of the art optimizers, SGD formulas are often supplemented with
momentum. Recall that the SGD formula is
Source: http://cs231n.github.io/neural-networks-3/
Source: He, Zhang, Ren, Sun “Deep residual learning for image recognition”
Jan Hązła Introduction AIMS Rwanda, Feb-Mar 2024 97 / 98
Skip connections: Technicalities