Session 4

Prediction methods and Machine learning
Session IV
Pierre Michel
[email protected]
M2 EBDS
2021
“Your ML methods seem pretty good but. . . ”
How do you create value with Machine Learning ?

An example in financial econometrics ?
Check this paper: Investing Through Economic Cycles with Ensemble
Machine Learning Algorithms
Pierre Michel Prediction methods and Machine learning 2/35

1. Deep Learning
1. Deep Learning

1. Deep Learning
1.1. Introduction to artificial neural networks
1.1 Introduction to artificial neural networks

1. Deep Learning
Supervised learning problem: artificial neural networks
We consider a supervised learning problem (regression, classification).

The labeled training examples are denoted (x(i) , y (i) ).
In neural networks, we consider a more complex and non-linear prediction
function, denoted hW,b (x).
W and b are the parameters we want to fit to the training data.

1. Deep Learning
Simple neural network

Simple neural network, with a single neuron:
𝑥1
𝑥2
ℎ𝑊,𝑏 (𝑥)
𝑥3
Pierre Michel Figure 1 Prediction methods and Machine learning 6/35

1. Deep Learning
What is a neuron ?
A neuron is a computational unit that takes inputs and returns outputs.

In the previous example, the neuron takes three inputs x1 , x2 and x3 , and
an intercept parameter.
P3
The output of the neuron is hW,b (x) = f (W T x) = f ( i=1 Wi xi + b) with
f : R → R the activation function.
Note: if we choose f to be the sigmoid function, the previous example
correponds to the logistic regression.

1. Deep Learning
Activation funtions
Different activation functions can be chosen:
• the sigmoid function: f (z) = 1

1+exp(−z)
exp(z)−exp(−z)
• the hyperbolic tangent function: f (z) = tanh(z) = exp(z)+exp(−z)
• the rectified linear function: f (z) = max(0, z)
Note: a neuron using the rectified linear function is called a rectified linear
unit (ReLU).

1. Deep Learning
Activation funtions
Activation functions
sigmoid
4
tanh
rectified linear
3
2
f(z)
1
0
−1
−2 −1 0 1 2

1. Deep Learning
Derivatives of activation functions
• sigmoid function: f 0 (z) = f (z) (1 − f (z))

• hyperbolic tangent function: f 0 (z) = 1 − (f (z))2

• rectified linear function: f 0 (z) = 0 if z ≤ 0
1 else

1. Deep Learning
1.2. Neural Network Model
1.2 Neural Network Model

1. Deep Learning
Multi-Layer Neural Network

A neural network is constructed by nesting many neurons, in such a way
that the output of one neuron is the input of another. Here is an
example of a “small” neural network:
𝑥1
(2)
𝑎1
(2)
𝑥2 𝑎2
(2) ℎ𝑊,𝑏 (𝑥)
𝑎3
𝑥3 Layer 3
1 1
Layer 1 Layer 2

1. Deep Learning
Multiple layers of neurons
Input Layer Hidden Layer
𝑥1
(2)
𝑎1 Output Layer
(2)
𝑥2 𝑎2
(2) ℎ𝑊,𝑏 (𝑥)
𝑎3
𝑥3 Layer 3
1 1
Layer 1 Layer 2
Figure 3

1. Deep Learning
Notations
We introduce some notations:
• nl the number of layers in a neural network (nl = 3 in the previous

example).
• l is used as the label for the l-th layer denoted Ll (L1 is the input
layer, Lnl is the output layer).
• (W, b) = (W (1) , b(1) , W (2) , b(2) , ..., W (nl ) , b(nl ) ) are the neural
network parameters.
• Wij(l) is the parameter associated to the connection between neuron
j in layer l, and neuron i in layer l + 1.
• b(l)
i is the intercept parameter associated to neuron i in layer l + 1.
• sl is the number of neurons in layer l.
• a(l) (1)
i is the activation of neuron i in layer l (note that ai = xi ).

1. Deep Learning
Example: 3-layer neural network

In this example, W (1) ∈ R3×3 and W (2) ∈ R1×3 :
Input Layer Hidden Layer
𝑥1
(2)
𝑎1 Output Layer
(2)
𝑥2 𝑎2
(2) ℎ𝑊,𝑏 (𝑥)
𝑎3
𝑥3 Layer 3
1 1
Layer 1 Layer 2

1. Deep Learning
In this example, we have to compute:
(2) (1) (1) (1) (1)

a1 = f (W11 x1 + W12 x2 + W13 x3 + b1
(2) (1) (1) (1) (1)

a2 = f (W21 x1 + W22 x2 + W23 x3 + b2
(2) (1) (1) (1) (1)

a3 = f (W31 x1 + W32 x2 + W33 x3 + b3
(3) (2) (2) (2) (2) (2) (2) (2)

hW,b (x) = a1 = f (W11 a1 + W12 a2 + W13 a3 + b1 )

1. Deep Learning
Weighted sum of inputs
(l)
Let zi be the weighted sum of inputs to neuron i in layer l, for example:
3
(2) (1) (1)
X
zi = Wij xj + bi
j=1
Moreover, we have:
(l) (l)
ai = f (zi )

1. Deep Learning
We can rewrite the computations as follows:
z (2) = W (1) x + b(1)
a(2) = f (z (2) )
z (3) = W (2) a(2) + b(2)
hW,b (x) = a(3) = f (z (3) )

1. Deep Learning
Forward propagation
Using the notation a(1) = x, we can rewrite again the computations, to

compute the activations a(l+1) of layer l + 1 as follows:
z (l+1) = W (l) a(l) + b(l)
a(l+1) = f (z (l+1) )
This can be implemented using code vectorization.

1. Deep Learning
Different architectures
We have considered 2 simple architectures (single neuron neural network

and 3-layer neural network) but many other architectures can be chosen.
For example, multipe hidden layers: nl -layer network with input layer L1
and output layer Lnl , each other layer l is fully (or densely) connected to
layer l + 1.
Computing the output of the network means computing the activations
of each layer successively, this is a feedforward neural network: it is an
acyclic graph.
We talk about deep learning (and deep neural networks) when the number
of hidden layers in the network is high (say > 3).

1. Deep Learning
Multiple output neurons

Consider a network with two hidden layers and two outputs:
𝑥1
𝑥2
ℎ𝑊,𝑏 (𝑥)
𝑥3
1 Layer 4
1 1
Layer 3
Layer 1 Layer 2

1. Deep Learning
4-layer network with 2 outputs
(x(i) , y (i) ) are the training samples, with y (i) ∈ R2 .

This model is useful if you want to predict multiple outputs.
Example: softmax activation function (multiple class classification), x(i)
could be the clinical characteristics of a patient, and y (i) could represent
the presence/absence of two syptoms.

1. Deep Learning
1.3. Training a neural network
1.3 Training a neural network

1. Deep Learning
How to train a neural network ?

Suppose we have a training set {(x(1) , y (1) ), ..., (x(m) , y (m) )}.
We generally use batch gradient descent to train a neural network.
For a single example x, we define the following cost function:
1
J(W, b; x, y) = ||hW,b (x) − y||2
2
We thus define, for the m examples, the overall cost function:
nl −1 X
sl sX
" #
m l +1
1 X (i) (i) λ X (l)
J(W, b) = J(W, b; x , y ) + (Wji )2
m i=1 2
l=1 i=1 j=1
nl −1 X sl sX
" #
m l +1
1 X 1 (i) (i) 2 λ X (l)
= ||hW,b (x ) − y || + (Wji )2
m i=1 2 2 i=1 j=1 l=1

1. Deep Learning
About the cost function. . .
nl −1 Xsl sX
" #
m l +1
1 X 1 λ X (l)
J(W, b) = ||hW,b (x(i) ) − y (i) ||2 + (Wji )2
m i=1 2 2 i=1 j=1
l=1
The first term is the classical sum of squares error.

The second term is called the weight decay, it is a regularization pa-
rameter, controlling the magnitude of the parameters.
λ is the decay parameter, controlling the importance of both terms.
This can be used for both regression and classification.

1. Deep Learning
Minimizing the cost function
As usual, the goal is to minimize the cost function J(W, b).

(l) (l)
We first initialize starting values for parameters Wij and bi with different
values, to avoid the outputs to have the same values.
This phenomenom is known as symmetry breaking.
Use a pseudo-random number generator for example with:
(l)
Wij ∼ N (0, )
.
(l)
bij ∼ N (0, )
.

1. Deep Learning
Gradient descent
Here is one iteration of gradient descent:
(l) (l) ∂
Wij = Wij − α J(W, b)
∂Wij (l)
(l) (l) ∂
bi = bi − α (l)
J(W, b)
∂bi
with
m
" #
∂ 1 X ∂ (i) (i) (l)
(l)
J(W, b) = J(W, b; x , y ) + λWij
∂W m i=1 ∂W (l)
ij ij
" m
#
∂ 1 X ∂
(l)
J(W, b) = J(W, b; x(i) , y (i) )
∂b m i=1 ∂b(l)
i i
1. Deep Learning
Backpropagation algorithm: interpretation
Let (x, y) an example,
• first, do a forward pass: compute all activations in the hidden layers,

and the output hW,b (x).
• for each neuron i in layer l, compute an error term δi(l) measuring
the amount of error in the output due to this neuron.
(nl )
Note: for the output neuron, δi is simply the difference between the
target value and the output.
(l)
For hidden neurons, δi is computed using a weighted average of errors
(l)
of neurons that take δi in input.

1. Deep Learning
Backpropagation algorithm
• Feedforward pass: compute all the activations in layers L2 ,. . . ,Lnl .
• Output layer: for each neuron i in layer nl set
(nl ) ∂ 1 (n ) (n )
δi = (n )
||y − hW,b (x)||2 = −(yi − ai l )f 0 (zi l )
∂zi l 2
• Hidden layers: for each hidden layer, l = nl − 1, nl − 2, ..., 2
for each neuron i in layer l set

 
sX
l +1
(l) (l) (l+1)  0 (l)
δi =  Wji δi f (zi )
j=1
• Partial derivatives: compute

∂ (l) ∂
(l)
J(W, b; x, y) = aj δil+1 and (l)
J(W, b; x, y) = δil+1
∂Wij ∂bi
1. Deep Learning
Backpropagation algorithm: vectorized version
• Feedforward pass: compute all the activations in layers L2 ,. . . ,Lnl .

• Output layer: set
δ (nl ) = −(y − a(nl ) )· f 0 (z (nl ) )
• Hidden layers: for each hidden layer, l = nl − 1, nl − 2, ..., 2, set

δ (l) = (W (l) )T δ (l+1) · f 0 (z (l) )
• Partial derivatives: compute
∇W (l) J(W, b; x, y) = δ (l+1) (a(l) )T
∇b(l) J(W, b; x, y) = δ (l+1)

1. Deep Learning
Batch gradient descent algorithm

Notations: ∆W (l) is matrix of the same dimension of W (l) and ∆b(l) is a
vector of the same dimension of b(l) .
• ∆W (l) := 0, ∆b(l) := 0
• for i = 1, ..., m
I compute ∇W (l) J(W, b; x, y) and ∇b(l) J(W, b; x, y) (backpropagation)
I set ∆W (l) := ∆W (l) + ∇W (l) J(W, b; x, y)
I set ∆b(l) := ∆b(l) + ∇b(l) J(W, b; x, y)
• update the parameters

(l) (l) 1 (l) (l)
W =W −α ∆W + λW
m

1
b(l) = b(l) − α ∆b(l)
m
Loop a given number of times (called epochs).
Note: by default, batch size equals m, this can bePrediction
Pierre Michel
reduced to q < m such
methods and Machine learning 31/35
1. Deep Learning
1.4. Deep learning in Python
1.4 Deep learning in Python

1. Deep Learning
Coding from scratch
• nothing particular to write about that. . .

• use numpy
• use the “die and retry” method
• good luck !
• ...
• anyway, what about coding decision trees from scratch ?

1. Deep Learning
A user-friendly framework for deep learning: Keras
• Keras is an API in Python used for deep learning applications

(recommendation: CPU-only version).
• The installation of Tensorflow is needed.
• Different datasets are available (including MNIST).
Let’s try Keras !!!

1. Deep Learning

Session 4

Uploaded by

Copyright:

Available Formats

Session 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 4

Uploaded by

Copyright:

Available Formats

Prediction methods and Machine learning

How do you create value with Machine Learning ?

Pierre Michel Prediction methods and Machine learning 2/35

Pierre Michel Prediction methods and Machine learning 3/35

1.1 Introduction to artificial neural networks

Pierre Michel Prediction methods and Machine learning 4/35

Supervised learning problem: artificial neural networks

We consider a supervised learning problem (regression, classification).

Pierre Michel Prediction methods and Machine learning 5/35

Simple neural network

Pierre Michel Figure 1 Prediction methods and Machine learning 6/35

A neuron is a computational unit that takes inputs and returns outputs.

Pierre Michel Prediction methods and Machine learning 7/35

Different activation functions can be chosen:

• the sigmoid function: f (z) = 1

Pierre Michel Prediction methods and Machine learning 8/35

Pierre Michel Prediction methods and Machine learning 9/35

Derivatives of activation functions

• sigmoid function: f 0 (z) = f (z) (1 − f (z))

Pierre Michel Prediction methods and Machine learning 10/35

1.2 Neural Network Model

Pierre Michel Prediction methods and Machine learning 11/35

Multi-Layer Neural Network

Pierre Michel Prediction methods and Machine learning 12/35

Multiple layers of neurons

Input Layer Hidden Layer

Pierre Michel Prediction methods and Machine learning 13/35

We introduce some notations:

• nl the number of layers in a neural network (nl = 3 in the previous

Pierre Michel Prediction methods and Machine learning 14/35

Example: 3-layer neural network

Input Layer Hidden Layer

Pierre Michel Figure 4 Prediction methods and Machine learning 15/35

Example: 3-layer neural network

In this example, we have to compute:

(2) (1) (1) (1) (1)

(2) (1) (1) (1) (1)

(2) (1) (1) (1) (1)

(3) (2) (2) (2) (2) (2) (2) (2)

Pierre Michel Prediction methods and Machine learning 16/35

Weighted sum of inputs

Pierre Michel Prediction methods and Machine learning 17/35

Example: 3-layer neural network

We can rewrite the computations as follows:

z (2) = W (1) x + b(1)

z (3) = W (2) a(2) + b(2)

hW,b (x) = a(3) = f (z (3) )

Pierre Michel Prediction methods and Machine learning 18/35

Using the notation a(1) = x, we can rewrite again the computations, to

z (l+1) = W (l) a(l) + b(l)

This can be implemented using code vectorization.

Pierre Michel Prediction methods and Machine learning 19/35

We have considered 2 simple architectures (single neuron neural network

Pierre Michel Prediction methods and Machine learning 20/35

Multiple output neurons

Pierre Michel Figure 5 Prediction methods and Machine learning 21/35

4-layer network with 2 outputs

(x(i) , y (i) ) are the training samples, with y (i) ∈ R2 .

Pierre Michel Prediction methods and Machine learning 22/35