Session 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Prediction methods and Machine learning

Session IV

Pierre Michel
[email protected]

M2 EBDS

2021
“Your ML methods seem pretty good but. . . ”

How do you create value with Machine Learning ?


An example in financial econometrics ?
Check this paper: Investing Through Economic Cycles with Ensemble
Machine Learning Algorithms

Pierre Michel Prediction methods and Machine learning 2/35


1. Deep Learning

1. Deep Learning

Pierre Michel Prediction methods and Machine learning 3/35


1. Deep Learning
1.1. Introduction to artificial neural networks

1.1 Introduction to artificial neural networks

Pierre Michel Prediction methods and Machine learning 4/35


1. Deep Learning
1.1. Introduction to artificial neural networks

Supervised learning problem: artificial neural networks

We consider a supervised learning problem (regression, classification).


The labeled training examples are denoted (x(i) , y (i) ).
In neural networks, we consider a more complex and non-linear prediction
function, denoted hW,b (x).
W and b are the parameters we want to fit to the training data.

Pierre Michel Prediction methods and Machine learning 5/35


1. Deep Learning
1.1. Introduction to artificial neural networks

Simple neural network


Simple neural network, with a single neuron:

𝑥1

𝑥2
ℎ𝑊,𝑏 (𝑥)
𝑥3

Pierre Michel Figure 1 Prediction methods and Machine learning 6/35


1. Deep Learning
1.1. Introduction to artificial neural networks

What is a neuron ?

A neuron is a computational unit that takes inputs and returns outputs.


In the previous example, the neuron takes three inputs x1 , x2 and x3 , and
an intercept parameter.
P3
The output of the neuron is hW,b (x) = f (W T x) = f ( i=1 Wi xi + b) with
f : R → R the activation function.
Note: if we choose f to be the sigmoid function, the previous example
correponds to the logistic regression.

Pierre Michel Prediction methods and Machine learning 7/35


1. Deep Learning
1.1. Introduction to artificial neural networks

Activation funtions

Different activation functions can be chosen:

• the sigmoid function: f (z) = 1


1+exp(−z)
exp(z)−exp(−z)
• the hyperbolic tangent function: f (z) = tanh(z) = exp(z)+exp(−z)
• the rectified linear function: f (z) = max(0, z)

Note: a neuron using the rectified linear function is called a rectified linear
unit (ReLU).

Pierre Michel Prediction methods and Machine learning 8/35


1. Deep Learning
1.1. Introduction to artificial neural networks

Activation funtions

Activation functions

sigmoid
4

tanh
rectified linear
3
2
f(z)

1
0
−1

−2 −1 0 1 2

Pierre Michel Prediction methods and Machine learning 9/35


1. Deep Learning
1.1. Introduction to artificial neural networks

Derivatives of activation functions

• sigmoid function: f 0 (z) = f (z) (1 − f (z))


• hyperbolic tangent function: f 0 (z) = 1 − (f (z))2

• rectified linear function: f 0 (z) = 0 if z ≤ 0
1 else

Pierre Michel Prediction methods and Machine learning 10/35


1. Deep Learning
1.2. Neural Network Model

1.2 Neural Network Model

Pierre Michel Prediction methods and Machine learning 11/35


1. Deep Learning
1.2. Neural Network Model

Multi-Layer Neural Network


A neural network is constructed by nesting many neurons, in such a way
that the output of one neuron is the input of another. Here is an
example of a “small” neural network:

𝑥1
(2)
𝑎1

(2)
𝑥2 𝑎2
(2) ℎ𝑊,𝑏 (𝑥)
𝑎3

𝑥3 Layer 3

1 1

Layer 1 Layer 2

Pierre Michel Prediction methods and Machine learning 12/35


1. Deep Learning
1.2. Neural Network Model

Multiple layers of neurons

Input Layer Hidden Layer

𝑥1
(2)
𝑎1 Output Layer
(2)
𝑥2 𝑎2
(2) ℎ𝑊,𝑏 (𝑥)
𝑎3

𝑥3 Layer 3

1 1

Layer 1 Layer 2

Figure 3

Pierre Michel Prediction methods and Machine learning 13/35


1. Deep Learning
1.2. Neural Network Model

Notations

We introduce some notations:

• nl the number of layers in a neural network (nl = 3 in the previous


example).
• l is used as the label for the l-th layer denoted Ll (L1 is the input
layer, Lnl is the output layer).
• (W, b) = (W (1) , b(1) , W (2) , b(2) , ..., W (nl ) , b(nl ) ) are the neural
network parameters.
• Wij(l) is the parameter associated to the connection between neuron
j in layer l, and neuron i in layer l + 1.
• b(l)
i is the intercept parameter associated to neuron i in layer l + 1.
• sl is the number of neurons in layer l.
• a(l) (1)
i is the activation of neuron i in layer l (note that ai = xi ).

Pierre Michel Prediction methods and Machine learning 14/35


1. Deep Learning
1.2. Neural Network Model

Example: 3-layer neural network


In this example, W (1) ∈ R3×3 and W (2) ∈ R1×3 :

Input Layer Hidden Layer

𝑥1
(2)
𝑎1 Output Layer
(2)
𝑥2 𝑎2
(2) ℎ𝑊,𝑏 (𝑥)
𝑎3

𝑥3 Layer 3

1 1

Layer 1 Layer 2

Pierre Michel Figure 4 Prediction methods and Machine learning 15/35


1. Deep Learning
1.2. Neural Network Model

Example: 3-layer neural network

In this example, we have to compute:

(2) (1) (1) (1) (1)


a1 = f (W11 x1 + W12 x2 + W13 x3 + b1

(2) (1) (1) (1) (1)


a2 = f (W21 x1 + W22 x2 + W23 x3 + b2

(2) (1) (1) (1) (1)


a3 = f (W31 x1 + W32 x2 + W33 x3 + b3

(3) (2) (2) (2) (2) (2) (2) (2)


hW,b (x) = a1 = f (W11 a1 + W12 a2 + W13 a3 + b1 )

Pierre Michel Prediction methods and Machine learning 16/35


1. Deep Learning
1.2. Neural Network Model

Weighted sum of inputs

(l)
Let zi be the weighted sum of inputs to neuron i in layer l, for example:

3
(2) (1) (1)
X
zi = Wij xj + bi
j=1

Moreover, we have:

(l) (l)
ai = f (zi )

Pierre Michel Prediction methods and Machine learning 17/35


1. Deep Learning
1.2. Neural Network Model

Example: 3-layer neural network

We can rewrite the computations as follows:

z (2) = W (1) x + b(1)

a(2) = f (z (2) )

z (3) = W (2) a(2) + b(2)

hW,b (x) = a(3) = f (z (3) )

Pierre Michel Prediction methods and Machine learning 18/35


1. Deep Learning
1.2. Neural Network Model

Forward propagation

Using the notation a(1) = x, we can rewrite again the computations, to


compute the activations a(l+1) of layer l + 1 as follows:

z (l+1) = W (l) a(l) + b(l)

a(l+1) = f (z (l+1) )

This can be implemented using code vectorization.

Pierre Michel Prediction methods and Machine learning 19/35


1. Deep Learning
1.2. Neural Network Model

Different architectures

We have considered 2 simple architectures (single neuron neural network


and 3-layer neural network) but many other architectures can be chosen.
For example, multipe hidden layers: nl -layer network with input layer L1
and output layer Lnl , each other layer l is fully (or densely) connected to
layer l + 1.
Computing the output of the network means computing the activations
of each layer successively, this is a feedforward neural network: it is an
acyclic graph.
We talk about deep learning (and deep neural networks) when the number
of hidden layers in the network is high (say > 3).

Pierre Michel Prediction methods and Machine learning 20/35


1. Deep Learning
1.2. Neural Network Model

Multiple output neurons


Consider a network with two hidden layers and two outputs:

𝑥1

𝑥2
ℎ𝑊,𝑏 (𝑥)

𝑥3
1 Layer 4

1 1
Layer 3

Layer 1 Layer 2

Pierre Michel Figure 5 Prediction methods and Machine learning 21/35


1. Deep Learning
1.2. Neural Network Model

4-layer network with 2 outputs

(x(i) , y (i) ) are the training samples, with y (i) ∈ R2 .


This model is useful if you want to predict multiple outputs.
Example: softmax activation function (multiple class classification), x(i)
could be the clinical characteristics of a patient, and y (i) could represent
the presence/absence of two syptoms.

Pierre Michel Prediction methods and Machine learning 22/35


1. Deep Learning
1.3. Training a neural network

1.3 Training a neural network

Pierre Michel Prediction methods and Machine learning 23/35


1. Deep Learning
1.3. Training a neural network

How to train a neural network ?


Suppose we have a training set {(x(1) , y (1) ), ..., (x(m) , y (m) )}.
We generally use batch gradient descent to train a neural network.
For a single example x, we define the following cost function:

1
J(W, b; x, y) = ||hW,b (x) − y||2
2
We thus define, for the m examples, the overall cost function:

nl −1 X
sl sX
" #
m l +1
1 X (i) (i) λ X (l)
J(W, b) = J(W, b; x , y ) + (Wji )2
m i=1 2
l=1 i=1 j=1
nl −1 X sl sX
" #
m   l +1
1 X 1 (i) (i) 2 λ X (l)
= ||hW,b (x ) − y || + (Wji )2
m i=1 2 2 i=1 j=1 l=1

Pierre Michel Prediction methods and Machine learning 24/35


1. Deep Learning
1.3. Training a neural network

About the cost function. . .

nl −1 Xsl sX
" #
m  l +1
1 X 1 λ X (l)
J(W, b) = ||hW,b (x(i) ) − y (i) ||2 + (Wji )2
m i=1 2 2 i=1 j=1
l=1

The first term is the classical sum of squares error.


The second term is called the weight decay, it is a regularization pa-
rameter, controlling the magnitude of the parameters.
λ is the decay parameter, controlling the importance of both terms.
This can be used for both regression and classification.

Pierre Michel Prediction methods and Machine learning 25/35


1. Deep Learning
1.3. Training a neural network

Minimizing the cost function

As usual, the goal is to minimize the cost function J(W, b).


(l) (l)
We first initialize starting values for parameters Wij and bi with different
values, to avoid the outputs to have the same values.
This phenomenom is known as symmetry breaking.
Use a pseudo-random number generator for example with:

(l)
Wij ∼ N (0, )
.
(l)
bij ∼ N (0, )
.

Pierre Michel Prediction methods and Machine learning 26/35


1. Deep Learning
1.3. Training a neural network

Gradient descent
Here is one iteration of gradient descent:

(l) (l) ∂
Wij = Wij − α J(W, b)
∂Wij (l)
(l) (l) ∂
bi = bi − α (l)
J(W, b)
∂bi

with

m
" #
∂ 1 X ∂ (i) (i) (l)
(l)
J(W, b) = J(W, b; x , y ) + λWij
∂W m i=1 ∂W (l)
ij ij

" m
#
∂ 1 X ∂
(l)
J(W, b) = J(W, b; x(i) , y (i) )
∂b m i=1 ∂b(l)
i i
Pierre Michel Prediction methods and Machine learning 27/35
1. Deep Learning
1.3. Training a neural network

Backpropagation algorithm: interpretation

Let (x, y) an example,

• first, do a forward pass: compute all activations in the hidden layers,


and the output hW,b (x).
• for each neuron i in layer l, compute an error term δi(l) measuring
the amount of error in the output due to this neuron.
(nl )
Note: for the output neuron, δi is simply the difference between the
target value and the output.
(l)
For hidden neurons, δi is computed using a weighted average of errors
(l)
of neurons that take δi in input.

Pierre Michel Prediction methods and Machine learning 28/35


1. Deep Learning
1.3. Training a neural network

Backpropagation algorithm
• Feedforward pass: compute all the activations in layers L2 ,. . . ,Lnl .
• Output layer: for each neuron i in layer nl set
(nl ) ∂ 1 (n ) (n )
δi = (n )
||y − hW,b (x)||2 = −(yi − ai l )f 0 (zi l )
∂zi l 2
• Hidden layers: for each hidden layer, l = nl − 1, nl − 2, ..., 2

for each neuron i in layer l set


 
sX
l +1
(l) (l) (l+1)  0 (l)
δi =  Wji δi f (zi )
j=1

• Partial derivatives: compute


∂ (l) ∂
(l)
J(W, b; x, y) = aj δil+1 and (l)
J(W, b; x, y) = δil+1
∂Wij ∂bi
Pierre Michel Prediction methods and Machine learning 29/35
1. Deep Learning
1.3. Training a neural network

Backpropagation algorithm: vectorized version

• Feedforward pass: compute all the activations in layers L2 ,. . . ,Lnl .


• Output layer: set

δ (nl ) = −(y − a(nl ) )· f 0 (z (nl ) )

• Hidden layers: for each hidden layer, l = nl − 1, nl − 2, ..., 2, set

 
δ (l) = (W (l) )T δ (l+1) · f 0 (z (l) )

• Partial derivatives: compute

∇W (l) J(W, b; x, y) = δ (l+1) (a(l) )T

∇b(l) J(W, b; x, y) = δ (l+1)

Pierre Michel Prediction methods and Machine learning 30/35


1. Deep Learning
1.3. Training a neural network

Batch gradient descent algorithm


Notations: ∆W (l) is matrix of the same dimension of W (l) and ∆b(l) is a
vector of the same dimension of b(l) .
• ∆W (l) := 0, ∆b(l) := 0
• for i = 1, ..., m
I compute ∇W (l) J(W, b; x, y) and ∇b(l) J(W, b; x, y) (backpropagation)
I set ∆W (l) := ∆W (l) + ∇W (l) J(W, b; x, y)
I set ∆b(l) := ∆b(l) + ∇b(l) J(W, b; x, y)
• update the parameters

  
(l) (l) 1 (l) (l)
W =W −α ∆W + λW
m
 
1
b(l) = b(l) − α ∆b(l)
m
Loop a given number of times (called epochs).
Note: by default, batch size equals m, this can bePrediction
Pierre Michel
reduced to q < m such
methods and Machine learning 31/35
1. Deep Learning
1.4. Deep learning in Python

1.4 Deep learning in Python

Pierre Michel Prediction methods and Machine learning 32/35


1. Deep Learning
1.4. Deep learning in Python

Coding from scratch

• nothing particular to write about that. . .


• use numpy
• use the “die and retry” method
• good luck !
• ...
• anyway, what about coding decision trees from scratch ?

Pierre Michel Prediction methods and Machine learning 33/35


1. Deep Learning
1.4. Deep learning in Python

A user-friendly framework for deep learning: Keras

• Keras is an API in Python used for deep learning applications


(recommendation: CPU-only version).
• The installation of Tensorflow is needed.
• Different datasets are available (including MNIST).

Let’s try Keras !!!

Pierre Michel Prediction methods and Machine learning 34/35


1. Deep Learning
1.4. Deep learning in Python

Pierre Michel Prediction methods and Machine learning 35/35

You might also like