0% found this document useful (0 votes)
18 views47 pages

Neural - Networks

This document provides an introduction to machine learning and neural networks. It discusses how neural networks can be used for regression, binary classification, and multi-class classification problems. The key aspects covered are: 1) How neural networks extend linear models by making the basis functions parametric and learnable. 2) How neural networks are constructed and trained using gradient descent or variants to minimize an error function. 3) How neural networks can be applied to regression by minimizing mean squared error, and to classification by minimizing cross-entropy error.

Uploaded by

howgibaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views47 pages

Neural - Networks

This document provides an introduction to machine learning and neural networks. It discusses how neural networks can be used for regression, binary classification, and multi-class classification problems. The key aspects covered are: 1) How neural networks extend linear models by making the basis functions parametric and learnable. 2) How neural networks are constructed and trained using gradient descent or variants to minimize an error function. 3) How neural networks can be applied to regression by minimizing mean squared error, and to classification by minimizing cross-entropy error.

Uploaded by

howgibaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Introduction to Machine Learning

Neural Networks
林彥宇 教授
Yen-Yu Lin, Professor
國立陽明交通大學 資訊工程學系
Computer Science, National Yang Ming Chiao Tung University

Some slides are modified from S.-J. Wang,


H.-T. Chen, V. Khalidov, and M. Hansard
Linear model for regression or classification

• A linear model for regression or classification

➢ Decision based on a linear combination of fixed nonlinear basis


functions
➢ 𝑓 is an identity function for regression
➢ 𝑓 is a nonlinear activation function for classification
◆Logistic sigmoid or softmax function

2
Linear model and neural networks

• Our goal is to extend the linear model by making


➢ 1. The basis functions depend on parameters
◆Parametric basis functions
➢ 2. Their parameters learnable during training

• The goal leads to the basic neural network model

https://www.houseofbots.com/news-detail/1442-1-what-is-deep-learning-and-neural-network 3
Activations

4
Activations

• Construct 𝑀 linear combinations of the inputs

➢ where 𝑎𝑗 is the activation for 𝑗 = 1, 2, … , 𝑀


1
➢ {𝑤𝑗𝑖 }𝐷
𝑖=1 are the weights. Superscript (1) indicates that these
parameters are in the first layer of neural networks
1
➢ 𝑤𝑗0 is the bias
➢ Each activation is nonlinearly transformed by using a
differentiable, nonlinear activation function ℎ, i.e.,
𝑧𝑗 = ℎ(𝑎𝑗 )
• {𝑧𝑗 = ℎ(𝑎𝑗 )}𝑀
𝑗=1 are called hidden units
5
Output unit activation

• The hidden units {𝑧𝑗 = ℎ(𝑎𝑗 )}𝑀


𝑗=1 are linearly combined in the
second layer of neural networks
• Suppose there are 𝐾 outputs in the neural networks. We have

➢ where 𝑎𝑘 is the output activation for 𝑘 = 1, 2, … , 𝐾


2
➢ {𝑤𝑘𝑗 }𝑀
𝑗=1 are the weights. Superscript (2) indicates that these
parameters are in the second layer of neural networks
2
➢ 𝑤𝑘0 is the bias
• 𝑎𝑘 is further transformed by output activation function

6
Neural networks for regression and classification

• Output activation 𝑎𝑘 is further transformed by output


activation function

• {𝑦𝑘 }𝐾
𝑘=1 are the final outputs of the neural networks

• For regression, is the identity function


• For two-class classification, is the logistic sigmoid function

• For multiclass classification, is the softmax function

7
Two-layer neural networks

• The two-layer neural network model

➢ where 𝐰 is the set of all weight and bias parameters


• The bias parameters can be absorbed into weight parameters
by using one additional input

8
Feed-forward neural networks

• Evaluating the following equation is called forward propagation

Network Diagram
Nodes: Input, hidden, and
output variables
Links: Weights and biases
Arrows: Propagation direction

9
Generalizations

• There may be more than one layer of hidden units


➢ Deep learning
• Individual units need not be fully connected to the next layer
➢ Convolutional neural networks
• Individual links may skip over one or more subsequent layers
➢ Skip connections

10
Neural networks as universal approximators

Points:
training data

Dashed curves:
Outputs of
three hidden
units

Curve:
Prediction by
the NN

11
Neural networks for classification

• 3-class classification
• 2-layer neural networks with 64 hidden units

https://www.annytab.com/neural-network-classification-in-python/

12
Network training

• Given a set of training data {𝐱𝑛 } where 𝑛 = 1, 2, … , 𝑁,


together with a corresponding set of target vectors {𝐭 𝑛 }, we
can learn the neural networks by minimizing

• Let’s consider how to train the networks by giving a


probabilistic interpretation to the network output

13
Neural networks for 1D regression

• We aim to minimize the error between and


• We assume that the target is a scalar-valued function, which is
normally distributed around the prediction

➢ where is the prediction by neural networks and is


the variance
• Suppose data are i.i.d. The likelihood is

➢ where and

14
ML solution for 1D regression

• Taking the negative logarithm, we get negative log likelihood

• The maximum likelihood solution for 𝐰 is equivalent to


minimizing the sum-of-squares error

• Does setting the gradient of to zero work?


➢ No closed-form solution

15
ML solution for 1D regression

• Optimization by using gradient descent, stochastic gradient descent,


or Newton-Raphson iterative optimization scheme

• The nonlinearity of makes to be nonconvex

• In practice, local minima of the negative log likelihood may be found

• After having found 𝐰ML , the value of 𝛽 can be found by


minimizing the negative log likelihood

16
ML solution for 1D regression

• After getting 𝐰ML and 𝛽ML , we can predict the distribution of


the target value 𝑡 for an input testing data point 𝐱 via

17
ML solution for 1D regression

• Two-layer neural networks for one-dimensional regression

(1) 𝑧𝑀 (2)
𝑤𝑀𝐷 𝑤1𝑀
𝑥𝑛𝐷
output target
input
….

𝑦𝑛 𝑡𝑛
𝐱𝑛
𝑥𝑛1
𝑥𝑛0 𝑧1
𝑧0

18
Neural networks for multi-dimensional regression

• Neural networks can be used for 𝐾-dimensional regression

• Construct neural networks with 𝐾 outputs

• Make the following assumption

• We can use maximum likelihood solution, which is equivalent


to minimizing the sum-of-squares errors, to get 𝐰ML

• Similarly given 𝐰ML , the optimal 𝛽ML is obtained

19
Neural networks for multi-dimensional regression

• Two-layer neural networks for 𝐾-dimensional regression

𝑧𝑀 output target
(1) (2)
𝑤𝑀𝐷 𝑤𝐾𝑀 𝒚𝑛 𝐭𝑛
𝑥𝑛𝐷 𝑦𝑛𝐾 𝑡𝑛𝐾
input

….
….

….
𝐱𝑛
𝑥𝑛1 𝑦𝑛1 𝑡𝑛1
𝑥𝑛0 𝑧1
𝑧0
1 1 K
En ( w ) = y n − t n =  ( ynk − t nk ) 2
2

2 2 k =1
20
Neural networks for binary classification

• Neural networks can be used for classification


• Given a set of training data {𝐱𝑛 } where 𝑛 = 1, 2, … , 𝑁,
together with a corresponding set of target labels {𝑡𝑛 }, where
𝑡𝑛 = 1 denotes class 𝐶1 and 𝑡𝑛 = 0 denotes class 𝐶2

• Construct (two-layer) neural networks having a single output


whose activation function is a logistic sigmoid

➢ where
➢ is the conditional probability
➢ The conditional probability is given by

21
ML solution for binary classification

• Regression: the target is a real-valued function, which is


normally distributed around the prediction

• Classification: the conditional distribution of a target given its


input is a Bernoulli distribution of the form

22
ML solution for binary classification

• When using ML optimization, we minimize the negative log


likelihood, here called cross-entropy error

➢ where denotes
• Optimize 𝐰 by using gradient descent or its variant

• After getting 𝐰ML , binary classification is carried out by

23
ML solution for binary classification

(1) 𝑧𝑀 (2)
𝑤𝑀𝐷 𝑤1𝑀
𝑥𝑛𝐷
output target
input
….

𝑦𝑛 𝑡𝑛
𝐱𝑛
𝑥𝑛1
𝑥𝑛0 𝑧1
𝑧0

24
Neural networks for multi-class classification

• Neural networks can be extended to 𝐾-class classification

• Given a set of training data {𝐱𝑛 } where 𝑛 = 1, 2, … , 𝑁,


together with a corresponding set of target vectors {𝐭 𝑛 },
where 𝐭 𝑛 is encoded by using 1-of- 𝐾 coding scheme

• Construct (two-layer) neural networks having 𝐾 outputs and


use softmax as the activation function

➢ where and

25
ML solution for multi-class classification

• The negative log likelihood or the cross-entropy error is

• Optimize 𝐰 by using gradient descent or its variant

• After getting 𝐰ML , multi-class classification is carried out by


using the softmax function

26
ML solution for multi-class classification

• Two-layer neural networks for 𝐾-class classification

𝑧𝑀 output target
(1) (2)
𝑤𝑀𝐷 𝑤𝐾𝑀 𝒚𝑛 𝐭𝑛
𝑥𝑛𝐷 𝑦𝑛𝐾 𝑡𝑛𝐾
input

….
….

….
𝐱𝑛
𝑥𝑛1 𝑦𝑛1 𝑡𝑛1
𝑥𝑛0 𝑧1
𝑧0

27
Gradient descent

• The simplest approach is to update 𝐰 by a displacement in the


negative gradient direction

➢ This is a steepest descent algorithm


➢ is the learning rate
➢ This is a batch method, as evaluation of involves the entire
data set
➢ A range of starting points {𝐰 (0) } may be needed, in order to find
a satisfactory minimum

28
Stochastic gradient descent

• Stochastic gradient descent (or called sequential gradient


descent) has proved useful in practice when training neural
networks on a large data set
• The error function needs to comprise a sum of terms, one for
each data point, i.e.,

➢ Sum-of-squares error for regression

➢ Cross-entropy error for classification

29
Stochastic gradient descent

• Stochastic gradient descent makes an update to the weight


vector based on one data point at a time

30
Geometric view of gradient descent

• The error function is a surface sitting over the weight


space

• is a local minimum

• is a global minimum

• At any point , the local gradient


of the error surface is given by the
vector

31
Error backpropagation

• The computational cost of gradient descent mainly lies in the


evaluation of gradient at each iteration

➢ The dimension of gradient is the number of learnable parameters

• In feed-forward neural networks, the gradient of an error


function can be efficiently evaluated via an algorithm
called error backpropagation

32
Feed-forward neural networks

• Two-layer feed-forward neural networks for regression


𝑧 output
1 (2)
{𝑤𝑗𝑖 } 𝑀 {𝑤𝑘𝑗 } 𝒚
𝑥𝐷 𝑦𝐾
input 𝑎𝑗 𝑎𝑘

….
….

𝐱
𝑥1 𝑦1
𝑥0 𝑧1
M
D
𝑧0 ak =  wkj( 2 ) z j + wk( 20) , k = 1,..., K
aj = w x + w ,
(1)
ji i
(1)
j0 j = 1,..., M
i =1 j =1

z j = h(a j ) y k = ak

33
Error backpropagation

• Variables/Activations dependency:
1 (2)
{𝑥𝑖 } → {𝑤𝑗𝑖 } → {𝑎𝑗 } → {𝑧𝑗 } → {𝑤𝑘𝑗 } → {𝑎𝑘 } → {𝑦𝑘 } → 𝐸

• Our goal in gradient computation:


output
𝜕𝐸 𝜕𝐸
{𝑤𝑗𝑖
1
} 𝑧𝑀 {𝑤 (2) } 𝒚
(2) and 1 𝑘𝑗
𝜕𝑤𝑘𝑗 𝜕𝑤𝑗𝑖
𝑥𝐷 𝑦𝐾
𝑎𝑗 𝑎𝑘
input

….
.

• In backpropagation, we also 𝐱
𝑥1 𝑦1
need to compute
𝜕𝐸 𝜕𝐸
𝑥0 𝑧1
𝛿𝑘 = and 𝛿𝑗 =
𝜕𝑎𝑘 𝜕𝑎𝑗
𝑧0
34
output
}𝑧𝑀{𝑤 }
input 1 (2)
Error backpropagation 𝐱
{𝑤𝑗𝑖 𝑘𝑗
𝒚
𝑥𝐷 𝑎𝑘
𝑦𝐾
𝑎𝑗

….
.

• Stochastic gradient descent
𝑥1 𝑦1
𝑥0 𝑧1
• Multi-dimensional regression 𝑧0
D
a j =  w(ji1) xi + w(j10) , j = 1,..., M E E ak
i =1 Hidden layer j  =
z j = h(a j ) a j k ak a j

M
= h' (a j ) wkj( 2) k
ak =  wkj( 2) z j + wk( 20) , k = 1,..., K
k

Output layer E
j =1 k  = yk − t k
y k = ak ak

1 K
E (w ) =  ( yk − t k ) 2 Error function
2 k =1

35
Error backpropagation

• Variables/Activations dependency:
1 (2)
{𝑥𝑖 } → {𝑤𝑗𝑖 } → {𝑎𝑗 } → {𝑧𝑗 } → {𝑤𝑘𝑗 } → {𝑎𝑘 } → {𝑦𝑘 } → 𝐸

D
 j = h' (a j ) wkj( 2) k
a j =  w(ji1) xi + w(j10) , j = 1,..., M
k

i =1 Hidden layer
z j = h(a j )
M
ak =  wkj( 2) z j + wk( 20) , k = 1,..., K E E ak
j =1 Output layer = = k z j
wkj( 2) ak wkj
( 2)
y k = ak
1 K
E (w ) =  ( yk − t k ) 2 Error function  k = yk − t k
2 k =1

36
Error backpropagation

• Variables/Activations dependency:
1 (2)
{𝑥𝑖 } → {𝑤𝑗𝑖 } → {𝑎𝑗 } → {𝑧𝑗 } → {𝑤𝑘𝑗 } → {𝑎𝑘 } → {𝑦𝑘 } → 𝐸

D
 j = h' (a j ) wkj( 2) k
a j =  w(ji1) xi + w(j10) , j = 1,..., M
k

i =1 Hidden layer E E a j
= =  j xi
z j = h(a j ) w(ji1) a j w(ji1)
M
ak =  wkj( 2) z j + wk( 20) , k = 1,..., K E E ak
j =1 Output layer = = k z j
wkj( 2) ak wkj
( 2)
y k = ak
1 K
E (w ) =  ( yk − t k ) 2 Error function  k = yk − t k
2 k =1

37
A review of error backpropagation

output
{𝑤𝑗𝑖
1
} 𝑧𝑀 {𝑤 (2) } 𝒚
𝑘𝑗
𝑥𝐷 𝑦𝐾
input 𝑎𝑗 𝑎𝑘

….
.
𝐱 …
𝑥1 𝑦1
Step 1:
Step 3: 𝑥0 𝑧1  k = yk − t k
 j = h' (a j ) wkj( 2) k
k 𝑧0
Step 2:
Step 4: E E ak
= = k z j
E E a j wkj( 2) ak wkj
( 2)

= =  j xi
w ji
(1)
a j w ji
(1)

38
Error backpropagation for other tasks

E E yk
• Step 1: k  =
ak yk ak

 1 K
 
2 k =1
( yk − t k ) 2 regression

E (w ) =  − {t ln y (x, w ) + (1 − t ) ln(1 − y (x, w ))} binary classification
 K

−  t k ln yk (x , w ) multi - calss classification


 k =1
𝑦𝑘 = 𝑎𝑘 regression
1
𝑦= binary classification
1 + 𝑒 −𝑎
𝑒 𝑎𝑘
𝑦𝑘 = multi−class classification
σ𝑗 𝑒 𝑎𝑗

• Steps 2 ~ 4 remain unchanged

39
Neural networks’ applications

• Face detection

Rowley et al.
40
Convolutional neural networks

Low-Level Mid-Level High-Level Trainable


Feature Feature Feature Classifier

41
Convolutional neural networks’ applications

object recognition object segmentation

object detection

42
Recurrent neural networks

• Speech recognition

https://gab41.lab41.org/speech-recognition-you-down-with-ctc-8d3b558943f0

43
Generative adversarial networks

https://www.slideshare.net/xavigiro/deep-learning-for-computer-
vision-generative-models-and-adversarial-training-upc-2016

44
Generative adversarial networks’ applications

Karras et al. Wang et al.

45
References

• Chapters 5.1, 5.2, and 5.3 in the PRML textbook

46
Thank You for Your Attention!

Yen-Yu Lin (林彥宇)


Email: [email protected]
URL: https://www.cs.nctu.edu.tw/members/detail/lin

47

You might also like