0% found this document useful (0 votes)
11 views

2024-Lecture11-MLAlgorithms

The document discusses machine learning algorithms, focusing on supervised learning and decision trees, particularly the ID3 algorithm. It explains the training process, hypothesis space, and how decision trees are constructed using attributes to predict outcomes, illustrated with a restaurant waiting example. The document also covers concepts like entropy and information gain in the context of decision tree learning.

Uploaded by

pvmtue22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

2024-Lecture11-MLAlgorithms

The document discusses machine learning algorithms, focusing on supervised learning and decision trees, particularly the ID3 algorithm. It explains the training process, hypothesis space, and how decision trees are constructed using attributes to predict outcomes, illustrated with a restaurant waiting example. The document also covers concepts like entropy and information gain in the context of decision tree learning.

Uploaded by

pvmtue22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

MACHINE LEARNING

ALGORITHMS

Nguyễn Ngọc Thảo – Nguyễn Hải Minh


{nnthao, nhminh}@fit.hcmus.edu.vn
Outline
• Supervised learning: Related concepts
• ID3 decision trees
• Artificial neural networks

2
Supervised learning: Training
• Consider a labeled training set of 𝑁 examples.
(𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑁 , 𝑦𝑁 )
• where each 𝑦𝑗 was generated by an unknown function 𝑦 = 𝑓(𝑥).
• The output 𝒚𝒋 is called ground truth, i.e., the true answer
that the model must predict.

• The training process finds a hypothesis ℎ such that 𝒉 ≈ 𝒇.

3
Supervised learning: Hypothesis space
• ℎ is drawn from a hypothesis space 𝐻 of possible functions.
• E.g., 𝐻 might be the set of polynomials of degree 3; or the set of 3-
SAT Boolean logic formulas.
• Choose 𝐻 by some prior knowledge about the process that
generated the data or exploratory data analysis (EDA).
• EDA examines the data with statistical tests and visualizations to get
some insight into what hypothesis space might be appropriate.
• Or just try multiple hypothesis spaces and evaluate which
one works best.

4
Supervised learning: Hypothesis
• The hypothesis ℎ is consistent if it agrees with the true
function 𝑓 on all training observations, i.e., ∀𝑥𝑖 ℎ 𝑥𝑖 = 𝑦𝑖 .
• For continuous data, we instead look for a best-fit function for which
each ℎ 𝑥𝑖 is close to 𝑦𝑖 .
• Ockham’s razor: Select the simplest consistent hypothesis.

5
Supervised learning: Hypothesis

Finding hypotheses to fit data. Top row: four plots of best-fit functions from
four different hypothesis spaces trained on data set 1. Bottom row: the same
four functions, but trained on a slightly different data set (sampled from the
same 𝑓(𝑥) function).
6
Supervised learning: Testing
• The quality of the hypothesis ℎ depends on how accurately it
predicts the observations in the test set → generalization.
• The test set must use the same distribution over example space as
training set.

A learning curve for the decision


tree learning algorithm on 100
randomly generated examples in
the restaurant domain. Each data
point is the average of 20 trials.

7
ID3
Decision Tree
Example problem: Restaurant waiting

Predicting whether a certain person will wait to


have a seat in a restaurant.

1. Alternate: is there an alternative restaurant nearby?


2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
9
Example problem: Restaurant waiting

A (true) decision tree for deciding


whether to wait for a table.

10
Example problem: Restaurant waiting

11
Learning decision trees
• Divide and conquer: Split data into x1 > a ?
smaller and smaller subsets no
yes
• Splits are usually on a single variable
x2 > b ? x2 > g ?

yes no yes no

• After splitting up, each outcome is a new decision tree


learning problem with fewer examples and one less attribute.

12
Learning decision trees

Splitting the examples by testing on attributes. At each node we show the positive
(light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type
brings us no nearer to distinguishing between positive and negative examples. (b)
Splitting on Patrons does a good job of separating positive and negative examples.
After splitting on Patrons, Hungry is a fairly good second test. 13
ID3 Decision tree: Pseudo-code

function LEARN-ECISION-TREE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠, 𝑝𝑎𝑟𝑒𝑛𝑡_𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)


returns a tree
No examples left 3
if 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 is empty
then return PLURALITY-VALUE(𝑝𝑎𝑟𝑒𝑛𝑡_𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
else if all 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 have the same classification
then return the classification Remaining examples
2
are all pos/neg
else if 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 is empty
then return PLURALITY-VALUE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
else No attributes left but
4
… examples are still pos & neg

The decision tree learning algorithm. The function PLURALITY-VALUE selects the
most common output value among a set of examples, breaking ties randomly.
14
ID3 Decision tree: Pseudo-code
function LEARN-DECISION-TREE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠, 𝑝𝑎𝑟𝑒𝑛𝑡_𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
returns a tree
There are still attributes
… to split the examples
1
else
𝐴 ← 𝑎𝑟𝑔𝑚𝑎𝑥𝑎∈𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 IMPORTANCE(𝑎, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
𝑡𝑟𝑒𝑒 ← a new decision tree with root test A
for each value 𝑣 of A do
𝑒𝑥𝑠 ← 𝑒 ∶ 𝑒 ∈ 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 and 𝑒. 𝐴 = 𝑣
𝑠𝑢𝑏𝑡𝑟𝑒𝑒 ← LEARN-DECISION-TREE(𝑒𝑥𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − 𝐴, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
add a branch to 𝑡𝑟𝑒𝑒 with label (𝐴 = 𝑣) and subtree 𝑠𝑢𝑏𝑡𝑟𝑒𝑒
return 𝑡𝑟𝑒𝑒

The decision tree learning algorithm. The function IMPORTANCE evaluates the
profitability of attributes.
15
ID3 Decision tree algorithm
There are some positive and some negative examples → choose the
1
best attribute to split them
The remaining examples are all positive (or all negative), → DONE, it
2
is possible to answer Yes or No.
No examples left at a branch → return a default value.
• No example has been observed for a combination of attribute values
3
• The default value is calculated from the plurality classification of all the
examples that were used in constructing the node’s parent.

No attributes left but both positive and negative examples → return


the plurality classification of remaining ones.
4 • Examples of the same description, but different classifications
• It is due to an error or noise in the data, nondeterministic domain, or no
observation of an attribute that would distinguish the examples.
16
Example problem: Restaurant waiting

The decision tree induced from the 12-example training set.


17
Example problem: Restaurant waiting
• The induced decision tree can classify all the examples
without tests for Raining and Reservation.
• It can detect interesting and previously unsuspected pattern.
• E.g., the customers will wait for Thai food on weekends.
• It is also bound to make some mistakes for cases where it
has seen no examples.
• E.g., how about a situation in which the wait is 0–10 minutes, the
restaurant is full, yet the customer is not hungry?

18
Decision tree: Inductive learning
• Simplest: Construct a decision tree
with one leaf for every example
→ memory based learning
→ worse generalization

• Advanced: Split on each variable so that the purity of each


split increases (i.e. either only yes or only no).

19
A purity measure with entropy
• The Entropy measures the uncertainty of a random variable
𝑉 with values 𝑣𝑘 having probability 𝑃 𝑣𝑘 is defined as
𝟏
𝑯 𝑽 = ෍ 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 = − ෍ 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 𝑷 𝒗𝒌
𝑷 𝒗𝒌
𝒌 𝒌
• It is fundamental quantity in information theory.

• The information gain (IG) for an attribute 𝐴 is the expected


reduction in entropy from before to after splitting data on 𝐴.

20
A purity measure with entropy
• Entropy is maximal when all possibilities are equally likely.

• Entropy is zero in a pure


”yes” (or pure ”no”) node.

• Decision tree aims to decrease the entropy while increasing


the information gain in each node.
21
Example problem: Restaurant waiting
Alternate?

Yes No

3 Y, 3 N 3 Y, 3 N

• Calculate the Entropy


H(S) = − 6ൗ12 log 2 6ൗ12 − 6ൗ12 log 2 6ൗ12 = 1
of the whole dataset
• Calculate Average entropy of attribute Alternate?
𝐴𝐸𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒? = 𝑃 𝐴𝑙𝑡 = 𝑌 × 𝐻 𝐴𝑙𝑡 = 𝑌 + 𝑃 𝐴𝑙𝑡 = 𝑁 × 𝐻 𝐴𝑙𝑡 = 𝑁 = 1
6 3 3 3 3 6 3 3 3 3
= − log 2 − log 2 + − log 2 − log 2
12 6 6 6 6 12 6 6 6 6
• Calculate Information gain of attribute Alternate?
𝐼𝐺 𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒? = 𝐻 𝑆 − 𝐴𝐸𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒? = 1 − 1 = 0
22
Example problem: Restaurant waiting
Bar?

Yes No

3 Y, 3 N 3 Y, 3 N

• Calculate Average entropy of attribute Bar?


6 3 3 3 3 6 3 3 3 3
𝐴𝐸𝐵𝑎𝑟? = − log 2 − log 2 + − log 2 − log 2 =1
12 6 6 6 6 12 6 6 6 6

• Calculate Information gain of attribute Bar?


𝐼𝐺 𝐵𝑎𝑟? = 𝐻 𝑆 − 𝐴𝐸𝐵𝑎𝑟? = 1 − 1 = 0

23
Example problem: Restaurant waiting
Sat/Fri?

Yes No

2 Y, 3 N 4 Y, 3 N

• Calculate Average entropy of attribute Sat/Fri?


5 2 2 3 3 7 4 4 3 3
𝐴𝐸𝑆𝑎𝑡Τ𝐹𝑟𝑖? = − log 2 − log 2 + − log 2 − log 2 = 0.979
12 5 5 5 5 12 7 7 7 7

• Calculate Information gain of attribute Sat/Fri?


𝐼𝐺 𝑆𝑎𝑡Τ𝐹𝑟𝑖? = 𝐻 𝑆 − 𝐴𝐸𝑆𝑎𝑡Τ𝐹𝑟𝑖? = 1 − 0.979 = 0.021

24
Example problem: Restaurant waiting
Hungry?

Yes No

5 Y, 2 N 1 Y, 4 N

• Calculate Average entropy of attribute Hungry?


7 5 5 2 2 5 1 1 4 4
𝐴𝐸𝐻𝑢𝑛𝑔𝑟𝑦? = − log 2 − log 2 + − log 2 − log 2 = 0.804
12 7 7 7 7 12 5 5 5 5

• Calculate Information gain of attribute Hungry?

𝐼𝐺 𝐻𝑢𝑛𝑔𝑟𝑦? = 𝐻 𝑆 − 𝐴𝐸𝐻𝑢𝑛𝑔𝑟𝑦? = 1 − 0.804 = 0.196

25
Example problem: Restaurant waiting
Raining?

Yes No

3 Y, 2 N 3 Y, 4 N

• Calculate Average entropy of attribute Raining?


5 3 3 2 2 7 3 3 4 4
𝐴𝐸𝑅𝑎𝑖𝑛𝑖𝑛𝑔? = − log 2 − log 2 + − log 2 − log 2 = 0.979
12 5 5 5 5 12 7 7 7 7
• Calculate Information gain of attribute Raining?

𝐼𝐺 𝑅𝑎𝑖𝑛𝑖𝑛𝑔? = 𝐻 𝑆 − 𝐴𝐸𝑅𝑎𝑖𝑛𝑖𝑛𝑔? = 1 − 0.979 = 0.021

26
Example problem: Restaurant waiting
Reservation?

Yes No

3 Y, 2 N 3 Y, 4 N

• Calculate Average entropy of attribute Reservation?


5 3 3 2 2 7 3 3 4 4
𝐴𝐸𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛? = − log 2 − log 2 + − log 2 − log 2 = 0.979
12 5 5 5 5 12 7 7 7 7

• Calculate Information gain of attribute Reservation?

𝐼𝐺 𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛? = 𝐻 𝑆 − 𝐴𝐸𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛? = 1 − 0.979 = 0.021

27
Example problem: Restaurant waiting
Type?
French Burger

1 Y, 1 N Italian 2 Y, 2 N
Thai

1 Y, 1 N 2 Y, 2 N

• Calculate Average entropy of attribute Type?


2 1 1 1 1 2 1 1 1 1
𝐴𝐸𝑇𝑦𝑝𝑒? = − log 2 − log 2 + − log 2 − log 2
12 2 2 2 2 12 2 2 2 2
4 2 2 2 2 4 2 2 2 2
+ − log 2 − log 2 + − log 2 − log 2 =1
12 4 4 4 4 12 4 4 4 4

• Calculate Information gain of attribute Type?


𝐼𝐺 𝑇𝑦𝑝𝑒? = 𝐻 𝑆 − 𝐴𝐸𝑇𝑦𝑝𝑒? = 1 − 1 = 0 30
Example problem: Restaurant waiting
Est. waiting
time?
0-10 > 60
10-30
4 Y, 2 N 2N
30-60

1 Y, 1 N 1 Y, 1 N

• Calculate Average entropy of attribute Est. waiting time?


6 4 4 2 2 2 1 1 1 1
𝐴𝐸𝐸𝑠𝑡.𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒? = − log 2 − log 2 + − log 2 − log 2
12 6 6 6 6 12 2 2 2 2
2 1 1 1 1 2 0 0 2 2
+ − log 2 − log 2 + − log 2 − log 2 = 0.792
12 2 2 2 2 12 2 2 2 2

• Calculate Information gain of attribute Est. waiting time?


𝐼𝐺 𝐸𝑠𝑡. 𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒? , 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝐸𝑠𝑡.𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒? = 1 − 0.792 = 0.208
31
Example problem: Restaurant waiting
• Largest Information Gain (0.459) / Smallest Entropy (0.541)
achieved by splitting on Patrons.
Patrons?
None Full
Some
2N 2 T,X?4 F
4Y

• Continue making new splits, always purifying nodes

32
Quiz 01: ID3 decision tree
• The data represent files on a computer system. Possible
values of the class variable are “infected”, which implies the
file has a virus infection, or “clean” if it doesn't.
• Derive decision tree for virus identification.

No. Writable Updated Size Class


1 Yes No Small Infected
2 Yes Yes Large Infected
3 No Yes Med Infected
4 No No Med Clean
5 Yes No Large Clean
6 No No Large Clean

33
Artificial neural network
What is a neural network?
• The biological neural network (NN) is a reasoning model
based on the human brain.
• There are approximately 86 billion neurons. Estimates vary for an
adult, ranging from 100 to 500 trillion.
• It is a system that is highly complex, nonlinear and parallel
information-processing.
• Learning through experience is an essential characteristic.
• Plasticity: connections leading to the “right answer” are enhanced,
while those to the “wrong answer” are weakened.

35
Biological neural network
• There are attempts to emulate biological neural network in
the computer, resulting artificial neural networks (ANNs).

• Just resemble the learning mechanisms, not the architecture


• Megatron-Turing's NLG: 530 billion parameters, GPT-3: 175 billion

36
ANN: Network architecture
• An ANN has many neurons, arranging in a hierarchy of layers.
• Each neuron is an elementary information-processing unit.

• ANN improve performance via experience and generalization.


37
ANN: Applications

38
ANN: Neurons and Signals
• Each neuron receives several input signals through its
connections and produces at most a single output signal.
Expressing the strength
of the input

• The set of weights is the long-term memory in an ANN → the


learning process iteratively adjusts the weights.
39
Artificial neuron

Biological neuron

Biological neuron Artificial neuron


Soma Neuron
Dendrite Input
Axon Output
Synapse Weight
40
41 41
42

Source: The Asimov Institute


42
How to build an ANN?
• The network architecture must be decided first.
• How many neurons are to be used?
• How the neurons are to be connected to form a network?
• Then determine which learning algorithm to use,
• Supervised /semi-supervised / unsupervised / reinforcement learning
• And finally train the neural network
• How to initialize the weights of the network?
• How to update them from a set of training examples.

43
Perceptron
Perceptron (Frank Rosenblatt, 1958)
• A perceptron has a single neuron with adjustable synaptic
weights and a hard limiter.

A single-layer two-input perceptron


45
How does a perceptron work?
• Divide the n-dimensional space into two decision regions by
a hyperplane defined by the linearly separable function
𝒏

𝒚 = ෍ 𝒙𝒊 𝒘𝒊 − 𝜽
𝒊=𝟏

46
Perceptron learning rule
• Step 1 – Initialization: Initial weights 𝒘𝟏 , 𝒘𝟐 , … , 𝒘𝒏 and threshold 𝜽 are randomly
assigned to small numbers (usually in −0.5, 0.5 , but not restricted to).

• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑𝑡ℎ example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏 (𝒑) and desired output 𝒀𝒅 𝒑 , and calculate the actual output
𝒏 1 𝑖𝑓 𝑥 ≥ 0
𝒀 𝒑 =𝛔 ෍ 𝒙𝒊 (𝒑)𝒘𝒊 (𝒑) − 𝜽 𝜎 𝑥 =ቊ
0 𝑖𝑓 𝑥 < 0
𝒊=𝟏
where 𝑛 is the number of perceptron inputs and 𝑠𝑡𝑒𝑝 is the activation function

• Step 3 – Weight training


• Update the weights 𝒘𝒊 : 𝒘𝒊 𝒑 + 𝟏 = 𝒘𝒊 𝒑 + ∆𝒘𝒊 (𝒑)
where ∆𝒘𝒊 (𝒑) is the weight correction at iteration 𝒑
• The delta rule determines how to adjust the weights: ∆𝒘𝒊 𝒑 = 𝜼 × 𝒙𝒊 𝒑 × 𝒆(𝒑)
where 𝜼 is the learning rate (0 < 𝜂 < 1) and 𝒆 𝒑 = 𝒀𝒅 𝒑 − 𝒀(𝒑)
• Step 4 – Iteration: Increase iteration 𝒑 by one, go back to Step 2 and repeat the process
until convergence.

47
Perceptron for the logical AND/OR
• A single-layer perceptron can learn the AND/OR operations.

The learning of logical AND


converged after several iterations
Threshold  = 0.2, learning rate  = 0.1 48
Perceptron for the logical XOR
• It cannot be trained to perform the Exclusive-OR.

• Generally, perceptron can classify only linearly separable


patterns regardless of the activation function used.
• Research works: Shynk, 1990 and Shynk and Bershad, 1992.
49
Perceptron: An example
Suppose there is a high-tech exhibition in the city, and you are thinking
• 𝑤1 = 6, 𝑤2 = 2, 𝑤3 = 2 → the weather matters to you much more than
about whether to go there. Your decision relies on the below factors:
whether your friend joins you, or the nearness of public transit
• Is the weather good?
• 𝜃 = 5 → decisions are made based on the weather only
• Does your friend want to accompany you?
• 𝜃 = 3 → you go to the festival whenever the weather is good or when
• Isthe
both thefestival
festivalisnear
nearpublic
publictransit?
transit (You don'tfriend
and your own awants
car). to join you.

weather

friend wants to go

near public transit


50
Quiz 03: Perceptron
• Consider the following neural network which receives binary input
values, 𝑥1 and 𝑥2 , and produces a single binary value.

• For every combination (𝑥1 , 𝑥2 ), what are the output values at neurons, 𝐴,
𝐵 and 𝐶?
51
Multi-layer perceptron
Multi-layer perceptron (MLP)

Output signals
Input signals

First Second
hidden layer hidden layer
• It is a fully connected feedforward network with at least three
layers.
• Idea: Map certain input to a specified target value using a
cascade of nonlinear transformations.
53
Learning algorithm: Back-propagation
The input signals are propagated forwardly on a layer-by-layer basis.

The error signals are propagated backwards


from the output layer to the input layer. 54
Back-propagation algorithm
• Consider a MLP with one hidden layer.
• Note the following notations
• 𝑎𝑖 : the output value of node 𝑖 in the input layer
• 𝑧𝑗 : the input value to node 𝑗 in the layer ℎ
• 𝑔𝑗 : the activation function for node 𝑗 in the layer ℎ (applied to 𝑧𝑗 )
• 𝑎𝑗 = 𝑔𝑗 𝑧𝑗 : the output value of node 𝑗 in the layer ℎ
• 𝑏𝑗 : the bias/offset for unit 𝑗 in the layer ℎ
• 𝑤𝑖𝑗 : weights connecting node 𝑖 in layer (ℎ − 1) to node 𝑗 in layer ℎ
• 𝑡𝑘 : target value for node 𝑘 in the output layer

55
Back-propagation algorithm
Input Hidden layer Output
layer layer

* Bias units are not shown.


56
BP algorithm: The error function
• Training a neural network entails finding parameters 𝜃 =
𝐖, 𝐛 that minimize the errors.
• The error function is usually the sum of the squared errors
between the target values 𝑡𝑘 and the network outputs 𝑎𝑘 .
𝒍
𝟏 𝟐
𝑬 = ෍ 𝒂𝒌 − 𝒕𝒌
𝟐
𝒌=𝟏
• 𝑙 is the dimensionality of the target for a single observation.

• This parameter optimization problem can be solved using


𝜕𝐸
gradient descent, computing for all 𝜃.
𝜕𝜃

57
BP algorithm: Output layer params
• Calculating the gradient of the error function with respect to those
parameters is straightforward with the chain rule.
𝜕𝐸 1 2
= ෍ 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑗𝑘 2
𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑗𝑘
𝜕𝐸 𝜕 𝜕
• Then, = 𝑎𝑘 − 𝑡𝑘 𝑎 since 𝜕𝑤 𝑡𝑘 = 0
𝜕𝑤𝑗𝑘 𝜕𝑤𝑗𝑘 𝑘 𝑗𝑘

𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘 𝑧𝑘 since 𝑎𝑘 = 𝑔(𝑧𝑘 )
𝜕𝑤𝑗𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) 𝑧
𝜕𝑤𝑗𝑘 𝑘

58
BP algorithm: Output layer params
𝜕
• Recall that 𝑧𝑘 = 𝑏𝑘 + σ𝑗 𝑔𝑗 𝑧𝑗 𝑤𝑗𝑘 , and hence, 𝑧 = 𝑔𝑗 𝑧𝑗 = 𝑎𝑗 .
𝜕𝑤𝑗𝑘 𝑘

𝝏𝑬
• Then, = 𝒂𝒌 − 𝒕𝒌 𝒈′𝒌 (𝒛𝒌 )𝒂𝒋
𝝏𝒘𝒋𝒌

the difference between the derivative of the the output of node 𝑗 from
the network output 𝑎𝑘 activation function at 𝑧𝑘 the hidden layer feeding
and the target value 𝑡𝑘 into the output layer

• The common activation function is the sigmoid function


1
𝑔 𝑧 =
1 + 𝑒 −𝑧
whose derivative is
𝑔′ 𝑧 = 𝑔(𝑧) 1 − 𝑔(𝑧)

59
BP algorithm: Output layer params
• Let 𝛿𝑘 = 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) be the error signal after being back-propagated
through the output activation function 𝑔𝑘 .
• The delta form of the error function gradient for the output layer weights is
𝜕𝐸
= 𝛿𝑘 𝑎𝑗
𝜕𝑤𝑗𝑘

• The gradient descent update rule for the output layer weights is
𝝏𝑬
𝒘𝒋𝒌 ← 𝒘𝒋𝒌 − 𝜼 𝜂 is the learning rate
𝝏𝒘𝒋𝒌
← 𝑤𝑗𝑘 − 𝜂 𝛿𝑘 𝑎𝑗
← 𝒘𝒋𝒌 − 𝜼 𝒂𝒌 − 𝒕𝒌 𝒈𝒌 𝒛𝒌 𝟏 − 𝒈𝒌 𝒛𝒌 𝒂𝒋

• Apply similar update rules for the remaining parameters 𝑤𝑗𝑘 .

60
BP algorithms: Output layer biases
• The gradient for the biases is simply the back-propagated error signal 𝛿𝑘 .
𝜕𝐸
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘 (1) = 𝛿𝑘
𝜕𝑏𝑘

• Each bias is updated as 𝒃𝒌 ← 𝒃𝒌 − 𝜼 𝜹𝒌


𝜕 𝜕
• Note that 𝑧 = 𝑏𝑘 + σ𝑗 𝑔𝑗 𝑧𝑗 =1
𝜕𝑏𝑘 𝑘 𝜕𝑏𝑘
• The biases are weights on activations that are always equal to one, regardless of the
feed-forward signal.
• Thus, the bias gradients aren’t affected by the feed-forward signal, only by the error.

61
BP algorithm: Hidden layer params
• The process starts just the same as for the output layer.
𝜕𝐸 1 2
= ෍ 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑖𝑗 2
𝑘
𝜕
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑎𝑘
𝜕𝑤𝑖𝑗
𝑘

• Apply the chain rule again, we obtain:


𝜕𝐸 𝜕
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔 𝑧 since 𝑎𝑘 = 𝑔𝑘 𝑧𝑘
𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗 𝑘 𝑘
𝑘
𝜕
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) 𝑧𝑘
𝜕𝑤𝑖𝑗
𝑘

62
BP algorithm: Hidden layer params
• The term 𝑧𝑘 can be expanded as follows.

𝑧𝑘 = 𝑏𝑘 + ෍ 𝑎𝑗 𝑤𝑗𝑘 = 𝑏𝑘 + ෍ 𝑔𝑗 𝑧𝑗 𝑤𝑗𝑘 since 𝑎𝑗 = 𝑔𝑗 𝑧𝑗


𝑗 𝑗

= 𝑏𝑘 + ෍ 𝑔𝑗 𝑏𝑗 + ෍ 𝑎𝑖 𝑤𝑖𝑗 𝑤𝑗𝑘 since 𝑧𝑗 = 𝑏𝑗 + σ𝑖 𝑎𝑖 𝑤𝑖𝑗


𝑗 𝑖

𝜕
• Again, use the chain rule to calculate 𝜕𝑤 𝑧𝑘
𝑖𝑗

𝜕 𝜕𝑧𝑘 𝜕𝑎𝑗 𝜕 𝜕𝑎𝑗 𝜕𝑎𝑗 𝜕𝑔𝑗 𝑧𝑗


𝑧𝑘 = = 𝑏𝑘 + ෍ 𝑎𝑗 𝑤𝑗𝑘 = 𝑤𝑗𝑘 = 𝑤𝑗𝑘
𝜕𝑤𝑖𝑗 𝜕𝑎𝑗 𝜕𝑤𝑖𝑗 𝜕𝑎𝑗 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗
𝑗
𝜕𝑧𝑗 𝜕
= 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 ) = 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 ) 𝑏𝑗 + σ𝑗 𝑎𝑖 𝑤𝑖𝑗
𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗

= 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖


63
BP algorithm: Hidden layer params
𝜕𝐸
• Thus, = ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 )𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖
𝜕𝑤𝑖𝑗
𝑘

= ෍ 𝛿𝑘 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖 the output activation signal


from the layer below 𝑎𝑖
𝑘

error term the derivative of the


activation function at 𝑧𝑗
• Let 𝛿𝑗 = 𝑔𝑗′ 𝑧𝑗 σ𝑘 𝛿𝑘 𝑤𝑗𝑘 denote the resulting error signal back to layer 𝑗.
• The error function gradient for the hidden layer weights is
𝜕𝐸
= 𝛿𝑗 𝑎𝑖
𝜕𝑤𝑖𝑗

To calculate the weight gradients at any layer 𝑙, we calculate the backpropagated


error signal 𝛿𝑙 that reaches that layer from the “afterward” layers, and weight it
by the feed-forward signal at 𝑙 − 1 feeding into that layer.
64
BP algorithm: Hidden layer params
• The gradient descent update rule for the hidden layer weights is
𝝏𝑬
𝒘𝒊𝒋 ← 𝒘𝒊𝒋 − 𝜼
𝝏𝒘𝒊𝒋
← 𝑤𝑖𝑗 − 𝜂 𝛿𝑗 𝑎𝑖 ← 𝑤𝑖𝑗 − 𝜂 𝑔𝑗′ 𝑧𝑗 ෍ 𝛿𝑘 𝑤𝑗𝑘 𝑎𝑖
𝑘

← 𝒘𝒊𝒋 − 𝜼 ෍ 𝒂𝒌 − 𝒕𝒌 𝒈𝒌 (𝒛𝒌 ) 𝟏 − 𝒈𝒌 (𝒛𝒌 ) 𝒘𝒋𝒌 𝒈𝒋 (𝒛𝒋 ) 𝟏 − 𝒈𝒋 (𝒛𝒋 ) 𝒂𝒊


𝑘

• Apply similar update rules for the remaining parameters 𝑤𝑖𝑗 .

65
BP algorithms: Hidden layer biases
• Calculating the error gradients with respect to the hidden layer biases 𝑏𝑗
follows a very similar procedure to that for the hidden layer weights.
𝜕𝐸 𝜕 𝜕𝑧𝑘
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔 𝑧 = ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘
𝜕𝑏𝑗 𝜕𝑏𝑗 𝑘 𝑘 𝜕𝑏𝑗
𝑘 𝑘

𝜕𝑧𝑘
• Apply chain rule to solve = 𝑤𝑗𝑘 𝑔𝑗′ 𝑧𝑗 (1)
𝜕𝑏𝑗

• The gradient for the biases is the back-propagated error signal 𝛿𝑗 .

𝜕𝐸
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘 𝑤𝑗𝑘 𝑔𝑗′ 𝑧𝑗 = 𝑔𝑗′ 𝑧𝑗 ෍ 𝛿𝑘 𝑤𝑗𝑘 = 𝛿𝑗
𝜕𝑏𝑗
𝑘 𝑘

• Each bias is updated as 𝒃𝒋 ← 𝒃𝒋 − 𝜼 𝜹𝒋

66
Back-propagation network for XOR
• The logical XOR problem took
224 epochs or 896 iterations
for network training.

67
Sum of the squared errors (SSE)
• When the SSE in an entire pass through all training sets is
sufficiently small, a network is deemed to have converged.

Learning curve for


logical operation XOR

68
Sigmoid neuron vs. Perceptron
• Sigmoid neuron better reflects the fact that small changes in
weights and bias cause only a small change in output.

A sigmoidal function is a smoothed-out


version of a step function.

69
About back-propagation learning
• Are randomly initialized weights and thresholds leading to
different solutions?
• Starting with different initial conditions will obtain different weights
and threshold values. The problem will always be solved within
different numbers of iterations.
• Back-propagation learning cannot be viewed as emulation of
brain-like learning.
• Biological neurons do not work backward to adjust the strengths of
their interconnections, synapses.
• The training is slow due to extensive calculations.
• Improvements: Caudill, 1991; Jacobs, 1988; Stubbs, 1990

70
Gradient descent method
• Consider two parameters, 𝑤1 and 𝑤2 , in a network.
Error Surface

Randomly pick a
starting point 𝜃 0

Compute the negative The colors represent the value of the function 𝑓.
gradient at 𝜃 0
→ −∇𝑓(𝜃 0 ) 𝑤2 𝜃∗
−𝜂𝛻𝑓 𝜃 0
Time the learning rate 𝜂
→ −𝜂∇𝑓(𝜃 0 ) −𝛻𝑓 𝜃 0

𝜃0 𝜕𝑓 𝜃 0 /𝜕𝑤1
𝛻𝑓 𝜃 0 =
𝜕𝑓 𝜃 0 /𝜕𝑤2
𝑤1 71
Gradient descent method
• Consider two parameters, 𝑤1 and 𝑤2 , in a network.
Error Surface
Eventually, we would reach a minima …..
Randomly pick a
starting point 𝜃 0

Compute the negative −𝜂𝛻𝑓 𝜃 2


gradient at 𝜃 0 1𝜃 2
−𝜂𝛻𝑓 𝜃
→ −∇𝑓(𝜃 ) 0
𝑤2 −𝛻𝑓 1 𝜃2
−𝛻𝑓 𝜃
𝜃1
Time the learning rate 𝜂
→ −𝜂∇𝑓(𝜃 0 )

𝜃0

𝑤1 72
Gradient descent method
• Gradient descent never guarantees global minima.

Different initial
point 𝜃 0
𝑓
𝐶

Reach different minima,


so different results
𝑤1 𝑤2

73
Gradient descent method
• It also has issues at plateau and saddle point.
cost
Very slow at the plateau

Stuck at saddle point

Stuck at local minima

𝛻𝑓 𝜃 𝛻𝑓 𝜃 𝛻𝑓 𝜃
≈0 =0 =0
parameter space
74
Accelerated learning in ANNs
• Use tanh instead of sigmoid: represent the sigmoidal function
by a hyperbolic tangent
𝟐𝒂 where 𝑎 = 1.716 and 𝑏 = 0.667
𝒀𝐭𝐚𝐧 𝒉 = −𝒃𝑿
−𝒂 (Guyon, 1991)
𝟏−𝒆

75
Accelerated learning in ANNs
• Generalized delta rule: A momentum term is
included in the delta rule (Rumelhart et al., 1986)
∆𝒘𝒋𝒌 𝒑 = 𝜷 × ∆𝒘𝒋𝒌 𝒑 − 𝟏 + 𝜼 × 𝒚𝒋 𝒑 × 𝜹𝒌 𝒑
where 𝛽 = 0.95 is the momentum constant (0 ≤ 𝛽 ≤ 1)

How about put momentum of physical world in gradient descent?

76
Accelerated learning in ANNs
Still not guarantee reaching global minima,
but give some hope ……
cost

Movement = Negative of Gradient + Momentum

Negative of Gradient
Momentum
Real Movement

Gradient = 0 77
Accelerated learning in ANNs
• Adaptive learning rate: Adjust the learning rate parameter 𝜼
during training
• Small  → small weight changes through iterations → smooth
learning curve
• Large  → speed up the training process with larger weight changes
→ possible instability and oscillatory
• Heuristic-like approaches for adjusting 
1. The algebraic sign of the SSE change remains for several
consequent epochs → increase .
2. The algebraic sign of the SSE change alternates for several
consequent epochs → decrease .
• One of the most effective acceleration means

78
Learning with momentum only
Learning with momentum for
the logical operation XOR.

79
Learning with adaptive  only

Learning with adaptive learning


rate for the logical operation XOR.

80
Learning with adaptive  and momentum

81
Quiz 04: Multi-layer perceptron
• Consider the below feedforward network with one hidden layer of units.

• If the network is tested with an input vector 𝑥 = 1.0, 2.0, 3.0 then what
are the activation 𝐻1 of the first hidden neuron and the activation 𝐼3 of the
third output neuron?
82
Quiz 04: Forward the input signals
𝑇
• The input vector to the network is 𝑥 = 𝑥1 , 𝑥2 , 𝑥3
𝑇
• The vector of hidden layer outputs is 𝑦 = 𝑦1 , 𝑦2
• The vector of actual outputs is 𝑧 = 𝑧1 , 𝑧2 , 𝑧3 𝑇

• The vector of desired outputs is 𝑡 = 𝑡1 , 𝑡2 , 𝑡3 𝑇

• The network has the following weight vectors


−2.0 1.0
𝑣1 = 2.0 𝑣2 = 1.0 1.0 0.5 0.3
𝑤1 = 𝑤2 = 𝑤3 =
−3.5 −1.2 0.6
−2.0 −1.0
• Assume that all units use sigmoid activation function and zero biases.

83
Quiz 05: Backpropagation error signals
• The figure shows part of the
network described in the
previous slide.
• Use the same weights,
activation functions and bias
values as described.

• A new input pattern is presented to the network and training proceeds as


follows. The actual outputs are given by 𝑧 = 0.15, 0.36, 0.57 𝑇 and the
corresponding target outputs are given by 𝑡 = 1.0, 1.0, 1.0 𝑇 .
• The weights 𝑤12 , 𝑤22 and 𝑤32 are also shown.
• What is the error for each of the output units?
84
Acknowledgements
• Some parts of the slide are adapted from
• Derivation: Error Backpropagation & Gradient Descent for Neural
Networks (github.io link)
• Negnevitsky, Michael. Artificial intelligence: A guide to intelligent
systems. Pearson, 2005. Chapter 6.

85
89

You might also like