2024-Lecture11-MLAlgorithms
2024-Lecture11-MLAlgorithms
ALGORITHMS
2
Supervised learning: Training
• Consider a labeled training set of 𝑁 examples.
(𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑁 , 𝑦𝑁 )
• where each 𝑦𝑗 was generated by an unknown function 𝑦 = 𝑓(𝑥).
• The output 𝒚𝒋 is called ground truth, i.e., the true answer
that the model must predict.
3
Supervised learning: Hypothesis space
• ℎ is drawn from a hypothesis space 𝐻 of possible functions.
• E.g., 𝐻 might be the set of polynomials of degree 3; or the set of 3-
SAT Boolean logic formulas.
• Choose 𝐻 by some prior knowledge about the process that
generated the data or exploratory data analysis (EDA).
• EDA examines the data with statistical tests and visualizations to get
some insight into what hypothesis space might be appropriate.
• Or just try multiple hypothesis spaces and evaluate which
one works best.
4
Supervised learning: Hypothesis
• The hypothesis ℎ is consistent if it agrees with the true
function 𝑓 on all training observations, i.e., ∀𝑥𝑖 ℎ 𝑥𝑖 = 𝑦𝑖 .
• For continuous data, we instead look for a best-fit function for which
each ℎ 𝑥𝑖 is close to 𝑦𝑖 .
• Ockham’s razor: Select the simplest consistent hypothesis.
5
Supervised learning: Hypothesis
Finding hypotheses to fit data. Top row: four plots of best-fit functions from
four different hypothesis spaces trained on data set 1. Bottom row: the same
four functions, but trained on a slightly different data set (sampled from the
same 𝑓(𝑥) function).
6
Supervised learning: Testing
• The quality of the hypothesis ℎ depends on how accurately it
predicts the observations in the test set → generalization.
• The test set must use the same distribution over example space as
training set.
7
ID3
Decision Tree
Example problem: Restaurant waiting
10
Example problem: Restaurant waiting
11
Learning decision trees
• Divide and conquer: Split data into x1 > a ?
smaller and smaller subsets no
yes
• Splits are usually on a single variable
x2 > b ? x2 > g ?
yes no yes no
12
Learning decision trees
Splitting the examples by testing on attributes. At each node we show the positive
(light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type
brings us no nearer to distinguishing between positive and negative examples. (b)
Splitting on Patrons does a good job of separating positive and negative examples.
After splitting on Patrons, Hungry is a fairly good second test. 13
ID3 Decision tree: Pseudo-code
The decision tree learning algorithm. The function PLURALITY-VALUE selects the
most common output value among a set of examples, breaking ties randomly.
14
ID3 Decision tree: Pseudo-code
function LEARN-DECISION-TREE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠, 𝑝𝑎𝑟𝑒𝑛𝑡_𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
returns a tree
There are still attributes
… to split the examples
1
else
𝐴 ← 𝑎𝑟𝑔𝑚𝑎𝑥𝑎∈𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 IMPORTANCE(𝑎, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
𝑡𝑟𝑒𝑒 ← a new decision tree with root test A
for each value 𝑣 of A do
𝑒𝑥𝑠 ← 𝑒 ∶ 𝑒 ∈ 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 and 𝑒. 𝐴 = 𝑣
𝑠𝑢𝑏𝑡𝑟𝑒𝑒 ← LEARN-DECISION-TREE(𝑒𝑥𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − 𝐴, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
add a branch to 𝑡𝑟𝑒𝑒 with label (𝐴 = 𝑣) and subtree 𝑠𝑢𝑏𝑡𝑟𝑒𝑒
return 𝑡𝑟𝑒𝑒
The decision tree learning algorithm. The function IMPORTANCE evaluates the
profitability of attributes.
15
ID3 Decision tree algorithm
There are some positive and some negative examples → choose the
1
best attribute to split them
The remaining examples are all positive (or all negative), → DONE, it
2
is possible to answer Yes or No.
No examples left at a branch → return a default value.
• No example has been observed for a combination of attribute values
3
• The default value is calculated from the plurality classification of all the
examples that were used in constructing the node’s parent.
18
Decision tree: Inductive learning
• Simplest: Construct a decision tree
with one leaf for every example
→ memory based learning
→ worse generalization
19
A purity measure with entropy
• The Entropy measures the uncertainty of a random variable
𝑉 with values 𝑣𝑘 having probability 𝑃 𝑣𝑘 is defined as
𝟏
𝑯 𝑽 = 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 = − 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 𝑷 𝒗𝒌
𝑷 𝒗𝒌
𝒌 𝒌
• It is fundamental quantity in information theory.
20
A purity measure with entropy
• Entropy is maximal when all possibilities are equally likely.
Yes No
3 Y, 3 N 3 Y, 3 N
Yes No
3 Y, 3 N 3 Y, 3 N
23
Example problem: Restaurant waiting
Sat/Fri?
Yes No
2 Y, 3 N 4 Y, 3 N
24
Example problem: Restaurant waiting
Hungry?
Yes No
5 Y, 2 N 1 Y, 4 N
25
Example problem: Restaurant waiting
Raining?
Yes No
3 Y, 2 N 3 Y, 4 N
26
Example problem: Restaurant waiting
Reservation?
Yes No
3 Y, 2 N 3 Y, 4 N
27
Example problem: Restaurant waiting
Type?
French Burger
1 Y, 1 N Italian 2 Y, 2 N
Thai
1 Y, 1 N 2 Y, 2 N
1 Y, 1 N 1 Y, 1 N
32
Quiz 01: ID3 decision tree
• The data represent files on a computer system. Possible
values of the class variable are “infected”, which implies the
file has a virus infection, or “clean” if it doesn't.
• Derive decision tree for virus identification.
33
Artificial neural network
What is a neural network?
• The biological neural network (NN) is a reasoning model
based on the human brain.
• There are approximately 86 billion neurons. Estimates vary for an
adult, ranging from 100 to 500 trillion.
• It is a system that is highly complex, nonlinear and parallel
information-processing.
• Learning through experience is an essential characteristic.
• Plasticity: connections leading to the “right answer” are enhanced,
while those to the “wrong answer” are weakened.
35
Biological neural network
• There are attempts to emulate biological neural network in
the computer, resulting artificial neural networks (ANNs).
36
ANN: Network architecture
• An ANN has many neurons, arranging in a hierarchy of layers.
• Each neuron is an elementary information-processing unit.
38
ANN: Neurons and Signals
• Each neuron receives several input signals through its
connections and produces at most a single output signal.
Expressing the strength
of the input
Biological neuron
43
Perceptron
Perceptron (Frank Rosenblatt, 1958)
• A perceptron has a single neuron with adjustable synaptic
weights and a hard limiter.
𝒚 = 𝒙𝒊 𝒘𝒊 − 𝜽
𝒊=𝟏
46
Perceptron learning rule
• Step 1 – Initialization: Initial weights 𝒘𝟏 , 𝒘𝟐 , … , 𝒘𝒏 and threshold 𝜽 are randomly
assigned to small numbers (usually in −0.5, 0.5 , but not restricted to).
• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑𝑡ℎ example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏 (𝒑) and desired output 𝒀𝒅 𝒑 , and calculate the actual output
𝒏 1 𝑖𝑓 𝑥 ≥ 0
𝒀 𝒑 =𝛔 𝒙𝒊 (𝒑)𝒘𝒊 (𝒑) − 𝜽 𝜎 𝑥 =ቊ
0 𝑖𝑓 𝑥 < 0
𝒊=𝟏
where 𝑛 is the number of perceptron inputs and 𝑠𝑡𝑒𝑝 is the activation function
47
Perceptron for the logical AND/OR
• A single-layer perceptron can learn the AND/OR operations.
weather
friend wants to go
• For every combination (𝑥1 , 𝑥2 ), what are the output values at neurons, 𝐴,
𝐵 and 𝐶?
51
Multi-layer perceptron
Multi-layer perceptron (MLP)
Output signals
Input signals
First Second
hidden layer hidden layer
• It is a fully connected feedforward network with at least three
layers.
• Idea: Map certain input to a specified target value using a
cascade of nonlinear transformations.
53
Learning algorithm: Back-propagation
The input signals are propagated forwardly on a layer-by-layer basis.
55
Back-propagation algorithm
Input Hidden layer Output
layer layer
57
BP algorithm: Output layer params
• Calculating the gradient of the error function with respect to those
parameters is straightforward with the chain rule.
𝜕𝐸 1 2
= 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑗𝑘 2
𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑗𝑘
𝜕𝐸 𝜕 𝜕
• Then, = 𝑎𝑘 − 𝑡𝑘 𝑎 since 𝜕𝑤 𝑡𝑘 = 0
𝜕𝑤𝑗𝑘 𝜕𝑤𝑗𝑘 𝑘 𝑗𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘 𝑧𝑘 since 𝑎𝑘 = 𝑔(𝑧𝑘 )
𝜕𝑤𝑗𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) 𝑧
𝜕𝑤𝑗𝑘 𝑘
58
BP algorithm: Output layer params
𝜕
• Recall that 𝑧𝑘 = 𝑏𝑘 + σ𝑗 𝑔𝑗 𝑧𝑗 𝑤𝑗𝑘 , and hence, 𝑧 = 𝑔𝑗 𝑧𝑗 = 𝑎𝑗 .
𝜕𝑤𝑗𝑘 𝑘
𝝏𝑬
• Then, = 𝒂𝒌 − 𝒕𝒌 𝒈′𝒌 (𝒛𝒌 )𝒂𝒋
𝝏𝒘𝒋𝒌
the difference between the derivative of the the output of node 𝑗 from
the network output 𝑎𝑘 activation function at 𝑧𝑘 the hidden layer feeding
and the target value 𝑡𝑘 into the output layer
59
BP algorithm: Output layer params
• Let 𝛿𝑘 = 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) be the error signal after being back-propagated
through the output activation function 𝑔𝑘 .
• The delta form of the error function gradient for the output layer weights is
𝜕𝐸
= 𝛿𝑘 𝑎𝑗
𝜕𝑤𝑗𝑘
• The gradient descent update rule for the output layer weights is
𝝏𝑬
𝒘𝒋𝒌 ← 𝒘𝒋𝒌 − 𝜼 𝜂 is the learning rate
𝝏𝒘𝒋𝒌
← 𝑤𝑗𝑘 − 𝜂 𝛿𝑘 𝑎𝑗
← 𝒘𝒋𝒌 − 𝜼 𝒂𝒌 − 𝒕𝒌 𝒈𝒌 𝒛𝒌 𝟏 − 𝒈𝒌 𝒛𝒌 𝒂𝒋
60
BP algorithms: Output layer biases
• The gradient for the biases is simply the back-propagated error signal 𝛿𝑘 .
𝜕𝐸
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘 (1) = 𝛿𝑘
𝜕𝑏𝑘
61
BP algorithm: Hidden layer params
• The process starts just the same as for the output layer.
𝜕𝐸 1 2
= 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑖𝑗 2
𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑎𝑘
𝜕𝑤𝑖𝑗
𝑘
62
BP algorithm: Hidden layer params
• The term 𝑧𝑘 can be expanded as follows.
𝜕
• Again, use the chain rule to calculate 𝜕𝑤 𝑧𝑘
𝑖𝑗
65
BP algorithms: Hidden layer biases
• Calculating the error gradients with respect to the hidden layer biases 𝑏𝑗
follows a very similar procedure to that for the hidden layer weights.
𝜕𝐸 𝜕 𝜕𝑧𝑘
= 𝑎𝑘 − 𝑡𝑘 𝑔 𝑧 = 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘
𝜕𝑏𝑗 𝜕𝑏𝑗 𝑘 𝑘 𝜕𝑏𝑗
𝑘 𝑘
𝜕𝑧𝑘
• Apply chain rule to solve = 𝑤𝑗𝑘 𝑔𝑗′ 𝑧𝑗 (1)
𝜕𝑏𝑗
𝜕𝐸
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘 𝑤𝑗𝑘 𝑔𝑗′ 𝑧𝑗 = 𝑔𝑗′ 𝑧𝑗 𝛿𝑘 𝑤𝑗𝑘 = 𝛿𝑗
𝜕𝑏𝑗
𝑘 𝑘
66
Back-propagation network for XOR
• The logical XOR problem took
224 epochs or 896 iterations
for network training.
67
Sum of the squared errors (SSE)
• When the SSE in an entire pass through all training sets is
sufficiently small, a network is deemed to have converged.
68
Sigmoid neuron vs. Perceptron
• Sigmoid neuron better reflects the fact that small changes in
weights and bias cause only a small change in output.
69
About back-propagation learning
• Are randomly initialized weights and thresholds leading to
different solutions?
• Starting with different initial conditions will obtain different weights
and threshold values. The problem will always be solved within
different numbers of iterations.
• Back-propagation learning cannot be viewed as emulation of
brain-like learning.
• Biological neurons do not work backward to adjust the strengths of
their interconnections, synapses.
• The training is slow due to extensive calculations.
• Improvements: Caudill, 1991; Jacobs, 1988; Stubbs, 1990
70
Gradient descent method
• Consider two parameters, 𝑤1 and 𝑤2 , in a network.
Error Surface
Randomly pick a
starting point 𝜃 0
Compute the negative The colors represent the value of the function 𝑓.
gradient at 𝜃 0
→ −∇𝑓(𝜃 0 ) 𝑤2 𝜃∗
−𝜂𝛻𝑓 𝜃 0
Time the learning rate 𝜂
→ −𝜂∇𝑓(𝜃 0 ) −𝛻𝑓 𝜃 0
𝜃0 𝜕𝑓 𝜃 0 /𝜕𝑤1
𝛻𝑓 𝜃 0 =
𝜕𝑓 𝜃 0 /𝜕𝑤2
𝑤1 71
Gradient descent method
• Consider two parameters, 𝑤1 and 𝑤2 , in a network.
Error Surface
Eventually, we would reach a minima …..
Randomly pick a
starting point 𝜃 0
𝜃0
𝑤1 72
Gradient descent method
• Gradient descent never guarantees global minima.
Different initial
point 𝜃 0
𝑓
𝐶
73
Gradient descent method
• It also has issues at plateau and saddle point.
cost
Very slow at the plateau
𝛻𝑓 𝜃 𝛻𝑓 𝜃 𝛻𝑓 𝜃
≈0 =0 =0
parameter space
74
Accelerated learning in ANNs
• Use tanh instead of sigmoid: represent the sigmoidal function
by a hyperbolic tangent
𝟐𝒂 where 𝑎 = 1.716 and 𝑏 = 0.667
𝒀𝐭𝐚𝐧 𝒉 = −𝒃𝑿
−𝒂 (Guyon, 1991)
𝟏−𝒆
75
Accelerated learning in ANNs
• Generalized delta rule: A momentum term is
included in the delta rule (Rumelhart et al., 1986)
∆𝒘𝒋𝒌 𝒑 = 𝜷 × ∆𝒘𝒋𝒌 𝒑 − 𝟏 + 𝜼 × 𝒚𝒋 𝒑 × 𝜹𝒌 𝒑
where 𝛽 = 0.95 is the momentum constant (0 ≤ 𝛽 ≤ 1)
76
Accelerated learning in ANNs
Still not guarantee reaching global minima,
but give some hope ……
cost
Negative of Gradient
Momentum
Real Movement
Gradient = 0 77
Accelerated learning in ANNs
• Adaptive learning rate: Adjust the learning rate parameter 𝜼
during training
• Small → small weight changes through iterations → smooth
learning curve
• Large → speed up the training process with larger weight changes
→ possible instability and oscillatory
• Heuristic-like approaches for adjusting
1. The algebraic sign of the SSE change remains for several
consequent epochs → increase .
2. The algebraic sign of the SSE change alternates for several
consequent epochs → decrease .
• One of the most effective acceleration means
78
Learning with momentum only
Learning with momentum for
the logical operation XOR.
79
Learning with adaptive only
80
Learning with adaptive and momentum
81
Quiz 04: Multi-layer perceptron
• Consider the below feedforward network with one hidden layer of units.
• If the network is tested with an input vector 𝑥 = 1.0, 2.0, 3.0 then what
are the activation 𝐻1 of the first hidden neuron and the activation 𝐼3 of the
third output neuron?
82
Quiz 04: Forward the input signals
𝑇
• The input vector to the network is 𝑥 = 𝑥1 , 𝑥2 , 𝑥3
𝑇
• The vector of hidden layer outputs is 𝑦 = 𝑦1 , 𝑦2
• The vector of actual outputs is 𝑧 = 𝑧1 , 𝑧2 , 𝑧3 𝑇
83
Quiz 05: Backpropagation error signals
• The figure shows part of the
network described in the
previous slide.
• Use the same weights,
activation functions and bias
values as described.
85
89