0% found this document useful (0 votes)

16 views84 pages

2024 Lecture11 MLAlgorithms

The document discusses machine learning algorithms, focusing on supervised learning and decision trees, particularly the ID3 algorithm. It explains the training process, hypothesis space, and how decision trees are constructed using attributes to predict outcomes, illustrated with a restaurant waiting example. The document also covers concepts like entropy and information gain in the context of decision tree learning.

Uploaded by

pvmtue22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views84 pages

2024 Lecture11 MLAlgorithms

Uploaded by

pvmtue22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

MACHINE LEARNING

ALGORITHMS

Nguyễn Ngọc Thảo – Nguyễn Hải Minh

{nnthao, nhminh}@fit.hcmus.edu.vn
Outline
• Supervised learning: Related concepts
• ID3 decision trees
• Artificial neural networks

2
Supervised learning: Training
• Consider a labeled training set of 𝑁 examples.
(𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑁 , 𝑦𝑁 )
• where each 𝑦𝑗 was generated by an unknown function 𝑦 = 𝑓(𝑥).
• The output 𝒚𝒋 is called ground truth, i.e., the true answer
that the model must predict.

• The training process finds a hypothesis ℎ such that 𝒉 ≈ 𝒇.

3
Supervised learning: Hypothesis space
• ℎ is drawn from a hypothesis space 𝐻 of possible functions.
• E.g., 𝐻 might be the set of polynomials of degree 3; or the set of 3-
SAT Boolean logic formulas.
• Choose 𝐻 by some prior knowledge about the process that
generated the data or exploratory data analysis (EDA).
• EDA examines the data with statistical tests and visualizations to get
some insight into what hypothesis space might be appropriate.
• Or just try multiple hypothesis spaces and evaluate which
one works best.

4
Supervised learning: Hypothesis
• The hypothesis ℎ is consistent if it agrees with the true
function 𝑓 on all training observations, i.e., ∀𝑥𝑖 ℎ 𝑥𝑖 = 𝑦𝑖 .
• For continuous data, we instead look for a best-fit function for which
each ℎ 𝑥𝑖 is close to 𝑦𝑖 .
• Ockham’s razor: Select the simplest consistent hypothesis.

5
Supervised learning: Hypothesis

Finding hypotheses to fit data. Top row: four plots of best-fit functions from
four different hypothesis spaces trained on data set 1. Bottom row: the same
four functions, but trained on a slightly different data set (sampled from the
same 𝑓(𝑥) function).
6
Supervised learning: Testing
• The quality of the hypothesis ℎ depends on how accurately it
predicts the observations in the test set → generalization.
• The test set must use the same distribution over example space as
training set.

A learning curve for the decision

tree learning algorithm on 100
randomly generated examples in
the restaurant domain. Each data
point is the average of 20 trials.

7
ID3
Decision Tree
Example problem: Restaurant waiting

Predicting whether a certain person will wait to

have a seat in a restaurant.

1. Alternate: is there an alternative restaurant nearby?

2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
9
Example problem: Restaurant waiting

A (true) decision tree for deciding

whether to wait for a table.

10
Example problem: Restaurant waiting

11
Learning decision trees
• Divide and conquer: Split data into x1 > a ?
smaller and smaller subsets no
yes
• Splits are usually on a single variable
x2 > b ? x2 > g ?

yes no yes no

• After splitting up, each outcome is a new decision tree

learning problem with fewer examples and one less attribute.

12
Learning decision trees

Splitting the examples by testing on attributes. At each node we show the positive
(light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type
brings us no nearer to distinguishing between positive and negative examples. (b)
Splitting on Patrons does a good job of separating positive and negative examples.
After splitting on Patrons, Hungry is a fairly good second test. 13
ID3 Decision tree: Pseudo-code

function LEARN-ECISION-TREE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠, 𝑝𝑎𝑟𝑒𝑛𝑡_𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)

returns a tree
No examples left 3
if 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 is empty
then return PLURALITY-VALUE(𝑝𝑎𝑟𝑒𝑛𝑡_𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
else if all 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 have the same classification
then return the classification Remaining examples
2
are all pos/neg
else if 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 is empty
then return PLURALITY-VALUE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
else No attributes left but
4
… examples are still pos & neg

The decision tree learning algorithm. The function PLURALITY-VALUE selects the
most common output value among a set of examples, breaking ties randomly.
14
ID3 Decision tree: Pseudo-code
function LEARN-DECISION-TREE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠, 𝑝𝑎𝑟𝑒𝑛𝑡_𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
returns a tree
There are still attributes
… to split the examples
1
else
𝐴 ← 𝑎𝑟𝑔𝑚𝑎𝑥𝑎∈𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 IMPORTANCE(𝑎, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
𝑡𝑟𝑒𝑒 ← a new decision tree with root test A
for each value 𝑣 of A do
𝑒𝑥𝑠 ← 𝑒 ∶ 𝑒 ∈ 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 and 𝑒. 𝐴 = 𝑣
𝑠𝑢𝑏𝑡𝑟𝑒𝑒 ← LEARN-DECISION-TREE(𝑒𝑥𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − 𝐴, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
add a branch to 𝑡𝑟𝑒𝑒 with label (𝐴 = 𝑣) and subtree 𝑠𝑢𝑏𝑡𝑟𝑒𝑒
return 𝑡𝑟𝑒𝑒

The decision tree learning algorithm. The function IMPORTANCE evaluates the
profitability of attributes.
15
ID3 Decision tree algorithm
There are some positive and some negative examples → choose the
1
best attribute to split them
The remaining examples are all positive (or all negative), → DONE, it
2
is possible to answer Yes or No.
No examples left at a branch → return a default value.
• No example has been observed for a combination of attribute values
3
• The default value is calculated from the plurality classification of all the
examples that were used in constructing the node’s parent.

No attributes left but both positive and negative examples → return

the plurality classification of remaining ones.
4 • Examples of the same description, but different classifications
• It is due to an error or noise in the data, nondeterministic domain, or no
observation of an attribute that would distinguish the examples.
16
Example problem: Restaurant waiting

The decision tree induced from the 12-example training set.

17
Example problem: Restaurant waiting
• The induced decision tree can classify all the examples
without tests for Raining and Reservation.
• It can detect interesting and previously unsuspected pattern.
• E.g., the customers will wait for Thai food on weekends.
• It is also bound to make some mistakes for cases where it
has seen no examples.
• E.g., how about a situation in which the wait is 0–10 minutes, the
restaurant is full, yet the customer is not hungry?

18
Decision tree: Inductive learning
• Simplest: Construct a decision tree
with one leaf for every example
→ memory based learning
→ worse generalization

• Advanced: Split on each variable so that the purity of each

split increases (i.e. either only yes or only no).

19
A purity measure with entropy
• The Entropy measures the uncertainty of a random variable
𝑉 with values 𝑣𝑘 having probability 𝑃 𝑣𝑘 is defined as
𝟏
𝑯 𝑽 = ෍ 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 = − ෍ 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 𝑷 𝒗𝒌
𝑷 𝒗𝒌
𝒌 𝒌
• It is fundamental quantity in information theory.

• The information gain (IG) for an attribute 𝐴 is the expected

reduction in entropy from before to after splitting data on 𝐴.

20
A purity measure with entropy
• Entropy is maximal when all possibilities are equally likely.

• Entropy is zero in a pure

”yes” (or pure ”no”) node.

• Decision tree aims to decrease the entropy while increasing

the information gain in each node.
21
Example problem: Restaurant waiting
Alternate?

Yes No

3 Y, 3 N 3 Y, 3 N

• Calculate the Entropy

H(S) = − 6ൗ12 log 2 6ൗ12 − 6ൗ12 log 2 6ൗ12 = 1
of the whole dataset
• Calculate Average entropy of attribute Alternate?
𝐴𝐸𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒? = 𝑃 𝐴𝑙𝑡 = 𝑌 × 𝐻 𝐴𝑙𝑡 = 𝑌 + 𝑃 𝐴𝑙𝑡 = 𝑁 × 𝐻 𝐴𝑙𝑡 = 𝑁 = 1
6 3 3 3 3 6 3 3 3 3
= − log 2 − log 2 + − log 2 − log 2
12 6 6 6 6 12 6 6 6 6
• Calculate Information gain of attribute Alternate?
𝐼𝐺 𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒? = 𝐻 𝑆 − 𝐴𝐸𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒? = 1 − 1 = 0
22
Example problem: Restaurant waiting
Bar?

Yes No

3 Y, 3 N 3 Y, 3 N

• Calculate Average entropy of attribute Bar?

6 3 3 3 3 6 3 3 3 3
𝐴𝐸𝐵𝑎𝑟? = − log 2 − log 2 + − log 2 − log 2 =1
12 6 6 6 6 12 6 6 6 6

• Calculate Information gain of attribute Bar?

𝐼𝐺 𝐵𝑎𝑟? = 𝐻 𝑆 − 𝐴𝐸𝐵𝑎𝑟? = 1 − 1 = 0

23
Example problem: Restaurant waiting
Sat/Fri?

Yes No

2 Y, 3 N 4 Y, 3 N

• Calculate Average entropy of attribute Sat/Fri?

5 2 2 3 3 7 4 4 3 3
𝐴𝐸𝑆𝑎𝑡Τ𝐹𝑟𝑖? = − log 2 − log 2 + − log 2 − log 2 = 0.979
12 5 5 5 5 12 7 7 7 7

• Calculate Information gain of attribute Sat/Fri?

𝐼𝐺 𝑆𝑎𝑡Τ𝐹𝑟𝑖? = 𝐻 𝑆 − 𝐴𝐸𝑆𝑎𝑡Τ𝐹𝑟𝑖? = 1 − 0.979 = 0.021

24
Example problem: Restaurant waiting
Hungry?

Yes No

5 Y, 2 N 1 Y, 4 N

• Calculate Average entropy of attribute Hungry?

7 5 5 2 2 5 1 1 4 4
𝐴𝐸𝐻𝑢𝑛𝑔𝑟𝑦? = − log 2 − log 2 + − log 2 − log 2 = 0.804
12 7 7 7 7 12 5 5 5 5

• Calculate Information gain of attribute Hungry?

𝐼𝐺 𝐻𝑢𝑛𝑔𝑟𝑦? = 𝐻 𝑆 − 𝐴𝐸𝐻𝑢𝑛𝑔𝑟𝑦? = 1 − 0.804 = 0.196

25
Example problem: Restaurant waiting
Raining?

Yes No

3 Y, 2 N 3 Y, 4 N

• Calculate Average entropy of attribute Raining?

5 3 3 2 2 7 3 3 4 4
𝐴𝐸𝑅𝑎𝑖𝑛𝑖𝑛𝑔? = − log 2 − log 2 + − log 2 − log 2 = 0.979
12 5 5 5 5 12 7 7 7 7
• Calculate Information gain of attribute Raining?

𝐼𝐺 𝑅𝑎𝑖𝑛𝑖𝑛𝑔? = 𝐻 𝑆 − 𝐴𝐸𝑅𝑎𝑖𝑛𝑖𝑛𝑔? = 1 − 0.979 = 0.021

26
Example problem: Restaurant waiting
Reservation?

Yes No

3 Y, 2 N 3 Y, 4 N

• Calculate Average entropy of attribute Reservation?

5 3 3 2 2 7 3 3 4 4
𝐴𝐸𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛? = − log 2 − log 2 + − log 2 − log 2 = 0.979
12 5 5 5 5 12 7 7 7 7

• Calculate Information gain of attribute Reservation?

𝐼𝐺 𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛? = 𝐻 𝑆 − 𝐴𝐸𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛? = 1 − 0.979 = 0.021

27
Example problem: Restaurant waiting
Type?
French Burger

1 Y, 1 N Italian 2 Y, 2 N
Thai

1 Y, 1 N 2 Y, 2 N

• Calculate Average entropy of attribute Type?

2 1 1 1 1 2 1 1 1 1
𝐴𝐸𝑇𝑦𝑝𝑒? = − log 2 − log 2 + − log 2 − log 2
12 2 2 2 2 12 2 2 2 2
4 2 2 2 2 4 2 2 2 2
+ − log 2 − log 2 + − log 2 − log 2 =1
12 4 4 4 4 12 4 4 4 4

• Calculate Information gain of attribute Type?

𝐼𝐺 𝑇𝑦𝑝𝑒? = 𝐻 𝑆 − 𝐴𝐸𝑇𝑦𝑝𝑒? = 1 − 1 = 0 30
Example problem: Restaurant waiting
Est. waiting
time?
0-10 > 60
10-30
4 Y, 2 N 2N
30-60

1 Y, 1 N 1 Y, 1 N

• Calculate Average entropy of attribute Est. waiting time?

6 4 4 2 2 2 1 1 1 1
𝐴𝐸𝐸𝑠𝑡.𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒? = − log 2 − log 2 + − log 2 − log 2
12 6 6 6 6 12 2 2 2 2
2 1 1 1 1 2 0 0 2 2
+ − log 2 − log 2 + − log 2 − log 2 = 0.792
12 2 2 2 2 12 2 2 2 2

• Calculate Information gain of attribute Est. waiting time?

𝐼𝐺 𝐸𝑠𝑡. 𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒? , 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝐸𝑠𝑡.𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒? = 1 − 0.792 = 0.208
31
Example problem: Restaurant waiting
• Largest Information Gain (0.459) / Smallest Entropy (0.541)
achieved by splitting on Patrons.
Patrons?
None Full
Some
2N 2 T,X?4 F
4Y

• Continue making new splits, always purifying nodes

32
Quiz 01: ID3 decision tree
• The data represent files on a computer system. Possible
values of the class variable are “infected”, which implies the
file has a virus infection, or “clean” if it doesn't.
• Derive decision tree for virus identification.

No. Writable Updated Size Class

1 Yes No Small Infected
2 Yes Yes Large Infected
3 No Yes Med Infected
4 No No Med Clean
5 Yes No Large Clean
6 No No Large Clean

33
Artificial neural network
What is a neural network?
• The biological neural network (NN) is a reasoning model
based on the human brain.
• There are approximately 86 billion neurons. Estimates vary for an
adult, ranging from 100 to 500 trillion.
• It is a system that is highly complex, nonlinear and parallel
information-processing.
• Learning through experience is an essential characteristic.
• Plasticity: connections leading to the “right answer” are enhanced,
while those to the “wrong answer” are weakened.

35
Biological neural network
• There are attempts to emulate biological neural network in
the computer, resulting artificial neural networks (ANNs).

• Just resemble the learning mechanisms, not the architecture

• Megatron-Turing's NLG: 530 billion parameters, GPT-3: 175 billion

36
ANN: Network architecture
• An ANN has many neurons, arranging in a hierarchy of layers.
• Each neuron is an elementary information-processing unit.

• ANN improve performance via experience and generalization.

37
ANN: Applications

38
ANN: Neurons and Signals
• Each neuron receives several input signals through its
connections and produces at most a single output signal.
Expressing the strength
of the input

• The set of weights is the long-term memory in an ANN → the

learning process iteratively adjusts the weights.
39
Artificial neuron

Biological neuron

Biological neuron Artificial neuron

Soma Neuron
Dendrite Input
Axon Output
Synapse Weight
40
41 41
42

Source: The Asimov Institute

42
How to build an ANN?
• The network architecture must be decided first.
• How many neurons are to be used?
• How the neurons are to be connected to form a network?
• Then determine which learning algorithm to use,
• Supervised /semi-supervised / unsupervised / reinforcement learning
• And finally train the neural network
• How to initialize the weights of the network?
• How to update them from a set of training examples.

43
Perceptron
Perceptron (Frank Rosenblatt, 1958)
• A perceptron has a single neuron with adjustable synaptic
weights and a hard limiter.

A single-layer two-input perceptron

45
How does a perceptron work?
• Divide the n-dimensional space into two decision regions by
a hyperplane defined by the linearly separable function
𝒏

𝒚 = ෍ 𝒙𝒊 𝒘𝒊 − 𝜽
𝒊=𝟏

46
Perceptron learning rule
• Step 1 – Initialization: Initial weights 𝒘𝟏 , 𝒘𝟐 , … , 𝒘𝒏 and threshold 𝜽 are randomly
assigned to small numbers (usually in −0.5, 0.5 , but not restricted to).

• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑𝑡ℎ example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏 (𝒑) and desired output 𝒀𝒅 𝒑 , and calculate the actual output
𝒏 1 𝑖𝑓 𝑥 ≥ 0
𝒀 𝒑 =𝛔 ෍ 𝒙𝒊 (𝒑)𝒘𝒊 (𝒑) − 𝜽 𝜎 𝑥 =ቊ
0 𝑖𝑓 𝑥 < 0
𝒊=𝟏
where 𝑛 is the number of perceptron inputs and 𝑠𝑡𝑒𝑝 is the activation function

• Step 3 – Weight training

• Update the weights 𝒘𝒊 : 𝒘𝒊 𝒑 + 𝟏 = 𝒘𝒊 𝒑 + ∆𝒘𝒊 (𝒑)
where ∆𝒘𝒊 (𝒑) is the weight correction at iteration 𝒑
• The delta rule determines how to adjust the weights: ∆𝒘𝒊 𝒑 = 𝜼 × 𝒙𝒊 𝒑 × 𝒆(𝒑)
where 𝜼 is the learning rate (0 < 𝜂 < 1) and 𝒆 𝒑 = 𝒀𝒅 𝒑 − 𝒀(𝒑)
• Step 4 – Iteration: Increase iteration 𝒑 by one, go back to Step 2 and repeat the process
until convergence.

47
Perceptron for the logical AND/OR
• A single-layer perceptron can learn the AND/OR operations.

The learning of logical AND

converged after several iterations
Threshold  = 0.2, learning rate  = 0.1 48
Perceptron for the logical XOR
• It cannot be trained to perform the Exclusive-OR.

• Generally, perceptron can classify only linearly separable

patterns regardless of the activation function used.
• Research works: Shynk, 1990 and Shynk and Bershad, 1992.
49
Perceptron: An example
Suppose there is a high-tech exhibition in the city, and you are thinking
• 𝑤1 = 6, 𝑤2 = 2, 𝑤3 = 2 → the weather matters to you much more than
about whether to go there. Your decision relies on the below factors:
whether your friend joins you, or the nearness of public transit
• Is the weather good?
• 𝜃 = 5 → decisions are made based on the weather only
• Does your friend want to accompany you?
• 𝜃 = 3 → you go to the festival whenever the weather is good or when
• Isthe
both thefestival
festivalisnear
nearpublic
publictransit?
transit (You don'tfriend
and your own awants
car). to join you.

weather

friend wants to go

near public transit

50
Quiz 03: Perceptron
• Consider the following neural network which receives binary input
values, 𝑥1 and 𝑥2 , and produces a single binary value.

• For every combination (𝑥1 , 𝑥2 ), what are the output values at neurons, 𝐴,
𝐵 and 𝐶?
51
Multi-layer perceptron
Multi-layer perceptron (MLP)

Output signals
Input signals

First Second
hidden layer hidden layer
• It is a fully connected feedforward network with at least three
layers.
• Idea: Map certain input to a specified target value using a
cascade of nonlinear transformations.
53
Learning algorithm: Back-propagation
The input signals are propagated forwardly on a layer-by-layer basis.

The error signals are propagated backwards

from the output layer to the input layer. 54
Back-propagation algorithm
• Consider a MLP with one hidden layer.
• Note the following notations
• 𝑎𝑖 : the output value of node 𝑖 in the input layer
• 𝑧𝑗 : the input value to node 𝑗 in the layer ℎ
• 𝑔𝑗 : the activation function for node 𝑗 in the layer ℎ (applied to 𝑧𝑗 )
• 𝑎𝑗 = 𝑔𝑗 𝑧𝑗 : the output value of node 𝑗 in the layer ℎ
• 𝑏𝑗 : the bias/offset for unit 𝑗 in the layer ℎ
• 𝑤𝑖𝑗 : weights connecting node 𝑖 in layer (ℎ − 1) to node 𝑗 in layer ℎ
• 𝑡𝑘 : target value for node 𝑘 in the output layer

55
Back-propagation algorithm
Input Hidden layer Output
layer layer

* Bias units are not shown.

56
BP algorithm: The error function
• Training a neural network entails finding parameters 𝜃 =
𝐖, 𝐛 that minimize the errors.
• The error function is usually the sum of the squared errors
between the target values 𝑡𝑘 and the network outputs 𝑎𝑘 .
𝒍
𝟏 𝟐
𝑬 = ෍ 𝒂𝒌 − 𝒕𝒌
𝟐
𝒌=𝟏
• 𝑙 is the dimensionality of the target for a single observation.

• This parameter optimization problem can be solved using

𝜕𝐸
gradient descent, computing for all 𝜃.
𝜕𝜃

57
BP algorithm: Output layer params
• Calculating the gradient of the error function with respect to those
parameters is straightforward with the chain rule.
𝜕𝐸 1 2
= ෍ 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑗𝑘 2
𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑗𝑘
𝜕𝐸 𝜕 𝜕
• Then, = 𝑎𝑘 − 𝑡𝑘 𝑎 since 𝜕𝑤 𝑡𝑘 = 0
𝜕𝑤𝑗𝑘 𝜕𝑤𝑗𝑘 𝑘 𝑗𝑘

𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘 𝑧𝑘 since 𝑎𝑘 = 𝑔(𝑧𝑘 )
𝜕𝑤𝑗𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) 𝑧
𝜕𝑤𝑗𝑘 𝑘

58
BP algorithm: Output layer params
𝜕
• Recall that 𝑧𝑘 = 𝑏𝑘 + σ𝑗 𝑔𝑗 𝑧𝑗 𝑤𝑗𝑘 , and hence, 𝑧 = 𝑔𝑗 𝑧𝑗 = 𝑎𝑗 .
𝜕𝑤𝑗𝑘 𝑘

𝝏𝑬
• Then, = 𝒂𝒌 − 𝒕𝒌 𝒈′𝒌 (𝒛𝒌 )𝒂𝒋
𝝏𝒘𝒋𝒌

the difference between the derivative of the the output of node 𝑗 from
the network output 𝑎𝑘 activation function at 𝑧𝑘 the hidden layer feeding
and the target value 𝑡𝑘 into the output layer

• The common activation function is the sigmoid function

1
𝑔 𝑧 =
1 + 𝑒 −𝑧
whose derivative is
𝑔′ 𝑧 = 𝑔(𝑧) 1 − 𝑔(𝑧)

59
BP algorithm: Output layer params
• Let 𝛿𝑘 = 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) be the error signal after being back-propagated
through the output activation function 𝑔𝑘 .
• The delta form of the error function gradient for the output layer weights is
𝜕𝐸
= 𝛿𝑘 𝑎𝑗
𝜕𝑤𝑗𝑘

• The gradient descent update rule for the output layer weights is
𝝏𝑬
𝒘𝒋𝒌 ← 𝒘𝒋𝒌 − 𝜼 𝜂 is the learning rate
𝝏𝒘𝒋𝒌
← 𝑤𝑗𝑘 − 𝜂 𝛿𝑘 𝑎𝑗
← 𝒘𝒋𝒌 − 𝜼 𝒂𝒌 − 𝒕𝒌 𝒈𝒌 𝒛𝒌 𝟏 − 𝒈𝒌 𝒛𝒌 𝒂𝒋

• Apply similar update rules for the remaining parameters 𝑤𝑗𝑘 .

60
BP algorithms: Output layer biases
• The gradient for the biases is simply the back-propagated error signal 𝛿𝑘 .
𝜕𝐸
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘 (1) = 𝛿𝑘
𝜕𝑏𝑘

• Each bias is updated as 𝒃𝒌 ← 𝒃𝒌 − 𝜼 𝜹𝒌

𝜕 𝜕
• Note that 𝑧 = 𝑏𝑘 + σ𝑗 𝑔𝑗 𝑧𝑗 =1
𝜕𝑏𝑘 𝑘 𝜕𝑏𝑘
• The biases are weights on activations that are always equal to one, regardless of the
feed-forward signal.
• Thus, the bias gradients aren’t affected by the feed-forward signal, only by the error.

61
BP algorithm: Hidden layer params
• The process starts just the same as for the output layer.
𝜕𝐸 1 2
= ෍ 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑖𝑗 2
𝑘
𝜕
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑎𝑘
𝜕𝑤𝑖𝑗
𝑘

• Apply the chain rule again, we obtain:

𝜕𝐸 𝜕
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔 𝑧 since 𝑎𝑘 = 𝑔𝑘 𝑧𝑘
𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗 𝑘 𝑘
𝑘
𝜕
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) 𝑧𝑘
𝜕𝑤𝑖𝑗
𝑘

62
BP algorithm: Hidden layer params
• The term 𝑧𝑘 can be expanded as follows.

𝑧𝑘 = 𝑏𝑘 + ෍ 𝑎𝑗 𝑤𝑗𝑘 = 𝑏𝑘 + ෍ 𝑔𝑗 𝑧𝑗 𝑤𝑗𝑘 since 𝑎𝑗 = 𝑔𝑗 𝑧𝑗

𝑗 𝑗

= 𝑏𝑘 + ෍ 𝑔𝑗 𝑏𝑗 + ෍ 𝑎𝑖 𝑤𝑖𝑗 𝑤𝑗𝑘 since 𝑧𝑗 = 𝑏𝑗 + σ𝑖 𝑎𝑖 𝑤𝑖𝑗

𝑗 𝑖

𝜕
• Again, use the chain rule to calculate 𝜕𝑤 𝑧𝑘
𝑖𝑗

𝜕 𝜕𝑧𝑘 𝜕𝑎𝑗 𝜕 𝜕𝑎𝑗 𝜕𝑎𝑗 𝜕𝑔𝑗 𝑧𝑗

𝑧𝑘 = = 𝑏𝑘 + ෍ 𝑎𝑗 𝑤𝑗𝑘 = 𝑤𝑗𝑘 = 𝑤𝑗𝑘
𝜕𝑤𝑖𝑗 𝜕𝑎𝑗 𝜕𝑤𝑖𝑗 𝜕𝑎𝑗 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗
𝑗
𝜕𝑧𝑗 𝜕
= 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 ) = 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 ) 𝑏𝑗 + σ𝑗 𝑎𝑖 𝑤𝑖𝑗
𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗

= 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖

63
BP algorithm: Hidden layer params
𝜕𝐸
• Thus, = ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 )𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖
𝜕𝑤𝑖𝑗
𝑘

= ෍ 𝛿𝑘 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖 the output activation signal

from the layer below 𝑎𝑖
𝑘

error term the derivative of the

activation function at 𝑧𝑗
• Let 𝛿𝑗 = 𝑔𝑗′ 𝑧𝑗 σ𝑘 𝛿𝑘 𝑤𝑗𝑘 denote the resulting error signal back to layer 𝑗.
• The error function gradient for the hidden layer weights is
𝜕𝐸
= 𝛿𝑗 𝑎𝑖
𝜕𝑤𝑖𝑗

To calculate the weight gradients at any layer 𝑙, we calculate the backpropagated

error signal 𝛿𝑙 that reaches that layer from the “afterward” layers, and weight it
by the feed-forward signal at 𝑙 − 1 feeding into that layer.
64
BP algorithm: Hidden layer params
• The gradient descent update rule for the hidden layer weights is
𝝏𝑬
𝒘𝒊𝒋 ← 𝒘𝒊𝒋 − 𝜼
𝝏𝒘𝒊𝒋
← 𝑤𝑖𝑗 − 𝜂 𝛿𝑗 𝑎𝑖 ← 𝑤𝑖𝑗 − 𝜂 𝑔𝑗′ 𝑧𝑗 ෍ 𝛿𝑘 𝑤𝑗𝑘 𝑎𝑖
𝑘

← 𝒘𝒊𝒋 − 𝜼 ෍ 𝒂𝒌 − 𝒕𝒌 𝒈𝒌 (𝒛𝒌 ) 𝟏 − 𝒈𝒌 (𝒛𝒌 ) 𝒘𝒋𝒌 𝒈𝒋 (𝒛𝒋 ) 𝟏 − 𝒈𝒋 (𝒛𝒋 ) 𝒂𝒊

𝑘

• Apply similar update rules for the remaining parameters 𝑤𝑖𝑗 .

65
BP algorithms: Hidden layer biases
• Calculating the error gradients with respect to the hidden layer biases 𝑏𝑗
follows a very similar procedure to that for the hidden layer weights.
𝜕𝐸 𝜕 𝜕𝑧𝑘
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔 𝑧 = ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘
𝜕𝑏𝑗 𝜕𝑏𝑗 𝑘 𝑘 𝜕𝑏𝑗
𝑘 𝑘

𝜕𝑧𝑘
• Apply chain rule to solve = 𝑤𝑗𝑘 𝑔𝑗′ 𝑧𝑗 (1)
𝜕𝑏𝑗

• The gradient for the biases is the back-propagated error signal 𝛿𝑗 .

𝜕𝐸
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘 𝑤𝑗𝑘 𝑔𝑗′ 𝑧𝑗 = 𝑔𝑗′ 𝑧𝑗 ෍ 𝛿𝑘 𝑤𝑗𝑘 = 𝛿𝑗
𝜕𝑏𝑗
𝑘 𝑘

• Each bias is updated as 𝒃𝒋 ← 𝒃𝒋 − 𝜼 𝜹𝒋

66
Back-propagation network for XOR
• The logical XOR problem took
224 epochs or 896 iterations
for network training.

67
Sum of the squared errors (SSE)
• When the SSE in an entire pass through all training sets is
sufficiently small, a network is deemed to have converged.

Learning curve for

logical operation XOR

68
Sigmoid neuron vs. Perceptron
• Sigmoid neuron better reflects the fact that small changes in
weights and bias cause only a small change in output.

A sigmoidal function is a smoothed-out

version of a step function.

69
About back-propagation learning
• Are randomly initialized weights and thresholds leading to
different solutions?
• Starting with different initial conditions will obtain different weights
and threshold values. The problem will always be solved within
different numbers of iterations.
• Back-propagation learning cannot be viewed as emulation of
brain-like learning.
• Biological neurons do not work backward to adjust the strengths of
their interconnections, synapses.
• The training is slow due to extensive calculations.
• Improvements: Caudill, 1991; Jacobs, 1988; Stubbs, 1990

70
Gradient descent method
• Consider two parameters, 𝑤1 and 𝑤2 , in a network.
Error Surface

Randomly pick a
starting point 𝜃 0

Compute the negative The colors represent the value of the function 𝑓.
gradient at 𝜃 0
→ −∇𝑓(𝜃 0 ) 𝑤2 𝜃∗
−𝜂𝛻𝑓 𝜃 0
Time the learning rate 𝜂
→ −𝜂∇𝑓(𝜃 0 ) −𝛻𝑓 𝜃 0

𝜃0 𝜕𝑓 𝜃 0 /𝜕𝑤1
𝛻𝑓 𝜃 0 =
𝜕𝑓 𝜃 0 /𝜕𝑤2
𝑤1 71
Gradient descent method
• Consider two parameters, 𝑤1 and 𝑤2 , in a network.
Error Surface
Eventually, we would reach a minima …..
Randomly pick a
starting point 𝜃 0

Compute the negative −𝜂𝛻𝑓 𝜃 2

gradient at 𝜃 0 1𝜃 2
−𝜂𝛻𝑓 𝜃
→ −∇𝑓(𝜃 ) 0
𝑤2 −𝛻𝑓 1 𝜃2
−𝛻𝑓 𝜃
𝜃1
Time the learning rate 𝜂
→ −𝜂∇𝑓(𝜃 0 )

𝜃0

𝑤1 72
Gradient descent method
• Gradient descent never guarantees global minima.

Different initial
point 𝜃 0
𝑓
𝐶

Reach different minima,

so different results
𝑤1 𝑤2

73
Gradient descent method
• It also has issues at plateau and saddle point.
cost
Very slow at the plateau

Stuck at saddle point

Stuck at local minima

𝛻𝑓 𝜃 𝛻𝑓 𝜃 𝛻𝑓 𝜃
≈0 =0 =0
parameter space
74
Accelerated learning in ANNs
• Use tanh instead of sigmoid: represent the sigmoidal function
by a hyperbolic tangent
𝟐𝒂 where 𝑎 = 1.716 and 𝑏 = 0.667
𝒀𝐭𝐚𝐧 𝒉 = −𝒃𝑿
−𝒂 (Guyon, 1991)
𝟏−𝒆

75
Accelerated learning in ANNs
• Generalized delta rule: A momentum term is
included in the delta rule (Rumelhart et al., 1986)
∆𝒘𝒋𝒌 𝒑 = 𝜷 × ∆𝒘𝒋𝒌 𝒑 − 𝟏 + 𝜼 × 𝒚𝒋 𝒑 × 𝜹𝒌 𝒑
where 𝛽 = 0.95 is the momentum constant (0 ≤ 𝛽 ≤ 1)

How about put momentum of physical world in gradient descent?

76
Accelerated learning in ANNs
Still not guarantee reaching global minima,
but give some hope ……
cost

Movement = Negative of Gradient + Momentum

Negative of Gradient
Momentum
Real Movement

Gradient = 0 77
Accelerated learning in ANNs
• Adaptive learning rate: Adjust the learning rate parameter 𝜼
during training
• Small  → small weight changes through iterations → smooth
learning curve
• Large  → speed up the training process with larger weight changes
→ possible instability and oscillatory
• Heuristic-like approaches for adjusting 
1. The algebraic sign of the SSE change remains for several
consequent epochs → increase .
2. The algebraic sign of the SSE change alternates for several
consequent epochs → decrease .
• One of the most effective acceleration means

78
Learning with momentum only
Learning with momentum for
the logical operation XOR.

79
Learning with adaptive  only

Learning with adaptive learning

rate for the logical operation XOR.

80
Learning with adaptive  and momentum

81
Quiz 04: Multi-layer perceptron
• Consider the below feedforward network with one hidden layer of units.

• If the network is tested with an input vector 𝑥 = 1.0, 2.0, 3.0 then what
are the activation 𝐻1 of the first hidden neuron and the activation 𝐼3 of the
third output neuron?
82
Quiz 04: Forward the input signals
𝑇
• The input vector to the network is 𝑥 = 𝑥1 , 𝑥2 , 𝑥3
𝑇
• The vector of hidden layer outputs is 𝑦 = 𝑦1 , 𝑦2
• The vector of actual outputs is 𝑧 = 𝑧1 , 𝑧2 , 𝑧3 𝑇

• The vector of desired outputs is 𝑡 = 𝑡1 , 𝑡2 , 𝑡3 𝑇

• The network has the following weight vectors

−2.0 1.0
𝑣1 = 2.0 𝑣2 = 1.0 1.0 0.5 0.3
𝑤1 = 𝑤2 = 𝑤3 =
−3.5 −1.2 0.6
−2.0 −1.0
• Assume that all units use sigmoid activation function and zero biases.

83
Quiz 05: Backpropagation error signals
• The figure shows part of the
network described in the
previous slide.
• Use the same weights,
activation functions and bias
values as described.

• A new input pattern is presented to the network and training proceeds as

follows. The actual outputs are given by 𝑧 = 0.15, 0.36, 0.57 𝑇 and the
corresponding target outputs are given by 𝑡 = 1.0, 1.0, 1.0 𝑇 .
• The weights 𝑤12 , 𝑤22 and 𝑤32 are also shown.
• What is the error for each of the output units?
84
Acknowledgements
• Some parts of the slide are adapted from
• Derivation: Error Backpropagation & Gradient Descent for Neural
Networks (github.io link)
• Negnevitsky, Michael. Artificial intelligence: A guide to intelligent
systems. Pearson, 2005. Chapter 6.

85
89

2025 Lecture07 P1 ID3
No ratings yet
2025 Lecture07 P1 ID3
41 pages
ML-3-Decision Tree
No ratings yet
ML-3-Decision Tree
17 pages
2021 Lecture10 BasicML
No ratings yet
2021 Lecture10 BasicML
76 pages
Unit 3
No ratings yet
Unit 3
46 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
Learning
No ratings yet
Learning
51 pages
Module - 2 Decision Tree Learning
No ratings yet
Module - 2 Decision Tree Learning
79 pages
DT-0 (3 Files Merged)
No ratings yet
DT-0 (3 Files Merged)
143 pages
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
No ratings yet
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
49 pages
Data Mining Practical 8
No ratings yet
Data Mining Practical 8
7 pages
New Module 3 Part1
No ratings yet
New Module 3 Part1
69 pages
Screenshot 2024-02-06 at 1.43.15 PM
No ratings yet
Screenshot 2024-02-06 at 1.43.15 PM
66 pages
Decision Tree Learning Notes On 23rd July
No ratings yet
Decision Tree Learning Notes On 23rd July
23 pages
Unit-3 MLT
No ratings yet
Unit-3 MLT
74 pages
7 DecisionTree
No ratings yet
7 DecisionTree
58 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
NOTES Module 3 - Chapter 6 - Decision Tree Learning
No ratings yet
NOTES Module 3 - Chapter 6 - Decision Tree Learning
20 pages
JU Ch9
No ratings yet
JU Ch9
21 pages
07. Decision Trees
No ratings yet
07. Decision Trees
34 pages
CENG313 Introduction To Data Science: Lecture 12: Classification Decision Trees
No ratings yet
CENG313 Introduction To Data Science: Lecture 12: Classification Decision Trees
61 pages
Chapter 3 Decision Trees
No ratings yet
Chapter 3 Decision Trees
61 pages
Unit IV Notes
No ratings yet
Unit IV Notes
20 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
Module 3 Chap 3 Decision Tree Learning
No ratings yet
Module 3 Chap 3 Decision Tree Learning
79 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
ML DecisionTrees
No ratings yet
ML DecisionTrees
46 pages
Trees
No ratings yet
Trees
78 pages
Video Tutorial: Decision Tree Learning
No ratings yet
Video Tutorial: Decision Tree Learning
21 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
Module 2
No ratings yet
Module 2
42 pages
Module 2 Notes v1 PDF
No ratings yet
Module 2 Notes v1 PDF
20 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
Ai Mod3@Azdocuments - in
No ratings yet
Ai Mod3@Azdocuments - in
42 pages
L8-1-decisiontrees--random-forest (1)
No ratings yet
L8-1-decisiontrees--random-forest (1)
118 pages
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
No ratings yet
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
21 pages
3 Decision Tree Learning
No ratings yet
3 Decision Tree Learning
38 pages
MLT Unit 3
100% (1)
MLT Unit 3
38 pages
Module - 3 - DTL & Ann
No ratings yet
Module - 3 - DTL & Ann
10 pages
UNIT II 2.1 ML Decision Tree Learning
No ratings yet
UNIT II 2.1 ML Decision Tree Learning
55 pages
Decision Tree Learning Lecture
No ratings yet
Decision Tree Learning Lecture
13 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
Decision Tree
No ratings yet
Decision Tree
20 pages
Decision Tree 2
No ratings yet
Decision Tree 2
20 pages
Lecture 06 Part A - Macine Learning
No ratings yet
Lecture 06 Part A - Macine Learning
77 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
Session-29 Co3 - BBN DT KNN
No ratings yet
Session-29 Co3 - BBN DT KNN
34 pages
Decision Tree
No ratings yet
Decision Tree
43 pages
Machine Learning: MVJ21CS62
No ratings yet
Machine Learning: MVJ21CS62
12 pages
Decision Trees
No ratings yet
Decision Trees
128 pages
ML - Unit 2 - Part I
No ratings yet
ML - Unit 2 - Part I
15 pages
Decession Tree
No ratings yet
Decession Tree
72 pages
L3 - Decision Trees
No ratings yet
L3 - Decision Trees
28 pages
AIML - Module 3 - Updated
No ratings yet
AIML - Module 3 - Updated
42 pages
chapter five learning
No ratings yet
chapter five learning
50 pages
SAT Math: Master the Skills in 40 Pages
From Everand
SAT Math: Master the Skills in 40 Pages
Jennifer L Johnson
No ratings yet
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Microcontroller-Based Multiple-Platform PWM Signal Generation Procedures For Industrial Use
No ratings yet
Microcontroller-Based Multiple-Platform PWM Signal Generation Procedures For Industrial Use
6 pages
Two-Pass Assembler: Optab Symtab Locctr
No ratings yet
Two-Pass Assembler: Optab Symtab Locctr
4 pages
How To Calibrate Your Android
No ratings yet
How To Calibrate Your Android
2 pages
Intel® Solid State Drive Firmware Update Tool: Release Notes
No ratings yet
Intel® Solid State Drive Firmware Update Tool: Release Notes
22 pages
Rahubeniwalresume 1
No ratings yet
Rahubeniwalresume 1
2 pages
MBSTU Topic List Sheet1
No ratings yet
MBSTU Topic List Sheet1
1 page
Ai and Ar
No ratings yet
Ai and Ar
13 pages
History of Computer and Its Generations
No ratings yet
History of Computer and Its Generations
18 pages
Catalogo Combos Upgrade
No ratings yet
Catalogo Combos Upgrade
34 pages
Is Individual Assignment
No ratings yet
Is Individual Assignment
5 pages
Iot Codes
No ratings yet
Iot Codes
50 pages
Manual g5
No ratings yet
Manual g5
30 pages
YOLO Series Algorithms in Object Detection of Unmanned Aerial Vehicles: A Survey
No ratings yet
YOLO Series Algorithms in Object Detection of Unmanned Aerial Vehicles: A Survey
30 pages
Learning Module in General Mathematics 11: Quarter 1 - Week 2
No ratings yet
Learning Module in General Mathematics 11: Quarter 1 - Week 2
10 pages
Final Project Content
No ratings yet
Final Project Content
56 pages
Easy Rack PDU E - Brochure
No ratings yet
Easy Rack PDU E - Brochure
7 pages
Logcat 1694175323651
No ratings yet
Logcat 1694175323651
199 pages
Xtream Code
No ratings yet
Xtream Code
3 pages
Big Ranch Need Guides Objective C 2nd Edition Miley Cyprus Aaron Hill Easy Mike Ward Instant Download
100% (2)
Big Ranch Need Guides Objective C 2nd Edition Miley Cyprus Aaron Hill Easy Mike Ward Instant Download
71 pages
Sahil Nayak Resume
No ratings yet
Sahil Nayak Resume
1 page
Experiment 12 22z433 2
No ratings yet
Experiment 12 22z433 2
12 pages
Fault Tolerance in Iot
No ratings yet
Fault Tolerance in Iot
4 pages
TurboCAD 2D Manual
No ratings yet
TurboCAD 2D Manual
121 pages
Unit 34 - System Analysis & Design Reworded 2021-Merged
No ratings yet
Unit 34 - System Analysis & Design Reworded 2021-Merged
65 pages
VSP F350 F370 v88 03 2x Hardware Reference MK-97HM85016-02
No ratings yet
VSP F350 F370 v88 03 2x Hardware Reference MK-97HM85016-02
110 pages
GED 106 - Lecture 5 Communication For Work Purposes Part 3
No ratings yet
GED 106 - Lecture 5 Communication For Work Purposes Part 3
13 pages
Machine Learning Super Cheatsheet (Prof. Pedram Jahangiry)
No ratings yet
Machine Learning Super Cheatsheet (Prof. Pedram Jahangiry)
2 pages
Using The Email Header Analyzer
No ratings yet
Using The Email Header Analyzer
4 pages
Bluebeam Script Reference
No ratings yet
Bluebeam Script Reference
26 pages
WK 10 - Effects of Modern Technology To Filipino People
No ratings yet
WK 10 - Effects of Modern Technology To Filipino People
33 pages

2024 Lecture11 MLAlgorithms

Uploaded by

2024 Lecture11 MLAlgorithms

Uploaded by

MACHINE LEARNING

Nguyễn Ngọc Thảo – Nguyễn Hải Minh

• The training process finds a hypothesis ℎ such that 𝒉 ≈ 𝒇.

A learning curve for the decision

Predicting whether a certain person will wait to

1. Alternate: is there an alternative restaurant nearby?

A (true) decision tree for deciding

• After splitting up, each outcome is a new decision tree

function LEARN-ECISION-TREE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠, 𝑝𝑎𝑟𝑒𝑛𝑡_𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)

No attributes left but both positive and negative examples → return

The decision tree induced from the 12-example training set.

• Advanced: Split on each variable so that the purity of each

• The information gain (IG) for an attribute 𝐴 is the expected

• Entropy is zero in a pure

• Decision tree aims to decrease the entropy while increasing

• Calculate the Entropy

• Calculate Average entropy of attribute Bar?

• Calculate Information gain of attribute Bar?

• Calculate Average entropy of attribute Sat/Fri?

• Calculate Information gain of attribute Sat/Fri?

• Calculate Average entropy of attribute Hungry?

• Calculate Information gain of attribute Hungry?

𝐼𝐺 𝐻𝑢𝑛𝑔𝑟𝑦? = 𝐻 𝑆 − 𝐴𝐸𝐻𝑢𝑛𝑔𝑟𝑦? = 1 − 0.804 = 0.196

• Calculate Average entropy of attribute Raining?

𝐼𝐺 𝑅𝑎𝑖𝑛𝑖𝑛𝑔? = 𝐻 𝑆 − 𝐴𝐸𝑅𝑎𝑖𝑛𝑖𝑛𝑔? = 1 − 0.979 = 0.021

• Calculate Average entropy of attribute Reservation?

• Calculate Information gain of attribute Reservation?

𝐼𝐺 𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛? = 𝐻 𝑆 − 𝐴𝐸𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛? = 1 − 0.979 = 0.021

• Calculate Average entropy of attribute Type?

• Calculate Information gain of attribute Type?

• Calculate Average entropy of attribute Est. waiting time?

• Calculate Information gain of attribute Est. waiting time?

• Continue making new splits, always purifying nodes

No. Writable Updated Size Class

• Just resemble the learning mechanisms, not the architecture

• ANN improve performance via experience and generalization.

• The set of weights is the long-term memory in an ANN → the

Biological neuron Artificial neuron

Source: The Asimov Institute

A single-layer two-input perceptron

• Step 3 – Weight training

The learning of logical AND

• Generally, perceptron can classify only linearly separable

near public transit

The error signals are propagated backwards

* Bias units are not shown.

• This parameter optimization problem can be solved using

• The common activation function is the sigmoid function

• Apply similar update rules for the remaining parameters 𝑤𝑗𝑘 .

• Each bias is updated as 𝒃𝒌 ← 𝒃𝒌 − 𝜼 𝜹𝒌

• Apply the chain rule again, we obtain:

𝑧𝑘 = 𝑏𝑘 + ෍ 𝑎𝑗 𝑤𝑗𝑘 = 𝑏𝑘 + ෍ 𝑔𝑗 𝑧𝑗 𝑤𝑗𝑘 since 𝑎𝑗 = 𝑔𝑗 𝑧𝑗

= 𝑏𝑘 + ෍ 𝑔𝑗 𝑏𝑗 + ෍ 𝑎𝑖 𝑤𝑖𝑗 𝑤𝑗𝑘 since 𝑧𝑗 = 𝑏𝑗 + σ𝑖 𝑎𝑖 𝑤𝑖𝑗

𝜕 𝜕𝑧𝑘 𝜕𝑎𝑗 𝜕 𝜕𝑎𝑗 𝜕𝑎𝑗 𝜕𝑔𝑗 𝑧𝑗

= 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖

= ෍ 𝛿𝑘 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖 the output activation signal

error term the derivative of the

To calculate the weight gradients at any layer 𝑙, we calculate the backpropagated

← 𝒘𝒊𝒋 − 𝜼 ෍ 𝒂𝒌 − 𝒕𝒌 𝒈𝒌 (𝒛𝒌 ) 𝟏 − 𝒈𝒌 (𝒛𝒌 ) 𝒘𝒋𝒌 𝒈𝒋 (𝒛𝒋 ) 𝟏 − 𝒈𝒋 (𝒛𝒋 ) 𝒂𝒊

• Apply similar update rules for the remaining parameters 𝑤𝑖𝑗 .

• The gradient for the biases is the back-propagated error signal 𝛿𝑗 .

• Each bias is updated as 𝒃𝒋 ← 𝒃𝒋 − 𝜼 𝜹𝒋

Learning curve for

A sigmoidal function is a smoothed-out

Compute the negative −𝜂𝛻𝑓 𝜃 2

Reach different minima,

Stuck at saddle point

Stuck at local minima

How about put momentum of physical world in gradient descent?

Movement = Negative of Gradient + Momentum

Learning with adaptive learning

• The vector of desired outputs is 𝑡 = 𝑡1 , 𝑡2 , 𝑡3 𝑇

• The network has the following weight vectors

• A new input pattern is presented to the network and training proceeds as

You might also like