Module 3 - Decision Tress and Artificial Neural Networks
Module 3 - Decision Tress and Artificial Neural Networks
Decision Trees
r
da
u d
h H
e s
a h
Dr. Mahesh G Huddar
M
Dept. of Computer Science and Engineering
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
Outlook=Sunny, Temp=Hot,
Watch Video Tutorial at Humidity=High, Wind=Strong
https://www.youtube.com/@MaheshHuddar No
Decision trees expressivity
• Decision trees represent a disjunction of conjunctions on
constraints on the value of attributes:
(Outlook = Sunny Humidity = Normal) ar
d d
(Outlook = Overcast)
Hu
(Outlook = Rain Wind = Weak) esh
a h
M
d ar
• Each node in the tree specifies a test of some attribute of the instance, and each
u d
h H
branch descending from that node corresponds to one of the possible values for
e s
this attribute.
a h
M
• An instance is classified by starting at the root node of the tree, testing the
attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute in the given example.
• ThisWatch
process is then repeated
Video Tutorial at
for the subtree rooted at the new node.
https://www.youtube.com/@MaheshHuddar
Decision tree representation (PlayTennis)
• In general, decision trees represent a disjunction of conjunctions of constraints
on the attribute values of instances.
r
da to a conjunction of attribute
• Each path from the tree root to a leaf corresponds
d
Hu
h
tests, and the tree itself to a disjunction of these conjunctions.
hes
Ma
dar
u d
h H
hes
M a
4. The training data may contain errors. Decision tree learning methods are
robust to errors, both errors in classifications of the training examples and
ar examples.
errors in the attribute values that describe these
d
u d
5. The training data may contain h H
missing attribute values. Decision tree
hes
Ma
methods can be used even when some training examples have unknown
values (e.g., if the Humidity of the day is known for only some of the training
examples).
dar
u d
h H
hes
M a
• The central choice in the ID3 algorithm is selecting which attribute to test at each
node in the tree.
dar
d
• We would like to select the attribute thatHisumost useful for classifying examples.
e sh
a h of the worth of an attribute? We will define a
• What is a good quantitative measure
M
statistical property, called information gain, that measures how well a given
attribute separates the training examples according to their target classification.
• ID3 uses this information gain measure to select among the candidate attributes at
eachWatch
step Video Tutorial at https://www.youtube.com/@MaheshHuddar
while growing the tree.
CONSTRUCTING DECISION TREE – ID3 ALGORITHM
ENTROPY MEASURES HOMOGENEITY OF EXAMPLES
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
• where Values(A) is the set of all possible values for attribute A, and S, is the subset of S for
which attribute
Watch A has
Video Tutorial (i.e. , 𝑆𝑣 = {𝒔 ∈ 𝑺|𝑨(𝒔) = 𝒗})
at value v https://www.youtube.com/@MaheshHuddar
CONSTRUCTING DECISION TREE – ID3 ALGORITHM
• For example, suppose S is a collection of training-example days described by
attributes including Wind, which can have the values Weak or Strong.
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
ar 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺
𝟒 𝟒 𝟒 𝟒
D6 Rain Cool Normal Strong No
d
𝟑 𝟑 𝟐 𝟐
d
D7 Overcast Cool Normal Strong Yes 𝑺𝑹𝒂𝒊𝒏 ← [𝟑+, 𝟐−] 𝑹𝒂𝒊𝒏 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟗𝟕𝟏
u
𝟓 𝟓 𝟓 𝟓
D8 Sunny Mild High Weak No
𝑮𝒂𝒊𝒏 𝑺,h
H
s 𝑶𝒖𝒕𝒍𝒐𝒐𝒌
D9 Sunny Cool Normal Weak Yes 𝑺𝒗
e
= 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
h
|𝑺|
𝑮𝒂𝒊𝒏a𝑺, 𝑶𝒖𝒕𝒍𝒐𝒐𝒌
D10 Rain Mild Normal Weak Yes 𝒗 ∈{𝑺𝒖𝒏𝒏𝒚,𝑶𝒗𝒆𝒓𝒄𝒂𝒔𝒕,𝑹𝒂𝒊𝒏}
D11
D12
Sunny
Overcast
Mild
Mild
Normal
High
Strong
Strong
Yes
Yes
M 𝟓 𝟒 𝟓
= 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒖𝒏𝒏𝒚 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑶𝒗𝒆𝒓𝒄𝒂𝒔𝒕 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑹𝒂𝒊𝒏
D13 Overcast Hot Normal Weak Yes 𝟏𝟒 𝟏𝟒 𝟏𝟒
ar
𝟒 𝟒 𝟐 𝟐
D6 Rain Cool Normal Strong No 𝑺𝑴𝒊𝒍𝒅 ← [𝟒+, 𝟐−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑴𝒊𝒍𝒅 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟗𝟏𝟖𝟑
𝟔 𝟔 𝟔 𝟔
D7 Overcast Cool Normal Strong Yes
d d
u
𝟑 𝟑 𝟏 𝟏
D8 Sunny Mild High Weak No 𝑺𝑪𝒐𝒐𝒍 ← [𝟑+, 𝟏−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑪𝒐𝒐𝒍 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟖𝟏𝟏𝟑
h H 𝟒 𝟒 𝟒 𝟒
es 𝑺, 𝑻𝒆𝒎𝒑
D9 Sunny Cool Normal Weak Yes
h
D10 Rain Mild Normal Weak Yes 𝑺𝒗
D11 Sunny Mild Normal Strong Yes
M a 𝑮𝒂𝒊𝒏 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 −
𝒗 ∈{𝑯𝒐𝒕,𝑴𝒊𝒍𝒅,𝑪𝒐𝒐𝒍}
|𝑺|
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
ar
𝟔 𝟔 𝟏 𝟏
D6 Rain Cool Normal Strong No 𝑺𝑵𝒐𝒓𝒎𝒂𝒍 ← [𝟔+, 𝟏−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑵𝒐𝒓𝒎𝒂𝒍 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟓𝟗𝟏𝟔
𝟕 𝟕 𝟕 𝟕
D7 Overcast Cool Normal Strong Yes
d d
D8 Sunny Mild High Weak No
Hu
𝑮𝒂𝒊𝒏 𝑺, 𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 −
𝑺𝒗
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
D9 Sunny Cool Normal Weak Yes
h
s 𝑺, 𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚
|𝑺|
e𝑮𝒂𝒊𝒏
𝒗 ∈{𝑯𝒊𝒈𝒉,𝑵𝒐𝒓𝒎𝒂𝒍}
h
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
M a 𝟕 𝟕
D12 Overcast Mild High Strong Yes = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑯𝒊𝒈𝒉 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑵𝒐𝒓𝒎𝒂𝒍
𝟏𝟒 𝟏𝟒
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
𝟕 𝟕
𝑮𝒂𝒊𝒏 𝑺, 𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚 = 𝟎. 𝟗𝟒 − 𝟎. 𝟗𝟖𝟓𝟐 − 𝟎. 𝟓𝟗𝟏𝟔 = 𝟎. 𝟏𝟓𝟏𝟔
𝟏𝟒 𝟏𝟒
D4 Rain Mild High Weak Yes 𝑺𝑺𝒕𝒓𝒐𝒏𝒈 ← [𝟑+, 𝟑−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒕𝒓𝒐𝒏𝒈 = 𝟏. 𝟎
D5 Rain Cool Normal Weak Yes
𝟔 𝟔 𝟐 𝟐
𝑺𝑾𝒆𝒂𝒌 ← [𝟔+, 𝟐−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑾𝒆𝒂𝒌 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟖𝟏𝟏𝟑
ar
D6 Rain Cool Normal Strong No 𝟖 𝟖 𝟖 𝟖
D7 Overcast Cool Normal Strong Yes
d d
D8 Sunny Mild High Weak No
Hu
𝑮𝒂𝒊𝒏 𝑺, 𝑾𝒊𝒏𝒅 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 −
𝑺𝒗
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
h
|𝑺|
es
D9 Sunny Cool Normal Weak Yes 𝒗 ∈{𝑺𝒕𝒓𝒐𝒏𝒈,𝑾𝒆𝒂𝒌}
h
D10 Rain Mild Normal Weak Yes 𝟔 𝟖
D11 Sunny Mild Normal Strong Yes
M a
𝑮𝒂𝒊𝒏 𝑺, 𝑾𝒊𝒏𝒅 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 −
𝟏𝟒
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒕𝒓𝒐𝒏𝒈 −
𝟏𝟒
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑾𝒆𝒂𝒌
ar
D6 Rain Cool Normal Strong No 𝑮𝒂𝒊𝒏 𝑺, 𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚 = 𝟎. 𝟏𝟓𝟏𝟔
D7 Overcast Cool Normal Strong Yes
d d
D8 Sunny Mild High Weak No
Hu
h
𝑮𝒂𝒊𝒏 𝑺, 𝑾𝒊𝒏𝒅 = 𝟎. 𝟎𝟒𝟕𝟖
es
D9 Sunny Cool Normal Weak Yes
h
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
M a
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
ar
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal
d d
Weak Yes
D6 Rain Cool Normal Hu Strong No
h
es
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild a hHigh Weak No
D9 Sunny CoolM Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
Tem Play Attribute: Temp
Day Humidity Wind
p Tennis 𝑽𝒂𝒍𝒖𝒆𝒔 𝑻𝒆𝒎𝒑 = 𝑯𝒐𝒕, 𝑴𝒊𝒍𝒅, 𝑪𝒐𝒐𝒍
D1 Hot High Weak No
𝟐 𝟐 𝟑 𝟑
D2 Hot High Strong No 𝑺𝑺𝒖𝒏𝒏𝒚 = 𝟐+, 𝟑 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒖𝒏𝒏𝒚 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 =
𝟓 𝟓 𝟓 𝟓
D8 Mild High Weak No 𝟎. 𝟗𝟕
D9 Cool Normal Weak Yes
𝑺𝑯𝒐𝒕 ← [𝟎+, 𝟐−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑯𝒐𝒕 = 𝟎. 𝟎
D11 Mild Normal Strong Yes
ar
𝑺𝑴𝒊𝒍𝒅 ← [𝟏+, 𝟏−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑴𝒊𝒍𝒅 = 𝟏. 𝟎
d d
u
𝑺𝑪𝒐𝒐𝒍 ← [𝟏+, 𝟎−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑪𝒐𝒐𝒍 = 𝟎. 𝟎
h H
es
𝑺𝒗
h
𝑮𝒂𝒊𝒏 𝑺𝑺𝒖𝒏𝒏𝒚 , 𝑻𝒆𝒎𝒑 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒖𝒏𝒏𝒚 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
a
|𝑺|
𝒗 ∈{𝑯𝒐𝒕,𝑴𝒊𝒍𝒅,𝑪𝒐𝒐𝒍}
M
𝑮𝒂𝒊𝒏 𝑺𝑺𝒖𝒏𝒏𝒚 , 𝑻𝒆𝒎𝒑
𝟐 𝟐 𝟏
= 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒖𝒏𝒏𝒚 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑯𝒐𝒕 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑴𝒊𝒍𝒅 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑪𝒐𝒐𝒍
𝟓 𝟓 𝟓
𝟐 𝟐 𝟏
𝑮𝒂𝒊𝒏 𝑺𝒔𝒖𝒏𝒏𝒚 , 𝑻𝒆𝒎𝒑 = 𝟎. 𝟗𝟕 − 𝟎. 𝟎 − 𝟏 − 𝟎. 𝟎 = 𝟎. 𝟓𝟕𝟎
𝟓 𝟓 𝟓
dar
u d
h H
𝑮𝒂𝒊𝒏 𝑺𝑺𝒖𝒏𝒏𝒚 , 𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒖𝒏𝒏𝒚 −
𝑺𝒗
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
e s |𝑺|
h
𝒗 ∈{𝑯𝒊𝒈𝒉,𝑵𝒐𝒓𝒎𝒂𝒍}
𝟑 𝟐
= 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒖𝒏𝒏𝒚 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑯𝒊𝒈𝒉 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑵𝒐𝒓𝒎𝒂𝒍
𝟓 𝟓
𝟑 𝟐
𝑮𝒂𝒊𝒏 𝑺𝒔𝒖𝒏𝒏𝒚 , 𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚 = 𝟎. 𝟗𝟕 − 𝟎. 𝟎 − 𝟎. 𝟎 = 𝟎. 𝟗𝟕
𝟓 𝟓
ar
𝟑 𝟑 𝟑 𝟑
𝟎. 𝟗𝟏𝟖𝟑
d d
Hu
h
hes 𝑺𝒗
M a
𝑮𝒂𝒊𝒏 𝑺𝑺𝒖𝒏𝒏𝒚 , 𝑾𝒊𝒏𝒅 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒖𝒏𝒏𝒚 −
𝒗 ∈{𝑺𝒕𝒓𝒐𝒏𝒈,𝑾𝒆𝒂𝒌}
|𝑺|
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
𝟐 𝟑
= 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒖𝒏𝒏𝒚 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒕𝒓𝒐𝒏𝒈 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑾𝒆𝒂𝒌
𝟓 𝟓
𝟐 𝟑
𝑮𝒂𝒊𝒏 𝑺𝒔𝒖𝒏𝒏𝒚 , 𝑾𝒊𝒏𝒅 = 𝟎. 𝟗𝟕 − 𝟏. 𝟎 − 𝟎. 𝟗𝟏𝟖 = 𝟎. 𝟎𝟏𝟗𝟐
𝟓 𝟓
ar
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal
d d
Weak Yes
D6 Rain Cool Normal Hu Strong No
h
es
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild a hHigh Weak No
D9 Sunny CoolM Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
Tem Play Attribute: Temp
Day Humidity Wind
p Tennis 𝑽𝒂𝒍𝒖𝒆𝒔 𝑻𝒆𝒎𝒑 = 𝑯𝒐𝒕, 𝑴𝒊𝒍𝒅, 𝑪𝒐𝒐𝒍
D4 Mild High Weak Yes
𝟑 𝟑 𝟐 𝟐
D5 Cool Normal Weak Yes 𝑺𝑹𝒂𝒊𝒏 = 𝟑+, 𝟐 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒖𝒏𝒏𝒚 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 =
𝟓 𝟓 𝟓 𝟓
D6 Cool Normal Strong No 𝟎. 𝟗𝟕
D10 Mild Normal Weak Yes
𝑺𝑯𝒐𝒕 ← [𝟎+, 𝟎−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑯𝒐𝒕 = 𝟎. 𝟎
D14 Mild High Strong No
ar
𝟐 𝟐 𝟏 𝟏
d
𝑺𝑴𝒊𝒍𝒅 ← [𝟐+, 𝟏−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑴𝒊𝒍𝒅 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 =
d
𝟑 𝟑 𝟑 𝟑
𝟎. 𝟗𝟏𝟖𝟑
Hu
← [𝟏+, 𝟏−]h
𝑺𝑪𝒐𝒐𝒍
e s 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 𝑪𝒐𝒐𝒍 = 𝟏. 𝟎
h
Ma 𝑺 , 𝑻𝒆𝒎𝒑 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 −
𝑮𝒂𝒊𝒏 𝑹𝒂𝒊𝒏 𝑹𝒂𝒊𝒏
𝑺𝒗
|𝑺|
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
𝒗 ∈{𝑯𝒐𝒕,𝑴𝒊𝒍𝒅,𝑪𝒐𝒐𝒍}
𝟎 𝟑 𝟐
= 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑹𝒂𝒊𝒏 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑯𝒐𝒕 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑴𝒊𝒍𝒅 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑪𝒐𝒐𝒍
𝟓 𝟓 𝟓
𝟎 𝟑 𝟐
𝑮𝒂𝒊𝒏 𝑺𝑹𝒂𝒊𝒏 , 𝑻𝒆𝒎𝒑 = 𝟎. 𝟗𝟕 − 𝟎. 𝟎 − 𝟎. 𝟗𝟏𝟖 − 𝟏. 𝟎 = 𝟎. 𝟎𝟏𝟗𝟐
𝟓 𝟓 𝟓
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
Tem Play Attribute: Humidity
Day Humidity Wind
p Tennis 𝑽𝒂𝒍𝒖𝒆𝒔 𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚 = 𝑯𝒊𝒈𝒉, 𝑵𝒐𝒓𝒎𝒂𝒍
D4 Mild High Weak Yes
𝟑 𝟑 𝟐 𝟐
D5 Cool Normal Weak Yes 𝑺𝑹𝒂𝒊𝒏 = 𝟑+, 𝟐 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒖𝒏𝒏𝒚 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 =
𝟓 𝟓 𝟓 𝟓
D6 Cool Normal Strong No 𝟎. 𝟗𝟕
Dl0 Mild Normal Weak Yes
Dl4 Mild High Strong No 𝑺𝑯𝒊𝒈𝒉 ← [𝟏+, 𝟏−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑯𝒊𝒈𝒉 = 𝟏. 𝟎
d
𝟑 𝟑 𝟑 𝟑
𝟎. 𝟗𝟏𝟖𝟑
Hu
h
hes
M a 𝑺𝒗
𝑮𝒂𝒊𝒏 𝑺𝑹𝒂𝒊𝒏 , 𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑹𝒂𝒊𝒏 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
|𝑺|
𝒗 ∈{𝑯𝒊𝒈𝒉,𝑵𝒐𝒓𝒎𝒂𝒍}
𝟐 𝟑
= 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑹𝒂𝒊𝒏 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑯𝒊𝒈𝒉 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑵𝒐𝒓𝒎𝒂𝒍
𝟓 𝟓
𝟐 𝟑
𝑮𝒂𝒊𝒏 𝑺𝑹𝒂𝒊𝒏 , 𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚 = 𝟎. 𝟗𝟕 − 𝟏. 𝟎 − 𝟎. 𝟗𝟏𝟖 = 𝟎. 𝟎𝟏𝟗𝟐
𝟓 𝟓
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
Tem Play Attribute: Wind
Day Humidity Wind
p Tennis 𝑽𝒂𝒍𝒖𝒆𝒔 𝒘𝒊𝒏𝒅 = 𝑺𝒕𝒓𝒐𝒏𝒈, 𝑾𝒆𝒂𝒌
D4 Mild High Weak Yes
𝟑 𝟑 𝟐 𝟐
D5 Cool Normal Weak Yes 𝑺𝑹𝒂𝒊𝒏 = 𝟑+, 𝟐 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒖𝒏𝒏𝒚 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 =
𝟓 𝟓 𝟓 𝟓
D6 Cool Normal Strong No 𝟎. 𝟗𝟕
Dl0 Mild Normal Weak Yes
Dl4 Mild High Strong No 𝑺𝑺𝒕𝒓𝒐𝒏𝒈 ← [𝟎+, 𝟐−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒕𝒓𝒐𝒏𝒈 = 𝟎. 𝟎
u d
h H
es
𝑺𝒗
a h
𝑮𝒂𝒊𝒏 𝑺𝑹𝒂𝒊𝒏 , 𝑾𝒊𝒏𝒅 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑹𝒂𝒊𝒏 −
|𝑺|
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
M
𝒗 ∈{𝑺𝒕𝒓𝒐𝒏𝒈,𝑾𝒆𝒂𝒌}
𝟐 𝟑
𝑮𝒂𝒊𝒏 𝑺𝑹𝒂𝒊𝒏 , 𝑾𝒊𝒏𝒅 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑹𝒂𝒊𝒏 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑺𝒕𝒓𝒐𝒏𝒈 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑾𝒆𝒂𝒌
𝟓 𝟓
𝟐 𝟑
𝑮𝒂𝒊𝒏 𝑺𝑹𝒂𝒊𝒏 , 𝑾𝒊𝒏𝒅 = 𝟎. 𝟗𝟕 − 𝟎. 𝟎 − 𝟎. 𝟎 = 𝟎. 𝟗𝟕
𝟓 𝟓
dar
u d
H
𝑮𝒂𝒊𝒏 𝑺𝑹𝒂𝒊𝒏 ,h𝑾𝒊𝒏𝒅 = 𝟎. 𝟗𝟕
hes
Ma
2 + T T
𝑺 = 𝟑+, 𝟑 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 = 𝟏. 𝟎
3 - T F
4 + F F 𝑺𝑻 = 𝟐+, 𝟏 −
𝟐 𝟐 𝟏
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑻 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟗𝟏𝟖𝟑
𝟑 𝟑 𝟑
𝟏
𝟑
5 - F T
6 - F T 𝑺𝑭 ← [𝟏+, 𝟐−]
dar𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 𝟏 𝟏 𝟐 𝟐
= − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟗𝟏𝟖𝟑
d
𝑭
u
𝟑 𝟑 𝟑 𝟑
h H
h es
Ma
𝑺𝒗
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟏 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
|𝑺|
Example - 2 𝒗 ∈{𝑻,𝑭}
𝟑 𝟑
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟏 = 𝟏. 𝟎 − ∗ 𝟎. 𝟗𝟏𝟖𝟑 − ∗ 𝟎. 𝟗𝟏𝟖𝟑 = 𝟎. 𝟎𝟖𝟏𝟕
𝟔 𝟔
2 + T T
𝑺 = 𝟑+, 𝟑 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 = 𝟏. 𝟎
3 - T F
4 + F F 𝑺𝑻 = 𝟐+, 𝟐 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑻 = 𝟏. 𝟎
5 - F T 𝑺𝑭 ← [𝟏+, 𝟏−]
a r𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 𝑭 = 𝟏. 𝟎
6 - F T
d d
Hu
h
es
𝑺𝒗
h
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟐 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
a
|𝑺|
M
𝒗 ∈{𝑻,𝑭}
Example - 2 𝟒 𝟐
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟐 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑻 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑭
Decision Tree Algorithm – ID3 𝟔 𝟔
Solved Example 𝟒 𝟐
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟐 = 𝟏. 𝟎 − ∗ 𝟏. 𝟎 − ∗ 𝟏. 𝟎 = 𝟎. 𝟎
𝟔 𝟔
hes a2
a
a2
Example - 2 M T F
T F
Decision Tree Algorithm – ID3
Solved Example
1, 2 3 5, 6 4
+ - - +
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
Subscribe to Mahesh Huddar Visit: vtupulse.com
dar
u d
h H
hes
M a
r
𝟓 𝟓 𝟓 𝟓
a
7 True Hot High No
8 True Hot Normal Yes d
d𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺
u
𝑺𝑭𝒍𝒂𝒔𝒆 ← [𝟓+, 𝟎−] 𝑭𝒂𝒍𝒔𝒆 = 𝟎. 𝟎
9 False Cool Normal Yes
h H
es
10 False Cool High Yes
h
Ma = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺
𝑺𝒗
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟏 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
|𝑺|
𝒗 ∈{𝑻𝒓𝒖𝒆,𝑭𝒂𝒍𝒔𝒆}
Example - 3
Decision Tree Algorithm – ID3 𝟓 𝟓
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟏 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑻𝒓𝒖𝒆 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑭𝒂𝒍𝒔𝒆
𝟏𝟎 𝟏𝟎
Solved Example
𝟓 𝟓
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟏 = 𝟎. 𝟗𝟕𝟎𝟗 − ∗ 𝟎. 𝟕𝟐𝟏𝟗 − ∗ 𝟏 = 𝟎. 𝟔𝟎𝟗𝟗
𝟏𝟎 𝟏𝟎
ar
𝟓 𝟓 𝟓 𝟓
7 True Hot High No
8 True Hot Normal Yes d
d𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺
u
𝟒 𝟒 𝟏 𝟏
𝑺𝑪𝒐𝒐𝒍 ← [𝟒+, 𝟏−] = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟕𝟐𝟏𝟗
H
𝑪𝒐𝒐𝒍
9 False Cool Normal Yes 𝟓 𝟓 𝟓 𝟓
h
es
10 False Cool High Yes
a h
M
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟐 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 −
𝑺𝒗
|𝑺|
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
Example - 3 𝒗 ∈{𝑯𝒐𝒕,𝑪𝒐𝒐𝒍}
𝟓 𝟓
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟐 = 𝟎. 𝟗𝟕𝟎𝟗 − ∗ 𝟎. 𝟗𝟕𝟎𝟗 − ∗ 𝟎. 𝟕𝟐𝟏𝟗 = 𝟎. 𝟏𝟐𝟒𝟓
𝟏𝟎 𝟏𝟎
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
Subscribe to Mahesh Huddar Visit: vtupulse.com
Instance a1 a2 a3 Classification Attribute: a3
1 True Hot High No
2 True Hot High No 𝑽𝒂𝒍𝒖𝒆𝒔 𝒂𝟑 = 𝑯𝒊𝒈𝒉, 𝑵𝒐𝒓𝒎𝒂𝒍
3 False Hot High Yes
𝟔 𝟔 𝟒 𝟒
4 False Cool Normal Yes 𝑺 = 𝟔+, 𝟒 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟗𝟕𝟎𝟗
𝟏𝟎 𝟏𝟎 𝟏𝟎 𝟏𝟎
5 False Cool Normal Yes
6 True Cool High No 𝑺𝑯𝒊𝒈𝒉 = 𝟐+, 𝟒 −
𝟐 𝟐
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑯𝒊𝒈𝒉 = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟗𝟏𝟖𝟑
𝟒 𝟒
r
𝟔 𝟔 𝟔 𝟔
a
7 True Hot High No
8 True Hot Normal Yes d
d𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺
u
𝑺𝑵𝒐𝒓𝒎𝒂𝒍 ← [𝟒+, 𝟎−] 𝑵𝒐𝒓𝒎𝒂𝒍 = 𝟎. 𝟎
9 False Cool Normal Yes
h H
es
10 False Cool High Yes 𝑺𝒗
a h
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟑 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 −
|𝑺|
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
M
𝒗 ∈{𝑯𝒊𝒈𝒉,𝑵𝒐𝒓𝒎𝒂𝒍}
𝟔 𝟒
Example - 3 𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟑 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 −
𝟏𝟎
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑯𝒊𝒈𝒉 −
𝟏𝟎
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑵𝒐𝒓𝒎𝒂𝒍
Decision Tree Algorithm – ID3
𝟔 𝟒
Solved Example 𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟑 = 𝟎. 𝟗𝟕𝟎𝟗 − ∗ 𝟎. 𝟗𝟏𝟖𝟑 − ∗ 𝟎. 𝟎 = 𝟎. 𝟒𝟏𝟗𝟗
𝟏𝟎 𝟏𝟎
ar
7 True Hot High No True False
8 True Hot Normal Yes
d d
9 False Cool Normal Yes
Hu
h
es
10 False Cool High Yes 1, 2, 6, 7, 8 3, 4, 5, 9, 10
a h Yes
M
Example - 3
Decision Tree Algorithm – ID3
Solved Example
ar
𝟒 𝟒 𝟒 𝟒
d d
u
𝑺𝑪𝒐𝒐𝒍 ← [𝟎+, 𝟏−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑪𝒐𝒐𝒍 = 𝟎. 𝟎
h H
es
𝑺𝒗
a h
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟐 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 −
|𝑺|
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
M
𝒗 ∈{𝑯𝒐𝒕,𝑪𝒐𝒐𝒍}
Example - 3 𝟒 𝟏
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟐 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑯𝒐𝒕 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑪𝒐𝒐𝒍
Decision Tree Algorithm – ID3 𝟓 𝟓
Solved Example 𝟒 𝟏
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟐 = 𝟎. 𝟗𝟕𝟎𝟗 − ∗ 𝟎. 𝟖𝟏𝟏𝟐 − ∗ 𝟎. 𝟎 = 𝟎. 𝟑𝟐𝟏𝟗
𝟓 𝟓
dar
d
𝑺𝑵𝒐𝒓𝒎𝒂𝒍 ← [𝟏+, 𝟎−] 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑵𝒐𝒓𝒎𝒂𝒍 = 𝟎. 𝟎
Hu
h
es
𝑺𝒗
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟑 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗 )
a h 𝒗 ∈{𝑯𝒊𝒈𝒉,𝑵𝒐𝒓𝒎𝒂𝒍}
|𝑺|
Example - 3 M 𝟒 𝟏
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟑 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑯𝒊𝒈𝒉 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑺𝑵𝒐𝒓𝒎𝒂𝒍
Decision Tree Algorithm – ID3 𝟓 𝟓
Solved Example 𝟒 𝟏
𝑮𝒂𝒊𝒏 𝑺, 𝒂𝟑 = 𝟎. 𝟗𝟕𝟎𝟗 − ∗ 𝟎. 𝟎 − ∗ 𝟎. 𝟎 = 𝟎. 𝟕𝟐𝟏𝟗
𝟓 𝟓
dar
u d
h H1, 2, 6, 7, 8 3, 4, 5, 9, 10
hes Yes
M a a3
Example - 3
High Normal
Decision Tree Algorithm – ID3
Solved Example
1, 2, 6, 7 8
No Yes
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
▪ ard
The strategy: use other examples to guess attribute
u d
1. h
Assign the value that is most common H among the training examples at the
e s
h
Ma
node
▪ Missing values in new instances to be classified are treated accordingly, and the
most probable
Watch Video Tutorial atclassification is chosen (C4.5)
https://www.youtube.com/@MaheshHuddar
Handling attributes with different costs
• Instance attributes may have an associated cost: we would prefer
decision trees that use low-cost attributes
• r
ID3 can be modified to take into account costs:
da
u d
1. Tan and Schlimmer (1990)
h H
hes
Gain2(S, A)
M a
Cost(S, A)
2. Nunez (1988)
2Gain(S, A) − 1
r
da of describing the basis by which it
• Describing the inductive bias of ID3 therefore consists
d
u
H the others.
h
es
chooses one of these consistent hypotheses over
a h
• M
Which of these decision trees does ID3 choose?
• It chooses the first acceptable tree it encounters in its simple-to-complex, hill climbing
a r
d
• A closer approximation to the inductive bias of ID3:d
Hu
– Shorter trees are preferred over longeres
h
trees.
a h
M
– Trees that place high information gain attributes close to the root are preferred over
• Arguments in favor:
– a principle usually attributed14th-century English logician and Franciscan friar, William of Ockham.
– The term razor refers to the act of shaving away unnecessary assumptions to get to the simplest
explanation.
ar
• Most appropriate for problems where
– Instances have many attribute-value pairs d d
H u
– h
Target function output may be discrete-valued, real-valued, or a vector of several real- or
discrete-valued attributes h es
M a
– Training examples may contain errors
– Long training times are acceptable
– Fast evaluation of the learned target function may be required
– The ability for humans to understand the learned target function is not important
ar
Input values can be any real values.
d d
•
u
The target function output may be discrete-valued, real-valued, or a vector of several real- or
H system the output is a vector of 30 attributes,
h
es
discrete-valued attributes. For example, in the ALVINN
h
each corresponding to a recommendationaregarding the steering direction. The value of each output is
M
some real number between 0 and 1, which in this case corresponds to the confidence in predicting the
corresponding steering direction. We can also train a single network to output both the steering
command and suggested acceleration, simply by concatenating the vectors that encode these two
output predictions.
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar 102
Appropriate Problems – for ANN
The training examples may contain errors. ANN learning methods are quite robust
to noise in the training data.
dar
u d
Long training times are acceptable. Network training algorithms typically require
h H
e s tree learning algorithms. Training times
longer training times than, say, decision
a h
can range from a few seconds to Mmany hours, depending on factors such as the
number of weights in the network, the number of training examples considered,
and the settings of various learning algorithm parameters.
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
ar
S1: Sunny,Warm, Normal, Strong, Warm, Same
d d
Hu
S2: S3:
s h
Sunny,Warm, ?, Strong, Warm, Same
S4 h e
Sunny, Warm,a?, Strong, ?, ?
M
G4: Sunny, ?, ?, ?, ?, ? ?, Warm, ?, ?, ?, ?
G0: Watch
G1: VideoGTutorial
2: at https://www.youtube.com/@MaheshHuddar
?, ?, ?, ?, ?, ?
Decision Trees
• Decision trees represent a disjunction of conjunctions on constraints on the
value of attributes:
(Outlook = Sunny Humidity = Normal) => Yes
(Outlook = Overcast) => Yes
dar
(Outlook = Rain Wind = Weak) => YesHud
h
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
Ma
dar
u d
h H
hes
M a
dar
u d
h H
• s values for the weights wo, . . . , wn.
Learning a perceptron involves choosing
e
a h
• M
Therefore, the space H of candidate hypotheses considered in perceptron
learning is the set of all possible real-valued weight vectors.
dar
d
• Hu example, o is the output generated by
Here t is the target output for the current training
e sh
a h
the perceptron, and n is a positive constant called the learning rate.
•
M
The role of the learning rate is to moderate the degree to which weights are changed at
each step.
• It is usually set to some small value (e.g., 0.1) and is sometimes made to decay as the
number of weight-tuning iterations increases.
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
The Perceptron Training Rule
A single perceptron can be used to represent many Boolean functions weights 0.6 and 0.6
AND function
If A=0 & B=0 → 0*0.6 + 0*0.6 = 0
r
da= 0
This is not greater than the threshold of 1, so the output
d
u
hH
If A=0 & B=1 → 0*0.6 + 1*0.6 = 0.6
s
a heoutput = 0
This is not greater than the threshold, so the
If A=1 & B=0 → 1*0.6 + 0*0.6 = 0.6
M
This is not greater than the threshold, so the output = 0
If A=1 & B=1 → 1*0.6 + 1*0.6 = 1.2
This exceeds the threshold, so the output = 1
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
The Perceptron Training Rule
A single perceptron can be used to represent many Boolean functions weights 1.2 and 0.6
AND function
If A=0 & B=0 → 0*1.2 + 0*0.6 = 0
r
da= 0
This is not greater than the threshold of 1, so the output
d
u
hH
If A=0 & B=1 → 0*1.2 + 1*0.6 = 0.6
s
a heoutput = 0
This is not greater than the threshold, so the
If A=1 & B=0 → 1*1.2 + 0*0.6 = 1.2
M
This is greater than the threshold, so the output = 1
But the expected output is 0
dar
u d
h H
hes
M a
0.7
0.6
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
The Perceptron Training Rule
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
•
r
Thus, a linear unit corresponds to the first stage ofdaaperceptron, without the threshold.
u d units, let us begin by specifying a
• In order to derive a weight learning rule for
h H linear
h es
measure for the training error of a hypothesis (weight vector), relative to the training
examples. Ma
• Although there are many ways to define this error, one common measure is
• where D is the set of training examples, td is the target output for training example d, and
od isWatch
theVideo
output ofatthe linear unit
Tutorial for training example d.
https://www.youtube.com/@MaheshHuddar
Derivation of Gradient Descent Rule
dar
u d
h H
hes
M a
r
• dawhich determines the step size in the
Here n is a positive constant called the learning rate,
d
gradient descent search. The negative sign is H u
h
present because we want to move the weight
h es
a
vector in the direction that decreases E.
M
• This training rule can also be written in its component form
dar
u d
h H
hes
M a
ar
1. the hypothesis space contains continuously parameterized
d
hypotheses (e.g., the
u d
weights in a linear unit), and
h H
es to these hypothesis parameters.
2. the error can be differentiated with respect
h
• a
M gradient descent are
The key practical difficulties in applying
1. converging to a local minimum can sometimes be quite slow (i.e., it can require many
thousands of gradient descent steps), and
2. if there are multiple local minima in the error surface, then there is no guarantee that
the procedure will find the global minimum.
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
STOCHASTIC APPROXIMATION TO GRADIENT DESCENT
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
Error is summed over all examples before Weightsr are updated examining each
da
u d
updating weights
h H
training example
e s
h
Ma
Summing over multiple examples require Less computation as individual weights are
Difficult when there are multiple local Uses various Ed (w) rather than E(w), hence
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
dar
u d
h H
hes
M a
4 6
dar
u d
2 h H
hes
M a
5 7
3
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
X1
1
4 6
dar
X2
u d
2 h H
hes
M a
5 7
X3
3
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
X1
1 X41 , w41
X51 , w51
4 6
dar
X2 X42 , w42
u d
2 h H
X52 , w52
hes
Ma
5 7
X43 , w43
X3
X53 , w53
3
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
X1
1 X41 , w41
4 6
dar
X74, w74
X2 X42 , w42
u d
2 h H
X52 , w52
hes
a
X65 , w65
M
5 7
X43 , w43 X75, w75
X3
X53 , w53
3
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
X1
1 X41 , w41
4 6 O6
dar
X74, w74
X2 X42 , w42
u d
2 h H
X52 , w52
hes
a
X65 , w65
M
5 7 O7
X43 , w43 X75, w75
X3
X53 , w53
3
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
X1
1 X41 , w41
O4 = δ(X41w41X42w42 +X43w43)
4 6 O6
dar
X74, w74
X2 X42 , w42
u d
2 h H
X52 , w52
hes
a
X65 , w65
M
5 7 O7
X43 , w43 X75, w75
X3 O5 = δ(X51w51+X52w52+X53w53)
X53 , w53
3
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
X1
1 X41 , w41
O4 = δ(X41w41X42w42 +X43w43)
X64 = X74 = O4
4 6 O6
dar
X74, w74
X2 X42 , w42
u d
2 h H
X52 , w52
hes
a
X65 , w65
M
5 7 O7
X43 , w43 X75, w75
X3 O5 = δ(X51w51+X52w52+X53w53)
X53 , w53
3 X75 = X65 = O5
4 6 O6
dar
X74, w74
X2 X42 , w42
u d
2 h H
X52 , w52
hes
a
X65 , w65
M
5 7 O7
X43 , w43 X75, w75
O7 = δ(X74,w74+X75,w75)
X3 O5 = δ(X51w51+X52w52+X53w53)
X53 , w53
3 X75 = X65 = O5
ar
• For each (𝑥, t), in training examples, Do
Propagate the input forward through the network: d d
•
H u
1. Input the instance 𝑥, to the network andhcompute the output ou of every unit u in the network.
s
Propagate the errors backward throughhe
•
M a the network
2. For each network unit k, calculate its error term δk 4. Update each network weight wji
r
• where Ed is the error on training example d, thatdisda
half the squared difference between the
Hu units in the network,
h
target output and the actual output over all output
hes
M a
• Here outputs is the set of output units in the network, tk is the target value of unit k for
training example d, and ok is the output of unit k given training example d.
4 r 6 O6
Xd da
u ,w
hH
74 74
X2 X42 , w42
2 es
X52 , w52
a h X65 , w65
M
X43 , w43
5 X75, w75
7 O7
X3
X53 , w53
3
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
Derivation of Back Propagation Algorithm
• To begin, notice that weight wji can influence the restr of the network only through netj.
da
Therefore, we can use the chain rule to write, u d
h H
es
𝒏𝒆𝒕𝒋 = 𝒘𝒋𝒊 𝑿𝒋𝒊
a h 𝒊
M
𝝏𝒏𝒆𝒕𝒋
= 𝒙𝒋𝒊
𝝏𝒘𝒋𝒊
ar
We consider two cases in turn:
d d
Hu
• Case 1, where unit j is an output unit for the network, and
e sh
• a
Case 2, where unit j is an internal unit hof the network.
M
es
𝝏 𝒏𝒆𝒕 𝝏 𝒏𝒆𝒕
𝒋 𝒋
a h = 𝝈 𝒏𝒆𝒕𝒋 (1 -
M 𝝈 𝒏𝒆𝒕𝒋 )
= 𝒐𝒋 (𝟏 − 𝒐𝒋 )
dar
u d
h H
hes
M a
4 r 6 O6
Xd da
u ,w
hH
74 74
X2 X42 , w42
2 es
X52 , w52
a h X65 , w65
M
X43 , w43
5 X75, w75
7 O7
X3
X53 , w53
3
Watch Video Tutorial at https://www.youtube.com/@MaheshHuddar
Derivation of Back Propagation Algorithm
Case 2: Training Rule for Hidden Unit Weights
dar
u d
h H
es
𝝏𝒐𝒋 𝝏𝝈 𝒏𝒆𝒕𝒋
=
h
a 𝝏 𝒏𝒆𝒕
𝝏 𝒏𝒆𝒕𝒋 𝝏 𝒏𝒆𝒕𝒋
M 𝝏𝒐𝒋
𝒌
=
𝝏𝒙𝒌𝒋 𝒘𝒌𝒋 𝝏𝒐𝒋 𝒘𝒌𝒋
𝝏𝒐𝒋
=
𝝏𝒐𝒋 = 𝝈 𝒏𝒆𝒕𝒋 (1 -
𝝈 𝒏𝒆𝒕𝒋 )
= 𝒐𝒋 (𝟏 − 𝒐𝒋 )
dar
u d
h H
hes
M a
ar
• As can be seen, there are four units that receive inputs directly from all of the 30 x 32 pixels in the image. These are called
d d
"hidden“ units because their output is available only within the network and is not available as part of the global network
output.
Hu
h
es
• Each of these four hidden units computes a single real-valued output based on a weighted combination of its 960 inputs.
•
a h
These hidden unit outputs are then used as inputs to a second layer of 30 "output" units.
•
M
Each output unit corresponds to a particular steering direction, and the output values of these units determine which steering
direction is recommended most strongly.
dar
u d
h H
hes
M a