AI Chapter 3 Part 2
AI Chapter 3 Part 2
Institute of Technology
University of Gondar
Biomedical Engineering Department
Outlines: -
» Supervised learning
» Classification
o Decision tree, bayes classification, Bayesian belief networks, SVM, KNN, ANN
» Regression
2
Supervised learning
» The goal is to infer a function or mapping from training data that is labeled.
» The training data consist of input vector X and output vector Y of labels or tags.
» Based the training set, the algorithms generalize to respond correctly to all
possible inputs i.e. it is called learning from Examples.
Supervised Learning
Dataset Example:
» Weather information of last 14 days
» Whether match was played or not on
that particular day.
» Then predict whether the game will
happen or not if the weather condition is
(outlook=Rain, Humidity=High,
wind=weak) using Supervised learning
model
Supervised Learning
» Generalization: the algorithms should produce sensible output for the inputs that
were not encountered during learning.
Supervised Learning categorized into two:
o Classification: data is classified into one of two or more classes
o Regression: a task of predicting continuous quantity.
5
Classification
Classification:
o It is a systematic approach to building classification models from an input data set.
o It is the task of assigning a new object to one of several predefined categories.
» Each technique employs a learning algorithm to identify a model that best fits the
relationship between the attribute set and class label of the input data.
6
Classification
» A learned models will accurately predict the class labels of previously unknown
records.
» Example:
o assigning a given email into Spam or non-spam category
o a classification step (where the model is used to predict class labels for given data)
7
Decision Tree(DT)
» Decision tree (DT) is a statistical model that builds classification models in the
form a tree structure.
» This model classifies data in a dataset by flowing through a query structure from
the root until it reaches the leaf, which represents one class.
» The root represents the attribute that plays a main role in classification, and the
leaf represents the class.
o Given an input, at each node, a test is applied and one of the branches is taken
depending on the outcome.
How decision tree works?
✓ATM machine
✓Calling to 994
How decision tree works?
11
How decision tree works?
12
How decision tree works?
13
How Decision Tree works?
» The key to building a decision tree - which attribute to choose in order to branch.
» The best attribute is selected for splitting the training examples using a Goodness
function.
o The best attribute:
✓ separate the classes of the training examples faster, and
✓ the tree is smallest
15
Decision Tree…
Entropy:
» This measure is based on Claude Shannon on Information Theory which studied the
value or “information content” of messages.
» Shannon entropy quantifies this uncertainty in term of expected value of the information
present in the message.
o 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐷 = − σ𝑚
𝑖=1 𝑝𝑖 𝑙𝑜𝑔2 (𝑝𝑖 )
Where 𝑝𝑖 is the nonzero probability that an arbitrary tuple in D belongs to class 𝑐𝑖 and
𝑐𝑖,𝐷
estimated by , m is the number of class labels
𝐷
Decision Tree
» Suppose a set of D containing a total of N examples which 𝑛+ are positive and 𝑛− are
negative outcomes and the entropy is given by:
𝑛+ 𝑛+ 𝑛− 𝑛−
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐷 = 𝐷 𝑛+ , 𝑛− = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
𝑛 𝑛 𝑛 𝑛
» Some useful property of the entropy:
o D(𝑛+ , 𝑛− ) = 𝐷(𝑛− , 𝑛+ )
o 𝐷 𝑛, 0 = 𝐷 0, 𝑛 = 0
✓ 𝐷 𝑠 = 0 means that all the examples in the same class
o 𝐷 𝑚, 𝑚 = 1 means that half the examples in S are of one class and half in the opposite class.
17
Decision tree…
Information Gain:
» It is defined as the difference between the original information requirement (i.e. based on
just the proportion of classes) and the new requirement (i.e. obtained after partitioning on
attribute A)
» Select the attribute with the highest information gain, that create small average disorder:
✓First, compute the disorder using Entropy:
• calculate by how much the disorder of a set would reduce by knowing the value of a
particular attribute.
18
Decision Tree …
» Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated
by |Ci, D|/|D|
Gain(income ) = 0.029
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes Gain( student ) = 0.151
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain(credit _ rating ) = 0.048
<=30 medium yes excellent yes
31…40 medium no excellent yes
20
31…40 high yes fair yes
>40 medium no excellent no
Computing Information-Gain for Continuous-Valued Attributes
» Split:
o D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of
tuples in D satisfying A > split-point
21
Gain Ratio for Attribute Selection (C4.5)
» Information gain measure is biased towards attributes with a large number of values.
» C4.5 (a successor of ID3) uses gain ratio to overcome the problem of information gain.
v | Dj | | Dj |
SplitInfoA ( D) = − log 2 ( )
j =1 |D| |D|
✓ GainRatio(A) = Gain(A)/SplitInfo(A)
» Ex.
✓ gain_ratio(income) = 0.029/1.557 = 0.019
» The attribute with the maximum gain ratio is selected as the splitting attribute
Gini Index (CART, IBM IntelligentMiner)
» If a data set D contains examples from n classes, gini index, gini(D) is defined as
n
gini( D) = 1− p 2j
j =1
where pj is the relative frequency of class j in D
» If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as
|D1| |D |
gini A ( D) = gini( D1) + 2 gini( D 2)
|D| |D|
» The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen
to split the node (need to enumerate all the possible splitting points for each attribute)
23
Example-Decision tree
The problem of “Sunburn”: You want to predict whether another person is likely to
get sunburned if he is back to the beach. How can you do this? Data Collected: predict
based on the observed properties of the people
Name Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Kate Blonde Short Light Yes None
24
Decision tree
» To find the splitting criterion for these tuples we must compute the information gain of
each attribute
» The expected information needed to classify a tuple in D if the tuples are partitional
according to hair is
Decision tree
» The calculation of height, weight and lotion performed in the same way as hair
attribute
Test Expected information for each attribute
Hair 0.50
Height 0.69
Weight 0.94
Lotion 0.61
Decision tree
27
The best decision tree?
is_sunburned
Hair colour
blonde
red brown
Sunburned None
?
» Once we have finished with hair colour we then need to calculate the
remaining branches of the decision tree.
» This is the simplest and optimal one possible and it makes a lot of sense.
» It classifies 4 of the people on just the hair colour alone.
is_sunburned
Hair colour
blonde brow
red n
None
Sunburned
Lotion used
no yes
Sunburned
None
29
Decision Tree: Rule Extraction from Trees
If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-used=“No”) then
return (sunburned = yes)
else
return (false)
30
Avoid over fitting in classification
o Symptoms: tree too deep and too many branches, some may reflect anomalies due to noise or outliers
o Difficult to decide because we do not know what may happen subsequently if we keep growing the tree.
o Upon halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset
tuples or the probability distribution of those tuples
Avoid over fitting in classification
» A sub-tree at a given node is pruned by removing its branches and replacing it with
a leaf.
» The leaf is labeled with the most frequent class among the sub-tree being replaced.
✓ C4.5 uses a statistical method to estimates the errors at each node for pruning.
✓ The leaf is labeled with the most frequent class among the subtree being replaced
33
Avoid over fitting in classification
34
Decision Tree
o It is easy to comprehend.
o The learning and classification steps of a decision tree are simple and
fast.
35
Exercises
36
Assignment 2
» Bayesian classifiers are statistical classifiers which combining prior knowledge of the classes
with new evidence gathered from data.
» For each new sample they provide a probability that the sample belongs to a class (for all
classes)
o Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes theorem with
strong (naïve) independence assumption between the features.
38
Bayes’ Theorem: Basics
» For example, suppose our world of data tuples is confined to customers described by the attributes
age and income , respectively, and that X is a 35-years-old customer with an income of $40,00.
o Suppose that H is the hypothesis that our customer will buy a computer. Then P(H/X) reflects the probabilities
that customer X will buy a computer given that we know the customer’s age and income.
» P(X) is the prior probability of X. Using our example, it is the probability that a person from our set of
customers is 35 years old and earns $40,000.
» The P(X/H) is the posterior probability of X conditioned on H. That is, it is the probability that a customer,
X, is 35 years old and earns $40,000, given that we know the customer will buy a computer.
» Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, p(H/X), from P(X),
P(H) and P(X/H).
𝑃(𝑋Τ𝐻)𝑃(𝐻)
o 𝑃 𝐻 Τ𝑋 =
𝑃(𝑋)
o Posteriori=likelihood X Prior/evidence
Bayes Classification
Example :
» A doctor knows that meningitis causes stiff neck 50% of the time
» If a patient has stiff neck, what’s the probability he/she has meningitis?
41
Bayes Classification
Example : A medical cancer diagnosis problem. There are 2 possible outcomes of a diagnosis: +ve, -ve. We
know 0.8% of world population has cancer. Test gives correct +ve result 98% of the time and gives correct –ve
result 97% of the time. If a patient’s test returns +ve, should we diagnose the patient as having cancer?
Given
P(cancer) = 0.008 p(no-cancer) = 0.992
P(+ve|cancer) = 0.98 P(-ve|cancer) = 0.02
P(+ve|no-cancer) = 0.03 P(-ve|no-cancer) = 0.97
44
Naïve Bayes Classifier: Training Dataset
» P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643, P(buys_computer = “no”) = 5/14= 0.357 age income studentcredit_rating
buys_compu
» Compute P(X|Ci) for each class <=30 high no fair no
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
<=30 high no excellent no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
31…40 high no fair yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
>40 medium no fair yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
>40 low yes fair yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 >40 low yes excellent no
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 31…40 low yes excellent yes
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 <=30 medium no fair no
» X = (age <= 30 , income = medium, student = yes, credit_rating = fair) <=30 low yes fair yes
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 >40 medium yes fair yes
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 <=30 medium yes excellent yes
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
31…40 medium no excellent yes
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
31…40 high yes fair yes
Therefore, X belongs to class (“buys_computer = yes”)
>40 medium no excellent no
46
Example. ‘Play Tennis’ data
» Suppose that you have a free Day Outlook Temperature Humidity Wind Play Tennis
Day1 Sunny Hot High Weak No
afternoon and you are Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
thinking whether or not to go Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
and play tennis, How you do Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
that? Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
✓ Based on the following Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
training data, predict when
Day12 Overcast Mild High Strong Yes
this player will Play Tennis? Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
47
Naive Bayes Classifier
P(P) = 9/14
P(N) = 5/14
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 Strong 3/9 3/5
mild 4/9 2/5 Weak 6/9 2/5
48 cool 3/9 1/5
Play-tennis example
» Based on the model created, predict Play Tennis or Not for the following unseen sample
(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
CNB = arg max P(C ) P(at | C )
C[ yes, no ] t
= arg max P(C ) P(Outl = sunny | C ) P(Temp = cool | C ) P( Hum = high | C ) P(Wind = strong | C )
C[ yes, no ]
» Working:
P( yes ) P( sunny | yes ) P(cool | yes ) P(high | yes ) P( strong | yes ) = 0.0053
P(no) P( sunny | no) P(cool | no) P(high | no) P( strong | no) = 0.0206
answer : PlayTennis = no
» Advantages
o Easy to implement
» Disadvantages
o Assumption: class conditional independence, therefore loss of accuracy
1. Write python algorithm for Decision Tree and Naïve Bayes classifier
machine learning methods.