0% found this document useful (0 votes)
22 views51 pages

AI Chapter 3 Part 2

Uploaded by

biruck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views51 pages

AI Chapter 3 Part 2

Uploaded by

biruck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Artificial Intelligence

Institute of Technology
University of Gondar
Biomedical Engineering Department

By Ewunate Assaye (MSc.)


Chapter Three
Machine Learning

Outlines: -
» Supervised learning

» Classification
o Decision tree, bayes classification, Bayesian belief networks, SVM, KNN, ANN

» Regression

» Timeseries data prediction

2
Supervised learning

» A supervised scenario is characterized by the concept of a teacher or supervisor,


whose main task is to provide the agent with a precise measure of its error
(directly comparable with output values)

» The goal is to infer a function or mapping from training data that is labeled.

» The training data consist of input vector X and output vector Y of labels or tags.

» Based the training set, the algorithms generalize to respond correctly to all
possible inputs i.e. it is called learning from Examples.
Supervised Learning

Dataset Example:
» Weather information of last 14 days
» Whether match was played or not on
that particular day.
» Then predict whether the game will
happen or not if the weather condition is
(outlook=Rain, Humidity=High,
wind=weak) using Supervised learning
model
Supervised Learning

» A data set denoted in the form of 𝑥𝑖 , 𝑦𝑖 :


o Where the inputs are 𝑥𝑖 , the outputs are 𝑦𝑖 and i=1 to N, N is the number of
observation.

» Generalization: the algorithms should produce sensible output for the inputs that
were not encountered during learning.
Supervised Learning categorized into two:
o Classification: data is classified into one of two or more classes
o Regression: a task of predicting continuous quantity.

5
Classification

Classification:
o It is a systematic approach to building classification models from an input data set.
o It is the task of assigning a new object to one of several predefined categories.

» Examples of classification algorithms: decision tree classifiers, rule-based classifiers,


neural networks, support vector machines, naive Bayes classifiers etc.

» Each technique employs a learning algorithm to identify a model that best fits the
relationship between the attribute set and class label of the input data.

6
Classification

» A learned models will accurately predict the class labels of previously unknown
records.

» Example:
o assigning a given email into Spam or non-spam category

o assign a bank loans applicants as safe or risky for the bank

» A classification is a two-step process which consisting of:


o a learning step (where a classification model is constructed using training set), and

o a classification step (where the model is used to predict class labels for given data)
7
Decision Tree(DT)

» Decision tree (DT) is a statistical model that builds classification models in the
form a tree structure.

» This model classifies data in a dataset by flowing through a query structure from
the root until it reaches the leaf, which represents one class.

» The root represents the attribute that plays a main role in classification, and the
leaf represents the class.
o Given an input, at each node, a test is applied and one of the branches is taken
depending on the outcome.
How decision tree works?

» DT algorithm adopt (a greedy divide-and-conquer algorithm) approach:


o Attributes can be categorical (or continuous) values
o Tree is constructed in a top-down recursive manner/no backtracking
o At start, all the training examples are at the root
o Examples are partitioned recursively based on selected attributes
o Attributes are selected on the basis of an impurity function (e.g., information gain)

» Conditions for stopping partitioning


o All examples for a given node belong to the same class
o There are no remaining attributes for further partitioning – majority class is the leaf(
Majority Voting)
o There are no examples left for partition
Example

✓ATM machine

✓Calling to 994
How decision tree works?

11
How decision tree works?

12
How decision tree works?

13
How Decision Tree works?

Grow Tree (Training Data D)


Partition(D);
Partition (Data D)
if (all points in D belong to the same class) then
return;
for each attribute A do
evaluate splits on attribute A;
use best split found to partition D into D1 and D2;
Partition(D1);
Partition(D2);
14
Choose an attribute to partition data

» The key to building a decision tree - which attribute to choose in order to branch.

» The objective is to reduce impurity or uncertainty in data as much as possible.


o A subset of data is pure if all instances belong to the same class.

» The best attribute is selected for splitting the training examples using a Goodness
function.
o The best attribute:
✓ separate the classes of the training examples faster, and
✓ the tree is smallest

15
Decision Tree…

Entropy:

» It is a degree of randomness of elements or it is a measure of impurity or uncertainty.

» This measure is based on Claude Shannon on Information Theory which studied the
value or “information content” of messages.

» Shannon entropy quantifies this uncertainty in term of expected value of the information
present in the message.

o 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐷 = − σ𝑚
𝑖=1 𝑝𝑖 𝑙𝑜𝑔2 (𝑝𝑖 )

Where 𝑝𝑖 is the nonzero probability that an arbitrary tuple in D belongs to class 𝑐𝑖 and
𝑐𝑖,𝐷
estimated by , m is the number of class labels
𝐷
Decision Tree

» Suppose a set of D containing a total of N examples which 𝑛+ are positive and 𝑛− are
negative outcomes and the entropy is given by:
𝑛+ 𝑛+ 𝑛− 𝑛−
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐷 = 𝐷 𝑛+ , 𝑛− = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
𝑛 𝑛 𝑛 𝑛
» Some useful property of the entropy:
o D(𝑛+ , 𝑛− ) = 𝐷(𝑛− , 𝑛+ )
o 𝐷 𝑛, 0 = 𝐷 0, 𝑛 = 0
✓ 𝐷 𝑠 = 0 means that all the examples in the same class
o 𝐷 𝑚, 𝑚 = 1 means that half the examples in S are of one class and half in the opposite class.

17
Decision tree…

Information Gain:

» It is defined as the difference between the original information requirement (i.e. based on
just the proportion of classes) and the new requirement (i.e. obtained after partitioning on
attribute A)

» Select the attribute with the highest information gain, that create small average disorder:
✓First, compute the disorder using Entropy:

• the expected information needed to classify objects into classes

✓Second, measure the Information Gain

• calculate by how much the disorder of a set would reduce by knowing the value of a
particular attribute.
18
Decision Tree …

» Select the attribute with the highest information gain

» Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated
by |Ci, D|/|D|

» Expected information (entropy) needed to classify a tuple in D:


m
Info( D) = − pi log 2 ( pi )
i =1

» Information needed (after using A to split D into v partitions) to classify D:


v | Dj |
InfoA ( D ) =   Info( D j )
j =1 |D|
» Information gained by branching on attribute A
Gain(A) = Info(D)− InfoA(D)
Attribute Selection: Information Gain
» Class P: buys_computer = “yes” 5 4
Infoage ( D ) = I ( 2,3) + I ( 4,0)
» Class N: buys_computer = “no” 14 14
5
9 9 5 5
Info( D ) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14

age pi ni I(pi, ni)


5
<=30 2 3 0.971 I ( 2,3) means “age <=30” has 5 out of 14
14
31…40 4 0 0 samples, with 2 yes’es and 3 no’s. Hence
>40 3 2 0.971 Gain (age) = Info( D) − Infoage ( D) = 0.246
age income student credit_rating buys_computer
<=30 high no fair no Similarly,
<=30 high no excellent no
31…40 high no fair yes

Gain(income ) = 0.029
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes Gain( student ) = 0.151
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain(credit _ rating ) = 0.048
<=30 medium yes excellent yes
31…40 medium no excellent yes
20
31…40 high yes fair yes
>40 medium no excellent no
Computing Information-Gain for Continuous-Valued Attributes

» Let attribute A be a continuous-valued attribute

» Must determine the best split point for A

o Sort the value A in increasing order

o Typically, the midpoint between each pair of adjacent values is considered


as a possible split point

✓ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1

o The point with the minimum expected information requirement for A is


selected as the split-point for A

» Split:
o D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of
tuples in D satisfying A > split-point
21
Gain Ratio for Attribute Selection (C4.5)

» Information gain measure is biased towards attributes with a large number of values.

» C4.5 (a successor of ID3) uses gain ratio to overcome the problem of information gain.
v | Dj | | Dj |
SplitInfoA ( D) = −  log 2 ( )
j =1 |D| |D|
✓ GainRatio(A) = Gain(A)/SplitInfo(A)

» Ex.
✓ gain_ratio(income) = 0.029/1.557 = 0.019

» The attribute with the maximum gain ratio is selected as the splitting attribute
Gini Index (CART, IBM IntelligentMiner)

» If a data set D contains examples from n classes, gini index, gini(D) is defined as
n
gini( D) = 1−  p 2j
j =1
where pj is the relative frequency of class j in D

» If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as
|D1| |D |
gini A ( D) = gini( D1) + 2 gini( D 2)
|D| |D|

» Reduction in Impurity: gini( A) = gini( D) − giniA ( D)

» The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen
to split the node (need to enumerate all the possible splitting points for each attribute)

23
Example-Decision tree

The problem of “Sunburn”: You want to predict whether another person is likely to
get sunburned if he is back to the beach. How can you do this? Data Collected: predict
based on the observed properties of the people
Name Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Kate Blonde Short Light Yes None
24
Decision tree

» To find the splitting criterion for these tuples we must compute the information gain of
each attribute

» Compute the expected information needed to classify a tuple in D:

» The expected information needed to classify a tuple in D if the tuples are partitional
according to hair is
Decision tree

» The gain in information from such a partitioning would be

» The calculation of height, weight and lotion performed in the same way as hair
attribute
Test Expected information for each attribute
Hair 0.50
Height 0.69
Weight 0.94
Lotion 0.61
Decision tree

» Information gain of each attribute

✓ Gain(hair) = 0.954 - 0.50 = 0.454

✓ Gain(height) = 0.954 - 0.69 = 0.264

✓ Gain(weight) = 0.954 - 0.94 = 0.014

✓ Gain (lotion) =0.954 - 0.61 = 0.344

» Which decision variable maximises the Info Gain?

27
The best decision tree?

is_sunburned
Hair colour
blonde
red brown
Sunburned None
?

Sunburned = Sarah, Annie,


None = Dana, Katie

» Once we have finished with hair colour we then need to calculate the
remaining branches of the decision tree.

» Which attributes is better to classify the remaining ?


28
The best Decision Tree

» This is the simplest and optimal one possible and it makes a lot of sense.
» It classifies 4 of the people on just the hair colour alone.

is_sunburned
Hair colour
blonde brow
red n
None
Sunburned
Lotion used

no yes

Sunburned
None

29
Decision Tree: Rule Extraction from Trees

» A decision tree does its own feature extraction

» You can view Decision Tree as an IF-THEN_ELSE statement which tells


us whether someone will suffer from sunburn.

If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-used=“No”) then
return (sunburned = yes)
else
return (false)

30
Avoid over fitting in classification

Over fitting: A tree may over fit the training data


o Good accuracy on training data but poor on test data

o Symptoms: tree too deep and too many branches, some may reflect anomalies due to noise or outliers

Two approaches to avoid over fitting

» Pre-pruning: Halt tree construction early


o A tree is “pruned” by halting its construction early (e.g., by deciding not to further split or partition the
subset of training tuples at a given node).

o Difficult to decide because we do not know what may happen subsequently if we keep growing the tree.

o Upon halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset
tuples or the probability distribution of those tuples
Avoid over fitting in classification

» Training Set and Test Set errors for decision trees


Avoid over fitting in classification

» Post-pruning: Remove branches or sub-trees from a “fully grown” tree.

» A sub-tree at a given node is pruned by removing its branches and replacing it with
a leaf.

» The leaf is labeled with the most frequent class among the sub-tree being replaced.
✓ C4.5 uses a statistical method to estimates the errors at each node for pruning.

✓ The leaf is labeled with the most frequent class among the subtree being replaced

33
Avoid over fitting in classification

» An unpruned decision tree and a pruned version of it.

34
Decision Tree

» The benefits of having a decision tree are as follows −

o It does not require any domain knowledge.

o It is easy to comprehend.

o The learning and classification steps of a decision tree are simple and
fast.

35
Exercises

»Suppose that you have a free afternoon and


you are thinking whether or not to go and
play tennis, How you do that?
oGoal is to Predict When This Player Will Play
Tennis?

oThe following training Data Example are prepared


for the classifier

36
Assignment 2

1. Write python algorithm for Decision Tree classifier.

Submit via [email protected] before July 12 2014


Bayesian Classification

» Bayesian classifiers are statistical classifiers which combining prior knowledge of the classes
with new evidence gathered from data.

» They can predict class membership probabilities


o the probability that a given tuple belongs to a particular class.

» For each new sample they provide a probability that the sample belongs to a class (for all
classes)

» Bayesian classification is based on Bayes’ theorem provides practical learning algorithms:


o Probabilistic learning: Calculate explicit probabilities for hypothesis. E.g. Naïve Bayes

o Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes theorem with
strong (naïve) independence assumption between the features.
38
Bayes’ Theorem: Basics

» Let X be a data tuple and it is considered “Evidence” .

» It is described by measurements made on a set of n attributes.

» Let H be a hypothesis that X belongs to class C.

» P(H/X) is the posterior probability of H conditioned on X.

» For example, suppose our world of data tuples is confined to customers described by the attributes
age and income , respectively, and that X is a 35-years-old customer with an income of $40,00.
o Suppose that H is the hypothesis that our customer will buy a computer. Then P(H/X) reflects the probabilities
that customer X will buy a computer given that we know the customer’s age and income.

» P(H) is the prior probability, or a priori probability, of H.


Bayes’ Theorem: Basics
» P(H) is the probability that any given customer will buy a computer, regardless of age , income or any other
information for that matter.

» P(X) is the prior probability of X. Using our example, it is the probability that a person from our set of
customers is 35 years old and earns $40,000.

» The P(X/H) is the posterior probability of X conditioned on H. That is, it is the probability that a customer,
X, is 35 years old and earns $40,000, given that we know the customer will buy a computer.

» How are these probabilities estimated?

» Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, p(H/X), from P(X),
P(H) and P(X/H).

𝑃(𝑋Τ𝐻)𝑃(𝐻)
o 𝑃 𝐻 Τ𝑋 =
𝑃(𝑋)

o Posteriori=likelihood X Prior/evidence
Bayes Classification

Example :

» A doctor knows that meningitis causes stiff neck 50% of the time

» Prior probability of any patient having meningitis is 1 / 50,000

» Prior probability of any patient having stiff neck is 1 / 20

» If a patient has stiff neck, what’s the probability he/she has meningitis?

41
Bayes Classification
Example : A medical cancer diagnosis problem. There are 2 possible outcomes of a diagnosis: +ve, -ve. We
know 0.8% of world population has cancer. Test gives correct +ve result 98% of the time and gives correct –ve
result 97% of the time. If a patient’s test returns +ve, should we diagnose the patient as having cancer?
Given
P(cancer) = 0.008 p(no-cancer) = 0.992
P(+ve|cancer) = 0.98 P(-ve|cancer) = 0.02
P(+ve|no-cancer) = 0.03 P(-ve|no-cancer) = 0.97

Using Bayes Formula:


o P(cancer|+ve) = P(+ve|cancer)xP(cancer) / P(+ve)
= 0.98 x 0.008 = 0.0078 / P(+ve)
o P(no-cancer|+ve) = P(+ve|no-cancer)xP(no-cancer) / P(+ve)
= 0.03 x 0.992 = 0.0298 / P(+ve)

So, the patient most likely does not have cancer.


General Bayes Theorem
» Consider each attribute & class label as random variables

» Given a record with attributes (A1, A2,…,An)


o Goal is to predict class C
o we want to find the value of C that maximizes P(C| A1, A2,…,An )

» Can we estimate P(C| A1, A2,…,An ) directly from data?


o Approach: compute the posterior probability P(C | A1, A2, …, An) for all
values of C using the Bayes theorem P( A1 A2  An | C ) P(C )
P(C | A1 A2  An ) =
P( A1 A2  An )
o Choose value of C that maximizes: P(C | A1, A2, …, An)
o Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)

» How to estimate P(A1, A2, …, An | C )?


Naïve Bayes Classifier

» Assume independence among attributes Ai when class is given:

o P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

Can estimate P(Ai| Cj) for all Ai and Cj.

o New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.

CNaive Bayes = arg max P(C j ) P( Ai | C j )


j i

44
Naïve Bayes Classifier: Training Dataset

age income studentcredit_rating


buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
>40 medium no fair yes
C2:buys_computer = ‘no’
>40 low yes fair yes
>40 low yes excellent no
Data to be classified: 31…40 low yes excellent yes
<=30 medium no fair no
X = (age <=30, <=30 low yes fair yes
Income = medium, >40 medium yes fair yes
<=30 medium yes excellent yes
Student = yes
31…40 medium no excellent yes
Credit_rating = Fair) 31…40 high yes fair yes
45 >40 medium no excellent no
Naïve Bayes Classifier: An Example

» P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643, P(buys_computer = “no”) = 5/14= 0.357 age income studentcredit_rating
buys_compu
» Compute P(X|Ci) for each class <=30 high no fair no
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
<=30 high no excellent no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
31…40 high no fair yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
>40 medium no fair yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
>40 low yes fair yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 >40 low yes excellent no
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 31…40 low yes excellent yes
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 <=30 medium no fair no
» X = (age <= 30 , income = medium, student = yes, credit_rating = fair) <=30 low yes fair yes
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 >40 medium yes fair yes
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 <=30 medium yes excellent yes
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
31…40 medium no excellent yes
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
31…40 high yes fair yes
Therefore, X belongs to class (“buys_computer = yes”)
>40 medium no excellent no

46
Example. ‘Play Tennis’ data

» Suppose that you have a free Day Outlook Temperature Humidity Wind Play Tennis
Day1 Sunny Hot High Weak No
afternoon and you are Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
thinking whether or not to go Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
and play tennis, How you do Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
that? Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
✓ Based on the following Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
training data, predict when
Day12 Overcast Mild High Strong Yes
this player will Play Tennis? Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No

47
Naive Bayes Classifier

» Given a training set, we can compute the probabilities

P(P) = 9/14
P(N) = 5/14

Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 Strong 3/9 3/5
mild 4/9 2/5 Weak 6/9 2/5
48 cool 3/9 1/5
Play-tennis example
» Based on the model created, predict Play Tennis or Not for the following unseen sample
(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
CNB = arg max P(C ) P(at | C )
C[ yes, no ] t

= arg max P(C ) P(Outl = sunny | C ) P(Temp = cool | C ) P( Hum = high | C ) P(Wind = strong | C )
C[ yes, no ]

» Working:
P( yes ) P( sunny | yes ) P(cool | yes ) P(high | yes ) P( strong | yes ) = 0.0053
P(no) P( sunny | no) P(cool | no) P(high | no) P( strong | no) = 0.0206
 answer : PlayTennis = no

o More example: What if the following test data is given:


X= <Outlook=rain, Temperature=hot, Humidity=high, Wind=weak>
Naïve Bayes Classifier: Comments

» Advantages
o Easy to implement

o Good results obtained in most of the cases

» Disadvantages
o Assumption: class conditional independence, therefore loss of accuracy

o Practically, dependencies exist among variables

✓ E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

✓ Dependencies among these cannot be modeled by Naïve Bayes Classifier

» How to deal with these dependencies? Bayesian Belief Networks


50
Assignment 2

1. Write python algorithm for Decision Tree and Naïve Bayes classifier
machine learning methods.

Submit via [email protected] before July 12 2022

You might also like