0% found this document useful (0 votes)

47 views41 pages

Simple Learning Algorithms: Jiming Peng, Advol, Cas, Mcmaster 1

The document discusses various simple machine learning algorithms, including the 1R algorithm for generating classification rules based on single attributes. It provides an example applying the 1R algorithm to a weather dataset to classify whether it is suitable for playing tennis. Other algorithms discussed include naive Bayes classification, which makes probabilistic predictions based on attribute values and their probabilities, and decision trees, which use information gain to recursively split data into purer subsets based on attribute values. The document notes some issues that can arise with these algorithms like overfitting and provides ways to address them.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PS, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views41 pages

Simple Learning Algorithms: Jiming Peng, Advol, Cas, Mcmaster 1

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PS, PDF, TXT or read online on Scribd

You are on page 1/ 41

Simple learning algorithms

One R learning;
Bayes Model;
Decision Tree;
Covering algorithm;
Mining for Association Rules
Linear models for numeric prediction;
Instance-based learning.
Reading Materials: Chapter 4 of textbook
by Witten etc, Sections 6.1, 6.2,7.17.4, 7.8
of the textbook by Han.

Inferring rudimentary rules

One R algorithm: learns a 1- level decision
tree or generates a set of rules that all test
on one particular attribute
Basic version for nominal attributes:
One branch for each of the attributes
values and each branch assigns most frequent
class
Error rate: proportion of instances that
dont belong to the majority class of their
corresponding branch
Choose attribute with lowest error rate
Pseudo-code for 1R
For each attribute,
For each value of the attribute:
count how often each class appears
find the most frequent class
make the rule assign that class to this
attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate
Note: missing is treated as a separate attribute value.

The weather problem

Outlook

Temper.

humidity

windy

Play

Sunny

Hot

High

False

Sunny

Hot

High

True

Overcast

Hot

High

False

Yes

Rainy

Mild

High

False

Yes

Rainy

Cool

Normal

False

Yes

Rainy

Cool

Normal

True

Overcast

Cool

Normal

True

Yes

Sunny

Mild

High

False

Sunny

Cool

Normal

False

Yes

Rainy

Mild

Normal

False

Yes

Sunny

Mild

Normal

True

Yes

Overcast

Mild

High

True

Yes

Overcast

Hot

Normal

False

Yes

Rainy

Mild

High

True

In total, there are 6 instances whose temperature is mild. Four of them with final decision
Yes and two with No. The rule is
If Temper.=mild then Play=Yes
Error rate: 2/6.

1R algorithm
Attribute

Outlook

Temper.

Humidity

Windy

Rules

Errors

Sunny no

2/5

overcast yes

0/4

rainy no

2/5

hot no

2/4

mild yes

2/6

cool yes

1/4

high no

3/7

normal yes

1/7

f alse yes

2/8

true no

3/6

Total err.

4/14

5/14

4/14

5/14

Dealing with numeric attributes

Discretization: the range of the attribute is
divided into a set of intervals
Instances are sorted according to attributes
values
Breakpoints are placed where the main
class changes (minimizing the errors)

Discretization
64

68 69 70

71 72

72 75 75

81 83

Y Y Y

No No

?Y Y Y

Y Y

Overfitting Problem: The procedure is very

sensitive to noise
A single instance with an incorrect class
label might result in a separate interval
Also: time-stamp attribute (have different
values for all instances) will have zero errors
Simple solution: enforce minimum number
of instances in majority class of per interval
64 65 68 69 70

71 72 72 75 75

80 81 83 85

Y No Y Y Y

No No Y Y Y

No Y Y No

Merging two adjacent partitions with a common majority together, we get

64 65 68 69 70 71 72 72 75 75

80 81 83 85

Y No Y Y Y No No Y Y Y

No Y Y No

How about?
64 65 68 69 70 71 72 72 75 75 80 81 83

Y No Y Y Y No No Y Y Y No Y Y

Statistical Modelling
Basic assumptions: Attributes are equally
important and statistically independent
Illusive assumptions never meet in practice, but the scheme works well!
The weather data with probabilities
Outlook

Temper.

Humidity

Windy

Play

y n

Sunny 2 3

Hot 2 2

High 3 4

F 6 2

9 5

Overc 4 0

Mild 4 2

Norm. 6 1

T 3 3

Rainy 3 2

Cool 3 1

Sunny
Overc
Rainy

2
9
4
9
3
9

3
5

Hot

Mild

2
5

Cool

2
9
4
9
3
9

2
5
2
5
1
5

High
Norm.

3
9
6
9

4
5
1
5

F
T

6
9
3
9

2
5
3
5

9 5
14 14

New instance: (sunny, cool, high, true, ?)

Likelihood of
2 3 3 3
9

= 0.0053
9 9 9 9 14
5
3 1 4 3
= 0.0206.
no =
5 5 5 5 14

yes =

Normalization into a probability by

0.0053
= 20.5%
0.0053 + 0.0206
0.0206
= 79.5%
P(no) =
0.0053 + 0.0206

P(yes) =

Naive Bayes Model

Bayes rule:Probability of event H given evidence E:
P r[E|H]P r[H]
P r[H|E] =
P r[E]
Priori probability of H: Pr[H], probability of
event before evidence has been seen
Posteriori probability of H: P r[H|E], probability of event after evidence has been seen
Naive Bayes for Classification
Whats the probability of the class for a given
instance?
Evidence: E= instance
Event: H = class value for instance
Naive Bayes assumption: evidence can be
split into independent parts (i.e. attributes
of instance!)
P r[H|E] =

P r[E1|H]P r[E2 |H] P r[En|H]P r[H]

.
P r[E]

No worry about Pr[E] as it will disappear

after normalization!

Zero Prob and missing values

Another instance: : (overcast, mild, high, f alse)
Likelihood of:
9
4 4 3 3

= 0.0254
9 9 9 5 14
5
2 1 2
= 0.
no = 0
5 5 5 14

yes =

Does it make sense to claim that the likelihood is zero? If not, how should we deal with
this issue?
Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator)
In some cases adding a constant different
from 1 might be more appropriate:
Attribute outlook for class yes
2 + 4 + 3 +
,
,
9+ 9+ 9+
Weights satisfying + + = 1, a > 0, b >
0, g > 0.
Extra Merit: Missing values are not counted
in both training and prediction!!

Dealing with numeric attributes

1 Usual assumptions: attributes have a
normal or Gaussian probability distribution
2 The probability density function for the
normal distribution is defined by two parameters
1 Pn
i The sample mean: = n
i=1 xi

ii The standard deviation:

v
u n
u X (xi )2
=t
i=1 n 1

iii The density function:

1
(x)
e 22
f (x) =
2

For the weather problem, if the attribute temperature has a mean of 73 and a standard
deviation of 6.2, then the density function
f (temperature = 66|yes) =

1
26.2

(7366)
2(6.2)2

= 0.0340,

Dealing with numeric attributes

Weather data with numeric attributes
Outlook

Temper.

Humidity

Windy

Play

Y no

Sunny 2 3

83 85

86 85

F 6 2

9 5

Overc 4 0

70 80

96 90

T 3 3

Rainy 3 2

68 65

80 70

... ...

73 74.6

79.1 86.2

6.2 7.9

10.2 9.7

Sunny
Overc
Rainy

2
9
4
9
3
9

3
5
0
5
2
5

6
9
3
9

2
5
3
5

9 5
14 14

For a new day(sunny,66,90,true,?), using

f (temperature = 66|yes) = 0.0340,
we have the Likelihood of
2
3 9
yes = 0.0340 0.0221
= 0.000036
9
9 14
3 5
3
= 0.000136,
no = 0.0291 0.0380
5
5 14
which gives
P r(yes) = 20.9%, P r(no) = 79.1%.
Missing values are not counted!

Probability densities
Relationship between probability and density:

P r[c < x c + ] f (c).

2
2
This doesnt change calculation of a posteriori probabilities because cancels out
Exact relationship:
P r[a x b] =

Z b
a

f (t)dt.

Merits and flaws of Nave Bayes

Nave Bayes works surprisingly well (despite

the illusive assumption)
Reason: classification doesnt require accurate probability estimates as long as maximum probability is assigned to correct class;
Redundant attributes might cause problems
Note: for numeric attributes that are not
normally distributed, other kernel density estimators should be considered!

Decision trees
Normal procedure: top down in recursive
divide-and-conquer fashion
1. Attribute is selected for root node and
branch is created for each possible attribute value
2. The instances are split into subsets (one
for each branch extending from the node)
3. Procedure is repeated recursively for
each branch, using only instances that
reach the branch
Process stops if all instances have the
same class
Issue: How to select the splitting attribute?
Criteria: The best attribute is the one leading to the smallest tree
Trick: choose the attribute that produces
the purest nodes
How to measure information? Information gain!

Computing information
Information gain increases with the average
purity of the subsets that an attribute produces
Strategy: choose the attribute that results
in greatest information gain
Information is measured in bits
1. Given a probability distribution, the info
required to predict an event is the distributions entropy
2. Entropy gives the information required
in bits(involve fractions of bits!)
Formula for computing the entropy:
entropy(p1, p2, , pn) =

n
X

pi log pi.

i=1

Example: attribute Outlook = Sunny:

inf o([2, 3]) = entropy(2/5, 3/5)
2 3
3
2
= log log = 0.971bits;
5
5 5
5

More Examples
Outlook = Overcast:
inf o([4, 0]) = entropy(1, 0) = log 1 = 0bits;
Outlook = Rainy:
inf o([3, 2]) = entropy(3/5, 2/5)
2 3
3
2
= log log = 0.971bits;
5
5 5
5
Expected information for attribute:

10
inf o([3, 2], [4, 0], [3, 2]) = 0.971
+0
14
= 0.693bits.

Information gain: the gap between information before and after the splitting.
Information gain for the weather problem:
gain(outlook)= inf o([9, 5])inf o([3, 2],[4, 0],[3, 2])
= 0.940 0.693 = 0.247bits.
gain(temper.) = 0.029bits,
gain(humidity) = 0.152bits,
gain(windy) = 0.048bits.

Wishlist for a purity measure

Desirable Properties of a purity measure:

A pure node should be measured as zero

When impurity is maximal (all classes equally
likely), measure should be maximal
Measure should enjoy the multistage property (decisions in several stages):
q
r
, q+r
).
entr(p, q, r) = entr(p, q + r) + (q + r)entr( q+r

measure([2, 3, 4]) = measure([2, 7])+7/9

measure([3, 4]).

Entropyis the only function that satisfies all

the above properties!
Simplification of computation:
entr(x1, , xn ) = Pn1

i=1

i=1 xi log xi + log(

i=1 xi ).

Instead of maximizing info gain we can minimize information!

Avoiding overfitting
Trouble: attributes with a large number of
values (extreme case: ID code) and thus the
corresponding subsets are more likely to be
pure. In this case, Information gain is biased towards choosing attributes with a large
number of values. This leads to so-called
overfitting (selection of a useless attribute
for prediction)

Remedy: use the gain ratio which takes

the number of branches into account when
choosing an attribute. Let a be an attribute,
gain ratio(a) =

gain(a)
.
inf o(a)

For example, inf o(1, 1, , 1) = log n where

n is the dimension of the all 1s vector.
Note: the distinction between two info
functions: inf o([1, 1]) = log 2 and
inf o([0, 1], [1, 0]) = 0.

Computing gain ratio

Suppose there is a different ID code associated with every instance in the weather problem, the information gain is 0.940bits, but
the info for the ID code is log 14 = 3.807bits.
Information for weather data
Attribute

Info.

Info Gain

Split Info

Gain Ratio

outlook

0.693

0.247

info([5,4,5])

0.156

temper.

0.911

0.029

1.362

0.021

humidity

0.788

0.152

info([7,7])=1

0.152

windy

0.892

0.048

0.985

0.049

The ID code still has a very high gain ratio.

Ad hoc test is needed to prevent the ID case.
The gain ratio might bias to attributes with
low intrinsic information. Restriction only to
the attribute whose information gain is above
the average is necessary.
Comments: Algorithm for induction decision
trees ID3 was developed by Ross Quinlan.
Gain ratio is one modification of this basic algorithm. Its advanced version is coined C4.5,
which can deal with numeric attributes, missing values, and noisy data.

Covering algorithm
Decision tree can be converted into a rule
set by using straightforward conversion, but
this usually leads to very complex rule set.
Efficient conversions are useful but not easy
to find.
The covering approach generates a rule
set directly (exclude instances in other classes).
Key idea: find the rule set that covers all the
instances in one class.
Let consider the problem of classifying a set
(two classes denoted by circles and boxes) of
points on the plane. We can start with
If ? then the point belongs to the circle class
It covers all the instances in circle-class, but
too general.
Adding a pre-condition (x 1), we get:
If x 1, then class is circle.
The rule covers some instances in circle.
We need more rules to cover some circleinstances and box instances.

Figure: Covering
y

y=1.5

x
x=1

If ? then the point belong to class

If x<=1 then the point belongs to class

Procedure for covering

Simple approach
Generates a rule by adding tests that maximize rules accuracy
Each new test reduces rules coverage: start
from space of examples, then rule so far and
rule after adding new term

The selection of a test depends on the following quantities

t: total number of instances covered by

rule
p: positive examples of the class covered
by rule
t-p: number of errors made by rule
Select test that maximizes the ratio p/t
We are finished when p/t = 1 or the set
of instances cant be split any further

Covering for contact lenses data

Rule we seek: If ? then recommendation = hard
Possible tests:
Age = Young
2/8
Age=Pre-presbyopic
1/8
Age=Presbyopic
1/8
Spectacle prescription=Myope
3/12
Spectacle prescription=Hypermetrope 1/12
Astigmatism=no
0/12
Astigmatism=yes
4/12
Tear production rate = Reduced
0/12
Tear production rate = Normal
4/12

We choose the test Astigmatism=yes, then

we get a subset as
Age

Spect. prescr.

Astig.

Tear prod.

Rec. lenses

Prepresbyopic

Hypermetrope

Yes

Reduced

None

Prepresbyopic

Hypermetrope

Yes

Normal

None

Presbyopic

Myope

Yes

Reduced

None

Presbyopic

Myope

Yes

Normal

Hard

Presbyopic

Hypermetrope

Yes

Reduced

None

Presbyopic

Hypermetrope

Yes

Normal

None

Prepresbyopic

Myope

Yes

Normal

Hard

Prepresbyopic

Myope

Yes

Reduced

None

Young

Hypermetrope

Yes

Normal

Hard

Young

Hypermetrope

Yes

Reduced

None

Young

Myope

Yes

Normal

Hard

Young

Myope

Yes

Reduced

None

Further refinement
Rule we seek: If astigmatism=yes and ? then
recommendation = hard
Possible tests:
Age = Young
2/4
Age=Pre-presbyopic
1/4
Age=Presbyopic
1/4
Spectacle prescription=Myope
3/6
Spectacle prescription=Hypermetrope 1/6
Tear production rate = Reduced
0/6
Tear production rate = Normal 4/6

Adding the test Tear production rate = Normal, we obtain

Resulting Table
Age

Spect. prescr.

Astig.

Tear prod.

Rec. lenses

Prepresbyopic

Hypermetrope

Yes

Normal

None

Presbyopic

Myope

Yes

Normal

Hard

Presbyopic

Hypermetrope

Yes

Normal

None

Myope

Yes

Normal

Hard

Young

Hypermetrope

Yes

Normal

Hard

Young

Myope

Yes

Normal

Hard

Prepresbyopic

Further refinement
Rule we seek: If astigmatism=yes and tear production rate=normal and ? then recommendation = hard
Possible tests:
Age = Young
2/2
Age=Pre-presbyopic
1/2
Age=Presbyopic
1/2
Spectacle prescription=Myope 3/3
Spectacle prescription=Hypermetrope 1/1
Between the first and fourth, we select the one with
larger coverage.
Resulting Table
Age

Spect. prescr.

Astig.

Tear prod.

Rec. lenses

Presbyopic

Myope

Yes

Normal

Hard

Prepresbyopic

Myope

Yes

Normal

Hard

Young

Myope

Yes

Normal

Hard

The Final rule: If astigmatism = yes and tear production rate = normal and spectacle prescription =
myope then recommendation = hard
Second rule for recommending hard lenses: If age =
young and astigmatism = yes and tear production rate
= normal then recommendation = hard

This rule is built from instances not covered

by first rule
The above two rules cover all hard lenses.
We can follow a similar process for other two
classes.

PRISM
Pseudo- code for PRISM:
For each class C
Initialize E to the instance set
While E contains instances in class C
1. Create a rule R with an empty left- hand side
that predicts class C
2. Until R is perfect (or there are no more attributes to use) do
For each attribute A not mentioned in R,
and each value v,
Consider adding the condition A = v to the
left- hand side of R
Select A and v to maximize the accuracy p/
t (break ties by choosing the condition with
the largest p)
Add A = v to R
Remove the instances covered by R from E

Comments: This simple algorithm utilized

divider-conquer to extract all rules. However,
it did not tell us what is the ordering of interpretation. It also did not address the issue
of how to deal with missing values and the
overfitting problem.

Mining association rules

Simple method for finding association rules:
Using the standard separate- and- conquer method,
treating every possible combination of attribute
values as a separate class
Two problems:
1. Computational complexity
2. Resulting number of rules;
Remedy: Focus on rules with high support and
confidence only!

Difficulty: Hard to define coverage and accuracy.

Item: one test/ attribute- value pair
Item set: a set of all the items occurring in
a rule
Frequency: The occurrence frequency of an
item-set in the data set
Goal: only rules that exceed pre-defined minimum frequency/support are reported
We can find all item sets with the given
minimum frequency and generate rules from
them!

Item sets
Item sets for weather data
one-item set
outlook=sunny(5)

two-item set

three-item set

outlook=sunny(2)

outlook=sunny

temperature=mild

temperature=hot
humidity=high(2)

temperature=cool(4)

outlook=sunny(3)

outlook=sunny

hudimity=high(3)

humidity=high
windy=false(2)

Four-item set: (outlook=sunny,temperature=hot, humidity=high ,play=no)[2]

(outlook=rainy, temperature=mild, windy=false, play=yes)[2]

In total: 12 one- item sets, 47 two- item

sets, 39 three- item sets, 6 four- item sets.
Once all item sets with minimum support
have been generated, we can turn them into
rules
Example: Humidity = Normal, Windy = False,
Play = Yes (4) In total seven (2n 1) potential rules:
If Humidity = Normal and Windy = False then Play = Yes 4/4
If Humidity = Normal and Play = Yes then Windy = False 4/6
If Windy = False and Play = Yes then Humidity = Normal 4/6
If Humidity = Normal then Windy = False and Play = Yes 4/7
If Windy = False then Humidity = Normal and Play = Yes 4/8
If Play = Yes then Humidity = Normal and Windy = False 4/9
If True then (?,?, Normal, False, Yes
4/12

Association rules
Rules for the weather data with support > 1:
In total: 3 rules with support four, 5 with
support three, and 50 with support two
Example rules from the same set:
Temperature = Cool, Humidity = Normal, Windy = False, Play
= Yes (2)
Resulting rules : Temperature = Cool, Windy = False
Humidity = Normal, Play = Yes
Temperature = Cool, Windy = False, Humidity = Normal
Play = Yes
Temperature = Cool, Windy = False, Play = Yes Humidity =
Normal

owing to the following frequent item sets:

Temperature=Cool, Windy=False
Temperature=Cool, Humidity=Normal, Windy=False
Temperature=Cool, Windy=False, Play=Yes

All these item-sets have minimum support 2.

Frequent item sets

A frequent item set is a set whose support
exceeds the minimal requirement.
Apriori Property:
If (A B) is frequent item set, then (A) and
(B) have to be frequent item sets as well;
In general: if X is frequent k-item set, then
all (k-1)-item subsets of X are also frequent;
Based on Apriori Property, we can compute
k-item set by merging (k-1)-item sets.
Finding one-item sets easy;
Using one-item sets to get two-item sets,
two-item sets to get three-item sets.
An example: For given five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)

The sets are lexicographically ordered!

Candidate for four-item sets: (A B C D),
OK because (B C D) has greater than minimum coverage! (A C D E) Not OK because
(C D E) does not satisfy the minimum coverage requirement!
Final check by counting instances in dataset
(k-1)-item sets are stored in hash table

Apriori Algorithm
We are looking for all high-confidence rules:
Support of antecedent obtained from hash
table
Building (c+1)-consequent rules from cconsequent ones
Observation: (c+1)-consequent rule can
only hold if all corresponding c-consequent
rules also hold
Just like the procedure for large item sets
Key Steps from k-item sets to (k+1)-item
sets:
Create a table of potential candidates of
(k+1)-item sets from the hash table of kitem sets by computing the product of two
sets.
Using the Apriori property of the frequent
item set and the order in the hash table to
improve the efficiency!
Remove non-promising candidates in the
table via consulting the hash table of k-item
sets.
Scan the whole data set to remove the
candidates that does not satisfy the minimal support requirement and get the frequent
(k+1)-item sets.

Mining for transaction data

TID

Items

I1, I2, I5

I2, I4

I2, I3

I1, I2, I4

I1, I3

I2, I3

I1, I3

I1,I2,I3,I5

I1, I2,I3

Itemset

C: Candidate set
L: the frequent item set

Support

Itemset

Support

C1 = L1

Mining Transaction Data

Itemset

Sup.

{I1,I2}

{I1,I3}

Itemset

Sup.

{I1, I4}

{I1,I2}

{I1,I5}

{I1,I3}

{I2,I3}

{I1,I5}

{I2,I4}

{I2,I3}

{I2,I5}

{I2,I4}

{I3,I4}

{I2,I5}

{I3,I5}

{I4,I5}

C2 = L2

Itemset

Sup.

{I1,I2, I3}

{I1,I2, I5}

C3 = L3

Itemset

Sup.

{I1,I2, I3}

{I1,I2, I5}

Mining transaction data

Procedure

1 Generate the candidate 1-item set and

scan all the transaction to count the occurrences of each item to get the 1-item
set satisfying the minimum support.

2 Generate the candidate 2-items set, and

scan the data set to find the 2-items set.

3 Generate the candidate 3-items set, scan

the data set to get L3.
To generate the candidate set, let
C3 = L2 L2 = {(I1, I2, I3), (I1, I2, I5), (I1, I3, I5),
(I2, I3, I4), (I2, I3, I5), (I2, I4, I5)}

By using the Apriori property, we can remove

the last four items. Thus we have
C3 = {(I1, I2, I3), (I1, I2, I5)}.

Linear models
Work naturally with numeric attributes
Standard technique for numeric prediction: linear regression
Output is a linear combination of attributes
y = w 0 + a 1 x1 + a 2 x2 + + a k xk .
Weights are calculated from the training
data
Predicted value for the training instance
a(1)
(1)

(1)

y = a 0 x0 + a 1 x1 + a 2 x2 + + a k xk .
All these k + 1 coefficients are chosen so that
the squared error on the training data is minimized
n
X

i=1

y i

n
X

j=0

(i)

a j xj .

Coefficient can be derived by solving a

linear system
Can handle many instances

Solving Least Square Problem

Linear regression always leads to unconstrained
convex quadratic optimization as follows
1
QP minn f (x) = xT Qx q T x.
2
xIR

Theorem .1 Suppose that the matrix Q is

positive semidefinite. Then x solves (QP)
if and only if it is a solution of the linear
equation system Qx = q.
Let us consider the linear regression for the
point set S = {(1, 0), (1, 2), (2, 1)}. In other
word, try to find a linear model y = ax + b to
approximate these points. Thus we have
min f (a, b) = (a+b)2+(2ab)2+(12ab)2.
By Theorem 1, we need only to solve the
linear system

0
12 8 a
= .

8 6

Solving the above system we get the linear

model y = 2x + 3.

From Regression to Classification

Any regression technique can be used for
classification:
Training: perform a regression for each
class, setting the output to 1 (y i = 1) for
training instances that belong to class, and 0
(y i = 0) for those that dont
Prediction: predict class corresponding to
model with largest output value.
Let us consider the case with 3 classes (S1, S2
and S3) on a plane where each point is characterized by parameter pair (x, y). We can
define the label for each class in the following way:
(x, y) S1, labelled as (1, 0, 0)T ;
(x, y) S2, labelled as (0, 1, 0)T ;
(x, y) S3, labelled as (0, 0, 1)T .
Performing linear regression for all the three
classes by minimizing
min f (a1, b1, c1, a2, b2, c2, a3, b3, c3)
where the function f is defined in next page.

Multiple-Class classification

2

1
a1 x + b1y + c1

X

f =
0 a2x + b2y + c2

(x,y)S1
0
a3 x + b3y + c3

2

0
a1 x + b1 y + c1

X

+
1 a2x + b2y + c2

(x,y)S2
0
a3 x + b3 y + c3

2

0
a1 x + b1 y + c1

X

0 a2x + b2y + c2
+

(x,y)S3
1
a3 x + b3 y + c3

A complex optimization problem is involved

and this is known as multi-response linear regression.
By minimizing f over the space, we find all
the three models for the sets S1, S2, S3. For
any new point (x,y), we calculate the value
for each model, which is defined by a1x+b1y+
c1, a2x+b2y +c2 and a3x+b3y +c3 separately.
The set model with the largest value assign
the class for the point (x, y).

Two-class classification
For two-class classification problem, we should
change the label for each class to (1, 0)T and
(0, 1)T , respectively.
Another way for bi-classification is to perform regression for each class first
1
1
1
f1(a) = x1
0 + a 1 x1 + a 2 x2 + + a k xk ,

2 + a x2 + + a x2 .
+
a
x
f2(a) = x2
2 2
1 1
k k
0

Then, use these two models to predict the

class of the instance a, saying
If f1(a) f2(a), then a is in class 1.
If f1(a) < f2(a), then a is in class 2.
y=a1 x+b1

y=a2x+b2

Pairwise regression
Another regression model for classification:
Regression for each pair of classes, using
only instances from these two classes
An output of +1 is assigned to one member of the pair, and 1 to the other
Class receiving most votes is predicted
What to do if there is no agreement
Maybe more accurate but expensive
Logistic regression:
Designed for classification problems
Tries to estimate class probabilities directly using the following linear model:
P r(G = i|X = x)
= i0 + iT x, i = 1, 2, , K 1
P r(G = K|X = x)
exp(i0 + iT x)
, i = 1, 2, , K 1
P r(G = i|X = x) =
PK1
T
1+ l=1 exp(l0 +l x)
1
.
P r(G = K|X = x) =
PK1
T x)
exp(
+

1+
l0
l
l=1
log

Define = {10, 1, , (k1)0, k1}, we

denote P r(G = i|X = x) = pi(x, ). The logistic regression use the conditional likelihood
of G given X. For N observations {g1, , gN },
the likelihood is
l() =

N
X
i=1

log pgi (xi , )

Instance-based learning
Distance function defines whats learned
Most popular distance function:

1
2
1
2
1
2
(a1 a1, a2 a2, , an an)
2

where a1 and a2 are two instances with k attributes.

If different attributes are measured, normalization is necessary.
Distance for nominal attributes should be
defined carefully.
For missing values, it is always assumed
to be maximally distant
1-NN algorithm:
Very accurate for the training data.
Very slow in prediction: scan the entire
training set for one prediction.
Usually treats all attributes equally,
However, attribute weights is necessary!
Remedies for noisy instances: taking a
majority vote over the k nearest neighbors.

Examples for K-NN Methods

The weather problem
Outlook

Temper.

Humidity

windy

Play

Sunny

Hot

High

False

Sunny

Hot

High

True

Yes

Overcast

Hot

High

False

Yes

Rainy

Mild

High

False

Yes

Rainy

Cool

Normal

False

Yes

Rainy

Cool

Normal

True

Overcast

Cool

Normal

True

Yes

Sunny

Mild

High

False

Sunny

Cool

Normal

False

Yes

Rainy

Mild

Normal

False

Sunny

Mild

Normal

True

Yes

Overcast

Mild

High

True

Yes

New instances:
(sunny, hot, high, true), easy!
(sunny, cool, high, false), no agreement!
What to do? Go with majority, no.
(rainy, hot, normal, false), no agreement!
A tie between (rainy, mild,normal, false) and
(rainy, cool,normal, false)! Maybe go from
1-NN to 2-NN,... K-NN.

Figure: Linear Model VS K-NN

Classification
by Linear Model

Decision Boundary by
K-NN

Unit 2 Elasticity of Demand and Supply
100% (1)
Unit 2 Elasticity of Demand and Supply
24 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
111 pages
Classification - Issues Regarding Classification and Prediction
No ratings yet
Classification - Issues Regarding Classification and Prediction
42 pages
L5 - Decision Tree - B
No ratings yet
L5 - Decision Tree - B
51 pages
Naive Bayes
No ratings yet
Naive Bayes
21 pages
Data Mining All Slides
No ratings yet
Data Mining All Slides
206 pages
Biswaranjan Biswal NCES MCQ
100% (1)
Biswaranjan Biswal NCES MCQ
11 pages
Harmonica Chords
100% (1)
Harmonica Chords
2 pages
Unit 3
No ratings yet
Unit 3
90 pages
Icar Syllabus-Physics, Chemistry, Maths, Bio & Agriculture
75% (4)
Icar Syllabus-Physics, Chemistry, Maths, Bio & Agriculture
26 pages
Wk10 Algorithms
No ratings yet
Wk10 Algorithms
123 pages
DLWSS551 - Algorithms Part I
No ratings yet
DLWSS551 - Algorithms Part I
59 pages
Slides
No ratings yet
Slides
174 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Unit 3
No ratings yet
Unit 3
81 pages
05 ZeroR OneR Bayes KNN
No ratings yet
05 ZeroR OneR Bayes KNN
76 pages
Lecture2 DT
No ratings yet
Lecture2 DT
75 pages
ML 4
No ratings yet
ML 4
50 pages
DM UNIT 4b (1R ALGO)
No ratings yet
DM UNIT 4b (1R ALGO)
39 pages
Decision Trees
No ratings yet
Decision Trees
49 pages
Ch4 Supervised
No ratings yet
Ch4 Supervised
78 pages
07 - Decision Tree
No ratings yet
07 - Decision Tree
45 pages
Naive Ba Yes
No ratings yet
Naive Ba Yes
65 pages
Business Analytics & Machine Learning: Decision Tree Classifiers
No ratings yet
Business Analytics & Machine Learning: Decision Tree Classifiers
60 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
41 pages
Chap5 - Machine Learning Part II - Decision Tree
No ratings yet
Chap5 - Machine Learning Part II - Decision Tree
68 pages
2.3 Decision-Tree-Algorithm
No ratings yet
2.3 Decision-Tree-Algorithm
61 pages
2 Naive Bayes
No ratings yet
2 Naive Bayes
49 pages
Lecture 05 Reasoning Under Uncertainty
No ratings yet
Lecture 05 Reasoning Under Uncertainty
41 pages
3 Decision Trees - LMS
No ratings yet
3 Decision Trees - LMS
47 pages
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
No ratings yet
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
73 pages
Decision-Tree Learning .
No ratings yet
Decision-Tree Learning .
29 pages
Bayesian Learning
No ratings yet
Bayesian Learning
41 pages
Lecture 19 - Bayes
No ratings yet
Lecture 19 - Bayes
33 pages
32-Naive Bayes Cont''d-03-10-2024
No ratings yet
32-Naive Bayes Cont''d-03-10-2024
31 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
CP1407 Prac6-9
No ratings yet
CP1407 Prac6-9
45 pages
Chapter 4
No ratings yet
Chapter 4
111 pages
Decision Tree.10.11
No ratings yet
Decision Tree.10.11
31 pages
Bayes Classifier
No ratings yet
Bayes Classifier
35 pages
Decision Tree
No ratings yet
Decision Tree
26 pages
Data Mining - Classification
No ratings yet
Data Mining - Classification
53 pages
Pattern Recognition
No ratings yet
Pattern Recognition
76 pages
Unit 6 Finalized
No ratings yet
Unit 6 Finalized
30 pages
Classification (Naive Bayes)
No ratings yet
Classification (Naive Bayes)
40 pages
DM Witten 03
No ratings yet
DM Witten 03
56 pages
7-Decision Trees Learning
No ratings yet
7-Decision Trees Learning
51 pages
HAI C-06 Jueves 15-10-2020
No ratings yet
HAI C-06 Jueves 15-10-2020
34 pages
L23 Bayesian Naive
No ratings yet
L23 Bayesian Naive
18 pages
D3 It Naive Bayes
No ratings yet
D3 It Naive Bayes
24 pages
26-Bayes Rule-16-03-2024
No ratings yet
26-Bayes Rule-16-03-2024
18 pages
Bayesian-Classification
No ratings yet
Bayesian-Classification
14 pages
A Simple Approach To Weather Predictions by Using Naive Bayes Classifiers
No ratings yet
A Simple Approach To Weather Predictions by Using Naive Bayes Classifiers
12 pages
Lecture - 4.2 - Continuous Data and Zero Frequency Problem in Naive Bayes Classifier
No ratings yet
Lecture - 4.2 - Continuous Data and Zero Frequency Problem in Naive Bayes Classifier
11 pages
06 Classification Naive Bayes
No ratings yet
06 Classification Naive Bayes
13 pages
Chapter 4A Tutorial Questions and Solutions
No ratings yet
Chapter 4A Tutorial Questions and Solutions
12 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
DWM
No ratings yet
DWM
9 pages
Naive Bayes
No ratings yet
Naive Bayes
6 pages
Predicting The Missing Value by Bayesian Classification: Abstract
No ratings yet
Predicting The Missing Value by Bayesian Classification: Abstract
5 pages
Classification and Prediction: Data Mining Concepts and Techniques
No ratings yet
Classification and Prediction: Data Mining Concepts and Techniques
18 pages
Exercises of Nouns
No ratings yet
Exercises of Nouns
5 pages
Brute Force Bayes Algorithm Example
No ratings yet
Brute Force Bayes Algorithm Example
6 pages
ABB DCS Function Code 15
No ratings yet
ABB DCS Function Code 15
2 pages
Biochem Solved B.pharm 2nd Semester
No ratings yet
Biochem Solved B.pharm 2nd Semester
87 pages
Sirosonic L
No ratings yet
Sirosonic L
100 pages
Pascal Output Answer
100% (1)
Pascal Output Answer
13 pages
MiR Main Brochure EN Web PDF
No ratings yet
MiR Main Brochure EN Web PDF
20 pages
Assignment
No ratings yet
Assignment
6 pages
STAT-231-Statistical Methods
No ratings yet
STAT-231-Statistical Methods
98 pages
Turbina Tesla Manual
No ratings yet
Turbina Tesla Manual
5 pages
Combine Stresses Week14
100% (1)
Combine Stresses Week14
2 pages
Kulfoldi Kutatasi Jelentesek Gyujtemenye
No ratings yet
Kulfoldi Kutatasi Jelentesek Gyujtemenye
92 pages
EOS Extension Documentation
No ratings yet
EOS Extension Documentation
270 pages
BSC Aeronautical
No ratings yet
BSC Aeronautical
144 pages
Midpoint
No ratings yet
Midpoint
10 pages
Bcf42ht Maruyama
No ratings yet
Bcf42ht Maruyama
16 pages
Recurring Crux Configurations 6 OI Parallel Side 38.4
No ratings yet
Recurring Crux Configurations 6 OI Parallel Side 38.4
4 pages
Researchpaper Microsegregation Studies On Pulsed Current Gas Tungsten Arc Welding of Alloy C 276
No ratings yet
Researchpaper Microsegregation Studies On Pulsed Current Gas Tungsten Arc Welding of Alloy C 276
6 pages
Employee Benefit Plans 6: Limitations On Contributions and Benefits
No ratings yet
Employee Benefit Plans 6: Limitations On Contributions and Benefits
23 pages
Phy 421 Note
No ratings yet
Phy 421 Note
27 pages
LQ043T3DX02 SP 122805 PDF
No ratings yet
LQ043T3DX02 SP 122805 PDF
25 pages
Advanced Creating of 3D Dental Models in Blender Software: September 2016
No ratings yet
Advanced Creating of 3D Dental Models in Blender Software: September 2016
67 pages
ATS - Daily Trading Plan 27agustus2018
No ratings yet
ATS - Daily Trading Plan 27agustus2018
1 page
Irregular Singular Points
No ratings yet
Irregular Singular Points
14 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
2 pages
Full Wave Analysis of The Exposure of Implantable Medical Devices To Electromagnetic Fields
No ratings yet
Full Wave Analysis of The Exposure of Implantable Medical Devices To Electromagnetic Fields
2 pages
Philosophy of Science 134E
No ratings yet
Philosophy of Science 134E
4 pages