0% found this document useful (0 votes)
333 views172 pages

Chapter 6

This document discusses machine learning algorithms and their implementation in data mining. It covers several algorithms including decision trees, classification rules, association rules, instance-based learning, numeric prediction, clustering, and Bayesian networks. For decision trees specifically, it describes how they are extended to handle numeric attributes, missing values, and noise through techniques like pruning. It also discusses preprocessing steps like discretization of numeric data and dealing with missing values.

Uploaded by

Koert Oosterhuis
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
333 views172 pages

Chapter 6

This document discusses machine learning algorithms and their implementation in data mining. It covers several algorithms including decision trees, classification rules, association rules, instance-based learning, numeric prediction, clustering, and Bayesian networks. For decision trees specifically, it describes how they are extended to handle numeric attributes, missing values, and noise through techniques like pruning. It also discusses preprocessing steps like discretization of numeric data and dealing with missing values.

Uploaded by

Koert Oosterhuis
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 172

Data Mining

Practical Machine Learning Tools and Techniques


Slides for Chapter 6 of Data Mining by I. H. Witten, E. Frank and
M. A. Hall
2 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Implementation:
Real machine learning schemes

Decision trees

From ID3 to C4.5 (pruning, numeric attributes, ...)

Classification rules

From PRISM to RIPPER and PART (pruning, numeric data, )

Association Rules

Frequent-pattern trees

Extending linear models

Support vector machines and neural networks

Instance-based learning

Pruning examples, generalized exemplars, distance functions


3 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Implementation:
Real machine learning schemes

Numeric prediction

Regression/model trees, locally weighted regression

Bayesian networks

Learning and prediction, fast data structures for learning

Clustering: hierarchical, incremental, probabilistic

Hierarchical, incremental, probabilistic, Bayesian

Semisupervised learning

Clustering for classification, co-training

Multi-instance learning

Converting to single-instance, upgrading learning algorithms,


dedicated multi-instance methods
4 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Industrial-strength algorithms

For an algorithm to be useful in a wide


range of real-world applications it must:

Permit numeric attributes

Allow missing values

Be robust in the presence of noise

Be able to approximate arbitrary concept


descriptions (at least in principle)

Basic schemes need to be extended to


fulfill these requirements
5 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Decision trees

Extending ID3:

to permit numeric attributes: straightforward

to deal sensibly with missing values: trickier

stability for noisy data:


requires pruning mechanism

End result: C4.5 (Quinlan)

Best-known and (probably) most widely-used


learning algorithm

Commercial successor: C5.0


6 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Numeric attributes

Standard method: binary splits

E.g. temp < 45

Unlike nominal attributes,


every attribute has many possible split points

Solution is straightforward extension:

Evaluate info gain (or other measure)


for every possible split point of attribute

Choose best split point

Info gain for best split point is info gain for attribute

Computationally more demanding


7 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Weather data (again!)
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes

Yes False Normal Mild Rainy
Yes False High Hot Overcast
No True High Hot Sunny
No False High Hot Sunny
Play Windy Humidity Temperature Outlook

Yes False Normal Mild Rainy
Yes False High Hot Overcast
No True High Hot Sunny
No False High Hot Sunny
Play Windy Humidity Temperature Outlook

Yes False High Mild Rainy
Yes False High Hot Overcast
No True High Hot Sunny
No False High Hot Sunny
Play Windy Humidity Temperature Outlook
No True Normal Cool Rainy

Yes False Normal Cool Rainy
8 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Weather data (again!)
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
If outlook = sunny and humidity > 83 then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = no
If none of the above then play = yes

Yes False Normal Mild Rainy
Yes False High Hot Overcast
No True High Hot Sunny
No False High Hot Sunny
Play Windy Humidity Temperature Outlook

Yes False Normal Mild Rainy
Yes False High Hot Overcast
No True High Hot Sunny
No False High Hot Sunny
Play Windy Humidity Temperature Outlook

Yes False High Mild Rainy
Yes False High Hot Overcast
No True High Hot Sunny
No False High Hot Sunny
Play Windy Humidity Temperature Outlook
No True Normal Cool Rainy

Yes False Normal Cool Rainy

Yes False Normal Mild Rainy
Yes False High Hot Overcast
No True High Hot Sunny
No False High Hot Sunny
Play Windy Humidity Temperature Outlook

Yes False Normal Mild Rainy
Yes False High Hot Overcast
No True High Hot Sunny
No False High Hot Sunny
Play Windy Humidity Temperature Outlook

Yes False 96 70 Rainy
Yes False 86 83 Overcast
No True 90 80 Sunny
No False 85 85 Sunny
Play Windy Humidity Temperature Outlook
No True 70 65 Rainy

Yes False 80 68 Rainy
9 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Example

Split on temperature attribute:

E.g. temperature < 71.5: yes/4, no/2


temperature 71.5: yes/5, no/3

Info([4,2],[5,3])
= 6/14 info([4,2]) + 8/14 info([5,3])
= 0.939 bits

Place split points halfway between values

Can evaluate all split points in one pass!


64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
10 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Can avoid repeated sorting

Sort instances by the values of the numeric attribute

Time complexity for sorting: O (n log n)

Does this have to be repeated at each node of the


tree?

No! Sort order for children can be derived from sort


order for parent

Time complexity of derivation: O (n)

Drawback: need to create and store an array of sorted


indices for each numeric attribute
11 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Binary vs multiway splits

Splitting (multi-way) on a nominal attribute exhausts all


information in that attribute

Nominal attribute is tested (at most) once on any path in the


tree

Not so for binary splits on numeric attributes!

Numeric attribute may be tested several times along a path in


the tree

Disadvantage: tree is hard to read

Remedy:

pre-discretize numeric attributes, or

use multi-way splits instead of binary ones


12 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Computing multi-way splits

Simple and efficient way of generating multi-way


splits: greedy algorithm

Dynamic programming can find optimum multi-


way split in O (n
2
) time

imp (k, i, j ) is the impurity of the best split of values


x
i
x
j
into k sub-intervals

imp (k, 1, i ) =
min
0<j <i
imp (k1, 1, j ) + imp (1, j+1, i )

imp (k, 1, N ) gives us the best k-way split

In practice, greedy algorithm works as well


13 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Missing values

Split instances with missing values into pieces

A piece going down a branch receives a weight


proportional to the popularity of the branch

weights sum to 1

Info gain works with fractional instances

use sums of weights instead of counts

During classification, split the instance into pieces


in the same way

Merge probability distribution using weights


14 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Pruning

Prevent overfitting to noise in the data

Prune the decision tree

Two strategies:

Postpruning
take a fully-grown decision tree and discard
unreliable parts

Prepruning
stop growing a branch when information becomes
unreliable

Postpruning preferred in practice


prepruning can stop early
15 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Prepruning

Based on statistical significance test

Stop growing the tree when there is no statistically


significant association between any attribute and the class
at a particular node

Most popular test: chi-squared test

ID3 used chi-squared test in addition to


information gain

Only statistically significant attributes were allowed to be


selected by information gain procedure
16 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Early stopping

Pre-pruning may stop the growth


process prematurely: early stopping

Classic example: XOR/Parity-problem

No individual attribute exhibits any significant


association to the class

Structure is only visible in fully expanded tree

Prepruning wont expand the root node

But: XOR-type problems rare in practice

And: prepruning faster than postpruning


0 0 0 1
1 1 0 2
1
1
a
0 1 4
1 0 3
class b
17 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Postpruning

First, build full tree

Then, prune it

Fully-grown tree shows all attribute interactions

Problem: some subtrees might be due to chance


effects

Two pruning operations:

Subtree replacement

Subtree raising

Possible strategies:

error estimation

significance testing

MDL principle
18 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Subtree replacement

Bottom-up

Consider replacing a tree only


after considering all its subtrees
19 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Subtree raising

Delete node

Redistribute instances

Slower than subtree


replacement
(Worthwhile?)
20 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Estimating error rates

Prune only if it does not increase the estimated error

Error on the training data is NOT a useful estimator


(would result in almost no pruning)

Use hold-out set for pruning


(reduced-error pruning)

C4.5s method

Derive confidence interval from training data

Use a heuristic limit, derived from this, for pruning

Standard Bernoulli-process-based method

Shaky statistical assumptions (based on training data)


21 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
C4.5s method

Error estimate for subtree is weighted sum of error


estimates for all its leaves

Error estimate for a node:

If c = 25% then z = 0.69 (from normal distribution)

f is the error on the training data

N is the number of instances covered by the leaf


e=f +
z
2
2N
+z
.
f
N

f
2
N
+
z
2
4N
2
)/1+
z
2
N
)
22 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Example
f=0.33
e=0.47
f=0.5
e=0.72
f=0.33
e=0.47
f = 5/14
e = 0.46
e < 0.51
so prune!
Combined using ratios 6:2:6 gives 0.51
23 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Complexity of tree induction

Assume

m attributes

n training instances

tree depth O (log n)

Building a tree O (m n log n)

Subtree replacement O (n)

Subtree raising O (n (log n)


2
)

Every instance may have to be redistributed at every


node between its leaf and the root

Cost for redistribution (on average): O (log n)

Total cost: O (m n log n) + O (n (log n)


2
)
24 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
From trees to rules

Simple way: one rule for each leaf

C4.5rules: greedily prune conditions from each rule if this


reduces its estimated error

Can produce duplicate rules

Check for this at the end

Then

look at each class in turn

consider the rules for that class

find a good subset (guided by MDL)

Then rank the subsets to avoid conflicts

Finally, remove rules (greedily) if this decreases error on


the training data
25 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
C4.5: choices and options

C4.5rules slow for large and noisy datasets

Commercial version C5.0rules uses a


different technique

Much faster and a bit more accurate

C4.5 has two parameters

Confidence value (default 25%):


lower values incur heavier pruning

Minimum number of instances in the two most


popular branches (default 2)
26 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Cost-complexity pruning

C4.5's postpruning often does not prune enough

Tree size continues to grow when more instances are


added even if performance on independent data does
not improve

Very fast and popular in practice

Can be worthwhile in some cases to strive for a more


compact tree

At the expense of more computational effort

Cost-complexity pruning method from the CART


(Classification and Regression Trees) learning system
27 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Cost-complexity pruning

Basic idea:

First prune subtrees that, relative to their size, lead to


the smallest increase in error on the training data

Increase in error () average error increase per leaf of


subtree

Pruning generates a sequence of successively smaller


trees

Each candidate tree in the sequence corresponds to one


particular threshold value,
i

Which tree to chose as the final model?

Use either a hold-out set or cross-validation to estimate the


error of each
28 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Discussion

The most extensively studied method of machine


learning used in data mining

Different criteria for attribute/test selection rarely


make a large difference

Different pruning methods mainly change the


size of the resulting pruned tree

C4.5 builds univariate decision trees

Some TDITDT systems can build multivariate


trees (e.g. CART)
TDIDT: Top-Down Induction of Decision Trees
29 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Classification rules

Common procedure: separate-and-conquer

Differences:

Search method (e.g. greedy, beam search, ...)

Test selection criteria (e.g. accuracy, ...)

Pruning method (e.g. MDL, hold-out set, ...)

Stopping criterion (e.g. minimum accuracy)

Post-processing step

Also: Decision list


vs.
one rule set for each class
30 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Test selection criteria

Basic covering algorithm:

keep adding conditions to a rule to improve its accuracy

Add the condition that improves accuracy the most

Measure 1: p/t

t total instances covered by rule


p number of these that are positive

Produce rules that dont cover negative instances,


as quickly as possible

May produce rules with very small coverage


special cases or noise?

Measure 2: Information gain p (log(p/t) log(P/T))

P and T the positive and total numbers before the new condition was added

Information gain emphasizes positive rather than negative instances

These interact with the pruning mechanism used


31 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Missing values, numeric attributes

Common treatment of missing values:


for any test, they fail

Algorithm must either

use other tests to separate out positive instances

leave them uncovered until later in the process

In some cases its better to treat missing as a


separate value

Numeric attributes are treated just like they are in


decision trees
32 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Pruning rules

Two main strategies:

Incremental pruning

Global pruning

Other difference: pruning criterion

Error on hold-out set (reduced-error pruning)

Statistical significance

MDL principle

Also: post-pruning vs. pre-pruning


33 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Using a pruning set

For statistical validity, must evaluate measure on


data not used for training:

This requires a growing set and a pruning set

Reduced-error pruning :
build full rule set and then prune it

Incremental reduced-error pruning : simplify


each rule as soon as it is built

Can re-split data after rule has been pruned

Stratification advantageous
34 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Incremental reduced-error pruning
Initialize E to the instance set
Until E is empty do
Split E into Grow and Prune in the ratio 2:1
For each class C for which Grow contains an instance
Use basic covering algorithm to create best perfect rule
for C
Calculate w(R): worth of rule on Prune
and w(R-): worth of rule with final condition
omitted
If w(R-) > w(R), prune rule and repeat previous step
From the rules for the different classes, select the one
thats worth most (i.e. with largest w(R))
Print the rule
Remove the instances covered by rule from E
Continue
35 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Measures used in IREP

[p + (N n)] / T

(N is total number of negatives)

Counterintuitive:

p = 2000 and n = 1000 vs. p = 1000 and n = 1

Success rate p / t

Problem: p = 1 and t = 1
vs. p = 1000 and t = 1001

(p n) / t

Same effect as success rate because it equals 2p/t 1

Seems hard to find a simple measure of a rules


worth that corresponds with intuition
36 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Variations

Generating rules for classes in order

Start with the smallest class

Leave the largest class covered by the default rule

Stopping criterion

Stop rule production if accuracy becomes too low

Rule learner RIPPER:

Uses MDL-based stopping criterion

Employs post-processing step to modify rules guided


by MDL criterion
37 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Using global optimization

RIPPER: Repeated Incremental Pruning to Produce Error Reduction


(does global optimization in an efficient way)

Classes are processed in order of increasing size

Initial rule set for each class is generated using IREP

An MDL-based stopping condition is used

DL: bits needs to send examples wrt set of rules, bits needed to
send k tests, and bits for k

Once a rule set has been produced for each class, each rule is re-
considered and two variants are produced

One is an extended version, one is grown from scratch

Chooses among three candidates according to DL

Final clean-up step greedily deletes rules to minimize DL


38 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
PART

Avoids global optimization step used in C4.5rules


and RIPPER

Generates an unrestricted decision list using basic


separate-and-conquer procedure

Builds a partial decision tree to obtain a rule

A rule is only pruned if all its implications are known

Prevents hasty generalization

Uses C4.5s procedures to build a tree


39 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Building a partial tree
Expand-subset (S):
Choose test T and use it to split set of examples
into subsets
Sort subsets into increasing order of average
entropy
while
there is a subset X not yet been expanded
AND all subsets expanded so far are leaves
expand-subset(X)
if
all subsets expanded are leaves
AND estimated error for subtree
estimated error for node
undo expansion into subsets and make node a leaf
40 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Example
41 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Notes on PART

Make leaf with maximum coverage into a rule

Treat missing values just as C4.5 does

I.e. split instance into pieces

Time taken to generate a rule:

Worst case: same as for building a pruned tree

Occurs when data is noisy

Best case: same as for building a single rule

Occurs when data is noise free


42 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Rules with exceptions
1.Given: a way of generating a single good rule
2.Then its easy to generate rules with exceptions
3.Select default class for top-level rule
4.Generate a good rule for one of the remaining classes
5.Apply this method recursively to the two subsets produced
by the rule
(I.e. instances that are covered/not covered)
43 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Iris data example
Exceptions are represented as
Dotted paths, alternatives as
solid ones.
44 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Association rules

Apriori algorithm finds frequent item sets via a


generate-and-test methodology

Successively longer item sets are formed from shorter


ones

Each different size of candidate item set requires a full


scan of the data

Combinatorial nature of generation process is costly


particularly if there are many item sets, or item sets are
large

Appropriate data structures can help

FP-growth employs an extended prefix tree (FP-tree)


45 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
FP-growth

FP-growth uses a Frequent Pattern Tree (FP-


tree) to store a compressed version of the data

Only two passes are required to map the data


into an FP-tree

The tree is then processed recursively to grow


large item sets directly

Avoids generating and testing candidate item sets


against the entire database
46 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Building a frequent pattern tree
1) First pass over the data count the number times
individual items occur
2) Second pass over the data before inserting each
instance into the FP-tree, sort its items in descending
order of their frequency of occurrence, as found in step 1
+
Individual items that do not meet the minimum support
are not inserted into the tree
+
Hopefully many instances will share items that occur
frequently individually, resulting in a high degree of
compression close to the root of the tree
47 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
An example using the weather data

Frequency of individual items (minimum


support = 6)
play = yes 9
windy = false 8
humidity = normal 7
humidity = high 7
windy = true 6
temperature = mild 6
play = no 5
outlook = sunny 5
outlook = rainy 5
temperature = hot 4
temperature = cool 4
outlook = overcast 4
48 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
An example using the weather data

Instances with items sorted


1 windy=false, humidity=high, play=no, outlook=sunny, temperature=hot
2 humidity=high, windy=true, play=no, outlook=sunny, temperature=hot
3 play=yes, windy=false, humidity=high, temperature=hot, outlook=overcast
4 play=yes, windy=false, humidity=high, temperature=mild, outlook=rainy
.
.
.
14 humidity=high, windy=true, temperature=mild, play=no, outlook=rainy

Final answer: six single-item sets (previous slide) plus two


multiple-item sets that meet minimum support
play=yes and windy=false 6
play=yes and humidity=normal 6
49 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Finding large item sets

FP-tree for the weather data (min support 6)

Process header table from bottom

Add temperature=mild to the list of large item sets

Are there any item sets containing temperature=mild that meet


min support?
50 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Finding large item sets cont.

FP-tree for the data conditioned on temperature=mild

Created by scanning the first (original) tree

Follow temperature=mild link from header table to find all instances that
contain temperature=mild

Project counts from original tree

Header table shows that temperature=mild can't be grown any longer


51 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Finding large item sets cont.

FP-tree for the data conditioned on humidity=normal

Created by scanning the first (original) tree

Follow humidity=normal link from header table to find all instances


that contain humidity=normal

Project counts from original tree

Header table shows that humidty=normal can be grown to include play=yes


52 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Finding large item sets cont.

All large item sets have now been found

However, in order to be sure it is necessary to


process the entire header link table from the
original tree

Association rules are formed from large item


sets in the same way as for Apriori

FP-growth can be up to an order of magnitude


faster than Apriori for finding large item sets
53 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Extending linear classification

Linear classifiers cant model nonlinear class


boundaries

Simple trick:

Map attributes into new space consisting of


combinations of attribute values

E.g.: all products of n factors that can be constructed


from the attributes

Example with two attributes and n = 3:


x=w
1
a
1
3
+w
2
a
1
2
a
2
+w
3
a
1
a
2
2
+w
4
a
2
3
54 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Problems with this approach

1
st
problem: speed

10 attributes, and n = 5 >2000 coefficients

Use linear regression with attribute selection

Run time is cubic in number of attributes

2
nd
problem: overfitting

Number of coefficients is large relative to the


number of training instances

Curse of dimensionality kicks in


55 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Support vector machines

Support vector machines are algorithms for


learning linear classifiers

Resilient to overfitting because they learn a


particular linear decision boundary:

The maximum margin hyperplane

Fast in the nonlinear case

Use a mathematical trick to avoid creating pseudo-


attributes

The nonlinear space is created implicitly


56 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
The maximum margin hyperplane

The instances closest to the maximum margin


hyperplane are called support vectors
57 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Support vectors

This means the hyperplane


can be written as

The support vectors define the maximum margin hyperplane

All other instances can be deleted without changing its position and orientation
x=w
0
+w
1
a
1
+w
2
a
2
x=b+
i is supp. vector
o
i
y
i

ai)

a
58 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Finding support vectors

Support vector: training instance for which


i
> 0

Determine
i
and b ?
A constrained quadratic optimization problem

Off-the-shelf tools for solving these problems

However, special-purpose algorithms are faster

Example: Platts sequential minimal optimization


algorithm (implemented in WEKA)

Note: all this assumes separable data!


x=b+
i is supp. vector
o
i
y
i

ai)

a
59 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Nonlinear SVMs

Pseudo attributes represent attribute


combinations

Overfitting not a problem because the


maximum margin hyperplane is stable

There are usually few support vectors relative to the


size of the training set

Computation time still an issue

Each time the dot product is computed, all the


pseudo attributes must be included
60 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
A mathematical trick

Avoid computing the pseudo attributes

Compute the dot product before doing the nonlinear


mapping

Example:

Corresponds to a map into the instance space


spanned by all products of n attributes
x=b+
i is supp. vector
o
i
y
i

ai)

a)
n
61 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Other kernel functions

Mapping is called a kernel function

Polynomial kernel

We can use others:

Only requirement:

Examples:
x=b+
i is supp. vector
o
i
y
i

ai)

a)
n
x=b+
i is supp. vector
o
i
y
i
K

ai)

a)
K

x
i
,

x
j
)=d

x
i
)d

x
j
)
K

x
i
,

x
j
)=

x
i

x
j
+1)
d
K

x
i
,

x
j
)=exp
x
i
x
j
)
2
2u
2
)
K

x
i
,

x
j
)=tanh

x
i

x
j
+b)
*
62 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Noise

Have assumed that the data is separable (in


original or transformed space)

Can apply SVMs to noisy data by introducing a


noise parameter C

C bounds the influence of any one training


instance on the decision boundary

Corresponding constraint: 0
i
C

Still a quadratic optimization problem

Have to determine C by experimentation


63 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Sparse data

SVM algorithms speed up dramatically if the data is


sparse (i.e. many values are 0)

Why? Because they compute lots and lots of dot products

Sparse data
compute dot products very efficiently

Iterate only over non-zero values

SVMs can process sparse datasets with 10,000s of


attributes
64 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Applications

Machine vision: e.g face identification

Outperforms alternative approaches (1.5% error)

Handwritten digit recognition: USPS data

Comparable to best alternative (0.8% error)

Bioinformatics: e.g. prediction of protein


secondary structure

Text classifiation

Can modify SVM technique for numeric prediction


problems
65 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Support vector regression

Maximum margin hyperplane only applies to


classification

However, idea of support vectors and kernel functions can


be used for regression

Basic method same as in linear regression: want to


minimize error

Difference A: ignore errors smaller than and use


absolute error instead of squared error

Difference B: simultaneously aim to maximize flatness of


function

User-specified parameter defines tube


66 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
More on SVM regression

If there are tubes that enclose all the training points, the
flattest of them is used

Eg.: mean is used if 2 > range of target values

Model can be written as:

Support vectors: points on or outside tube

Dot product can be replaced by kernel function


Note: coefficients
i
may be negative

No tube that encloses all training points?

Requires trade-off between error and flatness

Controlled by upper limit C on absolute value of coefficients

i
x=b+
i is supp. vector
o
i

ai)

a
67 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Examples
= 2
= 1
= 0.5
68 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Kernel Ridge Regression

For classic linear regression using squared loss,


only simple matrix operations are need to find
the model

Not the case for support vector regression with


user-specified loss

Combine the power of the kernel trick with


simplicity of standard least-squares regression?

Yes! Kernel ridge regression


69 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Kernel Ridge Regression

Like SVM, predicted class value for a test


instance a is expressed as a weighted sum over
the dot product of the test instance with training
instances

Unlike SVM, all training instances participate


not just support vectors

No sparseness in solution (no support vectors)

Does not ignore errors smaller than

Uses squared error instead of absolute error


70 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Kernel Ridge Regression

More computationally expensive than standard linear


regresion when #instances > #attributes

Standard regression invert an m m matrix (O(m


3
)),
m = #attributes

Kernel ridge regression invert an n n matrix


(O(n
3
)), n = #instances

Has an advantage if

A non-linear fit is desired

There are more attributes than training instances


71 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
The kernel perceptron

Can use kernel trick to make non-linear classifier using


perceptron rule

Observation: weight vector is modified by adding or


subtracting training instances

Can represent weight vector using all instances that have


been misclassified:

Can use
instead of
( where y is either -1 or +1)

Now swap summation signs:

Can be expressed as:

Can replace dot product by kernel:

i
w
i
a
i

j
y j)a' j)
i
a
i

j
y j)
i
a' j)
i
a
i

j
y j)

a' j)

j
y j)K

a' j) ,

a)
72 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Comments on kernel perceptron

Finds separating hyperplane in space created by kernel function (if it


exists)

But: doesn't find maximum-margin hyperplane

Easy to implement, supports incremental learning

Linear and logistic regression can also be upgraded using the kernel
trick

But: solution is not sparse: every training instance contributes to


solution

Perceptron can be made more stable by using all weight vectors


encountered during learning, not just last one (voted perceptron)

Weight vectors vote on prediction (vote based on number of


successful classifications since inception)
73 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Multilayer perceptrons

Using kernels is only one way to build nonlinear classifier


based on perceptrons

Can create network of perceptrons to approximate arbitrary


target concepts

Multilayer perceptron is an example of an artificial neural


network

Consists of: input layer, hidden layer(s), and output


layer

Structure of MLP is usually found by experimentation

Parameters can be found using backpropagation


74 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Examples
75 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Backpropagation

How to learn weights given network structure?

Cannot simply use perceptron learning rule because we have


hidden layer(s)

Function we are trying to minimize: error

Can use a general function minimization technique called


gradient descent

Need differentiable activation function: use sigmoid function


instead of threshold function

Need differentiable error function: can't use zero-one loss, but can
use squared error
f x)=
1
1+expx)
E=
1
2
yf x))
2
76 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
The two activation functions
77 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Gradient descent example

Function: x
2
+1

Derivative: 2x

Learning rate: 0.1

Start value: 4
Can only find a local minimum!
78 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Minimizing the error I

Need to find partial derivative of error function


for each parameter (i.e. weight)
dE
dw
i
=yf x))
df x)
dw
i
df x)
dx
=f x)1f x))
x=
i
w
i
f x
i
)
df x)
dw
i
=f ' x)f x
i
)
dE
dw
i
= yf x))f ' x) f x
i
)
79 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Minimizing the error II

What about the weights for the connections from the


input to the hidden layer?
dE
dw
ij
=
dE
dx
dx
dw
ij
=yf x))f 'x)
dx
dw
ij
x=
i
w
i
f x
i
)
dx
dw
ij
=w
i
df x
i
)
dw
ij
dE
dw
ij
= yf x))f ' x) w
i
f ' x
i
)a
i
df x
i
)
dw
ij
=f 'x
i
)
dx
i
dw
ij
=f 'x
i
) a
i
80 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Remarks

Same process works for multiple hidden layers and multiple


output units (eg. for multiple classes)

Can update weights after all training instances have been


processed or incrementally:

batch learning vs. stochastic backpropagation

Weights are initialized to small random values

How to avoid overfitting?

Early stopping: use validation set to check when to stop

Weight decay: add penalty term to error function

How to speed up learning?

Momentum: re-use proportion of old weight change

Use optimization method that employs 2nd derivative


81 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Radial basis function networks

Another type of feedforward network with two


layers (plus the input layer)

Hidden units represent points in instance space


and activation depends on distance

To this end, distance is converted into similarity:


Gaussian activation function

Width may be different for each hidden unit

Points of equal activation form hypersphere (or


hyperellipsoid) as opposed to hyperplane

Output layer same as in MLP


82 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Learning RBF networks

Parameters: centers and widths of the RBFs + weights in


output layer

Can learn two sets of parameters independently and still


get accurate models

Eg.: clusters from k-means can be used to form basis


functions

Linear model can be used based on fixed RBFs

Makes learning RBFs very efficient

Disadvantage: no built-in attribute weighting based on


relevance

RBF networks are related to RBF SVMs


83 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Stochastic gradient descent

Have seen gradient descent + stochastic backpropagation


for learning weights in a neural network

Gradient descent is a general-purpose optimization


technique

Can be applied whenever the objective function is


differentiable

Actually, can be used even when the objective function is


not completely differentiable!

Subgradients

One application: learn linear models e.g. linear SVMs or


logistic regression
84 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Stochastic gradient descent cont.

Learning linear models using gradient descent is


easier than optimizing non-linear NN

Objective function has global minimum rather than


many local minima

Stochastic gradient descent is fast, uses little


memory and is suitable for incremental online
learning
85 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Stochastic gradient descent cont.

For SVMs, the error function (to be minimized)


is called the hinge loss
86 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Stochastic gradient descent cont.

In the linearly separable case, the hinge loss is 0 for a


function that successfully separates the data

The maximum margin hyperplane is given by the


smallest weight vector that achieves 0 hinge loss

Hinge loss is not differentiable at z = 1; cant compute


gradient!

Subgradient something that resembles a gradient

Use 0 at z = 1

In fact, loss is 0 for z 1, so can focus on z < 1 and


proceed as usual
87 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Instance-based learning

Practical problems of 1-NN scheme:

Slow (but: fast tree-based approaches exist)

Remedy: remove irrelevant data

Noise (but: k -NN copes quite well with noise)

Remedy: remove noisy instances

All attributes deemed equally important

Remedy: weight attributes (or simply select)

Doesnt perform explicit generalization

Remedy: rule-based NN approach


88 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Learning prototypes

Only those instances involved in a decision


need to be stored

Noisy instances should be filtered out

Idea: only use prototypical examples


89 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Speed up, combat noise

IB2: save memory, speed up classification

Work incrementally

Only incorporate misclassified instances

Problem: noisy data gets incorporated

IB3: deal with noise

Discard instances that dont perform well

Compute confidence intervals for

1. Each instances success rate

2. Default accuracy of its class

Accept/reject instances

Accept if lower limit of 1 exceeds upper limit of 2

Reject if upper limit of 1 is below lower limit of 2


90 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Weight attributes

IB4: weight each attribute


(weights can be class-specific)

Weighted Euclidean distance:

Update weights based on nearest neighbor

Class correct: increase weight

Class incorrect: decrease weight

Amount of change for i th attribute depends on


|x
i
- y
i
|
.
w
1
2
x
1
y
1
)
2
+...+w
n
2
x
n
y
n
)
2
)
91 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Rectangular generalizations

Nearest-neighbor rule is used outside rectangles

Rectangles are rules! (But they can be more


conservative than normal rules.)

Nested rectangles are rules with exceptions


92 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Generalized exemplars

Generalize instances into hyperrectangles

Online: incrementally modify rectangles

Offline version: seek small set of rectangles that


cover the instances

Important design decisions:

Allow overlapping rectangles?

Requires conflict resolution

Allow nested rectangles?

Dealing with uncovered instances?


93 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Separating generalized exemplars
Class 1
Class
2
Separation
line
94 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Generalized distance functions

Given: some transformation operations on attributes

K*: similarity = probability of transforming


instance A into B by chance

Average over all transformation paths

Weight paths according their probability


(need way of measuring this)

Uniform way of dealing with different attribute types

Easily generalized to give distance between sets of instances


95 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Numeric prediction

Counterparts exist for all schemes previously


discussed

Decision trees, rule learners, SVMs, etc.

(Almost) all classification schemes can be


applied to regression problems using
discretization

Discretize the class into intervals

Predict weighted average of interval midpoints

Weight according to class probabilities


96 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Regression trees

Like decision trees, but:

Splitting criterion: minimize intra-subset


variation

Termination criterion: std dev becomes small

Pruning criterion: based on numeric error


measure

Prediction: Leaf predicts average


class values of instances

Piecewise constant functions

Easy to interpret

More sophisticated version: model trees


97 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Model trees

Build a regression tree

Each leaf linear regression function

Smoothing: factor in ancestors predictions

Smoothing formula:

Same effect can be achieved by incorporating ancestor


models into the leaves

Need linear regression function at each node

At each node, use only a subset of attributes

Those occurring in subtree

(+maybe those occurring in path to the root)

Fast: tree usually uses only a small subset of the


attributes
p'=
np+kq
n+k
98 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Building the tree

Splitting: standard deviation reduction

Termination:

Standard deviation < 5% of its value on full training set

Too few instances remain (e.g. < 4)


Pruning:

Heuristic estimate of absolute error of LR models:

Greedily remove terms from LR models to minimize estimated


error

Heavy pruning: single model may replace whole subtree

Proceed bottom up: compare error of LR model at internal node


to error of subtree
SDR=sdT)
i

T
i
T
sdT
i
)
n+v
nv
average_absolute_error
99 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Nominal attributes

Convert nominal attributes to binary ones

Sort attribute by average class value

If attribute has k values,


generate k 1 binary attributes

i th is 0 if value lies within the first i , otherwise 1

Treat binary attributes as numeric

Can prove: best split on one of the new


attributes is the best (binary) split on original
100 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Missing values

Modify splitting criterion:

To determine which subset an instance goes into,


use surrogate splitting

Split on the attribute whose correlation with original is


greatest

Problem: complex and time-consuming

Simple solution: always use the class

Test set: replace missing value with average


SDR=
m
T
|sd T)
i

T
i
T
sd T
i
)|
101 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Surrogate splitting based on class

Choose split point based on instances with known values

Split point divides instances into 2 subsets

L (smaller class average)

R (larger)

m is the average of the two averages

For an instance with a missing value:

Choose L if class value < m

Otherwise R

Once full tree is built, replace missing values with


averages of corresponding leaf nodes
102 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Pseudo-code for M5'

Four methods:

Main method: MakeModelTree

Method for splitting: split

Method for pruning: prune

Method that computes error: subtreeError

Well briefly look at each method in turn

Assume that linear regression method performs attribute


subset selection based on error
103 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
MakeModelTree
MakeModelTree (instances)
{
SD = sd(instances)
for each k-valued nominal attribute
convert into k-1 synthetic binary attributes
root = newNode
root.instances = instances
split(root)
prune(root)
printTree(root)
}
104 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
split
split(node)
{
if sizeof(node.instances) < 4 or
sd(node.instances) < 0.05*SD
node.type = LEAF
else
node.type = INTERIOR
for each attribute
for all possible split positions of attribute
calculate the attribute's SDR
node.attribute = attribute with maximum SDR
split(node.left)
split(node.right)
}
105 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
prune
prune(node)
{
if node = INTERIOR then
prune(node.leftChild)
prune(node.rightChild)
node.model = linearRegression(node)
if subtreeError(node) > error(node) then
node.type = LEAF
}
106 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
subtreeError
subtreeError(node)
{
l = node.left; r = node.right
if node = INTERIOR then
return (sizeof(l.instances)*subtreeError(l)
+ sizeof(r.instances)*subtreeError(r))
/sizeof(node.instances)
else return error(node)
}
107 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Model tree for servo data
108 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Rules from model trees

PART algorithm generates classification rules by building


partial decision trees

Can use the same method to build rule sets for regression

Use model trees instead of decision trees

Use variance instead of entropy to choose node to


expand when building partial tree

Rules will have linear models on right-hand side

Caveat: using smoothed trees may not be appropriate due


to separate-and-conquer strategy
109 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Locally weighted regression

Numeric prediction that combines

instance-based learning

linear regression

Lazy:

computes regression function at prediction time

works incrementally

Weight training instances

according to distance to test instance

needs weighted version of linear regression

Advantage: nonlinear approximation

But: slow
110 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Design decisions

Weighting function:

Inverse Euclidean distance

Gaussian kernel applied to Euclidean distance

Triangular kernel used the same way

etc.

Smoothing parameter is used to scale the distance


function

Multiply distance by inverse of this parameter

Possible choice: distance of k th nearest training instance


(makes it data dependent)
111 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Discussion

Regression trees were introduced in CART

Quinlan proposed model tree method (M5)

M5: slightly improved, publicly available

Quinlan also investigated combining instance-based


learning with M5

CUBIST: Quinlans commercial rule learner for


numeric prediction

Interesting comparison: neural nets vs. M5


112 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
From nave Bayes to Bayesian Networks

Nave Bayes assumes:


attributes conditionally independent given the
class

Doesnt hold in practice but classification


accuracy often high

However: sometimes performance much worse


than e.g. decision tree

Can we eliminate the assumption?


113 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Enter Bayesian networks

Graphical models that can represent any


probability distribution

Graphical representation: directed acyclic graph,


one node for each attribute

Overall probability distribution factorized into


component distributions

Graphs nodes hold component distributions


(conditional distributions)
114 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
N
e
t
w
o
r
k

f
o
r

t
h
e

w
e
a
t
h
e
r

d
a
t
a
115 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
N
e
t
w
o
r
k

f
o
r

t
h
e

w
e
a
t
h
e
r

d
a
t
a
116 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Computing the class probabilities

Two steps: computing a product of probabilities for


each class and normalization

For each class value

Take all attribute values and class value

Look up corresponding entries in conditional


probability distribution tables

Take the product of all probabilities

Divide the product for each class by the sum of the


products (normalization)
117 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Why can we do this? (Part I)

Single assumption: values of a nodes parents


completely determine probability distribution
for current node
Means that node/attribute is conditionally
independent of other ancestors given parents
Pr |node|ancestors|=Pr |node|parents|
118 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Why can we do this? (Part II)

Chain rule from probability theory:


Because of our assumption from the previous slide:
Pr | a
1,
a
2,
... , a
n
|=
i=1
n
Pr | a
i
| a
i1
, ... , a
1
|
Pr | a
1,
a
2,
... , a
n
|=
i=1
n
Pr | a
i
| a
i1
, ... , a
1
|=

i=1
n
Pr |a
i
| a
i
' sparents|
119 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Learning Bayes nets

Basic components of algorithms for learning


Bayes nets:

Method for evaluating the goodness of a given


network

Measure based on probability of training data given


the network (or the logarithm thereof)

Method for searching through space of possible


networks

Amounts to searching through sets of edges because


nodes are fixed
120 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Problem: overfitting

Cant just maximize probability of the training data

Because then its always better to add more edges (fit the
training data more closely)

Need to use cross-validation or some penalty for


complexity of the network
AIC measure:
MDL measure:
LL: log-likelihood (log of probability of data), K: number of
free parameters, N: #instances
Another possibility: Bayesian approach with prior
distribution over networks
AICscore=LL+K
MDLscore=LL+
K
2
logN
121 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Searching for a good structure

Task can be simplified: can optimize each


node separately

Because probability of an instance is product


of individual nodes probabilities

Also works for AIC and MDL criterion


because penalties just add up

Can optimize node by adding or removing


edges from other nodes

Must not introduce cycles!


122 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
The K2 algorithm

Starts with given ordering of nodes


(attributes)

Processes each node in turn

Greedily tries adding edges from previous


nodes to current node

Moves to next node when current node cant


be optimized further

Result depends on initial order


123 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Some tricks

Sometimes it helps to start the search with a nave Bayes


network

It can also help to ensure that every node is in Markov


blanket of class node

Markov blanket of a node includes all parents, children, and


childrens parents of that node

Given values for Markov blanket, node is conditionally


independent of nodes outside blanket

I.e. node is irrelevant to classification if not in Markov blanket


of class node
124 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Other algorithms

Extending K2 to consider greedily adding or deleting


edges between any pair of nodes

Further step: considering inverting the direction of edges

TAN (Tree Augmented Nave Bayes):

Starts with nave Bayes

Considers adding second parent to each node (apart from


class node)

Efficient algorithm exists


125 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Likelihood vs. conditional likelihood

In classification what we really want is to maximize


probability of class given other attributes
Not probability of the instances

But: no closed-form solution for probabilities in nodes


tables that maximize this

However: can easily compute conditional probability of


data based on given network

Seems to work well when used for network scoring


126 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data structures for fast learning

Learning Bayes nets involves a lot of counting for


computing conditional probabilities

Nave strategy for storing counts: hash table

Runs into memory problems very quickly

More sophisticated strategy: all-dimensions (AD) tree

Analogous to kD-tree for numeric data

Stores counts in a tree but in a clever way such that


redundancy is eliminated

Only makes sense to use it for large datasets


127 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
AD tree example
128 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Building an AD tree

Assume each attribute in the data has been


assigned an index

Then, expand node for attribute i with the values


of all attributes j > i

Two important restrictions:

Most populous expansion for each attribute is omitted


(breaking ties arbitrarily)

Expansions with counts that are zero are also omitted

The root node is given index zero


129 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Discussion

We have assumed: discrete data, no missing


values, no new nodes

Different method of using Bayes nets for


classification: Bayesian multinets

I.e. build one network for each class and make


prediction using Bayes rule

Different class of learning methods for Bayes


nets: testing conditional independence assertions

Can also build Bayes nets for regression tasks


130 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Clustering: how many clusters?

How to choose k in k-means? Possibilities:

Choose k that minimizes cross-validated squared


distance to cluster centers

Use penalized squared distance on the training data (eg.


using an MDL criterion)

Apply k-means recursively with k = 2 and use stopping


criterion (eg. based on MDL)

Seeds for subclusters can be chosen by seeding along


direction of greatest variance in cluster
(one standard deviation away in each direction from cluster
center of parent cluster)

Implemented in algorithm called X-means (using Bayesian


Information Criterion instead of MDL)
131 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Hierarchical clustering

Recursively splitting clusters produces a


hierarchy that can be represented as a
dendogram

Could also be represented as a Venn diagram of


sets and subsets (without intersections)

Height of each node in the dendogram can be made


proportional to the dissimilarity between its
children
132 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Agglomerative clustering

Bottom-up approach

Simple algorithm

Requires a distance/similarity measure

Start by considering each instance to be a cluster

Find the two closest clusters and merge them

Continue merging until only one cluster is left

The record of mergings forms a hierarchical


clustering structure a binary dendogram
133 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Distance measures

Single-linkage

Minimum distance between the two clusters

Distance between the clusters closest two members

Can be sensitive to outliers

Complete-linkage

Maximum distance between the two clusters

Two clusters are considered close only if all instances


in their union are relatively similar

Also sensitive to outliers

Seeks compact clusters


134 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Distance measures cont.

Compromise between the extremes of minimum and


maximum distance

Represent clusters by their centroid, and use distance


between centroids centroid linkage

Works well for instances in multidimensional Euclidean


space

Not so good if all we have is pairwise similarity between


instances

Calculate average distance between each pair of members


of the two clusters average-linkage

Technical deficiency of both: results depend on the


numerical scale on which distances are measured
135 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
More distance measures

Group-average clustering

Uses the average distance between all members of the


merged cluster

Differs from average-linkage because it includes pairs from


the same original cluster

Ward's clustering method

Calculates the increase in the sum of squares of the


distances of the instances from the centroid before and after
fusing two clusters

Minimize the increase in this squared distance at each


clustering step

All measures will produce the same result if the clusters are
compact and well separated
136 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Example hierarchical clustering

50 examples of different creatures from the zoo data


Dendogram Polar plot
Complete-linkage
137 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Example hierarchical clustering 2
Single-linkage
138 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Incremental clustering

Heuristic approach (COBWEB/CLASSIT)

Form a hierarchy of clusters incrementally

Start:

tree consists of empty root node

Then:

add instances one by one

update tree appropriately at each stage

to update, find the right leaf for an instance

May involve restructuring the tree

Base update decisions on category utility


139 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Clustering weather data
N
M
L
K
J
I
H
G
F
E
D
C
B
A
ID
True High Mild Rainy
False Normal Hot Overcast
True High Mild Overcast
True Normal Mild Sunny
False Normal Mild Rainy
False Normal Cool Sunny
False High Mild Sunny
True Normal Cool Overcast
True Normal Cool Rainy
False Normal Cool Rainy
False High Mild Rainy
False High Hot Overcast
True High Hot Sunny
False High Hot Sunny
Windy Humidity Temp. Outlook
1
2
3
140 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Clustering weather data
N
M
L
K
J
I
H
G
F
E
D
C
B
A
ID
True High Mild Rainy
False Normal Hot Overcast
True High Mild Overcast
True Normal Mild Sunny
False Normal Mild Rainy
False Normal Cool Sunny
False High Mild Sunny
True Normal Cool Overcast
True Normal Cool Rainy
False Normal Cool Rainy
False High Mild Rainy
False High Hot Overcast
True High Hot Sunny
False High Hot Sunny
Windy Humidity Temp. Outlook
4
3
Merge best host
and runner-up
5
Consider splitting the best host if
merging doesnt help
141 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Final hierarchy
142 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Example: the iris data (subset)
143 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Clustering with cutoff
144 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Category utility

Category utility: quadratic loss function


defined on conditional probabilities:

Every instance in different category


numerator becomes
maximum
number of attributes
CUC
1,
C
2,
... , C
k
)=

l
Pr|C
l
|
i

j
Pr| a
i
=v
ij
|C
l
|
2
Pr|a
i
=v
ij
|
2
)
k
n
i

j
Pr|a
i
=v
ij
|
2
145 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Numeric attributes

Assume normal distribution:

Then:

Thus
becomes

Prespecified minimum variance

acuity parameter
f a=
1

2
exp
a
2
2
2

j
Pr |a
i
=v
ij
|
2
f a
i

2
da
i
=
1
2
i
CUC
1,
C
2,
... , C
k
=

l
Pr | C
l
|
i

j
Pr | a
i
=v
ij
|C
l
|
2
Pr |a
i
=v
ij
|
2

k
CUC
1,
C
2,
... , C
k
=

l
Pr | C
l
|
1
2

il

k
146 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Probability-based clustering

Problems with heuristic approach:

Division by k?

Order of examples?

Are restructuring operations sufficient?

Is result at least local minimum of category utility?

Probabilistic perspective
seek the most likely clusters given the data

Also: instance belongs to a particular cluster with a


certain probability
147 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Finite mixtures

Model data using a mixture of distributions

One cluster, one distribution

governs probabilities of attribute values in that cluster

Finite mixtures : finite number of clusters

Individual distributions are normal (usually)

Combine distributions using cluster weights


148 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Two-class mixture model
A 51
A 43
B 62
B 64
A 45
A 42
A 46
A 45
A 45
B 62
A 47
A 52
B 64
A 51
B 65
A 48
A 49
A 46
B 64
A 51
A 52
B 62
A 49
A 48
B 62
A 43
A 40
A 48
B 64
A 51
B 63
A 43
B 65
B 66
B 65
A 46
A 39
B 62
B 64
A 52
B 63
B 64
A 48
B 64
A 48
A 51
A 48
B 64
A 42
A 48
A 41
data
model

A
=50,
A
=5, p
A
=0.6
B
=65,
B
=2, p
B
=0.4
149 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Using the mixture model

Probability that instance x belongs to cluster A:


with

Probability of an instance given the clusters:


Pr | A| x|=
Pr | x| A| Pr | A|
Pr | x|
=
f x ;
A
,
A
p
A
Pr | x|
f x ; , =
1

2
exp
x
2
2
2

Pr | x|the_clusters|=
i
Pr | x|cluster
i
| Pr |cluster
i
|
150 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Learning the clusters

Assume:

we know there are k clusters

Learn the clusters

determine their parameters

I.e. means and standard deviations

Performance criterion:

probability of training data given the clusters

EM algorithm

finds a local maximum of the likelihood


151 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
EM algorithm

EM = Expectation-Maximization

Generalize k-means to probabilistic setting

Iterative procedure:

E expectation step:
Calculate cluster probability for each instance

M maximization step:
Estimate distribution parameters from cluster probabilities

Store cluster probabilities as instance weights

Stop when improvement is negligible


152 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
More on EM

Estimate parameters from weighted instances

Stop when log-likelihood saturates

Log-likelihood:

A
=
w
1
x
1
+w
2
x
2
+...+w
n
x
n
w
1
+w
2
+...+w
n

A
=
w
1
x
1

2
+w
2
x
2

2
+...+w
n
x
n

2
w
1
+w
2
+...+w
n

i
logp
A
Pr | x
i
| A|+p
B
Pr | x
i
| B|
153 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Extending the mixture model

More then two distributions: easy

Several attributes: easyassuming independence!

Correlated attributes: difficult

Joint model: bivariate normal distribution


with a (symmetric) covariance matrix

n attributes: need to estimate n + n (n+1)/2 parameters


154 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
More mixture model extensions

Nominal attributes: easy if independent

Correlated nominal attributes: difficult

Two correlated attributes v


1
v
2
parameters

Missing values: easy

Can use other distributions than normal:

log-normal if predetermined minimum is given

log-odds if bounded from above and below

Poisson for attributes that are integer counts

Use cross-validation to estimate k !


155 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Bayesian clustering

Problem: many parameters EM overfits

Bayesian approach : give every parameter a prior


probability distribution

Incorporate prior into overall likelihood figure

Penalizes introduction of parameters

Eg: Laplace estimator for nominal attributes

Can also have prior on number of clusters!

Implementation: NASAs AUTOCLASS


156 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Discussion

Can interpret clusters by using supervised learning

post-processing step

Decrease dependence between attributes?

pre-processing step

E.g. use principal component analysis

Can be used to fill in missing values

Key advantage of probabilistic clustering:

Can estimate likelihood of data

Use it to compare different models objectively


157 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Semisupervised learning

Semisupervised learning: attempts to use unlabeled


data as well as labeled data

The aim is to improve classification performance

Why try to do this? Unlabeled data is often


plentiful and labeling data can be expensive

Web mining: classifying web pages

Text mining: identifying names in text

Video mining: classifying people in the news

Leveraging the large pool of unlabeled examples


would be very attractive
158 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Clustering for classification

Idea: use nave Bayes on labeled examples and then


apply EM

First, build nave Bayes model on labeled data

Second, label unlabeled data based on class probabilities


(expectation step)

Third, train new nave Bayes model based on all the data
(maximization step)

Fourth, repeat 2
nd
and 3
rd
step until convergence

Essentially the same as EM for clustering with fixed


cluster membership probabilities for labeled data
and #clusters = #classes
159 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Comments

Has been applied successfully to document classification

Certain phrases are indicative of classes

Some of these phrases occur only in the unlabeled data,


some in both sets

EM can generalize the model by taking advantage of co-


occurrence of these phrases

Refinement 1: reduce weight of unlabeled data

Refinement 2: allow multiple clusters per class


160 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Co-training

Method for learning from multiple views (multiple sets of attributes),


eg:

First set of attributes describes content of web page

Second set of attributes describes links that link to the web page

Step 1: build model from each view

Step 2: use models to assign labels to unlabeled data

Step 3: select those unlabeled examples that were most confidently


predicted (ideally, preserving ratio of classes)

Step 4: add those examples to the training set

Step 5: go to Step 1 until data exhausted

Assumption: views are independent


161 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
EM and co-training

Like EM for semisupervised learning, but view


is switched in each iteration of EM

Uses all the unlabeled data (probabilistically


labeled) for training

Has also been used successfully with support


vector machines

Using logistic models fit to output of SVMs

Co-training also seems to work when views are


chosen randomly!

Why? Possibly because co-trained classifier is


more robust
162 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Multi-instance learning

Converting to single-instance learning

Already seen aggregation of input or output

Simple and often work well in practice

Will fail in some situations

Aggregating the input loses a lot of information


because attributes are condensed to summary
statistics individually and independently

Can a bag be converted to a single instance


without discarding so much info?
163 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Converting to single-instance

Can convert to single instance without losing so much


info, but more attributes are needed in the condensed
representation

Basic idea: partition the instance space into regions

One attribute per region in the single-instance


representation

Simplest case boolean attributes

Attribute corresponding to a region is set to true for a


bag if it has at least one instance in that region
164 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Converting to single-instance

Could use numeric counts instead of boolean attributes


to preserve more information

Main problem: how to partition the instance space?

Simple approach partition into equal sized


hypercubes

Only works for few dimensions

More practical use unsupervised learning

Take all instances from all bags (minus class


labels) and cluster

Create one attribute per cluster (region)


165 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Converting to single-instance

Clustering ignores the class membership

Use a decision tree to partition the space instead

Each leaf corresponds to one region of instance space

How to learn tree when class labels apply to entire


bags?

Aggregating the output can be used: take the bag's class


label and attach it to each of its instances

Many labels will be incorrect, however, they are only


used to obtain a partitioning of the space
166 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Converting to single-instance

Using decision trees yields hard partition boundaries

So does k-means clustering into regions, where the cluster


centers (reference points) define the regions

Can make region membership soft by using distance


transformed into similarity to compute attribute values in
the condensed representation

Just need a way of aggregating similarity scores between each


bag and reference point into a single value e.g. max
similarity between each instance in a bag and the reference
point
167 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Upgrading learning algorithms

Converting to single-instance is appealing because


many existing algorithms can then be applied without
modification

May not be the most efficient approach

Alternative: adapt single-instance algorithm to the


multi-instance setting

Can be achieved elegantly for for distance/similarity-


based methods (e.g. nearest neighbor or SVMs)

Compute distance/similarity between two bags of


instances
168 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Upgrading learning algorithms

Kernel-based methods

Similarity must be a proper kernel function that


satisfies certain mathematical properties

One example so called set kernel

Given a kernel function for pairs of instances, the set


kernel sums it over all pairs of instances from the two
bags being compared

Is generic and can be applied with any single-instance


kernel function
169 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Upgrading learning algorithms

Nearest neighbor learning

Apply variants of the Hausdorff distance, which is


defined for sets of points

Given two bags and a distance function between


pairs of instances, Hausdorff distance between the
bags is
Largest distance from any instance in one bag to its
closest instance in the other bag

Can be made more robust to outliers by using the


nth-largest distance
170 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Dedicated multi-instance methods

Some methods are not based directly on single-instance


algorithms

One basic approach find a single hyperrectangle that


contains at least one instance from each positive bag and no
instances from any negative bags

Rectangle encloses an area of the instance space where


all positive bags overlap

Originally designed for the drug activity problem


mentioned in Chapter 2

Can use other shapes e.g hyperspheres (balls)

Can also use boosting to build an ensemble of balls


171 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Dedicated multi-instance methods

Previously described methods have hard decision


boundaries an instance either falls inside or outside a
hyperectangle/ball

Other methods use probabilistic soft concept


descriptions

Diverse-density

Learns a single reference point in instance space

Probability that an instance is positive decreases with


increasing distance from the reference point
172 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7)
Dedicated multi-instance methods

Diverse-density

Combine instance probabilities within a bag to obtain


probability that bag is positive

noisy-OR (probabilistic version of logical OR)

All instance-level probabilities 0 noisy-OR value and bag-


level probability is 0

At least one instance-level probability is 1 the value is 1

Diverse density, like the geometric methods, is maximized


when the reference point is located in an area where positive
bags overlap and no negative bags are present

Gradient descent can be used to maximize

You might also like