Chapter 6
Chapter 6
Decision trees
Classification rules
Association Rules
Frequent-pattern trees
Instance-based learning
Numeric prediction
Bayesian networks
Semisupervised learning
Multi-instance learning
Extending ID3:
Info gain for best split point is info gain for attribute
Info([4,2],[5,3])
= 6/14 info([4,2]) + 8/14 info([5,3])
= 0.939 bits
Remedy:
imp (k, 1, i ) =
min
0<j <i
imp (k1, 1, j ) + imp (1, j+1, i )
weights sum to 1
Two strategies:
Postpruning
take a fully-grown decision tree and discard
unreliable parts
Prepruning
stop growing a branch when information becomes
unreliable
Then, prune it
Subtree replacement
Subtree raising
Possible strategies:
error estimation
significance testing
MDL principle
18 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Subtree replacement
Bottom-up
Delete node
Redistribute instances
C4.5s method
f
2
N
+
z
2
4N
2
)/1+
z
2
N
)
22 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Example
f=0.33
e=0.47
f=0.5
e=0.72
f=0.33
e=0.47
f = 5/14
e = 0.46
e < 0.51
so prune!
Combined using ratios 6:2:6 gives 0.51
23 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Complexity of tree induction
Assume
m attributes
n training instances
Then
Basic idea:
Differences:
Post-processing step
Measure 1: p/t
P and T the positive and total numbers before the new condition was added
Incremental pruning
Global pruning
Statistical significance
MDL principle
Reduced-error pruning :
build full rule set and then prune it
Stratification advantageous
34 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Incremental reduced-error pruning
Initialize E to the instance set
Until E is empty do
Split E into Grow and Prune in the ratio 2:1
For each class C for which Grow contains an instance
Use basic covering algorithm to create best perfect rule
for C
Calculate w(R): worth of rule on Prune
and w(R-): worth of rule with final condition
omitted
If w(R-) > w(R), prune rule and repeat previous step
From the rules for the different classes, select the one
thats worth most (i.e. with largest w(R))
Print the rule
Remove the instances covered by rule from E
Continue
35 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Measures used in IREP
[p + (N n)] / T
Counterintuitive:
Success rate p / t
Problem: p = 1 and t = 1
vs. p = 1000 and t = 1001
(p n) / t
Stopping criterion
DL: bits needs to send examples wrt set of rules, bits needed to
send k tests, and bits for k
Once a rule set has been produced for each class, each rule is re-
considered and two variants are produced
Follow temperature=mild link from header table to find all instances that
contain temperature=mild
Simple trick:
1
st
problem: speed
2
nd
problem: overfitting
All other instances can be deleted without changing its position and orientation
x=w
0
+w
1
a
1
+w
2
a
2
x=b+
i is supp. vector
o
i
y
i
ai)
a
58 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Finding support vectors
Determine
i
and b ?
A constrained quadratic optimization problem
ai)
a
59 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Nonlinear SVMs
Example:
ai)
a)
n
61 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Other kernel functions
Polynomial kernel
Only requirement:
Examples:
x=b+
i is supp. vector
o
i
y
i
ai)
a)
n
x=b+
i is supp. vector
o
i
y
i
K
ai)
a)
K
x
i
,
x
j
)=d
x
i
)d
x
j
)
K
x
i
,
x
j
)=
x
i
x
j
+1)
d
K
x
i
,
x
j
)=exp
x
i
x
j
)
2
2u
2
)
K
x
i
,
x
j
)=tanh
x
i
x
j
+b)
*
62 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Noise
Corresponding constraint: 0
i
C
Sparse data
compute dot products very efficiently
Text classifiation
If there are tubes that enclose all the training points, the
flattest of them is used
i
x=b+
i is supp. vector
o
i
ai)
a
67 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Examples
= 2
= 1
= 0.5
68 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Kernel Ridge Regression
Has an advantage if
Can use
instead of
( where y is either -1 or +1)
i
w
i
a
i
j
y j)a' j)
i
a
i
j
y j)
i
a' j)
i
a
i
j
y j)
a' j)
j
y j)K
a' j) ,
a)
72 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Comments on kernel perceptron
Linear and logistic regression can also be upgraded using the kernel
trick
Need differentiable error function: can't use zero-one loss, but can
use squared error
f x)=
1
1+expx)
E=
1
2
yf x))
2
76 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
The two activation functions
77 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Gradient descent example
Function: x
2
+1
Derivative: 2x
Start value: 4
Can only find a local minimum!
78 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Minimizing the error I
Subgradients
Use 0 at z = 1
Work incrementally
Accept/reject instances
Easy to interpret
Smoothing formula:
Termination:
T
i
T
sdT
i
)
n+v
nv
average_absolute_error
99 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Nominal attributes
T
i
T
sd T
i
)|
101 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Surrogate splitting based on class
R (larger)
Otherwise R
Four methods:
Can use the same method to build rule sets for regression
instance-based learning
linear regression
Lazy:
works incrementally
But: slow
110 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Design decisions
Weighting function:
etc.
i=1
n
Pr |a
i
| a
i
' sparents|
119 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Learning Bayes nets
Because then its always better to add more edges (fit the
training data more closely)
Bottom-up approach
Simple algorithm
Single-linkage
Complete-linkage
Group-average clustering
All measures will produce the same result if the clusters are
compact and well separated
136 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Example hierarchical clustering
Start:
Then:
l
Pr|C
l
|
i
j
Pr| a
i
=v
ij
|C
l
|
2
Pr|a
i
=v
ij
|
2
)
k
n
i
j
Pr|a
i
=v
ij
|
2
145 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Numeric attributes
Then:
Thus
becomes
acuity parameter
f a=
1
2
exp
a
2
2
2
j
Pr |a
i
=v
ij
|
2
f a
i
2
da
i
=
1
2
i
CUC
1,
C
2,
... , C
k
=
l
Pr | C
l
|
i
j
Pr | a
i
=v
ij
|C
l
|
2
Pr |a
i
=v
ij
|
2
k
CUC
1,
C
2,
... , C
k
=
l
Pr | C
l
|
1
2
il
k
146 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Probability-based clustering
Division by k?
Order of examples?
Probabilistic perspective
seek the most likely clusters given the data
A
=50,
A
=5, p
A
=0.6
B
=65,
B
=2, p
B
=0.4
149 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Using the mixture model
2
exp
x
2
2
2
Pr | x|the_clusters|=
i
Pr | x|cluster
i
| Pr |cluster
i
|
150 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Learning the clusters
Assume:
Performance criterion:
EM algorithm
EM = Expectation-Maximization
Iterative procedure:
E expectation step:
Calculate cluster probability for each instance
M maximization step:
Estimate distribution parameters from cluster probabilities
Log-likelihood:
A
=
w
1
x
1
+w
2
x
2
+...+w
n
x
n
w
1
+w
2
+...+w
n
A
=
w
1
x
1
2
+w
2
x
2
2
+...+w
n
x
n
2
w
1
+w
2
+...+w
n
i
logp
A
Pr | x
i
| A|+p
B
Pr | x
i
| B|
153 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Extending the mixture model
post-processing step
pre-processing step
Third, train new nave Bayes model based on all the data
(maximization step)
Fourth, repeat 2
nd
and 3
rd
step until convergence
Second set of attributes describes links that link to the web page
Kernel-based methods
Diverse-density
Diverse-density