Data Mining: Practical Machine Learning Tools and Techniques
Data Mining: Practical Machine Learning Tools and Techniques
Mining
Practical Machine Learning Tools and Techniques
Slides for Chapter 4 of Data Mining by I. H. Witten and E. Frank
Algorithms: The basic methods
●
Inferring rudimentary rules
●
Statistical modeling
●
Constructing decision trees
●
Constructing rules
●
Association rule learning
●
Linear models
●
Instancebased learning
●
Clustering
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 2
Simplicity first
●
Simple algorithms often work very well!
●
There are many kinds of simple structure, eg:
♦ One attribute does all the work
♦ All attributes contribute equally & independently
♦ A weighted linear combination might do
♦ Instancebased: use a few prototypes
♦ Use simple logical rules
●
Success of method depends on the domain
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 3
Inferring rudimentary rules
●
1R: learns a 1level decision tree
♦ I.e., rules that all test one particular attribute
●
Basic version
♦ One branch for each value
♦ Each branch assigns most frequent class
♦ Error rate: proportion of instances that don’t
belong to the majority class of their
corresponding branch
♦ Choose attribute with lowest error rate
(assumes nominal attributes)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 4
Evaluating the weather attributes
Outlook Temp Humidity Windy Play
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 6
Dealing with numeric attributes
●
Discretize numeric attributes
●
Divide each attribute’s range into intervals
♦ Sort instances according to attribute’s values
♦ Place breakpoints where class changes (majority class)
♦ This minimizes the total error
●
Example: temperature from weather data
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 7
The problem of overfitting
●
This procedure is very sensitive to noise
♦ One instance with an incorrect class label will probably
produce a separate interval
●
Also: time stamp attribute will have zero errors
●
Simple solution:
enforce minimum number of instances in majority
class per interval
●
Example (with min = 3):
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 8
Discussion of 1R
●
1R was described in a paper by Holte (1993)
♦ Contains an experimental evaluation on 16 datasets
(using crossvalidation so that results were
representative of performance on future data)
♦ Minimum number of instances was set to 6 after
some experimentation
♦ 1R’s simple rules performed not much worse than
much more complex decision trees
●
Simplicity first pays off!
Very Simple Classification Rules Perform Well on Most
Commonly Used Datasets
Robert C. Holte, Computer Science Department, University of Ottawa
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 10
Statistical modeling
●
“Opposite” of 1R: use all the attributes
●
Two assumptions: Attributes are
♦ equally important
♦ statistically independent (given the class value)
● I.e., knowing the value of one attribute says nothing
about the value of another (if the class is known)
●
Independence assumption is never correct!
●
But … this scheme works well in practice
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 12
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
●
A new day: Outlook Temp. Humidity Windy Play
Sunny Cool High True ?
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 14
Bayes’s rule
● Probability of event H given evidence E:
Pr [ E∣H ] Pr [ H ]
Pr [ H∣E ]=
Pr [ E ]
A priori probability of H :
●
Pr [ H ]
● Probability of event before evidence is seen
A posteriori probability of H :
●
Pr [ H∣E ]
● Probability of event after evidence is seen
Thomas Bayes
Born: 1702 in London, England
Died: 1761 in Tunbridge Wells, Kent, England
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 15
Naïve Bayes for classification
●
Classification learning: what’s the
probability of the class given an instance?
♦ Evidence E = instance
♦ Event H = class value for instance
●
Naïve assumption: evidence splits into parts
(i.e. attributes) that are independent
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 16
Weather data example
Outlook Temp. Humidity Windy Play
Evidence E
Sunny Cool High True ?
●
What if an attribute value doesn’t occur with every
class value?
(e.g. “Humidity = high” for class “yes”)
♦ Probability will be zero! Pr [ Humidity =High∣yes ]=0
♦ A posteriori probability will also be zero! Pr [ yes∣E ]=0
(No matter how likely the other values are!)
●
Remedy: add 1 to the count for every attribute
valueclass combination (Laplace estimator)
●
Result: probabilities will never be zero!
(also: stabilizes probability estimates)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 18
Modified probability estimates
●
In some cases adding a constant different
from 1 might be more appropriate
●
Example: attribute outlook for class yes
2/3 4/3 3/3
9 9 9
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 19
Missing values
●
Training: instance is not included in frequency
count for attribute valueclass combination
●
Classification: attribute will be omitted from
calculation
●
Example: Outlook Temp. Humidity Windy Play
? Cool High True ?
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 20
Numeric attributes
● Usual assumption: attributes have a
normal or Gaussian probability
distribution (given the class)
● The probability density function for the
normal distribution is defined by two
parameters:
n
Sample mean µ = ∑ x i
● 1
n i =1
n
● Standard deviation σ =
1
∑ x i −
2
n−1 i =1
● Then the density function f(x) is
x −2
1 −
2
2
f x = e
2
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 21
Statistics for weather data
●
Example density value:
66−732
1 −
2⋅6.2
2
f temperature =66∣yes = e =0.0340
26.2
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 22
Classifying a new day
●
A new day: Outlook Temp. Humidity Windy Play
Sunny 66 90 true ?
●
Missing values during training are not
included in calculation of mean and
standard deviation
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 23
Probability densities
●
Relationship between probability and
density:
Pr [ c − x c ]≈×f c
2 2
●
But: this doesn’t change calculation of a
posteriori probabilities because ε cancels out
●
Exact relationship:
b
Pr [ a x b ]=∫ f t dt
a
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 24
Naïve Bayes: discussion
●
Naïve Bayes works surprisingly well (even if
independence assumption is clearly violated)
●
Why? Because classification doesn’t require accurate
probability estimates as long as maximum probability
is assigned to correct class
●
However: adding too many redundant attributes will
cause problems (e.g. identical attributes)
●
Note also: many numeric attributes are not normally
distributed (→ kernel density estimators)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 27
Constructing decision trees
●
Strategy: top down
Recursive divideandconquer fashion
♦ First: select attribute for root node
Create branch for each possible attribute value
♦ Then: split instances into subsets
One for each branch extending from the node
♦ Finally: repeat recursively for each branch, using
only instances that reach the branch
●
Stop if all instances have the same class
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 28
Which attribute to select?
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 29
Which attribute to select?
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 30
Criterion for attribute selection
●
Which is the best attribute?
♦ Want to get the smallest tree
♦ Heuristic: choose the attribute that produces the
“purest” nodes
●
Popular impurity criterion: information gain
♦ Information gain increases with the average
purity of the subsets
●
Strategy: choose attribute that gives greatest
information gain
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 31
Computing information
●
Measure information in bits
♦ Given a probability distribution, the info
required to predict an event is the
distribution’s entropy
♦ Entropy gives the information required in bits
(can involve fractions of bits!)
●
Formula for computing the entropy:
entropy p1, p 2, ... ,p n=− p1 log p1− p2 log p2 ...−p n log pn
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 32
Example: attribute Outlook
●
Outlook = Sunny :
info[2,3]=entropy 2/5,3/5=−2/5 log 2/5−3/5 log 3/5=0.971 bits
●
Outlook = Overcast : Note: this
info[4,0]=entropy 1,0=−1 log 1−0 log0=0 bits is normally
undefined.
●
Outlook = Rainy :
info[2,3]=entropy 3/5,2/5=−3/5 log 3/5−2/5 log 2/5=0.971 bits
●
Expected information for attribute:
info[3,2] , [4,0] , [3,2]=5/14×0.9714/14×05/14×0.971=0.693bits
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 33
Computing information gain
●
Information gain: information before splitting –
information after splitting
gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits
●
Information gain for attributes from weather data:
gain(Outlook ) = 0.247 bits
gain(Temperature ) = 0.029 bits
gain(Humidity ) = 0.152 bits
gain(Windy ) = 0.048 bits
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 34
Continuing to split
gain(Temperature ) = 0.571 bits
gain(Humidity ) = 0.971 bits
gain(Windy ) = 0.020 bits
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 35
Final decision tree
●
Note: not all leaves need to be pure; sometimes
identical instances have different classes
⇒
Splitting stops when data can’t be split any further
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 36
Wishlist for a purity measure
●
Properties we require from a purity measure:
♦ When node is pure, measure should be zero
♦ When impurity is maximal (i.e. all classes equally
likely), measure should be maximal
♦ Measure should obey multistage property (i.e.
decisions can be made in several stages):
measure [2,3,4]=measure[2,7]7/9×measure [3,4]
●
Entropy is the only function that satisfies all
three properties!
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 37
Properties of the entropy
●
The multistage property:
entropy p ,q, r =entropy p ,q r q r ×entropy qq r , q r r
●
Simplification of computation:
info[2,3,4]=−2/9×log 2/9−3/9×log3/9−4/9×log 4/9
=[−2×log 2−3×log 3−4×log 49×log 9]/9
●
Note: instead of maximizing info gain we
could just minimize information
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 38
Discussion
●
Topdown induction of decision trees: ID3,
algorithm developed by Ross Quinlan
♦ Gain ratio just one modification of this basic
algorithm
♦ ⇒ C4.5: deals with numeric attributes, missing
values, noisy data
●
Similar approach: CART
●
There are many other attribute selection
criteria!
(But little difference in accuracy of result)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 46
Classification rules
●
Popular alternative to decision trees
●
Antecedent (precondition): a series of tests (just like
the tests at the nodes of a decision tree)
●
Tests are usually logically ANDed together (but may
also be general logical expressions)
●
Consequent (conclusion): classes, set of classes, or
probability distribution assigned by rule
●
Individual rules are often logically ORed together
♦ Conflicts arise if different conclusions apply
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 8
From trees to rules
●
Easy: converting a tree into a set of rules
♦ One rule for each leaf:
● Antecedent contains a condition for every node on the path
from the root to the leaf
● Consequent is class assigned by the leaf
●
Produces rules that are unambiguous
♦ Doesn’t matter in which order they are executed
●
But: resulting rules are unnecessarily complex
♦ Pruning to remove redundant tests/rules
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 9
From rules to trees
● More difficult: transforming a rule set into a tree
● Tree cannot easily express disjunction between rules
● Example: rules which test different attributes
If a and b then x
If c and d then x
● Symmetry needs to be broken
● Corresponding tree contains identical subtrees
(⇒ “replicated subtree problem”)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 10
A tree for a simple disjunction
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 11
The exclusiveor problem
If x = 1 and y = 0
then class = a
If x = 0 and y = 1
then class = a
If x = 0 and y = 0
then class = b
If x = 1 and y = 1
then class = b
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 12
A tree with a replicated subtree
If x = 1 and y = 1
then class = a
If z = 1 and w = 1
then class = a
Otherwise class = b
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 13
“Nuggets” of knowledge
●
Are rules independent pieces of knowledge? (It
seems easy to add a rule to an existing rule base.)
●
Problem: ignores how rules are executed
●
Two ways of executing a rule set:
♦ Ordered set of rules (“decision list”)
● Order is important for interpretation
♦ Unordered set of rules
● Rules may overlap and lead to different conclusions for the
same instance
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 14
Interpreting rules
●
What if two or more rules conflict?
♦ Give no conclusion at all?
♦ Go with rule that is most popular on training data?
♦ …
●
What if no rule applies to a test instance?
♦ Give no conclusion at all?
♦ Go with class that is most frequent in training data?
♦ …
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 15
Covering algorithms
●
Convert decision tree into a rule set
♦ Straightforward, but rule set overly complex
♦ More effective conversions are not trivial
●
Instead, can generate rule set directly
♦ for each class in turn find rule set that covers
all instances in it
(excluding instances not in the class)
●
Called a covering approach:
♦ at each stage a rule is identified that “covers”
some of the instances
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 47
Example: generating a rule
●
Possible rule set for class “b”:
If x ≤ 1.2 then class = b
If x > 1.2 and y ≤ 2.6 then class = b
●
Could add more rules, get “perfect” rule set
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 48
Rules vs. trees
Corresponding decision tree:
(produces exactly the same
predictions)
●
But: rule sets can be more perspicuous when
decision trees suffer from replicated subtrees
●
Also: in multiclass situations, covering algorithm
concentrates on one class at a time whereas
decision tree learner takes all classes into account
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 49
Simple covering algorithm
●
Generates a rule by adding tests that maximize
rule’s accuracy
●
Similar to situation in decision trees: problem of
selecting an attribute to split on
♦ But: decision tree inducer maximizes overall purity
●
Each new test reduces
rule’s coverage:
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 50
Selecting a test
●
Goal: maximize accuracy
♦ t total number of instances covered by rule
♦ p positive examples of the class covered by rule
♦ t – p number of errors made by rule
⇒ Select test that maximizes the ratio p/t
●
We are finished when p/t = 1 or the set of
instances can’t be split any further
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 51
Example: contact lens data
If ?
●
Rule we seek: then recommendation = hard
●
Possible tests:
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 52
Modified rule and resulting data
●
Rule with best test added:
If astigmatism = yes
then recommendation = hard
●
Instances covered by modified rule:
Age Spectacle Astigmatism Tear production Recommended
prescription rate lenses
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope Yes Reduced None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope Yes Reduced None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 53
Further refinement
If astigmatism = yes
●
Current state: and ?
then recommendation = hard
●
Possible tests:
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 54
Modified rule and resulting data
●
Rule with best test added:
If astigmatism = yes
and tear production rate = normal
then recommendation = hard
●
Instances covered by modified rule:
Age Spectacle prescription Astigmatism Tear production Recommended
rate lenses
Young Myope Yes Normal Hard
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope Yes Normal None
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 55
Further refinement
●
Current state:
If astigmatism = yes
and tear production rate = normal
and ?
then recommendation = hard
●
Possible tests:
Age = Young 2/2
Age = Pre-presbyopic 1/2
Age = Presbyopic 1/2
Spectacle prescription = Myope 3/3
Spectacle prescription = Hypermetrope 1/3
●
Tie between the first and the fourth test
♦ We choose the one with greater coverage
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 56
The result
●
Final rule: If astigmatism = yes
and tear production rate = normal
and spectacle prescription = myope
then recommendation = hard
●
Second rule for recommending “hard lenses”:
(built from instances not covered by first rule)
If age = young and astigmatism = yes
and tear production rate = normal
then recommendation = hard
●
These two rules cover all “hard lenses”:
♦ Process is repeated with other two classes
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 57
Linear models: linear regression
●
Work most naturally with numeric attributes
●
Standard technique for numeric prediction
♦ Outcome is linear combination of attributes
x = w0 w1 a1 w2 a2... wk ak
●
Weights are calculated from the training data
●
Predicted value for first training instance a(1)
k
w0 a1
0 w a
1 1
1
w a
2 2
1
... w a
k k
1
=∑ j =0 w a
j j
1
(assuming each instance is extended with a constant attribute with value 1)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 74
Minimizing the squared error
● Choose k +1 coefficients to minimize the squared
error on the training data
● Squared error: n i k i 2
∑i =1 x −∑ j=0 w j a j
●
● Derive coefficients using standard matrix
operations
● Can be done if there are more instances than
attributes (roughly speaking)
● Minimizing the absolute error is more difficult
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 75
Classification
●
Any regression technique can be used for
classification
♦ Training: perform a regression for each class, setting
the output to 1 for training instances that belong to
class, and 0 for those that don’t
♦ Prediction: predict class corresponding to model
with largest output value (membership value)
●
For linear regression this is known as multi
response linear regression
●
Problem: membership values are not in [0,1]
range, so aren't proper probability estimates
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 76
Instancebased learning
● Distance function defines what’s learned
● Most instancebased schemes use
Euclidean distance:
a −a a −a ... ak −ak
1
1
2 2
1
1
2
2 2
2
1 2 2
a(1) and a(2): two instances with k attributes
● Taking the square root is not required when
comparing distances
● Other popular metric: cityblock metric
● Adds differences without squaring them
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 90
Normalization and other issues
●
Different attributes are measured on different
scales ⇒ need to be normalized:
v i−min v i
ai = max v i− min vi
vi : the actual value of attribute i
●
Nominal attributes: distance either 0 or 1
●
Common policy for missing values: assumed to be
maximally distant (given normalized attributes)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 91
Finding nearest neighbors efficiently
●
Simplest way of finding nearest neighbour: linear
scan of the data
♦ Classification takes time proportional to the product of
the number of instances in training and test sets
●
Nearestneighbor search can be done more
efficiently using appropriate data structures
●
We will discuss two methods that represent training
data in a tree structure:
kDtrees and ball trees
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 92
kDtree example
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 93
Using kDtrees: example
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 94
Discussion of nearestneighbor learning
● Often very accurate
● Assumes all attributes are equally important
● Remedy: attribute selection or weights
● Possible remedies against noisy instances:
● Take a majority vote over the k nearest neighbors
● Removing noisy instances from dataset (difficult!)
● Statisticians have used kNN since early 1950s
● If n →
∞ and k/n →
0, error approaches minimum
● kDtrees become inefficient when number of
attributes is too large (approximately > 10)
● Ball trees (which are instances of metric trees) work
well in higherdimensional spaces
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 101
Clustering
●
Clustering techniques apply when there is no class to be
predicted
●
Aim: divide instances into “natural” groups
●
As we've seen clusters can be:
♦ disjoint vs. overlapping
♦ deterministic vs. probabilistic
♦ flat vs. hierarchical
●
We'll look at a classic clustering algorithm called k
means
♦ kmeans clusters are disjoint, deterministic, and flat
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 103
The kmeans algorithm
To cluster data into k groups:
(k is predefined)
1. Choose k cluster centers
♦ e.g. at random
2. Assign instances to clusters
♦ based on distance to cluster centers
3. Compute centroids of clusters
4. Go to step 1
♦ until convergence
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 104
Discussion
●
Algorithm minimizes squared distance to cluster
centers
●
Result can vary significantly
♦ based on initial choice of seeds
●
Can get trapped in local minimum
♦ Example:
initial
cluster
centres
instances
●
To increase chance of finding global optimum: restart
with different random seeds
●
Can we applied recursively with k = 2
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 105