0% found this document useful (0 votes)

32 views64 pages

Data Mining: Practical Machine Learning Tools and Techniques

Uploaded by

haider bhatti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views64 pages

Data Mining: Practical Machine Learning Tools and Techniques

Uploaded by

haider bhatti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Data

Mining
Practical Machine Learning Tools and Techniques
Slides for Chapter 4 of Data Mining by I. H. Witten and E. Frank
Algorithms: The basic methods

●
Inferring rudimentary rules
●
Statistical modeling
●
Constructing decision trees
●
Constructing rules
●
Association rule learning
●
Linear models
●
Instancebased learning
●
Clustering

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 2
Simplicity first

●
Simple algorithms often work very well!
●
There are many kinds of simple structure, eg:
♦ One attribute does all the work
♦ All attributes contribute equally & independently
♦ A weighted linear combination might do
♦ Instancebased: use a few prototypes
♦ Use simple logical rules
●
Success of method depends on the domain

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 3
Inferring rudimentary rules

●
1R: learns a 1level decision tree
♦ I.e., rules that all test one particular attribute
●
Basic version
♦ One branch for each value
♦ Each branch assigns most frequent class

♦ Error rate: proportion of instances that don’t

belong to the majority class of their
corresponding branch
♦ Choose attribute with lowest error rate

(assumes nominal attributes)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 4
Evaluating the weather attributes
Outlook Temp Humidity Windy Play

Sunny Hot High False No Attribute Rules Errors Total

errors
Sunny Hot High True No
Outlook Sunny →No 2/5 4/14
Overcast Hot High False Yes
Overcast →Yes 0/4
Rainy Mild High False Yes
Rainy →Yes 2/5
Rainy Cool Normal False Yes
Temp Hot →No* 2/4 5/14
Rainy Cool Normal True No
Mild → Yes 2/6
Overcast Cool Normal True Yes
Cool → Yes 1/4
Sunny Mild High False No
Humidity High → No 3/7 4/14
Sunny Cool Normal False Yes
Normal →Yes 1/7
Rainy Mild Normal False Yes
Windy False →Yes 2/8 5/14
Sunny Mild Normal True Yes
True →No* 3/6
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No * indicates a tie

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 6
Dealing with numeric attributes
●
Discretize numeric attributes
●
Divide each attribute’s range into intervals
♦ Sort instances according to attribute’s values
♦ Place breakpoints where class changes (majority class)
♦ This minimizes the total error
●
Example: temperature from weather data
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

Outlook Temperature Humidity Windy Play

Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 7
The problem of overfitting
●
This procedure is very sensitive to noise
♦ One instance with an incorrect class label will probably
produce a separate interval
●
Also: time stamp attribute will have zero errors
●
Simple solution:
enforce minimum number of instances in majority
class per interval
●
Example (with min = 3):
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 8
Discussion of 1R
●
1R was described in a paper by Holte (1993)
♦ Contains an experimental evaluation on 16 datasets
(using crossvalidation so that results were
representative of performance on future data)
♦ Minimum number of instances was set to 6 after
some experimentation
♦ 1R’s simple rules performed not much worse than
much more complex decision trees
●
Simplicity first pays off!

Very Simple Classification Rules Perform Well on Most
Commonly Used Datasets
Robert C. Holte, Computer Science Department, University of Ottawa

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 10
Statistical modeling

●
“Opposite” of 1R: use all the attributes
●
Two assumptions: Attributes are
♦ equally important
♦ statistically independent (given the class value)
● I.e., knowing the value of one attribute says nothing
about the value of another (if the class is known)
●
Independence assumption is never correct!
●
But … this scheme works well in practice

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 12
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5

●
A new day: Outlook Temp. Humidity Windy Play
Sunny Cool High True ?

Likelihood of the two classes

For “yes” = 2/9 × 3/9 × 3/9 × 3/9 × 9/14 = 0.0053
For “no” = 3/5 × 1/5 × 4/5 × 3/5 × 5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 14
Bayes’s rule
● Probability of event H given evidence E:

Pr [ E∣H ] Pr [ H ]
Pr [ H∣E ]=
Pr [ E ]

A priori probability of H :
●
Pr [ H ]
● Probability of event before evidence is seen
A posteriori probability of H :
●
Pr [ H∣E ]
● Probability of event after evidence is seen

Thomas Bayes
Born: 1702 in London, England
Died: 1761 in Tunbridge Wells, Kent, England

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 15
Naïve Bayes for classification

●
Classification learning: what’s the
probability of the class given an instance?
♦ Evidence E = instance
♦ Event H = class value for instance
●
Naïve assumption: evidence splits into parts
(i.e. attributes) that are independent

Pr [ E1∣H ] Pr [ E2∣H] Pr [ En∣H ] Pr [ H ]

Pr [ H∣E ]=
Pr [ E ]

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 16
Weather data example
Outlook Temp. Humidity Windy Play
Evidence E
Sunny Cool High True ?

Pr [ yes∣E ]=Pr [ Outlook =Sunny∣yes ]

×Pr [ Temperature= Cool∣yes ]
Probability of × Pr [ Humidity =High∣yes ]
class “yes” ×Pr [ Windy = True∣yes ]
Pr [ yes ]
×
Pr [ E ]
2 3 3 3 9
× × × ×
9 9 9 9 14
=
Pr [ E ]
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 17
The “zerofrequency problem”

●
What if an attribute value doesn’t occur with every
class value?
(e.g. “Humidity = high” for class “yes”)
♦ Probability will be zero! Pr [ Humidity =High∣yes ]=0
♦ A posteriori probability will also be zero! Pr [ yes∣E ]=0
(No matter how likely the other values are!)
●
Remedy: add 1 to the count for every attribute
valueclass combination (Laplace estimator)
●
Result: probabilities will never be zero!
(also: stabilizes probability estimates)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 18
Modified probability estimates

●
In some cases adding a constant different
from 1 might be more appropriate
●
Example: attribute outlook for class yes
2/3 4/3 3/3
9 9 9

Sunny Overcast Rainy

●
Weights don’t need to be equal
(but they must sum to 1)
2 p1 4 p2 3 p3
9 9 9

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 19
Missing values
●
Training: instance is not included in frequency
count for attribute valueclass combination
●
Classification: attribute will be omitted from
calculation
●
Example: Outlook Temp. Humidity Windy Play
? Cool High True ?

Likelihood of “yes” = 3/9 × 3/9 × 3/9 × 9/14 = 0.0238

Likelihood of “no” = 1/5 × 4/5 × 3/5 × 5/14 = 0.0343
P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%
P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 20
Numeric attributes
● Usual assumption: attributes have a
normal or Gaussian probability
distribution (given the class)
● The probability density function for the

normal distribution is defined by two
parameters:
n
Sample mean µ = ∑ x i
● 1
n i =1


n
● Standard deviation σ =
1
∑  x i −
2

n−1 i =1
● Then the density function f(x) is
 x −2
1 −
2
2

f  x = e
 2
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 21
Statistics for weather data

Outlook Temperature Humidity Windy Play

Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65,71, 65, 70, 70, 85, False 6 2 9 5
Overcast 4 0 69, 70, 72,80, 70, 75, 90, 91, True 3 3
Rainy 3 2 72, … 85, … 80, … 95, …
Sunny 2/9 3/5 µ =73 µ =75 µ =79 µ =86 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 σ =6.2 σ =7.9 σ =10.2 σ =9.7 True 3/9 3/5 14 14
Rainy 3/9 2/5

●
Example density value:
66−732
1 −
2⋅6.2
2
f  temperature =66∣yes = e =0.0340
 26.2

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 22
Classifying a new day

●
A new day: Outlook Temp. Humidity Windy Play
Sunny 66 90 true ?

Likelihood of “yes” = 2/9 × 0.0340 × 0.0221 × 3/9 × 9/14 = 0.000036

Likelihood of “no” = 3/5 × 0.0221 × 0.0381 × 3/5 × 5/14 = 0.000108
P(“yes”) = 0.000036 / (0.000036 + 0. 000108) = 25%
P(“no”) = 0.000108 / (0.000036 + 0. 000108) = 75%

●
Missing values during training are not
included in calculation of mean and
standard deviation

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 23
Probability densities

●
Relationship between probability and
density:
 
Pr [ c −  x  c ]≈×f  c 
2 2

●
But: this doesn’t change calculation of a
posteriori probabilities because ε cancels out
●
Exact relationship:
b
Pr [ a  x  b ]=∫ f  t  dt
a

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 24
Naïve Bayes: discussion

●
Naïve Bayes works surprisingly well (even if
independence assumption is clearly violated)
●
Why? Because classification doesn’t require accurate
probability estimates as long as maximum probability
is assigned to correct class
●
However: adding too many redundant attributes will
cause problems (e.g. identical attributes)
●
Note also: many numeric attributes are not normally
distributed (→ kernel density estimators)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 27
Constructing decision trees

●
Strategy: top down
Recursive divideandconquer fashion
♦ First: select attribute for root node
Create branch for each possible attribute value
♦ Then: split instances into subsets
One for each branch extending from the node
♦ Finally: repeat recursively for each branch, using
only instances that reach the branch
●
Stop if all instances have the same class

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 28
Which attribute to select?

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 29
Which attribute to select?

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 30
Criterion for attribute selection

●
Which is the best attribute?
♦ Want to get the smallest tree
♦ Heuristic: choose the attribute that produces the
“purest” nodes
●
Popular impurity criterion: information gain
♦ Information gain increases with the average
purity of the subsets
●
Strategy: choose attribute that gives greatest
information gain

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 31
Computing information

●
Measure information in bits
♦ Given a probability distribution, the info

required to predict an event is the
distribution’s entropy
♦ Entropy gives the information required in bits

(can involve fractions of bits!)
●
Formula for computing the entropy:
entropy  p1, p 2, ... ,p n=− p1 log p1− p2 log p2 ...−p n log pn

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 32
Example: attribute Outlook

●
Outlook = Sunny :
info[2,3]=entropy 2/5,3/5=−2/5 log 2/5−3/5 log 3/5=0.971 bits
●
Outlook = Overcast : Note: this
info[4,0]=entropy 1,0=−1 log 1−0 log0=0 bits is normally
undefined.
●
Outlook = Rainy :
info[2,3]=entropy 3/5,2/5=−3/5 log 3/5−2/5 log 2/5=0.971 bits
●
Expected information for attribute:
info[3,2] , [4,0] , [3,2]=5/14×0.9714/14×05/14×0.971=0.693bits

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 33
Computing information gain

●
Information gain: information before splitting –
information after splitting
gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits

●
Information gain for attributes from weather data:

gain(Outlook )       = 0.247 bits
gain(Temperature )       = 0.029 bits
gain(Humidity )       = 0.152 bits
gain(Windy )       = 0.048 bits

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 34
Continuing to split

gain(Temperature ) = 0.571 bits
gain(Humidity ) = 0.971 bits
gain(Windy ) = 0.020 bits

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 35
Final decision tree

●
Note: not all leaves need to be pure; sometimes
identical instances have different classes
⇒
Splitting stops when data can’t be split any further

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 36
Wishlist for a purity measure

●
Properties we require from a purity measure:
♦ When node is pure, measure should be zero
♦ When impurity is maximal (i.e. all classes equally
likely), measure should be maximal
♦ Measure should obey multistage property (i.e.
decisions can be made in several stages):
measure [2,3,4]=measure[2,7]7/9×measure [3,4]

●
Entropy is the only function that satisfies all
three properties!

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 37
Properties of the entropy

●
The multistage property:
entropy  p ,q, r =entropy  p ,q r  q r ×entropy  qq r , q r r 

●
Simplification of computation:
info[2,3,4]=−2/9×log 2/9−3/9×log3/9−4/9×log 4/9
=[−2×log 2−3×log 3−4×log 49×log 9]/9
●
Note: instead of maximizing info gain we
could just minimize information

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 38
Discussion

●
Topdown induction of decision trees: ID3,
algorithm developed by Ross Quinlan
♦ Gain ratio just one modification of this basic
algorithm
♦ ⇒ C4.5: deals with numeric attributes, missing
values, noisy data
●
Similar approach: CART
●
There are many other attribute selection
criteria!
(But little difference in accuracy of result)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 46
Classification rules
●
Popular alternative to decision trees
●
Antecedent (precondition): a series of tests (just like
the tests at the nodes of a decision tree)
●
Tests are usually logically ANDed together (but may
also be general logical expressions)
●
Consequent (conclusion): classes, set of classes, or
probability distribution assigned by rule
●
Individual rules are often logically ORed together
♦ Conflicts arise if different conclusions apply

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 8
From trees to rules

●
Easy: converting a tree into a set of rules
♦ One rule for each leaf:
● Antecedent contains a condition for every node on the path
from the root to the leaf
● Consequent is class assigned by the leaf
●
Produces rules that are unambiguous
♦ Doesn’t matter in which order they are executed
●
But: resulting rules are unnecessarily complex
♦ Pruning to remove redundant tests/rules

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 9
From rules to trees

● More difficult: transforming a rule set into a tree
● Tree cannot easily express disjunction between rules
● Example: rules which test different attributes

If a and b then x
If c and d then x

● Symmetry needs to be broken
● Corresponding tree contains identical subtrees

(⇒ “replicated subtree problem”)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 10
A tree for a simple disjunction

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 11
The exclusiveor problem

If x = 1 and y = 0
then class = a
If x = 0 and y = 1
then class = a
If x = 0 and y = 0
then class = b
If x = 1 and y = 1
then class = b

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 12
A tree with a replicated subtree

If x = 1 and y = 1
then class = a
If z = 1 and w = 1
then class = a
Otherwise class = b

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 13
“Nuggets” of knowledge

●
Are rules independent pieces of knowledge? (It
seems easy to add a rule to an existing rule base.)
●
Problem: ignores how rules are executed
●
Two ways of executing a rule set:
♦ Ordered set of rules (“decision list”)
● Order is important for interpretation
♦ Unordered set of rules
● Rules may overlap and lead to different conclusions for the
same instance

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 14
Interpreting rules

●
What if two or more rules conflict?
♦ Give no conclusion at all?
♦ Go with rule that is most popular on training data?
♦ …
●
What if no rule applies to a test instance?
♦ Give no conclusion at all?
♦ Go with class that is most frequent in training data?
♦ …

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 15
Covering algorithms
●
Convert decision tree into a rule set
♦ Straightforward, but rule set overly complex
♦ More effective conversions are not trivial
●
Instead, can generate rule set directly
♦ for each class in turn find rule set that covers
all instances in it
(excluding instances not in the class)
●
Called a covering approach:
♦ at each stage a rule is identified that “covers”
some of the instances

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 47
Example: generating a rule

If true If x > 1.2 and y > 2.6

then class = a then class = a
If x > 1.2
then class = a

●
Possible rule set for class “b”:
If x ≤ 1.2 then class = b
If x > 1.2 and y ≤ 2.6 then class = b

●
Could add more rules, get “perfect” rule set
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 48
Rules vs. trees

Corresponding decision tree:
(produces exactly the same
predictions)

●
But: rule sets can be more perspicuous when
decision trees suffer from replicated subtrees
●
Also: in multiclass situations, covering algorithm
concentrates on one class at a time whereas
decision tree learner takes all classes into account

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 49
Simple covering algorithm

●
Generates a rule by adding tests that maximize
rule’s accuracy
●
Similar to situation in decision trees: problem of
selecting an attribute to split on
♦ But: decision tree inducer maximizes overall purity
●
Each new test reduces
rule’s coverage:

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 50
Selecting a test

●
Goal: maximize accuracy
♦ t total number of instances covered by rule
♦ p positive examples of the class covered by rule

♦ t – p number of errors made by rule

⇒ Select test that maximizes the ratio p/t
●
We are finished when p/t = 1 or the set of
instances can’t be split any further

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 51
Example: contact lens data

If ?
●
Rule we seek: then recommendation = hard
●
Possible tests:

Age = Young 2/8

Age = Pre-presbyopic 1/8
Age = Presbyopic 1/8
Spectacle prescription = Myope 3/12
Spectacle prescription = Hypermetrope 1/12
Astigmatism = no 0/12
Astigmatism = yes 4/12
Tear production rate = Reduced 0/12
Tear production rate = Normal 4/12

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 52
Modified rule and resulting data

●
Rule with best test added:
If astigmatism = yes
then recommendation = hard
●
Instances covered by modified rule:
Age Spectacle Astigmatism Tear production Recommended
prescription rate lenses
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope Yes Reduced None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope Yes Reduced None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 53
Further refinement

If astigmatism = yes
●
Current state: and ?
then recommendation = hard

●
Possible tests:

Age = Young 2/4

Age = Pre-presbyopic 1/4
Age = Presbyopic 1/4
Spectacle prescription = Myope 3/6
Spectacle prescription = Hypermetrope 1/6
Tear production rate = Reduced 0/6
Tear production rate = Normal 4/6

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 54
Modified rule and resulting data

●
Rule with best test added:
If astigmatism = yes
and tear production rate = normal
then recommendation = hard

●
Instances covered by modified rule:
Age Spectacle prescription Astigmatism Tear production Recommended
rate lenses
Young Myope Yes Normal Hard
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope Yes Normal None

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 55
Further refinement
●
Current state:
If astigmatism = yes
and tear production rate = normal
and ?
then recommendation = hard

●
Possible tests:
Age = Young 2/2
Age = Pre-presbyopic 1/2
Age = Presbyopic 1/2
Spectacle prescription = Myope 3/3
Spectacle prescription = Hypermetrope 1/3

●
Tie between the first and the fourth test
♦ We choose the one with greater coverage
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 56
The result

●
Final rule: If astigmatism = yes
and tear production rate = normal
and spectacle prescription = myope
then recommendation = hard

●
Second rule for recommending “hard lenses”:
(built from instances not covered by first rule)
If age = young and astigmatism = yes
and tear production rate = normal
then recommendation = hard

●
These two rules cover all “hard lenses”:
♦ Process is repeated with other two classes

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 57
Linear models: linear regression

●
Work most naturally with numeric attributes
●
Standard technique for numeric prediction
♦ Outcome is linear combination of attributes
x = w0 w1 a1 w2 a2... wk ak
●
Weights are calculated from the training data
●
Predicted value for first training instance a(1)
k
w0 a1
0  w a
1 1
1
 w a
2 2
1
... w a
k k
1
=∑ j =0 w a
j j
1

(assuming each instance is extended with a constant attribute with value 1)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 74
Minimizing the squared error

● Choose k +1 coefficients to minimize the squared
error on the training data
● Squared error: n  i k i  2
∑i =1  x −∑ j=0 w j a j 
●

● Derive coefficients using standard matrix
operations
● Can be done if there are more instances than
attributes (roughly speaking)
● Minimizing the absolute error is more difficult

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 75
Classification
●
Any regression technique can be used for
classification
♦ Training: perform a regression for each class, setting
the output to 1 for training instances that belong to
class, and 0 for those that don’t
♦ Prediction: predict class corresponding to model
with largest output value (membership value)
●
For linear regression this is known as multi
response linear regression
●
Problem: membership values are not in [0,1]
range, so aren't proper probability estimates

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 76
Instancebased learning

● Distance function defines what’s learned
● Most instancebased schemes use
Euclidean distance:
  a −a   a −a  ... ak −ak 
1
1
2 2
1
1
2
2 2
2
1 2 2

a(1) and a(2): two instances with k attributes
● Taking the square root is not required when
comparing distances
● Other popular metric: cityblock metric
● Adds differences without squaring them

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 90
Normalization and other issues

●
Different attributes are measured on different
scales ⇒ need to be normalized:
v i−min v i
ai = max v i− min vi

vi : the actual value of attribute i
●
Nominal attributes: distance either 0 or 1
●
Common policy for missing values: assumed to be
maximally distant (given normalized attributes)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 91
Finding nearest neighbors efficiently

●
Simplest way of finding nearest neighbour: linear
scan of the data
♦ Classification takes time proportional to the product of
the number of instances in training and test sets
●
Nearestneighbor search can be done more
efficiently using appropriate data structures
●
We will discuss two methods that represent training
data in a tree structure:

kDtrees and ball trees

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 92
kDtree example

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 93
Using kDtrees: example

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 94
Discussion of nearestneighbor learning

● Often very accurate
● Assumes all attributes are equally important
● Remedy: attribute selection or weights
● Possible remedies against noisy instances:
● Take a majority vote over the k nearest neighbors
● Removing noisy instances from dataset (difficult!)
● Statisticians have used kNN since early 1950s
● If n →
∞ and k/n →
0, error approaches minimum
● kDtrees become inefficient when number of
attributes is too large (approximately > 10)
● Ball trees (which are instances of metric trees) work
well in higherdimensional spaces

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 101
Clustering
●
Clustering techniques apply when there is no class to be
predicted
●
Aim: divide instances into “natural” groups
●
As we've seen clusters can be:
♦ disjoint vs. overlapping

♦ deterministic vs. probabilistic

♦ flat vs. hierarchical

●
We'll look at a classic clustering algorithm called k
means
♦ kmeans clusters are disjoint, deterministic, and flat

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 103
The kmeans algorithm

To cluster data into k groups:
(k is predefined)
1. Choose k cluster centers
♦ e.g. at random
2. Assign instances to clusters
♦ based on distance to cluster centers
3. Compute centroids of clusters
4. Go to step 1
♦ until convergence

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 104
Discussion
●
Algorithm minimizes squared distance to cluster
centers
●
Result can vary significantly
♦ based on initial choice of seeds

●
Can get trapped in local minimum
♦ Example:
initial
cluster
centres

instances

●
To increase chance of finding global optimum: restart
with different random seeds
●
Can we applied recursively with k = 2
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 105

AD83586 EliteSemiconductor
No ratings yet
AD83586 EliteSemiconductor
3 pages
ATP Template
No ratings yet
ATP Template
21 pages
FINE Marine PDF
No ratings yet
FINE Marine PDF
8 pages
Chapter 4
No ratings yet
Chapter 4
111 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
111 pages
Chapter4 ML
No ratings yet
Chapter4 ML
108 pages
PPDM_5 and 6 DM
No ratings yet
PPDM_5 and 6 DM
30 pages
Data Mining Slides
No ratings yet
Data Mining Slides
43 pages
PPDM_4 input(part2)
No ratings yet
PPDM_4 input(part2)
19 pages
CH 3
No ratings yet
CH 3
38 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
43 pages
Chapter 6
No ratings yet
Chapter 6
172 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
35 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
39 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
42 pages
Chapter 1
No ratings yet
Chapter 1
42 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
47 pages
Data Mining All Slides
No ratings yet
Data Mining All Slides
206 pages
Classification: Rule Based Classification 0R Holte 1R Holte Decision Tree
No ratings yet
Classification: Rule Based Classification 0R Holte 1R Holte Decision Tree
24 pages
Wk10 Algorithms
No ratings yet
Wk10 Algorithms
123 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
68 pages
DLWSS551 - Algorithms Part I
No ratings yet
DLWSS551 - Algorithms Part I
59 pages
class 1a-DataCollection
No ratings yet
class 1a-DataCollection
14 pages
Simple Learning Algorithms: Jiming Peng, Advol, Cas, Mcmaster 1
No ratings yet
Simple Learning Algorithms: Jiming Peng, Advol, Cas, Mcmaster 1
41 pages
DLWSS551 - Introduction
No ratings yet
DLWSS551 - Introduction
54 pages
Cornell CS578: Introduction
No ratings yet
Cornell CS578: Introduction
11 pages
Data Mining Group Project .
No ratings yet
Data Mining Group Project .
26 pages
Support Machine Learning
No ratings yet
Support Machine Learning
161 pages
Lecture Notes For Chapter 4 Rule-Based Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 4 Rule-Based Introduction To Data Mining, 2 Edition
28 pages
6019-13680-1-PB_2
No ratings yet
6019-13680-1-PB_2
7 pages
Rule Based Classifier
No ratings yet
Rule Based Classifier
14 pages
Data Mining Classification: Alternative Techniques: Lecture Notes For Chapter 5 Introduction To Data Mining
No ratings yet
Data Mining Classification: Alternative Techniques: Lecture Notes For Chapter 5 Introduction To Data Mining
44 pages
IJEDR2001052
No ratings yet
IJEDR2001052
4 pages
Introd M
No ratings yet
Introd M
38 pages
Lecture Notes For Chapter 5 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 5 Introduction To Data Mining: by Tan, Steinbach, Kumar
72 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
11 pages
BI Chapter 04 - Unlocked
No ratings yet
BI Chapter 04 - Unlocked
47 pages
Lecture Notes For Chapter 5 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 5 Introduction To Data Mining: by Tan, Steinbach, Kumar
88 pages
Chap5 Alternative Classifi1
No ratings yet
Chap5 Alternative Classifi1
67 pages
Lec 1
No ratings yet
Lec 1
72 pages
Classification: Alternative Techniques: Salvatore Orlando
No ratings yet
Classification: Alternative Techniques: Salvatore Orlando
52 pages
PPDM 2 Definition
No ratings yet
PPDM 2 Definition
12 pages
L4
No ratings yet
L4
19 pages
Lecture Notes For Chapter 5: by Tan, Steinbach, Kumar
0% (1)
Lecture Notes For Chapter 5: by Tan, Steinbach, Kumar
88 pages
Data Mining - Lecture 1
No ratings yet
Data Mining - Lecture 1
23 pages
PPDM_3 input(part1)
No ratings yet
PPDM_3 input(part1)
13 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Ch 4_Classification rule_based Global Edition edited Oct 17, 2024
No ratings yet
Ch 4_Classification rule_based Global Edition edited Oct 17, 2024
28 pages
06 Classification Naive Bayes
No ratings yet
06 Classification Naive Bayes
13 pages
COMP527: Data Mining: M. Sulaiman Khan (Mskhan@liv - Ac.uk)
No ratings yet
COMP527: Data Mining: M. Sulaiman Khan (Mskhan@liv - Ac.uk)
28 pages
WK 6 Nearest Neighbor Classifier and Bayesian Classifier 12-05-2021
No ratings yet
WK 6 Nearest Neighbor Classifier and Bayesian Classifier 12-05-2021
23 pages
DWM Merged
No ratings yet
DWM Merged
125 pages
VI Sem Machine Learning CS 601 PDF
No ratings yet
VI Sem Machine Learning CS 601 PDF
28 pages
Chap4 Rule Based
No ratings yet
Chap4 Rule Based
27 pages
365-Article Text-2346-1-10-20230515
No ratings yet
365-Article Text-2346-1-10-20230515
8 pages
Naive Bayes
No ratings yet
Naive Bayes
18 pages
CSI5155 ML Project Report
No ratings yet
CSI5155 ML Project Report
23 pages
Data Mining, Data Warehousing and Knowledge Discovery
No ratings yet
Data Mining, Data Warehousing and Knowledge Discovery
70 pages
Final Security
No ratings yet
Final Security
6 pages
Codigo de Estrada
100% (1)
Codigo de Estrada
226 pages
WSNS Question paper solved[1]
No ratings yet
WSNS Question paper solved[1]
6 pages
2SC3987 NPN Planar Silicon Darlington Transistor
No ratings yet
2SC3987 NPN Planar Silicon Darlington Transistor
5 pages
Training Module Five: Completing The TAC Application Form
No ratings yet
Training Module Five: Completing The TAC Application Form
19 pages
Story Points - The Simple Explanation You've Been Looking For - Parabol
No ratings yet
Story Points - The Simple Explanation You've Been Looking For - Parabol
6 pages
Lecture 1.1.3 (Advantages of Dbms Over File System Approach)
No ratings yet
Lecture 1.1.3 (Advantages of Dbms Over File System Approach)
28 pages
Glimpse of JAVA
No ratings yet
Glimpse of JAVA
157 pages
How To Write A Good Review
No ratings yet
How To Write A Good Review
1 page
DSU3
No ratings yet
DSU3
6 pages
Syed Muhammad Sami Ullah: Professional Engineering Services (PES) Pakistan Key Responsibilities
No ratings yet
Syed Muhammad Sami Ullah: Professional Engineering Services (PES) Pakistan Key Responsibilities
3 pages
Mobile Professionals, Inc: - Your Partner For Wireless Engineering Solutions
No ratings yet
Mobile Professionals, Inc: - Your Partner For Wireless Engineering Solutions
15 pages
Current Transducer HAL 50..600-S
No ratings yet
Current Transducer HAL 50..600-S
2 pages
Packet Tracer - VLSM Design and Implementation Practice Topology
No ratings yet
Packet Tracer - VLSM Design and Implementation Practice Topology
3 pages
Advance PLC
No ratings yet
Advance PLC
120 pages
Excellent Report - Sample
No ratings yet
Excellent Report - Sample
11 pages
PNP Transistor
No ratings yet
PNP Transistor
5 pages
Radware ThreatReport Report 2024 RW-459
No ratings yet
Radware ThreatReport Report 2024 RW-459
59 pages
OMNI 3 Brochure
No ratings yet
OMNI 3 Brochure
6 pages
Blockchain-Enabled Drug Supply Chain: Sheetal Nayak, Prachitee Shirvale, Nihar Naik, Snehpriya Khul, Amol Sawant
No ratings yet
Blockchain-Enabled Drug Supply Chain: Sheetal Nayak, Prachitee Shirvale, Nihar Naik, Snehpriya Khul, Amol Sawant
4 pages
Portfolio New Format
No ratings yet
Portfolio New Format
30 pages
Designing Learning by Teaching Agents: The Betty's Brain System
No ratings yet
Designing Learning by Teaching Agents: The Betty's Brain System
28 pages
Salesforce QA Interview Questions
No ratings yet
Salesforce QA Interview Questions
26 pages
VSPlayer User Manual V6.0.0.4
No ratings yet
VSPlayer User Manual V6.0.0.4
17 pages
Perform: Algorithmic Trading
No ratings yet
Perform: Algorithmic Trading
12 pages
HP Damodar Valley Offer Feb-Mar'22
No ratings yet
HP Damodar Valley Offer Feb-Mar'22
6 pages
MCMC MTSFB TC G010 - 2017 - RNF Smart Pole PDF
No ratings yet
MCMC MTSFB TC G010 - 2017 - RNF Smart Pole PDF
33 pages

Data Mining: Practical Machine Learning Tools and Techniques

Uploaded by

Data Mining: Practical Machine Learning Tools and Techniques

Uploaded by

Data

Sunny Hot High False No Attribute Rules Errors Total

Outlook Temperature Humidity Windy Play

Likelihood of the two classes

Pr [ E1∣H ] Pr [ E2∣H] Pr [ En∣H ] Pr [ H ]

Pr [ yes∣E ]=Pr [ Outlook =Sunny∣yes ]

Sunny Overcast Rainy

Likelihood of “yes” = 3/9 × 3/9 × 3/9 × 9/14 = 0.0238

Outlook Temperature Humidity Windy Play

Likelihood of “yes” = 2/9 × 0.0340 × 0.0221 × 3/9 × 9/14 = 0.000036

If true If x > 1.2 and y > 2.6

Age = Young 2/8

Age = Young 2/4

You might also like