Data Mining CS4168 Lecture 5 Basics of Classification 1
Data Mining CS4168 Lecture 5 Basics of Classification 1
Most slides based on the lecture slides accompanying “Data Mining: Practical Machine Learning Tools and Techniques” by Ian H. Witten and
Eibe Frank, https://www.cs.waikato.ac.nz/ml/weka/book.html.
Styles of Machine Learning
• Predictive/Supervised
– Classification techniques
• predicting a discrete attribute/class
– Numeric prediction techniques
• predicting a numeric quantity
• Descriptive/Unsupervised
– Association learning techniques
• detecting associations between features
– Clustering techniques
• grouping similar instances into clusters
2
CLASSIFICATION
Training Set (or Test Set)
dependent variable
predictors class attribute
label
a1 a2 … ak x
…
a1( n ) a2( n ) ak(n ) x (n )
Simplicity First
5
DECISIONS TREE CLASSIFIER
Weather Dataset
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
7
Final decision tree
Constructing decision trees
• Strategy: top down
Recursive divide-and-conquer fashion
– First: select attribute for root node
Create branch for each possible attribute value
– Then: split instances into subsets
One for each branch extending from the node
– Finally: repeat recursively for each branch, using only
instances that reach the branch
• Stop if all instances have the same class
Weather Dataset
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
10
Which attribute to select?
Criterion for attribute selection
• Which is the best attribute?
– Want to get the smallest tree
– Heuristic: choose the attribute that produces the “purest”
nodes
• Measure of purity:
– info[node] - information value of a node measured in bits.
– the amount of further information necessary to take a
decision at tree_node
• Strategy: choose attribute with the lowest
information value of its child nodes
How to measure information value?
• Properties we require from an information value
measure:
– When node is pure, measure should be zero
– When impurity is maximal (i.e., all classes equally
likely), measure should be maximal
• Use entropy to calculate information value:
entropy( p1, p2 ,, pn ) p1logp1 p2logp2 pn logpn
where .
Example: attribute Outlook
• Child node: Outlook = Sunny
info([2,3]) entropy(2/5,3/5) 2 / 5 log(2 / 5) 3 / 5 log(3 / 5) 0.971 bits
21
Instance-Based Learning Algorithm
• Training instances (i.e., examples) are searched for the top k
instances that most closely resemble a new instance.
• A majority vote is taken from the top k instances to classify
the new instance.
22
The Distance Function
• Simplest case: dataset with one numeric attribute
– Distance is the difference between the two attribute
values involved
• Dataset with several numeric attributes: normally,
Euclidean distance is used.
• Are all attributes equally important?
– Weighting the attributes might be necessary.
23
The Distance Function
• Distance function defines what is learned
• Most instance-based schemes use Euclidean distance:
24
Discussion about kNN
• Often very accurate … but slow:
– Simple version scans entire training data to derive a
prediction.
• Assumes all features are equally important,
– Remedy: feature selection or weights.
• Sensitive to noise for k=1.
• Statisticians have used kNN since the early 1950s.
– For a dataset with n instances, if n and k/n 0,
error approaches minimum.
25