0% found this document useful (0 votes)
117 views25 pages

Data Mining CS4168 Lecture 5 Basics of Classification 1

This document provides an overview of classification techniques for machine learning. It discusses supervised learning methods like decision trees, random forests, and k-nearest neighbors (kNN) classification. Decision trees work by recursively splitting the data into purer subsets based on attribute values. Random forests create an ensemble of decision trees to improve accuracy. kNN classification finds the k closest training examples to make predictions for new data based on majority vote of neighbors.

Uploaded by

alina sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views25 pages

Data Mining CS4168 Lecture 5 Basics of Classification 1

This document provides an overview of classification techniques for machine learning. It discusses supervised learning methods like decision trees, random forests, and k-nearest neighbors (kNN) classification. Decision trees work by recursively splitting the data into purer subsets based on attribute values. Random forests create an ensemble of decision trees to improve accuracy. kNN classification finds the k closest training examples to make predictions for new data based on majority vote of neighbors.

Uploaded by

alina sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Data Mining

Lecture 5: Basics of Classification

Most slides based on the lecture slides accompanying “Data Mining: Practical Machine Learning Tools and Techniques” by Ian H. Witten and
Eibe Frank, https://www.cs.waikato.ac.nz/ml/weka/book.html.
Styles of Machine Learning
• Predictive/Supervised
– Classification techniques
• predicting a discrete attribute/class
– Numeric prediction techniques
• predicting a numeric quantity

• Descriptive/Unsupervised
– Association learning techniques
• detecting associations between features
– Clustering techniques
• grouping similar instances into clusters

2
CLASSIFICATION
Training Set (or Test Set)
dependent variable
predictors class attribute
label

a1 a2 … ak x

a1(1) a2(1) … ak(1) x (1)

a1( 2 ) a2( 2 ) … ak( 2 ) x( 2)


⁞ ⁞ ⁞ ⁞ ⁞


a1( n ) a2( n ) ak(n ) x (n )
Simplicity First

• Simple algorithms often work very well!


• There are many kinds of simple structures, for
example:
– One attribute does all the work – OneR (one rule) algorithm
– All attributes contribute equally & independently (Naïve Bayes)
– A weighted linear combination of attributes might do (logistic
regression)
– Etc.
• Success of method depends on the domain.

5
DECISIONS TREE CLASSIFIER
Weather Dataset
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No

7
Final decision tree
Constructing decision trees
• Strategy: top down
Recursive divide-and-conquer fashion
– First: select attribute for root node
Create branch for each possible attribute value
– Then: split instances into subsets
One for each branch extending from the node
– Finally: repeat recursively for each branch, using only
instances that reach the branch
• Stop if all instances have the same class
Weather Dataset
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No

10
Which attribute to select?
Criterion for attribute selection
• Which is the best attribute?
– Want to get the smallest tree
– Heuristic: choose the attribute that produces the “purest”
nodes
• Measure of purity:
– info[node] - information value of a node measured in bits.
– the amount of further information necessary to take a
decision at tree_node
• Strategy: choose attribute with the lowest
information value of its child nodes
How to measure information value?
• Properties we require from an information value
measure:
– When node is pure, measure should be zero
– When impurity is maximal (i.e., all classes equally
likely), measure should be maximal
• Use entropy to calculate information value:
entropy( p1, p2 ,, pn )   p1logp1  p2logp2   pn logpn

where .


Example: attribute Outlook
• Child node: Outlook = Sunny
info([2,3])  entropy(2/5,3/5)  2 / 5 log(2 / 5)  3 / 5 log(3 / 5)  0.971 bits

• Child node: Outlook = Overcast Note: log(0)


info([4,0])  entropy(1,0)  1log(1)  0 log(0)  0 bits is normally
undefined.
• Child node: Outlook = Rainy
info([3,2])  entropy(3/5,2/5)  3 / 5 log(3 / 5)  2 / 5 log(2 / 5)  0.971 bits
Example: attribute Outlook
• Information value for node Outlook
– Before the split
info([9,5]) = entropy(9/14, 5/14) =
= -9/14log(9/14) – 5/14log(5/14) = 0.940 bits
– After the split
info([3,2], [4,0], [3,2]) =
= (5/14)0.971 + (4/14)0 + (5/14)0.971 = 0.693 bits
– Information gain = 0.940 – 0.693 = 0.247
Continuing to split

gain(Temperature ) = 0.571 bits


gain(Humidity ) = 0.971 bits
gain(Windy ) = 0.020 bits
Final decision tree
Discussion
• Top-down induction of decision trees: ID3,
algorithm developed by Ross Quinlan.
• Pruning techniques to avoid overfitting:
– Use a validation dataset.
• Similar approach: CART
• There can be attribute selection criteria!
(But little difference in accuracy of result)
Random Forest Algorithm
• “Random Forest is one of the most popular and most powerful machine
learning algorithms. It is a type of ensemble machine learning algorithm
called bootstrap aggregation or bagging.” Source

• “The random forest is a classification algorithm consisting of many


decisions trees. It uses bagging and feature randomness when building
each individual tree to try to create an uncorrelated forest of trees whose
prediction by committee is more accurate than that of any individual
tree.” Source

• “In bagging (a.k.a. bootstrap aggregation), a random sample of data in a


training set is selected with replacement — meaning that the individual
data points can be chosen more than once.” Source
Ensemble ML
• Ensemble learning refers to a group (or ensemble) of
base ML algorithms, which work collectively to
achieve better predictive performance.
• Bagging and boosting are two main types of
ensemble learning methods.
– Bagging: the base models are trained in parallel.
– Boosting: the base models are trained sequentially.
• Popular boosting algorithms: AdaBoost, XGBoost,
GradientBoost, BrownBoost.
KNN CLASSIFIER
(INSTANCE BASED LEARNING)

21
Instance-Based Learning Algorithm
• Training instances (i.e., examples) are searched for the top k
instances that most closely resemble a new instance.
• A majority vote is taken from the top k instances to classify
the new instance.

• Similarity (or distance) function defines what’s “learned”.

• Simplest form of classification, also called:


– rote learning
– lazy learning
– k-nearest-neighbor (k-NN, kNN, KNN)

22
The Distance Function
• Simplest case: dataset with one numeric attribute
– Distance is the difference between the two attribute
values involved
• Dataset with several numeric attributes: normally,
Euclidean distance is used.
• Are all attributes equally important?
– Weighting the attributes might be necessary.

23
The Distance Function
• Distance function defines what is learned
• Most instance-based schemes use Euclidean distance:

where a(1) and a(2) are two instances with m attributes.


• Taking the square root is not required when
comparing distances.
• Another popular metric: city-block metric:
– Adds differences without squaring them.

24
Discussion about kNN
• Often very accurate … but slow:
– Simple version scans entire training data to derive a
prediction.
• Assumes all features are equally important,
– Remedy: feature selection or weights.
• Sensitive to noise for k=1.
• Statisticians have used kNN since the early 1950s.
– For a dataset with n instances, if n   and k/n  0,
error approaches minimum.

25

You might also like