0% found this document useful (0 votes)

117 views25 pages

Data Mining CS4168 Lecture 5 Basics of Classification 1

This document provides an overview of classification techniques for machine learning. It discusses supervised learning methods like decision trees, random forests, and k-nearest neighbors (kNN) classification. Decision trees work by recursively splitting the data into purer subsets based on attribute values. Random forests create an ensemble of decision trees to improve accuracy. kNN classification finds the k closest training examples to make predictions for new data based on majority vote of neighbors.

Uploaded by

alina sheikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views25 pages

Data Mining CS4168 Lecture 5 Basics of Classification 1

Uploaded by

alina sheikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Data Mining

Lecture 5: Basics of Classification

Most slides based on the lecture slides accompanying “Data Mining: Practical Machine Learning Tools and Techniques” by Ian H. Witten and
Eibe Frank, https://www.cs.waikato.ac.nz/ml/weka/book.html.
Styles of Machine Learning
• Predictive/Supervised
– Classification techniques
• predicting a discrete attribute/class
– Numeric prediction techniques
• predicting a numeric quantity

• Descriptive/Unsupervised
– Association learning techniques
• detecting associations between features
– Clustering techniques
• grouping similar instances into clusters

2
CLASSIFICATION
Training Set (or Test Set)
dependent variable
predictors class attribute
label

a1 a2 … ak x

a1(1) a2(1) … ak(1) x (1)

a1( 2 ) a2( 2 ) … ak( 2 ) x( 2)

⁞ ⁞ ⁞ ⁞ ⁞

…
a1( n ) a2( n ) ak(n ) x (n )
Simplicity First

• Simple algorithms often work very well!

• There are many kinds of simple structures, for
example:
– One attribute does all the work – OneR (one rule) algorithm
– All attributes contribute equally & independently (Naïve Bayes)
– A weighted linear combination of attributes might do (logistic
regression)
– Etc.
• Success of method depends on the domain.

5
DECISIONS TREE CLASSIFIER
Weather Dataset
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No

7
Final decision tree
Constructing decision trees
• Strategy: top down
Recursive divide-and-conquer fashion
– First: select attribute for root node
Create branch for each possible attribute value
– Then: split instances into subsets
One for each branch extending from the node
– Finally: repeat recursively for each branch, using only
instances that reach the branch
• Stop if all instances have the same class
Weather Dataset
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No

10
Which attribute to select?
Criterion for attribute selection
• Which is the best attribute?
– Want to get the smallest tree
– Heuristic: choose the attribute that produces the “purest”
nodes
• Measure of purity:
– info[node] - information value of a node measured in bits.
– the amount of further information necessary to take a
decision at tree_node
• Strategy: choose attribute with the lowest
information value of its child nodes
How to measure information value?
• Properties we require from an information value
measure:
– When node is pure, measure should be zero
– When impurity is maximal (i.e., all classes equally
likely), measure should be maximal
• Use entropy to calculate information value:
entropy( p1, p2 ,, pn )   p1logp1  p2logp2   pn logpn

where .


Example: attribute Outlook
• Child node: Outlook = Sunny
info([2,3])  entropy(2/5,3/5)  2 / 5 log(2 / 5)  3 / 5 log(3 / 5)  0.971 bits

• Child node: Outlook = Overcast Note: log(0)

info([4,0])  entropy(1,0)  1log(1)  0 log(0)  0 bits is normally
undefined.
• Child node: Outlook = Rainy
info([3,2])  entropy(3/5,2/5)  3 / 5 log(3 / 5)  2 / 5 log(2 / 5)  0.971 bits
Example: attribute Outlook
• Information value for node Outlook
– Before the split
info([9,5]) = entropy(9/14, 5/14) =
= -9/14log(9/14) – 5/14log(5/14) = 0.940 bits
– After the split
info([3,2], [4,0], [3,2]) =
= (5/14)0.971 + (4/14)0 + (5/14)0.971 = 0.693 bits
– Information gain = 0.940 – 0.693 = 0.247
Continuing to split

gain(Temperature ) = 0.571 bits

gain(Humidity ) = 0.971 bits
gain(Windy ) = 0.020 bits
Final decision tree
Discussion
• Top-down induction of decision trees: ID3,
algorithm developed by Ross Quinlan.
• Pruning techniques to avoid overfitting:
– Use a validation dataset.
• Similar approach: CART
• There can be attribute selection criteria!
(But little difference in accuracy of result)
Random Forest Algorithm
• “Random Forest is one of the most popular and most powerful machine
learning algorithms. It is a type of ensemble machine learning algorithm
called bootstrap aggregation or bagging.” Source

• “The random forest is a classification algorithm consisting of many

decisions trees. It uses bagging and feature randomness when building
each individual tree to try to create an uncorrelated forest of trees whose
prediction by committee is more accurate than that of any individual
tree.” Source

• “In bagging (a.k.a. bootstrap aggregation), a random sample of data in a

training set is selected with replacement — meaning that the individual
data points can be chosen more than once.” Source
Ensemble ML
• Ensemble learning refers to a group (or ensemble) of
base ML algorithms, which work collectively to
achieve better predictive performance.
• Bagging and boosting are two main types of
ensemble learning methods.
– Bagging: the base models are trained in parallel.
– Boosting: the base models are trained sequentially.
• Popular boosting algorithms: AdaBoost, XGBoost,
GradientBoost, BrownBoost.
KNN CLASSIFIER
(INSTANCE BASED LEARNING)

21
Instance-Based Learning Algorithm
• Training instances (i.e., examples) are searched for the top k
instances that most closely resemble a new instance.
• A majority vote is taken from the top k instances to classify
the new instance.

• Similarity (or distance) function defines what’s “learned”.

• Simplest form of classification, also called:

– rote learning
– lazy learning
– k-nearest-neighbor (k-NN, kNN, KNN)

22
The Distance Function
• Simplest case: dataset with one numeric attribute
– Distance is the difference between the two attribute
values involved
• Dataset with several numeric attributes: normally,
Euclidean distance is used.
• Are all attributes equally important?
– Weighting the attributes might be necessary.

23
The Distance Function
• Distance function defines what is learned
• Most instance-based schemes use Euclidean distance:

where a(1) and a(2) are two instances with m attributes.

• Taking the square root is not required when
comparing distances.
• Another popular metric: city-block metric:
– Adds differences without squaring them.

24
Discussion about kNN
• Often very accurate … but slow:
– Simple version scans entire training data to derive a
prediction.
• Assumes all features are equally important,
– Remedy: feature selection or weights.
• Sensitive to noise for k=1.
• Statisticians have used kNN since the early 1950s.
– For a dataset with n instances, if n   and k/n  0,
error approaches minimum.

Photon Prog Guide
100% (1)
Photon Prog Guide
716 pages
Computer Vision Methods For Fast Image Classification and Retrieval 2020
100% (5)
Computer Vision Methods For Fast Image Classification and Retrieval 2020
144 pages
0802 Python Tutorial
100% (1)
0802 Python Tutorial
155 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Unit 4 Data Science
No ratings yet
Unit 4 Data Science
21 pages
Photon Prog Guide
100% (1)
Photon Prog Guide
919 pages
ML Notes MAKAUT 7th Sem
No ratings yet
ML Notes MAKAUT 7th Sem
31 pages
Top 45 Machine Learning Interview Questions in 2025
100% (1)
Top 45 Machine Learning Interview Questions in 2025
37 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
9 pages
Book
100% (1)
Book
480 pages
Lecture08 AI UMT Fall 2020 21 - V3
No ratings yet
Lecture08 AI UMT Fall 2020 21 - V3
31 pages
Feature Selection Engineering
No ratings yet
Feature Selection Engineering
72 pages
Thinkcspy 3
100% (1)
Thinkcspy 3
415 pages
Machine Learning Algorithms
100% (1)
Machine Learning Algorithms
15 pages
Industrial Control System
100% (2)
Industrial Control System
31 pages
Unit-5 Decision Trees and Ensemble Learning
100% (1)
Unit-5 Decision Trees and Ensemble Learning
162 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
28 pages
Yahya Thesis - Draft
100% (1)
Yahya Thesis - Draft
58 pages
Machine Learning and Neural Networks: Riccardo Rizzo
100% (1)
Machine Learning and Neural Networks: Riccardo Rizzo
113 pages
Classification: Decision Tree Induction: Lecture #9
No ratings yet
Classification: Decision Tree Induction: Lecture #9
121 pages
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
Data Science Intervieew Questions
100% (1)
Data Science Intervieew Questions
16 pages
Prof. Chandan Singhavi
No ratings yet
Prof. Chandan Singhavi
86 pages
PR01
100% (1)
PR01
41 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Decision Tree Classification
100% (1)
Decision Tree Classification
11 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
Artificial Neural Networks - MiniProject
100% (1)
Artificial Neural Networks - MiniProject
16 pages
Software Mining Repository Practical
No ratings yet
Software Mining Repository Practical
28 pages
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
100% (1)
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
19 pages
Tycs Ai Unit 2
No ratings yet
Tycs Ai Unit 2
84 pages
M1 - Introducing Google Cloud v5.2 - ILT
No ratings yet
M1 - Introducing Google Cloud v5.2 - ILT
69 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
39 pages
Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
20 pages
Micro-Framework: Presented By-Khirod Kumar Behera
No ratings yet
Micro-Framework: Presented By-Khirod Kumar Behera
10 pages
Determine A Thread Group's Ramp-Up Period
100% (2)
Determine A Thread Group's Ramp-Up Period
7 pages
Lab3 NguyenQuocKhanh ITITIU18186
No ratings yet
Lab3 NguyenQuocKhanh ITITIU18186
7 pages
Vinee
100% (1)
Vinee
28 pages
XV. Anomaly Detection
0% (1)
XV. Anomaly Detection
4 pages
Machine Learning CS-8 Dept. of CS, KFUEIT: Instructor: Muhammad Adeel Abid
No ratings yet
Machine Learning CS-8 Dept. of CS, KFUEIT: Instructor: Muhammad Adeel Abid
19 pages
Decision-Tree Learning .
No ratings yet
Decision-Tree Learning .
29 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
Decision Trees - 2022
No ratings yet
Decision Trees - 2022
49 pages
By Ghazwan Khalid Auda
100% (1)
By Ghazwan Khalid Auda
17 pages
Applied Coding Track
No ratings yet
Applied Coding Track
10 pages
DecisionTrees RandomForest v2
No ratings yet
DecisionTrees RandomForest v2
27 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
No ratings yet
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
34 pages
Classification Techniques
No ratings yet
Classification Techniques
99 pages
Artificial Neural Network Quick Guide
No ratings yet
Artificial Neural Network Quick Guide
55 pages
Beginner Cheat Sheet KNIME 5.1
No ratings yet
Beginner Cheat Sheet KNIME 5.1
2 pages
Python Django Presentation
No ratings yet
Python Django Presentation
16 pages
Machine Learning Lesson - Plan
No ratings yet
Machine Learning Lesson - Plan
3 pages
Appendix Weka
No ratings yet
Appendix Weka
17 pages
Notes1 Stochastic Proccesses KENT U
No ratings yet
Notes1 Stochastic Proccesses KENT U
13 pages
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
No ratings yet
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
24 pages
Data Mining Techniques Unit-1
No ratings yet
Data Mining Techniques Unit-1
122 pages
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
Machine Learning: Interview Questions
No ratings yet
Machine Learning: Interview Questions
21 pages
Unit2 PDS
No ratings yet
Unit2 PDS
17 pages
7 Mechatronics Design Process and Advanced Approaches in Mechatronics
No ratings yet
7 Mechatronics Design Process and Advanced Approaches in Mechatronics
2 pages
Cybernetics: The Key Principles of Cybernetics
No ratings yet
Cybernetics: The Key Principles of Cybernetics
9 pages
Introduction To Neural Networks
No ratings yet
Introduction To Neural Networks
25 pages
Chapter 11 (11-23-04)
No ratings yet
Chapter 11 (11-23-04)
61 pages
Sample Template - Advance Data Science Students
No ratings yet
Sample Template - Advance Data Science Students
3 pages
An Integrated View of Sla
100% (1)
An Integrated View of Sla
16 pages
Data Analysis and Automtion
No ratings yet
Data Analysis and Automtion
3 pages
Martion Reinhold Arc Media
No ratings yet
Martion Reinhold Arc Media
4 pages
Adaptive Resonance Theory (ART) : An Introduction by L.G. Heins & D.R. Tauritz May/June 1995
No ratings yet
Adaptive Resonance Theory (ART) : An Introduction by L.G. Heins & D.R. Tauritz May/June 1995
15 pages
Paper Pengolahan Data
No ratings yet
Paper Pengolahan Data
9 pages
Deep Learning Assignment
No ratings yet
Deep Learning Assignment
8 pages
Classification of Adaptive Control Techniques
No ratings yet
Classification of Adaptive Control Techniques
3 pages
Review Article Drl-Based Intelligent Resource Allocation For Diverse Qos in 5G and Toward 6G Vehicular Networks: A Comprehensive Survey
No ratings yet
Review Article Drl-Based Intelligent Resource Allocation For Diverse Qos in 5G and Toward 6G Vehicular Networks: A Comprehensive Survey
21 pages
The Four Types of Artificial Intelligence 09237
No ratings yet
The Four Types of Artificial Intelligence 09237
8 pages
Process Assessment Tool - Onboarding Process - Solution
No ratings yet
Process Assessment Tool - Onboarding Process - Solution
11 pages
Actu - Moyennes Gen 20 05 2025
No ratings yet
Actu - Moyennes Gen 20 05 2025
6 pages
Question Bank ROBOTICS UNIT - 3: A) Complicated To Design B) Unreliable C) Either 1 or 2 D) None of The Above
No ratings yet
Question Bank ROBOTICS UNIT - 3: A) Complicated To Design B) Unreliable C) Either 1 or 2 D) None of The Above
3 pages
Comparative Analysis of Machine Learning Algorithms On The Bot-IOT Dataset
No ratings yet
Comparative Analysis of Machine Learning Algorithms On The Bot-IOT Dataset
6 pages
Shaping Tomorrow
No ratings yet
Shaping Tomorrow
5 pages
DBMS Lab 3 ER DIagram
No ratings yet
DBMS Lab 3 ER DIagram
5 pages
Penerapan Theory of Constraint (Toc) Untuk Meningkatkan Profitabilitas Pada Perusahaan Bakpia Latief Di Kota Kediri
No ratings yet
Penerapan Theory of Constraint (Toc) Untuk Meningkatkan Profitabilitas Pada Perusahaan Bakpia Latief Di Kota Kediri
9 pages
Data Scientist Exercise
No ratings yet
Data Scientist Exercise
2 pages
Sample Resume
No ratings yet
Sample Resume
2 pages
Building Websites with VB.NET and DotNetNuke 4
From Everand
Building Websites with VB.NET and DotNetNuke 4
Daniel N. Egan
1/5 (1)
Debugging Like a Pro: A Practical Guide with Examples
From Everand
Debugging Like a Pro: A Practical Guide with Examples
William E. Clark
No ratings yet
Trackpad Pro Ver. 5.0 Class 6
From Everand
Trackpad Pro Ver. 5.0 Class 6
Nidhi Arora
No ratings yet
Mastering WebGL: Crafting Advanced 3D Web Experiences: WebGL Wizadry
From Everand
Mastering WebGL: Crafting Advanced 3D Web Experiences: WebGL Wizadry
Kameron Hussain
No ratings yet
Application-Specific Integrated Circuit ASIC A Complete Guide
From Everand
Application-Specific Integrated Circuit ASIC A Complete Guide
Gerardus Blokdyk
No ratings yet

Data Mining CS4168 Lecture 5 Basics of Classification 1

Uploaded by

Data Mining CS4168 Lecture 5 Basics of Classification 1

Uploaded by

Data Mining

Lecture 5: Basics of Classification

a1(1) a2(1) … ak(1) x (1)

a1( 2 ) a2( 2 ) … ak( 2 ) x( 2)

• Simple algorithms often work very well!

• Child node: Outlook = Overcast Note: log(0)

gain(Temperature ) = 0.571 bits

• “The random forest is a classification algorithm consisting of many

• “In bagging (a.k.a. bootstrap aggregation), a random sample of data in a

• Similarity (or distance) function defines what’s “learned”.

• Simplest form of classification, also called:

where a(1) and a(2) are two instances with m attributes.

You might also like