0% found this document useful (0 votes)

23 views31 pages

Lecture 12

This document provides a summary of key concepts related to classification in data mining. It discusses classification as assigning objects to predefined categories or classes based on attributes. The two-step classification process involves model construction using a training set and model usage to classify new data. Supervised learning uses labeled training data while unsupervised learning does not. Decision trees are presented as a popular classification algorithm that recursively partitions the data space into homogeneous regions. Information gain is discussed as a measure for selecting the best attribute to split on at each node in decision tree induction.

Uploaded by

Abood Fazil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views31 pages

Lecture 12

Uploaded by

Abood Fazil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

eMBA933

Data Mining
Tools & Techniques
Lecture 12

Dr. Faiz Hamid

Associate Professor
Department of IME
IIT Kanpur
[email protected]
Classification: Basic Concepts
Classification
• Task of assigning objects to one of several predefined
categories or classes
• A form of data analysis that extracts model or classifier to
predict class labels
– find a model for class label attribute as a function of the values of other
attributes
– classifies data based on training set and values in a classifying attribute, and
uses it in classifying new data
– class labels are generally categorical (discrete or nominal)
– less effective for ordinal categories since implicit order among the categories
is not considered
• Numeric Prediction
– models continuous‐valued functions, i.e., predicts unknown or missing values
• Typical applications
– Credit/loan approval: loan application is “safe” or “risky”
– Medical diagnosis: tumor is “cancerous” or “benign”
– Fraud detection: transaction is “fraudulent”
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: Training data is
accompanied by labels indicating the
class of the observations
– New data is classified based on the
training set

• Unsupervised learning (clustering)

– Class labels of training data is
unknown
– Given a set of observations, the aim is
to establish existence of classes or
clusters in the data
Classification— Two‐Step Process
• Model construction: Describe a set of predetermined classes
– Each tuple is assumed to belong to a predefined class, as determined by
the class label attribute
– The model is represented as classification rules, decision trees, or
mathematical formulae

• Model usage: Classify future or unknown objects

– Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy = percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify new data
Phase 1: Model Construction
Classification
Algorithms
Training
Data

NAM E RANK YEARS TENURED Classifier

M ike Assistant Prof 3 no (Model)
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
6
Phase 2: Model Usage
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Classification
Linearly Separable Not Linearly Separable
Decision Trees
Decision Trees
• Divides the feature space by axes
aligned decision boundaries 1
• Each rectangular region is labeled 2
with one label/class
• Idea is to divide the entire X‐space
into rectangles such that each
rectangle is as homogeneous or
“pure” as possible 3
• Pure = containing records that
belong to just one class
• Recursive partitioning of p‐
dimensional space of the predictor
variables into non‐overlapping
multidimensional rectangles

Not Linearly Separable

Decision Trees
Internal nodes
branch (test on attributes)
(outcome of the test)

1 Width > 6.5 cm?

1
2 Yes No

2 Height > 9.5 cm? 3 Height > 6.0 cm?

Yes No Yes No
3

Leaf node
(class label)

Not Linearly Separable Decision tree: a flowchart‐like tree structure

Decision Trees
• If‐Then Rules
– If Width> 6.5 cm AND
Height> 9.5 cm THEN Lemon Width > 6.5 cm?
– If Width> 6.5 cm AND Yes No
Height≤ 9.5 cm THEN Orange
– If Width≤ 6.5 cm AND Height > 9.5 cm? Height > 6.0 cm?
Height> 6.0 cm THEN Lemon Yes No Yes No
– If Width≤ 6.5 cm AND
Height≤ 6.0 cm THEN Orange
Example
• Whether a customer will wait for a table at a restaurant?
• Attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. Wait Estimate: estimated waiting time (0‐10 min, 10‐30, 30‐60, >60)
Example
Class label attribute
Which Tree is Better?
The tree to decide whether to
wait (T) or not (F)
What Makes a Good Tree?
• Not too big:
– computational efficiency (avoid redundant, spurious attributes)
– avoid overfitting training examples
– generalise well to new/unseen observations
– easy to understand and interpret
• Not too small:
– need to handle important but possibly subtle distinctions in data
• Occam's Razor: "the simplest explanation is most likely
the right one"
– find the smallest tree that fits the observations
Learning Decision Trees
• In principle there are exponentially many DT constructed from a
given set of attributes that fits the same data
• Learning the simplest (smallest) decision tree is an NP complete
problem (Hyal & Rivest,1976)
• Resort to heuristics: efficient algorithms that induce reasonably
accurate, albeit suboptimal, DT in a reasonable amount of time
• Greedy strategy: series of locally optimal decisions
– Start from an empty decision tree
– Split on next best attribute
– Recurse
• What is best attribute?
• We use information theory to guide us
– ID3 (Iterative Dichotomiser) – Information Gain
– C4.5 – Gain Ratio
– Classification and Regression Trees (CART) – Gini index
Decision Tree Learning Algorithm
• Simple, greedy, recursive approach, builds up tree
node‐by‐node
1. pick an attribute to split at a non‐terminal node
2. split examples into groups based on attribute
value
3. for each group:
– if no examples ‐ return majority from parent
– else if all examples in same class ‐ return class
– else loop to Step 1
Choosing a Good Attribute
• Which attribute is better to split on, X1 or X2?

Pure Node

Idea:
1. use counts at leaves to define probability distributions, so we
can measure uncertainty
2. a good attribute splits the examples into subsets that are
(ideally) pure
Choosing a Good Attribute
• Restaurant example
• Test Patrons or Type first?

• A good attribute splits samples into groups that are (ideally)

all positive or negative
• Patrons is better attribute than Type
• Testing on good attributes early allows to minimise the tree
depth
Quantifying Uncertainty
• We Flip Two Different Coins
Quantifying Uncertainty
• Entropy H(X) of a random variable X

• Measures the level of impurity in a group of examples

Quantifying Uncertainty
• Entropy
Attribute Selection Measures
• An attribute selection measure is a heuristic for selecting the
splitting criterion that “best” separates a given data partition

• Pure partition: A partition is pure if all the tuples in it belong

to the same class
– split up the tuples according to mutually exclusive outcomes of the
splitting criterion

• Popular measures
– information gain, gain ratio, Gini index
Information Gain
• ID3 uses information gain as its attribute selection measure
• Node N represents tuples of partition D
• Attribute with the highest information gain is chosen as the
splitting attribute for node N
• Objective: to partition on an attribute that would do the “best
classification,” so that the amount of information still required
to finish classifying the tuples is minimal
• Minimize expected number of tests needed to classify a given
tuple and guarantee a simple tree is found
Notations
• D: data partition, a training set of class‐labeled tuples
• m: distinct values of class label attribute defining m
distinct classes, Ci (i = 1,…,m)
• Ci,D: set of tuples of class Ci in D
• |Dj|: number of tuples in Dj
• |Ci,D |: number of tuples in Ci,D
Information Gain
• Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci, D|/|D|
• Expected information (entropy) needed to classify a tuple in D:

• Measures the average amount of information needed to identify the class of

a tuple in D
– Original information required based on just the proportion of classes
• How much more information would we still need (after the partitioning) to
arrive at an exact classification?
• Information needed (after using A to split D into v partitions) to classify D:
Information Gain
• InfoA(D) is the expected information required to classify a tuple from D
based on the partitioning by A
• Smaller the expected information (still) required, the greater the purity of
the partitions (reduces the entropy in the partitions)
• To determine how well a test condition performs, compare the degree of
impurity of the parent node (before splitting) and the child node (after
splitting)
– larger the difference, the better the test condition
• Information gained by branching on attribute A

– expected reduction in the information requirement caused by knowing the

value of A
– determines the goodness of the split
Example
Example

• InfoType(D) = 2 x 2/12 x 1 + 2 x 4/12 x 1 = 1/3 + 2/3 =1

• Gain(Type) = Info (D) ‐ InfoType(D) = 1 – 1 = 0

• InfoPatrons(D) = 2/12 x 0 + 4/12 x 0 + 6/12 x 0.918 = 0.459

• Gain(Patrons) = Info (D) ‐ InfoPatrons(D) = 1 – 0.459 = 0.541

• Patrons is a better attribute than Type

Example

Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
139 pages
Data Mining With Clustering AND Classification
No ratings yet
Data Mining With Clustering AND Classification
16 pages
Chapter3 Classification and Prediction
No ratings yet
Chapter3 Classification and Prediction
63 pages
Chapter 5. Classification and Prediction
No ratings yet
Chapter 5. Classification and Prediction
122 pages
Probability and Statistics Mansoura Day4
No ratings yet
Probability and Statistics Mansoura Day4
23 pages
Chapter4 Classification Prediction
No ratings yet
Chapter4 Classification Prediction
173 pages
Classfication and Prediction
No ratings yet
Classfication and Prediction
133 pages
Classification Basic Concept - Data Mining
No ratings yet
Classification Basic Concept - Data Mining
20 pages
DM - Unit-1 - Fundamentals of Data Mining
No ratings yet
DM - Unit-1 - Fundamentals of Data Mining
43 pages
Lecture 3.1.1
No ratings yet
Lecture 3.1.1
17 pages
Screenshot 2024-06-04 at 12.00.45 AM
No ratings yet
Screenshot 2024-06-04 at 12.00.45 AM
45 pages
Screenshot 2024-06-04 at 12.01.00 AM
No ratings yet
Screenshot 2024-06-04 at 12.01.00 AM
45 pages
Screenshot 2024-06-04 at 12.07.18 AM
No ratings yet
Screenshot 2024-06-04 at 12.07.18 AM
45 pages
Screenshot 2024-06-03 at 11.59.21 PM
No ratings yet
Screenshot 2024-06-03 at 11.59.21 PM
45 pages
Lecture 3.1.3 3.1.4
No ratings yet
Lecture 3.1.3 3.1.4
24 pages
Unit-IV Classification Part 1
No ratings yet
Unit-IV Classification Part 1
38 pages
Machine Learning-Classification
No ratings yet
Machine Learning-Classification
52 pages
Lecture 10
No ratings yet
Lecture 10
53 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
DAMI 011114a
No ratings yet
DAMI 011114a
48 pages
Chapter 7. Classification and Prediction
No ratings yet
Chapter 7. Classification and Prediction
68 pages
HairJr2021 Book PartialLeastSquaresStructuralE
100% (2)
HairJr2021 Book PartialLeastSquaresStructuralE
208 pages
Chapter 6 Classification and Prediction25.10.13
No ratings yet
Chapter 6 Classification and Prediction25.10.13
43 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
112 pages
Classvac
No ratings yet
Classvac
14 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
115 pages
Classification - Prediction Data Model Very Important
No ratings yet
Classification - Prediction Data Model Very Important
173 pages
Lect 1
No ratings yet
Lect 1
38 pages
7 Class
No ratings yet
7 Class
72 pages
Unit V Classification
No ratings yet
Unit V Classification
69 pages
Pioneer X Hm82 S X Hm82d XC Hm82d K X Hm72 X Hm72d
100% (1)
Pioneer X Hm82 S X Hm82d XC Hm82d K X Hm72 X Hm72d
110 pages
DM Classification 1 3
No ratings yet
DM Classification 1 3
19 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
7 Class
No ratings yet
7 Class
72 pages
Classification in Data Mining 12
No ratings yet
Classification in Data Mining 12
7 pages
Unit-4 AML (1. Basics and K-NN)
No ratings yet
Unit-4 AML (1. Basics and K-NN)
25 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Oracle Cash Management
100% (2)
Oracle Cash Management
14 pages
Unit 4 DWDM
No ratings yet
Unit 4 DWDM
8 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
Practice Exercises in OS
No ratings yet
Practice Exercises in OS
11 pages
Data Mining Functi
No ratings yet
Data Mining Functi
6 pages
Basic Concept of Classification (Data Mining)
No ratings yet
Basic Concept of Classification (Data Mining)
11 pages
Unit Iii Classification
No ratings yet
Unit Iii Classification
57 pages
Test Card Credit
100% (6)
Test Card Credit
32 pages
AutoCAD Drawing Commands
No ratings yet
AutoCAD Drawing Commands
9 pages
Data Mining: UNIT-3 Classification
No ratings yet
Data Mining: UNIT-3 Classification
54 pages
Classification: Unit-III
No ratings yet
Classification: Unit-III
90 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Classification Prediction
No ratings yet
Classification Prediction
71 pages
2FA Set Up
No ratings yet
2FA Set Up
17 pages
Chap 7
No ratings yet
Chap 7
71 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Classification in Data Mining
No ratings yet
Classification in Data Mining
14 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
72 pages
DM Ch6 (Classification and Prediction)
No ratings yet
DM Ch6 (Classification and Prediction)
39 pages
MIPS Addressing Modes
No ratings yet
MIPS Addressing Modes
5 pages
A Data Mining Query Language
No ratings yet
A Data Mining Query Language
69 pages
Spool Generated For Class of Oracle by Satish K Yellanki
No ratings yet
Spool Generated For Class of Oracle by Satish K Yellanki
98 pages
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
No ratings yet
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
23 pages
18 Ajit Gupta Android Practical
No ratings yet
18 Ajit Gupta Android Practical
122 pages
Java For Selenium
No ratings yet
Java For Selenium
45 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
Collaborative Learning For Cyberattack Detection in Blockchain Networks
No ratings yet
Collaborative Learning For Cyberattack Detection in Blockchain Networks
12 pages
Encyclopedia
No ratings yet
Encyclopedia
56 pages
PROGRAM 24: C++ Program For Multilevel Inheritance
No ratings yet
PROGRAM 24: C++ Program For Multilevel Inheritance
23 pages
Joshuahuntresume 1
No ratings yet
Joshuahuntresume 1
2 pages
EEE378 - Digital Electronic II (Vol I) Week 1
No ratings yet
EEE378 - Digital Electronic II (Vol I) Week 1
41 pages
GPMF English Brochure
No ratings yet
GPMF English Brochure
8 pages
FiberVU User Guide v01
No ratings yet
FiberVU User Guide v01
35 pages
2.dasar Counting 1
No ratings yet
2.dasar Counting 1
19 pages
Px40 Introduction SN
No ratings yet
Px40 Introduction SN
63 pages
One To One and Onto1
No ratings yet
One To One and Onto1
9 pages
01 Intro
No ratings yet
01 Intro
45 pages
Lecture 15
No ratings yet
Lecture 15
21 pages
Questions Chapter Wise
No ratings yet
Questions Chapter Wise
6 pages
Mathematical Review
No ratings yet
Mathematical Review
12 pages
16631271278
No ratings yet
16631271278
12 pages
Week 3 - Probablistic Context Free Grammars
No ratings yet
Week 3 - Probablistic Context Free Grammars
18 pages
Types of Software Testing
No ratings yet
Types of Software Testing
10 pages
Find Changes Logs For A Table Using SM30 - SAP Blogs
No ratings yet
Find Changes Logs For A Table Using SM30 - SAP Blogs
7 pages
Code:: Bahria University, Islamabad Campus Short Assignment (Quiz 01) (Fall 2020 Semester)
No ratings yet
Code:: Bahria University, Islamabad Campus Short Assignment (Quiz 01) (Fall 2020 Semester)
4 pages
D-Tect 50 Ip Quad Pir Datasheet
No ratings yet
D-Tect 50 Ip Quad Pir Datasheet
2 pages
Android Instructions - Freedom Pro Keyboard
No ratings yet
Android Instructions - Freedom Pro Keyboard
2 pages
ICT Trivia
No ratings yet
ICT Trivia
9 pages
JBL Cinema Sb150: Home Cinema 2.1 Soundbar With Wireless Subwoofer
No ratings yet
JBL Cinema Sb150: Home Cinema 2.1 Soundbar With Wireless Subwoofer
9 pages
Carpathia
No ratings yet
Carpathia
13 pages
Measurement - Task & Drill Sheets Gr. 3-5
From Everand
Measurement - Task & Drill Sheets Gr. 3-5
Chris Forest
No ratings yet
Geometry - Task & Drill Sheets Gr. 6-8
From Everand
Geometry - Task & Drill Sheets Gr. 6-8
Mary Rosenberg
No ratings yet

Lecture 12

Uploaded by

Lecture 12

Uploaded by

eMBA933

Dr. Faiz Hamid

• Unsupervised learning (clustering)

• Model usage: Classify future or unknown objects

NAM E RANK YEARS TENURED Classifier

Not Linearly Separable

1 Width > 6.5 cm?

2 Height > 9.5 cm? 3 Height > 6.0 cm?

Not Linearly Separable Decision tree: a flowchart‐like tree structure

• A good attribute splits samples into groups that are (ideally)

• Measures the level of impurity in a group of examples

• Pure partition: A partition is pure if all the tuples in it belong

• Measures the average amount of information needed to identify the class of

– expected reduction in the information requirement caused by knowing the

• InfoType(D) = 2 x 2/12 x 1 + 2 x 4/12 x 1 = 1/3 + 2/3 =1

• InfoPatrons(D) = 2/12 x 0 + 4/12 x 0 + 6/12 x 0.918 = 0.459

• Patrons is a better attribute than Type

You might also like