0% found this document useful (0 votes)
224 views71 pages

Unit 3 Classification

This document provides an overview of basic concepts in classification, including: 1) Classification techniques use learning algorithms to build models that predict class labels for new records based on patterns in training data. Common techniques include decision trees, rules, neural networks, and naive Bayes. 2) The training process involves building a classification model from a training set with known labels, which is then tested on new data. Performance is evaluated using metrics like accuracy and a confusion matrix. 3) Key issues include data cleaning, feature selection, evaluation of models based on accuracy, speed, robustness, and interpretability. Decision tree induction is described as a common classification approach.

Uploaded by

Saima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
224 views71 pages

Unit 3 Classification

This document provides an overview of basic concepts in classification, including: 1) Classification techniques use learning algorithms to build models that predict class labels for new records based on patterns in training data. Common techniques include decision trees, rules, neural networks, and naive Bayes. 2) The training process involves building a classification model from a training set with known labels, which is then tested on new data. Performance is evaluated using metrics like accuracy and a confusion matrix. 3) Key issues include data cleaning, feature selection, evaluation of models based on accuracy, speed, robustness, and interpretability. Decision tree induction is described as a common classification approach.

Uploaded by

Saima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 71

Classification: Basic Concepts

Unsupervised Learning
General Approach to Solving a
Classification Problem
• A classification technique (or classifier) is a
systematic approach to building classification models
from an input data set.
Classification Techniques
 Decision Tree based Methods
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines
• Each technique employs a learning algorithm
to identify a model that best fits the
relationship between the attribute set and
class label of the input data.
• key objective of the learning algorithm is to
build models with good generalization
capability i.e., models that accurately predict
the class labels of previously unknown records.
• First, a training set
consisting of records whose
class labels are known must
be provided.
• The training set is used to
build a classification model,
which is subsequently
applied to the test set,
which consists of records
with unknown class labels.
• Evaluation of the performance of a classification
model is based on the counts of test records
correctly and incorrectly predicted by the
model. These counts are tabulated in a table
known as a confusion matrix.
Issues regarding classification and prediction

• Data cleaning
– Preprocess data in order to reduce noise and handle
missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data

10/27/22 12
Evaluation of Classifiers
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight provded by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
10/27/22 13
Classification by Decision Tree Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree

10/27/22 14
Training Dataset

age income student credit_rating


This <=30 high no fair
<=30 high no excellent
follows an 31…40 high no fair
example >40 medium no fair
from >40 low yes fair
>40 low yes excellent
Quinlan’s 31…40 low yes excellent
ID3 <=30 medium no fair
<=30 low yes fair
>40 medium yes fair
<=30 medium yes excellent
31…40 medium no excellent
31…40 high yes fair
10/27/22
>40 medium no excellent 15
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

10/27/22 16
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in
advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left

10/27/22 17
Attribute Selection Measure

• Information gain (ID3/C4.5)


– All attributes are assumed to be categorical
– Can be modified for continuous-valued attributes
• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values for each
attribute
– May need other tools, such as clustering, to get the possible
split values
– Can be modified for categorical attributes

10/27/22 18
Information Gain (ID3/C4.5)

• Select the attribute with the highest information gain


• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n
elements of class N
– The amount of information, needed to decide if an arbitrary
example in S belongs to P or N is defined as

p p n n
I ( p , n)   log2  log2
pn pn pn pn

10/27/22 19
Information Gain in Decision Tree
Induction

• Assume that using attribute A a set S will be partitioned into


sets {S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subtrees Si is
 p n
E ( A)   i i I ( pi , ni )
i 1 p  n

• The encoding information that would be gained by branching


on A Gain( A)  I ( p, n)  E ( A)
10/27/22 20
Attribute Selection by Information Gain
Computation
5 4
E ( age )  I ( 2 ,3 )  I ( 4 ,0 )
g Class P: buys_computer = “yes” 14 14
5
g Class N: buys_computer = “no”  I ( 3 , 2 )  0 . 69
14
g I(p, n) = I(9, 5) =0.940
Hence
g Compute the entropy for age:
Gain(age)  I ( p, n)  E (age)
Similarly

age pi ni I(pi, ni) Gain(income)  0.029


<=30 2 3 0.971
Gain( student )  0.151
30…40 4 0 0
>40 3 2 0.971 Gain(credit _ rating )  0.048

10/27/22 21
Gini Index (IBM IntelligentMiner)

• If a data set T contains examples from n classes, gini index, gini(T)


n
is defined as gini (T )  1   p 2j
j 1
where pj is the relative frequency of class j in T.
• If a data set T is split into two subsets T1 and T2 with sizes N1 and
N2 respectively, the gini index of the split data contains examples
from n classes, the gini index gini(T) is defined as
gini split
(T )  N 1 gini (T 1)  N 2 gini (T 2 )
N N

• The attribute provides the smallest ginisplit(T) is chosen to split the


node (need to enumerate all possible splitting points for each
attribute).
10/27/22 22
Extracting Classification Rules from Trees

• Represent the knowledge in the form of IF-THEN rules


• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

10/27/22 23
Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree
—get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”
10/27/22 24
Approaches to Determine the Final Tree Size

• Separate training (2/3) and testing (1/3) sets


• Use cross validation, e.g., 10-fold cross validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve the
entire distribution
• Use minimum description length (MDL) principle:
– halting growth of the tree when the encoding is
minimized
10/27/22 25
Tree Induction

 Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.

 Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?

– Determine when to stop splitting


Tree Induction

 Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.

 Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?

– Determine when to stop splitting


How to Specify Test Condition?

 Depends on attribute types


– Nominal
– Ordinal
– Continuous

 Depends on number of ways to split


– 2-way split
– Multi-way split
Splitting Based on Nominal Attributes

 Multi-way split: Use as many partitions as


distinct values.
CarType
Family Luxury
Sports

 Binary split: Divides values into two subsets.


Need to find optimal partitioning.
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes

 Multi-way split: Use as many partitions as distinct


values.
Size
Small Large
Medium

 Binary split: Divides values into two subsets.


Need to find optimal partitioning.
Size Size
{Small, OR {Medium,
Medium} {Large} Large} {Small}

Size
{Small,
 What about this split? Large} {Medium}
Splitting Based on Continuous Attributes

 Different ways of handling


– Discretization to form an ordinal categorical
attribute
 Static – discretize once at the beginning
 Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

– Binary Decision: (A < v) or (A  v)


 consider all possible splits and finds the best cut
 can be more compute intensive
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Tree Induction

 Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.

 Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?

– Determine when to stop splitting


How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
C0: 1
C1: 0
... C0: 1
C1: 0
C0: 0
C1: 1
... C0: 0
C1: 1

Which test condition is the best?


How to determine the Best Split

 Greedy approach:
– Nodes with homogeneous class distribution are
preferred
 Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity

 Gini Index

 Entropy

 Misclassification error
How to Find the Best Split

Before Splitting: C0 N00


M0
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI

 Gini Index for a given node t :

GINI (t )  1   [ p ( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI

GINI (t )  1   [ p ( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Splitting Based on GINI

 Used in CART, SLIQ, SPRINT.


 When a node p is split into k partitions (children), the quality of
split is computed as,
k
ni
GINI split   GINI (i )
i 1 n

where, ni = number of records at child i,


n = number of records at node p.
Binary Attributes: Computing GINI Index

 Splits into two partitions


 Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.

Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (2/6)2 N1 N2 Gini(Children)
= 0.194
C1 5 1 = 7/12 * 0.194 +
Gini(N2) C2 2 4 5/12 * 0.528
= 1 – (1/6)2 – (4/6)2 Gini=0.333 = 0.333
= 0.528
Categorical Attributes: Computing Gini Index

 For each distinct value, gather counts for each class in the
dataset
 Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 C1
3 1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
Continuous Attributes: Computing Gini Index

 Use Binary Decisions based on one value


 Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
 Each splitting value has a count matrix
associated with it
– Class counts in each of the partitions,
A < v and A  v
 Simple method to choose best v
– For each v, scan the database to
gather count matrix and compute its
Gini index
– Computationally Inefficient! Repetition Taxable
Income
of work.
> 80K?

Yes No
Bayes Classifier
• A probabilistic framework for solving
classification problems P ( A, C )
P (C | A) 
• Conditional Probability: P ( A)
P ( A, C )
P( A | C ) 
P(C )
• Bayes theorem:
P( A | C) P(C)
P(C | A) 
P( A)
Example of Bayes Theorem
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20

• If a patient has stiff neck, what’s the probability


he/she has meningitis?

P( S | M ) P( M ) 0.5 1 / 50000
P( M | S )    0.0002
P( S ) 1 / 20
Bayesian Classifiers
• Consider each attribute and class label as random
variables

• Given a record with attributes (A1, A2,…,An)


– Goal is to predict class C
– Specifically, we want to find the value of C that maximizes
P(C| A1, A2,…,An )

• Can we estimate P(C| A1, A2,…,An ) directly from


data?
Bayesian Classifiers
• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for all
values of C using the Bayes theorem
P ( A A  A | C ) P (C )
P (C | A A  A )  1 2 n

P( A A  A )
1 2 n

1 2 n

– Choose value of C that maximizes


P(C | A1, A2, …, An)

– Equivalent to choosing value of C that maximizes


P(A1, A2, …, An|C) P(C)

• How to estimate P(A1, A2, …, An | C )?


Naïve Bayes Classifier
• Assume independence among attributes Ai when class is
given:
– P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

– Can estimate P(Ai| Cj) for all Ai and Cj.

– New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.


How to Estimate Probabilities from Data?

• Class: P(C) = Nc/N


– e.g., P(No) = 7/10,
P(Yes) = 3/10
Tid Refund Marital Taxable

• For discrete attributes:


Status Incom e Evade

1 Yes Single 125K No

P(Ai | Ck) = |Aik|/ Nkc


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
– where |Aik| is number of
6 No Married 60K No
7 Yes Divorced 220K No instances having attribute Ai
8 No Single 85K Yes and belongs to class Ck
9 No Married 75K No
10 No Single 90K Yes
– Examples:
10

P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
How to Estimate Probabilities from Data?

• For continuous attributes:


– Discretize the range into bins
• one ordinal attribute per bin k
• violates independence assumption
– Two-way split: (A < v) or (A > v)
• choose only one of the two splits as new attribute
– Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, can use it to estimate
the conditional probability P(Ai|c)
Example of Naïve Bayes Classifier
Given a Test Record:
X  (Refund  No, Married,Income  120K)
naive Bayes Classifier:

P(Refund=Yes|No) = 3/7  P(X|Class=No) = P(Refund=No|Class=No)


P(Refund=No|No) = 4/7  P(Married| Class=No)
P(Refund=Yes|Yes) = 0  P(Income=120K| Class=No)
P(Refund=No|Yes) = 1 = 4/7  4/7  0.0072 = 0.0024
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7  P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Marital Status=Single|Yes) = 2/7  P(Married| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7  P(Income=120K| Class=Yes)
P(Marital Status=Married|Yes) = 0 = 1  0  1.2  10-9 = 0
For taxable income:
If class=No: sample mean=110 Since P(X|No)P(No) > P(X|Yes)P(Yes)
sample variance=2975 Therefore P(No|X) > P(Yes|X)
If class=Yes: sample mean=90
sample variance=25 => Class = No
Naïve Bayes Classifier
• If one of the conditional probability is zero,
then the entire expression becomes zero
• Probability estimation:
N ic
Original : P( Ai | C ) 
Nc
c: number of classes
N ic  1
Laplace : P( Ai | C )  p: prior probability
Nc  c
m: parameter
N ic  mp
m - estimate : P( Ai | C ) 
Nc  m
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals A: attributes
python no no no no non-mammals
salmon no no yes no non-mammals M: mammals
whale yes no yes no mammals
frog no no sometimes yes non-mammals N: non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M )      0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N )      0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
eel no no yes no non-mammals 7
salamander no no sometimes yes non-mammals P ( A | M ) P( M )  0.06   0.021
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N )  0.004   0.0027
eagle no yes no yes non-mammals 20

Give Birth Can Fly Live in Water Have Legs Class P(A|M)P(M) > P(A|N)P(N)
yes no yes no ?
=> Mammals
Naïve Bayes (Summary)
• Robust to isolated noise points

• Handle missing values by ignoring the instance during


probability estimate calculations

• Robust to irrelevant attributes

• Independence assumption may not hold for some


attributes
– Use other techniques such as Bayesian Belief Networks
(BBN)
Bayesian Theorem

• Given training data D, posteriori probability of a hypothesis


h, P(h|D) follows the Bayes theorem
P(h | D)  P(D | h)P(h)
P(D)
• MAP (maximum posteriori) hypothesis
h  arg max P(h | D)  arg max P(D | h)P(h).
MAP hH hH
• Practical difficulty: require initial knowledge of many
probabilities, significant computational cost

10/27/22 55
The independence hypothesis…
• … makes computation possible
• … yields optimal classifiers when satisfied
• … but is seldom satisfied in practice, as attributes (variables)
are often correlated.
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning with
causal relationships between attributes
– Decision trees, that reason on one attribute at the time,
considering most important attributes first
10/27/22 56
Bayesian Belief Networks (I)
Family
Smoker
History
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1


LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

The conditional probability table for the


variable LungCancer

PositiveXRay Dyspnea

Bayesian Belief Networks

10/27/22 57
Bayesian Belief Networks (II)
• Bayesian belief network allows a subset of the variables
conditionally independent
• A graphical model of causal relationships
• Several cases of learning Bayesian belief networks
– Given both network structure and all the variables: easy
– Given network structure but only some variables
– When the network structure is not known in advance

10/27/22 58
Nearest Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s
probably a duck
Compute
Distance Test Record

Training Choose k of the


Records “nearest” records
Nearest-Neighbor Classifiers
Unknown record  Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

 To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have


the k smallest distance to x
1 nearest-neighbor
Voronoi Diagram
Nearest Neighbor Classification
• Compute distance between two points:
– Euclidean distance

d ( p, q )   ( pi
i
q )
i
2

• Determine the class from nearest neighbor list


– take the majority vote of class labels among the k-nearest
neighbors
– Weigh the vote according to distance
• weight factor, w = 1/d2
Nearest Neighbor Classification…
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes

X
Nearest Neighbor Classification…
• Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by one
of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
Nearest Neighbor Classification…
• Problem with Euclidean measure:
– High dimensional data
• curse of dimensionality
– Can produce counter-intuitive results
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142

 Solution: Normalize the vectors to unit length


Nearest neighbor Classification…
• k-NN classifiers are lazy learners
– It does not build models explicitly
– Unlike eager learners such as decision tree
induction and rule-based systems
– Classifying unknown records are relatively
expensive

You might also like