Class11-PatternClassification KNN
Class11-PatternClassification KNN
Pattern Classification
Classification
• Problem of identifying to which of a set of categories a
new observation belongs
• Predicts categorical labels
• Example:
• Predicting a person as adult or child (2-class)
• Predicting the raise in salary based on the year of
experience and salary (2-class)
• Identify an email as spam or not (2-class)
• Predicting the presence or absence of disease (2-class)
– Pima Indians Diabetes Database: predict whether a patient
has diabetes or not based on diagnostic measurements
• Categorising the disease according to symptoms (Multi-
class)
• Categorizing the Iris flowers (Multi-class)
2
Classification
• Classification is a two step process
– Step1: Building a classifier (data modeling)
• Learning from data (training phase)
• Supervised learning: In supervised learning, each example
is a pair consisting of an input example and a desired
output value (class label)
• Training phase or learning phase is viewed as the learning
of a mapping or function that can predict the associated
class label of a given training example
Weight (x2)
Adul
t
Height, x1 Class
Adult/Child
Classifier Chil
Weight, x2 d
Adult :Class C1
Child :Class C2 Height (x1)
x = [x1 x2]T
4
Illustration of Training Set: Adult-Child
• Number of training examples (N) = 20
• Dimension of a training example = 2
• Class label attribute is 3rd dimension
• Class:
– Child (0)
– Adult (1)
Weight
in Kg
Height in cm 5
Step1: Building a Classification
Model (Training Phase)
Feature
extraction Training Examples
Chil
90 21.5
d
Feature
extraction
32.4 Chil
100 d
5
Feature
Training extraction
28.4 Chil Classifier
98
Phase 3 d
Feature
extraction
Adul
183 90
t
Feature
extraction
67.4 Adul
163
5 t
6
Step2: Classification (Testing Phase)
Feature
extraction Training Examples
90 21.5 Child
Feature
extraction
32.4 Child
100
5
Class label
Feature
extraction
(Adult)
Training
98
28.4
Child Classifier
Phase 3
Feature
extraction
183 90 Adult
Feature
extraction
67.4
163 Adult
5
Feature
extraction
Testing
150 50.6
Phase
7
Data Preparation for the Classification
8
Data Preparation for the Classification:
Approach 1
• Suppose that we are doing 70-30 split
• Suppose the data set has 3000 samples
• Each sample is belonging to one of the 3 classes
• Suppose each class has 1000 samples
– Step1: From class1, 70% i.e. 700 samples considered as
training samples and remaining 30% i.e. 300 samples are
considered as test samples
– Step2: From class2, 70% i.e. 700 samples considered as
training samples and remaining 30% i.e. 300 samples are
considered as test samples
– Step3: From class3, 70% i.e. 700 samples considered as
training samples and remaining 30% i.e. 300 samples are
considered as test samples
– Step4: Combine training examples from each class
• Training set now contain 700+700+700=2100 samples
– Step5: Combine test examples from each class
• Test set now contain 300+300+300=900 samples
9
Data Preparation for the Classification
• Divide the data into training set and test set
• Approach 1: When the number samples from each class
are almost equal (Balanced data)
– Example:
• Training data contain 70% of samples from each class
• Test data contain remaining 30% of samples from each class
• Approach 2: When the number samples from each class
are not equal (Imbalanced data)
– One class may have large number of samples and another has
small number of samples
– 70%-30% division may cause learned model to be bias to class
with larger number of training samples
– Solution:
• Consider 70% or 80% of the samples from the class with least
number of samples as training data from that class
• Consider the same number of samples from other class as training
examples
• Each class will have same number of training examples
10
Data Preparation for the Classification:
Approach 2
• Suppose the data set has 3000 samples
• Each sample is belonging to one of the 3 classes
• Suppose class1 has 700 samples, class2 has 300 samples
and class3 has 2000 samples
– Step1: From class2, 70% i.e. 210 samples considered as
training samples and remaining 30% i.e. 90 samples are
considered as test samples
– Step2: From class1, 210 samples considered as training
samples and remaining 490 samples are considered as test
samples
– Step3: From class3, 210 samples considered as training
samples and remaining 1790 samples are considered as test
samples
– Step4: Combine training examples from each class
• Training set now contain 210+210+210=630 samples
– Step5: Combine test examples from each class
• Test set now contain 490+90+1790=2370 samples
11
Nearest-Neighbour Method
• Training data with N samples:
x2
x1
12
Nearest-Neighbour Method
• Training data:
Weight
in Kg
Height in cm
14
Illustration of Nearest Neighbour Method:
Adult(1)-Child(0) Classification
Test Example:
Weight
in Kg
Height in cm
Weight
in Kg
Height in cm
• Step 3: Assign the class of the
training example with the
minimum distance to the test
example
– Class: Adult (1)
16
Nearest-Neighbour Method
• Training data:
Weight
in Kg
Height in cm
Weight
in Kg
Height in cm
• Step 2: Sort the examples in the
training set in the ascending order
of the distance to test example
19
Illustration of Nearest Neighbour
Method: Adult(1)-Child(0) Classification
Test Example:
Weight
in Kg
Height in cm
• Step 3: Assign the class of the
training example with the minimum
distance to the test example
– Class: Adult (1) ?
20
K-Nearest Neighbours (K-NN) Method
• Consider the class labels of the K training examples
nearest to the test example
• Step 1: Compute Euclidean distance for a test
example x with every training examples, x1, x2, …, xn,
…, xN
x2
x1
21
K-Nearest Neighbours (K-NN) Method
• Consider the class labels of the K training examples
nearest to the test example
• Step 1: Compute Euclidean distance for a test
example x with every training examples, x1, x2, …, xn,
…, xN • Step 2: Sort the examples in
the training set in the
ascending order of the
distance to x
x2 • Step 3: Choose the first K
examples in the sorted list
– K is the number of
neighbours for text
x1 example
• Step 4: Test example is assigned the most common
class among its K neighbours
22
Illustration of Nearest Neighbour
Method: Adult(1)-Child(0) Classification
Test Example:
Weight
in Kg
Height in cm
• Consider K=5
• Step 3: Choose the first K=5
examples in the sorted list
23
Illustration of Nearest Neighbour
Method: Adult(1)-Child(0) Classification
Test Example:
Weight
in Kg
Height in cm
• Consider K=5
• Step 4: Test example is assigned
the most common class among its
K neighbours
– Class: Adult
24
Determining K, Number of Neighbours
25
Data Normalization
• Since the distance measure is used, K-NN classifier
require normalising the values of each attribute
• Normalising the training data:
– Compute the minimum and maximum values of each of
the attributes in the training data
– Store the minimum and maximum values of each of the
attributes
– Perform the min-max normalization on training data set
• Normalizing the test data:
– Use the stored minimum and maximum values of each
of the attributes from training set to normalise the test
examples
• NOTE: Ensure that test examples are not causing out-
of-bound error
26
Lazy Learning : Learning from Neighbours
• The K nearest neighbour classifier is an example of
lazy learner
• Lazy learning waits until the last minute before doing
any model construction to classify test example
• When the training examples are given, a lazy learner
simply stores them and waits until it is given a test
example
• When it sees the test example, then it classify based
on its similarity to the stored training examples
• Since the lazy learns stores the training examples or
instances, they also called instance based learners
• Disadvantages:
– Making classification or prediction is computationally
intensive
– Require efficient huge storage techniques when the
training samples are huge 27
Text Books
28