0% found this document useful (0 votes)
19 views50 pages

Class10 14 PatternClassification - 13 24sept2019

The document discusses data science and machine learning concepts. It explains that data science uses scientific methods to extract knowledge and insights from structured and unstructured data. Machine learning uses data to build models that can perform predictive tasks like classification. Classification involves building a model from labeled training data, then using the model to predict the class of new unlabeled data. The document provides examples of classification problems and illustrates training data with labeled examples.

Uploaded by

Saili Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views50 pages

Class10 14 PatternClassification - 13 24sept2019

The document discusses data science and machine learning concepts. It explains that data science uses scientific methods to extract knowledge and insights from structured and unstructured data. Machine learning uses data to build models that can perform predictive tasks like classification. Classification involves building a model from labeled training data, then using the model to predict the class of new unlabeled data. The document provides examples of classification problems and illustrates training data with labeled examples.

Uploaded by

Saili Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

24-09-2019

Data Modeling

Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge

Data Modeling Inference


Data Collection (Machine
Learning)

Data Preprocessing

Data
Feature
Database Cleaning and
Representation
Cleansing
2

1
24-09-2019

Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge

Data Modeling Inference


Data Collection (Machine
Learning)

Data Preprocessing

Data
Feature
Database Cleaning and
Representation
Cleansing
3

Descriptive Data Analytics


• It helps us to study the general characteristics of data
and identify the presence of noise or outliers
• Data characteristics:
– Central tendency of data
• Centre of the data
• Measuring mean, median and mode
– Dispersion of data
• The degree to which numerical data tend to spread
• Measuring range, quartiles, interquartile range (IQR), the
five-number summery and standard deviation
• Descriptive analytics are the backbone of reporting

2
24-09-2019

Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge

Data Modeling Inference


Data Collection (Machine
Learning)

Data Preprocessing

Data
Feature
Database Cleaning and
Representation
Cleansing
5

Predictive Data Analytics


• It is used to identify the trends, correlations and
causation by learning the patterns from data
• Study and construction of algorithms that can learn
from data and make predictions on data
• It involve tasks like
– Classification:
• E.g.: predicting the presence or absence of disease or
• the classification of disease according to symptoms
– Regression: Numeric prediction
• E.g.: predicting the landslide or
• predicting the rainfall
– Clustering:
• E.g.: grouping the similar items to be sold or
• grouping the people from the same region
• Learning from data
6

3
24-09-2019

Pattern Classification

Classification
• Problem of identifying to which of a set of categories a
new observation belongs
• Predicts categorical labels
• Example:
– Assigning a given email to the "spam" or "non-spam"
class
– Assigning a diagnosis (disease) to a given patient based
on observed characteristics of the patient
• Classification is a two step process
– Step1: Building a classifier (data modeling)
• Learning from data (training phase)
– Step2: Using classification model for classification
• Testing phase

4
24-09-2019

Step1: Building a Classification Model


(Training Phase)
• A classifier is built describing a predetermined set of
data classes
• This is a learning step (or training phase)
• Training phase: A classification algorithm builds the
classifier by analysing or learning from a training data
set made up of tuples (samples) and their class labels
• In the context of machine learning, data tuples can be
referred to as samples, examples, instance, data
vectors, data points

Step1: Building a Classification Model


(Training Phase)
• Suppose a training data consist of N tuples (or data
vectors) described by d-attributes (d -dimensions)
D  {x n }nN1 , x n  R d
• Each tuple (or data vector) is assumed to belong to a
predefined class
– Class is determined by another attribute ((d+1)th
attribute) called the class label attribute
– Class label attribute is discrete-valued and unordered
– It is a categorical (nominal) in that each value serves as
a category or class
• Individual tuples (or data vectors) making up training
set are referred as training tuples or training samples
or training examples or training data vectors

10

5
24-09-2019

2-class Classification
• Example: Classifying a person as child or adult

Weight (x2)
Adult

Height, x1 Class
Adult/Child
Classifier Child
Weight, x2

Adult :Class C1
Child :Class C2 Height (x1)

x = [x1 x2]T

11

Illustration of Training Set: Adult-Child


• Number of training examples (N) = 20
• Dimension of a training example = 2
• Class label attribute is 3rd dimension
• Class:
– Child (0)
– Adult (1)

Weight
in Kg

Height in cm 12

6
24-09-2019

Illustration of Training Set – Iris


(Flower) Data
• Number of training
examples (N) = 20
• Dimension of a
training example =
4
• Class label attribute
is 5th dimension
• Class:
– Iris Setosa (1)
– Iris Versicolour (2)
– Iris Virginica (3)

13

Illustration of Training Set – Iris


(Flower) Data

1: Iris Setosa 2: Iris Versicolour 3: Iris Virginica 14

7
24-09-2019

Step1: Building a Classification Model


(Training Phase)
• Training phase or learning phase is viewed as the
learning of a mapping or function that can predict the
associated class label of a given training example
yn  f (x n )
– xn is the nth training example and yn is the associated
class label
• Supervised learning:
– Class label for each training example is provided
– In supervised learning, each example is a pair consisting
of an input example (typically a vector) and a desired
output value

15

Step1: Building a Classification


Model (Training Phase)
Feature
extraction Training Examples
90 21.5 Child
Feature
extraction
100 32.45 Child

Feature
Training extraction
98 28.43 Child Classifier
Phase

Feature
extraction
183 90 Adult

Feature
extraction
163 67.45 Adult

16

8
24-09-2019

Step2: Classification (Testing Phase)


• Trained model is used for classification
• Predictive accuracy of the classifier is estimated
• Accuracy of a classifier:
– Accuracy of a classifier on a test set is percentage of test
examples that are correctly classified by the classifier
– The associated class label of each test example (ground
truth) is compared with the learned classifier’s class
prediction for that example
• Generalization ability of trained model: Performance
of trained models on new (test) data
• Target of learning techniques: Good generalization
ability

17

Step2: Classification (Testing Phase)

Feature
extraction Training Examples
90 21.5 Child
Feature
extraction
100 32.45 Child

Class label
Feature
(Adult)
Training extraction
98 28.43 Child Classifier
Phase

Feature
extraction
183 90 Adult

Feature
extraction
163 67.45 Adult

Feature
extraction
Testing
150 50.6
Phase
18

9
24-09-2019

Pattern Classification Problems

x2 x2 x2

x1 x1 x1
Linearly Nonlinearly Overlapping
separable separable classes
classes classes

19

• 1, 2, 3, 4, 5, ?, …, 24, 25, 26, 27, ?


• 1, 3, 5, 7, 9, ?, …, 25, 27, 29, 31, ?
• 2, 3, 5, 7, 11, ?, …, 29, 31, 37, 41, ?
• 1, 4, 9, 16, 25, ?, …, 121, 144, 169, ?
• 1, 2, 4, 8, 16, 32, ?,…, 1024, 2048, 4096, ?
• 1, 1, 2, 3, 5, 8, ?, …, 55, 89, 144, 233, ?
• 1, 1, 2, 4, 7, 13, ?, 44, 81, 149, 274, 504, ?
• 3, 5, 12, 24, 41, ?, …., 201, 248, 300, 357, ?
• 1, 6, 19, 42, 59, ?, …, 95, 117, 156, 191, ?

20

10
24-09-2019

• 1, 2, 3, 4, 5, 6, …, 24, 25, 26, 27, 28


• 1, 3, 5, 7, 9, 11, …, 25, 27, 29, 31, 33
• 2, 3, 5, 7, 11, 13, …, 29, 31, 37, 41, 43
• 1, 4, 9, 16, 25, 36, …, 121, 144, 169, 196
• 1, 2, 4, 8, 16, 32, 64,…, 1024, 2048, 4096, 8192
• 1, 1, 2, 3, 5, 8, 13, …, 55, 89, 144, 233, 377
• 1, 1, 2, 4, 7, 13, 24, 44, 81, 149, 274, 504, 927
• 3, 5, 12, 24, 41, 63, ….., 201, 248, 300, 357, 419
(2, 7, 12, 17, 22, 27, 32, 37, 42, 47, 52, 57, 62)
• 1, 6, 19, 42, 59, ?, …, 95, 117, 156, 191, ?

• Pattern: Any regularity or structure in data or source of


data
• Pattern Analysis: Automatic discovery of patterns in
data
21

Image Classification

Tiger

Giraffe

Horse

Bear

22

11
24-09-2019

Scene Image Classification


Tall Inside Street Highway Coast Open Mountain Forest
building city country

23

Nearest-Neighbour Method
• Training data with N samples: D  {x n , y n }n 1 ,
N

x n  R d and y n  {1, 2,  , M }
– d: dimension of input example
– M: Number of classes
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN
Euclidean distance  x n  x
 (x n  x)T (x n  x)
x2 d
T
x  [ x1 x2 ]
  (x
i 1
ni  xi ) 2

x1
24

12
24-09-2019

Nearest-Neighbour Method
• Training data:D  {x n , y n }n 1 ,
N

x n  R d and yn  {1, 2,  , M }
– d: dimension of input example
– M: Number of classes
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN
• Step 2: Sort the examples
in the training set in the
ascending order of the
x2 distance to x

x  [ x1 x2 ]T • Step 3: Assign the class of


the training example with
the minimum distance to
x1 the test example, x
25

Illustration of Nearest Neighbour


Method: Adult(1)-Child(0) Classification
Test Example:

Weight
in Kg

Height in cm
• Step 1: Compute Euclidian distance
(ED) will each training examples

26

13
24-09-2019

Illustration of Nearest Neighbour


Method: Adult(1)-Child(0) Classification
Test Example:

Weight
in Kg

Height in cm
• Step 2: Sort the examples in the
training set in the ascending order
of the distance to test example

27

Illustration of Nearest Neighbour


Method: Adult(1)-Child(0) Classification
Test Example:

Weight
in Kg

Height in cm
• Step 3: Assign the class of the
training example with the
minimum distance to the test
example
– Class: Adult
28

14
24-09-2019

Nearest-Neighbour Method
• Training data:D  {x n , y n }n 1 ,
N

x n  R d and yn  {1, 2,  , M }
– d: dimension of input example
– M: Number of classes
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN
• Step 2: Sort the examples
in the training set in the
ascending order of the
x2 distance to x

x  [ x1 x2 ]T • Step 3: Assign the class of


the training example with
the minimum distance to
x1 the test example, x
29

Illustration of Nearest Neighbour


Method: Adult(1)-Child(0) Classification
Test Example:

Weight
in Kg

Height in cm
• Step 1: Compute Euclidian distance
(ED) will each training examples

30

15
24-09-2019

Illustration of Nearest Neighbour


Method: Adult(1)-Child(0) Classification
Test Example:

Weight
in Kg

Height in cm
• Step 2: Sort the examples in the
training set in the ascending order
of the distance to test example

31

Illustration of Nearest Neighbour


Method: Adult(1)-Child(0) Classification
Test Example:

Weight
in Kg

Height in cm
• Step 3: Assign the class of the
training example with the minimum
distance to the test example
– Class: Adult

32

16
24-09-2019

K-Nearest Neighbours (K-NN) Method


• Consider the class labels of the K training examples
nearest to the test example
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN

Euclidean distance  x n  x
 (x n  x)T (x n  x)
x2 d
T
x  [ x1 x2 ]   (x
i 1
ni  xi ) 2

x1

33

K-Nearest Neighbours (K-NN) Method


• Consider the class labels of the K training examples
nearest to the test example
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN
• Step 2: Sort the examples
in the training set in the
ascending order of the
distance to x
x2 • Step 3: Choose the first K
T examples in the sorted list
x  [ x1 x2 ]
– K is the number of
neighbours for text
x1 example

• Step 4: Test example is assigned the most common


class among its K neighbours
34

17
24-09-2019

Illustration of Nearest Neighbour


Method: Adult(1)-Child(0) Classification
Test Example:

Weight
in Kg

Height in cm
• Consider K=5
• Step 3: Choose the first K=5
examples in the sorted list

35

Illustration of Nearest Neighbour


Method: Adult(1)-Child(0) Classification
Test Example:

Weight
in Kg

Height in cm
• Consider K=5
• Step 4: Test example is assigned
the most common class among its
K neighbours
– Class: Adult
36

18
24-09-2019

Determining K, Number of Neighbours


• This is determined experimentally
• Starting with K=1, test set is used to estimate the
accuracy of the classifier
• This process is repeated each time by incrementing K
to allow for more neighbour
• The K value that gives the maximum accuracy may be
selected
• Preferably the value of K should be an odd number.

37

Data Normalization
• Since the distance measure is used, K-NN classifier
require normalising the values of each attribute
• Normalising the training data:
– Compute the minimum and maximum values of each of
the attributes in the training data
– Store the minimum and maximum values of each of the
attributes
– Perform the min-max normalization on training data set’
• Normalizing the test data:
– Use the stored minimum and maximum values of each
of the attributes from training set to normalise the test
examples
• NOTE: Ensure that test examples are not causing out-
of-bound error

38

19
24-09-2019

K-Nearest Neighbours (K-NN) Method


• Consider the class labels of the K training examples
nearest to the test example
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN
• Step 2: Sort the examples
in the training set in the
ascending order of the
distance to x
x2 • Step 3: Choose the first K
T examples in the sorted list
x  [ x1 x2 ]
– K is the number of
neighbours for text
x1 example

• Step 4: Test example is assigned the most common


class among its K neighbours
39

Learning from Data

20
24-09-2019

• 1, 2, 3, 4, 5, ?, …, 24, 25, 26, 27, ?


• 1, 3, 5, 7, 9, ?, …, 25, 27, 29, 31, ?
• 2, 3, 5, 7, 11, ?, …, 29, 31, 37, 41, ?
• 1, 4, 9, 16, 25, ?, …, 121, 144, 169, ?
• 1, 2, 4, 8, 16, 32, ?,…, 1024, 2048, 4096, ?
• 1, 1, 2, 3, 5, 8, ?, …, 55, 89, 144, 233, ?
• 1, 1, 2, 4, 7, 13, ?, 44, 81, 149, 274, 504, ?
• 3, 5, 12, 24, 41, ?, …., 201, 248, 300, 357, ?
• 1, 6, 19, 42, 59, ?, …, 95, 117, 156, 191, ?

41

• 1, 2, 3, 4, 5, 6, …, 24, 25, 26, 27, 28


• 1, 3, 5, 7, 9, 11, …, 25, 27, 29, 31, 33
• 2, 3, 5, 7, 11, 13, …, 29, 31, 37, 41, 43
• 1, 4, 9, 16, 25, 36, …, 121, 144, 169, 196
• 1, 2, 4, 8, 16, 32, 64,…, 1024, 2048, 4096, 8192
• 1, 1, 2, 3, 5, 8, 13, …, 55, 89, 144, 233, 377
• 1, 1, 2, 4, 7, 13, 24, 44, 81, 149, 274, 504, 927
• 3, 5, 12, 24, 41, 63, ….., 201, 248, 300, 357, 419
(2, 7, 12, 17, 22, 27, 32, 37, 42, 47, 52, 57, 62)
• 1, 6, 19, 42, 59, ?, …, 95, 117, 156, 191, ?

• Pattern: Any regularity or structure in data or source of


data
• Pattern Analysis: Automatic discovery of patterns in
data
42

21
24-09-2019

Image Classification

Tiger

Giraffe

Horse

Bear

43

Scene Image Classification


Tall Inside Street Highway Coast Open Mountain Forest
building city country

44

22
24-09-2019

Machine Learning for Pattern Recognition


• Learning: Acquiring new knowledge or modifying the
existing knowledge
• Knowledge: Familiarity with information present in data
• Learning by machines for pattern analysis: Acquisition of
knowledge from data to discover patterns in data
• Data-driven techniques for learning by machines: Learning
from examples (Training of models)
• Generalization ability of learning machines: Performance of
trained models on new (test) data
• Target of learning techniques: Good generalization ability
• Learning techniques: Estimation of parameters of models

45

Lazy Learning : Learning from Neighbours


• The K nearest neighbour classifier is an example of
lazy learner
• Lazy learning waits until the last minute before doing
any model construction to classify test example
• When the training examples are given, a lazy learner
simply stores them and waits until it is given a test
example
• When it sees the test example, then it classify based
on its similarity to the stored training examples
• Since the lazy learns stores the training examples or
instances, they also called instance based learners
• Disadvantages:
– Making classification or prediction is computationally
intensive
– Require efficient huge storage techniques when the
training samples are huge 46

23
24-09-2019

Data Preparation for the Classification


• Divide the data into training set and test set
– Example:
• Training data contain 70% of samples from each class
• Test data contain remaining 30% of samples from each
class

47

Data Preparation for the Classification


using K-Nearest Classifier
• Suppose the data set has 3000 samples
• Each sample is belonging to one of the 3 classes
• Suppose each class has 1000 samples
– Step1: From class1, 70% i.e. 700 samples considered as
training samples and remaining 30% i.e. 300 samples
are considered as test samples
– Step2: From class2, 70% i.e. 700 samples considered as
training samples and remaining 30% i.e. 300 samples
are considered as test samples
– Step3: From class3, 70% i.e. 700 samples considered as
training samples and remaining 30% i.e. 300 samples
are considered as test samples
– Step4: Combine training examples from each class
• Training set now contain 700+700+700=2100 samples
– Step5: Combine test examples from each class
• Test set now contain 300+300+300=900 samples
48

24
24-09-2019

Performance Evaluation for


Classification

Confusion Matrix
Actual Class

Class1 Class2
(Positive) (Negative)
Predicted

Class1
Class

Class

True Positive False Positive


(Positive)
Class2 False
True Negative
(Negative) Negative

• True Positive: Number of test samples correctly


predicted as positive class.
• True Negative: Number of test samples correctly
predicted as negative class.
• False Positive: Number of test samples predicted as
positive class but actually belonging to negative class.
• False Negative: Number of test samples predicted as
negative class but actually belonging to positive class. 50

25
24-09-2019

Confusion Matrix

Actual Class

Class1 Class2
Class Predicted

(Positive) (Negative)

Class1 True False


Class

(Positive) Positive Positive

Class2 False True


(Negative) Negative Negative

Total test
samples
in class1

51

Confusion Matrix

Actual Class

Class1 Class2
Class Predicted

(Positive) (Negative)

Class1 True False


Class

(Positive) Positive Positive

Class2 False True


(Negative) Negative Negative

Total test
samples
in class2

52

26
24-09-2019

Confusion Matrix

Actual Class

Class1 Class2
Class Predicted

(Positive) (Negative)
Total test
Class1 True False
Class

samples
(Positive) Positive Positive predicted as
class1
Class2 False True
(Negative) Negative Negative

53

Confusion Matrix

Actual Class

Class1 Class2
Class Predicted

(Positive) (Negative)

Class1 True False


Class

(Positive) Positive Positive


Total test
Class2 False True samples
(Negative) Negative Negative predicted as
class2

54

27
24-09-2019

Accuracy

Actual Class

Class1 Class2
(Positive) (Negative)
Predicted

Class1
Class

Class

True Positive False Positive


(Positive)
Class2 False
True Negative
(Negative) Negative

55

Confusion Matrix - Multiclass


Actual Class

Class1 Class2 Class3


Predicted Class

Class1
C11 C21 C31
Class2
C12 C22 C32

Class2 C13 C23 C33

• True Positive: Number of test samples correctly predicted


as positive class (C11).
• True Negative: Number of test samples correctly predicted
as negative class (C22+C33).
• False Positive: Number of test samples predicted as positive
class but actually belonging to negative class (C21+C31)
• False Negative: Number of test samples predicted as
negative class but actually belonging to positive class
(C12+C13) 56

28
24-09-2019

Confusion Matrix - Multiclass

Actual Class

Class1 Class2 Class3


Total samples
Predicted Class

Class1
C11 C21 C31 predicted as
class1
Total samples
Class2
C12 C22 C32 predicted as
class2
Total samples
Class2
C13 C23 C33 predicted as
class3
Total samples Total samples Total samples in
Total
in class1 in class2 class3

Total samples used for testing


57

Accuracy of Multiclass Classification


Actual Class

Class1 Class2 Class3


Predicted Class

Class1
C11 C21 C31
Class2
C12 C22 C32

Class2 C13 C23 C33

58

29
24-09-2019

K-Nearest Neighbours (K-NN) Method


• Consider the class labels of the K training examples
nearest to the test example
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN
• Step 2: Sort the examples
in the training set in the
ascending order of the
distance to x
x2 • Step 3: Choose the first K
T examples in the sorted list
x  [ x1 x2 ]
– K is the number of
neighbours for text
x1 example

• Step 4: Test example is assigned the most common


class among its K neighbours
59

Reference Templates Method


• Each class is represented by its reference templates
– Mean of each data points of each class as reference
template
• The class of the nearest reference template (mean) is
assigned to the test pattern
μi: Mean vector of
Euclidean distance  x  μ i
i
class
T
 (x  μ i ) (x  μ i )
μ1  [ 11 12 ]T
d
  (x
j 1
j   ij ) 2
T
x2 x  [ x1 x2 ] • Learning: Estimating
first order statistics
μ 2  [  21  22 ]T (mean) from the data
of each class
x1 60

30
24-09-2019

Modified Reference Templates Method


• Each class is represented by one or more reference
templates
– Mean and variance of data points of each class as
reference template
• The class of the nearest reference templates is
assigned to the test pattern
μi & Σi : Mean vector
x  μi and Covariance
Mahalanobi s distance 
i matrix of class i
T
μ1  [ 11 12 ]  (x  μ i )T  i1 (x  μ i )
2 1
• Learning: Estimating
x2 – first order statistics (mean)
and
x  [ x1 x2 ]T – Second order statistics
μ 2  [  21  22 ]T (variance and covariance)
from the data of each class
x1 61

Probability Distribution
• Data of a class is represented by a probability
distribution
• For a class whose data is considered to be forming a
single cluster, it can be represented by a normal or
Gaussian distribution
• Multivariate Gaussian distribution:
– Adult-Child class

Weight
in Kg

Height in cm 62

31
24-09-2019

Probability Distribution
• Data of a class is represented by a probability
distribution
• For a class whose data is considered to be forming a
single cluster, it can be represented by a normal or
Gaussian distribution
• Multivariate Gaussian distribution:
– Adult-Child class
– Bivariate
Gaussian p(x)
distribution
– Each example is
sampled from
Gaussian
distribution

Weight
in Kg
Height in cm 63

Multivariate Gaussian Distribution


• Data in d-dimensional space
p(x)  N (x | μ, Σ)
1  1 
 1/ 2
exp   ( x  μ ) T Σ -1 ( x  μ ) 
(2 ) d / 2 Σ  2 
Mahalanobis distance
–  is the mean vector
– Σ is the covariance matrix
• Bivariate Gaussian distribution: d=2
 x1      E [ x1 ] 
x  μ  1  
 x2    2   E [ x 2 ]

 12  12 
Σ 
 21  2 
2


 E  x1  1 2
Σ
 E  x1  1  x 2   2 

 E  x 2   2 x1   1  
E x 2   2 
2
 
64

32
24-09-2019

Bayes Classifier: Multivariate Data


• Let C1, C2, …, Ci, …, CM be the M classes
– Each class has Ni number of training examples
• Given: a test example x
• Bayes decision rule:
Likelihood Prior
Posterior Probability
of a class p(x | Ci ) P(Ci )
P(Ci | x) 
P(x) Evidence
Ni
– Prior: Prior information of a class P(Ci ) 
N
• where, N is total number of training examples
M

– Evidence: Evidence/probability that x exists p(x)   p(x | Ci )P(Ci )


i 1
• Out of all the samples, what is the probability of the
sample we are looking at
– Likelihood follows the distribution of the data of a class
Class label for x = arg max P(Ci | x) i  1, 2,..., M
i 65

Probability Theory and Bayes Rule

C1 C2 • The sample space is partitioned


into C1, C2, …, Ci, …, CM where
each partitions are disjoint
– Example:
• Data space is sample space
• Each class is my partitions
x
Ci • Let x be an event defined in
sample space
CM – Example: A finite data points
(training data) are the event x

• P(x): Total probability i.e. joint probability of x and Ci,


P(x, Ci), for all i M M
P(x)   px, Ci    px | Ci PCi 
i 1 i 1

• P(x) is marginal probability – probability of x is obtained


by marginalising over the events Ci 66

33
24-09-2019

Probability Theory and Bayes Rule

C1 C2 • Conditional probability:
px, Ci 
p(x | Ci )  (1)
P(Ci )
px, Ci 
p(Ci | x)  (2)
x P(x)
Ci • From (1) and (2)
CM
p(x | Ci ) P(Ci )  p(Ci | x) P(x)

• Bayes decision rule:


p(x | Ci ) P(Ci )
P(Ci | x) 
P(x)

67

Bayes Classifier: Multivariate Data


• Data of a class is represented by a probability
distribution
• Given: a test example x
• Bayes decision rule:
Likelihood Prior
Posterior Probability
of a class p(x | Ci ) P(Ci )
P(Ci | x) 
P(x) Evidence

– Likelihood of a class follows the distribution of the data


of a class
– Computation of likelihood of a class depends on the
distribution of the data and the parameters of that
distribution
p(x | θi ) P(Ci )
• Bayes decision rule can be given as P(θi | x) 
P(x)
– θi is the parameters of the distribution of class Ci
68

34
24-09-2019

Maximum Likelihood (ML)


Method for Parameter Estimation
• Given: Training data for a class Ci: having Ni samples
Di={x1, x2,…, xn ,…,xNi}, x n  R d
• Data of a class is represented by parameter vector:
θi=[θi1, θi2,…, θiK]T, of its distribution
• Unknown: θi
• Likelihood of training data (Total data likelihood) for a
Ni
given θi :
p (D i | θ i )   p(x n | θi )
n 1
Ni
L (θi )  ln p (D i | θi )   ln p (x n | θi )
n 1

• Choose the parameters for which the total data


likelihood (log likelihood) is maximum:
θiML  arg max L (θ i )
θi 69

ML Method for Parameter Estimation


of Multivariate Gaussian Distribution
• Given: Training data for a class Ci having Ni samples
Di={x1, x2,…, xn ,…,xNi}, x n  R d
• Data of a class is represented by parameter vector:
[µi Σi]T, of Gaussian distribution
• Unknown: µi and Σi
• Likelihood of training data (Total data likelihood) for a
Ni
given µi and Σi : p (D | μ , Σ ) 
i i 
p(x n | μ i , Σi )
n 1
Ni
L (μ i , Σ i )  ln p (D i | μ i , Σ i )   ln p (x n | μ i , Σ i )
n 1

• Choose the parameters for which the total data


likelihood (log likelihood) is maximum:
μ iML , Σ iML  arg max L (μ i , Σ i )
μ i , Σi 70

35
24-09-2019

Illustration of ML Method:
Training Set: Adult-Child
• Number of training examples (N) = 20
• Dimension of a training example = 2
• Class label attribute is 3rd dimension
• Class:
– Child (0)
– Adult (1)

Weight
in Kg

Height in cm 71

Illustration of ML Method: Child class


• Number of training examples (N) = 20
• Dimension of a training example = 2
• Sample mean: [103.6 30.66]
• Sample covariance matrix:

109.3778 61.3500 
 61.3500 43.5415 

Weight
in Kg

Height in cm 72

36
24-09-2019

Illustration of ML Method: Child class


• Covariance matrix value is fixed at :

109.3778 61.3500 
 61.3500 43.5415 

• Search the values for mean vector


µ=[μ1, μ2]T that maximizes the total
data likelihood
• Range of values for mean vectors to
search:
– 1000 equally sampled values from 53.6
to 153.6 for μ1
– 1000 equally sampled values from
-20.66 to 80.66 for μ2
• Compute the likelihood value for each of the 10,00,000
(1000 x 1000) values of the mean vectors
73

Illustration of ML Method: Child class


• A maximum value for the likelihood is
obtained for the value
[103.65 30.71]
• This value is close to sample mean
vector: [103.6 30.66]
p(Di | µi Σi)

µ2
µ1 74

37
24-09-2019

ML Method for Parameter Estimation


of Multivariate Gaussian Distribution
• Parameters of Gaussian distribution of class Ci : µi and Σi
• Likelihood for a single example, xn :
 1 1 T 1 
p (x n | μ i , Σ i )  exp   (x n  μ i ) Σ i (x n  μ i ) 
1/ 2
(2 ) Σ i
d /2
 2 
• Log likelihood for total training data of class Ci ,
Di={x1, x2,…,xN}: Ni Ni
L (μ i , Σ i )  ln p (D i | μ i , Σ i )  ln  p ( x n | μ i , Σ i )   ln p ( x n | μ i , Σ i )
n 1 n 1
Ni
1 d 1
   ln Σ i  ln 2  ( x n  μ i ) T Σ i1 ( x n  μ i )
n 1 2 2 2
• Setting the derivatives of L(µi, Σi) w.r.t. µi and Σi to
zero, we get:
1 Ni 1 Ni
μ iML  
N i n 1
x n Σ iML  
N i n1
(x n  μ iML )(x n  μ iML )T
75

Bayes Classifier with Unimodal Gaussian


Density – Training Process
• Let C1, C2, …, Ci, …, CM be the M classes
• Let D1, D2, …, Di, …, DM be the training data for M
classes
• Estimate the parameters
– θ1= [µ1 Σ1]T ,
– θ2= [µ2 Σ2]T,
– …,
– θi= [µi Σi]T,
– …,
– θM= [µM ΣM]T for each of the classes
• Number of parameters to be estimated for each class
is dependent on dimensionality of the data space d
– Number of parameters: d + (d(d+1))/2
76

38
24-09-2019

Bayes Classifier with Unimodal Gaussian


Density – Training Process
• Let C1, C2, …, Ci, …, CM be the M classes
• Let D1, D2, …, Di, …, DM be the training data for M
classes
• Compute sample mean vector and sample covariance
matrix from training data of class 1, θ1= [µ1 Σ1]T
• Compute sample mean vector and sample covariance
matrix from training data of class 2, θ2= [µ2 Σ2]T,
• …,
• Compute sample mean vector and sample covariance
matrix from training data of class M, θM= [µM ΣM]T

77

Bayes Classifier with Unimodal Gaussian


Density: Classification
• For a test example x:
– likelihood of x generated from each of the classes
p(x|µi,Σi) is computed
– Assign the label of class for which p(x|µi,Σi) is maximum
p(x|µ1,Σ1)
θ1=[μ1 Σ1]

Test θ2=[μ2 Σ2]


example Decision Class label
x p(x|µ2,Σ2) Logic

Class
label

 arg max p x μ i , Σ i 
i
θM=[μM ΣM]
p(x|µM,ΣM)
78

39
24-09-2019

Illustration of Bayes Classifier with Unimodal


Gaussian Density : Adult(1)-Child(0) Classification
• Training Phase:
– Compute sample mean vector and sample
covariance matrix from training data of
class 1 (Child)
μ1  103.6000 30.6600 

109.3778 61.3500 
Σ1  
 61.3500 43.5415 
– Compute sample mean vector and sample
covariance matrix from training data of
class 2 (Adult)

μ 2  166.0000 67.1150 

110.6667 160.5278 
Σ2   
160.5278 255.4911 
79

Illustration of Bayes Classifier with Unimodal


Gaussian Density : Adult(1)-Child(0) Classification
Test Example, x :
• Test phase: Classification
• Class1 (Child)
μ1  103.6000 30.6600 
109.3778 61.3500 
Σ1  
 61.3500 43.5415  Weight
in Kg
• Class 2 (Adult)
μ 2  166.0000 67.1150 
110.6667 160.5278 
Σ2   
160.5278 255.4911 
Height in cm
• Compute likelihood of test
• Compute likelihood of test
sample, x with class 1 (Child)
sample, x with class 2 (Adult)
p x μ1 , Σ1   3.5237x10-08
p x μ 2 , Σ 2   3.7177x10 -04

Class label of x = Adult 80

40
24-09-2019

Summary: Bayes Classifier with Unimodal


Gaussian Density
• The relation between examples and class can be
captured in a statistical model
– Bayes classifier

• Statistical model:
– Unimodal Gaussian density
• Univariate
• Multivariate

p(x)

Weight
in Kg
Height in cm 81

Summary: Bayes Classifier with Unimodal


Gaussian Density
• The relation between examples and class can be
captured in a statistical model
– Bayes classifier
[166.0 67.1]
• Statistical model:
Weight in Kg

– Unimodal Gaussian
density
• Univariate
• Multivariate
[103.6 30.1]

Height in cm
• The real world data need not be unimodal
– The shape of the density can be arbitrary
– Bayes classifier?
• Multimodal density function
82

41
24-09-2019

Adult-Child Data

[149.7 65.1]

Weight in Kg
Weight in Kg

[117.2 31.5]

Height in cm Height in cm

83

Multimodal Distribution: Adult-Child Data


• For a class whose data is considered to have multiple
clusters, the probability distribution is multimodal
[171.1 75.2]
[138.2 59.6]
Weight in Kg
Weight in Kg

[129.9 32.6]
[101.7 30.1]

Height in cm Height in cm

84

42
24-09-2019

Multimodal Distribution: Adult-Child Data


• For a class whose data is considered to have multiple
clusters, the probability distribution is multimodal

• M1: Cluster 1 (mode 1)


Multimodal Gaussian:
Child Data • M2: Cluster 2 (mode 2)

M2
M1
p(x)

Weight
in Kg
Height in cm

85

Multimodal Gaussian Distribution:


Gaussian Mixture Model
• Given: Training data for a class Ci: having Ni samples
Di={x1, x2,…, xn ,…,xNi}, x n  R d
• Gaussian mixture model (GMM): to represent a
multimodal distribution

• GMM is a linear Multimodal Gaussian:


superposition of Child Data
multiple (Q)
Gaussian
components: p(x)

p (x|Ci )   wq N x | μ q , Σ q 
Q

q 1

– The overall Weight


envelope of the in Kg
curve Height in cm
86

43
24-09-2019

Gaussian Mixture Model (GMM)


• GMM is a linear superposition of multiple Gaussians:

p (x|Ci )   wq N x | μ q , Σ q 
Q

q 1

• For a d-dimensional feature vector representation of


data, the parameters of GMM are
– Mixture coefficients, wq , q = 1,2,…, Q
• Mixture weight or Strength of each clusters (or mixtures or
modes)
Q
• Property: w
q 1
q 1
– d-dimensional mean vector, µq , q = 1,2,…, Q
– dxd size covariance matrices, Σq , q = 1,2,…, Q

• Training process objective: To estimate the


parameters of the GMM
87

Parameter Estimation of GMM:


Incomplete Data Problem
• Given: Training data for a class Ci: having Ni samples
Di={x1, x2,…, xn ,…,xNi}, x n  R d
• Known: Training data is multimodal in nature
• Unknown: identity of the cluster (or mixture) of these
training data points
• Incomplete data problem:
– Given is only data points but not their identity (i.e. to
which cluster it belongs)
– Hidden (latent) information: Identity of data points to
the cluster

88

44
24-09-2019

Parameter Estimation of GMM:


Incomplete Data Problem
• If identity (latent information) is given, how to
estimate parameters of GMM?
• Apply maximum likelihood method to estimate the
parameters of each of the q mixtures (µq and Σq)
• Mixture coefficients, wq is computed as
N iq • Niq: Number of data points in cluster q
wq 
Ni • Ni: Number of data points in class Ci

• In practice, we do not have this information


• Goal of parameter estimation: To find the best
possible values of parameters of GMM such that the
total likelihood of data is maximized
– Maximum likelihood method for training a GMM:
Expectation-Maximization (EM) method
89

Expectation-Maximization (EM) for GMMs


• An elegant and powerful method for finding the
maximum likelihood solution for a model with latent
variables
• Given a Gaussian mixture model, the goal is to
maximize the likelihood function with respect to the
parameters
1. Initialize the means μq, covariances Σq and mixing
coefficients wq, and evaluate the initial value of the log
likelihood
2. E-step: Evaluate the responsibilities γq(x) using the
current parameter values

90

45
24-09-2019

EM Method – Responsibility Term


• A quantity that plays an important role is the responsibility
term, γq(x)
• It is given by wqN x | μ q , Σ q 
γq ( x ) 
 w N x | μ ,Σj
Q

j j
j 1
• wq : mixture coefficient or prior probability of component q,
• γq(x) gives the posterior probability of the component q for
the observation x

1
γ1(xn) = 0.99 3 γ1(xm) = 0.08
γ2(xn) = 0.01 γ2(xm) = 0.42
γ3(xn) = 0.00
2 γ3(xm) = 0.34
γ4(xn) = 0.00 4 γ4(xm) = 0.16
91

Expectation-Maximization (EM) for GMMs


• Given a Gaussian mixture model, the goal is to
maximize the likelihood function with respect to the
parameters
1. Initialize the means μq, covariances Σq and mixing
coefficients wq, and evaluate the initial value of the log
likelihood
2. E-step: Evaluate the responsibilities γq(x) using the
current parameter values
new
3. M-step: Re-estimate the parameters μ new
q , Σq and wqnew
using the current responsibilities
4. Evaluate the log likelihood and check for convergence of
the log likelihood
• If the convergence criterion is not satisfied return to step 2

92

46
24-09-2019

Expectation-Maximization (EM) for GMMs


• Convergence criterion: Difference between log
likelihoods of successive iterations fall below a
threshold (E.g. 10-3)

Log likelihood
L (θ i )  ln p (D i | θ i )

1 2 3 4 5 6 7 8 9 10 11 12
Iterations
93

Illustration of Parameter Estimation

C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.


94

47
24-09-2019

Bayes Classifier: Multimodal Data


• Let C1, C2, …, Ci, …, CM be the M classes
• Given: a test example x
• Bayes decision rule:
Likelihood Prior
Posterior Probability
p(x | Ci ) P(Ci )
P(Ci | x) 
p(x) Evidence

p(x | Ci )   wqN x | μq , Σq 
Q

q 1

Class label for x = arg max P(Ci | x)


i

95

Bayes Classifier with Multimodal Gaussian


Density (GMM) – Training Process
• Let C1, C2, …, Ci, …, CM be the M classes
• Let D1, D2, …, Di, …, DM be the training data for M
classes
• Build GMM (λ) for each of the classes
GMM for class 1, λ1 GMM for class 2, λ2 GMM for class M, λM

GMM for Class i , λ i  wq , μ q , Σ q Q


q 1

96

48
24-09-2019

Bayes Classifier with Multimodal Gaussian


Density (GMM) – Classification

p(x | λ1)

λ1

Class
label
Test Example
x
p(x | λ2) Decision (class 1)
logic
λ2
 arg max p x i 
Class
label i

p(x | λM)

λM

97

Determining Q, Number of Gaussian


Components
• This is determined experimentally
• Starting with Q=1, test set is used to estimate the
accuracy of the Bayes classifier
• This process is repeated each time by incrementing Q
to allow for more Gaussian components
• The GMM with Q components that gives the maximum
accuracy may be selected

98

49
24-09-2019

Bayes Classifier with


Gaussian Mixture Models – Summary
• Multimodal probability distribution for each class is
represented by a Gaussian mixture model.
• GMM is a powerful way of modeling data
• Using GMM, a data of any arbitrary shaped distribution can
be modeled
• In GMM, number of parameters to be estimated for each
class is dependent on:
– Dimensionality of the data space d
– Number of Gaussian mixtures Q
Qxd + Qx(d(d+1))/2 + Q
• For large values of d and Q, the number of examples
required to estimate the parameters properly will be large.
• When the estimated class-conditional densities are the
same as the true densities, Bayes classifier gives minimum
classification error

99

Text Books
1. J. Han and M. Kamber, Data Mining: Concepts and
Techniques, Third Edition, Morgan Kaufmann Publishers,
2011.

2. S. Theodoridis and K. Koutroumbas, Pattern Recognition,


Academic Press, 2009.

3. C. M. Bishop, Pattern Recognition and Machine Learning,


Springer, 2006.

100

50

You might also like