GNR401 Dr. A.
Bhattacharya 1
INTRODUCTION TO IMAGE
CLASSIFICATION
Lecture 7
Concept of Image Classification
2
Image classification - assigning pixels in the image
to categories or classes of interest
Examples: built-up areas, waterbody, green
vegetation, bare soil, rocky areas, cloud, shadow,
GNR401 Dr. A. Bhattacharya
Concept of Image Classification
3
Image classification is a process of mapping numbers to
symbols
f(x): x D; x Rn, D = {c1, c2, , cL}
Number of bands = n;
Number of classes = L
f(.) is a function assigning a pixel vector x to
a single class in the set of classes D
GNR401 Dr. A. Bhattacharya
Concept of Image Classification
4
In order to classify a set of data into different classes
or categories, the relationship between the data and
the classes into which they are classified must be well
understood
To achieve this by computer, the computer must be
trained
Training is key to the success of classification
Classification techniques were originally developed
out of research in Pattern Recognition field
GNR401 Dr. A. Bhattacharya
Concept of Image Classification
5
Computer classification of remotely sensed images
involves the process of the computer program
learning the relationship between the data and the
information classes
Important aspects of accurate classification
Learning techniques
Feature sets
GNR401 Dr. A. Bhattacharya
Types of Learning
6
Supervised Learning
Learning process designed to form a mapping from one set
of variables (data) to another set of variables (information
classes)
A teacher is involved in the learning process
Unsupervised learning
Learning happens without a teacher
Exploration of the data space to discover the scientifc laws
underlying the data distribution
GNR401 Dr. A. Bhattacharya
Features
7
Features are attributes of the data elements based
on which the elements are assigned to various
classes.
E.g., in satellite remote sensing, the features are
measurements made by sensors in different
wavelengths of the electromagnetic spectrum
visible/ infrared / microwave/texture features
GNR401 Dr. A. Bhattacharya
Features
8
In medical diagnosis, the features may be the
temperature, blood pressure, lipid profile, blood
sugar, and a variety of other data collected through
pathological investigations
The features may be qualitative (high, moderate, low)
or quantitative.
The classification may be presence of heart disease
(positive) or absence of heart disease (negative)
GNR401 Dr. A. Bhattacharya
Supervised Classification
9
The classifier has the advantage of an analyst or
domain knowledge using which the classifier can be
guided to learn the relationship between the data
and the classes.
The number of classes, prototype pixels for each
class can be identified using this prior knowledge
GNR401 Dr. A. Bhattacharya
Partially Supervised Classification
10
When prior knowledge is available
For some classes, and not for others,
For some dates and not for others in a multitemporal
dataset,
Combination of supervised and unsupervised methods
can be employed for partially supervised
classification of images
GNR401 Dr. A. Bhattacharya
Unsupervised Classification
11
When access to domain knowledge or the
experience of an analyst is missing, the data can
still be analyzed by numerical exploration, whereby
the data are grouped into subsets or clusters based
on statistical similarity
GNR401 Dr. A. Bhattacharya
Supervised vs. Unsupervised
12
Classifiers
Supervised classification generally performs better
than unsupervised classification IF good quality
training data is available
Unsupervised classifiers are used to carry out
preliminary analysis of data prior to supervised
classification
GNR401 Dr. A. Bhattacharya
Role of Image Classifier
13
The image classifier performs the role of a discriminant
discriminates one class against others
Discriminant value highest for one class, lower for other
classes (multiclass)
Discriminant value positive for one class, negative for
another class (two class)
GNR401 Dr. A. Bhattacharya
Discriminant Function
14
g(ck, x) is discriminant function, relating feature vector
x and class ck, k=1,,L
Denote g(ck,x) as gk(x) for simplicity
Multiclass Case
gk(x) > gl(x), l = 1,,L, l k x ck
Two Class Case
g(x) > 0 x c1; g(x) < 0 x c2
GNR401 Dr. A. Bhattacharya
Example of Image Classification
15
Multiple Class Case
Recognition of characters or digits from bitmaps of
scanned text
Two Class Case
Distinguishing between text and graphics in scanned
document
GNR401 Dr. A. Bhattacharya
Prototype / Training Data
16
Using domain knowledge (maps of the study area,
experienced interpreter), small sets of sample pixels
are selected for each class.
The size and spatial distribution of the samples are
important for proper representation of the total
pixel population in terms of the samples
GNR401 Dr. A. Bhattacharya
Statistical Characterization of Classes
17
Each class has a conditional probability density
function (pdf) denoted by p(x | ck)
The distribution of feature vectors in each class ck is
indicated by p(x | ck)
We estimate P(ck | x), the conditional probability of
class ck given that the pixels feature vector is x
GNR401 Dr. A. Bhattacharya
Supervised Classification Algorithms
18
There are many techniques for assigning pixels to
informational classes, e.g.:
Minimum Distance from Mean (MDM)
Parallelpiped
Maximum Likelihood (ML)
Support Vector Machines (SVM)
Artificial Neural Networks (ANN)
GNR401 Dr. A. Bhattacharya
Supervised Classification Principles
19
The classifier learns the characteristics of different
thematic classes forest, marshy vegetation,
agricultural land, turbid water, clear water, open soils,
manmade objects, desert etc.
This happens by means of analyzing the statistics of
small sets of pixels in each class that are reliably
selected by a human analyst through experience or
with the help of a map of the area
GNR401 Dr. A. Bhattacharya
Supervised Classification Principles
20
Typical characteristics of classes
Mean vector
Covariance matrix
Minimum and maximum gray levels within each band
Conditional probability density function p(Ci|x) where Ci is
the ith class and x is the feature vector
Number of classes L into which the image is to be
classified should be specified by the user
GNR401 Dr. A. Bhattacharya
Prototype Pixels for Different Classes
21
The prototype pixels are samples of the population
of pixels belonging to each class
The size and distribution of samples are formally
governed by the mathematical theory of sampling
There are several criteria for choosing the samples
belonging to different classes
GNR401 Dr. A. Bhattacharya
22
GNR401 Dr. A. Bhattacharya
Parallelepiped Classifier - Example of
23
a Supervised Classifier
Assign ranges of values for each class in each band
Really a feature space classifier
Training data provide bounds for each feature for each
class
Results in bounding boxes for each class
A pixel is assigned to a class only if its feature vector
falls within the corresponding box
GNR401 Dr. A. Bhattacharya
Parallelepiped Classifier
24
GNR401 Dr. A. Bhattacharya
Parallelepiped Classifier
25
GNR401 Dr. A. Bhattacharya
Advantages/Disadvantages of
26
Parallelpiped Classifier
Does NOT assign every pixel to a class. Only the
pixels that fall within ranges.
Fastest method computationally
Good for helping decide if you need additional
classes (if there are many unclassified pixels)
Problems when class ranges overlapmust develop
rules to deal with overlap areas.
GNR401 Dr. A. Bhattacharya
Minimum Distance Classifier
27
Simplest kind of supervised classification
The method:
Calculate the mean vector for each class
Calculate the statistical (Euclidean) distance from each
pixel to class mean vector
Assign each pixel to the class it is closest to
GNR401 Dr. A. Bhattacharya
Minimum Distance Classifier
28
GNR401 Dr. A. Bhattacharya
Minimum Distance Classifier
29
Algorithm
Estimate class mean vector and covariance matrix from training
samples
mi = SjCi Xj ; Ci = E{(X - mi ) (X - mi )T } | X Ci}
Compute distance between X and mi
X Ci if d(X, mi) d(X,mj) j
Compute P(Ck|X) = Leave X unclassified if
maxk P(Ck|X) < Tmin
GNR401 Dr. A. Bhattacharya
Minimum Distance Classifier
30
Normally classifies every pixel no matter how far it is
from a class mean (still picks closest class) unless the
Tmin condition is applied
Distance between X and mi can be computed in
different ways Euclidean, Mahalanobis, city block,
GNR401 Dr. A. Bhattacharya
Maximum Likelihood Classifier
31
Calculates the likelihood of a pixel being in
different classes conditional on the available
features, and assigns the pixel to the class with the
highest likelihood
GNR401 Dr. A. Bhattacharya
Likelihood Calculation
32
The likelihood of a feature vector x to be in class Ci is
taken as the conditional probability P(Ci|x).
We need to compute P(Ci|x), that is the conditional
probability of class Ci given the pixel vector x.
It is not possible to directly estimate the conditional
probability of a class given the feature vector.
Instead, it is computed indirectly in terms of the
conditional probability of feature vector x given that
it belongs to class Ci.
GNR401 Dr. A. Bhattacharya
Likelihood Calculation
33
P(Ci|x) is computed using Bayes Theorem in terms of
P(x|Ci)
P(Ci|x) = P(x|Ci) P(Ci) / P(x)
x is assigned to class Cj such that
P(Cj|x) = Maxi P(Ci|x), i=1K, the number of classes.
P(Ci) is the prior probability of occurrence of class i in
the image
P(x) is the multivariate probability density function of
feature x.
GNR401 Dr. A. Bhattacharya
Likelihood Calculation
34
P(x) can be ignored in the computation of Max{P(Ci|x)}
If P(x|Cj) is not assumed to have a known distribution, then its
estimation is said to be non-parametric estimation.
If P(x|Cj) is assumed to have a known distribution, then its
estimation is said to be parametric.
The training data x with the class already given, can be used
to estimate the conditional density function P(x|Ci)
GNR401 Dr. A. Bhattacharya
Likelihood Calculation
35
P(x|Ci) is assumed to be multivariate Gaussian
distributed in practical parametric classifiers.
Gaussian distribution is mathematically simple to
handle.
Each class conditional density function P(x|Ci) is
represented by its mean vector mi and covariance
matrix Si
1 ( x i )T Si 1 ( x i )
p (x | Ci ) L/2 1/ 2
e
(2 ) | Si |
GNR401 Dr. A. Bhattacharya
Assumption of Gaussian Distribution
36
Each class is assumed to be multivariate normally distributed
That implies each class has a mean mi that has the highest
likelihood of occurrence
The likelihood function decreases exponentially as the feature
vector x deviates from the mean vector mi
The rate of decrease is governed by the class variance;
Smaller the variance, steep will be the decrease, and larger
the variance, slower will be the decrease.
GNR401 Dr. A. Bhattacharya
Likelihood Calculation
37
Taking logarithm
1 of the Gaussian distribution,
L 1we get
gi ( x) ( x i )t Si1 ( x i ) ln 2 ln Si ln P(i )
2 2 2
We assume that the covariance matrices for each class are
different.
t 1
The term ( x i Si ( x i )
)
is known as the Mahalanobis distance between x and mi (after
Prof. P.C. Mahalanobis, famous Indian statistician and founder
of Indian Statistical Institute)
GNR401 Dr. A. Bhattacharya
Interpretation of Mahalanobis
38
distance
The Mahalanobis distance between two multivariate quantities
x and y is
t 1
d M ( x, y ) ( x y ) S ( x y )
If the covariance matrix is k.I, (I is the unit matrix) then the
Mahalanobis distance reduces to a scaled version of the
Euclidean distance.
Mahalanobis distance reduces the Euclidean distance according
to the extent of variation within the data, given by the
covariance matrix S
GNR401 Dr. A. Bhattacharya
Advantages/Disadvantages of
39
Maximum Likelihood Classifier
Normally classifies every pixel no matter how far it is from a
class mean
Slowest method more computationally intensive
Normally distributed data assumption is not always true, in
which case the results are not likely to be very accurate
Thresholding condition can be introduced into the classification
rule to separately handle ambiguous feature vectors
GNR401 Dr. A. Bhattacharya
Nearest-Neighbor Classifier
40
Non-parametric in nature
The algorithm is:
Find the distance of given feature vector x from ALL the
training samples
x is assigned to the class of the nearest training
sample (in the feature space)
This method does not depend on the class statistics
like mean and covariance.
GNR401 Dr. A. Bhattacharya
41
GNR401 Dr. A. Bhattacharya
K-NN Classifier
42
K-nearest neighbour classifier
Simple in concept, time consuming to implement
For a pixel to be classified, find the K closest training
samples (in terms of feature vector similarity or
smallest feature vector distance)
Among the K samples, find the most frequently
occurring class Cm
Assign the pixel to class Cm
GNR401 Dr. A. Bhattacharya
K-NN Classifier
43
Let ki be number of samples for class Ci (out of K
closest samples), i=1,2,,L (number of classes)
Note that ki K
i
The discriminant for K-NN classifier is
gi(x) = ki
The classifier rule is
Assign x to class Cm if gm(x) > gi(x), for all i, im
GNR401 Dr. A. Bhattacharya
K-NN Classifier
44
It is possible to find more than one class whose training samples
are closest to the feature vector of pixel x. Therefore the
discriminant function is refined further as
ki
j
1/
j 1
d ( x , xi )
gi ( x) L ki
j
1/ d ( x, x
l 1 j 1
l )
The distances of the nearest neighbours to the feature vector of the pixel
to be classified are taken into account
GNR401 Dr. A. Bhattacharya
K-NN Classifier
45
If the classes are in different proportions in the image, then the
prior probabilities can be taken into account:
ki p (i )
gi ( x) L
l 1
k p ( )
l l
For each pixel to be classified, the feature space distances to
all training pixels are to be computed before the decision is
made, due to which this procedure is extremely computation
intensive, and is not used when the dimensionality (number of
bands) of the feature space is large, e.g., with hyperspectral
data.
GNR401 Dr. A. Bhattacharya
Spectral Angle Mapper
46
Given a large dimensional data set, computing the covariance
matrix, its inverse, and the distance for each pixel
(X m)TS-1(X m) is highly time consuming and if the
covariance matrix is close to singular then its inverse can be
unstable, leading to erroneous results
In such cases, alternate methods can be applied, such as
Spectral Angle Mapper
GNR401 Dr. A. Bhattacharya
S.A.M. Principle
47
If each class is represented by a vector vi, then the
angle between the class vector and the pixel feature
vector x is given by
cosq = [vi.x] / [|vi||x|]
For small values of q, the value of cosq is large
The likelihood of x to belong to different classes can
be ranked according to the value of cosq.
GNR401 Dr. A. Bhattacharya
S.A.M. Advantage
48
The value of the vector would not be greatly
affected by minor changes in vi or x.
The computation is simpler compared to the
Mahalanobis distance computation involved in ML
method
GNR401 Dr. A. Bhattacharya
Feature Data
49
Given a set of training samples, a set of test
samples, N band training data, the given image is
classified after training the classifier using training
data
The classification result is verified using test data
If some bands are discarded, how is the result
affected?
GNR401 Dr. A. Bhattacharya
Feature Selection/Reduction
50
Feature selection selecting a subset of features
based on some criteria
Feature reduction discarding some features after
performing some transformation on the input data;
e.g., PCT
Reduced number of bands also reduces need for
large training data
GNR401 Dr. A. Bhattacharya
Mixed Pixels
51
When spatial resolution is coarse, one pixel may
contain parts of many landuse classes
e.g., tree, bare soil, grass
Classification is estimating proportions of different
classes within a pixel
The problem is called mixture modeling
GNR401 Dr. A. Bhattacharya
Mixed Pixels
52
Important when spectra of different classes are
compared as in hyperspectral remote sensing
Reference spectra are drawn from single classes
Most pixel spectra are mixtures of more than one
pure class
Mixture modeling estimates the relative proportion of
each class assuming a particular model for mixing
GNR401 Dr. A. Bhattacharya
Feature Selection/Reduction
53
Feature selection selecting a subset of features
based on some criteria
Feature reduction discarding some features after
performing some transformation on the input data;
e.g., PCT
Reduced number of bands also reduces need for
large training data
GNR401 Dr. A. Bhattacharya
Motivation to Consider Feature
54
Selection
Image classification
each element should have useful features
discriminate between elements of different classes
Discrimination power of features
GNR401 Dr. A. Bhattacharya
Class separability in feature space
55
Two generic cases
Band 2 Band 2
Band 1 Band 1
Well separated classes Poorly separated classes
GNR401 Dr. A. Bhattacharya
Low separability (with one feature)
56
p(X|Ci)
Each class is represented
by normal distribution, but
with close means and
high variances between
means
mi mj
GNR401 Dr. A. Bhattacharya
Important Features
57
Shape Features
Spectral Features
Texture Features
Transform Features
GNR401 Dr. A. Bhattacharya
Unsupervised Classification
58
When access to domain knowledge or the
experience of an analyst is missing, the data can
still be analyzed by numerical exploration, whereby
the data are grouped into subsets or clusters based
on statistical similarity
GNR401 Dr. A. Bhattacharya
Unsupervised Classification
59
Unsupervised classification is also known as learning without
teacher
In the absence of reliable training data it is possible to
understand the structure of the data using statistical methods
such as clustering algorithms
Popular clustering algorithms are k-means and ISODATA.
GNR401 Dr. A. Bhattacharya
Clustering Algorithms
60
All feature vectors are points in an L-dimensional
space where L is the number of bands (The letter K is
reserved for the number of clusters!)
It is required to find which sets of feature vectors tend
to form clusters
Members of a cluster are more similar to each other
than to members of another cluster In other words,
they possess low intra-cluster variability and high inter-
cluster variability
GNR401 Dr. A. Bhattacharya
K-Means
61
Iterative algorithm
Number of clusters K is known by user
Most popular clustering algorithm
Initialize randomly K cluster mean vectors
Assign each pixel to any of the K clusters based on
minimum feature distance
After all pixels are assigned to the K clusters, each
cluster mean is recomputed.
Iterate till cluster mean vectors stabilize
GNR401 Dr. A. Bhattacharya
ISODATA ALGORITHM
62
Iterative Self-Organizing Data Analysis Technique
(the last A added to make the acronym sound
better)
Developed in biology in the 1960s by Ball and Hall
See Tou and Gonzalezs classic Pattern Recognition
Principles for an excellent exposition to clustering
algorithms
GNR401 Dr. A. Bhattacharya
User specified parameters for
63
ISODATA
Generalization of K-Means algorithm
Consists of many user-specified parameters
Minimum size of cluster
Maximum size of cluster
Maximum intra-cluster variance
Minimum separation between pairs of clusters
Maximum number of clusters
Minimum number of clusters
Maximum number of iterations
GNR401 Dr. A. Bhattacharya