0% found this document useful (0 votes)
46 views

Pattern All Week

This document provides an introduction to pattern recognition. It defines pattern recognition as the classification of objects into categories. Pattern recognition systems first extract features from patterns, then use statistical, structural or other methods to learn models that map features to categories. The goal is to generalize this categorization to new unknown patterns. An example of classifying fish by species is provided to illustrate feature selection, decision boundaries, costs of errors, and improving generalization.

Uploaded by

xBurak34
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Pattern All Week

This document provides an introduction to pattern recognition. It defines pattern recognition as the classification of objects into categories. Pattern recognition systems first extract features from patterns, then use statistical, structural or other methods to learn models that map features to categories. The goal is to generalize this categorization to new unknown patterns. An example of classifying fish by species is provided to illustrate feature selection, decision boundaries, costs of errors, and improving generalization.

Uploaded by

xBurak34
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 391

BIM488 Introduction to Pattern Recognition

Introduction
Outline

• Pattern Recognition
• An Example
• Pattern Recognition Systems
• The Design Cycle

BIM488 Introduction to Pattern Recognition Introduction 2


Pattern Recognition

• A pattern, from the French patron, is a type of theme of


recurring events or objects, sometimes referred to as
elements of a set of objects.
• Arrangement of objects which has
a mathematical, geometric, statistical etc. relationship
• A pattern is an abstract object, such as a set of
measurements describing a physical object.

BIM488 Introduction to Pattern Recognition Introduction 3


Pattern Recognition

• Pattern recognition is the scientific discipline whose goal


is the classification of objects into a number of categories or
classes.

Class “A”

Class “B”

BIM488 Introduction to Pattern Recognition Introduction 4


Pattern Recognition

• Depending on the application, these objects can be


– images
– signal waveforms
– text
– or any type of measurements
that need to be classified.
• We will refer to these objects using the generic term
patterns.
• The task is to assign unknown patterns into the correct
class which is known as classification.

BIM488 Introduction to Pattern Recognition Introduction 5


Pattern Recognition

Pattern Class

• A collection of “similar” (not necessarily identical) objects


– Intra-class variability

The letter “T” in different typefaces

– Inter-class variability

Characters that look similar

BIM488 Introduction to Pattern Recognition Introduction 6


Pattern Recognition

• Pattern recognition systems are in many cases trained from


the labeled "training" data (supervised learning).
• However, when no labeled data are available, other
algorithms can be used to discover previously unknown
patterns (unsupervised learning).

BIM488 Introduction to Pattern Recognition Introduction 7


Pattern Recognition

• Pattern recognition applications

BIM488 Introduction to Pattern Recognition Introduction 8


Pattern Recognition
Main PR Methods:

• Statistical pattern recognition (we will focus on this)


– Focuses on the statistical properties of the patterns (i.e., probability
densities).
• Structural pattern recognition
– Describe complicated objects in terms of simple primitives and
structural relationships.
• Syntactic pattern recognition
– Decisions consist of logical rules or grammars.
• Template matching
– The pattern to be recognized is matched against a stored template
while taking into account all allowable pose (translation and rotation)
and scale changes.

BIM488 Introduction to Pattern Recognition Introduction 9


Pattern Recognition

Artificial Intelligence

Machine Learning

Pattern Recognition

BIM488 Introduction to Pattern Recognition Introduction 10


Pattern Recognition
• Intelligence: The word intelligence derives from the Latin nouns
intelligentia or intellēctus, which in turn stem from the verb
intelligere, to ‘comprehend’ or ‘perceive’. Intelligence is the
capacity for logic, understanding, self-awareness, learning,
emotional knowledge, reasoning, planning, creativity, critical
thinking, and problem-solving.
• Artificial Intelligence (AI): is the intelligence demonstrated by
computers/machines, unlike the natural intelligence displayed by
human beings. It is the broad discipline of creating intelligent
machines.
• Machine Learning (ML): refers to the systems that can learn
from experience.
• Pattern Recognition (PR): is the classification of objects into a
number of categories or classes.

BIM488 Introduction to Pattern Recognition Introduction 11


An Example PR problem

• Problem: Sorting incoming fish on a conveyor belt


according to species
• Assume that we have only two kinds of fish:
– sea bass
– salmon

BIM488 Introduction to Pattern Recognition Introduction 12


An Example: Decision Process

• What kind of information can distinguish one species from


the other?
– Length
– Width
– Weight
– Number and shape of fins
– Tail shape
– etc.
• These information are possible features.
• Feature is a ‘lower-dimensional’ and ‘discriminative
information’ extracted from the patterns.

BIM488 Introduction to Pattern Recognition Introduction 13


An Example: Selecting Features

• Assume a fisherman told us that a sea bass is generally


longer than a salmon.

• We can use length as a feature and decide between sea


bass and salmon according to a threshold on length.

• But, how can we choose this threshold?

BIM488 Introduction to Pattern Recognition Introduction 14


An Example: Selecting Features

Figure: Histograms of the length feature for two types of fish in training
samples.
• How can we choose the threshold l* to make a reliable decision?

BIM488 Introduction to Pattern Recognition Introduction 15


An Example: Selecting Features

• In statistics, a histogram is a graphical representation


showing a visual impression of the distribution of data.
• It is an estimate of the probability distribution of a variable.

BIM488 Introduction to Pattern Recognition Introduction 16


An Example: Selecting Features

• Even though sea bass is longer than salmon on the


average, there are many examples of fish where this
observation does not hold.

• Try another feature: average lightness of the fish scales.

BIM488 Introduction to Pattern Recognition Introduction 17


An Example: Selecting Features

Figure: Histograms of the lightness feature for two types of fish in training
samples.

• How can we choose the threshold x* to make a reliable decision?

BIM488 Introduction to Pattern Recognition Introduction 18


An Example: Cost of Error

• We should also consider costs of different errors we make


in our decisions.
• For example, if the fish packing company knows that:
– Customers who buy salmon will be angry if they see cheap sea bass
in their cans.
– Customers who buy sea bass will not be unhappy if they
occasionally see some expensive salmon in their cans.
• How does this knowledge affect our decision?

BIM488 Introduction to Pattern Recognition Introduction 19


An Example: Multiple Features

• Assume we also observed that sea bass are typically wider


than salmon.
• We can use two features in our decision:
– lightness: x1
– width: x2
• Each fish image is now represented as a point (feature
vector) in a two-dimensional feature space:

x = [x1 x2]

BIM488 Introduction to Pattern Recognition Introduction 20


An Example: Multiple Features

• Figure: Scatter plot of lightness and width features for training samples.
We can draw a decision boundary to divide the feature space into two
regions. Does it look better than using only lightness?

BIM488 Introduction to Pattern Recognition Introduction 21


An Example: Multiple Features

• Does adding more features always improve the results?


– Avoid unreliable features.
– Be careful about correlations with existing features.
– Be careful about measurement costs.
– Be careful about noise in the measurements.

• Is there some curse for working in very high dimensions?

BIM488 Introduction to Pattern Recognition Introduction 22


An Example: Decision Boundaries
• Can we do better with another decision rule?
• More complex models result in more complex boundaries.

• Figure: We may distinguish training samples perfectly but


how can we predict how well we can generalize to unknown
samples?
BIM488 Introduction to Pattern Recognition Introduction 23
An Example: Generalization

• The ability of the classifier to produce correct results on


novel patterns.
• How can we improve generalization performance ?
– More training examples (i.e., better pdf estimates).
– Simpler models (i.e., simpler classification boundaries) usually yield
better performance.

Simplify the decision boundary!

BIM488 Introduction to Pattern Recognition Introduction 24


Pattern Recognition Systems

BIM488 Introduction to Pattern Recognition Introduction 25


Pattern Recognition Systems

• Data acquisition and sensing:


– Measurements of physical variables
– Important issues: bandwidth, resolution, sensitivity, distortion, SNR,
latency, etc.
• Pre-processing:
– Removal of noise in data
– Isolation of patterns of interest from the background
• Feature extraction:
– Finding a new representation in terms of features

BIM488 Introduction to Pattern Recognition Introduction 26


Pattern Recognition Systems

• Model learning and estimation:


– Learning a mapping between features and pattern groups and
categories
• Classification:
– Using features and learned models to assign a pattern to a category
• Post-processing:
– Evaluation of confidence in decisions
– Exploitation of context to improve performance
– Combination of experts

BIM488 Introduction to Pattern Recognition Introduction 27


The Design Cycle

BIM488 Introduction to Pattern Recognition Introduction 28


The Design Cycle: Overview of Important Issues

• Noise
• Data Collection / Feature Extraction
• Pattern Representation / Invariance/Missing Features
• Model Selection / Overfitting
• Prior Knowledge / Context
• Classifier Combination
• Costs and Risks
• Computational Complexity

BIM488 Introduction to Pattern Recognition Introduction 29


The Design Cycle: Issue: Noise

• Various types of noise (e.g., shadows, conveyor belt might


shake, etc.)
• Noise can reduce the reliability of the feature values
measured.
• Knowledge of the noise process can help to improve
performance.

BIM488 Introduction to Pattern Recognition Introduction 30


The Design Cycle: Issue: Data Collection

• How do we know that we have collected an adequately


large and representative set of examples for
training/testing the system?

BIM488 Introduction to Pattern Recognition Introduction 31


The Design Cycle: Issue: Feature Extraction

• It is a domain-specific problem which influences classifier's


performance.
• Which features are most promising ?
• Are there ways to automatically learn which features are
best ?
• How many should we use ?
• Choose features that are robust to noise.
• Favor features that lead to simpler decision regions.

BIM488 Introduction to Pattern Recognition Introduction 32


The Design Cycle: Issue: Pattern Representation

• Similar patterns should have similar representations.


• Patterns from different classes should have dissimilar
representations.
• Pattern representations should be invariant to
transformations such as:
– translations, rotations, size, reflections, non-rigid deformations
• Small intra-class variation, large inter-class variation.

BIM488 Introduction to Pattern Recognition Introduction 33


The Design Cycle: Issue: Missing Features

• Certain features might be missing (e.g., due to occlusion).


• How should the classifier make the best decision with
missing features ?
• How should we train the classifier with missing features ?

BIM488 Introduction to Pattern Recognition Introduction 34


The Design Cycle: Issue: Model Selection

• How do we know when to reject a class of models and try


another one ?
• Is the model selection process just a trial and error
process ?
• Can we automate this process ?

BIM488 Introduction to Pattern Recognition Introduction 35


The Design Cycle: Issue: Overfitting

• Models complex than necessary lead to overfitting (i.e.,


good performance on the training data but poor
performance on novel data).
• How can we adjust the complexity of the model ? (not very
complex or simple).
• Are there principled methods for finding the best
complexity ?

BIM488 Introduction to Pattern Recognition Introduction 36


The Design Cycle: Issue: Context

How m ch
info mation are
y u mi sing?
BIM488 Introduction to Pattern Recognition Introduction 37
The Design Cycle: Issue: Classifier Combination

• Performance can be improved using a "pool" of classifiers.


• How should we combine multiple classifiers ?

BIM488 Introduction to Pattern Recognition Introduction 38


The Design Cycle: Issue: Costs and Risks

• Each classification is associated with a cost or risk (e.g.,


classification error).
• How can we incorporate knowledge about such risks ?
• Can we estimate the lowest possible risk of any classifier ?

BIM488 Introduction to Pattern Recognition Introduction 39


The Design Cycle: Issue: Computational Complexity

• How does an algorithm scale with


– the number of feature dimensions
– number of patterns
– number of categories

• Brute-force approaches might lead to perfect classifications


results but usually have impractical time and memory
requirements.

• What is the tradeoff between computational ease and


performance ?

BIM488 Introduction to Pattern Recognition Introduction 40


The Design Cycle: General Purpose PR Systems?

• Humans have the ability to switch rapidly and seamlessly


between different pattern recognition tasks

• It is very difficult to design a device that is capable of


performing a variety of classification tasks
– Different decision tasks may require different features.
– Different features might yield different solutions.
– Different tradeoffs (e.g., classification error vs processing time)
exist for different tasks.

BIM488 Introduction to Pattern Recognition Introduction 41


A Design Example

• How can we design an attendance system for this course


using a pattern recognition system?

BIM488 Introduction to Pattern Recognition Introduction 42


Summary

• Pattern Recognition
• An Example
• Pattern Recognition Systems
• The Design Cycle

BIM488 Introduction to Pattern Recognition Introduction 43


References

• S. Theodoridis and K. Koutroumbas, Pattern Recognition (4th Edition),


Academic Press, 2009.

• R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification (2nd Edition),


Wiley, 2001.

BIM488 Introduction to Pattern Recognition Introduction 44


BIM488 Introduction to Pattern Recognition

Review of Matrices and Vectors


Outline

• Definitions
• Basic Matrix Operations
• Vector and Vector Spaces
• Vector Norms
• Eigenvalues and Eigenvectors

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 2


Some Definitions

An m×n (read "m by n") matrix, denoted by A, is a rectangular array


of entries or elements (numbers, or symbols representing numbers)
enclosed typically by square brackets, where m is the number of
rows and n the number of columns.

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 3


Definitions (con’t)

• A is square if m= n.
• A is diagonal if all off-diagonal elements are 0, and not all
diagonal elements are 0.
• A is the identity matrix ( I ) if it is diagonal and all diagonal
elements are 1.
• A is the zero or null matrix ( 0 ) if all its elements are 0.
• The trace of A equals the sum of the elements along its main
diagonal.
• Two matrices A and B are equal iff the have the same
number of rows and columns, and aij = bij .

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 4


Definitions (con’t)

• The transpose AT of an m×n matrix A is an n×m matrix


obtained by interchanging the rows and columns of A.
• A square matrix for which AT=A is said to be symmetric.
• Any matrix X for which XA=I and AX=I is called the inverse of
A.
• Let c be a real or complex number (called a scalar). The
scalar multiple of c and matrix A, denoted cA, is obtained by
multiplying every elements of A by c. If c = 1, the scalar
multiple is called the negative of A.

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 5


Definitions (con’t)

A column vector is an m × 1 matrix:

A row vector is a 1 × n matrix:

A column vector can be expressed as a row vector by using


the transpose:

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 6


Some Basic Matrix Operations

• The sum of two matrices A and B (of equal dimension),


denoted A + B, is the matrix with elements aij + bij.
• The difference of two matrices, A B, has elements aij  bij.
• The product, AB, of m×n matrix A and p×q matrix B, is an
m×q matrix C whose (i,j)-th element is formed by multiplying
the entries across the ith row of A times the entries down the
jth column of B; that is,

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 7


Some Basic Matrix Operations (con’t)

The inner product (also called dot product) of two vectors

is defined as

Note that the inner product is a scalar.

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 8


Vectors and Vector Spaces

Example
The vector space with which we are most familiar is the two-
dimensional real vector space 2 , in which we make frequent use of
graphical representations for operations such as vector addition,
subtraction, and multiplication by a scalar. For instance, consider the
two vectors

Using the rules of matrix addition and subtraction we have

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 9


Vectors and Vector Spaces (con’t)
Example (Con’t)
The following figure shows the familiar graphical representation of the
preceding vector operations, as well as multiplication of vector a by
scalar c = 0.5.

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 10


Vectors and Vector Spaces (con’t)

Consider two real vector spaces V0 and V such that:


• Each element of V0 is also an element of V (i.e., V0 is a subset
of V).
• Operations on elements of V0 are the same as on elements of
V. Under these conditions, V0 is said to be a subspace of V.
A linear combination of v1,v2,…,vn is an expression of the form

where the ’s are scalars.

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 11


Vectors and Vector Spaces (con’t)

A vector v is said to be linearly dependent on a set, S, of vectors


v1,v2,…,vn if and only if v can be written as a linear combination of
these vectors. Otherwise, v is linearly independent of the set of
vectors v1,v2,…,vn .

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 12


Vectors and Vector Spaces (con’t)

A set S of vectors v1,v2,…,vn in V is said to span some subspace V0 of V


if and only if S is a subset of V0 and every vector v0 in V0 is linearly
dependent on the vectors in S. The set S is said to be a spanning set
for V0. A basis for a vector space V is a linearly independent spanning
set for V. The number of vectors in the basis for a vector space is called
the dimension of the vector space. If, for example, the number of
vectors in the basis is n, we say that the vector space is n-dimensional.

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 13


Vectors and Vector Spaces (con’t)

An important aspect of the concepts just discussed lies in the


representation of any vector in m as a linear combination of the
basis vectors. For example, any vector

in 3 can be represented as a linear combination of the basis


vectors

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 14


Vector Norms

A vector norm on a vector space V is a function that assigns to each


vector v in V a nonnegative real number, called the norm of v,
denoted by ||v||. By definition, the norm satisfies the following
conditions:

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 15


Vector Norms (con’t)

There are numerous norms that are used in practice. In our work, the
norm most often used is the so-called 2-norm, which, for a vector x
in real m, space is defined as

which is recognized as the Euclidean distance from the origin to point


x; this gives the expression the familiar name Euclidean norm. The
expression also is recognized as the length of a vector x, with origin at
point 0. From earlier discussions, the norm also can be written as

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 16


Vector Norms (con’t)

The Cauchy-Schwartz inequality states that

Another well-known result used in the book is the expression

where  is the angle between vectors x and y. From these


expressions it follows that the inner product of two vectors can be
written as

Thus, the inner product can be expressed as a function of the


norms of the vectors and the angle between the vectors.

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 17


Vector Norms (con’t)

From the preceding results, two vectors in m are orthogonal if and


only if their inner product is zero. Two vectors are orthonormal if, in
addition to being orthogonal, the length of each vector is 1.

From the concepts just discussed, we see that an arbitrary vector a is


turned into a vector an of unit length by performing the operation an =
a/||a||. Clearly, then, ||an|| = 1.

A set of vectors is said to be an orthogonal set if every two vectors


in the set are orthogonal. A set of vectors is orthonormal if every
two vectors in the set are orthonormal.

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 18


Eigenvalues & Eigenvectors

Definition: The eigenvalues of a real matrix M are the real


numbers  for which there is a nonzero vector e such that

Me =  e.

The eigenvectors of M are the nonzero vectors e for which there is


a real number  such that Me =  e.

Eigenvalues are obtained by solving the equation below

det(M - I) = 0

Eigenvectors constitute an orthogonal (orthonormal) set.

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 19


Eigenvalues & Eigenvectors (con’t)

Example: Consider the matrix

and

In other words, e1 is an eigenvector of M with associated


eigenvalue 1, and similarly for e2 and 2.

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 20


Eigenvalues & Eigenvectors (con’t)

Example 2: Consider the matrix

e1 = ? e2 = ?

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 21


Summary

• Definitions
• Basic Matrix Operations
• Vector and Vector Spaces
• Vector Norms
• Orthogonality
• Eigenvalues and Eigenvectors

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 22


References

• R. C. Gonzalez & R. E. Woods, Digital Image Processing (3rd Edition),


Prentice Hall, 2008.

BIM488 Introduction to Pattern Recognition Review of Matrices and Vectors 23


BIM488 Introduction to Pattern Recognition

Review of Probability
Outline

• Sets and Set Operations


• Relative Frequency and Probability

BIM488 Introduction to Pattern Recognition Review of Probability 2


Sets and Set Operations

Probability events are modeled as sets, so it is customary to begin a


study of probability by defining sets and some simple operations
among sets.

A set is a collection of objects, with each object in a set often


referred to as an element or member of the set. Familiar
examples include the set of all image processing books in the
world, the set of prime numbers, and the set of planets circling the
sun. Typically, sets are represented by uppercase letters, such as
A, B, and C, and members of sets by lowercase letters, such as a,
b, and c.

BIM488 Introduction to Pattern Recognition Review of Probability 3


Sets and Set Operations (con’t)

We denote the fact that an element a belongs to set A by

If a is not an element of A, then we write

A set can be specified by listing all of its elements, or by listing


properties common to all elements. For example, suppose that
I is the set of all integers. A set B consisting the first five
nonzero integers is specified using the notation

BIM488 Introduction to Pattern Recognition Review of Probability 4


Sets and Set Operations (con’t)

The set of all integers less than 10 is specified using the notation

which we read as "C is the set of integers such that each members of
the set is less than 10." The "such that" condition is denoted by the
symbol “ | “ . As shown in the previous two equations, the elements of
the set are enclosed by curly brackets.

The set with no elements is called the empty or null set, denoted in this
review by the symbol Ø.

BIM488 Introduction to Pattern Recognition Review of Probability 5


Sets and Set Operations (con’t)

Two sets A and B are said to be equal if and only if they contain the
same elements. Set equality is denoted by

If the elements of two sets are not the same, we say that the sets are
not equal, and denote this by

If every element of B is also an element of A, we say that B is a


subset of A:

BIM488 Introduction to Pattern Recognition Review of Probability 6


Sets and Set Operations (con’t)

Finally, we consider the concept of a universal set, which we denote


by U and define to be the set containing all elements of interest in a
given situation. For example, in an experiment of tossing a coin, there
are two possible (realistic) outcomes: heads or tails. If we denote
heads by H and tails by T, the universal set in this case is {H,T}.
Similarly, the universal set for the experiment of throwing a single die
has six possible outcomes, which normally are denoted by the face
value of the die, so in this case U = {1,2,3,4,5,6}. For obvious reasons,
the universal set is frequently called the sample space, which we
denote by S. It then follows that, for any set A, we assume that Ø  A
 S, and for any element a, a  S and a  Ø.

BIM488 Introduction to Pattern Recognition Review of Probability 7


Some Basic Set Operations

The operations on sets associated with basic probability theory are


straightforward. The union of two sets A and B, denoted by

is the set of elements that are either in A or in B, or in both. In other


words,

Similarly, the intersection of sets A and B, denoted by

is the set of elements common to both A and B; that is,

BIM488 Introduction to Pattern Recognition Review of Probability 8


Set Operations (con’t)

Two sets having no elements in common are said to be disjoint or


mutually exclusive, in which case

The complement of set A is defined as

Clearly, (Ac)c=A. Sometimes the complement of A is denoted as .

The difference of two sets A and B, denoted A  B, is the set of


elements that belong to A, but not to B. In other words,

BIM488 Introduction to Pattern Recognition Review of Probability 9


Set Operations (con’t)

It is easily verified that


The union operation is applicable to multiple sets. For example the
union of sets A1,A2,…,An is the set of points that belong to at least
one of these sets. Similar comments apply to the intersection of
multiple sets.

The following table summarizes several important relationships


between sets. Proofs for these relationships are found in most books
dealing with elementary set theory.

BIM488 Introduction to Pattern Recognition Review of Probability 10


Set Operations (con’t)

BIM488 Introduction to Pattern Recognition Review of Probability 11


Set Operations (con’t)

It often is quite useful to represent sets and sets operations in a so-


called Venn diagram, in which S is represented as a rectangle,
sets are represented as areas (typically circles), and points are
associated with elements. The following example shows various
uses of Venn diagrams.

Example: The following figure shows various examples of Venn


diagrams. The shaded areas are the result (sets of points) of the
operations indicated in the figure. The diagrams in the top row are self
explanatory. The diagrams in the bottom row are used to prove the
validity of the expression

which is used in the proof of some probability relationships.

BIM488 Introduction to Pattern Recognition Review of Probability 12


Set Operations (con’t)

BIM488 Introduction to Pattern Recognition Review of Probability 13


Relative Frequency & Probability

A random experiment is an experiment in which it is not


possible to predict the outcome. Perhaps the best known
random experiment is the tossing of a coin. Assuming that the
coin is not biased, we are used to the concept that, on average,
half the tosses will produce heads (H) and the others will
produce tails (T). This is intuitive and we do not question it. In
fact, few of us have taken the time to verify that this is true. If we
did, we would make use of the concept of relative frequency. Let
n denote the total number of tosses, nH the number of heads that
turn up, and nT the number of tails. Clearly,

BIM488 Introduction to Pattern Recognition Review of Probability 14


Relative Frequency & Probability (con’t)

Dividing both sides by n gives

The term nH/n is called the relative frequency of the event we have
denoted by H, and similarly for nT/n. If we performed the tossing
experiment a large number of times, we would find that each of these
relative frequencies tends toward a stable, limiting value. We call this
value the probability of the event, and denoted it by P(event).

BIM488 Introduction to Pattern Recognition Review of Probability 15


Relative Frequency & Probability (con’t)

In the current discussion the probabilities of interest are P(H) and P(T).
We know in this case that P(H) = P(T) = 1/2. Note that the event of an
experiment need not signify a single outcome. For example, in the
tossing experiment we could let D denote the event "heads or tails,"
(note that the event is now a set) and the event E, "neither heads nor
tails." Then, P(D) = 1 and P(E) = 0.

The first important property of P is that, for an event A,

That is, the probability of an event is a positive number bounded by


0 and 1. For the certain event, S,

BIM488 Introduction to Pattern Recognition Review of Probability 16


Relative Frequency & Probability (con’t)

Here the certain event means that the outcome is from the universal
or sample set, S. Similarly, we have that for the impossible event, Sc

This is the probability of an event being outside the sample set. In


the example given at the end of the previous paragraph, S = D and
Sc = E.

BIM488 Introduction to Pattern Recognition Review of Probability 17


Relative Frequency & Probability (con’t)

The event that either events A or B or both have occurred is simply


the union of A and B (recall that events can be sets). Earlier, we
denoted the union of two sets by A  B. One often finds the
equivalent notation A+B used interchangeably in discussions on
probability. Similarly, the event that both A and B occurred is given by
the intersection of A and B, which we denoted earlier by A  B. The
equivalent notation AB is used much more frequently to denote the
occurrence of both events in an experiment.

BIM488 Introduction to Pattern Recognition Review of Probability 18


Relative Frequency & Probability (con’t)

Suppose that we conduct our experiment n times. Let n1 be the


number of times that only event A occurs; n2 the number of times that
B occurs; n3 the number of times that AB occurs; and n4 the number
of times that neither A nor B occur. Clearly, n1+n2+n3+n4=n. Using
these numbers we obtain the following relative frequencies:

BIM488 Introduction to Pattern Recognition Review of Probability 19


Relative Frequency & Probability (con’t)

and

Using the previous definition of probability based on relative


frequencies we have the important result

If A and B are mutually exclusive it follows that the set AB is empty


and, consequently, P(AB) = 0.

BIM488 Introduction to Pattern Recognition Review of Probability 20


Relative Frequency & Probability (con’t)

The relative frequency of event A occurring, given that event B has


occurred, is given by

This conditional probability is denoted by P(A/B), where we note


the use of the symbol “ / ” to denote conditional occurrence. It is
common terminology to refer to P(A/B) as the probability of A given
B.

BIM488 Introduction to Pattern Recognition Review of Probability 21


Relative Frequency & Probability (con’t)

Similarly, the relative frequency of B occurring, given that A has


occurred is

We call this relative frequency the probability of B given A, and


denote it by P(B/A).

BIM488 Introduction to Pattern Recognition Review of Probability 22


Relative Frequency & Probability (con’t)

A little manipulation of the preceding results yields the following


important relationships

and

The second expression may be written as

which is known as Bayes' theorem, so named after the 18th century


mathematician Thomas Bayes.

BIM488 Introduction to Pattern Recognition Review of Probability 23


Relative Frequency & Probability (con’t)

If A and B are statistically independent, then P(B/A) = P(B) and it


follows that

and

It was stated earlier that if sets (events) A and B are mutually


exclusive, then A  B = Ø from which it follows that P(AB) = P(A 
B) = 0. As was just shown, the two sets are statistically independent
if P(AB)=P(A)P(B), which we assume to be nonzero in general.
Thus, we conclude that for two events to be statistically
independent, they cannot be mutually exclusive.

BIM488 Introduction to Pattern Recognition Review of Probability 24


Relative Frequency & Probability (con’t)

In general, for N events to be statistically independent, it must be true


that, for all combinations 1  i  j  k  . . .  N

BIM488 Introduction to Pattern Recognition Review of Probability 25


Relative Frequency & Probability (con’t)

Example: (a) An experiment consists of throwing a single die twice.


The probability of any of the six faces, 1 through 6, coming up in
either experiment is 1/6. Suppose that we want to find the probability
that a 2 comes up, followed by a 4. These two events are statistically
independent (the second event does not depend on the outcome of
the first). Thus, letting A represent a 2 and B a 4,

We would have arrived at the same result by defining "2 followed by


4" to be a single event, say C. The sample set of all possible
outcomes of two throws of a die is 36. Then, P(C)=1/36.

BIM488 Introduction to Pattern Recognition Review of Probability 26


Relative Frequency & Probability (con’t)

Example (Con’t): (b) Consider now an experiment in which we draw


one card from a standard card deck of 52 cards. Let A denote the
event that a king is drawn, B denote the event that a queen or jack is
drawn, and C the event that a diamond-face card is drawn. A brief
review of the previous discussion on relative frequencies would show
that

and

BIM488 Introduction to Pattern Recognition Review of Probability 27


Relative Frequency & Probability (con’t)

Example (Con’t): Furthermore,

and

Events A and B are mutually exclusive (we are drawing only one card,
so it would be impossible to draw a king and a queen or jack
simultaneously). Thus, it follows from the preceding discussion that
P(AB) = P(A  B) = 0 [and also that P(AB)  P(A)P(B)].

BIM488 Introduction to Pattern Recognition Review of Probability 28


Relative Frequency & Probability (con’t)

Example (Con’t): (c) As a final experiment, consider the deck of 52


cards again, and let A1, A2, A3, and A4 represent the events of
drawing an ace in each of four successive draws. If we replace the
card drawn before drawing the next card, then the events are
statistically independent and it follows that

BIM488 Introduction to Pattern Recognition Review of Probability 29


Relative Frequency & Probability (con’t)

Example (Con’t): Suppose now that we do not replace the cards


that are drawn. The events then are no longer statistically
independent. With reference to the results in the previous example,
we write

Thus we see that not replacing the drawn card reduced our chances
of drawing fours successive aces by a factor of close to 10. This
significant difference is perhaps larger than might be expected from
intuition.

BIM488 Introduction to Pattern Recognition Review of Probability 30


Summary

• Sets and Set Operations


• Relative Frequency and Probability

BIM488 Introduction to Pattern Recognition Review of Probability 31


References

• R. C. Gonzalez & R. E. Woods, Digital Image Processing (3rd Edition),


Prentice Hall, 2008.

BIM488 Introduction to Pattern Recognition Review of Probability 32


BIM488 Introduction to Pattern Recognition

Introduction to Matlab
Outline

• Basics of Matlab
• Control Structures
• Scripts and Functions
• Basic Plotting Functions
• Graphical User Interface
• Help

BIM488 Introduction to Pattern Recognition Introduction to Matlab 2


Basics of Matlab

• MATLAB stands for Matrix Laboratory.


• Matlab had many functions and toolboxes to help in various
applications
• It allows you to solve many technical computing problems,
especially those with matrix and vector formulas, in a
fraction of the time it would take to write a program in a
scalar non-interactive languages such as C.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 3


Basics of Matlab

• The Language
– The MATLAB language is a high-level matrix/array language with
control flow statements, functions, data structures, input/output, and
object-oriented programming features.
• Graphics
– MATLAB has extensive facilities for displaying vectors and matrices
as graphs, as well as editing and printing these graphs. It also
includes functions that allow you to customize the appearance of
graphics as well as build complete graphical user interfaces on your
MATLAB applications.
• External Interfaces
– The external interfaces library allows you to write C programs that
interact with MATLAB.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 4


Basics of Matlab

• Command-based environment
• A(i,j) denotes the element located at i’th row and j’th
column
• Matrices are defined using brackets ‘[’ and ‘]’.
• Rows are separated by semicolon ‘;’.
• Matlab has various toolboxes containing ready-to-use
functions for various tasks.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 5


Basics of Matlab

• Matlab application window:

Variables

Files in
current Command
directory window

Command
history
Content of
selected file

BIM488 Introduction to Pattern Recognition Introduction to Matlab 6


Basics of Matlab

• The prompt consists of two right arrows: >>


• Just type your command and press Enter.
• Matlab has all elementary functions.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 7


Basics of Matlab

• Create variables directly, and use them in other functions.


• All variables are created with double precision unless
specified.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 8


Basics of Matlab

Colon ‘:’ Operator

• MATLAB’s most powerful operator!


• 1:5 means 1 2 3 4 5
• 1:3:10 means 1 4 7 10
• 100:-10:50 means 100 90 80 70 60 50
• A(:, 3) returns the third column of A
• A(3, :) returns the third row of A
• A(1:2, 1:3) returns the top two rows and first three
columns

BIM488 Introduction to Pattern Recognition Introduction to Matlab 9


Basics of Matlab

Generating Matrices

• zeros(M,N)
• ones(M,N)
• eye(N)
• rand(M,N) [uniformly-distributed]
• randn(M,N) [normally-distributed]
• magic(N) [sums along rows, columns and
diagonals are the same]
• How can you generate a matrix of all 5’s?
• How can you generate a matrix whose elements are between 2
and 5?

BIM488 Introduction to Pattern Recognition Introduction to Matlab 10


Basics of Matlab

Matrix Concatenation

• A=[B C] concatenates B and C left-to-right


• A=[B;C] concatenates B and C top-to-bottom

BIM488 Introduction to Pattern Recognition Introduction to Matlab 11


Basics of Matlab

Deleting Rows or Columns

• A(2,:)=[] deletes the second row of A


• A(:,3)=[] deletes the third column of A

BIM488 Introduction to Pattern Recognition Introduction to Matlab 12


Basics of Matlab

Obtaining Matrix Properties

• min(A) finds the minimum of each columns


• min(min(A)) finds the minimum element of A
• max(max(A)) finds the max. element of A
• sum(sum(A)) returns the summation
• size(A) returns the row and column counts
• length(D) returns the length of one-dimensional array D
• ndims(C) returns the dimension of C
• ‘whos B’ shows the matrix properties

BIM488 Introduction to Pattern Recognition Introduction to Matlab 13


Basics of Matlab

Arithmetic Operators

• + • ^
• - • .^
• * • .’
• .* • ’
• ./ • + [unary plus] e.g. +A
• .\ • - [unary minus] e.g. -A
• / • :
• \

BIM488 Introduction to Pattern Recognition Introduction to Matlab 14


Basics of Matlab

Relational & Logical Operators

• < • & AND


• <= • | OR
• > • ~ NOT
• >=
• == Equal to
• ~= Not equal to

BIM488 Introduction to Pattern Recognition Introduction to Matlab 15


Basics of Matlab

• Dealing with matrices:

“ : ” indicates block of data

BIM488 Introduction to Pattern Recognition Introduction to Matlab 16


Basics of Matlab

• Dealing with matrices:


• We can directly add, subtract, multiply, invert matrices.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 17


Control Structures

• Conditional Control
- if, else, elseif
- switch, case
• Loop Control
- for, while, continue, break
• Error Control
- try, catch
• Program Termination
- return

BIM488 Introduction to Pattern Recognition Introduction to Matlab 18


Control Structures

Examples
• If Statement Syntax
if ((a>3) & (b==5))
if (Condition_1) Matlab Commands;
end
Matlab Commands
elseif (Condition_2) if (a<3)
Matlab Commands Matlab Commands;
elseif (b~=5)
elseif (Condition_3) Matlab Commands;
Matlab Commands end
else
if (a<3)
Matlab Commands Matlab Commands;
end else
Matlab Commands;
end

BIM488 Introduction to Pattern Recognition Introduction to Matlab 19


Control Structures

• For loop syntax Examples

for i=1:100
for i=Index_Array Matlab Commands;
Matlab Commands end
end
for j=1:3:200
Matlab Commands;
end

for m=13:-0.2:-21
Matlab Commands;
end

for k=[0.1 0.3 -13 12 7 -9.3]


Matlab Commands;
end

BIM488 Introduction to Pattern Recognition Introduction to Matlab 20


Control Structures

• While Loop Syntax Example

while (condition) while ((a>3) & (b==5))


Matlab Commands;
Matlab Commands end
end

BIM488 Introduction to Pattern Recognition Introduction to Matlab 21


Scripts and Functions

• There are two kinds of M-files:

- Scripts, which do not accept input arguments or


return output arguments. They operate on data in the
workspace. Any variables that they create remain in
the workspace, to be used in subsequent computations

- Functions, which can accept input arguments and


return output arguments. Internal variables are local to
the function.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 22


Scripts and Functions

• Functions are m-files which can be executed by


specifying some inputs and supply some desired outputs.
• The code telling the Matlab that an m-file is actually a
function is:
function out1=functionname(in1)
function out1=functionname(in1,in2,in3)
function [out1,out2]=functionname(in1,in2)

• You should write this command at the beginning of the


m-file and you should save the m-file with a file name
same as the function name.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 23


Scripts and Functions

• Examples
– Write a function : out=squarer (A, ind)
• Which takes the square of the input matrix if the input
indicator is equal to 1
• And takes the element by element square of the input matrix if
the input indicator is equal to 2

Same Name

BIM488 Introduction to Pattern Recognition Introduction to Matlab 24


Scripts and Functions

• Another function which takes an input array and returns the


sum and product of its elements as outputs

• The function sumprod(.) can be called from command window


or an m-file as

BIM488 Introduction to Pattern Recognition Introduction to Matlab 25


Scripts and Functions

Global Variables
• If you want more than one function to share a single
copy of a variable, simply declare the variable as global
in all the functions. The global declaration must occur
before the variable is actually used in a function.

Example: function h = falling(t)


global GRAVITY
h = 1/2*GRAVITY*t.^2;

BIM488 Introduction to Pattern Recognition Introduction to Matlab 26


Basic Plotting Functions

• MATLAB provides a variety of techniques to display data


graphically.
• Interactive tools enable you to manipulate graphs to
achieve results that reveal the most information about your
data.
• You can also edit and print graphs for presentations, or
export graphs to standard graphics formats for presentation
in Web browsers or other media.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 27


Basic Plotting Functions

• The plot function has different forms, depending on the


input arguments.
• If y is a vector, plot(y) produces a piecewise graph of the
elements of (y) versus the index of the elements of (y).
• If you specify two vectors as arguments, plot(x,y)
produces a graph of y versus x.
• You can also label the axes and add a title, using the
‘xlabel’, ‘ylabel’, and ‘title’ functions.
Example: xlabel('x = 0:2\pi')
ylabel('Sine of x')
title('Plot of the Sine Function','FontSize',12)

BIM488 Introduction to Pattern Recognition Introduction to Matlab 28


Basic Plotting Functions

BIM488 Introduction to Pattern Recognition Introduction to Matlab 29


Basic Plotting Functions

• Plotting Multiple Data Sets in One Graph


– Multiple x-y pair arguments create multiple graphs
with a single call to plot.
For example: x = 0:pi/100:2*pi;
y = sin(x);
y2 = sin(x-.25);
y3 = sin(x-.5);
plot(x,y,x,y2,x,y3)

BIM488 Introduction to Pattern Recognition Introduction to Matlab 30


Basic Plotting Functions

• Specifying Line Styles and Colors


It is possible to specify color, line styles, and
markers (such as plus signs or circles) when you
plot your data using the plot command:
plot(x,y,'color_style_marker')

For example: plot(x,y,'r:+')


plots a red-dotted line and places plus sign markers
at each data point.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 31


Basic Plotting Functions

• Graphing Imaginary and Complex Data


When the arguments to plot are complex, the imaginary part
is ignored except when you use a single complex argument.
For example: plot(Z)
which is equivalent to: plot(real(Z),imag(Z))

• Adding Plots to an Existing Graph


When you type: hold on

MATLAB does not replace the existing graph when you issue
another plotting command; it adds the new data to the current
graph, rescaling the axes if necessary.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 32


Basic Plotting Functions

• Figure Windows
Graphing functions automatically open a new figure
window if there are no figure windows already on the
screen.

• To make a figure window the current figure, type


figure(n)
where n is the number in the figure title bar. The results
of subsequent graphics commands are displayed in this
window.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 33


Basic Plotting Functions

• Displaying Multiple Plots in One Figure


subplot(m,n,p)
This splits the figure window into an m-by-n matrix of small subplots
and selects the pth subplot for the current plot.

• Example:
t = 0:pi/10:2*pi;
[X,Y,Z] = cylinder(4*cos(t));
subplot(2,2,1); mesh(X)
subplot(2,2,2); mesh(Y)
subplot(2,2,3); mesh(Z)
subplot(2,2,4); mesh(X,Y,Z)

BIM488 Introduction to Pattern Recognition Introduction to Matlab 34


Basic Plotting Functions

• Setting Axis Limits & Grids


The axis command lets you to specify your own limits:
axis([xmin xmax ymin ymax])

You can use the axis command to make the axes visible
or invisible: axis on / axis off

The grid command toggles grid lines on and off:


grid on / grid off

BIM488 Introduction to Pattern Recognition Introduction to Matlab 35


Graphical User Interface

• GUIDE, the MATLAB Graphical User Interface


Development Environment, provides a set of tools for
creating graphical user interfaces (GUIs). These tools
greatly simplify the process of designing and building
GUIs.

BIM488 Introduction to Pattern Recognition Introduction to Matlab 36


Help

• “%” is the neglect sign for Matlab (equivalent of “//” in C).


Anything after it on the same line is neglected by Matlab
compiler.
• Sometimes slowing down the execution is done
deliberately for observation purposes. You can use the
command “pause” for this purpose:

pause %wait until any key


pause(3) %wait 3 seconds

BIM488 Introduction to Pattern Recognition Introduction to Matlab 37


Help

• You can always use help of Matlab by typing

>> help
>> help command_name
>> help toolbox_name

BIM488 Introduction to Pattern Recognition Introduction to Matlab 38


References

• http://www.mathworks.com
• Lecture Notes by V. Adams and S.B. Ul Haq
• Lecture Notes by İ.Y. Özbek

BIM488 Introduction to Pattern Recognition Introduction to Matlab 39


BIM488 Introduction to Pattern Recognition

Classification Algorithms – Part I


Outline

• Introduction
• Bayes Decision Theory
• Bayesian Classifier
• Minimum Distance Classifiers
• Naive Bayes Classifier
• Nearest Neighbor (NN) Classifier

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 2


Introduction

Pattern Recognition System

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 3


Introduction

• There exist numerous classification algorithms.


• We are going to describe some of those classifiers in this
course.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 4


Bayes Decision Theory

• This chapter discusses classification techniques inspired by


Bayes decision theory.
• In a classification task, we are given a pattern and the task
is to classify it into one out of M classes.
• The number of classes is assumed to be known a priori.
• Each pattern is represented by a set of feature values which
make up l dimensional feature vector, x.

x  x1 , x2 ,..., xl 
T

• Each pattern is represented uniquely by a single feature


vector and that it can belong to only one class.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 5


Bayes Decision Theory

• Assign the pattern represented by feature vector x to the most probable


of the available classes

1 ,  2 ,..., M
That is,

x   i : P ( i x )
maximum

• Probability that unknown pattern belongs to the respective class wi,


given that corresponding feature vector takes the value x.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 6


Bayes Decision Theory

Recall the Bayes rule (2-class case)

Posterior probability Class conditional pdf


of class wi given x of x given wi

p ( x ) P (i x)  p ( x i ) P (i ) 
p ( x i ) P (i )
P (i x )  Prior probability
p( x)
of class wi
Pdf of x
where 2
p ( x)   p ( x i ) P (i )
i 1

Pdf: Probability Density Function


BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 7
Bayes Decision Theory

• Probability P(.)
– prior knowledge of how likely is to get a pattern

• Probability density function p(x)


– how frequently we will measure a pattern with feature value x

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 8


Bayes Decision Theory

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 9


Bayesian Classifier

• The Bayesian classification rule:


– Given x classify it to  i if:

– Since p(x) is the same for all classes,

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 10


Bayesian Classifier

• Gaussian (Normal) pdf is extensively used in pattern recognition


• N(µ, Σ) notation is used to describe normal distribution
• In one dimensional case (single feature vector):
µ = Mean value
Σ = σ2 = Variance
• In multidimensional case (multiple feature vectors):
µ = Mean vector
Σ = Covariance matrix

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 11


Bayesian Classifier

• The one-dimensional case:


1  ( x   )2 
p( x)  exp  
2   2 2 

• The Multivariate (Multidimensional) case:

1  1 
p ( x)  
exp  ( x   )T  1 ( x   ) 
1
 2 
(2 )  2 2

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 12


Bayesian Classifier

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 13


Bayesian Classifier

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 14


Minimum Distance Classifiers

1. The Euclidean Distance Classifier


2. The Mahalanobis Distance Classifier

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 15


Minimum Distance Classifiers

The optimal Bayesian classifier is significantly simplified under


the following assumptions:

• The classes are equiprobable.


• The data in all classes follow Gaussian distributions.
• The covariance matrix is the same for all classes.
• The covariance matrix is diagonal and all elements across
the diagonal are equal. That is, S = σ2I, where I is the
identity matrix.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 16


Minimum Distance Classifiers

• Under these assumptions, it turns out that the optimal


Bayesian classifier is equivalent to the minimum Euclidean
distance classifier.

• That is, given an unknown x, assign it to class ωi if

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 17


Minimum Distance Classifiers

• If one relaxes the assumptions required by the Euclidean classifier and


removes the last one, the one requiring the covariance matrix to be
diagonal and with equal elements, the optimal Bayesian classifier
becomes equivalent to the minimum Mahalanobis distance classifier.

• That is, given an unknown x, it is assigned to class ωi if

• where S is the common covariance matrix.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 18


Minimum Distance Classifiers

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 19


Minimum Distance Classifiers

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 20


Minimum Distance Classifiers

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 21


Naive Bayes Classifier

• In the Naive Bayes classification scheme, the required


estimate of the pdf at a point x is computed by

• That is, the components (features) of the feature vector x


are assumed to be statistically independent.
• For example, a fruit may be considered to be an apple if it is
red, round, and about 4" in diameter. Even if these features
depend on each other or upon the existence of the other
features, a naive Bayes classifier considers all of these
properties to independently contribute to the probability that
this fruit is an apple.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 22
Naive Bayes Classifier

• Decision rule (similar to Bayes classification):

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 23


Nearest Neighbor (NN) Classifier

• Nearest neighbor (NN) is one of the most popular


classification rules.
• We are given c classes, ωi , i = 1, 2, . . . , c, and a point x,
and N training points, xi , i = 1, 2, . . .,N, in the l-dimensional
space, with the corresponding class labels.
• Given a point, x, whose class label is unknown, the task is
to classify x in one of the c classes. The rule consists of the
following steps:

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 24


Nearest Neighbor (NN) Classifier

1. Among the N training points, search for the k neighbors


closest to x using a distance measure (e.g., Euclidean,
Mahalanobis). The parameter k is user-defined. Note that it
should not be a multiple of c. That is, for two classes k
should be an odd number.
2. Out of the k-closest neighbors, identify the number, ki, of
the points that belong to class ωi.
3. Assign x to class ωi, for which ki > kj , j = i. In other words,
x is assigned to the class in which the majority of the k-
closest neighbors belong.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 25


Nearest Neighbor (NN) Classifier

Example: For k=11  11-NN Classification

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 26


Summary

• Introduction
• Bayes Decision Theory
• Bayesian Classifier
• Minimum Distance Classifiers
• Naive Bayes Classifier
• Nearest Neighbor (NN) Classifier

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 27


References

• S. Theodoridis, A. Pikrakis, K. Koutroumbas, D. Cavouras, Introduction


to Pattern Recognition: A MATLAB Approach, Academic Press, 2010.

• S. Theodoridis and K. Koutroumbas, Pattern Recognition (4th Edition),


Academic Press, 2009.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part I 28


BIM488 Introduction to Pattern Recognition

Classification Algorithms - Part II


Outline

• Introduction
• Linear Discriminant Functions
• The Perceptron Algorithm

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 2


Introduction

• Previously, our major concern was to design classifiers


based on probability density functions.
• Now, we will focus on the design of linear classifiers,
regardless of the underlying distributions describing the
training data.
• The major advantage of linear classifiers is their simplicity
and computational attractiveness.
• Here, our assumption is that all feature vectors from the
available classes can be classified correctly using a linear
classifier, and we will develop techniques for the
computation of the corresponding linear functions.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 3


Introduction

The solid and empty dots can be correctly classified by any


number of linear classifiers. H1 (blue) classifies them correctly, as
does H2 (red). H2 could be considered "better" in the sense that
it is also furthest from both groups. H3 (green) fails to correctly
classify the dots.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 4


Introduction

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 5


Linear Discriminant Functions
• A classifier that uses discriminant functions assigns a
feature vector x to class ωi if
gi(x) > gj(x) for all j≠i

where gi(x), i = 1, . . . , c, are the discriminant functions for c


classes.
• A discriminant function that is a linear combination of the
components of x is called a linear discriminant function and
can be written as
g(x) = wTx + w0 = w1x1 + w1x2+ ... + wdxd + w0

where w is the weight vector and w0 is the bias (or


threshold weight).

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 6


Linear Discriminant Functions

• For the two-category case, the decision rule can be written


as

Decide : ω1 if g(x) > 0


ω2 otherwise

• The equation g(x) = 0 defines the decision boundary that


separates points assigned to ω1 from points assigned to ω2.
• When g(x) is linear, the decision surface is a hyperplane
whose orientation is determined by the normal vector w and
location is determined by the bias ω0.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 7


Linear Discriminant Functions

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 8


Linear Discriminant Functions

Geometry for the decision line. On one side of the line it is g(x) >0(+)
and on the other g(x)< 0(-).

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 9


Linear Discriminant Functions

Multicategory Case:

• There is more than one way to devise multicategory


classifiers with linear discriminant functions.
• One against all: we can pose the problem as c two-class
problems, where the i’th problem is solved by a linear
discriminant that separates points assigned to ωi from those
not assigned to ωi.
• One against one: Alternatively, we can use c(c-1)/2 linear
discriminants, one for every pair of classes.
• Also, we can use c linear discriminants, one for each class,
and assign x to ωi if gi(x) > gj(x) for all j≠i.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 10


Linear Discriminant Functions

Figure: Linear decision boundaries for a 4-class problem devised as


(a) four 2-class problems (b) 6 pairwise problems.The pink regions have
ambiguous category assignments.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 11


Linear Discriminant Functions

• To avoid the problem of ambiguous regions:


– Define c linear discriminant functions
– Assign x to wi if gi(x) > gj(x) for all j  i.

• The resulting classifier is called a linear machine

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 12


Linear Discriminant Functions

Figure: Linear decision boundaries produced by using one linear


discriminant for each class.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 13


Linear Discriminant Functions

• The boundary between two regions Ri and Rj is a portion of


the hyperplane given by:

gi (x)  g j (x) or
(w i  w j )t x  ( wi 0  w j 0 )  0

• The decision regions for a linear machine are convex.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 14


The Perceptron Algorithm
• The perceptron algorithm is appropriate for the 2-class
problem and for classes that are linearly separable.
• The perceptron algorithm computes the values of the
weights w of a linear classifier, which separates the two
classes.
• The algorithm is iterative. It starts with an initial estimate in
the extended (d +1)-dimensional space and converges to a
solution in a finite number of iteration steps.
• The solution w correctly classifies all the training points
assuming linearly separable classes.
• Note that the perceptron algorithm converges to one out of
infinite possible solutions.
• Starting from different initial conditions, different
hyperplanes result.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 15
The Perceptron Algorithm
• The update at the i th iteration step has the simple form

w(t  1)  w(t )   t   x x
xY

• Y is the set of wrongly classified samples by the current estimate w(t),


• δx is −1 if x Є ω1, and +1 if x Є ω2,
• ρt is a user-defined parameter that controls the convergence speed and
must obey certain requirements to guarantee convergence (for
example, ρt can be chosen to be constant, ρt = ρ).
• The algorithm converges when Y becomes empty.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 16


The Perceptron Algorithm

• Move the hyperplane so that training samples are on its


positive side.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 17


The Perceptron Algorithm

• Once the classifier has been computed, a point, x, is


classified to either of the two classes depending on the
outcome of the following operation:
f (wTx) = f (w1x(1) + w2x(2) + ··· + wdx(d) + w0)

• The function f (·) in its simplest form is the step or sign


function ( f (z) = 1 if z > 0; f (z) =−1 if z < 0).
• However, it may have other forms; for example, the output
may be either 1 or 0 for z > 0 and z < 0, respectively.
• In general, it is known as the activation function.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 18


The Perceptron Algorithm

• The basic network model, known as perceptron or neuron,


that implements the classification operation is shown
below:

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 19


The Perceptron Algorithm

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 20


The Perceptron Algorithm

Some important points related to perceptron:

• For a fixed learning parameter, the number of iterations (in


general) increases as the classes move closer to each
other (i.e., as the problem becomes more difficult).
• The algorithm fails to converge for a data set that is not
linearly separable. Then, what should we do?
• Different initial estimates for w may lead to different final
estimates for it (although all of them are optimal in the
sense that they separate the training data of the two
classes).

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 21


Summary

• Introduction
• Linear Discriminant Functions
• The Perceptron Algorithm

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 22


References

• S. Theodoridis, A. Pikrakis, K. Koutroumbas, D. Cavouras, Introduction


to Pattern Recognition: A MATLAB Approach, Academic Press, 2010.

• S. Theodoridis and K. Koutroumbas, Pattern Recognition (4th Edition),


Academic Press, 2009.

• R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification (2nd Edition),


Wiley, 2001.

BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 23


BIM488 Introduction to Pattern Recognition

Classification Algorithms – Part III

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 1


Outline

 Introduction
 Decision Trees

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 2


Introduction

 The XOR problem


x1 x2 XOR Class
0 0 0 B
0 1 1 A
1 0 1 A
1 1 0 B

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 3


 There is no single line (hyperplane) that separates
class A from class B. On the contrary, AND and OR
operations are linearly separable problems

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 4


 There exist many types of nonlinear classifiers

• Multi-layer neural networks


• Support vector machines (nonlinear case)
• Decision trees
• ...

 We will particularly focus on decision trees in this course


as a nonlinear classifier.

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 5


BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 6
 The figures below are such examples. This type of trees is
known as Ordinary Binary Classification Trees (OBCT). The
decision hyperplanes, splitting the space into regions, are
parallel to the axis of the spaces. Other types of partition are
also possible, yet less popular.

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 7


Elements of a decision tree:

 Root
 Nodes
 Leafs

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 8


 Design Elements that define a decision tree.
• Each node, t, is associated with a subset Χ t  X , where X
is the training set. At each node, Xt is split into two (binary
splits) disjoint descendant subsets Xt,Y and Xt,N, where

Xt,Y  Xt,N = Ø
Xt,Y  Xt,N = Xt

Xt,Y is the subset of Xt for which the answer to the query at


node t is YES. Xt,N is the subset corresponding to NO. The
split is decided according to an adopted question (query).

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 9


• A splitting criterion must be adopted for the best split of Xt
into Xt,Y and Xt,N.

• A stop-splitting criterion must be adopted that controls the


growth of the tree and a node is declared as terminal
(leaf).

• A rule is required that assigns each (terminal) leaf to a


class.

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 10


BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 11
 Splitting Criterion: The main idea behind splitting at each
node is the resulting descendant subsets Xt,Y and Xt,N to be
more class homogeneous compared to Xt. Thus the criterion
must be in harmony with such a goal. A commonly used
criterion is the node impurity:
M
I (t )   Pi | t  log 2 Pt | t 
i 1

N ti
and P i | t  
Nt
where N ti is the number of data points in Xt that belong to
class i. The decrease in node impurity (expected reduction
in entropy, called as information gain) is defined as:
N t , Nt,N
I (t )  I (t )  I (t  )  I (t N )
Nt Nt
BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 12
• The goal is to choose the parameters in each node
(feature and threshold) that result in a split with the
highest decrease in impurity.

• Why highest decrease?

• Observe that the highest value of I(t) is achieved if all


classes are equiprobable, i.e., Xt is the least homogenous.
I(t) =0.5 log2(0.5) + 0.5 log2(0.5) = 1.0

• Observe that the lowest value of I(t) is achieved if data at


the node belongs to only one class, i.e., Xt is the most
homogenous.
I(t) =1 log2(1) + 0 log2(0) = 0.0

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 13


 Where should we stop splitting?

 Stop - splitting rule: Adopt a threshold T and stop splitting a


node (i.e., assign it as a leaf), if the impurity decrease is less
than T. That is, node t is “pure enough”.

 Class Assignment Rule: Assign a leaf to a class j , where:

j  arg max P (i | t )


i

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 14


 Summary of an OBCT algorithmic scheme:

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 15


Example:

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 16


Advantages

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 17


Disadvantages

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 18


Example:
Suppose we want to train a decision tree using the following
instances:
Weekend Decision
Weather Parents Money
(Examples) (Category)
W1 Sunny Yes Rich Cinema
W2 Sunny No Rich Tennis
W3 Windy Yes Rich Cinema
W4 Rainy Yes Poor Cinema
W5 Rainy No Rich Stay in
W6 Rainy Yes Poor Cinema
W7 Windy No Poor Cinema
W8 Windy No Rich Shopping
W9 Windy Yes Rich Cinema
W10 Sunny No Rich Tennis

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 19


 The first thing we need to do is work out which attribute will be put
into the node at the top of our tree: either weather, parents or
money.

 To do this, we need to calculate:

Entropy(S) =-pcinemalog2(pcinema)-ptennislog2(ptennis)-pshoplog2(pshop)-pstay_inlog2(pstay_in)

= -(6/10)* log2(6/10) -(2/10) * log2(2/10) -(1/10) * log2(1/10) -(1/10) * log2(1/10)


=-(6/10) * -0.737 -(2/10) * -2.322 -(1/10) * -3.322 -(1/10) * -3.322
= 0.4422 + 0.4644 + 0.3322 + 0.3322 = 1.571

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 20


 and we need to determine the best of:

Gain(S, weather) = 1.571 - (|Ssun|/10)*Entropy(Ssun) - (|Swind|/10)*Entropy(Swind) -


(|Srain|/10)*Entropy(Srain)
= 1.571 - (0.3)*Entropy(Ssun) - (0.4)*Entropy(Swind) -
(0.3)*Entropy(Srain)
= 1.571 - (0.3)*(0.918) - (0.4)*(0.81125) - (0.3)*(0.918) =
0.70

Gain(S, parents) = 1.571 - (|Syes|/10)*Entropy(Syes) - (|Sno|/10)*Entropy(Sno)


= 1.571 - (0.5) * 0 - (0.5) * 1.922 = 1.571 - 0.961 = 0.61

Gain(S, money) = 1.571 - (|Srich|/10)*Entropy(Srich) - (|Spoor|/10)*Entropy(Spoor)


= 1.571 - (0.7) * (1.842) - (0.3) * 0 = 1.571 - 1.2894 = 0.2816

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 21


 This means that the first node in the decision tree will be the
weather attribute. As an exercise, convince yourself why this scored
(slightly) higher than the parents attribute - remember what
entropy means and look at the way information gain is calculated.
 From the weather node, we draw a branch for the values that
weather can take: sunny, windy and rainy:

 Now we look at the first branch. Ssunny = {W1, W2, W10}. This is not
empty, so we do not put a default categorisation leaf node here. The
categorisations of W1, W2 and W10 are Cinema, Tennis and Tennis
respectively. As these are not all the same, we cannot put a
categorisation leaf node here. Hence we put an attribute node here,
which we will leave blank for the time being.

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 22


 Looking at the second branch, Swindy = {W3, W7, W8, W9}. Again, this is
not empty, and they do not all belong to the same class, so we put an
attribute node here, left blank for now. The same situation happens with
the third branch, hence our amended tree looks like this:

 Now we have to fill in the choice of attribute A, which we know cannot be


weather, because we've already removed that from the list of attributes to
use. So, we need to calculate the values for Gain(Ssunny, parents) and
Gain(Ssunny, money). Firstly, Entropy(Ssunny) = 0.918. Next, we set S to be
Ssunny = {W1,W2,W10} (and, for this part of the branch, we will ignore all the
other examples). In effect, we are interested only in this part of the table:

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 23


Weekend Decision
Weather Parents Money
(Example) (Category)
W1 Sunny Yes Rich Cinema
W2 Sunny No Rich Tennis
W10 Sunny No Rich Tennis

 Hence we can calculate:

Gain(Ssunny, parents) = 0.918 - (|Syes|/|S|)*Entropy(Syes) - (|Sno|/|S|)*Entropy(Sno)


= 0.918 - (1/3)*0 - (2/3)*0 = 0.918

Gain(Ssunny, money) = 0.918 - (|Srich|/|S|)*Entropy(Srich) - (|Spoor|/|S|)*Entropy(Spoor)


= 0.918 - (3/3)*0.918 - (0/3)*0 = 0.918 - 0.918 = 0

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 24


 Notice that Entropy(Syes) and Entropy(Sno) were both zero, because
Syes contains examples which are all in the same category (cinema),
and Sno similarly contains examples which are all in the same category
(tennis). This should make it more obvious why we use information
gain to choose attributes to put in nodes.

 Given our calculations, attribute A should be taken as parents. The


two values from parents are yes and no, and we will draw a branch
from the node for each of these. Remembering that we replaced the
set S by the set SSunny, looking at Syes, we see that the only example of
this is W1. Hence, the branch for yes stops at a categorisation leaf,
with the category being Cinema. Also, Sno contains W2 and W10, but
these are in the same category (Tennis). Hence the branch for no ends
here at a categorisation leaf. Hence our upgraded tree looks like this:

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 25


Finishing this tree off is left as an exercise !

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 26


Avoiding Overfitting

 As we discussed before, overfitting is a common problem in machine


learning. Decision trees suffer from this, because they are trained to stop
when they have perfectly classified all the training data, i.e., each branch is
extended just far enough to correctly categorise the examples relevant to
that branch. Many approaches to overcoming overfitting in decision trees
have been attempted. These attempts fit into two types:

• Stop growing the tree before it reaches perfection.


• Allow the tree to fully grow, and then post-prune some of the
branches from it.

 The second approach has been found to be more successful in practice.

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 27


Summary

 Introduction
 Decision Trees

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 28


References

 S. Theodoridis and K. Koutroumbas, Pattern Recognition (4th


Edition), Academic Press, 2009.

 Decision Tree Learning, Lecture Notes of Course V231,


Department of Computing, Imperial College, London.

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 29


BIM488 Introduction to Pattern Recognition

Classification Algorithms – Part III

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 1


Outline

 Introduction
 Decision Trees

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 2


Introduction

 The XOR problem


x1 x2 XOR Class
0 0 0 B
0 1 1 A
1 0 1 A
1 1 0 B

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 3


 There is no single line (hyperplane) that separates
class A from class B. On the contrary, AND and OR
operations are linearly separable problems

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 4


 There exist many types of nonlinear classifiers

• Multi-layer neural networks


• Support vector machines (nonlinear case)
• Decision trees
• ...

 We will particularly focus on decision trees in this course


as a nonlinear classifier.

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 5


BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 6
 The figures below are such examples. This type of trees is
known as Ordinary Binary Classification Trees (OBCT). The
decision hyperplanes, splitting the space into regions, are
parallel to the axis of the spaces. Other types of partition are
also possible, yet less popular.

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 7


Elements of a decision tree:

 Root
 Nodes
 Leafs

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 8


 Design Elements that define a decision tree.
• Each node, t, is associated with a subset Χ t  X , where X
is the training set. At each node, Xt is split into two (binary
splits) disjoint descendant subsets Xt,Y and Xt,N, where

Xt,Y  Xt,N = Ø
Xt,Y  Xt,N = Xt

Xt,Y is the subset of Xt for which the answer to the query at


node t is YES. Xt,N is the subset corresponding to NO. The
split is decided according to an adopted question (query).

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 9


• A splitting criterion must be adopted for the best split of Xt
into Xt,Y and Xt,N.

• A stop-splitting criterion must be adopted that controls the


growth of the tree and a node is declared as terminal
(leaf).

• A rule is required that assigns each (terminal) leaf to a


class.

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 10


BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 11
 Splitting Criterion: The main idea behind splitting at each
node is the resulting descendant subsets Xt,Y and Xt,N to be
more class homogeneous compared to Xt. Thus the criterion
must be in harmony with such a goal. A commonly used
criterion is the node impurity:
M
I (t )   Pi | t  log 2 Pt | t 
i 1

N ti
and P i | t  
Nt
where N ti is the number of data points in Xt that belong to
class i. The decrease in node impurity (expected reduction
in entropy, called as information gain) is defined as:
N t , Nt,N
I (t )  I (t )  I (t  )  I (t N )
Nt Nt
BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 12
• The goal is to choose the parameters in each node
(feature and threshold) that result in a split with the
highest decrease in impurity.

• Why highest decrease?

• Observe that the highest value of I(t) is achieved if all


classes are equiprobable, i.e., Xt is the least homogenous.
I(t) =0.5 log2(0.5) + 0.5 log2(0.5) = 1.0

• Observe that the lowest value of I(t) is achieved if data at


the node belongs to only one class, i.e., Xt is the most
homogenous.
I(t) =1 log2(1) + 0 log2(0) = 0.0

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 13


 Where should we stop splitting?

 Stop - splitting rule: Adopt a threshold T and stop splitting a


node (i.e., assign it as a leaf), if the impurity decrease is less
than T. That is, node t is “pure enough”.

 Class Assignment Rule: Assign a leaf to a class j , where:

j  arg max P (i | t )


i

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 14


 Summary of an OBCT algorithmic scheme:

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 15


Example:

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 16


Advantages

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 17


Disadvantages

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 18


Example:
Suppose we want to train a decision tree using the following
instances:
Weekend Decision
Weather Parents Money
(Examples) (Category)
W1 Sunny Yes Rich Cinema
W2 Sunny No Rich Tennis
W3 Windy Yes Rich Cinema
W4 Rainy Yes Poor Cinema
W5 Rainy No Rich Stay in
W6 Rainy Yes Poor Cinema
W7 Windy No Poor Cinema
W8 Windy No Rich Shopping
W9 Windy Yes Rich Cinema
W10 Sunny No Rich Tennis

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 19


 The first thing we need to do is work out which attribute will be put
into the node at the top of our tree: either weather, parents or
money.

 To do this, we need to calculate:

Entropy(S) =-pcinemalog2(pcinema)-ptennislog2(ptennis)-pshoplog2(pshop)-pstay_inlog2(pstay_in)

= -(6/10)* log2(6/10) -(2/10) * log2(2/10) -(1/10) * log2(1/10) -(1/10) * log2(1/10)


=-(6/10) * -0.737 -(2/10) * -2.322 -(1/10) * -3.322 -(1/10) * -3.322
= 0.4422 + 0.4644 + 0.3322 + 0.3322 = 1.571

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 20


 and we need to determine the best of:

Gain(S, weather) = 1.571 - (|Ssun|/10)*Entropy(Ssun) - (|Swind|/10)*Entropy(Swind) -


(|Srain|/10)*Entropy(Srain)
= 1.571 - (0.3)*Entropy(Ssun) - (0.4)*Entropy(Swind) -
(0.3)*Entropy(Srain)
= 1.571 - (0.3)*(0.918) - (0.4)*(0.81125) - (0.3)*(0.918) =
0.70

Gain(S, parents) = 1.571 - (|Syes|/10)*Entropy(Syes) - (|Sno|/10)*Entropy(Sno)


= 1.571 - (0.5) * 0 - (0.5) * 1.922 = 1.571 - 0.961 = 0.61

Gain(S, money) = 1.571 - (|Srich|/10)*Entropy(Srich) - (|Spoor|/10)*Entropy(Spoor)


= 1.571 - (0.7) * (1.842) - (0.3) * 0 = 1.571 - 1.2894 = 0.2816

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 21


 This means that the first node in the decision tree will be the
weather attribute. As an exercise, convince yourself why this scored
(slightly) higher than the parents attribute - remember what
entropy means and look at the way information gain is calculated.
 From the weather node, we draw a branch for the values that
weather can take: sunny, windy and rainy:

 Now we look at the first branch. Ssunny = {W1, W2, W10}. This is not
empty, so we do not put a default categorisation leaf node here. The
categorisations of W1, W2 and W10 are Cinema, Tennis and Tennis
respectively. As these are not all the same, we cannot put a
categorisation leaf node here. Hence we put an attribute node here,
which we will leave blank for the time being.

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 22


 Looking at the second branch, Swindy = {W3, W7, W8, W9}. Again, this is
not empty, and they do not all belong to the same class, so we put an
attribute node here, left blank for now. The same situation happens with
the third branch, hence our amended tree looks like this:

 Now we have to fill in the choice of attribute A, which we know cannot be


weather, because we've already removed that from the list of attributes to
use. So, we need to calculate the values for Gain(Ssunny, parents) and
Gain(Ssunny, money). Firstly, Entropy(Ssunny) = 0.918. Next, we set S to be
Ssunny = {W1,W2,W10} (and, for this part of the branch, we will ignore all the
other examples). In effect, we are interested only in this part of the table:

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 23


Weekend Decision
Weather Parents Money
(Example) (Category)
W1 Sunny Yes Rich Cinema
W2 Sunny No Rich Tennis
W10 Sunny No Rich Tennis

 Hence we can calculate:

Gain(Ssunny, parents) = 0.918 - (|Syes|/|S|)*Entropy(Syes) - (|Sno|/|S|)*Entropy(Sno)


= 0.918 - (1/3)*0 - (2/3)*0 = 0.918

Gain(Ssunny, money) = 0.918 - (|Srich|/|S|)*Entropy(Srich) - (|Spoor|/|S|)*Entropy(Spoor)


= 0.918 - (3/3)*0.918 - (0/3)*0 = 0.918 - 0.918 = 0

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 24


 Notice that Entropy(Syes) and Entropy(Sno) were both zero, because
Syes contains examples which are all in the same category (cinema),
and Sno similarly contains examples which are all in the same category
(tennis). This should make it more obvious why we use information
gain to choose attributes to put in nodes.

 Given our calculations, attribute A should be taken as parents. The


two values from parents are yes and no, and we will draw a branch
from the node for each of these. Remembering that we replaced the
set S by the set SSunny, looking at Syes, we see that the only example of
this is W1. Hence, the branch for yes stops at a categorisation leaf,
with the category being Cinema. Also, Sno contains W2 and W10, but
these are in the same category (Tennis). Hence the branch for no ends
here at a categorisation leaf. Hence our upgraded tree looks like this:

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 25


Finishing this tree off is left as an exercise !

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 26


Avoiding Overfitting

 As we discussed before, overfitting is a common problem in machine


learning. Decision trees suffer from this, because they are trained to stop
when they have perfectly classified all the training data, i.e., each branch is
extended just far enough to correctly categorise the examples relevant to
that branch. Many approaches to overcoming overfitting in decision trees
have been attempted. These attempts fit into two types:

• Stop growing the tree before it reaches perfection.


• Allow the tree to fully grow, and then post-prune some of the
branches from it.

 The second approach has been found to be more successful in practice.

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 27


Summary

 Introduction
 Decision Trees

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 28


References

 S. Theodoridis and K. Koutroumbas, Pattern Recognition (4th


Edition), Academic Press, 2009.

 Decision Tree Learning, Lecture Notes of Course V231,


Department of Computing, Imperial College, London.

BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 29


BIM488 Introduction to Pattern Recognition

Assesment of Classification Performance


Outline

• Accuracy vs. Error


• Training and Test Set
• Confusion Matrix
• Precision, Recall, F-score

BIM488 Introduction to Pattern Recognition Assessment of Classification Performance 2


Performance Assessment

• We can use accuracy or error rate to assess performance


of classifiers.
• Accuracy is the ratio of correct classifications.
• Error rate is the ratio of incorrect classifications.
• Accuracy = 1 - Error rate.
• Example:
10 patterns belonging to the same class
Number of correctly classified patterns= 8
Number of incorrectly classified patterns = 2
Accuracy = 8 / 10 = 0.8 = 80%
Error Rate = 2 / 10 = 0.2 = 20%

BIM488 Introduction to Pattern Recognition Assessment of Classification Performance 3


Performance Assessment

• Performance is evaluated on a testing set.


• Therefore, entire dataset should be divided into
– training set
– testing set
• Classification model is obtained using the training set.
• Classification performance is assessed using the testing
set.

BIM488 Introduction to Pattern Recognition Assessment of Classification Performance 4


Performance Assessment

• For objective evaluation, k-fold cross validation technique


is used. Why ?
• Example: k = 3
Fold 1 Fold 2 Fold 3

Training Training Testing

Training Testing Training

Testing Training Training

Accuracy1 Accuracy2 Accuracy3

Overall accuracy = (Accuracy1 + Accuracy2 + Accuracy3) / 3

BIM488 Introduction to Pattern Recognition Assessment of Classification Performance 5


Performance Assessment

• We can also use a confusion matrix during assessment


• The example below shows predicted and true class labels
for a 10-class recognition problem.

BIM488 Introduction to Pattern Recognition Assessment of Classification Performance 6


Performance Assessment

• We can also use precision, recall and F-score for


performance assessment.
• For classification tasks, the terms
– true positives (TP)
– true negatives (TN)
– false positives (FP)
– false negatives (FN)
compare the results of the classifier under test with trusted
external judgments.
• The terms positive and negative refer to the classifier's
prediction (expectation), and the terms true and false refer
to whether that prediction corresponds to the external
judgment (observation).
BIM488 Introduction to Pattern Recognition Assessment of Classification Performance 7
Performance Assessment

• This can be illustrated by the table below:

Predicted Positive Predicted Negative

Actual Positive TP FN

Actual Negative FP TN

BIM488 Introduction to Pattern Recognition Assessment of Classification Performance 8


Performance Assessment

• Precision and recall are then defined as:


– Precision = TP / (TP + FP)
– Recall = TP / (TP + FN)

• Here, accuracy corresponds to


– Accuracy = (TP + TN) / (TP + FP + TN + FN)

• F-score is the harmonic mean of precision and recall:


– F-score = 2 . (precision . recall) / (precision + recall)

BIM488 Introduction to Pattern Recognition Assessment of Classification Performance 9


Summary

• Accuracy vs. Error


• Training and Test Set
• Confusion Matrix
• Precision, Recall, F-score

BIM488 Introduction to Pattern Recognition Assessment of Classification Performance 10


References

• S. Theodoridis, A. Pikrakis, K. Koutroumbas, D. Cavouras, Introduction


to Pattern Recognition: A MATLAB Approach, Academic Press, 2010.

• S. Theodoridis and K. Koutroumbas, Pattern Recognition (4th Edition),


Academic Press, 2009.

• R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification (2nd Edition),


Wiley, 2001.

BIM488 Introduction to Pattern Recognition Assessment of Classification Performance 11


BIM488 Introduction to Pattern Recognition

Feature Selection
Outline

• Introduction
• Feature Selection Methods
• Exhaustive Search
• SBS/SFS
• GSFS/GSBS
• PTA

BIM488 Introduction to Pattern Recognition Feature Selection 2


Introduction

• Feature selection is an essential topic in the field of pattern


recognition.
• The feature selection strategy has a direct influence on the
accuracy and processing time of pattern recognition
applications.

BIM488 Introduction to Pattern Recognition Feature Selection 3


Introduction

Prior to any feature selection, data preprocessing is a


necessary step.

• Data Preprocessing
– Outlier removal: An outlier is defined as a point that lies very
far from the mean of the corresponding random variable.
Such points result in large errors during training. If such
points are the result of erroneous measurements, they have
to be removed.
– Data normalization: Features with large values have large
influence compared to others with small values, although this
may not necessarily reflect a respective significance towards
the design of the classifier.

BIM488 Introduction to Pattern Recognition Feature Selection 4


Introduction

• A common technique is to normalize each feature via


the respective estimate of the mean and variance (i.e.,
th
zero mean, unit variance. That is for the k feature:
N
1
xk 
N
x
i 1
ik , k  1,2,..., l
N
1
 k2  
N  1 i 1
( xik  x k ) 2

xik  x k
xˆik 
k

BIM488 Introduction to Pattern Recognition Feature Selection 5


Introduction

• The Peaking Phenomenon


If, in an ideal world, the class pdfs were known, then increasing the
number of features would be beneficial.
In practice, the general trend is that for a finite number of training
points, increasing the number of features initially improves the
generalization error rate, but after a certain value, the generalization
error rate increases.

BIM488 Introduction to Pattern Recognition Feature Selection 6


Introduction

• The main goals in feature selection:


– Select the “optimum” number l of features
– Select the “best” l features

• Large l has a three-fold disadvantage:


– High computational demands
– Low generalization performance
– Poor error estimates

BIM488 Introduction to Pattern Recognition Feature Selection 7


Introduction

BIM488 Introduction to Pattern Recognition Feature Selection 8


Feature Selection Methods

Widely used feature selection methods are:


• Filters: ranks features independently of the classifier.
• Wrapper: employs a classifier to assess feature subsets.

• Univariate approach: considers one feature at a time.


• Multivariate approach: considers subsets of features
together.

BIM488 Introduction to Pattern Recognition Feature Selection 9


Feature Selection Methods

BIM488 Introduction to Pattern Recognition Feature Selection 10


Feature Selection Methods

BIM488 Introduction to Pattern Recognition Feature Selection 11


Feature Selection Methods

• In this course, we are going to learn some of the well-known


wrapper methods including:

– Exhaustive Search
– SBS/SFS
– GSFS/GSBS
– PTA

BIM488 Introduction to Pattern Recognition Feature Selection 12


Exhaustive Search

• In this selection method, (N, d) possible feature


combinations are analyzed to obtain optimal d dimensional
feature subset out of N dimensional full feature set based
on a criterion function (e.g. classification accuracy).
• Although this method guarantees to reach the optimal
solution, required processing time is quite high even for
moderate number of features.

BIM488 Introduction to Pattern Recognition Feature Selection 13


Exhaustive Search

• Example: 4-dimensional feature set

1: Selected Feature
0: Unselected Feature  Empty Feature Set

 Full Feature Set

• How many feature combinations for N-dimensional feature


set?

BIM488 Introduction to Pattern Recognition Feature Selection 14


Sequential forward selection (SFS)

• It operates in bottom-to-top manner.


• The selection procedure starts with an empty set initially.
• Then, at each step, the feature maximizing the criterion
function is added to the current set.
• This operation continues until the desired number of
features is selected.
• The nesting effect is present such that a feature added into
the set in a step can not be removed in the subsequent
steps.
• As a consequence, SFS method can offer only suboptimal
result.

BIM488 Introduction to Pattern Recognition Feature Selection 15


Sequential forward selection (SFS)

1: Selected Feature
0: Unselected Feature

BIM488 Introduction to Pattern Recognition Feature Selection 16


Sequential backward selection (SBS)

• SBS works in a top-to-bottom manner.


• It is the reverse case of SFS method.
• Initially, complete feature set is considered. At each step,
single feature is removed from the current set so that the
criterion function is maximized for the remaining features
within the set.
• Removal operation continues until the desired number of
features is obtained.
• The nesting effect is present in this method as in SFS.
Once a feature is eliminated from the set, it can not enter
into the set in the subsequent steps.
• Thus, SBS offers suboptimal solution.

BIM488 Introduction to Pattern Recognition Feature Selection 17


Sequential backward selection (SBS)

1: Selected Feature
0: Unselected Feature

BIM488 Introduction to Pattern Recognition Feature Selection 18


Generalized SFS

• In generalized version of SFS, instead of single feature, n


features are added to the current feature set at each step.
• The nesting effect is still present.

BIM488 Introduction to Pattern Recognition Feature Selection 19


Generalized SBS

• In generalized form of SBS (GSBS), instead of single


feature, n features are removed from the current feature set
at each step.
• The nesting effect is present here, too.

BIM488 Introduction to Pattern Recognition Feature Selection 20


Plus-l takeaway-r (PTA)

• The nesting effect present in SFS and SBS can be partly


avoided by moving in the reverse direction of selection for
certain number of steps.
• With this purpose, at each step, l features are selected
using SFS and then r features are removed with SBS.
• This method is called as PTA.
• Although the nesting effect is reduced with respect to SFS
and SBS, PTA still provides suboptimal results.

BIM488 Introduction to Pattern Recognition Feature Selection 21


Summary

• Introduction
• Feature Selection Methods
• Exhaustive Search
• SBS/SFS
• GSFS/GSBS
• PTA

BIM488 Introduction to Pattern Recognition Feature Selection 22


References

• S. Theodoridis and K. Koutroumbas, Pattern Recognition (4th Edition),


Academic Press, 2009.
• Saeys Y., Inza I., Larranaga P., "A review of feature selection
techniques in bioinformatics", Bioinformatics, 23(19), 2507-2517, 2007.

BIM488 Introduction to Pattern Recognition Feature Selection 23


BIM488 Introduction to Pattern Recognition

Text Classification

BIM488 Introduction to Pattern Recognition Text Classification 1


Outline
 Introduction
 Text representation
 Text classification
 Term Selection

BIM488 Introduction to Pattern Recognition Text Classification 2


Introduction
• Text classification/categorization is a
problem in information science.
• The task is to assign a document to one
or more categories, based on its
contents.
In other words:
• given a predefined set of categories and
a set of documents
• label each document with one or more
categories
BIM488 Introduction to Pattern Recognition Text Classification 3
Introduction
Applications of Text Classification:

• Topic classification
• Sentiment analysis
• Spam e-mail filtering
• Spam SMS filtering
• Author Identification
• etc.

BIM488 Introduction to Pattern Recognition Text Classification 4


Text representation

• selection of terms
• vector model
• weighting (TF-IDF)

BIM488 Introduction to Pattern Recognition Text Classification 5


Text representation

• text cannot be directly interpreted by


the many document processing
applications
• we need a compact representation of
the content
• which are the meaningful units of text?

BIM488 Introduction to Pattern Recognition Text Classification 6


Terms

• Words
– typical choice
– set of words, bag of words
• Phrases
– syntactical phrases (e.g. noun phrases)
– statistical phrases (e.g. frequent pairs of
words)
– usefulness not yet known?

BIM488 Introduction to Pattern Recognition Text Classification 7


Terms

• Stop-word removal: part of the text is


not considered as terms: these words
can be removed
– very common words (function words):
• articles (a, the) , prepositions (of, in),
conjunctions (and, or), adverbs (here, then)
– numerals (30.9.2002, 2547)
• other preprocessing steps
– Stemming (i.e., apples  apple)

BIM488 Introduction to Pattern Recognition Text Classification 8


Vector model

• a document is often represented as a


vector
• the vector has as many dimensions as
there are terms in the whole collection
of documents

BIM488 Introduction to Pattern Recognition Text Classification 9


Vector model

• Assume in a sample document


collection, there are 100 words (terms)
• In alphabetical order, the list of terms
starts with:
– absorption
– agriculture
– anaemia
– analyse
– application
– …

BIM488 Introduction to Pattern Recognition Text Classification 10


Vector model

• Each document can be represented by a


vector of 100 dimensions
• We can think a document vector as an
array of 100 elements, one for each
term, indexed, e.g. 0-99

BIM488 Introduction to Pattern Recognition Text Classification 11


Vector model

• let d1 be the vector for document 1


• record only which terms occur in
document:
– d1[0] = 0 -- absorption doesn’t occur
– d1[1] = 0 -- agriculture -”-
– d1[2] = 0 -- anaemia -”-
– d1[3] = 0 -- analyse -”-
– d1[4] = 1 -- application occurs
– ...
– d1[21] = 1 -- current occurs
– …
BIM488 Introduction to Pattern Recognition Text Classification 12
Weighting terms

• usually we want to say that some terms


are more important (for some
document) than the others ->
weighting
• weights usually range between 0 and 1
– 1 denotes presence, 0 absence of the term
in the document

BIM488 Introduction to Pattern Recognition Text Classification 13


Weighting terms

• if a word occurs many times in a


document, it may be more important
– but what about very frequent words?
• often the TF-IDF function is used
– higher weight, if the term occurs often in
the document
– lower weight, if the term occurs in many
documents

BIM488 Introduction to Pattern Recognition Text Classification 14


Weighting terms: TF-IDF
• TF-IDF = term frequency * inversed
document frequency
• weight of term tk in document dj:

Tr
tfidf (t k , d j ) # (t k , d j )  log
# Tr (t k )
• where
– #(tk,dj): the number of times tk occurs in dj
– #Tr(tk): the number of documents in Tr in
which tk occurs
– Tr: the documents in the collection
BIM488 Introduction to Pattern Recognition Text Classification 15
Weighting terms: TF-IDF

• in document 1:
– term ’application’ occurs once, and in
the whole collection it occurs in 2
documents:
• tfidf (application, d1) = 1 * log(10/2) =
log 5 ~ 0.7
– term ´current´occurs once, in the
whole collection in 9 documents:
• tfidf(current, d1) = 1 * log(10/9) ~ 0.05

BIM488 Introduction to Pattern Recognition Text Classification 16


Weighting terms: TF-IDF

• if there were some word that occurs 7


times in doc 1 and only in doc 1, the
TF-IDF weight would be:
– tfidf(doc1word, d1) = 7 * log(10/1) = 7

BIM488 Introduction to Pattern Recognition Text Classification 17


Weighting terms: normalization

• in order for the weights to fall in the


[0,1] interval, the weights are often
normalized (T is the set of terms):

tfidf (t k , d j )
wkj 

|T | 2
s 1
(tfidf (t s , d j ))

BIM488 Introduction to Pattern Recognition Text Classification 18


Text categorization

• two major approaches:


– knowledge engineering -> end of 80’s
• manually defined set of rules encoding
expert knowledge on how to classify
documents under the given gategories
– machine learning, 90’s ->
• an automatic text classifier is built by
learning, from a set of preclassified
documents, the characteristics of the
categories

BIM488 Introduction to Pattern Recognition Text Classification 19


Single-label, multi-label TC

• single-label text categorization


– exactly 1 category must be assigned to
each dj  D
• multi-label text categorization
– any number of categories may be assigned
to the same dj  D

BIM488 Introduction to Pattern Recognition Text Classification 20


Single-label, multi-label TC

• special case of single-label: binary


– each dj must be assigned either to category
ci or to its complement ¬ ci
• the binary case (and, hence, the single-label
case) is more general than the multi-label
– an algorithm for binary classification can
also be used for multi-label classification
– the converse is not true

BIM488 Introduction to Pattern Recognition Text Classification 21


Machine learning approach
• a general inductive process (learner)
automatically builds a classifier for a category
ci by observing the characteristics of a set of
documents manually classified under ci or ci
by a domain expert
• from these characteristics the learner extracts
the characteristics that a new unseen
document should have in order to be classified
under ci
• use of classifier: the classifier observes the
characteristics of a new document and decides
whether it should be classified under ci or ci

BIM488 Introduction to Pattern Recognition Text Classification 22


Classification process: classifier
construction

Learner
Training
set

Doc 1; Label: yes


Doc2; Label: no Classifier
...
Docn; Label: yes

BIM488 Introduction to Pattern Recognition Text Classification 23


Classification process: testing
the classifier

Test set Classifier

BIM488 Introduction to Pattern Recognition Text Classification 24


Classification process: use of the
classifier

New, unseen
document Classifier

Document Class

BIM488 Introduction to Pattern Recognition Text Classification 25


Strengths of machine learning
approach
• the learner is domain independent
– usually available ’off-the-shelf’
• the inductive process is easily repeated, if the
set of categories changes
– only the training set has to be replaced
• manually classified documents often already
available
– manual process may exist
– if not, it is still easier to manually classify a
set of documents than to build and tune a
set of rules

BIM488 Introduction to Pattern Recognition Text Classification 26


Examples of learners
• Rocchio method
• probabilistic classifiers (Naïve Bayes)
• decision tree classifiers
• decision rule classifiers
• regression methods
• on-line methods
• neural networks
• example-based classifiers (k-NN)
• boosting methods
• support vector machines

BIM488 Introduction to Pattern Recognition Text Classification 27


Term selection

• a large document collection may contain millions of


words -> document vectors would contain millions
of dimensions
– many algorithms cannot handle high
dimensionality of the term space (= large number
of terms)
– very specific terms may lead to overfitting: the
classifier can classify the documents in the
training data well but fails often with unseen
documents

BIM488 Introduction to Pattern Recognition Text Classification 28


Term selection

• usually only a part of terms is used


• how to select terms that are used?
– term selection (often called feature
selection or dimensionality reduction)
methods

BIM488 Introduction to Pattern Recognition Text Classification 29


Term selection

• goal: select terms that yield the highest


effectiveness in the given application
• wrapper approach
– the reduced set of terms is found iteratively
and tested with the application
• filtering approach
– keep the terms that receive the highest
score according to a function that measures
the ”importance” of the term for the task

BIM488 Introduction to Pattern Recognition Text Classification 30


Term selection

• many functions available


– document frequency: keep the high
frequency terms
• stopwords have been already removed
• 50% of the words occur only once in the
document collection
• e.g. remove all terms occurring in at
most 3 documents

BIM488 Introduction to Pattern Recognition Text Classification 31


Term selection functions:
document frequency
• document frequency is the number of
documents in which a term occurs
• in our sample, the ranking of terms:
– 9 current
– 7 project
– 4 environment
– 3 nuclear
– 2 application
– 2 area … 2 water
– 1 use …

BIM488 Introduction to Pattern Recognition Text Classification 32


Term selection functions:
document frequency
• we might now set the threshold to 2 and
remove all the words that occur only once
• result: 25 words of 100 words (~25%)
selected

BIM488 Introduction to Pattern Recognition Text Classification 33


Term selection: other functions

• Information-theoretic term selection functions,


e.g.
– chi-square
– information gain
– mutual information
– odds ratio
– relevancy score

BIM488 Introduction to Pattern Recognition Text Classification 34


Term selection: information gain

• Information gain: measures the (number of


bits of) information obtained for category
prediction by knowing the presence or
absence of a term in a document
• information gain is calculated for each term
and the best n terms are selected

BIM488 Introduction to Pattern Recognition Text Classification 35


Term selection: IG

• information gain for term t:


– m: the number of categories

G (t )   i 1 p  ci  log p  ci 
m

 p  t   i 1 p  ci | t  log p  ci | t 
m

 pt   i 1 p  ci | t  log p  ci | t 
m

BIM488 Introduction to Pattern Recognition Text Classification 36


Estimating probabilities

2 classes: c1 and c2

• (c1) Doc 1: cat cat cat


• (c1) Doc 2: cat cat cat dog
• (c2) Doc 3: cat dog mouse
• (c2) Doc 4: cat cat cat dog dog dog
• (c2) Doc 5: mouse

BIM488 Introduction to Pattern Recognition Text Classification 37


Term selection: estimating
probabilities
• P(t): probability of a term t
– P(cat) = 4/5, or
• ‘cat’ occurs in 4 docs of 5
– P(cat) = 10/17
• the proportion of the occurrences of ´cat’
of the all term occurrences

BIM488 Introduction to Pattern Recognition Text Classification 38


Term selection: estimating
probabilities
• P(t): probability of the absence of t
– P(cat) = 1/5, or
– P(cat) = 7/17

BIM488 Introduction to Pattern Recognition Text Classification 39


Term selection: estimating
probabilities
• P(ci): probability of category i
– P(c) = 2/5 (the proportion of
documents belonging to c in the
collection), or
– P(c) = 7/17 (7 of the 17 terms occur
in the documents belonging to c)

BIM488 Introduction to Pattern Recognition Text Classification 40


Term selection: estimating
probabilities
• P(ci | t): probability of category i if
t is in the document; i.e., which
proportion of the documents where
t occurs belong to the category i
– P(c1 | cat) = 2/4 (or 6/10)
– P(c2 | cat) = 2/4 (or 4/10)
– P(c1 | mouse) = 0
– P(c2 | mouse) = 1

BIM488 Introduction to Pattern Recognition Text Classification 41


Term selection: estimating
probabilities
• P(ci | t): probability of category i if
t is not in the document; i.e.,
which proportion of the documents
where t does not occur belongs to
the category i
– P(c1 | cat) = 0 (or 1/7)
– P(c1 | dog) = ½ (or 6/12)
– P(c1 | mouse) = 2/3 (or 7/15)

BIM488 Introduction to Pattern Recognition Text Classification 42


Term selection: estimating
probabilities

• In other words...
• Let
– term t occurs in B documents, A of
them are in category c
– category c has D documents, of the
whole of N documents in the
collection

BIM488 Introduction to Pattern Recognition Text Classification 43


Term selection: estimating
probabilities
A documents
N documents
D documents

docs
c containing t
B documents

BIM488 Introduction to Pattern Recognition Text Classification 44


Term selection: estimating
probabilities

• For instance,
– P(t): B/N
– P(t): (N-B)/N
– P(c): D/N
– P(c|t): A/B
– P(c|t): (D-A)/(N-B)

BIM488 Introduction to Pattern Recognition Text Classification 45


Term selection: IG

• information gain for


a term t:

G (t )   i 1 p  ci  log p  ci 
m

 p  t   i 1 p  ci | t  log p  ci | t   p  t   i 1 p  ci | t  log p  ci | t 
m m

• G(cat) = 0.17
• G(dog) = 0.02
• G(mouse) = 0.42

BIM488 Introduction to Pattern Recognition Text Classification 46


Summary
 Introduction
 Text representation
 Text classification
 Term Selection

BIM488 Introduction to Pattern Recognition Text Classification 47


References

 582410 Processing of large document collections,


Lecture Notes, University of Helsinki.

BIM488 Introduction to Pattern Recognition Text Classification 48


BIM488 Introduction to Pattern Recognition

Speech Recognition
Outline

• Automatic Speech Recognition (ASR)


• Applications
• Human vs. Computer
• Issues in Speech Recognition
• ASR Approaches
• ASR Example

BIM488 Introduction to Pattern Recognition Speech Recognition 2


What is the task?

• Getting a computer to understand spoken language:


Automatic Speech Recognition
• By “understand” we might mean
– React appropriately
– Convert the input speech into another medium, e.g. text
– etc.

BIM488 Introduction to Pattern Recognition Speech Recognition 3


Applications

• Voice dialing
• Voice operated telephony systems
• Voice controlled devices
• Speech-to-Text converters
• Speaker recognition
• etc.

BIM488 Introduction to Pattern Recognition Speech Recognition 4


Samples of Speech Signal

5000 5000

4000 4000

3000 3000

2000 2000

1000 1000

0 0

-1000 -1000

-2000 -2000

-3000 -3000

-4000 -4000

-5000 -5000
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 3000 3500 4000 4500

‘Two’ ‘Seven’

MATLAB: >> load(‘sound.mat')


>> plot(wavedata)
>> soundsc(wavedata)

BIM488 Introduction to Pattern Recognition Speech Recognition 5


How do humans do it?

• Articulation produces sound


waves.
• The ear conveys sound
waves to the brain for
processing
BIM488 Introduction to Pattern Recognition Speech Recognition 6
How might computers do it?

Acoustic waveform Acoustic signal

• Digitization
• Acoustic analysis of the speech
signal Speech recognition
• Linguistic interpretation

BIM488 Introduction to Pattern Recognition Speech Recognition 7


Issues in Speech Recognition

• Digitization
– Converting analogue signal into digital representation
• Signal processing
– Separating speech from background noise
• Phonetics
– Variability in human speech
• Phonology
– Recognizing individual sound distinctions (similar phonemes)

BIM488 Introduction to Pattern Recognition Speech Recognition 8


Digitization

• Analogue to digital conversion


• Sampling and quantizing
• Use filters to measure energy levels for various
points on the frequency spectrum
• Knowing the relative importance of different
frequency bands (for speech) makes this process
more efficient
• e.g. high frequency sounds are less informative, so
can be sampled using a broader bandwidth (log
scale)

BIM488 Introduction to Pattern Recognition Speech Recognition 9


Separating speech from background noise

• Noise cancelling microphones


– Two mics, one facing speaker, the other facing away
– Ambient noise is roughly same for both mics
• Knowing which bits of the signal relate to speech
– Spectrograph analysis

BIM488 Introduction to Pattern Recognition Speech Recognition 10


Variability in individuals’ speech

• Variation among speakers due to


– Vocal range (f0, and pitch range – see later)
– Voice quality (growl, whisper, physiological elements
such as nasality, adenoidality, etc)
– ACCENT !!! (especially vowel systems, but also
consonants, allophones, etc.)
• Variation within speakers due to
– Health, gender, emotional state
– Ambient conditions
• Speech style: formal read vs spontaneous

BIM488 Introduction to Pattern Recognition Speech Recognition 11


Speaker-(in)dependent systems

• Speaker-dependent systems
– Require “training” to “teach” the system your individual
idiosyncracies
• The more the merrier, but typically nowadays 5 or 10 minutes is
enough
• User asked to pronounce some key words which allow computer to
infer details of the user’s accent and voice
• Fortunately, languages are generally systematic
– More robust
– But less convenient
– And obviously less portable
• Speaker-independent systems
– Language coverage is reduced to compensate need to be flexible
in phoneme identification
– Clever compromise is to learn on the fly

BIM488 Introduction to Pattern Recognition Speech Recognition 12


(Dis)continuous speech

• Discontinuous speech much easier to recognize


– Single words tend to be pronounced more clearly
• Continuous speech involves contextual coarticulation
effects
– Weak forms
– Assimilation
– Contractions

BIM488 Introduction to Pattern Recognition Speech Recognition 13


Approaches to ASR

• Template matching
• Knowledge-based (or rule-based) approach
• Statistical approach (machine learning)

BIM488 Introduction to Pattern Recognition Speech Recognition 14


Template-based approach

• Store examples of units (words, phonemes), then find the


example that most closely fits the input
• Extract features from speech signal, then it’s “just” a
complex similarity matching problem, using solutions
developed for all sorts of applications
• OK for discrete utterances, and a single user

BIM488 Introduction to Pattern Recognition Speech Recognition 15


Template-based approach

• Hard to distinguish very similar templates


• And quickly degrades when input differs from templates
• Therefore needs techniques to mitigate this degradation:
– More subtle matching techniques
– Multiple templates which are aggregated
• Taken together, these suggested …

BIM488 Introduction to Pattern Recognition Speech Recognition 16


Rule-based approach

• Use knowledge of phonetics and linguistics to guide search


process
• Templates are replaced by rules expressing everything
(anything) that might help to decode:
– Phonetics, phonology, phonotactics
– Syntax
– Pragmatics

BIM488 Introduction to Pattern Recognition Speech Recognition 17


Rule-based approach

• Typical approach is based on “blackboard” architecture:


– At each decision point, lay out the possibilities
– Apply rules to determine which sequences are permitted
• Poor performance due to
– Difficulty to express rules
– Difficulty to make rules interact
– Difficulty to know how to improve the system

BIM488 Introduction to Pattern Recognition Speech Recognition 18


• Identify individual phonemes
• Identify words
• Identify sentence structure and/or meaning
• Interpret prosodic features (pitch, loudness, length)
BIM488 Introduction to Pattern Recognition Speech Recognition 19
Statistics-based approach

• Can be seen as extension of template-based approach,


using more powerful mathematical and statistical tools
• Sometimes seen as “anti-linguistic” approach
– Fred Jelinek (IBM, 1988): “Every time I fire a linguist my system
improves”

BIM488 Introduction to Pattern Recognition Speech Recognition 20


Statistics-based approach

• Collect a large corpus of transcribed speech recordings


• Train the computer to learn the correspondences (“machine
learning”)
• At run time, apply statistical processes to search through
the space of all possible solutions, and pick the statistically
most likely one

BIM488 Introduction to Pattern Recognition Speech Recognition 21


ASR Example: Isolated Word Recognition

• Here, as an example, we will focus on isolated word


recognition problem.
• We will talk about each step of the recognition process in
detail.

BIM488 Introduction to Pattern Recognition Speech Recognition 22


ASR Example

BIM488 Introduction to Pattern Recognition Speech Recognition 23


Sound Recorder

• Analog to Digital conversion is carried out.


• Speech signal is now digital.
• Digital speech signal can now be processed.

BIM488 Introduction to Pattern Recognition Speech Recognition 24


Voice Activity Detection

• Signal energy is compared with a pre-determined energy


threshold.
• Thus, silence is discarded.

BIM488 Introduction to Pattern Recognition Speech Recognition 25


Segmentation (Windowing)

Words are parameterised on a frame-by-frame basis


Choose frame length, over which speech remains reasonably stationary
Overlap frames e.g. 25ms frames, 10ms frame shift

25ms
10ms

BIM488 Introduction to Pattern Recognition Speech Recognition 26


Computation of MFCC (Feature Extraction)

• Calculating Mel-frequency cepstral coefficients (MFCCs):

• MFCCs are coefficients of the


short-term power spectrum of a
sound, based on a linear cosine
transform of a log power spectrum
on a nonlinear mel scale of
frequency.

• MFCCs are one of the most


succesful feature extraction
approaches for speech data

BIM488 Introduction to Pattern Recognition Speech Recognition 27


Computation of MFCC (Feature Extraction)

“seven”
x(t)

Fourier
Mel-scaled
filter bank

Log
energy
DCT
Cepstral
Filter #
domain

Time

BIM488 Introduction to Pattern Recognition Speech Recognition 28


Classification

• Select a classification algorithm.


• Train your classifier using the training data.
• Then, start classifying unknown speech data.

BIM488 Introduction to Pattern Recognition Speech Recognition 29


Process Summary

BIM488 Introduction to Pattern Recognition Speech Recognition 30


Process Summary

BIM488 Introduction to Pattern Recognition Speech Recognition 31


Summary

• Automatic Speech Recognition (ASR)


• Applications
• Human vs. Computer
• Issues in Speech Recognition
• ASR Approaches
• ASR Example

BIM488 Introduction to Pattern Recognition Speech Recognition 32


References

• S. Theodoridis and K. Koutroumbas, Pattern Recognition (4th Edition),


Academic Press, 2009.
• Omid Talakoub, Astrid Yi, ‘‘Implementing a Speech Recognition System
on a GPU using CUDA’’.
• ‘‘Automatic Speech Recognition’’, Informatics, The University of
Manchester.

BIM488 Introduction to Pattern Recognition Speech Recognition 33


BIM488 Introduction to Pattern Recognition

Image Recognition
Outline

• Introduction
• Applications
• Facial features
• Face recognition approaches
• Eigenface method

BIM488 Introduction to Pattern Recognition Image Recognition 2


Introduction

• Image recognition, which is also a topic of computer


vision, aims to recognize images.
• Examples
– human face
– fingerprint
– handwritten characters
– satellite images
– medical images
– other images

BIM488 Introduction to Pattern Recognition Image Recognition 3


Introduction

• Face recognition is a specific field of image recognition.

• The task is to enable a machine to identify or verify a face


from a digital image or video.

BIM488 Introduction to Pattern Recognition Image Recognition 4


Applications

• Criminal identification
• Security systems
• Image and film processing
• Human-computer interaction
• etc.

BIM488 Introduction to Pattern Recognition Image Recognition 5


Facial features

• Every face has numerous, distinguishable landmarks, the


different peaks and valleys that make up facial features
such as:

– Distance between the eyes


– Width of the nose
– Depth of the eye sockets
– The shape of the cheekbones
– The length of the jaw line

BIM488 Introduction to Pattern Recognition Image Recognition 6


Face Recognition Approaches

• There are two fundamental approaches to the face


recognition problem:

1. Geometric (feature based), which looks at distinguishing features.


2. Photometric (view based), which is a statistical approach that distill
an image into values and comparing the values with templates to
eliminate variances.

• As researcher interest in the subject continued, many


different algorithms were developed.
• In this course, we will focus on Eigenfaces, which is one of
the most popular methods in face recognition.

BIM488 Introduction to Pattern Recognition Image Recognition 7


Eigenfaces: the idea
• Think of a face as being a weighted combination of some
“component” or “basis” faces
• These basis faces are called eigenfaces

-8029 2900 1751 1445 4238 6193

BIM488 Introduction to Pattern Recognition Image Recognition 8


Eigenfaces: representing faces
• These basis faces can be differently weighted to represent any face

• So we can use different vectors of weights to represent different faces

-8029 -1183 2900 -2088 1751 -4336 1445 -669 4238 -4221 6193 10549

BIM488 Introduction to Pattern Recognition Image Recognition 9


Learning Eigenfaces
Q: How do we pick the set of basis faces?

A: We take a set of real training faces


Then we find (learn) a set of basis faces which best represent the differences
between them

We’ll use a statistical criterion for measuring this notion of “best representation
of the differences between the training faces”

We can then store each face as a set of weights for those basis faces

BIM488 Introduction to Pattern Recognition Image Recognition 10


Using Eigenfaces: recognition & reconstruction

• We can use the eigenfaces in two ways


1. We can store and then reconstruct a face from a set of
weights

2. We can recognise a new picture of a familiar face

BIM488 Introduction to Pattern Recognition Image Recognition 11


Learning Eigenfaces

• How do we learn them?

• We use a method called Principle Components Analysis


(PCA)

• To understand this we will need to understand


– What an eigenvector is
– What covariance is

• But first we will look at what is happening in PCA


qualitatively

BIM488 Introduction to Pattern Recognition Image Recognition 12


Subspaces
• Imagine that our face is simply a (high dimensional) vector of pixels

• We can think more easily about 2d vectors

• Here we have data in two dimensions

• But we only really need one dimension to represent it


BIM488 Introduction to Pattern Recognition Image Recognition 13
Finding Subspaces

• Suppose we take a line through the space

• And then take the projection of each point onto that line

• This could represent our data in “one” dimension

BIM488 Introduction to Pattern Recognition Image Recognition 14


Finding Subspaces

• Some lines will represent the data in this way well, some
badly

• This is because the projection onto some lines separates


the data well, and the projection onto some lines separates
it badly
BIM488 Introduction to Pattern Recognition Image Recognition 15
Finding Subspaces

• Rather than a line we can perform roughly the same trick


with a vector

3
16  
2 1
  
1

i  
• Now we have to scale the vector to obtain any point on the
line
BIM488 Introduction to Pattern Recognition Image Recognition 16
Eigenvectors

• An eigenvector is a vector v that obeys the following rule:


Av   v

Where A is a matrix, µ is a scalar (called the eigenvalue)


e.g. A   2 3 one eigenvector of A is    3  since
2
 2 1  
 2 3  3 12  3
 2 1  2    8   4   2 
      

so for this eigenvector of this matrix the eigenvalue is 4

Av

BIM488 Introduction to Pattern Recognition Image Recognition 17


Eigenvectors
• We can think of matrices as performing transformations on vectors (e.g
rotations, reflections)

• We can think of the eigenvectors of a matrix as being special vectors (for that
matrix) that are scaled by that matrix

• Different matrices have different eigenvectors

• Only square matrices have eigenvectors

• Not all square matrices have eigenvectors

• An n by n matrix has at most n distinct eigenvectors

• All the distinct eigenvectors of a matrix are orthogonal (ie perpendicular)

BIM488 Introduction to Pattern Recognition Image Recognition 18


Covariance
• Which single vector can be used to separate these points as much as possible?

x2

x1
• This vector turns out to be a vector expressing the direction of the correlation

• Here I have two variables x1 and x2

• They co-vary (y tends to change in roughly the same direction as x)

BIM488 Introduction to Pattern Recognition Image Recognition 19


Covariance

• The covariances can be expressed as a matrix

x1 x2
x2
.617 .615 x1
C 
.615 .717  x2
x1
• The diagonal elements are the variances e.g. Var(x1)
• The covariance of two variables is:
n

 1 1 2 x2 )
( x i
 x )( x i

cov( x1 , x2 )  i 1
n 1

BIM488 Introduction to Pattern Recognition Image Recognition 20


Eigenvectors of the covariance matrix
• The covariance matrix has eigenvectors

x2
.617 .615
covariance matrix C 
.615 .717 

x1
eigenvectors  .735 .678
1   2  

 .678  .735

eigenvalues 1  0.049  2  1.284

• Eigenvectors with larger eigenvectors correspond to


directions in which the data varies more

• Finding the eigenvectors and eigenvalues of the


covariance matrix for a set of data is termed
principle components analysis
BIM488 Introduction to Pattern Recognition Image Recognition 21
Expressing points using eigenvectors
• Suppose you think of your eigenvectors as specifying a new vector space

• i.e. I can reference any point in terms of those eigenvectors

• A point’s position in this new coordinate system is what we earlier referred to as


its “weight vector”

• For many data sets you can cope with fewer dimensions in the new space than
in the old space

BIM488 Introduction to Pattern Recognition Image Recognition 22


Eigenfaces
• All we are doing in the face case is treating the face as a
point in a high-dimensional space, and then treating the
training set of face pictures as our set of points
• To train:

– We calculate the covariance matrix of the faces, or perform singular


value decomposition (SVD).
– We then find the eigenvectors and eigenvalues of that covariance
matrix

• These eigenvectors are the eigenfaces or basis faces

• Eigenfaces with bigger eigenvalues will explain more of the


variation in the set of faces, i.e. will be more distinguishing
BIM488 Introduction to Pattern Recognition Image Recognition 23
Eigenfaces: image space to face space

• When we see an image of a face we can transform it to face


space
w k  x . vki

• There are k=1…n eigenfaces vk


i
• th
The i face in image space is a vector x
• The corresponding weight is w k
• We calculate the corresponding weight for every eigenface

BIM488 Introduction to Pattern Recognition Image Recognition 24


Recognition in face space
• Recognition is now simple. We find the euclidean distance d
between our face and all the other stored faces in face
space: 2

w  w 
n
d (w , w ) 
1 2 1
i
2
i
i 1

• The closest face in face space is the chosen match

BIM488 Introduction to Pattern Recognition Image Recognition 25


Test Procedure

BIM488 Introduction to Pattern Recognition Image Recognition 26


Reconstruction
• The more eigenfaces you have, the better the reconstruction, but you can have
high quality reconstruction even with a small number of eigenfaces

82 70 50

30 20 10

BIM488 Introduction to Pattern Recognition Image Recognition 27


Summary

• Introduction
• Applications
• Facial features
• Face recognition approaches
• Eigenface method

BIM488 Introduction to Pattern Recognition Image Recognition 28


References

• J. Wyatt, ‘‘Face Recognition’’, School of Computer Science, University


of Birmingham.
• M. Turk and A. Pentland (1991). Eigenfaces for recognition, Journal of
Cognitive Neuroscience, 3(1): 71–86.
• S. Theodoridis and K. Koutroumbas, Pattern Recognition (4th Edition),
Academic Press, 2009.

BIM488 Introduction to Pattern Recognition Image Recognition 29

You might also like