Pattern Recognition 2nd Ed. (2009)
Pattern Recognition 2nd Ed. (2009)
INSTRUCTOR VERSION
Contents
Pattern Recognition 7
1 Introduction 11
1.1 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Creating artificial and real datasets . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Measuring objects within PRTools . . . . . . . . . . . . . . . . . . . . . . . 15
3 Nonparametric Classification 25
3.1 Non-parametric density estimation . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 The scaling problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Linear Classifiers 33
4.1 The perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Bias-variance dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8 Classifier evaluation 47
8.1 Sources of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3
8.3 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.4 The Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.5 Reject Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.6 Receiver Operating Characteristic . . . . . . . . . . . . . . . . . . . . . . . . 51
9 Clustering 53
9.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.2 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.3 Clustering with a mixture-of-Gaussians . . . . . . . . . . . . . . . . . . . . . 56
10 Cluster validation 59
10.1 Cluster validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11 Feature Selection 63
11.1 Selection Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
11.2 Stuff on Scatter Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
11.3 More. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
12 Feature Extraction 67
12.1 Principal Component Analysis
Karhunen-Lo`eve Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
12.2 Scatter Matrices, Repeating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
12.3 Supervised Feature Extraction: the Fisher Mapping . . . . . . . . . . . . . . 68
12.4 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
13 Dissimilarity Representation 71
A Final assignment 75
A.1 Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.2 Input: the NIST dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.3 Expected output: deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.4 The benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.5 Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.6 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
C Introduction to Matlab 83
C.1 Getting started with Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
C.2 Mathematics with vectors and matrices . . . . . . . . . . . . . . . . . . . . . 87
C.3 Control flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
C.4 Script and function m-files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5
6
Pattern Recognition
classification;
representations;
evaluation;
clustering.
understand pattern recognition theory to such an extent that he/she is able to read
recent literature on the topic in engineering-oriented journals (e.g. IEEE Tr. on PAMI);
know which statistical methods to apply to which problems, on which assumptions they
are based and how these methods interrelate;
be able to construct a learning system to solve a given simple problem, using existing
software.
Laboratory work
In addition to the Monday lectures, there is an obligatory practical component in which
you practice the construction of a patern recognition system. This runs in parallel with the
lectures and it given on Wednesday or Thursday. The laboratory work starts with exercises
related to the lecture of the week (as given in this manual).
Material
7
Next to the document you are now reading, the book Pattern Recognition by Sergios
Theodoridis and Konstantinos Koutroumbas will be used (4th edition, 2009, Academic Press,
ISBN 978-1-59749-272-0.).
Although not necessary, it will be useful for you to have access to a computer running Mat-
lab outside the practical course. The toolbox and data sets used during the course can be
downloaded from the course Blackboard site, so that you can experiment and work on your
final exercise at home.
Prior knowledge
Basic working knowledge of multivariate statistics and linear algebra is required to follow the
course. The basic elements of statistics will be reiterated in the second week. After the first
week you are asked to do the self-test in Appendix B and submit your answers. If you feel
A,B you are deficient in either statistics or linear algebra, please have a good look at Appendices
A and B of the book Pattern Recognition.
Examination
The examination of this course will consist of three parts:
1. A number of papers (6-7) will be discussed in the lectures. In preparation, these paper
will have to be read, and five non-trivial questions about each paper should be handed
in on a separate A4 (with your name and student number) at the beginning of the
lecture. These questions are graded as good or insufficient (20% of the final grade).
2. Lab work is mandatory. As a final assignment, you should construct a working pattern
recognition system to solve a given problem (during the lab hours) and write a report
on it (40% of the grade, see below).
Final assignment
The final assignment for the practical course is based on a case, in which your group plays
the role of a pattern recognition consultancy company (see Appendix A). Your company is
faced with an assignment by a client company working on automatic bank cheque processing
applications. This client wants you to research the possibility of using pattern recognition
techniques to interpret bank account numbers and the monetary amount. To this end, they
have supplied you with a standard dataset of handwritten digits (0, . . . , 9) put together
by the US National Institute of Standards & Technology, NIST. The digits have already been
segmented (isolated) for you, and are stored as individual images. The problem you will work
on is how to recognize the digits, given these images. The deliverable consists of a report
detailing and motivating the design choices you made, as well as a test of your system on a
set of benchmark data withheld from you by the client.
To prepare for this final assignment and practise with the tools you will need to solve this
problem, the last few exercises for each of the first 9 weeks will consist of some small experi-
ments on the handwritten digit dataset. During the last few weeks you can then work on the
final assignment itself. In the exercises each week, some possible solutions will be presented.
8
These are definitely not the best solutions, so your final system cannot consist of just the
solutions to the exercises each week!
The basic software package used is PRTools. It is introduced in appendix D. It contains
many routines to analyze patterns in object measurements given by a vector representation.
Preprocessing is needed to convert raw data like images and time signals into the proper
representation. PRTools includes a set of routines that enable such preprocessing for image
data. A specific package for image analysis like DipImage might be useful to facilitate these
interface routines. The standard Matlab routines for image processing however may be used
as well. In addition, private code may be used. It is important to store routines that auto-
matically generate (intermediate) results, as this will enable to modify and regenerate later
datasets. It may also be useful to recompute results for larger sets of objects.
Depending on your programming skills, these last exercises each week may take some time.
You are advised to take care that they are at most one to two hours behind schedule.
This manual
This document consists of several parts:
In the first part of the manual, for each week a number of exercises is given. Please read the
text carefully and do the exercises. If you have the idea things are still unclear, ask one of
the assistants. They will be around to ask you questions on your results as well, so please
write down the answers youve found.
9
Notation
Most weeks contain a few optional exercises, indicated like this:
optional
An exercise between these lines is optional. This means that you are not required to do it.
It can give you some extra background and tips. Only work on it if you think the subject is
interesting and if you have sufficient time left.
end optional
Some other notational conventions are:
Matlab variables and code will be indicated using the teletype font (for example:
x, mean(x)). For larger pieces of Matlab code, we will use the notation:
>> % A piece of test Matlab code
>> x = rand(10,2);
>> mean(x)
>> std(x)
Here, >> is the Matlab prompt. If pieces of code are not preceded by >> it will mean
its wiser to write a script. You can learn more about Matlab scripts in appendix C.
An alert sign like the one in the margin here indicates it is essential you read the text
next to it carefully.
A book sign indicates where you can read more on the theory behind the subject dis-
X, Slides cussed. Numbers indicate chapters and sections in Statistical Pattern Recognition.
Slides means the theory is discussed in the slides, which can be downloaded in hand-
out format from the Blackboard site.
It might happen that in an exercise you will encounter a Matlab-command you have never
seen before. If this happens, have a look at the help. In almost all cases there is an explanation
of the command, so you can get an idea of what the function does.
10
Week 1
Introduction
Objectives When you have done the exercises for this week, you
Exercise 1.1 (a) Assume that we managed to represent objects from a two-class classifi-
cation problem by a single feature. We know that the objects from class 1 have a 2.2
Gaussian distribution with 1 = 0 and 12 = 1/2, and the objects from class 2 have a
Gaussian distribution with 2 = 1 and 22 = 1/2. Derive the position of the decision
boundary when both class priors are equal.
(b) Again, assume we have a two-class classification problem in a 1D feature space,
but now assume that objects from class 1 have a uniform distribution between 0 and
1, and objects from class 2 have a uniform distribution between 2 and 3. Where is the
decision boundary now?
(c) And where is the decision boundary when the objects from class 2 have a uniform
distribution between 0.5 and 1.5? (The distribution of 1 did not change, classes have
equal prior.)
(d) And where is the decision boundary when the objects from class 2 have a uniform
distribution between 0.5 and 2.5? (The distribution of 1 did not change, classes have
equal prior.)
Exercise 1.2 (a) Assume we represent the objects in a two-class classification problem by
a single feature. We know that the objects from class 1 have a Gaussian distribution
with 1 = 0 and 12 = 1/2, and the objects from class 2 have a Gaussian distribution
11
with 2 = 1 and 22 = 1/2. Derive the position of the decision boundary when both
class priors are equal, but we have a loss matrix of:
0 0.5
L= . (1.1)
1.0 0
(b) Assume again we have a two-class classification problem in a 1D feature space, but
now assume that objects from class 1 have a uniform distribution between 0 and 1,
and objects from class 2 have a uniform distribution between 0.5 and 2.5. Given the
loss matrix (1.1), where is the decision boundary now?
0.9
0.8
0.7
0.6
f
0.5
0.4
0.3
0.2
0.1
0
2 1 0 1 2 3 4 5
x
Figure 1.1: The class-conditional probabilities of two classes p(x|1 ) (dashed blue line) and
p(x|2 ) (solid black line) in a 1-dimensional feature space.
Exercise 1.3 In figure 1.1 two triangular-shaped class conditional probability density func-
tions are given. The first one p(x|1 ) is indicated by a dashed blue line and the second
p(x|2 ) with a solid black line. The class priors are assumed equal here.
(a) Again, use the Bayes rule to derive the class posterior probabilities of the fol-
lowing objects: x = 3, x = 0.5, x = +0.5. To which class are the objects therefore
assigned?
(b) Which object is on the decision boundary of the Bayes classifier?
Exercise 1.4 Now assume that class 1 in figure 1.1 is twice as small as class 2 . That
means p(1 ) = 1/3 and p(2 ) = 2/3.
(a) Compute the posterior probabilities for x = 3, x = 0.5, x = 0.5.
(b) Where is the decision boundary of the Bayes classifier now?
Exercise 1.5 Compute the Bayes error for the class distributions given in figure 1.1, where
the classes have equal prior.
12
1.2 Creating artificial and real datasets
The general approach to attack a pattern recognition problem (or a data analysis problem) Ch.1
is presented in section 1.2 in the book. One key entity in data analysis is the idea of an
object. We always assume we can represent any object by a set of values, often just measure-
ments. Whatever object we are considering, in order to perform operations on an object by
a computer, we have to encode this object by some numbers. An object in real life and in
the computer are therefore two different things. Furthermore, in this course we will use the
convention that each object is represented by a row vector of measurements (thus an 1 d
array).
When you want to conclude something from data, it is often not from one object, but from
a set of objects. We assume we have a set of objects from which we want to obtain some
knowledge. This set is called a dataset. A dataset is a set of n objects and is stored in a n d
array.
After you have specified the problem you want to solve, you have to collect examples and do
measurements on these examples. For that, you have to define what features are likely to be
informative for the problem you specified.
Exercise 1.6 (a) Make a fake dataset in Matlab containing 10 objects with 3 measure-
ments each. Invent the measurement values yourself. Below is an example piece of code
which fills the matrix x with 2 objects, each containing 3 measurements:
>> x = [ 0.7 0.3 0.2; 2.1 4.5 0];
Make your own matrix x.
(b) Compute the means (using mean) and standard deviations (std) of your 3 measure-
ments of 10 objects. What is the difference between mean(x) and mean(x)?
Exercise 1.7 (a) When a dataset contains just two features per object, it can be visualized
in a scatterplot (we come back to this later). Make a scatterplot by:
plot(x(:,1),x(:,2),b*);
Looking at the data matrix you created in Exercise 1.6, find out which object is plotted
where in the plot.
(b) When you look at the scatterplot, can you identify outlier objects, or structure in
the data?
When the dataset is used to train classifiers, it is also required that for each object a class
label is present. This indicates from which class the object originates, according to the expert
(i.e. you).
Exercise 1.8 (a) Invent labels for the objects that you defined in the previous question,
and store it in a Matlab variable lab. The labels can be numbers, like 1 or 2, but you
can also use character strings, like banana or orange. Create a PRTools dataset
by
>> a = prdataset(x,lab)
13
Check if the resulting dataset has the correct number of objects, the correct number of
features and correct number of classes (correct here means: what you expect). You can
do that by just typing the variable on the Matlab command line:
>> a
1.2.1 Scatterplot
A scatterplot is the most simple plot you can make: it simply plots the first measurement
against the second measurement. If you have three measurements, you can use 3D plots; if
you have even more, you will have to select at most three of them by hand (although later
we will discuss ways of visualising more measurements at once).
Exercise 1.9 Load the dataset boomerangs (use the function boomerangs and choose the
number of objects to generate).
(a) Use scatterd(a) to make a scatterplot of the first two features, and
scatterd(a(:,[2 3])); for the features 2 and 3.
(b) Use scatterd(a,3) for a 3D scatterplot. Find and use the rotate3d button to
view all data.
(c) Use scatterd(a,legend) to add a legend with the class labels to the scatterplot.
Up to now, each object was characterised by a vector of measurements. Sometimes the idea
of an object is somewhat vague and the user has to decide where an object starts and
stops. Imagine, for instance, that we want to build a speaker recogniser. This recogniser
should identify the speaker from a recorded speech signal.
Exercise 1.10 (a) Assume we have recorded several sentences of several speakers using a
microphone. Now we have a set of very long time signals, say five seconds, recorded at a
sampling rate of about 11 kHz. How many samples does the signal contain?
(b) What would you define as the objects, and what as the measurements by which
each object is characterised (this is probably not simple!)?
Pattern recognition studies automatic ways to find patterns in data. Very often these are
images, but other sensor signals like time series (e.g. speech, music, weather) or spectra are
equally possible. In addition also non-sensor data like text, web-pages, answers to question-
naires and medical tests are often analyzed by pattern recognition tools. In the first example
we will deal with simple black-and-white images (blobs or silhouettes) as given by the
so-called Kimia dataset (collected by Benjamin B. Kimia).
>> obj = kimia_images % load 12 objects for 18 classes
>> show(obj,12)
14
This displays 18 classes of objects, each represented by 12 examples. A single image, e.g.
object 100 can be displayed by
>> figure; show(obj(100,:))
The show command belongs the the PRTools toolbox. There many more commands in
Matlab to display a single image, e.g. imshow, image and imagesc. Sometimes they demand
a change of the color table (colormap) or even a change of the intensity scale of the images.
In order to distinguish the classes in the Kimia database in an automatic way, a numeric
representation is needed. A first try is to measure the sizes and the perimeters of the blobs.
The following commands do this for the entire database using tools from the Matlab Image
Processing toolbox (in particular the function regionprops):
>> x = im_features(obj,obj,{Area,Perimeter,Eccentricity});
>> +x
The im features commands define the measurements (it is intended to measure object prop-
erties in gray value images for which the shape and position are given by a binary (black-and-
white) image, therefore the obj parameter is supplied twice as we have just binary images).
The function im features accepts a prdatafile, and it outputs a prdataset. This dataset
can be shown on the screen by +x. A better way for the visualization of 2-dimensional data
(i.e. data describing objects by just two numbers) is a scatter plot.
Exercise 1.11 (a) Take the feature set x from the above piece of code and select just
two classes, e.g. the elephants (class 3) and the camels (class 16) by featset =
seldat(x,[3 16]). You may inspect the values by featset = double(featset).
(b) In order to understand these numbers you may select the original images by
>> a = kimia_images
>> b = seldat(a,[3 16]) % select classes 3 and 16
>> show(b)
A single image, e.g. the first camel, can be inspected by
>> c = data2im(b,13); % select image 13
>> figure; imagesc(c); % show it on the screen
>> colormap gray % apply gray colormap
In this figure the axes show the sizes in pixels. Can you understand the area and
perimeter values of featset?
(c) Inspect the data in a scatterplot by scatterd(x,legend). You will now see the
first two features. Note the class names in the legend and the feature labels along the
axis.
The PRTools package, that will be used extensively in this course, is based on vector repre-
sentations of objects derived from object measurements. In order to have a unique toolbox, it
has been recently extended (in version 4.1) with possibilities to perform such measurements
on large sets raw object measurements. This extension is the so called prdatafile object of
15
PRTools. An introduction to PRTools is presented in D. The prdatafile extension is
described in E. In the following some introductory examples are presented.
In order to work with PRTools it should be available in the Matlab search path. This
can be verified by help prtools. Various datasets can only be loaded if the prdatasets or
prdatafiles directories are in the path. Use help prdatasets and help prdatafiles to
inspect the list of available datasets.
This example is again based on the Kimia Dataset. It consists of a set of 216 black-and-white
images, silhouettes of toy objects and animals. There are 18 classes of such objects and for
each are 12 images given. In order to speed up processing here we will restrict ourselves to
just 6 classes.
The data referred to by the variables obj1, obj2, ..., etcetera, are of the datafile data
16
type. The contents can be inspected by
>> struct(obj1)
ans =
Objects of the type prdatafile do not store data directly, but just refer to files on disk
and the way they should be processed. In (in rootpath) the name is stored of the top
directory which contains the data. The two cells in the field files contain the names of all
sub-directories and of all files they contain. Such fields may be better inspected by, e.g. s
= getfiles(obj2); s{1}, s{2}{:}. All preprocessing is stored in the preproc field, e.g.
p = getpreproc(obj6), p(4), p(5). (The selection of classes made for obj2 is stored
somewhere else. We will not discuss it here).
It should be emphasized that commands like obj5 = im resize(obj4,[20 20]) do not cause
any processing of data. They just store the command in the description of the variable.
Execution is postponed until data is needed (e.g. for display by the show command) or when
a datafile is converted to a dataset. This will be described later.
All preprocessing can be combined in the following way:
>> preproc = im_box([],0,1)*im_rotate*im_resize([],[20 20])*im_box([],1,0);
>> obj5 = obj1 * preproc;
The *-symbol can be interpreted as the pipe command in Unix; preproc is thereby roughly a
small routine that consists of a sequence of operations. When it is applied to obj1, PRTools
substitutes the [] entries by the data before the *-symbol.
Exercise 1.13 We will now measure two properties of the set of objects, the x- and y-
coordinates of the centers of gravity:
>> obj = seldat(obj1,2) % Use just class 2
>> obj = obj*preproc % normalize
>> figure(1); show(obj); % show data
>> mean_set = im_mean(obj) % measure (x,y) of centers of gravity
>> mean_data = double(mean_set) % convert them (unpack)
(a) Do the results correspond to what you expect, given the object images?
The last command, double(mean set), converts the datafile variable mean set to
plain numbers (doubles). So in mean data we now have the measurements of the ob-
jects in class 2. These can be further analyzed, e.g. by constructing histograms and
scatterplots.
(b) Plot the histograms of the (x, y) coordinates of the centers of gravities (means)
of the objects in class 2.
17
(c) Construct a scatterplot of these two measurements.
(d) Also measure the means of the objects in class 13.
(e) Add these to the scatterplot with a different color.
This week you saw the basis of pattern recognition: object definition, data collection, and
Bayes rule. Starting from good measurement data, the rest of the analysis (visualisation,
clustering, classification, regression) becomes much easier. Starting from poorly defined ob-
jects or poorly sampled datasets with insufficient example objects, your analysis becomes very
hard and no clear conclusions will be possible (except that more data is needed).
Take-home messages
If you collect or receive data, think well about what the objects and
measurements are
18
Week 2
Objectives When you have done the exercises and read the text for this week, you should
2.1.1 Shapes in 2D
19
optional
end optional
Exercise 2.4 Consider Equation (2.27) in the book, which provides the general expression
2.4.1 for a normal distribution in an l-dimensional space. Here we assume that l = 2.
Now assume, in addition, that the covariance matrix is given and equals 41 11 . Derive
the maximum likelihood estimate for the 2D mean vector .
optional
end optional
20
Exercise 2.7 This may be a difficult exercise but it is a rather important one. . .
Again consider Equation (2.27) that provides the general expression for a normal dis-
2.4.1 tribution in an l-dimensional space. Consider an l 1 vector a and a l l linear
transformation matrix L. Assume we transform our random variables X, which are
distributed according to Equation (2.27), into new random variables Y by means of an
affine transformation as follows:
Y = LX + a .
(a) Show that Y is still normally distributed. Express the new mean and covariance
matrix in terms of , , L, and a.
(b) Check your answer with the outcome in exercise 2.6 (d) and (e).
Exercise 2.8 Generate 1000 random point from a 2D standard normal distribution using
randn. (Note the discrepancy between the math, in which typically column vectors are
employed, and computational software, in which feature vectors often are row vectors.
This means, in this case, that the matrix with random numbers should be size 10002.)
(a) Turn these random points into a prdataset a and scatterd the data.
The command w = gaussm(a) determines estimates for a normal distribution based on
the data set a, and stores the estimated parameters in the mapping w.
(b) Use plotm(w) to visualize the 2D density estimates on top of the scattered data.
(c) Multiply one of the features by 3 and, subsequently, rotate
the data over an angle
(say = /3) by means of a rotation matrix (you know. . . cos sin . . . ). Scatter
sin cos
the new data in a new figure and plot the new contours. Compare to the other figure
and see where corresponding points end up.
(d) Investigate for which angle the correlation becomes maximum.
optional
Exercise 2.9 (a) Consider the same questions as those in the previous exercise but now
use uniformly distributed data, i.e., use rand to generate the initial data.
end optional
Exercise 2.10 The piece of code below generates 5 data points from a standard normal
distribution, estimates its parameters, and displays the estimated densities through a
contour plot. This is repeated a number of times and all contour are plotted in the
same figure.
21
(a) Before executing the code. By now, can you predict what the various contour plots
will look like? Can you explain why this happens?
for i = 1:8
a = prdataset(randn(5,2));
scatterd([-3 -3;3 3],.w) % trick to set the figure axes about right
plotm(gaussm(a))
hold on
pause
end
hold off
(b) Execute the code. Pressing the space bar will add another contour plot. Check
with yourself wether the plot comes close to your expectations. (Dont fool yourself!
Was this indeed what you expected?)
(c) What happens when the number of random feature vectors increases steadily?
What if we could perform this exercise with an infinite amount of random data points?
Exercise 2.13 Generate a 2D data set with 2 classes with 10 samples from a uniform distri-
bution (using rand) and 10 samples from a normal distribution (using randn), respec-
tively.
22
(a) Train a ldc and a qdc on this data, scatter the data, and plot the two decision
boundaries (with plotc). Determine the how many points are on the wrong side of the
boundary.
(b) With w a trained classifier and a a data set, the labels that the classifier assigns
to the data points in a can be retrieved using labeld: a*w*labeld. Check your count
from the previous question using this command.
Exercise 2.14 Given two normally distributed classes in 1D. One with mean 0, variance 4,
9 1
and class prior 10 and one class with mean 1, variance 1, and class prior 10 . Determine
the decision boundary for this classification problem.
optional
Exercise 2.15 In case one is dealing with a classification problem in which the classes are
multivariate Gaussian, the quadratic classifier is in principle the classifier that will
always perform optimally as long as there is enough data to accurately estimate the
various parameters like means and covariance matrices. When the class distributions
are not necessarily normal, this does not hold true and there are cases in which the
normality based linear discriminant performs a better discrimination that the normality
based quadratic discriminant.
(a) Construct an artificial 2D problem for which the decision boundary of ldc(a) is
better than the decision boundary obtained by qdc(a).
(b) Take two uniform rectangular classes. One with bottom left coordinate (-1,-2) and
top right coordinate (0,3), the other with bottom left (0,0) and top right (1,1). This
should do the trick.
end optional
Exercise 2.16 (a) Show that the classification performance does not change for ldc and qdc
when the data set it linearly (or even affinely) transformed. If this is too complicated,
at least show that scaling the axes does not change the decision on the data.
(b) Construct a simple 1D or 2D example that shows the classifier might essentially
change when a nonlinear transformation of the space is performed.
23
Take-home messages
LDA and QDA are invariant to scaling or, generally, to affine trans-
formations of the data
The new means and (co)variances of some data after an affine trans-
form can be determined using some simple rules
24
Week 3
Nonparametric Classification
Objectives When you have done the exercises for this week, you
3.1.1 Histograms
The simplest way to estimate the density of one measurement is by plotting a histogram. It is
easiest to do this for one measurement (Matlab has a command hist), but you can create 2.5.6
a histogram plot of 2 measurements at the same time.
The main problem in using histograms is choosing the right bin size. Making the bins too
broad will hide useful information within single bins; making them too small will result in
bins containing too few samples to be meaningful.
Exercise 3.1 (a) Generate 10 1D Gaussian datasets and calculate the average histogram
and the standard deviation:
for i = 1:10
a = gauss(n,0);
h(i,:) = hist(+a,-5:5);
end;
errorbar (-5:5, mean(h), std(h));
with different numbers of samples n (e.g. 10, 100 and 1000).
(b) For what n do you think the histogram starts giving a good impression of the true
density?
(c) Repeat the above for data obtained from a Laplacian distribution (which is more
peaked than the Gaussian); use a = laplace(n,1) to generate n samples. Does the
same hold? Why?
25
Note how you used 2 dimensions to visualise the 1D histogram. Similarly, we can create 2D
histograms using 3 dimensions. In practice, visualising 3D data or a function of 2D data is
often as far as we can go with computers.
Exercise 3.2 Generate a 2D Gaussian dataset and use the hist2 function provided:
a = gauss(100,[0 0]);
hist2(a);
Note that hist2 is not a standard Matlab function, so it is a bit less flexible than the
standard hist function. To have a look around the histogram, try help rotate3d.
(a) For the standard number of bins (10 10), how many samples would you say are
needed to get a good idea of the shape of the histogram? Try this by generating various
numbers of samples and inspecting whether the histogram changes if you repeat the
experiment.
(b) You can also play with the number of bins; see help hist2. For 5 5 bins,
how many samples do you need? Is the representation still accurate? And for 20 20
bins?
Now we are going to estimate the density using a Parzen density estimator, called parzenm
2.5.6 in PRTools.
26
When you want to evaluate a fit of a density model to some data, you have to define an error.
One possibility is to use the log-likelihood, defined as:
Slides
!
Y X
LL(X ) = log p(x i ) = log (
p(x i )) (3.1)
i i
Note that when we fill in different values for the width parameters h in the Parzen density
estimation, we have different estimates p. Using the log-likelihood as a criterion, we can
optimize the value of this free parameter h to maximise LL.
To get an honest estimate of the log-likelihood, we have to evaluate the log-likelihood (3.1) on
a test set. That means that we have to make (or measure) new data from the same distribution
as where the training data came from. When we would evaluate the log-likelihood on the
data on which the probability density was fitted, we would get a too optimistic estimation of
the error. We might conclude that we have fitted the data very well, while actually a new
dataset from the same distribution does not fit in the density at all! Therefore, if you want
to evaluate the performance of an estimated p, use an independent test set!
Exercise 3.4 Use the data from the same distribution as in the previous exercise to train a
Parzen density estimator for different values of h. Compute the log-likelihood of this
training set given the estimated densities (for different h):
a = gendats([20 20],1,8); % Generate data
hs = [0.01 0.05 0.1 0.25 0.5 1 1.5 2 3 4 5]; % Array of hs to try
for i = 1:length(hs) % For each h...
w = parzenm(+a,hs(i)); % estimate Parzen density
LL(i) = sum(log(+(a*w))); % calculate log-likelihood
end;
plot(hs,LL); % Plot log-likelihood as function of h
(since w is the estimated density mapping w, the estimated density p for objects in a
dataset a is given by +(a*w)).
(a) What is the optimal value for h, i.e. the maximal likelihood? Is this also the best
density estimate for the dataset?
Exercise 3.5 (a) Use the same data as in the previous exercise, but now split the data into
a training and test set of equal size. Estimate a Parzen density on the training set and
compute the Parzen density for the test set. Compute the log-likelihood on both the
training and test sets for h = [0.1, 0.25, 0.5, 1, 1.5, 2, 3, 4, 5]. Plot these log-likelihood vs.
27
h curves:
a = gendats([20 20],1,8); % Generate data
[trn,tst] = gendat(a,0.5); % Split into trn and tst, both 50%
hs = [0.1 0.25 0.5 1 1.5 2 3 4 5]; % Array of hs to try
for i = 1:length(hs) % For each h...
w = parzenm(+trn,hs(i)); % estimate Parzen density on trn
Ltrn(i) = sum(log(+(trn*w))); % calculate trn log-likelihood
Ltst(i) = sum(log(+(tst*w))); % calculate tst log-likelihood
end;
plot(hs,Ltrn,b-); hold on; % Plot trn log-likelihood as function of h
plot(hs,Ltst,r-); % Plot tst log-likelihood as function of h
What is a good choice of h?
optional
Exercise 3.6 Use the same procedure as in Exercise 1. Change the number of training
objects from 20 per class to 100 per class. What is now the best value of h?
end optional
Estimation of the density by the nearest neighbour method is harder, but PRTools contains
3.3 an implementation called knnm.
Exercise 3.7 (a) Again generate a dataset a = gendats([20 20],1,8) and apply the
knnm density estimator for k = 1: w = knnm(a,k). Plot the density mapping using
scatterd(a); plotm(w,1). What did you expect to see for k = 1? Why dont you see
it?
(b) Increase the grid size, e.g. gridsize(500), and plot again.
(c) Try different values of k instead of 1. Judging visually, which do you prefer?
(d) How could you, in theory, optimise k?
Exercise 3.8 Which do you prefer: the Parzen density estimate or the nearest neighbour
density estimate?
Although PRTools contains a nearest neighbour density estimator, for illustration purposes
we can also make one ourselves. The simplest to construct is the 1-nearest neighbour.
Exercise 3.9 (a) Generate 1D data as you did in the previous exercise, but only 5 objects
per class: a = gendats([5 5],1,8). Also generate a grid of data, as follows:
>> gx = [-3:0.2:11];
28
(b) Compute the distances from all the grid points to the training set. You can use
the distm function from PRTools:
D = sqrt(distm(gx,a));
(use the help to find out why you need to take a sqrt). Check the size of matrix D.
What does a row and a column in this dataset mean?
(c) To find the nearest object in the training set for each of the grid objects, you have
to find the minimum over a row (or a column, depends on how you defined your matrix
D). To find the minimum we first have to order them. To order them along the columns
do sort(+D,1) and along the rows do sort(+D,2). You will get a sorted distance vector
Dsort for each of the columns or rows.
(d) To compute the density for each of the grid objects, you have to use the general
formula:
k
p(x ) = (3.2)
nV (x )
where V (x ) is the volume of a hypersphere around x with as radius the distance to the
nearest neighbour (k = 1) and n is the total number of objects in the dataset. The
volume of the hypersphere In the 1D dataset this is just two times this distance, so:
phat = 1./(n*2*Dsort(:,1));
(e) Plot this density estimate (scatterd(+a); hold on; plot(gx,phat);) and save
the figure, saveas(gcf,figknn,m). Are you satisfied with this plot? (Notice the
scale of the y-axis! Restrict the axis with the axis command).
Exercise 3.10 (a) How can you generalize the previous procedure to use the k th nearest
neighbour instead of the first nearest neighbour? Test it by inspecting the density plot.
(b) How would you calculate an optimal value for k?
Exercise 3.11 How should you generalize the procedure in Exercise 3.9 to work in a p-
dimensional feature space?
In the sections above, two approaches to estimating a density are shown. The first is using
the Parzen window approach, the second is using the nearest neighbor approach. When one
of these approaches is used to estimate the class conditional probabilities, Bayes rule can
be used to construct a classifier. In the table below, the corresponding PRTools mapping
names are listed.
Exercise 3.12 Generate a dataset a using the function gendatb, and make a scatterplot
of the dataset. Next, train a Parzen classifier and a k-nearest neighbor classifier, and
29
plot the decision boundary using plotc (obviously, to train the classifiers, you have to
choose a value for h and for k):
>> w = parzenc(a,h);
>> v = knnc(a,k);
>> plotc(w); plotc(v,r);
(a) Try different values for h and k. What are good values?
(b) How would you make an automatic procedure to optimize the values of h and k
for a classification problem?
In PRTools the naive Bayes classifier is called naivebc. In the implementation of PRTools,
for each of the features it estimates the class conditional probabilities using a histogram with
N bins.
Exercise 3.13 Compare the decision boundary of the naivebc with that one of the Parzen
classifier and the k-nearest neighbor. Look, for instance, at the datasets gendats,
gendatb and gendatd.
2.8 Exercise 3.14 Make exercise 2.39 and 2.41 from the book.
problems
In this last section we have a quick look at the scaling problem. It appears that some classifiers
are sensitive to the scaling of features. That means, that when one of the features is rescaled
to very small or very large values, the classifier will change dramatically. It can even mean
that the classifier is not capable of finding a good solution. Here we will try to find out, which
classifiers are sensitive to scaling, and which are not.
Exercise 3.15 (a) Generate a simple 2D dataset (for instance, using gendatb) and plot the
decision boundaries of the six classifiers listed above. Use k = 1 for the knnc.
(b) Make a new dataset in which the second feature is 10 times larger than the dataset
above. Do this by
newtrain = a;
newtrain(:,2) = 10*newtrain(:,2)
Train six classifiers nmc,ldc,qdc,fisherc,parzenc,knnc and plot the decision bound-
aries.
(c) Which classifiers are affected by this rescaling of the feature space? Why are they
affected, and others not?
(d) Is it an advantage or a disadvantage?
30
Take-home messages
31
32
Week 4
Linear Classifiers
Objectives When you have done the exercises for this week, you
Exercise 4.3 Check that equation (3.28) from the book is equal to equation (3.26).
3.3
33
Exercise 4.4 Implement the Least Squares solution, equation (3.45), in Matlab. Test it on
the same dataset that you have used for the perceptron (exercise 4.2), by plotting the 3.4
decision boundary in a scatterplot.
Exercise 4.5 The Least Squares solution is implemented in PRTools as well, and it is
called the Fisher classifier, fisherc. Train a Fisher classifier on the same data as you
used before, and compare the two decision boundaries. Are they equal?
The bias-variance dilemma indicates that when a model is fitted on a finite (small) dataset,
3.5.3 there should be made a trade-off between the bias and the variance. The bias is the difference
between the average model output, and the true expected output. The variance is the
average difference between the specific output of a model, and the average model output. Or,
loosely speaking, the bias measures if the model is flexible enough to fit the truth, and the
variance measures how stable the estimate is. In this section we try to identify this in some
classifiers.
Exercise 4.7 Repeat the previous exercise, but instead of using a Fisher classifier, use a
Parzen classifier. How does the variance and the bias change, when you switch from the
Fisher classifier to the Parzen classifier?
Take-home messages
When you have a limited dataset, you need to find a good trade-off
between the bias and the variance.
34
Week 5
Objectives When you have done the exercises and read the text for this week, you should
be able to derive the expression for the posteriors in the case of logistic discrimi-
nation,
have an idea of the basic shape of the logistic posterior,
be able to describe to underlying idea of the linear SVM.
35
0 0
Shift the first point above, 1 , to 1 .
(b) How does the new maximum margin classifier look? What happened to the number
of support vectors?
Exercise 5.3 (a) Demonstrate, possibly graphically/geometrically, that the support vector
classifier is sensitive to feature scaling. Hint: this can be done in 2D based on a training
set of size 3 (like in (a) and (b)) and a single test point.
Exercise 5.4 (a) Reconsider the 2D configurations from 5.2 (a)(b) above and compare the
solution of the Fisher classifier to those obtained by means of an SVM. In what cases
do they differ? Do you see the pattern?
optional
Exercise 5.6 (For this exercise you might want to set up a script. . . )
Generate a 30-dimensional training set trn with 50 samples using gendats (read help
gendats!). Generate, in addition, a test data set tst with 1000 objects.
(a) Train the following classifiers: nmc, ldc, fisherc, loglc, and svc, and test these
with the test set. Hint: testc can be used to test trained classifiers on a test set (again:
read the help!) and with braces one can train (and test) many classifiers in one go. For
instance, trn*{qdc,knnc([],1)} can be used to train a quadratic discriminant and a
1NN on trn simultaneously.
(b) Repeat the previous steps of training and testing a couple of times with newly
generated training sets. Overall, which seems the best performing classifier? Can you
explain this?
(c) Reduce the training set size to 30 objects and redo the experiment. Does you
conclusion change? What happens to ldc when the number of samples is so small
relative to the dimensionality?
(d) Redo (a) to (c) for samples from gendatd instead of gendats. Any remarkable
changes? What is the main difference between the NMC in the case of a sample size of
50 and 30?
(e) Can you construct a classification problem in which a linear SVM would clearly
outperforms a linear discriminant classifier? How about the Fisher classifier v.s. SVM?
36
Take-home messages
37
38
Week 6
Objectives When you have done the exercises and read the text for this week, you should
be able to give the link between a kernel function and its (implicit) feature mapping
be able to kernelize simple, inherently linear classifiers
have basic understanding of the nonlinearities neural networks can create
understand the separation potential of linear classifiers in high-dimensional spaces.
Exercise 6.1 Consider a 1D data set with two samples in class A x1 = 0, x2 = 3 and one in
class B x3 = 2.
(a) Construct a neural network with one hidden layer, using neurons that have a
sigmoid activation function, that can perfectly separate class A and B. How many
hidden units does it need?
(b) What if one allows an arbitrary activation functions for every setting? With how
few neurons can one solve this task consisting of three labeled samples?
Exercise 6.2 The function bpxnc implements a simple multilayer neural network, which can
be trained by backpropagation. For instance, w = bpxnc(a,2,1000) trains on data set
a a network with a single hidden layer with two nodes.
Investigate for very(!) small networks and very(!) small samples sizes from two classes
(XOR?) how such networks behave. (You better type prwaitbar off.) Repeat, for
example, the same line of code that trains some neural network and plot the different
classifiers obtained in the same scatterd of the data.
39
Exercise 6.3 Consider in the book. Consider a sample in all of the seven different regions
and label these samples A or B according to the labels in the book, except for the sample Figure 4.8
in area 110, which should also be labeled A (instead of B). So, we have four samples
from A and three from B.
(a) Can this problem be solved perfectly with a multilayer perceptron (i.e., using hard
thresholds as activation functions and not smooth sigmoids) with one (1) hidden layer
with three (3) neurons? How can you see this?
(b) How many regions can this network classify correctly at most?
(c) How many different optima does this multilayer perceptron have for which at most
one region is being misclassified? That is, in how many different ways can the network
reach the maximum number of correctly classified regions as found in b? Explain your
answer.
Exercise 6.4 Design a bpxnc that works well (and fairly stable?) on the data set a =
gendatb.
Exercise 6.8 The function svc can be used to both construct linear and nonlinear support
vector machines. Check the help to see how kernels different from the linear one can
be employed. Type help proxm to get an overview of the possible kernels and their
parameters.
40
(a) On a = gendatb([20 20]), train an svc with a radial basis kernel, i.e., the
Gaussian kernel, for kernel widths that vary from fairly small (0.1?) to fairly large
(10?). Check with a large (enough) independent banana test set how the performance
varies for the different choices of kernel widths.
(b) How does the kernel width of parzenc relate to the width of the radial basis
function?
(c) Why can the svc, potentially, perform much faster at test time than the Parzen
classifier?
optional
Exercise 6.9 Consider the linear least squares problem as in of the book. We aim to kernelize
T x.
its solution mapping w Subsection
(a) Express the minimizing solution w explicitly in terms of the training samples. 3.4.1
(b) Using the well-known identity (XX T )+ X = X(X T X)+ , where + indicates the
T x in terms of inner products between training
Moore-Penrose pseudoinverse, rewrite w
instances xi and the test point x.
Exercise 6.10 The next experiment might ask too much time or too much memory.
Construct a data set with a sharp corner in it. Take, for instance X = rand(1000,2);
and a = dataset(X,X(:,1)>.5&X(:,2)>.5). Study how closely a support vector ma-
chine follows the border, and especially the corner, for different kernels and different
settings of the penalty term C. Definitely try the different Gaussian kernels with dif-
ferent sigma parameters in combination with various Cs. (If possible, you might want
to consider using libsvc instead of svc.)
end optional
Take-home messages
Neural nets might get stuck in local optima or, generally, bad solu-
tions
Classifiers like NMC and the Fisher linear discriminant can be ker-
nelized
41
42
Week 7
Objectives When you have done the exercises and read the text for this week, you should
N.B.! The time has come that, during the computer exercises, you better not only spend time
on the exercises for the current week but also on the final assignment. You can find the latter
in Appendix A.
Exercise 7.1 (a) Train a decision tree, using treec, with splitting criterion infcrit on a
gendats data set with 50 samples in every class. Vary the degree of pruning up to, say,
12 (help treec!) and check how the classification error varies with it on a large enough
test data set (i.e., to get a good impression of the true performance of the classifier). (In
addition, you might have to average over different training sets to get a good impression
of the expected variability of the error rate for different degrees of pruning.)
What can be observed?
(b) Can you explain the previous behavior? In what way do you think the classifi-
cation boundaries differ when comparing a high degree of pruning with a low degree?
You might want to check this using scatterd etc. (Note that you might have to put
gridsize to, say, 1000 or so, and replot the plot.)
Exercise 7.2 (a) Redo the experiments from the previous exercise but now with gendatd.
What changes in results do you note? Can you explain these?
43
(b) What if the two class means move closer to each other in the first feature dimension?
What kind of change in behavior with varying degrees of pruning do you expect (apart
from an increase of the error rate)?
optional
Exercise 7.3 (a) This is an open problem, I think. . . Redo the experiment in Exercise 7.1
but now with a training set size of 10 per class (so 20 in total). Explain the error rate
behavior with varying pruning degrees.
Exercise 7.4 (a) (Another open problem?) Any idea what the splitting criterion maxcrit
does?
end optional
7.2 Combining
Exercise 7.5 We are going to combine classifiers trained on different representations of hand
written digits. We compare these to the individual classifiers and the classifier trained
on the data set having all features.
First we construct the various data sets:
a1 = mfeat kar, [b1,c1,I,J] = gendat(a1,0.3);
a2 = mfeat zer, b2 = a2(I,:); c2 = a2(J,:);
a3 = mfeat mor, b3 = a3(I,:); c3 = a3(J,:);
The kar, zer, and mor indicate the type of features we are considering.
(a) Train three nearest mean classifiers on the different data sets:
w1 = nmc(b1)*classc; w2 = nmc(b2)*classc; w3 = nmc(b3)*classc;
(The classc is a trick to make sure that classifiers output posteriors.) Estimate the
three corresponding error rates on the corresponding c-data sets, e.g., for the first data
set: compute c1*w1*testc.
(b) Train and test the same classifier on the data sets b = [b1 b2 b3] and c = [c1
c2 c3], which contain all features from the three data sets above. Compare its error
rate to those from (a).
Given the trained individual classifiers w1, w2, and w3, a fixed combiner can be con-
structed as follows: v = [w1; w2; w3]*meanc. This is the mean combiner. Other
combiners that PRTools allows are minc, maxc, prodc, medianc, and votec of which it
should be clear what kind of combining rule these implement.
(c) Test some combiners, by means of [c1 c2 c3]*v*testc, and compare the results
to the four earlier error rates you estimated.
(d) Looking back at the exercise, do you understand why the special form of the
gendat command, in which the I and J occur, had to be used? What would we have
been doing if we would just have generated the b and c data sets by applying gendat
to the three a data sets individually? Obviously, you may want to check help gendat
for an initial clue and to see what I and J contain.
44
Exercise 7.6 We consider the classical use of bagging for which we reconsider exercise 7.1
(a).
(a) Basically, redo the experiments from 7.1 (a) but just compare the un-
pruned tree with its bagged version. For the latter you need something like
baggingc(trn,treec([],infcrit,0)); check help baggingc for more on bagging.
(b) You may want to check that even when using gendatd instead of gendats some
improvement can be obtained.
(c) What does the stabilization of the decision tree by means of bagging effectively
do? You could check a plot of the classification boundary of the single decision tree
with a bagged version to get an idea.
Exercise 7.7 (a) Generate a data set by means of gendath and train two differently be-
having linear classifiers on it. Take, for instance, w1 equal to nmc and w2 toldc. Gen-
erate posterior probabilities from these by computing p1 = a*w1*classc and p2 =
a*w2*classc and combine them in one data set by p = [p1 p2].
How many features does this new data set have and why is this so?
(b) Scatter plot the first and third dimension of p. What do these dimensions corre-
spond to? Try to understand that the original classifiers correspond to horizontal and
vertical lines at 0.5 in this same scatter plot.
(c) Straight lines not parallel to any of the two axes actually combine the posteriors
from the two classifiers. Are there combiners possible that indeed seem improve upon
the two individual classifiers?
Exercise 7.8 As simply as fixed combiners can be constructed, one can construct trained
combiners. An example is v = [nmc*classc ldc*classc]*fisherc, which takes the
two classifier outputs of nmc and ldc and combines them through training a fisherc
on top of it. All is simultaneously trained by executing w3 = a*v on a data set a.
(a) Take the data set from 7.7gendath, construct a training and a test set, and
compare nmc, ldc, and the trained combiner.
Exercise 7.9 The AdaBoost combining scheme, adaboostc in PRTools, enables one to ex-
periment with different AdaBoost schemes based on different base classifiers. The clas-
sical one is the so-called decision stump (stumpc, check the help).
(a) Generate a standard banana data set a = gentdatb and visualize the deci-
sion boundaries that are obtained by means of boosting when you combine 1, 10,
100, and 1000 base decision stumps, respectively. N.B. You might want to use
the following form to suppress annoying messages, while adaboostc is training:
adaboostc(a,stumpc,number of base classifiers,[],0). The final 0 silences the
verbose.
45
(b) When combining 1000 base classifiers, AdaBoost seem to suffer from what is called
overtraining. How can you maybe see this in the scatter plot? How could you probably
see this from the error rates on an independent test set?
(c) Repeat the experiment but replace the base classifier with nmc and with fisherc.
Are there noteworthy differences between the various solutions obtained? Also compare
results between different base classifiers.
(d) What can we expect AdaBoost to do when we employ it in combination with, say,
a 1NN or a unpruned decision tree? Could one expect any improvements from such
scheme?
Take-home messages
...
Dont forget to work on the final assignment: now and in the weeks
to come.
46
Week 8
Classifier evaluation
Objectives When you have done the exercises for this week, you
Exercise 8.1 In this exercise we investigate the difference in behavior of the error on the
training and the test set. Generate a large test set and study the variations in the
classification error based on repeatedly generated training sets:
>> t = gendath([500 500]);
>> a = gendath([20 20]); t*ldc(a)*testc
Repeat the last line e.g. 30 times.
(a) What causes the variation in the error?
Now do the same for different test sets:
>> a = gendath([20 20]);
>> w = ldc(a);
>> t = gendath([500 500]); t*w*testc
Repeat the last line e.g. 30 times.
(b) Again explain what causes the variance observed in the results.
The function cleval allows you to calculate so-called learning curves. These are curves that
plot classification errors against the number of points in the training data set. (Check help
cleval for the details.)
47
In this section we are going to study some learning curves for different data sets and different
classifiers. Some of the plots and number that will be generated demonstrate certain salient
and/or noteworthy behavior.
After reading a question, and before just implementing the things to be implemented, try to
think about the outcome to be expected. Eventually, say at the end of this course, you should
not be surprised by many of these things anymore!
Exercise 8.2 Generate Highleyman classes (gendath) with a 1000 samples per class. Enlarge
the feature dimensionality of this set by adding 60 dimensions of class independent
randomness, i.e., plain noise. Use cleval to generate learning curves for nmc, ldc, and
qdc using 64, 128, 256, and 512 objects in the training set (make sure that you repeat
often enough. . . ). Use plote to plot these curves in a single figure (check the help).
(a) Can you explain the overall behavior of these curves?
(b) Explain why the curves intersect. Which classifier performs best?
(c) What do you expect the limiting behavior of learning curves is? That is, if we were
able to train on more and more data?
optional
Exercise 8.3 Redo the previous experiments but substitute gendath with gendats and
gendatd. Dont forget to add the 60 noise features.
(a) What kind of typical differences do you see when comparing behaviors for different
data sets? Explain these differences?
end optional
Exercise 8.4 (a) Take your favorite data set and study a learning curve for the first nearest
neighbor classier (1NN). What can be expected of the learning curve of the apparent
error?
8.2.2 Dipping
Exercise 8.5 Study the learning curve for nmc in combination with gendatc for very, very,
very small training set sizes.
(a) What seems to be happening? And is this expected?
optional
48
8.2.3 I Challenge You
Exercise 8.6 This more an challenge than an exercise. You might consider skipping it and
move on to the next exercise. Then again, you also might take on this challenge!
(Anyway, do keep track of time somewhat.)
(a) Create a two-class problem in which ldc, on average, outperforms qdc for large
sample sizes.
end optional
Like learning curves that typically plot the classification error against the number of training
set samples, making feature curves can also be informative. The latter studies how the
classification error varies with varying number of feature dimensionality. We can use the
clevalf command to perform such study.
Exercise 8.7 Load the data set mfeat kar and make a feature curve for 4, 8, 16, 32, and 64
features using 50% of the data for training based on qdc.
(a) When redoing this experiment, would you expect the same curve? Why? Or why
not?
(b) What, generally, happens to the learning curve when 40% or 80% of the data is
used for training? Can you explain this?
8.3 Cross-Validation
The prcrossval function allows you to perform a cross-validation on a given data set using
a particular classifier (or even a whole bunch of them in one go).
Exercise 8.8 Generate a small data set using gendatb, say, with 10 objects per class.
(a) Using n-fold cross-validation, make plots for the error rates for kNN and 1NN over
different values of n. Also calculate the standard deviation of the error estimate, e.g.,
by performing the cross-validation 10 time (check the help).
(b) What do you notice about the estimated error rates? What is the general
trend (maybe you should redo the data generation and the cross-validation a couple
of times).
(c) What happen to the variance of the estimates for varying n? Again, we are inter-
ested in the general trend.
(d) How would the observations change if one would repeat the experiments with much
larger dataset? Would they change?
49
Exercise 8.9 (a) Let us look back at exercise 1 where a Parzen density is estimated on
some data, and use the same data in this exercise. Now let PRTools find the optimal
h using leave-one-out cross-validation. This can simply be performed by not specifying
h, i.e. calling w = parzenm(+a);. To find out what value of h the function actually
uses, you can call h = parzenml(a);. Does the h found correspond to the one you
found to be optimal above?
A confusion matrix is of the size C C, where C is the number of classes. An element C(i, j)
encodes the confusion between the classes i and j. More precisely, C(i, j) gives the number
of objects that are from i, but are labeled as j. Nothing confusing about it.
Exercise 8.10 To create a confusion matrix you may use something like the following code.
>> lab = getlab(test_set);
>> w = fisherc(train_set);
>> lab2 = test_set*w*labeld;
>> confmat(lab,lab2)
An alternative is:
>> confmat(test_set*w)
(a) What is stored in lab and what does lab2 contain?
Exercise 8.11 Load the digits data sets mfeat zer and mfeat kar. Split the data up in a
training and a test set (using gendat).
(a) What are the error rates on the test sets?
(b) Use confmat to study the error distributions more closely. Where do the larger
parts of the errors stem from?
(c) Can you explain which of the two feature sets is insensitive to image rotations?
a b
Exercise 8.12 Given a confusion matrix c d for a two-class problem.
(a) Give an estimate of the expected cost (per object) when misclassifying class 1
as class 2 is twice as expensive as the other way around and assuming that a correct
classification incurs no cost.
The error rate can be improved simply by not classifying certain objects, typically the ones
that are near the decision boundary and in that sense difficult to classify reliably. Studying
this so-called reject option is facilitated by the functions reject and rejectc.
Given a classification result, e.g. d = tst*ldc(trn), the classification error is found by e =
testc(d). The number of columns in d equals the number of classes and an object is assigned
50
to the class for which the value in its corresponding row is the largest. Basically, the row
contains the posterior probabilities (or something equivalent) and therefore will typically sum
up to one.
By rejection of objects a threshold is used to determine when this largest value is not suffi-
ciently large. The routine e = reject(d) determines the classification error and the reject
rate for a set of such threshold values. The errors and reject frequencies are stored in e.
Exercise 8.13 Make two data sets both with 1000 samples using gendatc. One for training,
one for testing.
(a) Train three classifiers fisherc, qdc, and knnc([],1) on the training set and plot
the reject curve of the test set.
(b) Explain what you see. Explain the differences between the curves.
(c) What can we maybe conclude from the shape of the curve belonging to fisherc?
Can you explain why this happens?
(d) Determine the reject rate for which qdc makes an error of (about) 5% and compute
by rejectc the rejecting classifier for this error rate (check the help :-)
(e) Plot the classifier and the rejecting classifier in a scatter-plot. Explain what you
see.
The receiver operating characteristic (ROC), which is too often wrongly referred to as the
receiver-operator curve or the like, is a tool to study the effect of changing decision boundaries
in the two-class case due to, for instance, changing misclassification costs or class priors.
Exercise 8.14 Generate two moderately sized Highleyman data sets. Train an nmc, an ldc,
and a 3-nn on this data. Using roc plot the three ROCs.
(b) If one prefers a very small error on class 1, which classifier would be the chosen
one?
51
Take-home messages
ROCs give yet another view of a problem at hand. They allow one
to study classification behavior under varying priors and costs.
52
Week 9
Clustering
This week, we will discuss the problem of clustering; this practical session is intended to
familiarise you with the different techniques that were discussed, more specifically hierarchical 11
clustering, the K-means algorithm and the mixtures-of-Gaussians.
Earlier in this course, you learned how to select features, i.e. you learned which variables
belong together. This week we will shift the emphasis to determining which objects belong
together in groups or clusters. This clustering process enables us to extract structure from
the data, and to exploit this structure for various purposes such as building a classifier or
creating a taxonomy of objects.
The most difficult part of clustering is to determine whether there is truly any structure
present in the data and if so, what this structure is. To this end, next week we will employ
cluster validity measures to estimate the quality of the clustering we have obtained.
In the lectures we discussed hierarchical clustering at length. There are basically two choices
that need to be made in hierarchical clustering in order to construct a dendrogram: 13
In this course, we will only employ the Euclidean distance between two samples as a measure
of dissimilarity. As you will recall from the lecture, there are three types of linkage: complete,
single and average linkage. Once the dendrogram has been constructed, we need to cut the
53
dendrogram at an appropriate level to obtain a clustering. Simple interactive routines are
available to create, visualise and cut the dendrograms.
Exercise 9.1 Start with the hall dataset, an artificial dataset with clear structure.
(a) Load the dataset (load hall). Use scatterd to visualise it. How many clusters
are visible in the plot?
(b) What is the most suitable clustering?
Exercise 9.2 Load dataset rnd. Make a scatterplot. This is a uniformly distributed dataset,
with no apparent cluster structure. We will hierarchically cluster this dataset to get an
idea of what a dendrogram looks like when there is no structure in the data.
(a) Perform hierarchical clustering with interactclust on the rnd dataset with com-
plete linkage: interactclust(+a,c);
(b) Cut the dendrogram at arbitrary levels to see how the samples are clustered. Look
at the structure of the dendrogram. What is apparent?
(c) Repeat this for single and average linkage. Do you observe the same behavior as
with complete linkage?
Exercise 9.3 (a) Perform hierarchical clustering with interactclust on the hall dataset
with complete linkage: what do the lengths of the vertical stems in the dendrogram tell
us about the clustering?
(b) Cut the dendrogram at different levels, i.e. inspect different numbers of clusters
by left-mouse clicking at the desired level. Can you think of ways in which a good
clustering can be defined?
(c) Can you devise a simple rule-of-thumb (in terms of vertical stem lengths) for finding
a good clustering in the dendrogram?
54
Exercise 9.4 (a) Now perform single linkage hierarchical clustering:
(interactclust(+a,s);) on the hall dataset. Do you notice any significant
differences with the complete linkage dendrogram?
(b) Do you notice any differences with the average linkage dendrogram
(interactclust(+a,a))?
Exercise 9.5 (a) Load and plot the cigars dataset. What is an appropriate number of
clusters? Why?
(b) Perform single linkage hierarchical clustering on the cigars dataset. Can you
obtain the desired clustering with single linkage hierarchical clustering?
(c) Now perform complete linkage hierarchical clustering on the cigars data. Does
the optimal number of clusters obtained with single linkage hierarchical clustering also
produce acceptable results in complete linkage hierarchical clustering? If not, why
not?
Exercise 9.6 (a) Perform complete linkage hierarchical clustering on the messy dataset.
What is an appropriate number of clusters? Why?
(b) Now perform single linkage clustering on this dataset. Select the same number of
clusters that you determined to be optimal with complete linkage. Is it also optimal in
this case?
(c) What is an appropriate number of clusters for the single linkage case? Why?
optional
Exercise 9.7 Perform hierarchical clustering on a dataset of your choice. Bear in mind
that only the first two dimensions of higher dimensional datasets are displayed in the
scatterplot. To visualise other dimensions, reorganise the columns in your dataset so
that the appropriate variables are in columns one and two.
end optional
The operation of this algorithm was explained in the lecture. In this practical session, we
will familiarise ourselves with the K-means algorithm by applying it in various settings. A 14.5.1
function kmclust is provided to perform K-means clustering. Take a moment to familiarise
yourself with the input, output and operation of this function.
Exercise 9.8 The initialisation of the K-means algorithm (positioning of the prototypes)
can be done in several ways. Two simple ways are:
55
2. selecting a number of the samples at random as prototypes.
You might recall that the K-means algorithm is a simple iterative way of attempting to find
those prototype positions that correspond to the global minimum of the total within-scatter.
However, there are two factors that could cause this minimisation procedure to get stuck
without having reached the global minimum. First, K-means is a procedure that always
updates the positions of the prototypes such that the within-scatter decreases. Secondly,
the within-scatter is not a monotonically decreasing function of prototype positions. This
implies that from a specific set of non-optimal prototype positions the road to the optimum
(global minimum) contains an initial increase in within-scatter before reaching the optimum.
If K-means ends up in such a position (local minimum), it gets stuck. The following exercise
illustrates this.
Exercise 9.9 (a) Load dataset triclust. Inspect it with the aid of a scatterplot. There
are clearly three equally spaced, spherical clusters.
(b) Run your K-means algorithm several times, for g = 3 clusters, until a solution is
found where there are two prototypes in a single cluster, and the third is positioned
between the remaining two clusters. Why is this a local minimum?
(c) Does hierarchical clustering also suffer from this problem?
(d) What can we do to overcome the local minimum problem?
During the lectures, the concept of clustering based on the quality of a mixture-of-Gaussian
14.2 density fit to the data was discussed. The operation of the Expectation-Maximisation (EM)
algorithm, which is employed to estimate the parameters of the mixture model, was also
discussed in detail. In the following set of exercises, mixture model clustering will be explored
with the aid of the function em.
Exercise 9.10 (a) Load the triclust dataset and play around with the function
em (em(data,nmodels,type,0,1,0);). Vary the number of Gaussians employed
(nmodels) in the mixture model, and also vary the type of Gaussian employed. Re-
late the type (circular,aligned,gauss) of the Gaussian to its covariance ma-
trix.
56
(b) On the cigars dataset, fit an unconstrained Gaussian (type = gauss) mixture
model using the function em. For the number of clusters g, assume the following values:
g = 1, 2, 3, 4, 5. Which g do you expect to be the best? .
(c) Repeat this 20 times (without plotting, i.e. setting the plot mix and plot resp
arguments to zero) and plot the average log-likelihood (first output argument of em)
and its standard deviation as a function of g (hint: use the errorbar function). What
characteristic of the log-likelihood function indicates the preferable number of models
(clusters)?
(d) Do the same for the rnd dataset (which has no structure) to confirm the relationship
between the log-likelihood curve and the number of clusters. What is the behavior of
the curve for the random dataset?
(e) Now try clustering the messy dataset. What is the best shape to employ for the
Gaussians? What is the optimal number of clusters?
Clustering may be applied for image segmentation, e.g. on the basis of colors.
Exercise 9.11 A full-color image may be segmented by clustering the color feature space.
For example, read the roadsign10 image:
>> im = imread(roadsign10.bmp);
>> show(im)
The three colors may be used to segment the images based on their pixel values only.
Convert the image to a dataset using the im2featcommand. Use a small subset of
pixels for finding four clusters in the 3D color space:
>> a = im2feat(im); % create a dataset
>> trainingset = gendat(a,500) % create a small training subset
>> [d,w] = emclust(trainingset,nmc,4) % cluster the data
The retrieved classifier w may be used to classify all image pixels in the color space:
>> lab = classim(a,w);
>> figure
>> imagesc(lab) % view image labels
Finally, we will replace the color map used to render the label image to the mean vector
of each cluster:
>> aa = prdataset(a,lab(:)) % create labeled dataset
>> map = +meancov(aa) % compute class means
>> colormap(map) % set color map accordingly
(a) Find out the correspondences between the image parts (sky, road, etc.) and the
clusters by visualizing the outputs of the classifier w on the entire image.
57
Take-home messages
58
Week 10
Cluster validation
In this week, we will explore ways to guide our choice when we judge clustering results and
determine the number of clusters.
In the lectures this week, various clustering approaches were presented. Last week you applied
hierarchical clustering with different linkage types and the Euclidean dissimilarity measure 16
as well as K-means clustering to several datasets. One aspect of clustering that we avoided
last week was the determination of the optimal number of clusters in the dataset. More
specifically, given a particular hierarchical clustering, one needs to determine where to cut
the dendrogram to produce a clustering of the data and one needs to know which value of
g to use as input in the K-means clustering algorithm. Last week you had a small taste of
cluster validation when you tried to determine the optimal number of clusters to employ in
mixture-of-Gaussians clustering algorithm.
This practical session is intended to familiarise you with the different cluster validity measures
that were discussed in the lecture. We will achieve this by:
2. coding and applying the within-scatter in the K-means clustering algorithm to estimate
the optimal number of clusters;
3. coding and applying the Davies-Bouldin index to various algorithms and datasets.
59
10.1.1 Fusion graphs
The fusion graph plots the fusion level as a function of the number of clusters (g). For example,
the fusion level at g = 2 represents the (single, complete, average) link distance between the
clusters that are merged to create two clusters from three clusters. A simple heuristic to
determine the number of clusters in hierarchical clustering is to cut the dendrogram at the
point where we observe a large jump in the fusion graph.
The following three exercises focus on the estimation of the number of clusters based on the
fusion graph.
Exercise 10.2 (a) Load the triclust dataset. Perform single linkage hierarchical
clustering and display the fusion graph by setting the third optional argument:
interactclust(triclust,s,1). Where do you observe the largest jump?
(b) Cut the dendrogram at this position, to check whether this is a reasonable number
of clusters. Now perform complete linkage hierarchical clustering. Does the fusion graph
give a clear indication of the number of clusters in the data? If not, why not?
Exercise 10.3 (a) Load the hall dataset. Perform single linkage hierarchical clustering
and display the fusion graph. What do you observe in the fusion graph?
Exercise 10.4 (a) Finally, load the messy dataset. Perform single linkage hierarchical clus-
tering. According to the fusion graph, where should the dendrogram be cut?
(b) Does a satisfactory clustering result from cutting the dendrogram at this point?
Motivate.
(c) Now perform complete linkage clustering. Is the clustering suggested by the fusion
graph better or worse than the clustering obtained with single linkage clustering?
60
optional
Exercise 10.6 Generate a dataset consisting of a single Gaussian, i.e. use d = 0. Plot
the same within-scatter curve. (a) How does this curve differ from the other
curves?
end optional
D.L. Davies and D.W. Bouldin1 introduced a cluster separation measure which is based on
both the within-scatter of the clusters in a given clustering and the separation between the
clusters. Formally, this measure is known as the Davies-Bouldin index (DBI). It assumes that
clusters are spherical, and that a desirable clustering consists of compact clusters that are
well-separated.
Suppose we wish to compute the DBI for a clustering consisting of n objects assigned to g
clusters. We can compute a score for every possible pair of clusters in this clustering, which
is inversely proportional to the distance between the cluster means and directly proportional
to the sum of the within-scatters in the pair of clusters. This score is given by
j + k
Rjk = , j, k = 1, 2, . . . , g; k 6= j. (10.1)
kj k k
Here j is the mean of all the objects in cluster j and j is the within scatter of cluster j,
given by:
v
u1 X
u
j = t kx i j k2 , (10.2)
nj
x i Cj
where Cj is the set of objects associated with cluster j and nj is the number of objects in
cluster j. The score, Rjk , is small when the means of clusters j and k are far apart and the
sum of the within-scatter for these clusters is small. Since cluster j can be paired with g 1
other clusters, resulting in g 1 pair-scores, Rjk , j = 1, 2, . . . , g; k 6= j, a conservative estimate
of the cluster score for cluster j, when paired with all other clusters, is obtained by assigning
the maximal pair-score with cluster j:
The Davies-Bouldin index of the complete clustering is then determined by averaging these
maximal pair-scores for all clusters:
g
1X
IDB = Rj . (10.4)
g
j=1
1
IEEE Transactions on Pattern Analysis and Machine Intelligence 1, pp. 224227, 1979.
61
Exercise 10.7 We will employ the Davies-Bouldin index to evaluate the clustering produced
by hierarchical clustering. We will do so for a range of clusters in order to determine
the optimal number of clusters, i.e. the best level to cut the dendrogram at. To achieve
this we employ the function hdb().
(a) The function hdb() has as inputs a dataset, the linkage type, the distance measure
and the range of clusters (start,stop) for which you wish to evaluate the quality of
the clustering. It computes, for each clustering, the DBI. Familiarize yourself with the
operation of this function.
(b) Load the triclust dataset and make a scatterplot. What do you expect the DBI
curve as a function of the number of clusters to look like?
(c) Now apply hdb to this dataset with Euclidean distance, complete linkage clustering,
starting at 2 clusters and stopping at 10. What is evident from the DBI curve?
(d) Apply hdb to the hall dataset with Euclidean distance, complete linkage clustering,
starting at 2 clusters and stopping at 20. What do you observe in the obtained curve?
Can you explain your observation?
(e) Finally, apply hdb to the cigars dataset with Euclidean distance, complete linkage
clustering, starting at 2 clusters and stopping at 20. Inspect the DBI curve. Do you
detect the expected number of clusters? Now try single linkage clustering. Does this
help? Why (not)?
Last week, you already looked at how to use log-likelihood to judge the number of clusters in
exercise 9.10.
optional
Exercise 10.8 Repeat exercise 9.10, but generate plots of the Akaike Information Criterion
(AIC) and the Bayesian Information Criterion (BIC) instead.
end optional
Take-home messages
62
Week 11
Feature Selection
Objectives When you have done the exercises and read the text for this week, you should
Exercise 11.1 Given a data set with 100 features. Say you want to find the 5 best features.
How many feature sets do you need to check for this in case we have to rely on an
exhaustive search?
Exercise 11.3 Load the diabetes data set using a = diabetes. This data set is a real-
world data set (though an old one). The aim is, or rather, was to predict the presence
of diabetes mellitus in people based on the eight measurements made. Some of the
features included are diastolic blood pressure, triceps skin fold thickness, and age.
(a) Split up the data set in 50% for training and the rest for testing. Based on the
training set, what is the optimal 5-dimensional subspace for the fisher classifier? (Hint:
you can loop you way out of this but it might be more handy to use some feature from
nchoosek; read the help carefully.)
(b) What is the optimal 5-dimensional space for a 1NN classifier?
63
(c) What is the error attained on the test set for these classifiers with their respective
optimal feature spaces.
(d) Check some search strategies and selection criteria (help feateval) and see
whether they are able to pick out the same 5-dimensional subspace. If not, do at
least the error rates on the test set come close in case one uses the same classifier?
(e) Do another 50-50 split of the original data set. When redoing some of the feature
selections, do you find back the same feature sets?
Exercise 11.4 This code you get for free: a = mfeat zer
[trn,tst] = gendat(a,.04)
clsf = ldc
featnum = [1:5:47]
prwaitbar off
prwarning off
mf = max(featnum)
e0 = clevalf(trn,clsf,featnum,[],1,tst);
rp = randperm(size(a,2));
er = clevalf(trn(:,rp),clsf,featnum,[],1,tst(:,rp));
[w,r] = featseli(trn,eucl-m,mf)
e1 = clevalf(trn*w,clsf,featnum,[],1,tst*w);
[w,r] = featself(trn,eucl-m,mf)
e2 = clevalf(trn*w,clsf,featnum,[],1,tst*w);
[w,r] = featselb(trn,eucl-m,mf)
e3 = clevalf(trn*w,clsf,featnum,[],1,tst*w);
figure
plote({e0,er,e1,e2,e3})
legend({o,r,i,f,b})
(a) Run it a couple of times and have a look at the output, both in the figure and in
the command window, and try to understand it. Also have a look at the code of course.
E.g. what does randperm do in the code and what is rp used for? What does featseli
do? And eucl-m?
(b) Can you explain what the backward feature selection does if you want to extract mf
features from dataset a? Using this explanation, can you then explain why the feature
curve for the backward selection looks so poor?
(c) Can you find a selection strategy, which optimal performance is, by and large,
better than the best performing one in (a)? Recall that there are not only so-called
filter approaches .
Page 285
64
transformed feature vectors {Axi } equals ACAT .
Exercise 11.6 Note: this might be a tiresome exercise but if you have too much time on
your hands. . . One way or the other, you should know that Sm = Sb + Sw for a fact!
optional
(a) Show that i = E[(x i )(x i )T ] = 12 E[(x x0 )(x x0 )T ] in which x and x0 are
equally distributed, i.e., according to the distribution of class i.
(b) Exercise 5.12 from the book. Show that the mixture scatter matrix is the sum of
the within-class and between-class scatter matrices. The fact proven in (a) can be very Ex. 5.12
helpful here.
end optional
Exercise 11.7 (a) Variation to Exercise 5.18 from the book (where the book is slightly
inaccurate). Convince yourself that if the number of classes equals M that the rank of Ex. 5.18
the between scatter matrix Sb is at most M 1. You can either try to proof this, which
might be a bit though, or you can develop some insight be staring at the formula or
doing some experiments in, say, Matlab.
(b) In which (two) different cases can the rank of Sb be smaller than M 1? Note
that this applies similarly to any estimate of a covariance matrix based on M samples.
optional
Exercise 11.8 Make Exercise 5.20 from the book. (Note that Sxw actually equals Sw ?)
Ex. 5.20
end optional
11.3 More. . .
65
Take-home messages
Procedures are approximate and will, typically, not find the optimal
features
Sm = Sb + Sw .
66
Week 12
Feature Extraction
Objectives When you have done the exercises and read the text for this week, you should
Exercise 12.1 Make sure that you had enough practice with last weeks exercises.
Exercise 12.2 Load the diabetes data set using a = diabetes. We revisit Exercise 11.3
from the foregoing week.
(a) Consider one of the better performing feature selection strategies and compare
its performance (e.g. using clevalf), for various feature dimensionalities with the
performance of PCA (pcam). Do this based on two quite differently behaving classifiers,
say, a complex and a simple one.
(b) Does a normalization of the individual features lead to an improved error rate
when using PCA?
Exercise 12.3 (a) Load the mfeat pix data set. Scatter this data set. What objects does
this data set describe and what are the features that are used to describe them?
(b) Split up the set into a training set containing 50% of the data. The other 50%
goes into the test set. Calculate a mapping w on the training set that consist of the
first 2 PCs. Display, in two separate figures, the 2-dimensional training set and test set
obtained by means of PCA. What noticeable difference between the scatter plots, from
a classification point of view, can be observed?
67
(c) Now split up the set into a training set containing 5% of the data. Again the rest
goes into the test set. Determine a PCA on the training set. What is the maximum
number of features that PCA can extract? Why is this the case?
(d) Compare feature curves obtained with the original data sets and some obtained
with the PCA transformed data sets for a couple of classifiers (e.g. using something
like clevalf(trn,classifier,[5:5:90],[],1,tst)). Is there any benefit of the PCA
feature space over the pixel feature space? Do you observe peaking for any of the
classifiers?
Exercise 12.4 (a) Variation to Exercise 5.18 from the book (where the book is slightly
Ex. 5.18 inaccurate). Convince yourself that if the number of classes equals M that the rank of
the between scatter matrix Sb is at most M 1. You can either try to proof this, which
might be a bit though, or you can develop some insight be staring at the formula or
doing some experiments in, say, Matlab.
(b) In which (two) different cases can the rank of Sb be smaller than M 1? Note
that this applies similarly to any estimate of a covariance matrix based on M samples.
Exercise 12.5 (a) Consider three arbitrary points in a 2-dimensional feature space. One is
from one class and the two others from the other. That is, we have a two-class problem.
What (obviously?) is the 1-dimensional subspace that is optimal according to the Fisher
mapping/LDA? What value would the Fisher criterion take on in this case? Explain
your answers.
(b) Consider an arbitrary two-class problem in three dimensions with three points in
one class and one in the other. How can one determine a 1-dimensional subspace in
this 3D space that optimizes the Fisher criterion? (Like in (a), a geometric view maybe
seems the easiest here.)
(c) Again we take a two-class problem with four objects in 3D but now both classes
contain the same number of points. Again determine the 1D subspace that Fisher gives.
(d) When dealing with two points from two different classes in three dimensions, how
many essentially different Fisher optimal 1D subspaces are there?
Exercise 12.6 Take a = mfeat pix but now split up the set into a training set containing
10% of the data. Again the rest goes into the test set. Determine a PCA on the training
set.
(a) What is the maximum number of features that Fisher can extract? Why is this
the case?
(b) Calculate a mapping w on the training set that consist of the first 3 components
provided by fisherm. Display the three (3) different 2-dimensional scatters of the
68
training set after the Fisher mapping (use the gridded option in scatterd). What
happened here? Explain what you see (hint: reconsider the solutions you found in the
previous simple, low-dimensional examples).
(c) Use the mapping from (b) and apply it to the test set. Display, in a different figure,
the three (3) different 2-dimensional scatters. What noticeable difference between the
scatter plots can be observed?
(d) Would you say the Fisher mapping is overtrained?
(e) Does the mapping with PCA look better or worse than the mapping using
Fisher/LDA? Explain your answer.
(f ) Can you come up with a combination of PCA and Fisher that performs better than
using PCA or Fisher separately? Does it depend on the choice of classifier?
Exercise 12.7 Assume that the overall mean of a data set is zero and that all feature vectors
are stored in the rows of the matrix X.
(a) Show that the eigenvector of X T X with the largest eigenvalue is equal to the
largest principal component.
(b) Given that v is a (column) eigenvector with eigenvalue of XX T , show that X T v
is an eigenvector of X T X. What is the eigenvalue associated to the eigenvector X T v?
(c) Using X T v, a (row) feature vector x from the original space can be mapped onto
the first PC by xX T v.
(d) Describe now how one can kernelize PCA: describe the procedure to do the actual
calculation. All ingredients are there. (Hint: realize what information XX T and xX T
contain.)
Take-home messages
Procedures are approximate and will, typically, not find the optimal
features
69
70
Week 13
Dissimilarity Representation
Objectives When you have done the exercises and read the text for this week, you should
71
In most applications the dissimilarity representation is computed on raw data like images and
not on an already given feature representation. We will use distances resulting from a study
by Zongker and Jain on deformable templates between 10 classes of handwritten digits (0 -
9). Each class is given by 200 examples, resulting in a 2000 x 2000 dissimilarity matrix. We
will however just use two of them. For handling dissimilarity matrices the following commands
from the DisTools toolbox are useful:
genddat: split of a dissimilarity matrix in a training set and a test set. Note that such
a split effects the numbers of rows as well as columns.
testkd: 1-NN classification error based on a given set of distances between a test set
and a training set.
Exercise 13.2 Load the Zongker dataset by readzongker and inspect its size. Select the
just two classes (1 and 2) using selcdat. Use for the following experiments training
sets and test sets of 100 objects per class.
(a) Use the entire dissimilarity matrix of the two classes (size 400 x 400) for embedding
by psem into a 10-dimensional space. Project all dissimilarity data into this space.
Generate the training and test set and compute the errors of the 1-NN and qdc classifiers.
optional
(b) Select a random set of 10 objects for representation (5 from each class) and de-
termine the corresponding dissimilarity matrix and use it for embedding. Project all
dissimilarity data into this pseudo-Euclidean space. Generate the training and test set
and compute the errors of the 1-NN and qdc classifiers.
end optional
(c) Use the same representation set for constructing a dissimilarity space (just by
reducing the numbers of columns of the total dissimilarity matrix). Generate a training
set and a test set and compute in the dissimilarity space the errors of the 1-NN and qdc
classifiers.
(d) Compare the above results with 1-NN errors for the same test set using the repre-
sentation set as training set, as well as using the entire training set.
(e) What representation yields the best classifiers? Which representation is preferred
if for test objects just the computation of 10 dissimilarities can be afforded?
72
Take-home messages
73
74
Appendix A
Final assignment
A.1 Case
For the purpose of this final assignment, a case is introduced. Assume your group plays
the role of a pattern recognition consultancy company. Your company is faced with an
assignment by a client company working on automatic bank cheque processing applications.
This client wants you to research the possibility of using pattern recognition techniques to
classify individual digits in bank account numbers and the monetary amount.
They are interested in two scenarios:
1. the pattern recognition system is trained once, and then applied in the field;
2. the pattern recognition system is trained for each batch of cheques to be processed.
In pattern recognition terms, Scenario 1 means the amount of training data available is large
(in this case, at least 200 and at most 1000 objects per class), whereas for Scenario 2 it is
much smaller (at most 10 objects per class).
To help you to construct your system, the client has supplied you with a standard dataset
of handwritten digits put together by the US National Institute of Standards & Technology,
NIST (see http://www.nist.gov/). This set consists of 2800 images of handwritten digits
for each of the 10 classes, 0,1, . . . , 9. They were scanned from forms filled out by
volunteers, thresholded in black-and-white and automatically segmented in images of 128 x
128 pixels. Sometimes parts of other digits are visible in an image. In order to save diskspace,
we determined for each digit a bounding box, thereby causing images to have different sizes.
The data can be loaded and displayed as:
>> a = prnist([0:9],[1:40:1000])
>> show(a)
This loads, for each of the 10 digits, 25 objects out of the available 1000 into a prdatafile
a. The result is shown in figure A.1.
75
Figure A.1: Subset of the NIST digit dataset
a report detailing the design choices you made for both scenarios, including detailed
motivation. This report should discuss the following:
a test of your system on a set of benchmark data withheld from you by the client (see
below). For Scenario 1, your system should have lower than 5% test error; for Scenario
2, your target is 25% test error.
Note that each choice is expected to be backed up by either evidence or solid reasoning.
In the last exercise(s) for each week you have already experimented somewhat with the hand-
written digit dataset. However, it is important to question each step as you construct your
final digit recognition system, for two reasons:
76
for Scenario 1, you should use as much training data as possible for the construction of
your system (but at least 200 objects per class), rather than the smaller sets you used
earlier; this may invalidate some of your earlier conclusions;
you are supposed to optimise the overall performance of the system, rather than just
the individual steps.
It is common for clients of your company to withhold data from you, which they can then
use to test your system and verify whether the performance you claim to obtain is actually
valid. In this case, the client has supplied you with a function you can use to benchmark your
system yourself, without having access to the actual data used. It goes without saying that
you may not use the test error to optimise your system in any way.
The procedure agreed upon is as follows:
write a function a = my rep(m) in which m is a NIST prdatafile set and a the resulting
dataset;
compute your classifier as PRTools classifier w in which dimension reduction and clas-
sification are incorporated, so a*w*labeld should generate the proper labels;
call e = nist eval(filename,w,n) in which filename is the filename for the my rep
routine, w is your classifier and n is the desired number of digits per class to be tested
(which you can leave empty, if you want to test on all data). A typical call therefore
looks like: e = nist eval(my rep,w);. The maximum value of n is 100. Evaluation
may take a long time. The fraction of classification errors is returned in e.
In your report, give the performance obtained using the classifiers selected for both scenarios
and compare these to the performance predicted using the training data. Are there large
differences; if so, what could be their cause?
A.5 Grading
Your report will be graded by the supervisors, acting as the client company. If the minimum
performance requirements (see section A.3 are not met, a passing grade will not be given. The
report is graded on the results obtained, the choices based on their results and the motivation
behind these choices. The maximum grade that can be obtained for the report is an 8.
The final two points can be earned as follows:
Include a section Live test in your report, in which you report on the performance
obtained on actual handwritten digits and compare it to the expected performance and
the benchmark performance. To this end, you should:
use a scanner to read in an image of a sheet of paper on which you have written a
test set of digits;
77
segment the individual digits out of the resulting image;
classify the digit images.
Include a section Recommendations to your report, in which you advise your client
in detail on possible steps to improve performance in light of the overall system they
are constructing. For example:
A.6 Feedback
While you work on the final assignment during the last weeks, the supervisors will of course
be available to answer questions and comment on your results.
78
Appendix B
In the course on Pattern Recognition, we assume a certain familiarity with probability theory
and linear algebra. You should be able to answer the following questions without too many
problems and, in principle, only using pen and paper. If you find you have difficulties in doing
so, please consult Theodoridis (Appendices A and B), or some other textbook of your liking,
to refresh your memory.
79
Exercise B.3 The normal distribution
Suppose that the IQ of students at Delft University is normally distributed with mean
125 and standard deviation 10, respectively. What is the probability that of 20 randomly
selected Delft students, five have an IQ of at least 135? Use the table at https:
//www.stat.tamu.edu/~lzhou/stat302/standardnormaltable.pdf, in which the cell
in the row with label r and the column with label c gives the value of the standard normal
CDF (x) for x = r + c.
Prove that if X1 and X2 are not correlated, they are also independent (note that this
only holds for the normal distribution!).
80
(b) Check that the vector (1, 2)T is an eigenvector of A. What is its eigen-
value?
(c) Write out the eigenvalue decomposition for A. .
81
82
Appendix C
Introduction to Matlab
Ela Pekalska
, and Marjolein van der Glas
83
Note that spaces are not important in Matlab.
The result of the last performed computation is ascribed to the variable ans, which is
an example of a Matlab built-in variable. It can be used on the next command line.
For instance:
>> 14/4
ans =
3.5000
>> ans^(-6)
ans =
5.4399e-04
5.4399e-04 is a computer notation of 5.4399104 . Note that ans is always overwritten
by the last command.
You can also define your own variables (more on variables in Section C.2.1), e.g. a and
b as:
>> a = 14/4
a =
3.5000
>> b = a^(-6)
b =
5.4399e-04
When the command is followed by a semicolon ;, the output is suppressed. Compare:
>> 3 + 7.5
>> 3 + 7.5;
It is possible to execute more than one command at the same time; the commands should
then be separated by commas (to display the output) or by semicolons (to suppress the
output display), e.g.:
>> sin(pi/4), cos(pi); sin(0)
ans =
0.7071
ans =
0
Note that the value of cos(pi) is not printed.
By default, Matlab displays only 5 digits. The command format long increases this
number to 15, format short reduces it to 5 again. For instance:
>> 312/56
ans =
5.5714
>> format long
>> 312/56
ans =
5.57142857142857
The output may contain some empty lines; this can be suppressed by the command
format compact. In contrast, the command format loose will insert extra empty
lines.
84
Mathematical notation Matlab command
a+b a + b
ab a - b
ab a * b
a
b a / b or b \ a
xb x^b
x sqrt(x) or x^0.5
|x| abs(x)
pi
4 103 4e3 or 4*10^3
i i or j
3 4i 3-4*i or 3-4*j
e, ex exp(1), exp(x)
ln x, log x log(x), log10(x)
sin x, arctan x, ... sin(x), atan(x),...
Matlab is case sensitive, for example, a is written as a in Matlab; A will result then
in an error.
All text after a percent sign % until the end of a line is treated as a comment. Enter
e.g. the following:
>> sin(3.14159) % this is an approximation of sin(pi)
You should skip comments while typing the examples.
Previous commands can be fetched back with the -key. The command can also be
changed, the and -keys may be used to move around in a line and edit it.
C.1.2 help-facilities
Matlab provides assistance through extensive online help. The help command is the simplest
way to get help. It displays the list of all possible topics. To get a more general introduction
to help, try:
>> help help
If you already know the topic or command, you can ask for a more specified help. For instance:
>> help elfun
provides information on the elementary math functions of Matlab. The topic you want help
on must be exact. The lookfor command is useful if you do not know the exact name of the
command or topic, e.g.:
>> lookfor inverse
displays a list of all commands, with a short description, for which the word inverse is
included in its help-text. Besides the help and lookfor commands, there is also a separate
85
mouse driven help. The helpwin command opens a new window on screen which can be
browsed in an interactive way.
Sometimes you might spot an error in your command or program. Due to this error it can
happen that the command or program does not stop. Pressing Ctrl-C (or Ctrl-Break on
PC) forces Matlab to stop the process. After this, the Matlab prompt (>>) re-appears.
This may take a while, though.
If you work in the Command Window, Matlab memorizes all commands that you entered and
all variables that you created. These commands and variables reside in the Matlabworkspace
and they might be easily recalled when needed, e.g. to recall previous commands, the -key
is used. Variables can be verified with the commands who, which gives a list of variables
present, and whos, which includes also extra information. Assuming that you performed all
commands given in Section A.1, after typing who you would obtain:
>> who
Your variables are:
a ans b x
The command clear <name> deletes the variable <name> from the Matlab workspace, clear
or clear all removes all variables. This is useful when starting a new task. For example:
>> clear a x
>> who
Your variables are:
ans b
The easiest way to save or load Matlab variables is by using (clicking) the File menu-bar,
and then the Save Workspace as... and Load Workspace... items. Alternatively, the
Matlab commands save and load can be used. save allows for saving your workspace
variables either into a binary file or an ASCII file. Binary files automatically get the .mat
extension. For ASCII files, it is recommended to add a .txt or .dat extension. Study
the use of the save command:
>> s1 = sin(pi/4);
>> c1 = cos(pi/4); c2 = cos(pi/2);
>> str = hello world; % a string
>> save data % saves all variables in binary format to data.mat
>> save numdata s1, c1 % saves numeric variables: s1 and c1 to numdata.mat
>> save allcos.dat c* -ascii % saves c1,c2 in 8-digit ascii format to allcos.dat
The load command allows for loading variables into the workspace. It uses the same syntax
as save. It might be useful to clear the workspace before each load and check which variables
86
are present in the workspace (use who) after loading. Assuming that you have created the
files above, the variables can be now loaded as:
>> load data % loads all variables from data.mat
>> load data s1 c1 % loads only specified variables from data.mat
It is also possible to read ASCII files that contain rows of space separated values. Such a file
may contain comments that begin with a percent character. The resulting data is placed into
a variable with the same name as the ASCII file (without the extension). Check, for example:
>> load allcos.dat % loads data from allcos.dat into variable allcos
>> who % lists variables present in the workspace now
The basic element of Matlab is a matrix (or an array). Special cases are:
Note that Matlab may behave differently depending on input, whether it is a number, a
vector or a 2D matrix.
Formally, there is no need to declare (i.e. define the name, size and the type of) a new
variable in Matlab. A variable is simply created by an assignment (e.g. a = 15/4). Each
newly created numerical variable is always of the double type. You can change this type by
converting it into e.g. the single type1 .
Bear in mind that undefined values cannot be assigned to a variable. So, the following is not
possible:
>> clear x; % to make sure that x does not exist
>> f = x^2 + 4 * sin (x)
It becomes possible by:
>> x = pi / 3; f = x^2 + 4 * sin (x)
Here are some examples of different types of Matlab variables (check their types by using
whos):
>> M = [1 2; 3 4; 5 6] % a matrix
>> 2t = 8 % what is the problem with this command?
>> c = E % a character
>> str = Hello world % a string
There is also a number of built-in variables, e.g. pi, eps or i, presented in Table C.2. In
addition to creating variables by assigning values to them, another possibility is to copy one
1
A variable a is converted into a different type by performing e.g. a = single(a), etc.
87
Variable name Value/meaning
ans the default variable name used for storing the last result
pi = 3.14159...
eps the smallest positive number that added to 1, creates a result
larger than 1
inf representation for positive infinity, e.g. 1/0
nan or NaN representation
for not-a-number, e.g. 0/0
i or j i = j = 1
nargin/nargout number of function input/output arguments used
realmin/realmax the smallest/largest usable positive real number
variable, e.g. b into another, e.g. a. In this way, the variable a is automatically created (if a
already existed, its previous value is now lost):
>> b = 10.5;
>> a = b;
If min is the name of a function (see Section C.4.2), then a defined as:
>> b = 5; c = 7;
>> a = min (b,c);
will call that function, with the values b and c as parameters. The result of this function (its
return value) will be written (assigned) into a.
C.2.2 Vectors
Row vectors are lists of numbers separated either by commas or by spaces. They are examples
of simple arrays. First element has index 1. The number of entries is known as the length of
the vector (the command length exists as well). Their entities are referred to as elements or
components. The entries must be enclosed in [ ]:
>> v = [-1 sin(3) 7]
v =
-1.0000 0.1411 7.0000
88
already existing ones:
>> v = [-1 2 7]; w = [2 3 4];
>> z = v + w % an element-by-element sum
z =
1 5 11
>> t = [2*v, -w+2]
ans =
-2 4 14 0 -1 -2
>> v(2) = -1 % change the 2nd element of v
v =
-1 -1 7
A colon notation is an important shortcut, used when producing row vectors (see Table C.3
and help colon):
>> 2:5
ans =
2 3 4 5
In general, first:step:last produces a vector of entities with the value first, incrementing
by the step until it reaches last:
>> -0.2:0.5:1.7
ans =
-0.2000 0.3000 0.8000 1.3000
>> 5.5:-2:-2.5 % a negative step is also possible
ans =
5.5000 3.5000 1.5000 -0.5000 -2.5000
To create column vectors, you should separate entries by a semicolon ; or by new lines.
The same operations as on row vectors can be done on column vectors. However, you cannot
for example add a column vector to a row vector. To do that, you need an operation called
89
transposing, which converts a column vector into a row vector and vice versa:
>> f = [-1; 3; 5] % a column vector
f =
-1
3
5
>> f % f is a row vector
ans =
-1 3 5
>> v = [-1 2 7]; % a row vector
>> f + v % you cannot add a column vector f to a row vector v
??? Error using ==> +
Matrix dimensions must agree.
>> f + v
ans =
-2 5 12
You can now compute the inner product between two vectors x and y, xT y = i xi yi , in a
P
simple way:
>> v = [-1; 2; 7]; % v is now a column vector; f is also a column vector
>> f * v % this is the inner product! f * v is the outer product
ans =
42
C.2.3 Matrices
Transposing a vector changes it from a row to a column or the other way around. This idea
can be extended to a matrix, where transposing interchanges rows with the corresponding
90
Command Result
A(i,j) Aij
A(:,j) j-th column of A
A(i,:) i-th row of A
A(k:l,m:n) (l k + 1) (n m + 1) matrix with elements Aij with
k i l, m j n
v(i:j) vector-part (vi , vi+1 , . . . , vj ) of vector v
There are also built-in matrices of size specified by the user (see Table C.4). A few examples
are given below:
>> I = eye(2) % the 2-by-2 identity matrix
I =
1 0
0 1
>> B = randn(1,2) % a vector of normally distributed random numbers
B =
-0.4326 -1.6656
>> r = [1 3 -2]; R = diag(r) % create a diagonal matrix with r on the diagonal
R =
1 0 0
0 3 0
0 0 -2
91
It is often needed to build a larger matrix from the smaller ones:
>> x = [4; -1]; y = [-1 3]; X = [x y]
X = % X consists of the columns: x and y
4 -1
-1 3
>> T = [-1 3 4; 4 5 6]; t = 1:3;
>> A = [[T; t] ones(3,2)]] % add a row vector t, then add two columns of 1s
A =
-1 3 4 1 1
4 5 6 1 1
1 2 3 1 1
It is possible to manipulate (groups of) matrix elements (see Table C.3). A part can be
extracted from a matrix in a similar way as from a vector. Each element in the matrix is
indexed by a row and a column to which it belongs, e.g. A(i,j) denotes the element from
the i-th row and the j-th column of the matrix A.
>> A(3,4)
ans =
1
>> A(4,3) % A is a 3-by-5 matrix! A(4,3) does not exist
??? Index exceeds matrix dimensions.
It is easy to extend a matrix automatically. For the matrix A it can be done e.g. as follows:
>> A(4,3) = -2 % assign -2 to the position (4,3); the uninitialized
A = % elements become zeros
-1 3 4 1 1
4 5 6 1 1
1 2 3 1 1
0 0 -2 0 0
>> A(4,[1:2,4:5]) = 4:7; % assign A(4,1)=4, A(4,2)=5, A(4,4)=6 and A(4,5)=7
Table C.4 shows some frequently used matrix operations and functions. A few examples are
92
given below:
>> B = [1 4 -3; -7 8 4]; C = [1 2; 5 1; 5 6];
>> (B + C)./4 % add matrices and divide all elements of result by 4
ans =
0.5000 2.2500 0.5000
-1.2500 2.2500 2.5000
>> D = [1 -1 4; 7 0 -1];
>> D = B * D; D^3 % D^3 is equivalent to D * D * D
ans =
-4205 38390
3839 -150087
>> D.^3 - 2 % for all elements: raise to the power 3, subtract 2
ans =
-3377 998
-1 -148879
The concept of an empty matrix [] is also very useful in Matlab. For instance, a few
columns or rows can be removed from a matrix by assigning an empty matrix to it. Try for
example:
>> C = [1 2 3 4; 5 6 7 8; 1 1 1 1];
>> D = C; D(:,2) = [] % now a copy of C is in D; remove the 2nd column of D
>> C ([1,3],:) = [] % remove the 1st and 3rd rows from C
A control flow structure is a block of commands that allows conditional code execution and
making loops.
To use control flow commands, it is necessary to perform operations that result in logical
values: TRUE or FALSE, which in Matlab correspond to 1 or 0 respectively (see Table C.5
and help relop). The relational operators <, <=, >, >=, == and ~= can be used to compare
two arrays of the same size or an array to a scalar. The logical operators &, | and ~ allow
for the logical combination or negation of relational operators. The logical & and | have equal
precedence in Matlab, which means that those operators associate from left to right. In
addition, three functions are also available: xor, any and all (use help to find out more).
You can extract all elements from the vector or the matrix satisfying a given condition, e.g.
equal to 1 or larger than 5, by using logical addressing. The same result can be obtained via
93
the command find, which return the positions (indices) of such elements. For instance:
>> x = [1 1 3 4 1];
>> i = (x == 1)
i =
1 1 0 0 1
>> j = find(x == 1) % j holds indices of the elements satisfying x == 1
j =
1 2 5
>> y = x(i), z = x(j)
y =
1 1 1
z =
1 1 1
find operates in a similar way on matrices. It reshapes first the matrix A into a column
vector, i.e. all columns all concatenated one after another. Therefore, e.g. k = find (A >
2) is a list of indices of elements larger than 2 and A(k) gives the values. The row and column
indices can be returned by [I,J] = find (A > 2).
Selection control structures, if-blocks, are used to decide which instruction to execute next
depending whether expression is TRUE or not. The general description is given below. In
the examples below the command disp is frequently used. This command displays on the
screen the text between the quotes.
if ... end
[frame=single,framesep=3mm,xrightmargin=2cm,label=Syntax ]
if logical_expression
statement1
statement2
....
end
[frame=single,framesep=3mm,xrightmargin=1.2cm,label=Example ]
if (a > 0)
b = a;
disp (a is positive);
end
[frame=single,framesep=3mm,xrightmargin=1.7cm,label=Syntax ]
if logical_expression
block of statements
evaluated if TRUE
else
block of statements 94
evaluated if FALSE
end
[frame=single,framesep=3mm,xrightmargin=0.4cm,label=Example ]
if ... elseif ... else ... end
[frame=single,framesep=3mm,xrightmargin=0cm,label=Syntax ]
if logical_expression1
block of statements evaluated
if logical_expression1 is TRUE
elseif logical_expression2
block of statements evaluated
if logical_expression2 is TRUE
else
block of statements evaluated
if no other expression is TRUE
end
[frame=single,framesep=3mm,xrightmargin=1.9cm,label=Example ]
if (height > 190)
disp (very tall);
elseif (height > 170)
disp (tall);
elseif (height < 150)
disp (small);
else
disp (average);
end
Another selection structure is switch, which switches between several cases depending on an
expression, which is either a scalar or a string. Learn more by using help.
C.3.3 Loops
Iteration control structures, loops, are used to repeat a block of statements until some condi-
tion is met:
the for loop that repeats a group of statements a fixed number of times;
[frame=single,framesep=3mm,xrightmargin=0.8cm,label=Syntax ]
for index = first:step:last
block of statements
end
[frame=single,framesep=3mm,xrightmargin=1.8cm,label=Example ]
sumx = 0;
for i=1:length(x)
sumx = sumx + x(i);
end
You can specify any step, including a negative value. The index of the for-loop can
be also a vector. See some examples of possible variations:
95
[frame=single,framesep=3mm,xrightmargin=0cm,label=Example 1 ]
for i=1:2:n
...
end
[frame=single,framesep=3mm,xrightmargin=0cm,label=Example 2 ]
for i=n:-1:3
....
end
[frame=single,framesep=3mm,xrightmargin=0cm,label=Example 3 ]
for x=0:0.5:4
disp(x^2);
end
[frame=single,framesep=3mm,xrightmargin=-0.2cm,label=Example 4 ]
for x=[25 9 81]
disp(sqrt(x));
end
[frame=single,framesep=3mm,xrightmargin=2.8cm,label=Syntax ]
while expression
statement1
statement2
statement3
...
end
[frame=single,framesep=3mm,xrightmargin=1.9cm,label=Example ]
N = 100;
iter = 1;
msum = 0;
while iter <= N
msum = msum + iter;
iter = iter + 1;
end;
Script m-files are useful when the number of commands increases or when you want to change
values of some variables and re-evaluate them quickly. Formally, a script is an external file
that contains a sequence of Matlab commands. However, it is not a function, since there
are no input/output parameters and the script variables remain in the workspace. When
you run a script, the commands in it are executed as if they have been entered
96
through the keyboard. To create a script, open the Matlab editor (go to the File menu-
bar, choose the Open option and then m-file; the Matlab Editor Window will appear),
enter the lines listed below and save as sinplot.m:
x = 0:0.2:6; y = sin(x); plot(x,y);
title(Plot of y = sin(x));
The script can be run by calling sinplot. Note that m-script file must be saved in one of the
directories in Matlabs path. The sinplot script affects the workspace. Check:
>> clear % all variables are removed from the workspace
>> who % no variables present
>> sinplot % run the script
>> who
Your variables are:
x y
These generic names, x and y, may be easily used in further computations and this can cause
side effects. They occur, in general, when a set of commands change variables other than the
input arguments. Remember that the commands within a script have access to all variables
in the workspace and all variables created in this script become a part of the workspace.
Therefore, it is better to use function m-files (see Section C.4.2) for a specific problem to be
solved.
Functions m-files are true subprograms, as they take input arguments and/or return output
parameters. They can call other functions, as well. Variables defined and used inside a
function, different from the input/output arguments, are invisible to other functions and to
the command environment. The general syntax is:
function [outputArgs] = function_name (inputArgs)
outputArgs are enclosed in [ ]:
a comma-separated list of variable names;
[ ] is optional when only one argument is present;
functions without outputArgs are legal2 .
inputArgs are enclosed in ( ):
a comma-separated list of variable names;
functions without inputArgs are legal.
Matlab provides a structure for creating your own functions. The first line should be a
definition of a new function (called a header). After that, a continuous sequence of comment
lines, explaining what the function does, should appear. Since they are displayed in response
to the help command, also the expected input parameters, returned output parameters and
synopsis should be described there. Finally, the remainder of the function is called the body.
Function m-files terminate execution when they reach the end of the file or, alternatively,
when the command return is encountered. The name of the function and the name of
2
In other programming languages, functions without output arguments are called procedures.
97
the file stored should be identical . As an example, the function average, stored in a file
average.m, is defined as:
output argument function name
input argument
the first line must be
the function definition function avr = average (x)
%AVERAGE computes the average value of a vector x
comment % and returns it in avr
% Notes: an example of a function
n = length(x); a blank line within the comment;
function body avr = sum(x)/n; Notes information will NOT appear
when you ask: help average
return;
The most important difference between a script and a function is that all scripts parameters
and variables are externally accessible (i.e. in the workspace), where function variables are
not. Therefore, a script is a good tool for documenting work, designing experiments and
testing. In general, create a function to solve a given problem for arbitrary parameters; use
a script to run functions for specific parameters required by the assignment.
98
Command Result
n = rank(A) n becomes the rank of matrix A
x = det(A) x becomes the determinant of matrix A
x = size(A) x becomes a row-vector with 2 elements: the number of rows and columns of A
x = trace(A) x becomes the trace (sum of diagonal elements) of matrix A
x = norm(v) x becomes the Euclidean length of vector v
C = A + B sum of two matrices
C = A - B subtraction of two matrices
C = A * B multiplication of two matrices
C = A .* B element-by-element multiplication (A and B are of equal size)
C = A ^ k power of a matrix (k Z; can also be used for A1 )
C = A .^ k element-by-element power of a matrix
C = A the transposed of a matrix; AT
C = A ./ B element-by-element division (A and B are of equal size)
X = A \ B finds the solution in the least squares sense to the system of equations AX = B
X = B / A finds the solution of XA = B, analogous to the previous command
C = inv(A) C becomes the inverse of A
C = null(A) C is an orthonormal basis for the null space of A obtained from the singular
value decomposition
C = orth(A) C is an orthonormal basis for the range of A
C = rref(A) C is the reduced row echelon form of A
L = eig(A) L is a vector containing the (possibly complex) eigenvalues of a square matrix A
[X,D] = eig(A) produces a diagonal matrix D of eigenvalues and a full matrix X whose columns
are the corresponding eigenvectors of A
S = svd(A) S is a vector containing the singular values of A
[U,S,V] = svd(A) S is a diagonal matrix with nonnegative diagonal elements in decreasing order;
columns of U and V are the accompanying singular vectors
x = linspace(a,b,n) generates a vector x of n equally spaced points between a and b
x = logspace(a,b,n) generates a vector x starting at 10a and ended at 10b containing n values
A = eye(n) A is an n n identity matrix
A = zeros(n,m) A is an n m matrix with zeros (default m = n)
A = ones(n,m ) A is an n m matrix with ones (default m = n)
A = diag(v) gives a diagonal matrix with the elements v1 , v2 , . . . , vn on the diagonal
X = tril(A) X is lower triangular part of A
X = triu(A) X is upper triangular part of A
A = rand(n,m) A is an n m matrix with elements uniformly distributed between 0 and 1
A = randn(n,m) ditto - with elements standard normal distributed
v = max(A) v is a vector with the maximum value of the elements in each column of A
or v is the maximum of all elements if A is a vector
v = min(A) ditto - with minimum
v = sum(A) ditto - with sum
Command Result
a = (b > c) a is 1 if b is larger than c. Similar are: <, >= and <=
a = (b == c) a is 1 if b is equal to c
a = (b ~= c) a is 1 if b is not equal c
a = ~b logical complement: a is 1 if b is 0
a = (b & c) logical AND: a is 1 if b = TRUE AND c = TRUE
a = (b | c) logical OR: a is 1 if b = TRUE OR c = TRUE
99
100
Appendix D
Introduction to PRTools
David M.J. Tax and Robert P.W. Duin
D.1 Introduction
The more than 100 routines offered by PRTools in its present state represent a basic set
covering largely the area of statistical pattern recognition. In order to make the evaluation
and comparison of algorithms more easy a set of data generation routines is included, as well
as a small set of standard real world datasets.
This manual treats just the basic objects and constructs in the PRTools toolbox. It does
not give an exhaustive explanation of all possibilities in the toolbox. To be more explicit, this
manual does not treat:
feature selection,
combining classifiers,
clustering,
applications on images.
Note: the manual tries to show the basic constructs and ideas behind the toolbox. In the
actual application of the toolbox in practice, the user should frequently use the help. For
each function defined in the PRTools toolbox an extensive help text is present, explaining
the syntax, the arguments, the function and the possible default values for arguments.
D.2 PRTools
PRTools deals with sets of labeled objects and offers routines for generalising such sets into
functions for data mapping and classification. An object is a k-dimensional vector of feature
values. It is assumed that for all objects in a problem all values of the same set of features
are given (and that there are no missing values). The space defined by the actual set of
features is called the feature space. Objects are represented as points or vectors in this space.
101
A classification function assigns labels to new objects in the feature space. Usually, this is
not done directly, but in a number of stages in which the initial feature space is successively
mapped into intermediate stages, finally followed by a classification. The concept of mapping
spaces and dataset is thereby important and constitutes the basis of many routines in the
toolbox.
PRTools makes use of the possibility offered by Matlab 5 to define classes and objects.
These programmatic concepts should not be confused with the classes and objects as defined
in Pattern Recognition. Two classes have been defined: prdataset and mapping. A large
number of operators (like *, []) and Matlab commands have been overloaded and have
thereby a special meaning when applied to a dataset and/or a mapping.
In the coming sections it is explained how prdatasets and prmappings can be created, mod-
ified and applied.
D.3 Dataset
The central data structure of PRTools is the prdataset. It primarily consists of a set of
objects represented by a matrix of feature vectors. Attached to this matrix is a set of labels,
one for each object and a set of feature names. Moreover, a set of a priori probabilities, one
for each class, is stored. In most help files of PRTools, a dataset is denoted by a. In almost
any routine this is one of the inputs. Almost all routines can handle multi-class object sets.
prdataset
prdataset Define dataset from data matrix and labels
getdata Retrieve data from dataset
getlabels Retrieve object labels from dataset
getfeat Retrieve feature labels from dataset
Sets of objects may be given externally or may be generated by one of the data generation
routines of PRTools. Their labels may also be given externally or may be the result of a
cluster analysis.
A dataset containing 10 objects with 5 random measurements can be generated by:
>> data = rand(10,5);
>> a = prdataset(data)
10 by 5 dataset with 0 classes: []
In the previous example no labels were supplied, therefore one class is detected. Labels can
be added to the dataset by:
>> data = rand(10,5);
>> labs = [A; A; A; B; A; A; B; B; B; B];
>> a = prdataset(data,labs)
10 by 5 dataset with 2 classes: [5 5]
102
In most cases the objects in the dataset are ordered (such that the first n1 objects belong to
class A, the next n2 to class B, etc). In this case it is easier to use the genlab routine for the
generation of the labels:
>> data = rand(10,5);
>> labs = genlab([5 5],[A;B]);
>> a = prdataset(data,labs)
10 by 5 dataset with 2 classes
Next to class labels, it is also possible to supply feature labels. To set this parameter, the
keyword featlab should be given before:
>> data = rand(10,3);
>> labs = genlab([5 5],[A;B]);
>> feats = [Feat1;Feat2;Feat3];
>> a = prdataset(data,labs,featlab,feats)
10 by 3 dataset with 2 classes: [5 5]
There are numberous other parameters which can be defined inside the dataset definition. The
class prior probabilities can be defined, the dataset name, cost matrices can be given, it can
be indicated whether features are numerical or categorical, etc. etc. See for more information
about the possible parameters help prdataset.
All the values can be set by their respective commands. For instance for setting the prior:
>> a = setprior(a,[0.4 0.6]);
Analogous you can use setcost, setname or setfeatdom.
Various items stored in a dataset can be retrieved by the get series of commands:
>> name = getname(a);
>> prob = getprior(a);
>> nlab = getnlab(a);
>> lablist = getlablist(a);
>> [m,k,c] = getsize(a);
in which nlab are numeric labels for the objects (1, 2, 3, ...) referring to the true labels
stored in the rows of lablist. The size of the dataset is m by k, c is the number of classes
(equal to max(nlab)).
The original data matrix can be retrieved by double(a) or by +a. The labels in the objects of
a can be retrieved labels = getlab(a), which is equivalent to labels = lablist(nlab,:).
The feature labels can be retrieved by featlist = getfeat(a).
Note: In many respects a prdataset object can be compared with a normal Matlab matrix.
Normal matrix operations like a.^2, abs(a), a(:,3), a(1,:), size(a) etc, can directly
be applied to prdataset objects. In some cases Matlab routines cannot cope with the
prdataset object. In these cases the user has to use +a instead of a.
Another way to inspect a dataset a is to make a scatterplot of the objects in the dataset.
For this the function scatterd is supplied. This plots each object in the dataset a in a 2D
graph, using a coloured marker when class labels are supplied. When more than 2 features
are present in the dataset, the first two are used. When you want to make a scatterplot of
two other features, you have to explicitly extract them by scatterd(a(:,[2 5])).
103
prdataset generation
gauss Generation of multivariate Gaussian distributed data
gendatb Generation of banana shaped classes
gendatc Generation of circular classes
gendatd Generation of two difficult classes
gendath Generation of Higleyman classes
gendatl Generation of Lithuanian classes
gendatm Generation of many Gaussian distributed classes
gendats Generation of two Gaussian distributed classes
Exercise D.1 Generate an artificial dataset (one from the prdataset generation table) con-
taining 3 objects per class.
Exercise D.2 Inspect the data matrix in the prdataset object by using the + operator and
make a scatterplot of the dataset.
Exercise D.3 Can you point out which object is plotted where in the the scatterplot? Check
if each object is coloured according to their class label.
Datasets can be combined by [a; b] if a and b have equal numbers of features and by [a b]
if they have equal numbers of objects.
Exercise D.4 Make a dataset containing 2 classes with 5 objects per class and call it a1 (for
instance from gendats). Make a second dataset containing 2 classes with 5 objects per
class and call it a2 (for instance from gendatb).
Combine the two datasets using
>> b = [a1; a2]
and make a scatterplot.
Next combine the two datasets using
>> c = [a1 a2]
and again make a scatterplot. In what do the two datasets differ? Check your opinions
by using size(b) and size(c).
Creating subsets of datasets can be done by a(I,J) in which I is a set of indices defining the
desired objects and J is a set of indices defining the desired features. In all these examples
the a priori probabilities set for a remain unchanged.
Exercise D.5 Use the dataset c from the previous exercise to extract feature 2 and 4. Make
a scatterplot and compare it with the data matrix. Where is the first object in the data
matrix plotted in the scatterplot?
There is a large set of routines for the generation of arbitrary normally distributed classes
(gauss), and for various specific problems (gendatc, gendatd, gendath, gendatm and
104
prdataset manipulation
gendat Extraction of subsets of a given data set
gendatk Nearest neighbour data generation
gendatp Parzen density data generation
gendat Extraction of test set from given dataset
prdata Read data from file and convert into a dataset
gendats). There are two commands for enriching classes by noise injection (gendatk and
gendatp). These are used for the general test set generator gendatt. A given dataset can be
split randomly into a training set and a test set using gendat.
Exercise D.6 Generate a dataset using gendatb containing 5 objects per class.
Enlarge this dataset by making a new dataset using gendatk and gendatp and append-
ing this dataset to your original dataset.
Make a scatterplot of the original and the second dataset, to check that it worked.
Note: have a look at the help if it is not clear how you should use a function!
Exercise D.7 Make a dataset with 100 objects per class. Generate two new datasets
trainset and testset with 60 and 40 objects per class respectively.
Make two scatterplots of the datasets. Do they differ much?
D.4 Datafile
The above described prdataset class refers to data that are stored in the computer memory
and not on disk. It has thereby its limitations w.r.t. the numbers of objects and features that
can be used. Moreover, it is assumed that objects are already represented by a fixed number
of vector elements (features, dissimilarities).
The recently introduced (PRTools version 4.1) class prdatafile offers several ways to start
with objects stored in files as raw images, time signals or mat-files. There is thereby no lim-
itation to the size. a prdatafile is a special type of prdataset by which many operations
defined on a prdataset are supported. In addition there are tools to define several types of
preprocessing filters and feature measurement commands. At some moment all objects in a
prdatafile are represented by the same features and it can be converted to a prdataset. Af-
ter that all procedures for dimension reduction, cluster analysis, classification and evaluation
can be used.
An important difference between a prdatafile and a prdataset is that operations on the
first are just stored in the prdatafile as a defined processing which is only executed at the
moment it is converted to a prdataset. Operations on a prdataset are directly executed.
105
D.5 Mapping
Data structures of the class mapping store trained classifiers, feature extraction results, data
scaling definitions, nonlinear projections, etc. They are usually denoted by w.
mapping
prmapping Define mapping
getlab Retrieve labels assigned by mapping
A mapping w is often created by training a classifier on some data. For instance, the nearest
mean classifier nmc is trained on some data a by:
>> a = gendatb(20);
>> w = nmc(a)
Nearest Mean, 2 to 2 trained classifier --> affine
w by itself, or display(w) lists the size and type of a classifier as well as the routine which is
used for computing the mapping a*w.
When a mapping is trained, it can be applied to a dataset, using the operator *:
>> b = a*w
Banana Set, 20 by 2 dataset with 2 classes: [7 13]
The result of the operation a*w is again a dataset. It is the classified, rescaled or mapped
result of applying the mapping definition stored in w to a.
All routines operate in multi-class problems. For mappings which change the labels of the
objects (so the mapping is actually a classifier) the routines labeld and testc are useful.
labeld and testc are the general classification and testing routines respectively. They can
handle any classifier from any routine.
106
Nonlinear classifiers
knnc k-nearest neighbour classifier (find k, build classifier)
mapk k-nearest neighbour mapping routine
testk Error estimation for k-nearest neighbour rule
parzenc Parzen density based classifier
parzenml Optimization of smoothing parameter in Parzen density estimation.
parzen map Parzen mapping routine
testp Error estimation for Parzen classifier
edicon Edit and condense training sets
treec Construct binary decision tree classifier
tree map Classification with binary decision tree
bpxnc Train feed forward neural network classifier by backpropagation
lmnc Train feed forward neural network by Levenberg-Marquardt rule
rbnc Train radial basis neural network classifier
neurc Automatic neural network classifier
rnnc Random neural network classifier
svc Support vector classifier
There are many commands to train and use mappings between spaces of different (or equal)
dimensionalities. For example:
Mappings can be linear (e.g. a rotation) as well as nonlinear (e.g. a neural network). Typi-
cally they can be used for classifiers. In that case a k by n mapping maps an k-feature data
vector on the output space of an n-class classifier (exception: 2-class classifiers like discrim-
inant functions may be implemented by a mapping to a 1D space like the distance to the
discriminant, n = 1).
Mappings are of the data type mapping, have a size of [k,n] if they map from k to n
dimensions. Mappings can be instructed to assign labels to the output columns, e.g. the class
names. These labels can be retrieved by
Mappings can be learned from examples, (labeled) objects stored in a dataset a, for instance
by training a classifier:
107
optional
The mapping of a test set b by b*w1 is now equivalent to b*(a*v1) or even, irregularly but
very handy to a*v1*b (or even a*ldc*b). Note that expressions are evaluated from left to
right, so b*a*v1 may result in an error as the multiplication of the two datasets (b*a) is
executed first.
end optional
108
D.7 Example
In this example a 2D Highleyman dataset A is generated, 100 objects for each class. Out of
each class 20 objects are generated for training, C and 80 for testing, D. Three classifiers are
computed: a linear one and a quadratic one, both assuming normal densities (which is correct
in this case) and a Parzen classifier. Note that the data generation use the random generator.
As a result they only reproduce if they use the original seed. After computing and displaying
classification results for the test set a scatterplot is made in which all classifiers are drawn.
%PREX_PLOTC PRTools example on the dataset scatter and classifier plot
help prex_plotc
echo on
% Generate Higleyman data
A = gendath([100 100]);
% Split the data into the training and test sets
[C,D] = gendat(A,[20 20]);
% Compute classifiers
w1 = ldc(C); % linear
w2 = qdc(C); % quadratic
w3 = parzenc(C); % Parzen
w4 = lmnc(C,3); % neural net
% Compute and display errors
% Store classifiers in a cell
W = w1,w2,w3,w4;
% Plot errors
disp(D*W*testc);
% Plot the data and classifiers
figure
% Make a scatter-plot
scatterd(A);
% Plot classifiers
plotc(w1,w2,w3,w4);
echo off
109
110
Appendix E
Introduction to datafiles
Robert P.W. Duin and David M.J. Tax
E.1 Introduction
Since the introduction of PRTools 4.1 the toolbox is extended to avoid some of the limi-
tations that were imposed by the use of the prdataset object. The first limitation of the
dataset object is that all objects in a dataset should have the same (number of) features. In
particular when one starts with raw data, for instance when one is using images or timeseries
with different sizes, the processing has to be done outside PRTools. A second limitation
is that datasets can become very large. So large in fact, that the dataset object cannot be
stored in the memory. To remove these limitations the prdatafile object is introduced.
A datafile in PRTools is a generalization of the dataset concept, and they inherit most of
the properties of a dataset. The datafile allows the distribution of a large dataset over a set
of files, stored on disk. Datafiles are mainly an administration about the files and directories
in which the objects are stored. A datafile is, like a dataset, a set consisting of M objects,
each described by k features. k may be unknown and varying over different objects, in which
case it is set to zero.
Almost all operations defined for datasets are also defined for datafiles, with a few exceptions.
Operations on datafiles are possible as long as they can be stored (e.g. filtering of images
for raw datafiles, or object selection by gendat). Furthermore, commands that are able to
process objects sequentially, like nmc and testc can be executed on datafiles. The use of
untrained mappings in combination with datafiles is a problem, as they have to be adapted
to the sequential use of the objects. Mappings that can handle datafiles are indicated in the
Contents file.
Different types of datafiles are defined:
raw Every file is interpreted as a single object in the dataset. These files may, for instance,
be images of different size. It is assumed that objects from different classes are stored
each in a different directory. The directory name is used as the class label.
cell All files should be mat-files containing just a single variable being a cell array. Its
elements are interpreted as objects. The file names will be used as labels during con-
struction. This may be changed by the user afterwards.
111
pre-cooked In this case the user should supply a command that reads the file and converts
it to a dataset. More than one object may be stored in a single file. Furthermore, the
command should define the class labels for the objects stored in the dataset.
mature This is a datafile created by PRTools, using the savedatafile command after
execution of all preprocessings defined for the datafile.
The most natural way for a user to create a datafile, is using the option raw.
Whenever a raw datafile is sufficiently defined by pre- and postprocessing it can be converted
into a dataset. If this is still a large dataset, not suitable for the available memory, it should be
stored by the savedatafile command and is ready for later use. If the dataset is sufficiently
small it can be directly converted into a dataset by prdataset.
112
mappings for image processing
im gray Convert any image to a gray-value image
im invert Invert an image by subtracting it from its maximum
im gaussf Gaussian filter
im minf Minimum filter
im maxf Maximum filter
im stretch Grey-value stretching
im hist equalize Histogram equalization
im threshold Thresholding
im berosion Binary erosion
im bdilation Binary dilation
im bpropagation Binary propagation
im label Label binary image
im select blob Selection of largest blob in a binary image
im box Bounding box for binary image
im center Center binary image in center of gravity
im resize Resize images
im scale Scale binary image
im rotate Rotate image
im fft Image FFT
im cols Conversion of full color images to 3 grey value images
A number of more specific routines for measuring features in images are available:
The last routine, im measure, computes a single set of features. If the image contains a set
of objects, it has first to be split by im2meas into images containing a single object, in order
to obtain a measurement set in which each object refers to a single physical object. The
im measure-routine makes use of the DipImage package. The user has to take care that this
is loaded.
113