Pattern Recognition Systems
Pattern Recognition Systems
Pattern
Recognition
Systems
Laboratory Works
1st Edition
UTPRESS
Cluj-Napoca, 2023
ISBN 978-606-737-637-1
Editura UTPRESS
Str. Observatorului nr. 34
400775 Cluj-Napoca
Tel.: 0264-401.999
e-mail: [email protected]
www.utcluj.ro/editura
ISBN 978-606-737-637-1
Table of contents
Preface................................................................................................... 1
1 Least Mean Squares.......................................................................... 3
2 RANSAC............................................................................................ 10
3 Hough Transform.............................................................................. 15
4 Distance Transform........................................................................... 21
5 Statistical Data Analysis.................................................................... 27
6 Principal Component Analysis.......................................................... 34
7 K-means Clustering........................................................................... 40
8 K-Nearest Neighbor Classifier.......................................................... 45
9 Naive Bayes Classifier....................................................................... 51
10 Perceptron Classifier....................................................................... 55
11 AdaBoost Method............................................................................ 62
12 Support Vector Machine................................................................. 67
References............................................................................................. 75
Preface
Introduction
Electronic support
To access the additional files required for the practical work use the
following link:
https://cv.utcluj.ro/index.php/teaching.html
The website contains the following: the starting project (for multiple
versions of Visual Studio); an introduction to the OpenCV library; and
the additional data files required for the programming assignments.
Required software
1
Prerequisites
2
1 Least Mean Squares
1.1 Objectives
In this assignment a line is fitted to a set of points using the Least Mean
Squares method (linear regression). Both the iterative solution
(gradient descent) and the closed form are presented. This laboratory
work also introduces the OpenCV-based framework used throughout
the course.
When trying to fit a model to data the first step is to establish the form
of the model. Linear regression adopts a model that is linear in terms
of the parameters (including a constant term). In this first part, we will
adopt a simple model that expresses 𝑦𝑦 in terms of 𝑥𝑥:
The Least Squares approach for determining the parameters states that
the best fit to the model will be obtained when the following quadratic
cost function is at its minimum:
𝑛𝑛
1
𝐽𝐽(𝜽𝜽) = �(𝑓𝑓(𝑥𝑥𝑖𝑖 ) − 𝑦𝑦𝑖𝑖 )2
2
𝑖𝑖=1
3
The squared differences can be motivated by the assumption that the
error in the data follows a normal distribution – see reference [1]. Note
that, this minimizes the error only along the 𝑦𝑦-axis and not the actual
distances of the points from the line. In order to minimize the cost
function, we take its partial derivatives with respect to each parameter.
𝑛𝑛
𝜕𝜕
𝐽𝐽(𝜽𝜽) = �(𝑓𝑓(𝑥𝑥𝑖𝑖 ) − 𝑦𝑦𝑖𝑖 )
𝜕𝜕𝜃𝜃0
𝑖𝑖=1
𝑛𝑛
𝜕𝜕
𝐽𝐽(𝜽𝜽) = �(𝑓𝑓(𝑥𝑥𝑖𝑖 ) − 𝑦𝑦𝑖𝑖 ) 𝑥𝑥𝑖𝑖
𝜕𝜕𝜃𝜃1
𝑖𝑖=1
The cost function attains its minimum when the gradient becomes zero.
One general approach to find the minimum is to use gradient descent.
Since the gradient shows the direction in which the function increases
the most, if we take steps in the opposite direction we decrease the
value of the function. By controlling the size of the step we can arrive
at a local minimum of the function. Since the objective function in this
case is quadratic, the function has a single minimum and so gradient
descent will find it.
𝜕𝜕𝜕𝜕(𝜽𝜽) 𝜕𝜕𝜕𝜕(𝜽𝜽) 𝑇𝑇
∇𝐽𝐽(𝜽𝜽) = � , �
𝜕𝜕𝜃𝜃0 𝜕𝜕𝜃𝜃1
𝜽𝜽𝑛𝑛𝑛𝑛𝑛𝑛 = 𝜽𝜽 − 𝛼𝛼∇𝐽𝐽(𝜽𝜽),
4
𝑛𝑛 𝑛𝑛
⎧ 𝜃𝜃0 𝑛𝑛 + 𝜃𝜃1 � 𝑥𝑥𝑖𝑖 = � 𝑦𝑦𝑖𝑖
⎪
𝑖𝑖=1 𝑖𝑖=1
𝑛𝑛 𝑛𝑛 𝑛𝑛
⎨
⎪𝜃𝜃0 � 𝑥𝑥𝑖𝑖 + 𝜃𝜃1 � 𝑥𝑥𝑖𝑖2 = � 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖
⎩ 𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1
which is a linear system with two equations and two unknowns and
can be solved directly to obtain the values for 𝜽𝜽:
𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥(𝛽𝛽) + 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦(𝛽𝛽) = 𝜌𝜌
This describes a line with unit normal vector [𝑐𝑐𝑐𝑐𝑐𝑐(𝛽𝛽), 𝑠𝑠𝑠𝑠𝑠𝑠(𝛽𝛽)] which
is at a distance of 𝜌𝜌 from the origin. The cost function we wish to
minimize in this case is the sum of squared distances of each point
from the line. This is given by:
5
𝑛𝑛
1
𝐽𝐽(𝛽𝛽, 𝜌𝜌) = �(𝑥𝑥𝑖𝑖 𝑐𝑐𝑐𝑐𝑐𝑐(𝛽𝛽) + 𝑦𝑦𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠(𝛽𝛽) − 𝜌𝜌)2
2
𝑖𝑖=1
Note, that this is the actual error term that we want to minimize and
that in the previous section we have considered only the error along
the 𝑦𝑦-axis, which is incorrect.
𝑛𝑛
𝜕𝜕𝜕𝜕
= − �(𝑥𝑥𝑖𝑖 𝑐𝑐𝑐𝑐𝑐𝑐(𝛽𝛽) + 𝑦𝑦𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠(𝛽𝛽) − 𝜌𝜌)
𝜕𝜕𝜕𝜕
𝑖𝑖=1
𝑛𝑛 𝑛𝑛
1
𝜌𝜌 = �cos(𝛽𝛽) � 𝑥𝑥𝑖𝑖 + 𝑠𝑠𝑠𝑠𝑠𝑠(𝛽𝛽) � 𝑦𝑦𝑖𝑖 �
𝑛𝑛
𝑖𝑖=1 𝑖𝑖=1
𝑎𝑎𝑎𝑎 + 𝑏𝑏𝑏𝑏 + 𝑐𝑐 = 0
6
𝑛𝑛
1
𝐽𝐽(𝑎𝑎, 𝑏𝑏, 𝑐𝑐) = �(𝑎𝑎𝑥𝑥𝑖𝑖 + 𝑏𝑏𝑦𝑦𝑖𝑖 + 𝑐𝑐)2
2
𝑖𝑖=1
𝐴𝐴 = 𝑈𝑈𝑈𝑈𝑈𝑈
8
1.5 Example Results
Figure 1.1 – Example Results using model 2 on data from files points1 and points2
1.6 References
[1] Stanford Machine Learning - course notes 1
http://cs229.stanford.edu/notes/cs229-notes1.pdf
[2] Tomas Svoboda - Least-squares solution of Homogeneous
Equations
http://cmp.felk.cvut.cz/cmp/courses/XE33PVR/WS20072008/Lec
tures/Supporting/constrained_lsq.pdf
9
2 RANSAC
2.1 Objectives
This laboratory work discusses the RANSAC method and applies it to
the problem of fitting a line to a set of 2D data points.
Figure 2.1.a – Line obtained via Least Mean Squares fit on the whole data
10
Random Sample Consensus (RANSAC) is a paradigm for fitting a
model to experimental data, introduced by Martin A. Fischler and
Robert C. Bolles in [1]. RANSAC addresses the previous issue and
automatically determines the set of inlier points and proceeds to fit a
model only on this subset. As stated by the authors: "The RANSAC
procedure is opposite to that of conventional smoothing techniques:
Rather than using as much of the data as possible to obtain an initial
solution and then attempting to eliminate the invalid data points,
RANSAC uses as small an initial data set as feasible and enlarges this
set with consistent data when possible".
The method works one supposition that if one of the points is an outlier
then the line will not gain much support. Scoring a line by its support
set size has the advantage of favoring better fits. For example, the line
passing through points a and b from Figure 1-b has a support of 10,
whereas the line passing through points c and d has a support of only
2. We can probably deduce from this that c or d is an outlier point.
12
measurement error is Gaussian with zero mean and standard
deviation 𝜎𝜎, then a value for t may be set to 3𝜎𝜎.
• How large is an acceptable consensus set? A rule of thumb is to
terminate if the size of the consensus set is similar to the number
of inliers believed to be in the data set, given the assumed
proportion of outliers, i.e. for 𝑛𝑛 data points 𝑇𝑇 = 𝑞𝑞 ⋅ 𝑛𝑛
The equation of a line through two distinct points (𝑥𝑥1 , 𝑦𝑦1 ) and (𝑥𝑥2 , 𝑦𝑦2 )
is given by:
(𝑦𝑦1 − 𝑦𝑦2 )𝑥𝑥 + (𝑥𝑥2 − 𝑥𝑥1 )𝑦𝑦 + 𝑥𝑥1 𝑦𝑦2 − 𝑥𝑥2 𝑦𝑦1 = 0
The distance from a point (𝑥𝑥0 , 𝑦𝑦0 ) to a line given by the equation 𝑎𝑎𝑎𝑎 +
𝑏𝑏𝑏𝑏 + 𝑐𝑐 = 0 is:
13
Selecting a random point from an array of n points (requires stdlib.h):
Point p = points[rand()%n];
2.5 References
[1] Fischler, Martin A., and Robert C. Bolles. "Random sample
consensus: a paradigm for model fitting with applications to image
analysis and automated cartography." Communications of the
ACM 24, no. 6 (1981): 381-395.
[2] Hartley, Richard, and Andrew Zisserman. Multiple view geometry
in computer vision. Cambridge university press, 2003.
14
3 Hough Transform
3.1 Objectives
The main objective of this laboratory session is to implement the
Hough Transform for line detection from edge images.
The Hough transform was first proposed and patented by Peter Hough
in [1]. It proposes to count how many points are placed on each
possible line in the image. The original method relies on the
representation of the lines in the slope-intercept form (𝑦𝑦 = 𝑎𝑎𝑎𝑎 + 𝑏𝑏),
and on the building of the line parameter space, also called Hough
accumulator. For each image interest point, all possible lines are
considered, and the corresponding elements in the line parameter space
are incremented. Relevant image lines are found at the locations of the
local maxima in the line parameter space.
The initial proposal was focused on the detection of lines from video
sequences, based on a slope and free-term line representation. This
representation is not optimal because it is not bounded: in order to
represent all the possible lines in an image, the slope and the intercept
terms should vary between -∞ and +∞. The work of Duda and Hart
from [2] made the Hough transform more popular in the computer
vision field. The main problem of the original Hough transform
(unbounded parameters) is solved by using the so-called normal
parameterization. The normal parameterization of a line consists of
representing the line by its normal vector and the distance from origin
to the line. The normal representation (1) is sometimes referred to as
the 𝜌𝜌 − 𝜃𝜃 representation (Figure 3.1).
15
Figure 3.1 – Line represented by its normal vector, at angle 𝜽𝜽, and its distance
𝝆𝝆 from the origin
The equation satisfied by a point on the line (x,y) is then given by:
𝜌𝜌 = 𝑥𝑥 𝑐𝑐𝑐𝑐𝑐𝑐(𝜃𝜃) + 𝑦𝑦 𝑠𝑠𝑠𝑠𝑠𝑠(𝜃𝜃)
16
Algorithm Hough Transform
Initialize 𝐻𝐻 with 0
For each edge point 𝑃𝑃(𝑥𝑥, 𝑦𝑦)
For each 𝜃𝜃 from 0 to 𝜃𝜃𝑚𝑚𝑚𝑚𝑚𝑚 (with a step of Δθ)
Although the Hough transform is widely used for line detection, it can
also work with more complex curves, as long as an adequate
parameterization is available. Duda and Hart [2] also proposed the
detection of circles based on the Hough transform. In this case a 3D
parameter space is needed and each interest point is transformed into a
right circular cone in the parameter space (all the possible circles
containing the point). Later, Ballard generalized the Hough transform
to detect arbitrary non-analytical shapes in [3].
17
3.3 Practical Background
Use the simplest configuration for parameter quantization: 1 pixel for
𝜌𝜌 and 1 degree for 𝜃𝜃. Use the second option for the parameter range
from formula (2). The size of the Hough accumulator will have D + 1
rows and 360 columns, where D is the image diagonal rounded to the
nearest integer.
Hough.setTo(0);
Hough.at<int>(ro, theta)++;
Mat houghImg;
Hough.convertTo(houghImg, CV_8UC1,
255.f/maxHough);
In order to locate peaks in the accumulator, you will test for each
Hough element if it is a local maximum in a squared window (n x n)
centered on that element. Use the following custom structure to store
and sort the local maxima. The < operator of the structure has been
overwritten to use the > operator between the peak values specifically
to sort descending when the sort method from the algorithm
library is called.
struct peak{
int theta, ro, hval;
bool operator < (const peak& o) const {
return hval > o.hval;
}
};
18
3.4 Practical Work
1. Compute the Hough accumulator based on the edge image and
display it as a grayscale image.
2. Locate the k largest local maxima from the accumulator. Try using
different window sizes such as: 3x3, 7x7 or 11x11.
3. Draw lines that correspond to the peaks found. Use both the
original image and the edge image for visualization.
a. b. c.
d.
e.
Figure 3.2 - a. An image containing a pattern with straight borders corrupted by
salt-and-pepper like noise, b. The edges detected with the Canny edge detector, c.
The most relevant image lines are displayed with green, associated to the most
relevant 8 peaks from the Hough accumulator, d. The Hough accumulator
displayed using an intensity encoding, e. The Hough accumulator displayed in 3D,
using color encoding.
19
3.6 References
[1] P. Hough, “Method and means for recognizing complex patterns”,
US patent 3,069,654, 1962.
[2] R. O. Duda and P. E. Hart, "Use of the Hough Transformation to
Detect Lines and Curves in Pictures," Comm. ACM, Vol. 15, pp.
11–15, 1972.
[3] D. H. Ballard, "Generalizing the Hough Transform to Detect
Arbitrary Shapes", Pattern Recognition, Vol.13, No.2, p.111-122,
1981.
20
4 Distance Transform
4.1 Objectives
In this laboratory session we will study an algorithm that calculates the
Distance Transform of a binary image (object and background). Our
goal is to evaluate the pattern matching score between a known
template object (e.g. a pedestrian contour) and an unknown object (e.g.
the contour of a different object) in order to decide if the unknown
object is similar or not to the template object. The less the pattern
matching score is, the more similar is the unknown object is to the
template.
One way to think about the distance transform is to first imagine that
foreground regions in the input binary image are made of some
uniform slow burning flammable material. Then consider
simultaneously starting a fire at all points on the boundary of a
foreground region and letting the fire burn its way into the interior. If
we then label each point in the interior with the amount of time that the
fire took to first reach that point, then we have effectively computed
the distance transform of that region.
21
metric, where each value encodes the minimum distance to an object
pixel (boundary pixel):
Figure 4.1 – On the left: binary input image; on the right: distance transform of the
image, every position shows the checkerboard distance to the closest boundary
point (0 values in the input image) [1]
Usually the transform is qualified with the chosen metric. For example,
one may speak of Manhattan Distance Transform, if the underlying
metric is Manhattan distance. Common metrics are:
- Euclidean distance;
- Taxicab geometry, also known as City block distance or
Manhattan distance;
- Chessboard distance.
22
- The distance transform map has the same size as the input image
and is initialized with zeroes and large values based on the
positions of edge points:
0 𝑖𝑖𝑖𝑖𝑖𝑖(𝑖𝑖, 𝑗𝑗) ∈ 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂
𝐷𝐷𝐷𝐷(𝑖𝑖, 𝑗𝑗) = �
∞ 𝑖𝑖𝑖𝑖𝑖𝑖(𝑖𝑖, 𝑗𝑗) ∉ 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂
- A double scan (first top-down, left-right and second bottom-up,
right-left) of the image (with the corresponding two parts of the
mask – see the figure below) is required to update the minimum
distance. On the first traversal the central element is compared to
the neighbors corresponding yellow elements, and on the second
traversal, with green elements:
It is worth noting, that the center element with weight 0 belongs to both
masks, and so, the minimum is always compared to the existing value
from the DT map.
Our goal in this part is to compute the pattern matching score between
a template, which represents a known object, and an unknown object.
The score can be used to quantify the similarity between the contours
of the template and that of unknown object. We consider that both the
template and the unknown object images have the same dimensions.
23
The steps for computing the pattern matching score:
- compute the DT image of the template object;
- superimpose the unknown object image over the DT image of the
template;
- the pattern matching score is the average of all the values from
the DT image that lie under the unknown object contour.
Example: Consider that the template object is a leaf contour and the
unknown object is a pedestrian contour:
- Compute the DT image of the leaf:
- Superimpose the pedestrian image over the DT image of the leaf;
- Evaluate the matching score as the average values from the DT
image from the positions indicated by the pedestrian contour
Figure 4.2 – From left to right: contour of a leaf; contour of a pedestrian; the
pedestrian superimposed on the distance transform of the leaf
24
Convenient 8-neighborhood access at position (i,j):
int di[9] = {-1,-1,-1,0,0,0,1,1,1};
int dj[9] = {-1,0,1,-1,0,1,-1,0,1};
int weight[9] = {3,2,3,2,0,2,3,2,3};
for(int k=0; k<9; k++)
uchar pixel = img.at<uchar>(i+di[k],
j+dj[k]);
25
4.5 Example Results
Examples of DT image results using Chamfer method using the
suggested weight matrix.
Figure 4.3 – Upper row: input binary images; lower row: corresponding DT images
4.6 References
[1] Wikipedia The Free Encyclopedia – Distance Transform,
http://en.wikipedia.org/wiki/Distance_transform
[2] Compendium of Computer Vision – Distance Transform,
http://homepages.inf.ed.ac.uk/rbf/HIPR2/distance.htm
26
5 Statistical Data Analysis
5.1 Objectives
The purpose of this laboratory session is to explore methods for
analyzing statistical data. We will study the mean, standard deviation,
covariance and the Gaussian probability density function. The
experiments will be performed on a set of images containing faces. The
covariance matrix will be used to establish the correlations between
different pixels.
27
5.2.2 Statistical Characterization of Random variables
Let 𝑰𝑰 be the feature matrix which will hold all the intensity values from
the image set. 𝑰𝑰 is of dimension 𝑝𝑝 × 𝑁𝑁, where p is the number of
images and N is the number of pixels in each image. The kth row
contains all the pixel values from the kth image in row-major order. The
row-major order for a 3x3 matrix is:
29
𝐴𝐴00 𝐴𝐴01 𝐴𝐴02
�𝐴𝐴10 𝐴𝐴11 𝐴𝐴12 � → [𝐴𝐴00 , 𝐴𝐴01 , 𝐴𝐴02 , 𝐴𝐴10 , 𝐴𝐴11 , 𝐴𝐴12 , 𝐴𝐴20 , 𝐴𝐴21 , 𝐴𝐴22 ]
𝐴𝐴20 𝐴𝐴21 𝐴𝐴22
Each image in the set has the dimension of 19x19 pixels. The
interpretation of the feature matrix I is that each row holds a sample
for the N dimensional random variable 𝑿𝑿 which is drawn from the
distribution underlying the dataset.
Your task will be to compute the covariance matrix of the given set of
images and to study how different features vary with respect to each
other.
31
1
𝑝𝑝�𝑥𝑥𝑖𝑖 , 𝑥𝑥𝑗𝑗 � = 𝑒𝑒𝑒𝑒𝑒𝑒 �−0.5 ��𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑖𝑖 , 𝑥𝑥𝑗𝑗
2𝜋𝜋�det (𝐶𝐶𝑖𝑖𝑖𝑖 )
𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑖𝑖
− 𝜇𝜇𝑗𝑗 �𝐶𝐶𝑖𝑖𝑖𝑖−1 �𝑥𝑥 − 𝜇𝜇 ���
𝑗𝑗 𝑗𝑗
where μi is the mean for feature i and 𝐶𝐶𝑖𝑖𝑖𝑖 is the covariance matrix
between features i and j. Normalize the density values to fit inside
the range 0:255.
Figure 5.3 – From left to right, Example Results for tasks 5 a, b and c
32
5.6 References
[1] MIT CBCL FACE dataset
http://www.ai.mit.edu/courses/6.899/lectures/faces.tar.gz
33
6 Principal Component Analysis
6.1 Objectives
This laboratory work describes the Principal Component Analysis
method. It is applied as a means of dimensionality reduction,
compression and visualization. A library capable of providing the
eigenvalue decomposition of a matrix is required.
Consider now the two vectors u1 and u2. If we project the 2D points
onto the vector u2 we obtain scalar values with a small spread (standard
deviation). If instead, we project it onto u1 the spread is much larger.
If we had to choose a single vector we would prefer to project onto u1
since the points can still be discerned from each other.
Here we have projected x onto each vector and then added the
corresponding terms. The dot product 〈𝒙𝒙, 𝒖𝒖𝑖𝑖 〉 gives the magnitude of
34
the projection, it needs to be normalized by the length of the vector
‖𝒖𝒖𝑖𝑖 ‖ and the two vectors give the directions. This is possible since u1
and u2 are perpendicular. If we impose that they be also unit vectors
then the normalization term disappears. See [4] for a better
visualization.
The idea behind reducing the dimensionality of the data is to use only
the largest projections. Since the projections onto u2 will be smaller we
can approximate x with only the first term:
The question arises of how to determine the basis vectors onto which
to perform the projections. Since we are interested in maximizing the
preserved variance the covariance matrix could offer useful
information. The covariance of two features is defined as:
𝑛𝑛
1
𝐶𝐶(𝑖𝑖, 𝑗𝑗) = �(𝑥𝑥𝑘𝑘𝑘𝑘 − 𝜇𝜇𝑖𝑖 )�𝑥𝑥𝑘𝑘𝑘𝑘 − 𝜇𝜇𝑗𝑗 �
𝑛𝑛 − 1
𝑘𝑘=1
where 𝜇𝜇𝑖𝑖 is the mean for feature i and 𝑥𝑥𝑘𝑘𝑘𝑘 is the i-th feature of the k-th
point. The covariance matrix contains covariance values for all pairs.
It can be shown that it can be expressed as a simple matrix product:
1
𝐶𝐶 = (𝑋𝑋 − 𝝁𝝁𝟏𝟏1𝑥𝑥𝑥𝑥 )𝑇𝑇 (𝑋𝑋 − 𝝁𝝁𝟏𝟏1𝑥𝑥𝑥𝑥 )
𝑛𝑛 − 1
where 𝝁𝝁 is a vector containing all mean values and 𝟏𝟏1𝑥𝑥𝑥𝑥 is a row vector
containing only ones. If we extract the mean from the data as a
preprocessing step the formula simplifies further:
1
𝐶𝐶 = 𝑋𝑋 𝑇𝑇 𝑋𝑋
𝑛𝑛 − 1
The next step is to find the axes along which the covariance is
maximal. Eigenvalue decomposition of a matrix offers such
35
information. Intuitively, (almost) any matrix can be viewed as a
rotation followed by a stretching along the axes and the inverse
rotation. The eigenvalue decomposition returns such a decomposition
for the matrix:
𝑑𝑑
where 𝑄𝑄1:𝑘𝑘 is a dxk matrix with the first k eigenvectors and Λ1:𝑘𝑘 is a
kxk diagonal matrix with the first k eigenvalues. If k equals d we obtain
the original matrix and as we decrease k we get increasingly poorer
approximations for C.
Thus we have found the axes along which the variance of the
projections is maximized. Then, for a general vector its approximate
using k vectors can be evaluated as:
𝑘𝑘 𝑘𝑘
𝑋𝑋𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 𝑋𝑋𝑋𝑋
PCA approximation can be performed on all the input vectors at once
(if they are stored as rows in X) using the following formulas:
𝒌𝒌 𝑘𝑘
� 𝒌𝒌 =
𝑿𝑿 � 𝑋𝑋𝑄𝑄𝑖𝑖 𝑄𝑄𝑖𝑖𝑡𝑡 = � 𝑋𝑋𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑖𝑖 𝑄𝑄𝑖𝑖𝑡𝑡 = 𝑋𝑋𝑄𝑄1:𝑘𝑘 𝑄𝑄1:𝑘𝑘
𝑇𝑇 𝑇𝑇
= 𝑋𝑋𝑐𝑐𝑐𝑐𝑐𝑐𝑓𝑓1:𝑘𝑘 𝑄𝑄1:𝑘𝑘
𝒊𝒊=𝟏𝟏 𝑖𝑖=1
36
where 𝑄𝑄1:𝑘𝑘 signifies the first k columns from the matrix 𝑄𝑄. It is
important to distinguish the approximation from the coefficients; the
approximation is the sum of coefficients multiplied by the principal
components.
Calculate the covariance matrix after the means have been subtracted:
Mat C = X.t()*X/(n-1);
37
6.4 Practical Work
1. Open the input file and read the list of data points. The first line
contains the number of points n and the dimensionality of the data
d. The following n lines each contain a single point with d
coordinates.
2. Calculate the mean vector and subtract it from the data points.
3. Calculate the covariance matrix as a matrix product.
4. Perform the eigenvalue decomposition on the covariance matrix.
5. Print the eigenvalues.
6. Calculate the PCA coefficients and kth approximate 𝑋𝑋�𝑘𝑘 for the input
data.
7. Evaluate the mean absolute difference between the original points
and their approximation using k principal components.
8. Find the minimum and maximum values along the columns of the
coefficient matrix.
9. For the input data from pca2d.txt select the first two columns from
Xcoef and plot the data as black 2D points on a white background.
To obtain positive coordinates subtract the minimum values.
10. For input data from pca3d.txt select the first three columns from
Xcoef and plot the data as a grayscale image. Use the first two
components as x and y coordinates and the third as intensity values.
To obtain positive coordinates subtract the minimum values from
the first two coordinates. Normalize the third component to the
interval 0:255
11. Automatically select the required k which retains a given percent
of the original variance. For example, find k for which the kth
approximate retains 99% of the original variance. The percentage
of variance retained is given by ∑𝑘𝑘𝑖𝑖=1 𝜆𝜆𝑖𝑖 ⁄∑𝑑𝑑𝑖𝑖=1 𝜆𝜆𝑖𝑖 .
38
Figure 6.2 – Visualization of 2D points resulting from applying PCA on data from
pca2d.txt
6.6 References
[1] Wikipedia article PCA -
https://en.wikipedia.org/wiki/Principal_component_analysis
[2] Stanford Machine Learning course notes –
http://cs229.stanford.edu/notes/cs229-notes10.pdf
[3] Lindsay Smith - PCA tutorial –
http://faculty.iiit.ac.in/~mkrishna/PrincipalComponents.pdf
[4] PCA in R (animation of projection) -
https://poissonisfish.wordpress.com/2017/01/23/principal-
component-analysis-in-r/
39
7 K-means Clustering
7.1 Objectives
This laboratory session deals with the problem of clustering a set of
points. This is a machine learning task that is unsupervised, i.e. the
class labels of the points are not known and not required. Successful
methods will identify the underlying structure in the data and group
similar points together.
The input for the method is the set of data points: 𝑋𝑋 = {𝑥𝑥𝑖𝑖 , 𝑖𝑖 = 1: 𝑛𝑛}.
Each point is d-dimensional 𝑥𝑥𝑖𝑖 = (𝑥𝑥𝑖𝑖1 , 𝑥𝑥𝑖𝑖2 , … , 𝑥𝑥𝑖𝑖𝑖𝑖 ). The goal of the k-
means clustering method is to partition the points into K sets denoted
by 𝑆𝑆 = {𝑆𝑆𝑘𝑘 |𝑘𝑘 = 1: 𝐾𝐾}. The mean value of the points in each set is
named 𝑚𝑚𝑘𝑘 . The partitioning must minimize the following objective
function:
𝐾𝐾
𝑑𝑑
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑥𝑥, 𝑦𝑦) = �� (𝑥𝑥𝑖𝑖 − 𝑦𝑦𝑖𝑖 )2
𝑖𝑖=1
K-means algorithm
𝑚𝑚𝑘𝑘 = 𝑥𝑥𝑟𝑟𝑘𝑘
Assignment - Assign each point from the input dataset to the closest
cluster center found so far. The membership function will take the
value of the index of the closest center:
41
Halting condition - If there is no change in the membership function
then the algorithm can be halted because further calculation will lead
to no changes in the mean values. A maximum number of iterations
can also be enforced. If none of the above conditions are met the
algorithm continues with the assignment step.
default_random_engine gen;
uniform_int_distribution<int> distribution(a,
b);
int randint = distribution(gen);
42
c. Draw the Voronoi tessellation corresponding to the obtained
cluster centers. For this picture you must color each image
position (including the background) according to which is the
closest center to it.
3. Apply K-means on a grayscale image. Use the intensity as the
single feature for the input points – in this case d=1.
4. Recolor the input image based on the mean intensity of each
cluster.
5. Apply K-means on a color image. Use the color components as the
features for the input points – in this case d=3.
6. Recolor the input image based on the mean color of each cluster.
7. Optionally, implement the k-means++ initialization technique
from [3].
Figure 7.1 – Input 2D points and clusterization results with different K values
43
Figure 7.2 – Input grayscale image and its segmentation using K=3 and K=10
centers
Figure 7.3 – Input color image and its segmentation using K=3 and K=10 centers
7.6 References
[1] Cluster analysis Wikipedia article -
https://en.wikipedia.org/wiki/Cluster_analysis
[2] K-means Wikipedia article - https://en.wikipedia.org/wiki/K-
means_clustering
[3] Arthur, David, and Sergei Vassilvitskii. "k-means++: The
advantages of careful seeding." Proceedings of the eighteenth
annual ACM-SIAM symposium on Discrete algorithms. Society
for Industrial and Applied Mathematics, 2007.
[4] P. Arbelaez, M. Maire, C. Fowlkes and J. Malik. „Contour
Detection and Hierarchical Image Segmentation”, IEEE TPAMI,
Vol. 33, No. 5, pp. 898-916, May 2011.
[5] Image segmentation dataset:
https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/groupin
g/resources.html
44
8 K-Nearest Neighbor Classifier
8.1 Objectives
The purpose of this laboratory work is to introduce perhaps the
simplest classifier: the k-Nearest Neighbor classifier. The classifier is
applied on a small image dataset with multiple classes.
45
approximated locally and all computation is deferred until
classification.
The distances are sorted in ascending order and the closest K instances
are considered based on the distance. Each instance casts a vote for
their class which is known from y. The instance is classified as the
class which has the most votes. A more formal description follows.
𝒉𝒉 = � 𝟏𝟏�𝑦𝑦𝑝𝑝𝑘𝑘 �
𝑘𝑘=1
𝑐𝑐 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑖𝑖 𝒉𝒉𝑖𝑖
46
𝐾𝐾
𝟏𝟏�𝑦𝑦𝑝𝑝𝑘𝑘 �
𝒉𝒉 = �
1 + 𝑑𝑑𝑝𝑝𝑘𝑘
𝑘𝑘=1
The image histogram can be viewed as a global feature vector for the
image. The basic definition for a histogram of a grayscale image is that
of a vector which counts the occurrences of each gray level intensity.
It is a vector of dimension 256. In general, the histogram can be a
vector of length m if we divide the [0,255] interval in m equal parts. In
this case each bin in the histogram vector counts the number of gray
level intensities falling in that particular bin. For example: if m=8, the
first bin would count all intensities between 0 and 256/m - 1=31; the
second bin between 32 and 63; and so on. The histogram for a color
image can be formed by concatenating the individual histograms for
the separate channels. The size of the resulting histogram is of 3 x m.
47
8.2.4 Evaluation of classifiers
The accuracy for the classifier on a labeled test set is defined as the
percentage of correctly classified instances. It is the complementary
metric to the error rate. It does not offer relevant information if the
class distribution is skewed. If the number of instances is unbalanced,
a classifier that always predicts the most prevalent class will have a
high accuracy. This is the typical situation, for example: pedestrian
classifiers deal with a highly skewed distribution of much more
background image samples than pedestrian samples. In this case, more
relevant metrics are precision and recall for each class.
Read all images from class c, calculate the histogram and insert the
values in X:
int c = 0, fileNr = 0, rowX = 0;
while(1){
sprintf(fname, "train/%s/%06d.jpeg", classes[c],
fileNr++);
Mat img = imread(fname);
if (img.cols==0) break;
8.5 References
[1] Wikipedia article - k-NN classifier
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
[2] Andrew Ng - Machine Learning: Nonparametric methods &
Instance-based learning
http://www.cs.cmu.edu/~epxing/Class/10701-
08s/Lecture/lecture2.pdf
50
9 Naive Bayes Classifier
9.1 Objectives
In this laboratory session we will study the Naive Bayes Classifier and
we will apply it to a specific recognition problem: we will learn to
distinguish between images of handwritten digits.
51
Using Bayes' rule from above, we can implement a classifier that
predicts the class based on the input features x. This is achieved by
selecting the class c that achieves the highest posterior probability.
Let X denote the feature matrix for the training set, as usual. In this
case X contains on every row the binarized values of each training
image to either 0 or 255 based on a selected threshold. X has the
dimension n x d, where n is the number of training instances and d=28
x 28 is the number of features which is equal to the size of an image.
The class labels are stored in the vector y of dimension n.
The prior for class i is calculated as the fraction of instances from class
i from the total number of instances:
𝑛𝑛𝑖𝑖�
𝑃𝑃(𝐶𝐶 = 𝑖𝑖) = 𝑛𝑛
52
The likelihood of having feature j equal to 255 given class i is given
by the fraction of the training instances which have feature j equal to
255 and are from class i:
Since the ordering of the posteriors does not change when the log
function is applied, the predicted class will be the one with the highest
log posterior probability value. The log of the total probability can be
ignored since it is a constant.
53
while(index<100){
sprintf(fname, "train/%d/%06d.png", c, index);
Mat img = imread(fname, 0);
if (img.cols==0) break;
//process img
index++;
}
The likelihood is a Cxd vector (we only store the likelihood for 255):
const int d = 28*28;
Mat likelihood(C,d,CV_64FC1);
9.5 References
[1] Electronic Statistics Textbook –
http://www.statsoft.com/textbook/stnaiveb.html
[2] LeCun, Yann, Corinna Cortes, and Christopher JC Burges. "The
MNIST database." (1998) http://yann.lecun.com/exdb/mnist
54
10 Perceptron Classifier
10.1 Objectives
This laboratory session presents the perceptron learning algorithm for
the linear classifier. We will apply gradient descent and stochastic
gradient descent procedure to obtain the weight vector for a two-class
classification problem.
10.2.1 Definitions
Define a training set as the tuple (X,Y), where 𝑋𝑋 ∈ 𝑀𝑀𝑛𝑛×𝑚𝑚 (𝑅𝑅) and Y is
a vector 𝑌𝑌 ∈ 𝑀𝑀𝑛𝑛×1 (𝐷𝐷), where D is the set of class labels. X represents
the concatenation the feature vectors for each sample from the training
set, where each row is an m dimensional vector representing a sample.
Y is the vector the desired outputs for the classifier. A classifier is a
map from the feature space to the class labels: 𝑓𝑓: 𝑅𝑅 𝑚𝑚 → 𝐷𝐷.
Thus a classifier partitions the feature space into |D| decision regions.
The surface separating the classes is called decision boundary. If we
have only two dimensional feature vectors the decision boundaries are
lines or curves. In the following we will discuss binary classifiers. In
this case the set of class labels contains exactly two elements. We will
denote the labels for classes as D={-1,1}.
55
Figure 10.1. Example of a linear classifier for a two-class classification problem.
Each sample is characterized by two features. The decision boundary is a line.
if w x < − w0 decide that sample x belongs to class − 1
T
Figure 10.3 – Image for 2D case depicting: decision regions (red and blue), linear
decision boundary (dashed line), weight vector (𝒘𝒘) and bias (𝒘𝒘𝟎𝟎 = 𝒅𝒅 ⋅ ‖𝒘𝒘‖).
57
If an instance 𝑖𝑖 is classified correctly, no penalty is applied because the
expression −𝑦𝑦𝑖𝑖 𝒘𝒘� 𝑇𝑇 ∙ 𝒙𝒙
�𝑖𝑖 is negative. In the case of a misclassification,
the previous expression will be positive and it will be added to the
function value. The objective now is to find the weights that minimize
the loss function.
� 𝑘𝑘+1 ← 𝒘𝒘
𝒘𝒘 � 𝑘𝑘 )
� 𝑘𝑘 − 𝜂𝜂𝛁𝛁𝑳𝑳(𝒘𝒘
where 𝒘𝒘
� 𝑘𝑘 is the weight vector at time k, 𝜂𝜂 is a parameter that controls
the step size and is called the learning rate, and 𝛁𝛁𝑳𝑳(𝒘𝒘� ) is the gradient
vector of the loss function at point 𝒘𝒘
� 𝑘𝑘 . The gradient of the loss function
is:
𝑛𝑛
1
� ) = � 𝛁𝛁𝑳𝑳𝒊𝒊 (𝒘𝒘
𝛁𝛁𝑳𝑳(𝒘𝒘 �)
𝑛𝑛
𝑖𝑖=1
0, � 𝑇𝑇 ∙ 𝒙𝒙
𝑖𝑖𝑖𝑖 𝑦𝑦𝑖𝑖 𝒘𝒘 �𝑖𝑖 > 0
𝛁𝛁𝑳𝑳𝒊𝒊 (𝒘𝒘
�) = �
�𝑖𝑖 ,
−𝑦𝑦𝑖𝑖 𝒙𝒙 𝑜𝑜𝑡𝑡ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
𝒘𝒘 � 𝑘𝑘 − 𝜂𝜂𝛁𝛁𝑳𝑳𝒊𝒊 (𝒘𝒘
� 𝑘𝑘+1 ← 𝒘𝒘 �)
58
Algorithm: Algorithm:
Batch Perceptron Online Perceptron
Each point is described by the color (that denotes the class label) and
the two coordinates, x1 and x2.
The augmented weight vector will have the form 𝑤𝑤 � = [𝑤𝑤0 𝑤𝑤1 𝑤𝑤2 ].
The augmented feature vector will be 𝑥𝑥̅ = [1 𝑥𝑥1 𝑥𝑥2 ].
59
10.4 Practical Work
1. Read the points from a single file test0*.bmp and construct the training set
(X,Y). Assign the class label +1 to red points and -1 to blue points.
2. Implement and apply the online perceptron algorithm to find the linear
classifier that divides the points into two groups. Suggestion for
parameters:
𝜂𝜂 = 10−4 , 𝑤𝑤0 = [0.1, 0.1, 0.1], 𝐸𝐸𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 = 10−5 , 𝑚𝑚𝑚𝑚𝑚𝑚_𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 105 .
Note: for a faster convergence use a larger learning rate only for 𝑤𝑤0
3. Draw the final decision boundary based on the weight vector 𝒘𝒘.
4. Implement the batch perceptron algorithm and find suitable parameters
values. Show the loss function at each step. It must decrease slowly.
5. Visualize the decision boundary at intermediate steps, while the learning
algorithm is running.
6. Change the starting values for the weight vector 𝒘𝒘, the learning rate and
terminating conditions to observe what happens in each case. What does
an oscillating cost function signal?
Iteration 0
i=0: w=[1.000000 1.000000 -1.000000] xi=[1 23 5] yi = 1 zi=19.000000
i=1: w=[1.000000 1.000000 -1.000000] xi=[1 15 11] yi = 1 zi=5.000000
i=2: w=[1.000000 1.000000 -1.000000] xi=[1 14 21] yi = -1 zi=-6.000000
i=3: w=[1.000000 1.000000 -1.000000] xi=[1 27 23] yi = -1 zi=5.000000
wrong
update w0 = w0 - 0.01, w1 = w1 - 27*0.01, w2 = w2 – 23*0.01
i=4: w=[0.990000 0.730000 -1.230000] xi=[1 20 27] yi = -1 zi=-17.620000
60
Iteration 1
i=0: w= [0.990000 0.730000 -1.230000] xi= [1 23 5] yi = 1 zi=11.630000
i=1: w= [0.990000 0.730000 -1.230000] xi= [1 15 11] yi = 1 zi=-1.590000
wrong
update w0 = w0 + 0.01, w1 = w1 + 15*0.01, w2 = w2 + 11*0.01
i=2: w= [1.000000 0.880000 -1.120000] xi= [1 14 21] yi = -1 zi=-10.200000
i=3: w= [1.000000 0.880000 -1.120000] xi= [1 27 23] yi = -1 zi=-1.000000
i=4: w= [1.000000 0.880000 -1.120000] xi= [1 20 27] yi = -1 zi=-11.640000
Iteration 2
i=0: w= [1.000000 0.880000 -1.120000] xi= [1 23 5] yi = 1 zi=15.640000
i=1: w= [1.000000 0.880000 -1.120000] xi= [1 15 11] yi = 1 zi=1.880000
i=2: w= [1.000000 0.880000 -1.120000] xi= [1 14 21] yi = -1 zi=-10.200000
i=3: w= [1.000000 0.880000 -1.120000] xi= [1 27 23] yi = -1 zi=-1.000000
i=4: w= [1.000000 0.880000 -1.120000] xi= [1 20 27] yi = -1 zi=-11.640000
All classified correctly
10.6 References
[1] Rosenblatt, Frank (1957), The Perceptron - a perceiving and recognizing
automaton. Report 85-460-1, Cornell Aeronautical Laboratory.
[2] Richard O. Duda, Peter E. Hart, David G. Stork: Pattern Classification
2nd ed.
[3] Xiaoli Z. Fern, Machine Learning and Data Mining Course, Oregon
University -
http://web.engr.oregonstate.edu/~xfern/classes/cs434/slides/perceptron-
2.pdf
[4] Gradient Descent - http://en.wikipedia.org/wiki/Gradient_descent
[5] Avrim Blum, Machine Learning Theory, Carnegie Mellon University -
https://www.cs.cmu.edu/~avrim/ML10/lect0125.pdf
61
11 AdaBoost Method
11.1 Objectives
In this laboratory session we will study a method for obtaining an ensemble
classifier called AdaBoost (Adaptive Boosting). We will apply it for a binary
classification problem on 2D points.
Figure 11.1 – Example of samples from two classes (red and blue points)
Each weak learner returns either +1 or -1 and is weighted by 𝛼𝛼𝑡𝑡 . The final class
is given by the sign of the strong classifier 𝐻𝐻𝑇𝑇 (𝒙𝒙). In this work, we will use
decision stumps as weak learners. A decision stump classifies an instance by
looking at a particular feature, if this feature is below a threshold, the instance
is classified as class +1 and -1 otherwise.
62
We are given the training set in the following form: 𝑿𝑿 is the feature matrix of
dimension 𝑛𝑛 𝑥𝑥 𝑚𝑚 and contains 𝑛𝑛 the training samples, each row being an
individual sample of dimension 𝑚𝑚. In our case, 𝑚𝑚 = 2 and the features are
the rows and columns at which the points are found in the input image. The
class vector 𝒚𝒚 of dimension 𝑛𝑛 contains +1 for each red point and -1 for each
blue point.
For this method we will associate a weight with each example. We will store
the weights in the weight vector 𝒘𝒘 of dimension 𝑛𝑛. Initially all samples have
an equal weight of 1/𝑛𝑛. The following algorithm describes the high level
AdaBoost procedure which finds the strong classifier 𝐻𝐻𝑇𝑇 .
Algorithm AdaBoost
init wi=1/n
for t=1:T
//also returns the weighted training error 𝝐𝝐𝒕𝒕 :
[ℎ𝑡𝑡 , 𝜖𝜖𝑡𝑡 ] = findWeakLearner(X,y,w)
1−𝜖𝜖
𝛼𝛼𝑡𝑡 = 0.5 ln � 𝜖𝜖 𝑡𝑡�
𝑡𝑡
s = 0
for i=1:n
//wrongly classified examples obey: 𝒚𝒚𝒊𝒊 𝒉𝒉𝒕𝒕 (𝑿𝑿𝒊𝒊 ) < 𝟎𝟎
//their weights will become larger
𝑤𝑤𝑖𝑖 ← 𝑤𝑤𝑖𝑖 ⋅ 𝑒𝑒𝑒𝑒𝑒𝑒(−𝛼𝛼𝑡𝑡 𝑦𝑦𝑖𝑖 ℎ𝑡𝑡 (𝑋𝑋𝑖𝑖 ))
s += 𝑤𝑤𝑖𝑖
endfor
//normalize the weights
for i=1:n
𝑤𝑤𝑖𝑖 ← 𝑤𝑤𝑖𝑖 /𝑠𝑠
endfor
endfor
//returns all the alpha values
//and the weak learners
return [𝜶𝜶, 𝒉𝒉]
63
[ℎ𝑡𝑡 , 𝜖𝜖𝑡𝑡 ] = findWeakLearner(X,y,w)
best_h = {}
best_err = ∞
for j=1:X.cols
for threshold=0:max(img.cols, img.rows)
for class_label={-1,1}
e=0
for i=1:X.rows
if X(i,j)<threshold
zi=class_label
else
zi=-class_label
endif
if ziyi < 0
e += wi
endif
endfor
if e<best_err
best_err = e
best_h = {j, threshold, class_label, e}
endif
endfor
endfor
endfor
return [best_h, best_err]
The underlying idea behind this algorithm is to find the best simple (weak)
classifier and then to modify the importance of the examples. Missclassified
examples will get a higher weight and correctly classified examples will get a
lower weight. An example is classified as the wrong class if the sign of the
expression 𝑦𝑦𝑖𝑖 ℎ𝑡𝑡 (𝑋𝑋𝑖𝑖 ) is negative (the predicted and correct class labels have
different signs).
In the following step, when we search for the next weak learner, it will be
more important to correctly classify the examples which have higher weights
since they contribute more to the weighted training error.
Each weak learner contributes to the final score of the classifier. The
contribution is weighted by how well the weak learner performed in terms of
the weighted training error.
64
11.3 Practical Background
Suggested structure for a single weak leaner (assumes 𝑋𝑋 stores floats):
struct weaklearner{
int feature_i;
int threshold;
int class_label;
float error;
int classify(Mat X){
if (X.at<float>(feature_i)<threshold)
return class_label;
else
return –class_label;
}
};
Header for function that finds the best weak learner – note that the
weaklearner structure stores the weighted error:
weaklearner findWeakLearner(Mat X, Mat y, Mat w)
Header for function which draws the decision boundary (keep the original
image unmodified):
void drawBoundary(Mat img, classifier clf)
65
11.4 Practical Work
1. Read the training set from one of the input files (points*.bmp). Each row
from the feature matrix 𝑿𝑿 should contain the row and column of each
colored point from the image. The class vector 𝒚𝒚 contains +1 for red and -
1 for blue points.
2. Implement the decision stump weak learner – the weaklearner
structure.
3. Implement the findWeakLearner function.
4. Implement the drawBoundary function which colors the input image
showing the decision boundary by changing the background color (white
pixels) based on the classification result. Use yellow for +1 background
and teal for -1 background pixels. Test the function with a strong classifier
formed by a single weak learner.
5. Implement the AdaBoost algorithm to find the strong classifier with T
weak learners. Visualize the decision boundary. For each input image find
the value of T which results in zero classification error. What are the
limitations of the presented method?
Figure 11.2 – Sample results on points1 with number of weak learners T=1 (left) and T=13
(right)
11.6 References
[1] Robert E. Schapire, The Boosting Approach to Machine Learning, An
Overview, 2001
[2] AdaBoost - https://en.wikipedia.org/wiki/AdaBoost
66
12 Support Vector Machine
12.1 Objectives
In this lab session we will implement the simple linear classifier described in
the previous lab and we will study the mechanisms of support vector
classification based on soft margin classifiers.
We will start the discussion from a simple problem of separating a set of points
into two classes, as depicted in Figure 12.1igure 12.1:
The question here is how can we classify these points using a linear
discriminant function in order to minimize the training error rate? We have an
infinite number of answers, as shown in Figure 12.2:
Figure 12.2 – Linear classifiers that correctly discriminate between the two classes
67
From the multitude of solutions we need to find which the best one is. One
possible answer is given by the linear discriminant function with the maximum
margin. Informally, the margin is defined as the width by which the boundary
can be increased by before hitting a data point, see Figure 12.3.
Choosing two points from the positive and negative sides of the boundary we
know that:
𝒘𝒘𝑇𝑇 𝒙𝒙+ + 𝑏𝑏 = 1
𝒘𝒘𝑇𝑇 𝒙𝒙− + 𝑏𝑏 = −1
68
Figure 12.4 – Positive and negative samples nearest to the separation boundary – support
vectors
𝒘𝒘 2
𝑀𝑀 = (𝒙𝒙+ − 𝒙𝒙− ) ∙ 𝒏𝒏 = (𝒙𝒙+ − 𝒙𝒙− ) ∙ =
||𝒘𝒘|| ||𝒘𝒘||
1
minimize 2 ‖𝒘𝒘‖2 such that:
𝑦𝑦𝑖𝑖 (𝒘𝒘𝑇𝑇 𝒙𝒙𝑖𝑖 + 𝑏𝑏) ≥ 1 .
69
12.2.3 Soft-margin classifiers
If the data points are not linearly separable a transformation can be applied to
each sample 𝒙𝒙𝑖𝑖 .which performs a mapping into a higher dimensional space
where they are linearly separable. Denoting this transformation by 𝜙𝜙 we can
write the following optimization problem:
1
minimize 2 ‖𝒘𝒘‖2 + 𝐶𝐶 ∑𝑛𝑛𝑖𝑖=1 𝜉𝜉𝑖𝑖 such that
𝑦𝑦𝑖𝑖 (𝒘𝒘𝑇𝑇 𝜙𝜙(𝒙𝒙𝑖𝑖 ) + 𝑏𝑏) ≥ 1 − 𝜉𝜉𝑖𝑖 and 𝜉𝜉𝑖𝑖 ≥0.
70
Since the solution for the SVM requires only dot products between instances
the usage of the transformation 𝜙𝜙 can be avoided if we define the following
kernel function:
71
The application allows several parameters, but we will use two
of them:
• ‘-t kernel_type’ specifies the kernel type: set type of
kernel function (default 2); ‘kernel_type’ can be one
of the following:
0 – linear kernel: < 𝒖𝒖, 𝒗𝒗 >= 𝒖𝒖𝑇𝑇 𝒗𝒗
1 –polynomial kernel: (𝛾𝛾 < 𝒖𝒖, 𝒗𝒗 > +𝑐𝑐)𝑑𝑑
2 – radial basis function: exp (−𝛾𝛾|𝒖𝒖 − 𝒗𝒗|2 )
3 – sigmoid: tanh (𝛾𝛾 < 𝒖𝒖, 𝒗𝒗 > +𝑐𝑐)
• ‘-c cost’ specifies the parameter 𝐶𝐶 from the soft
margin classification problem
• ‘SimpleClassifier’ button – implements the simple classifier.
2. For each image in svm_images.zip run the default SVM classifier
(with different kernels and costs)
3. Implement the ‘SimpleClassifier’ code and compare it to the SVM
classifier that uses a linear kernel.
Write the code in the file svm-toy.cpp for the case branch:
case ID_BUTTON_SIMPLE_CLASSIFIER:
{
/* ****************************************
TO DO:
WRITE YOUR CODE HERE FOR THE SIMPLE CLASSIFIER
**************************************** */
}
For implementing the simple classifier you should know that in the
svm_toy.cpp file the coordinates of the points are stored in the structure
list<point> point_list;
struct point {
double x, y;
signed char value;
};
72
Notice that the dimension of the classification space is XLEN x YLEN.
Hence to a normalized point (x,y) we have other coordinates in the
classification space (drawing space) which are (x*XLEN, y*YLEN).
In order to iterate over all the points and count how many points are in
class ‘1’ and in class ‘2’ you should do the following:
//declare an iterator
list<point>::iterator p;
int nrSamples1=0;
int nrSamples2=0;
double xC1=0,xC2=0,yC1=0,yC2=0;
}
if ((*p).value==2) //point from class ‘2’
{
nrSamples2++;
xC2 =(*p).x;
//normalized x coordinate of the current point
yC2 =(*p).y;
//normalized y coordinate of the current point
}
}
73
12.4 Sample result:
Details:
- 2D points to be classified
- 2 classes, 2 features (x1 and x2)
- Red line separation obtained by
implementing the ‘Simple
Classifier’ algorithm
- Cyan/Brown line separation
obtained by SVM linear kernel
(-t 0) and cost C=100 (-c 100)
Observe:
- The maximized margin obtained
with SVM
- The points incorrectly classified
by simple classifier
12.5 References
[1] J. Shawe-Taylor, N. Cristianini: Kernel Methods for Pattern Analysis.
Pattern Analysis (Chapter 1)
[2] B. Scholkopf, A. Smola: Learning with Kernels. A Tutorial Introduction
(Chapter 1), MIT University Press.
[3] LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
74
REFERENCES
Global references:
1. Richard O. Duda, Peter E. Hart , David G . Stork, "Pattern
Clasification", John Wiley and Sons, 2001.
75
3.3 D. H. Ballard, "Generalizing the Hough Transform to Detect
Arbitrary Shapes", Pattern Recognition, Vol.13, No.2, p.111-122,
1981.
3.4 https://en.wikipedia.org/wiki/Hough_transform
76
7.5 Image segmentation dataset:
https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/group
ing/resources.html
77
12.2 B. Scholkopf, A. Smola: Learning with Kernels. A Tutorial
Introduction (Chapter 1), MIT University Press.
78