01pattern Recognition
01pattern Recognition
Jianxin Wu
LAMDA Group
National Key Lab for Novel Software Technology
Nanjing University, China
[email protected]
Contents
1 An example: autonomous driving 3
Exercises 12
This book will introduce several algorithms, methods, and practices in pat-
tern recognition, with a sizable portion of its contents being the introduction to
various machine learning algorithms.
Technical details (such as algorithms and practices) are important. How-
ever, we encourage the readers to focus on more fundamental issues rather than
dwelling on technical details. For example, keeping the following questions in
mind will be useful.
• Given a specific pattern recognition task, what is the input and output of
this task? What aspects of the task make it difficult to solve?
• Among the many existing algorithms and methods, which one(s) should
be used (or must not be used) for our task at hand? Is there any good
reason to support your decision?
1
• There is no silver bullet that can solve all problems. For example, deep
learning has emerged as the best solution technique for many applications,
but there are many tasks that cannot be effectively attacked by deep
learning. If we have to develop a new method for one task, is there a
procedure that can help us?
2
The field of pattern recognition is concerned with the automatic
discovery of regularities in data through the use of computer algo-
rithms and with the use of these regularities to take actions such as
classifying the data into different categories.
T.2 The car will communicate with the user to learn about the destination of
a trip.
T.3 The car will find its way and drive to the destination on its own (i.e.,
autonomously), while the user may take a nap or enjoy a movie during
the trip.
T.4 The car will properly notify the user upon its arrival and shall park itself
after the user leaves the car.
For task T.1 above, a viable approach is to install several cameras in different
sides of the car, which can operate even when the engine is switched off. These
cameras will watch out vigilantly and automatically find out that a person is
approaching it. Then, the next step is to identify the person: is he or she the
owner of this car? If the answer is yes and when the owner is close enough to
any one of the car’s doors, the car shall unlock that door.
Because these steps are based on cameras, the task T.1 can be viewed as a
computer vision (CV) task.
3
Computer vision methods and systems take as inputs the images or videos
captured by various imaging sensors (such as optic, ultrasound, or infrared
cameras). The goal of computer vision is to design hardware and software that
can work together and mimic or even surpass the functionality of the human
vision system—e.g., the detection and recognition of objects (such as identifying
a person) and the identification of anomalies in the environment (such as finding
that a person is approaching). In short, the input of computer vision methods
or systems are different types of images or videos, and the outputs are results
of image or video understanding, which also appear in different forms according
to the task at hand.
Many subtasks have been decomposed from T.1, which correspond to widely
researched topics in computer vision, e.g., pedestrian detection and human iden-
tity recognition (such as face recognition based on human faces, or gait recog-
nition based on a person’s walking style and habits).
Task T.2 involves different sensors and data acquisition methods. Although
we can ask the user to use a keyboard to key in his or her destination, the more
natural way is to carry out this communication through natural languages.
Hence, microphones and speakers should be installed around the car’s interior.
The user may just say “Intercontinental hotel” (in English) or “Zhou Ji Jiu
Dian” (in Chinese). It is the car’s job to find out that a command has been
issued and to acknowledge (probably also to confirm) the user’s command before
it starts driving. The acknowledgment or confirmation, of course, are given in
natural languages, too.
Techniques from various areas are required to carry out these natural verbal
communications, such as speech recognition, natural language processing, and
speech synthesis. The car will capture the words, phrases, or sentences spoken by
the user through speech recognition, shall understand their meanings, and must
be able to choose appropriate answers through natural language processing;
finally it can speak its reply through speech synthesis.
T.2 involves two closely related subjects: speech processing and natural lan-
guage processing. The input of T.2 is speech signals obtained via one or several
microphones, which will undergo several layers of processing: the microphones’
electronic signal is firstly converted into meaningful words, phrases, or sentences
by speech recognition; the natural language processing module will convert these
words into representations that a computer can understand; natural language
processing is also responsible for choosing an appropriate answer (e.g., a confir-
mation or a further clarification request) in the form of one or several sentences
in texts; finally, the speech synthesis module must convert the text sentences
into audio signals, which will be spoken to the user via a loudspeaker and is the
final output of T.2. However, the modules used in T.2 have different intermedi-
ate inputs and outputs, often with one module’s output being the input of the
next processing module.
T.3 and T.4 can be analyzed in a similar fashion. However, we will leave
the analyses of T.3 and T.4 to the reader.
In this example, we have witnessed many sensors (e.g., camera, infrared
camera, microphone) and outputs of various modules (e.g., existence of a person,
4
human identity, human voice signal). Many more sensory inputs and outputs
are required for this application. For example, in T.3, highly accurate global
positioning sensors (such as GPS or BeiDou receiving sensors) are necessary in
order to know the car’s precise location. Radars, which could be millimeter-
wave radar or laser based (Lidar), are also critical to sense the environment for
driving safety. New parking lots may be equipped with RFID (radio-frequency
identification) tags and other auxiliary sensors to help the automatic parking
task.
Similarly, more modules are required to process these new sensory data and
produce good outputs. For example, one module may take both the camera
and radar inputs to determine whether it is safe to drive forward. It will detect
any obstacle that stands in front of the car and to avoid collision if obstacles do
exist.
5
Figure 1: A typical pattern recognition pipeline.
integers between 0 and 255. This large quantity of integers is not very helpful
for finding useful patterns or regularities in the image.
Figure 2 shows a small gray-scale (single channel) face image (Figure 2a)
and its raw input format (Figure 2b) as a matrix of integers. As shown by
Figure 2a, although the resolution is small (23 × 18, which is typical for faces in
surveillance videos), our brain can still interpret it as a candidate for a human
face, but it is almost hopeless to guess the identity of this face. The computer,
however, sees a small 23 × 18 matrix of integers, as shown in Figure 2b. These
6
414 numbers are far away from our concept of a human face.
Hence, we need to extract or learn features—i.e., to turn these 414 numbers
into other numerical values which are useful for finding faces. For example,
because most eyes are darker than other facial regions, we can compute the sum
of pixel intensity values in the top half image (12 × 18 pixels, denoting the sum
as v1 ) and the sum of pixel intensity values in the bottom half (12 × 18 pixels,
denoting the sum as v2 ). Then, the value v1 − v2 can be treated as a feature
value: if v1 − v2 < 0, the upper half is darker and this small image is possibly
a face. In other words, the feature v1 − v2 is useful in determining whether this
image is a face or not.
Of course, this single feature is very weak; we may extract many feature
values from an input image, and these feature values form a feature vector .
In the above example, the features are manually designed. Manually de-
signed features often follow advice from domain experts. Suppose the PR task
is to judge whether a patient has a certain type of bone injury, based on a
CT image. A domain expert (e.g., a doctor specialized in bone injuries) will
explain how he or she reaches a conclusion; a pattern recognition specialist will
try to capture the essence of the expert’s decision making process and turn this
knowledge into feature extraction guidelines.
Recently, especially after the popularization of deep learning methods, fea-
ture extraction has been replaced by feature learning in many applications.
Given enough raw input data (e.g., images as matrices) and their associated la-
bels (e.g., face or non-face), a learning algorithm can use the raw input data and
their associated labels to automatically learn good features using sophisticated
techniques.
After the feature extraction or feature learning step, we need to produce a
model, which takes the feature vectors as its input, and produces our appli-
cation’s desired output. The model is mostly obtained by applying machine
learning methods on the provided training feature vectors and labels.
For example, if an image is represented as a d-dimensional feature vector
x ∈ Rd (i.e., with d feature values), a linear model
wT x + b
w ∈ Rd and b ∈ R
are d + 1 parameters of the linear machine learning model. Given any image
with a feature vector x, the model will predict the image as being a face image
if wT x + b ≥ 0, or non-face if wT x + b < 0.
In this particular form of machine learning model (which is a parametric
model), to learn a model is to find its optimal parameter values. Given a set of
training examples with feature vectors and labels, machine learning techniques
learn the model based on these training examples—i.e., using past experience
(training instances and their labels) to learn a model that can predict for future
examples even if they are not observed during the learning process.
7
(a) A small gray-scale face image
8
2.2 PR vs. ML
Now we will have a short detour to discuss the relationship between PR and
ML before we proceed to the next step in the PR pipeline.
It is quite easy to figure out that pattern recognition and machine learning
are two closely related subjects. An important step (i.e., model learning) in PR
is typically considered as an ML task, and feature learning (also called repre-
sentation learning) has attracted increasing attention in the ML community.
However, PR includes more than those components that are ML-related. As
discussed above, data acquisition, which is traditionally not related to machine
learning, is ultra-important for the success of a PR system. If a PR system
accepts input data that is low quality, it is very difficult, if not impossible,
for the machine learning related PR components to recover from the loss of
information incurred by the low quality input data. As the example in Figure 2
illustrates, a low resolution face image makes face recognition almost impossible,
regardless of what advanced machine learning methods are employed to handle
these images.
Traditional machine learning algorithms often focus on the abstract model
learning part. A traditional machine learning algorithm usually uses pre-extracted
feature vectors as its input, which rarely pays attention to data acquisition. In-
stead, ML algorithms assume the feature vectors satisfy some mathematical or
statistical properties and constraints, and learn machine learning models based
on the feature vectors and their assumptions.
An important portion of machine learning research is focused on the theo-
retical guarantees of machine learning algorithms. For example, under certain
assumptions on the feature vectors, what is the upper or lower bound of the
accuracy any machine learning algorithm can attain? Such theoretical studies
are sometimes not considered as a topic of PR research. Pattern recognition
research and practice often has a stronger system flavor than that in machine
learning.
We can find other differences between PR and ML. But, although we do not
agree on expressions like “PR is a branch of ML” or “ML is a branch of PR,”
we do not want to emphasize the differences either.
PR and ML are two closely related subjects, and the differences may gradu-
ally disappear. For example, the recent deep learning trend in machine learning
emphasizes end-to-end learning: the input of a deep learning method is the raw
input data (rather than feature vectors), and its output is the desired prediction.
Hence, instead of emphasizing the differences between PR and ML, it is
better to focus on the important task: let us just solve the problem or task that
is presented to us!
One more note: the patterns or regularities recognized by PR research and
practice involve various types of sensory input data, which means that PR is
also closely related to subjects such as computer vision, acoustics, and speech
processing.
9
Figure 3: A typical pattern recognition pipeline with feedback loop.
10
has to be revised before its deployment.
Both data acquisition and manual feature extraction based on knowledge
from domain experts are application or domain specific, which will involve back-
ground from various subject areas. In this book, we will not be able to go into
details of these two steps.
The other steps are introduced in this book. For a detailed description of
the structure and contents of the rest of this book, please refer to the preface.
11
Exercises
1. Below is an interesting equation I saw from a short online entertainment
video clip:
s r s r
3 a + 1 8a − 1 3 a + 1 8a − 1
a+ + a− . (1)
3 3 3 3
What do you think this equation will evaluate to? Note that we only
consider real numbers (i.e., complex numbers shall not appear in this
problem).
One does not often see this kind of complex equation in an online enter-
tainment video, and this equation almost surely has nothing to do with
pattern recognition or machine learning. However, as will be illustrated in
this problem, we can observe many useful thinking patterns in the solution
of this equation that are also critical in the study of machine learning and
pattern recognition. So, let us take a closer look at this equation.
(a) Requirements on the input. In a pattern recognition or machine
learning problem, we must enforce some constraints on the input data.
These constraints might be implemented by pre-processing techniques, by
requirements in the data acquisition process, or by other means.
The requirement for the above equation is specified in terms of a. What
shall we enforce on the variable a?
(b) Observing the data and the problem. The first step in solving a
PR or ML problem is often observing or visualizing your data—in other
words, to gain some intuitions about the problem at hand. While trying to
observe or visualize the data, two kinds of data are often popular choices:
those data that are representative (to observe some common properties),
and those that have peculiar properties (to observe some corner cases).
One example of peculiar data for Equation 1 is a = 81 . This value for a is
peculiar because it greatly simplifies the equation. What is Equation 1’s
value under this assignment of a?
(c) Coming up with your idea. After observing the data, you may
come up with some intuitions or ideas on how to solve the problem. If
that idea is reasonable, it is worth pursuing.
Can you find another peculiar value for a? What is Equation 1’s value in
that case? Given these observations, what is your idea about Equation 1?
(d) Sanity check of your idea. How do you make sure your idea is
reasonable? One commonly used method is to test it on some simple
cases, or to write a simple prototype system to verify it.
For Equation 1, we can write a single-line Matlab/Octave command to
evaluate its value. For example, let a=3/4 assigns a value to a; we can
evaluate Equation 1 as
12
f = ( a + ( a +1)/3* sqrt ((8* a -1)/3) )^(1/3) + ...
( a - ( a +1)/3* sqrt ((8* a -1)/3) )^(1/3)
13
can be solved using expressions related to Equation 1. Read the in-
formation at https://en.wikipedia.org/wiki/Cubic_equation (espe-
cially the part that is related to Cardano’s method), and try to understand
this connection.
14