Deborah Thomas Dissertation
Deborah Thomas Dissertation
A Dissertation
Doctor of Philosophy
by
Deborah Thomas,
c Copyright by
Deborah Thomas
2010
All Rights Reserved
Abstract
by
Deborah Thomas
In this dissertation, we develop techniques for face recognition from surveillancequality video. We handle two specific problems that are characteristic of such
video, namely uncontrolled face pose changes and poor illumination. We conduct
a study that compares face recognition performance using two different types of
probe data and acquiring data in two different conditions. We describe approaches
to evaluate the face detections found in the video sequence to reduce the probe
images to those that contain true detections. We also augment the gallery set using synthetic poses generated using 3D morphable models. We show that we can
exploit temporal continuity of video data to improve the reliability of the matching
scores across probe frames. Reflected images are used to handle variable illumination conditions to improve recognition over the original images. While there
remains room for improvement in the area of face recognition from poor-quality
video, we have shown some techniques that help performance significantly.
CONTENTS
FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TABLES
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CHAPTER 1: INTRODUCTION . . . . .
1.1 Description of surveillance-quality
1.2 Overview of our work . . . . . . .
1.3 Organization of the dissertation .
. . . .
video
. . . .
. . . .
v
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
8
13
20
28
. . . .
. . . .
. . . .
NDSP
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
30
30
31
31
33
33
35
38
43
43
43
45
45
ii
3.4
3.5
software
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
47
47
47
49
50
51
51
51
53
53
54
65
66
66
69
69
73
74
74
80
83
85
86
86
91
94
96
96
98
98
99
101
103
103
105
iii
7.1.2
7.1.3
7.2
CHAPTER 8: CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . .
119
APPENDIX A: GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . .
121
123
130
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
135
iv
FIGURES
1.1
1.2
1.3
Example showing the low resolution of the face in the frame, when
the subject is too far from the camera . . . . . . . . . . . . . . . .
1.4
3.1
30
3.2
31
3.3
32
3.4
. . . . . . . . . . . .
32
3.5
34
3.6
3.7
3.8
3.9
3.10
36
37
39
40
44
48
49
4.1
52
4.2
57
4.3
58
4.4
59
4.5
60
4.6
4.7
4.8
Results: CMC curve comparing performance when using high definition and surveillance data (Indoor video) . . . . . . . . . . .
63
Results: CMC curve comparing performance when using high definition and surveillance data (Outdoor video) . . . . . . . . . .
64
Frames showing the variable pose seen in a video clip (the black
dots mark the detected eye locations) . . . . . . . . . . . . . . . .
67
5.2
68
5.3
71
5.4
75
76
78
5.7
81
6.1
87
6.2
88
6.3
6.4
Reflection algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
Example images: original image, reflected left and reflected right .
89
92
6.5
93
6.6
6.7
Reflection algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
Example images: original image and averaged image . . . . . . . .
94
95
6.8
97
7.1
104
7.2
106
7.3
109
7.4
Results: Rank one recognition rates when using the entire dataset
113
7.5
Results: Rank one recognition rates when using the dataset after
background subtraction . . . . . . . . . . . . . . . . . . . . . . . .
114
Results: Rank one recognition rates when using the dataset after
background subtraction and gestalt clustering . . . . . . . . . . .
115
B.1 CMC curves: Comparing fusion techniques approaches using a single frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
124
4.9
5.1
5.5
5.6
7.6
vi
125
B.3 CMC curves: Comparing approaches exploiting temporal continuity, using rank-based fusion . . . . . . . . . . . . . . . . . . . . .
126
B.4 ROC curves: Comparing fusion techniques exploiting temporal continuity, using rank-based fusion . . . . . . . . . . . . . . . . . . .
127
128
129
131
133
134
vii
TABLES
2.1
PREVIOUS WORK . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.1
33
3.2
SUMMARY OF DATASETS . . . . . . . . . . . . . . . . . . . . .
42
4.1
55
4.2
COMPARISON DATASET RESULTS: COMPARISON OF RECOGNITION RESULTS ACROSS CAMERAS USING PITTPATT . .
56
5.1
79
82
100
COMPARING DATASET SIZE OF ALL IMAGES TO BACKGROUND SUBTRACTION APPROACH AND GESTALT CLUSTERING APPROACH . . . . . . . . . . . . . . . . . . . . . . . .
112
5.2
6.1
7.1
7.2
RESULTS: PERFORMANCE WHEN VARYING DISTANCE METRICS AND NUMBER OF EIGENVECTORS DROPPED . . . . 118
viii
CHAPTER 1
INTRODUCTION
Firstly, such video is affected by variable illumination. Often times, surveillance cameras are pointed toward doorways where the sun is streaming in, or the
camera may be in a poorly-lit location. This can change the intensity of the image,
even causing different parts of the image to be illuminated differently, which can
cause problems for the recognition system. In Figure 1.1, we show an example
frame of such video affected by variable illumination.
The second feature of surveillance video is the variable pose of the subject in
the video. The subject is often not looking at the camera and the camera may
be mounted to the ceiling. Therefore, the subject may not be a frontal pose in
the video. While a lot of work has been done using images where the subject is
looking directly at the camera, there is a need to explore recognition when the
subject is not looking at the camera. In Figure 1.2, we two show such examples.
Another surveillance video characteristic is the low resolution of the face. Usually, such video is of low resolution and covers a large scene. Furthermore, the
Figure 1.2. Example showing the variable pose in two frames of a video
clip
camera may be located far from the subject. Hence, the subjects face may be
small, causing the number of pixels on the subjects face to be low, making it
difficult for robust face recognition. In Figure 1.3, we show an image where the
subject is too far from the camera for reliable face recognition.
The last feature of surveillance-quality video is obstruction to the human face.
A perpetrator may be aware of the presence of a camera and try to cover their face
to prevent the camera from capturing their face. Hats, glasses and makeup can
also be used to change the appearance of the face to cause problems for recognition
systems. Sometimes, the positioning of the camera may cause the face to be out
of view of the camera frame as seen in Figure 1.4.
Figure 1.3. Example showing the low resolution of the face in the frame,
when the subject is too far from the camera
Figure 1.4. Example showing the face to be out of view of the camera
continuity between the frames of the data. The identity of the subject will not
change in an instant, so the multiple frames available can be used for recognition. The matching scores between a pair of probe and gallery subjects can be
made more robust by using decisions about a previous frame for the current one.
First, we compare recognition performance when using surveillance video to performance when using high-resolution video in our probe dataset. We also devise a
technique to evaluate the face detections to prune the dataset to true detections,
to improve recognition performance. We use a multi-gallery approach to make the
recognition system more robust to variable pose in the data. We generate these
poses using synthetic morphable models. We then create reflected images in order
to mitigate the effects of variable illumination.
By combining these techniques we show that we can handle some of the issues
of variable pose and illumination in surveillance data and improve recognition over
baseline performance.
CHAPTER 2
PREVIOUS WORK
In this chapter, we describe previous work that looks at face recognition from
unconstrained video. We first describe three studies that explore face recognition
from video. We look at two problems, namely uncontrolled pose and poor lighting
conditions. We then describe different approaches that have been used to handle
both of these problems.
and face recognition from video). The first experiment compares face recognition
performance when using still images in the probe set to using 100 frames from a
video sequence, while the subject is talking with varied expression. The video is
similar to that of a mugshot with the added component of change in expression.
The gallery is a set of still images. Among all the participants in FRVT 2002,
except for DreamMIRH and VisionSphere, recognition performance is better when
using a still image rather than when using a video sequence. They observe that
if the subject were walking toward the camera, there would be a change in size
and orientation of the face that would be a further challenge to the system. In
this work, we focus on uncontrolled video, where data is captured using a surveillance camera in uncontrolled lighting conditions, hence performance is expected
to be poor. They also conclude that 3D morphable models provide only slight
improvement over 2D images.
In 2006, the FRVT 2006 Evaluation Report [34], compared face recognition
when using 2D and 3D data. It also explores face recognition when using controlled and uncontrolled lighting. When using 3D data, the algorithms were able
to meet the FRGC [33] goal of an improvement of an order of magnitude over
FRVT 2002. To test the effect of lighting, the gallery data was captured in a
controlled environment, whereas the probe data was captured in an uncontrolled
lighting environment (either indoors or outdoors). Cognitec, Neven Vision, SAIT
and Viisage outperformed the best FRGC results achieved, with SAIT having a
false reject rate between 0.103 and 0.130 at a false accept rate of 0.001. The performance of FRVT participants when using uncontrolled probe data matches that
of the FRVT participants of 2002 when using controlled data. However, they also
show that illumination condition does have a huge effect on performance.
The Foto-Fahndung report [9] evaluates performance of three recognition systems when the data comes from a surveillance system in a German railway station.
They report recognition performance in four distinct conditions based on lighting and movement of the subjects and show that while face recognition systems
can be used in search scenarios, environmental conditions such as lighting and
quick movements influence performance greatly. They conclude that it is possible
to recognize people from video, provided the external conditions are right, especially lighting. They also state that high recognition performance can be achieved
indoors, where the light does not change much. However, drastic changes in
lighting conditions affect performance greatly. They state that High recognition
performance can be expected in indoor areas which have non-varying light conditions. Varying light conditions (darkness, black light, direct sunlight) cause a
sharp decrease in recognition performance. A successful utilization of biometric
face recognition systems in outdoor areas does not seem to be very promising for
search purposes at the moment. [9] They suggest the use of 3D face recognition
technology as a way to improve performance.
is of poor quality and low image resolution and has large illumination and pose
variations. They believe that the posterior probability of the identity of a subject varies over time. They use a condensation algorithm that determines the
transformation of the kinematics in the sequence and the identity simultaneously,
incorporating two conditions into their model: (1) motion in a short time interval
depends on the previous interval, along with noise that is time-invariant and (2)
the identity of the subject in a sequence does not change over time. When they use
a gallery of 12 still images and 12 video sequences as probes, they achieve 100%
rank one recognition rate. However, the small size of the dataset may contribute
to the high accuracy.
In a later work, they extend this approach to apply to scenarios where the
illumination of probe videos is different from that of the gallery [21], which is also
made up of video clips. Each subject is represented as a set of exemplars from
a video sequence. They use a probabilistic approach to determine the set of the
images that minimizes the expected distance to a set of exemplar clusters and
assume that in a given clip, the identity of the subject does not change, Bayesian
probabilities are used over time to determine the identity of the faces in the frames.
A set of four clips of 24 subjects each walking on a treadmill is used for testing.
The background is plain and each clip is 300 frames long. They achieve 100%
rank one recognition rate on all four combinations of clips as probe and gallery.
Chellappa et al. build on this in [52]. They incorporate temporal information
in face recognition. They create a model that consists of a state equation, an
identity equation (containing information about the temporal change of the identity) and an observation equation. Using a set of four video clips with 25 subjects
walking on a treadmill, (from the MoBo [16] database), they train their model
on one or two clips per subject and use the remaining for testing. They are able
to achieve close to 100% rank one recognition rate overall. They expand on this
work [53] to incorporate both changes in pose within a video sequence and the
illumination change between a gallery and probe. They combine their likelihood
probability between frames over time which improves performance overall. In a
set of 30 subjects, where the gallery set consists of still images, they achieved 93%
rank one recognition rate.
Park and Jain [31] use a view synthesis strategy for face recognition from
surveillance video, where the poses are mainly non-frontal and the size of the faces
is small. They use frontal pose images for their gallery, whereas the probe data
contains variable pose. They propose a factorization method that develops 3D
face models from 2D images using Structure from Motion (SfM). They select a
video frame in which the pose of the face is the closest to a frontal pose, as a
texture model for the 3D face reconstruction. They then use a gradient descent
method to iteratively fit the 3D shape to the 72 feature points on the 2D image.
On a set of 197 subjects, they are able to demonstrate a 40% increase in rank one
recognition performance (from 30% to 70%).
Blanz and Vetter [10] describe a method to fit an image to a 3D morphable
model to handle pose changes for face recognition. Using a single image of a person, they automatically estimate 3D shape, texture and illumination. They use
intrinsic characteristics of the face that are independent of the external conditions
to represent each face. In order to create the 3D morphable model, they use a
database of 3D laser scans that contains 200 subjects from a range of demographics. They build a dense point-to-point correspondence between the face model and
a new face using optical flow. Each face is fit to the 3D shape using seven facial
10
feature points (tip of nose, corners of eyes, etc.). They try to minimize the sum of
squared differences over all color channels from all pixels in the test image to all
pixels in the synthetic reconstruction. On a set of 68 subjects of the PIE database
[40], they achieve 95% rank one recognition rate when using the side view gallery.
Using the FERET set, with 194 subjects, they achieve 96% rank one recognition
when using the frontal images as gallery and the remaining images as probes.
Huang et al. [48] use 3D morphable models to handle pose and illumination
changes in face video. They create 3D face models based on three training images
per subject and then render 2D synthetic images to be used for face recognition.
They apply a component-based approach for face detection that uses 14 independent component classifiers. The faces are rotated from 0 to 34 in increments of
2 using two different illuminations. At each instance, an image is saved. Out of
the 14 components detected, nine are used for face recognition. The recognition
system consists of second degree polynomial Support Vector Machine classifiers.
When they use 200 images of six different subjects, they get a true accept rate of
90% at a false accept rate of 10%.
Beymer [8] uses a template based approach to represent subjects in the gallery
when there are pose changes in the data. He first applies a pose estimator based
on the features of the face (eyes and mouth). Then, using the nose and the eyes,
the recognition system applies a transform to the input image to align the three
feature points with a training image. When using 930 images for training the
detector and 520 images for testing, the features are correctly detected 99.6% of
the time. For recognition, a feature-level set of systems is used for each eye, nose
and mouth. The probe images are compared only to those gallery images closest
to its pose. Then he uses a sum of correlations of the best matching eye, nose and
11
mouth templates to determine the best match. On the set of 62 subjects, when
using 10 images per subject with an inter-ocular distance of about 60 pixels in the
images, the rank one recognition rate is 98.39%. However, this is a relatively large
inter-ocular distance for good face recognition and not usually typical of faces in
surveillance quality video.
Arandjelovic and Cipolla [3] deal with face movement and observe that most
strategies use the temporal information of the frames to determine identity. They
propose a strategy that uses Resistor Average Distance (RAD), which is a measure
of dissimilarity between two disjoint probabilities. They claim that PCA does not
capture true modes of variation well and hence a Kernel PCA is used to map the
data to a high-dimensional space. Then, PCA can be applied to find the true
variations in the data. For recognition, the RAD between the distributions of
sets of gallery and probe points is used as a measure of distance. They test their
approach on two databases. One database contains 35 subjects and the other
contains 60 subjects. In both datasets, the illumination conditions are the same
for training and testing. They achieve around 98% rank one recognition rate on
the larger dataset.
Thomas et al. [43] use synthetic poses and score-level fusion to improve recognition when there is variable pose in the data. They show that recognition can
be improved by exploiting temporal continuity. The gallery dataset consists of
one high-quality still image per subject. Using the approach in [10] to generate
synthetic poses, the gallery set is enhanced with multiple images per subject. A
dataset of 57 subjects is used, which contains subjects walking around a corner in
a hallway. When they use the original gallery images and treat each probe image
as a single frame with no temporal continuity, they achieve a rank one recog-
12
nition rate of 6%. However, by adding synthetic poses and exploiting temporal
continuity, they improved rank one recognition performance to 21%.
13
Furthermore, one may not be able to fully capture all the possible variations in
the data. While it has been shown theoretically that a function invariant to illumination does not exist, there are representations that are more robust than
others [2]. The four representations they consider are (1) the original gray-level
image, (2) the edge map of the image, (3) the image filtered with 2D Gabor like
filters and (4) the second-order derivative of the gray level image [2]. Some edges
of the image can be insensitive to illuminations whereas others are not. However,
an edge map is useful in that it is a compact representation of the original image. Derivatives of the gray level image are useful because while ambient light
will affect the gray level image, under certain conditions it does not affect the
derivatives. In order to make the images more robust, they divide the face into
two sub parts by creating subregions of the eyes area and the lower part of the
face. They show that in highly variable lighting, the error rate is 100% on raw
gray level images, where there are changes in illumination direction. Performance
improves when using the filtered images. They also show that even though the
filtered images do not resemble the original face, they encode information to improve recognition. However, they conclude that no one representation is sufficient
to overcome variations in illumination. While some are robust to changes along
the horizontal axis, others are more robust along the vertical axis. Hence, the
different approaches need to be combined to exploit the benefits of each of them.
Zhao and Chellappa [51] use 3D models to handle the problems of illumination in face recognition. They create synthesized images acquired under different
lighting and viewing conditions. They develop a 2D prototype image from a 2D
image acquired under variable lighting using a generic 3D model, rather than a
full 3D approach that uses accurate 3D information. For the generic 3D model,
14
a laser scanned range map is used. They use a Lambertian model to estimate the
albedo value, which they determine using a self-ratio image, which is the illumination ratio of two differently aligned images. Using a 3D generic model, they
bypass the 2D to 3D step, since the pose is fixed in their dataset. When they test
their approach using the Yale database, on a set of 15 subjects with 4 images each
they obtain 100% rank one recognition rate, which was an improvement of about
25% improvement over using the original images (about 75% rank one recognition
rate).
Wei and Lai [47] describe a robust technique for face recognition under varying lighting conditions. They use a relative image gradient feature to represent
the image, which is the image gradient function of the original intensity image,
where each pixel is scaled by the maximum intensity of its neighbors. They use
a normalized correlation of the gradient maps of the probe and gallery images to
determine how well the images match. On the CMU-PIE face database [40], which
contains 22 images under varying illuminations of 68 individuals, they obtain an
equal error rate of 1.47% and show that their approach outperforms recognition
when using the original intensity images.
Price and Gee [36] also propose a PCA-based approach to address three issues
that could cause problems in face recognition, namely illumination, expression and
decoration (specifically, glasses and facial hair). They use an LDA-based approach
to handle changes in illumination and expression. They note that subregions of
the face are less sensitive to expression and decoration than the full face. So they
break the face into modular subregions: the full face, the region of the eyes and
the nose and then just the eyes. For each region, they independently determine
the distance from that region to each of the corresponding images in the database.
15
Hence, they have a parallel system of observations, one for each region mentioned
above. They then use a combination of results as their matching score to determine
the best match. They use a database of 106 subjects with varied illumination,
expression and decoration, where 400 still images are used for training and 276
for testing. When they combine the results from the three observers, using PCA
and LDA, they achieve a rank one recognition rate of 94.2%.
Hiremath and Prabhakar [18] use interval-type discriminating features to generate illuminant invariant images. They create symbolic faces for each subject
in each illumination type based on the maximum and minimum value found at
each pixel for a given dataset. While this is an appearance-based approach, it
does not suffer the same drawbacks as other approaches because it uses interval
type features. Therefore, it is insensitive to the particular illumination conditions
in which the data is captured within the range of illuminations in the training
data. They then use Factorial Discriminant Analysis to find a suitable subspace
with optimal separation between the face classes. They test their approach using
the CMU PIE [40] database and get a 0% error rate. This approach is advantageous in that it does not require a probability distribution of the image gradient.
Furthermore, it does not use any complex modeling of reflection components or
assume a Lambertian model. However, it is limited by the range of illuminations
found in the training data. Therefore, it may not be applicable in cases where
there is a difference in the illuminations between the gallery and probe sets.
Belhumeur et al. [6] use LDA to produce well-separated classes for robustness
to lighting direction and facial expression, and compare their approach to using
eigenfaces (PCA) for recognition. They conclude that LDA performs the best
when there are variations in lighting or even simultaneous changes in lighting
16
and expression. They also state that In the [PCA] method, removing the first
three principal components results in better performance under variable lighting
conditions. [6] Their experiments use the Harvard database [17] to test variation
in lighting. The Harvard database contains 330 images from 5 subjects (66 images
each). The images are divided into five subsets based on the direction of the light
source (0, 30, 45, 60, 75 degrees). The Yale database consists of 16 subjects with
10 images each taken on the same day but with variation in expression, eyewear
and lighting. They use a nearest neighbor classifier for matching, though the
measure used to determine distance was not specified. The variation in expression
and lighting is tested using a leave-one-out error estimation strategy on all 16
subjects. They train the space on nine of the images and then tested it using the
image left out and achieve a 0.6% recognition error rate using LDA and a 19.4%
recognition error rate using PCA, with the first three dimensions dropped. They
do mention that the databases are small and more experimentation using larger
databases is needed.
Arandjelovic and Cipolla [5] handle variation in illumination and pose using
clustering and Gamma intensity correction. They create three clusters per subject corresponding to different poses and use locations of pupils and nostrils to
distinguish between the three clusters. Illumination is handled using Gamma intensity correction. Here, the pixels in each image are transformed so as to match
a canonically illuminated image. Pose and illumination are combined by performing PCA on variations of each persons images under different illuminations
from a given persons mean image and using simple Euclidean distance as their
distance measure. In order to match subjects to a novel image, they use the ratio
of the probability that three clusters belong to the same subject over the proba-
17
bility that they belong to a different subject. Their dataset consist of 20 subjects
for training and 40 others for testing, where each subject has 20-100 images in
random motion. They achieve 95% rank one recognition rate using this approach.
Arandjelovic and Cipolla [4] evaluate strategies to achieve illumination invariance when there are large and unpredictable illumination changes. In these
situations, the difference between two images of the same subject under different
illuminations is larger than that of two images under the same illumination but
of different subjects. Hence, they focus on ways to represent the subjects face
and put more emphasis on the classification stage. They show that both the high
pass filter and the self quotient image operations on the original intensity image
show recognition improvement over the raw grayscale representation of the images,
when the imaging conditions between the gallery and probe set are very different.
However, they also note that while they improve recognition in the difficult cases,
they actually reduce performance in the easy cases. They conclude that Laplacian of Gaussian representation of the image as described in [2] and a quotient
image representation perform better than using the raw image. They demonstrate
a rank one recognition rate improvement from about 75% using the raw images,
to 85% using the Laplacian of Gaussian representation, to about 90%, using quotient images. Since we are dealing with conditions which change drastically and
where the conditions for gallery and probe data differ, we use these approaches to
improve recognition in this work.
Gross and Brajovic [15] use an illuminance-reflectance model to generate images that are robust to illumination changes. Their model makes two assumptions:
human vision is mostly sensitive to scene reflectance and mostly insensitive to illumination conditions and secondly that human vision responds to local changes
18
in contrast rather than to global brightness levels. [15] Since they focus on preprocessing the images based on the intensity, there is no training required. They
test their approach using the Yale database, which contains 10 subjects acquired
under 576 lighting conditions. When using PCA for recognition, they improve
the rank one recognition rate from 60% to 93%, when using reflectance images instead of the original intensity images. Since we are dealing with conditions which
change drastically and where the conditions for gallery and probe data differ, we
use these approaches to improve recognition in this work.
Wang et al. [46] expand on the approach in [15] and used self-quotient images
to handle the illumination variation for face recognition. The Lambertian model
of an image can be separated into two parts, the intrinsic and extrinsic part. If
one can estimate the extrinsic part based on the lighting, it can be factored out
of the image to retain the intrinsic part for face recognition. The image is found
by using a smoothing kernel and dividing the image pixels by this filter. Let F
be the smoothing filter and I the original image, then the self-quotient image, Q
is defined as
I
.
F [I]
and show improvement over using the intensity images for recognition, from about
50% to about 95% rank one recognition rate.
Nishiyama et al. [25] show that self-quotient images [46] are insufficient to handle partial cast shadows or partial specular reflection. They handle this weakness
by using an appearance-based quotient image. They use photometric linearization
to transform the image into the diffuse reflection. A linearized image is defined as
a linear combination of three basis images. In order to generate the basis images
to find the diffuse image, different images from other subjects are used. They
acquired images under fixed pose with a moving light source. The reflectance
19
image is then factored out using the estimated diffuse image. They compare their
algorithm to the self-quotient image and the quotient image and show that on the
Yale B database and show that they achieve a rank one recognition rate of 96%,
whereas self-quotient images achieve 87% rank one recognition rate and Support
Retinex images [37] achieves a rank one recognition rate of 93%.
20
from 16 x 16 pixels to 128 x 128 pixels and achieve a rank one recognition rate of
92% when using the lowest resolution images.
Lin et al. [23] describe an approach to handle face recognition from video of low
resolution like those found in surveillance. They use optical flow for registration
to handle issues of non-planarity, non-rigidity, self-occlusion and illumination
and reflectance variation. [23] For each image in the sequence, they interpolate
between the rows and columns to obtain an image that is twice the size of the
original image. They then compute optical flow between the current frames and
the two previous and two next images and register the four adjacent images using displacements estimated by the optical flow. Then they compute the mean
using the registered images and the reference images. The final step is to apply a
deblurring Wiener deconvolution filter to the super resolved image. They tested
their approach on the CUAVE database, which contains 36 subjects. When they
reduce the images to 13x18 pixels, their approach (approximately 15% FRR at
1% FAR) performs slightly better than bilinear interpolated images and far outperforms nearest neighbor interpolation. They expand on this work in [24] and
compare their approach to a hallucination approach (assumes a frontal view of
face and works well when faces are aligned exactly). They conclude that while
there is some improvement gained over using over the lower resolution images,
a fully automated recognition system is currently impractical, given the performance. Hence, they relax their constraint to a rank ten match and can achieve
87.3% rank ten recognition rate on XM2VTS dataset that contains 295 subjects.
In Table 2.1, we summarize the different approaches along with their assumptions, dataset size and performance. We divide up the works based on the problem
they are trying to solve: (1) Variable pose (2) variable illumination and (3) other
21
problems, such as low resolution on the face. Performance is reported in rank one
recognition rate, unless otherwise specified. Some of the results are reported in
terms of equal error rate (or EER). Also, the results must be viewed in light of
the difficulty of the dataset (data features) and dataset size.
22
23
Data features
Pose changes
62
frontal 193
3D models based on 2D 2
different 6
training images, render syn- illuminations,
thetic poses for recognition faces rotated
194
12
98.39%
70%
90%
96%
100%
Dataset Perforsize
mance1
Weyrauch, et al : [48]
Vetter:
and
Blanz
[10]
Constrained
video
Basic idea
Zhou: [54]
Authors: Title
PREVIOUS WORK
TABLE 2.1
24
Arandjelovic,
Cipolla: [3]
Chellappa:
Chellappa:
Zhou:
[52]
Zhou,
[53]
Subject
on 25
treadmill,
frontal video
93%
About
100%
100%
About
97%
Dataset Perforsize
mance
Face
move- 36
ment;
No
illumination
change
Data features
Basic idea
Authors: Title
Continued
TABLE 2.1
25
Data features
68
Symbolic
interval
type Different ilfeatures to represent face luminations,
classes, Factor Discrimi- still images
nant Analysis to reduce
dimensions
Hiremath,
hakar: [18]
Prab-
68
Modified
image
inten- Varying illusity function, normalized minations
correlation for matching
0%
EER
94.2 %
1.47%
EER
100%
Dataset Perforsize
mance
Basic idea
Zhao,
[51]
Authors: Title
Continued
TABLE 2.1
26
and
Arandjelovic,
Cipolla: [5]
Aranjelovic
Cipolla: [4]
73.6%
0.6%
EER
Dataset Perforsize
mance
Data features
Basic idea
Authors: Title
Continued
TABLE 2.1
27
Support vector data descrip- Low
tion, correlation for recogni- tion
tion
Lee, et al : [22]
resolu-
Unconstrained 40
environment
and
movement
Exploit temporal information of frames, two-layer hybrid learning network, adjusted using the WidrowHoff learning rule
FAR
92%
95%
15
Dataset Perforsize
mance
Data features
Basic idea
Authors: Title
Continued
TABLE 2.1
28
CHAPTER 3
EXPERIMENTAL SETUP
In this chapter, we describe the sensors, data sets and software used in our
experiments. We acquire three different datasets for our experiments. We label
them the NDSP, IPELA and Comparison dataset. The first dataset is used to
show baseline performance and used in our pose and face detection experiments.
The IPELA dataset is used for our reflection experiments to handle pose and
illumination variation. Finally, the Comparison dataset is used to compare face
recognition performance when using high-quality data acquired on a camcorder
and when using data acquired on a surveillance camera.
The rest of the chapter is organized as follows: Section 3.1 describes the different sensors we use in our experiments. We then describe our datasets in Section
3.2. Finally, the software we use is described in Section 3.3.
3.1 Sensors
We capture data using four different sensors. The first camera is a Nikon D80,
used to acquire the gallery data used in the experiments. The second camera is
a PTZ camera installed by the Notre Dame Security Police. The third camera is
a Sony IPELA camera with PTZ capability. The fourth camera is a Sony HDR
camcorder used to capture data as a comparison to the surveillance-quality data.
We describe each in detail below.
29
interlaced mode.
31
32
TABLE 3.1
FEATURES OF CAMERAS USED
Features
Camera
Name used
Nikon D80
NSDP camera
IPELA
Sony HD
Model
Nikon D80
Not available
Sony
RZ25N
Resolution
2592x3872
640x480
640x480
1920x1080
Image size
3,732 kb
40kb
52kb
466 kb
Interlaced
No
Yes
Yes
Yes
SNC
Sony HDRHC
3.2 Dataset
We describe three datasets. They are named NDSP, IPELA and Comparison
dataset based on the camera used to acquire them and the experiments for which
they are used.
33
34
a glass door, walking around the corner till they are out of the camera view.
Each video sequence consists of between 50 and 150 frames. In Figure 3.6 we
show 10 frames acquired from this camera. We see that the illumination is highly
uneven due to the glare of the sun on the subject. The inter-ocular distance is
about 40 pixels on average. The pan, tilt and zoom are not changed during data
acquisition but could vary from day to day since the camera is part of a campus
security system. There are 57 subjects in this dataset. The time lapse between
the probe and gallery data varies from two weeks to about six months.
35
36
illumination is also uncontrolled. This dataset consists of 104 subjects. The time
lapse between the probe and gallery data is between two weeks and six months.
Splitting up the dataset: In order to test our approach when using the
surveillance data, we use four-fold cross validation. We split up the dataset into
four disjoint subsets, where each set contains 26 subjects. The sets are subjectdisjoint. For our experiments, we train the space on three subsets and test on
the remaining subset. We use the average of the four scores as our measure of
37
38
Figure 3.8: Example frames from IPELA camcorder for the Comparison dataset
39
40
This dataset contains 176 subjects. Out of the 176 subjects, 78 are acquired
indoors in a hallway on the first floor of Fitzpatrick Hall. One half of the face
is partially lit by the sun. The remaining 98 subjects are acquired outdoors
in uncontrolled lighting conditions. We separate out these datasets to compare
recognition performance when using data acquired indoors rather than outdoors.
The probe and gallery data in this dataset are acquired on the same day. This
dataset partly overlaps with the data of the Multi - Biometric Grand Challenge
(or MBGC) dataset [29], but also includes surveillance data that is not part of
the MBGC dataset.
In Table 3.2, we summarize the details of the datasets we use in this dissertation.
41
TABLE 3.2
SUMMARY OF DATASETS
Features
Dataset
Name
NDSP
Dataset
IPELA
Dataset
Gallery data
source
Nikon D80
Nikon D80
Nikon D80
Nikon D80
Probe
source
NDSP
installed
surveillance
camera
Sony IPELA
camera
Sony IPELA
camera
Sony
HD
camcorder
Number subjects
57
104
176
176
Number
images
per
gallery subject
Number
images
per
probe subject
50
frames
Acquisition
environment
of probe data
Fitzpatrick
Hallway
Activity
Time
lapse
between
probe
and
gallery data
data
150
100 frames
Comparison Dataset
100 frames
300
300 frames
450
Fitzpatrick
Hallway
Indoor
outdoor
and
Indoor
outdoor
and
Subject walks
around a corner and down
a hallway and
out of view of
the camera
Subject picks
up an object
and
walks out of
camera-view
Subject picks
and
object
and walks out
of camera view
2 weeks to 6
months apart
2 weeks to 6
months apart
Same day
Same day
42
300
3.3 Software
We use a variety of software for our work: FaceGen Modeller 3.2, Viisage
IdentityExplorer, Neurotechnologija, PittPatt and CSUs PCA code. They are
described in further detail below:
3.3.2 IdentityEXPLORER
Viisage manufactures an SDK for multi-biometric technology, called IdentityEXPLORER. It provides packages for both face and fingerprint recognition. It
is based on Viisages Flexible Template Matching technology and a new set of
43
44
powerful multiple biometric recognition algorithms, incorporating a unique combination of biometric tools [45]. We use it for detection and recognition:
1. Detection: It gives the centers of the eyes and the mouth, with an associated
confidence measure in the face localization, ranging from 0.0 to 100.0.
2. Recognition: It takes two images and gives a matching score between the
faces in the two images. The scores range from 0.0 to 100.0, where a higher
score implies a better match.
3.3.3 Neurotechnoligija
Neurotechnology [27] manufactures an SDK for face and fingerprint biometrics.
The face recognition package is called Neurotechnologija Verilook. It includes face
detection and face recognition capability. The face detection gives the eye and
mouth locations. The software also includes recognition software, which gives the
matching score between two faces in two images.
3.3.4 PittPatt
PittPatt manufactures a face detection and recognition package [35] that we
use in our comparison experiments. The face detection component is robust to
illumination and pose changes in the data and to a variety of demographics. Along
with its detection capability it can determine the pose of the face. It is able to
capture small faces, such as faces with an inter-ocular distance of eight pixels. The
face recognition component is also robust to a variety of poses and expressions by
using statistical learning techniques. By combining face detection and tracking,
PittPatt can also be used to recognize humans across video sequences.
45
46
R=
p
100 n
(3.1)
47
48
rate at which the true accept rate equals the false accept rate is called the equal
error rate.
An ROC curve plots the change in false accept rate versus the true accept
rate. At each point on the graph, the threshold of acceptance as a true match is
varied. In Figure 3.12, we show an example of such a curve.
3.5 Conclusions
In this chapter, we discussed the sensors and datasets used in our experiments.
We also described the software we used to support our work. Finally, we closed
with a discussion about the metrics used to evaluate performance.
49
CHAPTER 4
A STUDY: COMPARING RECOGNITION PERFORMANCE WHEN USING
POOR QUALITY DATA
The sensor used to capture data used for face recognition can affect recognition
performance. Low quality cameras are often used for surveillance, which can
result in poor recognition because of the poor video quality and low resolution
on the face. In this chapter, we conduct two sets of experiments. The first
set of experiments demonstrate baseline performance using the NDSP dataset.
This dataset is captured indoors, where the sunlight streaming through the doors
affects the illumination of the scene. Then we show recognition experiments using
the Comparison dataset, where video data acquired from two different sources:
a high-quality camcorder and a surveillance camera. We also capture data both
indoors and outdoors to compare performance when acquiring data in different
acquisition settings. We then compare recognition performance when using each
of these two sources of video data as our probe set and show that performance
falls drastically when we use poor quality video and when we move from indoor
to outdoor settings.
The rest of the chapter is organized as follows: First, we describe baseline
performance for the NDSP dataset in Section 4.1. Then, Section 4.2 describes
the experiments we run to compare performance and in Sections 4.2.2 and 4.3, we
describe our results and conclusions.
50
4.1.1 Experiments
For each subject in the NDSP probe set, we compare each frame of their
probe video clip to the set of gallery images of the same subject. We describe
how we generate the multiple gallery images per subject in Section 5.1. For each
subject, we predetermine the best single probe video frame to use for that person.
We do this by picking the frame that gives us the highest matching score to
this corresponding set of gallery images. This gives us a new image set of 57
images (one image per subject), where each image represents the highest possible
matching score of that subject to the gallery images of the same subject. We use
this oracle set of probe video frames as our probe set. This is an optimistic
baseline, in that a recognition system would not necessarily be able to find the best
frame in each probe video clip. We then run recognition using this set of images
as probes and report the rank one recognition rate as our baseline performance.
4.1.2 Results
In Figure 4.1, we show the rank one recognition rates using this set of 57
images, when the images in the gallery set correspond to an off-angle degree from
51
the frontal position of 0 , +/- 6 , +/-12 , +/-18 in the yaw angle. The face is
also rotated to +/- 6 in pitch angle.
We see that performance steadily increases as we increase the range of poses
available in the gallery set. We determine the best frame per subject based on its
matching score to all 17 poses. This explains why performance peaks when we use
all 17 poses, since we use all 17 poses to pick the frames that make up the oracle
probe set. This shows that this is a challenging dataset, where performance is
poor even when we pick out the probe frame with the best matching score to its
52
4.2.1 Experiments
For our experiments, we use PittPatts detector and recognition system. Once
we have detected all the faces in the probe and gallery data, we create a single
53
gallery of all the gallery images, with minimal clustering to ensure that each image
is considered a unique subject. Then for each video sequence, we create a gallery
and cluster it so that they all correspond to the same subject. We then run
recognition of each set of videos against the gallery of high-quality still images.
We report results using rank one recognition rate and equal error rate.
Since we cluster the video frames to correspond to one subject, distances are
reported between one sequence and a gallery image. So results are reported per
video sequence, rather than per frame. Our experiments are grouped into four
categories, depending on the sensor used and the acquisition condition in which
the data is acquired.
4.2.2 Results
In this section, we describe the detection and recognition results when we run
recognition experiments described in 4.2.
Detection results: In Table 4.1 we show the results of the face detection
and how many faces were detected in the video sequences. The number of faces
detected in the outdoor video is far fewer than that in the indoor video. We
also notice that the number of faces detected reduces as we move from highquality video to outdoor video. With the high - quality video indoors, detection is
about 50% and falls to less than 5% when we move outdoors, using a surveillance
camera. So we see that the type of camera and the acquisition condition affects
face detection performance.
In Figures 4.2 through 4.5, we show an example frame from each acquisition
and camera. We also show some of the thumbnails created after we run detection
on the surveillance and high-definition video (both indoor and outdoor video). We
54
TABLE 4.1
COMPARISON DATASET RESULTS: DETECTIONS IN VIDEO
USING PITTPATT
Performance
metric
Indoor video
Outdoor video
Highresolution
video
Surveillance
video
Highresolution
video
Surveillance
video
78
78
98
98
Total number
of frames
59,894
27,564
78,268
32,112
Number
of
faces detected
29,688
5,508
14,653
1,335
Percentage of
faces detected
49.57%
19.98%
18.72%
4.16%
Number
subjects
of
55
TABLE 4.2
COMPARISON DATASET RESULTS: COMPARISON OF
RECOGNITION RESULTS ACROSS CAMERAS USING PITTPATT
Performance
Indoor video
Outdoor video
metric
Highresolution
video
Surveillance
video
Highresolution
video
Surveillance
video
Rank
one
recognition
rate
82.05%
43.28%
68.24%
42.86%
Equal
rate
3.45%
14.48%
6.89%
16.82%
error
see that even with the poor quality video, PittPatt is able to successfully pick out
faces in the sequence.
Recognition results: In Table 4.2, we show the rank one recognition rate and
equal error rate. We also show the number of sequences in each category. The
highlighted column (Surveillance video data acquired indoors), is the scenario
which we will focus on for the experiments in the rest of the dissertation.
In Figures 4.6 through 4.9, we show ROC and CMC curves for each of the
experiments. For each graph, we show performance when keeping the acquisition
condition (indoor or outdoor) constant.
PittPatts system can accommodate the recognition and detection well when
using the high-quality data, indoors. However, the recognition rate decreases when
we move from indoors to outdoors (from about 82% to 68%). Also, performance
56
57
58
59
60
61
62
63
64
drops significantly when we move from the high-definition data to the surveillance
data. We also see that the drop between indoor and outdoor recognition is not as
large when we use the surveillance video. However, recognition is already poor to
begin with (43% indoors), so the decrease in performance is not as large.
4.3 Conclusions
We have shown that even using an oracle dataset, where the images used in the
probe set have the highest matching score to their gallery image, performance is
still poor. Similarly, we have also shown that there is a significant effect on recognition performance when we vary two factors in the acquisition setup. Firstly, the
lighting variations between indoors and outdoors cause a drastic drop in recognition performance. Secondly, the camera used to acquire the data also affects
performance. Using a high - resolution camera of higher quality results in higher
- quality video, which causes recognition performance to improve. However, data
from surveillance cameras is becoming more common, so there is a need to improve
performance even when the quality of the data is poor. This drives the need to
develop strategies to handle poor quality video for face recognition.
In the rest of the dissertation, we focus on using data acquired by a surveillance
camera, indoors, where the illumination is uncontrolled. This scenario reflects the
scenario in most real - world situations where the camera used may be of poor
quality and may be pointed towards a door, where lighting can affect the images
acquired.
65
CHAPTER 5
HANDLING POSE VARIATION IN SURVEILLANCE DATA
In this chapter, we describe the problem of variable pose in the probe data
that can cause problems for the recognition system. Most of the work in face
recognition focuses on frontal video, however surveillance video contains many
non-frontal faces. So there needs to be an emphasis on handling variable pose in
the probe data.
We use synthetic poses to enhance our gallery set. By using score-level fusion
across poses, we can improve recognition over using a single frame. We also
describe two fusion techniques to exploit temporal continuity in the probe data.
The rest of the chapter is organized as follows: Section 5.1 describes the techniques we use to handle variable pose. In Section 5.2 we describe the fusion
techniques used. Then we describe our experiments in Section 5.3 and results and
conclusions in Sections 5.4 and 5.5.
66
Figure 5.1. Frames showing the variable pose seen in a video clip (the
black dots mark the detected eye locations)
problems for recognition, because the initial alignment of the probe and gallery
images is a critical step in matching.
One strategy to address this is to add images of the gallery subjects taken in
a non-frontal pose. However, since we do not have such images, we create them
synthetically using 3D morphable models as in [10] and [48]. In this approach, a
PCA space is trained using a set of 3D models of faces from different demographics
(gender, ethnicity and age). Then a 2D image is aligned to the 3D face using a
set of points marked on the face. By adjusting the parameters of the PCA space,
a synthetic 3D face can be generated.
For each gallery image, we create a 3D model using the digital SLR image and
the FaceGen software described in Chapter 3. Once the model has been created,
it can be rotated to get different poses, which more accurately represent the poses
in the probe dataset. For each subject, we create a set of 17 poses using FaceGen.
We used 17 poses to represent the poses present in the probe dataset. Poses at a
67
(c) Eight images with non-zero pitch and non-zero yaw angles
further angle from a frontal position could not be tolerated by Viisage. Hence, the
reason we stopped by at 17 poses. In Figure 5.2, we show the 17 gallery images
for one subject: eight with nonzero yaw angles and eight with nonzero yaw and a
nonzero pitch angles. These images cover a range of about +/-24 . Once we have
these images, we can compare each probe to this enhanced gallery set to find the
pose that best matches the pose in the probe image.
68
for a particular subject. The matching score for that probe frame to that
particular gallery subject then is the average of all the matching scores of the
poses of that subject. If M (i, Pj (S)) is the matching score between probe
frame i and the j th pose P of subject S, Then the matching score between
probe frame i and subject S, M(i, S) is defined as in Equation 5.1
(5.1)
We did compare this fusion technique to using the highest of the matching
scores between a frame and the poses of a given subject. However, using the
average of matching scores was more robust to noise introduced by matching
two different poses than using the highest of the matching scores. We call
this approach Single frame.
2. Rank-based average over time: Each set of frames from one subject is called
a probe set, P. For each set P, the frames are numbered starting at zero.
Each probe set has a rank matrix R where each entry is initialized to zero.
For an image I in a probe set, each match score to a particular gallery
subject is weighted by a rank array based on frames already seen. A higher
rank implies a better match. Similarly, a higher match score implies a closer
match.
For all frames seen before the current frame, we first sort the matching scores
of that frame to each of the gallery subjects. For the xth subject in this order,
we assign it a rank of N x, where N is the number of gallery subjects.
So the ranks now range from 0 to N , where N corresponds to the highest
ranked subject and 0 to the lowest rank subject. For each gallery subject x,
70
Rank
Previous rank Subject 1
matrix
Subject 2
Subject 3
Subject 4
Current
rankings
New rank
matrix
1
A
E
I
M
2
B
F
J
N
3
C
G
K
O
4
D
H
L
P
1
2
3
4
1
A+ 1
E
I
M
2
B
F
J
N+1
3
C
G
K+1
O
4
D
H+1
L
P
we increment position R(x, y), where x is the gallery subject position and
y is the rank assigned. Assume, we have 4 subjects in our gallery set and
for the current frame the sorted scores of the frame to the gallery set are as
follows: (1,4,3,2). This means that subject 1 has the highest matching score
to the frame while subject 2 has the lowest score. In Figure 5.3, we show
the values of the rank matrix before and after we update it with these new
values.
The current set of scores for a probe image to a gallery subject is then
weighted by the number of times that gallery subject has had a particular
rank among frames already seen. These values are found using the rank
71
matrix. We also update the rank matrix based on the rank recognition of
the current frame and the set of gallery subjects, to be used for the next
incoming frame. In Algorithm 1, we show the details of the approach. We
call this approach Rank-based fusion.
72
score to determine the identity of the current frame. The details of the
algorithm are shown in Algorithm 2. We called this approach Score-based
fusion.
5.3 Experiments
In order to measure improvement, we run recognition experiments using the
synthetic poses in our gallery and also exploiting temporal continuity. We demonstrate our approach on the NDSP dataset and the IPELA dataset.
We vary two aspects of the experiment to show performance:
1. Number of poses used to represent the subject in gallery: We also vary the
number of poses used to represent each subject. We start with the original
high-quality image (frontal pose) and add a pose to the right and then to
the left with each iteration.
73
2. Approach to combine scores across poses: (a) Average score: Average score
across poses (b) Rank-based fusion: Using ranks across frames to improve
recognition and (c) Score-based fusion: Using scores across frames to improve recognition.
5.4 Results
In this section, we show the results in face recognition performance when we
vary the number of poses used in the gallery set and when we exploit temporal
continuity. First, we discuss the results using the NDSP dataset and then the
results using the IPELA dataset.
74
Figure 5.4. Results: Comparing rank one recognition rates when adding
poses of increasing degrees of off-angle poses
75
Figure 5.5. Results: Comparing rank one recognition rates when using
frontal, +/-6 degree and +/-24 degree poses
76
by adding synthetic poses, the benefit from using such poses de creases as we
move further and further away from a frontal position. Furthermore, the highest
performance achieved is about 7.5%. However, as shown in Se ction 4.1, this is
a challenging dataset, with images acquired in uncontrolled lighting. Therefore,
recognition performance is expected to be poor. As we will discuss later, there are
other factors besides variable pose that affect recognition, since this is a dataset
acquired in uncontrolled conditions.
Exploiting temporal continuity: In Figure 5.6, we show the rank one
recognition rate when using between 1 and 17 gallery images per subject, with
all gallery images generated with morphable models. We also show performance
when using each of the three fusion techniques.
In Table 5.1, we compare the best performance achieved using each of the three
fusion techniques when using the baseline set and the test set for our experiments.
We see that we can improve performance by enhancing the gallery set by using
artificial poses generated using a 3D morphable model. This improves performance
over using a single high-quality image. Therefore, even though the morphed images are not of the same quality as the original image (as seen in Figure 5.2, they
help improve performance. Another aspect that we exploit is the temporal continuity between frames in a video sequence. By weighting the scores based on the
scores of frames seen already, we improve recognition performance over treating
each frame as a single image. This is an advantage of using a video sequence
rather than a set of still images.
Even though we cannot pick out the best frame (baseline), without using
prior knowledge of the dataset, we improve performance over the baseline approach. So by incorporating temporal continuity, we can outperform the baseline
77
Figure 5.6. Results: Comparing rank one recognition rate when using
fusion techniques to improve recognition
78
TABLE 5.1
RESULTS: COMPARISON OF RECOGNITION PERFORMANCE
USING FUSION
Dataset
Rank one
recognition
rate
Number of
poses
Baseline
57
Average
15.79%
17
Average
2.62%
Rankbased
16.49%
Scorebased
21.06%
Test
dataset
57
57
3233
79
performance.
Reasons for poor recognition: Overall, performance is poor (best recognition rate is about 21%). However, based on baseline performance, we see that this
is a very challenging dataset. One of the main reasons for poor performance is
illumination. We can see this in Figure 5.7 (a), where the subjects face is washed
out due to the glare of the sun. Therefore, even though the eyes are picked out
correctly, there is insufficient variation on the face for good recognition. Another
reason is the actual location of the eyes as seen in (b) of Figure 5.7. Even though
the locations are on the face of the subject, they are not precise and hence the
localization of the image is poor. A third reason for this is seen in (c) of Figure 5.7,
where the subjects face is angled down so much that it is hard for the recognition
system to match it to the right gallery image. Finally, the inter-ocular distance
is about 40 pixels, whereas Viisage requires about 60 pixels between the eyes for
robust detection and recognition. So the conditions are not ideal for recognition
using Viisage. Therefore, even when we do select the eyes correctly, performance
is poor.
80
81
TABLE 5.2
RESULTS: COMPARISON OF RECOGNITION PERFORMANCE
USING FUSION ON THE IPELA DATASET
Fusion
technique
Number
poses
gallery
Single
frame
Rankbased
fusion
Scorebased
fusion
of
in
Rank one
recognition
rate
Equal error
rate
27.99%
26.10%
14
33.88%
26.63%
30.49%
29.47%
14
35.91%
29.72%
31.11%
20.91%
14
41.74%
22.82%
82
5.5 Conclusions
We have shown that adding synthetic poses can be beneficial for face recognition. Even when they do not retain all the details of the face (such as skin
texture), they can be used to augment the gallery set. Whereas it would be more
beneficial to have actual images of the subjects in a range of poses, even in instances where those images are not available, we can create synthetic images to
accomplish the same purpose. We have also shown that while adding images help
recognition performance to a point, they can also hinder performance as the poses
more further and further away from a frontal pose. So ideally, using between 2
and 5 images (within a small range of variation from a frontal pose) is useful for
recognition.
We have also shown that temporal continuity can be exploited in multiple ways,
using either rank-based fusion or score-based fusion. Again, some approaches are
more useful than others (in our experiments, we showed that score-based fusion
performs the best overall). Score-based fusion is shown to outperform rank-based
83
fusion in other works such as [1] and [14]. When using video data as input, the
temporal continuity between frames should be exploited to improve the reliability
of the matching scores for improved face recognition.
84
CHAPTER 6
HANDLING VARIABLE ILLUMINATION IN SURVEILLANCE DATA
In Section 6.4 we describe our dataset used in our experiments. Finally, Sections
6.5 and 6.6 describe our results and our conclusions.
86
then replace every pixel on the right half of the image by a pixel from the left
half, so that the right half is a mirror image of the left half of the image. Let W
be the width of the image. If p(i, j) is the intensity value of the pixel in row i
and column j, and r(i, j) is the corresponding pixel in the Reflected left image,
then r(i, j) is determined as shown in Equation 6.1.
r(i, j)
87
(6.1)
88
For the Reflected right image, the equations used are in Equation 6.2.
r(i, j)
(6.2)
size (130 150 pixels) and then normalized based on gray value intensity to
have a dynamic range of 0 to 255.
2. Reflect left dataset: For the left reflected images, the cropped and aligned
images are reflected left over right. We retain the left side of the image as is
and reflect those pixels across the vertical mid-axis so that the right half of
the image is a mirror image of the left half, as defined by Equation 6.1. The
reflected left images are used both for training and recognition for this set.
3. Reflect right dataset: In this set, the cropped and aligned images are
reflected right over left. Similar to the Reflect left dataset, we reflect the
right half of the image over the left and retain the right half of the image as
is, using Equation 6.2. In the Reflect right datasets, the images reflected
right over left are used to train the face space and used for recognition.
In Figure 6.4, we show the resulting images. The first column is the original
image. The middle column is the corresponding image from the Reflect left
dataset and the third column contains the corresponding image from the Reflect
right dataset. We also inverted the images so that it is easier to see the lighting
changes. The gallery images are illuminated with a frontal light source, so the
left and right halves of the face are illuminated equally. However, in the original
probe images, the left side of the face is illuminated differently from the right
side. This illumination is evened out when the images are reflected left over right
or when the images are reflected right over left. However, when we compare the
third column of images in Figure 6.4 to the image in Row 1 of Figure 6.4, the
light and dark parts of the image are inverted, since the light source is not frontal
but rather coming in from the side of the face. This can cause problems for a
recognition system. We also notice a bright line down the midline (corresponds
90
to dark region in original image). This is because of the shadow formed down the
middle of the face due to the direction of the lighting.
We further illustrate this effect by looking at the average illumination of each
column in the images. We take the average of each column for each image and
then take the average over all the images. We separate out the probe and the
gallery images. We display these values in a graph in Figure 6.5, and see that the
gallery images are more evenly lit from left to right (though the two halves are not
identical). The average intensity of each column of the gallery image differs from
the probe images. When we reflect the images left over right, the average intensity
per column of the probe images more closely resemble the gallery images, because
the light sources appears to be coming from the same direction, with respect to the
left half of the image. However, when the images are reflected right over left, the
average intensities of the probe and gallery images are also vastly different. This is
because the gallery images are illuminated from the front, while the probe images
are illuminated from the right side, so the light source appears to be coming from
different directions.
91
(b) Example 1
(c) Example 2
(d) Example 3
(e) Example 4
(f) Example 5
(g) Example 6
(h) Example 7
(i) Example 8
Figure 6.4. Example images: original image, reflected left and reflected
right
92
r(i, j) =
p(i,j)+p(i,W/2j)
2
r(i, W/2 j) =
p(i,j)+p(i,W/2j)
2
(6.3)
We show a figure of this preprocessing in Figure 6.6. In these images, the left
half of the image is a mirror image of the right half, reflected across the vertical
midline of the image.
In Figure 6.7, we show the resulting images when we create new images by
taking the average of the left and right pixels.
94
(b) Example 1
(c) Example 2
(d) Example 3
(e) Example 4
(f) Example 5
(g) Example 6
(h) Example 7
(i) Example 8
Wang et al. [46] used self-quotient images to handle the illumination variation
for face recognition. The Lambertian model of an image can be separated into
two parts, the intrinsic and extrinsic part. If we can estimate the extrinsic factor
based on the lighting, it can be factored out of the image to retain the intrinsic part. This can then be retained for face recognition. The image is found by
using a smoothing kernel and dividing the image pixels by this filter. Let F be
the smoothing filter and I the original image, then the self-quotient image, Q is
defined as
I
.
F [I]
use them as input for PCA to create a facespace based on these images and then
recognize the probe images. In Figure 6.8, we show the images shown in 6.4 after
they are processed to retrieve the quotient image and the self-quotient image.
6.4 Experiments
In this section, we describe the dataset used for our experiments and also the
experiments we conduct.
96
(b) Example 1
(c) Example 2
(d) Example 3
(e) Example 4
(f) Example 5
(g) Example 6
(h) Example 7
(i) Example 8
6.4.3 Experiments
In order to test our approach when using the surveillance data, we use four-fold
cross validation. We split up the probe dataset into four subject-disjoint subsets,
where each set contains 26 subjects.
We use CSUs PCA code [13] for our matching experiments. We train the
space on three subsets of probe-type data and all the gallery images and use
the remaining dataset for testing. Three eigenvectors corresponding to the three
largest eigenvalues are dropped. We use covariance distance as a matching score
between two images. Finally, since we may not know which side of the face is
robust to illumination, we also use the average of the Reflect left and Reflect
right scores for recognition. This gives us three recognition scores:
1. Reflect left results: Performance when using the images reflected left over
right.
2. Reflect right results: Performance when using the images reflected right
over left.
3. Average score results: Performance when using score-level fusion to combine
scores for Reflect left and Reflect right scores for each probe frame.
98
We then use the average of the four scores (one for each testing set) as our
measure of performance of the different approaches. We also tried using the minimum of matching scores, however we achieve higher performance when we use the
average of matching scores. We report rank one recognition rate and equal error
rate.
6.5 Results
In Table 6.1, we show the rank one recognition rate and equal error rate of
recognition when using each of the three sets of images, as described in Section
6.2. We also compare our performance to using the approaches in [15] and [46].
Finally, we compare recognition performance when treating each frame separately
and when we exploit temporal continuity between the frames.
In the single-frame approach, we see that the highest rank one recognition rate
(49.88%) is achieved when using the images reflected left over right. We also see
that the lowest rank one recognition rate (22.48% recognition rate) is achieved
when using the images reflected right over left. When we look at equal error rate,
using score fusion to combine the Reflect left and Reflect right scores has the
best rate at 18.80%. We also see that using image-level averaging, performance is
similar to using the original images. Thus, score-level fusion performs better than
using the image-averaging approach.
The poor performance of the Reflect right images can be explained by the
fact that the illumination on the face is not similar to the illumination seen on
the gallery images. This shows that uneven illumination on the face does affect
recognition performance. But if we can preprocess the probe images to have
a similar illumination as those in the gallery set, we can improve recognition.
99
TABLE 6.1
RESULTS: COMPARING RESULTS FOR DIFFERENT
ILLUMINATION TECHNIQUES
Single frame
Approach
Rank
one
recognition
rate
Equal
rate
Original
images
38.62%
Quotient images
Equal
rate
19.45%
39.67%
15.66%
32.51%
27.11%
40.82%
18.32%
Self-quotient
images
21.90%
32.75%
27.27%
24.61%
Reflect left
images
49.88%
18.27%
54.03%
14.13%
Reflect
right images
22.48%
30.65%
21.65%
28.60%
Average
images
38.63%
22.52%
41.74%
16.38%
Average
of
left and right
scores
44.24%
18.80%
49.91%
13.83%
100
error
error
Furthermore, we can use score-level fusion of left and right reflected images to
improve recognition over using the original images. However, it may not always
give the highest performance overall, but will show improvement over using the
original images.
Using the quotient images performs better than self-quotient images, but it
still does not improve over using the original images. These techniques focus on
single light sources and not on instances of uncontrolled lighting as seen in this
video. Thus they fail to correctly model the lighting effect on these images.
Furthermore, we improve recognition when we exploit temporal continuity. We
improve rank one recognition rates to about 54% when using the Reflected left
images and to about 50% when we use the average of the left and right scores.
Also, the equal error rate is improved to 13.83% when using the average of left
and right images. The improvement when using quotient images and moving from
single frame to score-level fusion across frames is large. When exploiting temporal
continuity, using quotient images shows slight rank one recognition improvement
over using the original images.
6.6 Conclusions
We have shown both visually and through experimentation, that the change
in illumination results in poor recognition. By creating reflected images, where
the illumination of the probe images resembles that of the gallery images, we
can improve recognition performance. In cases where the direction of the light
source is unknown, we can reflect images both ways and use score-level fusion to
improve recognition over using the original images. We have also shown that in
such a dataset techniques such as the self-quotient and quotient image may hurt
101
102
CHAPTER 7
OTHER EXPERIMENTS
In this chapter, we discuss some other issues that we explored for this dissertation. First, we describe a technique used to evaluate the detections found by the
face detection using background subtraction and gestalt clustering. Then we include a discussion on how the distance metric used and the number of eigenvectors
dropped can affect recognition performance.
103
104
weathered images, i.e. images that are taken during various conditions like
precipitation, snow and fog and show that poor eye localization clearly degrades
face recognition performance. This is a significant source of error in surveillance
video face detection and recognition. Hence, we further reduce the images used
as probes by evaluating the accuracy of the face detections. We describe a twolevel approach to prune the probe image set to those images in which the face is
correctly picked out. The two approaches are described in detail below.
105
F (I) = (Lx , Ly , Rx , Ry , Mx , My )
(7.1)
106
107
108
eye locations, separate from the incorrect eye locations (enclosed in the oval). A
technique such as a minimal spanning tree (MST) as described above, will choose
the continuous line since the inter-node distance between a node and its nearest
neighbor decides whether or not two points are in the same cluster.
In our approach, we use the coordinates of the detected eye locations and
sequence number of the respective frame within a subject clip as the features of
our dataset. If Lx , Ly , Rx and Ry are the x and y coordinates of the left and
right eye locations, respectively and t is the time stamp for that image within the
109
video clip of one subject, then the feature vector V for a frame I is described in
Equation 7.2.
V (I) = (Lx , Ly , Rx , Ry , t)
(7.2)
We treat each subject probe set separately. A subject probe set is one clip that
contains the same subject in all the frames. We create a graph from each probe set.
Each frames feature vector corresponds to one node in the graph. The distance
between every pair of nodes (eye locations and time stamp) is the Euclidean
distance between them in five-dimensional space, where the five dimensions are
defined in Equation 7.2.
We use an implementation [39] of Prims algorithm [20] to find the MST in
the graph. The running time of Prims Algorithm is O(kV k2 + E), where kV k is
the number of nodes in the graph, and kEk is the number of edges. We modify
the code slightly to work with our application.
We run this algorithm on each set of the 57 subject probe sets. We set the
threshold of 40 pixels. This determines whether two nodes are in the same cluster
or not. Our initial node is the rightmost point picked out to be the eyes from a
set of frames for one subject. At this point in the video clip, the subject closest to
the frontal position, hence the likelihood of the face detector picking out the face
correctly is high. After this step, we reduce the set of 4373 probe images to a set
of 3233 images. This corresponds to removing images such as (c) in Figure 7.1.
110
111
TABLE 7.1
COMPARING DATASET SIZE OF ALL IMAGES TO
BACKGROUND SUBTRACTION APPROACH AND GESTALT
CLUSTERING APPROACH
Dataset
Size of dataset
Fusion
nique
Baseline
57
Original
dataset
Background
subtraction
Gestalt clustering
25962
4373
3233
Rank
one
recognition
rate
Number
poses
Average
15.79%
17
Average
2.62%
Rank-based
3.07%
Score-based
2.3%
Average
6.63%
Rank-based
10.63%
Score-based
12.99%
Average
7.70%
Score-based
16.49%
Score-based
21.06%
112
tech-
of
Figure 7.4. Results: Rank one recognition rates when using the entire
dataset
113
Figure 7.5. Results: Rank one recognition rates when using the dataset
after background subtraction
114
Figure 7.6. Results: Rank one recognition rates when using the dataset
after background subtraction and gestalt clustering
115
116
eigenvalues, which is a good range to show how the number of vectors dropped
affects performance. After dropping more than five vectors, performance begins
to decrease.
7.2.1 Experiments
We conduct a variety of experiments. We use four-fold cross validation to test
each combination. We report results when using the MahCosine and covariance
metrics for distance and also when we vary the number of eigenvectors dropped
from the front and the distance metrics used. We also show results when we
exploit temporal continuity using this dataset.
7.2.2 Results
In Table 7.2, we show performance when we vary the distance metric used and
the number of eigenvectors dropped.
We see that in this dataset using the covariance distance metric performs
better than using the MahCosine distance. We also see that dropping eigenvectors
corresponding to the highest eigenvalues, performance increases. Hence, in our
experiments we use Covariance distance and drop three eigenvectors when training
our recognition model.
117
TABLE 7.2
RESULTS: PERFORMANCE WHEN VARYING DISTANCE
METRICS AND NUMBER OF EIGENVECTORS DROPPED
Distance
Number
Single
Exploiting
metric
eigenvectors
frame
temporal continuity
dropped
Rank one
recognition
rate
Rank one
recognition
rate
0 vectors
16.98
34.95
28.29
25.88
MahCosine 1 vector
17.12
34.95
28.17
25.77
distance
3 vectors
16.98
35.3
27.15
26.44
5 vectors
16.46
35.31
27.08
26.63
0 vectors
17.41
30.18
21.24
26.36
Covariance 1 vector
26.6
26.7
33.05
21.83
Distance
3 vectors
27.99
26.1
31.11
20.91
5 vectors
28.07
25.49
33.84
19.34
118
CHAPTER 8
CONCLUSIONS
We have shown in this dissertation that even though face recognition from
surveillance quality data is poor, there are strategies we can use to improve recognition. We have also explored different strategies to improve recognition performance. We list our contributions below:
1. Comparison: We compare recognition performance when using two different
sensors, a high-definition camcorder and a surveillance camera. We also
compare data acquired in two different environments. We show that using
surveillance data for recognition causes recognition performance to drop.
Also, when we move from indoors to outdoors, recognition performance is
hurt.
2. Handling pose variation in surveillance data: We use synthetic poses to
enhance our gallery set and then use score-level fusion to combine scores
across poses to improve recognition over a single-frame approach. While it
seems like in this case, the multiple poses account for little improvement in
recognition, we have to consider the other problems in the datasets such as
illumination changes and low resolution. We also exploit temporal continuity
of the video data to further improve recognition.
119
3. Handling illumination variable in surveillance data: We create reflected images and use score-level fusion to make face recognition performance more robust to varying illumination. We also compare our approach to self-quotient
images and quotient images and show that we can outperform these approaches for this dataset.
4. Face detection evaluation: We devise a strategy using background subtraction and gestalt clustering to prune a set of face detections to the set of true
detections for improved recognition.
There is still a huge room for improvement in the area of face recognition
from surveillance-quality video. We need more systems that are robust to lighting
variation which has a huge effect on face recognition. There also needs to be more
exploration in the dealing with other areas of real-world face recognition such as
techniques to deal with obstructions and with low resolution on the face.
120
APPENDIX A
GLOSSARY
Definitions of terms:
Cumulative Match Characteristic curve: Graphs that show the correct recognition rates versus rank
False Accept Rate: The ratio of the false positives (see below) to the total
number of positive examples
False Negative (fn): Number of outcomes predicted incorrectly to be negative over all positive outcomes in the set
False Positive (fp): Number of outcomes predicted incorrectly to be positive
over all negative outcomes in the set
False Reject Rate: The ratio of the false negatives to the total number of
negative examples
Identification: One-to-many comparative process of a biometric sample or a
code derived from it against all of the known biometric reference templates
on file
Receiver Operating Characteristic curve: A graph that show the rate of
change in the verification rate (true positive rate) as the false accept rate
increases
Verification: The process of comparing a submitted biometric sample against
a single biometric reference of a single enrollee whose identity or role is being
claimed
Signature: An image that represents a subject
121
122
APPENDIX B
POSE RESULTS
123
(a) Set 1
(b) Set 2
(c) Set 3
(d) Set 4
124
(a) Set 1
(b) Set 2
(c) Set 3
(d) Set 4
125
(a) Set 1
(b) Set 2
(c) Set 3
(d) Set 4
126
(a) Set 1
(b) Set 2
(c) Set 3
(d) Set 4
127
(a) Set 1
(b) Set 2
(c) Set 3
(d) Set 4
128
(a) Set 1
(b) Set 2
(c) Set 3
(d) Set 4
129
APPENDIX C
ILLUMINATION RESULTS
130
(a) Set 1
(b) Set 2
(c) Set 3
(d) Set 4
131
(a) Set 1
(b) Set 2
(c) Set 3
(d) Set 4
132
(a) Set 1
(b) Set 2
(c) Set 3
(d) Set 4
133
(a) Set 1
(b) Set 2
(c) Set 3
(d) Set 4
134
BIBLIOGRAPHY
135
URL http://www.
Project.
31. U. Park and A. K. Jain. 3D model-based face recognition in video. In International Conference on Biometrics, pages 10851094, 2007.
32. J. Phillips, P. Grother, D. Blackburn, E. Tabassi, and M. Bone. FRVT 2002
Evaluation Report. http:www.frvt.org/FRVT2002/documents.htm, 2002.
33. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman,
J. Marques, J. Min, and W. Worek. The 2005 ieee workshop on face recognition
grand challenge experiments. In Proceedings of the 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, page 45,
Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2372-23. doi: http://dx.doi.org/10.1109/CVPR.2005.585.
34. J. Phillips, T. Scruggs, A. J. OToole, K. W. B. Patrick J. Flynn,
C. L. Schott, and M. Sharpe.
FRVT 2006 Evaluation Report.
http://face.nist.gov/frvt/frvt2006/frvt2006.htm, 2006.
35. Pittpatt. Face detection and recognition, 2004. URL http://www.pittpatt.
com.
137
36. J. Price and T. Gee. Towards robust face recognition from video. In Proceedings on Applied Image Pattern Recognition, pages 94102, 2001.
37. Z. Rahman and G. Woodell. A multiscale retinex for bridging the gap between
color images and the human observation of scenes. In IEEE Transactions on
Image Processing, volume VI, pages 965976, 1997.
38. T. Riopka and T. Boult. The eyes have it. In Proceedings of the 2003 ACM
SIGMM Workshop on Biometrics methods and applications, pages 916, 2003.
39. A. Scvortov. Dzone snippets. URL http://snippets.dzone.com/user/
scvalex/tag/minimum+spanning+tree.
40. T. Sim, S. Baker, and M. Bsat. The CMU Pose, Illumination, and Expression
(PIE) Database. In Proceedings IEEE International Conference on Automatic
Face and Gesture Recognition, page 53, 2002.
41. Sony.
Sony SNCRZ50N, .
URL http://pro.sony.com/bbsc/ssr/
cat-securitycameras/cat-ip/product-SNCRZ50N/.
42. Sony.
HDR-HC7 high definition handycam, .
URL http:
//www.sonystyle.com/webapp/wcs/stores/servlet/ProductDisplay?
catalogId=10551&storeId=10151&productId=11039061&langId=-1.
43. D. Thomas, K. W. Bowyer, and P. J. Flynn. Multi-frame approaches to
improve face recognition. In Proceedings of the IEEE Workshop on Motion
and Video Computing, page 19, 2007.
44. M. Turk and A. Pentland. Face recognition using eigenfaces. In Computer
Vision and Pattern Recognition, pages 586590, 1991.
45. Viisage. IdentityEXPLORER SDK, 2006. URL http://www.viisage.com.
46. H. Wang, S. Z. Li, Y. Wang, and J. Zhang. Self quotient image for face
recognition. In ICIP, pages 13971400, 2004.
47. S. D. Wei and S. H. Lai. Robust face recognition under lighting variations. In
International Conference on Pattern Recognition, pages 354357, 2004.
48. B. Weyrauch, B. Heisele, J. Huang, and V. Blanz. Component-based face
recognition with 3D morphable models. In Computer Vision and Pattern
Recognition Workshop, volume V, page 85, 2004.
49. C. T. Zahn. Graph-theoretical methods for detecting and describing gestalt
clusters. In IEEE Transactions on Computers, volume XX:1, pages 6886,
1971.
138
139
This document was prepared & typeset with pdfLATEX, and formatted with
nddiss2 classfile (v3.0[2005/07/27]) provided by Sameer Vijay.
140