Human and Machine Recognition of Faces: A Survey: Rama Charles L. Saad Sirohey, Wilson
Human and Machine Recognition of Faces: A Survey: Rama Charles L. Saad Sirohey, Wilson
of Faces: A Survey
RAMA CHELLAPPA, FELLOW, IEEE, CHARLES L. WILSON, SENIOR MEMBER, IEEE,
ANDSAAD SIROHEY, MEMBER, IEEE
~~
__
inverted faces; existence of a “grandmother” neuron for using motion as a cue. Only a handful of papers on face
face recognition; role of the right hemisphere of the brain recognition [117], [121], [133] have addressed the issue of
in face perception; and inability to recognize faces due to segmenting a face image from the background. However,
conditions such as prosopagnosia. Some of the theories there is a significant amount of work reported in the
put forward to explain the observed experimental results image understanding (IU) literature [l], [2] on segmenting
are contradictory. Many of the hypotheses and theories put a moving object from the background using a sequence.
forward by researchers in these disciplines have been based Also, there is a significant amount of work on the analysis
on rather small sets of images. Nevertheless, several of the of nonrigid moving objects, including faces, in the IU
findings have important consequences for engineers who [l], 121 as well as the image compression literature [4].
design algorithms and systems for machine recognition of We briefly discuss those techniques that have potential
human faces. applications to recovery and reconstruction (in 3D) of faces
Barring a few exceptions [21], [24], [116], research on from a video sequence. The reconstructed image will be
machine recognition of faces has developed independent useful for recognition tasks when disguises and aging are
of studies in psychophysics and neurophysiology. During present.
the early and mid- 1970’s, typical pattem classification In addition to the separation of images into static and
techniques, which use measured attributes between features real-time image sequences several other parameters are
in faces or face profiles, were used. During the 1980’s, important in critically evaluating existing methods. In any
work on face recognition remained largely dormant. Since pattem recognition problem the accuracy of the solution
the early 1990’s, research interest in FRT has grown very will be strongly affected by the limitations placed on the
significantly. One can attribute this to several reasons: problem. To restrict the problem to practical proportions
An increase in emphasis on civilidcommercial research both the image input and the size of the search space
projects; the reemergence of neural network classifiers with must have some limits. The limits on the image might
emphasis on real-time computation and adaptation; the for example include controlled format, backgrounds which
availability of real time hardware; and the increasing need simplify segmentation, and controls on image quality. The
for surveillance-relatedapplications due to drug trafficking, limits on the database size might include geographic lim-
terrorist activities, etc. its and descriptor based limits. Critical issues involving
Over the last five years, increased activity has been seen data collection, evaluation and benchmarking of existing
in tackling problems such as segmentation and location algorithms and systems also need to be addressed.
of a face in a given image, and extraction of features An excellent survey of face recognition research prior to
such as eyes, mouth, etc. Also, numerous advances have 1991 is in 11141. Still we decided to prepare our survey
been made in the design of statistical and neural network paper due to the following reasons: The face recognition
classifiers for face recognition. Classical concepts such area has become very active since 1990. Approaches based
as Karhunen-Loeve transform based methods [ 111, [82], on Karhunen-Loeve expansion, neural networks and feature
[ 1041, [ 1241, [ 1331, singular value decomposition [69] and matching have all been initiated since the survey paper
more recently neural networks [21], [51], have been used. [ 1141 appeared. Also, [114] did not cover discussions on
Barring a few exceptions [104], many of the existing face recognition from a video, profile, or range imagery
approaches have been tested on relatively small datasets, nor any aspects of performance evaluation.
typically less than 100 images. The organization of the paper is as follows: In Section
In addition to recognition using full face images, tech- I1 we describe several applications of FRT in still and
niques that use only profiles constructed from a side view video images and point out the specific constraints that
are also available. These methods typically use distances each set of applications pose. Section I11 provides a brief
between the “fiducial” points in the profile (points such summary of issues that are relevant from the psychophysics
as the nose tip, etc.) as features. Modifications of Fourier point of view. In Section IV a detailed review of face
descriptors have also been used for characterizing the recognition techniques, involving still intensity and range
profiles. Profile based methods are potentially useful for images, in the engineering literature is given. Techniques
the mug shot problem, due to the availability of side views for segmentation of faces from clutter, feature extraction
of the face. and recognition are detailed. Face recognition using profile
All of the discussion thus far has focused on recognizing images (which has not been pursued with much vigor
faces from still images. The still image problem has several in recent years, but nevertheless is useful in mug shots
inherent advantages and disadvantages. For applications matching problem) is discussed in Section V. Section VI
such as mug shots matching, due to the controlled nature presents a discussion on face recognition from video images
of the image acquisition process, the segmentation problem with special emphasis on how IU techniques could be
is rather easy. On the other hand, if only a static picture useful. Some specific examples of face recognition and
of an airport scene is available, automatic location and recall work in law enforcement domains, and commercial
segmentation of a face could pose serious challenges to applications are briefly discussed in Section VII. Data
any segmentation algorithm. However, if a video sequence collection and performance evaluation of face recognition
acquired from a surveillance camera is available, segmenta- algorithms and architectures are addressed in Section VIII.
tion of a person in motion can be more easily accomplished Finally, summary and conclusions are in Section ZX.
disguises must be accounted for in feature extraction and closest to witness’s recollection is chosen. In application 6,
matching. In applications 1 and 2, the matching criterion electronic browsing of photo collection is attempted. Appli-
can be quantified; also, the top few choices can be rank cation 7 involves a witness identifying a face from a set of
ordered. face images which include some false candidates. Typically,
Applications 4-7 involve finding or creating a face image in these applications the image quality tends to be low; in
which is similar to the human recollection of a face. In addition to matching, it is required to find faces that are
application 4, an expert confirms that the face in the given similar to a recalled face. The similarity measure is difficult
image corresponds to a person in question. It is possible to quantify, as measures supposedly used by humans need
that the face in the image could be disguised, or occluded. to be defined. The problem is complicated further in that
Typically, in this application a list of similar looking faces when humans search through a mug shots book, they tend
is generated using a face identification algorithm, the expert to make more recognition errors as the number of mug
then performs a careful analysis of the listed faces. In shots presentations increases. It is difficult to completely
application 5 the witness is asked to compose a picture of a quantify the degradation in machine implementation of
culprit using a library of features such as noses, eyes, lips, algorithms developed for applications 4-6. Another issue
etc. For example the library may have examples of noses is the incorporation of mechanisms for recalling faces that
that are long, short, curved, flat, etc., from which one that is humans use in the algorithms. Applications 4-7 need a
(b)
Fig. 3. An approximate illustration of an uncontrolled environment for face im(ages corresponding
to application 2.
strong interaction between algorithms and known results available through a video camera tend to be of low quality.
in psychophysics and neuroscience studies. Also, in crowd surveillance applications the background is
Applications 8 and 9 involve transformations of images very cluttered, making the problem of segmenting a face
from current data to what they could have been (application in the crowd difficult. However, since a video sequence
8) or to what they will be (application 9). These are even is available, one could use motion as a strong cue for
more difficult than applications 4-6, since “smoothing” or segmenting faces of moving persons. One may also be able
“predictive” mechanisms need to be incorporated into the to do partial reconstruction of the face image using existing
algorithms. models [lo], [23], [87] and be able to account for disguises,
somewhat better than in static matching problems. One of
the strong constraints of this application is the need for
B. Dynamic Matching real-time recognition. It is expected that several of the
We group application 3, and cases of application 2 where existing methodologies in the IU literature [ 11-[5] for image
a video sequence is available, as dynamic. The images sequence based segmentation, structure estimation, nonrigid
---- - ~ -~
different mechanisms being used for detection and for of isolated features and paraphernalia to one of holistic
identification. analysis. Curiously, when children as young as five
The role of spatial frequency analysis: Earlier studies years are asked to recognize familiar faces, they do
[47], [64]concluded that information in low spatial pretty well in ignoring paraphernalia.
frequency bands play a dominant role in face recog- Several other interesting studies related to how chil-
nition. Recent studies [ 1191 show that, depending on dren perceive inverted faces are summarized in [29].
the specific recognition task, the low, bandpass and Facial expression: [191 Based on neurophysiological
high frequency components may play different roles. studies, it seems that analysis of facial expressions
For example the sex judgment task is successfully is accomplished in parallel to face recognition. Some
accomplished using low frequency components only, prosopagnosic patients, who have difficulties in iden-
while the identification task requires the use of high tifying familiar faces, nevertheless seem to recognize
frequency components. The low frequency components emotional expressions. Patients who suffer from “or-
contribute to the global description, while the high ganic brain syndrome” suffer from poor expression
frequency components contribute to the finer details analysis but perform face recognition quite well. Nor-
required in the identification task. mal humans exhibit parallel capabilities for facial
The role of the brain: [40] The role of the right hemi- expression analysis and face recognition. Similarly,
sphere in face perception has been supported by several separation of face recognition and “focused visual
researchers. In regard to prosopagnosia and the right processing” (look for someone with a thick mustache)
hemisphere, a retrospective study seems to strongly tasks have been claimed.
indicate right hemisphere involvement in face recog- Role of racdgender: Humans recognize people from
nition. In other brain damaged victims, those with right their own race better than people from another race.
hemisphere disease have more impairment in facial This may be due to the fact that humans may be
recognition then left hemisphere disease. When shown coding an “average” face with “average” attributes, the
the left half of one face and the right half of another characteristic of which may be different for different
face tachistoscopically, the overwhelming majority of races, making the recognition of faces from a different
commissurotomy patients selected the face shown to race harder. Goldstein [50] gives two possible reasons
the left vision field (LVF), which arrives initially at for the discrepancies: Psychosocial, in which the poor
the right hemisphere. In other tachistoscopic studies, identification results are from the effects of prejudice,
the LVF has the advantage in both speed and accuracy unfamiliarity with the class of stimuli, or a variety
of response and in long term memory response. Studies of other interpersonal reasons; and psychophysical,
have also shown a right hemisphere advantage in dealing with loss of facial detail because of different
reception and/or storage of faces. Some other studies amounts of reflectance from different skin colors, or
argue against right hemisphere superiority in face race-related differences in the variability of facial fea-
perception. Postmortem studies of prosopagnosia vic- tures. Using tables showing the coefficientsof variation
tims with known lesions in the right hemisphere have for different facial features for different races, it has
found approximately symmetrical lesions in the left been concluded that poor identification of other races
hemisphere. Other cases of bilateral brain damage have is not a psychophysical problem but more likely a
been seen or suspected in patients with prosopagnosia. psychosocial one. Using the same data collected in
The ways in which the two hemispheres operate may [50], some studies have been done to quantify the
reflect variations in degrees of expertise. It appears that role of gender in face recognition. It has been found
the right hemisphere does possess a slight advantage [49] that in a Japanese population, a majority of the
in aspects of face processing. It is also true that the women’s facial features are more heterogeneous than
two hemispheres may simultaneously handle different the men’s features. It has also been found that white
types of information. The dominance of the right women’s faces are slightly more variable than men’s,
hemisphere in facial processing may be the result but that the overall variation is small.
of left hemisphere dominance in language. The right Image quality: In [125] the relationship between
hemisphere is also involved in the interpretation of image quality and recognition of a human face has
emotions, and this may underlie the slight asymmetry been explored. The task required of observers is to
in perceiving and remembering faces. identify one face from a gallery of 35 faces. The
Face recognition by children. 1291, [30] It appears modulation transfer function area (MTFA) was used
that children under ten years of age code unfamiliar as a metric to predict an observers performance in a
faces using isolated features. Recognition of these task requiring the extraction of detailed information
faces is done using cues derived from paraphernalia, from both static and dynamic displays. Performance
such as clothes, glasses, hair style, hats, etc. Ten- for an observer is measured by two dependent vari-
year-old children exhibit this behavior less frequently, ables-proportion of correct responses and response
while children older than 12 years rarely exhibit this time. It was found that as the MTFA becomes mod-
behavior. It is postulated that around age ten, children erately large, facial recognition performance reaches a
seem to change their recognition mechanisms from one ceiling which cannot be exceeded. The MTFA metric
Fig. 4. (a) Input image, (b) edge image, (c) linked segments, and (d) segmented image.
the expansion of the given image in terms of eigenpictures distance. An empirically defined standard window encloses
serve the role of features. the transformed image. The KL expansion applied to the
In a subsequent extension of their work, Kirby and standardized face images is known as the Karhunen-Loeve
Sirovich in [82] include the inherent symmetry of faces transform of intensity pattern in affine-transformed target
in the eigenpicture representation of faces, by using an (KL-IPAT) image. The KL-IPAT was extracted from 269
extended ensemble of images consisting of original faces images with 100 eigenfaces. The second step is to apply
and their mirror images. Since the computations of eigen- the Fourier Transform to the standardized image and use
values and eigenvectors can be split into even and odd the resulting Fourier spectrum instead of the spatial data
pictures, there is no overall increase in computational from the standardized image. The KL expansion applied
complexity compared to the case in which only the original to the Fourier spectrum is called the Karhunen-Loeve
set of pictures in used. Although the eigenrepresentation transform of Fourier spectrum in the affine-transformed
for the extended ensemble does not produce dramatic target (KL-FSAT) image. The robustness of the KL-IPAT
reduction in the error in reconstruction when compared to and KL-FSAT was checked against geometrical variations
the unextended ensemble, still the method that accounts for using the standard features for 269 face images.
symmetry in the patterns is preferable. In [69], the image features are divided into four groups:
In [ll], the KL is combined with two other operations visual features, statistical pixel features, transform coeffi-
to improve the performance of the extraction technique cient features, and algebraic features, with emphasis on the
for the classification of front-view faces. The application algebraic features, which represent the intrinsic attributes
of the KL expansion directly to a facial image without of an image. The singular value decomposition (SVD) of
standardization does not achieve robustness against vari- a matrix is used to extract the features from the pattern.
ations in image acquisition. [ 111 uses standardization of SVD can be viewed as a deterministic counterpart of the
the position and size of the face. The center points are KL transform. The singular values (SV’s) of an image
the regions corresponding to the eyes and mouth. Each are very stable and represent the algebraic attributes of
target image is translated, scaled and rotated through affine the image, being intrinsic but not necessarily visible. [69]
transformation so that the reference points of the eyes and proves their stability and invariance to proportional variance
mouth are in a specific spatial arrangement with a constant of image intensity in the optimal discriminant vector space,
__ ~ -_
having two regions of uniform intensity. The first is the transform are:
iris region and the other is the white region of the eye.
The approach constructs an “archetypal” eye and models g(x,y : uo, WO) = exp {-[x2/2a2 y2/203 +
various distributions as variations of it. For the “ideal” eye +
f 2?ri[uoa: way]} (2)
a uniform intensity for both the iris and whites is chosen. G(u,w) = exp {-27r 2 [a,(u
2
- uo)2
In an actual eye certain discrepancies from the ideal are
found which hamper the uniform intensity choice. These
+ a;(w - w0)21) (3)
discrepanciescan be modeled as “noise” components added where gz and ay represent the spatial widths of the Gauss-
to the ideal image. For instance, the white region might ian and ( U O ,W O ) is the frequency of the complex sinusoid.
have speckled (spot) points depending on scale, lighting The Gabor functions form a complete though nonorthog-
direction, etc. Likewise the iris can have within it some onal basis set. Like the Fourier series, a function g(x,y)
“white” spots. The author uses an a-trimmed distribution can easily be expanded using the Gabor function. Consider
for both the iris and the white. A “blob” detection system the following wavelet representation of the Gabor function:
is developed to locate the intensity valley caused by the
iris enclosed by the white. Using a-trimmed means and @x(a:, y, e) = exp {[-X2(x’2 + y‘2)] + ira’} (4)
variances and a parameter set for the template of the blob, x’ = x cos 6 + y sin 0 (5)
a cost functional is determined for valley detection. A y’ = -x sin 6’ + y cos 13 (6)
deformable human eye template is constructed around the
valley detection scheme. The search for candidates uses a where 0 is the preferred spatial orientation and X is the
coarse to fine approach. Minimization is achieved using aspect ratio of the Gaussian. For convenience the subscripts
the steepest descent method. After locating the candidate a are dropped in further discussions. In the experiments, X is
goodness of fit criteria is used for verification purposes. The set to 1, and B is discretized into four orientations. The
inputs used in the experiments were frontal face intensity resulting family of wavelets is given by
images. In all three sets of data were used. One consisted
of 25 images used as a testing set, another had 107 positive {@[ai(. - xO), a’(y - YO), ek])r E R,
eyes, and the third consisted of images with most probably j = ( 0 , -1, - 2 , . . .} (7)
erroneous locations which could be chosen as candidate
templates. For locating the valleys the author reports as where e k = k7r/N, N = 4, k = (0, 1 , 2 , 3 } and a j , j E z.
many as 60 false alarms for the first data set, 30 for Feature detection utilizes a simple mechanism to model
the second and 110 for the third. An increase in hit rate the behavior of the end-inhibition. It uses interscale in-
is reported when using the a-trimmed distribution. The teraction to group the responses of cells from different
overall best hit rate reported was 80%. frequency channels. This results in the generation of the
Reisfeld and Yeshurun in [ 1121 use a generalized sym- end-stop regions. The orientation parameter 8 determines
metry operator for the purpose of finding the eyes and the direction of the edges. Hypercomplex cells in animals
mouth in a face. Their motivation stems from the almost are sensitive to oriented lines and step edges of short
lengths, and their response decreases if the lengths are
symmetric nature of the face about a vertical line through
increased.
the nose. Subsequent symmetries lie within features such
as the eyes, nose and mouth. The symmetry operator
locates points in the image corresponding to high values
of a symmetry measure discussed in detail in [112]. They
indicate their procedure’s superiority over other correlation
based schemes like that of Baron [ 141 in the sense that their
scheme is independent of scale or orientation. However,
since no a priori knowledge of face location is used, the
search for symmetry points is computationally intensive. where f represents the input image, g is a sigmoid nonlin-
The authors mention a success rate of 95% on their face earity, y is a normalizing factor, and n > m. The final step
image database, with the constraint that the face occupy is to actually localize these features, and this is done by
between 1540% of the image. looking at the local maximum of these feature responses.
Manjunath et al. [88] present a method for the extraction A feature point is selected by taking the maxima in a
of pertinent feature points from a face image. It employs local neighborhood of the pixel location ( 2 , ~ )Let. the
Gabor wavelet decomposition and local scale interaction to neighborhood be Nzy :
extract features at points of curvature maxima in the image,
corresponding to orientation and local neighborhood. These
feature points are then stored in a data base and subsequent
target face images are matched using a graph matching The general idea is to use (9) to determine responses at
technique. The 2D Gabor function used and its Fourier two scales. These scales act as the hypercomplex cells in
animals. To determine a high spatial curvature point the try to see if a particular area in the image has the necessary
response from a larger sized cell is subtracted from the component parts (in correct orientations relative to each
smaller sized cell using (8). A smaller cell will have a other) and determine the existence of the component. The
higher response for a sharper curvature. This is determined Face level will try to determine which geometric layout
to be a feature point in the image. of the components is best suited to describe a face from
Some experimental results for this feature extraction the image data. The structure of the system is based on
method are shown in Fig. 6. Notice that the background a blackboard architecture; all the tasks have access to
on the image is uniform; this type of image can be (and can write on) to the blackboard. The author reports
seen as representative of passport, driver’s license or successful detection of the face using this method with
any identification-type photographs where control over two experiments. The modularity of the system makes it
background is easily enforced. possible to expand it by adding other knowledge sources
[33] describes a knowledge-based vision system for de- such as eyebrows, ears, forehead, etc. The usage of sketched
tection of human faces from hand drawn sketches. The images can be extended to the edge map of an intensity
system employs an IF-THEN rule to process its tasks, image with some processing to get labeled segments, as is
i.e., “IF: upper mouth line is not found but lower mouth done in [ 1231.
line is found, THEN: look for the upper mouth line in
the image area directly above the lower mouth.” The C. Recognition
template for the face consists of the eyes (both left and I ) Earlier Approaches: One of the earliest works in com-
right), the nose and the mouth. The processing is done on puter recognition of faces is reported by Bledsoe [18]. In
four different abstraction levels of image information; Line this system, a human operator located the feature points
Segment, Component Part, Component, and Face. The line on the face and entered their positions into the computer.
segments are selected as candidates of component parts with Given a set of feature point distances of an unknown
probability values associated with them. A component will person, nearest neighbor or other classification rules were
-. - _ _
used for identifying the label of the test image. Since The authors then set a bound on the probability of finding
feature extraction is manually done, this system could a correct match, using some arbitrary constants, to be
accommodate wide variations in head rotation, tilt, image about 90% from 15 000 images. However, this is just an
quality, and contrast. extrapolation of the results that they obtained from the sixty
A landmark work on face recognition is reported in two images that were tested and not a result of actual
the doctoral dissertation of M. D. Kelly [81]. Kelly’s experiments.
work is similar in framework to that of Bledsoe, but is One method of characterizing the face is the use of
significantly different in that it does not involve any human geometrical parameterization, i.e., distances and angles
intervention. Although we cite this work in connection between points such as eye comers, mouth extremities,
with face recognition, Kelly’s dissertation has made several nostrils, and chin top [78]. The data set used by Kanade
important contributions to goal directed (also known as consists of 17 male and three female faces without glasses,
top-down) and multiresolution image analysis. mustaches, or beards. Two pictures were taken of each
Kelly uses the body and close up head images for individual, with the second picture being taken one month
recognition. Once the body and head have been outlined as later in a different setting. The face-feature points are
described in Section IV-A, ten measurements are extracted. located in two stages. The coarse-grain stage simplified
The body measurements include heights, widths of the head, the succeeding differential operation and feature-finding
neck, shoulders, and hips. Measurements from the face algorithms. Once the eyes, nose and mouth are approxi-
include width of the head and distances between eyes, top mately located, more accurate information is extracted by
of head to eyes, between eyes and nose and the distance confining the processing to four smaller regions, scanning
from eyes to mouth. The nearest neighbor rule was used at higher resolution, and using the “best beam intensity” for
for identifying the class label of the test image; the leave- the region. The four regions are the left and right eye, nose,
one-out [45] strategy was used. The dataset consisted of a and mouth. The beam intensity is based on the local area
total of 72 images, comprised of 24 sets of three images of histogram obtained in the coarse-grain stage. A set of 16
ten persons. Each set had three images per person; image facial parameters which are ratios of distances, areas, and
of the body, image of the background corresponding to the angles to compensate for the varying size of the pictures
body image and a close-up of the head. is extracted. To eliminate scale and dimension differences
In [go], Kaya et al. report a basic study using infor- the components of the resulting vector are normalized. The
mation theoretic arguments in classifying human faces. entire data set of 40 images is processed and one picture of
They reason from the fact that to represent N different each individual is used in the training set. The remaining 20
faces a total of log2 N bits are required (upper bound pictures are used as a test set. A simple distance measure
on the entropy). They contend that since illumination and is used to check for similarity between an image of the
background are the same for all face images and the images test set and the image in the reference set. Matching
taken are photographs of front views of human faces, with accuracies range from 45% to 75% correct, depending
mouth closed, no beards, and no eyeglasses, therefore the on the parameters used. Better results are obtained when
dimensionality of the parameter space can be reduced from several of the ineffective parameters are not used [78].
the above upper bound. Sixty two photographs were taken 2) Statistical Approach: Turk and Pentland [ 1331 used
with a special apparatus to ensure correct orientation and eigenpictures (also known as “eigenfaces” (see Fig. 7)
lighting conditions. An experiment was conducted using in [133]) for face detection and identification. Given the
1W O human subjects to identify prominent geometric eigenfaces, every face in the database can be represented
features from three different faces. The authors identify nine as a vector of weights; the weights are obtained by pro-
of these parameters to run statistical experiments on. These jecting the image into eigenface components by a simple
parameters form a parameter vector composed of internal inner product operation. When a new test image whose
biocular breadth, extemal biocular breadth, nose breadth, identification is required is given, the new image is also
mouth breadth, bizygomatic breadth, bigonial breadth, dis- represented by its vector of weights. The identification of
tance between lower lip and chin, distance between upper the test image is done by locating the image in the database
lip and nose and height of lips. They construct a classifier whose weights are the closest (in Euclidean distance) to
based on the parameter vector and its estimate, i.e., if X the weights of the test image. By using the observation
is the parameter vector then the estimate Y is given as that the projection of a face image and a nonface image
Y = X + D where D is the distortion vector. The distortion are quite different, a method for detecting the presence of
vector D has two components Dm, the distortion due to a face in a given image is obtained. Turk and Pentland
data acquisition and sampling error and Dj due to inherent illustrate their method using a large database of 2500 face
variations in facial features. The authors discuss two cases, images of 16 subjects, digitized at all combinations of
one in which D, is negligible and the other where D, three head orientations, three head sizes and three lighting
is comparable to Dj. For each parameter a threshold is conditions. Several experiments were conducted to test the
determined from its statistical behavior. Classification is robustness of the approach to variations in lighting, size,
done using the absolute norm between a stored parameter head orientation, and the differences between the training
set and the input image parameter values. It should be and test conditions. The authors reported 96% correct
noted that the parameter values are determined manually. classification over lighting variations, 85% over orientation
variations and 64% over size variations. It can be seen on sex, race, approximate age and facial expression was
that the approach is fairly robust to changes in lighting included. Unlike mug shots applications, where only one
conditions, but degrades quickly as the scale changes. front and one side view of a person’s face is kept, in this
One can explain this by the significant correlation present database several persons have many images with different
between images with changes in illumination conditions; expressions, head wear, etc.
the correlation between face images at different scales is One of the applications the authors consider is interactive
rather low. Another way to interpret this is that the approach search through the database. When the system is asked to
based on eigenfaces will work well as long as the test present face images of certain types of people (e.g., white
image is “similar” to the ensemble of images used in the females of age 30 years or younger), images that satisfy
calculation of eigenfaces. Turk and Pentland also extend this query are presented in groups of 21. When the user
their approach to real time recognition of a moving face chooses one of these images, the system presents faces
image in a video sequence. A spatiotemporal filtering step from the database that look similar to the chosen face
followed by a nonlinear operation is used to identify a in the order of decreasing similarity. In a test involving
moving person. The head portion is then identified using a 200 selected images, about 95% recognition accuracy was
simple set of rules and handed over to the face recognition obtained-i.e., for 180 images the most similar face was of
module. the same person. To evaluate the recognition accuracy as
In [104], Pentland et al. extend the capabilities of their a function of race, images of white, black and Asian adult
earlier system [133] in several directions. They report ex- males were tested. For white and black males accuracies
tensive tests based on 7562 images of approximately 3000 of 90% and 95% were reported, respectively, while only
people, the largest database on which any face recognition 80% accuracy was obtained for Asian males. The use of
study has been reported to date. Twenty eigenvectors were eigenfaces for personnel verification is also illustrated.
computed using a randomly selected subset of 128 images. In mug shots applications, usually a frontal and a side
In addition to eigenrepresentation, annotated information view of a person are available. In some other applications,
.___ ~ ~ ~ _ _ __
~
more than two views may be available. One can take two Both the KL-IPAT and KL-FSAT have difficulties when
approaches to handling images from multiple views. The the head orientation is varied [l 11.
first approach will pool all the images and construct a set The effectiveness of SVD for face recognition has been
of eigenfaces that represent all the images from all the tested in [32], [69]. The optimal discriminant plane and
views. The other approach is to use separate eigenspaces quadratic classifier of the normal pattern is constructed for
for different views, so that the collection of images taken the 45 SV feature vector samples. The classifier is able
from each view will have its own eigenspace. The second to recognize the 45 training samples of the nine subjects.
approach, known as the view-based eigenspace, seems to Testing was done using 13 photos which consisted of nine
perform better. For mug shots applications, since two or newly sampled photos of the original test subjects with two
at most three views are needed, the view-based approach of one subject and three samples of the subject at different
produces two or three sets of eigenspaces. ages. There was a 42.67% error rate which Hong feels was
The concept of eigenfaces can be extended to eigenfea- due to the statistical limitations of the small number of
tures, such as eigeneyes, eigenmouth, etc. Just as eigenfaces training samples [69].
were used to detect the presence of a face in [133], In [32] the SV vector is compressed into a low dimen-
eigenfeatures are used for the detection of features such sional space by means of various transforms, the most
as eyes, mouth etc. Detection rates of 94%, 80%, and 56% popular being an optimal discriminant transform based on
are reported for the eyes, nose and mouth, respectively, on Fisher's criterion. The Fisher optimal discriminant vector
the large dataset with 7562 images. represents the projection of the set of samples on a direction
Using a limited set of images (45 persons, two views 9,chosen so that the patterns have a minimal scatter within
per person, corresponding to different facial expressions each class and a maximal scatter between classes in the
such as neutral versus smiling), recognition experiments as 1D space. Three SV feature vectors are extracted from
a function of number of eigenvectors for eigenfaces only the training set in [32]. The optimal discriminant transform
and for the combined representation were performed. The compresses the high-dimensional SV feature space to a new
eigenfeatures performed as well as eigenfaces; for lower r-dimensional feature space. The new secondary features
order spaces, the eigenfeatures fared better; when the com- are algebraically independent and informational redundancy
bined set was used, marginal improvement was obtained. is reduced. This approach was tested on 64 facial images
As summarized in Section 111, both holistic and feature- of eight people (the classes). The images were represented
based mechanisms are employed by humans. The feature by Goshtasby's shape matrices, which are invariant to
based mechanisms may be useful when gross variations translation, rotation, and scaling of the facial images and
are present in the input image; the authors' experiments are obtained by polar quantization of the shape [54]. Three
support this. photographs from each class were used to provide a training
The effectiveness of standardized KL coefficients such as set of 24 SV feature vectors. The SV feature vectors
KL-IPAT and KL-FSAT has been illustrated in [ 111 using were treated with the optimal discriminant transform to
two experiments. In the first experiment, the training and
obtain new feature vectors for the 24 training samples. The
testing samples were acquired under as similar conditions class center vectors were obtained using the second feature
as possible. The test set consisted of five samples from 20 vectors. The experiment used six optimal discriminant
individuals. The KL-IPAT had an accuracy rate of 85% and
vectors. The separability of training set samples was good
the KL-FSAT had an accuracy rate of 91%. Both methods
with 100% recognition. The remaining 40 facial images
misidentified the one example where there is a difference
were used as the test set, five from each person. Changes
in the wearing and not wearing of glasses between the
were made in the camera position relative to the face, the
testing set and the training set. The second experiment
camera's focus, the camera's aperture setting, the wearing
checks for feature robustness when there is a variation
or not wearing of glasses, and blurring. As with the
caused by an error in the positioning of the target window.
training set, the SV feature vectors were extracted, and the
This is an error usually made during image acquisition
optimal discriminant transform was applied to obtain the
due to changing conditions. The test images are created
transformed feature vector. Again good separability was
by shifting the reference points in various directions by
obtained with an accuracy rate of 100% [32].
one pixel. The variances for 4 and 8 pixels are tested.
Cheng et al. [31] develop an algebraic method for face
The KL-IPAT having an error rate of 24% for the 4 pixel
difference and 81% for the 8 pixel difference. The KL- recognition using SVD and thresholding the eigenvalues
thus obtained to some value greater than a set threshold
FSAT had an 4% error rate for the 4 pixel difference and a
44% error rate for the 8 pixel difference. The improvement value. They use a projective analysis with the training set
is due to the shift invariance property in the Fourier of images serving as the projection space. A training set in
spectrum domain. The third experiment used the variations their experiments consists of three instances of face images
in head positioning. The test samples were taken while the of the same person. If A E 72"'" represents the image,
subject was nodding and shaking his head. The KL-FSAT and A!' represents the jth face image of person i, then the
showed high robustness over the KL-IPAT for the different average image for person i is given by ( l / N ) A!).
orientations of the head. Good recognition performance was Eigenvalues and eigenvectors are determined for this av-
achieved by restricting the image acquisition parameters. erage image using SVD. The eigenvalues are thresholded
.
to disregard the values close to zero. Average eigenvectors
(called feature vectors) for all the average face images are
calculated. A test image is then projected onto the space
spanned by the eigenvectors. The Frobenius norm is used as
a criterion to determine which person the test image belongs
to. The authors reported 100% accuracy when working with
a database of 64 face images of eight different persons.
Each person contributed eight images. Three images from
each person were used to determine the feature vector
for the face image in question. Eight such feature vectors
were determined. They state that the projective distance of
the testing image sample was markedly minimum for the
correct training set image.
The use of isodensity lines, i.e., curves of constant gray (a)
level, for face recognition has been investigated in [98]. Fig. 8. Radius vectors and other feature points’[22].
Such lines, although they are not directly related to the
3D structure of a face, do provide a relief image of the
face. Using images of faces taken with a black background, for testing and training were acquired such that facial
a Sobel operator and some post-processing steps are used hair, jewelry and makeup were not present. They were
to obtain the boundary of the face region. The gray level then preprocessed so that the eyes are level and the eyes
histogram (an 8-bin histogram) is then used to trace contour and mouth are positioned similarly. A 30 x 30 cropped
lines on isodensity levels. A template matching procedure is block of pixels was extracted for training and testing. The
used for face recognition. The method has been illustrated dataset consisted of 45 males and 45 females; 80 were
using ten pairs of face images, with three pairs of pictures used for training, with 10 serving as testing examples.
of men with spectacles, two pairs of pictures of men with The compression network indirectly serves as a feature
thin beards, and two pairs of pictures of women. 100% extractor; in that the activities of 40 hidden nodes (in a
recognition accuracy was reported on this small data set. 900 x 40 x 900 network) serve as features for the second
3 ) Neural Networks Approach: The use of neural net- network, that performs gender classification. The hope is
works (NN) in face recognition has addressed several that due to the nonlinearities in the network, the feature
problems: gender classification, face recognition, and clas- extraction step may be more efficient than the linear KL
sification of facial expressions. One of the earliest demon- methods. The gender classification network is a 40 x n x 1
strations of NN for face recall applications is reported in network, where the number n of hidden nodes has been 2,5,
Kohonen’s associative map [84].Using a small set of face 10, 20, or 40. Experiments with 80 training images and 10
images, accurate recall was reported even when the input testing images have shown the feasibility of this approach.
image is very noisy or when portions of the images are This method has also been extended to classifying facial
missing. This capability was demonstrated using optical expressions into eight types.
hardware by Psaltis’s group [6]. Using a vector of 16 numerical attributes (Fig. 8) such as
A single layer adaptive NN (one for each person in the eyebrow thickness, widths of nose and mouth, six chin radii,
database) for face recognition, expression analysis and face etc., Brunelli and Poggio [21] also develop a NN approach
verification is reported in [1281. Named Wilkie, Aleksander, for gender classification. They train two HyperBF networks
and Stonham’s recognition device (WISARD), the system [ 1091, one for each gender. The input images are normalized
needs typically 200-400 presentations for training each with respect to scale and rotation by using the positions of
classifier, the training patterns included translation and the eyes which are detected automatically. The 16D feature
variation in facial expressions. Sixteen classifiers were used vector is also automatically extracted. The outputs of the
for the dataset constructed using 16 persons. Classification two HyperBF networks are compared, the gender label for
is achieved by determining the classifier that gives the the test image being decided by the network with greater
highest response for the given input image. Extensions output. In the actual classification experiments only a subset
to face verification and expression analysis are presented. of the 16D feature vector is used. The database consists of
The sample size is small to make any conclusions on the 21 males and 21 females. The leave-one-out strategy [45]
viability of this approach for large datasets involving a large was employed for classification. When the feature vector
number of persons. from the training set was used as the test vector, 92.5%
In [51], Golomb, Lawrence, and Sejnowski present a correct recognition accuracy was reported; for faces not in
cascade of two neural networks for gender classification. the training set, the accuracy further dropped to 87.5%.
The first stage is an image compression NN whose hidden Some validation of the automatic classification results has
nodes serve as inputs to the second NN that performs been reported using humans.
gender classification. Both networks are fully connected, By using an expanded 35D feature vector, and one
three-layer networks with two biases and are trained by HyperBF per person, the gender classification approach
a standard back-propagation algorithm. The images used has been extended to face recognition. The motivation for
networks. DLA’s use synaptic plasticity and are able to Binding between neurons is encoded in the form of tempo-
instantly form sets of neurons grouped into structured ral correlations and is induced by the excitatory connections
graphs and maintain the advantages of neural systems. within the image. Four types of bindings are relevant to
A DLA permits pattem discrimination with the help of object recognition and representation: Binding all the nodes
an object-independent standard set of feature detectors, and cells together that belong to the same object, expressing
automatic generalization over large groups of symmetry neighborhood relationships with the image of the object,
operations, and the acquisition of new objects by one- bundling individual feature cells between features present
shot learning, reducing the time-consuming learning steps. in different locations, and binding corresponding points in
Invariant object recognition is achieved with respect to the image graph and model graph to each other. DLA’s
background, translation, distortion and size by choosing a basic mechanism, in addition to the connection parameter
set of primitive features which is maximally robust with between two neurons, is a dynamic variable ( J ) between
respect to such variations. Both [24] and [85] use Gabor two neurons (i, j ) . J-variables play the role of synaptic
based wavelets for the features. The wavelets are used as weights for signal transmission. The connection parameters
feature detectors, characterizedby their frequency, position, merely act to constrain the J-variables. The connection
and orientation. Two nonlinear transforms are used to help parameters may be changed slowly by long-term synaptic
during the matching process. A minimum of two levels, plasticity. The connection weights Jij are subject to a
the image domain and the model domain, are needed for process of rapid modification. Jij weights are controlled by
a DLA. The image domain corresponds to primary visual the signal correlations between neurons i and j . Negative
cortical areas and the model domain to the intertemporal signal correlations lead to a decrease and positive signal
cortex in biological vision. The image domain consists of a correlations lead to an increase in Jij. In the absence
2D array of nodes A: = [ (z,a), where (Y = 1,.. . , F } . of any correlation, Jij slowly retums to a resting state.
Each node at position z consists of F different feature Rapid network self-organizationis crucial to the DLA. Each
detector neurons (z,a ) that provide local descriptors of the stored image is formed by picking a rectangular grid of
image. The label a is used to distinguish different feature points as graph nodes. A locally determined jet for each
types. The amount of feature type excitation is determined of these nodes is stored and used as the pattem class.
for a given node by convolving the image with a subset New image recognition takes place by transforming the
of the wavelet functions for that location. Neighboring image into the grid of jets, and all stored model graphs
nodes are connected by links, encoding information about are tentatively matched to the image. Conformation of the
the local topology. Images are represented as attributed DLA is done by establishing and dynamically modifying
graphs. Attributes attached to the graph’s nodes are activity links between vertices in the model domain. During the
vectors of local feature detectors. An object in the image recognition process an object is selected from the model
is represented by a subgraph of the image domain. The domain. A copy of the model graph is positioned in a central
model domain is an assemblage of all the attributed graphs, position in the image domain. Each vertex in the model
being idealized copies of subgraphs in the image domain. graph is connected to the corresponding vertex in the image
Excitatory connections are between the two domains and graph. The match quality is evaluated using a cost function.
are feature preserving. The connection between domains The image graph is scaled by a factor while keeping the
occurs if and only if the features belong to corresponding center fixed. If the total cost is reduced the new value is
feature types. The DLA machinery is based on a data accepted. This is repeated until the optimum cost is reached.
format which is able to encode information on attributes The diffusion and size estimation are repeated for increasing
and links in the image domain and to transport that in- resolution levels and more of the image structure is taken
formation to the model domain without sending the image into account. Recognition takes place after the optimal total
domain position. The structure of the signal is determined cost is determined for each object. The object with the
by three factors: the input image, random spontaneous best match to the image is determined. Identification is a
excitation of the neurons, and interaction with the cells process of elastic graph matching. In the case of faces,
of the same or neighboring nodes in the image domain. if one face model matches significantly better than all
The feature points are represented by nodes V , where The recognized face is the one that has the minimum of
i = { 1, 2, 3,. . .}, a consistent numbering technique. The the combined cost value. An accuracy of 94% is reported.
information about a feature point is contained in {S,q}, The method shows a dependency on the illumination di-
where S represents the spatial location and q is the feature
rection and works on controlled background images like
vector defined by passport and drivers license pictures. Fig. 10 shows a set
of input and identified images for this method.
Seibert and Waxman [ 1161 have proposed a system for
corresponding to the ith feature point. The vector qi is a recognizing faces from their parts using a neural network.
set of spatial and angular distances from feature point i to The system is similar to a modular system they have
its N nearest neighbors denoted by Q;(z, y, O j ) , where j is developed for recognizing 3D objects [ 1171 by combining
the jth of the N neighbors. N; represents a set of neighbors 2D views from different vantage points; in the case of
which are of consequence for the feature point in question. faces, arrangement of features such as eyes and nose play
The neighbors satisfying both maximum number N and the role of the 2D views. The processing steps involved
minimum Euclidean distance d;j between two points V , and are segmentation of a face region using interframe change
V, are said to be of consequence for the ith feature point. detection techniques, extraction of features such as eyes,
To identify an input graph with a stored one which is mouth, etc., using symmetry detection, grouping and log-
different, either in total number of feature points or in the polar mapping of the features and their attributes such
location of the respective faces, we proceed in a stepwise as centroids, encoding of feature arrangements, cluster-
manner. If i , j refer to nodes in the input graph 2 and ing of feature vectors into view categories using ART 2,
x’,y’, m’,n’ refer to nodes in the stored graph 0 then the and integration of accumulated evidence using an aspect
two graphs are matched as follows: network.
In a subsequent paper Seibert and Waxman [118] exploit
1) The centroids of the feature points of 2 and 0 are the role of caricatures and distinctiveness (summarized in
aligned. Section 111 of the report) in human face recognition to
- -
lower lip were automatically identified. The details of how
each of these fiducial marks was identified are given in [61].
From these fiducial marks, a set of six feature characteristics
were derived. These were protrusion of nose, area right of
base line, base angle of profile triangle, wiggle, distances
and angles between fiducials. A total of eleven numerical
features were extracted from the characteristics mentioned
above. After aligning the profiles by using two selected
fiducial marks, an Euclidean distance measure was used for
measuring the similarity of the feature vectors derived from
3 the outline profiles. A ranking of most similar faces was
obtained by ordering the Euclidean norms. In subsequent
work, Harmon et al. [63] added images of female subjects
and experimented with the same feature vector. By noting
that the values of the features of a face do not change very
much in different images and that the faces corresponding
to feature vectors with a large Euclidean distance between
them will be different, a partitioning step is included to
improve computational efficiency.
[63] used the feature extraction methods developed in
[61] to create 11 feature vector components. The 11 features
Fig. 12. The nine fiducial points of interest for face recognition were reduced to 10, because nose protrusion is highly cor-
using profile images (similar to figure in [61]). related with two other features. The 10D feature vector was
found to provide a high rate of recognition. Classification
was done based on both Euclidean distances and set parti-
points (Fig. 12). Recognition involves the determination of tioning. Set partitioning was used to reduce the number of
relationships among these fiducial points. candidates for inclusion in the Euclidean distance measure
Kaufman and Breeding [79] developed a face recognition and thus increase performance and diminish computation
system using profile silhouettes. The image acquired by a time. Reference [62] is a continuation of the research done
black and white TV camera is thresholded to produce a in [61] and [63]. The aim is basic understanding of how to
binary, black and white image, the black corresponding to achieve automatic identification of human face profiles, to
the face region. A preprocessing step then extracts the front develop robust and economical procedures to use in real-
portion of the silhouette that bounds the face image. This is time systems, and to provide the technological framework
to ensure variations in the profile due to changes in hairline. for further research. The work defines 17 fiducial points
A set of normalized autocorrelations expressed in polar which appear to the best combination for face recognition.
coordinates is used as a feature vector. Normalization and The method uses the minimum Euclidean distance between
polar representation steps insure invariance to translation the unknown and the reference file to determine the correct
and rotation. A distance weighted k-nearest neighbor rule identification of a profile, and uses thresholding windows
is used for classification. Experiments were performed on a for population reduction during the search for the reference
total of 120 profiles of ten persons, half of which were used file. The thresholding window size is based on the average
for training. A set of 12 autocorrelation features was used vector obtained from multiple samples of an individual’s
as a feature vector. Three sets of experiments were done. profile. In [62], the profiles are obtained from high contrast
In the first two, 60 randomly chosen training samples were photography from which transparencies are made, scanned,
used, while in the third experiment 90 samples were used in and digitized. The test set consists of profiles of the same
the training set. Experiments with varying dimensionality individuals taken at a different setting. The resulting 96%
of the training samples are also reported. The best per- rate of correctness occurs both with and without population
formance (90% accuracy) was achieved when 90 samples reduction [62].
were stored in the training set and the dimensionality of the Wu and Huang [ 1421 also report a profile-based recogni-
training feature vector was four. Comparisons with features tion system using an approach similar to that of Harmon and
derived from moment invariants [38] show that the circular his group [61], but significantly different in detail. First of
autocorrelations performed better. all, the profile outlines are obtained automatically. B-splines
Harmon and Hunt [61] presented a semi-automatic recog- are used to extract six interest points. These are the nose
nition system for profile-posed face recognition by treating peak, nose bottom, mouth point, chin point, forehead point,
the problem as a “waveform” matching problem. The and eye point. A feature vector of dimension 24 is con-
profile photos of 256 males were manually reduced to structed by computing distances between two neighboring
outline curves by an artist. From these curves, a set of points, length, angle between curvature segments joining
nine fiducial marks (see Fig. 12) such as nose tip, chin, two adjacent points, etc. Recognition is done by comparing
forehead, bridge, nose bottom, throat, upper lip, mouth and the feature vector extracted from the test image with stored
flow fields, motion and structure parameters are computed as simulated annealing [46], maximum posterior marginal
under the assumptions the field of view (FOV) is narrow [90], and iterated conditional mode [ 161. Implementation of
and the images of moving objects are small with respect these algorithms using analog VLSI hardware is addressed
to the FOV and that the optical flow field is computed in [73], [83]. A recent paper [83] presents a multiscale
from monocular, noisy imagery. Thompson and Pong [ 13 13 approach with supporting physiological theory for the com-
present algorithms for the detection of moving objects from putation of optical flow. Other significant papers that deal
a moving platform. Under various assumptions about the with segmentation, motion detection from optical flow and
camera motion (the complete camera motion is known, only normal flow may be found in [26], [36], [102], [115], [126],
the rotation or translation is known, etc.), several versions [1351, [1411, [148].
of motion detection algorithms are presented with examples Two algorithms that use models of human motion for the
drawn from indoor scenes. purpose of segmentation are described in [loll, [121]. An
Analysis of optical flow for detecting motion bound- algorithm for the detection of moving persons from normal
aries and subsequently for motion detection requires the flows is also described in [99].
availability of accurate estimates of optical flow. But to
obtain these accurate estimates, we need to account for
or model the motion discontinuities in the flow field due
VII. APPLICATIONS
to the presence of moving objects. Simultaneous com-
putation of optical flow and modeling of discontinuities Current applications of FRT include both commercial
has been addressed by several research groups [67], [73], and law enforcement agencies. Although we have not
[94]. The central theme of this approach is to model been able to find many publications detailing commercial
the discontinuities using the “line processes” of Geman applications nevertheless a brief description of possible
and Geman [46], pose the computation of optical flow application areas is given. For the case of law enforcement
in a Bayesian framework, and derive iterative techniques agencies the emphasis has been on face identification,
from the application of an optimization procedure such witness recall, storage and retrieval of mug shots and user
~ ~- ~~ --~ __
Second, since many early OCR studies were done on data. With this level of image resolution, down-sampling of
isolated characters or characters with moderate segmen- the images and digital filtering to provide lower resolution
tation problems, the need for robust generalization was and image quality can be done with a single set of master
underestimated. This caused too much effort to be expended images. Images can also be cropped after segmentation
on systems that did not address the types of recognition to provide more usable image areas containing the face
problems that arise in real Based on this experience, we image. Resampling the image to provide a greater area of
feel that initial studies using realistic images from some background and less active image area may also be possible,
specific commerciaVlaw enforcement application should be but may introduce artifacts that change the difficulty of the
carried out. An initial image set based on mug shots seems segmentation problem.
appropriate since these images span a wide range of the 3 ) Test Methods for Algorithm Accuracy and Probability of
possible applications shown in Table 1, will have realistic Match: The scoring of face matching systems should be
image segmentation problems, and have realistic image based on the probability of matching a candidate face in the
quality parameters. This should provide commerciaVlaw first n faces on a candidate list. Two sets of probabilities of
enforcement agencies such as credit card companies or the this type can be defined, one for faces in the database and
FBI with a more realistic estimate of the utility of FRT one for faces not in the database. The first will generate
than studies done on idealized datasets or datasets which true positives and the second will generate false positives.
are unrelated to specific applications. The comparison of true and false recognition probabili-
3 ) Speed and Hardware Requirements: We recommend ties assumes that each recognition produces a confidence
that where possible all algorithms under test be evaluated number which is higher for faces with greater similarity.
on several types of parallel computer hardware as well For each specified level of confidence, the number of
as standard engineering workstations. In high volume faces matching true and false faces can be generated. The
applications speed will be an important factor in evaluating simplest accuracy measure of each type of recognition is the
applicability. In many potential applications, parallel cumulative probability of a match for various values of n,
computers may be too costly but developments of effective and at the same confidence level, the probability of a false
high speed methods on parallel computers should allow match. It seems likely that in addition to the raw cumulative
special purpose hardware to be developed to reduce costs. probability curve, some simple models of the shape of
4 ) Human Inlegace: The utility of face recognition sys- the curve, such as a linear model, may be of interest in
tems will be strongly affected by the type of human comparing different algorithms. In many applications it will
interface that is used in conjunction with this technology. be as important that the face recognition system avoid false
The human factors which will affect this interface are dealt positives as that it produce good true positive candidate
with in Section 111. The literature on human perception and lists.
recognition of faces will be important in designing human Many of the face recognition systems discussed in this
interfaces which allow users to make efficient use of the paper reduce the face to a set of features and measure the
results of machine-based face recognition. similarity of faces by comparing the distance between faces
in this feature space. For all of the test faces the distance
B. Evaluation Methods between each test face and all other faces in the database
1) Database Size and Uniformity: For law enforcement is calculated. The probability, over the entire test sample,
applications, as an initial evaluation sample, a collection of and the average confidence of the first n near neighbors
a minimum of 5000 and a maximum of 50 000 mug shot is then calculated. A similar calculation is made using
images may be appropriate. A testing sample containing faces not in the database and the average confidence of
500 to 5000 different mug shots of faces in the original the first n candidates evaluated. At each confidence level
training set and 500 to 5000 different mug shots of faces for these faces a probability of finding a false match can be
not in the original training set should be collected to allow calculated as the ratio of false candidates to true candidates
testing of machine face matching. Similar samples for at comparable confidence. If the recognition process is
commercial applications are also suggested.' The minimum to be successful the probability of detection of a face in
sample sizes for the test sets is based on the need to obtain the database should always exceed the probability of false
accurate matching and false matching statistics. The 10:1 detection of a face not included in the database.
ratio of the evaluation set size to the testing set size is 4 ) Similarity Measures: The example calculation dis-
designed to minimize false match statistics due to random cussed above requires that the recognition system produce
matches and provide statistical accuracy in probability of a measure of confidence of recognition and of similarity
match versus candidate list size statistics. to other possible recognitions. Similarity differs from
2 ) Sample Size Issues-Feasibility of Resampling: We sug- confidence in that similarity is measured between any two
gest that images be collected at relatively high resolution, points in the feature space of the database while confidence
512 x 512, and using 8 b of gray or intensity. If color is a measure between a test image and a trial match. In
images are used, matching will initially use only intensity the example a reasonable measure might be l/(l+kd:j).
NIST has recently made available a mug shot identification database
Using this measure the similarity of two faces is 1.0 if their
containing a total of 3248 images. For details the reader may contact: features are identical and approaches 0.0 as the features
[email protected]. are displaced to infinity. This type of similarity measure
-- - -.__ __ __
[92] -, Motion Understanding, Robot and Human Vision. 181 -, “An approach to face recognition using saliency maps
Boston, MA: Kluwer, 1988. and caricatures,” in Proc. World Cong. on Neural Networks,
[93] D. Metaxas and D. Terzopoulus, “Recursive estimation of 1993, pp. 661-664.
shape and nonrigid motion,” in Proc. IEEE Workshop on Visual 191 J. Sergent, “Microgenesis of face perception,” in Aspects of
Motion, 1991, pp. 296-311. Face Processing, H. D. Ellis, M. A. Jeeves, F. Newcombe, and
[94] D. W. Murray and B. F. Buxton, “Scene segmentation from A. Young, Eds. Dordrecht: Nijhoff, 1986.
visual motion using global optimization,” IEEE Trans. Part. 201 J. W. Shepherd, “An interactive computer system for retrieving
Anal. and Mach. Intell., vol. 9, pp. 220-228, 1987. faces,” in Aspects ofFace Processing, H. D. Ellis, M. A. Jeeves,
[95] H. H. Nagel, “Analysis techniques for image sequences,” in F. Newcombe, and A. Young, Eds. Dordrecht: Nijhoff, 1985,
Proc. Int. Con$ on Pan. Recog., 1978, pp. 186-21 1. pp. 398409.
[96] -, “Overview on image sequence analysis,” in Image 211 A. Shio and J. Sklansky, “Segmentation of people in motion,”
Sequence Analysis, T. S. Huang, Ed. New York: Springer- in Proc. IEEE WorkshoD on Visual Motion. 1991. DD. 325-332.
Verlag, 1981, pp. 19-228. [122] A. Singh, “An estimatidn-Theoretic framework f&image flow
[97] __ ,“Image sequences-ten (octal) years-from phenomenol- analysis,” in Proc. Int. Con$ on Computer Vision, 1990, pp.
ogy toward a theoretical foundation,” in Proc. Int. Con$ on Part. 167-177.
Recog., 1986, pp. 1174-1185. [ 1231 S. A. Sirohey, “Human face segmentation and identification,”
[98] 0. Nakamura, S. Mathur, and T. Minami, “Identification of Tech. Rep. CAR-TR-695, Center for Autom. Res., Univ. Mary-
human faces based on isodensity maps,” Patt. Recog., vol. 24, land, College Park, MD, 1993.
pp. 263-272, 1991. [124] L. Sirovich and M. Kirby, “Low-dimensionalprocedure for the
[99] R. Nelson, “Qualitative detection of motion by a moving characterization of human face,” J. Opt. Soc. Amer., vol. 4, pp.
observer,” in Proc. DARPA Image Understanding Workshop, 519-524, 1987.
1990, pp. 329-338. [ 1251 H. L. Snyder, “Image quality and face recognition on a televi-
[lo01 M. Nixon, “Eye spacing measurement for facial recognition,” sion display,” Human Factors, vol. 16, pp. 300-307, 1974.
in SPIE Proc., 1985, vol. 575, pp. 279-285. [126] A. Spoerri and S. Ullman, “The early detection of motion
[loll O’Rourke and N. L. Badler, “Model-based image analysis of boundaries,” in Proc. Int. Conf on Computer Vision, 1987, pp.
human motion using constraint propagation,”IEEE Trans. Pan. 209-2 18.
Anal. and Mach. Intell., vol. 2, pp. 522-536, 1980. [127] R. B. Starkey and I. Aleksander, “Facial recognition for police
[lo21 S. Peleg and H. Rom, “Motion based segmentation,” in Proc., purpose using computer graphics and neural networks,” in Pmc.
Int. Con$ on Patt. Recog., 1990, pp. 109-113. Colloquium on Electron. Images and Image Proc. in Securiry
[lo31 A. Pentland and B. Horowitz, “Recovery of nonrigid motion and Forensic Science, dig. no. 0871990, pp. 21-2.
and structure,” IEEE Trans. Pan. Anal. and Mach. Intell., vol. [128] T. J. Stonham, “Practical face recognition and verification with
13, pp. 730-742, 1991. WISARD,” in Aspects of Face Processing, H. D. Ellis, M. A.
11041 A. Pentland, B. Moghaddam, T. Stamer, and M. Turk, “View- Jeeves, F. Newcombe, and A. Young, Eds. Dordrecht: Nijhoff,
based and modular eigenspaces for face recognition,” in Proc. 1984, pp. 4 2 W 1 .
IEEE Computer Soc. Con$ on Computer Vision and Patt. 291 M. Subbarao, “Interpretation of image flow: A spatio-temporal
Recog., 1994, pp. 84-91. approach,” IEEE Trans. Patt. Anal. and Mach. Intell., vol. 11,
[ 1051 D. Perkins, “A definition of caricature and recognition,” Studies pp. 266-278, 1989.
in the Anthropology of Visual Commun., vol. 2, pp. 1-24, 1975. 301 D. Terzopoulus, A. Watkin, and M. Kass, “Constraints on
[lo61 D. I. Perrett, A. J. Mistlin, and A. J. Chitty, “Visual neurons deformable models: 3-D shape on nonrigid motion,” Art$
responsive to faces,” Trends in Neuroscience, vol. 10, pp. Intell.. vol. 36. OD. 91-123. 1988.
358-363, 1987. 311 W. B.’ ThompsoALandT. C.’Pong, “Detecting moving objects,”
[I071 D. I. Perret, A. J. Mistlin, A. J. Chitty, P. A. Smith, D. D. in Proc. 1st Int. Con$ on Computer Vision, 1987, pp. 201-208.
Potter, R. Broennimann, and M. H. Harries, “Specialized face [I321 A. N. Tikhonov and V. Y. Arsenin, Solution of Ill-posed
processing and hemispheric asymmetry in man and monkey: Problems. Washington, DC: Winston and Wiley, 1977.
Evidence from single unit and reaction time studies,’’ Behav. [133] M. A. Turk and A. P. Pentland, “Face recognition using
eigenfaces,” in Proc. Int. Con$ on Patt. Recog., 1991, pp.
Brain Res., vol. 29, pp. 245-258, 1988.
[lo81 D. I. Perret, P. A. Smith, D. D. Potter, A. J. Mistlin, A. S. Head, 586-591.
[134] S. Ullman, The Interpretation of Visual Motion. Cambridge,
A. D. Milner, and M. A. Jeeves, “Visual cells in temporal cortex MA: MIT Press, 1979.
sensitive to face view and gaze direction,” in h o c . Royal Soc. [I351 J. Y. A. Wang and E. H. Adelson, “Layered representation for
of London, Series B, 1985, vol. 223, pp. 293-317. motion analysis,” in Proc. IEEE Computer Soc. Con$ Computer
[109] T. Poggio and F. Girosi, “Networks for approximation and Vision and Patt. Recog., 1993, pp. 361-366.
learning,” Proc. IEEE, vol. 78, pp. 1481-1497, 1990. [136] A. M. Waxman, B. Kamgar-Parsi, and M. Subbarao, “Closed-
[110] T. Poggio and V. Torre, “Ill-posed problems and regularization form solutions to image flow equations for 3-D structure and
analysis in early vision,” MIT AI Lab, Tech. Rep. AI Memo motion,” Int. J. Computer Vision, vol. 1, pp. 239-258, 1987.
773, 1984. [ 1371 H. Wechsler, Computational Vision. Boston: Academic, 1990.
[ l l l ] A. Rahardja, A. Sowmya, and W. Wilson, “A neural network [138] J. Weng, T. S. Huang, and N. Ahuja, Motion and Structure
approach to component versus holistic recognition of facial for Image Sequences, T. S. Huang, Ed. New York Springer-
expressions in images,” in SPIE Proc.: Intell. Robots and Verlag, 1993.
Computer Vision X : Algorithms and Techn., vol. 1607, 1991, [139] -, “Learning recognition and segmentation of 3D objects
pp. 62-70. from 2D images,’’ in Proc. IEEE Int. Con$ on Computer Vision,
[112] D. Reisfeld and Y. Yeshuran, “Robust detection of facial 1993, pp. 121-128.
features by generalized symmetry,” in Proc. 11th Int. Conk on [140] P. A. Wintz, “Transform picture coding,” Proc. IEEE, vol. 60,
Part. Recog., 1992, pp. 117-120. pp. 809-820, 1972.
[113] T. Sakai, M. Nagao, and S. Fujibayashi, “Line extraction and [141] K. Wohn and A. M. Waxman, “The analytic structure of
pattem recognition in a photograph,” Pan. Recog., vol. 1, pp. image flows: Deformation and segmentation,”Computer Vision,
233-248, 1969. Graphics and Image Process., vol. 49, pp. 127-151, 1990.
[ 1141 A. Samal and P. Iyengar, “Automatic recognition and analysis [142] C. Wu and J. Huang, “Human face profile recognition by
of human faces and facial expressions: A survey,” Patt. Recog., computer,” Patt. Recog., vol. 23, pp. 255-259, 1990.
vol. 25, pp. 65-77, 1992. [143] Y. Yacoob and L. S. Davis, “Computing spatio-temporalrepre-
[115] B. G. Schunck, “Image flow segmentation and estimation by sentations of human faces,” in Proc, IEEE Computer Soc. Con5
constraint line clustering,” IEEE Trans. Patt. Anal. and Mach. on Computer Vision and Part. Recog., 1994, pp. 70-75.
Intell., vol. 11, pp. 1010-1027, 1989. [ 1441 G. Yang and T. S. Huang, “Human face detection in a scene,” in
[116] M. Seibert and A. Waxman, “Recognizing faces from their Proc. IEEE Con$ on Computer Vision and Patt. Recog., 1993,
parts,” in SPIE Proc.: Sensor Fusion N: Control Paradigms pp. 453-458.
and Data Structures, vol. 1611, 1991, pp. 129-140. [145] A. Yuille, D. Cohen, and P. Hallinan, “Feature extraction from
11171 -, “Combining evidence from multiple views of 3-D ob- faces using deformable templates,” in Proc. IEEE Computer
jects,” in SPIE Proc.: Sensor Fusion N: Control Paradigms Soc. Conf on Computer Vision and Pan. Recog., 1989, pp.
and Data Structures, vol. 1611, 1991. 104-109.
-_ ~
__ ~ __
[49] C. A. Rothwell, A. Zisserman, C. I. Marinos, D. A. Forsyth, and [72] H. Chen and T. Huang, “Matching 3-D line segments with
J. L. Mundy, “Relative motion and pose from arbitrary plane applications to multipie-object motion estimation,” IEEE Trans.
curves,” Image and Vision Computing, vol. 10, pp. 250-262, Putt. Anal. and Mach. Intell., vol. 12, pp. 1002-1008, 1990.
1992. [73] I. Sethi and R. Jain, “Finding trajectories of feature points in a
[50] R. M. Haralick, “A facet model for image data,” Computer monocular image sequence,” IEEE Trans. [email protected]. and Mach.
Graphics and Image Process., vol. 15, pp. 113-129, 1981. Intell., vol. 9, pp. 56-73, 1987.
[51] A. Pentland and B. Horowitz, “Recovery of nonrigid motion
and structure,” IEEE Trans. Pan. Anal. and Mach. Intell., vol.
13, pp. 730-742, 1991.
[52] M. Musen and J. Van“Der Lei, “Of brittleness and bottlenecks:
Challenges in the creation of pattem recognition and expert Rama Chellappa (Fellow, IEEE) is a Profes-
system models,’’ in Panem Recognition and Arf$cial Intelli- sor in the Department of Electrical Engineering
gence, E. S. Gelsema and L. N. Kanal, a s . Amsterdam: North at the University of Maryland, where he is
Holland, 1988. also affiliated with the Institute for Advanced
[53] R. M. Haralick, “Ridges and valleys on digital images,” Com- Computer Studies, the Center for Automation
puter Graphics and Image Process., vol. 22, pp. 28-38, 1983. Research and the Computer Science Depart-
[54] Y. Q. Cheng, Y. M. Zhuang, and J. Y. Yang, “Optimal Fisher ment. He is an Editor of Collected Papers on
discriminant analysis using the rank decomposition,” Putt. Digital Image Processing (IEEE Computer Soci-
Recog., vol. 25, pp. 101-111, 1992. ety Press, 1992). He coauthoredArtificial Neural
[55] B. P. Yuchas, Jr., M. H. Goldstein, T. J. Sejnowski, and R. Networks for Computer Vision (Springer Ver-
E. Jenkins, “Neural network models of sensory integration lag, 1992) and coedited Markov Random Fields:
for improved vowel recognition,” Proc. IEEE, vol. 78, pp. Theory and Applications (Academic Press, 1993). He was an Associate
1658-1668, 1990. Editor for IEEE Transactionson Acoustics, Speech, and Signal Processing
[56] J. Q. Fang and T. S. Huang, “Some experiments on estimating and IEEE Transactions on Neural Networks. He is presently Coeditor-
the 3-D motion parameters of a rigid body from two consecutive in-Chief of Computer Vision, Graphics, and Image Processing: Graphic
image frames,” IEEE Trans. Putt. Anal. and Mach. Intell., vol. Models and Image Processing and IEEE Transactions on Image Pro-
6, pp. 547-554, 1984. cessing. He has authored 20 book chapters and over 150 peer-reviewed
[57] J. W. Roach and J. K. Aggarwal, ‘‘Determining the movements joumal and conference papers. He was the General Chairman of the
of objects from a sequence of images,” IEEE Trans. Putt. Anal. IEEE Computer Society Conference on Computer Vision and Pattem
and Mach. Intell., vol. 2, pp. 554-562, 1980. Recognition and of the IEEE Computer Society Workshop on Artificial
[58] R. Y. Tsai and T. S. Huang, “Uniqueness and estimation of 3-D Intelligence for Computer Vision (1989). He was the Program Chairman
motion parameters of rigid bodies with curved surfaces,” IEEE of the IEEE Signal Processing Workshop on Neural Networks for Signal
Trans. Putt. Anal. and Mach. Intell., vol. 6, pp. 13-27, 1984. Processing, and is the Program Chairman for the 2nd Intemational
[59] J. Weng, T. S. Huang, and N. Ahuja, “Motion and structure Conference on Image Processing.
from two prospective views: Algorithms, error analysis and Dr. Chellappa received the 1985 National Science Foundation Presiden-
error estimation,” IEEE Trans. Putt. Anal. and Mach. Intell., tial Young Investigator Award and the 1985 IBM Faculty Development
vol. 11, pp. 451-476, 1989. Award. In 1990 he received the Excellence in Teaching Award from the
[60] T. J. Broida and R. Chellappa, “Estimation of object motion School of Engineering at the University of Southem California. He is a
parameters from noisy images,” IEEE Trans. Putt. Anal. and corecipient of four NASA certificates for his work on synthetic aperture
Mach. Intell., vol. 8, pp. 90-99, 1986. radar image segmentation.
[61] J. Aisbett, “An iterated estimation of the motion parameters of a
rigid body from noisy displacement vectors,” IEEE Trans. Putt.
Anal. and Mach. Intell., vol. 12, pp. 1092-1098, 1990.
[62] H. C. Longuet-Higgins, “A computer program for reconstruct-
ing a scene from two projections,” Nature, vol. 293, pp. Charles L. Wilson (Senior Member, IEEE) has
133-135, 1981. been with the National Institute of Standards
[63] Y. Yasumoto and G. Medioni, “Robust estimation of three- and Technology,Gaithersburg,MD, for the past
dimensional motion parameters from sequence of image frames 15 years. He is presently Manager of the Vi-
using regularization,” IEEE Trans. Putt. Anal. and Mach. Intell., sual Image Processing Group of the Advanced
vol. 8, pp. 464-471, 1986. Systems Division. His was with Los Alamos
[64]C. L. Fennema and W. R. Thompson, “Velocity determination in National Laboratory and AT&T Bell Laborato-
scenes containing several moving objects,” Computer Graphics ries. His current research interests are in appli-
and Image Process., vol. 9, pp. 301-315, 1979. cation of statistical pattem recognition, neural
[65] X. Zhuang, T. S. Huang, and R. M. Haralick, “Two-view motion network methods, and dynamic training methods
analysis: A unified algorithm,” J. Opt. Soc. Amer., vol. 3, pp. for image recognition, image compression, and
1492-1500, 1986. in standards used to evaluate recognition systems.
[66] G. S. Young and R. Chellappa, “3-D motion estimation using Dr. Wilson received a DOC Gold Medal in 1983 for his work in
a sequence of noisy stereo images: Models, estimation, and semiconductor device simulation.
uniqueness results,” IEEE Trans. Putt. Anal. and Mach. Intell.,
vol. 12, pp. 735-759, 1990.
[67] J. A. Webb and J. K. Aggarwal, “Structure from motion of rigid
and jointed objects,” art^ Intell,, vol. 19, pp. 107-130, 1982.
[68] M. Spetasakis and Y. Aloimonus, “A multi-frame approach to Saad A. Sirohey (Member, IEEE) received the
visual motion perception,” Int. J. Computer Vision, vol. 6, pp. B.Sc. (highest honors) in electrical enginering
245-255, 1991. from King Fahd University of Petroleum and
[69] Y. Kim and J. Aggarwal, “Determining object motion in a Minerals, Dhahran, Saudi Arabia, and the M.S.
sequence of stereo images,” IEEE J. Robotics and Autom., vol. degree in electrical engineering from the Uni-
RA-3, pp. 599-614, 1987. versity of Maryland at College Park, in 1990
[70] S. Liou and R. Jain, “Motion detection in spatio-temporal and 1993,respectively.He is working toward the
space,” Computer Vision, Graphics and Image Process., vol. Ph.D. in electrical engineering at the University
45, pp. 227-250, 1989. of Maryland at College Park.
[71] Z. Zhang, 0. Faugeras, and N. Ayache, “Analysis of a sequence He is a Research Assistant at the Center
of stereo scenes containing multiple moving objects using for Automation Research at the University of
rigidity constraints,” in Proc. Int. Conf. on Computer Vision, Maryland. His current research interests include signdimage processing
1988, pp. 177-186. and computer vision, specifically automated face recognition.