0% found this document useful (0 votes)
65 views36 pages

Human and Machine Recognition of Faces: A Survey: Rama Charles L. Saad Sirohey, Wilson

This document summarizes research on human and machine recognition of faces. It discusses applications of face recognition in commercial and law enforcement sectors. It then provides an overview of over 20 years of research done in both the psychophysics community to understand human face perception and the engineering community on developing techniques for machine face recognition. Finally, it addresses issues regarding evaluating and benchmarking face recognition algorithms and systems.

Uploaded by

Enrique García
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views36 pages

Human and Machine Recognition of Faces: A Survey: Rama Charles L. Saad Sirohey, Wilson

This document summarizes research on human and machine recognition of faces. It discusses applications of face recognition in commercial and law enforcement sectors. It then provides an overview of over 20 years of research done in both the psychophysics community to understand human face perception and the engineering community on developing techniques for machine face recognition. Finally, it addresses issues regarding evaluating and benchmarking face recognition algorithms and systems.

Uploaded by

Enrique García
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Human and Machine Recognition

of Faces: A Survey
RAMA CHELLAPPA, FELLOW, IEEE, CHARLES L. WILSON, SENIOR MEMBER, IEEE,
ANDSAAD SIROHEY, MEMBER, IEEE

The goal of this paper is to present a critical survey of existing I. INTRODUCTION


literature on human and machine recognition of faces. Machine
recognition of faces has several applications, ranging from static Machine recognition of faces from still and video images
matching of controlled photographs as in mug shots matching is emerging as an active research area spanning several
and credit card verijication to surveillance video images. Such disciplines such as image processing, pattern recognition,
applications have different constraints in terms of complexity computer vision and neural networks. In addition, face
of processing requirements and thus present a wide range of
different technical challenges. Over the last 20 years researchers in recognition technology (FRT) has numerous commercial
psychophysics, neural sciences and engineering, image processing, and law enforcement applications. These applications range
analysis and computer vision have investigated a number of issues from static matching of controlled format photographs such
related to face recognition by humans and machines. Ongoing as passports, credit cards, photo ID’S, driver’s licenses, and
research activities have been given a renewed emphasis over the
last five years. Existing techniques and systems have been tested mug shots to real-time matching of surveillance video im-
on diferent sets of images of varying complexities. But very little ages presenting different constraints in terms of processing
synergism exists between studies in psychophysics and the engi- requirements. Although humans seem to recognize faces in
neering literature. Most importantly, there exist no evaluation or cluttered scenes with relative ease, machine recognition is a
benchmarking studies using large databases with the image quality much more daunting task. In this paper we address critical
that arises in commercial and law enforcement applications.
In this paper, we first present different applications of face issues involved in understanding how humans perceive
recognition in commercial and law enforcement sectors. This is faces and follow it with a detailed discussion of several
followed by a brief overview of the literature on face recognition in techniques and systems that have been considered in the
the psychophysics community. We then present a detailed overview engineering literature for nearly 25 years. Critical issues
of more than 20 years of research done in the engineering com-
such as data collection and performance evaluation are also
munity. Techniques for segmentatiodlocation of the face, feature
extraction and recognition are reviewed. Global transform and addressed.
feature based methods using statistical, structural and neural A general statement of the problem can be formulated
classifiers are summarized. A brief summary of recognition using as follows: Given still or video images of a scene, identify
face projiles and range image data is also given. Real-time face one or more persons in the scene using a stored database
recognition from video images acquired in a cluttered scene such
as an airport is probably the most challenging problem. We discuss
of faces. Available collateral information such as race, age
several existing technologies in the image understanding literature and gender may be used in narrowing the search. The
that could potentially impact this problem. solution of the problem involves segmentation of faces from
Given the numerous theories and techniques that are applicable cluttered scenes, extraction of features from the face region,
to face recognition, it is clear that evaluation and benchmarking identification, and matching. The generic face recognition
of these algorithms is crucial. We discuss relevant issues such as
data collection, performance metrics, and evaluation of systems task thus posed is a central issue in problems such as
and techniques. Finally, summary and conclusions are given. electronic line up and browsing through a database of
faces.
Over the past 20 years extensive research has been
Manuscript received July I , 1994; revised February 14, 1995. Chellappa
and Sirohey’s work was supported by the Advanced Research Projects conducted by psychophysicists, neuroscientists and engi-
Agency under Order Number 8459 and the US Army Topographic neers on various aspects of face recognition by humans
Engineering Center under contract DACA76-92-0009. Wilson’s work was and machines. Psychophysicists and neuroscientists have
supported by the FBI through an agreement with NIST.
R. Chellappa and S. Sirohey are with the Department of Electrical been concerned with issues such as: Uniqueness of faces;
Engineering and Center for Automation Research, University of Maryland, whether face recognition is done holistically or by local
College Park, MD 20742 USA. feature analysis; analysis and use of facial expressions
C. L. Wilson is with the National Institute of Standards and Technology,
Gaithersburg, MD 20899 USA. for recognition; how infants perceive faces; organization
IEEE Log Number 9410528. of memory for faces; inability to accurately recognize

00 18-92 19/95$04.00 0 1995 IEEE

PROCEEDINGS OF THE IEEE, VOL 83, NO 5 , MAY 1995 705

~~
__
inverted faces; existence of a “grandmother” neuron for using motion as a cue. Only a handful of papers on face
face recognition; role of the right hemisphere of the brain recognition [117], [121], [133] have addressed the issue of
in face perception; and inability to recognize faces due to segmenting a face image from the background. However,
conditions such as prosopagnosia. Some of the theories there is a significant amount of work reported in the
put forward to explain the observed experimental results image understanding (IU) literature [l], [2] on segmenting
are contradictory. Many of the hypotheses and theories put a moving object from the background using a sequence.
forward by researchers in these disciplines have been based Also, there is a significant amount of work on the analysis
on rather small sets of images. Nevertheless, several of the of nonrigid moving objects, including faces, in the IU
findings have important consequences for engineers who [l], 121 as well as the image compression literature [4].
design algorithms and systems for machine recognition of We briefly discuss those techniques that have potential
human faces. applications to recovery and reconstruction (in 3D) of faces
Barring a few exceptions [21], [24], [116], research on from a video sequence. The reconstructed image will be
machine recognition of faces has developed independent useful for recognition tasks when disguises and aging are
of studies in psychophysics and neurophysiology. During present.
the early and mid- 1970’s, typical pattem classification In addition to the separation of images into static and
techniques, which use measured attributes between features real-time image sequences several other parameters are
in faces or face profiles, were used. During the 1980’s, important in critically evaluating existing methods. In any
work on face recognition remained largely dormant. Since pattem recognition problem the accuracy of the solution
the early 1990’s, research interest in FRT has grown very will be strongly affected by the limitations placed on the
significantly. One can attribute this to several reasons: problem. To restrict the problem to practical proportions
An increase in emphasis on civilidcommercial research both the image input and the size of the search space
projects; the reemergence of neural network classifiers with must have some limits. The limits on the image might
emphasis on real-time computation and adaptation; the for example include controlled format, backgrounds which
availability of real time hardware; and the increasing need simplify segmentation, and controls on image quality. The
for surveillance-relatedapplications due to drug trafficking, limits on the database size might include geographic lim-
terrorist activities, etc. its and descriptor based limits. Critical issues involving
Over the last five years, increased activity has been seen data collection, evaluation and benchmarking of existing
in tackling problems such as segmentation and location algorithms and systems also need to be addressed.
of a face in a given image, and extraction of features An excellent survey of face recognition research prior to
such as eyes, mouth, etc. Also, numerous advances have 1991 is in 11141. Still we decided to prepare our survey
been made in the design of statistical and neural network paper due to the following reasons: The face recognition
classifiers for face recognition. Classical concepts such area has become very active since 1990. Approaches based
as Karhunen-Loeve transform based methods [ 111, [82], on Karhunen-Loeve expansion, neural networks and feature
[ 1041, [ 1241, [ 1331, singular value decomposition [69] and matching have all been initiated since the survey paper
more recently neural networks [21], [51], have been used. [ 1141 appeared. Also, [114] did not cover discussions on
Barring a few exceptions [104], many of the existing face recognition from a video, profile, or range imagery
approaches have been tested on relatively small datasets, nor any aspects of performance evaluation.
typically less than 100 images. The organization of the paper is as follows: In Section
In addition to recognition using full face images, tech- I1 we describe several applications of FRT in still and
niques that use only profiles constructed from a side view video images and point out the specific constraints that
are also available. These methods typically use distances each set of applications pose. Section I11 provides a brief
between the “fiducial” points in the profile (points such summary of issues that are relevant from the psychophysics
as the nose tip, etc.) as features. Modifications of Fourier point of view. In Section IV a detailed review of face
descriptors have also been used for characterizing the recognition techniques, involving still intensity and range
profiles. Profile based methods are potentially useful for images, in the engineering literature is given. Techniques
the mug shot problem, due to the availability of side views for segmentation of faces from clutter, feature extraction
of the face. and recognition are detailed. Face recognition using profile
All of the discussion thus far has focused on recognizing images (which has not been pursued with much vigor
faces from still images. The still image problem has several in recent years, but nevertheless is useful in mug shots
inherent advantages and disadvantages. For applications matching problem) is discussed in Section V. Section VI
such as mug shots matching, due to the controlled nature presents a discussion on face recognition from video images
of the image acquisition process, the segmentation problem with special emphasis on how IU techniques could be
is rather easy. On the other hand, if only a static picture useful. Some specific examples of face recognition and
of an airport scene is available, automatic location and recall work in law enforcement domains, and commercial
segmentation of a face could pose serious challenges to applications are briefly discussed in Section VII. Data
any segmentation algorithm. However, if a video sequence collection and performance evaluation of face recognition
acquired from a surveillance camera is available, segmenta- algorithms and architectures are addressed in Section VIII.
tion of a person in motion can be more easily accomplished Finally, summary and conclusions are in Section ZX.

706 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5, MAY 1995


Table 1 Applications of Face Recognition Technology
~ ~~~~

Applications Advantages Disadvantages


la. Credit Card, Driver’s License, Passport, Controlled image
and Personal Identification Controlled segmentation No existing database
Good quality images Large potential database
1b. Mug shots Matching Mixed image quality Rare search type
More than one image available
2. BanWStore Security High value Uncontrolled segmentation
Geographically localized search Low image quantity
3. Crowd Surveillance High value Uncontrolled segmentation
Small file size Low image quality
Availability of video images Real-time
4. Expert Identification High value Low image quality
Enhancement possible Legal certainty required
5. Witness Face Reconstruction Witness search limits Unknown similarity
6. Electronic Mug Shots Book Descriptor search limits Viewer fatigue
7. Electronic Lineup Descriptor search limits Viewer fatigue
8. Reconstruction of Face from Remains High value Requires physiological input
9. Computerized Aging High value Requires example input

11. APPLICATIONS A. Static Matching


Commercial and law enforcement applications of FRT Mug shots matching is the most common application
listed in Table 1 range from static, controlled format pho- in this group. Typically, in mug shots photographs, the
tographs to uncontrolled video images, posing a wide illumination is reasonably controlled, and one frontal and
range of different technical challenges and requiring an one or more side views of a person’s face are taken.
equally wide range of techniques from image processing, Although more control can be exercised in image ac-
analysis, understanding and pattern recognition. One can quisition, no uniform standards exist for use by booking
broadly classify the challenges and techniques into two stations across the country. These standards could in-
groups: static (no video) and dynamic (video) matching. volve the type of background, illumination, resolution of
Even among these groups, significant differences exist, the camera, and the distance between the camera and
depending on the specific application. The differences are the person being photographed. By enforcing such simple
in terms of image quality, amount of background clut- controls over the image acquisition process, one can po-
ter (posing challenges to segmentation algorithms), the tentially simplify segmentation and matching algorithms.
availability of a well defined matching criterion, and the Two examples of typical mug shots images are given in
nature, type and amount of input from a human (as in Fig. 1.
applications 4 and 5). In some applications, such as com- Simple versions of the mug shots matching problem
puterized aging, one is only concerned with defining a set are recognition of faces in driver’s licenses, credit cards,
of transformations so that the new images created by the personal ID cards, and passports. Typical examples of face
system are similar to what humans expect based on their images in drivers licenses or personal ID cards are shown in
recollections. Fig. 2. The images in these documents are usually acquired
Three different kinds of problems arise in applications with more control than in mug shots.
listed in Table 1; these are matching, similarity detection, Typically, images in mug shots applications are of good
and transformation. Applications 1, 2, and 3 involve match- quality, consistent with existing law enforcement standards.
ing one face image to another face image. Applications 4-7 Given the reasonably controlled imaging conditions, seg-
involve finding or creating a face image which is similar mentatiodlocation of a face is relatively easy. Potential
to the human recollection of a face. Finally, applications 8 challenges are in searching through a large dataset and also
and 9 involve generating an image of a face from input in matching; though the imaging conditions are controlled,
data that is useful in other applications by using other variations in the face due to aging, hair loss, hair growth,
information to perform modifications of a face image. etc., have to be accounted for in feature extraction and
Each of these applications imposes different requirements matching.
on the recognition process. Matching requires that the Application 2 is more complicated than application 1,
candidate matching face image be in some set of face largely due to the uncontrolled nature of the image acquisi-
images selected by the system. Similarity detection requires, tion conditions. As considerable background clutter may be
in addition to matching, that images of faces be found present, segmentation gets harder. Also, the quality of the
which are similar to a recalled face; this requires that image tends to be low. An approximate rendition of such
the similarity measure used by the recognition system an image is shown in Fig. 3. It should be pointed out that
closely match the similarity measures used by humans. application 2 falls between static and dynamic matching.
Transformation applications require that new images cre- Some of the images that arise in this application are on
ated by the system be similar to human recollections of a film while some are acquired from a video camera. As in
face. application 1, variations in face images due to aging and

CHELLAPPA er al. HUMAN AND MACHINE RECOGNITION OF FACES: A SURVEY 707


- ~~ -___ -
(C)

Fig. 1. Frontal and profile mug shots images.

disguises must be accounted for in feature extraction and closest to witness’s recollection is chosen. In application 6,
matching. In applications 1 and 2, the matching criterion electronic browsing of photo collection is attempted. Appli-
can be quantified; also, the top few choices can be rank cation 7 involves a witness identifying a face from a set of
ordered. face images which include some false candidates. Typically,
Applications 4-7 involve finding or creating a face image in these applications the image quality tends to be low; in
which is similar to the human recollection of a face. In addition to matching, it is required to find faces that are
application 4, an expert confirms that the face in the given similar to a recalled face. The similarity measure is difficult
image corresponds to a person in question. It is possible to quantify, as measures supposedly used by humans need
that the face in the image could be disguised, or occluded. to be defined. The problem is complicated further in that
Typically, in this application a list of similar looking faces when humans search through a mug shots book, they tend
is generated using a face identification algorithm, the expert to make more recognition errors as the number of mug
then performs a careful analysis of the listed faces. In shots presentations increases. It is difficult to completely
application 5 the witness is asked to compose a picture of a quantify the degradation in machine implementation of
culprit using a library of features such as noses, eyes, lips, algorithms developed for applications 4-6. Another issue
etc. For example the library may have examples of noses is the incorporation of mechanisms for recalling faces that
that are long, short, curved, flat, etc., from which one that is humans use in the algorithms. Applications 4-7 need a

PROCEEDINGS OF THE IEEE, VOL 83, NO 5, MAY 1995


___
(a) (b)
Fig. 2. Face images in a controlled background, as in passport or identificaticin documents.

(b)
Fig. 3. An approximate illustration of an uncontrolled environment for face im(ages corresponding
to application 2.

strong interaction between algorithms and known results available through a video camera tend to be of low quality.
in psychophysics and neuroscience studies. Also, in crowd surveillance applications the background is
Applications 8 and 9 involve transformations of images very cluttered, making the problem of segmenting a face
from current data to what they could have been (application in the crowd difficult. However, since a video sequence
8) or to what they will be (application 9). These are even is available, one could use motion as a strong cue for
more difficult than applications 4-6, since “smoothing” or segmenting faces of moving persons. One may also be able
“predictive” mechanisms need to be incorporated into the to do partial reconstruction of the face image using existing
algorithms. models [lo], [23], [87] and be able to account for disguises,
somewhat better than in static matching problems. One of
the strong constraints of this application is the need for
B. Dynamic Matching real-time recognition. It is expected that several of the
We group application 3, and cases of application 2 where existing methodologies in the IU literature [ 11-[5] for image
a video sequence is available, as dynamic. The images sequence based segmentation, structure estimation, nonrigid

CHELLAPPA et al.: HUMAN AND MACHINE RECOGNITlON OF FACES: A SURVEiY 709


object motion and recognition will be useful for solving the features for the purpose of identification. It should be
requirements of application 3. noted that prosopagnosia patients recognize whether
It should be remarked that the widely varying constraints the given object is a face or not, but then have difficulty
of the different applications necessitate different methods in identifying the face. C) It is argued that infants
and scores for evaluating and benchmarking existing algo- come into the world pre-wired to be attracted by faces.
rithms and systems. Neonates seem to prefer to look at moving stimuli
that have face-like patterns in comparison to those
containing no pattern or with jumbled features.
Is face perception the result of holistic or feature
111. PSYCHOPHYSICS
AND NEUROPHYSIOLOGICAL
analysis? Both holistic and feature information are
ISSUESRELEVANTTO FACERECOGNITION
crucial for the perception and recognition of faces.
In general, the human recognition system utilizes a broad Studies suggest the possibility of global descriptions
spectrum of stimuli, obtained from many, if not all, of the serving as a front end for finer, feature-based percep-
senses (visual, auditory, olfactory, tactile, etc.). These stim- tion. If dominant features are present, holistic descrip-
uli are used in either an individual or collective manner for tions may not be used. For example, in face recall
both the storing and retrieval of face images for the purpose studies, humans quickly focus on odd features such as
of recognition. There are many instances when contextual big ears, a crooked nose, a staring eye, etc.
knowledge is also applied, i.e., the surroundings play an Ranking of significance of facial features: Hair, face
important role-recognizing faces in relation to where they outline, eyes and mouth (not necessarily in this order)
are supposed to be located. It is futile (impossible with have been determined to be important for perceiving
the present technology) to attempt to develop a system and remembering faces. Several studies have shown
which will mimic all these remarkable traits of humans. that the nose plays an insignificant role; this may be
However, the human brain has its shortcomings in the due to the fact that almost all of these studies have
total number of persons that it can accurately “remember.” been done using frontal images. In face recognition
The benefit of a computer system would be its capacity using profiles (which may be important in mug shots
to handle large datasets of face images. In most of the matching applications, where profiles can be extracted
applications the images are present in single or multiple from side views), several fiducial points (“features”)
views of 2D intensity data, which forces the inputs to a are around the nose region (see Section V). Another
computer algorithm to be visual only. It is for this reason outcome of some of the studies is that both extemal
that the literature reviewed in this section is related to and internal features are important in the recogni-
aspects of human visual perception. tion of previously presented but otherwise unfamiliar
During the course of our literature survey, we have come faces, and internal features are more dominant in
across several hundred papers that address problems and the recognition of familiar faces. It has also been
issues related to human recognition of faces. Many of these found that the upper part of the face is more useful
studies and their findings have direct relevance to engineers for face recognition than the lower part. The role
interested in designing algorithms or systems for machine of aesthetic attributes such as beauty, attractiveness
recognition of faces. A detailed review of relevant studies and/or pleasantness has also been studied, with the
in psychophysics and neuroscience is beyond the scope of conclusion that the more attractive the faces are, the
this paper. We only summarize findings that are potentially better is their recognition rate; the least attractive faces
relevant to the design of face recognition systems. For come next, followed by the mid-range faces, in terms
details the reader is referred to the papers cited below and of ease of being recognized.
to citations in the supplemental bibliography. In writing Caricatures: [20] Perkins [ 1051 formally defines a
this section, we have largely benefited from books [19], “caricature as a symbol that exaggerates measurements
[37], [39] and survey papers [14], [20], [41], [65]. Also relative to any measure which varies from one person
literature on how animals such as monkeys, dogs and cats to another.” Thus the length of a nose is a measure that
recognize faces is not included in our survey. Notable works varies from person to person, and could be useful as a
on experiments with monkeys are [ 1061-[ 1081. symbol in caricaturing someone, but not the number of
The issues that are of potential interest to designers are: ears. Caricatures do not contain as much information as
Is face recognition a dedicated process? [41]: Evi- photographs, but they manage to capture the important
dence for the existence of a dedicated face processing characteristics of a face; experiments comparing the
system comes from three sources. A) Faces are more usefulness of caricatures and line drawings decidedly
easily remembered by humans than other objects when favor the former.
presented in an upright orientation. B) Prosopagnosia Distinctiveness: Studies show that distinctive faces are
patients are unable to recognize previously familiar better retained in recognition memory and are recog-
faces, but usually have no other profound agnosia. nized better and faster than typical faces. However, if
They recognize people by their voices, hair color, a decision has to be made as to whether an object is
dress, etc. Although they can perceive eyes, nose, a face or not, it takes longer to recognize an atypical
mouth, hair, etc., they are unable to put together these face than a typical face. This may be explained by

710 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5, MAY 1995

---- - ~ -~
different mechanisms being used for detection and for of isolated features and paraphernalia to one of holistic
identification. analysis. Curiously, when children as young as five
The role of spatial frequency analysis: Earlier studies years are asked to recognize familiar faces, they do
[47], [64]concluded that information in low spatial pretty well in ignoring paraphernalia.
frequency bands play a dominant role in face recog- Several other interesting studies related to how chil-
nition. Recent studies [ 1191 show that, depending on dren perceive inverted faces are summarized in [29].
the specific recognition task, the low, bandpass and Facial expression: [191 Based on neurophysiological
high frequency components may play different roles. studies, it seems that analysis of facial expressions
For example the sex judgment task is successfully is accomplished in parallel to face recognition. Some
accomplished using low frequency components only, prosopagnosic patients, who have difficulties in iden-
while the identification task requires the use of high tifying familiar faces, nevertheless seem to recognize
frequency components. The low frequency components emotional expressions. Patients who suffer from “or-
contribute to the global description, while the high ganic brain syndrome” suffer from poor expression
frequency components contribute to the finer details analysis but perform face recognition quite well. Nor-
required in the identification task. mal humans exhibit parallel capabilities for facial
The role of the brain: [40] The role of the right hemi- expression analysis and face recognition. Similarly,
sphere in face perception has been supported by several separation of face recognition and “focused visual
researchers. In regard to prosopagnosia and the right processing” (look for someone with a thick mustache)
hemisphere, a retrospective study seems to strongly tasks have been claimed.
indicate right hemisphere involvement in face recog- Role of racdgender: Humans recognize people from
nition. In other brain damaged victims, those with right their own race better than people from another race.
hemisphere disease have more impairment in facial This may be due to the fact that humans may be
recognition then left hemisphere disease. When shown coding an “average” face with “average” attributes, the
the left half of one face and the right half of another characteristic of which may be different for different
face tachistoscopically, the overwhelming majority of races, making the recognition of faces from a different
commissurotomy patients selected the face shown to race harder. Goldstein [50] gives two possible reasons
the left vision field (LVF), which arrives initially at for the discrepancies: Psychosocial, in which the poor
the right hemisphere. In other tachistoscopic studies, identification results are from the effects of prejudice,
the LVF has the advantage in both speed and accuracy unfamiliarity with the class of stimuli, or a variety
of response and in long term memory response. Studies of other interpersonal reasons; and psychophysical,
have also shown a right hemisphere advantage in dealing with loss of facial detail because of different
reception and/or storage of faces. Some other studies amounts of reflectance from different skin colors, or
argue against right hemisphere superiority in face race-related differences in the variability of facial fea-
perception. Postmortem studies of prosopagnosia vic- tures. Using tables showing the coefficientsof variation
tims with known lesions in the right hemisphere have for different facial features for different races, it has
found approximately symmetrical lesions in the left been concluded that poor identification of other races
hemisphere. Other cases of bilateral brain damage have is not a psychophysical problem but more likely a
been seen or suspected in patients with prosopagnosia. psychosocial one. Using the same data collected in
The ways in which the two hemispheres operate may [50], some studies have been done to quantify the
reflect variations in degrees of expertise. It appears that role of gender in face recognition. It has been found
the right hemisphere does possess a slight advantage [49] that in a Japanese population, a majority of the
in aspects of face processing. It is also true that the women’s facial features are more heterogeneous than
two hemispheres may simultaneously handle different the men’s features. It has also been found that white
types of information. The dominance of the right women’s faces are slightly more variable than men’s,
hemisphere in facial processing may be the result but that the overall variation is small.
of left hemisphere dominance in language. The right Image quality: In [125] the relationship between
hemisphere is also involved in the interpretation of image quality and recognition of a human face has
emotions, and this may underlie the slight asymmetry been explored. The task required of observers is to
in perceiving and remembering faces. identify one face from a gallery of 35 faces. The
Face recognition by children. 1291, [30] It appears modulation transfer function area (MTFA) was used
that children under ten years of age code unfamiliar as a metric to predict an observers performance in a
faces using isolated features. Recognition of these task requiring the extraction of detailed information
faces is done using cues derived from paraphernalia, from both static and dynamic displays. Performance
such as clothes, glasses, hair style, hats, etc. Ten- for an observer is measured by two dependent vari-
year-old children exhibit this behavior less frequently, ables-proportion of correct responses and response
while children older than 12 years rarely exhibit this time. It was found that as the MTFA becomes mod-
behavior. It is postulated that around age ten, children erately large, facial recognition performance reaches a
seem to change their recognition mechanisms from one ceiling which cannot be exceeded. The MTFA metric

CHELLAPPA et al. HUMAN AND MACHINE RECOGNITION OF FACES A SURVEY 711


- _ _ ~ __
indicates the extent to which a system’s response presented as statistical, neural and feature based. Finally a
exceeds the minimum contrast requirements, averaged summary section is included.
across all spatial frequencies of interest [ 1251.
A. Segmentation
A. Summary One of the earliest papers that reported the presence or
absence of a face in an image is [ 1131. An edge map
For engineers interested in designing algorithms and
extracted from the input image is matched to a large oval
systems for face recognition, numerous studies in psy-
template with possible variations in the position and size
chophysics and neurophysiologicalliterature serve as useful
of the template. At positions where potential matches are
guides. As an example, designers should include both global
reported, the head hypothesis is confirmed by inspecting the
and local features for representing and recognizing faces.
edges produced at expected positions of eyes, mouth, etc.
Among the features, some (hairline, eyes, mouth) are more
The technique was dependent on the illumination direction.
significant or useful than others (nose). This observation
Kelly [SI] introduced a top-down image analysis ap-
is true for frontal images of faces, while for side views
proach known as PLANNING for automatically extracting
and profiles, the nose is an important feature. Studies
the head and body outlines from an image and subsequently
on distinctiveness and caricatures can help add special
the locations of eyes, nose, mouth. As an example, the head
features of the face that can be utilized for perceiving and
extraction algorithm works as follows: Smoothed versions
recognizing faces. The role of spatial frequency analysis
of original images (obtained by local averaging) are first
suggests multiresolution/multiscale algorithms for different searched for edges that may form the outline of a head;
problems related to face perception. Issues such as how
extracted edge locations are then projected back to the
humans recognize people from their own race better than
original image, and a fine search is locally performed for
people from another race, and how infants recognize faces,
edges that form the head outline. Several heuristics are used
are very important in the design of systems for expert
to connect the edges. Once the head outline is obtained, the
identification, witness face reconstruction, electronic mug
expected locations for eyes, nose and mouth are searched
shots books and lineups. Interpreting face recognition using
for locating these features. Several heuristics are again
Marr’s computational vision paradigm may point to new
employed in the search process.
algorithms and systems; see Chapter 6 of [19]. Other issues,
The algorithm for extracting the body of a person,
such as organization of face memory, are very pertinent for
subtracts the image of the background without the person
the design of large databases such as mug shots albums.
from the image that has the person. This difference image
Usefulness of facial expressions on face recognition needs
is reduced in size by averaging and then thresholded. After
to be evaluated.
applying a connected component algorithm, the extremes of
Historically, there has been great interest among com-
the regions obtained define the region in which the body is
puter vision algorithm developers and system designers
located. Details on the feature measurements, dataset etc.,
in learning how our visual system works and in translat-
used in [Sl] are given in Section IV-C.
ing these mechanisms into real systems. Marr’s paradigm Govindaraju et al. [55] consider a computational model
for computational vision [89] is a pioneering example of
for locating the face in a cluttered image. Their technique
such an effort. Designers of face recognition algorithms
utilizes a deformable template which is slightly different
and systems should be aware of relevant psychophysics
than that of Yuille et al. [ 1451. Working on the edge image
and neurophysiological studies but should be prudent in
they base their template on the outline of the head. The
using only those that are applicable or relevant from a
template is composed of three segments that are obtained
practicaUimplementation point of view.
from the curvature discontinuities of the head outline. These
three segments form the right side-line, the left side-line and
the hairline of the head. Each one of these curves is assigned
Iv. FACERECOGNITION FROM STILL
a four-tuple consisting of the length of the curve, the chord
INTENSITY AND RANGEIMAGES
in vector form, the area enclosed between the curve and
In this section we survey the state of the art in face the chord, and the centroid of this area. To determine the
recognition in the engineering literature. We have divided presence of the head, all three of these segments should
the face recognition papers into three groups. Methods for be present in particular orientations. The center of these
segmentation of faces from a given image are discussed three segments gives the location of the center of the face.
in Section IV-A. Techniques for extraction of statistical The templates are allowed to translate, scale and rotate
features such as Karhunen-Loeve transform and singular according to certain spring-based models. They construct
value decomposition coefficients, structural features like a cost function to determine hypothesized candidates. They
eyes, nose, lips, and points of high curvature are summa- have experimented on about ten images, and though they
rized in Section IV-B. Most papers included in Sections claim to have never failed to miss a face, they do get false
IV-A and IV-B do not report any recognition or matching alarms.
experiments. Recognition and identification papers that use Craw et al. in [34] describe a method for extracting the
features described in Section IV-B or other features are head area from the image. They use a hierarchical image
surveyed in Section IV-C. The recognition techniques are scale and a template scale. Constraints are imposed on the

112 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5, MAY 1995


- ~ ~~~
location of the head in the image. Resolutions of 8 x 8, 16 edge segments at intersection points. The human face is
x 16, 32 x 32,64 x 64 and full scale 128 x 128 are used approximated using the ellipse as the analytical tool. Pairs
in their multiresolution scheme. At the lowest resolution a of labeled edge segments Li, Lj are fitted to a linearized
template is constructed of the head outline. Edge magnitude equation of the ellipse (1). This linearization is possible
and direction are calculated from the gray level image using under the condition that the semi-major axis a andor the
a Sobel mask. A line follower is used to connect the outline semi-minor axis b of the ellipse are not 0, which is true for
of the head. After the head outline has been located a search all cases considered.
for lower level features such as eyes, eyebrows, and lips is
conducted, guided by the location of the head outline, using
a similar line following method. The algorithm for detecting
the head outline, performed better than the one searching where
for the eyes.
Another method of finding the face in an image was
defined by Burt [25]. It utilized a coarse to fine approach
a2 = -yo, u 3 = . ; + - - ua2
. 2
with a template based match criteria to locate the head. Burt b2 b2
illustrates the usefulness of such techniques by describing a
“smart transmission” system. This system could locate and The resulting parameter set ZO, yo, a, b is checked against
track the human head and then send the information of the the aspect ratio of the face, and if it is satisfied, is included
head location to an object based compression algorithm. in the class of parameter sets for final selection. The
In [35] Craw, Tock, and Bennet describe a system to parameter sets in the class of parameters are reverse fitted
recognize and measure facial features. Their work was mo- with the labeled segments. The parameter set with the most
tivated in part by automated indexing of police mug shots. segments (compensated for size) is selected to represent
They endeavor to locate 40 feature points from a grey- the segmented face. Fig. 4 shows the segmentation process
scale image; these feature points were chosen according going through its different phases from the input image to
to Shepherd [120], which was also used as a criterion its edge representation, then the final grouping of the likely
of judgment. The system uses a hierarchical coarse-to- edge segments corresponding to the outline of the face,
fine search. The template drew upon the principle of and finally the output image without background clutter.
polygonal random transformation in Grenander et al. [56]. An accuracy of above 80% was reported when the process
The approximate location, scale and orientation of the head was applied to a data set of 48 cluttered images. Fig. 5
is obtained by iterative deformation of the whole template shows some of the results of the segmentation algorithm.
by random scaling, translation and rotation. A feasibility The image size was 128 x 128 pixels.
constraint is imposed so that these transformations do not The presence or absence of a face using the eigenfaces
lead to results that have no resemblance to the human expansion is reported in [133]. Details on eigenfaces are in
head. Optimization is achieved by simulated annealing [46]. Sections IV-B and IV-C.
After a rough idea of the location of the head is obtained,
refinement is done by transforming individual vectors of the
polygon. The authors claim successful segmentation of the B. Feature Extraction
head in all 50 images that were tested. In 43 of these images Recently, the use of the Karhunen-Loeve (KL) expansion
a complete outline of the head was distinguishable; in the for the representation [82], [124] and recognition [104],
remaining ones there was failure in finding the chin. The [133] of faces has generated renewed interest. The KL
detailed template of the face included eyes, nose, mouth, expansion has been studied for image compression for more
etc.; in all, 1462 possible feature points were searched for. than 30 years [74], [140]; its use in pattern recognition
The authors claim to be able to identify 1292 of these applications has also been documented for quite some
feature points. The only missing feature was the eyebrow, as time [45]. One of the reasons why KL methods, although
they did not have a feature expert for that. They attribute the optimal, did not find favor with image compression re-
6% incorrect identification to be due to presence of beards searchers is their computational complexity. As a result, fast
and mustaches in their database, which caused mistakes in transforms such as the discrete sine and cosine transform
locating the chin and the mouth of the subject. It should have been used [74]. In [ 1241, Sirovich and Kirby revisit the
be noted that due to its use of optimization and random problem of KL representation of images (cropped faces).
transformation, the system is inherently computationally Once the eigenvectors (referred to as “eigenpictures”) are
intensive. obtained, any image in the ensemble can be approximately
In [123] the face is segmented from a moderately clut- reconstructed using a weighted combination of eigenpic-
tered background using an approach that involves working tures. By using an increasing number of eigenpictures, one
with both the intensity image of the face as well as the gets an improved approximation to the given image. The
edge image found using the Canny’s edge finder [281. authors also give examples of approximating an arbitrary
Preprocessing tasks include locating the intersection points image (not included in the calculation of eigenvectors) by
of edges (occlusion of objects), assigning labels to con- the eigenpictures. The emphasis in this paper is on the rep-
tiguous edge segments and linking of most likely similar resentation of human faces. The weights that characterize

CHELLAPPA et al. HUMAN AND MACHINE RECOGNITION OF FACES A SURVEY 713


- __ __
(C)

Fig. 4. (a) Input image, (b) edge image, (c) linked segments, and (d) segmented image.

the expansion of the given image in terms of eigenpictures distance. An empirically defined standard window encloses
serve the role of features. the transformed image. The KL expansion applied to the
In a subsequent extension of their work, Kirby and standardized face images is known as the Karhunen-Loeve
Sirovich in [82] include the inherent symmetry of faces transform of intensity pattern in affine-transformed target
in the eigenpicture representation of faces, by using an (KL-IPAT) image. The KL-IPAT was extracted from 269
extended ensemble of images consisting of original faces images with 100 eigenfaces. The second step is to apply
and their mirror images. Since the computations of eigen- the Fourier Transform to the standardized image and use
values and eigenvectors can be split into even and odd the resulting Fourier spectrum instead of the spatial data
pictures, there is no overall increase in computational from the standardized image. The KL expansion applied
complexity compared to the case in which only the original to the Fourier spectrum is called the Karhunen-Loeve
set of pictures in used. Although the eigenrepresentation transform of Fourier spectrum in the affine-transformed
for the extended ensemble does not produce dramatic target (KL-FSAT) image. The robustness of the KL-IPAT
reduction in the error in reconstruction when compared to and KL-FSAT was checked against geometrical variations
the unextended ensemble, still the method that accounts for using the standard features for 269 face images.
symmetry in the patterns is preferable. In [69], the image features are divided into four groups:
In [ll], the KL is combined with two other operations visual features, statistical pixel features, transform coeffi-
to improve the performance of the extraction technique cient features, and algebraic features, with emphasis on the
for the classification of front-view faces. The application algebraic features, which represent the intrinsic attributes
of the KL expansion directly to a facial image without of an image. The singular value decomposition (SVD) of
standardization does not achieve robustness against vari- a matrix is used to extract the features from the pattern.
ations in image acquisition. [ 111 uses standardization of SVD can be viewed as a deterministic counterpart of the
the position and size of the face. The center points are KL transform. The singular values (SV’s) of an image
the regions corresponding to the eyes and mouth. Each are very stable and represent the algebraic attributes of
target image is translated, scaled and rotated through affine the image, being intrinsic but not necessarily visible. [69]
transformation so that the reference points of the eyes and proves their stability and invariance to proportional variance
mouth are in a specific spatial arrangement with a constant of image intensity in the optimal discriminant vector space,

714 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5, MAY 1995


-- ~ __
of the eye's sclera. The analytic shape representing the
iris is a circle with expected gradient directions in each
quadrant, given the lighter background of the sclera. An
ellipse appears to be the most suitable shape approximating
the perimeter of the sclera, but it is unsatisfactory for
those parts of the eye furthest from the center of the
face. The ellipse is tailored for each eye's face center by
using an exponential function. The gradient magnitudes,
obtained using a Sobel operator, are thresholded using four
brightness levels to represent the direction of the gradient
at that point. The directional information is incorporated
into the Hough transform technique. The deviation of the
position of the iris center from the estimated value has a
mean value of 0.33 pixels. The application of the Hough
transform to detect the perimeter of the shape of the region
below the eyebrows appears on average to yield a spacing
20% larger than the spacing between the irises. Using the
Hough transform to find the sclera shows that the spacing
differed on average by minus 1.33 pixels. The results show
that it is possible to derive a measurement of the spacing
by detecting of the position of both the irises, and the
shape describing both the perimeter of the sclera and the
eyebrows. The measurement by detection of the position of
the iris is most accurate. Detection of the perimeter of the
sclera is the most sensitive of the methods.
Yuille, Cohen, and Hallinen in [145] extract facial fea-
tures using deformable templates. These templates are
allowed to translate, rotate and deform to fit the best repre-
sentation of their shape present in the image. Preprocessing
is done to the initial intensity image to get representations of
peaks and valleys from the intensity image. Morphological
filters are used to determine these representations. Their
template for the eye has eleven parameters consisting of
the upper and lower arcs of the eye; the circle for the iris;
the center points; and the angle of inclination of the eye.
This template is fit to the image in an energy minimization
sense. Energy functions of valley potential, edge potential,
image potential, peak potential, and internal potential are
(a) (b) determined. Coefficients are selected for each potential and
Fig. 5. Results of segmentation. (a) Input image, (b) Extracted
an update rule is employed to determine the best parameter
image. set. In their experiments they found that the starting location
of the template is critical for determining the exact location
of the eye. When the template was started above the
to transposition, rotation, translation, and reflection which eyebrow, the algorithm failed to distinguish between the
are important properties of the SV feature vector. The eye and the eyebrow. Another drawback to this approach is
Foley-Sammon transform is used to obtain the optimal set its computational complexity. Generally speaking, template
of discriminant vectors spanning the Sammon discriminant based approaches to feature extraction are a more logical
plane. For a small set of 45 images of nine persons two approach to take. The problem lies in the description of
of the vectors seem to be adequate for recognition; more these templates. Whenever analytical approximations are
discriminant vectors will be needed for recognition with made to the image, the system has to be tolerant to certain
more images. The SVD operation is applied to each image discrepancies between the template and the actual image.
matrix for extracting SV features and the SV vector. This tolerance tends to average out the differences that
In [loo] Nixon uses the Hough transform for feature make individual faces unique.
extraction. The transform locates analytically described A statistically motivated approach to detecting and rec-
shapes by using the magnitude of the gradient and the ognizing the human eye in an intensity image with the
directional information provided by the gradient operator to constraint that the face is in a frontal posture is described
aid in the recognition process. Two parts of the eye are at- in [60]. Hallinan [60] uses a template based approach for
tractive for recognition of the eye, the iris, and the perimeter detecting the eye in an image. The template is depicted as

CHELLAPPA et a1 HUMAN AND MACHINE RECOGNITION OF FACES. A SURVEY 715

__ ~ -_
having two regions of uniform intensity. The first is the transform are:
iris region and the other is the white region of the eye.
The approach constructs an “archetypal” eye and models g(x,y : uo, WO) = exp {-[x2/2a2 y2/203 +
various distributions as variations of it. For the “ideal” eye +
f 2?ri[uoa: way]} (2)
a uniform intensity for both the iris and whites is chosen. G(u,w) = exp {-27r 2 [a,(u
2
- uo)2
In an actual eye certain discrepancies from the ideal are
found which hamper the uniform intensity choice. These
+ a;(w - w0)21) (3)
discrepanciescan be modeled as “noise” components added where gz and ay represent the spatial widths of the Gauss-
to the ideal image. For instance, the white region might ian and ( U O ,W O ) is the frequency of the complex sinusoid.
have speckled (spot) points depending on scale, lighting The Gabor functions form a complete though nonorthog-
direction, etc. Likewise the iris can have within it some onal basis set. Like the Fourier series, a function g(x,y)
“white” spots. The author uses an a-trimmed distribution can easily be expanded using the Gabor function. Consider
for both the iris and the white. A “blob” detection system the following wavelet representation of the Gabor function:
is developed to locate the intensity valley caused by the
iris enclosed by the white. Using a-trimmed means and @x(a:, y, e) = exp {[-X2(x’2 + y‘2)] + ira’} (4)
variances and a parameter set for the template of the blob, x’ = x cos 6 + y sin 0 (5)
a cost functional is determined for valley detection. A y’ = -x sin 6’ + y cos 13 (6)
deformable human eye template is constructed around the
valley detection scheme. The search for candidates uses a where 0 is the preferred spatial orientation and X is the
coarse to fine approach. Minimization is achieved using aspect ratio of the Gaussian. For convenience the subscripts
the steepest descent method. After locating the candidate a are dropped in further discussions. In the experiments, X is
goodness of fit criteria is used for verification purposes. The set to 1, and B is discretized into four orientations. The
inputs used in the experiments were frontal face intensity resulting family of wavelets is given by
images. In all three sets of data were used. One consisted
of 25 images used as a testing set, another had 107 positive {@[ai(. - xO), a’(y - YO), ek])r E R,
eyes, and the third consisted of images with most probably j = ( 0 , -1, - 2 , . . .} (7)
erroneous locations which could be chosen as candidate
templates. For locating the valleys the author reports as where e k = k7r/N, N = 4, k = (0, 1 , 2 , 3 } and a j , j E z.
many as 60 false alarms for the first data set, 30 for Feature detection utilizes a simple mechanism to model
the second and 110 for the third. An increase in hit rate the behavior of the end-inhibition. It uses interscale in-
is reported when using the a-trimmed distribution. The teraction to group the responses of cells from different
overall best hit rate reported was 80%. frequency channels. This results in the generation of the
Reisfeld and Yeshurun in [ 1121 use a generalized sym- end-stop regions. The orientation parameter 8 determines
metry operator for the purpose of finding the eyes and the direction of the edges. Hypercomplex cells in animals
mouth in a face. Their motivation stems from the almost are sensitive to oriented lines and step edges of short
lengths, and their response decreases if the lengths are
symmetric nature of the face about a vertical line through
increased.
the nose. Subsequent symmetries lie within features such
as the eyes, nose and mouth. The symmetry operator
locates points in the image corresponding to high values
of a symmetry measure discussed in detail in [112]. They
indicate their procedure’s superiority over other correlation
based schemes like that of Baron [ 141 in the sense that their
scheme is independent of scale or orientation. However,
since no a priori knowledge of face location is used, the
search for symmetry points is computationally intensive. where f represents the input image, g is a sigmoid nonlin-
The authors mention a success rate of 95% on their face earity, y is a normalizing factor, and n > m. The final step
image database, with the constraint that the face occupy is to actually localize these features, and this is done by
between 1540% of the image. looking at the local maximum of these feature responses.
Manjunath et al. [88] present a method for the extraction A feature point is selected by taking the maxima in a
of pertinent feature points from a face image. It employs local neighborhood of the pixel location ( 2 , ~ )Let. the
Gabor wavelet decomposition and local scale interaction to neighborhood be Nzy :
extract features at points of curvature maxima in the image,
corresponding to orientation and local neighborhood. These
feature points are then stored in a data base and subsequent
target face images are matched using a graph matching The general idea is to use (9) to determine responses at
technique. The 2D Gabor function used and its Fourier two scales. These scales act as the hypercomplex cells in

716 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5 , MAY 1995


(a)
Fig. 6. (a) Input image and (b) feature points extracted.

animals. To determine a high spatial curvature point the try to see if a particular area in the image has the necessary
response from a larger sized cell is subtracted from the component parts (in correct orientations relative to each
smaller sized cell using (8). A smaller cell will have a other) and determine the existence of the component. The
higher response for a sharper curvature. This is determined Face level will try to determine which geometric layout
to be a feature point in the image. of the components is best suited to describe a face from
Some experimental results for this feature extraction the image data. The structure of the system is based on
method are shown in Fig. 6. Notice that the background a blackboard architecture; all the tasks have access to
on the image is uniform; this type of image can be (and can write on) to the blackboard. The author reports
seen as representative of passport, driver’s license or successful detection of the face using this method with
any identification-type photographs where control over two experiments. The modularity of the system makes it
background is easily enforced. possible to expand it by adding other knowledge sources
[33] describes a knowledge-based vision system for de- such as eyebrows, ears, forehead, etc. The usage of sketched
tection of human faces from hand drawn sketches. The images can be extended to the edge map of an intensity
system employs an IF-THEN rule to process its tasks, image with some processing to get labeled segments, as is
i.e., “IF: upper mouth line is not found but lower mouth done in [ 1231.
line is found, THEN: look for the upper mouth line in
the image area directly above the lower mouth.” The C. Recognition
template for the face consists of the eyes (both left and I ) Earlier Approaches: One of the earliest works in com-
right), the nose and the mouth. The processing is done on puter recognition of faces is reported by Bledsoe [18]. In
four different abstraction levels of image information; Line this system, a human operator located the feature points
Segment, Component Part, Component, and Face. The line on the face and entered their positions into the computer.
segments are selected as candidates of component parts with Given a set of feature point distances of an unknown
probability values associated with them. A component will person, nearest neighbor or other classification rules were

CHELLAPPA et a1 HUMAN AND MACHINE RECOGNITION OF FACES A SURVEY 7 11

-. - _ _
used for identifying the label of the test image. Since The authors then set a bound on the probability of finding
feature extraction is manually done, this system could a correct match, using some arbitrary constants, to be
accommodate wide variations in head rotation, tilt, image about 90% from 15 000 images. However, this is just an
quality, and contrast. extrapolation of the results that they obtained from the sixty
A landmark work on face recognition is reported in two images that were tested and not a result of actual
the doctoral dissertation of M. D. Kelly [81]. Kelly’s experiments.
work is similar in framework to that of Bledsoe, but is One method of characterizing the face is the use of
significantly different in that it does not involve any human geometrical parameterization, i.e., distances and angles
intervention. Although we cite this work in connection between points such as eye comers, mouth extremities,
with face recognition, Kelly’s dissertation has made several nostrils, and chin top [78]. The data set used by Kanade
important contributions to goal directed (also known as consists of 17 male and three female faces without glasses,
top-down) and multiresolution image analysis. mustaches, or beards. Two pictures were taken of each
Kelly uses the body and close up head images for individual, with the second picture being taken one month
recognition. Once the body and head have been outlined as later in a different setting. The face-feature points are
described in Section IV-A, ten measurements are extracted. located in two stages. The coarse-grain stage simplified
The body measurements include heights, widths of the head, the succeeding differential operation and feature-finding
neck, shoulders, and hips. Measurements from the face algorithms. Once the eyes, nose and mouth are approxi-
include width of the head and distances between eyes, top mately located, more accurate information is extracted by
of head to eyes, between eyes and nose and the distance confining the processing to four smaller regions, scanning
from eyes to mouth. The nearest neighbor rule was used at higher resolution, and using the “best beam intensity” for
for identifying the class label of the test image; the leave- the region. The four regions are the left and right eye, nose,
one-out [45] strategy was used. The dataset consisted of a and mouth. The beam intensity is based on the local area
total of 72 images, comprised of 24 sets of three images of histogram obtained in the coarse-grain stage. A set of 16
ten persons. Each set had three images per person; image facial parameters which are ratios of distances, areas, and
of the body, image of the background corresponding to the angles to compensate for the varying size of the pictures
body image and a close-up of the head. is extracted. To eliminate scale and dimension differences
In [go], Kaya et al. report a basic study using infor- the components of the resulting vector are normalized. The
mation theoretic arguments in classifying human faces. entire data set of 40 images is processed and one picture of
They reason from the fact that to represent N different each individual is used in the training set. The remaining 20
faces a total of log2 N bits are required (upper bound pictures are used as a test set. A simple distance measure
on the entropy). They contend that since illumination and is used to check for similarity between an image of the
background are the same for all face images and the images test set and the image in the reference set. Matching
taken are photographs of front views of human faces, with accuracies range from 45% to 75% correct, depending
mouth closed, no beards, and no eyeglasses, therefore the on the parameters used. Better results are obtained when
dimensionality of the parameter space can be reduced from several of the ineffective parameters are not used [78].
the above upper bound. Sixty two photographs were taken 2) Statistical Approach: Turk and Pentland [ 1331 used
with a special apparatus to ensure correct orientation and eigenpictures (also known as “eigenfaces” (see Fig. 7)
lighting conditions. An experiment was conducted using in [133]) for face detection and identification. Given the
1W O human subjects to identify prominent geometric eigenfaces, every face in the database can be represented
features from three different faces. The authors identify nine as a vector of weights; the weights are obtained by pro-
of these parameters to run statistical experiments on. These jecting the image into eigenface components by a simple
parameters form a parameter vector composed of internal inner product operation. When a new test image whose
biocular breadth, extemal biocular breadth, nose breadth, identification is required is given, the new image is also
mouth breadth, bizygomatic breadth, bigonial breadth, dis- represented by its vector of weights. The identification of
tance between lower lip and chin, distance between upper the test image is done by locating the image in the database
lip and nose and height of lips. They construct a classifier whose weights are the closest (in Euclidean distance) to
based on the parameter vector and its estimate, i.e., if X the weights of the test image. By using the observation
is the parameter vector then the estimate Y is given as that the projection of a face image and a nonface image
Y = X + D where D is the distortion vector. The distortion are quite different, a method for detecting the presence of
vector D has two components Dm, the distortion due to a face in a given image is obtained. Turk and Pentland
data acquisition and sampling error and Dj due to inherent illustrate their method using a large database of 2500 face
variations in facial features. The authors discuss two cases, images of 16 subjects, digitized at all combinations of
one in which D, is negligible and the other where D, three head orientations, three head sizes and three lighting
is comparable to Dj. For each parameter a threshold is conditions. Several experiments were conducted to test the
determined from its statistical behavior. Classification is robustness of the approach to variations in lighting, size,
done using the absolute norm between a stored parameter head orientation, and the differences between the training
set and the input image parameter values. It should be and test conditions. The authors reported 96% correct
noted that the parameter values are determined manually. classification over lighting variations, 85% over orientation

718 PROCEEDINGS OF THE IEEE. VOL. 83, NO. 5, MAY 1995


Fig. 7. Eigenfaces [133].

variations and 64% over size variations. It can be seen on sex, race, approximate age and facial expression was
that the approach is fairly robust to changes in lighting included. Unlike mug shots applications, where only one
conditions, but degrades quickly as the scale changes. front and one side view of a person’s face is kept, in this
One can explain this by the significant correlation present database several persons have many images with different
between images with changes in illumination conditions; expressions, head wear, etc.
the correlation between face images at different scales is One of the applications the authors consider is interactive
rather low. Another way to interpret this is that the approach search through the database. When the system is asked to
based on eigenfaces will work well as long as the test present face images of certain types of people (e.g., white
image is “similar” to the ensemble of images used in the females of age 30 years or younger), images that satisfy
calculation of eigenfaces. Turk and Pentland also extend this query are presented in groups of 21. When the user
their approach to real time recognition of a moving face chooses one of these images, the system presents faces
image in a video sequence. A spatiotemporal filtering step from the database that look similar to the chosen face
followed by a nonlinear operation is used to identify a in the order of decreasing similarity. In a test involving
moving person. The head portion is then identified using a 200 selected images, about 95% recognition accuracy was
simple set of rules and handed over to the face recognition obtained-i.e., for 180 images the most similar face was of
module. the same person. To evaluate the recognition accuracy as
In [104], Pentland et al. extend the capabilities of their a function of race, images of white, black and Asian adult
earlier system [133] in several directions. They report ex- males were tested. For white and black males accuracies
tensive tests based on 7562 images of approximately 3000 of 90% and 95% were reported, respectively, while only
people, the largest database on which any face recognition 80% accuracy was obtained for Asian males. The use of
study has been reported to date. Twenty eigenvectors were eigenfaces for personnel verification is also illustrated.
computed using a randomly selected subset of 128 images. In mug shots applications, usually a frontal and a side
In addition to eigenrepresentation, annotated information view of a person are available. In some other applications,

CHELLAPPA et a1 HUMAN AND MACHINE RECOGNITION OF FACES A SURVEY 719

.___ ~ ~ ~ _ _ __
~
more than two views may be available. One can take two Both the KL-IPAT and KL-FSAT have difficulties when
approaches to handling images from multiple views. The the head orientation is varied [l 11.
first approach will pool all the images and construct a set The effectiveness of SVD for face recognition has been
of eigenfaces that represent all the images from all the tested in [32], [69]. The optimal discriminant plane and
views. The other approach is to use separate eigenspaces quadratic classifier of the normal pattern is constructed for
for different views, so that the collection of images taken the 45 SV feature vector samples. The classifier is able
from each view will have its own eigenspace. The second to recognize the 45 training samples of the nine subjects.
approach, known as the view-based eigenspace, seems to Testing was done using 13 photos which consisted of nine
perform better. For mug shots applications, since two or newly sampled photos of the original test subjects with two
at most three views are needed, the view-based approach of one subject and three samples of the subject at different
produces two or three sets of eigenspaces. ages. There was a 42.67% error rate which Hong feels was
The concept of eigenfaces can be extended to eigenfea- due to the statistical limitations of the small number of
tures, such as eigeneyes, eigenmouth, etc. Just as eigenfaces training samples [69].
were used to detect the presence of a face in [133], In [32] the SV vector is compressed into a low dimen-
eigenfeatures are used for the detection of features such sional space by means of various transforms, the most
as eyes, mouth etc. Detection rates of 94%, 80%, and 56% popular being an optimal discriminant transform based on
are reported for the eyes, nose and mouth, respectively, on Fisher's criterion. The Fisher optimal discriminant vector
the large dataset with 7562 images. represents the projection of the set of samples on a direction
Using a limited set of images (45 persons, two views 9,chosen so that the patterns have a minimal scatter within
per person, corresponding to different facial expressions each class and a maximal scatter between classes in the
such as neutral versus smiling), recognition experiments as 1D space. Three SV feature vectors are extracted from
a function of number of eigenvectors for eigenfaces only the training set in [32]. The optimal discriminant transform
and for the combined representation were performed. The compresses the high-dimensional SV feature space to a new
eigenfeatures performed as well as eigenfaces; for lower r-dimensional feature space. The new secondary features
order spaces, the eigenfeatures fared better; when the com- are algebraically independent and informational redundancy
bined set was used, marginal improvement was obtained. is reduced. This approach was tested on 64 facial images
As summarized in Section 111, both holistic and feature- of eight people (the classes). The images were represented
based mechanisms are employed by humans. The feature by Goshtasby's shape matrices, which are invariant to
based mechanisms may be useful when gross variations translation, rotation, and scaling of the facial images and
are present in the input image; the authors' experiments are obtained by polar quantization of the shape [54]. Three
support this. photographs from each class were used to provide a training
The effectiveness of standardized KL coefficients such as set of 24 SV feature vectors. The SV feature vectors
KL-IPAT and KL-FSAT has been illustrated in [ 111 using were treated with the optimal discriminant transform to
two experiments. In the first experiment, the training and
obtain new feature vectors for the 24 training samples. The
testing samples were acquired under as similar conditions class center vectors were obtained using the second feature
as possible. The test set consisted of five samples from 20 vectors. The experiment used six optimal discriminant
individuals. The KL-IPAT had an accuracy rate of 85% and
vectors. The separability of training set samples was good
the KL-FSAT had an accuracy rate of 91%. Both methods
with 100% recognition. The remaining 40 facial images
misidentified the one example where there is a difference
were used as the test set, five from each person. Changes
in the wearing and not wearing of glasses between the
were made in the camera position relative to the face, the
testing set and the training set. The second experiment
camera's focus, the camera's aperture setting, the wearing
checks for feature robustness when there is a variation
or not wearing of glasses, and blurring. As with the
caused by an error in the positioning of the target window.
training set, the SV feature vectors were extracted, and the
This is an error usually made during image acquisition
optimal discriminant transform was applied to obtain the
due to changing conditions. The test images are created
transformed feature vector. Again good separability was
by shifting the reference points in various directions by
obtained with an accuracy rate of 100% [32].
one pixel. The variances for 4 and 8 pixels are tested.
Cheng et al. [31] develop an algebraic method for face
The KL-IPAT having an error rate of 24% for the 4 pixel
difference and 81% for the 8 pixel difference. The KL- recognition using SVD and thresholding the eigenvalues
thus obtained to some value greater than a set threshold
FSAT had an 4% error rate for the 4 pixel difference and a
44% error rate for the 8 pixel difference. The improvement value. They use a projective analysis with the training set
is due to the shift invariance property in the Fourier of images serving as the projection space. A training set in
spectrum domain. The third experiment used the variations their experiments consists of three instances of face images
in head positioning. The test samples were taken while the of the same person. If A E 72"'" represents the image,
subject was nodding and shaking his head. The KL-FSAT and A!' represents the jth face image of person i, then the
showed high robustness over the KL-IPAT for the different average image for person i is given by ( l / N ) A!).
orientations of the head. Good recognition performance was Eigenvalues and eigenvectors are determined for this av-
achieved by restricting the image acquisition parameters. erage image using SVD. The eigenvalues are thresholded

720 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5 , MAY 1995

.
to disregard the values close to zero. Average eigenvectors
(called feature vectors) for all the average face images are
calculated. A test image is then projected onto the space
spanned by the eigenvectors. The Frobenius norm is used as
a criterion to determine which person the test image belongs
to. The authors reported 100% accuracy when working with
a database of 64 face images of eight different persons.
Each person contributed eight images. Three images from
each person were used to determine the feature vector
for the face image in question. Eight such feature vectors
were determined. They state that the projective distance of
the testing image sample was markedly minimum for the
correct training set image.
The use of isodensity lines, i.e., curves of constant gray (a)
level, for face recognition has been investigated in [98]. Fig. 8. Radius vectors and other feature points’[22].
Such lines, although they are not directly related to the
3D structure of a face, do provide a relief image of the
face. Using images of faces taken with a black background, for testing and training were acquired such that facial
a Sobel operator and some post-processing steps are used hair, jewelry and makeup were not present. They were
to obtain the boundary of the face region. The gray level then preprocessed so that the eyes are level and the eyes
histogram (an 8-bin histogram) is then used to trace contour and mouth are positioned similarly. A 30 x 30 cropped
lines on isodensity levels. A template matching procedure is block of pixels was extracted for training and testing. The
used for face recognition. The method has been illustrated dataset consisted of 45 males and 45 females; 80 were
using ten pairs of face images, with three pairs of pictures used for training, with 10 serving as testing examples.
of men with spectacles, two pairs of pictures of men with The compression network indirectly serves as a feature
thin beards, and two pairs of pictures of women. 100% extractor; in that the activities of 40 hidden nodes (in a
recognition accuracy was reported on this small data set. 900 x 40 x 900 network) serve as features for the second
3 ) Neural Networks Approach: The use of neural net- network, that performs gender classification. The hope is
works (NN) in face recognition has addressed several that due to the nonlinearities in the network, the feature
problems: gender classification, face recognition, and clas- extraction step may be more efficient than the linear KL
sification of facial expressions. One of the earliest demon- methods. The gender classification network is a 40 x n x 1
strations of NN for face recall applications is reported in network, where the number n of hidden nodes has been 2,5,
Kohonen’s associative map [84].Using a small set of face 10, 20, or 40. Experiments with 80 training images and 10
images, accurate recall was reported even when the input testing images have shown the feasibility of this approach.
image is very noisy or when portions of the images are This method has also been extended to classifying facial
missing. This capability was demonstrated using optical expressions into eight types.
hardware by Psaltis’s group [6]. Using a vector of 16 numerical attributes (Fig. 8) such as
A single layer adaptive NN (one for each person in the eyebrow thickness, widths of nose and mouth, six chin radii,
database) for face recognition, expression analysis and face etc., Brunelli and Poggio [21] also develop a NN approach
verification is reported in [1281. Named Wilkie, Aleksander, for gender classification. They train two HyperBF networks
and Stonham’s recognition device (WISARD), the system [ 1091, one for each gender. The input images are normalized
needs typically 200-400 presentations for training each with respect to scale and rotation by using the positions of
classifier, the training patterns included translation and the eyes which are detected automatically. The 16D feature
variation in facial expressions. Sixteen classifiers were used vector is also automatically extracted. The outputs of the
for the dataset constructed using 16 persons. Classification two HyperBF networks are compared, the gender label for
is achieved by determining the classifier that gives the the test image being decided by the network with greater
highest response for the given input image. Extensions output. In the actual classification experiments only a subset
to face verification and expression analysis are presented. of the 16D feature vector is used. The database consists of
The sample size is small to make any conclusions on the 21 males and 21 females. The leave-one-out strategy [45]
viability of this approach for large datasets involving a large was employed for classification. When the feature vector
number of persons. from the training set was used as the test vector, 92.5%
In [51], Golomb, Lawrence, and Sejnowski present a correct recognition accuracy was reported; for faces not in
cascade of two neural networks for gender classification. the training set, the accuracy further dropped to 87.5%.
The first stage is an image compression NN whose hidden Some validation of the automatic classification results has
nodes serve as inputs to the second NN that performs been reported using humans.
gender classification. Both networks are fully connected, By using an expanded 35D feature vector, and one
three-layer networks with two biases and are trained by HyperBF per person, the gender classification approach
a standard back-propagation algorithm. The images used has been extended to face recognition. The motivation for

CHELLAPPA et al.: HUMAN AND MACHINE RECOGNITION OF FACES: A SURVEY 721


the underlying structure is the concept of a grandmother simulate the local receptive fields of the human visual
neuron: a single neuron (the Gaussian function in the system. Each block consists of the set of block patterns
HyperBF network) for each person. As there were rel- partitioned in the same positions over the image pattern
atively few training images per person, a synthetic data set. The data set for training consists of six hand drawn
base was generated by perturbing around the average of faces with six different expressions: happiness, surprise,
the feature vectors of available persons and the available sadness, anger, fear, and normal. The outer features of
persons were used as testing samples. For different sets each face are its shape and the ears. The inner features
of tuning parameters (coefficients, centers and metrics of are the eyebrows, the eyes, the nose, and the mouth.
the HyperBF‘s) classification results are reported. Some Each face is drawn to be as dissimilar as possible from
corroboration of the caricatural behavior of the HyperBF the others. The testing set consists of the six training
networks, by psychophysical studies, is also presented. faces and the images from the training set masked with a
In [22], Brunelli and Poggio compare the merits of horizontal bar across the upper, middle, and lower portions
both feature based and template based approaches. Their of the face covering approximately 20% of the total image.
feature based approach is motivated by [78] and [80]. The horizontal bar is used to demonstrate the network’s
They determine 35 features which are also used in [lo91 associative memory capability. The network has four levels.
[see also Fig. 8(a)]. They mention the use of various Levels 1-3 consist of 25 input units, six hidden units, and
classifiers to accomplish the task of matching these features, 25 output units. The fourth level has 18 input units, eight
namely Bayes, nearest neighbor, or the HyperBF. For the hidden units, and 25 output units. The network training
template based approach they have selected various regions process at each level results in a different representation of
of the face as templates and used a correlation based the original image data. The last level of the pyramid has the
matching technique [see Fig. 8(b)]. From their experiments leanest and most abstract representation. The representation
they concluded that the template based approach, though is viewed as a unique identification of the face and the
computationally complex, was superior on their database information it conveys. The network is able to successfully
over the feature based approach. An accuracy of 100% recognize the members of the training set when tested on
for the template based approach compared to 90% for the them. The network poorly recognizes (50%) the various
feature based one. masked, blurred, or distorted facial expressions. It is unable
The use of HyperBF networks for face recognition is to recognize the various masks of the happy face. The
also reported in (211.To remove variations due to changing error rate is the result of obtaining a totally different
viewpoint, the images are first transformed using 2D affine abstract representation which the network has not leamed.
transforms. The transformation parameters are obtained by On analysis of the hidden units, patches are found. The
using the detected positions of the eyes and mouth in the patches block off some of the features of the faces and
given image and the desired positions of these features. appear unimportant to the hidden node. The hidden units’
The transformed image is then subjected to a directional internal representations show that many of them are in
derivative operator to reduce the effects of illumination. The the form of eigenfeatures where the features of the faces
resulting image is multiplied by a Gaussian function and are combined in an overlaying manner on top of each
integrated over the receptive field to achieve dimensionality other. The eigenfeatures are only a portion of all the
reduction. The MIT Media Lab database of 27 images, of features. In the happy face the blocked patches of the
each of 16 different persons was used, with the images hidden units are mainly outside of the face while the others
of 17 persons being used for training, while the rest are inside the face. This may be explained by the fact
were used as testing samples. A HyperBF was trained for that the happy face does not have many facial features
each person. An average accuracy of 79% was reported in common with the other faces in the training set. It
compared with 90% accuracy when tested with human appears that the network developed a holistic representation
subjects. By feeding the outputs of 16 HyperBF’s to another of the happy face so that it could be recognized. The leaner
HyperBF, significant reductions in error rates were reported. representations of the face are automatically generated and
[ 1 1 I] presents the results of work using a connectionist are a unique identification of the learned object. The unique
model of facial expression. The model uses the pyramid representation may be associated with the original object in
structure to represent image data. Each level of the pyramid the form of one-to-many. The model is able to successfully
is represented by a network consisting of one input, one identify the same face but not the masked faces of the same
hidden, and one output layer. The input layers of the type. The masking of areas shows where the network’s
middle levels of the pyramid are the outputs of the previous learning is focused. It appears that the middle portion of
level’s hidden units when training is complete. Network the face image is not as important as the upper and lower
training at the lowest level is carried out conventionally. portions and may be used to develop a focus of attention
Each network is trained using a fast variation of the back [lil].
propagation learning algorithm. The training pattern set The systems presented in [24] and [85] are based on
for the subsequent levels is obtained by combining and the dynamic link architecture (DLA). DLA’s attempt to
partitioning the hidden units’ outputs of the preceding level. solve some of the conceptual problems of conventional
The original images of the training set are partitioned artificial neural networks, the most prominent problem
into blocks of overlapping squares. The overlapping blocks being the expression of syntactical relationships in neural

122 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5, MAY 1995


- ~
~-~~~~
Fig. 9. System for DLA.

networks. DLA’s use synaptic plasticity and are able to Binding between neurons is encoded in the form of tempo-
instantly form sets of neurons grouped into structured ral correlations and is induced by the excitatory connections
graphs and maintain the advantages of neural systems. within the image. Four types of bindings are relevant to
A DLA permits pattem discrimination with the help of object recognition and representation: Binding all the nodes
an object-independent standard set of feature detectors, and cells together that belong to the same object, expressing
automatic generalization over large groups of symmetry neighborhood relationships with the image of the object,
operations, and the acquisition of new objects by one- bundling individual feature cells between features present
shot learning, reducing the time-consuming learning steps. in different locations, and binding corresponding points in
Invariant object recognition is achieved with respect to the image graph and model graph to each other. DLA’s
background, translation, distortion and size by choosing a basic mechanism, in addition to the connection parameter
set of primitive features which is maximally robust with between two neurons, is a dynamic variable ( J ) between
respect to such variations. Both [24] and [85] use Gabor two neurons (i, j ) . J-variables play the role of synaptic
based wavelets for the features. The wavelets are used as weights for signal transmission. The connection parameters
feature detectors, characterizedby their frequency, position, merely act to constrain the J-variables. The connection
and orientation. Two nonlinear transforms are used to help parameters may be changed slowly by long-term synaptic
during the matching process. A minimum of two levels, plasticity. The connection weights Jij are subject to a
the image domain and the model domain, are needed for process of rapid modification. Jij weights are controlled by
a DLA. The image domain corresponds to primary visual the signal correlations between neurons i and j . Negative
cortical areas and the model domain to the intertemporal signal correlations lead to a decrease and positive signal
cortex in biological vision. The image domain consists of a correlations lead to an increase in Jij. In the absence
2D array of nodes A: = [ (z,a), where (Y = 1,.. . , F } . of any correlation, Jij slowly retums to a resting state.
Each node at position z consists of F different feature Rapid network self-organizationis crucial to the DLA. Each
detector neurons (z,a ) that provide local descriptors of the stored image is formed by picking a rectangular grid of
image. The label a is used to distinguish different feature points as graph nodes. A locally determined jet for each
types. The amount of feature type excitation is determined of these nodes is stored and used as the pattem class.
for a given node by convolving the image with a subset New image recognition takes place by transforming the
of the wavelet functions for that location. Neighboring image into the grid of jets, and all stored model graphs
nodes are connected by links, encoding information about are tentatively matched to the image. Conformation of the
the local topology. Images are represented as attributed DLA is done by establishing and dynamically modifying
graphs. Attributes attached to the graph’s nodes are activity links between vertices in the model domain. During the
vectors of local feature detectors. An object in the image recognition process an object is selected from the model
is represented by a subgraph of the image domain. The domain. A copy of the model graph is positioned in a central
model domain is an assemblage of all the attributed graphs, position in the image domain. Each vertex in the model
being idealized copies of subgraphs in the image domain. graph is connected to the corresponding vertex in the image
Excitatory connections are between the two domains and graph. The match quality is evaluated using a cost function.
are feature preserving. The connection between domains The image graph is scaled by a factor while keeping the
occurs if and only if the features belong to corresponding center fixed. If the total cost is reduced the new value is
feature types. The DLA machinery is based on a data accepted. This is repeated until the optimum cost is reached.
format which is able to encode information on attributes The diffusion and size estimation are repeated for increasing
and links in the image domain and to transport that in- resolution levels and more of the image structure is taken
formation to the model domain without sending the image into account. Recognition takes place after the optimal total
domain position. The structure of the signal is determined cost is determined for each object. The object with the
by three factors: the input image, random spontaneous best match to the image is determined. Identification is a
excitation of the neurons, and interaction with the cells process of elastic graph matching. In the case of faces,
of the same or neighboring nodes in the image domain. if one face model matches significantly better than all

CHELLAPPA et al.: HUMAN AND MACHINE RECOGNITTON OF FACES: A SURVEY 723


competitor models, the face in the image is considered 2) Let N; be the ith feature point { K} of 2. Search for
as recognized. The system identifies a person’s face by the best feature point {&} in 0 using the criterion
comparing an extracted graph with a set of stored graphs. In
[24] the experiment consists of a gallery of over 40 different
face images. With little effort to standardize the images,
the system’s recognition success is remarkably consistent.
3) After matching, the total cost is computed taking into
The system shows that a neural system gains power when
account the topology of the graphs. Let nodes i and j
provided with a mechanism for grouping. The system used
of the input graph match nodes z’ and j’ of the stored
in [85] has a larger gallery of faces and recognizes them
graph and let j E Ni (i.e., Vj is a neighbor of E).
under different types of distortion and rotation in depth,
Let p;;tjjt = min { d ; j / d ; , j , , d ; , j , / c L ; j } . The topology
achieving less than 5% false assignments. Lades et al.
cost is given by
state that when a clear criterion for the significance for the
recognition process is determined, all false assignments are
rejected and no image is accepted if its corresponding model
is temporarily removed from the gallery. This means that 4) The total cost is computed as
the capacity of the gallery to store distinguishable objects
is certainly larger than its present size. No limits to this
capacity other than a linear increase in computation time
have been encountered so far. Most of the time is spent on
image transformation and on optimizing the map between where At is a scaling parameter assigning relative
the image and individual stored models [24], [85]. importance to the two cost functions.
4 ) Feature Matching Approach: Manjunath et al. [88] 5) The total cost is scaled appropriately to reflect the
store feature points detected using the Gabor wavelet possible difference in the total number of the feature
decomposition into data files for each image. This greatly points between the input and stored graph. If nz, no
reduces the storage requirements for the database. Typically are the numbers of feature points in the input and
35-45 points per face image are generated and stored. The stored graph respectively, then the scaling factor
identification process utilizes the information present in sf = max{nz/no,no/nz}and the scaled cost is
a topological graphic representation of the feature points. C(Z,O) = S f C $ y q .
After compensating for differing centroid locations, two 6) The best candidate is the one with the least cost, i.e.,
cost values are evaluated. One is the topological cost and it satisfies
the other a similarity cost.
The identification process utilizes the information present C(Z, 0 * )= min C(Z, 0’). (15)
in a topological graphic representation of the feature points. 0‘

The feature points are represented by nodes V , where The recognized face is the one that has the minimum of
i = { 1, 2, 3,. . .}, a consistent numbering technique. The the combined cost value. An accuracy of 94% is reported.
information about a feature point is contained in {S,q}, The method shows a dependency on the illumination di-
where S represents the spatial location and q is the feature
rection and works on controlled background images like
vector defined by passport and drivers license pictures. Fig. 10 shows a set
of input and identified images for this method.
Seibert and Waxman [ 1161 have proposed a system for
corresponding to the ith feature point. The vector qi is a recognizing faces from their parts using a neural network.
set of spatial and angular distances from feature point i to The system is similar to a modular system they have
its N nearest neighbors denoted by Q;(z, y, O j ) , where j is developed for recognizing 3D objects [ 1171 by combining
the jth of the N neighbors. N; represents a set of neighbors 2D views from different vantage points; in the case of
which are of consequence for the feature point in question. faces, arrangement of features such as eyes and nose play
The neighbors satisfying both maximum number N and the role of the 2D views. The processing steps involved
minimum Euclidean distance d;j between two points V , and are segmentation of a face region using interframe change
V, are said to be of consequence for the ith feature point. detection techniques, extraction of features such as eyes,
To identify an input graph with a stored one which is mouth, etc., using symmetry detection, grouping and log-
different, either in total number of feature points or in the polar mapping of the features and their attributes such
location of the respective faces, we proceed in a stepwise as centroids, encoding of feature arrangements, cluster-
manner. If i , j refer to nodes in the input graph 2 and ing of feature vectors into view categories using ART 2,
x’,y’, m’,n’ refer to nodes in the stored graph 0 then the and integration of accumulated evidence using an aspect
two graphs are matched as follows: network.
In a subsequent paper Seibert and Waxman [118] exploit
1) The centroids of the feature points of 2 and 0 are the role of caricatures and distinctiveness (summarized in
aligned. Section 111 of the report) in human face recognition to

124 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5 , MAY 1995


Fig. 10. Some results of the recognition system in Manjunath er al.

enhance the capabilities of their previously reported face D. Range Images


recognition system’ Both Of their papers are preliminary The discussion so far has revolved around face recogni-
reports and lack experimental validation using a large
tion methods and systems which use data obtained from
dataset of faces.
a 2D intensity image. Another topic being studied by
Huang et al. have been working on a system for detection
and recognition of human faces in still monochromatic researchers is face recognition from range image data. A
images. A rule-based algorithm is first used to locate faces range image contains the depth structure Of the Object
in the image [1441. nen each face is recognized by a in question. Although such data is not available in most
neural network like structure called CrescepWon [139]. The applications, it is still important to note the benefit of
Cresceptron has a multi-resolution pyramid structure. It is the added infOrmation Present in range data in terms of
on the surface similar to Fukushima’s Neocognitron; how- accuracy of the face recognition System. For further study,
ever, the learning in cresceptron is completely automatic, the reader is encouraged to read some of the selected
and incremental. In a small-scale experiment involving 50 papers presented in the bibliography at the end of this
persons, the Cresceptron performs well. report.

CHELLAPPA et al.: HUMAN AND MACHINE RECOGNITION OF FACES: A SURVEY 125


~~ ~ ___~___~
Fig. 11. (a) Depth of face parameterized as f ( 6 , y ) (Leonard Nimoy as Spock). (b) rendered
polygonal model of face composed from coarse sampling of depth data [52].

Gordon in [52] describes a template based recognition The nose is located.


system involving descriptors determined from curvature Locating the nose facilitates the search for the eyes
calculations of range image data. The data is obtained from and mouth.
a rotating laser scanner system with resolution of better then Other features such as forehead, neck, cheeks, etc., are
0.4 mm. Segmentation is done by classifying surfaces into determined by their surface smoothness (unlike hair or
planar regions, spherical regions and surfaces of revolution. eye regions).
The image data is stored in a cylindrical coordinate system This information is then used for depth template com-
as f(0, y). An example of such data is shown in Fig. 11. At parison. Using the location of the eyes, nose and mouth
each point on the surface of the curve the magnitude and the faces are normalized into a standard position.
direction of the minimum and maximum normal curvatures This standard position is reinterpolated to a regular
are calculated. Since the calculations involve second order cylindrical grid and the volume of space between
derivatives, smoothing is required to remove the affect two normalized surfaces is used as the criterion for
of noise in the image. This smoothing is achieved by a a match.
Gaussian smoothing filter. The system was tested on a dataset of 24 images of eight
The segmentation produces four surface regions: one persons with three views of each. The data represented four
convex, one concave and two saddle regions. Ridges and male and four female faces. Sufficient feature detection was
valley lines are determined by obtaining the maxima and achieved for 100% of these faces. For recognition 97%
minima of the curvatures. These as a whole represent the accuracy is reported for individual features rather than the
information pertinent to feature location for an individual. whole face (which yielded 100% accuracy). In a related
Next comes a comparison strategy as applied to face work [53],the process of finding the features is formalized
recognition. for recognition purposes.

126 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5, MAY 1995


E. Summary that do face reconstruction for witness recall utilize such
Significant progress has been achieved in segmentation, subjective descriptions.
feature extraction and recognition of faces in intensity In the face recognition and identification area the eigen-
images. As range images or stereo pairs may not be faces and eigenfeatures approach of Pentland and his stu-
available in most of the commercial and law enforcement dents seems to be the most tested system, using several
applications, the face recognition problem, by and large, thousands of images. Their eigenfunction approach has
may be viewed as a 2D image matching and recognition been shown to be useful in personal identification search
problem with provisions for two or three views of a through a database, recognition from multiple views, etc.
person’s face. Given that in mug shots, drivers’ licenses, The interesting aspect of this approach is that one can
develop multiple eigenrepresentationscorresponding to not
and personal ID cards, the backgrounds are relatively
only different views but also corresponding to different
uncluttered, and only the face is present, the segmentation
races, age groups, gender, etc. It is almost always true that
of the face image, for subsequent processing, could be
in mug shot and other similar applications such information
reasonably handled by any one of the methods in Section
is available. This enables efficient representation of a po-
IV-A. Segmentation becomes more difficult when beards,
tentially very large number of people. It is our view that the
baldness are present; also if a face has to be segmented
eigenface and feature point based approaches are the most
in a cluttered scene where other objects are present, the
developed and tested ones yet and deserve very serious
techniques presented in [55] and [123] may be applicable.
consideration for evaluation in real applications involving
It should be pointed out that a strict bottom up procedure
hundreds of thousands of examples.
may use segmentation as the first step. Techniques based on
Techniques based on NN also deserve more study. It is
KL transforms often sidestep this issue, treating the entire our view that NN-based methods can potentially incorporate
image including the background as a pattem. This strategy both numerical and structural information relevant to face
may be appropriate when similar homogenous backgrounds recognition; all of the existing work on NN approaches
are present, but if widely varying backgrounds are present, to FRT have demonstrated this on limited sets of images.
more KL features will be required. In addition, the ability to generalize and recognize using
As discussed in Section 111, both holistic and independent incomplete information gives NN classifiers significant
features contribute to face recognition. The notion of eigen- advantages over the simple minimum distance classifiers
faces, and eigenfeatures such as eigenmouths, eigeneyes, used by Pentland’s group. By appropriately combining the
etc., captures both holistic and independent features in a eigenfaces and eigenfeatures with NN classifiers, it will be
unified way. Methods based on deformable templates, and possible to improve the performance of Pentland’s system.
their variants for the extraction of eyes and mouth, suffer In any case, the usefulness of NN classifiers needs to be
from dependence on a large number of parameters and also evaluated on significantly larger datasets than reported in
depend on good initial placement of the different templates. the literature.
One of the more serious concems with the deformable In the final stages of the mug shot matching problem,
templates approach is its computational complexity. The finer attributes of facial features are usually matched for
eigenfeature approach looks more promising as one can identification purposes. The point features used in [88]
roughly construct a region around an eye or both eyes and which correspond to points of high curvature, may serve the
mouth and perform eigenanalysis. From an aesthetic point role of “minutiae” in the fingerprint matching problem. The
of view, the eigenapproach is not appealing as structure usefulness of point features and appropriate recognizers and
information is coded purely in terms of numbers, but identifiers that use them should be studied and evaluated
has advantages of being rather straightforward from a on large datasets.
computational point of view. The features extracted using Owing to the cost of the equipment and the need for
Gabor wavelets capture some types of holistic attributes in easy maintenance, intensity based systems are preferred in
that they represent the face as a cluster of points; location law enforcement applications. Although range information
of these points in and around eyes, mouth, etc., could is richer than the 2D intensity array, we feel that cost
be quite accidental, although one can use masks around considerations will make range image based techniques less
these regions to highlight point features. The numerical attractive for field use.
features extracted from eyes, nose, mouth, and chin regions
concentrate more on the lower parts of the face region,
which may be adequate for gender classification. It appears v. FACERECOGNITION FROM PROFILES
that in face recognition, the upper parts of the face play a Another area for recognition of faces involving intensity
more important role. images is that of profile images. Research in this area is
There is not much of a consensus as to what should basically motivated from requirements of law enforcement
be coded to represent a face. The studies alluded to in agencies with their mug shot databases. Profile images
Section I11 point out the significance of different intemal provide a detailed structure of the face that is not seen
features, but do not say much about how these features in frontal images. In particular, the size and orientation
are coded numerically or in subjective terms such as thick of the nose is delineated. Face recognition from profiles
eyebrow, wide nose, narrow mouth, etc. Many systems concentrates on locating points of interest, called fiducial

CHELLAPPA et a1 HUMAN AND MACHINE RECOGNITION OF FACES A SURVEY 121

- -
lower lip were automatically identified. The details of how
each of these fiducial marks was identified are given in [61].
From these fiducial marks, a set of six feature characteristics
were derived. These were protrusion of nose, area right of
base line, base angle of profile triangle, wiggle, distances
and angles between fiducials. A total of eleven numerical
features were extracted from the characteristics mentioned
above. After aligning the profiles by using two selected
fiducial marks, an Euclidean distance measure was used for
measuring the similarity of the feature vectors derived from
3 the outline profiles. A ranking of most similar faces was
obtained by ordering the Euclidean norms. In subsequent
work, Harmon et al. [63] added images of female subjects
and experimented with the same feature vector. By noting
that the values of the features of a face do not change very
much in different images and that the faces corresponding
to feature vectors with a large Euclidean distance between
them will be different, a partitioning step is included to
improve computational efficiency.
[63] used the feature extraction methods developed in
[61] to create 11 feature vector components. The 11 features
Fig. 12. The nine fiducial points of interest for face recognition were reduced to 10, because nose protrusion is highly cor-
using profile images (similar to figure in [61]). related with two other features. The 10D feature vector was
found to provide a high rate of recognition. Classification
was done based on both Euclidean distances and set parti-
points (Fig. 12). Recognition involves the determination of tioning. Set partitioning was used to reduce the number of
relationships among these fiducial points. candidates for inclusion in the Euclidean distance measure
Kaufman and Breeding [79] developed a face recognition and thus increase performance and diminish computation
system using profile silhouettes. The image acquired by a time. Reference [62] is a continuation of the research done
black and white TV camera is thresholded to produce a in [61] and [63]. The aim is basic understanding of how to
binary, black and white image, the black corresponding to achieve automatic identification of human face profiles, to
the face region. A preprocessing step then extracts the front develop robust and economical procedures to use in real-
portion of the silhouette that bounds the face image. This is time systems, and to provide the technological framework
to ensure variations in the profile due to changes in hairline. for further research. The work defines 17 fiducial points
A set of normalized autocorrelations expressed in polar which appear to the best combination for face recognition.
coordinates is used as a feature vector. Normalization and The method uses the minimum Euclidean distance between
polar representation steps insure invariance to translation the unknown and the reference file to determine the correct
and rotation. A distance weighted k-nearest neighbor rule identification of a profile, and uses thresholding windows
is used for classification. Experiments were performed on a for population reduction during the search for the reference
total of 120 profiles of ten persons, half of which were used file. The thresholding window size is based on the average
for training. A set of 12 autocorrelation features was used vector obtained from multiple samples of an individual’s
as a feature vector. Three sets of experiments were done. profile. In [62], the profiles are obtained from high contrast
In the first two, 60 randomly chosen training samples were photography from which transparencies are made, scanned,
used, while in the third experiment 90 samples were used in and digitized. The test set consists of profiles of the same
the training set. Experiments with varying dimensionality individuals taken at a different setting. The resulting 96%
of the training samples are also reported. The best per- rate of correctness occurs both with and without population
formance (90% accuracy) was achieved when 90 samples reduction [62].
were stored in the training set and the dimensionality of the Wu and Huang [ 1421 also report a profile-based recogni-
training feature vector was four. Comparisons with features tion system using an approach similar to that of Harmon and
derived from moment invariants [38] show that the circular his group [61], but significantly different in detail. First of
autocorrelations performed better. all, the profile outlines are obtained automatically. B-splines
Harmon and Hunt [61] presented a semi-automatic recog- are used to extract six interest points. These are the nose
nition system for profile-posed face recognition by treating peak, nose bottom, mouth point, chin point, forehead point,
the problem as a “waveform” matching problem. The and eye point. A feature vector of dimension 24 is con-
profile photos of 256 males were manually reduced to structed by computing distances between two neighboring
outline curves by an artist. From these curves, a set of points, length, angle between curvature segments joining
nine fiducial marks (see Fig. 12) such as nose tip, chin, two adjacent points, etc. Recognition is done by comparing
forehead, bridge, nose bottom, throat, upper lip, mouth and the feature vector extracted from the test image with stored

728 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5 , MAY 1995


vectors using a sequential search method and an absolute Current attempts [ 1171, [ 1331 at segmenting moving faces
norm. The stored features are obtained from three instances from an image sequence have used pixel based, simple
of a person’s profile; in all, 18 persons were used for the change detection procedures based on difference images.
training phase. The testing dataset was generated from the These techniques may run into difficulties when multiple
same set of persons used in training, but from different moving objects and occlusion are present. Flow field based
images. In the first attempt 17 of the 18 test images were methods for segmenting humans in motion is reported in
correctly recognized. The face image corresponding to the [121]. If there are situations where both camera and objects
failed case was relearned (by including the failed image are moving, more sophisticated segmentation procedures
feature vector in the training set). Another instance of this may be required.
person was then correctly recognized using the expanded There is a large body of existing literature in image
dataset. sequence analysis on segmenting/detecting moving objects
Traditional approaches such as Fourier descriptors (FD) from a stationary or moving platform. Methods based on
have been used [146] for recognizing closed contours. analysis of difference images, discontinuities in flow fields
Using p-type FD’s that can describe open as well as using clustering, line processes or Mxrkov random field
closed curves, Aibara et al. [9] describe a technique for models are available. Some of these techniques have been
profile-based face recognition. The p-type FD’s are derived extended to the case when both the camera and objects are
by discrete Fourier transforming normalized line segments moving.
from profiles, are invariant to parallel translation or enlarge- The problem of structure from motion is to estimate 3D
mentheduction, and satisfy a simple relation between the depth of points on objects from a sequence of images.
original and rotated curves. The training set was generated Since in most cases involving surveillance applications, it
from three sittings of 90 persons (all males) with the is nearly impossible to move the camera along a known
fourth sitting used as the testing data. The p-type FD’s baseline, thus techniques such as motion stereo may not
(ten coefficients)obtained from three sittings were averaged be useful. The structure from motion problem has been
and used as prototypes. Using four coefficients, 65 persons approached in two ways. In the differential method, one
were recognized perfectly. For the full set of 90 test computes some form of a flow field (optical, image or nor-
samples, close to 98% accuracy was obtained using ten mal) and from the computed flow field estimates structure or
coefficients. depth of visible points on the object. The bottleneck in this
approach is the reliable computation of the flow field. In the
A. Summary discrete approach, a set of features such as points, edges,
Profile based recognition has not been pursed with as comers, lines and contours are tracked over a sequence of
much vigor as frontal face recognition. Given that mug frames, and the structure of these features is computed.
shots have at least one side view, one could pursue a The bottleneck here is the correspondence problem-the
combination of the eigenapproach for side views of the task of extracting and matching features over a sequence
face, as done in Pentland et al. [ 1041 and a similar approach of frames. It should be pointed out that in both differential
for the profiles. The p-type FD’s, used in [9] for profiles, and discrete approaches, the parameters that characterize
belong to a class of methods similar to eigenanalysis the motion of the object jointly appear along with the
of waveforms. The profile-based approaches, reported in structure parameters of the object. In FRT, the motion
the literature, have not been tested extensively on large parameters may be useful in predicting where the object will
datasets. An evaluation of the eigenapproach for sideview appear in subsequent frames, making the segmentation task
and profiles deserves serious investigation on large mug somewhat easier. The usefulness of structure information
shots datasets. is in building 3D models for faces and possibly using the
models for face recognition in the presence of occlusion,
disguises and face reconstruction. It should be pointed
VI. FACERECOGNITION FROM AN IMAGE SEQUENCE out that if only a monocular image sequence is available,
In surveillance applications, face recognition and identi- the depth information is available only up to a scaling
fication from a video sequence is an important problem. Al- constant; if binocular image sequences are available, one
though over 20 years of active research on image sequence can get absolute depth values using stereo triangulation.
analysis is available [l], [2], [4], [71], [134], very little Given that laser range finders may not be practical, for
of that research has been applied to the face recognition surveillance applications, binocular image sequences may
problem. We have identified at least four important areas be the best way to get depth information. Another point
relevant to FRT where techniques from image sequence worth observing is that when discrete approaches are used,
analysis are useful: the depth values are available only at sparse points requiring
interpolation techniques; when flow based methods are
1) Segmentation of moving objects (humans) from a used, dense depth maps can be constructed.
video sequence Over the last 20 years, hundreds of papers dealing with
2) Structure estimation structure from motion have appeared in the literature. It
3) 3D models for faces is beyond the scope of this paper to even include a brief
4) Nonrigid motion analysis. summary of major techniques. We simply list books [68],

CHELLAPPA et al.: HUMAN AND MACHINE RECOGNITION OF FACES: A SURVEY 129


[711, [921, [1341, [138], [1471 and review papers [8], [91], error with standard deviation in horizontal direction of 4.1
[95]-[97] here; papers that describe major approaches are pixels and in the vertical direction of 2.0 pixels. For the
listed in the additional bibliography. orientational accuracy on a test set of 189 images with nine
3D models of faces have been employed [lo], [23], views of 21 people evenly spaced from -90' to 90' along
[87] in the model based compression literature by several the horizontal plane, they reported an error with the average
research groups. Such models are very useful for applica- standard deviation of 15'.
tions such as witness face reconstruction, reconstruction of As noted earlier, thresholding of frame differences is
faces from partial information and computerized aging. 3D one of the simplest methods for detecting moving objects.
models of faces could also be useful for face recognition Several of the earlier papers [75]-[77], [96] have analyzed
in the presence of disguises. difference images to draw inferences as to whether an
Another area of relevance to FRT is the motion anal- object is approaching, receding, translating, etc. An exam-
ysis of nonrigid objects [72], [93], [103], [130]. There is ple of face location from a video sequence using simple
emerging work in image sequence analysis dealing with change detection algorithms based on difference images is
nonrigid objects with emphasis on medical applications. shown in Fig. 13. Analysis of difference images becomes
Some of the ongoing work on nonrigid motion analysis will complicated when occlusion and illumination changes are
be useful in face recognition. An application of nonrigid present or when the camera is moving. More sophisticated
motion to face recognition is reported in Yacoob and techniques for the segmentation of moving objects rely
Davis's [ 1431 approach for recognizing facial expressions on analyzing optical flow field or its variations. Optical
and actions from image sequences. Their work focuses flow is the distribution of apparent velocities of brightness
on six universal emotion expressions (i.e., anger, disgust, pattems in an image [70] and may arise from relative
fear, happiness, sadness, and surprise), and detection of eye motion of objects and the viewer. Analysis of optical flow
blinking. The approach consists of spatial tracking of face field is useful for estimating the egomotion of the observer,
features, optical flow computation of these features, and segmentatioddetection of moving objects, image stabiliza-
psychologicallymotivated analysis of these spatio-temporal tion and estimation of depth for scene reconstruction and
results. The system has been successfully employed to collision avoidance. Although computation of optical flow
classify the expressions of 30 subjects with a total of 120 has been studied in the image sequence coding literature
instances of the above six emotions. for nearly 25 years, it has received significant attention
In sum, we feel that segmentation of moving persons in the computer vision literature only over the last 15
from a video is the most important area in image se- years. Since computation of the optical flow field involves
quence analysis with direct applications to face recognition. two unknowns (velocity components along the 2 and y
Structure from motion, 3D modeling of faces and nonrigid directions) but only one measurement (intensity) at each
motion analysis potentially offer new solutions to various pixel, additional constraints such as smoothness of flow are
aspects of the face recognition and reconstruction problem. enforced to find a solution to what is essentially an ill-
In the next subsection, we will briefly summarize the posed problem. Such ill-posed problems are handled using
relevant literature on the problem of segmenting moving the regularization approach [ 1lo], [ 1321. Hom and Schunck
objects from an image sequence. [70] developed an iterative method for computing optical
flow field using the regularization approach. Subsequently
A. Segmentation Anandan [ 131 has presented hierarchical approaches to the
In [17], a template based approach is used to locate and computation of the optical flow field. The literature on
track a moving head in a video sequence. This approach computation of optical flow is quite extensive. We refer
derives its motivation from the view based approach of the reader to other significant papers by Enklemann [42],
[ 1041. It utilizes the minimum number of different templates Glazer [48], Heeger [66], Hildreth [68], and Fleet and
of a face determined by analysis of geometrical transfor- Jepson [44]. Accurate computation of optical flow is still an
mations of a face over parameters such as translation, large unsolved problem leading researchers into computation of
rotation, scaling etc. The minimum set is called the face other flow fields such as image flow [122], [129], [136] and
set topology. Restrictions on the transformation parameters normal flow [12]. Detailed discussion of the relative merits
lead to monotone regions within which variations of the of the computation and interpretation of different types of
parameter values monotonically ascend to the exact match. flow field pattems is beyond the scope of this report. A
A coarse to fine approach is used to zero in on the most systematic evaluation of methods that compute optical flow
likely match viz-a-viz the parameter in question. On the may be found in [15].
coarsest scale a rough estimate of the probable parameter Given that an accurate flow field is available, several
values is determined through a correlation match. This techniques are available for detecting motion boundaries
rough estimate is fine tuned in successive finer scales by or clustering of flow fields. Adiv [7] first partitions the
changing the acceptance threshold. They authors describe computed flow field into connected segments of flow vec-
a system which locates and then tracks a moving head in tors, where each segment is generated by rigid motion of
a video sequence using the above approach. It was used to a planar surface. Subsequently, segments coming from a
test both the translation as well as the rotational accuracy of rigid object are grouped. Grouping is done by using the
the algorithm. In the case of translation the authors reported motion coherency of the planar surfaces. From the grouped

730 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5, MAY 1995


Fig. 13. Locating the head from a video sequence applying the method of Pentland et al. [133].

flow fields, motion and structure parameters are computed as simulated annealing [46], maximum posterior marginal
under the assumptions the field of view (FOV) is narrow [90], and iterated conditional mode [ 161. Implementation of
and the images of moving objects are small with respect these algorithms using analog VLSI hardware is addressed
to the FOV and that the optical flow field is computed in [73], [83]. A recent paper [83] presents a multiscale
from monocular, noisy imagery. Thompson and Pong [ 13 13 approach with supporting physiological theory for the com-
present algorithms for the detection of moving objects from putation of optical flow. Other significant papers that deal
a moving platform. Under various assumptions about the with segmentation, motion detection from optical flow and
camera motion (the complete camera motion is known, only normal flow may be found in [26], [36], [102], [115], [126],
the rotation or translation is known, etc.), several versions [1351, [1411, [148].
of motion detection algorithms are presented with examples Two algorithms that use models of human motion for the
drawn from indoor scenes. purpose of segmentation are described in [loll, [121]. An
Analysis of optical flow for detecting motion bound- algorithm for the detection of moving persons from normal
aries and subsequently for motion detection requires the flows is also described in [99].
availability of accurate estimates of optical flow. But to
obtain these accurate estimates, we need to account for
or model the motion discontinuities in the flow field due
VII. APPLICATIONS
to the presence of moving objects. Simultaneous com-
putation of optical flow and modeling of discontinuities Current applications of FRT include both commercial
has been addressed by several research groups [67], [73], and law enforcement agencies. Although we have not
[94]. The central theme of this approach is to model been able to find many publications detailing commercial
the discontinuities using the “line processes” of Geman applications nevertheless a brief description of possible
and Geman [46], pose the computation of optical flow application areas is given. For the case of law enforcement
in a Bayesian framework, and derive iterative techniques agencies the emphasis has been on face identification,
from the application of an optimization procedure such witness recall, storage and retrieval of mug shots and user

CHELLAPPA et a1 ’ HUMAN AND MACHINE RECOGNITION OF FACES A SURVEY 731


~ __ __ -
interactive systems. Papers describing these applications are a matching algorithm to eliminate the mismatches, and a
summarized. sequencing algorithm on the remaining subset achieved
“nearly perfect” identification for a population of over 100
A. Commercial Applications faces.
In [86] Laughery et al. deal with the storage and retrieval
Commercial application of face recognition technology of mug shots. Three distinctions are made in the prototype
may soon become more important economically than law development efforts, the nature of the target image that
enforcement applications. This has the potential for dras- serves as input to the search, the method of coding the
tically reducing the cost of face recognition in law en- faces, and the pose position of the faces in the mug shots
forcement. Bank cards, both credit cards and ATM cards, album. Three prototype systems characterized by varying
need a better means of user identification. Insurance for methods of interaction and input are reviewed in this
ATM transactions costs about 1.5 cents per transaction; paper. Laughery et al. also review the different interactive
this amounts to several billion dollars per year. Encoding
and automatic measurement algorithms for locating and
each card with facial features could provide an identification
measuring facial features.
method which would be very effective in improving bank
The initial forensic evidence is often a witness’ recall
card fraud statistics.
of the culprit’s appearance. Verbal descriptions of people’s
The advantages of facial identification over altemative
faces very often lack detail. In [ 1201 two factors that might
methods, such as fingerprint identification, are based pri-
affect retrieval-distinctiveness of target face and position
marily on users convenience and cost. Facial recognition
in the album-were independently varied. Distinctive faces
can be corrected in uncertain cases by humans without
are easier to remember than nondistinctive faces. Target
extensive training. The development of multi-media com-
faces occurring later in the album are believed to be
puting promises to make low cost video input devices
more difficult to detect than those encountered earlier
available for PC’s. This should allow a facial image to
in the search. The faces were rated on a set of five-
be acquired by any PC based cash register systems. For
point descriptive scales. The scales were derived from the
systems of this kind to be widely accepted, standard PC
analysis of free descriptions of a different set of faces.
compatible methods for facial recognition must be avail-
The physical measurements corresponding to the features
able. If a low cost method for acquiring and encoding
which had been rated were taken from the faces, and these
facial images is developed, then this technology can also be
were converted to values on five-point scales using linear
applied to provide a low cost booking station technology.
regression techniques. The complete record for each face
comprises 47 face parameters plus the age. Thirty-eight of
B. Law Enforcement the parameters are five-point scale parameters (breadth of
The basic approach to the mug shots problem is for the face, length of hair, eye color), and nine are dichotomous
system to compare features from the target with those stored parameters which code for the presence or absence of facial
in the database. The nature of the target image relative to hair, peculiarities, and accessories. Age is coded on five-
the images in the database is crucial and determines the point scale. The database consists of 1000 photographs of
difficulty of the overall procedure. The target may be a males between 18-70 taken under controlled conditions.
mug-shot or from another photographic source and may Three photographs were taken, frontal view, profile, and $
need to be rotated before the features can be extracted view. Four nondistinctive and four distinctive faces were
and compared to the mug-file images. In [go] ten feature chosen from the set. Eight paid subjects were shown one
distances are measured to code nine features with each of the view test photographs for 10 s, provided a detailed
feature being a distance divided by the nose length. It was description of the photograph, and were assigned randomly
found that 92% of the variation in the normalized data could to either the computer or album search group. There were
be explained by five components. Similarity measures are four albums in which there were four photographs per page
used in sequencing algorithms with geometric coding of and 250 pages. Each photograph appeared four times within
features. The objective of this approach is to sequence the an album. The computer search was performed using the
photographs in the mug shots album on the basis of similar- subjects’ descriptions and ratings and could be changed and
ity to the target. The algorithm’s design must consider the repeated up to four times. The computer search retrieval rate
precision and accuracy of the measurements and develop for the distinctive faces was 75% and for the nondistinctive
appropriate scales for the components of the feature vector. faces 69%. The rate for the album searches was 78% for the
Algorithm design is concerned with the sequence in which distinctive faces and 44% for the nondistinctive faces [120].
selections are made and how to minimize errors in the face In [59] Haig presents a system that can insert new targets
description. The matching algorithm uses a window for into storage, control and change the intensities both locally
each feature and selects all the images that fall within the and globally, move the targets around, change their size
window. A larger window permits more of the database’s and orientation, present them for a wide range of fixed time
population to fit, while a smaller window increases the intervals, run experiments automatically, collect data, and
probability that an error in coding a feature will cause the analyze the results. Haig’s database consists of 100 target
correct image to be missed during the search. Harmon er faces, taken under reasonably standardized conditions from
al. [62], using geometric feature codes of images, present the direct frontal aspect. The pictures are registered such

732 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5, MAY 1995


that the inter-pupilary distance is 30 pixels. The goal of that result from errors in recognition. System reject-error
face distortion experiments is to measure the sensitivity of behavior should be studied, not just forced recognition.
adult observers to slight positional changes of prominent In planning these evaluations, it is also important to keep
facial features. Each target face is subjected to the same in mind that pattern recognition is not governed by a formal
operations, in which certain features are moved by defined theory, as physics or mathematics is, which allows clearly
amounts. Greatest sensitivity is to the movement of the applicable governing principles to determine the extent to
mouth upward by 1.2 pixels, close to the visual acuity which results derived form one set of applications can be
limit. In other experiments, features are interchanged among applied to other applications. The operation of these sys-
four different faces. The head outline is the major focus of tems is statistical, with measurable distributions of success
attention when four features are interchanged. A changed and failure. The specific values in these distributions are
head outline, while maintaining the inner features, very very application-dependent and no existing theory seems to
strongly influences the observer’s responses. Fixing both allow their prediction for a new application. This strongly
the head and the eyes shows a dominance in the mouth over suggests that the most useful form of evaluation is one
the nose. These experiments establish a clear hierarchy of based as closely as possible on a specific application.
feature saliency in which the head’s outline plays a major
role. In the distributed aperture experiments, Haig attempts A. Evaluation Requirements
to find what constitutes a facial recognition feature. The 1)Accuracy Requirements: The face recognition applica-
technique used implies that all parts of the face are equally tions to be evaluated here will usually take the form of
likely to be masked or unmasked in any combination. The a computer search of a large set of face images which
program selects one of the four target faces at random generates a list of possible match candidates which are
and presents the target to a random number of apertures evaluated by the users of the system. In this kind of
with their actual addresses selected at random from the application accuracy requirements for the face recognition
38 possible addresses. An analysis of results revealed a system are bounded by two factors: 1) the acceptable
very high proportion of correct responses across the eyes- probability of missed recognition (the computer never finds
eyebrows and across the hairline at the forehead. Few the right face) and 2) the ability of humans to distin-
correct responses may be seen around the side of the guish similar faces from a candidate list generated by
temples and at the mouth, and the lower chin area is clearly the recognition system without unacceptable confusion or
not a strong recognition feature. fatigue (the human can’t find the face in the set generated
Starkey [127] presents an overview of work done using by the computer). This tradeoff is substantially different
96 police photographs. The images of the faces are nor- from the trade-off inherent in the reject-error characteristics
malized by fixing the pupil distance at 80 pixels. A target found in OCR and fingerprint classification. In these two
face is chosen at random and a neural network is trained applications, if an image is rejected human correction can
to recognize it. The correct face is identified from among always lower the classification error to practical levels
the 96 photos 100% of the time. With the addition of noise since humans can do the job without computer assistance.
levels of 5% and 10% to the target image, the correct face is With face recognition, some faces are known to look alike
found 62.5% and 36.5% of the time. In an experiment using and when humans are presented with large sets of face
100 faces, 43 noses are used to train the neural network to images their ability to distinguish similar faces drops due
find features. The net is able to find the nose feature within to confusion and fatigue. As the candidate list of similar
2 pixels using a Euclidean metric. In profile analysis, a set faces produced by the recognition system increases, the
of 36 profiles is prepared using Fourier descriptors. Cluster probability that a matching face exits to the candidate face
analysis is used to group them by similarity. The descriptors increases. At the same time, as the candidate list increases,
from a profile can be displayed with other data such as the probability of confusion in human selection causes the
height, age, sex, etc. in the form of a histogram or bar-code probability of selecting the matching face to decrease. At
and may be used to increase search accuracy. some point the highest probability of a correct match will
be achieved. This will not be a serious problem if the
face of interest is in the first few faces; the probability
VIII. EVALUATION
OF A FACERECOGNITION SYSTEM
of finding the correct match then increases sharply as
In addition to the material contained in this paper, pre- candidate faces are added. If these lists contain hundreds of
vious work on the evaluation of OCR and fingerprint faces to obtain adequate probability of selection, then the
classification systems [27], [58] provides insights into how limiting performance factor will be human ability to select
the evaluation of recognition algorithms and systems is faces from the candidate list.
most efficiently performed. One of the most important 2 ) Constraints on Samples: In both OCR and fingerprint
facts learned in previous evaluations is that large sets of classification work, use of images with atypical image
test images are essential for adequate evaluation. It is quality or overly simplified segmentation characteristics has
also extremely important that the sample be statistically as led to misleading conclusions about system requirements.
similar as possible to the images that arise in the application In OCR two factors were found to be very important. First,
being considered. Scoring should be done in a way which algorithms should be tested using images from sources
reflects the costs or other system requirement changes of comparable image complexity to the target application.

CHELLAPPA et a1 HUMAN AND MACHINE RECOGNITION OF FACES A SURVEY 133

~ ~- ~~ --~ __
Second, since many early OCR studies were done on data. With this level of image resolution, down-sampling of
isolated characters or characters with moderate segmen- the images and digital filtering to provide lower resolution
tation problems, the need for robust generalization was and image quality can be done with a single set of master
underestimated. This caused too much effort to be expended images. Images can also be cropped after segmentation
on systems that did not address the types of recognition to provide more usable image areas containing the face
problems that arise in real Based on this experience, we image. Resampling the image to provide a greater area of
feel that initial studies using realistic images from some background and less active image area may also be possible,
specific commerciaVlaw enforcement application should be but may introduce artifacts that change the difficulty of the
carried out. An initial image set based on mug shots seems segmentation problem.
appropriate since these images span a wide range of the 3 ) Test Methods for Algorithm Accuracy and Probability of
possible applications shown in Table 1, will have realistic Match: The scoring of face matching systems should be
image segmentation problems, and have realistic image based on the probability of matching a candidate face in the
quality parameters. This should provide commerciaVlaw first n faces on a candidate list. Two sets of probabilities of
enforcement agencies such as credit card companies or the this type can be defined, one for faces in the database and
FBI with a more realistic estimate of the utility of FRT one for faces not in the database. The first will generate
than studies done on idealized datasets or datasets which true positives and the second will generate false positives.
are unrelated to specific applications. The comparison of true and false recognition probabili-
3 ) Speed and Hardware Requirements: We recommend ties assumes that each recognition produces a confidence
that where possible all algorithms under test be evaluated number which is higher for faces with greater similarity.
on several types of parallel computer hardware as well For each specified level of confidence, the number of
as standard engineering workstations. In high volume faces matching true and false faces can be generated. The
applications speed will be an important factor in evaluating simplest accuracy measure of each type of recognition is the
applicability. In many potential applications, parallel cumulative probability of a match for various values of n,
computers may be too costly but developments of effective and at the same confidence level, the probability of a false
high speed methods on parallel computers should allow match. It seems likely that in addition to the raw cumulative
special purpose hardware to be developed to reduce costs. probability curve, some simple models of the shape of
4 ) Human Inlegace: The utility of face recognition sys- the curve, such as a linear model, may be of interest in
tems will be strongly affected by the type of human comparing different algorithms. In many applications it will
interface that is used in conjunction with this technology. be as important that the face recognition system avoid false
The human factors which will affect this interface are dealt positives as that it produce good true positive candidate
with in Section 111. The literature on human perception and lists.
recognition of faces will be important in designing human Many of the face recognition systems discussed in this
interfaces which allow users to make efficient use of the paper reduce the face to a set of features and measure the
results of machine-based face recognition. similarity of faces by comparing the distance between faces
in this feature space. For all of the test faces the distance
B. Evaluation Methods between each test face and all other faces in the database
1) Database Size and Uniformity: For law enforcement is calculated. The probability, over the entire test sample,
applications, as an initial evaluation sample, a collection of and the average confidence of the first n near neighbors
a minimum of 5000 and a maximum of 50 000 mug shot is then calculated. A similar calculation is made using
images may be appropriate. A testing sample containing faces not in the database and the average confidence of
500 to 5000 different mug shots of faces in the original the first n candidates evaluated. At each confidence level
training set and 500 to 5000 different mug shots of faces for these faces a probability of finding a false match can be
not in the original training set should be collected to allow calculated as the ratio of false candidates to true candidates
testing of machine face matching. Similar samples for at comparable confidence. If the recognition process is
commercial applications are also suggested.' The minimum to be successful the probability of detection of a face in
sample sizes for the test sets is based on the need to obtain the database should always exceed the probability of false
accurate matching and false matching statistics. The 10:1 detection of a face not included in the database.
ratio of the evaluation set size to the testing set size is 4 ) Similarity Measures: The example calculation dis-
designed to minimize false match statistics due to random cussed above requires that the recognition system produce
matches and provide statistical accuracy in probability of a measure of confidence of recognition and of similarity
match versus candidate list size statistics. to other possible recognitions. Similarity differs from
2 ) Sample Size Issues-Feasibility of Resampling: We sug- confidence in that similarity is measured between any two
gest that images be collected at relatively high resolution, points in the feature space of the database while confidence
512 x 512, and using 8 b of gray or intensity. If color is a measure between a test image and a trial match. In
images are used, matching will initially use only intensity the example a reasonable measure might be l/(l+kd:j).
NIST has recently made available a mug shot identification database
Using this measure the similarity of two faces is 1.0 if their
containing a total of 3248 images. For details the reader may contact: features are identical and approaches 0.0 as the features
[email protected]. are displaced to infinity. This type of similarity measure

734 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5 , MAY 1995


only works if all of the feature are normalized to a similar in addition to fingerprint recognition, will remain a
scale; otherwise, a single large feature can control the critical high-technology strategic research area with
entire process. significant potential impact on reducing crime.
5 ) Rank Statistic Comparison: The ability to address ap- Over 30 years of research in psychophysics and neuro-
plications where lists of candidate faces are to be used sciences on human recognition of faces is documented
for human ranking requires that a method for comparison in the literature. Although we do not feel that machine
of human ranking of similar faces and machine ranking recognition of faces should strictly follow what is
of faces be developed. This problem can be addressed known about human recognition of faces, it is benefi-
by treating each ranked list of faces as a symbol string cial for engineers designing face recognition systems
and using the string comparison methods which have been to be aware of the relevant findings. This is especially
applied to OCR. These lists will contain insertions, dele- crucial for applications where human interaction is
tions, and substitutions just as symbol strings would. The involved, such as in expert identification, electronic
comparison of strings can then be effected using Levenstein mug shots books, and lineups. Even for applications
distance as a measure of cost of sequence alignment. The 1-3, what is known about human recognition of faces
problem differs from the OCR problem in that in OCR the obviously impact feature selection and recognition
use of confidence measures allows unknown symbols to strategies.
be included in sequences. In the face recognition problem Segmentation of a face region from a still image or
these sequences are completely determined by similarity a video is one of the key first steps in face recog-
measures and will only decrease in similarity with sequence nition. Surprisingly, this problem has not received
length. much attention. Although the segmentation problem
6) Summary: All of the evaluation methods discussed is easier for mug shots, drivers’ licenses, personal
here are directed toward the evaluation of machine base ID’S, and passport pictures, as in applications 2 and 3,
similarity metrics for specific applications. The two which segmentation in general is a nontrivial task. More effort
should be addressed first are: Can the methods find similar needs to be directed in addressing the segmentation
faces in a large database with an acceptable false detection problem.
rate? and: Will machine base rank ordering of similarity Both global and local feature descriptions are use-
be sufficiently similar to human ranking to provide useful ful. Popular global descriptions are based on the KL
input for human list correction? expansion. The local descriptors are derived from
regions that contain the eyes, mouth, nose, etc., using
approaches such as deformable templates or eigen-
Ix. SUMMARY AND CONCLUSIONS expansion of regions surrounding the eyes, mouth,
etc. Minutiae-type point features have also been ex-
In this paper, we have presented an extensive survey of tracted. It appears that feature extraction is better
human and machine recognition of faces with applications understood and developed in connection with recog-
to law enforcement and other commercial sectors. We have nition. It may be worthwhile to investigate the use of
focused on segmentation, feature extraction and recognition other possible global and local transforms, and better
aspects of the face recognition problem, using informa- methods for detection, localization and description of
tion drawn from intensity and range images of faces and features.
profiles. A brief summary of relevant psychophysics and A multitude of techniques are available in IU literature
neurosciences literature has also been included. [l], [3]-[5], [431, [57], 11371 for recognition. The
The survey presented here is relevant to applications 1-7. eigenapproach of Pentland’s group has been tested
While many of the methods described here are of interest on a large number of images of 3000 persons. Other
in applications 8 and 9, a detailed discussion of them is promising approaches based on neural networks and
beyond the scope of this paper. graph matching have not been tested on such large
We give below a concise summary followed by con- datasets. All of these approaches should be tested on
clusions in the same order as the topics appear in the the same dataset derived from a practical application.
paper. Although a complete algorithm that can solve even
Face recognition technology is an essential tool for the simplest of the applications in Table 1 is not yet
law enforcement agencies’ efforts tc combat crime. available, one can begin the task of evaluating the
Given that crime is seen as the most important problem existing methods on a dataset that truly represents the
facing the country, even ahead of job security, health data available in real applications. Methods that may
care, and the economy, the use of high technology to not scale with the size of the dataset can be identified
effectively fight crime will receive support from the in this way and discouraged from further development.
people and from their elected representatives. Further- For commerciaVlaw enforcement applications, the use
more, commercial applications of this technology has of range data is not feasible due to the cost of the
received a growing interest, most importantly in the data acquisition process. Consequently, only intensity-
case of credit card companies and their desire to reduce based approaches may be pursued. Profile-basedrecog-
fraudulent usage of credit cards. Face recognition, nition schemes should be evaluated on a large dataset

CHELLAPPA et al.: HUMAN AND MACHINE RECOGNITION OF FACES: A SURVEY 735


and methods for integrating profile and face based Commun. and Image Process., 1991, pp. 198-203.
methods should be developed. [lo] K. Aizawa et al., “Human facial motion analysis and synthesis
with application to model-based coding,” in Motion Analy-
Face recognition from a video sequence is probably sis and Image Sequence Processing, M. I. Sezan and R. L.
the most challenging problem in face recognition. Up Lagendijk, MS.Boston, M A Kluwer, 1993, pp. 317-348.
to now, fairly simple thresholding of difference images [ l l ] S. Akamatsu, T. Sasaki, H. Fukamachi, and Y. Suenaga, “A
robust face identification scheme-KL expansion of an invari-
has been used for locating a moving person’s face, and ant feature space,” in SPIE Proc.: Intell. Robots and Computer
has been followed by a 2D recognition algorithm. In Vision X: Algorithms and Techn., vol. 1607, 1991, pp. 71-84.
our opinion, recognition from an image sequence offers [12] Y. Aloimonos et al., “Behavioral visual motion analysis,”
in Proc., DARPA Image Understanding Workshop, 1992, pp.
excellent opportunities for applying several concepts 52 1-54 1.
from the IU literature; specifically, the usefulness of [13] P. Anandan, “A computational framework and an algorithm for
flow fields for the segmentation of the face region, the measurement of visual motion,” Internutiowl Journul of
Computer Vision, vol. 2, pp. 283-310, 1989.
and the reconstruction and refinement of 3D structure [ 141 R. Baron, “Mechanisms of human facial recognition,” Int. J.
from 2D frames, must be investigated. Man-Machine Studies, vol. 15, pp. 137-178, 1981.
[15] J. L. Barron, D. J. Fleet, and S. S. Beauchermin, “Performance
The most important step in face recognition is the of optical flow techniques,” In?. J. on Computer Vision, vol. 12,
ability to evaluate existing methods and provide new pp. 43-77, 1994.
directions on the basis of these evaluations. The im- [16] J. Besag, “On the statistical analysis of dirty pictures,” J. Royal
Statistical Soc., vol. 48, pp. 259-302, 1986.
ages used in the evaluation should be derived from [ 171 M. Bichsel and A. Pentland, “Human face recognition and the
operational situations, similar to those in which the face image set’s topology,” Computer Vision, Graphics, and
recognition system is expected to be installed. An Image Process.: Image Understanding, vol. 59, pp. 254-261,
1994.
important subproblem is the definition of a similar- [18] W. W. Bledsoe, “The model method in facial recognition,”
ity measure that can be used in matching two face Panoramic Research Inc., Tech. Rep. PRI:15, Palo Alto, CA,
images. In witness and electronic mug shots matching 1964.
[19] V. Bruce, Recognizing Faces. London: Erlbaum, 1988.
problems, a face recognition system is expected to rank [20] -, “Perceiving and recognizing faces,” in Mind and Lan-
the chosen images in the same way that humans do, guage, 1990, pp. 342-364.
in terms of how similar the test and stored images [21] R. Brunelli and T. Poggio, “HyperBF networks for gender clas-
sification,” in Proc., DARPA Image Understanding Workshop,
are. The similarity measure used in a face recognition 1992, pp. 311-314.
system should be designed so that humans’ ability to [22] -, “Face recognition: Features versus templates,” IEEE
perform face recognition and recall are imitated as Trans. Pan. Anal. and Mach. Intell., vol. 15, pp. 1042-1052,
1993.
closely as possible by the machine. As of this writing, [23] M. Buck and N. Diehl, “Model-based image sequence coding,”
no such evaluation of a face recognition system has in Motion Analysis and Image Sequence Processing, M. 1. Sezan
been reported in the literature. and R. L. Lagendijk, Eds. Boston, MA: Kluwer, 1993, pp.
285-3 15.
[24] J. Buhmann, M. Lades, and C. v. d. Malsburg, “Size and
ACKNOWLEDGMENT distortion invariant object recognition by hierarchical graph
matching,” in Proc., In?. Joint Con$ on Neural Networks, pp.
The authors are grateful to Prof. A. Pentland for provid- 411-416, 1990.
ing Figs. 3, 5(a), 7, and 13, Prof. T. Poggio for providing [25] P. J. Burt, “Multiresolution techniques for image representa-
Fig. 8, Prof. C. V. d. Malsburg for providing Figs. 2, 6(a), tion, analysis, and ‘smart’ transmission,” in SPIE Proc.: Visual
Commun. and Image Process. IV, vol. 1199, pp. 2-15, 1989.
9, and 10, Dr. G. Gordon for providing Fig. 11, Prof. [26] P. J. Burt, R. Hingorani, and R. J. Kolozunski, “Mechanisms
B. S. Manjunath for providing the feature extraction and for isolating component pattems in the sequential analysis of
face identification algorithms (results of which are shown multiple motion,” in Proceedings, IEEE Workshop on Visual
Motion, pp. 187-193, 1991.
in Fig. 6(b) and Fig. lo), and FBI for providing Fig. 1. [27] G. T. Candela and R. Chellappa, “Comparative performance
The authors would like to thank Ms. C. S. Barnes for of classification methods for fingerprints,” National Institute
helpful discussions in the preparation of this manuscript. of Standards and Technology, Gaithersburg, MD, Tech. Rep.
NISTIR-5163, 1993.
Finally, the authors are grateful to Prof. A. Rosenfeld [28] J. Canny, “A computational approach to edge detection,” IEEE
for carefully reading the manuscript and suggesting useful Trans. Patt. Anal. and Mach. Intell., vol. 8, pp. 679-689, 1986.
1291 S. Carey, “A case study: Face recognition,” in Explorations in
improvements. the Biological Language, E. Walker, Ed. New York Bradford,
1987, pp. 175-201.
REFERENCES [30] S. Carey, R. Diamond, and B. Woods, “The development
[I] Proc. DARPA Image Understanding Workshop, 1984, -. of face recognition-A maturational component?’ Develop.
[2] Proc. IEEE Workshops on Motion, 1986, 1989, 1991. Psych., vol. 16, pp. 257-269, 1980.
[3] Computer Vision, Graphics and Image Process., 1972, -. [31] Y. Cheng, K. Liu, J. Yang, and H. Wang, “A robust algebraic
[4] IEEE Trans. Patt. Anal. and Mach. Intell., 1979, -. method for human face recognition,” in Proc. 11th Int. Con$
[5] Int. J. Compu. Vision, 1988, -. on Putt. Recog., 1992, pp. 221-224.
[6] Y. S. Abu-Mostafa and D. Psaltis, “Optical neural computers,” 1321 Y. Cheng, K. Liu, J. Yang, Y. Zhuang, and N. Gu, “Human
Scientijic American, vol. 256, pp. 88-95, 1987. face recognition method based on the statistical model of small
[7] G. Adiv, “Determining three-dimensional motion and structure sample size,” in SPIE Proc.: Intell. Robots and Compu. Vision
from optical flow generated by several moving objects,” IEEE X : Algorithms and Techn., vol. 1607, 1991, pp. 85-95.
Trans. Putt. Anal. and Mach. Intell., vol. 7, pp. 525-542, 1985. [33] M. J. Conlin, “A ruled based high level vision system,” in SPIE
[8] J. K. Aggarwal and N. Nandhakumar, “On the computation of Proc.: Intell. Robots and Compu. Vision, vol. 726, 1986, pp.
motion from sequences of images,” Proc. IEEE, vol. 76, pp. 314-320.
917-935, 1988. [34] I. Craw, H. Ellis, and J. Lishman, “Automatic extraction of face
[9] T. Aibara, K. Ohue, and Y. Matsuoka, “Human face recognition features,” Part. Recog. Lett., vol. 5, pp. 183-187, 1987.
of P-type Fourier descriptors,” in SPIE Proc., Vol. 1606: Visual [35] I. Craw, D. Tock, and A. Bennett, “Finding face features,” in

PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5, MAY 1995


Proc. 2nd Europe. Con$ on Compu. Vision, 1992, pp. 92-96. [63] L. Harmon, S. Kuo, P. Ramig, and U. Raudkivi, “Identification
[36] T. Darrell and A. Pentland, “Robust estimation of a multi- of human face profiles by computer,” Pan. Recog., vol. 10, pp.
layered motion representation,” in Proc. IEEE Con$ on Visual 301-312, 1978.
Motion, 1991, pp. 173-178. [64] L. D. Harmon, “The recognition of faces,” Scientifrc American,
[37] G . Davies, H. Ellis, and E. J. Shepherd, Perceiving and Remem- vol. 229, pp. 71-82, 1973.
bering Faces. New York: Academic, 1981. [65] D. C. Hay and A. W. Young, “The human face,” in Nor-
[38] S. Dudani, K. Breeding, and R. McGhee, “Aircraft identification mality and Pathology in Cognitive Function, A. W. Ellis, Ed.
by moment invariants,” IEEE Trans. Computers, vol. 26, pp. London: Academic, 1982, pp. 173-202.
39-45, 1977. [66] D. Heeger, “Optical flow from spatio-temporal filters,” Int. J.
[39] H. Ellis, M. Jeeves, F. Newcombe, and A. Young, Aspects of Computer Vision, vol. 1, pp. 279-302, 1988.
Face Processing. Dordrecht: Nijhoff, 1986. [67] F. Heitz and P. Bouthemy, “Multimodal motion estimation and
[40] H. D. Ellis, “The role of the right hemisphere in face percep- segmentation using Markov random fields,” in Proc. Int. Conf.
tion,” in Function of the Right Cerebral Hemisphere, A. W. on Patt. Recog., 1990, pp. 378-383.
Young, Ed. London: Academic, 1983, pp. 33-64. [68] E. C. Hildreth, The Measurement of Visual Motion. Cam-
[41] -, “Introduction to aspects of face processing: Ten questions bridge, MA. MIT Press, 1984.
in need of answers,” in Aspects ofFace Processing, H. Ellis, M. [69] Z. Hong, “Algebraic feature extraction of image for recogni-
Jeeves, F. Newcombe, and A. Young, Eds. Dordrecht: Nijhoff, tion,’’ Patt. Recog., vol. 24, pp. 211-219, 1991.
1986, pp. 3-13. [70] B. K. P. Horn and B. G . Schunck, “Determining optical flow,”
[42] W. Enklemann, “Investigations of multigrid algorithms for the Art$ Intell., vol. 17, pp. 185-203, 1981.
estimation of optical flow fields in image sequences,” Computer [71] T. S. Huang, Image Sequence Analysis. New York: Springer-
Vision, Graphics and Image Process., vol. 43, pp. 150-177, Verlag, 1981.
1988. [72] -, “Modeling, analysis, and visualization of nonrigid object
[43] 0. Faugeras, Three-Dimensional Computer Vision, A Geometric motion,” in Proc. Int. Con$ Patt. Recog., pp. 361-364, 1990.
Viewpoint. Cambridge, MA: MIT Press, 1990. [73] J. Hutchinson, C. Koch, J. Luo, and C. Mead, “Computing
[44] D. J. Fleet and A. D. Jepson, “Stability of phase informa- motion using analog and binary resistive networks,” Computer,
tion,” IEEE Trans. Patt. Anal. and Mach. Intell., vol. 15, pp. vol. 21, pp. 52-63, 1988.
1253-1269, 1993. [74] A. K. Jain, Fundamentals of Digital Image Processing. Engle-
[45] K. Fukunaga, Statistical Pattem Recognition. New York: Aca- wood Cliffs, NJ: Prentice Hall, 1989.
demic, 1989. [75] R. Jain, “Extraction of motion information from peripheral
[46] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distri- processes,” IEEE Trans. Patt. Anal. and Mach. Intell., vol. 3,
bution and Bayesian restoration of images,” IEEE Trans. Patt. pp. 489-503, 1981.
Anal. and Mach. Intell., vol. 6, pp. 721-741, 1984. [76] R. Jain, W. N. Martin, and J. K. Aggarwal, “Segmentation
[47] A. P. Ginsburg, “Visual information processing based on spatial through the detection of changes due to motion,” Computer
filters constrained by biological data,” AMRL Tech. Rep., pp. Graphics and Image Process., vol. 11, pp. 13-34, 1979.
78-129, 1978. [77] R. Jain and H. H. Nagel, “On the analysis of accumulative
[48] F. Glazer, “Hierarchical motion detection,” Ph.D. dissertation, difference pictures from image sequence of real world scenes,”
Univ. Massachusetts, Amherst, MA, 1987. IEEE Trans. Patt. Anal. and Mach. Intell., vol. 1, pp. 206-214,
[49] A. G. Goldstein, “Facial feature variation: Anthropometric data 1979.
11,” Bull. Psychonomic Soc., vol. 13, pp. 191-193, 1979. [78] T. Kanade, Computer Recognition of Human Faces. Basel and
[50] -, “Race-related variation of facial features: Anthropometric Stuttgart: Birkhauser, 1977.
Data I,” Bull. Psychonomic Soc., vol. 13, pp. 187-190, 1979. [79] G . J. Kaufman, Jr., and K. J. Breeding, “The automatic recog-
[51] B. A. Golomb and T. J. Sejnowski, “SEXNET: A neural nition of human faces from profile silhouettes,” IEEE Trans.
network identifies sex from human faces,” in Advances in Syst., Man, and Cybem., vol. SMC-6, pp. 113-121, 1976.
Neural Information Processing Systems 3, D. S. Touretzky and [80] Y. Kaya and K. Kobayashi, “A basic study on human face
R. Lipmann, Eds. San Mateo, CA: Morgan Kaufmann, 1991, recognition,” in Frontiers of Pattern Recognition, S. Watanabe,
pp. 572-577. Ed. New York Academic, 1972, pp. 265-289.
[52] G . Gordon, “Face recognition based on depth maps and surface [81] M. D. Kelly, “Visual identification of people by computer,”
curvature,” in SPIE Proc.: Geometric Methods in Computer Tech. Rep. AI-130, Stanford AI Proj., Stanford, CA, 1970.
Vision, vol. 1570, 1991, pp. 234247. [82] M. Kirby and L. Sirovich, “Application of the Karhunen-Loeve
[53] G . G . Gordon and L. Vincent, “Application of morphology procedure for the characterization of human faces,” IEEE Trans.
to feature extraction for face recognition,” in SPIE Proc.: Patt. Anal. and Mach. Intell., vol. 12, pp. 103-108, 1990.
Nonlinear Image Process., vol. 1658, 1992. [83] C. Koch, H. T. Wang, R. Battiti, B. Mathur, and C. Ziomkowski,
[54] A. Goshtasby, “Description and discrimination of planar shapes “An adaptive multiscale approach for estimating optical flow:
using shape matrices,” IEEE Trans. Patt. Anal. and Mach. Computational theory and physiological implementation,” in
Intell., vol. 7, pp. 738-743, 1985. Proc. IEEE Workshop on Visual Motion, 1991, pp. 111-122.
[ 5 5 ] V. Govindaraju, S. N. Srihari, and D. B. Sher, “A computational [84] T. Kohonen, Self-organization and Associative Memory.
model for face location,” in Proc. 3rd Int. Con$ on Computer Berlin: Springer, 1988.
Vision, 1990, pp. 718-721. [85] M. Lades, J. Vorbruggen, J. Buhmann, J. Lange, C. v. d. Mals-
[56] U. Grenander, Y. Chow, and D. Keenan, Hands: A Pattern burg, and R. Wurtz, “Distortion invariant object recognition in
Theoretic Study of Biological Shapes. New York Springer- the dynamic link architecture,” IEEE Trans. Computers, vol.
Verlag, 1991. 42, pp. 300-311, 1993.
[57] W. E. L. Grimson, Object Recognition by Computer: The Role [86] K. R. Laughery, J. B. T. Rhodes, and J. G . W. Batten,
of Geometric Constraints. Cambridge, MA: MIT Press, 1990. “Computer-guided recognition and retrieval of facial images,”
[58] P. J. Grother and G . T. Candela, “Comparison of handprinted in Perceiving and Remembering Faces, G. Davis, H. Ellis, and
digit classifiers,” Tech. Rep. NISTIR, National Institute of J. Shepherd, Eds. New York Academic, 1981, pp. 250-269.
Standards and Technology, Gaithersburg, MD, 1993. [87] H. Li, P. Roivainen, and R. Forchheimer, “3-D motion estima-
[59] N. D. Haig, “Investigating face recognition with an image pro- tion in model-based facial image coding,” IEEE Trans. Patt.
cessing computer,” in Aspects of Face Processing, H. D. Ellis, Anal. and Mach. Intell., vol. 15, pp. 545-555, 1993.
M. Jeeves, F. Newcombe, and A. Young, Eds. Dordrecht: [88] B. S. Manjunath, R. Chellappa, and C. v. d. Malsburg, “A
Nijhoff, 1985, pp. 410-425. feature based approach to face recognition,” in Proc. IEEE
[60] P. W. Hallinan, “Recognizing human eyes,” in SPIE Proc.: Computer Soc. Conf. on Computer Vision and Patt. Recog.,
Geometric Methods in Compu. Vision, vol. 1570, 1991, pp. 1992, pp. 373-378.
214226. [89] D. Marr, Vision. San Francisco, CA: Freeman, 1982.
[61] L. Harmon and W. Hunt, “Automatic recognition of human face [90]J. Marroquin, S. Mitter, and T. Poggio, “Probabilistic solu-
profiles,” Computer Graphic and Image Process., vol. 6, pp. tions for ill-posed problems in computational vision,” J . Amer.
135-156, 1977. Statistical Assoc., vol. 82, pp. 7 6 8 9 , 1987.
[62] L. Harmon, M. Khan, R. Lasch, and P. Ramig, “Machine iden- [91] W. N. Martin and J. K. Aggarwal, “Dynamic scene analysis: A
tification of human faces,” Patt. Recog., vol. 13, pp. 97-110, survey,” Computer Vision, Graphics and Image Process., vol.
1981. 7, pp. 356-374, 1978.

CHELLAPPA et a1 : HUMAN AND MACHINE RECOGNITION OF FACES A SURVEY 737

-- - -.__ __ __
[92] -, Motion Understanding, Robot and Human Vision. 181 -, “An approach to face recognition using saliency maps
Boston, MA: Kluwer, 1988. and caricatures,” in Proc. World Cong. on Neural Networks,
[93] D. Metaxas and D. Terzopoulus, “Recursive estimation of 1993, pp. 661-664.
shape and nonrigid motion,” in Proc. IEEE Workshop on Visual 191 J. Sergent, “Microgenesis of face perception,” in Aspects of
Motion, 1991, pp. 296-311. Face Processing, H. D. Ellis, M. A. Jeeves, F. Newcombe, and
[94] D. W. Murray and B. F. Buxton, “Scene segmentation from A. Young, Eds. Dordrecht: Nijhoff, 1986.
visual motion using global optimization,” IEEE Trans. Part. 201 J. W. Shepherd, “An interactive computer system for retrieving
Anal. and Mach. Intell., vol. 9, pp. 220-228, 1987. faces,” in Aspects ofFace Processing, H. D. Ellis, M. A. Jeeves,
[95] H. H. Nagel, “Analysis techniques for image sequences,” in F. Newcombe, and A. Young, Eds. Dordrecht: Nijhoff, 1985,
Proc. Int. Con$ on Pan. Recog., 1978, pp. 186-21 1. pp. 398409.
[96] -, “Overview on image sequence analysis,” in Image 211 A. Shio and J. Sklansky, “Segmentation of people in motion,”
Sequence Analysis, T. S. Huang, Ed. New York: Springer- in Proc. IEEE WorkshoD on Visual Motion. 1991. DD. 325-332.
Verlag, 1981, pp. 19-228. [122] A. Singh, “An estimatidn-Theoretic framework f&image flow
[97] __ ,“Image sequences-ten (octal) years-from phenomenol- analysis,” in Proc. Int. Con$ on Computer Vision, 1990, pp.
ogy toward a theoretical foundation,” in Proc. Int. Con$ on Part. 167-177.
Recog., 1986, pp. 1174-1185. [ 1231 S. A. Sirohey, “Human face segmentation and identification,”
[98] 0. Nakamura, S. Mathur, and T. Minami, “Identification of Tech. Rep. CAR-TR-695, Center for Autom. Res., Univ. Mary-
human faces based on isodensity maps,” Patt. Recog., vol. 24, land, College Park, MD, 1993.
pp. 263-272, 1991. [124] L. Sirovich and M. Kirby, “Low-dimensionalprocedure for the
[99] R. Nelson, “Qualitative detection of motion by a moving characterization of human face,” J. Opt. Soc. Amer., vol. 4, pp.
observer,” in Proc. DARPA Image Understanding Workshop, 519-524, 1987.
1990, pp. 329-338. [ 1251 H. L. Snyder, “Image quality and face recognition on a televi-
[lo01 M. Nixon, “Eye spacing measurement for facial recognition,” sion display,” Human Factors, vol. 16, pp. 300-307, 1974.
in SPIE Proc., 1985, vol. 575, pp. 279-285. [126] A. Spoerri and S. Ullman, “The early detection of motion
[loll O’Rourke and N. L. Badler, “Model-based image analysis of boundaries,” in Proc. Int. Conf on Computer Vision, 1987, pp.
human motion using constraint propagation,”IEEE Trans. Pan. 209-2 18.
Anal. and Mach. Intell., vol. 2, pp. 522-536, 1980. [127] R. B. Starkey and I. Aleksander, “Facial recognition for police
[lo21 S. Peleg and H. Rom, “Motion based segmentation,” in Proc., purpose using computer graphics and neural networks,” in Pmc.
Int. Con$ on Patt. Recog., 1990, pp. 109-113. Colloquium on Electron. Images and Image Proc. in Securiry
[lo31 A. Pentland and B. Horowitz, “Recovery of nonrigid motion and Forensic Science, dig. no. 0871990, pp. 21-2.
and structure,” IEEE Trans. Pan. Anal. and Mach. Intell., vol. [128] T. J. Stonham, “Practical face recognition and verification with
13, pp. 730-742, 1991. WISARD,” in Aspects of Face Processing, H. D. Ellis, M. A.
11041 A. Pentland, B. Moghaddam, T. Stamer, and M. Turk, “View- Jeeves, F. Newcombe, and A. Young, Eds. Dordrecht: Nijhoff,
based and modular eigenspaces for face recognition,” in Proc. 1984, pp. 4 2 W 1 .
IEEE Computer Soc. Con$ on Computer Vision and Patt. 291 M. Subbarao, “Interpretation of image flow: A spatio-temporal
Recog., 1994, pp. 84-91. approach,” IEEE Trans. Patt. Anal. and Mach. Intell., vol. 11,
[ 1051 D. Perkins, “A definition of caricature and recognition,” Studies pp. 266-278, 1989.
in the Anthropology of Visual Commun., vol. 2, pp. 1-24, 1975. 301 D. Terzopoulus, A. Watkin, and M. Kass, “Constraints on
[lo61 D. I. Perrett, A. J. Mistlin, and A. J. Chitty, “Visual neurons deformable models: 3-D shape on nonrigid motion,” Art$
responsive to faces,” Trends in Neuroscience, vol. 10, pp. Intell.. vol. 36. OD. 91-123. 1988.
358-363, 1987. 311 W. B.’ ThompsoALandT. C.’Pong, “Detecting moving objects,”
[I071 D. I. Perret, A. J. Mistlin, A. J. Chitty, P. A. Smith, D. D. in Proc. 1st Int. Con$ on Computer Vision, 1987, pp. 201-208.
Potter, R. Broennimann, and M. H. Harries, “Specialized face [I321 A. N. Tikhonov and V. Y. Arsenin, Solution of Ill-posed
processing and hemispheric asymmetry in man and monkey: Problems. Washington, DC: Winston and Wiley, 1977.
Evidence from single unit and reaction time studies,’’ Behav. [133] M. A. Turk and A. P. Pentland, “Face recognition using
eigenfaces,” in Proc. Int. Con$ on Patt. Recog., 1991, pp.
Brain Res., vol. 29, pp. 245-258, 1988.
[lo81 D. I. Perret, P. A. Smith, D. D. Potter, A. J. Mistlin, A. S. Head, 586-591.
[134] S. Ullman, The Interpretation of Visual Motion. Cambridge,
A. D. Milner, and M. A. Jeeves, “Visual cells in temporal cortex MA: MIT Press, 1979.
sensitive to face view and gaze direction,” in h o c . Royal Soc. [I351 J. Y. A. Wang and E. H. Adelson, “Layered representation for
of London, Series B, 1985, vol. 223, pp. 293-317. motion analysis,” in Proc. IEEE Computer Soc. Con$ Computer
[109] T. Poggio and F. Girosi, “Networks for approximation and Vision and Patt. Recog., 1993, pp. 361-366.
learning,” Proc. IEEE, vol. 78, pp. 1481-1497, 1990. [136] A. M. Waxman, B. Kamgar-Parsi, and M. Subbarao, “Closed-
[110] T. Poggio and V. Torre, “Ill-posed problems and regularization form solutions to image flow equations for 3-D structure and
analysis in early vision,” MIT AI Lab, Tech. Rep. AI Memo motion,” Int. J. Computer Vision, vol. 1, pp. 239-258, 1987.
773, 1984. [ 1371 H. Wechsler, Computational Vision. Boston: Academic, 1990.
[ l l l ] A. Rahardja, A. Sowmya, and W. Wilson, “A neural network [138] J. Weng, T. S. Huang, and N. Ahuja, Motion and Structure
approach to component versus holistic recognition of facial for Image Sequences, T. S. Huang, Ed. New York Springer-
expressions in images,” in SPIE Proc.: Intell. Robots and Verlag, 1993.
Computer Vision X : Algorithms and Techn., vol. 1607, 1991, [139] -, “Learning recognition and segmentation of 3D objects
pp. 62-70. from 2D images,’’ in Proc. IEEE Int. Con$ on Computer Vision,
[112] D. Reisfeld and Y. Yeshuran, “Robust detection of facial 1993, pp. 121-128.
features by generalized symmetry,” in Proc. 11th Int. Conk on [140] P. A. Wintz, “Transform picture coding,” Proc. IEEE, vol. 60,
Part. Recog., 1992, pp. 117-120. pp. 809-820, 1972.
[113] T. Sakai, M. Nagao, and S. Fujibayashi, “Line extraction and [141] K. Wohn and A. M. Waxman, “The analytic structure of
pattem recognition in a photograph,” Pan. Recog., vol. 1, pp. image flows: Deformation and segmentation,”Computer Vision,
233-248, 1969. Graphics and Image Process., vol. 49, pp. 127-151, 1990.
[ 1141 A. Samal and P. Iyengar, “Automatic recognition and analysis [142] C. Wu and J. Huang, “Human face profile recognition by
of human faces and facial expressions: A survey,” Patt. Recog., computer,” Patt. Recog., vol. 23, pp. 255-259, 1990.
vol. 25, pp. 65-77, 1992. [143] Y. Yacoob and L. S. Davis, “Computing spatio-temporalrepre-
[115] B. G. Schunck, “Image flow segmentation and estimation by sentations of human faces,” in Proc, IEEE Computer Soc. Con5
constraint line clustering,” IEEE Trans. Patt. Anal. and Mach. on Computer Vision and Part. Recog., 1994, pp. 70-75.
Intell., vol. 11, pp. 1010-1027, 1989. [ 1441 G. Yang and T. S. Huang, “Human face detection in a scene,” in
[116] M. Seibert and A. Waxman, “Recognizing faces from their Proc. IEEE Con$ on Computer Vision and Patt. Recog., 1993,
parts,” in SPIE Proc.: Sensor Fusion N: Control Paradigms pp. 453-458.
and Data Structures, vol. 1611, 1991, pp. 129-140. [145] A. Yuille, D. Cohen, and P. Hallinan, “Feature extraction from
11171 -, “Combining evidence from multiple views of 3-D ob- faces using deformable templates,” in Proc. IEEE Computer
jects,” in SPIE Proc.: Sensor Fusion N: Control Paradigms Soc. Conf on Computer Vision and Pan. Recog., 1989, pp.
and Data Structures, vol. 1611, 1991. 104-109.

738 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5, MAY 1995


-
[146] C. T. Zahn and R. S. Roskies, “Fourier descriptors for plane [22] S. Lele and J. T. Richtsmeier, “Euclidean distance matrix
closed curves,” IEEE Trans. Computers, vol. COM-21, pp. analysis: A coordinate-free approach for comparing biological
269-281, 1972. shapes using landmark data,” Amer. . I. Anthrop., vol. 86,
Phys.
[147] Z. Zhang and 0. Faugeras, “3D dynamic scene analysis,’’ in pp. 415-427, 1991.
Springer Series in Information Sciences, Vol. 27, T. S. Huang, [23] G. Rhodes, “Lateralized process in face recognition,” British J.
Ed. New York: Springer-Verlag, 1992. Psych., vol. 76, pp. 249-271, 1985.
[148] Y. T. Zhou and R. Chellappa, “A network for motion percep- [24] H. D. Ellis, “Processes underlying face recognition: Introduc-
tion,” in Proc. Int. Con$ on Neural Networks, pp. 875-884, tion,” in The Neurophysiology of Face Perception and Facial
1990. Expression, R. Bruyer, Ed. Hillsdale, NJ: Erlbaum, 1986, pp.
1-27.
ADDITIONALREFERENCES [25] J. Hochberg and R. E. Galper, “Recognition of faces: I. An
exploratory study,” Psychonomic Science, vol. 9, pp. 619-620,
[l] M. K. Fleming and G . W. Cottrell, “Categorization of faces 1967.
using unsupervised feature extraction,” in Proc. In?. Joint Con$ [26] D. C. Hay and A. W. Young, “The human face,” in Nor-
on Neural Networks, pp. 65-70, 1990. mality and Pathology in Cognitive Function, A. W. Ellis, Ed.
[2] N. M. Marinovic and G. Eichmann, “Feature extraction and London: Academic, 1982, pp. 173-202.
pattem classification in space-Spatial frequency domain,” in [27] G . Davies, H. Ellis, and J. Shepherd, “Perceiving and remem-
SPIE Proc.: Intell. Robots and Computer Vision, vol. 579, 1985, bering faces,” Perceptual and Motor Skills, 1965.
pp. 19-26. [28] S. Deutsch, “Conjectures on mammalian neuron networks for
[3] L. de Floriani. “Feature extraction from boundary models of visual pattem recognition,” IEEE Trans. Syst. Sci. and Cybern.,
three-dimensionalobjects,” IEEE Trans. Patt. Anal. and Mach. vol. 2, pp. 81-85, 1966.
Intell., vol. 11, pp. 785-798, 1989. [29] P. D. McCormack and S. P. Colletta, “Recognition memory for
[4] T. Abe, “Automatic identification of human faces by 3-D shape items from unilingual and bilingual lists,” Bull. Psychonomic
of surface-Using vertices of B-spline surface,” in Systems and SOC.,vol. 6, pp. 149-151, 1975.
Computers in Japan, 1991. [30] H. Ellis, J. Sheppard, and G . Davies, “An investigation into
[5] E. R. Brocklehurst, “Computer methods of signature verifi- the use of the photofit technique for recalling faces,” British .I.
cation,” IEE Colloq. Dig. (80): Colloq. on MMI in Computer Psych., vol. 66, pp. 29-37, 1975.
Security, 1986, pp. 311-5. [31] H. Ellis, “Recognizing faces,” British .I Psych.,
. vol. 66, pp.
[6] L. D. Harmon and K. C. Knowlton, “Picture processing by 409-426, 1975.
computer,” Science, vol. 164, pp. 19-29, 1969. [32] M. S. Wogalter and D. B. Marwitz, “Face composite construc-
[7] P. L. Hawkes, “The commercializationof biometric techniques tion: In-view and from memory quality and improvement with
for automatic personal identification,” IEE Colloq. Dig. (80): practice,” Ergonomics, 1991.
Colloq. on MMI in Computer Security, 1986, pp. 111-2. [33] J. F. Fagan, 111, “Infants’ recognition of invariant features of
[8] H. Mannaert and A. Oosterlinck, “Self-organizing system for faces,” Child Develop., vol. 47, pp. 627-638, 1976.
analysis and identification of human face,” in SPIE Proc. Appl. [34] M. P. Young and S. Yamane, “Sparse population coding of faces
of Digital Image Process. X l l l , vol. 1349, 1990, pp. 227-232. in the inferotemporal cortex,” Sci., vol. 256, pp. 1327-1331,
[9] K. Flaton, “2D object recognition by adaptive feature extraction 1992.
and dynamical link graph matching,” Tech. Rep., Univ. S. [35] R. K. Yin, “Looking at upside-down faces,” J. Experimental
Calif., 1992. Psych., vol. 81, pp. 141-145, 1969.
[IO] Midorikawa, “Face pattem identification by back propagation [36] J. Y. Cartoux, J. T. Lapreste, and M. Richetin, “Face authentifi-
leaming procedure,” Neural Networks, vol. 1, p. 515, 1988. cation or recognition by profile extraction from range images,”
[ l l ] M. Nixon, “Automated facial recognition and its potential for in Proc. Workshop on Interp. of 3 0 Scenes, 1990, pp. 194199.
security,” in IEE Colloq. Dig. (SO):Colloq. on MMI in Computer [37] P. Maragos, “Tutorial on advances in morphological image
Security, 1986, pp. 5 / 1 4 , processing and analysis,” Opt. Eng., vol. 26, pp. 623-632, 1987.
[12] A. J. O’Toole, R. B. Millward, and J. A. Anderson, “A [38] J. C. Lee and E. Milios, “Matching range images of human
physical system approach to recognition memory for spatially faces,” in Proc. 3rd Int. Con$ on Computer Vision, 1990, pp.
transformedfaces,” Neural Networks, vol. 1, pp. 179-199, 1988. 722-726.
[13] R. Brunelli and T. Poggio, “Face recognition through geometri- [39] Y. Yacoob and L. S. Davis, “Labeling of human face compo-
cal features,” in Proc. Europe. Con$ on Computer Vision, 1992, nents from range data,” in Proc. IEEE Computer Soc. Con$ on
pp. 792-800. Computer Vision and Patt. Recog., 1993, pp. 592-593.
[14] D. Forsyth and A. Zisserman, “Invariant descriptors for 3-D [40] T. Minami, “Present state of the art on identification of human
object recognition and pose,” IEEE Trans. Patt. Anal. and Mach. face,” J. Soc. Instrum. and Control Eng. (Japan), vol. 25, pp.
Intell., vol. 13, pp. 971-991, 1991. 707-713, 1986.
[15] T. Sakaguchi, 0. Nakamura, and T. Minami, “Personal identi- [41] M. A. Fischler and R. A. Elschlager, “The representation and
fication through facial images using isodensity lines,” in SPIE matching of pictorial structures,” IEEE Trans. Computers, vol.
Proc.: Visual Commun. and Image Process. IV, vol. 1199, 1989, COM-22, pp. 67-92, 1973.
pp. 643-654. [42] S. Carey and R. Diamond, “From piecemeal to configurational
[I61 I. Aleksander, W. Thomas, and P. Bowden, “A step forward in representation of faces,” Sci., vol. 195, pp. 312-314, 1977.
image processing,” Sensor Rev., pp. 120-124, 1984. [43] S. S . Culbert, “Object recognition as a function of number of
[I71 S. Sclaroff and A. Pentland, “Closed-form solutions for phys- different views during training,” Perceptual and Motor Skills,
ically based shape modeling and recognition,” in Proc. IEEE 1965.
Computer Soc. Conf. on Computer Vision and Patt. Recog., [44] M. A. Jeeves, “Plenary Session. An overview. Complemen-
1991, pp. 238-243. tary approaches to common problems in face recognition,”
[18] A. G. Goldstein et al., “Recognition of human faces from in Aspects of Face Processing, H. D. Ellis, M. A. Jeeves, F.
isolated facial features: A developmental study,” Psychonomic Newcombe and A. Young, Eds. Dordrecht: Nijhoff, 1986, pp.
Sci., vol. 6, pp. 149-150, 1966. 445-452.
[19] K. Takahashi, T. Sakaguchi, T. Minami, and 0. Nakamura, [45] J. Carter and M. Nixon, “An integrated biometric database,” in
“Description and matching of density variation for per- Colloq. on Electron. Images and Image Process. in Security and
sonal identification through facial images,” in SPIE Proc.: Forensic Sci., dig. no. 87, 1986, pp. 8 / 1 4
Visual Commun. and Image Process., vol. 1360, 1990, pp. [46] T. Y. Tu, C. Ma, and Y. L. Ma, “Image recognition by the
16941704. Kolmogorov complexity program,” in SPIE Proc.: In?. Symp. on
[20] K. H. Wong, H. H. M. Law, and P. W. M. Tsang, “A system Patt. Recog. and Acoust. Imaging, vol. 768, 1987, pp. 343-350.
for recognizing human faces,” in Proc. Int. Conf. on Acoust., [47] P. OHiggins and N. W. Williams, “An investigation into the
Speech, and Signal Process., 1989, pp. 1638-1642. use of Fourier coefficients in characterizing cranial shapes in
[21] S. Akamatsu, T. Sasaki, N. Masui, H. Fukamachi, and Y. primates,” J. Zoology, London, vol. 21 1, pp. 409430, 1987.
Suenage, “A new method for designing face image classifiers [48] Z. De-Chen and Shen Xuan-Jing, “Fast texture image segmen-
using 3-D CG models,” in SPIE Proc.: Visual Commun. and tation,’’ in SPIE Proc.: Appli. of Digital Image Process. XI&
Image Process., vol. 1606, 1991, pp. 204216. vol. 1349, 1990, pp. 277-282.

CHELLAPPA er a1 HUMAN AND MACHINE RECOGNITION OF FACES A SURVEY 739

-_ ~
__ ~ __
[49] C. A. Rothwell, A. Zisserman, C. I. Marinos, D. A. Forsyth, and [72] H. Chen and T. Huang, “Matching 3-D line segments with
J. L. Mundy, “Relative motion and pose from arbitrary plane applications to multipie-object motion estimation,” IEEE Trans.
curves,” Image and Vision Computing, vol. 10, pp. 250-262, Putt. Anal. and Mach. Intell., vol. 12, pp. 1002-1008, 1990.
1992. [73] I. Sethi and R. Jain, “Finding trajectories of feature points in a
[50] R. M. Haralick, “A facet model for image data,” Computer monocular image sequence,” IEEE Trans. [email protected]. and Mach.
Graphics and Image Process., vol. 15, pp. 113-129, 1981. Intell., vol. 9, pp. 56-73, 1987.
[51] A. Pentland and B. Horowitz, “Recovery of nonrigid motion
and structure,” IEEE Trans. Pan. Anal. and Mach. Intell., vol.
13, pp. 730-742, 1991.
[52] M. Musen and J. Van“Der Lei, “Of brittleness and bottlenecks:
Challenges in the creation of pattem recognition and expert Rama Chellappa (Fellow, IEEE) is a Profes-
system models,’’ in Panem Recognition and Arf$cial Intelli- sor in the Department of Electrical Engineering
gence, E. S. Gelsema and L. N. Kanal, a s . Amsterdam: North at the University of Maryland, where he is
Holland, 1988. also affiliated with the Institute for Advanced
[53] R. M. Haralick, “Ridges and valleys on digital images,” Com- Computer Studies, the Center for Automation
puter Graphics and Image Process., vol. 22, pp. 28-38, 1983. Research and the Computer Science Depart-
[54] Y. Q. Cheng, Y. M. Zhuang, and J. Y. Yang, “Optimal Fisher ment. He is an Editor of Collected Papers on
discriminant analysis using the rank decomposition,” Putt. Digital Image Processing (IEEE Computer Soci-
Recog., vol. 25, pp. 101-111, 1992. ety Press, 1992). He coauthoredArtificial Neural
[55] B. P. Yuchas, Jr., M. H. Goldstein, T. J. Sejnowski, and R. Networks for Computer Vision (Springer Ver-
E. Jenkins, “Neural network models of sensory integration lag, 1992) and coedited Markov Random Fields:
for improved vowel recognition,” Proc. IEEE, vol. 78, pp. Theory and Applications (Academic Press, 1993). He was an Associate
1658-1668, 1990. Editor for IEEE Transactionson Acoustics, Speech, and Signal Processing
[56] J. Q. Fang and T. S. Huang, “Some experiments on estimating and IEEE Transactions on Neural Networks. He is presently Coeditor-
the 3-D motion parameters of a rigid body from two consecutive in-Chief of Computer Vision, Graphics, and Image Processing: Graphic
image frames,” IEEE Trans. Putt. Anal. and Mach. Intell., vol. Models and Image Processing and IEEE Transactions on Image Pro-
6, pp. 547-554, 1984. cessing. He has authored 20 book chapters and over 150 peer-reviewed
[57] J. W. Roach and J. K. Aggarwal, ‘‘Determining the movements joumal and conference papers. He was the General Chairman of the
of objects from a sequence of images,” IEEE Trans. Putt. Anal. IEEE Computer Society Conference on Computer Vision and Pattem
and Mach. Intell., vol. 2, pp. 554-562, 1980. Recognition and of the IEEE Computer Society Workshop on Artificial
[58] R. Y. Tsai and T. S. Huang, “Uniqueness and estimation of 3-D Intelligence for Computer Vision (1989). He was the Program Chairman
motion parameters of rigid bodies with curved surfaces,” IEEE of the IEEE Signal Processing Workshop on Neural Networks for Signal
Trans. Putt. Anal. and Mach. Intell., vol. 6, pp. 13-27, 1984. Processing, and is the Program Chairman for the 2nd Intemational
[59] J. Weng, T. S. Huang, and N. Ahuja, “Motion and structure Conference on Image Processing.
from two prospective views: Algorithms, error analysis and Dr. Chellappa received the 1985 National Science Foundation Presiden-
error estimation,” IEEE Trans. Putt. Anal. and Mach. Intell., tial Young Investigator Award and the 1985 IBM Faculty Development
vol. 11, pp. 451-476, 1989. Award. In 1990 he received the Excellence in Teaching Award from the
[60] T. J. Broida and R. Chellappa, “Estimation of object motion School of Engineering at the University of Southem California. He is a
parameters from noisy images,” IEEE Trans. Putt. Anal. and corecipient of four NASA certificates for his work on synthetic aperture
Mach. Intell., vol. 8, pp. 90-99, 1986. radar image segmentation.
[61] J. Aisbett, “An iterated estimation of the motion parameters of a
rigid body from noisy displacement vectors,” IEEE Trans. Putt.
Anal. and Mach. Intell., vol. 12, pp. 1092-1098, 1990.
[62] H. C. Longuet-Higgins, “A computer program for reconstruct-
ing a scene from two projections,” Nature, vol. 293, pp. Charles L. Wilson (Senior Member, IEEE) has
133-135, 1981. been with the National Institute of Standards
[63] Y. Yasumoto and G. Medioni, “Robust estimation of three- and Technology,Gaithersburg,MD, for the past
dimensional motion parameters from sequence of image frames 15 years. He is presently Manager of the Vi-
using regularization,” IEEE Trans. Putt. Anal. and Mach. Intell., sual Image Processing Group of the Advanced
vol. 8, pp. 464-471, 1986. Systems Division. His was with Los Alamos
[64]C. L. Fennema and W. R. Thompson, “Velocity determination in National Laboratory and AT&T Bell Laborato-
scenes containing several moving objects,” Computer Graphics ries. His current research interests are in appli-
and Image Process., vol. 9, pp. 301-315, 1979. cation of statistical pattem recognition, neural
[65] X. Zhuang, T. S. Huang, and R. M. Haralick, “Two-view motion network methods, and dynamic training methods
analysis: A unified algorithm,” J. Opt. Soc. Amer., vol. 3, pp. for image recognition, image compression, and
1492-1500, 1986. in standards used to evaluate recognition systems.
[66] G. S. Young and R. Chellappa, “3-D motion estimation using Dr. Wilson received a DOC Gold Medal in 1983 for his work in
a sequence of noisy stereo images: Models, estimation, and semiconductor device simulation.
uniqueness results,” IEEE Trans. Putt. Anal. and Mach. Intell.,
vol. 12, pp. 735-759, 1990.
[67] J. A. Webb and J. K. Aggarwal, “Structure from motion of rigid
and jointed objects,” art^ Intell,, vol. 19, pp. 107-130, 1982.
[68] M. Spetasakis and Y. Aloimonus, “A multi-frame approach to Saad A. Sirohey (Member, IEEE) received the
visual motion perception,” Int. J. Computer Vision, vol. 6, pp. B.Sc. (highest honors) in electrical enginering
245-255, 1991. from King Fahd University of Petroleum and
[69] Y. Kim and J. Aggarwal, “Determining object motion in a Minerals, Dhahran, Saudi Arabia, and the M.S.
sequence of stereo images,” IEEE J. Robotics and Autom., vol. degree in electrical engineering from the Uni-
RA-3, pp. 599-614, 1987. versity of Maryland at College Park, in 1990
[70] S. Liou and R. Jain, “Motion detection in spatio-temporal and 1993,respectively.He is working toward the
space,” Computer Vision, Graphics and Image Process., vol. Ph.D. in electrical engineering at the University
45, pp. 227-250, 1989. of Maryland at College Park.
[71] Z. Zhang, 0. Faugeras, and N. Ayache, “Analysis of a sequence He is a Research Assistant at the Center
of stereo scenes containing multiple moving objects using for Automation Research at the University of
rigidity constraints,” in Proc. Int. Conf. on Computer Vision, Maryland. His current research interests include signdimage processing
1988, pp. 177-186. and computer vision, specifically automated face recognition.

740 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 5 , MAY 1995


-___ ~ _ . -

You might also like