0% found this document useful (0 votes)
21 views4 pages

Extraction of Non Manual Features For Videobased Sign Language Recognition

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 4

MVA2002 lAPK Workshop on Machine Vision Applications, Dec.

1 1 13, 2002, N a r a ken New Public Hall, Nara, Japan

8-1 9 EXTRACTION OF NON MANUAL FEATURES


FOR VIDEOBASED SIGN LANGUAGE RECOGNITION
Ulrich ~ a n z l e r ' Thomas ~ z i u r z ~ k ~
Chair for Technical Computer Science Chair for Technical Computer Science
Aachen University Of Technology Aachen University Of Technology

Abstract
Videobased sign language recognition is barely investi- Existing sign language recognition systems rely exclu-
gated in the field of image processing. General conditions sively on the manual parameters [I] [2] [3] although image
like realtime-ability and user- and environment- independ- processing offers the possibility to consider non-manual
ence require compromise solutions. This paper presents a parameters, too.
system for automatic analyzing of the facial actions. For This work presents a coarse overview of a system in de-
this point distribution models and active shape models are velopment for the automated analysis of the human mimic.
brought into action. Additional a comparison is made be- The extraction of eye- and lip features using a bio-
tween different approaches for the shapes initialization. mechanical model is described in detail.

1 Introduction 2 Methodology

Sign language is the natural language of deaf people. It The system consists of a face jinder and -tracker
is a non-verbal and visual language, different in form from module, that combines several probability maps, analyzing
spoken language, but serving the same function. motion by temporal templates and skin color [4] by a
It is characterized by manual parameters (hand shape, RGB-histogram, and classifies many geometric features [5]
hand orientation, location, motion) and non-manual pa- (relation heightlwidth, orientation, roundness, invariant
rameters (gaze, facial expression, mouth movements, moments etc) and abstract features described by Jones [6]
position, motion of the trunk and head). with an Adaboost classifier. The overlaying of the four
Especially the non-manual parameters can be crucial, for maps results in a very accurate bounding box margin the
example the mouth movements are coding several func- face. The scene clipping of the cam becomes optimized for
tions: They specify meanings of a sign (meat 1 hamburger), the following processing levels by tilting and panning the
emphasize details, they make ambiguous signs using same cam and finally shifting the zoom-lens.
manual parameters well defined (brotherlsister) and supply
the recognition of a sign at all by giving redundant infor-
mation.

Source Image TionCropping Initialization

Log. Hue I3 Canny Edges Lip Prob. ASM Adaption


\ / 4

Com bined Mask Corrected Mask Result Image

fig 1: System overview

I
.
Address: 52062-Aachen, Germany. E-mail: canzlere techinfo ruth-aachen de .
.
Address: 52062-Aachen, Germany. E-mail: dziurzyketechinfo rwth-aachen.de
318
The next task to do is the finding and extraction of non- Next for all pictures a common average value was se-
manual features. Each of this features must be investigated lected, which should represent the natural skin color.
on a different way. E.g. the analysis of the gaze needs sim-
ple template matching techniques on the gradient of the eye All occurrences of the colors are counted and the aver-
region. For this first a eye-color histogram was created by age value on the limited range is detemned. Next for all
520 manual segmented images. By uslng the integral pro- plctures a common average value was selected, which was
jection and this adaptive eye-color histogram we receive to show the natural skin color. Now the average value can
very accurate results for eye-position and -tracking. be computed for each picture on its cutout. If this deviates
from the selected average value, for all pixels of the picture
Applying this method for finding the nostrils we get the color values are shifted accordingly.
firther characteristic points, that we map on a biomechani- In order to initialize the Active Shape Model as good as
cal model. This 3d-model of the human face allows to possible, for the initialization mask we tried to approximate
predict the muscle's tensions, by simulating the skin and the 11p outlines and the lip corners on the basis of the dif-
the muscles with springs, so it serves for additional verifi- ferent characteristics. For this we use four maps. Three of
cation. E.g. a smile tends to return into a base-lined these maps use colored separation characteristics of lips
expression. and face.
Thefirst map uses the Bayes theorem over a histogram,
Lip movements are more complicated to parse. Lip cor- which was provided for lip-similar colors. Thus the pixel
ners are permanent visible features in frontal views, so it is probabilities of affiliation can be determined. By a thresh-
easy to find them by performing a horizontal and vertical old value the pixels are divided in two groups, i.e. in lips
Integral projection inside a cropped area. Additional we use and face.
the result of four feature maps, described below. This in-
P(rgtjlr~)P(llp)
formation combined with a Susan edge detector yields p((rPlrgb)=
excellent initial positions for a Point Distribution Models P(rgb(liP)p(lrp)+
P(rgb(7llp)Pbllp)
(PDM), that represents the shape and its possible deforma-
tion of the lip outline. The second map is basing on the special cond~tionof the
ti^^ shape ~ ~(ASM) d align~this PDM
l shapes
~ and lip color. This contains many red and green colours into
correct inval~dshape matching. So they are good suited for first line. The combination the and the
representation of an object like human lips as a set of N thresholding following on it supplies a map, which empha-
labeled landmark points in a vector. The ASM module is sizes pixels ofthe lip colors.
described later in this paper, first we address the initializa- The third map is won from the combination of the 3
t ~ o nproblem. RGB channels, whereby a special weighting is selected
T~ get of lightening conditions, we correct the ap- here. The HSV area is characterized by complex transfor-
pearance of the mouth region by shifting the H S I - ~ mations ~ ~ on~ the~ one~ hand ~ and singularities and saltuses on
to an optimal mean value. The color channel can be used the ofher hand. Therefore we search for a color space,
additionally to neglect color discrepancies between the which the criteria:
Images. In our database you can notice, that there are sig- No saltuses or singularities, which make a seg-
nificant differences reg. the lip- and skincolor. caused by menting more difficult of the data
illumination and person dependency. This affects nega- Good separation from bnghtness and color infor-
tively the creation of the maps, which are based on the red mation.
and yellow channel in the RGB color space. For this reason Simplification of segmenting by a separating bar-
the mouth region in each image get calibrated to before ness of matching color regions as good as possible.
trained mean values. All occurrences of the colors are Simple and fast conversion of andor in RGB space.
counted and the average value on the limited range is de-
termined.
F 0 .

UWanar dctivr ShapIe Model {Blornec


coadinnar Ma
m b =pT*(x qhqc& me&
Cmbnsta

n
wVenf'KJted
Cmd~nata

Point Distribution Mc

fig 2: Interworking of ASMs, PDMs and the Biomechanical Modell


For the lip segmenting particularly the modified 13- parameters extracted from a separate system developed
channel was found to be useful. Since the blue channel for at our department.
the lip color plays a subordinated role, it turned out that its
restriction leads to better segmentation results. The fol- 3 Results
lowing transformation for the I 3 channel was selected:
2G-R-f For testing the system, we create a database with 720
I, = images (24 persons, 30 images under different lightning
4
conditions and mouth openings). We defined two qualities
For the fourth map a gradient picture won by the Canny for the recognition rates. The first one defined a maximum
operator was used, whereby a Gaussian filter smoothed the distance between hand segmented and automatic explored
edges. points on the lip contour. The second one was more toler-
The four characteristic maps are combined by a Bayesan ant with 6 pixels displacement.
Belief Network to a result map and corrected afterwards to As result it can be held that the ranges within 3 pixels
one initialization mask; disturbances are settled and closed were usually correctly detected around the upper andlor
with the mo~phologicaloperators dilatation and erosion. lower lip, which confirmed also the visual observation.
Finally the algorithm supplies an object, which matches Here the detection rate was about 80%, whereas the recog-
the contour in good approximation at the material outlines nition rate around the left and right lip comer with the
of the lips and serves for the ASM-initialization as well as asymmetrical PDM was ca. 60% and by using the symmet-
verifying the lip comer positions retrieved by the integral rical PDM by ca. 53% correctly recognized.
projection described above. Total regarded it can be stated that in most cases the er-
rors took place within the lips and only rarely outside, then
The last module of the system deals with the problem of
however mostly with persons with beard, with whom the
the feature analysis, which has the objective of classifying
strong gradient drew the points in wrong direction.
the described features to so-called Action Units (AU's).
Thus the upper and/or lower lip outline was detected
This AU's are defined as basic anatomically deduced
with approx. 15% of the pictures below andlor above the
minimal movement units of the face.
actual upperllower lip outline. The lip outline within the
We use a System derived from the Facial Action Coding
range of the right and left lip comers was determined in
System (FACS) [9] to describe this movements. Each AU
each case with approximately 30% of the images on the
can be performed with three different intensity levels, si-
left of and/or right by the actual lip outline. A rise of the
multaneous occurrences of several AU's are permitted. The
error barrier on six pixels led both for the symmetrical and
AU's themselves are won from the above described fea-
asymmetrical PDM in each of the four lip ranges to a cor-
tures by using a classifier, based on Hidden Markov
rect detection of at least 94%.
Models.
To improve the classification, additional rules are ap- Exact results show the tables 1; it differs between the
used PDMs (symmetrical or asymmetrical) and the maxi-
plied by Fuzzy Sets for effect concerning dominance,
substitution and exchangeability. Thls way additional con- mal tolerated distance (3 or 6 pixels) with respect to the
textual knowledge is taken into account. Finally the system appropriate mouth ranges and shows the direction of the
shifted points.
yields the facial expressions coded by the AUs. This high
level features are afterwards combined with the manual-

fig 3: Deformations of the lip PDM (left: asymmetrical shapes, right: symmetrical shapes
~eviation 11 -< 3 Pixel 5 6 Pixel I [ Deviation 1 5 3 Pixel -< 6 Pixel
1 Area 1 Up 1 Right 1 Down 1 Left 1 UP [ Right 1 Down 1 Left I [ Area 1 Up 1 Right 1 Down 1 Left 1 UP 1 Right 1 Down 1 Lefl

correct 1 76,25 1 43.75 1 83,75 1 54,86 1 97,78 1 94,86 1 99,58 195,28 1 I correct 1 78,89 1 58,06 1 84,72 1 61,ll 1 97,78 1 95.14 1 99.58 1 95.00 1
tab 1: Recognition rates (lefk asymmetrical shapes, right: symmetrical shapes

4 Conclusion

In this paper we presented a system for extracting facial


[7] P. De Smet and R. Pries. Implementation and analysis
expression features for sign language recognition. Several
modules are necessary for detecting, tracking and analyz- of an optimized rainfalling watershed algorithm. In
ing the face and finally to create a feature vector. We IS&TSPIE's 12th Annual Symposium Electronic Imaging
decided to work out an approach with Active Shape Mod- 2000: Science and Technology Conference: Image and
els for modeling the lip movement. Symmetrical and Video Communications and Processing, San Jos, Califor
Asymmetrical Shapes delivered nearly same recognition nia, USA, January 2000.759-766.
rates.
[8] D. Comaniciu and P. Meer. Robust analysis of feature
5 Acknowledgement spaces: Color image segmentation. In Proc.
IEEE Conference on Computer Visiona and Pattern
This research was supervised by Prof. Karl-Friednch Recognition, Puerto Rico, 1997.750-755.
Kraiss, Department for Technical Computer Science at the [9] P. Ekman, W. Friesen: Facial Action Coding System,
RWTH-Aachen and supported by the DFG-Project Consulting Psychologists Press Inc.,
SIGMA. California, 1978

References

[I] H. Hienz, K.-F. Kraiss, B. Bauer. Continuous Sign Lan


guage Recognition using Hidden Markov Models. Pro
ceedings of the Second International Conference on Mu1
timodal Interfaces, Hong Kong (China), 1999.
[2] T. Starner and A. Pentland. Real-time American Sign
Language recognition from video using hidden Markov
models. Perceptual Computing Section Technical Report
o. 375, MIT Media Lab, Cambridge, MA, 1996.
[3] C. Vogler and D. Metaxas. ASL recognition based on a
coupling between HMMs and 3D motion analysis. CIS
Technical Report, Department of Computer and Informa
tion Science, University of Pennsylvania, 1997
[4] M. Jones, J. Rehg: Statistical color models with appli
cation to skin detection, Proceedings Computer Vision
and Pattern Recognition, 1999, pp. 274-280
[5] S. Gong, S. McKema, A. Psarrou: Dynamic Vision,
Imerial College Press, London, 2000
[6] P. Viola ,M. Jones: Robust Real-time Object Detection,
Technical Report Series, Cambridge Research Laboratory,
Feb 2001

You might also like