Extraction of Non Manual Features For Videobased Sign Language Recognition
Videobased sign language recognition is barely investi- Existing sign language recognition systems rely exclu-
gated in the field of image processing. General conditions sively on the manual parameters [I] [2] [3] although image
like realtime-ability and user- and environment- independ- processing offers the possibility to consider non-manual
ence require compromise solutions. This paper presents a parameters, too.
system for automatic analyzing of the facial actions. For This work presents a coarse overview of a system in de-
this point distribution models and active shape models are velopment for the automated analysis of the human mimic.
brought into action. Additional a comparison is made be- The extraction of eye- and lip features using a bio-
tween different approaches for the shapes initialization. mechanical model is described in detail.
1 Introduction 2 Methodology
Sign language is the natural language of deaf people. It The system consists of a face jinder and -tracker
is a non-verbal and visual language, different in form from module, that combines several probability maps, analyzing
spoken language, but serving the same function. motion by temporal templates and skin color [4] by a
It is characterized by manual parameters (hand shape, RGB-histogram, and classifies many geometric features [5]
hand orientation, location, motion) and non-manual pa- (relation heightlwidth, orientation, roundness, invariant
rameters (gaze, facial expression, mouth movements, moments etc) and abstract features described by Jones [6]
position, motion of the trunk and head). with an Adaboost classifier. The overlaying of the four
Especially the non-manual parameters can be crucial, for maps results in a very accurate bounding box margin the
example the mouth movements are coding several func- face. The scene clipping of the cam becomes optimized for
tions: They specify meanings of a sign (meat 1 hamburger), the following processing levels by tilting and panning the
emphasize details, they make ambiguous signs using same cam and finally shifting the zoom-lens.
manual parameters well defined (brotherlsister) and supply
the recognition of a sign at all by giving redundant infor-
The next task to do is the finding and extraction of non- Next for all pictures a common average value was se-
manual features. Each of this features must be investigated lected, which should represent the natural skin color.
on a different way. E.g. the analysis of the gaze needs sim-
ple template matching techniques on the gradient of the eye All occurrences of the colors are counted and the aver-
region. For this first a eye-color histogram was created by age value on the limited range is detemned. Next for all
520 manual segmented images. By uslng the integral pro- plctures a common average value was selected, which was
jection and this adaptive eye-color histogram we receive to show the natural skin color. Now the average value can
very accurate results for eye-position and -tracking. be computed for each picture on its cutout. If this deviates
from the selected average value, for all pixels of the picture
Applying this method for finding the nostrils we get the color values are shifted accordingly.
firther characteristic points, that we map on a biomechani- In order to initialize the Active Shape Model as good as
cal model. This 3d-model of the human face allows to possible, for the initialization mask we tried to approximate
predict the muscle's tensions, by simulating the skin and the 11p outlines and the lip corners on the basis of the dif-
the muscles with springs, so it serves for additional verifi- ferent characteristics. For this we use four maps. Three of
cation. E.g. a smile tends to return into a base-lined these maps use colored separation characteristics of lips
expression. and face.
Thefirst map uses the Bayes theorem over a histogram,
Lip movements are more complicated to parse. Lip cor- which was provided for lip-similar colors. Thus the pixel
ners are permanent visible features in frontal views, so it is probabilities of affiliation can be determined. By a thresh-
easy to find them by performing a horizontal and vertical old value the pixels are divided in two groups, i.e. in lips
Integral projection inside a cropped area. Additional we use and face.
the result of four feature maps, described below. This in-
formation combined with a Susan edge detector yields p((rPlrgb)=
excellent initial positions for a Point Distribution Models P(rgb(liP)p(lrp)+
(PDM), that represents the shape and its possible deforma-
tion of the lip outline. The second map is basing on the special cond~tionof the
ti^^ shape ~ ~(ASM) d align~this PDM
l shapes
~ and lip color. This contains many red and green colours into
correct inval~dshape matching. So they are good suited for first line. The combination the and the
representation of an object like human lips as a set of N thresholding following on it supplies a map, which empha-
labeled landmark points in a vector. The ASM module is sizes pixels ofthe lip colors.
described later in this paper, first we address the initializa- The third map is won from the combination of the 3
t ~ o nproblem. RGB channels, whereby a special weighting is selected
T~ get of lightening conditions, we correct the ap- here. The HSV area is characterized by complex transfor-
pearance of the mouth region by shifting the H S I - ~ mations ~ ~ on~ the~ one~ hand ~ and singularities and saltuses on
to an optimal mean value. The color channel can be used the ofher hand. Therefore we search for a color space,
additionally to neglect color discrepancies between the which the criteria:
Images. In our database you can notice, that there are sig- No saltuses or singularities, which make a seg-
nificant differences reg. the lip- and skincolor. caused by menting more difficult of the data
illumination and person dependency. This affects nega- Good separation from bnghtness and color infor-
tively the creation of the maps, which are based on the red mation.
and yellow channel in the RGB color space. For this reason Simplification of segmenting by a separating bar-
the mouth region in each image get calibrated to before ness of matching color regions as good as possible.
trained mean values. All occurrences of the colors are Simple and fast conversion of andor in RGB space.
counted and the average value on the limited range is de-
F 0 .
Point Distribution Mc
fig 3: Deformations of the lip PDM (left: asymmetrical shapes, right: symmetrical shapes
~eviation 11 -< 3 Pixel 5 6 Pixel I [ Deviation 1 5 3 Pixel -< 6 Pixel
1 Area 1 Up 1 Right 1 Down 1 Left 1 UP [ Right 1 Down 1 Left I [ Area 1 Up 1 Right 1 Down 1 Left 1 UP 1 Right 1 Down 1 Lefl
correct 1 76,25 1 43.75 1 83,75 1 54,86 1 97,78 1 94,86 1 99,58 195,28 1 I correct 1 78,89 1 58,06 1 84,72 1 61,ll 1 97,78 1 95.14 1 99.58 1 95.00 1
tab 1: Recognition rates (lefk asymmetrical shapes, right: symmetrical shapes
4 Conclusion