0% found this document useful (0 votes)
40 views5 pages

Skin Detection A Bayesian Network Approach

This paper proposes using a Bayesian network approach for skin detection. It tests several classifiers and incorporates unlabeled data through a semi-supervised learning approach. The authors represent pixel patches as nodes in a Bayesian network to model dependencies between pixels. They learn the network structure from both labeled and unlabeled data using maximum likelihood estimation. Experimental results show this approach enables learning good classifiers with only a small labeled training set and a large amount of unlabeled data.

Uploaded by

Kat Ja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views5 pages

Skin Detection A Bayesian Network Approach

This paper proposes using a Bayesian network approach for skin detection. It tests several classifiers and incorporates unlabeled data through a semi-supervised learning approach. The authors represent pixel patches as nodes in a Bayesian network to model dependencies between pixels. They learn the network structure from both labeled and unlabeled data using maximum likelihood estimation. Experimental results show this approach enables learning good classifiers with only a small labeled training set and a large amount of unlabeled data.

Uploaded by

Kat Ja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/224752307

Skin detection: A Bayesian network approach

Conference Paper · September 2004


DOI: 10.1109/ICPR.2004.1334405 · Source: IEEE Xplore

CITATIONS READS

63 464

4 authors, including:

Nicu Sebe Ira Cohen


Università degli Studi di Trento HP Inc.
543 PUBLICATIONS   15,749 CITATIONS    82 PUBLICATIONS   4,682 CITATIONS   

SEE PROFILE SEE PROFILE

T. Gevers
University of Amsterdam
305 PUBLICATIONS   16,800 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Group Affect View project

Group affect analysis View project

All content following this page was uploaded by T. Gevers on 03 June 2014.

The user has requested enhancement of the downloaded file.


Skin Detection: A Bayesian Network Approach
 
Nicu Sebe , Ira Cohen , Thomas S. Huang , Theo Gevers
Faculty of Science,
 University of Amsterdam, The Netherlands
 HP Research Labs, USA
Beckman Institute, University of Illinois at Urbana-Champaign, USA

Abstract computational cost.


The automated detection and tracking of humans in com- In this paper we are interested in two aspects. First, the
puter vision necessitates improved modeling of the human research efforts mentioned above tried to classify each indi-
skin appearance. In this paper we propose a Bayesian net- vidual pixel as being skin or non-skin. We want to take a
work approach for skin detection. We test several classifiers different approach: can we learn the dependencies (the struc-
and propose a methodology for incorporating unlabeled data. ture) between the pixels within a skin patch? Can we then
We apply the semi-supervised approach to skin detection and use this structure for classification? To achieve this we use
we show that learning the structure of Bayesian network clas- Bayesian Networks (Section 2). Bayesian Networks can rep-
sifiers enables learning good classifiers with a small labeled resent joint distributions in an intuitive and efficient way; as
set and a large unlabeled set. such, Bayesian Networks are naturally suited for classifica-
tion. Second, we are interested in using a framework that
1. Introduction allows for the usage of labeled and unlabeled data. This is
a very important aspect because one of the challenges facing
Skin is arguably the most widely used primitive in human researchers is the relatively small amount of available labeled
image processing research, with applications ranging from data. Construction and labeling of a good ground-truth re-
face detection [16] and person tracking [14] to pornography quires time and effort. However, collecting unlabeled data
filtering [2, 6]. We are especially interested in skin detec- is not as difficult. It is therefore beneficial to use classifiers
tion as a cue for detecting people in real-world photographs. that are learnt with a combination of some labeled data and a
The main challenge is to make skin detection robust to the large amount of unlabeled data. Bayesian networks are very
large variations in appearance that can occur. Skin appear- well suited for this task: they can be learned with labeled
ance changes in color and shape is often affected by occlu- and unlabeled data using maximum likelihood estimation. A
sion (clothing, hair, eye glasses, etc.). Moreover, changes discussion on how to incorporate unlabeled data in learning
in intensity, color, and location of light sources affect skin Bayesian Networks and the corresponding effect on the clas-
appearance. Other objects within the scene may cast shad- sification results is presented in Section 3. The experimental
ows or reflect additional light and so forth. Finally, there are analysis on skin detection is presented in Section 4.
many other objects which are easily confused with skin: cer-
tain types of wood, copper, sand as well as clothes often have 2. Bayesian Networks (BN) Classifiers
skin-like colors. The goal is to label an incoming vector of observables  .
Research has been performed on the detection of skin pix- Each instantiation of  is a sample. We assume that there
els in color images and on the discrimination between skin is a class variable  ; the values of  are the classes (labels).
and non-skin pixels by use of various statistical color mod- Note that in our skin detection application,  stands for pixels
els [9]. Saxe and Foulds [13] proposed an iterative skin in an image patch (3x3 in our case) used as features for the
identification method that uses histogram intersection in HSV classifier and we use two classes in  : skin and non-skin.
color space. In contrast to the nonparametric methods men- We want to build classifiers that receive a sample  and
tioned above, Gaussian density functions [15] and a mixture output either one of the values of  . We assume 0-1 loss, and
of Gaussians [11] are often used to model skin color. The consequently our objective is to minimize the probability of
motivation for using a mixture of Gaussians is based on the classification error. If we knew exactly the joint distribution
observation that the color histogram for the skin of people  
 , the optimal rule would be to choose the class value
with different ethnic background does not form a unimodal with the maximum a-posteriori probability,    . This
distribution, but rather a multimodal distribution. Recently, classification rule attains the minimum possible classification
Jones and Rehg [10] conducted a large-scale experiment in error, called the Bayes error.
which nearly 1 billion labeled skin tone pixels were collected We consider probabilistic classifiers that represent the a-
(in normalized RGB color space). Comparing the perfor- posteriori probability of the class given the features,  
 ,
mance of histogram and mixture models for skin detection, using Bayesian networks. A Bayesian network is composed
they found histogram models to be superior in accuracy and of a directed acyclic graph in which every node is associ-
ated with a variable
    , where 

and with a conditional distribution not fix the structure of the Bayesian network, but we try to
denotes the parents of in the graph.  find the TAN structure that maximizes the likelihood function
The directed acyclic graph is the structure, and the distribu- given the training data out of all possible TAN structures. In
tions   represent the parameters of the network. We
  general, searching for the best structure has no efficient so-
say that the assumed structure for a network, , is correct
when it is possible to find a distribution,  
  , that
   lution, however, searching for the best TAN structure does
have one. The method is using the modified Chow-Liu algo-
matches the distribution that generates data; otherwise, the rithm for constructing tree augmented Bayesian networks [7].
structure is incorrect. We use maximum likelihood estima- The algorithm finds the tree structure among the features that
tion to learn the parameters of the network. maximizes the likelihood of the data by computation of the
Given a Bayesian network classifier with parameter set
, the optimal classification rule under the maximum like-
pairwise class conditional mutual information among the fea-
tures and building a maximum weighted spanning tree using
lihood (ML) framework to classify an observed feature vec- the pairwise mutual information as the weights of the arcs in
    
tor of dimensions, 

  , is given as:

, to one of   class labels, the tree. The problem of finding a maximum weighted span-
ning is defined as finding the set of arcs connecting the fea-
! #"%$'&)(,.
"+* -    %/ 0  (1)
tures such that the resultant graph is a tree and the sum of the
weights of the arcs is maximized. There have been several
There are two design decisions when building Bayesian algorithms proposed for building a maximum weighted span-
network classifiers. The first is to choose the structure of ning tree [5] and in our implementation we use the Kruskal
the network, which will determine the dependencies among algorithm.
the variables in the graph. The second is to determine the
distribution of the features. The features can be discrete, in
3. Semi-supervised Learning of BN
which case the distributions are probability mass functions. Is there value to unlabeled data in supervised learning of
The features can also be continuous, in which case one typ- classifiers? This fundamental question has been increasingly
ically has to choose a distribution, with the most common discussed in recent years, with a general optimistic view that

being the Gaussian distribution. Both these design decisions
determine the parameter set which defines the distribution
unlabeled data hold great value. Due to an increasing number
of applications and algorithms that successfully use unlabeled
needed to compute the decision function in Eq. (1). data [1, 8] and magnified by theoretical issues over the value
Two examples of popular Bayesian network classifiers are of unlabeled data in certain cases [3], semi-supervised learn-
the Naive Bayes (NB) classifier, in which the features are as- ing is seen optimistically as a learning paradigm that can re-
sumed independent given the class, and the Tree-Augmented lieve the practitioner from the need to collect many expensive
Naive Bayes classifier (TAN). labeled training data.
Consider the following scenario. A sample   is gen- 8
The NB classifier makes the assumption that all features

erated from  
 . The value is then either revealed, and
are conditionally independent given the class label. Although
this assumption is typically violated in practice, NB have been the sample is a labeled one; or the value is hidden, and the 
used successfully in many classification applications. One of sample is an unlabeled one. The probability that any sample
the reasons for the NB success is attributed to the small num- 9
is labeled, denoted by , is fixed, known, and independent of
ber of parameters needed to be learnt. the samples. Thus the same underlying distribution  

If the features in  are assumed to be independent of each
 models both labeled and unlabeled data. Given a set of :<;
other conditioned upon the class label (the Naive Bayes labeled samples and :>= ?
unlabeled samples, we use maxi-
framework), Eq. (1) reduces to: ?
mum likelihood for estimating . The assumed distribution
 
  can be decomposed either as       ? ?
! #"%$'&)(, "+* 1  - 76   %/ 0 (2) ?
?
?
?
or as    
  . A parametric model where both ?
3254
   
and   depend explicitly on is referred to
Now the problem is how to model - 36   %/ , which is the
6
as a generative model. The log-likelihood function of this
probability of feature  given the class label. In practice, the model for a dataset with labeled and unlabeled data is:
common assumption is that we have a Gaussian distribution
and the ML can be used to obtain the estimate of the parame-
@ ? @ ;  ? BA @ =  ? CAEDF &HG 9JILK  NMO9 PIRQTS (3)
V
UWV7X YPZ\[^_ ] Xklim[onqp8r Yfs0klt ` r nqp3uvYfs7Z3w8x jyayz p|{0} z `7~ 
ters (mean and variance).
Friedman, et al. [7] proposed the use of the Tree-
`a\bc dfeNgih j
Augmented Naive Bayes model as a classifier, to enhance the
V ƒ € V ƒ €
performance over the simple Naive-Bayes classifier. In the U€X YPZ‚[5] _ ] _ k…l‚m[on p r Yfs0klit „ r n p uvYfs  [5] _ ] k%X t „ r YfZ ˆ
structure of the TAN classifier, the class variable is the par- „va } V ƒ b c ~ dPe g j „va } V ƒ b c ~ dPe‡†
]qŒ ]
is the indicator function (1 if Œ E @ ; 0 ? other-
ent of all the features and each feature has at most one other
where ‰‹Š
wise) and )  #  are the mixing coefficients. ;  and
feature as a parent, such that the resultant graph of the fea-
tures forms a tree. For learning the TAN classifier we do
@ = ? are the likelihoods of the labeled and unlabeled data, changing the distribution being sampled by the MCMC, while
respectively. a decreasing P is a simulated annealing run, aimed at finding
When unlabeled data are available, estimating the param- the maximum of the inverse error distribution. The rate of de-
eters of the Naive Bayes classifier can be done using the EM crease of the temperature determines the rate of convergence.
algorithm. As for learning the TAN classifier, we learn the Asymptotically in the number of data, a logarithmic decrease
structure and parameters using the EM-TAN algorithm, de- of P will guarantee convergence to a global maximum with
rived from [12]. probability that tends to one.
Despite the optimistic view mentioned above, several dis- The advantages of the SSS algorithm are that it usually
parate empirical evidences in the literature suggest that there converges to better classifiers compared to other methods, and
are situations in which the addition of unlabeled data to a pool asymptotically can be shown to converge to the classifier with
of labeled data causes degradation of the classifier’s perfor- minimum error. Its biggest disadvantage is in the added com-
mance [1], in contrast to improvement of performance when plexity: for every structure being tested the parameters are
adding more labeled data. In [4] we present an extensive anal- estimated, followed by error estimation.
ysis demonstrating that, counter to statistical intuition, when
the assumed model of the classifier does not match the true 4. Skin Detection Experiments
data generating distribution, classification performance could In our experiments we use image patches of 9 pixels (a 3x3
degrade as more and more unlabeled data are added to the patch) as the features in the Bayesian Network. We consider
training set. Motivated by this, we consider a classification the SUT chromaticity space, which is the most popular color
driven stochastic structure search (SSS) algorithm for learn- space for skin color modeling [16].
ing the structure of Bayesian network classifiers that mini-
mizes the probability of classification error.
First we define a measure over the space of structures
which we want to maximize:
Definition 1 The inverse error measure for structure  is
p


         
!#"  (4)
    %$&    '


where the summation is over the space of possible structures
and )(   +*  is the probability of error of the best
classifier learned with structure . 
We use Metropolis-Hastings sampling to generate sam- Figure 1. An example of skin detection.
ples from the inverse error measure, without having to ever We use the database of Jones and Rehg [10] consists of
compute it for all possible structures. For constructing the 3,475 images containing skin and 8,796 non-skin images.
Metropolis-Hastings sampling, we define a neighborhood of Each image was manually segmented such that the skin pix-
a structure as the set of directed acyclic graphs to which we els are labeled. Examples of detected skin patches are pre-
can transit in the next step. Transition is done using a prede- sented in Figure 1. In the experiments we randomly se-
fined set of possible changes to the structure; at each transi- lected 3x3 skin and non-skin patches (100,000 in total). We
tion a change consists of a single edge addition, removal, or leave out 40,000 patches for testing and train the Bayesian
reversal. We define the acceptance probability of a candidate Network classifiers on the remaining 60,000. To compare
structure,  b
, to replace a previous structure, 10 as follows:
-,/. <
 b the results of the classifiers, we use the receiving operat-
  
 :9 ; ACB  :<D =9 ; IJIJKLI ANM < ing characteristic (ROC) curves. The ROC curves show, un-
243%576
  
=<> @? B  :9 E; D =< F
 243%5G6 9 ;
IJIJKLI ? M 9 ; F (5) der different classification thresholds, ranging from R to , 
'8 '8H

the probability of detecting a skin patch in a skin image,
  YX[Z-\   ]X[Z-\ , against the probabil-
where O   is the transition probability H from to , P
   - -  
: : 
WV
is a temperature factor, and Q0 and are the sizes of
 
ity of falsely detecting a skin patch in a non-skin image,
the neighborhoods of 10 and
,/.
-,L.

respectively; this choice  -
W^ V - 1_  % M
X[Z-\

 `* 1_  % M
X[Z-\
.
corresponds to equal probability of transition to each mem- We first learn using all the training data being labeled (that
ber in the neighborhood of a structure. This further creates is 60,000 labeled patches). Figure 2 (left) shows the resul-
a Markov chain which is aperiodic and irreducible, thus sat- tant ROC curve for this case. The classifier learned with the
isfying the Markov chain Monte Carlo (MCMC) conditions. SSS algorithm outperforms both TAN and NB classifiers, and
Roughly speaking, P close to would allow acceptance of  all perform quite well, achieving high detection rates with a
more structures with higher probability of error than previous low rate of false alarm. Next we remove the labels of some
structures. P close to R mostly allows acceptance of struc- of the training data and train the classifiers. Figure 2 (right)
tures that improve probability of error. A fixed P amounts to shows the case where the labels of 90% of the training data
1 1

0.9 0.9 satisfactory, then SSS can be used to attempt to further im-
0.8

0.7
0.8

0.7
prove performance. If none of the methods using the un-
labeled data improve performance over the supervised TAN
Detection

Detection
0.6 0.6

0.5 0.5

0.4 0.4 SSS


(or Naive Bayes) the practitioner is faced with two options:
TAN−LUL
0.3
TAN
NB
0.3 TAN−L
NB−LUL
NB−L
discard the unlabeled data, or label some of the unlabeled
0.2 0.2

0.1
SSS
0.1
data using the active learning methodology. Of course, active
0
0 0.1 0.2 0.3 0.4 0.5
False Detection
0.6 0.7 0.8 0.9 1
0
0 0.1 0.2 0.3 0.4 0.5 0.6
False Detection
0.7 0.8 0.9 1 learning can be used as long as there are resources to label
some samples.
Figure 2. ROC curves showing detection rates of skin
Structure learning of Bayesian networks is not a topic mo-
compared to false detection with all data labeled (left) and
 tivated solely by the use of unlabeled data. Skin detection
unlabeled data (right): SSS, NB learned with labeled
data only (NB-L) and with labeled and unlabeled data (NB- could be solved using classifiers other than Bayesian net-
LUL), and TAN learned with labeled data only (TAN-L) works. However, this work should be viewed as a combina-
and with labeled and unlabeled data (TAN-LUL). tion of 3 components; (1) the theory showing the limitations
of unlabeled data is used to motivate (2) the design of algo-
(leaving only 600 labeled patches) were removed. We see rithms to search for better performing structures of Bayesian
that the NB classifier using both labeled and unlabeled data networks and finally, (3) the successful application to skin de-
(NB-LUL) performs very poorly. The TAN based only on tection by learning with labeled and unlabeled data.
the 600 labeled images (TAN-L) and the TAN based on the
labeled and unlabeled images (TAN-LUL) are close in per- References
formance, thus there was no significant degradation of per- [1] S. Baluja. Probabilistic modeling for face orientation dis-
formance when adding the unlabeled data. crimination: Learning from labeled and unlabeled data. In
In Table 1 we summarize the results obtained for different NIPS, pages 854–860, 1998.
algorithms and in the presence of increasing number of un- [2] A. Bosson, G. Cawley, Y. Chan, and R. Harvey. Non-
labeled data. We fixed the false alarm to 1%, 5%, and 10% retrieval: blocking pornographic images. In CIVR, pages
and we computed the detection rates. Note that the detection 50–60, 2002.
[3] V. Castelli. The relative value of labeled and unlabeled sam-
rates for NB are lower than the ones obtained for the other ples in pattern recognition. PhD thesis, Stanford, 1994.
detectors. Overall, the results obtained with SSS are the best. [4] I. Cohen, F. Cozman, N. Sebe, M. Cirello, and T. Huang.
We see that even in the most difficult cases, there was suf- Semi-supervised learning of classifiers: Theory, algorithms,
ficient amount of unlabeled data to achieve almost the same and applications to human-computer interaction. IEEE
performance as with a large sized labeled dataset. Trans. on PAMI, to appear, 2004.
[5] T. Cormen, C. Leiserson, and R. Rivest. Introduction to al-
Table
 1. Detection rates (%) for different false positives
  
gorithms. MIT Press, Cambridge, MA, 1990.
 False
  detections  [6] M. Fleck, D. Forsyth, and C. Bregler. Finding naked people.
   1% 5% 10%
Detector  In ECCV, volume 2, pages 593–602, 1996.
[7] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian net-
60,000 labeled 64.74 85.26 90.64
work classifiers. Machine Learning, 29(2):131–163, 1997.
NB 600 labeled 61.34 81.27 84.85 [8] R. Ghani. Combining labeled and unlabeled data for multi-
600 labeled + 54,000 unlabeled 60.05 80.77 83.98 class text categorization. In ICML, pages 187–194, 2002.
60,000 labeled 86.85 96.22 99.00 [9] B. Jedynak, H. Zheng, and M. Daoudi. Statistical models for
600 labeled 84.50 88.84 93.63 skin detection. In CVPR Workshop on Statistical Analysis in
TAN
600 labeled + 54,000 unlabeled 84.66 88.82 93.01 Computer Vision, 2003.
[10] M. Jones and J. Rehg. Statistical color models with applica-
60,000 labeled 88.25 97.61 99.40 tion to skin detection. IJCV, 46(1):81–96, 2002.
SSS 600 labeled 85.23 92.51 96.15 [11] S. McKenna, S.Gong, and Y.Raja. Modelling facial colour
600 labeled + 54,000 unlabeled 87.66 95.82 98.32 and identity with gaussian mixtures. Pattern Recognition,
31:1883–1892, 1998.
5 Summary and Discussion [12] M. Meila. Learning with mixture of trees. PhD thesis, MIT,
1999.
In this work we presented a Bayesian Network approach [13] D. Saxe and R. Foulds. Toward robust skin identification in
for skin detection. We considered several instances of video images. In Automatic Face and Gesture Recognition,
Bayesian Networks and we suggested a methodology to per- pages 379–384, 1996.
form skin detection using both labeled and unlabeled data. [14] K. Schwerdt and J. Crowley. Robust face tracking using
In a nutshell, when faced with the option of learning with color. In Automatic Face and Gesture Recognition, pages
labeled and unlabeled data for skin detection using Bayesian 90–95, 2000.
[15] M.-H. Yang and N. Ahuja. Detecting human faces in color
networks, our discussion suggests using the following path. images. In ICIP, pages 127–130, 1998.
Start with Naive Bayes and TAN classifiers, learn only with [16] M.-H. Yang, D. Kriegman, and N. Ahuja. Detecting faces in
the available labeled data, and test whether the model is cor- images: A survey. PAMI, 24(1):34–58, 2002.
rect by learning with the unlabeled data. If the result is not

View publication stats

You might also like