0% found this document useful (0 votes)
12 views8 pages

Structured Multimedia Document Classification

The document presents a new statistical model for classifying structured multimedia documents, emphasizing the integration of structural and content information. It focuses on the classification of multilingual pornographic HTML pages using both text and image data, demonstrating the model's effectiveness in distinguishing relevant content. The proposed approach aims to enhance existing information retrieval methods by adapting them to handle the complexity of structured documents, particularly in the context of web filtering applications.

Uploaded by

snadiazehra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views8 pages

Structured Multimedia Document Classification

The document presents a new statistical model for classifying structured multimedia documents, emphasizing the integration of structural and content information. It focuses on the classification of multilingual pornographic HTML pages using both text and image data, demonstrating the model's effectiveness in distinguishing relevant content. The proposed approach aims to enhance existing information retrieval methods by adapting them to handle the complexity of structured documents, particularly in the context of web filtering applications.

Uploaded by

snadiazehra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Structured Multimedia Document Classification

Ludovic Denoyer Sylvie Brunessaux


Jean-Noël Vittaut Stephan Brunessaux
Patrick Gallinari EADS S&DE
LIP6 – University of Paris 6 Val de Reuil – France
Paris – France
[email protected]
{denoyer,vittaut,gallinari}@ia.lip6.fr
[email protected]

ABSTRACT this type of data. Description languages for structured doc-


We propose a new statistical model for the classification of uments such as HTML or XML have gained popularity and
structured documents and consider its use for multimedia are now widely used. Given the growing amount of struc-
document classification. Its main originality is its ability to tured document collections, it is important to develop tools
simultaneously take into account the structural and the con- able to take into account the increased complexity of these
tent information present in a structured document, and also representations and the diversity of doxel types inside the
to cope with different types of content (text, image, etc). document, to address document parts and the relations be-
We present experiments on the classification of multilingual tween doxels. Up to now, Information Retrieval –IR– has
pornographic HTML pages using text and image data. The mainly developed for handling flat documents and IR meth-
system accurately classifies porn sites from 8 European lan- ods should now adapt to these new types of documents. We
guages. This corpus has been developed by EADS company focus here on the particular task of document classification,
in the context of a large Web site filtering application. this is a generic problem with many different applications
like document indexing, e-mail or spam filtering, document
ranking, document categorization, etc. Although classifica-
Keywords tion has been considered in IR for a long time, it is mainly
Categorization, Structured Document, Multimedia Docu- since the nineties that it has gained popularity and has de-
ment, Bayesian Networks, Generative Model, Statistical Ma- veloped as a sub-branch of the IR domain. Much progress in
chine, Web Page Filtering this area has been obtained through recent machine learn-
ing classification techniques. Most classification models for
Categories and Subject Descriptors text or image have been developed before the emergence
of structured documents and are devoted only to flat rep-
I.2 [Artificial Intelligence]: Learning; I.7 [Document resentations. Recently, some attempts have been made to
and Text Processing]: Miscellaneous adapt these techniques for the classification of complex doc-
uments, e.g. XML textual documents or multimedia docu-
General Terms ments. This is usually done in a crude way, by combining
Algorithms basic classifiers trained independently on different compo-
nents of a document.
The work described here is an attempt to develop a more
1. INTRODUCTION principled approach to the problem of structured multime-
The development of the Web and the growing number dia document classification. We propose a new model which
of documents available electronically has been paralleled by allows to take simultaneously into account the structure
the emergence of semi-structured data models for represent- and the content information of electronic documents. This
ing textual or multimedia documents. These models allow model offers a natural framework for the integration of dif-
to encode the document content and its logical structure ferent information sources. It is based on a statistical frame-
i.e. relations between document elements – denoted doxels work: Bayesian networks are used to model the documents
in the following –, they also allow to enrich the document and to combine the information present in the doxels. We
description with different types of meta-data. These repre- present tests for the problem of Web pages filtering on a
sentations are also useful for efficiently storing and accessing large database gathered in the context of a European project
”NetProtect”.
The paper is organized as follows: we first review in part
2 existing work on information classification for structured
Permission to make digital or hard copies of all or part of this work for document. We then describe our model for classifying multi-
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies media documents in part 3. Finally, we describe the corpus
bear this notice and the full citation on the first page. To copy otherwise, to NetProtectII used for our experiments and present a series of
republish, to post on servers or to redistribute to lists, requires prior specific experiments.
permission and/or a fee.
DocEng’03, November 20–22, 2003, Grenoble, France.
Copyright 2003 ACM 1-58113-724-9/03/0011 ...$5.00.

153
2. PREVIOUS WORK 2.2 Multimedia documents
A large amount of work has been devoted over the last few Many different methods have been proposed for the classi-
years on flat document categorization either text or image. fication of multimedia documents. Most work in the multi-
In the information retrieval community, it is often considered media area makes use of text information such as keywords
that structure should play an important role: for example a as an auxiliary source of information for enhancing the per-
word can have different meanings according to its location formance of existing image, video or song classifiers. For
in the document and a document can be considered relevant example, [4] propose a system that combines textual and
for a class, if only one of its parts is relevant. The rapid visual statistics into a single index vector. Textual statistics
growth of structured electronic documents has recently mo- are captured using latent semantic indexing (LSI) and visual
tivated the emergence of a new line of research for accessing ones are color and orientation histograms. This approach
these documents. We briefly review below recent work on allows improving performance in conducting content-based
structured and multimedia document classification. search. [2] present a generative hierarchical model, where
the data is modeled as being generated by a fixed hierar-
2.1 Structured document classification chy of nodes. Each node in the tree has some probability
We consider stochastic classifier models which score any of generating each word or image feature. This model could
document class according to the value of the posterior prob- be useful for IR tasks such as database browsing and search
ability P (class|document). Generally speaking, classifiers for images based on text and image features. In [14], a doc-
fall into two categories: generative models which estimate ument is represented as a collection of objects which them-
class conditional densities P (document|class) and discrim- selves are represented as collections of features (words for
inant models which directly estimate posterior probabilities text, color and texture features for images). Several similar-
P (class|document). For instance, the Naive Bayes model ity measures over text and images are then combined. More
[13] is a popular generative categorization model while Sup- recently, a method has been proposed for using images for
port Vector Machines [10] is a discriminant model. See [16] word sense disambiguation, which suggests that combining
for an exhaustive review of categorization models for flat image features and text can outperform simple text classi-
documents. In machine learning, most classifiers have been fiers [3]. In the field of Web filtering, [11] combine a naked
designed for coping with vector or sequence representations, people photo image detector with a standard text classifier
only very few models allow to consider simultaneously con- using an ”OR” operator. [5] present an algorithm to iden-
tent and structure information. The growing need for han- tify images that contain large areas of skin. For identifying
dling structured objects – documents, biological structures, pornographic pages, they combine this skin detector with a
chemical compounds, etc – has recently motivated some in- text classifier via a weighing scheme. They also claim that
terest in this area, but the field is still new and widely open. text based approaches generally give poor results, whereas
Some years ago, in the IR community, the development of our structured model performs well on text-only documents.
the Web has created a need for classifying HTML pages viz. Most of these attempts then rely on the combination of basic
the last two TREC competitions [17]. In a HTML page, the classifiers trained independently on the different information
different parts of the text do not play the same role and do sources (e.g. text and images). They do not consider the
not have the same importance, e.g. titles, links, text can global context of the document nor its logical organization,
be considered as different sources of information. Most of i.e. they ignore the relations between the different document
the techniques proposed for HTML classification use prior parts.
knowledge about the meaning of HTML tags either to en- The model we propose provides a general framework for
code the page structure using very simple schemes or to com- the classification of structured multimedia documents. It
bine basic flat classifiers [9, 18]. These first attempts show can be used for any structural representation or language
that combining the different types of information present (e.g. HTML or XML). This model is an extension to mul-
in a web page may sometimes increase page categorization timedia data of the model proposed in [6] and [7] which
scores. This is not systematic and many of the early trials operates only on textual structured document. Previous
did not show any improvement at all. These ideas do not work has demonstrated the efficiency of the approach on
naturally extend to more general document representations. large XML textual corpus. We show here how to combine
More recently, different techniques have been developed for different information sources in a natural way (text, image,
general structured document classification. These models sound etc...). The model was developed for large databases
are not HTML specific and can be used in particular for and we will see that its complexity is linear with the size of
XML documents. For example, [19] presents an extension of the documents.
the Naive Bayes model to semi-structured documents where
essentially global word frequencies estimators are replaced
with local estimators computed for each path element. A
drawback of this technique is the dramatic index growth
which leads to poor estimation of probabilities. The Hid-
den Tree Markov Model (HTMM) proposed by [8] extends
classical HMM to semi-structured representations. Docu- 3. MULTIMEDIA GENERATIVE MODEL
ments are represented by a tree and for each node, words We present here the multimedia generative model for struc-
are generated by a specific HMM. This model has been used tured documents. We first explain the context of the clas-
for HTML classification. As for discriminative models, [15] sification task. We then give the different hypothesis used
proposed a model based on Bayesian Networks, which di- to model structured documents and show how this global
rectly computes the posterior probability corresponding to model can be seen as a weighted mixture of local generative
document relevance for each class. models. Finally, we describe the training algorithm.

154
Figure 1: An example of multimedia structured document

3.1 Context ing on the task the model will be used for, the difficulty
Let D be the set of all documents and C = {c1 , ..., c|C| } be for accurately estimating model parameters from data col-
the set of all classes where |C| is the number of classes. lections, the availability of labeled corpus, etc. Different
We will consider that a document can be either Relevant hypotheses for structured textual document representations
or Irrelevant for a specific class c. We then transform our are discussed in [7] where it is shown that best performances
problem with |C| classes into |C| problems with two classes for text classification are obtained with rather simple models
(Relevant R or Irrelevant I) so that in the following, we will of the dependencies between doxels. As it is often observed
discuss only the two classes problem. We adopt a machine in document classification, more sophisticated models do not
learning approach and the model parameters are learned lead to increased performance. We adapt here one of these
from a labeled training set of representative documents from simple models to multimedia documents. The correspond-
each class. ing generative process is as follows: an author who wants
Since we adopt a stochastic approach to the classification to build a document about a specific topic (class) will first
problem, a structured document d in D will be the realiza- imagine the global logical structure of the document, once
tion of a random variable D. We use the generative approach this structure is built he will then fill the content of each
to classification: the model will compute the probability of structural element. This content will depend on the type
generating a specific document P (D = d|θ) where θ corre- of the structural element: the process of writing a title will
sponds to the parameters of the generative model. P (d|θ) differ from the one for a paragraph, writing text is different
will be used as a shorthand for P (D = d|θ). from inserting an image or a piece of music, etc. This pro-
In order to use our generative model for classification, we cess is a simplified view of the reality and as will be seen
learn one model θR for the relevant class and one model θI below additional simplifying assumptions will be introduced
for the irrelevant class. The score of a document will then in order to meet practical constraints. It embodies some
be computed using Bayes-Rule: crucial facts about structured documents: doxels are differ-
ent depending on their type and the logical element they
P (R)P (d|R) belong to, both logical structure and plain content of doc-
P (R|d) = uments are essential for their descriptions, the topic of a
P (d|R)P (R) + P (d|I)P (I)
(1) document may influence both its logical structure and its
P (R)P (d|θR ) content. The latter idea implies that the logical structure of

P (d|θR )P (R) + P (d|θI )P (I) a document may sometimes contain important information
In the following, we will denote the model parameters by for characterizing the document class.
θ corresponding either to θR or θI . Let us now formally define the model.
We consider D as a random variable D = (S, T ) where S
3.2 Description is the variable corresponding to the structure of a document
A generative stochastic model for a document corresponds and T corresponds to the content information. Let d =
to specific hypotheses about the physical generation of this (sd , td ) a realization of D where sd represents the structure
document. Different hypothesis should be considered and and td represents the content information of document d, we
the choice of a particular model most often corresponds to a have:
compromise between an accurate representation of the doc-
ument generation process and practical constraints depend- P (d|θ) = P (sd |θ)P (td |sd , θ) (2)

155
In this equation, P (sd |θ) is the probability of generating Hypothesis 2. First order dependency: the content
the structural information and P (td |sd , θ) is the proba- information depends only on the structural node containing
bility of the content information. it and not on other structural nodes.
The structure of d consists of a set of nodes and their
dependence relations. The set of nodes is denoted: These hypothesis are simplifications of real dependencies
between the different parts of a document. They are needed
|d|
sd = (s1d , ......, sd ) (3) for keeping the complexity of the model reasonably low and
for using the model on very large corpus. Returning to our
where is the i-th node of the document d and |d| is the
sid
generation process, this means that once the document or-
number of nodes of the document.
ganization has been decided, each content node is filled inde-
We will consider only tree like document, this is a rea-
pendently of the others, by considering only the type of the
sonable simplification of real structured documents. Each
structural element it belongs to. All content elements with
node corresponds to a structural entity of the document (e.g.
the same type (paragraph, etc) will share the same gener-
paragraph, section).
ative process. It could seem more natural to consider that
Let pa(sid ) denote the parent of sid . The structure of the
content elements are filled in sequence, but early tests with
document is described by the set {sid , pa(sid )}. Nodes take
such a model did not led to improved results at the price
their values in Λ (i.e sid ∈ Λ) which is the set of all possi-
of an increased complexity and this was then left out. Note
ble node labels. Typically, for an XML document, this set is
that such simplification are frequent in stochastic modeling
defined with the DTD and is the set of all possible tags. Fig-
and have led to very efficient models in different application
ure (1) represents a structured multimedia document while
areas.
figure (2) is the associated tree structure. For this example,
Using hypothesis 1 and 2, we can rewrite the content prob-
we have Λ =(title, paragraph, image, section).
ability as:
|d|

Document
P (td |θ) = P (tid |sid , θ) (6)
i=1

According to hypothesis 2, each node type in the structure


will have its own generative content model. Let θsi be the
d
parameters of the generative model associated with label sid ,
we have:
Paragraph Section
|d|
Title
Image

P (td |θ) = P (tid |sid , θsi ) (7)
d
i=1

Models with parameters θl , l ∈ Λ, will be the local gen-


Title Paragraph erative models associated to nodes with label l. As an
example, for modeling the document in figure (1) and figure
(2), we will use 3 local generative models:
• a textual generative model of parameters θtitle for the
Figure 2: The structure graph corresponding to the
text contained in tags title
previous example
• a textual generative model of parameters θparagraph
Document content is denoted: for tags paragraph
|d|
td = (t1d , ...., td ) (4) • an image generative model of parameters θimage for
tid
where represents the content information of the i-th node the image contained in tag image
of the document. The content probability of the whole document is then
We will make the hypothesis that each tag label contains computed using a mixture of these local generative models
only one type of information (text, image, sound ...). This (see (7)).
hypothesis is not restrictive and it is often true in XML
documents. 3.2.2 Structural probability
With these notations, (2) writes: In our document model, each node sid has only one parent
|d| |d| |d| pa(sid ). The structural probability is computed as:
P (d|θ) = P (s1d , ..., sd |θ)P (t1d , ..., td |s1d , ..., sd , θ) (5)
|d|
We will now detail the content and structural parts of the 
P (sd |θ) = P (sid |pa(sid ), θ) (8)
document model.
i=1

3.2.1 Content probability The hypothesis here is that a structural doxel only depends
We will make the following hypothesis in order to compute on its parent. For our generating process, this means that
the content probability P (td |sd , θ): starting at the root, the first level of structural elements is
built, after that, the descendant of a structural node are
Hypothesis 1. Conditional Independence: given the build independently of the node brothers descendants. Here
structural organization of the document, the content infor- again, this can be viewed as a simplified process for defining
mation of the different nodes are independent. the logical organization of a document.

156
Let θssi ,pa(si ) be an estimation of P (sid |pa(sid )), we can porno general ambiguous total
d d
not porno
rewrite equation (8) as:
French 830 2042 420 2462
|d|
 English 3808 1827 640 2467
P (sd |θ) = θssi ,pa(si ) (9) German 357 1428 290 1718
d d
i=1 Dutch 349 1200 220 1420
3.2.3 Final probability Portuguese 63 200 93 293
Spanish 530 1448 641 2089
Using equations (7) and (9), we have the final probability:
Greek 309 870 359 1229
   Italian 368 1138 223 1361
|d|  
|d|  total 6614 10153 2886 13039
P (d|θ) = θssi ,pa(si ) P (tid |sid , θsi ) (10)
 d d  d 
i=1 i=1 Figure 4: The size of each class for the 8 languages
The parameters of our generative model is the vector θ
where
 page filtering. We first describe the corpus and then de-
θ = θs θl tail the local generative models used for this application for
l∈Λ modeling text and image information. We finally detail per-
formances on this corpus.
with θ the set of structural parameters (P (sid |pa(sid )))
s

and θl the parameters for the local generative model of the 4.1 The Netprotect II Corpus
nodes with labels l. We used for the experiments a database of HTML docu-
Equation (10) then writes: ments provided by EADS company. These documents have
|d|
 been collected on the web in the context of the European
P (d|θ) = θssi ,pa(si ) P (tid |sid , θsi ) (11) project Netprotect [1]. The project is aimed at developing a
d d d
i=1 library of software tools for Internet access filtering. There
are 4 categories to be filtered – pornography, violence, bomb-
From equation (11), it can be seen that our global genera- making and drugs – and 8 European languages – Dutch, En-
tive model corresponds to a mixture of local generative glish, French, German, Greek, Italian, Portuguese, Spanish.
models of the document content, weighted by transi- The database has been manually labeled.
tion probability models depending on the document In the experiments below, we consider all 8 languages, but
structure. only the pornography category for which most web pages do
3.3 Learning have an image + text content. Previous experiments with
text only content have shown similar performances for the
The model parameters will be learned by maximizing the different categories. The dataset consists of 6613 porno-
data likelihood. Model for class c will be trained on class c graphic pages, 2886 non pornographic pages with ambigu-
data, etc. The log-likelihood of our training data is DT RAIN : ous content or dealing with sexuality and 10153 general non-
 pornographic Web pages which were collected using Google
 |sd | search engine. Figure (4) presents the corpus size for the
L= logP (sid |pa(sid ), θs ) different languages for each group (porn, ambiguous, gen-
 eral).
d∈DT RAIN i=1
 In our experiments, we used half of each group for train-
|sd |  ing, and the remaining half for testing.
+ logP (tid |θsi )
d 
i=1
 
4.2 Content classifiers: text and image
 |sd |  (12) The model described in section 3.2 makes use of a different
= logP (sid |pa(sid ), θs ) + classifier for each structural element type. For all textual
 
d∈DT RAIN i=1 doxels, we used a Naive-Bayes model (NB)[12] and for all
  image doxels, an histogram generative model. For the text,
 |sd | 
we use one specific NB model for each HTML tag, i.e. all
logP (tid |θsi )
 d  text doxels under the same tag do share the same model.
d∈DT RAIN i=1
All image doxels are considered of the same type and share
= Lstructure + Lcontent the same NB generative model.
For classifying a document, the complexity of our model
The maximization of L amounts at two separate maxi-
is O(|sd | + |td |) where sd is the number of structural nodes
mizations on Lstructure and Lcontent . This is a classical op-
and td is the number of words and image components. This
timization problem and the solution for the structural and
complexity is dominated by td and is of the same order as
content parameters is described in the appendix.
Naive-Bayes.

4. EXPERIMENTS 4.2.1 Text model


In the following, we demonstrate the potential of the model The text model used here is the Naive Bayes model. This
for classifying structured documents with image and text is a reference model which has already been used by many
content. Tests are performed for the particular task of HTML different authors ([12] for example) in different contexts. It

157
Figure 3: The belief network corresponding to the previous example

is known to be very robust when data belong to very high order to keep image scores comparable, image size has been
dimensional spaces. Since we simultaneously consider 8 lan- normalized to Np pixels before computing the histogram.
guages, with different alphabets, the dictionary size is rather Under the independence hypothesis, we have:
large and is about 20 000 for all languages (see part 4.2.3
for more details). 
Nc
Let tid = (wd,1
i i
, ..., wd,|t i | ) be the textual content of node P (tid |θsi ) = P (Pk = pid,k |θsi ) (15)
d d d
i k=1
i in d, where represents the k-th word of node i in d.
wd,k
|tid | is the number of words in node i. where P (Pk = pid,k |θsi ) is the probability that there are
d
NB computes the probability P (tid |θsi ) as: pid,kpixels with color k in image tid .
d
This model is learned using a simple pixel count over all
|td |
 the images in the training set.
P (tid |sid , θsi ) = i
P (wd,k |θsi ) (13)
d d
k=1 4.2.3 Preprocessing
Our model will use one Naive Bayes model for each label Textual parts of HTML documents have been cleaned by
l ∈ Λ. The model with parameters θl will be learned using deleting figures, words smaller than three letters and all non-
the flat textual representation of all nodes with label l in alphabetical symbols. We have kept all accents and have not
the training set. stemmed any word. In order to reduce the size of the vocab-
ulary, we have suppressed the words appearing in less than
4.2.2 Image model 20 documents. The final vocabulary was composed of 20834
terms. These choices satisfy different constraints. First, the
Before deciding on the image modeling, we made exten-
model has to be simple enough to be used for example as
sive preliminary experiments on the classification of porno-
an add-on to a navigator. An additional constraint is that
graphic images. As in [5], the conclusion was that the best
new languages could be easily added for filtering. Within
workspace to detect pornographic images was the RGB color
this context, it is thus unfeasible to use more sophisticated
space and that additional components like texture or shape
preprocessing since they may considerably differ from one
did not improved performance. We then decided to repre-
language to the other and that the corresponding system
sent images with a color histogram in a normalized space.
would be to slow.
Let tid be an image, its histogram representation will be:
Images have been converted into features using color his-
tid = (pid,1 , ..., pid,Nc ) (14) tograms of size 100. All images have been projected in a 216
color space. The resulting set of parameters for the image
where pid,k
is the number of pixels in the image with color generative model has a size of 21600 features (216 ∗ 100).
k. Nc represents the number of colors in the histogram. In Each doxel – text or image – is then represented in a high

158
model porno not porno micro macro textual model WPT has good results and is 4.5% better than
average average NB for the micro-average and 2.6% for the macro-average.
NB 92.4 87.3 88.4 89.9 The multimedia model is even better than the WPT model
WPT 91.8 93.1 92.9 92.5 and achieves respectively 94.7% and 93.6% for the micro-
IMAGE 88.4 77.6 82.7 83.0 average and the macro-average recall. These results are very
WPT-IMAGE 91.6 95.4 94.7 93.6 encouraging and show that our structured approach, which
proved to be efficient for textual data [7], can be extended to
Figure 5: The recall values for the 4 models take into account different information sources. This model
is able to combine efficiently the information of simple gen-
erative models. Error rate is divided by 2 compared to NB
Error rate and by 3 compared to IMAGE.

20 5. CONCLUSION
Percentage of error

15
We have described a general model for the classification
of structured multimedia documents. It offers a principled
10 Error rate approach for considering simultaneously the relations be-
tween document parts embodied in the logical structure and
5 for integrating different sources of information. It can be
used with any type of structured document and informa-
0
tion sources, provided we have the possibility to compute
IMAGE NB WPT WPT-IMAGE
local scores for all document components. We have tested
Model a particular instance of the model on the task of Web page
filtering by considering two information sources: text and
image. The model has been compared to baseline flat text
and image classifiers. The experiments show that taking the
Figure 6: The error rate for the 4 models structure into account increases the performance compared
to a flat text classifier and that the integration of textual
and image information via this structured document model
dimensional space of about 20 K dimensions.
still increases the performance. The global classifier divides
4.3 Evaluation by a factor of 2 to 3 the error rate of individual classifiers.
We present results obtained on the NetProtect corpus. We
used 4 document models for this comparison: 6. REFERENCES
[1] Netprotect project page, 2001. Available as
• The Naive-Bayes model is the reference baseline model http://www.netprotect.org/.
on the textual information [2] K. Barnard and D. Forsyth. Learning the semantics of
words and pictures. In Proc. 8th Int. Conference on
• The IMAGE model is our structured model with only
Computer Vision, volume 2, pages 408–415, 2001.
images and no text.
[3] K. Barnard, M. Johnson, and D. Forsyth. Word sense
• The WordPerTag (WPT) model is our structured model disambiguation with pictures. In Workshop on
with only textual data. learning word meaning from non-linguistic data, 2003.
[4] M. L. Cascia, S. Sethi, and S. Sclaroff. Combining
• The WPT-IMAGE model is the multimedia model text textual and visual cues for content-based image
and image. retrieval on the world wide web. In Proc. IEEE
In order to evaluate our model, we use the recall obtained Workshop on Content-Based Access of Image and
on the test corpus for each class (porn, non porn). This is the Video Libraries, June 1998.
percentage of correctly classified documents for each class on [5] Y. Chan, R. Harvey, and D. Smith. Building systems
the test set. A document will be considered pornographic to block pornography. In Challenge of Image
if: Retrieval, 1999.
[6] L. Denoyer and P. Gallinari. A belief networks-based
P (d|θpornographic) > P (d|θnotpornographic) (16) generative model for structured documents.
In order to give a synthetic recall value for each classi- An application to the XML categorization. In MLDM
fier, we compute for each of them their micro-average and 2003, 2003.
macro-average recall. Macro-average recall is obtained by [7] L. Denoyer and P. Gallinari. Using Belief Networks
averaging recall values for the porn and non porn classes. and Fisher Kernels for structured document
Micro-average recall is obtained by weighting the average classification. In PKDD 2003, 2003.
by the relative size of each class. These results are pre- [8] M. Diligenti, M. Gori, M. Maggini, and F. Scarselli.
sented in figure (5). Figure (6) present the error rate (100 − Classification of HTML documents by Hidden
microaverage) for the 4 models. Tree-Markov Models. In 6th International Conference
The baseline Naive-Bayes model yields reasonably high on Document Analysis and Recognition, Seattle, WA,
micro-average and macro-average recall (88.4 % and 89.9 USA, Aug. 2001.
%). The IMAGE model is lower and achieves only 82.7 % on [9] S. T. Dumais and H. Chen. Hierarchical classification
micro-average and 83% an macro-average. Our structured of Web content. In N. J. Belkin, P. Ingwersen, and

159
M.-K. Leong, editors, Proceedings of SIGIR-00, 23rd APPENDIX
ACM International Conference on Research and
Development in Information Retrieval, pages 256–263, A. MAXIMIZATION OF LST RU CT U RE
Athens, GR, 2000. ACM Press, New York, US. We want to maximize:
[10] T. Joachims. Text categorization with support vector |sd |
machines: learning with many relevant features. In Lstructure = logP (sid |pa(sid ), θs )
C. Nédellec and C. Rouveirol, editors, Proceedings of d∈DT RAIN i=1
ECML-98, 10th European Conference on Machine (17)
|sd |
Learning, number 1398, pages 137–142, Chemnitz,
= logθssi ,pa(si )
DE, 1998. Springer Verlag, Heidelberg, DE. d d
d∈DT RAIN i=1
[11] M. J. Jones and J. M. Rehg. Detecting adult images. 
Technical report, 2002. under the constraint ∀m ∈ Λ, s
θl,m = 1.
l∈Λ
[12] D. D. Lewis. Representation and learning in
Using the Lagrange multipliers, for each (n, m) ∈ Λ × Λ,
information retrieval. PhD thesis, Department of
we have:
Computer Science, University of Massachusetts,  s
Amherst, US, 1992. ∂(Lstructure − λm ( θn,m − 1)
n
[13] D. D. Lewis. Naive (Bayes) at forty: The s
=0 (18)
∂θn,m
independence assumption in information retrieval. In
C. Nédellec and C. Rouveirol, editors, Proceedings of d
Let Nn,m be the number of times a node of label n has
ECML-98, 10th European Conference on Machine his parent with label m in the document d, we solve:
Learning, number 1398, pages 4–15, Chemnitz, DE,
1998. Springer Verlag, Heidelberg, DE.  d
Nn,m
[14] M. Ortega, K. Porkaew, and S. Mehrotra. Information d∈DT RAIN
retrieval over multimedia documents. In the SIGIR s
= λm (19)
θn,m
Post-Conference Workshop on Multimedia Indexing
and Retrieval (ACM SIGIR), 1999. The solution is:

[15] B. Piwowarski, L. Denoyer, and P. Gallinari. Un d
Nn,m
modèle pour la recherche d’information sur des d∈DT RAIN
s
θn,m =   (20)
documents structurés. In 6èmes Journées d
Ni,m
internationales d’Analyse statistique des Données i d∈DT RAIN
Textuelles (JADT 2002), Saint-Malo, France, Mar.
2002. B. MAXIMIZATION OF LCONT ENT
[16] F. Sebastiani. Machine learning in automated text We want to maximize:
categorization. ACM Computing Surveys, 34(1), 2002. |sd |
[17] Trec. Text REtrieval Conference (trec 2001), National Lcontent = logP (tid |θsi )
Institute of Standards and Technology (NIST). d∈DT RAIN i=1
d

[18] Y. Yang, S. Slattery, and R. Ghani. A study of  


|sd |
approaches to hypertext categorization. Journal of 
Intelligent Information Systems, 18(2-3):219–241, = logP (tid |θl ) (21)
l∈Λ d∈DT RAIN i=1/si =l
2002. d

[19] J. Yi and N. Sundaresan. A classifier for = Llcontent


semi-structured documents. In Proc. Conf. Knowledge l∈Λ
Discovery in Data, pages 190–197, 2000.
This maximization is performed by learning each local
generative model on its own data.

160

You might also like