Automatic Web Page Classification: Abstract
Automatic Web Page Classification: Abstract
Jiří Materna
1 Introduction
1.1 Motivation
At the present time the World Wide Web is the largest repository of hypertext
documents and is still rapidly growing up. The Web comprises billions of
documents, authored by millions of diverse people and edited by no one
in particular. When we are looking for some information on the Web, going
through all documents is impossible so we have to use tools which provide us
relevant information only. The widely used method is to search for information
by fulltext search engines like Google1 or Seznam2 . These systems process list
of keywords entered by users and look for the most relevant indexed web
pages using several ranking methods. Another way of accessing web pages is
through catalogs like Dmoz3 or Seznam4 . These catalogs consist of thousands
web pages arranged by their semantic content. This classification is usually
done manually or partly supported by computers. It is evident that building
large catalogs requires a lot of human effort and fully automated classification
1 2 3
http://www.google.com http://search.seznam.cz http://www.dmoz.org
4
http://www.seznam.cz
Petr Sojka, Aleš Horák (Eds.): Proceedings of Recent Advances in Slavonic Natural Language Processing,
RASLAN 2008, pp. 84–93, 2008.
c Masaryk University, Brno 2008
Automatic Web Page Classification 85
systems are needed. However several systems for English written documents
were developed (e.g. [1,2,3,4,5]) the approaches do not place emphasis on short
documents nor on the Czech language.
1.2 Objective
Classical methods of text document classification are not appropriate for web
document classification. Many of documents on the Web are to short or suffer
from a lack of linguistic data. This work treats with this problem in two novel
approaches:
2 Preprocessing
the title tag which can hold an important information about domain). As
other unwanted data were marked all n-grams (n>10) where portion of non
alphanumeric characters was grater than 50 %.
Very important issue of document preprocessing is charset encoding detec-
tion. However the charset is usually defined in the header of the document, it
is not a rule. We have used a method of automatic charset detection based on
byte distribution in the text [6]. This method works with a precision of about
99 %.
A lot of web sites allows user to chose language. Even some web pages
on the Czech internet are primarily written in foreign language (typically in
Slovak). With respect to used linguistic techniques, we are made to remove
such documents from the corpus. The detection of foreign languages is similar
to charset encoding detection based on typical 3-gram character distribution.
There has been built a training set of Czech written documents and computed
the typical distribution. Similarity of training data with the investigated docu-
ments is evaluated using cosine measure.
3 Document Model
In order to use these data in machine learning algorithms we need to convert
them into appropriate document models. The most common approach is vector
document model where each dimension of vector represents one word (or
token in corpus). There are several methods of representing the words.
Let m is number of documents in the training data set, f d (t) frequency of
term t in document d for d ∈ {1, 2, . . . , m} and Terms set of terms {t1 , t2 , . . . , tn }.
f d ( ti )
vi =
m
Disadvantage of previous two methods may be a fact of treating with all terms
in the same way – they are not weighted. This problem can be solved by using
IDF coefficient which is defined for all ti ∈ Terms as:
!
m
IDF (ti ) = log2
|{ j : f j (ti ) > 0}|
For TF and TF-IDF methods is convenient to discretize their real values. The
MDL algorithm [11] based on information entropy minimization has been used.
4 Term Clustering
The table shows that the incorporated words are really semantically similar.
However, there are some problems with homonyms and tagging errors (in this
case term aut). The characteristic set is defined in the way of eliminating words
occurred in the corpus more frequently in other senses than we currently treat
with.
Let CHL(l ) = [w1 , w2 , . . . , wk ] is the characteristic list of the lemma l,
S(l ) = {w1 , w2 . . . , wk } and S p (l ) = {wi |i ≤ k/p} where p ∈ R+ is a constant
coefficient. The characteristic set is defined as
CH (l ) = {wi : q · |S(wi ) ∩ S p (l )| ≥ |S p (l )|}
where q ∈ R+ is an appropriate constant. The experiments have shown that the
best values seem to be p = 2, q = 2.
Automatic Web Page Classification 89
5 Attribute Selection
Even after application of the dictionary function there are a lot of different terms
for using machine learning algorithms in the corpus and it is necessary to select
the most convenient ones. Statistics provides some standard tools for testing if
the class label and a single term are significantly correlated with each other. For
simplicity, let us consider a binary representation of the model. Fix a term t and
let
– k i,0 = number of documents in class i not containing term t
– k i,1 = number of documents in class icontaining term t
This gives us a contingency matrix
It \C 1 2 ... 11
0 k1,0 k2,0 ... k11,0
1 k1,1 k2,1 ... k11,1
where C and It denote boolean random variable and k l,m denotes the number
of observation where C = l and It = m.
5.1 χ2 test
This measure is a classical statistic approach. We would like to test if the
random variables C and It are independent or not. The difference between
observed and expected values is defined as:
based on selected lemmas. In the third case, only nouns, adjectives, verbs and
adverbs have been selected. You can see that overall accuracy in all cases grows
92 Jiří Materna
till about 12,000 attributes. After this threshold the overall accuracy does not
vary significantly. The best result (83.4 %) was acquired using clustering based
on same lemmas.
Finally, Figure 3 shows result of experiments with extended documents,
clustering based on same lemmas and on both lemmas and dictionary. The
overall accuracy growth from previous experiment is about 5.9 % for lemma
based clustering and 8.2 % for dictionary based clustering.
7 Conclusion
We have presented a method of automatic web page classification into given
11 semantic classes. Special attention has been laid on treating with short
documents which often occur on the internet. There have been introduced
two approaches which enable classification with overall accuracy about 91 %.
Several machine learning algorithms and preprocessing methods have been
tested. The best result has been acquired using Support vector machines with
linear kernel function (followed by method of k-nearest neighbors) and term
frequency document model with attribute selection by mutual information
score.
References
1. Asirvatham, A.P., Ravi, K.K.: Web page categorization based on document structure
(2008) http://citeseer.ist.psu.edu/710946.html.
2. Santini, M.: Some issues in automatic genre classification of web pages. In: JADT
2006 – 8èmes Journèes internationales d’analyse statistiques des donnés textuelles,
University of Brighton (2006).
3. Mladenic, D.: Turning Yahoo to automatic web-page classifier. In: European
Conference on Artificial Intelligence. (1998) 473–474.
4. Pierre, J.M.: On automated classification of web sites. 6 (2001)
http://www.ep.liu.se/ea/cis/2001/000/.
5. Tsukada, M., Washio, T., Motoda, H.: Automatic web-page classification by using
machine learning methods. In: Web intelligence: research and development,
Maebashi City, JAPON (23/10/2001) (2001).
6. Li, S., Momoi, K.: A composite approach to language/encoding detection. 9th
International Unicode Conference (San Jose, California, 2001).
7. Sedláček, R.: Morphemic Analyser for Czech. Ph.D. thesis, Faculty of Informatics,
Masaryk University, Brno (2005).
8. Šmerk, P.: Towards Morphological Disambiguation of Czech. Ph.D. thesis proposal,
Faculty of Informatics, Masaryk University, Brno (2007).
Automatic Web Page Classification 93
9. Rychlý, P.: Korpusové manažery a jejich efektivní implementace (in Czech). Ph.D.
thesis, Faculty of Informatics, Masaryk University, Brno (2000).
10. Kilgarriff, A., Rychlý, P., Smrž, P., Tugwell, D.: The Sketch engine in practical
lexicography: A reader. (2008) 297–306.
11. Fayyad, U.M., Irani, K.B.: On the handling of continuous-valued attributes in
decision tree generation. Machine Learning 8 (1992) 87–102.
12. Kilgarriff, A.: Thesauruses for natural language processing. Proc NLP-KE (2003).
13. Berka, P.: Dobývání znalostí z databází. Academia (2003).
14. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and
model selection. In: IJCAI. (1995) 1137–1145.
15. Witten, I.H., Frank, E.: Data mining: Practical machine learning tools and tech-
niques. Technical report, Morgan Kaufmann, San Francisco (2005).
16. Chang, C.C., Lin, C.J.: LIBSVM: a Library for Support Vector Machinesi. Technical
report, Department of Computer Science National Taiwan University, Taipei 106,
Taiwan (2007).