Hierarchical Web Page Classification Based on a Topic Model and Neighboring Pages Integration

Sriurai, Wongkot; Meesad, Phayung; Haruechaiyasak, Choochart

Computer Science > Machine Learning

arXiv:1003.1510 (cs)

[Submitted on 7 Mar 2010]

Title:Hierarchical Web Page Classification Based on a Topic Model and Neighboring Pages Integration

Authors:Wongkot Sriurai, Phayung Meesad, Choochart Haruechaiyasak

View PDF

Abstract:Most Web page classification models typically apply the bag of words (BOW) model to represent the feature space. The original BOW representation, however, is unable to recognize semantic relationships between terms. One possible solution is to apply the topic model approach based on the Latent Dirichlet Allocation algorithm to cluster the term features into a set of latent topics. Terms assigned into the same topic are semantically related. In this paper, we propose a novel hierarchical classification method based on a topic model and by integrating additional term features from neighboring pages. Our hierarchical classification method consists of two phases: (1) feature representation by using a topic model and integrating neighboring pages, and (2) hierarchical Support Vector Machines (SVM) classification model constructed from a confusion matrix. From the experimental results, the approach of using the proposed hierarchical SVM model by integrating current page with neighboring pages via the topic model yielded the best performance with the accuracy equal to 90.33% and the F1 measure of 90.14%; an improvement of 5.12% and 5.13% over the original SVM model, respectively.

Comments:	Pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS, Vol. 7 No. 2, February 2010, USA. ISSN 1947 5500, this http URL
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:1003.1510 [cs.LG]
	(or arXiv:1003.1510v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1003.1510

Submission history

From: Rdv Ijcsis [view email]
[v1] Sun, 7 Mar 2010 18:32:47 UTC (1,180 KB)

Computer Science > Machine Learning

Title:Hierarchical Web Page Classification Based on a Topic Model and Neighboring Pages Integration

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Hierarchical Web Page Classification Based on a Topic Model and Neighboring Pages Integration

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators