URL Tree: Efficient unsupervised content extraction from streams of web documents

B Sluban, M Grčar - Proceedings of the 22nd ACM international …, 2013 - dl.acm.org
B Sluban, M Grčar
Proceedings of the 22nd ACM international conference on Information …, 2013dl.acm.org
The Web represents the largest, and an increasingly growing, source of information.
Extracting meaningful content from Web pages presents a challenging problem, already
extensively addressed in the offline setting. In this work, we focus on content extraction from
streams of HTML documents. We present an infrastructure that converts continuously
acquired HTML documents into a stream of plain text documents. The presented pipeline
consists of RSS readers for data acquisition from different Web sites, a duplicate removal …
The Web represents the largest, and an increasingly growing, source of information. Extracting meaningful content from Web pages presents a challenging problem, already extensively addressed in the offline setting. In this work, we focus on content extraction from streams of HTML documents. We present an infrastructure that converts continuously acquired HTML documents into a stream of plain text documents. The presented pipeline consists of RSS readers for data acquisition from different Web sites, a duplicate removal component, and a novel content extraction algorithm which is efficient, unsupervised, and language-independent. Our content extraction approach is based on the observation that HTML documents from the same source normally share a common template. The core of the proposed content extraction algorithm is a simple data structure called URL Tree. The performance of the algorithm was evaluated in a stream setting on a time-stamped semi-automatically annotated dataset which was made publicly available. We compared the performance of URL Tree with that of several open source content extraction algorithms. The evaluation results show that our stream-based algorithm already starts outperforming the other algorithms after only 10 to 100 documents from a specific domain.
ACM Digital Library
Showing the best result for this search. See all results