IRFC2014 Draft For ResearchGate
IRFC2014 Draft For ResearchGate
net/publication/266141433
CITATIONS READS
63 1,522
5 authors, including:
All content following this page was uploaded by Yaakov HaCohen-Kerner on 29 October 2014.
1 Introduction
The number of news articles published on various websites in general and news
websites in particular had a dramatic increase over the last years. These articles
contain multimodal information including textual and visual (image and video)
descriptions. A visual example of such articles is illustrated in Fig. 1. Nowadays both
journalists and media monitoring companies face the problem of mastering large
amounts of articles in order to identify important topics and events all around the
world. Therefore there is an important need for accurate and rapid clustering and
classification of news articles into a set of categories in order to support journalism
and media monitoring tasks. Despite of the multimodal nature of the articles posted in
the web nowadays, most of the approaches consider only textual data in order to
achieve classification (e.g. [1, 2]). Therefore there is an interesting challenge to
investigate whether the combination of use of visual features in addition to the textual
features will improve the classification accuracy.
Fig. 1. Web-based news article from BBC entitled: 2013: The year we all went “mobile”1
News article classification is considered a Document classification (DC) problem.
DC means labeling a document with predefined categories. This can be achieved as
the supervised learning task of assigning documents to one or more predefined
categories [3]. Using machine learning (ML), the goal is to learn classifiers from
examples which perform the category classifications automatically. DC is applied in
many tasks, such as: clustering, document indexing, document filtering, information
retrieval, information extraction and word sense disambiguation. Current-day DC for
news articles poses several research challenges, due to the large number of
multimodal features present in the document set and their dependencies.
In this research, we investigate the task of category-based classification of news ar-
ticles using a combination of visual features and textual N-gram features. The textual
features are extracted from the textual part of the news article, while the visual fea-
tures are generated from the biggest image of the article. Specifically we “learn” two
Random Forest classifiers with textual and visual features and their results are fused
using a late fusion strategy. The main contribution of this work is the usage of visual
features in news article classification in order to leverage the text-based results, as
well as the late fusion strategy that makes use of Random Forests’ operational capa-
bilities (i.e. out-of-bag (OOB) error estimate and proximity ratios).
The rest of this paper is organized as follows: Section 2 presents the theoretical
background and related work. Section 3 describes the textual and visual feature ex-
traction procedure, while section 4 introduces the proposed classification framework.
Section 5 presents the results of the experiments and section 6 concludes the paper.
1
http://www.bbc.com/news/business-25445906
2 Related work and theoretical background
Since in this study we are dealing with supervised machine learning in Document
Classification (DC) in general and with news articles classification in particular, we
report previous work related to these two fields. Furthermore, since Random Forests
(RF) is the machine learning method we use for our proposed classification frame-
work, we provide the theoretical background and related work for this method.
5 Experimental Results
5.1 Dataset Description
The experiments are realized on a dataset that contains web pages from three well
known News Web Sites, namely BBC, The Guardian and Reuter. Overall, 651, 556
and 360 web pages have been retrieved from each site, respectively. At this point it
should be noted that the manual annotation of the web pages was necessary, regard-
less of the fact that in the three News Web Sites descriptions about the topic of each
web page are provided, since in many cases the descriptions are inconsistent with the
content of the web pages. The manual annotation was realized for a subset of the top-
ics recognized by the IPTC news codes taxonomy 2, which is the global standards
body of the news media. Specifically, we selected the most important topics with the
guidance of media monitoring experts and journalists. Table 1 contains a detailed
description of the final dataset3 and the topics considered.
Table 1. Details of dataset
Topics Num. of
Business, Lifestyle, Science,
Sports documents
finance leisure technology
News Sites per site
BBC 102 68 75 202 447
The Guardian 67 59 116 96 338
Reuter 165 7 29 57 258
Num. of documents per
334 134 220 355 1043
topic
2
http://www.iptc.org/site/Home/
3
The dataset is publicly available at: http://mklab.iti.gr/files/ArticlesNewsSitesData.7z
5.2 Experimental Setup
We randomly split our dataset into training and test sets in order to conduct the exper-
iments. Approximately 2/3 of the cases are kept for training purposes, whereas the
rest (1/3) are used as test set, in order to estimate the classification scheme’s perfor-
mance.
As for the RF parameters that we use in the experiments, we opted to apply the fol-
lowing setting: We set the number of trees for the construction of each RF based on
the OOB error estimate. After several experiments with different numbers of trees, we
noticed that the OOB error estimate was stabilized after using 1000 trees and no long-
er improved. Hence, the number of trees is set to N=1000. For each node split during
the growing of a tree, the number of the subset of variables used to determine the best
split is set to 𝑚 = √k (according to [14]), where k is the total number of features of
the dataset.
Finally, for the evaluation of the performance of the proposed methodology, we
compute the precision, recall and F-score measures for each category, along with their
corresponding macro-averaged values, as well as the accuracy on the entire test set
(all categories included).
5.3 Results
The test set results from the application of RF to each modality separately are summa-
rized in Table 2. We are mainly interested in the values of F-score, since it considers
both precision and recall. We notice that the textual modality outperforms the visual
in all measures, both regarding each topic and the macro-averaged scores. This indi-
cates that textual data is a more reliable and solid source of information, in compari-
son to the visual data. More specifically:
The RF trained with the textual data achieves a macro-averaged F-score val-
ue of 83.2%, compared to 45.5% for the visual modality
The accuracy for the textual modality RF is 84.4%, while the visual modality
RF achieves only 53%
The worst results for the visual data RF are attained for the topics “Lifestyle-
Leisure” (recall 12% and F-score 20.7%) and “Science-Technology” (preci-
sion 45.3%, recall 38.7% and F-score 41.7%). However, the results regarding
the topic “Sports” are considered satisfactory. A possible explanation for this
is the fact that the images from the “Lifestyle-Leisure” web pages depict di-
verse topics and therefore their visual appearance strongly varies. On the
other hand, the images regarding the topic “Sports” contain rather specific
information such as football stadiums (a characteristic example is depicted in
Fig. 3).
Table 2. Test set results from the application of RF to each modality
Business-
80.0% 87.3% 83.5% 56.3% 57.3% 56.8%
Finance
Lifestyle-
86.7% 78.0% 82.1% 75% 12% 20.7%
Leisure
Science-
79.1% 70.7% 74.6% 45.3% 38.7% 41.7%
Technology
Sports 91.3% 93.8% 92.5% 52.8% 76.8% 62.6%
Macro-
84.3% 82.5% 83.2% 57.4% 46.2% 45.5%
average
Accuracy 84.4% 53.0%
Fig. 3. Characteristic image from a “Sports” web page (left) 4, along with an image regarding a
web page from the “Lifestyle-Leisure” topic (right)5
In Table 3 we provide the test set results from the application of the late fusion
strategy to RF, using the two different weighting methods described in Section 4
(OOB error/ Proximity ratio). The weighting method regarding the proximity ratio
yields better performance results than the corresponding method for the OOB error.
More specifically:
The accuracy of Textual + Visual (Proximity ratio) is slightly better than the
corresponding accuracy of Textual + Visual (OOB error) (86.2% compared
to 85.9%)
The two weighted RFs achieve almost equal macro-averaged precision val-
ues (86.8% for Proximity ratio and 86.9% for OOB error), while regarding
the macro-averaged recall and F-score results, Textual + Visual (Proximity
4
http://www.bbc.com/sport/0/football/27897075-"_75602744_ochoa.jpg"
5
http://www.bbc.com/travel/feature/20140710-living-in-istanbul-“p022ktsw.jpg”
ratio) is better (84.2% to 82.9% for the macro-averaged recall and 85.3% to
84.3% for the macro-averaged F-score)
For comparison purposes, we also constructed a fused RF model, where equal
weights were assigned to each modality. We notice that after following this weighting
approach (i.e. with equal weights), the performance of RF diminishes in all aspects.
The superiority of the weighting strategy based on the proximity ratio of each topic is
also evident in Fig. 4, where the macro-averaged F-score values of all 5 RF models
constructed in this study are sorted in ascending order. We observe that Textual +
Visual (Proximity ratio) is the best performing model among all cases.
Table 3. Test set results after the late fusion of RF regarding three different weighting schemes
Business-
80.3% 92.7% 86.1% 82.4% 89.1% 85.6% 71.1% 91.8% 80.1%
Finance
Lifestyle-
92.5% 74.0% 82.2% 92.9% 78.0% 84.8% 91.4% 64% 75.3%
Leisure
Science-
83.9% 69.3% 75.9% 81.4% 76.0% 78.6% 83.1% 65.3% 73.1%
Technology
Sports 90.7% 95.5% 93% 90.5% 93.8% 92.1% 87.4% 86.6% 87%
Macro-
86.9% 82.9% 84.3% 86.8% 84.2% 85.3% 83.3% 76.9% 78.9%
average
Accuracy 85.9% 86.2% 80.4%
Fig. 4. Macro-averaged F-score values for all RF models sorted in ascending order
6 Summary, Conclusions and Future Work
In this research, we investigate the use of N-gram textual and visual features for clas-
sification of news articles that fall into four categories (Business-Finance, Lifestyle-
Leisure, Science-Technology, and Sports) downloaded from three news web-sites
(BBC, Reuters, and TheGuardian).
Using the N-gram textual features alone led to much better accuracy results
(84.4%) than using the visual features alone (53%). However, the use of both N-gram
textual features and visual features (weighting based on proximity ratio per category)
led to slightly better accuracy results (86.2%).
Future directions for research are: (1) Defining and applying additional various
types of features such as: function words, key-phrases, morphological features (e.g.:
nouns, verbs and adjectives), quantitative features (various averages such as average
number of letters per a word, average number of words per a sentence) and syntactic
features (frequencies and distribution of parts of speech tags, such as: noun, verb,
adjective, adverb), (2) Applying various kinds of classification models based on tex-
tual and visual features for a larger number of documents that belong to more than
four categories in the news articles area, as well as in other areas, applications and
languages, (3) Selecting a representation for images based on visual concepts. In such
a case, the approach could consider more than one image per article. Visual concepts
could be extracted from each image and the average score for each visual concept
could be calculated, in order to represent the article based on multiple images.
References
1. Schneider, K. M.: Techniques for improving the performance of naive Bayes for text clas-
sification. In Computational Linguistics and Intelligent Text Processing, pp. 682-693,
Springer Berlin Heidelberg (2005)
2. Zeng, A., & Huang, Y.: A text classification algorithm based on rocchio and hierarchical
clustering. In Advanced Intelligent Computing (pp. 432-439), Springer Berlin Heidelberg
(2012)
3. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Sur-
veys, 34 (1), pp. 1-47 (2002)
4. Toutanova, K.: Competitive generative models with structure learning for NLP classifica-
tion tasks. In Proceedings of the 2006 Conference on Empirical Methods in Natural Lan-
guage Processing, (pp. 576-584) (2006)
5. Ho, A. K. N., Ragot, N., Ramel, J. Y., Eglin, V., & Sidere, N.: Document Classification in
a Non-stationary Environment: A One-Class SVM Approach. In Proceedings of the 2013
12th International Conference on Document Analysis and Recognition, (ICDAR), (pp.
616-620) (2013)
6. Klassen, M., & Paturi, N.: Web document classification by keywords using random for-
ests. In Networked Digital Technologies, pp. 256-261, Springer Berlin Heidelberg (2010)
7. Caropreso, M. F., Matwin, S., & Sebastiani, F.: Statistical phrases in automated text cate-
gorization. Centre National de la Recherche Scientifique, Paris, France (2000)
8. Braga, I., Monard, M., & Matsubara, E.: Combining unigrams and bigrams in semi-
supervised text classification. In Proceedings of Progress in Artificial Intelligence, 14th
Portuguese Conference on Artificial Intelligence (EPIA 2009), Aveiro (pp. 489-500),
(2009)
9. Selamat, A., & Omatu, S.: Web page feature selection and classification using neural net-
works. Information Sciences, 158, pp. 69-88 (2004)
10. Aung, W. T., & Hla, K. H. M. S.: Random forest classifier for multi-category classification
of web pages. In Services Computing Conference, 2009. APSCC 2009. IEEE Asia-
Pacific, pp. 372-376, IEEE (2009)
11. Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., & König, A. C.: BLEWS: Using
Blogs to Provide Context for News Articles. In ICWSM (2008)
12. Bandari, R., Asur, S., & Huberman, B. A.: The Pulse of News in Social Media: Forecast-
ing Popularity. In ICWSM (2012)
13. Swezey, R. M., Sano, H., Shiramatsu, S., Ozono, T., & Shintani, T.: Automatic detection
of news articles of interest to regional communities. IJCSNS, 12(6), 100 (2012)
14. Breiman, L.: Random Forests. In Machine Learning, 45(1), pp. 5-32 (2001)
15. Xu, B., Ye, Y., & Nie, L.: An improved random forest classifier for image classification.
In Information and Automation (ICIA), 2012 International Conference on (pp. 795-800),
IEEE (2012)
16. Li, W., & Meng, Y.: Improving the performance of neural networks with random forest in
detecting network intrusions. In Advances in Neural Networks–ISNN 2013 (pp. 622-
629), Springer Berlin Heidelberg (2013)
17. Gray, K. R., Aljabar, P., Heckemann, R. A., Hammers, A., & Rueckert, D.: Random for-
est-based similarity measures for multi-modal classification of Alzheimer's disease. Neu-
roImage, 65, pp. 167-175 (2013)
18. HaCohen-Kerner, Y., Mughaz, D., Beck, H., & Yehudai, E.: Words as Classifiers of Doc-
uments According to their Historical Period and the Ethnic Origin of their Au-
thors. Cybernetics and Systems: An International Journal, 39(3), pp. 213-228 (2008)
19. Fox, C.: A stop list for general text. ACM SIGIR Forum, 24 (1-2), ACM (1989)
20. Sikora, T.: The MPEG-7 visual standard for content description-an overview. IEEE Trans-
actions on Circuits and Systems for Video Technology, 11(6), pp. 696-702 (2001)
21. Zhou, Q., Hong, W., Luo, L., & Yang, F.: Gene selection using random forest and prox-
imity differences criterion on DNA microarray data. Journal of Convergence Information
Technology, 5(6), pp. 161-170 (2010)