El Kah-Anoual-Publications-17-08-2022-11-08-19-34
El Kah-Anoual-Publications-17-08-2022-11-08-19-34
El Kah-Anoual-Publications-17-08-2022-11-08-19-34
Feature-Rich Combinations
Abstract. Several comparatives studies report new findings relevant to the Text
Categorization (TC) task, and all provide valuable observations. However, many
of them addressed western languages, especially English. By writing this paper,
we take a step toward filling this gap and focus on less commonly investigated
languages (i.e., Arabic) to provide a more balanced perspective. In that respect,
this paper presents a deeper investigation regarding the performance of some
well-known probabilistic methods successfully implemented for automatic TC,
such as Naïve Bayesian, Support Vector Machines, and Decision Tree. Besides,
the investigation covers pre-processing techniques and feature selection methods
that deal with data’s high dimensionality. Expressly, stop words elimination,
stemming, and lemmatization are the pre-processing techniques included along
with the TF-IDF and Chi-square as the feature selection methods. Moreover, all
possible combinations are considered. To make this study accurate and com-
prehensive, we trained and evaluated the selected classifiers and the pre-
processing techniques on common ground. To this end, we used an in-house
balanced and large corpus with 300,000 news articles which are equally dis-
tributed into six categories. The findings obtained prove the effectiveness of
combining the pre-processing techniques, feature selection methods, and
classifiers.
1 Introduction
As an old Semitic language, Arabic characteristics differ from those of western lan-
guages. These language characteristics always have, and still are, intriguing scholars in
several research fields such as Text Mining as well as Arabic Natural Language Pro-
cessing. Therefore, involving text pre-processing techniques is recommended. This
section introduces the most challenging Arabic language characteristics for the cate-
gorization task, primarily those related to data dimensionality.
Arabic is one of the most highly inflected languages. For instance, an Arabic word
can represent a whole sentence through sequential concatenation. E.g., the word
“”ﺃﻓﺎﺳﺘﺴﻘﻴﻨﺎﻛﻤﻮﻫﺎ, which contains 15 letters and ten diacritics, means in English “Then did
we asked you for it to drink”. A previous study [3] that investigated the average length
of Arabic words in news articles using a corpus of one billion words, reports that 75%
of the words are probably inflected since their lengths are above six letters. Generally,
the length of non-infected Arabic words is less than six letters. Therefore, processing
such inflected words is required to maximize the reduction in data dimensionality.
Another factor that influences the accuracy of text classification is dealing with
synonyms words. The Arabic language has a very rich lexicon of synonyms. For
348 A. El Kah and I. Zeroual
example, the word “( ”ﺃﺳﺪen., “lion”) has between 350 to 500 synonyms1. Therefore, it
is recommended to involve a text processing technique that intends to deal with such
language characteristics.
A stop word is a term that frequently appears in a text but does not bear significant
information related to the subject of the processed text. In the text classification con-
text, stop words are not only the particles. There are nouns and verbs which are also
considered as stop-words. Stop word elimination task impacts several text processing
applications such as information retrieval [4], text summarization [5], and automatic
translation [6]. A recent study [7] reports that stop words represent 35% to 43% of
news article content regarding Arabic TC. Thus, removing these stop words leads to a
high reduction in data dimensionality. However, random elimination of stop words
may significantly deteriorate text classification performance [8].
One of the most appropriate morphological processing for TC is stemming. This
latter is normally used to reduce the size of text features but keeping the meaning of
text content well represented. Arabic stemming algorithms are classified into two types.
The first one is the light stemming that aims to remove clitics without finding roots [9].
The second type is the root-based stemming that reduces inflected words to their roots
[10]. According to a survey [11], when the Arabic light stemming was included, the
classification performance was improved in nine different experiments. When the root-
based stemming was performed no improvement was observed in the classification
[12]. The reason is that the light stemming regroups the inflected grammatically related
words and, to some extent, semantically related. On the other hand, the root-based
stemming regroups the inflected words that are morphologically related; i.e., the words
derived from the same root. As a result, there is a great possibility to regroup inflected
and derived words that have different meanings. For example, the words “( ”ﻋﻴﻦi.e.,
eye), “( ”ﻣﻌﺎ ِﻧﻲi.e., meanings), “( ”ﺃ ِﻋﻴﻦi.e., I help), “( ”ﻋﻴﻮﻥi.e., fountains) are derived
from the same root “”ﻋﻴﻦ.
Unlike stemming, lemmatization regroups semantically related words, although
they are grammatically different from each other, to a specific word called lemma (i.e.,
a dictionary lookup form). For instance, the lemma “( ” ِﻛﺘﺎﺏi.e., book) regroups many
inflected words such as “( ”ﻛﺘﺐi.e., books), “( ”ﻛﺘﻴﺒﺎﺕi.e., manuals), and “( ” ِﻛﺘﺎﺑﺎﻥi.e., two
books). However, lemmatization is a complex level of text processing compared to
stemming. Maybe this is the reason why involving lemmatization as a pre-processing
task in Arabic TC is still limited. However, recent conducted experiments reported that
lemmatization reduces data dimensionality more than stemming, while still enhancing
the text classification performance [7, 13].
1
https://ar.wikipedia.org/wiki/ﻗﺎﺋﻤﺔ_ﺃﺳﻤﺎﺀ_ﺍﻷﺳﺪ_ﻓﻲ_ﺍﻟﻠﻐﺔ_ﺍﻟﻌﺮﺑﻴﺔ.
Improved Document Categorization Through Feature-Rich Combinations 349
This section presents different datasets and tools used in this work. It introduces the
features of each dataset and tool and discussing the reasons for adopting it exclusively.
3.1 Datasets
Compiling datasets for TC, as for many other research fields based on data-driven
approaches, is becoming more manageable due to the tremendous growth of the World
Wide Web content. Besides, the availability of free web crawlers make web scraping
easier and accessible to everyone. Another advantage for those interested in Arabic TC
is that Arabic is currently the fourth most used language on the web2. All these facts
lead to an increase in Arabic web-based corpora primarily those include news articles.
In our case, we crawled different Arabic news websites using the HTTRACK3 web
crawler. Further, cleaning and normalizing tasks were performed. Since each website
classifies its news articles differently from the others, we selected the same number of
articles from only the common categories: politics, culture, economy, sport, health, and
technology. As a result, the corpus compiled has for each category 50,000 articles
totaling 300,000 articles. Also, it consists of over 153 million words with an average of
512 words per article, 287 words as a minimum, and 737 words as a maximum.
To build a well-structured stop words list, we decided not to limit our list to the
most frequent words in our compiled corpus, including other words from previously
published lists. Then, we reviewed and filtrated the compiled lists. Consequently, we
created a new list of roughly 1,000 basic stop words. Finally, we generated each stop
word’s inflected forms following a proposed technique that involves 123 Arabic clitics
[14]. Thus, the final list comprises 11,403 stop words.
3.2 Tools
Here we introduce the various tools used for text stemming, lemmatization, and
classification.
According to three different comparatives studies [15–17], which investigated the
impact of 10 Arabic stemmers on Arabic TC performance, the best improvement in
classification was achieved using ARLSTem v1.04 [18]. Therefore, it is the one
implemented in our study. The overall ranking for the rest of the stemmers came as
follows: ARLSTem v1.1 [17], Tashaphyne light stemmer5, Farasa [19], Khoja stemmer
[10], Light10 [9], Al Khalil Morph Sys [20], Assem’s stemmer6, Soori’s stemmer [21],
and finally ISRI stemmer [22]. The algorithm of the ARLSTem v1.0 consists of the
2
https://www.internetworldstats.com/stats7.htm.
3
http://www.httrack.com/.
4
https://www.nltk.org/_modules/nltk/stem/arlstem.html [last accessed: January 24, 2021].
5
https://pypi.org/project/Tashaphyne/ [last accessed: January 24, 2021].
6
https://arabicstemmer.com/ [last accessed: January 24, 2021].
350 A. El Kah and I. Zeroual
The purpose of this section is to compare the impact of each pre-processing technique
on the whole classification task. Therefore, after cleaning and preparing the corpus, the
three pre-processing techniques were performed: Stop Words (SW) removal, stem-
ming, lemmatization, and all their possible combinations. Besides, these techniques’
effectiveness is compared with the baseline case, i.e., when none of these techniques is
performed. Also, none of the feature selection methods is involved in this phase. The
three algorithms (NB, SVM, and DT J48) are used for the classification. Finally,
concerning the evaluation of performance accuracy, 10-fold cross-validation is used.
Table 1 presents the results obtained.
Improved Document Categorization Through Feature-Rich Combinations 351
This section compares the effects of using the feature selection methods on the whole
classification task. It is worth mentioning that this evaluation is conducted after the pre-
processing task is done using the three techniques, SW removal, stemming, and
lemmatization. First, the feature selection methods TF-IDF and Chi-square were per-
formed individually; then, its combination.
The TF-IDF associates each word in a document with a number representing how
relevant this word is in that document. Then, each document will contain information
on the more important words and the less important ones as well. Consequently,
documents with similar relevant words will have similar vectors.
The Chi-squared test is used for testing the independence between the occurrence of
a specific word and the occurrence of a specific category. If no relationship exists
between a specific word and a specific category; then, they are independent. This is the
null hypothesis of the Chi-Squared test. Next, we ranked the words by their scores.
Only top-rank words are then selected to serve as inputs for the classifier.
The combination of both TF-IDF and Chi-square is implemented as described in
[25]. First, TF-IDF values of each word in each document are calculated; then, we
352 A. El Kah and I. Zeroual
calculate the sum of TF-IDF values of words from the same class. Further, these sums
were normalized. Next, a new weight of each word is computed using Chi-square.
Finally, only the first two-thirds of the words are selected. Table 2 exhibits the
enhancement achieved by applying these feature selection methods and its combina-
tion. The results are compared to the case when only the pre-processing techniques are
performed.
Table 2. 10 Folds cross-validation scores for evaluating feature selection effectiveness on text
classification.
Feature selection methods 10 Folds cross-
validation scores (%)
NB SVM DT J48
Without feature selection 85.99 87.64 83.63
TF-IDF 91.16 92.63 86.88
Chi-square 89.24 92.48 87.32
TF-IDF & Chi-square 93.47 94.91 90.81
This experiment shows that the average improvement achieved in all classifiers
after applying the feature selection methods individually is +4.47% for TF-IDF and
+3.93% for Chi-square. However, its combination recorded the highest improvement
(+7.31%). When comparing Chi-square to TF-IDF effects, the results show that this
latter performed better with the NB and SVM algorithms; whereas, Chi-square showed
superior performance DT (J48) compared to the other two algorithms.
The main purpose of this part is to combine the three algorithms NB, SVM, and DT
(J48), and investigate the results obtained. These combinations determine the most
appropriate category for a given document in three steps:
1. Performing document classification using all the three algorithms;
2. Selecting for each document the most voted category;
3. If the given categories from the combined algorithms are unlike; then, we select the
category proposed by the most accurate algorithm.
This voting method is the last step in the whole proposed system that is based on
combining several techniques in each phase. Figure 1 illustrates the whole idea of this
work which exhibits the different methods implemented in text pre-processing, feature
selection, and classification.
Improved Document Categorization Through Feature-Rich Combinations 353
After combining the outputs of two and three algorithms, the results displayed on
Table 3 can be explained as follows:
• The combination of NB & SVM achieved an accuracy rate (94.97%) higher than the
rates recorded by both algorithms when they are performed individually.
• The combination of SVM & DT showed inferior performance (94.48%) compared
to the rate obtained by the SVM (94.91%) when it is performed individually.
• The combination of NB & DT achieved an accuracy rate (91.36%) lower than the
rate (93.47%) obtained by the NB when it is performed individually.
• The best result (95.78%) is achieved by the combination that involved all the three
algorithms.
To sum up, combining only two algorithms can lead to an accuracy rate reduction
compared to the most accurate algorithm involved in this combination. However, the
combination of the three algorithms performed better than the other algorithms when
they were applied individually. Finally, more improvement is still possible if the
number of involved algorithms is increased.
354 A. El Kah and I. Zeroual
7 Conclusion
References
1. Zeroual, I., Lakhouaja, A.: Arabic corpus linguistics: major progress, but still a long way to
go. In: Intelligent Natural Language Processing: Trends and Applications, pp. 613–636.
Springer, Cham (2018)
2. Guellil, I., Saâdane, H., Azouaou, F., Gueni, B., Nouvel, D.: Arabic natural language
processing: an overview. J. King Saud Univ. – Comput. Inf. Sci. (2019). In Press
3. Zeroual, I., Goldhahn, D., Eckart, T., Lakhouaja, A.: OSIAN: Open source international
arabic news corpus - preparation and integration into the CLARIN-infrastructure. In:
Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 175–182.
Association for Computational Linguistics, Florence (2019)
4. El-Khair, I.A.: Effects of stop words elimination for Arabic information retrieval: a
comparative study. arXiv preprint arXiv:1702.01925 (2017)
5. Al-Abdallah, R.Z., Al-Taani, A.T.: Arabic single-document text summarization using
particle swarm optimization algorithm. Proc. Comput. Sci. 117, 30–37 (2017)
6. Arora, K.K., Agrawal, S.S.: Pre-processing of English-Hindi corpus for statistical machine
translation. Comput. Sist. 21, 725–737 (2017)
7. El Kah, A., Zeroual, I.: The effects of pre-processing techniques on Arabic text classification.
IJATCSE 10, 41–48 (2021)
8. Jianqiang, Z., Xiaolin, G.: Comparison research on text pre-processing methods on Twitter
sentiment analysis. IEEE Access 5, 2870–2879 (2017)
9. Larkey, L.S., Ballesteros, L., Connell, M.E.: Light stemming for Arabic information
retrieval. In: Arabic Computational Morphology, pp. 221–243. Springer (2007)
10. Khoja, S., Garside, R.: Stemming Arabic text. Computing Department, Lancaster University,
Lancaster (1999)
Improved Document Categorization Through Feature-Rich Combinations 355
11. Al-Anzi, F.S., AbuZeina, D.: Stemming impact on Arabic text categorization performance:
A survey. In: 2015 5th International Conference on Information Communication Technology
and Accessibility (ICTA), pp. 1–7 (2015)
12. Wahbeh, A., Al-Kabi, M., Al-Radaideh, Q., Al-Shawakfa, E., Alsmadi, I.: The effect of
stemming on Arabic text classification: an empirical study. Int. J. Inf. Retrieval Res. (IJIRR).
1, 54–70 (2011)
13. Namly, D., Bouzoubaa, K., El Jihad, A., Aouragh, S.L.: Improving Arabic lemmatization
through a lemmas database and a machine-learning technique. In: Recent Advances in NLP:
The Case of Arabic Language, pp. 81–100. Springer (2020)
14. Zeroual, I., Boudchiche, M., Mazroui, A., Lakhouaja, A.: Developing and performance
evaluation of a new Arabic heavy/light stemmer. In: Proceedings of the 2Nd International
Conference on Big Data, Cloud and Applications, pp. 17:1–17:6. ACM, Tetouan (2017)
15. Naili, M., Chaibi, A.H., Ghezala, H.H.B.: Comparative study of Arabic stemming algorithms
for topic identification. Proc. Comput. Sci. 159, 794–802 (2019)
16. Alhaj, Y.A., Xiang, J., Zhao, D., Al-Qaness, M.A., Abd Elaziz, M., Dahou, A.: A study of
the effects of stemming strategies on Arabic document classification. IEEE Access 7, 32664–
32671 (2019)
17. Abainia, K., Rebbani, H.: Comparing the effectiveness of the improved ARLSTem algorithm
with existing Arabic light stemmers. In: 2019 International Conference on Theoretical and
Applicative Aspects of Computer Science (ICTAACS), pp. 1–8. IEEE (2019)
18. Abainia, K., Ouamour, S., Sayoud, H.: A novel robust Arabic light stemmer. J. Exp. Theor.
Artif. Intell. 29(3), 557–573 (2016)
19. Abdelali, A., Darwish, K., Durrani, N., Mubarak, H.: Farasa: a fast and furious segmenter for
Arabic. In: Proceedings of the 2016 Conference of the North American Chapter of the
Association for Computational Linguistics: Demonstrations, pp. 11–16 (2016)
20. Boudchiche, M., Mazroui, A., Bebah, M.O.A.O., Lakhouaja, A., Boudlal, A.: AlKhalil
Morpho Sys 2: a robust Arabic morpho-syntactic analyzer. J. King Saud Univ.-Comput. Inf.
Sci. 29, 141–146 (2017)
21. Soori, H., Platoš, J., Snášel, V.: Simple stemming rules for Arabic language. In: Proceedings
of the Third International Conference on Intelligent Human Computer Interaction (IHCI
2011), Prague, Czech Republic, August 2011, pp. 99–108. Springer (2013)
22. Taghva, K., Elkhoury, R., Coombs, J.: Arabic stemming without a root dictionary. In: Null,
pp. 152–157. IEEE (2005)
23. Pasha, A., Al-Badrashiny, M., Diab, M.T., El Kholy, A., Eskander, R., Habash, N.,
Pooleery, M., Rambow, O., Roth, R.: MADAMIRA: a fast, comprehensive tool for
morphological analysis and disambiguation of Arabic. In: LREC, pp. 1094–1101 (2014)
24. Garner, S.R.: Weka: the waikato environment for knowledge analysis. In: Proceedings of the
New Zealand Computer Science Research Students Conference, pp. 57–64 (1995)
25. Tang, H., Zhou, L., Chengjie, X., Zhu, Q.: A method of text dimension reduction based on
CHI and TF-IDF. In: 2015 4th International Conference on Mechatronics, Materials,
Chemistry and Computer Engineering. Atlantis Press (2015)