El Kah-Anoual-Publications-17-08-2022-11-08-19-34

Improved Document Categorization Through
Feature-Rich Combinations
Anoual El Kah1(&) and Imad Zeroual2

1
Faculty of Sciences, Mohammed First University, Oujda, Morocco
2
L-STI, T-IDMS, Faculty of Sciences and Techniques, Moulay Ismail
University, Meknes, Morocco
Abstract. Several comparatives studies report new findings relevant to the Text
Categorization (TC) task, and all provide valuable observations. However, many
of them addressed western languages, especially English. By writing this paper,
we take a step toward filling this gap and focus on less commonly investigated
languages (i.e., Arabic) to provide a more balanced perspective. In that respect,
this paper presents a deeper investigation regarding the performance of some
well-known probabilistic methods successfully implemented for automatic TC,
such as Naïve Bayesian, Support Vector Machines, and Decision Tree. Besides,
the investigation covers pre-processing techniques and feature selection methods
that deal with data’s high dimensionality. Expressly, stop words elimination,
stemming, and lemmatization are the pre-processing techniques included along
with the TF-IDF and Chi-square as the feature selection methods. Moreover, all
possible combinations are considered. To make this study accurate and com-
prehensive, we trained and evaluated the selected classifiers and the pre-
processing techniques on common ground. To this end, we used an in-house
balanced and large corpus with 300,000 news articles which are equally dis-
tributed into six categories. The findings obtained prove the effectiveness of
combining the pre-processing techniques, feature selection methods, and
classifiers.
Keywords: Arabic text categorization Machine learning Feature selection

Text preprocessing techniques
1 Introduction
Nowadays, there are many satisfying investigations on document categorization, and

the number of its available resources and tools grew significantly over the last decade.
Unfortunately, not all languages have benefited equally from this growth. An example
of such languages is Arabic, which is still not very well-investigated with respect in this
field. On the other hand, last years have witnessed remarkable progress in building
Arabic corpora [1] and Arabic natural language processing and its applications [2],
which paved the way for highly data-driven approaches like text classification. By
conducting this work, we take a step toward filling this gap and focus on less com-
monly investigated languages to provide a more balanced perspective.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021

A. E. Hassanien et al. (Eds.): AICV 2021, AISC 1377, pp. 346–355, 2021.
https://doi.org/10.1007/978-3-030-76346-6_32
Improved Document Categorization Through Feature-Rich Combinations 347
Basically, developing a document classification system consists of four major

phases: collecting data, applying pre-processing techniques, feature selection, and
finally conducting the classification task. Therefore, our focus is improving the per-
formance of the whole classification procedure by investigating each phase. To this
end, we compiled a new balanced and large corpus with 300,000 articles which are
equally distributed into six categories: culture, economy, health, politics, sport, and
technology. Furthermore, we conducted several comparative studies to select the most
appropriate strategy to achieve our purpose. As a result, the highest improvement was
recorded by combining different methods involved in this investigation. Precisely, the
stop word removal, stemming, and lemmatization are used in the pre-processing phase
to deal with the issues that occurred due to the Arabic language characteristics. Then,
TF-IDF and Chi-square were involved as feature selection methods to reduce the data
dimensionality. Finally, three standards classifiers are implemented which are based on
different algorithms, namely Naïve Bayes (NB), Support Vector Machine (SVM), and
Decision Tree (DT). Also, a combination approach is proposed based on a voting
process among the individual classification results.
In addition to the introduction, the remainder of this paper is arranged in six main
sections. In Sect. 2, the challenges that occurred due to the Arabic language charac-
teristics are described. Section 3 introduces different resources and tools implemented
in this work. Various experiments concerning the impact of using text pre-processing
techniques are presented in Sect. 4. Section 5 addresses the feature selection methods
involved. Section 6 highlights the performance achieved by standards classification
algorithms and the improvement recorded through its combination. Finally, we draw
the conclusion and perspectives in Sect. 7.
2 Arabic Language Challenges
As an old Semitic language, Arabic characteristics differ from those of western lan-
guages. These language characteristics always have, and still are, intriguing scholars in
several research fields such as Text Mining as well as Arabic Natural Language Pro-
cessing. Therefore, involving text pre-processing techniques is recommended. This
section introduces the most challenging Arabic language characteristics for the cate-
gorization task, primarily those related to data dimensionality.
Arabic is one of the most highly inflected languages. For instance, an Arabic word
can represent a whole sentence through sequential concatenation. E.g., the word
“‫”ﺃﻓﺎﺳﺘﺴﻘﻴﻨﺎﻛﻤﻮﻫﺎ‬, which contains 15 letters and ten diacritics, means in English “Then did
we asked you for it to drink”. A previous study [3] that investigated the average length
of Arabic words in news articles using a corpus of one billion words, reports that 75%
of the words are probably inflected since their lengths are above six letters. Generally,
the length of non-infected Arabic words is less than six letters. Therefore, processing
such inflected words is required to maximize the reduction in data dimensionality.
Another factor that influences the accuracy of text classification is dealing with
synonyms words. The Arabic language has a very rich lexicon of synonyms. For
348 A. El Kah and I. Zeroual
example, the word “‫( ”ﺃﺳﺪ‬en., “lion”) has between 350 to 500 synonyms1. Therefore, it
is recommended to involve a text processing technique that intends to deal with such
language characteristics.
A stop word is a term that frequently appears in a text but does not bear significant
information related to the subject of the processed text. In the text classification con-
text, stop words are not only the particles. There are nouns and verbs which are also
considered as stop-words. Stop word elimination task impacts several text processing
applications such as information retrieval [4], text summarization [5], and automatic
translation [6]. A recent study [7] reports that stop words represent 35% to 43% of
news article content regarding Arabic TC. Thus, removing these stop words leads to a
high reduction in data dimensionality. However, random elimination of stop words
may significantly deteriorate text classification performance [8].
One of the most appropriate morphological processing for TC is stemming. This
latter is normally used to reduce the size of text features but keeping the meaning of
text content well represented. Arabic stemming algorithms are classified into two types.
The first one is the light stemming that aims to remove clitics without finding roots [9].
The second type is the root-based stemming that reduces inflected words to their roots
[10]. According to a survey [11], when the Arabic light stemming was included, the
classification performance was improved in nine different experiments. When the root-
based stemming was performed no improvement was observed in the classification
[12]. The reason is that the light stemming regroups the inflected grammatically related
words and, to some extent, semantically related. On the other hand, the root-based
stemming regroups the inflected words that are morphologically related; i.e., the words
derived from the same root. As a result, there is a great possibility to regroup inflected
and derived words that have different meanings. For example, the words “‫( ”ﻋﻴﻦ‬i.e.,
eye), “‫( ”ﻣﻌﺎ ِﻧﻲ‬i.e., meanings), “‫( ”ﺃ ِﻋﻴﻦ‬i.e., I help), “‫( ”ﻋﻴﻮﻥ‬i.e., fountains) are derived
from the same root “‫”ﻋﻴﻦ‬.
Unlike stemming, lemmatization regroups semantically related words, although
they are grammatically different from each other, to a specific word called lemma (i.e.,
a dictionary lookup form). For instance, the lemma “‫( ” ِﻛﺘﺎﺏ‬i.e., book) regroups many
inflected words such as “‫( ”ﻛﺘﺐ‬i.e., books), “‫( ”ﻛﺘﻴﺒﺎﺕ‬i.e., manuals), and “‫( ” ِﻛﺘﺎﺑﺎﻥ‬i.e., two
books). However, lemmatization is a complex level of text processing compared to
stemming. Maybe this is the reason why involving lemmatization as a pre-processing
task in Arabic TC is still limited. However, recent conducted experiments reported that
lemmatization reduces data dimensionality more than stemming, while still enhancing
the text classification performance [7, 13].
1
https://ar.wikipedia.org/wiki/‫ﻗﺎﺋﻤﺔ_ﺃﺳﻤﺎﺀ_ﺍﻷﺳﺪ_ﻓﻲ_ﺍﻟﻠﻐﺔ_ﺍﻟﻌﺮﺑﻴﺔ‬.
3 Resources and Tools
This section presents different datasets and tools used in this work. It introduces the
features of each dataset and tool and discussing the reasons for adopting it exclusively.
3.1 Datasets
Compiling datasets for TC, as for many other research fields based on data-driven
approaches, is becoming more manageable due to the tremendous growth of the World
Wide Web content. Besides, the availability of free web crawlers make web scraping
easier and accessible to everyone. Another advantage for those interested in Arabic TC
is that Arabic is currently the fourth most used language on the web2. All these facts
lead to an increase in Arabic web-based corpora primarily those include news articles.
In our case, we crawled different Arabic news websites using the HTTRACK3 web
crawler. Further, cleaning and normalizing tasks were performed. Since each website
classifies its news articles differently from the others, we selected the same number of
articles from only the common categories: politics, culture, economy, sport, health, and
technology. As a result, the corpus compiled has for each category 50,000 articles
totaling 300,000 articles. Also, it consists of over 153 million words with an average of
512 words per article, 287 words as a minimum, and 737 words as a maximum.
To build a well-structured stop words list, we decided not to limit our list to the
most frequent words in our compiled corpus, including other words from previously
published lists. Then, we reviewed and filtrated the compiled lists. Consequently, we
created a new list of roughly 1,000 basic stop words. Finally, we generated each stop
word’s inflected forms following a proposed technique that involves 123 Arabic clitics
[14]. Thus, the final list comprises 11,403 stop words.
3.2 Tools
Here we introduce the various tools used for text stemming, lemmatization, and
classification.
According to three different comparatives studies [15–17], which investigated the
impact of 10 Arabic stemmers on Arabic TC performance, the best improvement in
classification was achieved using ARLSTem v1.04 [18]. Therefore, it is the one
implemented in our study. The overall ranking for the rest of the stemmers came as
follows: ARLSTem v1.1 [17], Tashaphyne light stemmer5, Farasa [19], Khoja stemmer
[10], Light10 [9], Al Khalil Morph Sys [20], Assem’s stemmer6, Soori’s stemmer [21],
and finally ISRI stemmer [22]. The algorithm of the ARLSTem v1.0 consists of the
2
https://www.internetworldstats.com/stats7.htm.
3
http://www.httrack.com/.
4
https://www.nltk.org/_modules/nltk/stem/arlstem.html [last accessed: January 24, 2021].
5
https://pypi.org/project/Tashaphyne/ [last accessed: January 24, 2021].
6
https://arabicstemmer.com/ [last accessed: January 24, 2021].
following tasks: normalization, prefixes/suffixes removal, plural to singular transfor-

mation, feminine to masculine transformation, and verbs affixes removal.
Most Arabic lemmatizers are proprietary and not publicly available. Therefore,
conducting comparatives studies to investigate which lemmatizer has the best effects on
Arabic TC is not at hands. However, the Madamira v2.1 [23] is one of the few robust
Arabic lemmatizer available for free to use for non-commercial purposes. Another
reason that we used Madamira among others is that it can successfully deal with our
compiled corpus since it is trained on similar news articles.
The feature selection and classification tasks are performed using WEKA (Waikato
Environment for Knowledge Analysis) [24]. WEKA is a popular workbench for
machine learning algorithms devoted to data mining. Due to its rich functionality, it is
widely used for text and web document classification. Besides, WEKA is open-source
software and freely available.
The feature selection methods are used to select highly discriminative inputs (i.e.,
features) from raw data; later, these features are consumed by a particular classifier
algorithm. There are several feature selection methods, and each one of them has its
pros and cons. For instance, Term Frequency–Inverse Document Frequency (TF-IDF)
and Chi-squared test (v2 test). These two methods are computationally fast, simple, and
can deal with large dimensional features. What’s more, a study [25] claims that its
combination proved its efficiency when applied to a Chinese corpus. In that respect, we
implemented the same methodology but with an application on an Arabic corpus.
Finally, we selected standard classifiers that have been used wisely to classify
documents automatically. Our focus was selecting algorithms that are, to a certain
degree, distinct from each other in the approaches they are based on it. Therefore,
Naïve Bayesian algorithm (NB), Support Vector Machines (SVM), and Decision Trees
(DT J48) were selected.
4 Pre-processing Techniques Combination
The purpose of this section is to compare the impact of each pre-processing technique
on the whole classification task. Therefore, after cleaning and preparing the corpus, the
three pre-processing techniques were performed: Stop Words (SW) removal, stem-
ming, lemmatization, and all their possible combinations. Besides, these techniques’
effectiveness is compared with the baseline case, i.e., when none of these techniques is
performed. Also, none of the feature selection methods is involved in this phase. The
three algorithms (NB, SVM, and DT J48) are used for the classification. Finally,
concerning the evaluation of performance accuracy, 10-fold cross-validation is used.
Table 1 presents the results obtained.
Table 1. 10 Folds cross-validation scores for evaluating pre-processing effectiveness on text

classification.
Pre-processing techniques 10 Folds cross-
validation scores (%)
NB SVM DT J48
Baseline 60.58 66.34 50.97
SW removal 72.36 78.29 63.28
Stemming 68.54 71.04 62.57
Lemmatization 71.39 72.85 63.13
SW removal & Stemming 80.91 86.23 78.33
SW removal & Lemmatization 85.61 87.49 83.57
Stemming & Lemmatization 72.34 76.97 63.85
All techniques combined 85.99 87.64 83.63
According to Table 1, the average improvement achieved by the three classifiers

after applying the pre-processing techniques separately are as follows: SW removal
(+12.01%), lemmatization (+9.83%), and stemming (+8.09%). For the combinations
that include only two techniques, the highest improvement is obtained when applying
both SW removal and lemmatization (+26.26%), followed by SW removal and
stemming (+22.53%), and finally stemming with lemmatization (+11.76). Finally, all
the classifiers recorded the best scores when all the three pre-processing techniques
were used: +25.41% for NB, +21.30% for SVM, and +32.66% for DT (J48). This
implies that involving pre-processing techniques is beneficial for the Arabic TC,
especially if the classifier is based on a decision tree algorithm.
5 Feature Selection Methods Combination
This section compares the effects of using the feature selection methods on the whole
classification task. It is worth mentioning that this evaluation is conducted after the pre-
processing task is done using the three techniques, SW removal, stemming, and
lemmatization. First, the feature selection methods TF-IDF and Chi-square were per-
formed individually; then, its combination.
The TF-IDF associates each word in a document with a number representing how
relevant this word is in that document. Then, each document will contain information
on the more important words and the less important ones as well. Consequently,
documents with similar relevant words will have similar vectors.
The Chi-squared test is used for testing the independence between the occurrence of
a specific word and the occurrence of a specific category. If no relationship exists
between a specific word and a specific category; then, they are independent. This is the
null hypothesis of the Chi-Squared test. Next, we ranked the words by their scores.
Only top-rank words are then selected to serve as inputs for the classifier.
The combination of both TF-IDF and Chi-square is implemented as described in
[25]. First, TF-IDF values of each word in each document are calculated; then, we
calculate the sum of TF-IDF values of words from the same class. Further, these sums
were normalized. Next, a new weight of each word is computed using Chi-square.
Finally, only the first two-thirds of the words are selected. Table 2 exhibits the
enhancement achieved by applying these feature selection methods and its combina-
tion. The results are compared to the case when only the pre-processing techniques are
performed.
Table 2. 10 Folds cross-validation scores for evaluating feature selection effectiveness on text
classification.
Feature selection methods 10 Folds cross-
validation scores (%)
NB SVM DT J48
Without feature selection 85.99 87.64 83.63
TF-IDF 91.16 92.63 86.88
Chi-square 89.24 92.48 87.32
TF-IDF & Chi-square 93.47 94.91 90.81
This experiment shows that the average improvement achieved in all classifiers
after applying the feature selection methods individually is +4.47% for TF-IDF and
+3.93% for Chi-square. However, its combination recorded the highest improvement
(+7.31%). When comparing Chi-square to TF-IDF effects, the results show that this
latter performed better with the NB and SVM algorithms; whereas, Chi-square showed
superior performance DT (J48) compared to the other two algorithms.
6 Classification Algorithms Combination
The main purpose of this part is to combine the three algorithms NB, SVM, and DT
(J48), and investigate the results obtained. These combinations determine the most
appropriate category for a given document in three steps:
1. Performing document classification using all the three algorithms;
2. Selecting for each document the most voted category;
3. If the given categories from the combined algorithms are unlike; then, we select the
category proposed by the most accurate algorithm.
This voting method is the last step in the whole proposed system that is based on
combining several techniques in each phase. Figure 1 illustrates the whole idea of this
work which exhibits the different methods implemented in text pre-processing, feature
selection, and classification.
Fig. 1. An illustrative schema of the whole proposed system.
Performing document classification using all the three algorithms separately is

already implemented in the last study (Sect. 5). Based on the results reported in
Table 2, the SVM is the most accurate algorithm followed by NB and DT (J48).
Therefore, the SVM outputs will be considered in the third step. Likewise, if a com-
bination includes only NB and DT (J48), then the NB algorithm’s outputs will be
considered in the third step. Table 3 illustrates the results achieved by all possible
combinations of these algorithms.
Table 3. Different algorithm combinations achieve accuracies.

NB & SVM SVM & DT NB & DT NB & SVM & DT
94.97% 94.48% 91.36% 95.78%
After combining the outputs of two and three algorithms, the results displayed on
Table 3 can be explained as follows:
• The combination of NB & SVM achieved an accuracy rate (94.97%) higher than the
rates recorded by both algorithms when they are performed individually.
• The combination of SVM & DT showed inferior performance (94.48%) compared
to the rate obtained by the SVM (94.91%) when it is performed individually.
• The combination of NB & DT achieved an accuracy rate (91.36%) lower than the
rate (93.47%) obtained by the NB when it is performed individually.
• The best result (95.78%) is achieved by the combination that involved all the three
algorithms.
To sum up, combining only two algorithms can lead to an accuracy rate reduction
compared to the most accurate algorithm involved in this combination. However, the
combination of the three algorithms performed better than the other algorithms when
they were applied individually. Finally, more improvement is still possible if the
number of involved algorithms is increased.
7 Conclusion
In this paper, we highlighted an improved document categorization approach through

feature-rich combinations. Several comparative studies were conducted to select the
most appropriate combination for each phase of the overall text classification system.
As a result, the classification system’s highest performance was always recorded when
all the selected methods are involved in each phase. Expressly, including stop words
removal, stemming, and lemmatization in the text processing phase led to a better
performance in the classification. Similarly, using both TF-IDF and Chi-square in the
feature selection phase enhanced the classifiers. Finally, combining the outputs of NB,
SVM, and DT (J48) algorithms, based on a voting process, performed better among the
individual classification results.
This work is among the contributions that aim to fill the gap between Arabic TC
and western languages. Despite the challenges produced by the language characteris-
tics, the performed experiments confirm that text processing techniques can success-
fully deal, either individually or in combination, with such issues and effectively reduce
the data dimensionality. For those reasons, we highly recommend using a strategy that
uses robust text processing techniques, especially for highly inflected languages; then,
it involves at least one feature selection method. Finally, combining results achieved by
different classification algorithms to come up with an efficient classification system.
References
1. Zeroual, I., Lakhouaja, A.: Arabic corpus linguistics: major progress, but still a long way to
go. In: Intelligent Natural Language Processing: Trends and Applications, pp. 613–636.
Springer, Cham (2018)
2. Guellil, I., Saâdane, H., Azouaou, F., Gueni, B., Nouvel, D.: Arabic natural language
processing: an overview. J. King Saud Univ. – Comput. Inf. Sci. (2019). In Press
3. Zeroual, I., Goldhahn, D., Eckart, T., Lakhouaja, A.: OSIAN: Open source international
arabic news corpus - preparation and integration into the CLARIN-infrastructure. In:
Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 175–182.
Association for Computational Linguistics, Florence (2019)
4. El-Khair, I.A.: Effects of stop words elimination for Arabic information retrieval: a
comparative study. arXiv preprint arXiv:1702.01925 (2017)
5. Al-Abdallah, R.Z., Al-Taani, A.T.: Arabic single-document text summarization using
particle swarm optimization algorithm. Proc. Comput. Sci. 117, 30–37 (2017)
6. Arora, K.K., Agrawal, S.S.: Pre-processing of English-Hindi corpus for statistical machine
translation. Comput. Sist. 21, 725–737 (2017)
7. El Kah, A., Zeroual, I.: The effects of pre-processing techniques on Arabic text classification.
IJATCSE 10, 41–48 (2021)
8. Jianqiang, Z., Xiaolin, G.: Comparison research on text pre-processing methods on Twitter
sentiment analysis. IEEE Access 5, 2870–2879 (2017)
9. Larkey, L.S., Ballesteros, L., Connell, M.E.: Light stemming for Arabic information
retrieval. In: Arabic Computational Morphology, pp. 221–243. Springer (2007)
10. Khoja, S., Garside, R.: Stemming Arabic text. Computing Department, Lancaster University,
Lancaster (1999)
11. Al-Anzi, F.S., AbuZeina, D.: Stemming impact on Arabic text categorization performance:
A survey. In: 2015 5th International Conference on Information Communication Technology
and Accessibility (ICTA), pp. 1–7 (2015)
12. Wahbeh, A., Al-Kabi, M., Al-Radaideh, Q., Al-Shawakfa, E., Alsmadi, I.: The effect of
stemming on Arabic text classification: an empirical study. Int. J. Inf. Retrieval Res. (IJIRR).
1, 54–70 (2011)
13. Namly, D., Bouzoubaa, K., El Jihad, A., Aouragh, S.L.: Improving Arabic lemmatization
through a lemmas database and a machine-learning technique. In: Recent Advances in NLP:
The Case of Arabic Language, pp. 81–100. Springer (2020)
14. Zeroual, I., Boudchiche, M., Mazroui, A., Lakhouaja, A.: Developing and performance
evaluation of a new Arabic heavy/light stemmer. In: Proceedings of the 2Nd International
Conference on Big Data, Cloud and Applications, pp. 17:1–17:6. ACM, Tetouan (2017)
15. Naili, M., Chaibi, A.H., Ghezala, H.H.B.: Comparative study of Arabic stemming algorithms
for topic identification. Proc. Comput. Sci. 159, 794–802 (2019)
16. Alhaj, Y.A., Xiang, J., Zhao, D., Al-Qaness, M.A., Abd Elaziz, M., Dahou, A.: A study of
the effects of stemming strategies on Arabic document classification. IEEE Access 7, 32664–
32671 (2019)
17. Abainia, K., Rebbani, H.: Comparing the effectiveness of the improved ARLSTem algorithm
with existing Arabic light stemmers. In: 2019 International Conference on Theoretical and
Applicative Aspects of Computer Science (ICTAACS), pp. 1–8. IEEE (2019)
18. Abainia, K., Ouamour, S., Sayoud, H.: A novel robust Arabic light stemmer. J. Exp. Theor.
Artif. Intell. 29(3), 557–573 (2016)
19. Abdelali, A., Darwish, K., Durrani, N., Mubarak, H.: Farasa: a fast and furious segmenter for
Arabic. In: Proceedings of the 2016 Conference of the North American Chapter of the
Association for Computational Linguistics: Demonstrations, pp. 11–16 (2016)
20. Boudchiche, M., Mazroui, A., Bebah, M.O.A.O., Lakhouaja, A., Boudlal, A.: AlKhalil
Morpho Sys 2: a robust Arabic morpho-syntactic analyzer. J. King Saud Univ.-Comput. Inf.
Sci. 29, 141–146 (2017)
21. Soori, H., Platoš, J., Snášel, V.: Simple stemming rules for Arabic language. In: Proceedings
of the Third International Conference on Intelligent Human Computer Interaction (IHCI
2011), Prague, Czech Republic, August 2011, pp. 99–108. Springer (2013)
22. Taghva, K., Elkhoury, R., Coombs, J.: Arabic stemming without a root dictionary. In: Null,
pp. 152–157. IEEE (2005)
23. Pasha, A., Al-Badrashiny, M., Diab, M.T., El Kholy, A., Eskander, R., Habash, N.,
Pooleery, M., Rambow, O., Roth, R.: MADAMIRA: a fast, comprehensive tool for
morphological analysis and disambiguation of Arabic. In: LREC, pp. 1094–1101 (2014)
24. Garner, S.R.: Weka: the waikato environment for knowledge analysis. In: Proceedings of the
New Zealand Computer Science Research Students Conference, pp. 57–64 (1995)
25. Tang, H., Zhou, L., Chengjie, X., Zhu, Q.: A method of text dimension reduction based on
CHI and TF-IDF. In: 2015 4th International Conference on Mechatronics, Materials,
Chemistry and Computer Engineering. Atlantis Press (2015)

El Kah-Anoual-Publications-17-08-2022-11-08-19-34

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

El Kah-Anoual-Publications-17-08-2022-11-08-19-34

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

El Kah-Anoual-Publications-17-08-2022-11-08-19-34

Uploaded by

Copyright:

Available Formats

Improved Document Categorization Through

Anoual El Kah1(&) and Imad Zeroual2

Keywords: Arabic text categorization Machine learning Feature selection

Nowadays, there are many satisfying investigations on document categorization, and

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021

Basically, developing a document classiﬁcation system consists of four major

2 Arabic Language Challenges

3 Resources and Tools

following tasks: normalization, preﬁxes/sufﬁxes removal, plural to singular transfor-

4 Pre-processing Techniques Combination

Table 1. 10 Folds cross-validation scores for evaluating pre-processing effectiveness on text

According to Table 1, the average improvement achieved by the three classiﬁers

5 Feature Selection Methods Combination

6 Classiﬁcation Algorithms Combination

Fig. 1. An illustrative schema of the whole proposed system.

Performing document classiﬁcation using all the three algorithms separately is

Table 3. Different algorithm combinations achieve accuracies.

In this paper, we highlighted an improved document categorization approach through

You might also like