Impact of Stemming and Word Embedding On Deep Learning-Based Arabic Text Categorization

Received July 4, 2020, accepted July 9, 2020, date of publication July 14, 2020, date of current version July
23, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3009217
Impact of Stemming and Word Embedding on

Deep Learning-Based Arabic Text Categorization
HUDA ABDULRAHMAN ALMUZAINI AND AQIL M. AZMI
Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Corresponding author: Aqil M. Azmi ([email protected])
This work was supported by the Deanship of Scientific Research at King Saud University through the initiative of DSR Graduate Students
Research Support.
ABSTRACT Document classification is a classical problem in information retrieval, and plays an important
role in a variety of applications. Automatic document classification can be defined as content-based
assignment of one or more predefined categories to documents. Many algorithms have been proposed
and implemented to solve this problem in general, however, classifying Arabic documents is lagging
behind similar works in other languages. In this paper, we present seven deep learning-based algorithms
to classify the Arabic documents. These are: Convolutional Neural Network (CNN), CNN-LSTM (LSTM =
Long Short-Term Memory), CNN-GRU (GRU = Gated Recurrent Units), BiLSTM (Bidirectional LSTM),
BiGRU, Att-LSTM (Attention-based LSTM), and Att-GRU. And for word representation, we applied the
word embedding technique (Word2Vec). We tested our approach on two large datasets–with six and eight
categories–using ten-fold cross-validation. Our objective was to study how the classification is affected
by the stemming strategies and word embedding. First, we looked into the effects of different stemming
algorithms on the document classification with different deep learning models. We experimented with
eleven different stemming algorithms, broadly falling into: root-based and stem-based, and no stemming.
We performed ANOVA test on the classification results using the different stemmers, which helps assure if
the results are significant. The results of our study indicate that stem-based algorithms perform slightly better
compared to root-based algorithms. Among the deep learning models, the Attention mechanism and the
Bidirectional learning gave outstanding performance with Arabic text categorization. Our best performance
is F-score = 97.96%, achieved using the Att-GRU model with stem-based algorithm. Next, we looked
into different controlling parameters for word embedding. For Word2Vec, both skip-gram and bag-of-words
(CBOW) perform well with either stemming strategies. However, when using a stem-based algorithm,
skip-gram achieves good results with a vector of smaller dimension, while CBOW requires a larger dimension
vector to achieve a similar performance.
INDEX TERMS Arabic document classification, deep learning, stemming strategies, word embedding,
statistical significance.
I. INTRODUCTION basic information of documents automatically, thus saving

The Internet is full of information in many different forms, human time and effort. Automatic text classification (TC),
including millions of textual documents. This large volume of or document classification which we use interchangeably,
data posts a challenge, even for simple tasks, such as infor- is the assignment of the document to a pre-determined set of
mation retrieval (IR). A possible solution is to organize the categories based on its contents.
textual data into different categories. Manual classification is With a population of around 445 million, Arabic lan-
out of question, and the alternative is to automate the task. guage users constitute the fastest-growing language group
Automatic document classification is used to discover the with regards to the number of Internet users. According to
a study in (www.internetworldstats.com/stats7.htm), during
The associate editor coordinating the review of this manuscript and the last two decades, Arabic language Internet users have
approving it for publication was Muhammad Zakarya . grown by 9348%. In the same statistics, the next two language
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 127913
H. A. Almuzaini, A. M. Azmi: Impact of Stemming and Word Embedding on Deep Learning-Based Arabic Text Categorization
user groups experienced a growth of 3653.4% and 3356% (see Section III) most of the studies limit their comparison
(for Russian and Indonesian languages, respectively) in the by using a single candidate algorithm from each stemming
same period. While most of the research to date has tackled approach. On the other hand, in this study, we compared
the problem for the English language, the work on Arabic the performance of TC using ten different stemming algo-
document classification is lagging behind. There are many rithms, five of which are root-based [10]–[14], five are light
reasons for this, as we will see in Section II-B. Needless to stemmer [15]–[19], and no stemming. As we stated earlier,
say, many authors have concluded that constructing automatic stemming impacts the performance of NLP applications,
Arabic TC is a challenge [1]–[3]. TC included. We experimented using two large corpora,
Developing a classification system for the Arabic lan- and among the different DL models, only CNN-LSTM and
guage involves understanding the syntactic structure of the CNN-GRU where effected by the stemming algorithms. So,
words, so that we can manipulate and represent the words we have an interesting conclusion. Stemming had little or no
in a way that makes their classification more precise. The impact on the performance of classifying Arabic documents
pre-processing step (e.g., removing stop-words, or stemming) when using some of the deep learning models. We achieved
plays a critical role in many Arabic natural language process- our best performance for Arabic TC using Att-GRU, BiGRU,
ing (NLP) applications, including classification. Moreover, and BiLSTM, in order.
it helps reduce the dimensionality and thus reduce the classi- The rest of this paper is organized as follows: Section II
fication time. presents the background about text classification, and stem-
Deep learning (DL) is a sub-area of machine learning that ming algorithms. In Section III, we review related studies.
uses multi-layered artificial neural networks to deliver high In Section IV, we discuss our proposed system. Section V is a
accuracy in diverse tasks such as object detection, speech discussion of the experiments and results. Finally, Section VI
recognition, natural language processing, etc. Word2vec is concludes the paper, and outlines possible future work.
a word embedding technique that uses a shallow neural
network [4]. Word2Vec maps words into continuous vec-
II. BACKGROUND
tors, and as it turns out, it is one of the most popular
In this Section we briefly outline the textual classification,
models for word similarity tasks, and text classification.
challenges facing the classification when using Arabic lan-
It has been successfully used for raw text classification in
guage, and stemming algorithms.
English [5]–[7], and other languages, such as Chinese [8], [9].
Given the importance of stemming and word embedding to
TC, we would like to see their impact on textual classification A. TEXTUAL CLASSIFICATION
using DL algorithms. This study involves seven different DL Text classification (TC)—also text tagging, or text
models: Convolutional Neural Network (CNN), CNN-Long categorization—is the process of classifying text into orga-
Short Term Memory (CNN-LSTM), CNN-Gated Recurrent nized groups. It uses NLP to automatically analyze text, and
Units (CNN-GRU), Bidirectional LSTM (BiLSTM), Bidirec- then assigns a set of pre-defined tags or categories based
tional GRU (BiGRU), Attention-based LSTM (Att-LSTM), on the content. Some of the areas that may benefit from
and Attention-based GRU (Att-GRU); and eleven different TC are: sentiment analysis, topic detection, and language
stemming strategies broadly falling into root-based, stem- identification etc.
based, or no stemming. We are not aware of any comparable Classifying a text involves three stages. First, the pre-
study that combined so many DL models along with so processing step. This step usually requires cleaning the
many stemming algorithms to assess a single Arabic NLP text (e.g., removing punctuation mark, stop word, numer-
application. Summarizing our contributions: als). Stemming is applied at this step. The second step
• We explore how the TC using the different DL models involves representing the document in vector form to extract
for the documents in the Arabic language is affected by its features. Different techniques have been proposed, such
the different stemming strategies. as Latent Semantic Analysis [20], Bag-of-Words [21], and
• We investigate which of the studied DL models is best Word2Vec [4], or at character level using n-gram [22]. In the
suited for the task of Arabic TC. third step, we train and test the classifier. There are two
• We studied the impact of word embedding, using approaches for training the classifier, the traditional method
Word2Vec, and how this improves TC. and the deep learning method. In the traditional method,
• We use analysis of variance (ANOVA) test to confirm following the conversion of the documents to feature vector,
the significance of the TC results. we use a typical classifier (e.g., Support Vector Machine,
More specifically, we conducted two groups of experi- K Nearest Neighbors, Naïve Bayes). More recently, deep
ments, one for stemming strategies, and the other to identify learning (DL) methods have received considerable attention
appropriate parameters for Word2Vec. When conducting an for the classification task. In DL architecture, a multi-layer
experiment on, for example stemming strategies, we fixed neural network are used, where the input is the document
the parameters of word embedding. The same goes for the features vector. One drawback is that DL requires large
other experiment, where we fixed the stemming algorithm to training data to give satisfactory results. Examples of
one from each stemming strategy. As we will discuss later DL models that achieved remarkable classification results
127914 VOLUME 8, 2020

TABLE 1. Example of an Arabic word which has different affixes attached to a root word, ‘‘negotiate’’. The full meaning ‘‘to negotiate with them’’.
Source: [28].
are: the CNN [5], character level CNN [23], and Deep Belief C. STEMMING ALGORITHMS
Networks (DBN) [24]. Commonly referred to as stemmers. Stemming is a compu-
A subproblem of TC is hierarchical textual classification, tational procedure which reduces all words with the same
where the document is classified into a predefined multi-level root (or the same stem, in case prefixes are left untouched)
categories. The hierarchical TC aims at organizing the mass to a common form, usually by stripping each word of its
of information as a tree structure in which a document that derivational and inflectional suffixes. A stemming algorithm
belongs to a topic at a certain level also belongs to all of its reduces the words ‘‘chocolates’’, ‘‘chocolatey’’, or ‘‘choco’’
parent topics, ancestors, etc [25]. For example, under a 3-level to the root word, ‘‘chocolate’’; and the words ‘‘retrieval’’,
categorization, a document may be classified as ‘‘health ‘‘retrieved’’, ‘‘retrieves’’ are reduced to the stem ‘‘retrieve’’.
→ diseases → cancer’’, another document is classified as Here, we want to reduce different forms of a word to a core
‘‘health → diseases → heart’’ etc. There are advantages for root or stem. This provides more convenience when handling
such hierarchical classification, such as improving retrieval words that share the same core meaning, thus playing an
time. Hierarchical TC is outside the scope of our work. important role in the field of information retrieval (IR). In IR,
grouping words with the same root (or stem) increases the
B. CHALLENGES TO CLASSIFYING DOCUMENTS success with which documents can be matched against a
IN ARABIC query [27].
Developing an accurate system for categorizing text from the For the English language, a simple stemming that involves
large number of Arabic documents accessible on the Internet the stripping of suffixes is sufficient for the purpose of IR.
is very challenging. These challenges arise from the ambigu- For Arabic, however, the stripping of suffixes alone would not
ity due to the lack of diacritical markings in Modern Stan- be sufficient [29]. In Arabic, there are four kinds of affixes:
dard Arabic (MSA), the Arabic language’s rich and complex antefixes, prefixes, suffixes and postfixes that can be attached
morphology, the wide spread use of synonyms, the nature to words [28]. Table 1 provides an example of a complex
of the language itself—Arabic is a highly inflectional and Arabic word with all affixes. There are two main stemming
derivational language—etc. approaches in Arabic: root-based stemming, and stem-based
The Arabic orthographic system uses small diacritical (or light) stemming.
markings to represent different short vowels. There are a total In the root-based stemming technique we perform heuristic
of thirteen different diacritics, and these are used to clarify and linguistic morphological analysis to extract the root of a
the sense and meaning of the word. In MSA, the written text word. This technique can be further divided into three cat-
is devoid of these markings, as it is assumed the reader will egories: dictionary-based, nondictionary-based, and hybrid
disambiguate the meaning. However, this is not true for the (see Figure 1). An example of dictionary-based is the Khoja
machines [26]. Just to give an idea, consider the undiacritized stemmer [11], which uses—as the name implies—a dictio-
word . It has more than one meaning depending on nary file of Arabic roots. The nondictionary-based algorithms
the diacritics, ‘‘necklace’’, ‘‘knots’’, ‘‘contract’’, ‘‘decade’’, are further classified into three different approaches: pattern-
‘‘pact’’, and ‘‘complicated’’. Some of the words share the based, statistical-based, and rule-based. The pattern-based
exact same diacritical marking, but have different meaning algorithm uses the Arabic pattern (or a template) to match
which can only be realized through context. For example, a word, then extract the root. For instance, the word
the word means ‘‘year’’, but it may also mean ‘‘public’’. ‘‘school’’ matches the pattern , resulting in the
In addition, the plural, dual, and singular forms in Arabic triliteral root : d r s. Some of the algorithms that fall
vary according to gender. In Arabic, there are linguistic rules under this approach are [12], [30], [31]. Ref [32] developed
for each type, and some words have irregular plural forms. an algorithm that is statistical-based stemmer. The algorithm
Moreover, the letter waw at the beginning of an Arabic uses the idea of assigning weights to the letters in order to
word poses a challenge, since it may be the proposition extract the root without consulting lists of prefixes, suffixes,
‘‘and’’ or an original letter part of the word. For example, patterns, or roots. Then, we have algorithms that utilize lin-
the letter waw is proposition in ‘‘and sat’’, while it is guistic rules to extract the root, such as those in [10], [14],
original lexeme in ‘‘stood up’’. [33], [34]. The last approach under the root-based stemmer
VOLUME 8, 2020 127915

they reported a degraded performance when no stemming

is involved, while light stemming significantly boasted the
system performances compared to one using the root-based
stemmer.
In [40], the authors investigated the impact of the
root-based stemmer vs light stemmer for text mining tasks.
In the pre-processing step, the authors applied [11] and [37]
stemming algorithms, respectively. They also used the Latent
Semantic Analysis (LSA) model for measuring the semantic
similarity between Arabic words. The experiments demon-
FIGURE 1. Stemming approaches.
strated that using light stemming improved the performance
compared to using the root-based stemming algorithm. The
is the hybrid method. In hybrid methods, we extract the authors attributed this to the occasional loss of sense when
root using a combination of rules, patterns, and/or lookup words were reduced to their root form.
dictionary of roots, e.g. [13], [35], [36]. Ref [1] studied the effectiveness of light stemming
In light stemming we reduce the word by removing pre- vs heavy stemming (root-based) for Arabic text catego-
fixes and suffixes. This method does not deal with pat- rization. For the experiment, the authors used a dataset
terns and infix (letters added within the root). Occasionally, of 15,000 documents classifying them into three categories.
the resultant word may not be a valid one, but it is suffi- Using the K-Nearest Neighbors (KNN) classifier, the authors
cient for the objective of IR. For example, ‘‘studies’’ experimented with both the heavy stemmer [32] and the
whose stem is , is not a valid Arabic word. Works that light stemmer [15]. They concluded that better accuracy was
tackle light stemming includes [15]–[19], [37], [38]. The achieved when light stemming was used as opposed to heavy
author in [17] uses snowball, a programming language ded- stemming.
icated for stemming in different human languages. Ref [18], Reference [24] proposed a three-stage technique for clas-
on the other hand, used lexicon resources to improve the sifying Arabic documents into multiple categories. A root
stemming. extraction algorithm was applied in the pre-processing stage.
Then, a combination of Markov and fuzzy C-means was used
III. RELATED WORK for clustering. In the third stage, the DBN was used to build
For the Arabic language, most of the stemming algorithms the Arabic classification model for each resulting cluster.
have been tested on IR, but few works have looked at their The experiment was conducted on 12,000 randomly selected
impact on automatic TC. Arabic is a morphologically rich documents from two different datasets. The authors reported
language, and from a single root we can derive many dif- an F-score of 91.02%.
ferent words. For example, from the root letters :d r Reference [41] studied the effect of stemming techniques
s we can drive the words ‘‘study’’, ‘‘school’’, on Arabic document classification. For the pre-processing
‘‘teacher’’ (masc.), ‘‘teachers’’ (fem.), step, the authors picked three stemmers, one root-based [30],
‘‘teachers’’ (masc.) etc. An analysis of Arabic text in a and two stem-based [19], [42]. They used traditional classi-
newspaper indicates that there are more words occurring once fication algorithms, namely, Naïve Bayesian (NB), Support
and there are more distinct words than those found in English Vector Machines (SVM), and KNN. The experiments were
text of identical size, when no stemming is involved [39]. performed on open source Arabic corpora (OSAC) [43]. The
Given that, extracting the roots of the words found in a corpus consists of 5,070 documents divided into six cate-
document will reduce the dimensionality, and may improve gories. The best performance of micro-F1 = 94.64% was
the accuracy of IR. However, many word variants do not reported using SVM with stemmer in [19].
share the same semantic meaning even though they may To encode documents for classification, [44] uti-
share the root. Thus, in the pre-processing stage for IR, lized the Restricted Boltzmann Machine (RBM). For
the root extraction methods may increase a word’s ambi- the pre-processing step, the authors applied Khoja’s
guity, in which case the light stemming methods may be a stemmer [11], and [45] a light stemmer. After training the
better choice. Ref [15] investigated the effectiveness of two document representations using RBM, they classified the
different stemming techniques of Arabic texts on IR. The documents using Decision Tree (DT), NB, and SVM. They
first technique was root-based, and the second technique, was used OSAC corpus [43], the authors reported their best
light stemming. For the first technique, they used a modified accuracy of 75.1% using light stemmer.
Khoja’s algorithm [11] to extract the roots; and for the second Learning effective document representation can enhance
technique the authors used their own light stemmer. The latter document classification. In [46], the authors proposed a
technique relied on removing the most frequently occurring technique that combines document embedding representa-
suffix and prefix, and normalization. The authors evalu- tion with Arabic WordNet to learn the word sense disam-
ated the performance of no stemming, light stemming, and biguation. For the pre-processing step, the root of the words
root-based stemming algorithms. For an Arabic query system, was extracted using the Khoja stemmer [11]. After learning
127916 VOLUME 8, 2020

the documents representation, the documents were classified Reference [25] used Markov chains to solve the problem
using the multi-layer perceptron classifier. This proposed of hierarchical Arabic TC into three-level deep categories
method yield an F-score of 90%. (see Section II-A). The top level had eight categories. The
The proposed technique in [47] focused on feature authors used a corpus containing 11,191 documents compiled
selection for Arabic TC. They applied the modified Khoja’s from Alqabas newspaper. All the documents were at last
stemmer, followed by four different feature selection met- 800 characters long. The corpus was split into 9711 docu-
rics (Chi-square, information gain, mutual information, and ments (≈ 86.8%) for training, and 1480 documents for test-
improved Chi-square) to select the best features. The size of ing. The authors reported an accuracy of 90.29% for the first
features ranged between 20 and up to 1400. For classification level categorization, with subsequent levels having accuracy
they used DT and SVM. The authors tested their scheme on of 77.09% and 63.33% (respectively).
OSAC corpus [43], and reported their best performance using Table 2 summarizes all the reviewed works. The above
improved Chi-square feature selection, achieving F-score studies confirm that the preprocessing step is a challenging
of 90.50%. This was achieved when using 900 features. and a crucial stage when dealing with Arabic documents.
When less or more features are selected, the F-score drops. Stemming is likely to impact the quality of classification.
For instance, the F-score = 83.8% and 88% for 100 and Though there are so many stemmers for the Arabic lan-
1400 features respectively, using improved Chi-square. guage, with new ones consistently being devised, in the
Some recent researches explored the effects of different end most of the published works that studied the effect of
stemming approaches on TC and text mining tasks. In [48] stemming picked a single candidate algorithm, one from
compared three different stemmers [11], [16], [49] on TC each approach. A root-based stemming, which is typi-
task. For classification the authors used SVM and NB. The cally Khoja [11], and another for light stemming, mostly
system was tested on 2,000 documents collected from roy- Light10 [16].
anews.com. They reported the best result of F-score = 92%
using the stemmer [49]. IV. PROPOSED SYSTEM
The study in [50] used the stemmer [51] to assess Arabic One of our objectives was to automatically classify the
TC task. They used TF-IDF to extract the features from Arabic documents into a set of pre-defined categories. For
the documents. For classification, the authors used Logis- the pre-processing step, we experimented with eleven dif-
tic Regression, SVM, and CNN. The system was tested ferent stemmers, five of which are root-based stemmers,
on 111,728 documents collected from three different online five are stem-based (light) stemmers, and the no stemming.
newspapers. They reported their highest accuracy score Three of the five root-based stemmers were recommended
of 92% using the CNN model. While [52] compared two by [58] as they are known to outperform others. These three
stemmers [49], [53] to see which one is better suited for are [10]–[12]. The other two, namely those by [13], [14],
Arabic TC. Using two different classifiers, SVM and NB, were proven to perform well in different NLP tasks. For light
and a dataset of 1000 news articles from alghad.com, they stemming we picked those algorithms that were developed
reported their best performance of F-score = 90% when used recently, such as [15]–[19]. For our experiment we contacted
with the stemmer [49]. the authors of all the aforementioned stemmers to share the
In [54] devised a TC model to detect violence in Arabic source code. Only three agreed and shared the source, [11],
tweets using different feature reduction methods. For classifi- [17], [18] for which we are grateful. ARLSTem [19], source is
cation they used KNN, Bayesian boosting, and bagging SVM. also freely available, but as it is in python, we re-implemented
In addition, the authors used two different stemmers (a heavy it in java. As for the other six stemming algorithms, we imple-
and a light stemmer), and n-gram words without stemming. mented the stemmers based on the description found in their
They collected a total of 12,500 tweets covering four different respective published works, namely [10], [12]–[16].
regions of Saudi Arabia. Experimentally, the authors reported
their highest accuracy of 86.61% using SVM bagging with
tri-gram. A further boosting of accuracy to 90.59% was
achieved using information gain and some reduction features. Algorithm 1 Framework of Our Proposed System
Reference [55] studied the impact of stemming on sen- Input: Raw document D
timent analysis using SVM, NB, and Maximum Entropy. 1 begin
For the sake of comparison, the authors used different 2 Clean document D (remove special symbols,
stemmers [11], [16], [30], [42], [56]. Their highest reported stop-words, numerals, etc)
precision was 90%. 3 Apply stemming algorithm on D
Some of the research studies did not use stemming at all, 4 Create a list of distinct words in document D
such as [57], who instead utilized the Part of Speech (POS) 5 Use word embedding by transforming D into feature
feature. After several experiments, they concluded that a vector using Word2Vec model
higher number of features allowed them to reach a higher 6 Classify D using a deep learning model
classification score. The POS method achieved a classifica- 7 end
tion accuracy of 91% when the number of features was 2,000.
VOLUME 8, 2020 127917

TABLE 2. Summary of related works along with the list of stemmers used therein.
For the Arabic document classification, we implemented in a lower number of words in the dictionary. More words
seven different deep-learning models. For a thorough com- in the dictionary means a larger document representation.
parison of stemmers, we evaluate the model’s classification Table 6 shows the number of words in the dictionary using
performance using the different stemmers. Algorithm 1 sum- different stemming algorithms.
marizes our proposed system. There are three main steps: pre-
processing, word embedding that transforms the document
into feature vector using Word2Vec model, and finally a deep B. WORD EMBEDDING
learning-based classifier. We go over each step in more detail Word embedding is a technique used to represent the words
below. of the document, where each word—in the vocabulary—is
represented by a real valued vector. In this representation,
A. PREPROCESSING words with similar meaning will have a similar representa-
In this step, we cleaned the text by removing punctuation tion. For word embedding, we used Word2Vec [4]. It is an
marks, stop-words, numerals, non-Arabic words, and single efficient algorithm in terms of space and time. Word2Vec
letter words. Then, we applied the stemmer. Each word in the is a two-layer neural network, where the input is the doc-
document was reduced to its root (or stem) form based on ument and the output are a continuous feature vector of a
stemming algorithm. This was followed by creating a dictio- pre-specified dimension. There are two main learning algo-
nary of all distinct words in the document. The dictionary was rithms in Word2Vec: continuous skip-gram or continuous
used to encode the documents in embedding model. Com- bag-of-words (CBOW). In the skip-gram method, we pre-
pared to light stemming, using the root-based stemmer results dicted the surrounding context words given the center word,
127918 VOLUME 8, 2020

FIGURE 2. Neural network with three convolutional layers for document classification. Conv1D is the convolutional layer, and for pooling layer we
use max pooling.
while in CBOW we predicted the current word using a win-

dow of surrounding words. For instance, in the CBOW model
we maximized the probability of a word being in a specific
content in,
FIGURE 3. The CNN-LSTM (long short-term memory) model.
Pr(wi | wi−d , wi−d+1 , . . . , wi−1 , wi+1 ,. . . , wi+d−1 ,wi+d ),
(1)
CNN-Gated Recurrent Units (CNN-GRU). The GRU is
where wi is a word at position i and d is the size of the window. pretty similar to LSTM, but with less gates than a LSTM,
Thus, it yields a model that is contingent on the distributional making it a little speedier in the training process [61].
similarity of words. The dimension of word embedding can As we the mentioned in previous model, we utilize CNN
vary based on different applications, and it usually ranges as feature extraction layer with pooling. It is followed by
from 50 to 300. the GRU layer with 60 memory cells and tanh activation,
and finally, a dense layer with soft-max activation. There
C. DEEP-LEARNING MODELS are no dropout in our implementation, since it works
In this Section we will take a short glimpse at the different better without the dropout layer. Figure 4 illustrates our
deep learning models that we used for Arabic TC. Alto- proposed model.
gether, there are seven different models. In designing the
different models, the first layer is the word embedding layer
(see Section IV-B).
Convolutional Neural Network (CNN). Figure 2 shows
our proposed model, consisting of three layers of CNN,
and max pooling. Typically, CNN is made up of a
FIGURE 4. The CNN-GRU (Gated Recurrent Units) model.
sequence of layers, where the output of a layer feeds
into the next layer. The layers are: convolutional layer,
pooling layer, and the fully connected layer [59]. The Bidirectional LSTM (Bi-LSTM). The LSTM in CNN-
embedding layer is followed by the CNN layer with LSTM is a forward LSTM, which can predict the class
three filters (each with a varying window sizes). Then, label based on the past (previous tokens). However,
we have two CNN layers with a single filter, which are a word in the sentence is related to previous and next
passed to a dense (fully connected) layer with soft-max tokens. This makes it useful to learn the full context
activation whose output is the probability distribution from both directions. The Bi-LSTM consists of two
over labels. LSTMs, one to pass the text from left to right (forward
CNN-Long Short-Term Memory (CNN-LSTM). It com- LSTM), and another to pass the text from right to left
bines CNN and forward LSTM. See Figure 3 for our (backward LSTM) [62]. This way, the model learns
proposed model. LSTM is a kind of Recurrent Neural from past and future information. We implemented the
Network that can capture sequential information, such as Bi-LSTM using 100 memory units (forward and Back-
the context of a sentence [60]. The first Layer following ward LSTMs) after encoding the text with embedding
the embedding layer is a single CNN layer which is fol- layer. The output of those two LSTMs are fully con-
lowed by the pooling layer. The CNN is used for feature nected to a dense layer with soft-max activation. For our
extraction of the input text. Then, the resulted features proposed model, see Figure 5.
are passed to the single LSTM layer with 100 memory Bidirectional GRU (Bi-GRU). It is the same as Bi-LSTM
units and tanh activation to support sequence prediction. but with the GRU layer instead of LSTM. Figure 6
We perform 20% dropout that is followed by a dense shows our proposed model. After transforming the text
layer with soft-max activation. to embedding representation, it is fed to the Bi-GRU
VOLUME 8, 2020 127919

FIGURE 5. The bidirectional LSTM model. All inputs Token1 , Token2 , . . . FIGURE 7. The attention-based LSTM model. All inputs are stemmed.
to the model are stemmed.
the attention model, and at the end there is a dense layer

with soft-max activation.
Attention-based GRU (Att-GRU). It is the same attention
mechanism described in Att-LSTM. The GRU layer has
64 memory cells connected to the attention model. The
output of the attention model is connected to a dense
layer with soft-max activation, Figure 8.
FIGURE 6. The bidirectional GRU model. All inputs are stemmed.
model with 100 memory cells connected to a dense layer

with soft-max activation.
Attention-based LSTM (Att-LSTM). The attention is a
great mechanism to concentrate on the useful infor-
mation of the input data for a specific task, such as
translation, visual identification of objects, text clas-
sification, etc. The attention model takes n arguments FIGURE 8. The attention-based GRU model. All inputs are stemmed.
(which represent the hidden states of the LSTM hs ) and
context vector ct to generate the attention vector at . The V. EVALUATION AND RESULTS
attention vector at that contains the relevant part of the For convenience we will break down this Section into
text is given by Eq. (2). subsections covering: (i) our data set, (ii) the setup for
at = f (ct , ht ) = tanh(Wc [ct ; ht ]), the experiments and the metrics used for the evaluation,
X (iii) experimenting with the different stemming algorithms,
ct = αts h̄s , and (iv) experimenting with word embedding parameters.
s
exp(score(ht , h̄s ))
αts = Ps , (2) A. THE DATA SETS
exp(score(ht , h̄s0 ))
s0 =1 For the experiments we picked two different datasets.
where ht represents the last hidden state. The atten- The first dataset is the ANT v1.1 (Arabic News Texts)
tion weights αts is calculated according to Luong’s Corpus [64],1 and the second dataset is the Saudi Press
score [63] but with many-to-one attention (sequence to Agency (SPA) corpus.2 The ANT Corpus containing
label instated of sequence to sequence in case of trans- 10,161 documents containing a total of 1.474 million words,
lation). Figure 7 is our implementation of the attention 1 Compiled from Jawhara FM radio station in Tunisia. Free download from
model described above. After the embedding layer, there https://github.com/antcorpus/antcorpus.data/releases/tag/v1.1.
is a single LSTM layer with 100 memory units and tanh 2 The official Saudi Press Agency: http://www.spa.gov.sa. We downloaded
activation. The resulted hidden states are connected to the documents covering the four-month period starting in September 2018.
127920 VOLUME 8, 2020

and is divided into nine categories: culture, diverse, economy, TABLE 4. The hyper parameter values used for each of the deep
learning (DL) model.
international news, local news, politics, society, sports, and
technology. The SPA corpus covers six categories: general,
culture, sports, economic, social, and politics. Table 3 sum-
marizes the dataset used in this work. Though both cor-
pora share some categories, each had its own. For example,
the ANT corpus has the ‘‘technology’’ category, which is
missing in the SPA corpus, while the latter has the ‘‘general’’
category which is not in ANT. Since the ANT corpus has
nine categories versus six in SPA, we decided to drop one of
the categories from ANT, and used only eight. In addition,
‘‘local news’’ is the largest category in the ANT corpus parameters and study how they affect the classifica-
containing 4832 documents. We decided to use one-third of tion. This experiment was applied on the CNN model
the documents in this category, in order to be in par with the only.
‘‘international news’’ category in the same corpus. All the experiments were performed using 10-fold cross-
validation. Cross-validation (CV) is a statistical method used
TABLE 3. Details of the ANT [64] and SPA corpora.
to assess the machine learning models. Generally, in k-fold
CV, the dataset is randomly divided into k groups, or folds,
of equal size (more or less). One of the folds is used as a
validation set, and the other k − 1 folds are used for training.
The process is repeated k times, each time picking a different
fold for validation [66, p. 181]. In practice, one performs k-
fold CV using k = 5 or k = 10, as these are shown to yield
test error rate estimates that balances between high bias and
high variance [66, p. 184].
We have different DL models, and different stemming
algorithms. We need to answer the question, does the stem-
ming impacts the classification results of a DL model. Some
models performance will be indifferent to the stemming,
i.e. whether we use stemming or not, the model’s behavior
remains the same, and other models will be affected by
the stemming. For this, we will use ANOVA (analysis of
variance), a statistical method which compares the samples
B. EXPERIMENTAL SETUP AND EVALUATION METRICS
on the basis of their means [67]. ANOVA uses F tests to
We performed two sets of experiments: (a) evaluate the differ-
statistically determine if the means are significantly different
ent stemmers, and (b) evaluate the word embedding. Further
from each other (we have eleven groups corresponding to
details follow.
eleven different stemming algorithms classification results).
(a) The first set of experiments looks at the effect of Our null hypothesis will be ‘‘the sample means are equal’’.
different stemming algorithms with the different deep We then use the F statistic when deciding to support or reject
learning models on the document classification task. the null hypothesis. In ANOVA, if the F value > F critical
The DL models are: CNN, CNN-LSTM, CNN-GRU, (for a specific α) then we reject the null hypothesis. For the
BiLSTM, BiGRU, Att-LSTM, and Att-GRU. For one-way ANOVA, the F value is given by Eq. (3),
each DL model we conduct hyper parameter tuning
experiment to find the best parameters to train the between-group variability
F value = . (3)
individual model. Then, we fix this parameter for that within-group variability
DL model. It will be the same parameter for all exper-
For all the experiments we fixed α = 0.05. To calculate
iments involving the evaluation of the different stem-
the F critical we use an F distribution Table.3
ming algorithms on that model. Table 4 summarizes
We report the performance of document classification
the fixed parameters of each model. For learning the
using precision, recall, and F-score. These measures are
word vectors representation, we use a single setting for
defined using the confusion matrix. A confusion matrix is
the Word2Vec model. The same setting is used for all
a table that is used to describe the performance of a binary
DL models, which is: use the skip-gram method with
classification model on a set of test data for which the true
window of size 5, and set the dimension of the feature
values are known. For classification tasks, the terms true
vector to 60.
(b) The second set of the experiments looks at word 3 See for example, https://www.stat.purdue.edu/~jtroisi/STAT350
embedding. In particular, we focus on the different Spring2015/tables/FTable.pdf.
VOLUME 8, 2020 127921

positive (TP), true negative (TN), false positive (FP), and false stemmer is used as opposed to a light stemmer. Table 6
negative (FN) compare the results of the classifier with known shows the number of tokens in the vocabulary file for each
judgments. The terms true and false refer to whether that stemmer when applied to the SPA corpus. Some observations:
prediction corresponds to the actual judgment and the terms although [10] is a root-based stemmer, it yields a larger
positive and negative refer to the classifier’s prediction. The vocabulary than other root-based stemmers. We attribute this
four outcomes can be formulated in a 2 × 2 confusion matrix to the fact that—by design—no root finding is attempted for
(see Table 5). non-Arabic words, e.g. ‘‘electronic’’, which is left
untouched. Furthermore, in the case of [18], a light stemmer,
TABLE 5. Confusion matrix. the number of tokens exceeds the case when no stemming
is applied. This is because the stemmer often produces more
than one stem. That is, in some cases the stemmer generates
more than one stem for a single word. As we eluded to ear-
lier, more words in the vocabulary means a larger document
representation when using word embedding.
TABLE 6. Resultant number of tokens in the vocabulary file for the

The precision (P) measures the exactness of a classifier, SPA corpus when using different stemmers.
while recall (R) measures the completeness of a classifier.
We can combine P and R to produce a single metric called
F1 (which has been referred to earlier as the F-score), which
is the weighted harmonic mean of both measures. The three
measures are given by Eq. (4),
TP
P= ,
TP + FP
TP
R= ,
TP + FN
2PR
F1 = . (4)
P+R
However, when we have multiple class labels—as in our
case—then we need to redefine the measures in Eq. (4). In this Tables 7-8 summarizes the performance of classifying the
case, averaging the measures can give a better view of the documents in ANT and SPA corpora (respectively). All the
general results. For instance, we have the micro-, the macro- values reported in the Tables are average weighted F-score.
and the weighted-averaged measures. The micro-average will The performance of the document classifier depends on the
aggregate the contributions of all the categories to compute stemming algorithm. Whereas the extraction of the wrong
the average metric, whereas the macro-average will compute root or stem can result in the appearance of a wrong word
the metric independently for each category and then take in a wrong context, which may affect the representation of
the average (hence treating all categories equally). However, the word, and consequently the accuracy of the classifica-
we decided to use the weighted-average for all the measures. tion. However, among all the deep learning models, the light
The weighted-average considers the class labels imbalance, stemming algorithms generally yield better results compared
as in our corpus. It is calculated by computing the metric to those that used root-based stemming algorithms in the
independently for each category and then taking their average context of document classification. In general, the best per-
weighted by the number of true instances for each category. formance was achieved by BiGRU model for the ANT cor-
This may result in an F-score that is not between precision pus with stem-based algorithms, while for the SPA corpus
and recall. In subsequent discussion, weighted-F1 or just the Att-GRU model yield the best result irrespective of the
plain F-score will refer to weighted average F-score. stemming approach. Overall, the best classification result
for the SPA corpus was weighted-F1 = 97.96% (Att-GRU
C. EXPERIMENTING WITH DIFFERENT STEMMING model with light stemmer [15]), while for ANT corpus it was
ALGORITHMS weighted-F1 = 83.63% (BiGRU model with stemmer [16]).
We conducted eleven experiments using each of the seven Given that SPA has twice as many documents as ANT, this
DL models on two different datasets, a total of 154 experi- confirms the fact that deep learning algorithms scale well with
ments. For each experiment, we used a different stemming the amount of data fed.
algorithm (we had ten stemmers altogether), and no stemming Performance wise, the weakest models were CNN-LSTM,
for the last experiment. and CNN-GRU on both copra. This shows that using CNN
In Section IV-A we mentioned that the number of words for feature extraction may not be as effective for Arabic TC.
in the vocabulary file (dictionary) is less when a root-based On the other hand, LSTM or GRU models performed well
127922 VOLUME 8, 2020

TABLE 7. Classification results expressed using weighted-F1 for classifying ANT corpus using various DL models with different stemmers.
TABLE 8. Classification results for the SPA corpus.
either with bidirectional or with attention mechanism. What BiGRU. We therefore reject the null hypothesis and claim that
surprised us was the degraded performance by CNN-LSTM at least one of the stemming algorithms significantly affected
and CNN-GRU on the SPA corpus when used with the light the classification result. However, for the other three models
stemmer [18]. For CNN-LSTM, the performance dropped (e.g., CNN), the stemming algorithms where of no use and
from mid 90’s (using any of the other stemming algorithms) its impact on the classification was insignificant. For the SPA
to 75.4%, and for CNN-GRU it is even worse where it drops corpus, there is a clear evidence that the stemming algorithms
to ≈ 52%. We attribute this to the stemming algorithm in [18] had an impact on the classification. This is true for all the
which often produced more than one stem for a word,4 and models except for BiLSTM and BiGRU.
this affected the word embedding learning and consequently There is an interesting observation here. From Tables 7-8,
the CNN feature extraction process. The same stemming we note that the range of F-score is narrow indeed, and
algorithm performed well with other models, e.g. BiGRU and this is true for both corpora and for all but two models
Att-GRU. This proves that learning the full context from both (i.e. CNN-LSTM and CNN-GRU). In the case of the ANT
directions or using the attention mechanism to concentrate on corpus, the F-score ranges between 79.01% and 83.63%,
the useful information efficiently does improve the result. while for the SPA corpus it ranges between 94.61% and
We now answer one of the questions raised in the paper. 97.96%. What is more interesting is that the performance of
Does stemming impact the performance of Arabic TC when no-stemming is somewhere in between. This goes counter to
using a deep learning model? We will use ANOVA test on the general belief that stemming is a necessary step in any
the 10-fold F-score results for all eleven different stemming Arabic NLP application. We believe that when using deep
algorithms. Table 9 lists the F value and F critical of the learning algorithms and word embedding, the impact of stem-
ANOVA test, recalling that we set α = 0.05 (Section V-B). ming is minuscule, and we may still get a great performance
For the ANT corpus, the F value is greater than F critical without it. For instance, we get F-score = 97.82% using
for the four models: CNN-LSTM, CNN-GRU, BiLSTM, and Att-GRU without resorting to any stemming. However, if we
are concerned about the training time then it is better to do
4 There is a version that produces a single stem per word, but we were stemming as it reduces the size of the vocabulary. Note that
unsuccessful in contacting the authors. the fastest training models were CNN-LSTM and CNN-GRU,
VOLUME 8, 2020 127923

TABLE 9. ANOVA test (α = 0.05) for both corpora for the 10-fold result among all ten stemming algorithms.
FIGURE 9. The F1 score of classifying documents of each label using different stemmers on
the two corpora. Baseline stands for ‘‘no stemming’’.
although they resulted in the worst performance among all for the category ‘‘international news’’ the score is F1 = 91%.
the models. Whereas, the models BiLSTM and BiGRU had The number of documents in category ‘‘culture’’ is 124 (see
the largest training time. The CNN model had a reasonable Table 3). The weighted-F1 score for this stemmer is 82.5%.
training time and with a fairly well performance, followed by Figure 9 plots the results presented in Tables 10 and-11,
Att-GRU and Att-LSTM. ordering them into the category (label). Within each label,
Tables 10-11 shows the classification results for each we list the F1 score of the document classification into
label for the ANT and SPA corpora (respectively), using that particular label using different stemmers: no stemming
the best performing model for each dataset. For each cat- (baseline), [10], [12] etc. For the SPA corpus (Figure 9b),
egory, the results are reported in terms of F1 score (see the results are very much consistent, and the classifier has
Eq. (4)). For the ANT corpus (Table 7) for example, using the done an equally good job in classifying documents into dif-
stemmer [13] for the category ‘‘culture’’ the F1 = 77%, while ferent categories. What is more interesting is that the use of
127924 VOLUME 8, 2020

TABLE 10. Classification results of individual categories for the ANT corpus on BiGRU model. Values under each category are expressed using the F1 score.
TABLE 11. The classification results of individual categories for the SPA corpus using Att-GRU model.
TABLE 12. Comparison of our system’s performance vs [24] (uses deep belief network). Both systems use a different root-based stemmer. Each system
uses a different dataset. The performance measure of the other system is as reported by the respective author.
the stemmer has little impact on the performance. However, D. EXPERIMENTING WITH WORD EMBEDDING
for the ANT corpus (Figure 9a), there is a difference in PARAMETERS
performance between labels. We note the best performance This is the second set of experiments where we looked into
has been for the ‘‘sports’’ and ‘‘international news’’ cate- appropriate parameters for the Word2Vec word embedding
gories, while the worst performance was for the ‘‘local news’’ algorithm. We confined this experiment to CNN model only.
and ‘‘technology’’ categories. Even within a category, e.g. The parameters that can be controlled are: dimension of the
‘‘culture’’, the performance varies between stemmers which vector, the learning method (skip-gram or bag-of-words), size
ranges between 56% (no stemming) and 80% (e.g., [12]). of the window (number of neighboring words), and the cut-off
Finally, we wanted to compare the performance with Jin- occurrences for words in the vocabulary (less frequently
dal [24], another deep-learning based classifier for Arabic occurring words are ignored). We conducted two experiments
documents. Table 12 summarizes the performance of both to identify which word embedding parameters affect the clas-
systems. For stemming, the author used a root-based stem- sification process. In the first experiment, we explored the
mer, but did not provide any further detail. So, to be fair, two learning methods, skip-gram and bag-of-words (CBOW)
we report the performance of our system using the root-based for Word2Vec, with different vector dimensions. In the sec-
stemmer [14], though it is not our best reported performance. ond experiment, we explored the size of the window. For
Clearly our system tops the performance of the other system both experiments, we picked two stemming algorithms, one
by approximately 7%. root-based [10] and the other is stem-based [16], both yield
VOLUME 8, 2020 127925

TABLE 13. Effect of vector dimensions on the F -score of document classification using the two learning methods for Word2Vec.
TABLE 14. Effect of window sizes on the F -score of document document classification using various deep learning mod-
classification.
els. The following models were explored: CNN, CNN-
LSTM, CNN-GRU, Bidirectional LSTM, Bidirectional GRU,
Attention-based LSTM, and Attention-based GRU. And for
the stemming strategies we investigated three of them:
no stemming, root-based (five different algorithms), and
stem-based (five different algorithms). While for word
embedding, we used Word2Vec. For testing, we used two
large datasets, one with six predefined categories and the
good performance with CNN model. The experiments were other with eight. For testing we used 10-fold cross-validation.
performed using 10-fold cross-validation on the SPA cor- For the first group of experiments, we conducted a total
pus. All the reported performances are measured in terms of of 154 experiments (seven models × eleven stemming algo-
weighted-F1 . rithms × two corpora). For these experiments we fixed
The objective of the first experiment was to look at which the word embedding parameters. In the second group of
of the two learning methods is more appropriate while—at experiments, we use the CNN model and picked two stem-
the same time—trying different dimensions for the vector. ming algorithms that preformed well with CNN model (one
Table 13 shows the results of the first experiment when using root-based and the other stem-based), we then looked into
vectors of dimension 50, 60, 100, and 300 expressed in terms the effect of different parameter settings controlling the
of weighted-F1 score. For this experiment, we fixed the other Word2Vec used to classifying the documents. Summarizing
two parameters as follows. Fixed the size of windows to 5 the challenges and lessons learned (not in any particular
(a typical value) and set the cut-off frequency to one (no order):
word was left out). From Table 13, we can observe that both • There is a lack of research into Arabic NLP, and the
methods (skip-gram and CBOW) yielded a comparable and absence of cooperation among researches in the field
good performance for Arabic document classification. If we aggravates the problem. In this work, we had to imple-
had used the stem-based approach in the pre-processing step, ment many of the stemming algorithms ourself follow-
the skip-gram method would have worked well with a smaller ing the description in their respective papers.
vector dimension. While CBOW requires a larger sized vector • Deep learning models have a steep learning curve. And
to achieve a high success rate. In case of the root-based the size of the vocabulary affects the learning time,
method, we achieved the best values when the dimension proportionally. We performed all the experiments in this
of the vector was 100, and this was true for both methods. work using 10-fold cross-validation. This contributed
Although the small vector dimension in the skip-gram method extra time to finish the experiments. For instance,
works well with stem-based, it leads to ambiguity when using the BiGRU model with stemmer [18] took over 75 hours
the root-based stemmer for both methods. on a system with Nvidia Tesla K80 GPU having 12GB
In the second experiment, we looked at the effect of win- memory.
dow size on the classification process. Table 14 shows the • Label or class imbalance in the training set is a major
effect of different window sizes in each method on the classi- issue in text classification task. This was the case of the
fication process, where we fixed the dimension of the vector publicly available ANT corpus. However, we kept this
to 60. We tested for windows of sizes 2, 5, and 8. As we in mind when compiling our data set from the official
mentioned earlier, windows of size 5 are typical. We note that SPA (Saudi Press Agency), where we tried to retain
using larger sized windows did not improve the classification. proportionate classes.
For stem-based, a smaller sized window is a good choice, but • Experimental results show that stem-based (or light
not so for the root-based stemmer. Evidently, the best choice stemmers) algorithms generally yield a slightly better
is a size 5 window. It works well with both methods using performance compared to the root-based (or heavy stem-
either stemming approach. mers).
• Looking at the small differences in the performance
VI. CONCLUSION between the three stemming strategies (no stemming,
In this work, we studied the effect of stemming strate- root-based, or light stemming), we can safely claim
gies and word embedding on the performance of Arabic that the stemming step is optional for Arabic text
127926 VOLUME 8, 2020

classification task, as not much was gained from using [12] R. Alshalabi, ‘‘Pattern-based stemmer for finding arabic roots,’’ Inf. Tech-
stemming phase. One thing to keep in mind, stemming nol. J., vol. 4, no. 1, pp. 38–43, Jan. 2005.
[13] M. Hadni, A. Lachkar, and S. A. Ouatik, ‘‘A new and efficient stemming
helps reduce the training time as the system has to learn technique for arabic text categorization,’’ in Proc. Int. Conf. Multimedia
less vocabulary. Comput. Syst., May 2012, pp. 791–796.
• We believe in the fact that simple model is more appro- [14] A. Al-Rjoub, ‘‘A new approach for Arabic root extraction,’’ M.S. thesis,
Dept. MSc Comput. Sci., Jordan Univ. Sci. Technol., Irbid, Jordan, 2007.
priate for a rich morphological language such as Arabic. [15] M. Aljlayl and O. Frieder, ‘‘On arabic search: Improving the retrieval
During the construction of different deep learning mod- effectiveness via a light stemming approach,’’ in Proc. 11th Int. Conf. Inf.
els, we found that adding more layers did not improve Knowl. Manage. CIKM, 2002, pp. 340–347.
[16] L. S. Larkey, L. Ballesteros, and M. E. Connell, ‘‘Light stemming for Ara-
the results. This was specially true for the models based bic information retrieval,’’ in Arabic Computational Morphology. Springer,
on LSTM and GRU. 2007, pp. 221–243.
• Using bidirectional learning or attention mechanism [17] A. Chelli. (2016). Arabic Stemmer. [Online]. Available: https://www.
arabicstemmer.com/
with LSTM and GRU significantly improves the clas- [18] Y. Jaafar, D. Namly, K. Bouzoubaa, and A. Yousfi, ‘‘Enhancing arabic
sification result, whereas using CNN model as feature stemming process using resources and benchmarking tools,’’ J. King Saud
extraction layer before LSTM and GRU is ineffective. Univ. Comput. Inf. Sci., vol. 29, no. 2, pp. 164–170, Apr. 2017.
[19] K. Abainia, S. Ouamour, and H. Sayoud, ‘‘A novel robust arabic light
Nevertheless, using the standalone CNN model can stemmer,’’ J. Experim. Theor. Artif. Intell., vol. 29, no. 3, pp. 557–573,
achieve satisfactory classification results. May 2017.
• For the embedding layer (Word2Vec), we found that it [20] D. Laham, ‘‘Latent semantic analysis approaches to categorization,’’ in
Proc. 19th Annu. Conf. Cognit. Sci. Soc., 1997, p. 979.
was better to use skip-gram with the stem-based algo- [21] M. Davenport, ‘‘Introduction to modern information retrieval,’’ J. Med.
rithm, as we can get good results using a vector of Library Assoc. (JMLA), vol. 100, no. 1, p. 75, 2012.
dimension 50 ∼ 60. To achieve a comparable perfor- [22] W. B. Cavnar and J. M. Trenkle, ‘‘N-gram-based text categorization,’’
in Proc. 3rd Annu. Symp. Document Anal. Inf. Retr. (SDAIR-94), 1994,
mance using CBOW (bag-of-words) and the root-based pp. 161–175.
algorithm, we needed a vector of dimension 100. In addi- [23] X. Zhang, J. Zhao, and Y. LeCun, ‘‘Character-level convolutional networks
tion, for Word2Vec, the best window size is 5. for text classification,’’ in Proc. Adv. Neural Inf. Process. Syst., 2015,
pp. 649–657.
For future work, we intend to experiment with Arabic [24] V. Jindal, ‘‘A personalized Markov clustering and deep learning approach
multi-label classification problem using the proposed models for arabic text categorization,’’ in Proc. ACL Student Res. Workshop, 2016,
in this work. We also intend to look into Graph Convolutional pp. 145–151.
[25] F. S. Al-Anzi and D. AbuZeina, ‘‘Beyond vector space model for hierar-
Networks (GCN), yet another successful deep learning archi- chical arabic text classification: A Markov chain approach,’’ Inf. Process.
tecture for NLP tasks, for Arabic document classification Manage., vol. 54, no. 1, pp. 105–115, Jan. 2018.
problem. [26] A. M. Azmi and E. A. Aljafari, ‘‘Universal Web accessibility and the
challenge to integrate informal arabic users: A case study,’’ Universal
Access Inf. Soc., vol. 17, no. 1, pp. 131–145, Mar. 2018.
REFERENCES [27] D. Harman, ‘‘How effective is suffixing?’’ J. Amer. Soc. Inf. Sci., vol. 42,
no. 1, pp. 7–15, Jan. 1991.
[1] R. Duwairi, M. Al-Refai, and N. Khasawneh, ‘‘Stemming versus light
[28] Y. Kadri and J.-Y. Nie, ‘‘Effective stemming for Arabic information
stemming as feature selection techniques for arabic text categorization,’’
retrieval,’’ in Proc. Challenge Arabic NLP/MT Conf., Londres, U.K.,
in Proc. Innov. Inf. Technol. (IIT), Nov. 2007, pp. 446–450.
Oct. 2006, pp. 68–74.
[2] A. M. El-Halees, ‘‘Arabic text classification using maximum entropy,’’
[29] I. A. Al-Kharashi and M. W. Evens, ‘‘Comparing words, stems, and roots
Islamic Univ. J., vol. 15, no. 1, pp. 1–11, Dec. 2007.
as index terms in an arabic information retrieval system,’’ J. Amer. Soc. Inf.
[3] F. Harrag and E. El-Qawasmah, ‘‘Neural network for arabic text classifi- Sci., vol. 45, no. 8, pp. 548–560, Sep. 1994.
cation,’’ in Proc. 2nd Int. Conf. Appl. Digit. Inf. Web Technol., Aug. 2009, [30] K. Taghva, R. Elkhoury, and J. Coombs, ‘‘Arabic stemming without a
pp. 778–783. root dictionary,’’ in Proc. Int. Conf. Inf. Technol., Coding Comput. (ITCC),
[4] T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘‘Efficient estimation of vol. 2, Apr. 2005, pp. 152–157.
word representations in vector space,’’ 2013, arXiv:1301.3781. [Online]. [31] A. Nehar, D. Ziadi, H. Cherroun, and Y. Guellouma, ‘‘An efficient stem-
Available: http://arxiv.org/abs/1301.3781 ming for arabic text classification,’’ in Proc. Int. Conf. Innov. Inf. Technol.
[5] Y. Kim, ‘‘Convolutional neural networks for sentence classification,’’ (IIT), Mar. 2012, pp. 328–332.
in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), [32] H. M. Al-Serhan, R. Al Shalabi, and G. Kannan, ‘‘New approach for
Aug. 2014, pp. 1746–1751. extracting arabic roots,’’ in Proc. Arab Conf. Inf. Technol. (ACIT), 2003,
[6] J. Lilleberg, Y. Zhu, and Y. Zhang, ‘‘Support vector machines and pp. 42–59.
Word2vec for text classification with semantic features,’’ in Proc. IEEE [33] H. M. Harmanani, W. T. Keirouz, and S. Raheel, ‘‘A rule-based extensible
14th Int. Conf. Cognit. Informat. Cognit. Comput. (ICCI*CC), Jul. 2015, stemmer for information retrieval with application to Arabic,’’ The Int.
pp. 136–140. Arab J. Inf. Technol., vol. 3, no. 3, pp. 265–272, 2006.
[7] M. Hughes, I. Li, S. Kotoulas, and T. Suzumura, ‘‘Medical text classifi- [34] M. Momani and J. Faraj, ‘‘A novel algorithm to extract tri-literal arabic
cation using convolutional neural networks,’’ Stud Health Technol. Inf., roots,’’ in Proc. IEEE/ACS Int. Conf. Comput. Syst. Appl., May 2007,
vol. 235, pp. 246–250, May 2017. pp. 309–315.
[8] Z.-T. Yang and J. Zheng, ‘‘Research on chinese text classification based [35] Q. Yaseen and I. Hmeidi, ‘‘Extracting the roots of arabic words without
on Word2vec,’’ in Proc. 2nd IEEE Int. Conf. Comput. Commun. (ICCC), removing affixes,’’ J. Inf. Sci., vol. 40, no. 3, pp. 376–385, Jun. 2014.
Oct. 2016, pp. 1166–1170. [36] M. N. Al-Kabi, S. A. Kazakzeh, B. M. Abu Ata, S. A. Al-Rababah, and
[9] D. Zhang, H. Xu, Z. Su, and Y. Xu, ‘‘Chinese comments sentiment clas- I. M. Alsmadi, ‘‘A novel root based arabic stemmer,’’ J. King Saud Univ.
sification based on word2vec and SVMperf,’’ Expert Syst. Appl., vol. 42, Comput. Inf. Sci., vol. 27, no. 2, pp. 94–103, Apr. 2015.
no. 4, pp. 1857–1863, Mar. 2015. [37] L. S. Larkey, L. Ballesteros, and M. E. Connell, ‘‘Improving stemming for
[10] S. Ghwanmeh, G. Kanaan, R. Al-Shalabi, and S. Rabab’ah, ‘‘Enhanced Arabic information retrieval: light stemming and co-occurrence analysis,’’
algorithm for extracting the root of arabic words,’’ in Proc. 6th Int. Conf. in Proc. 25th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR
Comput. Graph., Imag. Visualizat., Aug. 2009, pp. 388–391. ’02), 2002, pp. 275–282.
[11] S. Khoja and R. Garside, ‘‘Stemming arabic text,’’ Dept. Comput., Lan- [38] A. Chen and F. Gey, ‘‘Building an arabic stemmer for information
caster Univ., Lancaster, U.K., Tech. Rep., 1999. retrieval,’’ in Proc. 11th Text Retr. Conf. (TREC), 2002, pp. 631–639.
VOLUME 8, 2020 127927

[39] A. Goweder and A. D. Roeck, ‘‘Assessment of a significant Arabic [59] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
corpus,’’ in Proc. Arabic NLP Workshop ACL/EACL, 2001, pp. 73–79. with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro-
[40] H. Froud, ‘‘A comparative study of root -based and stem -based approaches cess. Syst., 2012, pp. 1097–1105.
for measuring the similarity between arabic words for arabic text mining [60] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
applications,’’ Adv. Comput., Int. J., vol. 3, no. 6, pp. 55–67, Nov. 2012. Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[41] Y. A. Alhaj, J. Xiang, D. Zhao, M. A. A. Al-Qaness, M. Abd Elaziz, [61] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
and A. Dahou, ‘‘A study of the effects of stemming strategies on arabic H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using RNN
document classification,’’ IEEE Access, vol. 7, pp. 32664–32671, 2019. encoder–decoder for statistical machine translation,’’ in Proc. Conf. Empir-
[42] T. Zerrouki. (2017). Tashaphyne Arabic Light Stemmer. Accessed: ical Methods Natural Lang. Process. (EMNLP), 2014, pp. 1724–1734.
Aug. 25, 2019. https://pypi.org/project/Tashaphyne/ [62] A. Graves, A.-R. Mohamed, and G. Hinton, ‘‘Speech recognition with deep
[43] M. K. Saad and W. M. Ashour, ‘‘OSAC: Open source arabic corpora,’’ in recurrent neural networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal
Proc. 6th Int. Symp. Elect. Electron. Eng. Comput. Sci. (EEECS), 2010, Process., May 2013, pp. 6645–6649.
pp. 1–6. [63] T. Luong, H. Pham, and C. D. Manning, ‘‘Effective approaches to attention-
[44] F.-Z. El-Alami and S. O. El Alaoui, ‘‘An efficient method based on deep based neural machine translation,’’ in Proc. Conf. Empirical Methods
learning approach for Arabic text categorization,’’ in Proc. Int. Arab Conf. Natural Lang. Process., 2015, pp. 1412–1421.
Inf. Technol. (ACIT), 2016, pp. 1–7. [64] A. Chouigui, O. B. Khiroun, and B. Elayeb, ‘‘ANT corpus: An arabic news
[45] M. Sawalha and E. Atwell, ‘‘Comparative evaluation of Arabic language text collection for textual classification,’’ in Proc. IEEE/ACS 14th Int. Conf.
morphological analysers and stemmers,’’ in Proc. Int. Conf. Comput. Lin- Comput. Syst. Appl. (AICCSA), Oct. 2017, pp. 135–142.
guistics (COLING), 2008, pp. 107–110. [65] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimiza-
[46] F.-Z. El-Alami and S. O. El Alaoui, ‘‘Word sense representation based- tion,’’ 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.org/abs/
method for arabic text categorization,’’ in Proc. 9th Int. Symp. Signal, 1412.6980
Image, Video Commun. (ISIVC), Nov. 2018, pp. 141–146. [66] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to
[47] S. Bahassine, A. Madani, M. Al-Sarem, and M. Kissi, ‘‘Feature selection Statistical Learning With Applications in R, vol. 112. New York, NY, USA:
using an improved chi-square for arabic text classification,’’ J. King Saud Springer-Verlag, 2013.
Univ. Comput. Inf. Sci., vol. 32, no. 2, pp. 225–231, Feb. 2020. [67] R. A. Fisher, ‘‘XV.—The correlation between relatives on the supposition
[48] T. Kanan, O. Sadaqa, A. Almhirat, and E. Kanan, ‘‘Arabic light stemming: of Mendelian inheritance,’’ Trans. Roy. Soc. Edinburgh, vol. 52, no. 2,
A comparative study between P-stemmer, khoja stemmer, and light10 pp. 399–433, 1919.
stemmer,’’ in Proc. 6th Int. Conf. Social Netw. Anal., Manage. Secur.
(SNAMS), Oct. 2019, pp. 511–515.
[49] T. Kanan, R. Kanaan, O. Al-Dabbas, G. Kanaan, A. Al-Dahoud, and
E. Fox, ‘‘Extracting named entities using named entity recognizer for
Arabic news articles,’’ Int. J. Adv. Stud. Comput., Sci. Eng., vol. 5, no. 11,
pp. 78–84, 2016.
[50] S. Boukil, M. Biniz, F. E. Adnani, L. Cherrat, and A. E. E. Moutaouakkil,
HUDA ABDULRAHMAN ALMUZAINI received the B.Sc. degree in com-
‘‘Arabic text classification using deep learning technics,’’ Int. J. Grid
Distrib. Comput., vol. 11, no. 9, pp. 103–114, Sep. 2018. puter science from Qassim University, and the M.Sc. degree in computer
[51] S. Boukil, F. El Adnani, A. E. El Moutaouakkil, L. Cherrat, and science from King Saud University, Riyadh, Saudi Arabia, where she is
M. Ezziyyani, ‘‘Arabic stemming techniques as feature extraction applied currently pursuing the Ph.D. degree. She is also a Lecturer with the Depart-
in arabic text classification,’’ in Proc. Int. Conf. Adv. Inf. Technol., Services ment of Computer Science, Imam Mohammad ibn Saud Islamic University,
Syst. Springer, 2017, pp. 349–361. Riyadh. Her research interests include natural language processing and deep
[52] M. Elbes, A. Aldajah, and O. Sadaqa, ‘‘P-stemmer or NLTK stemmer learning.
for arabic text classification?’’ in Proc. 6th Int. Conf. Social Netw. Anal.,
Manage. Secur. (SNAMS), Oct. 2019, pp. 516–520.
[53] S. Bird, E. Klein, and E. Loper, Natural Language Processing With Python:
Analyzing Text With the Natural Language Toolkit. Newton, MA, USA:
O’Reilly Media, 2009.
[54] H. ALSaif and T. Alotaibi, ‘‘Arabic text classification using feature-
reduction techniques for detecting violence on social media,’’ Int. J. Adv. AQIL M. AZMI received the B.S.E. degree in
Comput. Sci. Appl., vol. 10, no. 4, pp. 77–87, 2019. electrical & computer engineering (ECE) from the
[55] A. Oussous, A. A. Lahcen, and S. Belfkih, ‘‘Impact of text pre-processing University of Michigan, Ann Arbor, MI, USA,
and ensemble learning on arabic sentiment analysis,’’ in Proc. 2nd Int. and the M.Sc. degree in electrical engineering
Conf. Netw., Inf. Syst. Secur. NISS, 2019, pp. 1–9.
and the Ph.D. degree in computer science from
[56] M. K. Saad and W. M. Ashour, ‘‘Arabic morphological tools for text
the University of Colorado, Boulder, CO, USA.
mining,’’ in Proc. 6th Int. Conf. Electr. Comput. Syst. (EECS), 2010.
[57] A. Al-Thubaity, A. Alqarni, and A. Alnafessah, ‘‘Do words with certain He is currently Professor with the Department of
part of speech tags improve the performance of arabic text classification?’’ Computer Science, King Saud University, Saudi
in Proc. 2nd Int. Conf. Inf. Syst. Data Mining ICISDM, 2018, pp. 155–161. Arabia. His current research interests include nat-
[58] E. Al-Shawakfa, A. Al-Badarneh, S. Shatnawi, K. Al-Rabab’ah, and ural language processing, computational biology,
B. Bani-Ismail, ‘‘A comparison study of some arabic root finding algo- bioinformatics, image understanding and processing, and digital humanities
rithms,’’ J. Amer. Soc. Inf. Sci. Technol., vol. 61, no. 5, pp. 1015–1024, specifically critical analysis of historical and religious texts.
May 2010.
127928 VOLUME 8, 2020

Impact of Stemming and Word Embedding On Deep Learning-Based Arabic Text Categorization

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Impact of Stemming and Word Embedding On Deep Learning-Based Arabic Text Categorization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Impact of Stemming and Word Embedding On Deep Learning-Based Arabic Text Categorization

Uploaded by

Copyright:

Available Formats

Received July 4, 2020, accepted July 9, 2020, date of publication July 14, 2020, date of current version July

Impact of Stemming and Word Embedding on

I. INTRODUCTION basic information of documents automatically, thus saving

127914 VOLUME 8, 2020

VOLUME 8, 2020 127915

they reported a degraded performance when no stemming

127916 VOLUME 8, 2020

VOLUME 8, 2020 127917

127918 VOLUME 8, 2020

while in CBOW we predicted the current word using a win-

VOLUME 8, 2020 127919

the attention model, and at the end there is a dense layer

FIGURE 6. The bidirectional GRU model. All inputs are stemmed.

model with 100 memory cells connected to a dense layer

127920 VOLUME 8, 2020

VOLUME 8, 2020 127921

TABLE 6. Resultant number of tokens in the vocabulary file for the

127922 VOLUME 8, 2020

TABLE 8. Classification results for the SPA corpus.

VOLUME 8, 2020 127923

127924 VOLUME 8, 2020

VOLUME 8, 2020 127925

127926 VOLUME 8, 2020

VOLUME 8, 2020 127927

127928 VOLUME 8, 2020

You might also like