0% found this document useful (0 votes)
56 views9 pages

Arabic Text Summarization Challenges Usi

This paper reviews the challenges and advancements in Arabic text summarization using deep learning techniques, highlighting the unique linguistic features of the Arabic language that complicate summarization efforts. It presents various methodologies, including extractive and abstractive approaches, and discusses the strengths and weaknesses of different algorithms applied to Arabic datasets. The study aims to guide future research by identifying gaps and suggesting directions for further exploration in the field.

Uploaded by

Esraa Maher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views9 pages

Arabic Text Summarization Challenges Usi

This paper reviews the challenges and advancements in Arabic text summarization using deep learning techniques, highlighting the unique linguistic features of the Arabic language that complicate summarization efforts. It presents various methodologies, including extractive and abstractive approaches, and discusses the strengths and weaknesses of different algorithms applied to Arabic datasets. The study aims to guide future research by identifying gaps and suggesting directions for further exploration in the field.

Uploaded by

Esraa Maher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

International Journal on Recent and Innovation Trends in Computing and Communication

ISSN: 2321-8169 Volume: 11 Issue: 11s


DOI: https://doi.org/10.17762/ijritcc.v11i11s.8079
Article Received: 25 June 2023 Revised: 22 August 2023 Accepted: 03 September 2023
___________________________________________________________________________________________________________________

Arabic Text Summarization Challenges using Deep


Learning Techniques: A Review
Adnan Souri*1, Mohammed Al Achhab1, Badr Eddine El Mohajir1, Mohamed Naoum2, Outman El Hichami3,
Abdelali Zbakh4
1NTTITeam, FS, Abdelmalek Essaadi University, Tetouan, Morocco
* [email protected]
2ENSIAS, Mohammed V University in Rabat, Rabat, Morocco
3Applied Mathematics And Computer Sciences Team,

ENS, Abdelmalek Essaadi University, Tetouan, Morocco


4ER-MSI Team, ENCGT, Abdelmalek Essaadi University, Tetouan, Morocco

Abstract— Text summarization is a challenging field in Natural Language Processing due to language modelisation and used techniques
to give concise summaries. Dealing with Arabic language does increase the challenge while taking into consideration the many features of the
Arabic language, the lack of tools and resources for Arabic, and the Algorithms adaptation and modelisation. In this paper, we present several
researches dealing with Arabic Text summarization applying different Algorithms on several Datasets. We then compare all these researches
and we give a conclusion to guide researchers on their further work..
Keywords- Extractive summarization; Abstractive summarization; Hybrid Approaches; Reinforcement Learning; Summarization of Arabic
News; AraVec; MultiBooked

I. INTRODUCTION morphology, lexical ambiguity, sentence structure or limited


resources.
Research in Arabic Natural Language Processing (ANLP)
Finally, we present results and analysis of our research paper
has gained significant attention in recent years. Arabic is a
with a discussion of the findings and their implications. And we
challenging language for Natural Language Processing (NLP)
conclude by giving the study contribution and future directions.
due to its rich morphology, complex grammar, and unique
linguistic characteristics. II. RELATED WORK
One of the most important research areas in ANLP is text Research on Arabic text summarization has gained
summarization [1], [2], [3], [4] and [5]. significant attention in recent years. Here are some notable
Text summarization aims to generate concise summaries of studies and approaches in the field:
longer texts. Arabic text summarization research focuses on In paper [1], the authors conduct a systematic review of
developing algorithms and techniques to handle the unique automatic Arabic text summarization techniques. They explore
characteristics of Arabic text, such as complex sentence the existing literature on Arabic text summarization and provide
structures and the presence of rich morphological features [6] a comprehensive overview of the different approaches,
and [7]. methodologies, and algorithms used in this field.
This paper gives an overview of several research on Arabic The study aims to identify the key techniques employed in
text summarization. Several techniques are presented such as Arabic text summarization, including extractive and abstractive
neural network architectures, convolutional neural networks, approaches, as well as the evaluation metrics and benchmarks
abstractive and extractive text summarization, the use of commonly used for assessing summarization quality. The
Transformer-based models, clustering and sentence scoring authors analyze the strengths and weaknesses of each technique
techniques and more other methods. and highlight the challenges specific to Arabic text
We then present some of Arabic language features that have summarization.
the major role in dealing with ANLP. Among these features we By conducting a systematic review, paper [1] provides
present the feature of script itself, the diglossic nature of the insights into the current state-of-the-art in Arabic text
language, grammatical and conjugational features. summarization. They discuss the advancements made in the field
We present then text summarization techniques and and identify research gaps that require further investigation. The
challenges that researchers encounter due to language study contributes to the understanding of Arabic text

134
IJRITCC | October 2023, Available @ http://www.ijritcc.org
International Journal on Recent and Innovation Trends in Computing and Communication
ISSN: 2321-8169 Volume: 11 Issue: 11s
DOI: https://doi.org/10.17762/ijritcc.v11i11s.8079
Article Received: 25 June 2023 Revised: 22 August 2023 Accepted: 03 September 2023
___________________________________________________________________________________________________________________
summarization techniques and serves as a valuable resource for an overview of the different methods and evaluate their
researchers and practitioners in the field. strengths, limitations, and potential applications.
Ref. [2] conducts a comprehensive review of Arabic text For the same, in [6], the authors present a comprehensive
summarization techniques. The review covers a wide range of review of automatic text summarization techniques and
approaches and methodologies used in Arabic text methods. The review covers various approaches and algorithms
summarization, including both extractive and abstractive employed in the field of automatic text summarization. The
methods. The authors discuss various aspects of Arabic text authors provide an overview of the different techniques, evaluate
summarization, such as linguistic complexities, domain-specific their effectiveness, and discuss their strengths, limitations, and
summarization, evaluation metrics, and datasets used in Arabic potential applications.
summarization research. They analyze the strengths and In [7], authors research work focuses on multi-document
weaknesses of different techniques and approaches, providing summarization in Arabic using neural network architectures.
insights into the current state-of-the-art in Arabic text The study proposes a model that leverages the Transformer-
summarization. based architecture and utilizes attention mechanisms to generate
By conducting a thorough review, the study in [2] helps summaries from multiple input documents.
researchers and practitioners gain a comprehensive This study presented in [8] explores the application of deep
understanding of Arabic text summarization, its challenges, and learning techniques, such as convolutional neural networks
the advancements made in the field. The authors also identify (CNN) and recurrent neural networks (RNN), for Arabic text
research gaps and highlight future research directions, aiming to summarization. It compares the performance of different deep
contribute to the further development of Arabic text learning architectures in generating abstractive summaries.
summarization techniques. Reference [9] investigates abstractive Arabic text
In [3], the authors present a survey of automatic Arabic text summarization using neural network models. The study explores
summarization techniques. The survey covers various aspects of the use of sequence-to-sequence models with attention
Arabic text summarization, including both extractive and mechanisms to generate concise summaries while capturing the
abstractive methods, as well as the challenges and evaluation semantic meaning of the input text.
metrics used in Arabic summarization research. The authors In [10], authors investigate the use of Transformer-based
review and analyze the existing literature on Arabic text models, specifically BERT (Bidirectional Encoder
summarization, discussing the different techniques, algorithms, Representations from Transformers), for Arabic news
and approaches employed in this domain. They explore the summarization. The study explores different fine-tuning
linguistic complexities of the Arabic language and how they approaches and compares the performance with traditional
impact the development of effective summarization models. methods.
They identify the strengths and limitations of different The authors in [11] propose an extractive summarization
approaches and highlight the research gaps that require further approach for Arabic text using lexical chains and sentence
exploration. similarity. Lexical chains are built to capture semantic
In [4], the authors present a comprehensive state-of-the-art relationships between sentences, and sentence similarity
survey of feature-based automatic text summarization methods. measures are used to identify important sentences for the
The survey focuses on summarization techniques that utilize summary.
various features, such as statistical, linguistic, semantic, and In [12], authors propose an extractive summarization method
syntactic features, to extract salient information from text and specifically for Arabic news articles. They use sentence and
generate summaries. phrase-level features, such as term frequency and position, to
The authors in [4] review and analyze the existing literature rank sentences and select the most relevant ones for the
on feature-based automatic text summarization, discussing the summary.
different approaches, algorithms, and evaluation methods The study in [13] combines clustering and sentence scoring
employed in this field. They examine how different features techniques for Arabic text summarization. Clustering is used to
contribute to the summarization process and highlight the group similar sentences, and sentence scoring based on features
strengths and limitations of various feature-based methods. They like term frequency-inverse document frequency (TF-IDF) and
identify the trends, challenges, and potential research directions sentence position is employed to select representative sentences
in this area. from each cluster.
Meanwhile, In [5], the authors present a comprehensive In [14], the research paper presents an approach that
review of automatic text summarization methods. The review combines sentence compression and semantic similarity
covers various techniques, algorithms, and approaches used in techniques for Arabic text summarization. Sentence
the field of automatic text summarization. The authors provide compression is applied to reduce sentence length, and semantic

135
IJRITCC | October 2023, Available @ http://www.ijritcc.org
International Journal on Recent and Innovation Trends in Computing and Communication
ISSN: 2321-8169 Volume: 11 Issue: 11s
DOI: https://doi.org/10.17762/ijritcc.v11i11s.8079
Article Received: 25 June 2023 Revised: 22 August 2023 Accepted: 03 September 2023
___________________________________________________________________________________________________________________
similarity is used to select the most representative sentences for TABLE I. LETTERS FORMS ACCORDING TO THE POSITION IN A WORD
the summary. Number
Forms Letter
The work presented in [15] explores the application of of forms
1 "‫"د‬ )dal( ‫د‬
abstractive summarization techniques to Arabic news articles.
2 " ‫" سـ " ; " س‬ )seen( ‫س‬
The proposed method incorporates deep learning models, such
3 "‫" هـ "; " ـه"; " ه‬ )haa( ‫هـ‬
as Long Short-Term Memory (LSTM) networks, to generate
4 "‫" عـ" ; " ـعـ"; " ـع "; " ع‬ )ayn( ‫ع‬
summaries by paraphrasing and restructuring the original text.
In [16], the study proposes a sentence ranking approach Arabic has a diglossic nature, meaning that it has two forms:
based on connectedness and positional weight to generate Classical Arabic (also known as Quranic Arabic) and Modern
extractive summaries in Arabic. The approach takes into account Standard Arabic (MSA). Classical Arabic is the language of the
sentence cohesion and coherence to select important sentences. Quran and is used in formal settings, literature, and religious
This study in [17] introduces an extractive summarization texts. MSA is used in formal speech, writing, and media, while
approach for Arabic text based on sentence clustering. The different dialects are used for everyday conversation.
method clusters sentences using a semantic similarity measure Like other Semitic languages, Arabic is based on a root
and selects representative sentences from each cluster to form system. Words in Arabic are derived from a three-consonant
the summary. root, and different forms and meanings are created by adding
The research in [18] proposes a method for Arabic text vowels, prefixes, and suffixes to these roots as mentioned in
summarization based on the concept of lexical chains. Lexical table 2.
chains are used to identify and extract important concepts from
the text, which are then used to generate summaries by selecting TABLE II. DERIVATION BY ADDING VOWELS, PREFIXES AND SUFFIXES

relevant sentences. English Arabic


In [19], a survey provides an overview of the challenges, He wrote (kataba) ‫كتب‬
techniques, and evaluation methods in Arabic text He writes (yaktubu) ‫يكتب‬
summarization. It discusses extractive and abstractive A book (kitabun) ‫كتاب‬
approaches, linguistic features, and domain-specific A library (maktabatun) ‫مكتبة‬
summarization. written (maktubun) ‫مكتوب‬
Children quran school (kuttabun) ‫كتاب‬
For the same, in [20], a survey provides an overview of
different approaches and techniques used in Arabic text Arabic has grammatical gender, with nouns and adjectives
summarization, including extractive and abstractive methods. It being either masculine or feminine. Agreement between nouns,
discusses challenges specific to Arabic and highlights key adjectives, and verbs is important in Arabic, meaning that they
research directions in the field. must agree in gender, number, and case.
These studies highlight the ongoing research efforts in Arabic has a dual form for nouns, pronouns, and verbs,
Arabic text summarization, addressing the unique challenges of which is used to indicate exactly two of something. Table 3
the Arabic language and exploring various techniques, ranging shows the case of a sentence in the singular form, dual and then
from traditional feature-based methods to state-of-the-art deep the plural form.
learning approaches. Researchers continue to explore novel
approaches, develop specialized datasets, and refine evaluation TABLE III. AGREEMENT BETWEEN NOUNS AND ADJECTIVES
metrics to advance the field of Arabic text summarization. The sentence in English The sentence in Arabic
The Arabic language is a Semitic language and is widely The girl is beautiful ‫الفتاة جميلة‬
spoken across the Middle East and North Africa. It has a rich The two girls are beautiful ‫الفتاتان جميلتان‬
The girls are beautiful ‫الفتيات جميالت‬
history and is one of the six official languages of the United
(three or more)
Nations. Here are some key features of the Arabic language:
Arabic uses a script known as the Arabic alphabet, which Arabic verbs are highly inflected and conjugated according
consists of 28 letters, some of which have one form, while others to tense, mood, aspect, person, and number. There are several
have two forms, three or four forms. The table 1 presents some verb forms, known as "conjugations," which are used to indicate
of these letters showing the number of forms. Letters forms are different aspects and time frames. In table 4, we present the verb
depending on their position within a word. The script is written “to write” conjugated in the present tense with some pronouns.
from right to left.

136
IJRITCC | October 2023, Available @ http://www.ijritcc.org
International Journal on Recent and Innovation Trends in Computing and Communication
ISSN: 2321-8169 Volume: 11 Issue: 11s
DOI: https://doi.org/10.17762/ijritcc.v11i11s.8079
Article Received: 25 June 2023 Revised: 22 August 2023 Accepted: 03 September 2023
___________________________________________________________________________________________________________________
TABLE IV. A VIEW OF CONJUGATION OF VERB (KATABA) TO WRITE • Clustering: Similar sentences are grouped together, and
English Arabic representative sentences from each cluster are selected to
I write (aktubu) ‫أكتب‬ form the summary.
you write (masc.) (taktubu) ‫تكتب‬
B. Abstractive Summerization
you write (fem.) (taktubiina) ‫تكتبين‬
you write (dual) (taktubaani) ‫تكتبان‬ Abstractive methods aim to generate summaries by
you write (masc.) (plural) (taktubuuna) ‫تكتبون‬ paraphrasing and rephrasing the original text, rather than directly
he writes (yaktubu) ‫يكتب‬ selecting sentences [22]. These methods involve more advanced
natural language processing techniques, such as:
Arabic has separate subject and object pronouns, which are
used in different positions within a sentence. Pronouns can also
• Sequence-to-Sequence Models : Neural network
be attached as suffixes to verbs, prepositions, and nouns to
architectures, such as Recurrent Neural Networks
indicate possession or object agreement.
(RNNs) or Transformer models, are trained to generate
Arabic has a relatively small vowel system, consisting of
summaries by learning to map input text to output
long and short vowels. Short vowels are often not written, but
summaries. These models can capture the semantic and
they are important for pronunciation and are indicated by
contextual information necessary for generating
diacritical marks in certain texts.
abstractive summaries.
Arabic has a variety of sounds that may be unfamiliar to
speakers of other languages. It includes guttural sounds like • Attention Mechanisms : Attention mechanisms allow the
"ayn" (/ʕ/) and "ha" (/ħ/), which can be challenging for non- model to focus on different parts of the input text while
native speakers to pronounce. generating the summary. They help in aligning
Arabic calligraphy is a highly regarded art form, it is like important content and generating more coherent
fonts in text processing software but with an artistic aspect. The summaries.
beauty of the Arabic script is often emphasized in various forms
C. Hybrid approaches
of artistic expression, such as Quranic manuscripts, architecture,
and decorative arts. Some approaches combine extractive and abstractive
Due to all these Arabic language features and more others, techniques to leverage the advantages of both. These methods
dealing with ANLP is quite challenging. first extract important sentences or phrases and then employ
abstractive methods to rephrase and reorganize the extracted
III. ARABIC TEXT SUMMARIZATION TECHNIQUES content into a more concise and coherent summary.
Arabic text summarization is the task of generating concise
D. Linguistic Features
summaries of longer Arabic texts while preserving the key
information and main points [1], [2], [3], [4] and [5]. It faces Linguistic features specific to Arabic can be used to aid the
unique challenges due to the complex structure of Arabic summarization process. These features include part-of-speech
sentences, rich morphology, and the presence of dialectal (POS) tags, morphological analysis, and syntactic information,
variations [6] and [7]. In this paragraph, we present some which can assist in identifying key phrases, relationships
approaches and techniques used in Arabic text summarization: between sentences, and important content for the summary.

A. Extractive Summerization E. Supervised Learning

Extractive methods involve selecting important sentences or Some approaches use supervised learning algorithms like in
phrases from the original text to construct a summary [21]. Key [23], where models are trained on pairs of input texts and
techniques for extractive summarization in Arabic include: corresponding summaries. These models learn to generalize
from the training data to generate summaries for new texts.
• Sentence Scoring: Sentences are assigned scores based
on various features such as word frequency, position, or F. Evaluation
semantic similarity to identify important content. Various evaluation metrics are used to assess the quality of
• Graph-based Algorithms: Text is represented as a graph, Arabic text summaries. Metrics such as ROUGE (Recall-
where nodes represent sentences, and edges represent Oriented Understudy for Gisting Evaluation) and BLEU
relationships between them. Graph algorithms are used (Bilingual Evaluation Understudy) are commonly employed to
to identify the most informative sentences for the measure the similarity between generated summaries and
summary. reference summaries or human judgments.

137
IJRITCC | October 2023, Available @ http://www.ijritcc.org
International Journal on Recent and Innovation Trends in Computing and Communication
ISSN: 2321-8169 Volume: 11 Issue: 11s
DOI: https://doi.org/10.17762/ijritcc.v11i11s.8079
Article Received: 25 June 2023 Revised: 22 August 2023 Accepted: 03 September 2023
___________________________________________________________________________________________________________________
IV. ARABIC TEXT SUMMARIZATION CHALLENGES methodologies specifically designed for Arabic text
Arabic text summarization faces several challenges due to summarization. The exploration of advanced NLP methods and
the unique characteristics of the Arabic language and the deep learning approaches can help to overcome these challenges
complexities of summarizing Arabic text. Here are some key and improve the quality and effectiveness of Arabic text
challenges in Arabic text summarization: summarization systems.

A. Rich morphology V. DATASETS


Arabic has a rich morphological structure, with extensive use A. Summarization of Arabic News (SAN)
of prefixes, suffixes, and root-based word formation. This In [24], the authors address the need for a dataset tailored to
complexity adds challenges to the identification and extraction Arabic single-document summarization. Single-document
of important content for summarization. summarization focuses on generating concise and informative
B. Sentence structure summaries from individual documents, which is an essential task
Arabic sentences can be long and have complex syntactic in natural language processing and information retrieval.
The authors recognize that existing summarization datasets
structures, including nested clauses and subclauses.
may not fully capture the linguistic and contextual peculiarities
Understanding and representing these structures accurately in
of the Arabic language. To address this gap, they propose the
the summary is a challenge.
creation of a new dataset that aligns with the unique challenges
C. Lexical ambiguity and characteristics of Arabic text summarization [24].
Arabic words often have multiple meanings and To build the Arabic single-document summarization dataset,
interpretations, which can lead to ambiguity in the the authors describe the data collection and annotation process.
summarization process. Resolving lexical ambiguity accurately They carefully curate a diverse set of Arabic documents from
is crucial for generating meaningful and coherent summaries. various domains and topics to ensure the dataset's
representativeness.
D. Limited resources
Human annotators are then employed to create reference
Compared to some other languages, there is a relatively summaries for each document in the dataset. These reference
limited availability of high-quality Arabic summarization summaries are crucial for evaluating the performance of
datasets, linguistic resources, and tools. The scarcity of resources summarization systems using standard evaluation metrics like
makes it challenging to train and evaluate robust summarization ROUGE.
models. The authors discuss the importance of the dataset in
E. Cross-domain Generalization advancing research in Arabic single-document summarization.
The availability of a well-constructed dataset allows researchers
Arabic text summarization models trained on specific
to develop and evaluate state-of-the-art summarization models
domains may struggle to generalize well to different domains.
specifically tailored to the Arabic language.
Adapting and fine-tuning models for specific domains and
In conclusion, in [24], the authors present the efforts to create
achieving cross-domain generalization is a challenge in Arabic
an Arabic single-document summarization dataset to address the
text summarization.
need for comprehensive and language-specific resources in the
F. Evaluation Metrics field. The dataset contributes to the development and evaluation
The evaluation of Arabic text summarization can be of Arabic summarization systems, ultimately enhancing the
challenging due to the lack of widely accepted evaluation accessibility and usability of automated summarization
metrics specifically tailored for Arabic summaries. Existing technology in Arabic text processing.
metrics, such as ROUGE, may not fully capture the linguistic In [25], the authors address the task of Arabic single-
nuances and intricacies of Arabic summaries. document summarization, which aims to generate concise and
coherent summaries from individual Arabic documents. The
G. Text complexity and topic variations
proposed approach utilizes lexical chains to extract salient
Arabic text covers a wide range of topics, including news, information from the document and construct a coherent
social media, legal documents, and scientific articles. summary.
Summarization systems must handle diverse topics and adapt to The authors begin by explaining the concept of lexical
varying text complexity levels to produce accurate and chains, which are sequences of related words or terms that share
informative summaries. semantic relationships within a document. These chains help
Addressing these challenges requires ongoing research and identify important concepts and topics present in the text.
development of specialized techniques, datasets, and evaluation

138
IJRITCC | October 2023, Available @ http://www.ijritcc.org
International Journal on Recent and Innovation Trends in Computing and Communication
ISSN: 2321-8169 Volume: 11 Issue: 11s
DOI: https://doi.org/10.17762/ijritcc.v11i11s.8079
Article Received: 25 June 2023 Revised: 22 August 2023 Accepted: 03 September 2023
___________________________________________________________________________________________________________________
The summarization process involves several steps: pre- can foster the development of more accurate and effective
processing the document to remove noise and irrelevant Arabic language models and applications.
information, building lexical chains to identify key concepts, and In [28], the authors propose an abstractive approach to
selecting sentences from the document that best represent these Arabic text summarization using deep learning techniques.
concepts to form the summary [25]. Abstractive summarization involves generating concise and
To build lexical chains, the authors in [25] use linguistic coherent summaries by paraphrasing and rephrasing the content
features such as word co-occurrence and semantic similarity. By of the source document, rather than extracting sentences directly.
identifying chains of related terms, the method aims to capture The authors aim to address the challenges of Arabic text
the core ideas and topics covered in the document. summarization, including the linguistic complexities and
The selected sentences are then ranked based on their morphological richness of the Arabic language. Deep learning
relevance to the identified concepts, and the top-ranked models are employed to learn the semantic representation and
sentences form the final summary. structure of the input text, enabling the generation of meaningful
The authors evaluate the performance of their summarization and grammatically correct summaries [28].
approach using standard evaluation metrics and compare it with The proposed approach likely involves a sequence-to-
other summarization methods [25]. The results demonstrate the sequence (seq2seq) architecture, commonly used for abstractive
effectiveness of the proposed method in generating meaningful summarization tasks. The model takes an Arabic document as
and coherent Arabic summaries. input and encodes it into a fixed-length vector representation.
In conclusion, the research in [25] presents a novel approach Then, a decoder generates the summary by predicting a sequence
to Arabic single-document summarization based on lexical of words based on the encoded representation.
chains. By leveraging semantic relationships among words, the To train the deep learning model, the authors in [28would
method identifies key concepts and constructs a concise require a dataset of paired Arabic documents and their
summary that represents the main ideas in the document. The corresponding human-generated summaries. Depending on the
research contributes to the development of effective availability of appropriate datasets, they might use existing
summarization techniques for the Arabic language, advancing resources or create their own dataset.
the state-of-the-art in Arabic natural language processing. The research in [28] presents a deep learning-based approach
The dataset can be accessed through the SAN GitHub to abstractive Arabic text summarization, aiming to overcome
repository at [26]. the challenges posed by the Arabic language. The research
contributes to the development of advanced summarization
B. AraVec
techniques for Arabic, with potential applications in information
In [27], the authors present AraVec, which is a set of Arabic retrieval, natural language processing, and content
word embedding models that have been trained on a large corpus summarization.
of Arabic text. Word embeddings are vector representations of
words in a high-dimensional space, where words with similar C. MultiBooked
meanings or usage patterns are represented by vectors close to Multibooked dataset serves as a benchmark for multilingual
each other. multi-document summarization [29] and [30]. The dataset aims
The authors explain that word embeddings play a crucial role to provide a comprehensive and multidimensional resource for
in various natural language processing tasks, such as text evaluating summarization models. It includes news articles
classification, sentiment analysis, and machine translation. collected from various Arabic news sources, covering a wide
AraVec provides pre-trained word embeddings specifically range of topics and genres [29].
tailored to the Arabic language, enabling researchers and The MultiBooked corpus contains a diverse collection of
practitioners to leverage these embeddings in Arabic NLP documents in multiple languages, making it suitable for
applications without the need for additional training data. multilingual summarization research. The data is curated from
To create AraVec, the authors ,in [27], likely used a large various sources and covers a wide range of topics and genres,
corpus of Arabic text, such as news articles, web content, or ensuring that the corpus is representative of real-world scenarios
social media data. They would have employed popular word [29] and [30].
embedding techniques like Word2Vec, GloVe, or FastText to The authors in [29] describe the process of creating the
generate the word embeddings. dataset, which involves collecting the source documents and
The availability of AraVec provides a valuable resource for obtaining multiple human-generated summaries for each
the Arabic natural language processing community, allowing document. The dataset is carefully annotated to ensure high-
researchers and developers to easily incorporate high-quality quality summaries that capture the essence of the source texts.
Arabic word embeddings into their NLP projects. The dataset

139
IJRITCC | October 2023, Available @ http://www.ijritcc.org
International Journal on Recent and Innovation Trends in Computing and Communication
ISSN: 2321-8169 Volume: 11 Issue: 11s
DOI: https://doi.org/10.17762/ijritcc.v11i11s.8079
Article Received: 25 June 2023 Revised: 22 August 2023 Accepted: 03 September 2023
___________________________________________________________________________________________________________________
The resulting dataset serves as a benchmark for evaluating the These algorithms can be categorized into main types : Extractive
performance of Arabic text summarization systems [29]. Summarization, Abstractive Summarization, Hybrid
The research in [29] highlights the importance of the Approaches, Rule-based Approaches, and Deep Reinforcement
MultiBooked dataset in advancing research in Arabic text Learning. Here's an overview of each type and some commonly
summarization. It provides researchers with a valuable resource used algorithms:
for training and evaluating summarization models, particularly
A. Extractive Summarization Algorithms:
in the context of multi-document summarization. The authors
discuss potential use cases and applications of the dataset and Extractive summarization algorithms select and extract
emphasize its potential impact on the development of Arabic sentences or phrases directly from the source text to form the
summarization technologies. summary. These algorithms identify the most relevant and
Overall, the study by [29] showcases the MultiBooked important sentences based on certain scoring criteria. Some
dataset as a multidimensional resource that contributes to the popular extractive summarization algorithms include:
advancement of Arabic text summarization research. 1) Term Frequency-Inverse Document Frequency
Researchers can refer to this reference for more detailed (TF-IDF) : This algorithm ranks sentences based on the
information about the dataset and its usage in Arabic text importance of the words they contain relative to the entire
summarization studies. document. Sentences with higher TF-IDF scores are more likely
In [30], The authors explain that multi-document to be included in the summary.
summarization is a challenging natural language processing task 2) TextRank: Inspired by Google's PageRank algorithm,
where the objective is to generate a concise summary that TextRank treats sentences as nodes in a graph and uses edge
captures the essential information from a set of related weights to represent the relationships between sentences. It
documents. This type of summarization is particularly important ranks sentences based on their centrality in the graph, and the
in scenarios where information is scattered across multiple top-ranked sentences are selected for the summary.
sources, such as news articles or research papers. 3) LexRank: Similar to TextRank, LexRank uses
To create the dataset, human annotators generated sentence similarity as edge weights in the graph representation.
summaries for each document cluster in the corpus. These It selects sentences that are both important and diverse to ensure
summaries were carefully crafted to capture the main ideas and a well-rounded summary.
key information from the source documents. The presence of 4) KL-Sum: KL-Sum uses Kullback-Leibler divergence
human-generated summaries allows researchers to evaluate the to measure the information loss when a sentence is removed
performance of their summarization models using standard from the document. Sentences with the least information loss
metrics such as ROUGE (Recall-Oriented Understudy for are included in the summary.
Gisting Evaluation) [30]. 5) Supervised Learning with Features : Traditional
The authors in [30] also discuss the importance of a supervised learning algorithms, combined with linguistic
standardized benchmark for evaluating multilingual multi- features and domain-specific knowledge, have also been used
document summarization systems. The availability of such a for extractive summarization.
benchmark fosters the development of robust and effective
B. Abstractive Summarization Algorithms:
summarization algorithms that can operate across different
languages and domains. Abstractive summarization algorithms generate summaries
In conclusion, the MultiBooked corpus presented in this by paraphrasing and rephrasing the content of the source text.
study provides a valuable resource for researchers working on These algorithms create new sentences that convey the key
multilingual multi-document summarization. It offers a diverse information in a more concise form. Some commonly used
and representative set of documents in multiple languages, abstractive summarization techniques include:
allowing researchers to advance the state-of-the-art in this 1) Sequence-to-Sequence (seq2seq) models: seq2seq
challenging NLP task. The dataset has the potential to facilitate models, often based on Recurrent Neural Networks (RNNs) or
the development of practical summarization systems capable of Transformer architecture, encode the source text into a fixed-
handling information from various sources and domains. length vector and then use a decoder to generate the summary.
The dataset can be accessed through the MultiBooked 2) Pointer-Generator Networks: These models combine
GitHub repository at [31]. extractive and abstractive techniques. They can copy words
from the source text (extractive) while also generating new
VI. ALGORITHMS words (abstractive) to form the summary.
Arabic text summarization algorithms are designed to 3) c. BERT-based models: Bidirectional Encoder
generate concise and coherent summaries from Arabic text. Representations from Transformers (BERT) and other language

140
IJRITCC | October 2023, Available @ http://www.ijritcc.org
International Journal on Recent and Innovation Trends in Computing and Communication
ISSN: 2321-8169 Volume: 11 Issue: 11s
DOI: https://doi.org/10.17762/ijritcc.v11i11s.8079
Article Received: 25 June 2023 Revised: 22 August 2023 Accepted: 03 September 2023
___________________________________________________________________________________________________________________
models can be fine-tuned for abstractive summarization, Abstractive summarization, which generates novel
allowing them to understand context and generate summaries summaries, is more difficult in Arabic due to its complex
effectively. morphology and syntax. Ensuring semantic coherence,
maintaining the original meaning, and avoiding grammatical
C. Pointer-Generator Networks
errors in generated summaries present significant challenges.
This model combines extractive and abstractive approaches Evaluating the quality of Arabic summaries is a challenge.
by using a pointer mechanism to select words directly from the Metrics like ROUGE are widely used, but they may not fully
input document when generating the summary. capture the nuances of the Arabic language. Developing reliable
D. Reinforcement Learning evaluation metrics and benchmarks specific to Arabic
Some research has explored using reinforcement learning to summarization remains an ongoing research area.
fine-tune summarization models by optimizing evaluation Training and fine-tuning large-scale summarization models
metrics like ROUGE require substantial computational resources. Access to high-
performance computing infrastructure and large-scale pre-
E. Latent Dirichlet Allocation (LDA) training data can be a limitation for researchers and practitioners,
LDA is a topic modeling algorithm that can be adapted for particularly in resource-constrained settings.
document summarization by selecting sentences that represent Researchers and developers are actively working to address
the main topics in the document. these limitations by creating new datasets, improving language-
It's important to note that Arabic text summarization specific tools and resources, and developing novel techniques
algorithms may have specific adaptations to address the unique tailored to the unique characteristics of the Arabic language.
linguistic characteristics of the Arabic language, such as its rich Continued efforts in these areas will help overcome the
morphology and right-to-left writing direction. limitations and advance the field of Arabic summarization.

VII. CONCLUSION REFERENCES


Arabic summarization, whether single-document or multi- [1] K. J. Abdelqader, A. Mohamed, and K. Shaalan, “Systematic
document, faces several limitations due to the unique Review of Automatic Arabic Text Summarization Techniques”.
characteristics of the Arabic language and the available In International conference on Variability of the Sun and sun-like
stars: from asteroeismology to space weather (pp. 783-796).
resources. As a conclusion of this paper, we present several
Springer, Singapore, 2023.
reasons of these limitations.
[2] A. Elsaid, A. Mohammed, L.F. Ibrahim, and M.M. Sakre, “A
Annotated datasets specifically designed for Arabic
comprehensive review of arabic text summarization”. IEEE
summarization are scarce compared to resources available for Access, 10, 38012-38030, 2022.
other languages like English. The lack of large-scale and diverse
[3] M. A. Elmenshawy, T. Hamza, and R. El-Deeb, “Automatic
datasets hampers the development and evaluation of Arabic arabic text summarization (AATS): A survey”. Journal of
summarization models. Intelligent & Fuzzy Systems, 43(5), 6077-6092, 2022.
Arabic has complex linguistic features, including rich [4] D. Yadav, R. Katna, A.K. Yadav, and J. Morato, “Feature Based
morphology, intricate syntax, and a wide range of dialects. Automatic Text Summarization Methods: A Comprehensive
Capturing these characteristics accurately in summarization State-of-the-Art Survey”. IEEE Access, 10, 133981-134003,
models is challenging. Morphological analysis, dealing with 2022.
dialectal variations, and maintaining fluency and coherence are [5] D. Yadav, J. Desai, and A.K. Yadav, “Automatic text
ongoing research areas. summarization methods: A comprehensive review”. arXiv
Arabic summarization for specific domains, such as legal, preprint arXiv:2204.01849, 2022.
medical, or scientific texts, requires domain-specific resources [6] A.P. Widyassari, S. Rustad, G. F. Shidik, E. Noersasongko, A.
and expertise. However, such resources are often limited in Syukur, and A. Affandy, “Review of automatic text
Arabic, hindering the development of domain-adaptive summarization techniques & methods”. Journal of King Saud
summarization models. University-Computer and Information Sciences, 34(4), 1029-
1046, 2022.
Compared to English for example, the availability of NLP
tools and resources for Arabic is relatively limited. This includes [7] Elaraby, and Mourad, “Multi-Document Arabic Summarization
Using Neural Networks”. 2021.
robust part-of-speech taggers, syntactic parsers, named entity
recognition systems, and sentiment analysis tools. The absence [8] Elmahdy et al., "Arabic Text Summarization Based on Deep
Learning Techniques", 2020.
of these resources affects the performance and accuracy of
Arabic summarization models. [9] Haggag et al. "Abstractive Arabic Text Summarization Based on
Neural Networks", 2020.

141
IJRITCC | October 2023, Available @ http://www.ijritcc.org
International Journal on Recent and Innovation Trends in Computing and Communication
ISSN: 2321-8169 Volume: 11 Issue: 11s
DOI: https://doi.org/10.17762/ijritcc.v11i11s.8079
Article Received: 25 June 2023 Revised: 22 August 2023 Accepted: 03 September 2023
___________________________________________________________________________________________________________________
[10] Elaraby et al. "Arabic News Summarization Using Transformer- [29] Salleh, M. R., Shaari, A. H., & Zainuddin, R. (2018). "AraVec: A
based Models", 2020. Set of Arabic Word Embedding Models Trained on a Large
[11] Abooraig, and Al-Kabi, "Arabic Text Summarization Using Corpus." Journal of Physics: Conference Series, 1049(1), 012039.
Lexical Chains and Sentence Similarity", 2019. [30] Abdelaal, K., Elaraby, M., Mourad, A., & Rafea, A. (2018).
[12] Al-Sammarraie, and Al-Bakour, "Extractive Text Summarization "Abstractive Arabic Text Summarization Using Deep Learning
for Arabic News", 2018. Techniques." In 2018 IEEE/ACS 15th International Conference
on Computer Systems and Applications (AICCSA) (pp. 1-4).
[13] Al-Smadi, and Al-Nawaiseh, "A Hybrid Approach for Arabic
Text Summarization Using Clustering and Sentence Scoring", [31] Al-Zaidy, R., & Mehdad, Y. (2019). MultiBooked: A
2018. Multidimensional Dataset for Arabic Text Summarization. In
Proceedings of the 4th Arabic Natural Language Processing
[14] Sunil Kumar, M. ., Sundararajan, V. ., Balaji, N. A. ., Sambhaji
Workshop (pp. 16-24). Association for Computational
Patil, S. ., Sharma, S. ., & Joy Winnie Wise, D. C. . (2023).
Linguistics.
Prediction of Heart Attack from Medical Records Using Big Data
Mining. International Journal of Intelligent Systems and [32] Stojanovic, N. . (2020). Deep Learning Technique-Based 3d Lung
Applications in Engineering, 11(4s), 90–99. Retrieved from Image-Based Tumor Detection Using segmentation and
https://ijisae.org/index.php/IJISAE/article/view/2575. Classification. Research Journal of Computer Systems and
Engineering, 1(2), 13:19. Retrieved from
[15] Al-Salman et al., "Arabic Text Summarization Based on Sentence
https://technicaljournals.org/RJCSE/index.php/journal/article/vie
Compression and Semantic Similarity", 2018.
w/6
[16] El-Haj et al., "Abstractive Text Summarization for Arabic News",
[33] Baly, R., Haddoud, M. H., Alsaied, T., & Alghamdi, R. (2018).
2018.
MultiBooked: A multilingual multi-document summarization
[17] Mahmoud, and Ezzeldin, "Arabic Text Summarization Using
benchmark. In Proceedings of the 2018 Conference of the North
Sentence Ranking Based on Connectedness and Positional American Chapter of the Association for Computational
Weight", 2017. Linguistics: Human Language Technologies (pp. 1362-1372).
[18] Alhadlaq et al., "Extractive Summarization of Arabic Texts Based
[34] https://github.com/HLTCHKUST/MultiBooked
on Sentence Clustering", 2017.
[19] El-Halees, and Al-Salman, "Arabic Text Summarization Using
Lexical Chains", 2016.
[20] Al-Sarem et al., "Arabic Text Summarization: A Survey", 2016.
[21] Al-Twairesh, and Al-Nafjan, "Arabic Text Summarization: A
Survey", 2014.
[22] AL-Khassawneh, Y. A., & Hanandeh, E. S. (2023). Extractive
Arabic Text Summarization-Graph-Based Approach. Electronics,
12(2), 437.
[23] Etaiwi, W., & Awajan, A. (2022). SemG-TS: Abstractive Arabic
Text Summarization Using Semantic Graph Embedding.
Mathematics, 10(18), 3225.
[24] Wazery, Y. M., Saleh, M. E., Alharbi, A., & Ali, A. A. (2022).
Abstractive Arabic text summarization based on deep learning.
Computational Intelligence and Neuroscience, 2022.
[25] A. Al-Sabbagh, M. K. Al-Subari, & A. Hamdan. (2017). Towards
Building an Arabic Single-Document Summarization Dataset. In
Proceedings of the 8th International Conference on Computer
Sciences and Convergence Information Technology (ICCIT).
[26] Bouazizi, M., & Mezghani, A. (2015). Arabic Single Document
Summarization Using Lexical Chains. International Journal on
Document Analysis and Recognition (IJDAR), 18(3), 297-316.
[27] https://github.com/mkhzoumi/SAN
[28] Dhabliya, D. (2021). An Integrated Optimization Model for Plant
Diseases Prediction with Machine Learning Model . Machine
Learning Applications in Engineering Education and
Management, 1(2), 21–26. Retrieved from
http://yashikajournals.com/index.php/mlaeem/article/view/15.

142
IJRITCC | October 2023, Available @ http://www.ijritcc.org

You might also like