Machine_Learning_and_Deep_Learning_Approaches_for_
Machine_Learning_and_Deep_Learning_Approaches_for_
Abstract: Sign language (SL) is a means of communication that is used to bridge the gap between
the deaf, hearing-impaired, and others. For Arabic speakers who are hard of hearing or deaf, Arabic
Sign Language (ArSL) is a form of nonverbal communication. The development of effective Arabic
sign language recognition (ArSLR) tools helps facilitate this communication, especially for people
who are not familiar with ArSLR. Although researchers have investigated various machine learning
(ML) and deep learning (DL) methods and techniques that affect the performance of ArSLR systems,
a systematic review of these methods is lacking. The objectives of this study are to present a com-
prehensive overview of research on ArSL recognition and present insights from previous research
papers. In this study, a systematic literature review of ArSLR based on ML/DL methods and tech-
niques published between 2014 and 2023 is conducted. Three online databases are used: Web of
Science (WoS), IEEE Xplore, and Scopus. Each study has undergone the proper screening processes,
which include inclusion and exclusion criteria. Throughout this systematic review, PRISMA guide-
lines have been appropriately followed and applied. The results of this screening are divided into
two parts: analysis of all the datasets utilized in the reviewed papers, underscoring their character-
istics and importance, and discussion of the ML/DL techniques’ potential and limitations. From the
Citation: Alayed, A. Machine 56 articles included in this study, it was noticed that most of the research papers focus on finger-
Learning and Deep Learning spelling and isolated word recognition rather than continuous sentence recognition, and the vast
Approaches for Arabic Sign majority of them are vision-based approaches. The challenges remaining in the field and future re-
Language Recognition: A Decade search directions in this area of study are also discussed.
Systematic Literature Review.
Sensors 2024, 24, 7798. https:// Keywords: Arabic sign language (ArSL); Arabic sign language recognition (ArSLR); dataset;
doi.org/10.3390/s24237798
machine learning; deep learning; hand gesture recognition
Academic Editor: Christoph M.
Friedrich
and facial expressions, in addition to manual signs (MS) that use static hand/arm gestures,
hand/arm motions, and fingerspelling [3].
Sign language recognition (SLR) is the process of recognizing and deciphering sign
language movements or gestures [4]. It usually involves complex algorithms and compu-
tational operations. Sign language recognition systems (SLRS) is one application within
the field of human-computer interaction (HCI) that interprets sign language of hearing
impairments into text or voice of oral language [5]. SLRS can be categorized into three
main groups based on the main technique used for gathering data, namely sensor-based
or hardware-based, vision-based or image-based, and hybrid [6,7]. Sensor-based tech-
niques use data gloves that the signer wears to collect data about their actions from exter-
nal sensors. Most of the current research, however, has focused on vision-based ap-
proaches that use images, videos, and depth data to extract the semantic meaning of hand
signals due to practical concerns with sensor-based techniques [4,6]. Hybrid methods
have occasionally been employed to gather information on sign language recognition.
Comparing hybrid methods to other approaches, they perform similarly or even better in
terms of proportional automatic speech or handwriting recognition. In hybrid techniques,
multi-mode information about the hand shapes and movement is obtained by combining
vision-based cameras with other types of sensors, like glove sensors.
As shown in Figure 1, the stages involved in sign language recognition can be broadly
divided into data acquisition, pre-processing, segmentation, feature extraction, and clas-
sification [8,9]. In data acquisition, single frames of images are the input for static sign
language recognition; video, or continuous frames of images, is the input for dynamic
signs. In order to enhance the system’s overall performance, the preprocessing stage mod-
ifies the image or video inputs. Segmentation, or the partitioning of an image or video into
several separate parts, is the third step. A good feature extraction arises from perfect seg-
mentation [10]. Feature extraction is performed to transform important parts of the input
data into sets of compact feature vectors. When it comes to the recognition of sign lan-
guage, the features that are extracted from the input hand gestures should contain perti-
nent information and be displayed in a condensed form that helps distinguish the sign
that needs to be categorized from other signals. The final step is classification. Machine
learning (ML) techniques for classification can be divided into two categories: supervised
and unsupervised [8]. Through the use of supervised machine learning, a system can be
trained to identify specific patterns in input data, which can subsequently be utilized to
predict future data. Using labeled training data and a set of known training examples,
supervised machine learning infers a function. Inferences are derived from datasets con-
taining input data without labeled responses through the application of unsupervised ma-
chine learning. There is no reward or penalty weightage indicating which classes the data
should belong to because the classifier receives no labeled responses. Deep learning (DL)
approaches have surpassed earlier cutting-edge machine learning techniques in a number
of domains in recent years, particularly in computer vision and natural language pro-
cessing [4]. Eliminating the need to construct or extract features is one of the primary ob-
jectives of deep models. Deep learning enables multi-layered computational models to
learn and represent data at various levels of abstraction, mimicking the workings of the
human brain and implicitly capturing the intricate structures of massive amounts of data.
Figure 2. Representation of the Arabic sign language for Arabic alphabets [14].
SLR research efforts are categorized according to a taxonomy that starts with creating
systems that can recognize small forms and segments, like alphabets, and progresses to
larger but still small forms, like isolated words, and ultimately to the largest and most
challenging forms, like complete sentences [4,9,15]. As the unit size to be recognized in-
creases, the ArSLR job becomes harder. Three main categories—fingerspelling (alphabet)
sign language recognition, isolated word sign language recognition, and continuous sen-
tences sign language recognition—can be used to classify ArSLR research [9,15].
In this study, a comprehensive review of the current landscape of ArSLR using ma-
chine learning and deep learning techniques is conducted. Specifically, the goal is to ex-
plore the application of ML and DL in the past decade, the period between 2014 and 2023,
to gain deeper insights into ArSLR. To the author’s knowledge, none of the previously
reviewed surveys have systematically reviewed the ArSLR studies published in the past
decade. Therefore, the main purpose of the current study is to thoroughly understand the
progress made in this field, discover valuable information about ArSLR, and shed light on
the current state of knowledge. Through an extensive systematic literature review, I have
collected and synthesized the most recent results regarding recognizing ArSL using both
ML and DL. The analysis involves a collection of techniques, exploring their effectiveness
and potential for improving the accuracy of ArSLR. Moreover, current trends and future
directions in the area of ArSLR are explored, highlighting important areas of interest and
Sensors 2024, 24, 7798 4 of 91
innovation. By understanding the current landscape, the aim is to provide valuable in-
sights into the direction of research and development in this evolving field.
The rest of this paper is organized as follows: in Section 2, the methodology that will
help to achieve the goals of this systematic literature review is detailed. In Section 3, the
results of the research questions and sub-questions are analyzed, synthesized, and inter-
preted. In Section 3.4, the most important findings of this study are discussed in an orderly
and logical manner, the future perspectives are highlighted, and the limitations of this
study are presented. Finally, Section 4 draws the conclusions.
2. Research Methodology
The systematic literature review gathers and synthesizes the papers published in var-
ious scientific databases in an orderly, accurate, and analytical manner about an area of
interest. The goal of the systematic approach is to direct the review process on a specific
research topic to assess the advancement of research and identify potential new directions.
This study adopts the Preferred Reporting Items for Systematic Reviews and Meta-Anal-
yses (PRISMA) guidelines [16]. The research methodology encompasses three key stages:
planning, conducting, and reporting the review. Figure 3 provides a visual representation
of the methodology applied in this study.
Any review paper published before 2014 was discarded. After filtering the remaining
review papers based on their relevance to the topic of interest, a total of five literature
reviews were eligible for inclusion. Three of the review papers [9,15,17] were found in
both Web of Science and Scopus. Two papers were found in Scopus [18,19].
In 2014, Mohandes et al. [17] presented an overview of image-based ArSLR studies.
Systems and models in the three levels of image-based ArSLR, including alphabet recog-
nition, isolated word recognition, and continuous sign language recognition, are reviewed
along with their recognition rate. Mohandes et al. have extended their survey [17] to in-
clude not only systems and methods for image-based ArSLR but also sensor-based ap-
proaches [15]. This survey also shed light on the main challenges in the field and potential
research directions. The datasets used in some of the reviewed papers are briefly dis-
cussed [15,17].
A review of the ArSLR background, architecture, approaches, methods, and tech-
niques in the literature published between 2001 and 2017 was conducted by Al-Shamayleh
et al. [9]. Vision-based (image-based) and sensor-based approaches were covered. The
three levels of ArSLR for both approaches, including alphabet recognition, isolated word
recognition, and continuous sign language recognition, were examined with their corre-
sponding recognition rates. The study also identified future research gaps and provided
a road map for research in this field. Limited details of the utilized datasets in the reviewed
papers were mentioned, such as training and testing data sizes.
Mohamed-Saeed and Abdulbqi [18] presented an overview of sign language and its
history and main approaches, including hardware-based and software-based. The classi-
fication techniques and algorithms used in sign language research were briefly discussed,
along with their accuracy. Tamimi and Hashlmon [19] focused on Arabic sign language
datasets and presented an overview of different datasets and how to improve them; how-
ever, other areas, such as classification techniques, were not addressed.
In addition to Web of Science and Scopus, Google Scholar was used as a support da-
tabase to search for systematic literature review papers on ArSLR. Similar keywords were
used and yielded four more reviews [2,12,20,21] that are summarized below:
Alselwi and Taşci [10] gave an overview of the vision-based ArSL studies and the
challenges the researchers face in this field. They also outlined the future directions of
ArSL research. Another study was conducted by Wadhawan and Kumar [2] to systemati-
cally review previous studies on the recognition of different sign languages, including
Arabic. They provided reviews for the papers published between 2007 and 2017. The pa-
pers in each sign language were classified according to six dimensions: acquisition
method, signing mode, single/double-handed, static/dynamic signing, techniques used,
and recognition rate. Mohammed and Kadhem [20] offered an overview of the ArSL stud-
ies published in the period between 2010 and 2019, focusing on input devices utilized in
each study, feature extraction techniques, and classification algorithms. They also re-
viewed some foreign sign language studies. Al Moustafa et al. [21] provided a thorough
review of the ArSLR studies in three categories: alphabets, isolated words, and continuous
sentence recognition. Additionally, they outlined the public datasets that can be used in
the field of ArSLR. Despite mentioning “systematic review” in the paper’s title, they did
not follow a systemic approach in their review.
Valuable contributions are made by gathering papers on ArSLR and carrying out a
comprehensive literature evaluation. Of the surveys mentioned above, only one adopts a
systematic review process [2]; nevertheless, this survey does not analyze or focus on the
datasets included in the evaluated studies, and it does cover the studies published after
2017. As a result, it was necessary to provide the community and interested researchers
with an analysis of the current status of ArSLR. The purpose of this research, therefore, is
to close or at least lessen that gap. All the review papers are listed and summarized in
Table 1 in chronological order from the oldest to the newest.
Sensors 2024, 24, 7798 6 of 91
Table 4. Research questions related to the research limitations and future directions.
Search Strategy
To establish a systematic literature review, it is crucial to specify the search terms and
to identify the scientific databases where the search will be conducted. In 2019, a study
was carried out to evaluate the search quality of PubMed, Google Scholar, and 26 other
academic search databases [23]; the results revealed that Google Scholar is not suitable as
a primary search resource. Therefore, the most widely used scientific databases in the field
of research were selected for this systematic review, namely Web of Science, Scopus, and
IEEE Xplore Digital Library. Each database was last consulted on 2 January 2024.
A search string is an essential component of a systematic literature review for study
selection, as it restricts the scope and coverage of the search. A set of search strings was
created with the Boolean operator combining suitable synonyms and alternate terms:
AND restricts and limits the results, and OR expands and extends the search [24]. More-
over, double quotes have been used to search for exact phrases. At this initial stage, the
search was limited to the title, abstract, and keywords of the papers. Each database has
different reserved words to indicate these three elements, such as TS in Web of Science,
TITLE-ABS-KEY in Scopus, and All Metadata in IEEE Xplore. The specific search strings
used in each scientific database are listed below:
Web of Science:
TS = (Arabic sign language) AND recognition
Scopus:
TITLE-ABS-KEY (“Arabic sign language” AND recognition)
IEEE Xplore:
((“All Metadata”: “Arabic sign language”) AND (“All Metadata”: recognition))
EC3: Papers that are published before January 2014 or after December 2023.
EC4: Papers that are secondary studies (i.e., review, survey).
EC5: Papers that do not use Arabic sign language as a language for recognition.
EC6: Papers that do not use machine learning or deep learning methods as a solution
to the problem of ArSLR.
EC7: Papers that do not include details about the achieved results.
EC8: Papers that do not mention the datasets used in their experiment.
EC9: Papers with multiple versions are not included; only one version will be in-
cluded.
EC10: Duplicate papers found in more than one database (e.g., in both Scopus and
Web of Science) are not included; only one will be included.
EC11: The full text is not accessible.
Quality Assessment
It is deemed essential to assess the quality of the primary studies in addition to in-
clusion/exclusion criteria to offer more thorough criteria for inclusion and exclusion, to
direct the interpretation of results, and to direct suggestions for further research [16]. In-
depth assessments of quality are typically conducted using quality instruments, which are
checklists of variables that must be considered for each primary study. Numerical evalu-
ations of quality can be achieved if a checklist’s quality items are represented by numerical
scales. Typically, checklists are created by considering variables that might affect study
findings. A quality checklist aims to contribute to the selection of studies through a set of
questions that must be answered to guide the research. In the current study, a scoring
system based on the answers to six questions was applied to each study. These questions
are:
QA1: Is the experiment on ArSLR clearly explained?
QA2: Are the machine learning/deep learning algorithms used in the paper clearly
described?
QA3: Is the dataset and number of training and testing data identified clearly?
QA4: Are the performance metrics defined?
QA5: Are the performance results clearly shown and discussed?
QA6: Are the study’s drawbacks or limitations mentioned clearly?
Each study obtained a score of three for each question: 0, 0.5, and 1, denoting no,
partially, and yes, respectively. A study was finally taken into consideration if it received
a score of 4 or more out of 6 on the previous questions.
Table 5. The reasoning behind the exclusion of some papers in the title-level screening stage.
Table 6. The reasoning behind the exclusion of some papers in the abstract-level screening stage.
Table 7. The reasoning behind the exclusion of some papers in the full-text scanning stage.
EC5—Paper does not use ArSL as a language for recognition. Saudi sign
Intelligent real-time Arabic sign lan-
language recognition is used. Saudi is an Arabic delicate, however;
[31] guage classification using attention-
Saudi sign language is different from the unified Arabic sign language
based inception and BiLSTM
(ArSL).
EC9—Papers with multiple versions are not included; only one version
Arabic sign language intelligent trans- will be included. The same methodology and results discussed in this
[32]
lator paper are presented in another paper published in 2019 and written by
the same authors [33].
Mobile camera-based sign language to
[34] speech by using zernike moments and EC11—The full text is not accessible.
neural network
Table 8. The reasoning behind the exclusion or inclusion of some papers in the full-text article
screening stage.
Score Total
Reference Included
QA1 QA2 QA3 QA4 QA5 QA6 Score
Podder et al. [35] 1 1 1 1 1 1 6 Yes
Aldhahri et al. [36] 1 0.5 1 0 0.5 0 3 No
Out of 60 papers, 56 of them met the quality assessment criteria. The number of pa-
pers that failed and passed these criteria is illustrated in Figure 5.
One of the reasons why some papers fail to pass the quality assessment phase is that
some papers did not explain their ArSLR methodology clearly and thoroughly. The other
reasons are an insufficient description of the utilized ML or DL algorithms and/or a lack
of information related to the used datasets, such as the number of training and testing
datasets. Missing identification of performance metrics and poor analysis and discussion
of the results also contributed to removing some papers from the pool of candidates. Many
Sensors 2024, 24, 7798 14 of 91
of the studies included a detailed explanation of their methodology; however, some ne-
glected to address the limitations and drawbacks of the techniques, which also affected
the assessment.
As seen in Figure 6, an increase has been observed in recent years, from 2017 to 2023,
in the research of ArSLR, which itself exhibits the significance of this investigation.
Figure 7 reveals the number of publications for each category of ArSLR for the years
between 2014 and 2023. It can be seen that, over the years, fingerspelling recognition has
attracted more researchers, followed by isolated recognition and then continuous recog-
nition. It is worth noting that a few of the selected papers have worked on more than one
category of ArSLR, and they belong to the category of miscellaneous recognition [37–39].
Figure 7. Number of publications for each category of ArSLR for the years between 2014 and 2023.
3. Research Results
In this section, the answer to each research question is provided by summarizing and
discussing the results of the selected papers.
Table 9. Fingerspelling ArSL datasets. “PR” means Private dataset, “AUR” means Available Upon Request dataset, and “PA” means Publicly Available dataset.
Signing
Acquisition Modali-
# Signers (Subjects)
Created/Used by Collection
# Signs (Classes)
Mode
Static/Dynamic
Availability
With Other Da-
MS/NMS
Fingerspelling
Samples
Data Acquisition Images/Videos Data Collection
Continuous
Year
Alphabets
Sentences
ties
Numbers
Dataset Comments
Isolated
tasets?
Words
Device or Others? Location/Country
Ref. Year
headquartered in
Boulder, CO, USA).
32 letters, 11 numbers
2021 [71] [71] 2021 Yes PR 10 22,000 44 MS Camera RGB RGB image ✓ ✗ ✗ ✗ ✗ ✓ S Iraq (0:10), and 1 sign for
none
Webcam and smart
2021 [72] [72] 2021 Yes PR 10 2800 28 MS RGB RGB images ✓ ✗ ✗ ✗ ✗ ✗ ✓ S Egypt Light background
mobile cameras
Webcam and smart
2021 [72] [72] 2021 Yes PR 10 2800 28 MS RGB RGB images ✓ ✗ ✗ ✗ ✗ ✗ ✓ S Egypt Dark background
mobile cameras
Images made with a
Webcam and smart right hand and wear-
2021 [72] [72] 2021 Yes PR 10 1400 28 MS RGB RGB images ✓ ✗ ✗ ✗ ✗ ✗ ✓ S Egypt
mobile cameras ing gloves, white
background
Images made with a
bare hand “two
Webcam and smart
2021 [72] [72] 2021 Yes PR 10 1400 28 MS RGB RGB images ✓ ✗ ✗ ✗ ✗ ✗ ✓ S Egypt hands” and wearing
mobile cameras
glove with different
background
720 × 960 × 3
2021 [73] [73] 2021 No PR - 15,360 30 MS Mobile cameras RGB ✓ ✗ ✗ ✗ ✗ ✗ ✓ S Saudi Arabia
RGB images
Smart phone cam-
2022 [74] [74] 2022 No PR 20+ 5400 30 MS RGB RGB ✓ ✗ ✗ ✗ ✗ ✗ ✓ S Saudi Arabia
era
64 × 64 pixels
50% left-handed, 50%
2023 [63] [63] 2023 Yes PR 30 840 28 MS Camera RGB gray scale im- ✓ ✗ ✗ ✗ ✗ ✗ ✓ S Iraq
right-handed
ages
Sensors 2024, 24, 7798 19 of 91
Table 10. ArSL Datasets for isolated words. “PR” means Private dataset, “AUR” means Available Upon Request dataset, and “PA” means Publicly Available dataset.
Signing
Images/ Videos
Static/Dynamic
# Signers (Sub-
Data Acquisi-
# Signs (Clas-
Created/Used by Collection
Availability
Data Collec-
tion Device
Mode
or Others?
MS/NMS
Samples
Acquisition tion Loca-
jects)
Year
Sentences
ses)
Other da-
Numbers
Continu-
Dataset Comments
Isolated
Alpha-
tasets?
Words
Finger
Modalities tion/ Coun-
With
Ref. Year
try
and 121-word
signs.
30 digit and
KArSL-502 S & Saudi Ara- signs, 39 letter
[90] 2022 Yes 3 75,300 502 ✓ ✓ ✓ ✗ ✓ ✗ ✗
[86,87] D bia signs, and 433
sign words.
11 Words:
[Friend,
Neighbor,
Guest, Gift,
RGB video for- Saudi Ara- Enemy, To
2022 [91] [91] 2022 No PR - 1100 11 MS - RGB ✗ ✗ ✓ ✗ ✓ ✗ ✗ D
mat bia Smel, To Help,
Thank you,
Come in,
Shame,
House]
Words: [Noth-
ing, Cheek,
Friend, Plate,
Marriage,
Moon, Break,
Broom, You,
Kinect cam- RGB and RGB video for-
2022 [92] [92] 2022 No PR 55 7350 21 MS ✗ ✗ ✓ ✗ ✓ ✗ ✗ D Iraq Mirror, Table,
era V2 Depth video mat
Truth, Watch,
Arch, Success-
ful, Short,
Smoking, I,
Push, stingy,
and Long]
Saudi Ara-
2023 [93] [93] 2023 No PR 1 500 5 MS - - - ✗ ✗ ✓ ✗ ✓ ✗ ✗ S
bia
Mobile cam-
2023 [94] [94] 2023 No PR 72 8467 20 MS RGB Videos ✗ ✗ ✓ ✗ ✓ ✗ ✗ D Egypt
era
Sensors 2024, 24, 7798 23 of 91
Table 11. ArSL Datasets for continuous sentences. “PR” means Private dataset, “AUR” means Available Upon Request dataset, and “PA” means Publicly Available
dataset.
Signing
Images/Videos or
Data Acquisition
Created/Used by Collection
# Signs (Classes)
Static/Dynamic
# Signers (Sub-
Mode
Availability
Data Col-
MS/NMS
Samples
Others?
Device
Acquisition lection
Continuous
With Other
jects)
Year
Alphabets
Sentences
Datasets?
Numbers
Dataset Comments
Isolated
Words
Finger
Modalities Location/
lli
Ref. Year
Country
Self-ac- 400
Two
quired sen-
Polhemus Motion tracker
2019 sensor- [96] 2019 YES AUR 2 tences 40 MS Raw data ✗ ✗ ✗ ✓ ✗ ✓ ✗ D Egypt
G4 motion readings
based da- 800
trackers
taset 2 [96] words.
Self-ac- [96] 2019 Yes 400 Egypt
Videos with frame
quired vi- sen-
rate set to 25 Hz
2019 sion-based AUR 1 tences 40 MS Camera RGB ✗ ✗ ✗ ✓ ✗ ✓ ✗ D -
[98] 2023 No with a spatial reso-
dataset 3 800
lution of 720 × 528
[96] words.
Table 12. ArSL Miscellaneous Datasets. “PR” means Private dataset, “AUR” means Available Upon Request dataset, and “PA” means Publicly Available dataset.
Acquisition
Modalities
Availabil-
Static/Dy-
(Subjects)
Videos or
MS/NMS
# Signers
quisition
(Classes)
Collection
Data Ac-
Samples
Images/
Others?
# Signs
Device
by Mode tion Loca-
Year
i
Dataset Comments
ity
tion/Coun-
Word
other
Num
d
With
lCon-
Sen-
Fin-
Iso-
Al-
Ref. Year
h
try
i
Alphabets: 28, Numbers: 11
(0–10), common dentist
Hands
26,060 Sequences of S& words: 8, common verbs
2017 [37] [37] 2017 No PR 3 79 MS LMC skeleton ✓ ✓ ✓ ✓ ✓ ✓ ✓ Egypt
samples frames D and nouns-single hand: 20,
joint points
common verbs and nouns-
two hands: 10
MS Kinect RGB,
16,000 1280 × 920 × 3 S & Saudi Ara-
2021 [39] [39] 2021 No PR 40 80 & camera depth, skel- ✓ ✓ ✓ ✗ ✓ ✗ ✓
videos RGB frames D bia
NMS V2 eton data
Sensors 2024, 24, 7798 25 of 91
3.1.1. RQ1.1: How Many Datasets Are Used in Each of the Selected Papers?
According to the results in Figure 8, we can see that a high percentage of the reviewed
papers, 76.79%, representing 43 papers, use one dataset in their experiments. Seven pa-
pers, with a percentage of 12.5%, rely on two datasets to conduct their experiments. Three
and four datasets are utilized in four and one studies, with a percentage of 7.14% and
1.79%, respectively. Only one study uses five datasets with a percentage of 1.79%.
Figure 8. Percentage of the reviewed studies based on the number of utilized datasets.
One of the following could be the rationale for using multiple datasets:
1. Incorporate different collections of sign language, for example, a dataset for letters
and a dataset for words.
2. Collect sign language datasets pertaining to many domains, such as health-related
words and greeting words.
3. Collect datasets representing different modalities, for example, one for JPG and the
other for depth and skeleton joints and/or different data acquisition devices.
4. Enhance the study by training and testing the model using more than one sign lan-
guage dataset and/or comparing the results.
5. Conduct user-independent sign language recognition, where the proposed mode is
tested using signs represented by different signers from those in the training set.
Table 13 summarizes the reasons behind the use of more than one dataset in the re-
viewed studies.
Table 13. Reasons for utilizing more than one dataset in the reviewed studies.
3.1.2. RQ1.2: What Is the Availability Status of the Datasets in the Reviewed Papers?
We can divide the availability of the datasets in the reviewed papers into three main
groups: publicly available, available upon request, and self-acquired private datasets. Fig-
ure 9 shows that self-acquired private datasets constitute the bulk of the reviewed da-
tasets, followed by datasets that can be obtained upon request and those that are publicly
available. Due to the lack of available upon request and publicly available datasets for the
fingerspelling, isolated, or miscellaneous datasets, the researchers who worked on them
had to build their own private ArSL datasets. On the contrary, around three-quarters of
the continuous datasets are available upon request, while the remaining dataset was built
privately.
Public and available upon-request datasets are discussed according to the dataset
category. Examples of each of the datasets are also presented where available.
Sensors 2024, 24, 7798 27 of 91
Figure 10. The ArSL2018 dataset, illustration of the ArSL for Arabic alphabets [48,49].
• Al-Jarrah Dataset
This dataset [40] is available upon request and contains gray-scale images. For every
gesture, 60 signers executed a total of 60 samples. Of the 60 samples available for each
motion, 40 were used for training, and the remaining 20 were used for testing.
For training purposes, 40 of the 60 samples for each gesture were used, while the
remaining 20 samples were used for testing. The samples were captured in various orien-
tations and at varying distances from the camera. Samples of the dataset are exhibited in
Figure 11.
• ArSL Dataset
The 700 color images in this dataset [45], which is available upon request, show the
motions of 28 Arabic letters, with 25 images per letter. Various settings and lighting con-
ditions were used to take the images included in the dataset. Different signers wearing
dark-colored gloves and with varying hand sizes executed the actions, as shown in Figure
12.
Figure 12. Samples of the ArSL Alphabet in the ArSL dataset [43].
• KArSL-33
There are 33 dynamic sign words in the publicly available KArSL-33 dataset [86,87].
Each sign was executed by three experienced signers. By performing each sign 50 times
by each signer, a total of 4950 samples (33 × 3 × 50) were generated. There are three mo-
dalities for each sign: RGB, depth, and skeleton joint points.
Sensors 2024, 24, 7798 29 of 91
• mArSL
Five distinct modalities—color, depth, joint points, face, and faceHD—are provided
in the multi-modality ArSL dataset [88,89]. It is composed of 6748 video samples captured
using Kinect V2 sensors, demonstrating fifty signs executed by four signers. Both manual
and non-manual signs are emphasized. An example of the five modalities offered for
every sign in mArSL is presented in Figure 13.
Figure 13. The mArSL dataset, an example of the five modalities provided for each sign sample [88].
Sensors 2024, 24, 7798 30 of 91
• KArSL-100
There are 100 static and dynamic sign representations in the KArSL-100 dataset
[86,87]. A wide range of sign gestures were included in the dataset: 30 numerals, 39 letters,
and 31 sign words. For every sign, there were three experienced signers. Each signer re-
peated each sign 50 times, resulting in an aggregate of 15,000 samples of the whole dataset
(100 × 3 × 50). For every sign, there are three modalities available: skeleton joint points,
depth, and RGB.
• KArSL-190
There are 190 static and dynamic sign representations in the KArSL-190 dataset
[86,87]. The dataset featured a variety of sign gestures, including digits (30 signs), letters
(39 signs), and 121 sign words. A broad spectrum of sign gestures was incorporated into
the dataset, including 121 sign words, 30 number signs, and 39 letter signs. Each sign was
executed by three skilled signers. Each signer repeated each sign 50 times, resulting in
28,500 samples of the dataset (190 × 3 × 50). For each sign, there are three modalities avail-
able: skeleton joint points, depth, and RGB.
• KArSL-502
Eleven chapters of the ArSL dictionary’s sign words, totaling 502 static and dynamic
sign words, are contained in the KArSL dataset [86,87]. Numerous sign gestures were in-
corporated in the dataset, including 433 sign words, 30 numerals, and 39 letters. For every
sign, there were three capable signers. Every signer repeated each sign 50 times, resulting
in 75,300 samples of the dataset (502 × 3 × 50). Each sign has three modalities: RGB, depth,
and skeletal joint points.
It is worth mentioning that public non-ArSL datasets were also utilized in a number
of reviewed studies along with ArSL datasets [56,71,78,82,90]. The purpose of these stud-
ies was to apply the proposed models to these publicly available datasets and compare
the results with published work on the same datasets.
3.1.3. RQ1.3 How Many Signers Were Employed to Represent the Signs in Each Dataset?
The number of signers is one of the factors that impact the diversity of the datasets.
In the reviewed studies, the number of signers varies from one dataset to another. The
results show that the minimum number of signers is one in three datasets, whereas the
highest number of signers is 72 in only one dataset, an isolated word dataset. Figure 14
shows that the majority of the ArSL datasets, with around 60%, recruit between one and
ten signers. More diversity is provided by ten ArSL datasets that are executed by 20 to 72
signers. The number of signers who executed the signs is not mentioned or specified (NM)
in any of the other nine ArSL fingerspelling and isolated datasets.
Sensors 2024, 24, 7798 31 of 91
Figure 15. Number of samples in the datasets of reviewed papers for each dataset category.
44 representing letters, numbers, and none [71]. The signs for basic Arabic letters, 28 let-
ters, are represented by eight datasets. The remaining datasets in this category consist of
signs for 30, 31, and 38 basic and extra Arabic letters. Interestingly, the highest number of
signs is 502 in an isolated words dataset [86,87], as well as the lowest number of signs,
five, is found in isolated words datasets [80,81,93].
Figure 16. Number of datasets for each number of signs in different categories.
3.1.6. RQ1.6 What Is the Number of Datasets That Include Manual Signs, Non-Manual
Signs, or Both Manual and Non-Manual Signs?
Tables 9–12 show the type of sign captured in each reviewed dataset. The type of sign
can be a manual sign (MS), represented by hand or arm gestures and motions and finger-
spelling; non-manual signs (NMS), such as body language, lip patterns, and facial expres-
sions; or both MS and NMS. Figure 17 illustrates that the majority of the signs in all cate-
gories—except miscellaneous—are manual signs. In fingerspelling and continuous sen-
tence datasets, all the signs are manual. Non-manual signs are represented by only two
isolated word datasets. Both sign types are represented by three isolated word datasets
and one miscellaneous dataset.
Figure 17. Number of datasets representing manual signs (MS), non-manual signs (NMS), and both
signs in each dataset category.
Sensors 2024, 24, 7798 33 of 91
3.1.7. RQ1.7 What Are the Data Acquisition Devices That Were Used to Capture the Data
for ArSLR?
Datasets for sign language can be grouped as sensor-based or vision-based, depend-
ing on the equipment used for data acquisition. Sensor-based datasets are gathered by
means of sensors that signers might wear on their wrists or hands. Electronic gloves are
the most utilized sensors for this purpose. One of the primary problems with sensor-based
recognition methods was the need to wear these sensors while signing, which led re-
searchers to turn to vision-based methods. Acquisition devices with one or more cameras
are usually used to gather vision-based datasets. One piece of information about the signer
is provided by single-camera systems, such as a color video stream. Multiple cameras,
each providing distinct information about the signer, such as depth and color information,
are combined to produce a multi-camera gadget. One of these devices that can provide
different types of information, like color, depth, and joint point information, is the multi-
modal Kinect.
The majority of the reviewed ArSL datasets use cameras to capture the signs, fol-
lowed by Kinect and leap motion controllers (LMC), as shown in Figure 18. Wearable sen-
sor-based datasets [70,95,96] use devices like DG5-VHand data gloves, Polhemus G4 mo-
tion trackers, and 3-D IMU sensors to capture the signs. These three acquisition devices
are the least used among all the devices due to the recent tendency to experiment with
vision-based ArSLR systems. No devices were specified in two of the datasets.
Figure 18. Data acquisition devices used to capture the data in the ArSL dataset.
3.1.8. RQ1.8 What Are the Acquisition Modalities Used to Capture the Signs?
As depicted in Figure 19, the most popular acquisition modality for the datasets that
belong to fingerspelling and isolated categories is RGB. Raw sensor data (feature vectors)
is mostly used in continuous datasets, whereas RGB, depth, and skeleton joint points are
the most popular in the category of isolated word datasets. The least used acquisition mo-
dalities for fingerspelling datasets are depth, skeleton models, and raw sensor data. For
the other dataset categories, the least used modalities are distributed among different
types of acquisition modalities.
Sensors 2024, 24, 7798 34 of 91
Figure 19. Acquisition modalities used to capture the signs in the ArSL dataset.
3.1.10. RQ1.10 What Is the Percentage of the Datasets That Represent Alphabets, Num-
bers, Words, Sentences, or a Combination of These?
ArSL datasets can be used to represent alphabets, numbers, words, sentences, or
combinations of them. As demonstrated in Figure 21, the highest percentage of the re-
viewed ArSL datasets, roughly 37.50%, constitute words. This is followed by 35.42% of
the datasets that represent alphabets. Combinations of alphabets, numbers, and words are
represented by 10.42%. A minimum number of ArSL datasets are used to represent
Sensors 2024, 24, 7798 35 of 91
sentences, alphabets and numbers, words and sentences, and alphabets, numbers, words,
and sentences with percentages of 8.33%, 4.17%, 2.08%, and 2.108%, respectively.
Figure 21. Percentages of alphabets, numbers, words, sentence datasets, and combinations of them.
3.1.11. RQ1.11 What Is the Percentage of the Datasets That Have Isolated, Continuous,
Fingerspelling, or Miscellaneous Signing Modes?
With a percentage of 47.92%, isolated mode is regarded as the most prevalent mode,
followed by fingerspelling signing (39.58%) and then continuous signing mode (8.33%),
as shown in Figure 22. With 4.17%, the category involving miscellaneous signing modes
has the least amount of work accomplished in this area.
Figure 22. Percentages of the datasets based on the signing mode, fingerspelling, isolated, continu-
ous, and miscellaneous.
Sensors 2024, 24, 7798 36 of 91
3.1.12. RQ1.12 What Is the Percentage of the Datasets Based on Their Data Collection
Location/Country?
Figure 23 illustrates the distribution of the ArSL dataset collection across different
countries. The figure shows that Egypt is the largest contributor to the ArSL dataset col-
lection, accounting for 43.75%. Saudi Arabia follows with a significant contribution of
37.50%. Together, these two countries make up the majority of the dataset collection, to-
taling 81.25%. Iraq and the UAE contribute moderately to the dataset, with 6.25% and
4.17%, respectively. Jordan, Morocco, Palestine, and Syria each contribute a small and
equal share of 2.08%, highlighting their relatively minor involvement. This distribution
suggests the need to expand dataset collection efforts to the underrepresented countries
to ensure a broader and more balanced representation of ArSL datasets. It might also re-
flect potential gaps in resources or interest in ArSL-related initiatives in these regions.
Figure 23. Distribution of the datasets based on their data collection location/country.
3.2. Machine Learning and Deep Algorithms Used for Arabic Sign Language Recognition
To address the second research question, RQ2: “What were the existing methodolo-
gies and techniques used in ArSLR?” six sub-questions were explored. These questions
examine different phases of the ArSLR methods discussed in the reviewed papers, includ-
ing data preprocessing, segmentation, feature extraction, and the recognition and classifi-
cation of signs. The answers to each sub-question are analyzed across various ArSLR cat-
egories, such as fingerspelling, isolated words, continuous sentences, and miscellaneous
methods. Tables 14–17 provide a concise summary of the reviewed studies for each cate-
gory. Rows shaded in gray indicate methods that utilize wearable sensors.
for particular jobs. Color spaces provide distinct benefits; depending on the job require-
ments, conversion can enhance certain features or make analysis easier.
Converting RGB images to grayscale can simplify the image data and hence reduce
computational load. Most of the research that applied greyscale conversion is categorized
as fingerspelling ArSLR [41,46,47,49,53,55,68,72,84], followed by isolated words ArSLR
[82,94], and miscellaneous ArSLR [38]. Other researchers in the category of fingerspelling
ArSLR have converted from greyscale space to red, green, and blue (RGB) images [60] and
to hue, saturation, and value (HSV) color space [62] to ensure compatibility with the clas-
sifier algorithms used, which entails more accurate results. Transformation to YCbCr
space was only applied by two studies published in 2015, for fingerspelling ArSLR [42]
and isolated words ArSLR [79], for the purpose of taking advantage of the lower resolu-
tion for color with respect to luminosity, which means faster processing.
Resizing and Cropping: Resizing images to a uniform size is essential to ensuring
optimal performance of ML/DL models. It helps avoid computational loads and simplifies
the process for the model to learn patterns uniformly across different samples. With crop-
ping, the focus is on a specific region of an image, and any irrelevant details are removed.
This maximizes the model’s capacity to recognize particular features, which is useful for
tasks where a certain object or object’s location is crucial. In the context of ArSLR, the
hands are considered important objects to crop, as implemented by a number of research-
ers [33,38,92]. In the reviewed papers, most of the researchers have resized their acquired
data, such as images and video frames, to a certain size and then trained their ML/DL
models based on that size [33,35,38,41,44,46,47,53,55,58,60,61,63,65,66,68,69,71–
74,84,92,94]. Other researchers have utilized pre-trained models that require that the data
be fed in a specific size. Therefore, the images were resized to fit the pre-trained model
input layer [52].
Normalization: Normalization refers to all operations and processes meant to stand-
ardize the input according to a predetermined set of rules, with the ultimate goal being to
enhance the ML/DL model’s performance. It may involve several statistical procedures or
input processing operations. The ideal normalization process varies depending on differ-
ent factors, such as the ML/DL model, the degree of variability in the sample, and the
nature of the input, whether it is text, image, or video. Pixel values are usually normalized
to have a mean of 0 and a standard deviation of 1. This process improves the model’s
performance during training by maintaining values within a standardized range, which
helps with convergence. Normalization was mostly applied to the category of finger-
spelling ArSLR [42,44,46,47,53,55–57,61,65,71], followed by isolated words [71,79,82,92],
and continuous ArSLR [95,97]. Sensor readings from the DG5-VHand data gloves were
normalized using the z-score [95], and then the standard deviations and means of the
readings of the training set were saved and used again to normalize the testing set. The
normalization applied by Hisham and Hamouda [97] on the frames captured by Microsoft
Kinect solves two main issues related to the variation of the signers’ position and size. It
consequently yields more accurate feature extraction for the coordinates, regardless of the
signers’ location or size.
Data Augmentation: The technique of artificially creating new data from preexisting
data, known as data augmentation, is mostly used to train new ML/DL models. Large and
diverse datasets are necessary for the training of the models; however, finding adequately
varied real-world datasets can be difficult. Data augmentation involves making minor ad-
justments to the original data in order to artificially enlarge the dataset. Using data aug-
mentation during model training helps avoid overfitting, which usually occurs when a
model performs well on training data but poorly on unseen data. There are different tech-
niques to implement data augmentation, including rotation, flipping, shearing, shifting,
rescaling, translation, and zooming. Using any of these techniques depends mainly on the
type of input and the characteristics of the model. As illustrated in Tables 14, 15, and 17,
various studies from the field of ArSLR utilize data augmentation techniques for the pur-
pose of enhancing the quality of data and improving the model’s performance.
Sensors 2024, 24, 7798 38 of 91
Noise Reduction: This process is used to remove or reduce unwanted noise in the
data and irrelevant or redundant features to direct the model’s attention to the data’s most
informative elements. In the reviewed papers, various noise reduction techniques have
been used for this purpose, such as Gaussian filter [38,68,71,84], median filter [38,60,71,90],
averaging filter [71], and weighted average filter [91]. The Pulse Coupled Neural Network
(PCNN) signature was used to reduce the random noise and smooth images [80]. In a
sensor-based ArSLR study [70], a low-pass filter was used to remove dynamic noise from
vibrations and other external factors that affect acceleration readings, and a high-pass fil-
ter was used to remove low-frequency drift from gyroscope readings. Dataset cleaning
was implemented by removing the observations that had null values [48,81] and the rows
that had the same values [48].
been obtained in hand segmentation and gesture recognition using this technique. Most
reviewed ArSLR studies published in 2020 onwards have utilized this technique to help
boost their results. CNNs are adopted in this field, wherein various CNN architectures are
trained from scratch [38,39,46,53,55–57,69,78,94].
Transfer learning in CNN involves utilizing the early and central layers while only
retraining the last layers on a different set of classes. The model leverages labeled data
from its original training task, hence reducing training time, improving neural network
performance, and functioning well with limited data. Examples of the pre-trained models
used in transfer learning include VGG, AlexNet, MobileNet, Inception, ResNet, DensNet,
SqueezeNet, EfficientNet, CapsNet, and others. Many researchers in the field of ArSLR
used this approach for segmentation [52,54,58–66,71,73,74,90–93,98]. Alharthi and Alzah-
rani utilized two pretrained vision transformers, ViT (ViT_b16, ViT_132) and Swin
(SwinV2Tiny256) [64]. In ViT, an image is considered a series of patches [99]. The image
is then split into small patches, and a 1D vector is created from each patch. The trans-
former model receives these patch embeddings as input. The model can focus on various
patches while tracking the connections between them due to the self-attention mechanism.
It assists in the model’s understanding of the image’s context and interdependencies. Po-
sitional embeddings, which offer details about the spatial placement of each patch, are
also part of ViT. Built upon the Transformer architecture, Swin processes and compre-
hends sequential data using a multi-layered system of self-attention mechanisms [100].
Swin has created shifted windows that operate by partitioning the input image into
smaller patches or windows and shifting them throughout the self-attention process. By
using this method, Swin can handle huge images quickly and effectively without having
to rely on computationally demanding processes like convolutional operations or sliding
window mechanisms. Swin increases the receptive field to gather global dependencies
and improves the model’s comprehension of the visual context by shifting the patches.
Aly and Aly addressed hand segmentation using the state-of-the-art semantic seg-
mentation DeepLabv3C model, which is built on Resnet-50 as a backbone encoder net-
work with atrous spatial pyramid pooling [77]. Using the DeepLabv3C mask image, hand
areas are cropped from each corresponding frame of the input video sequence, and the
resulting mask image is then normalized to a fixed size for scale invariance. Alawwad et
al. used fast R-CNN to enhance the efficiency and speed of the original model, R-CNN, by
integrating a Region Proposal Network (RPN) along with an ROI pooling layer, which
reduces the processing time and contributes to overall better performance [73].
nine features achieved from the autocorrelation and frequency. Another glove-based
study for the recognition of continuous ArSL by Tubaiz et al. [95] employed a feature ex-
traction technique that reflects the temporal dependence of the data. In this technique, a
sliding window is used to compute the mean and standard deviations of the sensor read-
ings. The accuracy of the classification is affected by the sliding window’s size. A small
window size is insufficient to fully convey the present feature vector’s context. The context
becomes increasingly noticeable as the size expands until it becomes saturated. Classifi-
cation accuracy suffers when window size is increased further. Hassan et al. [96] applied
a window-based approach to glove-based data and a 2D Discrete Cosine Transform (DCT)
to vision-based datasets. In 2D DCT, the feature extraction relies on two parameters to be
specified. The first parameter is cutoff, which is the number of DCT coefficients to keep in
a feature vector. The more coefficients there are, the higher the recognition rate. However,
recognition rates generally decline when the feature vector’s dimensionality rises above a
particular threshold; hence, there is typically a threshold beyond which any increase in
the DCT cutoff will result in a decline in recognition rates. The weighting parameter x is
the second parameter that needs to be specified empirically. When 100 DCT coefficients
and the value of x = 1 were used, the highest classification rate was obtained.
Elatawy et al. [68] developed an alphabet ArSLR system using the neutrosophic tech-
nique and fuzzy c-means. They proposed to use the Gray Level Co-occurrence Matrix
(GLCM) to extract features from neutrosophic images. Three matrices—object (T), edge
(I), and background (F)—are used to describe neutrophilic images. The GLCM works by
scanning the image and recording the gray levels of each pair of pixels that are spaced
apart by a set direction (0 and distant). Pixels and their neighbors are, hence, the basis for
GLCM’s feature extraction process for neutrosophic images. The contrast, homogeneity,
correlation, and energy are the calculated GLCM parameters. From each image, a total of
12 features, consisting of 4 GLCM parameters for each image component T, I, and F, are
extracted.
Hybrid feature extraction has been utilized in various studies to overcome the limi-
tations of single techniques and benefit from the advantages. Tharwat et al. [43] used SIFT
to extract the invariant and distinctive features and LDA to reduce dimensions and thus
increase the system’s performance in recognizing ArSL letters. A study conducted by Ah-
med et al. [33] to recognize the ArSL isolated dynamic gestures proposed a feature inte-
gration between intensity histogram features and GLCM. The former contains six features
that represent the first-order statistical information about the image, such as mean, vari-
ance, skewness, kurtosis, energy, and entropy, and the latter consists of 23 features to rep-
resent the second-order statistical information about the image features, like contrast, ho-
mogeneity, dissimilarity, angular second moment, energy, and entropy. The combined
integrated features vector comprises 26 features since three features are shared by both.
Other researchers have investigated the feature extraction techniques by conducting
a comparison between them to measure their performance. Sidig et al. [76] compared dif-
ferent feature extraction techniques: Modified Fourier Transform (MFT), Local Binary Pat-
tern (LBP), Histogram of Oriented Gradients (HOGs), and combination of Histogram of
Oriented Gradients and Histogram of Optical Flow (HOG-HOF) for isolated word ArSL
recognition with Hidden Markov Model (HMM) for classification. With MFT and HOG,
the best accuracy was achieved. In a previous study, Alzohairi et al. [49] conducted a com-
parison between five texture descriptors for the purpose of ArSL alphabet recognition.
The descriptors are HOG, Edge Histogram Descriptor (EHD), GLCM, Discrete Wavelet
Texture Descriptor (DWT), and LBP. These texture descriptors include details on the edges
of the image as well as region homogeneity. The comparison results reveal that the HOG
descriptor outperforms the other descriptors. Agab and Chelali [41] applied individual
descriptors, including DWT, the Dual Tree Complex Wavelet Transform (DT-CWT), HOG,
and two combined descriptors, namely DWT + HOG and DT-CWT + HOG, when com-
pared to the individual descriptors, the combined descriptors DT-CWT + HOG outper-
formed with respect to accuracy rate and execution times.
Sensors 2024, 24, 7798 41 of 91
In addition to the feature extraction methods discussed above, other researchers em-
ployed widely used deep learning techniques, such as CNNs, to extract pertinent features.
These techniques extract features in the first layers and then feed them into the subsequent
layers. CNNs have been utilized to extract features from fingerspelling ArSL images
[46,47,53,55–57,69,94] and video frames [94]. CNNs have been used in combination with
Long Short-Term Memory (LSTM) in order to extract spatial and temporal data depend-
encies [78], where the CNN model was adopted to extract features from each video frame
separately and LSTM to learn the temporal features across video frames. Aly and Aly [77]
have utilized the Convolutional Self-Organizing Map (CSOM) to extract hand-shape fea-
tures from video frames and the Bi-directional Long Short-Term Memory (BiLSTM) to
model the temporal dependencies in the video sequences.
In transfer learning, pre-trained models, such as VGG, ResNet, AlexNet, and Incep-
tion, have been trained on large-scale image datasets like ImageNet. These models func-
tion as effective feature extractors, converting raw images into forms that contain signifi-
cant details about the visual material. By extracting features from pre-trained neural mod-
els, the information gathered from massive datasets is used to improve the performance
of others’ work. Most of the studies conducted in 2019 onwards in the context of finger-
spelling ArSLR have relied on pre-trained models to extract the features. Most of these
studies have experimented with and compared various models [52,54,58,59,62–65,73]. Is-
mail et al. [71] have compared the performance of various single models and multi-models
in feature extraction and ArSL recognition.
A few researchers have implemented solo pre-trained models, for example, light-
weight EfficientNet with different settings [74] and EfficientNetB3 [66]. Islam et al. [66]
have leveraged EfficientNetB3 to extract initial features and adopted stacked autoencod-
ers to further refine these features. Alnuaim et al. [60] implemented two models, ResNet50
and MobileNetV2, together. In isolated and continuous ArSLR, various studies investi-
gated the implementation of different pre-trained models accompanied by Recurrent
Neural Networks (RNNs) or LSTM to extract spatial and temporal features accurately
[35,88,90,92,98]. A few researchers have relied on single models to be feature extractors to
generate a set of feature vectors, such as a capsul neural network (CapsNet) [91] and the
DenseNet169 model [93].
Machine Learning
Machine learning (ML) is a branch of artificial intelligence (AI) that enables machines
to learn from data and make decisions or predictions without being specifically pro-
grammed to do so. Fundamentally, machine learning is concerned with developing and
applying algorithms that help with these decisions and predictions. As they handle more
data, these algorithms are built to perform better over time, becoming more precise and
effective. In the following, ML algorithms employed in the reviewed ArSLR research pa-
pers are discussed.
HMM is a powerful statistical modeling method that can identify patterns in the com-
plicated relationships between actions in a continuum of time and space. HMM has been
applied to the field of ArSLR by different studies to classify static hand gestures of ArSL
alphabets [42], isolated words [76,79], and continuous sentences [96]. Abdo et al. [42] mod-
eled the information of each sign with a different HMM. The model with the highest like-
lihood was chosen as the best model, and the test sign was classified as the sign of that
model. The HMM was applied to the self-acquired dataset, ArASLRDB, with 29 Arabic
alphabet signs. The recognition system was tested when splitting the rectangle surround-
ing the hand shape into 4, 9, 16, and 25 zones. The optimal number of zones was
Sensors 2024, 24, 7798 42 of 91
determined to be 16, with 19 states that recognize the Arabic alphabet of sign language.
The algorithm could reach a 100% recognition rate by increasing the zone number to 16
or more, but it would take more time. To recognize ArSL at the word level, Abdo et al.
[79] proposed to utilize the Enhancement of Motion Chain Code (EMCC) that uses HMM.
The recognition rate achieved outperformed other systems by 98.8% when applied to a
private, self-gathered dataset of 40 ArSL words. Hassan et al. [96] compared two classifi-
cation algorithms, HMM (RASR and GT2K toolkits) and Modified K-Nearest Neighbor
(MKNN), which are adequate for sequential data on sensor-based and vision-based da-
tasets. Despite the high recognition rates obtained by the RASR and GT2K HMM toolkits,
MKNN has the best sentence recognition rates, exceeding both HMM toolkits.
The MKNN algorithm was first proposed by Tubaiz et al. [95] in 2015 to classify se-
quential data for a glove-based ArSL. In this modification, the context prior to predicting
the label of each feature vector is considered. It relies on using the most prevalent label
within a surrounding window of labels to replace the predicted label. Once every label in
a given sentence has been predicted, the statistical mode of the labels that surround it is
used to replace each label. KNN is a non-parametric supervised machine learning algo-
rithm that has been adopted by a number of ArSLR researchers [37,43,48,72,85,97]. By cal-
culating the distances between unknown patterns and each sample, the KNN classifier
can recognize unknown patterns based on how similar they are to known samples. The
K-nearest samples are then chosen as the basis for classification. Among the K-nearest
samples, the class with the greatest number of samples is assigned the unknown pattern.
It has been noticed that the accuracy results of KNN for the miscellaneous ArSLR [37] or
continuous sentence recognition [97] outperformed other algorithms when compared.
Hisham and Hamouda [37] carried out a comparison between different algorithms, in-
cluding SVM, KNN, and ANN, for static and dynamic gestures depending on two differ-
ent feature sets: palm features set, and bone features set. KNN obtained the best accuracy
results for the static gestures, achieving 99% and 98% for the two sets, respectively. In
another study conducted by Hisham and Hamouda [97], the experimental results revealed
that the accuracy of KNN with majority voting outperformed the other algorithms in rec-
ognizing dynamic medical phrase signs. On the other hand, when KNN was compared
with other algorithms, including SVM and nearest neighbor (minimum distance) [43],
SVM and RF [48] for fingerspelling ArSLR, and different variations of SVM and RF for
isolated word recognition [85], the findings indicated that the best performance was
achieved by SVM. In the context of fingerspelling recognition, just one study [68] demon-
strates that the KNN algorithm outperformed other algorithms, including C4.5, Naïve
Bayes (NB), and Multilayer Perceptron (MLP), in terms of accuracy.
SVM is a supervised machine learning approach that is mainly used for regression
and classification tasks. The SVM algorithm works by finding the optimal hyperplane that
maximizes the distance between each class in an N-dimensional space in order to classify
data. SVM is one of the popular algorithms that have been used for different categories in
the field of ArSLR, including fingerspelling recognition [41,43,48], isolated recognition
[85], continuous recognition [97], and miscellaneous ArSL recognition [37]. The perfor-
mance of the SVM algorithm proved to be outstanding when compared to other algo-
rithms in fingerspelling and isolated recognition. Tharwat et al. [43] carried out a number
of experiments to compare different ML algorithms, which showed that the performance
of the SVM was inferior to that of the KNN and minimum distance in fingerspelling recog-
nition with around 99% accuracy. While the experiments proved that the SVM algorithm
is robust against any rotation, achieving 99% accuracy, the performance needs to be im-
proved in the case of image occlusion. A different approach was proposed by Almasre and
Al-Nuaim [48], where two stages of classification were carried out using SVM, KNN, and
RF. Stage 1 involved training the three classifiers on the original dataset. Stage 2 entailed
training the classifiers on an ensemble dataset, where the output of each classifier was
coupled with an ensemble schema dataset to reclassify the classes. To see if changing num-
bers affected the performance of the classifiers, different observations for each letter were
Sensors 2024, 24, 7798 43 of 91
evaluated. When used as a standalone classifier, SVM yielded a superior overall accuracy
of 96.119%, regardless of the number of observations. SVM would be a more efficient op-
tion because it requires less complexity while obtaining higher accuracy. Agab and Chelali
[41] proposed a static and dynamic hand gesture recognition system that adopts the com-
bined feature descriptors DT-CWT + HOG and compared the classification performance
of three Artificial Neural Networks (ANNs), MLP, Probabilistic Neural Network (PNN),
Radial Basis Neural Network (RBNN), SVM, and RF. Four distinct datasets comprising
alphabet signs and dynamic gestures, including alphabet ArSL, were used for the experi-
mental evaluation. The SVM classifier performed better for the ArSL dataset with regard
to recognition rates and processing time. To recognize ArSL gestured dynamic words, Al-
masre and Al-Nuaim [85] proposed a dynamic prototype model (DPM) using Kinect as
an input device. A total of eleven predictive models based on three algorithms, namely
SVM, RF, and KNN, with varying parameter settings, were employed by the DPM. Ac-
cording to research findings, SVM models using a linear kernel and a cost parameter of
0.035 were able to attain the maximum accuracy for the dynamic words gestured. Alzohair
et al. [49] developed a model employing a one-versus-all SVM classifier for each gesture.
In their model, one class for each ArSL alphabet gesture was considered, and thirty classes
resulted from this. A model is learned for each gesture by training the classifier using one
particular class against all the others. The one-versus-all strategy looks for a hyperplane
that differentiates the classes by considering all classes and splitting them into two groups,
one for the points of the class under analysis and another for all other points.
The RF algorithm is a popular tree-learning approach in machine learning. During
the training stage, it generates a collection of decision trees. To measure a random subset
of characteristics in each partition, a random subset of the data set is used to build each
tree. Because each tree is more variable as a result of the randomization, there is less
chance of overfitting, and overall prediction performance is enhanced. In predictions, the
algorithm averages (for regression tasks) or votes (for classification tasks) the output of
each tree. The findings of this cooperative decision-making process, which is aided by the
insights of several trees, are consistent and accurate. Random forests are commonly uti-
lized for classification and regression tasks because of their reputation for managing com-
plex data, minimizing overfitting, and producing accurate predictions in a variety of set-
tings. A few studies have compared the RF performance to other ML algorithms for the
sake of fingerspelling recognition [41,48] and isolated recognition [85]. The comparative
results showed that while RF did not obtain the best recognition accuracy, it outperformed
all other classifiers in terms of recognition rates for non-ArSL datasets, such as the ASL,
Marcel, and Cambridge datasets [41]. Elpeltagy et al. [82] proposed to use the Canonical
Correlation Analysis (CCA) [101] and RF algorithms for isolated word recognition. The
proposed approach is based on hand shape and motion, where HOG-PCA is used for
hand shape description, CCA for hand shape matching, Cov3DJ+ for motion and feature
description, and RF for motion classification. The classification starts by applying the RF
to the Cov3DJ+ descriptor in order to determine which top sign corresponds to the highest
T probabilities. Subsequently, the CCA is used to determine which of these top signs is
right by applying it to the hand-shape descriptors that correspond to them. CCA enhances
classification performance by combining data from various performers and repetitions.
Deriche et al. [83] proposed the use of dual LMCs to capture the signer performing
isolated Arabic dynamic word signs. The Gaussian Mixture Model (GMM) and a Bayesian
classifier were used to examine the features that were extracted from the two LMCs. The
individual Bayesian classifier findings were aggregated through an evidence-based meth-
odology, specifically the Dempster-Shafer (DS) theory of evidence. A medium-sized vo-
cabulary (100 signs) comprising signs frequently used in social settings was utilized to
evaluate the suggested method. A simple LDA-based method to examine the system’s
performance across several classifiers was employed. The results demonstrate that the
combination strategy based on DS theory performs approximately 5% better than the
LDA-based approach. About 92% recognition accuracy was attained.
Sensors 2024, 24, 7798 44 of 91
The Euclidean distance classifier has been utilized in two isolated word recognition
studies [5,33]. Euclidean distance measures the similarity between two feature vectors that
are built directly from geometric features of the manual signs [33] or both manual and
non-manual signs [5]. The experimental results show that the proposed system by Ahmed
et al. [33] recognizes signs with an accuracy of 95.8%. Better outcomes were achieved in
the Ibrahim et al. study, where the system demonstrated its resilience against various oc-
clusion scenarios and reached a recognition rate of 97% in signer-independent mode [5].
Similar images are grouped together in the image clustering phase. In clustering
problems, the fuzzy c-means approach is frequently employed. It is a clustering technique
that gathers every data pixel into two or more clusters. The membership of the data
changes to point in the direction of the designated cluster center as it moves closer to it.
The Euclidean distance can then be used to calculate the degree of fuzziness between the
cluster centers. This approach has been employed by Elatawy et al. [68] to recognize the
Arabic alphabet sign language after converting the images to the neutrosophic domain
and extracting their features. According to the experimental evaluation, the fuzzy c-means
approach resulted in a 91% recognition accuracy rate.
Dynamic Time Warping (DTW) is known as an optimal alignment algorithm between
two given sequences. DTW is used in many domains to quantify the similarity between
two sequences that are changing in speed or time. Because it can handle the speeds at
which signs are performed, the DTW is very appropriate for tasks involving sign recogni-
tion. This involves utilizing DTW to compare a set of frames from the training set with a
set of frames from the test set. Each set of frames will be considered a signal or pattern. To
find the similarity of the sequences that will be compared, they need to be warped non-
linearly in the time dimension, independent of some non-linear changes in the time di-
mension. The sequence in the training set with the shortest DTW distances is the most
similar sequence to the test sequence, as identified by the DTW based on the estimated
distance between the most similar group and the test sign. At last, it selects the group that
is the most comparable and assigns it to the test sign. Hisham and Hamouda [37] used
LMC as an input device and employed this algorithm for dynamic gestures. The study
findings showed that DTW dominated other models, including KNN, SVM, and ANN, for
both the palm feature set and the bone feature set, with accuracy of 97.4% and 96.4%,
respectively. When DTW was used for continuous sign recognition captured by Kinect
[97], the performance was worse than the other models, KNN, SV, and ANN, in terms of
accuracy and response time.
Deep Learning
The field of deep learning is concerned with learning data representations. However,
the intricacy of the models and the underlying features of the system’s input restrict the
ability of deep learning techniques to capture semantics embedded within data. The ad-
vances in deep learning have improved sign language recognition accuracy and effective-
ness, leveraging ANNs, CNNs, and RNNs. An ANN is made up of several perceptrons or
neurons at each layer. Because an ANN only processes inputs in a forward manner, it is
often referred to as a feed-forward neural network. One of the most basic varieties of neu-
ral networks is this kind of network. Information is passed through a number of input
nodes in a single direction until it reaches the output node. The network’s operation can
be better understood whether or not it has hidden node layers. Hidden node layers may
or may not exist in the network, which would make its behavior easier to understand.
ANN did not show high performance when compared to other ML algorithms for static
gestures representing letters, numbers, and words [37]. A similar result was achieved
when applying ANN in the context of continuous ArSLR [97].
MLP is a feed-forward ANN with at least three layers of neurons: input, output, and
hidden. The MLP’s neurons usually employ fully connected neurons’ nonlinear activation
functions, which enable the network to recognize input with complicated patterns. Similar
to MLP, RBNN is a feed-forward neural network that has a single hidden layer made of
Sensors 2024, 24, 7798 45 of 91
nonlinear radial basis functions (RBF), like a Gaussian function. Each neuron calculates
the distance between the input data and the function’s center; at shorter distances, the
neuron’s output value increases. PNN is another feedforward neural network that is fre-
quently utilized to address pattern recognition and classification issues [102]. A PNN clas-
sifier is an application of Bayesian network and kernel discriminate analysis that develops
a family of probability density function estimators. Agab and Chelali [38] conducted a
comparison between five different classifiers, including three variants of the ANN, which
are MLP, PNN, and RBNN, as well as SVM and RF, for the purpose of fingerspelling
recognition. The results demonstrate lower performance of the ANN’s variants compared
to the SVM and RF algorithms. A similar low performance for MLP was obtained by an-
other fingerspelling recognition study [72] when comparing MLP with C4.5, NB, and
KNN.
Two studies were carried out using MLP for the purpose of isolated word recognition
[80,93]. ElBadawy et al. [80] proposed using MLP to classify manual sign input and a PCA
network followed by an MLP network for facial expressions and body movements. Due
to the integrated modules for body movement and facial expression recognition, the sys-
tem achieved an accuracy of 95% for a dataset with 20 dynamic signs. Al-Onazi et al. [93]
utilized the MLP classifier for sign recognition and classification in order to identify and
categorize the presence of sign language gestures for five words. The Deer Hunting Opti-
mization (DHO) algorithm is then applied to optimize the MLP model’s parameters. With
an accuracy of 92.88%, the comparison analysis demonstrated that the proposed method
produced better results for gesture classification than other methods.
A Deep Belief Network (DBN) is a class of deep neural networks used for unsuper-
vised learning activities like generative modeling, feature learning, and dimensionality
reduction. It is made up of several layers of hidden units that are trained to represent data
in a structured way. Among the reviewed studies, only one paper was found to utilize the
DBN approach paired with the direct use of tiny images to recognize and categorize Ara-
bic letter signs [44]. By identifying the most significant features from sparsely represented
data, deep learning was able to significantly reduce the complexity of the recognition
problem. After scaling and normalization, a total of about 6000 samples of the 28 Arabic
alphabetic signs were employed to extract features. A softmax regression was used to eval-
uate the classification process, and the results showed an overall accuracy of 83.32%,
demonstrating the great reliability of the Arabic alphabetical letter recognition model
based on DBN.
CNNs are the extended version of ANNs and one of the most widely utilized models
in use nowadays. This neural network computational model comprises one or more con-
volutional layers that can be either pooled or fully connected. It is based on a variant of
multilayer perceptrons. CNN’s ability to autonomously recognize key features without
human oversight is by far its greatest advantage over its forerunners. Additionally, CNN
delivers remarkable accuracy and processing efficiency. Numerous studies have used
CNNs to recognize fingerspelling ArSL [46,47,53,55–57,69]. Althagafi et al. [53] developed
a system that automatically recognizes 28 letters in Arabic Sign Language using a CNN
model with a grayscale image as input [50,51]. Using 54,049 sign images [50,51], Latif et
al. [55] offered various CNN architectures. Their results show how the size of the dataset
has a significant impact on the proposed model’s accuracy. As the dataset size is increased
from 8302 samples to 27,985 samples, the testing accuracy of the suggested model rises
from 80.3% to 93.9%. When the dataset size is raised from 33,406 samples to 50,000 sam-
ples, the testing accuracy of the suggested model improves even further, rising from 94.1%
to 95.9%. Alshomrani et al. [56] utilized CNNs to categorize the images into signs. CNN
has been experimented with various settings for datasets containing Arabic and American
signs. With an accuracy of 96.4%, CNN-2—which comprises two hidden layers—pro-
duced the best results for the Arabic sign language dataset [50,51]. Kamruzzaman [69]
developed a vision-based method that uses CNN to recognize Arabic hand sign-based
letters and transform them into Arabic speech with a 90% recognition accuracy. Utilizing
Sensors 2024, 24, 7798 46 of 91
the ArSL2018 dataset and a special ArSL-CNN architecture, Alani and Cosma [57] built a
system for recognizing Arabic signs. While training, the suggested ArSL-CNN model’s
accuracy was 98.80%; while testing, it was initially 96.59%. In order to lessen the impact
that unbalanced data has on the model’s precision, they chose to use a range of resampling
techniques for the dataset. The results show that the synthetic minority oversampling
method (SMOTE) improved overall testing accuracy from 96.59% to 97.29%. During train-
ing, the proposed ArSL-CNN model’s accuracy was 98.80%, and roughly 96.59% during
testing. In order to reduce the impact that unbalanced data has on the model’s accuracy,
they opted to use a range of resampling techniques on the dataset. The results show that
SMOTE boosted the overall testing accuracy from 96.59% to 97.29%. Abdelghfar et al. [47]
proposed a new convolutional neural network-based model for Qur’anic sign language
recognition, QSLR-CNN. A subset of the larger Arabic sign language collection, ArSL2018
[50,51], comprising just 24,137 images, was used for the tests. This subset represents the
14 dashed letters in the Holy Qur’an. The experiments were carried out on this portion of
the dataset. The QSLRS-CNN model obtained 98.05% training accuracy and 97.13% test-
ing accuracy for 100 epochs. In order to address class imbalance, the model was then
trained and tested using several resampling techniques. Based on the findings, the testing
accuracy increased from 97.13% to 97.67% overall when SMOT is used. The same meth-
odology was adopted by Abdelghfar et al. [46] for 100 and 200 epochs. The SMOT method
shows slightly better performance using 200 learning epochs but takes more time.
Many researchers have utilized CNN-based transfer learning for ArSL recognition.
Conducting experiments to compare various pre-trained models was one of the method-
ologies proposed by several studies [52,54,58,59,62,63,65,66,71,73]. A deep transfer learn-
ing-based recognition method for ArSL was proposed by Shahin and Almotairi [52]. They
employed several transfer learning techniques, including AlexNet, SqueezeNet, VGG-
Net16, VGGNet19, GoogleNet, DenseNet, MobileNet, ResNet18, ResNet50, ResNet101,
and InceptionV3, based on data augmentation and fine-tuning to lessen overfitting and
enhance performance. The experiment results on the ArSL2018 dataset [50,51] show that
ResNet101, the suggested residual network system, obtained a maximum accuracy of
99.52%. Alsaadi et al. [58] trained and evaluated four cutting-edge models: AlexNet,
VGG16, GoogleNet, and ResNet, using ArSL2018 [50,51], in order to determine which
CNN model would be best for classifying sign language. With 94.81% accuracy, AlexNet
was found to have the highest outcomes. Next, an AlexNet-based real-time recognition
system was built. A comparison analysis based on three popular deep pre-trained mod-
els—AlexNet, VGGNet, and GoogleNet/Inception—was conducted [59] using the
ArSL2018 dataset [50,51]. Test accuracy varied throughout the models, with VGGNet
achieving the best score of 97%. Experiments have been performed by Islam et al. [61] on
the ArSL2018 dataset using a variety of pre-trained models, including Xception, VGG16,
Resnet50, InceptionV3, MobileNet, and EfficientNetB4. With respect to its relative simplic-
ity, EfficientNetB4 is a heavy-weight architecture. We find that the top model has a 95%
testing accuracy and a 98% training accuracy. The architecture of EfficientNetB4 is heavy-
weight and relatively intricate. The findings reveal that the EfficientNetB4 model attained
the best accuracy over the other models, with 98%.
Using ArSL2018, Baker et al. [65] comprehensively assessed and compared the per-
formance of six different pre-trained models: Xception, ResNet50V2, InceptionV3, VGG16,
MobileNetV2, and ResNet152. Early stopping and data augmentation strategies were
used experimentally to improve the pre-trained models’ robustness and efficacy. The re-
sults demonstrated the greater accuracy attained by InceptionV3 and ResNet50V2, both
of which reached 100% accuracy—the best accuracy ever attained. Two pre-trained mod-
els with intermediate layers, VGG16 and VGG19, were examined by Nahar et al. [63] to
recognize sign language for the Arabic alphabet. Following testing on several datasets and
dataset-specific adjustments, both models were trained using various methods. Analyzing
the data revealed that the best accuracy results were obtained by fine-tuning these two
models’ fifth and fourth blocks. With regard to VGG16, the testing accuracy was
Sensors 2024, 24, 7798 47 of 91
specifically 96.51% for the fourth block and 96.50% for the fifth block. Similar findings
were observed in the testing accuracy of the second model. Saleh and Issa [54] exploited
transfer learning and deep CNN fine-tuning to increase the accuracy of 32-hand gesture
recognition by using the ArSL2018 dataset [50,51]. In order to address the imbalance re-
sulting from the difference in class sizes, the dataset was randomly undersampled. There
were 25,600 images instead of 54,049 in total. The best accuracy for VGG16 was 99.4%, and
for ResNet152, it was 99.57%. Alharthi and Alzhrani [64] carried out a study that is made
up of two parts. The first part involves the transfer learning approach by using a variety
of pre-trained models, including MobileNet, Xception, Inception, InceptionResNet,
DenseNet, and BiT, as well as two vision transformers, ViT and Swin. A number of CNN
architectures were trained from scratch in order to be compared with the transfer learning
approach in the second part, which used a deep learning approach employing CNNs, the
transfer learning method beat other CNN models and achieved stable high performance
on the ArSL2018 dataset [50,51]. A comparable high performance of 98% was achieved by
ResNet and InceptionResNet.
Ismail et al. [71] suggested a different approach by comparing the performance of
single pre-trained models: DenseNet121, VGG16, ResNet50, MobileNetV2, Xception, Effi-
cientB0, NASNetMobile, and InceptionV3, and multi-models: DenseNet121-VGG16
model, ResNet50-MobileNetV2, Xception-EfficientB0, NASNetMobile-InceptionV3,
DenseNet121-MobileNetV2, and DenseNet121-ResNet50. It was found that DenseNet121
is the best CNN model for extracting features and classifying Arabic sign language. For
multi-models, the DenseNet121-VGG16 multi-model CNN shows the highest accuracy.
The study findings reveal that when it comes to ASL feature extraction and classification,
multi-models outperform single models. Faster R-CNN based on the pre-trained models
VGG16 and ResNet18 was proposed by Alawwad et al. [73]. Using the dataset of self-
collected ArSL images, this linkage between the proposed architecture and the ResNet
and VGG16 models obtained 93% accuracy.
A few studies have focused their experiments on specific pre-trained models such as
EfficientNet [66,74]. Islam et al. [66] presented an innovative approach to Arabic SL recog-
nition that builds feature extraction on a modified version of the EfficientNetB3 model.
Using stacked autoencoders, the approach ensures the best possible mapping of input im-
ages through powerful feature selection. This approach shows enhanced performance for
Arabic sign language after a thorough testing process involving several CNN models. Ar-
abic SL gesture detection becomes simpler and more precise with the addition of densely
coupled coding layers, which improves the model’s performance even further.
AlKhuraym et al. [74] proposed utilizing a CNN-based lightweight EfficientNet to recog-
nize Arabic sign language (ArSL). A dataset with hand gestures for thirty distinct Arabic
alphabets was gathered by numerous signers. Then, the classification outcomes obtained
by different versions of lightweight EfficientNet were assessed. With 94% accuracy, the
EfficientNet-Lite 0 architecture showed the best results and demonstrated its effectiveness
against background variations.
By combining several models rather than just one, ensemble methods seek to increase
the accuracy of outcomes in models. The accuracy of the results is considerably increased
by the combined models. Two recent studies were found to exploit ensemble methods for
alphabet ArSLR [60,62]. Alnuaim et al. [60] proposed a framework that consists of two
CNN models, each trained on the ArSL 2018 dataset [50,51]. The two models, ResNet50
and MobileNetV2, were used in conjunction with each other. After using a variety of pre-
processing methods, several hyperparameters for each model, and data augmentation
strategies, the results reached an accuracy of almost 97% for the entire set of data. The
proposed solution by Nahar et al. [62] involves retraining 12 models, namely VGG16,
VGG19, ResNet50, InceptionV3, Xception, InceptionResNetV2, MobileNet, DenseNet121,
DenseNet169, DenseNet201, NASNetLarge, and NASNetMobile. Once the 12 predictions
are obtained, the majority of the predictions will be used by the classification module to
increase accuracy. Simple majority voting is a collective method that leverages the
Sensors 2024, 24, 7798 48 of 91
majority of the classifiers to determine the prediction, increasing the output’s accuracy.
The findings demonstrate that, with a 93.7% accuracy in Arabic language sign classifica-
tion, the suggested approach outperforms conventional models in terms of speed and ac-
curacy.
RNNs are a type of artificial neural network that is well-suited to capture temporal
dependencies and sequential patterns in data. In contrast to feedforward neural networks,
which process data in a single pass, RNNs handle data throughout many time steps. This
makes RNNs ideal for processing and modeling time series, speech, and text. The most
popular RNN architecture is LSTM. LSTM can efficiently capture long-term dependencies
in sequential data using the memory cell, which is managed by the input, forget, and out-
put gates. These gates determine what data should be input into, taken out of, and output
from the memory cell. Alnahhas et al. [84] proposed an innovative approach that uses
LMC and deep learning to recognize dynamic hand gestures that indicate expression in
Arabic sign language. In order to process dynamic gestures, the sensory data is first rep-
resented as a series of frames, where each frame is made up of values that indicate the
hand posture features in that frame. The LSTM model is then used to process the series of
frames, which can be used to categorize the series into classes that correspond to different
sign language expressions. Using the proposed solution, a system that can identify sign
language expressions that can be executed with one or two hands was developed. Accord-
ing to the experiment’s findings, the greatest accuracy was 89% for gestures made with
one hand and 96% for gestures made with two hands.
Another type of RNN is BiLSTM, which consists of two LSTM networks, one for for-
ward processing of the input sequence and another for backward processing. The final
outcome is then generated by combining the outputs of the two LSTM networks. Some
researchers have adopted BiLSTM in ArSLR research. Aly and Aly [77] proposed a meth-
odology to extract the features using a single-layer CSOM rather than depending on the
transfer learning of pre-trained deep CNNs. After that, deep BiLSTM—which consists of
three BiLSTM layers—was used to recognize the extracted feature vector sequence.
BiLSTM includes one fully connected layer and two softmax layers. The proposed ap-
proach’s effectiveness was assessed using the Shanableh dataset [75], which comprises 23
distinct terms that were recorded by three separate users. In the signer-independent
mode, the evaluation of the proposed solution yielded a high accuracy of 89.5%. Another
study carried out by Shanableh [98] has aimed at employing a camera in user-dependent
mode for continuous Arabic sign language recognition. The proposed solution is a two-
step process wherein the first stage uses deep learning to predict the number of words in
a sentence. Next comes a second stage, where a novel method based on motion images
and biLSTM layers is used to recognize words in a sentence. The experiments were con-
ducted using one LSTM layer, one biLSTM layer, two biLSTM layers, and three biLSTM
layers. According to experimental findings, the suggested method performed exception-
ally well when used on a dataset with 40 sentences [96]. BiLSTM with two layers produced
the best outcomes, with a word recognition rate of 97.3% and a sentence recognition rate
of 92.6%.
Some studies have integrated RNNs with other deep learning architectures, such as
CNNs, to benefit from their individual advantages. These hybrid models aim to gather
both spatial and temporal information from sign language data, seeking enhanced perfor-
mance in ArSLR. The baseline study conducted by Luqman and El-Alfy [88] aimed to as-
sess and compare six models based on cutting-edge deep-learning techniques for the spa-
tial and temporal processing of sign videos for ArSL words. These models are CNN-
LSTM, Inception-LSTM, Xception-LSTM, ResNet50-LSTM, VGG-16-LSTM, and Mo-
bileNet-LSTM. Using both manual and non-manual features, two scenarios are examined
for the signer-dependent and signer-independent modes. Color and depth images were
used directly in the first scenario, whereas in the second scenario, optical flow was utilized
to extract more distinct features from the signs themselves instead of the signers. Using
MobileNet-LSTM yielded the best results, with 99.7% and 72.4% for signer-dependent and
Sensors 2024, 24, 7798 49 of 91
signer-independent modes, respectively. Ismail et al. [92] proposed fusing different mod-
els in order to precisely capture the spatiotemporal change of dynamic word sign lan-
guage movements and to efficiently gather significant shape information. The models in-
clude RNN models, LSTM and gated recurrent unit (GRU) for sequence classification, and
deep neural network models that use 2D and 3D CNNs to cover all feature extraction
approaches. ResNet50-LSTM was the best multi-model using the same fusion technique
among the pre-trained models, including DenseNet121-LSTM, ResNet50-GRU, Mo-
bileNet-LSTM, and VGG16-LSTM. Luqman and Alalfy [78] suggested utilizing three dis-
tinct models for dynamic sign language recognition based on the combination of two ar-
chitectures exploiting layers from CNN and LSTM. These models are CNN-LSTM, CNN
with two stacked LSTM layers (CNN-SLSTM), and CNN followed by stacked LSTM layers
and a fully connected (FC) layer (CNN-SLSTM-FC). After evaluating the models, the re-
sults show that CNN-SLSTM outperformed other models in terms of accuracy and train-
ing time. Luqman and Alalfy [90] presented a novel approach that includes three deep
learning models for isolated sign language recognition: the Dynamic Motion Network
(DMN), the Accumulative Motion Network (AMN), and the Sign Recognition Network
(SRN). In DMN, different combinations of LSTM and CNN-based models were used to
train and extract the spatial and temporal information from the key frame of the sign ges-
ture, and MobileNet-LSTM outperformed all other combinations. The sign motion was
encoded into a single image using the Accumulative Video Motion (AVM) technique.
AMN was fed this image as its input. Finally, the SRN stream used the fused features from
the DMN and AMN streams as input for learning and classifying signals. In 2023, Podder
et al. [35] proposed a novel CNN-LSTM-SelfMLP architecture that can recognize Arabic
Sign Language words from recorded RGB videos. This study’s dataset comprises both
manual and non-manual sign gestures [88,89]. Six distinct CNN-LSTM-SelfMLP architec-
ture models were built using three SelfMLPs and MobileNetV2 and ResNet18 CNN-based
backbones, with the purpose of comparing performance in ArSLR. MobileNetV2-LSTM-
SelfMLP obtained the highest accuracy of 87.69% for the signer-independent mode.
Balaha et al. [94] presented their approach for integrating the CNN and RNN models to
recognize isolated Arabic sign language words. Two CNNs were combined, and the out-
put was fed to five cascaded layers of 512 BiLSTM units. To prevent network overfitting,
a dropout layer comes after each of these layers. An FC layer with a SoftMax activation
function that predicts the output comes after these layers. Using a self-acquired sign lan-
guage dataset for 20 words, the proposed architecture demonstrated a testing accuracy of
92%.
3.2.5. RQ2.5: Which Evaluation Metrics Were Used to Measure the Performance of ArSLR
Algorithms?
Evaluation metrics offer a quantified representation of the performance of the trained
model or algorithm. By employing metrics, researchers can evaluate various models and
choose which is most effective for their requirements. The evaluation metrics selected are
determined by the particular problem domain, the type of data, and the intended out-
come. In this section, the most and least used evaluation metrics are presented in all cate-
gories of the reviewed ArSLR studies. As illustrated in Tables 14–17, the most common
fundamental metric to evaluate the effectiveness of the proposed ArSL recognition is ac-
curacy, which determines the model’s capability to differentiate ArSL signs correctly and
presents the ratio of correctly classified samples across the whole dataset. In continuous
ArSLR, researchers report the accuracy in terms of word recognition rate and sentence
recognition rate [95,96,98]. The ratio of correctly identified sentences to the total number
of sentences in the collection of test sentences is referred to as the sentence recognition
rate. When all the words that form a sentence are correctly identified in their original or-
der, the sentence is correctly classified.
Precision, recall, and F1 scores were employed by around half of the fingerspelling
ArSLR, as shown in Table 14. The precision of a model is defined as the proportion of true
Sensors 2024, 24, 7798 50 of 91
positive predictions among all positive predictions produced by the model. It shows how
well the model recognizes positive samples correctly. The proportion of true positive pre-
dictions among all actual positive samples is called recall, which is often referred to as
sensitivity or true positive rate. It demonstrates how accurately the model represents the
positive samples. Precision and recall’s harmonic mean define the F1 score. By taking pre-
cision and recall into account, it offers a balanced measure of the model’s performance.
The F1 score has a range of 0 to 1, where 0 denotes low performance, and 1 denotes ex-
ceptional precision and recall. On the contrary, only one study for miscellaneous ArSL
recognition and a few studies for isolated word recognition utilized precision, recall, and
the F1 score, as shown in Tables 15 and 17.
More insight into the model’s performance across several classes is provided by the
confusion matrix, which visualizes the results from classifier algorithms. A confusion ma-
trix is a table that compares the number of actual ground truth values of a given class to
the number of predicted class values. Less than half of the studies in different categories
of ArSLR—except continuous ArSLR—benefited from using this visual representation.
The loss metric is a measure of the model’s performance in terms of its capability to
provide accurate predictions. It shows how the actual outcome, or target value, differs
from the model’s predicted outcome. The loss metric is widely used to express the cost or
error incurred in making the model’s predictions. The aim is to lower this error by modi-
fying the model’s parameters throughout training. Around one-third of the fingerspelling
recognition studies and one-fourth of the isolated word recognition studies used loss as a
measure of the model’s effectiveness. Among these studies, the commonly used classifi-
cation loss was categorical cross-entropy [52,55,58,63,69], whereas LogLoss was used by
only one study [85].
Training time was a concern in a few ArSLR studies, with three fingerspelling recog-
nition studies [46,47,55] and one isolated recognition [78]. Along with the training time,
testing time was considered in two continuous recognition studies [96,98] and one finger-
spelling recognition study [41]. In only one study [97], the response time was measured
by calculating the time required to capture and classify the sign in real time for continuous
sentence recognition.
Receiver Operating Characteristic (ROC) curve analysis is a chart that presents a false
positive rate (1-specificity) on the X-axis against a true positive rate (sensitivity) on the Y-
axis. This metric was utilized by three isolated recognition studies [35,91,93]. The specific-
ity metric is the capacity of the algorithm or model to predict true negatives for each class.
A few researchers considered this metric for either fingerspelling recognition [44,52] or
isolated recognition [35,95]. The Area Under Curve (AUC) is a metric that is specified by
calculating the area under ROC curves, where the AUC should be between 0.5 and 1. It
was used by two fingerspelling recognition studies [48,64] and one study for isolated
recognition [85]. The Kappa statistic was utilized by two studies for isolated recognition
[52,72] to assess the degree of agreement between two sets of multiclass labels. Various
evaluation metrics were rarely utilized in the reviewed papers, including Matthews’s Cor-
relation Coefficient (MCC) [52], Root Mean Squared Error (RMSE) [72] and top-5 accuracy
[63] for fingerspelling recognition and overlap ratio [76], Jaccard index [91], standard error
[35], confidence interval [35], top-1 accuracy [94], G-measure [93], and precision-recall
curve analysis [93] for isolated recognition.
3.2.6. RQ2.6: What Are the Performance Results in Terms of Recognition Accuracy?
The majority of research papers emphasize accurately recognizing sign language
content, and the main metrics employed in these studies aim to gauge this capacity. Al-
most all reviewed research papers contain a quantitative assessment of the proposed sign
recognition method. The scope and intricacy of testing vary widely, and specific tests are
established based on the goals of the research study. Generally speaking, the tests were
created to gauge the algorithm’s capability to recognize sign language sentences, words,
or alphabets frequently by comparing it with a number of benchmarking techniques.
Sensors 2024, 24, 7798 51 of 91
Comparing the performance of various ML/DL models in ArSLR research can be challeng-
ing due to the diverse nature of the tests, differences in datasets, evaluation metrics, and
experimental setups. Overall, many approaches did rather well and identified over 90%
of the signs that were presented, as shown in Figure 24 and Tables 14–17. Although this
typically involved less difficult tasks and was often unsustainable across several datasets,
there have been instances where the stated effectiveness was above 97%. One of the most
crucial aspects of ArSLR research that can positively affect the proposed solutions’ effec-
tiveness is the optimization of training parameters. The accuracy of 55.57% for a signer-
independent isolated recognition [82] was the lowest recognition accuracy achieved
among all the reviewed ArSLR studies. Recognition rates of more than 80% are regarded
as very strong for continuous ArSLR applications, especially when they are maintained
across different datasets; however, the reviewed continuous recognition papers obtained
superior results, with all of them above 90%.
One of the important factors that affect the performance results of the proposed
ArSLR models is sign dependency. Signer-independent recognition systems are usually
tested on different signers than those used for system training; augmenting the signer
population benefits these systems. The signer-independent option is more challenging to
use than the signer-dependent one, as shown by Tables 14–17, where, for the identical
experimental setting, the performance accuracy of the signer-independent case is consist-
ently lower than that of the signer-dependent case. The reason for the significant decline
in recognition accuracy in the signer-independent mode may be traced back to the models
that began to overfit the signers during the system learning phase. This has been seen
clearly in studies that apply both signer-dependent and signer-independent modes
[39,88,90]. The exception was when the input acquisition occurred through wearable sen-
sor devices like gloves [70], where there was no significant drop in the user-independent
case. Despite the discomfort and impractical need to wear the glove sensors during the
signing, ArSLR systems that utilize this approach achieve a high-performance accuracy of
96% and above [70,95,96].
Sensors 2024, 24, 7798 52 of 91
Testing and
Preprocessing Meth- Segmentation Meth- Feature Extraction Evaluation Met- Performance Re-
Year Ref. ArSLR Algorithm Training Meth- Signing Mode
ods ods Methods rics sults
odology
Feature extraction Training: 70.87%
HMM. This algorithm di-
Skin detection, back- through observation 100% recogni- (253 images),
Transform to YCbCr, vides the rectangle sur- Signer-depend-
2015 [42] ground removal detection and then Accuracy. tion accuracy for and Testing:
Image normalization. rounding by the hand shape ent
techniques. creation of observa- 16 zones. 29.13% (104 im-
into zones.
tion vector ages).
SIFT to extract the Different parti-
SVM shows the
features and SVM, KNN, nearest-neigh- tioning methods Signer-depend-
2015 [43] - - Accuracy. best accuracy
LDA to reduce di- bor (minimum distance) were experi- ent
around 98.9%.
mensions. mented
Two feature types:
− Type 1: three an-
gles for each
Remove observations hand bone (an-
gles between the Training: 75%
with rows that had SVM produced a
bone and the (1047 observa-
same values, and Accuracy, and higher overall Signer-depend-
2017 [48] - three axes of theSVM, KNN, and RF. tions), Testing:
rows that had multi- AUC. accuracy = ent
coordinate sys- 25% (351 obser-
ple missing values or 96.119%.
tem). vations).
null. − Type 2: one an-
gle between
each of the two
bones.
Best accuracy
Skin detection to ex-
Transform the color was 63.5% for
tract the hand region HOG, EHD, GLCM, Precision, Re- Signer-depend-
2018 [49] images into gray One versus all SVM. one ersus all Not Mentioned
from the back- DWT, and LBP. call, Accuracy. ent
level images. SVM using
ground.
HOG.
Sensors 2024, 24, 7798 53 of 91
AlexNet,
− Resize images to Accuracy, Error
SqueezeNet, VGG-
fit each pre- AlexNet, SqueezeNet, VGG- Rate, Sensitivity,
Net16, VGGNet19 ResNet18 Training: 90%
trained CNNs Net16, VGGNet19 Goog- Specificity, Pre-
GoogleNet, Dense- achieved highest (48,644 images), Signer-depend-
2019 [52] image input- leNet, DenseNet, MobileNet cision, F1 Score,
Net, MobileNet and accuracy with Testing: 210% ent
layer. and ResNet 18, ResNet50, MCC, Kappa,
ResNet 18, ResNet50, 99.52%. (5405 images)
− Image data aug- ResNet101, InceptionV3. confusion ma-
ResNet101, Incep-
mentation. trix.
tionV3
− Remove noise.
− Grayscale con-
Training: 80%
version. Best accuracy
(43,239 images), Signer-depend-
2020 [53] − Resize each im-- CNN CNN Accuracy, Loss. obtained was
Testing: 20% ent
age to 64 × 64 92.9%.
(10,810 images)
pixels.
− Normalization.
− Resize images to
128 × 128 pixels.
The feature extrac-
− Images conver- Training: 80%
tion for neutrosophic Best accuracy
sion into a neu-Global thresholding Clustering by Fuzzy c-means (20,480 images), Signer-depend-
2020 [68] images by GLCM is Accuracy. obtained was
trosophic image. technique. algorithm. Testing: 20% ent
based on pixel and 91%
− Gaussian filter to (5120 images)
their neighbors.
deal with the
noise.
− Random under- After
Best accuracy for
sampling to re- resampling:
VGG16 was
duce the dataset Fine-tuned VGG16, Fine-tuned VGG16, and Res- Training: 80% Signer-depend-
2020 [54] - Accuracy. 99.26% and for
imbalance. and ResNet152. Net152. (20,480 images), ent
ResNet152 was
− Data augmenta- Testing: 20%
99.57%.
tion. (5120 images).
− Resize images to Accuracy, Loss Best accuracy
Training: 80% Signer-depend-
2020 [69] 128 × 128 RGB- CNN CNN (categorical obtained was
(3100 images), ent
images. cross-entropy), 90%
Sensors 2024, 24, 7798 54 of 91
−The augmenta-Multi-models: Multi-models: RESNet50 & MobileNetV2, VGG16 multi- Testing: 10%
tion of data forDenseNet121 model DenseNet121 model Xception&Efficient B0, model CNN is (22,000 images).
the dynamic signand VGG16 model, and VGG16 model, NASNetMobile & Incep- the best with ac- ASL standard
(noise salt andRESNet50 & Mo- RESNet50 & Mo- tionV3, DenseNet121 & Mo- curacy = 100%. dataset: Train-
paper and blur-bileNetV2, Xcep- bileNetV2, Xcep- bileNetV2, and Dense- ing: 80% (69,600
ring images withtion&Efficient B0, tion&Efficient B0, Net121&RESNet50. images), Valida-
filters gaussian,NASNetMobile & In-NASNetMobile & In- tion: 10% (8700
median, averag-ceptionV3, Dense- ceptionV3, Dense- images), and
ing and morpho-Net121 & Mo- Net121 & Mo- Testing: 10%
logical operationbileNetV2, and bileNetV2, and (8700 images).
erosion and dila-DenseNet121&RES- DenseNet121&RES-
tion of the da-Net50. Net50.
taset).
Enhancing ROI pool-
− VGG16 based ResNet-18-Faster Training: 60%
ing layer perfor- − VGG16-Faster Region-
on Faster R- Accuracy, preci- Region-based (12,240 images),
mance in VGG16 based CNN (R-CNN),
Resize images to 224 CNN, sion, recall, F-1 CNN (R-CNN) Validation: 20% Signer-depend-
2021 [73] based on Faster R- − ResNet18-Faster Re-measure, confu- obtained higher (3060 images), ent
× 224. ResNet18-based
CNN and ResNet18- − gion-based CNN (R-sion matrix.
on Faster R- accuracy with Testing: 20%
based on Faster R- CNN).
CNN. 93.4%. (3060 images).
CNN.
ArSL-CNN, ArSL- ArSL-CNN, ArSL-
All images are nor- ArSL-CNN +
CNN + SMOTE, CNN + SMOTE, ArSL-CNN, ArSL-CNN + Accuracy, and Training: 60%,
malized. SMOTE obtained Signer-depend-
2021 [57] ArSL-CNN + RMU, ArSL-CNN + RMU, SMOTE, ArSL-CNN + RMU, confusion ma- Validation: 20%,
Then, the images are higher accuracy ent
and ArSL-CNN + and ArSL-CNN + and ArSL-CNN + RMO. trix. Testing: 20%.
standardized. with 97.29%.
RMO. RMO.
CNN-2 pro- ArSL2018: Train-
duced the best ing: 49% (20,227
CNN-2 (two hidden CNN-2 (two hidden Accuracy, preci-
− Data augmenta- CNN-2 (two hidden layers), results (accuracy images), Valida-
layers), and CNN-3 layers), and CNN-3 sion, recall, F-1 Signer-depend-
2021 [56] tion. and CNN-3 (three hidden of 96.4%) for the tion: 21% (8669
(three hidden lay- (three hidden lay- measure, confu- ent
− Normalization. ers). layers). ArSL dataset. images), Testing:
ers). sion matrix.
CNN-3 achieved 30% (12,480 im-
an accuracy of ages).
Sensors 2024, 24, 7798 56 of 91
− Data augmenta-
tion.
− Data normaliza-
tion.
− Convert images Accuracy, recall, Best result ArSL2018: Train-
100 Epochs: QSLRS-CNN,
to 64 × 64 gray-Hand gesture detec- precision, F- achieved by ing: 80%, testing:
QSLRS-CNN with RMU, Signer-depend-
2023 [46] scale images. tion using QSLRS- QSLRS-CNN model score, confusion QSLRS-CNN 20%.
QSLRS-CNN with RMO, ent
− Standardize Im-CNN model. QSLRS-CNN with SMOTE.
matrix. Training with SMOTE ArSL dataset:
ages. time. (97.67%). Testing: 100%.
− VGG, ResNet,− VGG, ResNet,− Transfer learning with
MobileNet, MobileNet, pretrained models:
Xception, Incep- Xception, Incep- VGG, ResNet, Mo-
ResNet and In-
tion, DenseNet, tion, DenseNet, bileNet, Xception, In-Accuracy, AUC,
ceptionResNet
InceptionRes- InceptionRes- ception, DenseNet, In-precision, recall, Signer-depend-
2023 [64] - obtained a com- Not mentioned
Net, BiT, and vi- Net, BiT, and vi- ceptionResNet, BiT, andF1-score and ent
parably high ac-
sion transform- sion transform- vision transformers (ViTloss
ers (ViT & ers (ViT & and Swin). curacy of 98%.
Swin). Swin). − Deep learning using
− CNNs. − CNNs. CNNs.
− Resize images toSix pretrained fine- Six pretrained fine-
Six distinct pretrained fine-
64 × 64 pixels. tuned models, tuned models, Mo-
tuned models, MobileNetV2, InceptionV3 and
− Image normali- VGG16, Mo- bileNetV2, VGG16, Accuracy and Training: 70%, Signer-depend-
2023 [65] VGG16, InceptionV3, Res- achieved 100%
zation. bileNetV2, Xception, InceptionV3, Xcep- Loss. Testing: 30%. ent
Net50V2, Xception, Res- accuracy.
− Data augmenta-InceptionV3, Res- tion, ResNet50V2,
Net152.
tion. Net50V2, ResNet152. ResNet152.
− Resize Images to
Accuracy, recall,
64 × 64. CNN-based Effi- CNN-based Effi- Accuracy with 70% train-
CNN-based EfficientNetB3 precision, F1-
− Rescale imagescientNetB3 with en- cientNetB3 with en- encoder and de- ing,20% valida- Signer-depend-
2023 [66] with encoder and decoder score, confusion
to 32 × 32. coder and decoder coder and decoder coder = 99.26% tion and 10% ent
network. matrix, loss
− Data Augmenta-network. network.
(cross-entropy).
(Best accuracy). testing
tion.
Sensors 2024, 24, 7798 58 of 91
− Individual de-
scriptors: DWT,
Best Accuracy
DT-CWT, HOG.
Images are converted Three variants of the ANN: was for DT-CWT
− Combined de- Accuracy, pro- 50% training, Signer-depend-
2023 [41] to 128 × 128 grayscale - PNN, RBNN, and MLP, and + HOG + SVM
scriptors: DWT cessing time 50% testing ent
images. SVM and RF. (One-against-all)
+ HOG,
with 94.89%.
− DT-CWT +
HOG.
− 100 Epochs: QSLRS-
CNN, QSLRS-CNN-
Convert Images to 64 RMU, QSLRS-CNN- 100 Epochs:
Accuracy, preci- ArSL2018: Train-
× 64 grayscale im- RMO, QSLRS-CNN- QSLRS-CNN-
Hand gesture detec- sion, recall, F- ing: 80%, Test-
ages. SMOTE. SMOTE: 97.67%. Signer-depend-
2023 [47] tion using QSLRS- QSLRS-CNN model score, confusion ing:20%.
-Standardize images − 200 Epochs: QSLRS- 200 Epochs: ent
CNN model matrix, Training ArSL dataset:
using 0–1-pixel val- CNN, QSLRS-CNN- QSLRS-CNN-
time. Testing: 100%.
ues. RMU, QSLRS-CNN- SMOTE: 97.79%.
RMO, QSLRS-CNN-
SMOTE.
Transfer learning based on
VGG16, InceptionV3,
the majority voting of the 12
Hand edge detection Xception, MobileNet, ArSL2018: Train-
model’s predictions: VGG16,
recognize hand NASNetLarge, Best accuracy ing: 90%, Valida-
VGG19, ResNet50, Incep-
shapes based on de- VGG19, Inception- Accuracy, recall, was for transfer tion: 10%.
Convert images into tionV3, Xception, Inception- Signer-depend-
2023 [62] tecting human skin ResNetV2, Dense- precision, F- learning CNN ASL-Digits-da-
HSV color space. ResNetV2, MobileNet, ent
colors and mathe- Net121, ResNet50, score. with majority taset: Training:
DenseNet121, DenseNet169,
matical morphology DenseNet169, voting = 93.7% 90%, Validation:
DenseNet201,
techniques. DenseNet201, 10%.
NASNetLarge, NASNetMo-
NASNetMobile.
bile.
− Resize images to VGG-16, fine
VGG16, VGG19 fine VGG16, VGG19 fine ArSL2018: Train-
64 × 64. VGG16, VGG19 fine tuning accuracy, top-5 tuning block4
2023 [63] tuning block4 and tuning block4 and ing: 70%, Test- Both
− Data augmenta-block5. block4 and block5. accuracy, Loss obtained best ac-
block5. ing: 30%.
tion. curacy = 96.51%.
Sensors 2024, 24, 7798 59 of 91
Self-built da-
taset: Testing:
100%.
Ibn Zohr Uni-
versity dataset:
Testing: 100%
Testing and
Preprocessing Segmentation Feature Extraction Evaluation Met- Performance Re-
Year Ref. ArSLR Algorithm Training Meth- Signing Mode
Methods Methods Methods rics sults
odology
Face and Hand Isola-
Training: 81.13%
Transform to tion and feature ex-
Skin detection, (1045 videos),
YCbCr. traction through ob- Best accuracy = Signer-depend-
2015 [79] background re- EMCC and HMM. Accuracy and Testing:
Image normaliza- servation detection 98.8% ent
moval techniques. 18.87% (243 vid-
tion. and calculation using
eos).
proposed EMCC.
Leap motion se-
quences for hand ANN with MLP for hand Facial expressions:
PCNN signature to
signs. ANN is used signs recognition. ANN 90%, body move-
decrease the random Signer-depend-
2015 [80] - with PCA to extract with PCA and MLP for Accuracy ment: 86%, hand -
noise and add image ent
features from facial facial expressions and sign: 90%, integrated
signature.
expressions and body movement. sign testing: 95%
body movement.
SVM with default param-
Two histograms to eters and linear kernel
Remove all the null show two types of (SVMLD), SVM with
SVMLD obtained Training: 75%
values and any fea- features: tuned parameters and lin- Signer-depend-
2017 [81] - Accuracy best accuracy with (109), Testing:
tures with zero vari- Type1: contains 3-an- ear kernel (SVMLT), SVM ent
97.059%. 25% (34)
ance. gles for each hand with default parameters
joint. and radial kernel
(SVMRD), and SVM with
Sensors 2024, 24, 7798 60 of 91
(SBD) algo-
rithms.
− Video key
frames: RoI.
Segmenting videos
MFT, LBP, HOGs, The best accuracy is
to key frames (Vid- HMM using GRT toolkit.
and HOG-HOF. achieved by HMM Database-01: not
Seg), and KeyFeat K-means clustering algo- Accuracy and Signer-inde-
2019 [76] - Compare different with MFT and HOG stated, Database-
algorithm to detect rithm to quantize the fea- overlap ratio. pendent
techniques on Data- features with 99.11% 02: Testing 100%.
hands (optical flow tures vector.
base-01. and 99.33%.
and thresholding).
Training: 80%
(352, 232 for one
hand & 120 for
Feature vectors ex- Accuracy, preci- One hand:89%, two Signer-depend-
2020 [84] - - LSTM two hands), Test-
tracted by LMC. sion hands: 96%. ent
ing: 20% (88, 58
for one hand, 30
for two hands).
Different settings of 3 al-
Data interpolation to Data interpolation in gorithms (RF, SVM,
equalize the number feature extraction KNN), KNND, KNNT, The best accuracy
LogLoss, AUC, Signer-depend-
2020 [85] of captured frames - (bone directions and RFD, RFDS, RFT, RFTS, was achieved by Not mentioned
Accuracy. ent
for the same gesture joint angles were the SVMLD, SVMLT, SVMRTS with 83%.
word. main features). SVMRD, SVMRT,
SVMRTS.
Hand shape features:
CSOM. The sequence Accuracy, confu- Training: 70%, Signer-inde-
2020 [77] - DeepLabv3 BiLSTM Accuracy: 89.5%
of feature vectors: sion matrix. Testing: 30%. pendent
BiLSTM.
An adaptive seg- 8 TD features, and 9
SVM, feature-based
− Low-pass filtermentation features obtained
fusion. User-depend-
Training: 75%
2021 [70] to eliminate dy-method to measure from autocorrelation SVM and NB. Accuracy (3675), and Test- Both
ent: 98.6%, User-in-
namic noise, the energy of the and FD, normalized ing: 25% (1225).
dependent: 96%.
signal and then arrays for
Sensors 2024, 24, 7798 62 of 91
Preprocessing Segmentation Meth- Feature Extraction ArSLR Algo- Evaluation Testing and Training
Year Ref. Performance Results Signing Mode
Methods ods Methods rithm Metrics Methodology
Resampling tech-
niques to reduce Word recogni-
Manual video seg-
data size. Nor- Window-based ap- tion rate, Sen- Sentence recognition rate Training: 70% (280), Test- Signer-depend-
2015 [95] mentation and label- MKNN
malize and proach tence recogni- = 98.9% ing: 30% (120). ent
ling.
standardize read- tion rate
ings (Z-score).
(KNN, SVM,
Automated feature se- Best accuracy is 89% for
Automatic segmen- ANN) with
Data normaliza- lection for the joints KNN classifier with ma- Training: 66.67% (840
tation to separate be- and without Accuracy, re- Signer-inde-
2018 [97] tion (position and (hands, shoulder, el- jority voting and the seg- samples), Testing: 33.33%
tween consequent majority vot- sponse time. pendent
user size). bow, wrist, spin mid mentation accuracy (420 samples).
signs. ing, and
and head center). reached 91%.
DTW.
DB1 (gloves): Training:
Manual labeling and Word recogni- Sentence recognition Signer-depend-
70% (280), and Testing:
segmentation in vi- Vision-based dataset: tion rate, sen- rates: MKNN achieved ent for all da-
30% (120),
sion-based SLR. In 2D DCT, zonal coding. tence recogni- the best results for all da- tasets except for
DB2 (Tracker): Signer-de-
sensor-based SLR, Sensor-based datasets: MKNN, and tion rate, com- tasets (97.78% for the vision-based da-
2019 [96] - pendent, Training: 70%,
synchronize camera Sliding window-based HMM. putation time gloves dataset). Word taset 2 signer-de-
Testing: 30%, Signer-inde-
with gloves and statistical features ex- (train time & recognition, HMM was pendent and
pendent, Training & Test-
tracker recordings to traction techniques. classification the best with 99.20% for signer-inde-
ing: 50%, DB3 (vision-
detect boundaries. time) the glove’s dataset. pendent
based): not stated.
Word recogni-
biLSTM×2 achieved best
Divide videos into LSTM, tion rate, sen-
results with word recog-
motion images equal Pertained CNN Incep- biLSTM, tence recogni- Signer-depend-
2023 [98] - nition rate = 97.3, and -
to the number of tion-v3 network. biLSTM×2, tion rate, train- ent
sentence recognition rate
sentence words. biLSTM×3. ing time and
= 92.6
testing time.
Sensors 2024, 24, 7798 66 of 91
3.3.1. RQ3.1: Has the Number of Research Papers Regarding ArSLR Been Increasing in
the Past Decade?
Figure 6 illustrates the distribution of the 56 papers gathered by the selection process
described above, according to the years of publication. From our pool of papers, we can
notice an overall pattern of increasing publications over the last ten years, which should
be a good indicator of a rising volume of publications in the area in all journals.
3.3.2. RQ3.2: What Are the Limitations and/or Challenges Faced by Researchers in the
Field of ArSLR?
The majority of the assessed ArSLR publications omitted information about the con-
straints or the difficulties they faced in conducting their studies. Approximately 26.79% of
the ArSLR studies acknowledged the challenges and limitations that they encountered
during their research process. Different aspects were discussed as limitations, including
dataset, signers, model performance, training time, and suitability for real-world applica-
tions. With regard to the dataset, the small size was considered a limitation by Latif et al.
[55] for alphabet recognition and Almasre and Al-Nuaim [81] for isolated word recogni-
tion. Abdelghfar et al. [46,47] utilized a limited set of images of static gestures that show
the discontinuous letters at the beginning of the Qur’anic Surahs. Alharthi and Alzahrani
[64] pointed out that the dataset was not representative enough for alphabet recognition.
Fadhil et al. [63] used a dataset of similar fingerspelling sign language images. Extra data
and low-quality blurred frames in the captured signs stored in the dataset were reported
by Bencherif et al. [39]. The enormous number of produced frames in the dataset, partic-
ularly when sign gestures are captured at high frame rates, was a challenge pointed out
by Luqman [90]. Limitations in the process of video capturing were discussed by Bench-
erif et al. [39], including relying on the original suboptimal factory parameters for the two
Kinect cameras, capturing at a low frame rate, and including a large field of view of each
camera. Signers were another aspect mentioned in a number of studies; for example, Latif
et al. [55] reported the limitation in the number of signers who volunteered to perform
alphabet signs. Luqman [90] stated that word signs in the dataset were performed by non-
expert signers who incorporated non-sign language gestures into the sign language.
Moreover, the various signers who perform the same signs show differences in their ges-
tures. A few signers were recruited to perform the word signs stored in the dataset used
by Podder et al. [35]. Bencherif et al. [39] reported that some singers had trouble coordi-
nating their hands and making the same gestures. A few researchers highlighted the lim-
itation in real-time recognition of alphabet signs [41,55,64] and continuous sentences [98]
for real-world applications due to the required computation time. Agab and Chelali [41]
pointed out that accurate segmentation is necessary when using their model in practical,
real-life applications. Model training time was another limitation that impacted finger-
spelling sign recognition, as mentioned by Shahin and Almotairi [52] and Fadil et al. [63].
The limitation in system hardware, including processing and memory requirements for
alphabet sign language recognition, was discussed by Latif et al. [55]. Some researchers
admit the limitations in their proposed models; for example, the model proposed by Al-
saadi et al. [58] is limited to detecting only one object (a hand) without taking the back-
ground into consideration, which would affect the performance. The detection process in
their proposed model is highly sensitive to variations in the hand’s pose. In addition to
the limitation in model robustness to illumination reported by Agab and Chelali [41], the
Sensors 2024, 24, 7798 68 of 91
HOG descriptor efficiently captures the hand structure only if there is no background
clutter. The semantic-oriented post-processing module suggested by Badawy et al. [80] to
detect and correct any translation errors would perform well in a particular predefined
field.
3.3.3. RQ3.3: What Are the Future Directions for ArSLR Research?
Many of the reviewed publications, with a percentage of 78.57%, exhibited future
work for their proposed solutions. This percentage is distributed among the different cat-
egories of ArSLR studies, as illustrated in Figure 25. Different dimensions for future work
have been discussed, including datasets, data acquisition devices, data preprocessing and
segmentation, feature extraction, recognition models, expanding to other ArSLR catego-
ries, and developing practical real-time systems.
Figure 25. Percentages of papers that discuss future work in each category.
making processes of the transfer learning model’s layers, such as ResNet. It was also sug-
gested that future research would extend the current proposed approaches to be capable
of recognizing words and sentences [48,55,64,71,74].
Other studies pointed out the need to enhance the proposed systems for real-time
acquisition and recognition [41,43,68]. One potential avenue for future improvement is the
development of real-time mobile applications for ArSLR [55,56,58,71]. Tharwat et al. [74]
recommended creating educational materials for deaf and dumb children, while Shahin
and Almotairi [52] proposed creating an entirely automated ArSLR system. Using deep
learning models, Abdelghfar et al. [46,47] and Tharwat et al. [72] suggested translating the
meanings of the Holy Qur’an into sign language.
In the category of isolated word recognition, some researchers suggested expanding
the dataset by increasing the number of signs [33,70,91,94], observations [85], and signers
[94]. Exploring how the proposed methods can be applied to different datasets was in the
plans for a number of studies [78,88,91,94]. Podder et al. [35] pointed out that the work
would be improved by building a larger sign dataset of alphabets, numbers, words, and
sentences from various signers, as well as variations in background, lighting, and camera
angles. With regard to data acquisition devices, Almasre and Al-Nuaim [81] emphasized
the need for employing faster Kinect and LMC to collect and recognize gestures more
accurately and instantly. Deiche et al. [83] suggested examining various scenarios in
which LMCs could be combined with other sensors to improve the overall performance.
Glove design was a concern for Qaroush et al. [70], where Bluetooth IMU sensors can be
utilized, together with 3D printed rings that allow the sensors to be positioned on the
fingers.
In the preprocessing stage, Qaroush et al. [70] suggested employing more sophisti-
cated methods such as sensor fusion (e.g., Kalman filter). In terms of feature extraction,
Ahmed et al. [33] planned to increase the number of non-manual features for full ArSL
recognition, whereas Hisham and Hamouda [37] suggested incorporating more feature
engineering. Qaroush et al. [70] mentioned merging magnetometer data with ACC and
GYRO features to detect magnetic north.
Many of the studies have paid attention to recognition models and improving their
accuracy. Some of the researchers suggested incorporating efficient techniques to decrease
the processing complexity and increase accuracy rates, including deep learning algo-
rithms [78,83,85,88,93,94]. Qaroush et al. [70] suggested utilizing sequence-based classifi-
cation techniques such as HMM and RNN. To improve the SL recognition performance, a
combination of DL models can be developed [33,91]. Luqman [90] mentioned that trans-
formers and the attention mechanism are two other models that can be utilized to recog-
nize sign language. Podder et al. [35] suggested creating a sign language transformer that
uses MediaPipe Holistic’s landmark data rather than videos as input, incorporating more
cutting-edge 1D and 2D CNNs into a real-time Arabic Sign Language Transformer and
including the attention mechanism in the model for Arabic sign video classification so that
the model can be trained on bigger datasets and be capable of recognizing sign videos in
real-life scenarios.
A number of studies showed that further work is needed to extend the proposed
approaches to incorporate continuous sign language recognition [33,78,83,88,94] and to
provide a mechanism that can transform sign language movements into complete sen-
tences while recognizing overlapped gestures [84]. Almasre and Al-Nuaim [81] high-
lighted their intention to make use of depth sensors and supervised machine learning to
identify ArSL phrases while considering the LMC’s constrained workspace and the user’s
motion. Marzouk et al. [91] underlined the importance of expanding the proposed model
to recognize sign boards in real-time applications. Developing an upgrade plan for the
system to enable mobile platform deployment was pointed out by Deriche et al. [83].
Badawy et al. [80] stated that the accuracy of the translation can be improved by including
a semantic-oriented post-processing module to identify and fix any translation errors.
Sensors 2024, 24, 7798 70 of 91
In the category of continuous sentence recognition, Hisham and Hamouda [97] rec-
ommended conducting further testing of more words from various domains. They also
suggested improving the segmentation method and utilizing alternative techniques in the
recognition phase, such as DTW, for the purpose of increasing accuracy and improving
the proposed system. Shanableh [98] emphasized the importance of making the proposed
solution appropriate for real-time sign language recognition and minimizing the compu-
tation time.
In the category of miscellaneous recognition, Hisham and Hamouda [37] suggested
incorporating more feature engineering, working on large samples of complete sentences,
and utilizing deep learning to enhance recognition accuracy. Bencherif et al. [39] pointed
out the need to add a sign boundary detector or a network to the proposed solution in
order to determine whether a shape is a sign or a transitory gesture. They stressed that
more research should be done to find the smallest group of unique frames and/or key
points that show a sign by putting the produced frame key points into meaningful re-
duced clusters. This would make the deep learning model smaller and lighter for mobile
devices. They also highlighted the significance of minimizing network size and maximiz-
ing classification performance by improving delay removal across the pipeline using con-
volution suppression and optimum data propagation. Additional enhancements, includ-
ing the ability to zoom in on singers and the addition of an automated method for detect-
ing palm positions, were also suggested by Bencherif et al. [39]. Elsayed and Fathy [38]
underlined the necessity of improving deep learning with ontology to address dynamic
real video in real-time applications and transforming the system into a mobile application.
Table 18. Limitations and future work for fingerspelling ArSLR studies.
⁻
Limitation in the system⁻ Increase the dataset size.
hardware available as image⁻ Develop real-time mobile application for Arabic sign language trans-
processing and deep learning lation.
algorithms require high pro-
cessing and memory require-
ments.
⁻ Limitation of achieving ac-
ceptable recognition within
reasonable time and accu-
racy.
⁻ Develop educational tools for deaf and dumb children using the pro-
2021 [72] Not mentioned posed AArSLRS.
⁻ Provide translation systems for the meanings of the Holy Quran.
⁻ Implement mobile-based application to recognize Arabic sign lan-
guage in real-time.
2021 [71] Not mentioned
⁻ Use dynamic gesture recognition for Arabic sign language.
⁻ Build video-based dataset.
Study the performance of YOLO algorithm instead of Faster R-CNN for
2021 [73] Not mentioned
ArSL letter recognition.
⁻ Consider testing the ArSL-CNN on different datasets.
2021 [57] Not mentioned ⁻ Study the effectiveness of RNN for the application.
⁻ Utilize transfer learning for ArSLR model.
⁻ Study CNN-2 and CNN-3 architectures on larger datasets.
2021 [56] Not mentioned ⁻ Consider time and space complexity optimization to enable these ar-
chitectures to be used on mobile phones.
⁻
The proposed model is lim-
ited to detecting only one ob-
ject (a hand) without taking⁻ Build a mobile application based on the proposed model.
the background into consid-⁻ Exploit transfer learning.
eration, which would affect⁻ Other sign language datasets such as the American Sign Language
2022 [58]
the performance. dataset, MS-ASL can be used to pre-train a model to utilize transfer
⁻ The detection process in the learning.
proposed model is highly⁻ Implement data augmentation to produce training samples.
sensitive to variations in the
hand’s pose.
⁻ Investigate utilizing transformers.
2022 [74] Not mentioned ⁻ Extend the proposed work to recognize the Arabic sign language
words or common expressions.
Generate real-time sentences and videos using sign language based on
2022 [59] Not mentioned
CNN models.
Limitations in Real-Time recogni-
2022 [60] Not mentioned
tion.
⁻ Combine different transfer learning models for single-hand gesture
2022 [61] Not mentioned recognition, such as MobileNet and ResNet50 architectures.
⁻ Apply these models to recognize the two-hand gestures.
⁻ The proposed model is lim-⁻ Test the proposed model, QSLRS-CNN on various datasets and
ited to images of static ges-⁻ Evaluate RNN and LSTM.
2023 [46] tures that show the discontin-⁻ Improve the proposed model by adopting transfer learning.
uous letters at the beginning⁻ Develop a deep learning model to translate the Holy Qur’an mean-
of the Qur’anic Surahs. ings into sign language.
⁻ Consider addressing the class imbalance found in ArSL datasets to
⁻ The dataset was not repre-
2023 [64] guarantee an equal representation and enhance the accuracy of mi-
sentative enough.
nority classes.
Sensors 2024, 24, 7798 72 of 91
⁻ Real-world applications may⁻ Expand the study to include video-based ArSL recognition.
require solutions to practical⁻ Investigate hybrid models that combine vision transformers and
implementation issues such pretrained models to enhance accuracy.
computational resources,⁻ Examine further the optimization methods and fine-tuning ap-
model deployment, real-time proaches for transfer learning using pretrained models and vision
performance, and user usabil- transformers.
ity. ⁻ Examine methods for data augmentation that are especially de-
signed for ArSL recognition.
⁻ Look into cross-language transfer learning.
⁻ Carry out benchmarking and comparing various pretrained models,
architectures, and vision transformers on ArSL recognition tasks and
evaluate how well they perform on bigger and more varied datasets.
2023 [65] Not mentioned Not mentioned
Lower the learning parameters and model size using quantization or
2023 [66] Not mentioned
model-pronening techniques to boost the model’s efficiency.
⁻ The system is not robust to
variations in illumination.
⁻ Only in the case where there
is no background clutter, the
HOG descriptor efficiently
captures the hand structure.
Thus, for practical real-life⁻ Enhance the proposed system by including segmentation and hand
applications, an accurate seg- tracking phases for real-time acquisition and recognition.
2023 [41]
mentation is necessary. ⁻ The system needs to be enhanced in order to attain high accuracy,
⁻ Real-time applications will particularly when dealing with a complicated background.
consider the 1.2 s characteri-
zation time to be slow.
⁻ The Random Forest classifier
takes a lot of trees to perform
well, which makes the model
slower.
⁻ Examine RNN and LSTM.
⁻ The proposed model exhibits⁻ Using many datasets when testing the QSLRS-CNN.
the discontinuous letters at⁻ Apply transfer learning to create a better deep learning model for
2023 [47]
the beginning of Qur’anic su- ArSL that is compatible with ArSL variations.
rahs using just static gestures.⁻ Develop a deep learning model to translate the meanings of the Holy
Qur’an into sign language.
⁻ Investigate more cutting-edge deep-learning techniques to enhance
the model’s practicality and accuracy.
2023 [62] Not mentioned
⁻ Evaluate the model’s resilience and scalability for additional sign
languages.
⁻ Because training was done on
a dataset of remarkably simi-
⁻ Continue the research to look into further deep learning models,
lar images, testing on the
such ResNet.
2023 [63] other two datasets revealed
⁻ Further understanding of the decision-making processes of the
low recognition accuracy.
model’s layers is necessary for the recognition process.
⁻ The training time was im-
pacted by the Internet speed.
Sensors 2024, 24, 7798 73 of 91
Table 19. Limitations and future work for isolated ArSLR studies.
Table 20. Limitations and future work for continuous ArSLR studies.
Table 21. Limitations and future work for miscellaneous ArSLR studies.
Size: The size of the dataset affects model performance. Therefore, larger datasets are
generally better for improving model accuracy.
Sign representation: Different representations of signs are available in the dataset in
the form of different modalities, including RGB, depth, skeleton joint points, or others.
Data quality: Performance may suffer from a large amount of poor-quality data.
Therefore, data should be high-resolution and free of watermarks.
Annotation quality: Any dataset should include accurate, comprehensive, and con-
sistent annotations for key points and sign detection.
The fact that the ArSLR datasets available today only partially meet these require-
ments is disappointing and could negatively impact the performance of the model. This
systematic review shows that a few datasets are publicly available. However, publicly
available datasets are important in creating benchmark datasets to compare the perfor-
mance of different algorithms proposed in previous studies on ArSLR. The lack of availa-
ble datasets is one of the challenges impeding research and improvement in Arabic sign
language recognition. This is mainly caused by a shortage of experienced ArSL specialists
as well as the time and cost involved in gathering sign language data. Furthermore, re-
searchers might have trouble acquiring reliable ArSL datasets since Arabic is a compli-
cated language by nature. It could be challenging to directly compare the recognition ac-
curacy of the various approaches because some studies created their own data, which is
typically private or unavailable to other researchers.
One of the variables that affects the diversity of the ArSL dataset is the total number
of signers. This aspect is crucial for assessing the generalization of the recognition systems.
The majority of the reviewed ArSL datasets were built with a relatively small number of
signers and only a few classes, which calls into doubt their representative value. Signer-
independent recognition systems that are tested on signers other than those who partici-
pated in the system training benefit from having more signers. Thus, when faced with
slightly varying presentations of sign language gestures, the performance of any ML/DL
model that relies on those datasets may be jeopardized.
An additional consideration in evaluating a sign language dataset is the number of
samples. Training ML/DL models requires numerous samples per sign with certain vari-
ances per sample. Data on sign representation is very crucial for assessing datasets. Most
of the reviewed Arabic sign language datasets are available in RGB format. However, a
number of the datasets capture the signs using multimodality devices such as Microsoft
Kinect and LMC, which provide additional representations of the sign sample, like joint
points and depth. A few datasets rely on wearable devices, such as DG5-VHand data
gloves and 3-D IMU sensors, to record the required features of the hand gestures. Multi-
modal datasets are becoming more prevalent, especially in the categories of isolated
words and continuous sentence recognition. This is a positive indicator showing the next
phase of ArSLR research and providing more opportunities for creative thinking.
One of the most important factors that might influence the rate of advancement in
any branch of AI research is the availability of high-quality datasets for model training
and testing. As a relatively new field of interest, ArSLR research experienced this issue at
first, but the reviewed papers show that situations are beginning to improve in this matter.
Owing to these issues with current datasets, some researchers tend to integrate two
or more datasets while training their models. The goal of combining datasets is to over-
come the shortcomings of each one individually. Despite the features of these datasets,
data augmentation remains required to enhance data diversity. Therefore, for ArSLR
tasks, an exhaustive dataset with a variety of data that tackles the issue of occlusions and
enables accurate labeling is still essential.
ArSLR research shows considerable promise for enhancing accessibility for the deaf
and hard-of-hearing groups; however, it raises a number of significant ethical and privacy
issues. These issues mostly focus on informed consent, data privacy, and security. The
ethical and privacy considerations that researchers must carefully consider when gather-
ing and utilizing data for ArSLR are discussed below.
Sensors 2024, 24, 7798 77 of 91
A key ethical principle in research is informed consent, which guarantees that study
participants are aware of the goals, methods, risks, and intended use of their data. Partic-
ipants must be clearly informed about the following when conducting ArSLR research:
• How their sign language gestures will be captured and examined.
• What data will be gathered, such as motion data or video footage?
• The data’s intended use, such as sharing it with third parties or using it to train mod-
els, and who will have access to it.
• The possible risks include invasions of privacy or improper use of personal infor-
mation.
Data anonymization is essential for protecting the privacy and identity of research
participants. Researchers must ensure that identifiable features like faces, clothing, or lo-
cations are eliminated or obscured when using motion data or video footage to train
ArSLR systems. They also need to ensure that data linkage is avoided, which means that
individual participants cannot be identified through combined data from multiple sources
or over time. In video-based data collection, this becomes especially difficult because iden-
tifying people by their body language may unintentionally lead to identification. In addi-
tion to ensuring that all identifying features of the data are anonymized before usage,
researchers should investigate methods like face-blurring.
The storage and protection of data must also be carefully considered in ethical re-
search. Researchers must ensure that data is stored securely, with only authorized indi-
viduals having access; data is stored in accordance with relevant data protection laws,
ethical standards, and regulations; and participants’ rights to request removal of data and
withdrawal of consent are upheld.
clustering. The findings reveal that since 2020, the segmentation has been shifted towards
CNN-based algorithms and transfer learning, including VGG, AlexNet, MobileNet, In-
ception, ResNet, Densnet, SqueezeNet, EfficientNet, CapsNet, and others. A few research-
ers adopted pretrained vision transformers, such as ViT and Swin, and semantic segmen-
tation DeepLabv3C, where they proved their contribution to enhancing the accuracy of
the model.
Feature extraction is a crucial step in the ArSLR process. Thus, the feature vectors
obtained from this process serve as the classifier’s intake. The feature extraction approach
should identify structures robustly and consistently, irrespective of changes in the bright-
ness, location, size, and orientation of the item in an image or video. The findings show
that some ArSLR studies tend to adopt hybrid feature extraction techniques to address the
shortcomings in any individual technique and take advantage of their benefits. Other re-
searchers have recently used popular deep learning methods, such as CNNs, to extract
relevant features. These methods take features from the first layers and input them into
the ones that come after. CNNs and LSTM have been used together by some researchers
to extract temporal and spatial data, which are useful in isolated words and continuous
sentence recognition. Pretrained models have been used to extract features in the majority
of studies on fingerspelling ArSLR that have been conducted since 2019.
Section 3.2 reveals that deep learning algorithms, such as RNNs and CNNs, have
emerged as powerful tools in ArSLR research and have seen widespread use since 2019.
However, despite their advances, they also encounter a number of challenges that should
be overcome in order to fully exploit their effect and applicability in ArSLR research. The
scarcity of longitudinal datasets presents one of the problems RNNs encounter. RNNs
excel at modeling temporal dependencies and capturing sequential patterns, making them
ideal for recognizing continuous sentences. However, collecting diverse large-scale longi-
tudinal datasets is the key to training resilient RNN models. Furthermore, RNNs are chal-
lenged by the wide range of sign language data. CNNs have proven to be remarkably
effective at recognizing ArSL. They do, however, have challenges in the field of ArSLR
research. The requirement for diverse and large-volume datasets for the effective training
of CNN models is one of these challenges. ArSLR data may demonstrate class imbalances
and tend to be small in size; therefore, careful data augmentation techniques are needed
to overcome these problems. Moreover, CNNs struggle with generalizing across different
signers and acquisition modalities. Developing robust techniques to handle these chal-
lenges and ensure model generalization is a key area of research. For this reason, a method
called transfer learning—in which the model is trained on a large training set—has been
suggested as a remedy. The results of this training are then treated as the starting point
for the target task. Transfer learning using pretrained models has been successful in the
field of ArSLR. Frequently used pretrained deep learning models include common models
such as AlexNet, SqueezeNet, VGGNet16, VGGNet19, GoogleNet, DenseNet, MobileNet,
Xception, Inception, and ResNet, which are all typically utilized for image classification.
In an attempt to improve model accuracy, ensemble approaches—which incorporate dif-
ferent models—have also been used recently. The integrated models significantly improve
the results’ accuracy. Additionally, some researchers have integrated RNNs with other
deep learning architectures, such as CNNs for isolated words and continuous sentence
recognition. These hybrid models aim to gather both spatial and temporal information
from sign language data, seeking enhanced performance in ArSLR. A few more recent
studies have boosted the performance by adopting vision transformers and attention
mechanisms.
The capacity of the proposed models to accurately accomplish the main task—that
is, to recognize or translate sign language—is how their performance is often evaluated.
The primary metric to evaluate the effectiveness of the model is the average accuracy over
the whole dataset; a greater percentage denotes a more accurate approach. It can be diffi-
cult to compare the effectiveness of different ML/DL models in ArSLR research because of
the variety of tests, differences in datasets, evaluation metrics, and experimental setups.
Sensors 2024, 24, 7798 79 of 91
Overall, many approaches did rather well and identified over 90% of the Arabic signs that
were displayed. The fact that many of the studies adopt signer-dependent mode testing
contributes to achieving such high accuracy. Although the ML/DL model’s capacity is typ-
ically limited to the signs learned from the training set, it is possible to accomplish some
generalization with regard to other individuals exhibiting the same sign. Thus, one of the
most crucial aspects of SLR research is the optimization of training parameters, which can
significantly affect the effectiveness of the proposed solutions. More sophisticated systems
seek to comprehend increasingly complicated continuous sign language speech segments
and to enhance their real-time recognition capabilities. These applications are far more
complicated than simple word or letter recognition and often require combining the anal-
ysis of various signs to decipher a particular sequence’s meaning. In order to capture se-
mantic nuances and prevent comparable signs from being confused, researchers have to
use hybrid architectures and advanced sequence-to-sequence models.
The relationship between computational resources and model complexity is crucial
in ASLR, particularly as the field moves toward the use of deeper and more complex neu-
ral networks. Earlier research on ArSL recognition may have relied on less complex mod-
els like shallow neural networks, decision trees, or support vector machines. The accuracy
of these models is often lower, particularly in more complicated sign sequences or ges-
tures, but they may not reflect the subtleties or contextual dynamics of sign language.
More recently, deeper neural networks are being investigated to enhance performance,
particularly for dynamic and continuous sign gesture recognition. For instance, RNNs or
transformers for sequential sign interpretation and CNNs for spatial feature extraction
(from images or videos). These models demand considerably more memory, processing
power, and training time than other models. ArSLR systems frequently need to process a
variety of input data types, such as images, video, depth sensors, or motion capture data.
This boosts model complexity and calls for more advanced networks.
The complexity of the model has a substantial impact on the computational resources
required for both training and deployment. For efficient training and inference, deeper
networks with more parameters require more processing power, such as high-perfor-
mance GPUs or TPUs. In order to maintain large datasets and model weights, complex
models demand a substantial amount of memory and storage capacity. Deeper network
training also necessitates processing massive amounts of video data, frequently in real-
time, which can be computationally costly. Longer training times for more complicated
models could result in increased expenses and consumption of power. This is especially
problematic when scaling up to large datasets for ArSLR or in circumstances with re-
stricted resources.
There is a trade-off between the model’s performance on ArSLR tasks and its com-
plexity. Deeper models typically offer higher accuracy, but they also come with a higher
processing cost. Model performance and efficiency must be balanced, particularly in real-
time applications where speed is critical. Researchers are investigating methods like trans-
fer learning, quantization, and model pruning to lessen the computational load without
appreciably compromising performance.
Both the Arabic language and sign languages in general have inherent complexities
that contribute to the difficulty of handling contextual interactions and advanced semantic
analysis in ArSLR. Like other sign languages, ArSL mostly depends on the surrounding
signs and contextual cues, such as body posture, facial expressions, and hand shape, to
accurately express meaning. A sign’s meaning might change depending on the context.
For instance, depending on its position, speed, and the facial expressions that accompany
it, a single sign may have several meanings. Several signs in continuous sign language
have similar hand shapes or movements but differ slightly (e.g., by speed, direction, or
facial expressions), and gestures also frequently overlap. It might be challenging to recog-
nize these subtle variations. Parsing the semantic structure of ArSL, or comprehending the
meaning behind a sequence of signs, remains a challenging task. This includes contextual
factors such as the link between signs, pronominal reference, negations, and non-manual
Sensors 2024, 24, 7798 80 of 91
markers. The lack of comprehensive ArSL datasets restricts the ability to train robust mod-
els, particularly for continuous signing.
A comparison of the most popular ML/DL techniques in ArSLR research, such as
SVM, KNN, HMM, RF, MKNN, NB, CNNs, RNNs, hybrid CNN-RNN, transfer learning,
and transformer-based techniques, is provided in Table 22 in terms of recognition accu-
racy, efficiency, and robustness. This comparison has been derived from typical patterns
observed in the reviewed ArSLR papers. The performance can differ considerably de-
pending on the dataset, preprocessing and segmentation methods, feature extraction tech-
niques, and the specific recognition task (static vs. dynamic signs) used.
Table 22. Comparison of the most popular ML/DL techniques in ArSLR research.
nuances is necessary. This is certainly one of the hottest topics in ArSLR research, and
it will keep being investigated in a variety of ways in an effort to find a configuration
that may solve the issues impeding the development of extremely powerful tools.
Considering current findings, I anticipate that future research in this particular area
will focus on context-aware models that use sequence-based analysis, such as trans-
formers or RNNs, to capture the sequential nature of signs and their contextual rela-
tionships. The immediate context (prior or subsequent signs) and the global context
(the overall sentence structure, non-manual signs like facial expressions, etc.) should
both be considered by these models. More accurate interpretations of ambiguous
signs could be made possible by using attention mechanisms to emphasize the most
contextually relevant portions of a sign sequence. ArSLR models can also be im-
proved by integrating multimodal data, such as body posture, face expression, and
hand gesture recognition. By processing multiple inputs concurrently, multimodal
deep learning frameworks—like multimodal transformers—may enhance the
model’s comprehension of the semantic information that is communicated by com-
bining these inputs.
• Deep learning models for ArSLR can evolve by integrating advanced frameworks
from state-of-the-art research, as demonstrated in datasets like AUTSL [103]. These
frameworks emphasize the fusion of spatial and temporal features, multimodal data
processing, and cutting-edge model architectures. Here is how these approaches can
be adapted for ArSLR:
• Frameworks like STF + LSTM [108] and Feature Engineering with LSTM (FE +
LSTM) [117] demonstrate how the system can model sequential dependencies
by combining spatial information, such as hand shapes and locations, with tem-
poral dynamics, like movement trajectories, using LSTM networks. Continuous
signing may be processed efficiently by such models, which can also handle var-
iations in the speed and execution of gestures and extract distinctive spatial fea-
tures from ArSL gestures (e.g., finger positions for specific letters).
• 3D Convolutional Models for Video Input including 3D-DCNN [109] and MViT-
SLR [118]: These networks are highly suited to sign language video data since
they can capture motion and depth. Hierarchical feature scaling transformers
can also be applied for temporal and spatial learning using Multiscale Vision
Transformers for Sign Language Recognition (MViT-SLR). Using 3D convolu-
tions to learn hand gestures and combining them with MViTs for hierarchical
temporal modeling could be one way to adapt to ArSLR.
• Graph-Based Models for Skeleton Dynamics (e.g., ST-MGCN [109], HW-GAT
[119]): These models use graph convolutional networks (GCNs) to model rela-
tionships between skeletal keypoints and temporal dynamics. ST-MGCNs are
used to model complex joint movements over time, and Hand-Weighted Graph
Attention Networks (HW-GAT) are used to assign higher weights to critical
joints like fingers in hand-dominated gestures. Adaptability to ArSLR comprises
applying ST-MGCNs or HW-GAT for fine-grained recognition of Arabic sign
trajectories and leveraging skeletal data, such as OpenPose or MediaPipe, to
track hand and body keypoints.
• Transformer-Based Architectures, such as Video Transformer Networks with
Progressive Filtering (VTN-PF) [120], can be applied for global temporal model-
ing, progressively refining key gesture features. These architectures can be
adapted to ArSLR to process Arabic sign videos with high variability in signer
style and environmental conditions. Moreover, progressive filtering can be em-
ployed to emphasize critical frames, such as key transitions in signs.
• Multimodal integration: Advanced frameworks can integrate video, depth, and
skeletal data inputs for a richer feature representation. Sign Attention Module
(SAM-SLR) [121] integrates spatial, temporal, and modality-specific features
Sensors 2024, 24, 7798 85 of 91
4. Conclusions
Intelligent solutions for Arabic sign language recognition are still gaining interest
from academic scholars thanks to recent developments in machine learning and deep
learning techniques. This study presents a systematic review of ML/DL techniques uti-
lized in ArSLR-relevant studies in the period between 2014 and 2023. Using data from 56
full-text research publications that were obtained from the Scopus, WoS, and IEEE Xplore
online databases, an overview of the current trends in intelligent-based ArSL recognition
is provided.
Thorough analysis of the dataset characteristics utilized in the reviewed papers was
conducted. The datasets were grouped according to the recognition category they repre-
sent, whether it is fingerspelling, isolated words, continuous sentences, or a combination
of them. The findings reveal that the most widely used dataset was ArSL2018, where it
has been adopted by many fingerspelling recognition researchers since 2019. The analysis
of the datasets shows that the area of ArSLR lacks high-quality, large-scale, publicly avail-
able datasets, particularly for isolated words and continuous recognition. Availability of
such datasets would play a significant role in advancing this field and enable researchers
to focus on improving recognition algorithms in order to boost performance and achieve
high accuracy outcomes. The adoption of deep learning models—which are still being
refined and will only gain more traction in the upcoming years—has been a major driving
force behind recent advancements in this field. The past decade has seen the development
Sensors 2024, 24, 7798 86 of 91
of numerous unique and extremely creative ideas for ArSLR systems, such as feature ex-
traction from sensor data or videos and passing them into neural classifiers.
In this study, I reviewed the state-of-the-art techniques for ArSLR tasks based on
ML/DL algorithms, which have been developed over the last 10 years, and categorized
them into groups according to the type of recognition: fingerspelling, isolated words, con-
tinuous sentences, and miscellaneous recognition. Due to their superior qualities, CNN-
based algorithms are used in the most popular method to extract discriminative features
from unprocessed input. Several different types of networks were frequently combined to
increase overall performance. These models can handle data from a variety of sources and
formats; they have been successfully applied to static images, depth, skeleton, and sequen-
tial data. Many researchers have shifted toward employing CNN-based transfer learning
for ArSL recognition. When compared to conventional CNN-based deep learning models,
the reviewed studies show that the transfer learning approach—which makes use of both
pretrained models and vision transformers—achieved a greater accuracy. Even though
the pretrained models outperformed the vision transformers in terms of accuracy, vision
transformers demonstrate more consistent learning. Recent fingerspelling recognition
studies were found to exploit ensemble methods, where several models are combined,
seeking to increase the overall performance.
Ultimately, regardless of the advancements in research on ArSL recognition, there is
still an apparent lack of practical applications and software for performing these tasks. In
order to narrow the gap between research and practical implementation, it is necessary
that accessible and user-friendly software and applications for ArSL recognition be devel-
oped. The development of trustworthy, usable, high-performance software solutions will
help those who are hard of hearing or deaf, and it may enhance their daily interactions
and communication in general.
References
1. World Health Organization. World Report on Hearing. Available online: https://www.who.int/teams/noncommunicable-dis-
eases/sensory-functions-disability-and-rehabilitation/highlighting-priorities-for-ear-and-hearing-care (accessed on 8 December
2023).
2. Wadhawan, A.; Kumar, P. Sign Language Recognition Systems: A Decade Systematic Literature Review. Arch. Comput. Methods
Eng. 2019, 28, 785–813. https://doi.org/10.1007/s11831-019-09384-2.
3. Amrutha, C.U.; Davis, N.; Samrutha, K.S.; Shilpa, N.S.; Chunkath, J. Improving Language Acquisition in Sensory Deficit Indi-
viduals with Mobile Application. Procedia Technol. 2016, 24, 1068–1073. https://doi.org/10.1016/j.protcy.2016.05.237.
4. Rastgoo, R.; Kiani, K.; Escalera, S. Sign Language Recognition: A Deep Survey. Expert Syst. Appl. 2021, 164, 113794.
https://doi.org/10.1016/j.eswa.2020.113794.
5. Ibrahim, N.B.; Selim, M.M.; Zayed, H.H. An Automatic Arabic Sign Language Recognition System (ArSLRS). J. King Saud Univ.
Comput. Inf. Sci. 2018, 30, 470–477. https://doi.org/10.1016/j.jksuci.2017.09.007.
6. Al-Qurishi, M.; Khalid, T.; Souissi, R. Deep Learning for Sign Language Recognition: Current Techniques, Benchmarks, and
Open Issues. IEEE Access 2021, 9, 126917–126951. https://doi.org/10.1109/ACCESS.2021.3110912.
7. Adeyanju, I.A.; Bello, O.O.; Adegboye, M.A. Machine Learning Methods for Sign Language Recognition: A Critical Review and
Analysis. Intell. Syst. Appl. 2021, 12, 56. https://doi.org/10.1016/j.iswa.2021.20.
8. Cheok, M.J.; Omar, Z.; Jaward, M.H. A review of hand gesture and sign language recognition techniques. Int. J. Mach. Learn.
Cybern. 2019, 10, 131–153. https://doi.org/10.1007/s13042-017-0705-5.
9. Al-Shamayleh, A.S.; Ahmad, R.; Jomhari, N.; Abushariah, M.A.M. Automatic Arabic Sign Language Recognition: A Review,
Taxonomy, Open Challenges, Research Roadmap and Future directions. Malays. J. Comput. Sci. 2020, 33, 306–343.
https://doi.org/10.22452/mjcs.vol33no4.5.
10. Oudah, M.; Al-Naji, A.; Chahl, J. Hand Gesture Recognition Based on Computer Vision: A Review of Techniques. J. Imaging
2020, 6, 73. https://doi.org/10.3390/JIMAGING6080073.
11. Liu, Y.; Nand, P.; Hossain, M.A.; Nguyen, M.; Yan, W.Q. Sign Language Recognition from Digital Videos using Feature Pyramid
Network with Detection Transformer. Multimed. Tools Appl. 2023, 82, 21673–21685. https://doi.org/10.1007/s11042-023-14646-0.
12. Alselwi, G.A.A.A.; Taşci, T. Arabic Sign Language Recognition: A Review. Bayburt Univ. J. Sci. 2021, 4, 73–79.
Sensors 2024, 24, 7798 87 of 91
13. Zahid, H.; Rashid, M.; Hussain, S.; Azim, F.; Syed, S.A.; Saad, A. Recognition of Urdu Sign Language: A Systematic Review of
the Machine Learning Classification. PeerJ Comput. Sci. 2022, 8, e883. https://doi.org/10.7717/PEERJ-CS.883.
14. Representatives of 18 Arabic Countries. The Arabic Dictionary of Gestures for the Deaf. Available online:
http://www.menasy.com/arab Dictionary for the deaf 2.pdf (accessed on 18 December 2023).
15. Mohandes, M.; Deriche, M.; Liu, J. Image-based and sensor-based approaches to arabic sign language recognition. IEEE Trans.
Hum. Mach. Syst. 2014, 44, 551–557. https://doi.org/10.1109/THMS.2014.2318280.
16. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.;
Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, 71.
https://doi.org/10.1136/bmj.n71.
17. Mohandes, M.; Liu, J.; Deriche, M. A Survey of Image-Based Arabic Sign Language Recognition. In Proceedings of the IEEE
11th International Multi-Conference on Systems, Signals & Devices (SSD14), Barcelona, Spain, 11–14 February 2014; pp. 1–4.
18. Mohamed-Saeed, H.A.; Abdulbqi, H.A. Investigation: Arabic and other sign languages based on artificial techniques. AIP Conf.
Proc. 2023, 2834, 050024. https://doi.org/10.1063/5.0161719.
19. Tamimi, A.J.; Hashlamon, I. Arabic Sign Language Datasets: Review and Improvements. In Proceedings of the 2023 Interna-
tional Symposium on Networks, Computers and Communications (ISNCC), Doha, Qatar, 23–26 October 2023; pp. 1–6.
https://doi.org/10.1109/ISNCC58260.2023.10323881.
20. Mohammed, R.M.; Kadhem, S.M. A Review on Arabic Sign Language Translator Systems. J. Phys. Conf. Ser. 2021, 1818, 012033.
https://doi.org/10.1088/1742-6596/1818/1/012033.
21. AL Moustafa, A.M.J.; Rahim, M.S.M.; Khattab, M.M.; Zeki, A.M.; Matter, S.S.; Soliman, A.M.; Ahmed, A.M. Arabic Sign Lan-
guage Recognition Systems: A Systematic Review. Indian J. Comput. Sci. Eng. 2024, 15, 1–18.
https://doi.org/10.21817/indjcse/2024/v15i1/241501008.
22. Petticrew, M.; Roberts, H. Systematic Reviews in the Social Sciences: A Practical Guide; Blackwell-Wiley Publishing: Hoboken, NJ,
USA, 2008.
23. Gusenbauer, M.; Haddaway, N.R. Which academic search systems are suitable for systematic reviews or meta-analyses? Eval-
uating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res. Synth. Methods 2020, 11, 181–217.
https://doi.org/10.1002/jrsm.1378.
24. Shaffril, H.A.M.; Samsuddin, S.F.; Samah, A.A. The ABC of systematic literature review: The basic methodological guidance for
beginners. Qual. Quant. 2021, 55, 1319–1346. https://doi.org/10.1007/s11135-020-01059-6.
25. Ghorai, A.; Nandi, U.; Changdar, C.; Si, T.; Singh, M.M.; Mondal, J.K. Indian sign language recognition system using network
deconvolution and spatial transformer network. Neural Comput. Appl. 2023, 35, 20889–20907. https://doi.org/10.1007/s00521-023-
08860-y.
26. Shohieb, S.M.; Elminir, H.K.; Riad, A.M. Signs World Atlas; a benchmark Arabic Sign Language database. J. King Saud Univ.—
Comput. Inf. Sci. 2015, 27, 68–76. https://doi.org/10.1016/j.jksuci.2014.03.011.
27. Luqman, H.; Mahmoud, S.A. A machine translation system from Arabic sign language to Arabic. Univers. Access Inf. Soc. 2020,
19, 891–904. https://doi.org/10.1007/s10209-019-00695-6.
28. Boukdir, A.; Benaddy, M.; Ellahyani, A.; El Meslouhi, O.; Kardouchi, M. Isolated Video-Based Arabic Sign Language Recogni-
tion Using Convolutional and Recursive Neural Networks. Arab. J. Sci. Eng. 2022, 47, 2187–2199. https://doi.org/10.1007/s13369-
021-06167-5.
29. Hisham, B.; Hamouda, A. Supervised learning classifiers for Arabic gestures recognition using Kinect V2. SN Appl Sci. 2019, 1,
768. https://doi.org/10.1007/s42452-019-0771-2.
30. Hisham, B.; Hamouda, A. Arabic Dynamic Gesture Recognition Using Classifier Fusion. J. Adv. Inf. Fusion 2019, 14, 66–85.
31. Abdul, W.; Alsulaiman, M.; Amin, S.U.; Faisal, M.; Muhammad, G.; Albogamy, F.R.; Bencherif, M.A.; Ghaleb, H. Intelligent
real-time Arabic sign language classification using attention-based inception and BiLSTM. Comput. Electr. Eng. 2021, 95, 107395.
https://doi.org/10.1016/j.compeleceng.2021.107395.
32. Ahmed, A.M.; Alez, R.A.; Tharwat, G.; Taha, M.; Belgacem, B.; Al Moustafa, A.M.J. Arabic sign language intelligent translator.
Imaging Sci. J. 2020, 68, 11–23. https://doi.org/10.1080/13682199.2020.1724438.
33. Ahmed, A.M.; Alez, R.A.; Tharwat, G.; Taha, M.; Belgacem, B.; Al Moustafa, A.M.; Ghribi, W. Arabic Sign Language Translator.
J. Comput. Sci. 2019, 15, 1522–1537. https://doi.org/10.3844/jcssp.2019.1522.1537.
34. Husam, W.; Juliet, V. Mobile camera based sign language to speech by using zernike moments and neural network. Int. J. Appl.
Eng. Res. 2015, 10, 18698–18702.
35. Podder, K.K.; Ezeddin, M.; Chowdhury, M.E.H.; Sumon, S.I.; Tahir, A.M.; Ayari, M.A.; Dutta, P.; Khandakar, A.; Bin Mahbub,
Z.; Kadir, M.A. Signer-Independent Arabic Sign Language Recognition System Using Deep Learning Model. Sensors 2023, 23,
7156. https://doi.org/10.3390/s23167156.
36. Aldhahri, E.; Aljuhani, R.; Alfaidi, A.; Alshehri, B.; Alwadei, H.; Aljojo, N.; Alshutayri, A.; Almazroi, A. Arabic Sign Language
Recognition Using Convolutional Neural Network and MobileNet. Arab. J. Sci. Eng. 2023, 48, 2147–2154.
https://doi.org/10.1007/s13369-022-07144-2.
37. Hisham, B.; Hamouda, A. Arabic static and dynamic gestures recognition using leap motion. J. Comput. Sci. 2017, 13, 337–354.
https://doi.org/10.3844/jcssp.2017.337.354.
38. Elsayed, E.K.; Fathy, R.D. Sign Language Semantic Translation System using Ontology and Deep Learning. Int. J. Adv. Comput.
Sci. Appl. 2020, 11, 141–147. https://doi.org/10.14569/ijacsa.2020.0110118.
Sensors 2024, 24, 7798 88 of 91
39. Bencherif, M.A.; Algabri, M.; Mekhtiche, M.A.; Faisal, M.; Alsulaiman, M.; Mathkour, H.; Al-Hammadi, M.; Ghaleb, H. Arabic
Sign Language Recognition System Using 2D Hands and Body Skeleton Data. IEEE Access 2021, 9, 59612–59627.
https://doi.org/10.1109/ACCESS.2021.3069714.
40. Al-Jarrah, O.; Halawani, A. Recognition of gestures in Arabic sign language using neuro-fuzzy systems. Artif. Intell. 2001, 133,
117–138.
41. Agab, S.E.; Chelali, F.Z. New combined DT-CWT and HOG descriptor for static and dynamic hand gesture recognition. Mul-
timed. Tools Appl. 2023, 82, 26379–26409. https://doi.org/10.1007/s11042-023-14433-x.
42. Abdo, M.Z.; Hamdy, A.M.; Abd, S.; Salem, E.-R.; Saad, E.M. Arabic Alphabet and Numbers Sign Language Recognition. 2015.
Available online: https://thesai.org/Publications/ViewPaper?Volume=6&Issue=11&Code=IJACSA&SerialNo=27 (accessed on 20
January 2024).
43. Tharwat, A.; Gaber, T.; Hassanien, A.E.; Shahin, M.K.; Refaat, B. Sift-based arabic sign language recognition system. Adv. Intell.
Syst. Comput. 2015, 334, 359–370. https://doi.org/10.1007/978-3-319-13572-4_30.
44. Hasasneh, A. Arabic sign language characters recognition based on a deep learning approach and a simple linear classifier.
Jordan. J. Comput. Inf. Technol. 2020, 6, 281–290. https://doi.org/10.5455/jjcit.71-1587943974.
45. Ahmed, A.M.; Alez, R.A.; Taha, M.; Tharwat, G. Automatic Translation of Arabic Sign to Arabic Text (ATASAT) System. Acad-
emy and Industry Research Collaboration Center (AIRCC). J. Comput. Sci. Inf. Technol. 2016, 6, 109–122.
https://doi.org/10.5121/csit.2016.60511.
46. AbdElghfar, H.A.; Ahmed, A.M.; Alani, A.A.; AbdElaal, H.M.; Bouallegue, B.; Khattab, M.M.; Youness, H.A. QSLRS-CNN:
Qur’anic sign language recognition system based on convolutional neural networks. Imaging Sci. J. 2023, 72, 254–266.
https://doi.org/10.1080/13682199.2023.2202576.
47. AbdElghfar, H.A.; Ahmed, A.M.; Alani, A.A.; AbdElaal, H.M.; Bouallegue, B.; Khattab, M.M.; Tharwat, G.; Youness, H.A. A
Model for Qur’anic Sign Language Recognition Based on Deep Learning Algorithms. J. Sens. 2023, 2023, 9926245.
https://doi.org/10.1155/2023/9926245.
48. Almasre, M.A.; Al-Nuaim, H. The Performance of Individual and Ensemble Classifiers for an Arabic Sign Language Recognition
System. 2017. Available online: https://thesai.org/Publications/ViewPaper?Volume=8&Issue=5&Code=IJACSA&SerialNo=38
(accessed on 20 January 2024).
49. Alzohairi, R.; Alghonaim, R.; Alshehri, W.; Aloqeely, S. Image based Arabic Sign Language Recognition System. Int. J. Adv.
Comput. Sci. Appl. 2018, 9, 185–194. https://doi.org/10.14569/ijacsa.2018.090327.
50. Latif, G.; Mohammad, N.; Alghazo, J.; AlKhalaf, R.; AlKhalaf, R. ArASL: Arabic Alphabets Sign Language Dataset. Data Brief
2019, 23, 103777. https://doi.org/10.1016/j.dib.2019.103777.
51. Latif, G.; Alghazo, J.; Mohammad, N.; AlKhalaf, R.; AlKhalaf, R. Arabic Alphabets Sign Language Dataset (ArASL). Available
online: https://data.mendeley.com/datasets/y7pckrw6z2/1 (accessed on 29 December 2023).
52. Shahin, A.I.; Almotairi, S. Automated Arabic Sign Language Recognition System Based on Deep Transfer Learning. IJCSNS Int.
J. Comput. Sci. Netw. Secur. 2019, 19, 144–152.
53. Althagafi, A.; Althobaiti, G.; Alsubait, T.; Alqurashi, T. ASLR: Arabic Sign Language Recognition Using Convolutional Neural
Networks. IJCSNS Int. J. Comput. Sci. Netw. Secur. 2020, 20, 124–129.
54. Saleh, Y.; Issa, G.F. Arabic sign language recognition through deep neural networks fine-tuning. Int. J. Online Biomed. Eng. 2020,
16, 71–83. https://doi.org/10.3991/IJOE.V16I05.13087.
55. Latif, G.; Mohammad, N.; AlKhalaf, R.; AlKhalaf, R.; Alghazo, J.; Khan, M. An Automatic Arabic Sign Language Recognition
System based on Deep CNN: An Assistive System for the Deaf and Hard of Hearing. Int. J. Comput. Digit. Syst. 2020, 9, 715–724.
https://doi.org/10.12785/ijcds/090418.
56. Alshomrani, S.; Aljoudi, L.; Arif, M. Arabic and American Sign Languages Alphabet Recognition by Convolutional Neural
Network. Adv. Sci. Technol. Res. J. 2021, 15, 136–148. https://doi.org/10.12913/22998624/142012.
57. Alani, A.A.; Cosma, G. ArSL-CNN: A convolutional neural network for arabic sign language gesture recognition. Indones. J.
Electr. Eng. Comput. Sci. 2021, 22, 1096–1107. https://doi.org/10.11591/ijeecs.v22i2.pp1096-1107.
58. Alsaadi, Z.; Alshamani, E.; Alrehaili, M.; Alrashdi, A.A.D.; Albelwi, S.; Elfaki, A.O. A Real Time Arabic Sign Language Alpha-
bets (ArSLA) Recognition Model Using Deep Learning Architecture. Computers 2022, 11, 78. https://doi.org/10.3390/comput-
ers11050078.
59. Duwairi, R.M.; Halloush, Z.A. Automatic recognition of Arabic alphabets sign language using deep learning. Int. J. Electr. Com-
put. Eng. 2022, 12, 2996. https://doi.org/10.11591/ijece.v12i3.pp2996-3004.
60. Alnuaim, A.; Zakariah, M.; Hatamleh, W.A.; Tarazi, H.; Tripathi, V.; Amoatey, E.T. Human-Computer Interaction with Hand
Gesture Recognition Using ResNet and MobileNet. Comput. Intell. Neurosci. 2022, 2022, 8777355.
https://doi.org/10.1155/2022/8777355.
61. Zakariah, M.; Alotaibi, Y.A.; Koundal, D.; Guo, Y.; Elahi, M.M. Sign Language Recognition for Arabic Alphabets Using Transfer
Learning Technique. Comput. Intell. Neurosci. 2022, 2022, 8777355. https://doi.org/10.1155/2022/4567989.
62. Nahar, K.M.O.; Almomani, A.; Shatnawi, N.; Alauthman, M. A Robust Model for Translating Arabic Sign Language into Spoken
Arabic Using Deep Learning. Intell. Autom. Soft Comput. 2023, 37, 2037–2057. https://doi.org/10.32604/iasc.2023.038235.
63. Fadhil, O.Y.; Mahdi, B.S.; Abbas, A.R. Using VGG Models with Intermediate Layer Feature Maps for Static Hand Gesture Recog-
nition. Baghdad Sci. J. 2023, 20, 1808–1816. https://doi.org/10.21123/bsj.2023.7364.
Sensors 2024, 24, 7798 89 of 91
64. Alharthi, N.M.; Alzahrani, S.M. Vision Transformers and Transfer Learning Approaches for Arabic Sign Language Recognition.
Appl. Sci. 2023, 13, 11625. https://doi.org/10.3390/app132111625.
65. Baker, Q.B.; Alqudah, N.; Alsmadi, T.; Awawdeh, R. Image-Based Arabic Sign Language Recognition System Using Transfer
Deep Learning Models. Appl. Comput. Intell. Soft Comput. 2023, 2023, 5195007. https://doi.org/10.1155/2023/5195007.
66. Islam, M.; Aloraini, M.; Aladhadh, S.; Habib, S.; Khan, A.; Alabdulatif, A.; Alanazi, T.M. Toward a Vision-Based Intelligent
System: A Stacked Encoded Deep Learning Framework for Sign Language Recognition. Sensors 2023, 23, 9068.
https://doi.org/10.3390/s23229068.
67. Hayani, S.; Benaddy, M.; El Meslouhi, O.; Kardouchi, M. Arab Sign language Recognition with Convolutional Neural Networks.
In Proceedings of the 2019 International Conference of Computer Science and Renewable Energies (ICCSRE), Agadir, Morocco,
22–24 July 2019; pp. 1–4. https://doi.org/10.1109/ICCSRE.2019.8807586.
68. Elatawy, S.M.; Hawa, D.M.; Ewees, A.A.; Saad, A.M. Recognition system for alphabet Arabic sign language using neutrosophic
and fuzzy c-means. Educ. Inf. Technol. 2020, 25, 5601–5616. https://doi.org/10.1007/s10639-020-10184-6.
69. Kamruzzaman, M.M. Arabic Sign Language Recognition and Generating Arabic Speech Using Convolutional Neural Network.
Wirel. Commun. Mob. Comput. 2020, 2020, 3685614. https://doi.org/10.1155/2020/3685614.
70. Qaroush, A.; Yassin, S.; Al-Nubani, A.; Alqam, A. Smart, comfortable wearable system for recognizing Arabic Sign Language
in real-time using IMUs and features-based fusion. Expert Syst. Appl. 2021, 184, 115448.
https://doi.org/10.1016/j.eswa.2021.115448.
71. Ismail, M.H.; Dawwd, S.A.; Ali, F.H. Static hand gesture recognition of Arabic sign language by using deep CNNs. Indones. J.
Electr. Eng. Comput. Sci. 2021, 24, 178–188. https://doi.org/10.11591/ijeecs.v24.i1.pp178-188.
72. Tharwat, G.; Ahmed, A.M.; Bouallegue, B. Arabic Sign Language Recognition System for Alphabets Using Machine Learning
Techniques. J. Electr. Comput. Eng. 2021, 2021, 2995851. https://doi.org/10.1155/2021/2995851.
73. Alawwad, R.A.; Bchir, O.; Maher, M.; Ismail, B. Arabic Sign Language Recognition using Faster R-CNN. 2021. Available online:
https://thesai.org/Publications/ViewPaper?Volume=12&Issue=3&Code=IJACSA&SerialNo=80 (accessed on 20 January 2024).
74. AlKhuraym, B.Y.; Ismail, M.M.B.; Bchir, O. Arabic Sign Language Recognition using Lightweight CNN-Based Architecture.
2022. Available online: https://thesai.org/Publications/ViewPaper?Volume=13&Issue=4&Code=IJACSA&SerialNo=38 (accessed
on 20 January 2024).
75. Shanableh, T.; Assaleh, K.; Al-Rousan, M. Spatio-temporal feature-extraction techniques for isolated gesture recognition in Ar-
abic Sign Language. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2007, 37, 641–650. https://doi.org/10.1109/TSMCB.2006.889630.
76. Sidig, A.A.I.; Luqman, H.; Mahmoud, S.A. Arabic sign language recognition using vision and hand tracking features with
HMM. Int. J. Intell. Syst. Technol. Appl. 2019, 18, 430. https://doi.org/10.1504/IJISTA.2019.101951.
77. Aly, S.; Aly, W. DeepArSLR: A Novel Signer-Independent Deep Learning Framework for Isolated Arabic Sign Language Ges-
tures Recognition. IEEE Access 2020, 8, 83199–83212. https://doi.org/10.1109/ACCESS.2020.2990699.
78. Luqman, H.; Elalfy, E. Utilizing motion and spatial features for sign language gesture recognition using cascaded CNN and
LSTM models. Turk. J. Electr. Eng. Comput. Sci. 2022, 30, 2508–2525. https://doi.org/10.55730/1300-0632.3952.
79. Abdo, M.Z.; Hamdy, A.M.; Abd, S.; Salem, E.-R.; Saad, E.M. EMCC: Enhancement of Motion Chain Code for Arabic Sign Lan-
guage Recognition. 2015. Available online: https://thesai.org/Publications/ViewPaper?Volume=6&Issue=12&Code=ijacsa&Seri-
alNo=15 (accessed on 20 January 2024).
80. El Badawy, M.; Elons, A.S.; Sheded, H.; Tolba, M.F. A proposed hybrid sensor architecture for Arabic sign language recognition.
Adv. Intell. Syst. Comput. 2015, 323, 721–730. https://doi.org/10.1007/978-3-319-11310-4_63.
81. Almasre, M.A.; Al-Nuaim, H. Comparison of Four SVM Classifiers Used with Depth Sensors to Recognize Arabic Sign Lan-
guage Words. Computers 2017, 6, 20. https://doi.org/10.3390/computers6020020.
82. Elpeltagy, M.; Abdelwahab, M.; Hussein, M.E.; Shoukry, A.; Shoala, A.; Galal, M. Multi-modality-based Arabic sign language
recognition. IET Comput. Vis. 2018, 12, 1031–1039. https://doi.org/10.1049/iet-cvi.2017.0598.
83. Deriche, M.; Aliyu, S.; Mohandes, M. An Intelligent Arabic Sign Language Recognition System using a Pair of LMCs with GMM
Based Classification. IEEE Sens. J. 2019, 19, 8067–8078. https://doi.org/10.1109/JSEN.2019.2917525.
84. Alnahhas, A.; Alkhatib, B.; Al-Boukaee, N.; Alhakim, N.; Alzabibi, O.; Ajalyakeen, N. Enhancing the Recognition of Arabic Sign
Language by Using Deep Learning and Leap Motion Controller. Int. J. Sci. Technol. Res. 2020, 9, 1865–1870.
85. Almasre, M.A.; Al-Nuaim, H. A comparison of Arabic sign language dynamic gesture recognition models. Heliyon 2020, 6,
e03554. https://doi.org/10.1016/j.heliyon.2020.e03554.
86. Sidig, A.A.I.; Luqman, H.; Mahmoud, S.; Mohandes, M. KArSL Dataset. Available online: https://hamzah-
luqman.github.io/KArSL/ (accessed on 27 December 2023).
87. Sidig, A.A.I.; Luqman, H.; Mahmoud, S.; Mohandes, M. KArSL: Arabic Sign Language Database. ACM Trans. Asian Low-Resour.
Lang. Inf. Process. 2021, 20, 1–19. https://doi.org/10.1145/3423420.
88. Luqman, H.; El-Alfy, E.-S.M. Towards Hybrid Multimodal Manual and Non-Manual Arabic Sign Language Recognition:
mArSL Database and Pilot Study. Electronics 2021, 10, 1739. https://doi.org/10.3390/electronics10141739.
89. Luqman, H. Hamzah Luqman-mArSL Dataset. Available online: https://faculty.kfupm.edu.sa/ICS/hluqman/mArSL.html (ac-
cessed on 27 May 2024).
90. Luqman, H. An Efficient Two-Stream Network for Isolated Sign Language Recognition Using Accumulative Video Motion.
IEEE Access 2022, 10, 93785–93798. https://doi.org/10.1109/ACCESS.2022.3204110.
Sensors 2024, 24, 7798 90 of 91
91. Marzouk, R.; Alrowais, F.; Al-Wesabi, F.N.; Hilal, A.M. Atom Search Optimization with Deep Learning Enabled Arabic Sign
Language Recognition for Speaking and Hearing Disability Persons. Healthcare 2022, 10, 1606.
https://doi.org/10.3390/healthcare10091606.
92. Ismail, M.H.; Dawwd, S.A.; Ali, F.H. Dynamic hand gesture recognition of Arabic sign language by using deep convolutional
neural networks. Indones. J. Electr. Eng. Comput. Sci. 2022, 25, 952–962. https://doi.org/10.11591/ijeecs.v25.i2.pp952-962.
93. Al-Onazi, B.B.; Nour, M.K.; Alshahran, H.; Elfaki, M.A.; Alnfiai, M.M.; Marzouk, R.; Othman, M.; Sharif, M.M.; Motwakel, A.
Arabic Sign Language Gesture Classification Using Deer Hunting Optimization with Machine Learning Model. Comput. Mater.
Contin. 2023, 75, 3413–3429. https://doi.org/10.32604/cmc.2023.035303.
94. Balaha, M.M.; El-Kady, S.; Balaha, H.M.; Salama, M.; Emad, E.; Hassan, M.; Saafan, M.M. A vision-based deep learning approach
for independent-users Arabic sign language interpretation. Multimed. Tools Appl. 2023, 82, 6807–6826.
https://doi.org/10.1007/s11042-022-13423-9.
95. Tubaiz, N.; Shanableh, T.; Assaleh, K. Glove-Based Continuous Arabic Sign Language Recognition in User-Dependent Mode.
IEEE Trans. Hum. Mach. Syst. 2015, 45, 526–533. https://doi.org/10.1109/THMS.2015.2406692.
96. Hassan, M.; Assaleh, K.; Shanableh, T. Multiple Proposals for Continuous Arabic Sign Language Recognition. Sens. Imaging
2019, 20, 4. https://doi.org/10.1007/s11220-019-0225-3.
97. Hisham, B.; Hamouda, A. Arabic Dynamic Gestures Recognition Using Microsoft Kinect. Sci. Vis. 2018, 10, 140–159.
https://doi.org/10.26583/sv.10.5.09.
98. Shanableh, T. Two-Stage Deep Learning Solution for Continuous Arabic Sign Language Recognition Using Word Count Pre-
diction and Motion Images. IEEE Access 2023, 11, 126823–126833. https://doi.org/10.1109/ACCESS.2023.3332250.
99. Dosovitskiy, A. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. arXiv 2020,
arXiv:2010.11929.
100. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using
Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Can-
ada, 10–17 October 2021; pp. 10012–10022.
101. Nicolaou, M.A.; Panagakis, Y.; Zafeiriou, S.; Pantic, M. Robust Canonical Correlation Analysis: Audio-Visual Fusion for Learn-
ing Continuous Interest. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), Florence, Italy, 4–9 May 2014; pp. 1522–1526. https://doi.org/10.1109/ICASSP.2014.6853852.
102. Mohebali, B.; Tahmassebi, A.; Meyer-Baese, A.; Gandomi, A.H. Probabilistic Neural Networks. In Handbook of Probabilistic Mod-
els; Elsevier: Amsterdam, The Netherlands, 2020; pp. 347–367. https://doi.org/10.1016/B978-0-12-816514-0.00014-X.
103. AUTSL Benchmark (Sign Language Recognition)|Papers with Code. Available online: https://paperswithcode.com/sota/sign-
language-recognition-on-autsl (accessed on 25 November 2024).
104. WLASL: A Large-Scale Dataset for Word-Level American Sign Language. Available online: https://github.com/dxli94/WLASL
(accessed on 26 November 2024).
105. BOBSL BBC-Oxford British Sign Language Dataset. Available online: https://www.robots.ox.ac.uk/~vgg/data/bobsl/ (accessed
on 26 November 2024).
106. PHOENIX-Weather-2014T. Available online: https://www-i6.informatik.rwth-aachen.de/~koller/RWTH-PHOENIX-2014-T/ (ac-
cessed on 26 November 2024).
107. MS-ASL Dataset. Available online: https://www.microsoft.com/en-us/research/project/ms-asl/ (accessed on 26 November 2024).
108. Ryumin, D.; Ivanko, D.; Ryumina, E. Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices. Sensors 2023,
23, 2284. https://doi.org/10.3390/s23042284.
109. Papadimitriou, K.; Potamianos, G. Sign Language Recognition via Deformable 3D Convolutions and Modulated Graph Convo-
lutional Networks. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096714.
110. Maruyama, M.; Singh, S.; Inoue, K.; Roy, P.P.; Iwamura, M.; Yoshioka, M. Word-Level Sign Language Recognition with Multi-
Stream Neural Networks Focusing on Local Regions and Skeletal Information. IEEE Access 2024, 12, 167333–167346.
https://doi.org/10.1109/ACCESS.2024.3494878.
111. Boháček, M.; Hrúz, M. Sign Pose-based Transformer for Word-level Sign Language Recognition. In Proceedings of the 2022
IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 4–8 January
2022; pp. 182–191. https://doi.org/10.1109/WACVW54805.2022.00024.
112. Camgoz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Sign Language Transformers: Joint End-to-End Sign Language Recognition
and Translation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle,
WA, USA, 13–19 June 2020; pp. 10020–10030. https://doi.org/10.1109/CVPR42600.2020.01004.
113. Camgoz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Multi-channel Transformers for Multi-articulatory Sign Language Transla-
tion. In Proceedings of the Computer Vision—ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; pp. 301–319.
https://doi.org/10.1007/978-3-030-66823-5_18.
114. Shen, X.; Zheng, Z.; Yang, Y. StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition. ACM
Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–19. https://doi.org/10.1145/3656046.
115. Reza, H.; Joze, V.; Redmond, M.; Koller, O. MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign
Language. Available online: https://arxiv.org/pdf/1812.01053 (accessed on 25 November 2024).
Sensors 2024, 24, 7798 91 of 91
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual au-
thor(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.