Paper 1
Paper 1
A R T I C L E I N F O A B S T R A C T
Keywords: Deaf-mute individuals encounter substantial difficulties in their daily lives due to communication
Arabic sign language recognition impediments. These individuals may encounter difficulties in social contact, communication, and
Deep learning capacity to acquire knowledge and engage in employment. Recent studies have contributed to
YOLOv8
decreasing the gap in communication between deaf-mute people and normal people by studying
Real-time communication
Deaf-mute communication
sign language interpretation. In this paper, a real-time Arabic avatar system is created to help
deaf-mute people communicate with other people. The system translates text or spoken input into
Arabic Sign Language (ArSL) movements that the avatar makes using deep-learning-based
translation. The dynamic generation of the avatar movements allows smooth and organic real-
time communication. In order to improve the precision and effectiveness of ArSL translation,
this study depends on a state-of-the-art deep learning model, which makes use of YOLOv8, to
recognize and interpret sign language gestures in realtime. The avatar is trained on three diverse
datasets of Arabic sign language images, namely Sign-language-detection Image (SLDI), Arabic
Sign Language (ArSL), and RGB Arabic Alphabet Sign Language (AASL), enabling it to accurately
capture the nuances and variations of hand movements. The best recognition accuracy of the
suggested approach was 99.4% on the AASL dataset. The experimental results of the suggested
approach demonstrate that deaf-mute people will be able to communicate with others in Arabic-
speaking communities more effectively and easily.
1. Introduction
The World Health Organization (WHO) declared that over 5% of people on the planet are deaf, and they face significant challenges
in communicating with normal people. Additionally, WHO predicts that by 2050, 1 in every 10 people will have a hearing loss
disability [1]. Deaf people use hand gestures and movements that correspond to alphabets and words to communicate with other
people, that is known as sign language [2]. The World Federation of the Deaf (WFD) states that there are 200+ sign languages
* Corresponding author.
E-mail address: [email protected] (A.I. Siam).
https://doi.org/10.1016/j.compeleceng.2024.109475
Received 30 April 2024; Received in revised form 5 June 2024; Accepted 10 July 2024
0045-7906/© 2024 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
worldwide [3]. Deaf people suffer from real challenges to express their thoughts and needs to normal people without an interpreter for
their signs. This has a serious impact on their social and daily life experiences. Additional challenges may exist if (i) a signer com-
municates with someone that he cannot communicate with signs, (ii) a signer communicates with a different sign language (amongst
the existing 200+ sign languages).
Recently, various studies have contributed to helping deaf people to communicate with normal people by providing automatic sign
language translators (SLT), which aim to produce spoken words for the provided sign language, or vice versa [4,5]. Automatic SLT
provides a unified tool that allows communication between signers and non-signers by translating the signs into either different
understood sign language or spoken/written words, and vice versa [6,7].
Recent advancements in modern technologies can facilitate the process of automatic SLT. For example, convolutional neural
networks (CNNs) have proven particularly effective in this domain, as they can learn the complex spatial and temporal patterns
inherent in sign language gestures [8,9]. In addition, recent developments in computer vision, machine learning, and natural language
processing have paved the way for innovative solutions in automatic SLT [10]. Moreover, wearable sensors, such as data gloves and
motion capture systems, can provide precise hand and finger tracking, enabling the capture of fine-grained movement data [11].
Computer vision algorithms can then be employed to extract features from the sensor data and translate them into corresponding sign
language gestures [12,13]. The combination of deep learning, sensor-based devices, and computer vision techniques holds great
promise for the development of robust and accurate automatic SLT systems. These systems have the potential to break down
communication barriers between deaf and hearing individuals, empowering deaf people to participate more fully in the society.
Different challenges can be faced when building an automatic SLT, starting from collecting the required data for training the model
to the deployment of the model [14–16]. Some challenges can be summarized as follows:
Researchers have explored various approaches to build robust and accurate SLR systems. Two prominent sub-categories have
emerged: pixel-based and skeleton-based approaches. The pixel-based approaches directly analyze the entire image or video frame
containing the signer’s hand(s) and body posture. CNNs are frequently employed for feature extraction and classification. These
approaches can capture intricate details of signs, including hand orientation, finger configuration, and facial expressions which
contribute to meaning in some sign languages [17]. On the other hand, the skeleton-based approaches focus on extracting key points
(joints) from the signer’s hand and body, forming a skeletal representation [18]. Techniques like CNNs and Support Vector Machines
(SVMs) are often used for recognition based on the movement trajectories of these key points [19,20]. These approaches are more
robust to background variations and lighting conditions [21].
In this paper, a deep learning-based Arabic Sign Language (ArSL) translator is presented. The suggested model depends on the
YOLOv8 framework to recognize and interpret sign language gestures in real time, which helps deaf and mute people to communicate
with each other in Arabic language. The contributions of the current work can be summarized as follows:
• Reviewing the literature on various constructed systems and frameworks in the context of automatic SLT.
• Proposing a deep-learning-based framework using YOLOv8 architecture for interpreting Arabic sign language. This helps deaf and
mute Muslims to understand the meanings and interpretations of the Holy Qur’an and perform Islamic rituals.
• Overcoming the problem of available dataset limitations and user-dependent datasets by adopting diverse datasets and data
augmentation techniques to achieve system generalization.
• Developing a real-time translation method that may be deployed as a mobile application.
• Conducting various experiments and comparing the results with those of other state-of-the-art works.
The rest of the paper is organized as follows. Section 2 presents a discussion of some related works and studies. The proposed ArSL
translation framework is presented in detail in Section 3. The experiments, their results, and a comparative analysis are presented in
Section 4. The paper conclusion is given in Section 5.
2. Literature review
In this section, recent related works are investigated and analyzed for discovering and highlighting any research gaps. Here, various
2
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
recent works related to automatic sign language and Arabic sign language interpretation using advancements in artificial intelligence
(AI), transfer learning, and computer vision techniques are discussed.
The CNNs have demonstrated significant advancements in various fields, including visual object recognition [22,23], natural
language processing [24], and medical image processing [25–28], among others. However, there is a lack of research on the utilization
of CNNs for video classification. This is due to the difficulty of integrating the CNNs to incorporate both spatial and temporal data [29].
Computer vision is a subfield of artificial intelligence that aims to extract a meaning from images and videos, similar to how the
human visual system perceives and comprehends its surroundings [30]. Computer vision intersects computer science, machine
learning, and image processing, and its applications span over image classification [31], object detection [32], image segmentation
[30], medical image analysis [33,34], and robotics [35].
Recently, CNNs and computer vision approaches have been used as a standard practice to improve the accuracy of SLR models. For
example, Sreemathy et al. [36] proposed an image-based SLR model to interpret 80 static signs. They adopted both YOLOv4 and
Mediapipe with SVM classifiers to recognize the hand gestures of the 80 words. Their method achieved an accuracy of 98.8% for the
YOLOv4 classifier and 98.62% for the Mediapipe with an SVM classifier.
Attia et al. [37] developed a method for automatic SLR based on YOLOv5 architecture. They presented three models based on
YOLOv5 structure with attention techniques to recognize alphabets and numbers from hand gestures. They identified that the attention
mechanism helps in improving the detection performance through observing important areas and distinguishing objects from their
surroundings, efficiently, even with occlusions or cluttered backgrounds. The authors validated their methods on two datasets: MU
HandImages ASL and OkkhorNama, and some augmentation techniques were applied to the images. The models achieved a 98% F1
score. The authors of [38] created an automated Bangla Sign Language (BSL) detection system using machine vision technology and an
embedded device. Detectron2, EfficientDet, and YOLOv7 with Jetson Nano are among the deep learning methods that have been
trained on the Okkhornama open-source dataset. The Jetson Nano, however, might cost more than other available edge computing
products, which would make it unaffordable for certain clients or projects with tight budgets. Al Ahmadi et al. [39] presented a
YOLOv8-based model for ArSL recognition. They relied on both channel attention and spatial attention modules with cross--
convolution module to extract and process features from input images. The validation of the model revealed that it achieved a clas-
sification accuracy of 99% on the ArSL21L dataset.
Miah et al. [19] developed an automatic sign recognition method based on hand skeleton joint information to overcome partial
occlusion and redundant background problems that may exist in images. Their method involves capturing the positions of joints in the
skeleton and the movement of these joints to analyze the whole body movement of a person during sign language gestures. This in-
formation is then fed into two separate streams. The first stream processes joint key features through a combination of a Separable
Temporal Convolutional Network (Sept-TCN), a Graph Convolutional Network (GCN), and an attention module. The second stream
processes the joint motion. Features extracted from the two streams are fused to extract spatial-temporal features from sign language
videos.
The authors of [4] developed an application named Sign4PSL that translates sentences to Pakistan Sign Language (PSL) for deaf
people with visual representation using virtual signing characters. The application is compatible with several platforms, such as web
and mobile. This system accepts English language text as input, translates it into sign language, and visually represents the movements
through a virtual character. The system attained an accuracy of 100%, when processing alphabets, numerals, words, and phrases.
Nevertheless, when it comes to sentences, the architecture attains an accuracy of 80%. In [40], the authors developed a computer
vision system that depends on deep convolutional neural networks to translate Amharic alphabets into Ethiopian Sign Language. The
suggested model achieved a training accuracy of 98.5%, a validation accuracy of 95.59%, and a testing accuracy of 98.3% in signal
identification. Furthermore, this technology has the capability to transform Amharic sign language visuals into written text.
In [41], the authors proposed a multistream deep learning model for recognizing signs of Brazilian, Indian, and Korean sign
languages. They incorporated 3D CNN networks and Generative Adversarial Networks (GAN) to generate depth maps of sign artic-
ulation. The method receives information from facial expressions, hands, and distances between joints and provides a visual expla-
nation to identify which regions are important for model decision making. The method achieved an accuracy of 91%, and an f1-score of
90% on a Brazilian sign language dataset. The authors of [42] introduced a framework called SignGraph, a pose-based SLR model using
a graph convolution network (GCN) and a residual neural network (RNN). They relied on Mediapipe to extract spatiotemporal features
from 65 joints from the face, hands, and pose from video sequences, which are given as input to the model. They used a ResNet-based
architecture for recognition. The authors reported that their method is computationally efficient in terms of the number of learnable
parameters, the computational cost, the number of flops, and the inference time.
Lee et al. [12] developed a wearable hand device to interpret sign language from hand gestures. They attached flex sensors,
pressure sensors, and IMU sensors on a signer’s hand to distinguish language characters. Their system involves three modules: a sensor
module, a processing module, and a display module on a mobile application. First, sensor data are analyzed using an embedded SVM
classifier to recognize characters. Second, the recognized character is transmitted to a mobile device through BLE communication. The
mobile application converts the received text into voice output. The system achieved an average accuracy of 65.7% without pressure
sensor and 98.2% with pressure sensor.
3
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
Sharma et al. [43] proposed a transfer-learning-based methodology for continuous sign language translation. Data was collected
using multiple IMUs on both hands. Their architecture incorporated CNN, Bi-LSTM, and connectionist temporal classification layers.
First, they trained the model on static isolated signs which enhanced the classification of continuous sentences of sign data. A limited
amount of data is used to train the lower layers of the pre-trained network. The classification accuracy reached 88.5%. Barbhuiya et al.
[8] proposed a transfer-learning-based SLR method for human-computer interaction (HCI) applications. The authors utilized pre--
trained AlexNet and VGG16 architectures for feature extraction along with an SVM classifier. They validated their method using a
dataset of 36 characters for 5 persons. They reported an accuracy of 99.82%.
Naz et al. [18] presented a pose-based approach for SLR. This method involves three steps: pose extraction, handcrafted feature
generation, and feature space mapping and recognition. The pose-based features include joints, bone lengths, and bone angles. A
lightweight residual graph convolutional network (ResGCN) and a new part attention method incorporate body spatial and temporal
information in a compact feature space and recognize signs. The presented technique achieved an accuracy of 83.33%.
AbdElghfar et al. [44] proposed an Arabic SLR model to help deaf and dumb Muslims recite the Holy Qur’an. They used a
CNN-based deep learning model to extract the features and recognize hand motions referring to Fourteen dashed Qur’anic letters. They
used 24,137 images from the ArSL2018 Arabic sign language dataset. The testing accuracy of the model with SMOTE technique was
97.67%.
Alsaadi et al. [45] proposed an Arabic alphabet sign language recognition model using deep learning and transfer learning tech-
niques. They utilized the ArSLA dataset, consisting of 54,049 images and representing 32 Arabic letters. The authors employed a
number of CNN architectures, including VGG16, ResNet50, EffecientNet, and AlexNet. The results indicated that the AlexNet archi-
tecture achieves the highest accuracy of 94.81%.
Islam et al. [46] presented a novel deep learning approach that leverages stacked autoencoders and the Internet of Things (IoT)
infrastructure to refine feature extraction and classification for accurate ArSL recognition. Saleem et al. [47] proposed a machine
learning-based system for two-way communication between deaf and mute (DnM) and non-deaf and mute (NDnM) individuals. Their
system depends on deep learning for hand gesture recognition and supervised machine learning for multi-language support. This
approach highlights the potential of machine learning in bridging communication gaps but may require user training on the specific
system interface.
Balaha et al. [14] proposed a deep-learning-based approach for Arabic sign language recognition. They created an Arabic sign
language dataset with 8467 videos of 20 Arabic words, using a mobile phone. They developed a method combining CNN and RNN to
interpret sign language based on the features extracted from video frames. They adopted two CNN networks for feature extraction, and
the features extracted from the two streams are fused to produce a feature sequence. The RNN network is used as a final classifier to
produce the final prediction. Their approach achieved a testing accuracy of 98%.
Kamruzzaman [48] proposed a CNN-based Arabic sign language recognition model that can translate the recognized words into
audible speech. The author also created an image dataset with 125 images for each of 31 letters of Arabic sign language. These images
are subjected to a number of preprocessing techniques, such as resizing, color conversion, and augmentation. The recognized Arabic
words are converted into speech using Google Text to Speech API. The achieved accuracy was 90%. The author reported that the
system can be further improved by employing an Xbox Kinect device.
The proposed ArSLGen algorithm is intended to dynamically translate Arabic Sign Language (ArSL) from spoken or written Arabic
using deep learning, with a focus on real-time object detection. The algorithm consists of four basic phases, each of which contributes
to the overall efficacy and efficiency of the Arabic Sign Language translation process. The architecture of the proposed framework
consists of four main phases as illustrated in Fig. 1, which are: (i) Data Collection and Preprocessing, (ii) Model Training with YOLOv8,
(iii) Performance Evaluation and Metrics, and (iv) Real-time Translation and Validation.
The main objective of this phase is to compile an extensive dataset of ArSL gestures and preprocess it in order to get it ready for
model training. The collection ensures a diverse representation of ArSL motions and expressions by drawing from a variety of sources,
such as movies, photographs, and motion capture data.
An important stage is data annotation, where each gesture is labeled with the meaning or translation in ArSL. The basis for su-
pervised learning is this annotated dataset, which helps the model learn how to translate ArSL gestures into their corresponding verbal
expressions. To guarantee uniformity in the input data, the photos or videos are downsized to a standard resolution during pre-
processing. Normalizing pixel values to a consistent scale makes training of the model easier. Rotation, flipping, and scaling are ex-
amples of data augmentation techniques that can be used to improve the model capacity to generalize to new data and diversify the
dataset. Ultimately, the preprocessed dataset is divided into test, validation, and training sets to ensure accurate assessment of the
model effectiveness. The overall steps of the Data Collection and Preprocessing phase are shown in Algorithm 1.
4
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
5
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
Algorithm 1
Data collection and preprocessing algorithm.
The three primary subphases of the Data Collection and Preprocessing procedure are (i) Gathering Datasets, (ii) Handwritten
Annotation, and (iii) Data Augmentation.
i. Gathering Datasets
Several datasets containing Arabic sign language motions are collected in this sub-phase. In this paper, Sign-language-detection
Image (SLDI), Arabic Sign Language ArSL, and RGB Arabic Alphabet Sign Language (AASL) are among the curated datasets. The
goal of this phase is to give a complete representation of Arabic sign language motions, encompassing a range of facial expressions and
movements.
The manual annotation sub-phase involves labeling images after data gathering. Every image in the datasets has been annotated
with Arabic text transcripts and bounding boxes surrounding hands. This generates a cleanly labeled dataset as illustrated in Fig. 2 that
serves as the basis for training and assessing the model in the future.
6
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
Strategies for data augmentation are employed to broaden the diversity of datasets. In this study, various techniques are used with
images for this purpose, including rotation, flipping, zooming, stretching, and adjusting brightness. The goal of this sub-phase is to
increase the model resilience by subjecting it to a greater variety of sign language gestures and environmental factors. It allows the
model to efficiently generalize to different hand morphologies and activities.
The data augmentation processes utilized in the proposed method with the speech signals are shown in Fig. 3. With this step, the
model is exposed to more environmental factors to enhance its robustness. In this step, random noise with different types and SNRs is
added to the speech signals. The inclusion of noise enhances the model ability to withstand disturbances and uncertainties in the input
data, hence optimizing its application in noisy or less controlled environments.
The deep learning model is trained to recognize and detect ArSL motions in input images or videos during the model training phase
using YOLOv8. YOLOv8 is a cutting-edge object detection system known for its precision and high speed [49]. In order to precisely
anticipate boundary boxes and class labels for ArSL gestures in real-time scenarios, the model is trained by adjusting its parameters.
Throughout the training phase, the dataset is iterated over, batches of images are fed through the model, the loss is computed, and an
optimizer is used to update the model weights. At that point, real-time ArSL translation can be performed using the learned YOLOv8
model. The general procedure for training the model with YOLOv8 is described in Algorithm 2.
The proposed algorithm is evaluated in this phase using a range of measures to identify the efficacy of the model in translating ArSL
motions into spoken or written Arabic. In order to assess the model predictions against the ground truth translations, three datasets,
including SLDI, ArSL, and AASL are employed. To measure the performance of the model, metrics including accuracy, precision, recall,
and F1-score are calculated. These metrics give a quantitative assessment of the model efficacy by indicating how well it can translate
ArSL motions. In addition, researchers can find cases in which the model might be deficient and change it to increase overall per-
formance by examining the performance indicators.
In order to make sure that the model can translate ArSL gestures in practical settings, the Performance Evaluation and Metrics phase
is essential for the development of the ArSLGen algorithm. The overall steps of The Performance Evaluation and Metrics are shown in
Algorithm 3.
The trained YOLOv8 model and the test dataset are the inputs for this algorithm, and the computed performance metrics are the
output. The test dataset is loaded for assessment by the algorithm after the evaluation metrics have been defined. Evaluation metrics
are computed by comparing the model prediction with the ground truth for every sample in the test dataset. The average performance
metrics across all samples are then calculated by the algorithm and saved for later examination.
7
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
Fig. 3. Data augmentation steps.
8
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
Algorithm 2
Model training with YOLOv8 algorithm.
9
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
Algorithm 3
Performance evaluation and metrics algorithm.
In this phase of the proposed ArSLGen algorithm, the trained model is put into practice in a real-time environment to translate ArSL
gestures into spoken or written Arabic. In addition, the model performance is validated in real-time scenarios to make sure it can
reliably and quickly translate ArSL gestures as they are made. The objective is to give deaf-mute people in Arabic-speaking com-
munities a smooth and useful communication tool. The overall steps of the Real-time Translation and Validation are shown in Al-
gorithm 4.
The ArSLGen method is designed to dynamically transform ArSL movements into text or spoken Arabic in real time, leveraging
deep learning techniques. The algorithm is divided into four main stages: Data Collection and Preprocessing, which gathers and
prepares a diverse dataset of ArSL gestures for training; Model Training with YOLOv8, which depends on the YOLOv8 architecture to
train a deep learning model to detect and translate ArSL gestures; Performance Evaluation and Metrics, which is performed to assess
the performance of the trained model using different metrics; and Real-time Translation and Validation, which depends on the trained
model to translate ArSL gestures in a real-time setting. The ArSLGen algorithm seeks to improve the communication and social
interaction skills of deaf-mute people in Arabic-speaking communities by offering a smooth and efficient communication tool during
these periods.
4. Results
This section provides a description of the used datasets, performance measurements, and a comparison with the most cutting-edge
techniques currently in use.
10
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
Algorithm 4
Real-time translation and validation algorithm.
4.1. Dataset
Three comprehensive collections of images were created to train and validate the proposed approach to recognize gestures in ArSL.
The first dataset is the RGB Arabic Alphabet Sign Language (AASL) dataset [50]. It includes thirty-one classes, including Ain, Al, Alef,
11
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
Beh, Dal, Feh, Ghain, Hah, Heh, Jeem, Kaf, Khah, Laa, Lam, Meem, Noon, Qaf, Reh, Seen, Sheen, Tah, Teh, Teh_Marbuta, Thal, Theh,
Waw, Yeh, Zah, and Zain. The dataset constitutes a total number of 7534 images; 6027 of them were used for training and 1507 for
testing. Each image in the dataset involves a manual annotation with bounding boxes around the hands and the accompanying Arabic
text transcription. Samples of the images from this dataset are shown in Fig. 4.
The second adopted dataset is the Sign-Language-Detection Image (SLDI) dataset [51]. Thirty lessons covering all Arabic letters are
included. The dataset contains 7494 images, which are partitioned into 5234 training images, 1129 testing images, and 1131 vali-
dation images. Some of the images from this dataset are displayed in Fig. 5.
The third dataset is the Arabic sign language (ArSL) dataset [52]. There are 5832 images in this dataset, each with 416 × 416 pixels.
The testing set consists of 290 images, the training set of 4651 images, and the validation set of 891 images. These images were
captured with a cell phone camera in a variety of environments, with different backgrounds and hand orientations. Samples of this
dataset are displayed in Fig. 6.
The following metrics are used to compare the state-of-the-art methods with the proposed model for Arabic speech/text translation
into Arabic Sign Language using YOLOv8: (i) Precision: The portion of predicted positive outcomes that actually occurred as opposed
to all predicted positive outcomes. (ii) Recall: The portion of actual positive results that are actually positive. (iii) The F1-score is the
harmonic mean of recall and precision. (iv) mAP: The average precision of the models. We also discuss the quantitative analysis that
was conducted to evaluate the relative effectiveness of various strategies. Prominent methods that use YOLO architectures and deep
learning approaches to identify Arabic sign language are compared with the proposed model for translating Arabic speech/text into
Arabic Sign Language. (v) The Matthews Correlation Coefficient (MCC) is a metric used to evaluate the performance of binary clas-
sification models. Unlike accuracy, which can be misleading in imbalanced datasets, MCC takes into account true positives, true
negatives, false positives, and false negatives, providing more robust measures of model performance. Here is how to interpret the MCC
values: 1: perfect prediction, between 0 and 1: Good prediction, 0: random prediction (no better than chance), between − 1 and 0: poor
prediction.
TP
Recall = (1)
TP + FN
TP
Precision = (2)
TP + FP
recall × precision
F1 − score = 2 × (3)
recall + precision
k=n
1 ∑
mAP = APk (4)
n k=1
(TP × TN − FP × FN)
MCC = (5)
sqrt((TP + FP) × (TP + FN) × (TN + FP) × (TN + FN))
Table 1
Configuration parameters for the YOLOv8 model.
Parameters Values
Epochs 100
Learning rate 0.01
Image size 640
Batch size 8
Number of images 20,860
Number of training images 15,912
Number of testing images 4948
Layers 225
12
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
Fig. 7. Overall performance of the suggested model for translating Arabic speech/text into Arabic Sign Language using YOLOv8 with the
three datasets.
13
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
Fig. 8. The precision-recall and F1-confidence curves of the suggested model for translating Arabic speech/text into Arabic Sign Language using
YOLOv8 with the three datasets.
14
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
Fig. 9. The validation labeled of the suggested model for translating Arabic speech/text into Arabic Sign Language using YOLOv8 with the
three datasets.
15
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
where TP, or True Positive, denotes the percentage of cases that are accurately identified as positive. FP, or False Positive, is the
number of cases that are incorrectly classified as positive. TN and FN represent True Negative and False Negative, respectively,
indicating the percentage of instances correctly classified as negative and the quantity of cases incorrectly classified as negative. In
addition, APk is the class-k average precision, and n is the number of classes.
We employed stratified 10-fold cross-validation to evaluate the model performance on the unseen evaluation dataset. This tech-
nique randomly partitions the data into 10 folds, where 9 folds are used for training and the remaining fold for testing. This process is
repeated 10 times, ensuring all data points are used for both training and testing. The final reported performance metrics (MCC,
precision, recall, F1-score) represent the average values obtained across all 10 folds.
The proposed YOLOv8-based model is trained using the three preprocessed datasets. The architecture of YOLOv8 constitutes
multiple-layered CNNs that enable real-time object detection. The model was able to identify and locate Arabic sign language motions
in real-time after being trained on speech or text input. There were 100 epochs in the training procedure, an 8-batch size, and a 0.01
learning rate, with an Adam optimizer. To speed up convergence, we have employed pre-trained weights from a general-purpose object
detection model. The model’s parameters were modified during the training phase by iterating through the dataset for a predefined
number of epochs using gradient descent and backpropagation. A detailed summary of the YOLOv8 configuration parameters is
displayed in Table 1.
The proposed model for translating Arabic speech/text into Arabic Sign Language using YOLOv8 was developed, trained, and
verified on the GPU platform on Kaggle. The results of the experiments demonstrate that our model can identify Arabic Sign Language
with mean average precision (mAP) values of 99.5.3%, 99.2%, and 98.7%, respectively, across all classes for the AASL, SLDI, and ArSL
datasets.
Fig. 7 displays the results of recall, accuracy, and loss of the proposed model across the three datasets. For each of the three datasets,
the F1-score and precision-recall curves are displayed in Fig. 8. Moreover, Fig. 9 displays the validation results of the proposed model
for the three datasets.
Table 2 and Fig. 10 display the detailed results of the proposed model. For the AASL dataset, the achieved results for precision,
Table 2
Performance comparison of the proposed model for the three adopted datasets.
Dataset Precision (%) Recall (%) mAP (%)
16
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
Table 3
A comparison of the proposed model using the YOLOv8 model’s performance with different techniques.
Reference Model Recall% mAP% Precision% MCC Dataset
(↑) (↑) (↑) (↑)
Alsaadi et al. [45] AlexNet —– —– 94.81 0.83 54,049 grayscale JPEG pictures, each having a 64 × 64
(accuracy) resolution, that correspond to the 32 Arabic characters
Attia et al. [37] YOLOv5x 96.6 98.5 97.6 0.96 3 datasets
Kamruzzaman EfficientNetB4 96.2 —– 95.6 0.95 ArSL2018 dataset with 54,049 images
et al. [48]
Balaha et al. [14] CNN+RNN —– —– 98 0.87 8467 videos of 20 signs
(Accuracy)
Al-Barham et al. Modified ResNet-18 —– —– 99.47 0.88 ArSL2018 dataset with 54,049 images
[53] model (Accuracy)
AbdElghfar et al. QSLRS-CNN —– —– 97.13 0.86 24,137 images dataset
[54] (Accuracy)
Dabwan et al. [55] CNN model with —– —– 97.9 0.86 16,192 images dataset
EfficientnetB1 scaling (Accuracy)
Sreemathy et al. YOLO V4 99.17 98.17 97.2 0.98 676 images
[36]
Al-Barham et al. ResNet 18 —– —– 96.36 0.84 54,000 images
[56] (accuracy)
Al Ahmadi et al. YOLO V8 —– —– 99 (accuracy) —– ArSL21L dataset with 14,202 images
[39]
Proposed YOLO V8 99.5 99.5 99.4 0.99 AASL (7534 images)
Technique 99.3 99.2 99.2 0.99 SLDI (7494 images)
99.8 98.7 98.2 0.99 ArSL (5832 images)
recall, and mAP are 99.4%, 99.5%, and 99.5%, respectively. Additionally, for the SLDI dataset, the suggested model achieved results
for precision, recall, and mAP of 99.2%, 99.3%, and 99.2%, respectively. For the ArSL dataset, the suggested model obtained 98.2%,
99.8%, and 98.7% for precision, recall, and mAP, respectively.
Table 3 compares the performance of the proposed model with various approaches in the literature. It is clear that the proposed
model outperforms existing methods in terms of precision, recall, and mAP, indicating its efficiency in translating Arabic speech or text
into Arabic sign language using YOLOv8.
Furthermore, Table 3 illustrates the performance comparison of deep learning architectures using MCC. As Table 3 reveals, the
proposed YOLOv8 achieves the highest MCC score (0.99), indicating exceptional performance in recall, mean Average Precision
(mAP), and precision. This surpasses other architectures like AlexNet (0.83 MCC) and ResNet-18 (0.84 MCC). Notably, YOLOv4 also
exhibits strong results with an MCC of 0.98. These findings demonstrate that the proposed revisions provide a robust evaluation of the
model, highlighting its superiority over existing architectures.
Moreover, Compared to previous versions of YOLO, the proposed YOLOv8-based model outperforms the performance of the
YOLOv5x-based technique [37] and YOLOv4 [36] for the same task.
The results of the comparison demonstrate how effectively YOLOv8 converts Arabic voice or text to Arabic sign language. The
model can recognize and translate sign language motions with high accuracy and performs well in real-time. The authors hope that
these comparisons will demonstrate the superiority of the proposed model and demonstrate its practical application in enhancing
communication between the Arabic-speaking community and individuals with hearing impairments.
Training Set Performance:
To evaluate the model’s learning behavior during training, we also measured performance metrics on the training dataset. The
model achieved an MCC score of 0.99 on the training data, indicating that the model is robust. This suggests that the model learned
effectively from the training data and generalizes well to unseen data.
5. Conclusion
This paper proposes a methodology for automatic Arabic sign language recognition. This helps deaf and mute Muslims to un-
derstand the meanings and interpretations of the Holy Qur’an and perform Islamic rituals. The proposed model is based on YOLOv8
architecture, which translates Arabic sign language and Arabic speech into text. The model can effectively recognize and translate
Arabic sign language movements, as demonstrated by the high results of precision, recall, and mAP attained across all three datasets
(AASL, SLDI, and ArSL). For the AASL dataset, the achieved results for precision, recall, and mAP are 99.4%, 99.5%, and 99.5%,
respectively. Additionally, for the SLDI dataset, the suggested model achieved results for precision, recall, and mAP of 99.2%, 99.3%,
and 99.2%, respectively. For the ArSL dataset, the suggested model obtained 98.2%, 99.8%, and 98.7% for precision, recall, and mAP,
respectively. The enhanced efficacy of the suggested model compared to current techniques, such as those predicated on YOLOv4 and
YOLOv5x, highlights the progress achieved in the identification and interpretation of Arabic sign language. The model’s practical
application in improving communication for the Arabic-speaking community and people with hearing impairments is demonstrated by
its high accuracy and real-time performance.
17
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
In this study, the application of deep learning techniques, specifically the YOLOv8 architecture, has shown to be successful,
demonstrating the potential of such methods in the development of assistive technology for sign language translation. In order to
increase translation efficiency and accuracy, future research could concentrate on refining the model’s performance even more, adding
more diverse sign language gestures to the dataset, and investigating other deep learning architectures.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.
Data availability
Acknowledgement
The authors extend their appreciation to the King Salman center For Disability Research for funding this work through Research
Group no. KSRG-2023-183.
References
[1] W. H. Organization. Deafness and hearing loss. Feb-2024 [Online]. Available, https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss
[Accessed: 16-Feb-2024].
[2] Johnston T, Schembri A. Australian sign language (Auslan): an introduction to sign language linguistics. Cambridge University Press; 2007.
[3] [Online]. Available. 2024. https://wfdeaf.org/our-work/; 2024; 2024 [Accessed: 18-Feb-].
[4] Sanaullah M, et al. A real-time automatic translation of text to sign language. Comput Mater Continua 2022;70(2):2471–88. https://doi.org/10.32604/
cmc.2022.019420.
[5] Núñez-Marcos A, Perez-de-Viñaspre O, Labaka G. A survey on sign language machine translation. Expert Syst Appl 2023;213:118993. https://doi.org/10.1016/
j.eswa.2022.118993. Mar.
[6] Rastgoo R, Kiani K, Escalera S, Athitsos V, Sabokrou M. A survey on recent advances in sign language production. Expert Syst Appl 2024;243:122846. https://
doi.org/10.1016/j.eswa.2023.122846. Jun.
[7] Dhanjal AS, Singh W. An automatic machine translation system for multi-lingual speech to Indian sign language. Multimedia Tools Appl 2022;81(3):4283–321.
https://doi.org/10.1007/s11042-021-11706-1. Jan.
[8] Barbhuiya AA, Karsh RK, Jain R. CNN based feature extraction and classification for sign language. Multimedia Tools Appl 2021;80(2):3051–69. https://doi.
org/10.1007/s11042-020-09829-y. Jan.
[9] Hao W, Hou C, Zhang Z, Zhai X, Wang L, Lv G. A sensing data and deep learning-based sign language recognition approach. Comput Electr Eng 2024;118:
109339. https://doi.org/10.1016/j.compeleceng.2024.109339. Aug.
[10] Siam AI, Soliman NF, Algarni AD, Abd El-Samie FE, Sedik A. Deploying machine learning techniques for human emotion detection. Comput Intell Neurosci
2022;2022:1–16. https://doi.org/10.1155/2022/8032673. Feb.
[11] Gupta R, Kumar A. Indian sign language recognition using wearable sensors and multi-label classification. Comput Electr Eng 2021;90:106898. https://doi.org/
10.1016/j.compeleceng.2020.106898. Mar.
[12] Lee BG, Lee SM. Smart wearable hand device for sign language interpretation system with sensors fusion. IEEE Sens J 2018;18(3):1224–32. https://doi.org/
10.1109/JSEN.2017.2779466. Feb.
[13] Qahtan S, Alsattar HA, Zaidan AA, Deveci M, Pamucar D, Martinez L. A comparative study of evaluating and benchmarking sign language recognition system-
based wearable sensory devices using a single fuzzy set. Knowl Based Syst 2023;269:110519. https://doi.org/10.1016/j.knosys.2023.110519. Jun.
[14] Balaha MM, et al. A vision-based deep learning approach for independent-users Arabic sign language interpretation. Multimedia Tools Appl 2023;82(5):
6807–26. https://doi.org/10.1007/s11042-022-13423-9. Feb.
[15] Rastgoo R, Kiani K, Escalera S. Sign language recognition: a deep survey. Expert Syst Appl 2021;164:113794. https://doi.org/10.1016/j.eswa.2020.113794.
Feb.
[16] Er-Rady A, Faizi R, Thami ROH, Housni H. Automatic sign language recognition: a survey. In: 2017 International conference on advanced technologies for signal
and image processing (ATSIP); 2017. p. 1–7. https://doi.org/10.1109/ATSIP.2017.8075561.
[17] Pathan RK, Biswas M, Yasmin S, Khandaker MU, Salman M, Youssef AAF. Sign language recognition using the fusion of image and hand landmarks through
multi-headed convolutional neural network. Sci Rep 2023;13(1):16975. https://doi.org/10.1038/s41598-023-43852-x. Oct.
[18] Naz N, Sajid H, Ali S, Hasan O, Ehsan MK. MIPA-ResGCN: a multi-input part attention enhanced residual graph convolutional framework for sign language
recognition. Comput Electr Eng 2023;112:109009. https://doi.org/10.1016/j.compeleceng.2023.109009. Dec.
[19] Miah ASM, Hasan MAM, Nishimura S, Shin J. Sign language recognition using graph and general deep neural network based on large scale dataset. IEEE Access
2024;12:34553–69. https://doi.org/10.1109/ACCESS.2024.3372425.
[20] Abdul W, et al. Intelligent real-time Arabic sign language classification using attention-based inception and BiLSTM. Comput Electr Eng 2021;95:107395.
https://doi.org/10.1016/j.compeleceng.2021.107395. Oct.
[21] Varshini CS, Hruday G, Mysakshi Chandu GS, Sharif SK. Sign language recognition. Int J Eng Res Technol 2020;V9(05). https://doi.org/10.17577/
IJERTV9IS050781. Jun.
[22] Siam AI, El-Bahnasawy NA, El Banby GM, Abou Elazm A, Abd El-Samie FE. Efficient video-based breathing pattern and respiration rate monitoring for remote
health monitoring. J Opt Soc Am 2020;37(11):C118. https://doi.org/10.1364/JOSAA.399284. Nov.
[23] Islam MM, Nooruddin S, Karray F, Muhammad G. Human activity recognition using tools of convolutional neural networks: a state of the art review, data sets,
challenges, and future prospects. Comput Biol Med 2022;149:106060. https://doi.org/10.1016/j.compbiomed.2022.106060. Oct.
[24] Khurana D, Koli A, Khatter K, Singh S. Natural language processing: state of the art, current trends and challenges. Multimedia Tools Appl 2023;82(3):3713–44.
https://doi.org/10.1007/s11042-022-13428-4. Jan.
[25] Siam AI, et al. Biosignal classification for human identification based on convolutional neural networks. Int J Commun Syst 2021;34(7). https://doi.org/
10.1002/dac.4685. May.
[26] Alnaggar M, Siam AI, Handosa M, Medhat T, Rashad MZ. Video-based real-time monitoring for heart rate and respiration rate. Expert Syst Appl 2023;225:
120135. https://doi.org/10.1016/j.eswa.2023.120135. Sep.
18
F.M. Talaat et al. Computers and Electrical Engineering 119 (2024) 109475
[27] Alharbey R, Dessouky MM, Sedik A, Siam AI, Elaskily MA. Fatigue State detection for tired persons in presence of driving periods. IEEE Access 2022. https://doi.
org/10.1109/ACCESS.2022.3185251.
[28] El-Rashidy N, Sedik A, Siam AI, Ali ZH. An efficient edge/cloud medical system for rapid detection of level of consciousness in emergency medicine based on
explainable machine learning models. Neural Comput Appl 2023. https://doi.org/10.1007/s00521-023-08258-w. Mar.
[29] Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L. Large-scale video classification with convolutional neural networks. In: Proceedings of the
IEEE conference on computer vision and pattern recognition; 2014. p. 1725–32.
[30] Szeliski R. Computer vision: algorithms and applications. Springer Nature; 2022.
[31] Chai J, Zeng H, Li A, Ngai EWT. Deep learning in computer vision: a critical review of emerging techniques and application scenarios. Mach Learn Appl 2021;6:
100134. https://doi.org/10.1016/j.mlwa.2021.100134. Dec.
[32] Cazzato D, Cimarelli C, Sanchez-Lopez JL, Voos H, Leo M. A survey of computer vision methods for 2D object detection from unmanned aerial vehicles.
J Imaging 2020;6(8):78. https://doi.org/10.3390/jimaging6080078. Aug.
[33] Olveres J, et al. What is new in computer vision and artificial intelligence in medical image analysis applications. Quant Imaging Med Surg 2021;11(8):3830–53.
https://doi.org/10.21037/qims-20-1151. Aug.
[34] Elyan E, et al. Computer vision and machine learning for medical image analysis: recent advances, challenges, and way forward. Artif Intell Surg 2022. https://
doi.org/10.20517/ais.2021.15.
[35] Abaspur Kazerouni I, Fitzgerald L, Dooly G, Toal D. A survey of state-of-the-art on visual SLAM. Expert Syst Appl 2022;205:117734. https://doi.org/10.1016/j.
eswa.2022.117734. Nov.
[36] Sreemathy R, Turuk M, Chaudhary S, Lavate K, Ushire A, Khurana S. Continuous word level sign language recognition using an expert system based on machine
learning. Int J Cognitive Comput Eng 2023;4:170–8. https://doi.org/10.1016/j.ijcce.2023.04.002. Jun.
[37] Attia NF, Ahmed MTFS, Alshewimy MAM. Efficient deep learning models based on tension techniques for sign language recognition. Intell Syst Appl 2023;20:
200284. https://doi.org/10.1016/j.iswa.2023.200284. Nov.
[38] Siddique S, Islam S, Neon EE, Sabbir T, Naheen IT, Khan R. Deep learning-based bangla sign language detection with an edge device. Intell Syst Appl 2023;18:
200224. https://doi.org/10.1016/j.iswa.2023.200224. May.
[39] Al Ahmadi S, Mohammad F, Al Dawsari H. Efficient YOLO-based deep learning model for arabic sign language recognition. J Disabil Res 2024;3(4). https://doi.
org/10.57197/JDR-2024-0051. May.
[40] Abeje BT, Salau AO, Mengistu AD, Tamiru NK. Ethiopian sign language recognition using deep convolutional neural network. Multimedia Tools Appl 2022;81
(20):29027–43. https://doi.org/10.1007/s11042-022-12768-5. Aug.
[41] de Castro GZ, Guerra RR, Guimarães FG. Automatic translation of sign language with multi-stream 3D CNN and generation of artificial depth maps. Expert Syst
Appl 2023;215:119394. https://doi.org/10.1016/j.eswa.2022.119394. Apr.
[42] Naz N, Sajid H, Ali S, Hasan O, Ehsan MK. Signgraph: an efficient and accurate pose-based graph convolution approach toward sign language recognition. IEEE
Access 2023;11:19135–47. https://doi.org/10.1109/ACCESS.2023.3247761.
[43] Sharma S, Gupta R, Kumar A. Continuous sign language recognition using isolated signs data and deep transfer learning. J Ambient Intell Human Comput 2023;
14(3):1531–42. https://doi.org/10.1007/s12652-021-03418-z. Mar.
[44] AbdElghfar HA, et al. QSLRS-CNN: qur’anic sign language recognition system based on convolutional neural networks. Imaging Sci J 2024;72(2):254–66.
https://doi.org/10.1080/13682199.2023.2202576. Feb.
[45] Alsaadi Z, Alshamani E, Alrehaili M, Alrashdi AAD, Albelwi S, Elfaki AO. A real time Arabic sign language alphabets (ArSLA) recognition model using deep
learning architecture. Computers 2022;11(5):78. https://doi.org/10.3390/computers11050078. May.
[46] Islam M, et al. Toward a vision-based intelligent system: a stacked encoded deep learning framework for sign language recognition. Sensors 2023;23(22):9068.
https://doi.org/10.3390/s23229068. Nov.
[47] Saleem MI, Siddiqui A, Noor S, Luque-Nieto M-A, Otero P. A novel machine learning based two-way communication system for deaf and mute. Appl Sci 2022;13
(1):453. https://doi.org/10.3390/app13010453. Dec.
[48] Kamruzzaman MM. Arabic sign language recognition and generating arabic speech using convolutional neural network. Wireless Commun Mobile Comput
2020;2020:1–9. https://doi.org/10.1155/2020/3685614. May.
[49] [Online]. Available. 2024. https://github.com/ultralytics/ultralytics; 2024; 2024 [Accessed: 20-Apr-].
[50] M. Al-Barham et al., “RGB Arabic alphabets sign language dataset,” arXiv preprint arXiv:2301.11932, 2023, doi: https://doi.org/10.48550/arXiv.2301.11932.
[51] C. Project, “sign-language-detection dataset,” Apr-2023. [Online]. Available: https://universe.roboflow.com/capston-project/sign-language-detection-qztxk.
[Accessed: 15-Feb-2024].
[52] Belmadoui S [Online]. Available. 2024. https://www.kaggle.com/datasets/sabribelmadoui/arabic-sign-language-unaugmented-dataset; 2024; 2024 [Accessed:
15-Feb-].
[53] Al-Barham M, Sa’Aleek AA, Al-Odat M, Hamad G, Al-Yaman M, Elnagar A. Arabic sign language recognition using deep learning models. In: 2022 13th
International conference on information and communication systems (ICICS); 2022. p. 226–31. https://doi.org/10.1109/ICICS55353.2022.9811162.
[54] AbdElghfar HA, et al. A model for Qur’anic sign language recognition based on deep learning algorithms. J Sens 2023;2023:1–13. https://doi.org/10.1155/
2023/9926245. Jun.
[55] Dabwan BA, Jadhav ME, Ali YA, Olayah FA. Arabic sign language recognition using efficientnetB1 and transfer learning technique. In: 2023 International
conference on IT innovation and knowledge discovery (ITIKD); 2023. p. 1–5. https://doi.org/10.1109/ITIKD56332.2023.10099710.
[56] M. Al-Barham, A. Jamal, and M. Al-Yaman, “Design of Arabic sign language recognition model,” arXiv preprint arXiv:2301.02693, 2023, doi: 10.48550/arXiv.
2301.02693.
19