Bdspell: A Yolo-Based Real-Time Finger Spelling System For Bangla Sign Language

BdSpell: A YOLO-based Real-time Finger
Spelling System for Bangla Sign Language

Naimul Haque1,2*, Meraj Serker2,3† and Tariq Bin Bashar1,2†
1* ComputerScience and Engineering, Uttara University.
2 Computer Science and Engineering, Daffodil International University.
3 Computer Science and Engineering, Manarat International University.
arXiv:2309.13676v1 [cs.CV] 24 Sep 2023
*Corresponding author(s). E-mail(s): [email protected];

Contributing authors: [email protected];
[email protected];
† These authors contributed equally to this work.
Abstract
In the domain of Bangla Sign Language (BdSL) interpretation, prior approaches
often imposed a burden on users, requiring them to spell words without hidden
characters, which were subsequently corrected using Bangla grammar rules due
to the missing classes in BdSL36 dataset. However, this method posed a challenge
in accurately guessing the incorrect spelling of words. To address this limita-
tion, we propose a novel real-time finger spelling system based on the YOLOv5
architecture. Our system employs specified rules and numerical classes as triggers
to efficiently generate hidden and compound characters, eliminating the neces-
sity for additional classes and significantly enhancing user convenience. Notably,
our approach achieves character spelling in an impressive 1.32 seconds with a
remarkable accuracy rate of 98%. Furthermore, our YOLOv5 model, trained on
9147 images, demonstrates an exceptional mean Average Precision (mAP) of
96.4%. These advancements represent a substantial progression in augmenting
BdSL interpretation, promising increased inclusivity and accessibility for the lin-
guistic minority. This innovative framework, characterized by compatibility with
existing YOLO versions, stands as a transformative milestone in enhancing com-
munication modalities and linguistic equity within the Bangla Sign Language
community.
Keywords: YOLOv5, real-time finger spelling, Bangla Sign Language, BdSL36

dataset, accessibility, inclusivity, linguistic equity, communication modalities
1
1 Introduction
In a world increasingly connected through technology, accessibility, and inclusivity are
of paramount importance. The Real-Time Bangla Finger Spelling for Sign Language
project represents a significant endeavor to bridge communication gaps for individuals
with hearing and speech impairments in Bangladesh. This pioneering project aims to
develop an advanced computer vision system capable of accurately detecting and inter-
preting Bangla finger spelling gestures in real-time, empowering users to communicate
effectively using sign language.
The project leverages the power of YOLOv5 [1], a cutting-edge object detection
algorithm renowned for its speed and precision. By harnessing YOLOv5’s capabilities,
the system seeks to enable real-time recognition of Bangla finger spelling gestures
for digits and alphabets, paving the way for seamless communication and meaningful
interactions for individuals with impairments.
With a primary objective of creating a highly accurate and efficient computer
vision model, the project places a strong emphasis on developing a robust dataset
encompassing various hand orientations, lighting conditions, and backgrounds. This
diversity ensures the system’s adaptability to real-world scenarios and enhances its
ability to accurately interpret Bangla finger spelling gestures. By evaluating the sys-
tem’s performance based on metrics such as Mean Average Precision, Precision, and
Recall, the project ensures the model’s reliability and effectiveness in avoiding false
positives and negatives during real-time inference.
Furthermore, prior approaches [2] in the domain of Bangla Sign Language (BdSL)
interpretation often imposed a burden on users, requiring them to spell words with
missing characters, which were subsequently corrected using Bangla grammar rules
due to the missing classes in BdSL36 dataset. However, this method posed a challenge
in accurately guessing the incorrect spelling of words. To address this limitation, we
propose a novel real-time finger spelling system based on the YOLOv5 architecture.
Our system employs specified rules and numerical classes as triggers to efficiently gen-
erate hidden and compound characters, eliminating the necessity for additional classes
and significantly enhancing user convenience. Another YOLOv4 [4] based system was
proposed which takes 60 sec for each character detection using a total of 49 different
classes. Our approach achieves character spelling in an impressive 1.32 seconds with a
remarkable accuracy rate of 98%. Furthermore, our YOLOv5 model, trained on 9147
images, demonstrates an exceptional mean Average Precision (mAP) of 96.4%.
• Our YOLOv5 model, trained on 9147 images, demonstrates an exceptional Mean
Average Precision (mAP) of 96.4%, showcasing its impressive accuracy.
• Our proposed system eliminates the need for additional classes, significantly reduc-
ing the computational cost while enhancing user convenience in comparison to
previous approaches.
• Our approach achieves character spelling in an impressive 1.32 seconds with a
remarkable accuracy rate of 98%, representing a substantial speed improvement over
previous systems.
2
• We implemented a technique involving thresholding the mean running cumulative
of the confidence of the detection for spelling characters, further enhancing the
accuracy and reliability of our system.
This paper will delve into the project’s methodology, outline the implementation
steps, and discuss the dataset preparation, model training, and evaluation processes.
Additionally, it will highlight the potential impact of the Real-Time Bangla Finger
Spelling for Sign Language project in fostering inclusivity and empowering individuals
with hearing and speech impairments to communicate effectively with others. Through
this research endeavor, we aspire to contribute to building a more accessible and
inclusive society that values effective communication for all, reaffirming the significance
of technology in promoting a more connected and empathetic world.
2 Related Works
Nanda et al. [5] explored real-time sign language conversion using the YOLOv3
algorithm. They focused on American Sign Language (ASL) and employed default
hyperparameters for training. In [6], researchers tackled Thai Sign Language classi-
fication using the YOLO algorithm. Their dataset comprised 25 signs with 15,000
instances for training. They achieved an mAP of 82.06% in complex background sce-
narios. Arabic Sign Language Recognition and Speech Generation were studied in [14]
using Convolution Neural Networks. The authors integrated Google Translator API
for hand sign-to-letter translation and gTTs for speech generation.
For Bangla Sign Language (BdSL) detection, [7] employed a method involving
RGB-to-HSV color space conversion, followed by feature extraction using the Scale
Invariant Feature Transform (SIFT) Algorithm for 38 Bangla signs. A simple neural
network was used for Bangla alphabet classification in [15], employing the YCvCr color
map for input images. They used the Canny edge detector and Freeman Chain Code
for feature extraction. A real-time Bangladeshi Sign Language Detection method using
Faster R-CNN was proposed in [9], achieving an accuracy of 98.2% and a detection
time of 90.03 milliseconds, using a dataset containing 10 different sign letter labels.
Hossen et. al, [10] presented a Deep Convolutional Neural Network-based method
for Bengali Sign Language Detection, utilizing a diverse dataset of 37 signs with various
backgrounds and skin colors. In [11], a dataset comprising 7052 samples of 10 numerals
and 23864 samples of 35 characters of Bangla Sign Language was introduced. The
Convolutional Neural Network was employed for accurate classification. Sarker et.
al, [12] proposed a Bangla Sign language-to-speech generation system using smart
gloves, sensors, and a microcontroller. They employed Levenshtein distance for word
matching in the database for sign recognition. In the domain of finger-spelling, Li et
al. [13] presented a real-time finger-spelling recognition system using a convolutional
neural network (CNN) architecture. They achieved high accuracy in recognizing finger-
spelling gestures in real-time scenarios.
Another research [2] used Unveiling an Innovative Algorithm for Accurate Hand-
Sign-Spelled Bangla Language Modelling. Authors used Bangla grammatical rules to
get hidden characters. And they generated independent vowels using those rules. Dipon
3
et al, [4] used YOLOv4 as the object detection model for Real-Time Bangla Sign Lan-
guage Detection with Sentence and Speech Generation. Detection time of 60 seconds
for every word. Rafiq et. al, proposed a real-time vision-based Bangla sign language
detection system [16] using the YOLO algorithm. They used a dataset consisting of
10 Bangla signs and achieved an mAP of 92.5%. Talukder et. al, proposed a real-time
Bangla sign language detection system using the YOLOv4 [? ]object detection model.
They used a dataset consisting of 49 different classes, including 39 Bangla alphabets,
10 Bangla digits, and three new proposed signs. They achieved an mAP of 95.6.
These finger spelling recognition papers complement the project’s focus on Real-
Time Bangla Finger Spelling for Sign Language, contributing valuable insights and
methodologies to the broader domain of sign language recognition and communication
accessibility.
3 Dataset
We utilized the BdSL36 dataset [3], purposefully curated for Bangladesh Sign Lan-
guage recognition systems by Oishee Bintey Hoque et al. This dataset underwent
meticulous preparation across five stages to ensure robustness and versatility.
The process commenced with Image Collection, conducted through extensive
research at a deaf school to identify 36 practical Bangla sign letters used in daily com-
munication. Ten volunteers captured raw images using phone cameras or webcams.
Subsequently, BdSL experts individually assessed and filtered the images to retain
those aligning with the appropriate sign style. This curation yielded 1200 images across
the classes.
Raw Image-Data Augmentation addressed the need for accurate sign letter detec-
tion under varying conditions. Manual augmentation techniques, encompassing affine
and perspective transformations, contrast adjustments, noise addition, cropping, blur-
ring, rotation, and more, were applied, resulting in the BdSL36v1 dataset containing
approximately 26,713 images with an average of 700 images per class. Few sample
images of the dataset are shown in the Figure 1.
Dataset preparation involved a comprehensive and adaptable approach to BdSL
recognition. The meticulous stages of image collection, augmented data generation,
and background augmentation ensure that the BdSL36 dataset authentically captures
real-world scenarios, rendering it a valuable resource for advancing the field of sign
language recognition and detection. Additionally, we further annotated 9,187 BdSL36
dataset images and split it into 6,427 training sets, 1,828 validation sets, and 932
testing sets.
4 Bangla Alphabets
Bengali is written from left to right, like the majority of other languages, and there
are no capital characters. The letters have a continuous line at the top, and there are
conjuncts, upstrokes, and downstrokes in the script. There are a total of 60 characters
including 11 vowels (Sworoborna), 39 consonants (Byanjanbarna), and 10 numerals.
Sworoborna, shown in Figure 2 refers to those letters in Bengali that can be spo-
ken on their own. These are the characters that represent individual vowel sounds
4
Fig. 1 Example images of initially collected BdSL36 dataset. Each image represents a different
BdSL sign letter. Images are serially organized according to their class label from left to right.
and can be pronounced independently without the need for a consonant. These vowel
characters are an integral part of the Bengali script and play a crucial role in forming
words and conveying meaning. They are combined with consonant characters (Byan-
janbarna) to create syllables, which ultimately from words. In the Bengali language,
Fig. 2 Bangla Vowels: A Visual Representation of the Complete Set of Vowels in the Bengali Alpha-
bet
consonants are known as “Byanjonborno”, shown in Figure 3. These are letters that
cannot be pronounced on their own and need to be combined with vowels to create a
complete sound. Bengali has a total of 32 consonant letters, which play a crucial role
in forming meaningful words and expressions. When a consonant is combined with a
vowel, it forms a syllable, which is the fundamental unit of pronunciation in Bengali.
This combination of consonants and vowels allows speakers to articulate a wide range
of sounds and convey various meanings. Bengali has its own set of numeric symbols to
represent numbers and fractions, shown in Figure 4. These symbols are used for numer-
ical representation in various contexts, such as writing numbers, expressing quantities,
and indicating fractions. In Bengali, compound characters are formed by combining
two or more consonants to create a single character. Some of the most commonly
5
Fig. 3 Bangla Consonants: Illustrating the Entire Array of Consonant Characters in the Bengali
Alphabet.
used compound characters are shown in Figure 5. In Bengali, compound characters

are formed by combining two or more consonants to create a single character. Some of
the most commonly used compound characters are shown in Figure 5. The formation
of these compound characters has been illustrated using some examples in Figure 6.
Fig. 4 Bangla Numerals: Displaying the Full Range of Numerical Digits in the Bengali Script.
5 Objection Detection Evaluation Metric

In object detection tasks, evaluating the performance of a model is crucial to under-
standing its accuracy and effectiveness. One of the most commonly used evaluation
metrics for object detection is Mean Average Precision (mAP). In this section, we will
elaborate on how mAP works and the different components involved in its calculation.
6
Fig. 5 Bangla Compound Characters: Essential Combinations in the Bengali Alphabet. Bangla
Compound Characters, known as ”Yuktakshar” in Bengali, are formed by combining two or more
basic characters from the script. These combinations create unique characters that represent specific
phonetic sounds not present in the basic alphabet. They are pivotal in accurately transcribing words
and phrases in the Bengali language. This figure showcases some of the most commonly used com-
pound characters in the Bengali script, highlighting their significance in phonetic representation
Fig. 6 An illustration of consonant combinations resulting in compound characters in the Bengali

script. The first column displays select compound characters, while the second column demonstrates
the amalgamation of consonant characters that give rise to these combinations, showcasing the script’s
phonetic intricacies.
5.1 From Prediction Score to Class Label

In object detection, the model predicts bounding boxes around objects along with
their corresponding class labels and confidence scores. The prediction score represents
how confident the model is in its prediction. To convert this prediction score into
a class label, a threshold is applied. If the confidence score for a prediction exceeds
the threshold, it is classified as a positive detection with the associated class label;
otherwise, it is considered a negative detection.
7
(
1 if Confidence Score > Threshold
Class Label =
0 otherwise
5.2 Detection Performance Metrices

Precision is a metric that measures the accuracy of the model’s predictions for a
particular class. It is defined as the ratio of true positive (TP) detections to the sum
of true positive and false positive (FP) detections:
TP
P recision(P ) = T P +F P .
Recall, also known as True Positive Rate (TPR) or Sensitivity, measures the model’s
ability to find all the positive instances for a particular class. It is defined as the
ratio of true positive detections to the sum of true positive and false negative (FN)
detections:
TP
Recall(R) = T P +F N
The Precision-Recall (PR) curve is a graphical representation of the precision and

recall values at various confidence score thresholds. It helps to visualize how precision
and recall change as we vary the confidence threshold for positive detections. The PR
curve is obtained by plotting precision on the y-axis and recall on the x-axis.
Average Precision (AP) is a single scalar value that summarizes the Precision-
Recall curve for a specific class. It is calculated by computing the area under the
Precision-Recall curve. The mathematical formula for AP is as follows:
R1
AP = 0
precision(r) dr
;where precision(r) is the precision at a given recall value r in the Precision-Recall

curve.
5.3 Intersection over Union (IoU)

IoU is a critical concept in object detection evaluation. It measures the overlap
between the predicted bounding box and the ground truth bounding box. IoU is com-
puted as the ratio of the area of intersection between the two bounding boxes to the
area of their union:
IoU = Area of Intersection
Area of Union
The example shown in Figure 7 shows how IoU is calculated.
5.4 Mean Average Precision (mAP)

mAP is the average of AP values calculated for all the classes in the dataset. It pro-
vides a comprehensive evaluation of the model’s overall performance across different
classes and confidence score thresholds. The mathematical formula for mAP is as
follows:
8
Fig. 7 In this illustrative image, the concept of IoU is demonstrated. The predicted bounding box
(in red) and the ground truth bounding box (also in red) showcase the overlapping area used to
compute IoU. A higher IoU value signifies accurate object localization, while a lower value indicates
less precise detection.
1
PN
mAP = N i=1 APi
Where N is the number of classes and APi is the Average Precision for classi .
By using these evaluation metrics and mathematical formulas, we can effectively
assess the performance of an object detection model, identify areas of improvement,
and fine-tune the model to achieve better accuracy and reliability in detecting objects
of interest.
6 Methodology
To develop a YOLO-based real-time finger spelling model for the BDSL36 dataset,
which encompasses 36 recognized characters along with several derived characters,
we will follow a systematic methodology. The BDSL36 dataset contains a diverse
set of characters represented by Unicode values, each associated with a correspond-
ing finger spelling label. To devise a comprehensive system capable of recognizing
hidden or derived characters, we introduced a set of key components within our
methodology. These components include Recognized Character Detection, Inde-
pendent Vowel Transformation, Hidden Character Generation, and Trigger
Handling, each playing a vital role in the fingerspelling process.
1. Recognized Character Detection: Recognized characters in our finger spelling
recognition system are identified through the use of confidence scores ci (t) gener-
ated by the YOLOv5 model at time t for the detection class i. All the available
recognized characters are shown in Figure 8. To qualify as a recognized character,
the cumulative running mean of confidence must surpass a specified threshold δ,
as determined by the formula for the running cumulative confidence:
PT 1
PN
t N · i ci (t) > δ
9
The N is the number of detections of class i at time frame t. This threshold δ
ensures that only characters with consistently high confidence scores are selected.
The recognized characters originate from the BDSL36 [3] dataset, having undergone
extensive training for object detection. They serve as the cornerstone for identifying
both overt and derived characters within our system, providing a robust foundation
for accurate recognition.
2. Independent Vowel Transformation: In written Bangla, it’s important to note
that there are two sets of vowels: independent and dependent. Dependent vowels
are essentially the same as independent vowels, but they can only be written after
consonants in the Bangla language, hence the name ”dependent vowels”. Since
our system utilizes only one set of vowels because of the limitation of the BdSL36
dataset, we assume by default that a recognized vowel is a dependent vowel. This
assumption stems from the fact that dependent vowels are more frequently used in
Bangla spelling due to their characteristic placement after consonants. This nuanced
understanding enables our model to accurately transcribe and recognize vowels in
the Bengali script. Independent vowels in our system are not directly recognized.
They are derived from the dependent vowels, which are recognized by the model.
These recognized vowels transition into independent vowels when trigger characters
are detected following a recognized character or derived character. This distinction
allows us to handle the recognition of vowels effectively. The sets of vowels and a
simple demonstration of the transformation that has been shown in Figure 9.
3. Hidden Character Generation: Hidden characters are a unique aspect of our
system. These characters are not provided in the BDSL36 dataset except for the
independent variables. They play a pivotal role in accurately representing the Ben-
gali script. These hidden characters, though not as widely recognized, are essential
for spelling numerous Bangla words. We carefully define and create these hidden
characters, ensuring they are not directly recognizable by the model. This distinc-
tive feature empowers our system to bridge the gap between the limitations of
existing datasets and the comprehensive representation of the Bengali language.
Using the provided rules stated in Figure 9.
4. Trigger Handling: Triggers are specific characters ranging from T0 to T7, which
are shown in the Figure 8. They operate exclusively in the Textual mode of our
finger spelling system. The specific functions and roles of these trigger characters
will be discussed in detail later in our methodology. They serve as key elements
for recognizing derived characters and their dependencies. Detecting and handling
these Trigger characters is a crucial part of our finger-spelling methodology. .
By incorporating these components into our methodology, we ensure a holistic
approach to finger-spelling recognition. Our system not only recognizes the charac-
ters available in the BDSL36 dataset but also effectively handles hidden characters,
transitions dependent vowels into independent vowels when triggers are detected,
and utilizes trigger characters for recognizing derived characters. This comprehensive
approach enables us to create a robust and versatile finger spelling recognition sys-
tem that accommodates the complexities of the Bengali language, where hidden and
derived characteristics play significant roles in communication.
10
Fig. 8 This image showcases a comprehensive array of Bangla characters, thoughtfully organized
into four distinct categories: recognized characters, derived characters, hidden characters, and Trigger
characters. The top-right corner provides a visual taxonomy for easy reference. At the bottom, a
specific example of a Trigger character is thoughtfully presented, offering a practical illustration of
this unique character type in the Bangla script.
6.1 BdSL Finger-Spelling

To derive characters that are not present in the BDSL36 dataset, we can establish
transformation rules based on the provided mappings in Figure 9. There are four
types of characters that need to be derived: Independent Vowels, Hidden Characters,
Compound Characters with two characters, and Compound Characters using three
characters.
6.1.1 Single Character Transformation

According to the provided figure 9, when someone finger spells any recognized depen-
dent vowel character and then follows it with Trigger 1 (T1) finger-spelling, the
dependent vowel character will be replaced with the corresponding independent vowel
character. Additionally, hidden characters are derived from recognized characters when
Trigger 4 (T4) is finger-spelled according to the figure 9.
6.1.2 Compound Character Derivation

In this subsection, we explore the process of deriving compound characters from recog-
nized characters using triggers T2 and T3, along with the Bangla compound character
dictionary. These triggers allow us to create compound characters by combining indi-
vidual characters. The dictionary and the derivation process are illustrated in Figure
11. The Figure 10 shows how to spell a compound character in real time
11
Fig. 9 In this visual representation, we observe the Transformation of Dependent Vowels to Inde-
pendent when Trigger T1 is finger-spelled while hidden characters are derived from other characters
using T4. In the lower right-hand corner of this image, we are presented with two illustrative exam-
ples. In the first one, we witness the transformation of the character ”/aa” into ”/AA” as a result
of the influence of ’Trigger T1.’ and in the second one, we can observe the transformation of the
sequence ”/a/A” into the character ”/ae.” This transformation is facilitated by the operation of ’T4.’
Fig. 10 Formation of Compound Character ’kta’ by Combining ’ka’ and ’tta’ in Bengali Script.
The illustration shows how a hand signer finger spell compound character after detecting the Trigger
T2 which automatically remove the last two characters and append their corresponding compound
character in our spelled word.
6.1.3 Real Time Finger-Spelling

The methodology for developing the BdSL Fingerspelling model is designed to accu-
rately recognize finger-spelled Bengali characters in real time. The BDSL36 dataset,
comprising 36 recognized characters and their derivatives, forms the basis of this
system. Key components within the methodology include Recognized Character Detec-
tion, Independent Vowel Transformation, Hidden Character Generation, and Trigger
Handling. Recognized characters are identified based on confidence scores generated
12
Fig. 11 In this visual representation, We are presented with a comprehensive overview of Bangla
finger spelling. On the left side of the image, we can observe a visual representation of the compound
characters, where two or more individual characters combine to form a unique and meaningful sign.
the right side of the picture provides a detailed set of rules and guidelines. These rules outline the
correct formation and usage of compound characters, both for two-character and three-character
combinations.
by the YOLOv5 model, ensuring consistent high confidence for selection. Indepen-
dent vowels are assumed to be dependent, allowing for accurate transcription, while
hidden characters play a crucial role in representing Bengali script. Transformation
rules and triggers facilitate the derivation of characters not present in the BDSL36
dataset. Compound characters are also derived using triggers and a Bangla compound
character dictionary. The comprehensive approach of this methodology addresses the
complexities of the Bengali language, resulting in a robust and versatile finger-spelling
recognition system.
Our real-time finger spelling starts with a speller hand signing the recognized
characters which in real-time contributes to the errors because of the ambient and other
factors. In order to confirm a detection, we created a window that uses the confidence
score, as explained in the above subsection. A detected character then passes through
the trigger-handling module. Based on the Trigger, the recognized character is further
transformed according to the flow chart shown in Figure 12.
In our system, we can finger-spell either texts or numerals, which can be done in
the textual or numeral mode respective. By default, we begin with textual mode, the
mode will only change if the trigger T5 (recognized character 5) is detected switching
to a numeral mode where any numeral recognized character will not be detected as
a trigger. To get back to the previous mode ’aa’ has to be detected which acts as
trigger T5 in the numeral mode. Each recognized character detected or transformed
character is then added to update a sentence. These steps are shown in the Figure
13
Fig. 12 This image depicts an algorithm utilizing YOLO v5 object detection to recognize charac-
ters. Recognized characters can transition into either numeral or textual modes. Textual mode is
determined by triggers: T5 for toggles, T1 for independent vowels, T2 for compound character 12,
T4 for hidden characters, T6 for backspace, and T0 for spaces.
12. Furthermore, we use trigger T0 to add space and trigger T6 to delete the last
appended character.
Using our setup and methodology explained above and in Figures [8 11], we further
demonstrate the finger spelling using examples in Figure [13].
7 Training YOLOv5
During the development of the project, the YOLOv5 model was trained on a dataset
of Bangla Sign Language images to detect and classify different signs. After training,
the model’s performance was evaluated on a separate validation dataset to assess its
accuracy and effectiveness. The validation process used the best-trained weights of
the model, which were saved at the location runs/train/exp/weights/best.pt. These
weights represent the model’s parameters that achieved the highest level of perfor-
mance during the training process. The model’s architecture consists of 157 layers and
is relatively lightweight, with 7,136,884 parameters. It’s essential to have a model with
a suitable number of parameters to strike a balance between accuracy and computa-
tional efficiency. In this case, the model’s computational cost is measured in GFLOPs
(Giga Floating Point Operations) and is found to be 16.2 GFLOPs, indicating a rea-
sonable computational load. To evaluate the model’s detection performance, it was
tested on a validation set comprising 1,826 images, which collectively contained 1,827
instances of Bangla Sign Language signs. The model’s performance is measured using
several metrics that provide insights into its ability to recognize different signs.
14
Word Spelling Finger Spelling
মা (Ma) - Mother ম (ma) + "া" (aa) ম (ma) + "া" (aa)
বাবা (Baba) - Father ব (ba) + "া" (/aa) + ব (ba) + "া" (/aa) ব (ba) + "া" (/aa) + ব (ba) + "া" (/aa)
ভাই (Bhai) - Brother ভ (bha) + "া" (/aa) + ই (E) ভ (bha) + "া" (/aa) + "ি(/e)" + T1
বোন (Bon) - Sister ব (ba) + ো (o) + ন (n) ব (ba) + ো (o) + ন (n)
দাদা (Dada) - Elder brother দ (da) + "া" (/aa) + দ (da) + "া" (/aa) দ (da) + "া" (/aa) + দ (da) + "া" (/aa)
খালা (Khala) - Aunt খ (kha) + "া" (/aa) + ল (la) + "া" (/aa) খ (kha) + "া" (/aa) + ল (la) + "া" (/aa)
ঠ (tha) + "া" (/aa) + ক (ka) + ু (u) + র (ra) + দ

ঠাকু রদা (Thakurda) - Uncle ঠ (tha) + "া" (/aa) + ক (ka) + ু (u) + র (ra) + দ (da) + "া" (/aa)
(da) + "া" (/aa)
শ (sha) + ি (i) + ক (k) + ্ (halant) + ষ (sh) + ক

শিক্ষক (Shikkhok) - Teacher স (/sa) + T4 + ি (i) + ক (ka) +স (/sa) + T4 + T4 + T2 + ক (ka)
(k)
ছাত্র (Chhatro) - Student ছ (cha) + "া" (/aa) + ত (ta) + ্ (halant) + র (ro) ছ (cha) + "া" (/aa) + ত (ta) + র (ro) + T2
ড (ḍa) + "া" (/aa) + ক (ka) + ্ (halant) + ত (ṭa) +

ডাক্তার (Doctor) - Doctor ড (ḍa) + "া" (/aa) + ক (ka) + ত (ṭa) + T2 + "া" (/aa) + র (ra)
"া" (/aa) + র (ra)
Fig. 13 In our fingerspelling system for Bengali, we break down the intricate art of conveying
Bengali words using hand shapes and finger movements. Each word is meticulously spelled out using
a combination of handshapes, finger motions, and specific positions, all in accordance with the unique
script and phonetic characteristics of the Bengali language. This table offers a comprehensive guide
to the fingerspelling of common Bengali words, making it an invaluable resource for those learning
sign language or communicating with individuals who rely on this method to express themselves in
Bengali.
8 Results
The primary evaluation metrics used are Precision (P), Recall (R), and Mean Average
Precision (mAP) at various IoU (Intersection over Union) thresholds. Precision mea-
sures the accuracy of the model’s predictions, reflecting the percentage of true positive
detections among all the positive detections. Recall, on the other hand, measures the
model’s ability to find all the positive instances, indicating how well it avoids missing
any relevant signs. The Mean Average Precision (mAP) is a comprehensive measure
that considers the precision-recall trade-off across multiple IoU thresholds. It provides
a holistic assessment of the model’s performance for different classes and thresholds.
In this evaluation, mAP is calculated at the standard IoU threshold of 0.5 and also
across a range of IoU thresholds from 0.5 to 0.95.
The overall performance of the model across all classes combined is quite promising.
It achieved a precision of 69.2% and a recall of 83.6%, indicating that it can identify a
significant portion of Bangla Sign Language signs accurately. The mAP at the standard
IoU threshold of 0.5 is 84.3%, which is considered a strong performance. It means that
the model’s predictions are well-matched with the ground truth annotations.
However, the more comprehensive mAP across IoU thresholds from 0.5 to 0.95
is 56.9%, which suggests that the model’s performance may vary across different
15
Fig. 14 In this graph, we explore the relationship between threshold values (δ) and accuracy,
considering two distinct measures: ”Accuracy (Sum of Detections)” and ”Accuracy (Cumulative Con-
fidence).” As the threshold value varies along the X-axis, we observe changes in accuracy on the
Y-axis. Lower threshold values (5 and 10) yield high accuracy levels in both measures, with ”Accu-
racy (Cumulative Confidence)” reaching a perfect score at 10. Accuracy remains consistently high as
the threshold increases, even reaching 100% for ”Accuracy (Cumulative Confidence)” at threshold
values of 20, 30, and 50. This visual representation elucidates how threshold values impact detection
accuracy, offering insights into system performance.
levels of detection strictness. It is common to see a drop in mAP as the IoU thresh-
old increases, as it requires stricter overlap between predicted and ground truth
bounding boxes. Looking into the performance of individual classes, we can observe
variations in the model’s ability to recognize different signs. Some classes, such as
”BHA,” ”BISHARGA,” and ”THA,” achieved high precision, recall, and mAP scores,
indicating that the model performs exceptionally well on these signs.
On the other hand, there are classes like ”NA” and ”RA” with lower precision,
recall, and mAP scores, suggesting that the model struggles more with recognizing
these particular signs. In summary, the YOLOv5 model showed overall promising
results for the Bangla Sign Language Detection project. It demonstrated the ability
to detect and classify signs with good accuracy and achieved a high mAP score at the
standard IoU threshold of 0.5. However, there are certain classes where the model’s per-
formance could be further enhanced. This might involve additional data collection for
underrepresented classes, fine-tuning hyperparameters, or exploring other techniques
to improve recognition accuracy. Continuous evaluation and refinement of the model
will be essential to enhance its capabilities and make it more robust for real-world
applications.
16
Fig. 15 In this graph, we explore how different threshold values (δ) affect detection time. Lower
thresholds (5 and 10) result in faster detections, especially in ”Time (Cumulative Confidence).”
However, as the threshold increases, both ”Time (Sum of Detections)” and ”Time (Cumulative Con-
fidence)” take longer. Overall, this graph shows the balance between accuracy and detection speed
based on threshold choices.
Model Data Dataset Size Number of Classes mAP

YOLOv5 BdSL-OD 1,000 images 30 signs 95%
YOLOv4 [4] BdSL-OD 12,500 images 49 signs 95.6%
YOLOv5 (ours) BdSL 36 9,147 images 36 signs 96.40%
Table 1 Comparison of Object Detection Models
In this comparison table 8, we present an overview of different object detection

models used for Bangla Sign Language detection, highlighting key attributes such as
the model name, dataset used, dataset size, number of classes, and mean Average Pre-
cision (mAP) scores. Notably, our YOLOv5 model achieved a superior mAP of 96.40%
on the BdSL 36 dataset comprising 9,147 images and 36 signs, outperforming the pre-
vious YOLOv4 model (95.6%) from [4]. This improved performance may be attributed
to architectural enhancements, a larger and diverse training dataset, the use of data
augmentation techniques, fine-tuning for Bangla Sign Language, consistent evaluation
metrics, and potential algorithmic advancements. Notably, YOLOv5 showcases better
precision in sign detection, making it a promising advancement in the field, although
potential limitations should also be acknowledged for a comprehensive evaluation of
its performance.
17
The graph 18 displays the Precision-Recall performance for different classes, with
the X-axis showing the classes, including both Bengali script and Romanized labels,
and the Y-axis representing the mAP50 (mean Average Precision at 50%) and mAP50-
95 (mean Average Precision from 50% to 95% overlap) scores. This graph offers
valuable insights into the model’s object detection capabilities. For each class, the
mAP50 score indicates how well the model can accurately identify instances of that
specific character or symbol. Higher bars indicate better performance, while lower ones
suggest areas where the model may struggle. The mAP50-95 scores provide a more
comprehensive evaluation, considering a broader range of overlap thresholds. Overall,
this graph allows us to assess the precision and recall performance of the model across
different classes, identifying its strengths and weaknesses in object detection tasks.
Fig. 16 This F1-confidence curve in the YOLO model demonstrates a promising performance for
object detection. Achieving an F1 score of 0.74 at a confidence threshold of 0.335 indicates a good
balance between precision and recall for detecting objects across all classes. The curve suggests that
the model can identify objects accurately while minimizing false positives, making it suitable for
object detection tasks.
9 Conclusion
Conclusions may be used to restate your hypothesis or research question, restate your
major findings, explain the relevance and the added value of your work, highlight any
limitations of your study, and describe future directions for research and recommen-
dations. In some disciplines use of Discussion or ’Conclusion’ is interchangeable. It
is not mandatory to use both. Please refer to Journal-level guidance for any specific
requirements.
18
Fig. 17 This precision-recall curve, achieving an mAP (mean Average Precision) of 0.74 at a con-
fidence threshold of 0.5, indicates strong performance in object detection training with the YOLO
model for all classes. An mAP of 0.74 signifies that the model can accurately identify objects with a
high level of precision and recall, making it well-suited for various object detection tasks. The curve
suggests that the model maintains a good balance between correctly identifying objects (precision)
and capturing all relevant objects (recall) at this confidence threshold, demonstrating its effectiveness
for object detection across different classes.
P P
Table 2 Threshold, Accuracy ( n), Accuracy ( ci ), Frequency n, Frequency c
P P
Threshold (δ) Accuracy ( n) Accuracy ( ci ) Frequency n Frequency c
5 87% 98% 69.15 61.33
10 91% 100% 65.81 60.00
20 88% 100% 68.00 60.00
30 90% 100% 66.67 60.00
50 100% 100% 60.00 60.00
References
[1] Glenn Jocher creates YOLOv5 SOTA Realtime Instance Segmentation models
that are the fastest and most accurate in the world, beating all current SOTA
benchmarks. We’ve made them super simple to train, validate and deploy. doi:
https://doi.org/10.5281/zenodo.7347926
[2] Rahaman, M. A., Jasim, M., Ali, M. H., & Hasanuzzaman, M. (2020). Bangla
language modeling algorithm for automatic recognition of hand-sign-spelled Bangla
sign language. Frontiers of Computer Science, 14(3), 1-15. https://doi.org/10.1007/
s11704-018-7253-3
19
Fig. 18 Precision-Recall graph showcasing object detection performance by class, with mAP50 and
mAP50-95 scores on the Y-axis. Evaluate model accuracy and variation across different characters.
Fig. 19 This is a widefig. This is an example of long caption this is an example of long caption this
is an example of long caption this is an example of long caption
[3] Hoque, Oishee & Jubair, Mohammad Imrul & Akash, Al-Farabi & Islam, Md.
(2021). BdSL36: A Dataset for Bangladeshi Sign Letters Recognition, doi: https:
//doi.org/10.1007/978-3-030-69756-3 6.
20
Fig. 20
[4] D. Talukder and F. Jahara, ”Real-Time Bangla Sign Language Detection with
Sentence and Speech Generation,” 2020 23rd International Conference on Com-
puter and Information Technology (ICCIT), DHAKA, Bangladesh, 2020, pp. 1-6,
doi: https://doi.org/10.1109/ICCIT51783.2020.9392693
[5] Nanda, M. (2020). You Only Gesture Once (YouGo): American Sign Language
Translation using YOLOv3. Master’s thesis, Purdue University
[6] Nakjai, P., Maneerat, P., Katanyukul, T., & others. (2021). Thai finger spelling
localization and classification under complex background using YOLO-based deep
learning. Journal of Ambient Intelligence and Humanized Computing, 12(7), 7485-
7496. https://doi.org/10.1007/s12652-021-03211-8
[7] Shanta, S. S., Anwar, S. T., & Kabir, M. R. (2018). Bangla Sign Language Detec-
tion Using SIFT and CNN. 2018 9th International Conference on Computing,
Communication and Networking Technologies (ICCCNT), 1-6. https://doi.org/10.
1109/ICCCNT.2018.8493915
[8] Hu, Y. Finger spelling recognition using depth information and support vector
machine. Multimed Tools Appl 77, 29043–29057 (2018). https://doi.org/10.1007/
s11042-018-6102-6
[9] M. A. Hossain, M. S. Hossain, and M. A. Hossain, ”Real-time Bangladeshi

Sign Language Detection using Faster R-CNN,” 2021 International Conference
on Computer, Communication, Chemical, Materials and Electronic Engineering
(IC4ME2), Dhaka, Bangladesh, 2021, pp. 1-5.
21
[10] M. Hossen, ”Deep Convolutional Neural Network-based method for Bengali Sign
Language Detection,” 2022 International Conference on Computer Science and
Artificial Intelligence (CSAI), Dhaka, Bangladesh, 2022, pp. 1-5.
[11] M. A. Poddar et al., ”Recognition Bangla Sign Language using Convolutional

Neural Network,” 2019 International Conference on Innovation and Intelligence for
Informatics, Computing, and Technologies (3ICT), Dhaka, Bangladesh, 2019, pp.
1-6.
[12] S. Sarker and M. M. Hoque, ”An Intelligent System for Conversion of Bangla Sign
Language into Speech,” 2018 International Conference on Innovations in Science,
Engineering, and Technology (ICISET), Chittagong, Bangladesh, 2018, pp. 513-
518, doi: https://doi.org/10.1109/ICISET.2018.8745608
[13] B. Kang, S. Tripathi, and T. Q. Nguyen, ”Real-time sign language finger-

spelling recognition using convolutional neural networks from the depth map,”
2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur,
Malaysia, 2015, pp. 136-140, doi: https://doi.org/10.1109/ACPR.2015.7486481
[14] Kamruzzaman, M. M. (2020). Arabic Sign Language Recognition and Generating

Arabic Speech Using Convolutional Neural Network. Wireless Communications
and Mobile Computing, 2020, 1-14. doi: https://doi.org/10.1155/2020/3685614
[15] Hossain, M. A., Hossain, M. A., & Hossain, M. A. (2017). Hand sign language
recognition for the Bangla alphabet based on Freeman Chain Code and ANN.
2017 4th International Conference on Advances in Electrical Engineering (ICAEE),
1-6.doi: https://doi.org/10.1109/ICAEE.2017.8255454
[16] Rafiq et al. presented a real-time vision-based Bangla sign language detection sys-
tem that utilizes the YOLO algorithm. Their study involved a dataset comprising
10 Bangla signs, and they achieved an impressive mean Average Precision (mAP)
score of 92.5 doi: https://doi.org/10.1109/ICACC-202152719.2021.9708141
[17] Leslie Lamport (1994) LATEX: a document preparation system, Addison Wesley,
Massachusetts, 2nd ed.
22

Bdspell: A Yolo-Based Real-Time Finger Spelling System For Bangla Sign Language

Uploaded by

Copyright:

Available Formats

Bdspell: A Yolo-Based Real-Time Finger Spelling System For Bangla Sign Language

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bdspell: A Yolo-Based Real-Time Finger Spelling System For Bangla Sign Language

Uploaded by

Copyright:

Available Formats

BdSpell: A YOLO-based Real-time Finger

Spelling System for Bangla Sign Language

*Corresponding author(s). E-mail(s): [email protected];

Keywords: YOLOv5, real-time finger spelling, Bangla Sign Language, BdSL36

used compound characters are shown in Figure 5. In Bengali, compound characters

5 Objection Detection Evaluation Metric

Fig. 6 An illustration of consonant combinations resulting in compound characters in the Bengali

5.1 From Prediction Score to Class Label

5.2 Detection Performance Metrices

The Precision-Recall (PR) curve is a graphical representation of the precision and

;where precision(r) is the precision at a given recall value r in the Precision-Recall

5.3 Intersection over Union (IoU)

5.4 Mean Average Precision (mAP)

6.1 BdSL Finger-Spelling

6.1.1 Single Character Transformation

6.1.2 Compound Character Derivation

6.1.3 Real Time Finger-Spelling

মা (Ma) - Mother ম (ma) + "া" (aa) ম (ma) + "া" (aa)

বোন (Bon) - Sister ব (ba) + ো (o) + ন (n) ব (ba) + ো (o) + ন (n)

ঠ (tha) + "া" (/aa) + ক (ka) + ু (u) + র (ra) + দ

শ (sha) + ি (i) + ক (k) + ্ (halant) + ষ (sh) + ক

ড (ḍa) + "া" (/aa) + ক (ka) + ্ (halant) + ত (ṭa) +

Model Data Dataset Size Number of Classes mAP

In this comparison table 8, we present an overview of different object detection

[9] M. A. Hossain, M. S. Hossain, and M. A. Hossain, ”Real-time Bangladeshi

[11] M. A. Poddar et al., ”Recognition Bangla Sign Language using Convolutional

[13] B. Kang, S. Tripathi, and T. Q. Nguyen, ”Real-time sign language finger-

[14] Kamruzzaman, M. M. (2020). Arabic Sign Language Recognition and Generating

You might also like