0% found this document useful (0 votes)
33 views9 pages

Recognizing British Sign Language Using Deep Learning A Contactless and Privacy-Preserving Approach

The document discusses recognizing British Sign Language (BSL) using deep learning and contactless methods. It proposes the first contactless BSL recognition system using radar and deep learning algorithms. The system extracts spatiotemporal features from radar data and classifies BSL signs like verbs and emotions with high accuracy at distances over 4 feet. It collected a large BSL dataset covering 15 sign types.

Uploaded by

GARGI VEDPATHAK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views9 pages

Recognizing British Sign Language Using Deep Learning A Contactless and Privacy-Preserving Approach

The document discusses recognizing British Sign Language (BSL) using deep learning and contactless methods. It proposes the first contactless BSL recognition system using radar and deep learning algorithms. The system extracts spatiotemporal features from radar data and classifies BSL signs like verbs and emotions with high accuracy at distances over 4 feet. It collected a large BSL dataset covering 15 sign types.

Uploaded by

GARGI VEDPATHAK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2090 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 10, NO.

4, AUGUST 2023

Recognizing British Sign Language Using


Deep Learning: A Contactless and
Privacy-Preserving Approach
Hira Hameed, Student Member, IEEE, Muhammad Usman , Senior Member, IEEE, Ahsen Tahir, Member, IEEE,
Kashif Ahmad , Senior Member, IEEE, Amir Hussain, Senior Member, IEEE,
Muhammad Ali Imran , Senior Member, IEEE, and Qammer H. Abbasi , Senior Member, IEEE

Abstract— Sign language is utilized by deaf-mute to com- I. I NTRODUCTION


municate through hand movements, body postures, and facial
emotions. The motions in sign language comprise a range of
distinct hand and finger articulations that are occasionally
synchronized with the head, face, and body. Automatic sign
O VER 430 million people on Earth are living with some
form of hearing impairment, who represent a significant
5% of the world’s population. Hearing loss is expected to
language recognition (SLR) is a highly challenging area and afflict roughly 700 million people by 2050 [1]. People who
still remains in its infancy compared with speech recognition
after almost three decades of research. Current wearable and have trouble hearing rely on sign language to communicate,
vision-based systems for SLR are intrusive and suffer from the depending on their level of disability. Millions of people
limitations of ambient lighting and privacy concerns. To the best throughout the world use sign language to communicate, and
of our knowledge, our work proposes the first contactless British it is critical for their social inclusion. Similar to spoken
sign language (BSL) recognition system using radar and deep languages, sign language is used in different variants in
learning (DL) algorithms. Our proposed system extracts the 2-D
spatiotemporal features from the radar data and applies the state- different parts of the works. The importance of sign language
of-the-art DL models to classify spatiotemporal features from with its origins is discussed in [6] and [7], with examples
BSL signs to different verbs and emotions, such as Help, Drink, from American, Japanese, Chinese, Arabic, and British sign
Eat, Happy, Hate, and Sad. We collected and annotated a large- languages (BSL).
scale benchmark BSL dataset covering 15 different types of BSL Automatic sign language recognition (SLR) is a highly
signs. Our proposed system demonstrates highest classification
performance with a multiclass accuracy of up to 90.07% at a challenging field of research, and the current progress still
distance of 141 cm from the subject using the VGGNet model. remains in its infancy compared with speech recognition.
Many efforts to automate the sign language translation into
Index Terms— British sign language (BSL), contactless mon-
itoring, deep learning (DL), micro-Doppler signatures, radio speech/text have been made in the literature using sensor or
frequency (RF) sensing. vision-based approaches.
For sensor-based approaches, more than one sensors are
generally adopted by researchers. Preetham et al. [22] utilized
ten flex sensors to develop a single-hand data glove for hand
Manuscript received 21 June 2022; revised 11 September 2022; accepted gesture recognition. These flex sensors are attached to each
18 September 2022. Date of publication 13 October 2022; date of current ver- finger’s two joints. Patil et al. [23] proposed a data glove,
sion 2 August 2023. This work was supported in part by the Engineering and which utilized five flex sensors, one for each finger to reduce
Physical Sciences Research Council (EPSRC) under Grant EP/T021020/1 and
Grant EP/T021063/1. (Corresponding author: Muhammad Usman.) the bulkiness of the wearable glove. However, flex sensors can
This work involved human subjects or animals in its research. Approval only measure flexion of fingers and cannot obtain finger and
of all ethical and experimental procedures and protocols was granted by hand movements or orientation. Wang et al. [24] proposed
the University of Glasgow’s Research Ethics Committee under Approval No.
300200232 and No. 300190109. a gesture recognition system using data glove that utilized
Hira Hameed, Muhammad Ali Imran, and Qammer H. Abbasi are with five flex sensors along with a three-axis accelerometer. The
the James Watt School of Engineering, University of Glasgow, G12 8QQ accelerometer was placed in the center of the back of hand
Glasgow, U.K. (e-mail: [email protected]; muhammad.imran@
glasgow.ac.uk; [email protected]). palm. The system recognizes 50 Chinese sign language ges-
Muhammad Usman is with the James Watt School of Engineering, Uni- tures in real time with an accuracy of around 91%. However,
versity of Glasgow, G12 8QQ Glasgow, U.K., and also with the School wearable systems are intrusive, and devices, such as hand
of Computing, Engineering and Built Environment, Glasgow Caledonian
University, G4 0BA Glasgow, U.K. (e-mail: [email protected]). gloves, are cumbersome to wear and limit human–computer
Ahsen Tahir is with the James Watt School of Engineering, University of interaction to subjects, which cannot carry or wear the sensors
Glasgow, G12 8QQ Glasgow, U.K., and also with the Department of Electrical for continuous interaction.
Engineering, University of Engineering and Technology, Lahore, Pakistan
(e-mail: [email protected]). Vision-based systems, such as Kinect sensor systems,
Kashif Ahmad is with the Department of Computer Science, Munster Tech- may perform fusion of depth data and color information
nological University, T12 P928 Cork, Ireland (e-mail: [email protected]). for the recognition of static signs of sign language [8].
Amir Hussain is with the School of Computing, Edinburgh Napier Univer-
sity, EH10 5DT Edinburgh, U.K. (e-mail: [email protected]). Colored gloves are used in [9] along with the hidden
Digital Object Identifier 10.1109/TCSS.2022.3210288 Markov models for automatic recognition of sign language.
2329-924X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:43:04 UTC from IEEE Xplore. Restrictions apply.
HAMEED et al.: RECOGNIZING BSL USING DL: A CONTACTLESS AND PRIVACY-PRESERVING APPROACH 2091

Jitcharoenpory et al. [10] presented a system for recognition alphabet recognition. The suggested work performs better than
of Thai sign language using a glove-based device equipped the current works in terms of precision (6%), recall (4%),
with flex sensors and gyroscopes. Furthermore, the authors and F-measure (5%). It reported better outcomes with 98.0%
classified the data using the K -nearest neighbor algorithm. accuracy on webcam and BSL corpus dataset [28]. For over-
Camera-based systems are widely used to recognize sign coming the communication gap between speech-impaired and
languagebecause of the wide availability of cameras and tools nonspeech-impaired community members, the author creates
to comprehend video and images. For this reason, majority a model of a DL model for predicting BSL. CNN and long
of the existing works that focus on designing SLR systems short-term memory (LSTM) employ vision-based data. The
generally utilize widely available 2-D video cameras [11]. CNN model performed the best, achieving the training and
One such work is presented by Pigou et al. [12], where they testing accuracies of 98.8% and 97.4%, whereas the LSTM
propose a deep neural network with bidirectional recurrences model performed poorly, achieving the training and testing
and temporal convolutions. This article aims at identifying accuracies of 49.4% and 48.8%, respectively [29]. This study
sign language gestures with a focus on enhancing the frame- presents the use of a four-beam patch antenna as a sensor node
wise accuracy of gesture recognition. Interestingly, the work to evaluate the pill-rolling effect in Parkinson’s disease using
in [13] combines the data from a gray-scale video with the S-band sensing technique at 2.4 GHz. The proposed system
depth information and hand’s skeletal joints. A convolutional uses amplitude and phase information to efficiently distinguish
neural network (CNN)-based framework is designed that between finger tremors and nontremors. It is determined that
integrates the features coming from all three data sources. the support vector machines have an accuracy of more than
Raj and Jasuja [18] utilized an artificial neural network (ANN) 90% on the collected dataset [30].
to classify the BSL signs on the image data coming from a Although the RF sensing has been partly discussed in
standard 2-D camera, where images were examined using the the literature to recognize sign language, the existence of a
histogram of oriented gradients (HOG). However, all camera- diverse dataset that includes samples from a wide range of
based systems have certain fundamental flaws, such as the subjects (diverse age and sex) and covers diverse number
video may be recorded, and the video/images raise privacy of classes is missing in the literature. To bridge this gap,
concerns in real-world applications. Furthermore, ambient this work focuses on recognizing different gestures in BSL
lighting is an important requirement and a limitation of vision- using micro-Doppler signatures of the data collected using
based systems for SLR. a radar sensor. Fifteen different types of Doppler signatures
Contactless radio frequency (RF) sensing has been recently are considered that include verbs (Drink, Eat, Help, Stop,
studied in the healthcare sector for its applications in activity and Walk), emotions (Sad, Happy, Hate, Depressed, and
monitoring and assisted living. The “contactless” feature of Confused), and family group (Family, Brother, Father, Mother,
RF sensing obliterates the requirement of wearing and carrying and Sister). These categories of BSL signs include dynamic
devices. Furthermore, an added advantage of RF sensing is its gestures, wherein mobility or movements of the hands are
ability to leverage the existing communication infrastructure, used to represent various signs. An ultrawideband (UWB)
such as Wi-Fi routers. Moreover, the radar-based sensing sys- radar, namely, XeThru X4M03, was utilized for recording
tem can alleviate the limitations of camera-based systems by the dataset. We note that the dataset is recorded at two
protecting users’ privacy and providing immunity to variations different distances and angles. These characteristics make the
in ambient lighting. From a privacy perspective, people do not dataset a better choice for training and evaluation of ML
vary to disclose their private data. The information they deliver algorithms for BSL signs recognition and translation. The
by sign language is confidential between people to people. recorded data are represented in the form of spectrograms, and
These radar systems exploit the Doppler signatures of further spatiotemporal features are extracted using GoogleNet,
the reflected wave to produce unique hand movements. squeezenet, and VGGNet CNNs.
McCleary et al. [19] used a frequency-modulated continuous- The main contributions of the work are summarized as
wave (FMCW) radar to classify four BSL gestures with the follows.
help of deep CNN models. Similarly, an FMCW radar has 1) We propose a contactless BSL recognition system that
been used in [20] to classify different words in American automatically recognizes and translates BSL signs into
sign language with the help of generative adversarial networks verbs and emotions.
(GANs). The feature-level fusion of RF sensor network data 2) We propose a contactless BSL recognition system that
have been used for machine learning (ML) classification of automatically recognizes and translates BSL signs.
American sign language [21]. Diverse hand and arm motions 3) We also collect a large-scale benchmark dataset con-
create distinct multipath distortions in Wi-Fi signals, resulting taining a total of 1950 samples from 15 different types
in distinctive patterns in the time series of channel state of BSL signs captured at the distances of 141 and
information (CSI), which have been utilized for classification 154 cm. Moreover, the data samples are captured from
of sign language using kernel-based SVM [15]. MacLaughlin two different angles. To ensure diversity, the data were
et al. [17] demonstrated that multiple radar antenna signals collected from four deaf participants (one male and three
received at different azimuth angles can be utilized to differ- females) having ages between 16 and 82 years of age.
entiate between different subjects for SLR with a fusion-based 4) We report the experimental results of several state-of-
CNN. The author of this study uses a deep learning (DL) the-art DL models on the dataset, which will provide a
architecture to solve the challenge of BSL fingerspelling baseline for future research in the domain.

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:43:04 UTC from IEEE Xplore. Restrictions apply.
2092 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 10, NO. 4, AUGUST 2023

Fig. 1. High-level signal flow diagram of the proposed framework highlighting the UWB radar-based system, data collection, and the DL models for the
classification of BSL signs.

Fig. 2. Experimental setup of data collection using the Xethru UWB radar sensor. (a) Measurable experimental setup. (b) Real experimental setup.

5) In our contactless data, we got 90.07% accuracy across In Sections II-A–II-C, we provide a detailed description of
all 15 classes using DL models. each of the phases of the proposed methodology.
The remaining sections of this article are organized
as follows. Section II discusses the adopted methodology, A. Data Collection
including the data collection and annotation, preprocessing,
In this phase, we collected BSL sign data through UWB
and DL algorithms. Dataset, evaluation metrics (matrices),
radar. Fig. 2 provides an overview of the hardware setup of the
experimental setup, results, and discussion are discussed in
radar-based BSL data collection system. To this aim, an UWB
Section III. The conclusions and future insights are presented
radar sensor, namely, Xethru X4M03, is used. The Xethru
in Section IV.
X4M03 is a UWB radar sensor with built-in transmitter (Tx)
and receiver (Rx) antennas, providing a maximum detection
II. M ETHODOLOGY range of 9.6 m.
Fig. 1 illustrates the methodology used in this study as a As shown in Fig. 2, the radar sensor is placed on top of
block diagram. The proposed framework is divided into three the screen of the laptop. The key parameter settings of the
stages. In the first phase, we collected and annotated a large radar are indicated in Table I. In order to encompass different
collection of BSL signs. In the second phase, we used the complexity levels in the dataset, the radar sensor was placed
signal processing techniques to extract the spectrograms of at two different distances, including a distance of 141 and
various signs captured in the first phase. Finally, several DL 154 cm from the subject (i.e., participant). Moreover, the sign
models were employed to classify the signs. gestures of the subject were recorded at two different angles.

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:43:04 UTC from IEEE Xplore. Restrictions apply.
HAMEED et al.: RECOGNIZING BSL USING DL: A CONTACTLESS AND PRIVACY-PRESERVING APPROACH 2093

TABLE I In the beginning, the radar chip was configured via the XEP
PARAMETERS C ONFIGURATION OF R ADAR S OFTWARE AND H ARDWARE interface with x4driver. Data were recorded from the module
at 500 frames/s (FPS) (in the form of the float message data),
where each value is a 32-bit floating point number. A loop was
used to read the data file and save the data into a DataStream
variable, which was mapped into a complex range-time inten-
sity (RTI) matrix. Thereafter, moving target indication (MTI)
filter was applied to get the Doppler range map. Afterward,
the second MTI was used as a Butterworth fourth-order filter
to generate the spectrograms using the following parameters:
TABLE II window length, overlap percentage, and fast Fourier transform
OVERVIEW OF THE D ATA C OLLECTED , N UMBER OF (FFT) padding factor. In particular, a window length of
S UBJECTS , AND THE A CTIVITIES P ERFORMED
128 samples and a padding factor of 16 were used. In addition,
a range profile was created by first converting each chirp to
an FFT. Afterward, given a range bin, a second FFT was
performed on a selected number of consecutive chirps [31].
Thereafter, spectrograms were generated leveraging a short-
time FT (STFT). The reason to exploit STFT is their ability
to offer both temporal and frequency information, unlike FT
that provides frequency information only. This is achieved
through data segmentation, wherein an FT was performed on
each data segment. When the window length is changed, both
the temporal and frequency resolutions are altered inversely.
For instance, increasing one decreases the resolution of other
parameters. Moreover, the level of Doppler details is mainly
One was placed directly at front of the subject at a distance of dependent on the sampling capability of radar’s hardware. The
141 cm, while the second was placed at an angle of 24◦ from greatest unambiguous Doppler frequency in radar is Fd,max =
the subject at a distance of 154 cm (see Fig. 2). The variations (1/2)tr , where tr represents the chirp time. The transmitted
in the distance and viewpoint are expected to help in training signal, Ts , can be represented by the following expression:
distance and viewpoint invariant DL models that are able to
recognize the gestures of the subjects from different distances Ts (t) = E cos(2π f t) (1)
and angles. During the data collection, the body of the subject
was in normal position with head and hands movements only. where E is the reflection coefficient. At the Rx, the received
Moreover, the duration of each activity was set to 5 s involving signal, Rs (t), can be represented by the following expression:
the data collection of a single gesture from a single subject.   
2D(t)
Fig. 3 provides visual illustration of the pronounced BSL.1 Rs (t) = É cos 2π f t − (2)
Moreover, four deaf subjects/participants, one male and c
three females, participated in the data collection process. The
where c is the speed of light, and D(t) is the distance of radar
reason to include more participants was to make the dataset
from the target location while performing hand’s movements
more realistic and diverse. A total of 1950 data samples were
over time. Now, (2) can be rewritten as the reflected signal,
collected during experiment for 15 different categories at the
which can be expressed as Rs (t)
distances of 141 and 154 cm. The details of the collected
dataset are highlighted in Table II. In each experiment, a total     
2v(t) 4π D(θ )
of 975 data samples were collected from four deaf participants, Rs (t) = É cos 2π f 1 + t− (3)
c c
where 15 samples were collected in each class. In particular,
each participant repeated the speaking activity of each ges- where θ represents the angle of the signal reflected off the
ture 15 times with the radar. In this way, each participant target points to the direction of radar, and v(t) represents the
contributed to collect 225 or 300 data samples in total for point of target movement in front of the radar. The Doppler
15 classes. In each case, a total of 975 spectrograms were shift, f d , can be expressed as follows:
categorized as 15 BSL gestures, where 750 were utilized for
training and 225 for testing. 2v(t)
fd = f . (4)
c
B. Preprocessing
The reflected signal received at the Rx is a composite of
This section describes the preprocessing steps carried out many components, reflected from different moving parts of the
to extract the required spectrograms from the radar’s data. body, such as head and hands. Each reflected component has
1 All the data and the photographs for this research are being published with unique characteristics of speed and acceleration. Let us assume
the consent of the participants. that there are N number of unique reflected components; then,

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:43:04 UTC from IEEE Xplore. Restrictions apply.
2094 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 10, NO. 4, AUGUST 2023

Fig. 3. Visual illustration of the pronounced BSL. (a) Brother. (b) Sister. (c) Mother. (d) Father. (e) Family. (f) Con f use. (g) Depress. (h) H appy.
(i) H ate. (j) Sad. (k) W alk. (l) Eat. (m) H elp. (n) Drink. (o) Stop.

(3) can be rewritten as the following expression: C. Classification via Deep Models
     For classification, the spectrograms generated in the pre-

N
2vk(t) 4π Dk (0) vious step are fed into DL models. Three different pre-
Rs (t) = E k cos 2π f 1 + t− .
k
c c trained models are considered for this purpose: GoogLeNet,
(5) SqueezeNet, and VGGNet. Our classification framework to
differentiate in BSL signs/activities is mainly based on fine-
Accordingly, the combined Doppler shift comprises of an tuning pretrained models, where multiple state-of-the-art CNN
interaction of various Doppler individual shifts caused by architectures pretrained on ImageNet [5] are fine-tuned on the
unique movements of hands and head. Therefore, the suc- spectrogram images generated from the radar data. In fine-
cessful classification of sign language gestures depends on tuning the pretrained models, we modify the top layers of
unique characteristics of Doppler signatures due to different the models to classify the collected data into 15 considered
gestures. After obtaining the spectrograms of various signs classes The CNN architectures used in this work are described
from the participants, a dataset was constructed. As indicated in detail in Sections II-C1–II-C3.
in the high-level signal flow diagram in Fig. 1, the dataset con- 1) GoogLeNet Model: GoogLeNet [4] is one of the state-
sisted of two key modules: 1) system training and 2) system of-the-art and commonly used CNN architectures for different
testing. The proposed pretrained DL classification algorithms image classification tasks [3]. The architecture is made up
were implemented on spectrograms to recognize the BSL of 22 layers, including convolutional, pooling, inception, and
dataset. fully linked layers. Six convolutional layers plus a pooling

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:43:04 UTC from IEEE Xplore. Restrictions apply.
HAMEED et al.: RECOGNIZING BSL USING DL: A CONTACTLESS AND PRIVACY-PRESERVING APPROACH 2095

TABLE III and a softmax for output. The parameter settings of VGG16
PARAMETER S ETTINGS FOR THE S ELECTED M ODELS are shown in Table III.

III. E XPERIMENTS AND R ESULTS


This section highlights the dataset description along with
system evaluation using the discussed pretrained DL models.

A. Dataset
The data collection and preprocessing phases described
earlier resulted in a collection of spectrograms. In total,
the dataset is composed of 1950 samples from 15 different
categories/classes. These classes can be subgrouped into three
groups, namely: 1) verbs; 2) emotions; and 3) family. The
verbs group includes five classes, namely, Drink, Eat, Help,
Stop, and Walk. The emotions include Confused, Depressed,
Happy, Hate, and Sad, while the final group is made of
Family, Brother, Father, Mother, and Sister classes. Each of
these 15 classes contains an equal number of samples. Fig. 4
provides some samples images/spectrograms from the dataset.
The dataset has been divided into two subsets, namely, the
training set and the test set. The training set is composed of
1560 samples, while the test set provides a total of 390 sam-
ples. The training and test sets have equal representation from
all the classes and subjects.

layer comprise the inception module. The module is made


up of patches or filters of sizes 1 × 1, 3 × 3, and 5 × 5. B. Evaluation Metrics for Classification Model
These different-sized filters help in obtaining diverse patterns In this work, the performance of the DL models in clas-
from the input image. At the output of each module, the sification of BSL signs is evaluated in terms of weighted
feature maps produced from various filters are concatenated. average accuracy, precision, recall, and F1 Score. F1 Score
Furthermore, 1 × 1 convolutions are performed prior to large is one of the most commonly used metrics in the literature
filter convolutions. The use of a 1 × 1 convolution filter for classification, which is calculated using (6). F1 Score is
reduces the number of parameters that GoogLeNet requires. a combination of precision and recall, which are calculated
The hyperparameter settings of GoogleNet are shown in using (7) and (8), respectively,
Table III. Precision.Recall
2) SqueezeNet Model: Our second pretrained model is F1 = 2 (6)
Precision+ Recall
based on the SqueezeNet architecture [2], which is com-
TruePositive
posed of 18 layers. This architecture has shown comparable Precision =   (7)
TruePositive + FalsePositive
outcomes with 50 times fewer parameters, making it a bet- 
ter choice for applications with limited data and processing TruePositive
Recall =   . (8)
resources. Squeezenet accommodated three major strategies. TruePositive + FalseNegative
In first, it reduces the squeeze layer from 3 × 3 filters to
1 × 1 filters. In the second strategy, It uses an expanded C. Results and Discussion
layer in which 1 × 1 and 3 × 3 filters were fed with fewer The objectives of the experimentation in this work are
input parameters. The third technique downsamples late (with twofold. On one side, we want to analyze the performance
smaller stride values), resulting in larger activation maps in the of different existing pretrained models on the newly collected
final layer, which improves accuracy. The parameter settings BSL dataset. On the other hand, we want to analyze the impact
of SqueezeNet are shown in Table III. of variations in viewpoint and distances from the subject on
3) VGGNet Model: Another pretrained model is based the performance of BSL frameworks. Therefore, we conducted
on the VGG16 architecture [27], which is composed of two different experiments by evaluating the performances
16 layers. This architecture contains 138 million parameters, of the models on spectrograms captured at the distances of
a 3 × 3 filter size with a stride of 1. Furthermore, the padding 141 and 154 cm.
and max-pooling layer are the same as a 2 × 2 filter with a The hyperparameter settings for all the models are provided
stride of 2. The layers in this design are organized as follows: in Table III. All the models are fine-tuned on the dataset for a
rectified linear unit (ReLU) layers, convolutional layers, and maximum of 100 epochs. Moreover, in all the experiments, the
max-pooling layers. ReLU provides efficient computing with fixed training and test sets are used. Our training and test sets
faster learning. In the end, it has three fully connected layers are composed of 80% and 20% of the total data, respectively.

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:43:04 UTC from IEEE Xplore. Restrictions apply.
2096 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 10, NO. 4, AUGUST 2023

Fig. 4. Obtained spectrum’s sample of 15 BSL signs. (a) Brother. (b) Sister. (c) Mother. (d) Father. (e) Family. (f) Con f use. (g) Depress. (h) H appy.
(i) H ate. (j) Sad. (k) W alk. (l) Eat. (m) H elp. (n) Drink. (o) Stop.

TABLE IV
E VALUATION OF THE P RETRAINED M ODELS IN T ERMS OF M ACRO -R ECALL , M ACRO -P RECISION , M ACRO - F1 S CORE , A CCURACY,
AND 95% C ONFIDENCE I NTERVAL ON THE D ATASETS C APTURED AT THE D ISTANCES OF 141 AND 154 cm

Table IV provides the experimental results of the experi- classification accuracy is nearly 100% for all classes except
ments conducted at the distances of 141 and 154 cm in terms the classes Brother, Mother, and Sister. The samples from the
of precision, recall, and F1 Score. As expected, overall better class Brother are mostly (around 0.13%) confused with the
results, for all the models, are obtained when the radar sensor class Happy. The samples from the class Mother confused
is placed at a distance of 141 cm compared with 154 cm. This with the samples from classes Father and Family. This is
is due to the reason that the 141-cm radar is placed exactly in due to the fact that in both classes, both right- and left-hand
front of the subject, and the micro-Doppler signature at this fingers move in the same manner. Similarly, the class Sister
view points is more sensitive to hands movements as compared has resemblance with class Drink, as the participants used their
with what is received at 154-cm radar for the same movements. right-hand index finger and thumb to touch the nose.
As a result, the ML model better classifies the hand movements Similarly, the confusion matrix of SqueezeNet with a dis-
as this viewpoint. As far as the performances of the pretrained tance of 141 cm is presented in Fig. 5(c). Here, again, most of
models are concerned, overall better results are obtained with the test samples from all the BSL signs are correctly identified
VGGNet achieving an F1 Score of 0.87. with an exception of Brother, which shows similarities with
In order to better analyze the performances of the models, Confused class. This may be because of the reason that
we also provide confusion matrices for each model at each in both classes, subjects rub their right hand and left hand
distance in Fig. 5. Fig. 5(a) illustrates the confusion matrix with each other between the head. Furthermore, the confusion
of the GoogleNet model with 141-cm distance. It is worth matrix of SqueezeNet at a distance of 154 cm is presented in
noting that most of the signs are correctly identified with most Fig. 5(d). In this case, most of the signs are correctly identified
of them having close to 100% accuracy. The lowest accuracy with an exception of class Brother and Mother, which show
is 0.26% on Brother class, which is mostly confused with the similarities with the classes Happy and Father. Moreover,
class Happy. Similarly, the confusion matrix of the GoogleNet Sad gives a classification accuracy of 80% with only 20%
at a distance of 154 cm is presented in Fig. 5(b), where the matching/confusion with the class Depressed.

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:43:04 UTC from IEEE Xplore. Restrictions apply.
HAMEED et al.: RECOGNIZING BSL USING DL: A CONTACTLESS AND PRIVACY-PRESERVING APPROACH 2097

Fig. 5. Confusion Matrices of all the models at two different distances. (a) Confusion matrix of GoogleNet at 141 cm. (b) Confusion matrix of GoogleNet
at 154 cm. (c) Confusion matrix of SqueezNet at 141 cm. (d) Confusion matrix of SqueezNet at 154 cm. (e) Confusion matrix of VGGNet at 141 cm.
(f) Confusion matrix of VGGNet at 154 cm.

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:43:04 UTC from IEEE Xplore. Restrictions apply.
2098 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 10, NO. 4, AUGUST 2023

Finally, the confusion matrix VGG16 with a distance of [8] S. Jadooki, D. Mohamad, T. Saba, A. S. Almazyad, and A. Rehman,
141 cm is presented in Fig. 5(e). Here, again, most of the “Fused features mining for depth-based hand gesture recognition to
classify blind human communication,” Neural Comput. Appl., vol. 28,
BSL signs are correctly recognized with an exception of no. 11, pp. 3285–3294, Nov. 2017.
Depressed, which is similar to class Sad. This is because the [9] B. Bauer and H. Hienz, “Relevant features for video-based continuous
way these two signs are performed is similar. In both classes, sign language recognition,” in Proc. 4th IEEE Int. Conf. Autom. Face
Gesture Recognit., Mar. 2000, pp. 440–445.
the subjects face makes a downward movement. On the other [10] R. Jitcharoenpory, P. Senechakr, M. Dahlan, A. Suchato,
side, Help is similar to the Family, because both signs have E. Chuangsuwanich, and P. Punyabukkana, “Recognizing words in
upturned hands while performing the activity. Furthermore, Thai sign language using flex sensors and gyroscopes,” in Proc.
I-CREATe, vol. 4, 2017, pp. 1–4.
the confusion matrix of VGG16 at a distance of 154 cm is [11] M. Mohandes, M. Deriche, and J. Liu, “Image-based and sensor-based
presented in Fig. 5(f). In this case, VGG16 mostly classifies approaches to Arabic sign language recognition,” IEEE Trans. Human-
the considered classes correctly with few exceptions. The class Mach. Syst., vol. 44, no. 4, pp. 551–557, Aug. 2014.
[12] L. Pigou, A. van den Oord, S. Dieleman, M. Van Herreweghe, and
Confused shows a resemblance with the two classes, namely, J. Dambre, “Beyond temporal pooling: Recurrence and temporal convo-
Family and Stop, due to related movements. Despite, Confused lutions for gesture recognition in video,” Int. J. Comput. Vis., vol. 126,
class shows only 20% similarity with Family and 10% with nos. 2–4, pp. 430–439, Apr. 2018.
[13] N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout, “Multi-scale deep
the Stop class, producing 70% correct classification. learning for gesture detection and localization,” in Proc. Eur. Conf.
Comput. Vis., 2014, pp. 474–490.
[14] B. Li, J. Yang, Y. Yang, C. Li, and Y. Zhang, “Sign language/gesture
IV. C ONCLUSION AND F UTURE W ORK recognition based on cumulative distribution density features using
This article presents the BSL recognition framework in con- UWB radar,” IEEE Trans. Instrum. Meas., vol. 70, pp. 1–13, 2021.
[15] J. Shang and J. Wu, “A robust sign language recognition system with
tactless and privacy-preserving manner. Off-the-shelf XeThru multiple Wi-Fi devices,” in Proc. Workshop Mobility Evolving Internet
X4M03 UWB radar sensors are used combined with the Archit., Aug. 2017, pp. 19–24.
DL models. Fifteen most common gestures in BSL were [16] T. W. Chong and B. J. Kim, “American sign language recognition system
using wearable sensors with deep learning approach,” J. Korea Inst.
studied, including verbs (Drink, Eat, Help, Stop, and Walk), Electron. Commun. Sci., vol. 15, no. 2, pp. 291–298, 2020.
emotions (Confused, Depressed, Pleased, Hate, and Sad), and [17] G. MacLaughlin, J. Malcolm, and S. A. Hamza, “Multi antenna radar
family group (Family, Brother, Father, Mother, and Sister). system for American sign language (ASL) recognition using deep
learning,” 2022, arXiv:2203.16624.
The experiment included four participants; three of whom [18] R. D. Raj and A. Jasuja, “British sign language recognition using
were deaf, ranging in age from 16 to 82 years. Micro-Doppler HOG,” in Proc. IEEE Int. Students’ Conf. Electr., Electron. Comput.
unique features were maintained in the form of spectrograms Sci. (SCEECS), Feb. 2018, pp. 1–4.
[19] J. McCleary, L. P. Garcia, C. Ilioudis, and C. Clemente, “Sign language
for each class. Three DL models, GoogleNet, SqueezeNet, and recognition using micro-Doppler and explainable deep learning,” in
VGGNet, were trained using these. Most of the gestures were Proc. IEEE Radar Conf. (RadarConf), May 2021, pp. 1–6.
correctly identified with 100% classification accuracy. The [20] M. M. Rahman et al., “Word-level sign language recognition using
VGGNet model surpassed others, yielding an overall accuracy linguistic adaptation of 77 GHz FMCW radar data,” in Proc. IEEE Radar
Conf. (RadarConf), May 2021, pp. 1–6.
of 90.07% across all 15 classes. [21] S. Gurbuz et al., “American sign language recognition using RF sensing,”
This preliminary work produced a 15-class BSL most IEEE Sensors J., vol. 21, no. 3, pp. 3763–3775, Feb. 2021.
common gestures dataset, which will be expanded for future [22] C. Preetham, G. Ramakrishnan, S. Kumar, A. Tamse, and
N. Krishnapura, “Hand talk-implementation of a gesture recognizing
research with additional vocabulary or sentences. We intend glove,” in Proc. Texas Instrum. India Educators’ Conf., Apr. 2013,
to strengthen our model by adding more users. The long-term pp. 328–331.
goal is to create a real-time intuitive version of BSL that can [23] K. Patil, G. Pendharkar, and G. N. Gaikwad, “American sign language
detection,” Int. J. Sci. Res. Publications, vol. 4, no. 11, pp. 1–6, 2014.
be scaled to other sign languages and personalized for use by [24] X. Wang, M. Xia, H. Cai, Y. Gao, and C. Cattani, “Hidden-Markov-
a range of end users, such as deaf and blind children. models-based dynamic hand gesture recognition,” Math. Problems Eng.,
vol. 2012, pp. 1–11, Feb. 2012.
[25] B. G. Lee and S. M. Lee, “Smart wearable hand device for sign language
R EFERENCES interpretation system with sensors fusion,” IEEE Sensors J., vol. 18,
[1] WHO. (2021). Deafness and Hearing Loss. Accessed: no. 3, pp. 1224–1232, Feb. 2018.
Apr. 1, 2021. [Online]. Available: https://www.who.int/news-room/fact- [26] D. Fairchild, R. Narayanan, E. Beckel, W. Luk, and G. Gaeta, Through-
sheets/detail/deafness-and-hearing-loss/.htm the-Wall Micro-Doppler Signatures, V. C. Chen, D. Tahmoush, and
[2] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, W. J. Miceli, Eds. U.K.: IET, 2014.
and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50× fewer [27] A. K. Rangarajan and R. Purushothaman, “Disease classification in
parameters and < 0.5 MB model size,” 2016, arXiv:1602.07360. eggplant using pre-trained VGG16 and MSVM,” Sci. Rep., vol. 10, no. 1,
[3] K. Ahmad and N. Conci, “How deep features have improved event pp. 1–11, Dec. 2020.
recognition in multimedia: A survey,” ACM Trans. Multimedia Comput., [28] K. Kumar, “DEAF-BSL: Deep learning framework for British sign lan-
Commun., Appl., vol. 15, no. 2, pp. 1–27, Jun. 2019. guage recognition,” ACM Trans. Asian Low-Resour. Lang. Inf. Process.,
[4] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE vol. 21, no. 5, pp. 1–14, Sep. 2022.
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9. [29] S. Dhulipala, F. F. Adedoyin, and A. Bruno, “Sign and human action
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: detection using deep learning,” J. Imag., vol. 8, no. 7, p. 192, Jul. 2022.
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. [30] S. A. Shah, X. Yang, and Q. H. Abbasi, “Cognitive health care system
Vis. Pattern Recognit., Jun. 2009, pp. 248–255. and its application in pill-rolling assessment,” Int. J. Numer. Model.,
[6] G. Simons and C. Fennig, “Ethnologue: Languages of Honduras,” SIL Electron. Netw., Devices Fields, vol. 32, no. 6, Nov. 2019, Art. no. e2632.
Int., Dallas, TX, USA, Tech. Rep., 2017. [31] C. Clemente, A. Balleri, K. Woodbridge, and J. J. Soraghan, “Develop-
[7] J. Fenlon and E. Wilkinson, “Sign languages in the world,” in Soci- ments in target micro-Doppler signatures analysis: Radar imaging, ultra-
olinguistics and Deaf Communities. Cambridge, U.K.: Cambridge Univ. sound and through-the-wall radar,” EURASIP J. Adv. Signal Process.,
Press, 2015, pp. 5–28. vol. 2013, no. 1, pp. 1–18, Dec. 2013.

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:43:04 UTC from IEEE Xplore. Restrictions apply.

You might also like