Signlanguage SEL
Signlanguage SEL
net/publication/351625774
CITATIONS READS
21 734
2 authors, including:
SEE PROFILE
All content following this page was uploaded by Emanuele Lindo Secco on 04 October 2022.
Abstract— This work aims presents a novel Computer Vision categorised into two methods. The first method involves
approach in the development of a real-time, web-camera based, capturing gesture data using a 3D camera, and the second
British Sign Language recognition system. A literature review method involves capturing gesture data using a 2D camera.
focused on current (1) state of sign language recognition systems Therefore the literature to be reviewed in this sub-chapter shall
and (2) techniques used is conducted. This review process is used be represented in two further sub-chapters discussing SLR
as a foundation on which a Convolutional Neural Network systems which use 3D cameras, and SLR systems which use
(CNN) based system is design as a and then implemented. A 2D cameras respectively.
bespoke British Sign Language dataset - containing 11,875
images - is then performed to train and test the CNN which is An advantage to using a 3D camera over a 2D camera is
used for the classification of human hand performed gestures. that the depth capturing capabilities of a 3D camera allow for
Finally, the CNN architecture recognized 19 static British Sign easier image pre-processing as everything registered as greater
Language gestures, incorporating both single and double- in depth than the signer can quickly be removed, solving the
handed gestures. During testing, the system achieved an average issue of complex environmental backgrounds and lighting as
recognition accuracy of 89%. seen with 2D cameras [8].
Keywords— sign language recognition, AI, CNN, human- Other SLR system are only capable of recognising sign
machine interaction language gestures in uniform background and lighting
I. INTRODUCTION conditions, such as Tolentino et al work [9], which achieved a
recognition accuracy of 93.67% in recognising gestures of the
Research in the area of systems that are capable of American Sign Language in uniform lighting and background
recognising sign language has received substantial attention conditions when tested with 30 individuals. The system
over the past few decades, fuelled in particular by the rapid operated in real time, and used a CNN to classify the gestures
evolution of artificial intelligence techniques [1-3]. In turn this performed. However, the system modified some of the
has led to the development of many Sign Language gestures because their similarity to other gestures would have
Recognition Systems, which shall be referred to as SLR caused misrecognition and affected the accuracy of the
systems throughout the remainder of this chapter. These system. The SLR system proposed by Sawant and Kumbhar
systems though varying in sign language dialect, share the [10] also avoided the issue of complex backgrounds by
common goal of correctly recognising hand gestures capturing all testing footage on a white background which not
performed by a signer. However the varying proposed only worked to limit background interference but also limited
approaches to achieving this goal has produced a diverse area ununiformed lighting. This SLR system was able to recognise
of research and development encompassing areas of computer 26 Indian Sign Language gestures and used Principal
science such as Computer Vision (CV), Sensor Processing, Component Analysis (PCA) during the classification stage. A
Human-Computer Interaction, and Pattern Recognition [1-4]. paper authored by Berru-Novoa et al [11], tested a host of
Of these SLR systems there are two main types of design and classification methods for a SLR system using a uniform
implementation, which are those that use wearable sensors, background and lighting approach. This study found that
and those that use video footage and images. Both shall be amongst the classification methods of a Support Vector
discussed below. Machine, or SVM, a KNN, and an ANN, that the SVN
SLR systems that utilise sensors worn on the body to performed best achieving a recognition accuracy of 89.17%.
capture sign language gestures usually comprise of sensor- However the KNN and ANN were less than 2% behind. This
embedded gloves that are worn on the hands. These types of system also used HOG which stands for Histogram of
SLR systems are one of the two main approaches to capturing Orientated Gradients in the feature extraction stage.
gestures to be classified [1]. There have many sensor-based
II. MATERIALS AND METHODS
SLR system developed. Of these developed systems many
rely on sensor fusion to achieve an accurate recognition rate; The design of the sign language recognition system itself
such as the system proposed by Kim et al [5] in which ‘bi- is a multi-step process. All of the steps involved are detailed
channel sensor fusion’ is used to combine data from an below.
accelerometer and electromyogram embedded glove that A. The Complexity of Sign Language
covers the hand and upper wrist to recognise German Sign
Language gestures [6, 7]. All sign languages are gesturally complex, which some
signs being performed with one hand and others being
SLR systems designed to use video footage or image data performed with two hands. The position and rotation of the
to capture a gesture performed by a user can be further hands is important in the gestures, as is the position of the
D. CNN Architecture
The system will make use of a CNN for the feature
extraction and classification process. This is because of the
abilities of CNNs in performing feature extraction processes
Fig. 1. The British Sign Language gestures to be used in the system. in CV based systems as discussed in [13]. A bespoke CNN
Of the signed gestures seen in Fig.1 a diverse range can be will be created for the system which means that it will not have
identified. For example, signs such as 0 and C are performed been pre-trained on any previous data. This is to allow for the
one-handed, whereas signs such as Q and W are performed CNN to be specifically tailored for the purpose of providing
double-handed. Furthermore, the importance of finger accurate gesture recognition in this system. The architecture
position is evident with signs such as A and L in which the of the CNN can be seen in Fig. 3.
splayed fingers and position of the fingers on the secondary
hand of A make the difference between gesturing an L.
However, there are also clear similarities between some
gestures such as 5 and E in which the only defining factor of
the E gesture is the secondary index finger pointing to the
index finger of the main hand, without which the E gesture
would become a 5 gesture. These will be challenging cases for
the system to deal with, and they have been incorporated to
test the robustness of the system.
B. A web-camera based system
The system will make use of a web-camera for capturing
the gesture performed by the user. This is because in the real-
world web-cameras are common, portable, and cheap, making
them accessible for users of a sign language recognition
system. The web-camera based approach was discussed in the
literature review chapter, and the many SRL systems develop Fig. 3. The CNN set up.
using this approach proves the viability of web-camera based
E. Dataset set up
systems. The web-camera used in this implementation will be
the Advent AWC72015 which contains a 12 mega pixel The design of the dataset involves 2 steps, of which the first
camera, with a 720 p resolution, which captures at 30 frames is the structure of the dataset, and the second is the contents
a second [12]. However, the system is not web-camera model of the dataset. These steps are detailed below.
specific, and therefore can be used with any web-camera.
1. Dataset Structure
C. System Architecture Design The dataset will be housed in a structure that is
The system will be designed around two main conveniently accessible for the CNN to perform training and
components. The first of these components is the sign testing with. Two main folders shall be used, one folder will
language recognition system itself. This component will be used to house the training data, and the other shall hold the
capture frames from the web-camera, pre-process these testing data. Both of these folders will contain an identical set
frames, and send these frames to the classifier for recognition. of 19 sub-folders representing each gesture recognisable by
This component will also handle the graphical user interface, the system. Inside these sub-folders the image data shall be
and user input. The other component will be used to house the stored in .jpg format. All images will consist of the
classifier, and will consist of two smaller parts. Of which the dimensions 288 x 312.
first will be the creator of the classifier, this part will construct
the model, and perform training and testing of the model. The 2. Gesture position and distance
second will be the compiled, trained, and saved model itself. The 19 gestures which are to be recognisable to the system
The saved model will be used with the sign language have been displayed in Fig.1 however in this example all signs
were present at close distance to the web-camera and in a directory, along with the training and testing datasets. As
neutral position. It is important to build a dataset containing stated in the design, both of these datasets contain a folder
data depicting gestures being performed at varying angles, corresponding to one of the 19 recognisable gestures.
positions, and distance to the web-camera. Including this data
will result in a more robust classification model, which is able III. RESULTS
to deal with fringe events. This process of collecting data will Each image data is captured by the web-camera and fed
also makes the system more versatile and able to deal more to the system in real-time. No pre-processing steps are taken
efficiently with other users, who may perform gestures until the user presses the b key on their keyboard, which runs
slightly differently [14]. a background subtraction task. This background subtraction
process is the first of multiple steps of image pre-processing
used by the system. Fig. 5 below displays the process of
background subtraction.