An Empirical Analysis of CNN For American Sign Language Recognition
An Empirical Analysis of CNN For American Sign Language Recognition
Abstract : American Sign Language recognition is communication which uses body motions, hands, and
a crucial technology that aims to improve accessibility arms as shown in [Figure 1]. It is a valuable form of
and communication for the deaf and hard-of-hearing communication for those who struggle with hearing.
community. This research aims to convert ASL The world's sign languages range in number from 200 to
gestures into text or speech by analyzing and almost 300. The many sign language dialects practiced
interpreting them using gesture recognition in sign worldwide include American Sign Language, British
language. This study compares different deep Sign Language, Auslan and New Zealand Sign
learning algorithms for sign language identification Language, Irish Sign Language, French Sign Language,
to overcome communication difficulties that deaf or Chinese Sign Language, etc. Most Anglophone Canada
hard-of-hearing people face. Three models—VGG16, and the United States utilize the American sign language
ResNet, and AlexNet—were created and trained system [1]. The deaf and dumb population now has
using a dataset of hand motions. The findings show access to modern technology and social media platforms
that all three models have excellent accuracy, with because of these sign languages.
AlexNet performing the best at 99.87%, ResNet
coming in second at 98.9%, and VGG16 third at The American Sign Language (ASL) system employs
98.2%. finger spelling, in which a particular gesture represents a
Index Terms: American Sign Language, VGG16, specific word. Each symbol defines the English alphabet
ResNet, AlexNet, Deep Learning, American letters. The population of the deaf and dumb has recently
Language. benefited from the efficient performance of deep learning
algorithms in sign language recognition [2]. For many
1. Introduction
years, machine learning and computer vision researchers
The community of the deaf and dumb has a way of have been actively working on identifying the sign
communicating with the outside world thanks to sign language. By now, the reliability and accuracy of systems
language. Every letter of the alphabet has a different hand that recognize sign language have significantly
gesture used to express it. When verbal communication increased.
is not possible, sign language is a method of
An analysis of several deep learning algorithms was training. this feature enables effective training of deep
conducted to compare them, considering the significance neural networks.
of having a sign language recognition system that
The Resnet model introduced residual connection which
produces the best results. Specifically, VGG-16,
addresses the vanishing gradients this feature allows the
AlexNet, and ResNet. Three well-known deep learning
training of networks within hundreds of layers without
models—VGG16, ResNet, and AlexNet—were the
degradation in performance. The residual connection
subject the proposed research work, which compared
enables learning residual mapping, focusing on the
their effectiveness. VGG16 is mainly used due to its
difference between the input and desired output.
straightforward architecture with all the convolutional
layers using 3x3 filter size and max-pooling layers in The existing studies did not consider the potential
between. A uniform structure makes it easier to influence of changes in background, lighting, and hand
understand and implement. this CNN model has form on the perception of gestures. The analysis of sign
achieved excellent performance on various image language from a video sequence turned out to get reduced
classification tasks also due to the pre-trained weights accuracy as the faces changes from sign to sign which
available. leads to inaccurate features being trained from videos.
Other architectures, such as Resnet, which has shown to
Alexnet has introduced the concept of Local Response be quicker and more effective because of its deeper layer
Normalization (LRN) which helps in enhances the were not considered in most studies.
discriminative power of the model by normalizing the
Our model aims in improving accessibility and
process within the local neighborhood. Alexnet utilizes
communication for persons who are deaf or hard of
two GPUs which helps in parallel processing and faster
hearing by using deep learning algorithms like VGG16,
ResNet, and AlexNet, with AlexNet being the most The study [7] Bekir Aksoy, Osamah Khaled Musleh
efficient, and can be used to construct reliable and Salman and Özge Ekrem focused on a small selection of
accurate models for sign language recognition.Our TSL gestures, which might not fully encompass the TSL
findings demonstrated that AlexNet outperformed VGG- lexicon. The study did not consider the potential
16 and ResNet. Our results indicate that AlexNet is the influence of changes in background, lighting, and hand
top model for detecting American Sign Language, and form on the perception of TSL gestures.
this model may be utilized as a foundation for creating
The study [8] by Alsaadi and et al mainly concentrates on
systems for sign language identification that are more
a small selection of ArSLA, which cannot be directly
precise and dependable.
applied to other signs or gestures in the Arabic sign
In the study [6], Premkumar, Aswathi and et al proposed movements that are represented as grayscale images with
that the deep learning models require significant dimensions of 200 × 200 pixels. The images are divided
into 27 classes, including 26 alphabet classes and a space
computing power, and the model's scalability is not
class with the designation "0."
addressed. Other architectures, such as Resnet, which has
shown to be quicker and more effective because of its In the dataset, 4,268 more photos are utilized for
deeper layer, were not considered in the article since it validation after the remaining 12,845 images are used for
was primarily focused on the VGG16 model. training. Each class has a specific number of images; the
lowest class has 600 images, while the most significant
class has 780 images.
The images in the collection were acquired by capturing recognition system's system architecture is shown in
various people signing the 26 alphabet and the space [Figure 2].
class in video sequences. In American sign language
The suggested system architecture for ASL recognition
hand motions, frames from video sequences were
entails gathering and pre-processing a dataset of hand
independently chosen, enhanced, and noise-removed.
gestures, choosing and training a deep learning model,
4. Proposed System Design and assessing the model's performance on a validation
database. By offering a reliable and precise ASL
The main objective of this effort is to convert ASL identification system, the system is anticipated to
gestures into text or speech by analyzing and interpreting increase accessibility and communication for deaf and
them using gesture recognition in sign language. The hard-of-hearing people by improving approachability
suggested system architecture for ASL recognition will and communication for persons who are deaf or hard of
be covered in this section. For ASL recognition, several hearing by using deep learning algorithms.
phases make up the system architecture. The ASL
2. Five convolutional layers, followed by layers that 1. Batch normalization and ReLU activation come
maximize pooling and reduce image size, make up after the first convolutional layer, which has 64
the network design. filters with a size of 7x7 and stride 2.
3. Each set's convolutional layers are 64, 128, 256, 512, 2. A maximum pooling layer with stride two,
4. The final layer is a thick layer with 26 units, one for 3. A series of residual blocks, each having two
convolutional layers, 64 3x3-filter layers, batch
each output class, that maps the features to the
normalization, ReLU activation, and a skip
output classes.
connection that joins the input and output of the
The network can learn complicated features and achieve second convolutional layer.
high accuracy in image classification tasks because of the 4. There are additional residual blocks, each consisting
usage of many convolutional layers and pooling layers. of two convolutional layers with 256 3x3-filter
layers, batch normalization, ReLU activation, and a
B. ResNet
skip connection that adds the input to the output of
the second convolutional layer.
ResNet, which stands for Residual Network, is a kind of
5. A pooling layer with an overall average that
deep neural network architecture that makes it possible
considers the output's spatial dimensions.
to train deep neural networks with more than 100 layers.
6. Twenty-six units in a wholly linked layer with
ResNet's key concept is the usage of residual
softmax activation create the output.
connections, which skip over certain levels rather than
7. A flat layer with three thick layers, the first two
adhering to a rigid hierarchy of stacked layers. The input
having 1000 units each and the third having 27 units.
data is sequentially transferred through each
conventional deep neural network layer. It is intended C. AlexNet
that each layer would extract the appropriate In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey
characteristics from the input data before passing on its Hinton created the convolutional neural network
output to the next layer. AlexNet. The ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) in 2012 was won by this system by
The early layers have difficulty learning useful features
a wide margin. It represented a substantial advancement
because of the vanishing gradient issue, which happens
in image categorization tasks. Three fully linked layers
when the gradients transmitted across the layers are
follow five convolutional layers to create the eight layers
incredibly tiny. ResNet presents the idea of residual
that comprise AlexNet's architecture.
connections as a solution to this issue, passing the input
data straight to a subsequent layer so that the network 1. A max pooling layer with a size of 3 by three and a
may learn the residual function that translates the input stride of 2 is added after an 11 by 11 convolutional
6. Followed by a max pooling layer with a size of 3x3 different epochs. An examination of these produced an
and a stride of 2, then a convolutional layer with 256 average score of 99.2%
filters and a size of 3x3 is applied.
7. The image is shrunk to 6x6 due to this.
8. Three layers with a total of 4096 units each, each
ultimately linked, with the first two levels having a
0.5 dropout.
30 epochs with 300 steps each. training data. A 98.9 accuracy rate was achieved. The
range of accuracy for different epochs was 64.56 to
A. VGG16 99.2%.
6. Conclusion
The suggested model was trained using 30 epochs with using cutting-edge computer vision methods to improve
accessibility and communication for persons who are
100 steps each. As a consequence, a precision of 99.87
deaf or hard of hearing. The study shows that deep
was achieved. In different epochs, the accuracy ranged
learning algorithms like VGG16, ResNet, and AlexNet,
from 94.65 to 99.97%.
with AlexNet being the most efficient, can be used to
construct reliable and accurate models for sign language
recognition.
[9] Rathi, Pulkit and Kuwar Gupta, Raj and Agarwal, Soumya and
Shukla, Anupam, Sign Language Recognition Using ResNet50 Deep
Neural Network Architecture (February 27, 2020).