0% found this document useful (0 votes)
27 views8 pages

An Empirical Analysis of CNN For American Sign Language Recognition

Uploaded by

nishwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views8 pages

An Empirical Analysis of CNN For American Sign Language Recognition

Uploaded by

nishwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Proceedings of the 5th International Conference on Inventive Research in Computing Applications (ICIRCA 2023)

IEEE Xplore Part Number: CFP23N67-ART; ISBN: 979-8-3503-2142-5

An Empirical Analysis of CNN for American Sign


Language Recognition
1 2 3 4 5
Lakshmi VB , Sivachandra KB , Parthasaradhi H , S Abhishek , Anjali T
1,2,3,4,5
Department of Computer Science and Engineering, Amrita School of Computing
2023 5th International Conference on Inventive Research in Computing Applications (ICIRCA) | 979-8-3503-2142-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICIRCA57980.2023.10220822

Amrita Vishwa Vidyapeetham, Amritapuri


1 2
[email protected], [email protected],
3 4 5
[email protected], [email protected], [email protected]

Abstract : American Sign Language recognition is communication which uses body motions, hands, and

a crucial technology that aims to improve accessibility arms as shown in [Figure 1]. It is a valuable form of

and communication for the deaf and hard-of-hearing communication for those who struggle with hearing.

community. This research aims to convert ASL The world's sign languages range in number from 200 to
gestures into text or speech by analyzing and almost 300. The many sign language dialects practiced
interpreting them using gesture recognition in sign worldwide include American Sign Language, British
language. This study compares different deep Sign Language, Auslan and New Zealand Sign
learning algorithms for sign language identification Language, Irish Sign Language, French Sign Language,
to overcome communication difficulties that deaf or Chinese Sign Language, etc. Most Anglophone Canada
hard-of-hearing people face. Three models—VGG16, and the United States utilize the American sign language
ResNet, and AlexNet—were created and trained system [1]. The deaf and dumb population now has
using a dataset of hand motions. The findings show access to modern technology and social media platforms
that all three models have excellent accuracy, with because of these sign languages.
AlexNet performing the best at 99.87%, ResNet
coming in second at 98.9%, and VGG16 third at The American Sign Language (ASL) system employs
98.2%. finger spelling, in which a particular gesture represents a
Index Terms: American Sign Language, VGG16, specific word. Each symbol defines the English alphabet
ResNet, AlexNet, Deep Learning, American letters. The population of the deaf and dumb has recently
Language. benefited from the efficient performance of deep learning
algorithms in sign language recognition [2]. For many
1. Introduction
years, machine learning and computer vision researchers
The community of the deaf and dumb has a way of have been actively working on identifying the sign
communicating with the outside world thanks to sign language. By now, the reliability and accuracy of systems
language. Every letter of the alphabet has a different hand that recognize sign language have significantly
gesture used to express it. When verbal communication increased.
is not possible, sign language is a method of

979-8-3503-2142-5/23/$31.00 ©2023 IEEE 421


Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on October 02,2024 at 17:01:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Inventive Research in Computing Applications (ICIRCA 2023)
IEEE Xplore Part Number: CFP23N67-ART; ISBN: 979-8-3503-2142-5

Figure 1 - Hand Gesture

An analysis of several deep learning algorithms was training. this feature enables effective training of deep
conducted to compare them, considering the significance neural networks.
of having a sign language recognition system that
The Resnet model introduced residual connection which
produces the best results. Specifically, VGG-16,
addresses the vanishing gradients this feature allows the
AlexNet, and ResNet. Three well-known deep learning
training of networks within hundreds of layers without
models—VGG16, ResNet, and AlexNet—were the
degradation in performance. The residual connection
subject the proposed research work, which compared
enables learning residual mapping, focusing on the
their effectiveness. VGG16 is mainly used due to its
difference between the input and desired output.
straightforward architecture with all the convolutional
layers using 3x3 filter size and max-pooling layers in The existing studies did not consider the potential
between. A uniform structure makes it easier to influence of changes in background, lighting, and hand
understand and implement. this CNN model has form on the perception of gestures. The analysis of sign
achieved excellent performance on various image language from a video sequence turned out to get reduced
classification tasks also due to the pre-trained weights accuracy as the faces changes from sign to sign which
available. leads to inaccurate features being trained from videos.
Other architectures, such as Resnet, which has shown to
Alexnet has introduced the concept of Local Response be quicker and more effective because of its deeper layer
Normalization (LRN) which helps in enhances the were not considered in most studies.
discriminative power of the model by normalizing the
Our model aims in improving accessibility and
process within the local neighborhood. Alexnet utilizes
communication for persons who are deaf or hard of
two GPUs which helps in parallel processing and faster
hearing by using deep learning algorithms like VGG16,

979-8-3503-2142-5/23/$31.00 ©2023 IEEE 422


Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on October 02,2024 at 17:01:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Inventive Research in Computing Applications (ICIRCA 2023)
IEEE Xplore Part Number: CFP23N67-ART; ISBN: 979-8-3503-2142-5

ResNet, and AlexNet, with AlexNet being the most The study [7] Bekir Aksoy, Osamah Khaled Musleh
efficient, and can be used to construct reliable and Salman and Özge Ekrem focused on a small selection of
accurate models for sign language recognition.Our TSL gestures, which might not fully encompass the TSL
findings demonstrated that AlexNet outperformed VGG- lexicon. The study did not consider the potential
16 and ResNet. Our results indicate that AlexNet is the influence of changes in background, lighting, and hand
top model for detecting American Sign Language, and form on the perception of TSL gestures.
this model may be utilized as a foundation for creating
The study [8] by Alsaadi and et al mainly concentrates on
systems for sign language identification that are more
a small selection of ArSLA, which cannot be directly
precise and dependable.
applied to other signs or gestures in the Arabic sign

2. Related Work language and affect the generalizability of the results.


The model is limited to only one object and does not
The research [3] by Simonyan, Karen, and Andrew consider the background, which can affect the
Zisserman primarily focuses on a ground-breaking performance.
method for classifying images using deep convolutional
The study [9] by Rathi, Pulkit and Kuwar Gupta, Raj and
neural networks. The ImageNet dataset was only
Agarwal, Shukla and Soumya, Anupam mainly
partially evaluated in this study; performance on other
concentrated on ResNet50 deep neural network
datasets and the dataset's lack of variety was not
architecture and got an accuracy of 99.03%. The model
considered.
just concentrates on identifying the characters from the
The study [4] by Aloysius, N., Geetha, M mainly surveys given images. The study doesn't provide a detailed
a vision-based continuous sign language recognition analysis of the delicacy of the model.
system. The proposed model relies primarily on hand
detection and tracking stages, which will not perform the The research [10] by K. Bantupalli and Y. Xie primarily
uses CNN and long short-term memory (LSTM) to bring
same in all environments. The KNN classifier used may
out time-based information from ASL video sequences.
not be the most efficient and accurate.
Faces were included in the model. However, this reduced
In the study [5] Wadhawan, Ankita, and Parteek Kumar its accuracy since faces change from sign to sign, leading
uses only stationary signs, not dynamic signs, and the to inaccurate features being trained from videos.
dataset used is a relatively small model that uses 10-fold
cross-validation. The study didn't compare the proposed 3. Dataset
model with other states, making calculating the
The dataset for American Sign Language (ASL) utilized
performance difficult.
in this research consists of 17,113 images of hand

In the study [6], Premkumar, Aswathi and et al proposed movements that are represented as grayscale images with

that the deep learning models require significant dimensions of 200 × 200 pixels. The images are divided
into 27 classes, including 26 alphabet classes and a space
computing power, and the model's scalability is not
class with the designation "0."
addressed. Other architectures, such as Resnet, which has
shown to be quicker and more effective because of its In the dataset, 4,268 more photos are utilized for
deeper layer, were not considered in the article since it validation after the remaining 12,845 images are used for
was primarily focused on the VGG16 model. training. Each class has a specific number of images; the
lowest class has 600 images, while the most significant
class has 780 images.

979-8-3503-2142-5/23/$31.00 ©2023 IEEE 423


Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on October 02,2024 at 17:01:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Inventive Research in Computing Applications (ICIRCA 2023)
IEEE Xplore Part Number: CFP23N67-ART; ISBN: 979-8-3503-2142-5

The images in the collection were acquired by capturing recognition system's system architecture is shown in
various people signing the 26 alphabet and the space [Figure 2].
class in video sequences. In American sign language
The suggested system architecture for ASL recognition
hand motions, frames from video sequences were
entails gathering and pre-processing a dataset of hand
independently chosen, enhanced, and noise-removed.
gestures, choosing and training a deep learning model,
4. Proposed System Design and assessing the model's performance on a validation
database. By offering a reliable and precise ASL
The main objective of this effort is to convert ASL identification system, the system is anticipated to
gestures into text or speech by analyzing and interpreting increase accessibility and communication for deaf and
them using gesture recognition in sign language. The hard-of-hearing people by improving approachability
suggested system architecture for ASL recognition will and communication for persons who are deaf or hard of
be covered in this section. For ASL recognition, several hearing by using deep learning algorithms.
phases make up the system architecture. The ASL

Figure 2 - Proposed System Architecture

A. VGG16 VGG16 can extract more intricate and abstract


characteristics from images by stacking several
The Oxford University's Visual Geometry Group (VGG)
convolutional layers one on top of another. However, this
initially presented the VGG16 convolutional neural
depth increases the network's computational cost and
network design in 2014. There are 16 layers, 13 of which
susceptibility to overfitting.
are convolutional, and three are fully linked. Its
consistent architecture is one of the critical There are 13 convolutional layers and three thick layers
characteristics of VGG16. Its max pooling layers have a in the convolutional neural network known as VGG16.
window size of 2x2 and a stride of 2, whereas all of its
convolutional layers have a filter size of 3x3 and a stride 1. The layers are layered sequentially, extracting
of 1. A more straightforward and effective features from 224 × 224 x 3 images.
implementation is made possible by this homogeneity.
VGG16's depth is an additional noteworthy trait.

979-8-3503-2142-5/23/$31.00 ©2023 IEEE 424


Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on October 02,2024 at 17:01:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Inventive Research in Computing Applications (ICIRCA 2023)
IEEE Xplore Part Number: CFP23N67-ART; ISBN: 979-8-3503-2142-5

2. Five convolutional layers, followed by layers that 1. Batch normalization and ReLU activation come
maximize pooling and reduce image size, make up after the first convolutional layer, which has 64
the network design. filters with a size of 7x7 and stride 2.

3. Each set's convolutional layers are 64, 128, 256, 512, 2. A maximum pooling layer with stride two,

and 512 pixels in size. measuring 3x3.

4. The final layer is a thick layer with 26 units, one for 3. A series of residual blocks, each having two
convolutional layers, 64 3x3-filter layers, batch
each output class, that maps the features to the
normalization, ReLU activation, and a skip
output classes.
connection that joins the input and output of the
The network can learn complicated features and achieve second convolutional layer.
high accuracy in image classification tasks because of the 4. There are additional residual blocks, each consisting
usage of many convolutional layers and pooling layers. of two convolutional layers with 256 3x3-filter
layers, batch normalization, ReLU activation, and a
B. ResNet
skip connection that adds the input to the output of
the second convolutional layer.
ResNet, which stands for Residual Network, is a kind of
5. A pooling layer with an overall average that
deep neural network architecture that makes it possible
considers the output's spatial dimensions.
to train deep neural networks with more than 100 layers.
6. Twenty-six units in a wholly linked layer with
ResNet's key concept is the usage of residual
softmax activation create the output.
connections, which skip over certain levels rather than
7. A flat layer with three thick layers, the first two
adhering to a rigid hierarchy of stacked layers. The input
having 1000 units each and the third having 27 units.
data is sequentially transferred through each
conventional deep neural network layer. It is intended C. AlexNet

that each layer would extract the appropriate In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey
characteristics from the input data before passing on its Hinton created the convolutional neural network
output to the next layer. AlexNet. The ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) in 2012 was won by this system by
The early layers have difficulty learning useful features
a wide margin. It represented a substantial advancement
because of the vanishing gradient issue, which happens
in image categorization tasks. Three fully linked layers
when the gradients transmitted across the layers are
follow five convolutional layers to create the eight layers
incredibly tiny. ResNet presents the idea of residual
that comprise AlexNet's architecture.
connections as a solution to this issue, passing the input
data straight to a subsequent layer so that the network 1. A max pooling layer with a size of 3 by three and a

may learn the residual function that translates the input stride of 2 is added after an 11 by 11 convolutional

to the output. layer with 96 filters.


2. The supplied image shrunk from 227x227 to 27x27
Each residual block in ResNet's architecture consists of as a result.
several convolutional layers, batch normalization, and 3. A max-pooling layer of 3 by three and a stride of 2
ReLU activation algorithms. The output of the current is followed by a convolutional layer of 5 by 5 and
block is added by the residual connections to the output 256 filters.
of the preceding block, which is then sent to the next 4. The image is shrunk to 13x13 thanks to this. A
block. These particular layers' characteristics are as convolutional layer with 384 filters and a 3x3 size.
follows: 5. A convolutional layer with 384 filters and a 3x3 size.

979-8-3503-2142-5/23/$31.00 ©2023 IEEE 425


Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on October 02,2024 at 17:01:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Inventive Research in Computing Applications (ICIRCA 2023)
IEEE Xplore Part Number: CFP23N67-ART; ISBN: 979-8-3503-2142-5

6. Followed by a max pooling layer with a size of 3x3 different epochs. An examination of these produced an
and a stride of 2, then a convolutional layer with 256 average score of 99.2%
filters and a size of 3x3 is applied.
7. The image is shrunk to 6x6 due to this.
8. Three layers with a total of 4096 units each, each
ultimately linked, with the first two levels having a
0.5 dropout.

The 1000 units of the final fully connected layer


correspond to the 1000 classes in the ImageNet dataset
and the softmax activation function. Rectified Linear
Units (ReLU), which helped to fix the problem of Figure 3 - VGG-16 Accuracy
disappearing gradients in deep networks, were used by
AlexNet as the activation function, making it significant.
Local response normalization (LRN) was employed to
lessen overfitting and boost generalization.

5. Results & Discussions

The three models' performance was enhanced using the


appropriate amount of epochs and steps per epoch. The
model's accuracy increases with increasing epoch count
Figure 4 - VGG-16 Loss
[16]—30 epochs with 200 steps each were employed for
B. ResNet
the VGG16. The number of steps per epoch for AlexNet
was 100, with 30 total. ResNet was implemented using Thirty epochs with 300 steps each served as the model's

30 epochs with 300 steps each. training data. A 98.9 accuracy rate was achieved. The
range of accuracy for different epochs was 64.56 to
A. VGG16 99.2%.

There are 4268 images in total in the test set. Thirty


Plots and evaluations of the training and validation
epochs with 200 steps each were used to train the
accuracy were performed. The plots shown in [Figure 5]
suggested model as shown in [Figure 3] and [Figure 4].
and [Figure 6] demonstrate that as the number of epochs
According to the tests, the validation set's accuracy was
rose, the accuracy and loss first climbed, then declined in
98.2%. From 54.02% to 98.45%, accuracy was found in
the middle, eventually delivering an accuracy of 99.2%.

979-8-3503-2142-5/23/$31.00 ©2023 IEEE 426


Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on October 02,2024 at 17:01:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Inventive Research in Computing Applications (ICIRCA 2023)
IEEE Xplore Part Number: CFP23N67-ART; ISBN: 979-8-3503-2142-5

epoch, the training and testing validation loss is shown in


[Figure 8] demonstrated that it is inversely proportional
to epochs.

Figure 5 - ResNet Accuracy

Figure 8 - AlexNet Loss

All three of the Models provide the best outcomes that


strike a balance between the models' accuracy and
complexity. ResNet, VGG-16, and AlexNet provided
accuracy results between 99.87 and 98.9, respectively.

6. Conclusion

Figure 6 - ResNet Loss This work's comparative examination of deep learning


algorithms for sign language identification provides
C. AlexNet
crucial insight into the feasibility and effectiveness of

The suggested model was trained using 30 epochs with using cutting-edge computer vision methods to improve
accessibility and communication for persons who are
100 steps each. As a consequence, a precision of 99.87
deaf or hard of hearing. The study shows that deep
was achieved. In different epochs, the accuracy ranged
learning algorithms like VGG16, ResNet, and AlexNet,
from 94.65 to 99.97%.
with AlexNet being the most efficient, can be used to
construct reliable and accurate models for sign language
recognition.

Future research on sign language recognition using deep


learning algorithms has several potential directions to
build on this analyses' findings. The accuracy and
efficiency of sign language recognition may be improved
by using more advanced deep learning architectures and
approaches, such as attention-based models or recurrent
neural networks (RNNs). Future research should focus
Figure 7 - AlexNet Accuracy
on creating real-time sign language recognition systems
The training and testing accuracy shown in [Figure 7] that can function in changing contexts and consider
rose with the number of epochs, reaching a maximum of different lighting, backgrounds, and other elements.
99.87. Since it is significantly decreased after each

979-8-3503-2142-5/23/$31.00 ©2023 IEEE 427


Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on October 02,2024 at 17:01:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Inventive Research in Computing Applications (ICIRCA 2023)
IEEE Xplore Part Number: CFP23N67-ART; ISBN: 979-8-3503-2142-5

References Communication and Networking Technologies (ICCCNT), pp. 1-6.


IEEE, 2022.
[1] Alsaadi, Zaran, Easa Alshamani, Mohammed Alrehaili, [15] Rangasrinivasan, Sahana, Sri Lohitha Bhagam, Nair K. Athira,
Abdulmajeed Ayesh D. Alrashdi, Saleh Albelwi, and Abdelrahman Kondapi Niharika, Anjuna D. Raj, and T. Anjali. "CoViMask: A Novel
Osman Elfaki. 2022. "A Real Time Arabic Sign Language Alphabets Face Mask Type Detector Using Convolutional Neural Networks."
(ArSLA)
[16] Abhishek, S., Mahima Chowdary Mannava, A. J.
[2] Grandhi, Chandhini, Sean Liu, and Divyank Rahoria. "American Ananthapadmanabhan, and T. Anjali. "Towards Accurate Auscultation
Sign Language Recognition using Deep Learning." Sound Classification with Convolutional Neural Network." pp. 254-
260. IEEE, 2023.
[3] Simonyan, Karen, and Andrew Zisserman. "Very deep
convolutional networks for large-scale image recognition." arXiv
preprint arXiv:1409.

[4] Aloysius, N., Geetha, M. Understanding vision-based continuous


sign language recognition. Multimed Tools Appl 79, 22177–22209
(2020).

[5] Wadhawan, Ankita, and Parteek Kumar. "Deep learning-based sign


language recognition system for static signs." Neural computing and
applications 32 (2020): 7957-7968.

[6] Premkumar, Aswathi, R. Hridya Krishna, Nikita Chanalya, C.


Meghadev, Utkrist Arvind Varma, T. Anjali, and S. Siji Rani. "Sign
language recognition: a comparative analysis of deep learning models."

[7] Bekir Aksoy, Osamah Khaled Musleh Salman & Özge


Ekrem (2021) Detection of Turkish Sign Language Using Deep
Learning and Image Processing Methods, Applied Artificial
Intelligence, 35:12, 952-981

[8] Alsaadi, Zaran, Easa Alshamani, Mohammed Alrehaili,


Abdulmajeed Ayesh D. Alrashdi, Saleh Albelwi, and Abdelrahman
Osman Elfaki. 2022. "A Real Time Arabic Sign Language Alphabets
(ArSLA) Recognition Model Using Deep Learning
Architecture" Computers 11, no. 5: 78.

[9] Rathi, Pulkit and Kuwar Gupta, Raj and Agarwal, Soumya and
Shukla, Anupam, Sign Language Recognition Using ResNet50 Deep
Neural Network Architecture (February 27, 2020).

[10] K. Bantupalli and Y. Xie, "American Sign Language Recognition


using Deep Learning and Computer Vision,"

[11] A. K. A, A. H, N. P. Nair, V. A and A. T, "Interview Performance


Analysis using Emotion Detection,"

[12] Gopakumar, Amritha, Aathira Shine, and T. Anjali. "Analysis of


Alcoholic EEG Signal using Semantic Technologies."

[13] Pradeesh, N., Abhishek Lal, Gautham Padmanabhan, R.


Gopikrishnan, T. Anjali, Shivsubramani Krishnamoorthy, and Kamal
Bijlani. "Fast and reliable group attendance marking system using face
recognition in classrooms."

[14]Mannava, Mahima Chowdary, Bhavana Tadigadapa,Devitha Anil


and A.T, "CNN Comparative Analysis for Skin Cancer
Classification."In 2022 13th International Conference on Computing

979-8-3503-2142-5/23/$31.00 ©2023 IEEE 428


Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on October 02,2024 at 17:01:16 UTC from IEEE Xplore. Restrictions apply.

You might also like