Sign Language Recognition With Convolutional Neural Networks
Sign Language Recognition With Convolutional Neural Networks
1
2.2. Real-time Robust ASL Recognition depth stream and can track body movement, as an input de-
vice. With Kinect, the CNN has 5 inputs, including depth
While classifying and translating static images of sign
and a body skeleton, and achieves 94.2% accuracy, which is
language letters and words is an essential task, there is also
higher than baseline methods.
a need for robust and scalable real-time ASL recognition
Another instance of this is [7], which uses a multi-
models. A noted weakness of existing training datasets is
stream architecture that comprises of CNNs and GANs to
that they often do not contain a variety of skin-tones, mak-
generate depth and joint information from RGB channels.
ing models trained on them prone to failure at inference
Then, manual and non-manual features are processed in a
time when presented with a hand from an ethnicity not seen
3D CNN. This multi-stream model receives the frames in
during training. One strategy to mitigate this is presented in
RGB, segmented hands and faces, distance and speed maps,
[21], which uses a skin detection algorithm to create a mask
and the artificial depth maps generated by the GANs, and
for the input image, that works by looking at the colors of
achieves 91% accuracy.
the image in terms of luminance (Y), and chrominance (Cb
and Cr), a common color space in video compression. This One more example of 3D CNNs is [1], which uses a fu-
mask was used to remove all parts of the image that were sion of parallel 3D CNN structures, where linear sampling
not identified as skin, and their model was able to achieve is applied to select frames, and a 3D CNN learns the spa-
94.7% test accuracy with a downstream classification model tiotemporal features at certain times in the video sequence.
based on AlexNet. Then, the 3D CNN extracts features from one of the clips,
Similarly, [12] proposes a multiple-stage pipeline to im- and then various methods for feature fusion are considered,
prove model robustness with reduced inference latency, by including MLP, LSTM, and stacked autoencoders. After
integrating MediaPipe landmarks into a standard CNN ar- considering scenarios of all combinations, including signer-
chitecture (similar to [15]). This works validates the find- dependent and signer-independent (where the signers in the
ings of multiple other researchers ([5], [3]), who achieved test data aren’t included in the training data), the model that
greater classification accuracy when integrating MediaPipe was signer dependent using MLP fusion achieved 98.12%
hand landmarks (with both static 2d images, and 3d depth accuracy. In particular, [7] employed state-of-the-art tech-
images) into their data preprocessing pipeline. [12] com- niques by combining many architectures and using datasets
pared their model to ones based on pre-trained Incep- in multiple languages of sign language. Also, [1] excels in
tion CNNs, and non-convolutional models such as random testing many different model combinations and techniques,
forests and SVMs, and found that an integrated MediaPipe even applying PCA and t-SNE for data reduction. In con-
landmark/raw image model was able to outperform them in trast, [11] has no mention of data augmentation, which is
terms of accuracy (9̃0% for Inception and SVM baselines, important to consider to prevent overfitting.
compared to over 99% accuracy for their model) and in
some cases, in inference time (particularly the SVM-based 3. Methods
image model used in [18]).
3.1. Baselines
One noticeable vector to reduce gesture recognition in-
ference time is presented in [9]. This paper is distinct in Our project is modeled as an ablation study—we built
that it trains on a variety of international sign languages, upon our model iteratively to test and evaluate each exten-
including Indian Sign Language (ISL) gestures. ISL is dis- sion. We first started with a simple CNN, made up of 2
tinct from ASL in that ISL fingerspelling gestures typically convolutional layers (with 32 and 64 channels respectively),
use two hands instead of one, making the MediaPipe model each of which is followed by ReLU and 2 × 2 max pool-
(which is designed to be adaptable to generate landmarks ing. The data is then flattened, and fed through two fully-
for multiple hands in a single image) an appropriate model- connected layers, with 128 nodes and 24 nodes respectively,
ing choice. Their model achieves low-latency by not using where 24 was the number of classes. Using a simple CNN
CNNs at all, instead using a lightweight SVM as the clas- as our initial model allowed us to develop a robust image
sifier, after reducing the size of the input search area using preprocessing pipeline, and develop an initial understand-
two-stage hand-detection/landmark extraction pipeline. [9] ing of how suited CNNs were as a general approach to this
achieve accuracies of > 98% on Italian, Indian, and Amer- task. This CNN was implemented using the standard deep
ican sign language datasets, indicating that this approach is learning framework Pytorch [14], and image preprocessing
also effective at performing real-time gesture recognition. was performed using the Python Imaging Library (Pillow)
[4]
2.3. 3D CNNs and Hand Modeling
To add model complexity and additional pathways for
3D CNNs are another prominent architecture in sign lan- feature extraction, we then chose to replicate the architec-
guage recognition. One instance of this is [11], which uses ture in [15], using solely the input channel of grayscale im-
Microsoft Kinect, a motion sensor that provides a color and ages. This model will be referred to as our baseline model,
2
and provided us with a more comprehensive baseline for the This is followed by a shallow CNN consisting of 2 convo-
classification task. For this CNN , we used the following ar- lutional layers with 50 and 25 channels respectively, each
chitecture: 5 convolutional layers with a filter size of 3 and with ReLU activation, batch normalization, and 2 × 2 max
32, 64, 128, and 512 filters and ReLU activation. Each is pooling. The MediaPipe landmark model itself is made up
followed by a dropout layer and batch normalization, as a of two models - a palm detector to provide a bounding box
form of model regularization. Each convolutional block af- for hands, and a hand landmark model, that provides a hand
ter the first is followed by a max pooling layer of size 2 × 2. skeleton in the form of 21 hand-knuckle coordinates within
We used a dropout probability of 0.3 and learning rate of the image.
0.001 as initial hyperparameter values, before tuning. We The palm detector allows the model to localize the hand
modified the paper’s architecture slightly to maintain out- to a particular area of the image. The detector uses an
put dimensions. The model ends with a fully connected- encoder-decoder feature extractor, built on the idea of a Fea-
dropout-fully connected block as a classification head, to ture Pyramid Network (FPN) [13]. FPNs are a type of CNN
classify each image into one of the 24 classes. A complete architecture that were specifically designed to enhance the
diagram of this model’s architecture can be seen in Figure ability of CNNs to detect objects at multiple scales. This
1. helps models develop scale-invariance, which is useful for
our proposed task in that it reduces the need for scale-based
data augmentation. FPNs are typically computed on top of
backbone CNNs such as ResNet, and work by successively
extracting feature maps at different stages of the backbone
network’s architecture (for example, in the ResNet case,
these feature maps are taken from different residual blocks).
Feature maps from different stages are them combined, of-
Figure 1. Baseline CNN Architecture, from [15] ten using upsampling and element-wise addition, to produce
a ‘pyramid’ of features at different scales. In the case of
A deeper CNN architecture such as this one offers the palm detection, standard RoI pooling methods such as Fast
benefits of being able to extract more complex spatial fea- R-CNN [6] are used to produce RoIs from the feature maps
tures before classification. The model’s loss was evaluated at different levels of the pyramid. An example of this type
using cross-entropy loss, which can be seen in Equation 1. of architecture can be seen in Figure 2.
N
X
LCE (y, ŷ) = − yi log(ŷi ), (1)
i=1
3
was too low, we chose to fill in the output tensor with ze- dataset more diverse, and the classification task more dif-
ros, to maintain shape consistency for the remainder of the ficult, as mentioned in the related work section. We ap-
network. The topology of the landmark coordinates can be plied salt and pepper noise (0.05), random rotation within
seen in Figure 3. 10 degrees, random zoom (10%), random shift (0.1), ran-
dom horizontal flip (0.5), and random crop (50 ×50 pixels)
(Figure 4). The salt and pepper noise was implemented our-
selves, using NumPy [10], and the remaining augmentation
methods were implemented using existing functions in Pil-
low [4]. After this augmentation, images were converted
to grayscale and resized to 50 × 50 pixels, and normalized
before being fed into the model’s first input channel.
4
Precisioni × Recalli accuracies and losses are plotted in Appendix 13.
F1 Scorei = 2 ×
Precisioni + Recalli
Class Precision Recall F1 Score Support
5.1. Smaller Dataset A 1.00 0.99 0.99 412
For our baseline models, the model’s overall accuracy B 0.99 1.00 0.99 430
on a randomly selected validation set made up of 15% of C 1.00 0.99 0.99 431
the data found in [16] (“smaller dataset”) is found in Table D 0.99 0.99 0.99 403
1. This subsection contains an explanation of each of these E 1.00 1.00 1.00 400
models, and our decision-making process when considering F 1.00 0.99 0.99 388
how to extend the baselines. G 1.00 0.99 0.99 376
H 0.99 1.00 0.99 409
Model Validation Accuracy I 0.99 0.99 0.99 395
Results from [15], single 96.29% K 1.00 0.99 0.99 435
input channel, with data L 0.99 0.99 0.99 396
augmentation
M 0.98 0.99 0.99 410
Results from [15], two 98.42%
N 0.98 0.98 0.98 418
input channels, with data
O 0.99 0.99 0.99 387
augmentation
P 0.99 0.98 0.98 434
Simple CNN 98.14%
Q 0.98 0.98 0.98 378
Baseline Model 99.34%
R 0.97 1.00 0.98 442
Baseline Model with Data 96.40%
S 0.99 0.99 0.99 415
Augmentation
T 0.99 0.99 0.99 430
Complete Model 96.5%
U 0.99 0.99 0.99 407
Table 1. All Model Results, Smaller Dataset V 0.97 0.99 0.98 418
W 0.99 0.98 0.99 457
Our initial point of comparison, the top row of 1, is the X 0.99 0.97 0.98 389
validation accuracy reported in [15] on a model with only Y 0.99 0.99 0.99 406
the 50 × 50 grayscale input channel, and data augmenta- Total 0.99 0.99 0.99 9866
tion. Our second point of comparison is the validation ac-
Table 2. Baseline Model, Smaller Dataset Classification Report
curacy reported in [15] on a dual-input model (grayscale
images and landmarks), with data augmentation. At this point in our experimentation process, we chose to
The first model we tested, as a proof-of-concept, was a conduct hyperparameter tuning on the Adam learning rate
relatively simple CNN with two convolutional layers and and dropout probability of each layer. Our results indicated
a single grayscale image input head. We were surprised that the best combination of parameters were a dropout
to find that this model was able to achieve extremely high probability of 0.2 and a learning rate of 0.0005, and these
accuracy on the smaller dataset (98.14%), when the data parameters were used for all subsequent models. The clas-
had not been augmented. Our initial thoughts were that we sification accuracies for the various combinations of param-
were overfitting to the data. However, plotting our training eters can be found in Table 3.
and validation accuracies and losses as a function of epoch
(Appendix 12) indicated that we were not overfitting to the Dropout Rate
training data, but that the model was actually extremely ef- 0.1 0.2 0.3 0.4
fective at classifying this dataset. One important thing to
Learning Rate
5
idation accuracy. As expected, due to the simplicity of the
initial data, our final performance was worse than the base-
line model. Similarly to the baseline model, validation ac-
curacy was higher than training accuracy, likely due to the
fact that our model uses dropout layers (Appendix 14). As
seen in the confusion matrix, the characters most confused
were classes 14 and 15 (P and Q), 19 and 20 (U and V),
20 and 21 (V and W), and 0 and 18 (A and T) (Figure 5).
Given the signs for these characters, this misclassification
is likely, as the signs for P and Q look very similar, and the
signs for U, V, and W, and A and T, all have a similar hand
position (Figure 6)[16]. Also, since this dataset consists of
signs from non-native signers, slight errors and variations in
signs may contribute to this misclassification.
6
Class Precision Recall F1 Score Support
A 0.98 0.98 0.98 403
B 0.97 1.00 0.98 392
C 0.99 0.99 0.99 444
D 0.99 0.97 0.98 421
E 0.96 0.99 0.97 374
F 0.98 0.97 0.97 406
G 0.92 0.96 0.94 413
H 0.98 0.94 0.96 442
I 0.98 0.96 0.97 417
K 0.96 0.97 0.96 452
L 0.98 0.99 0.99 412
M 0.97 0.97 0.97 412
N 0.97 0.90 0.94 378
O 0.96 0.97 0.96 371
P 0.93 0.95 0.94 401
Q 0.95 0.92 0.93 382
R 0.97 0.93 0.95 454
S 0.95 0.98 0.97 410
Figure 8. Confusion Matrix for Complete Model, Smaller Dataset T 0.92 0.94 0.93 405
U 0.96 0.97 0.97 393
V 0.97 0.98 0.97 405
W 0.98 0.98 0.98 465
X 0.96 0.96 0.96 398
Y 0.99 0.97 0.98 416
Total 0.96 0.96 0.96 9866
Table 4. Complete Model, Smaller Dataset Classification Report
as the same model did with the simpler data, which is ex-
pected (Appendix 15).
Next, after running the baseline model on the new data,
our validation accuracy significantly improved to 98.25%
Figure 9. Data Augmentation and Hand Landmark Detection Ex- (Appendix 16).
amples With data augmentation, we achieved a higher validation
accuracy of 98.93%–this indicates that data augmentation
helped make our model more robust to the complex data
and T had the lowest F1 scores (Table ??). (Appendix 17).
Then adding hand landmark detection, we had a valida-
5.2. Larger Dataset
tion accuracy of 96.6% (Figure 10). We also had a test ac-
We then repeated the following steps after replacing our curacy of 96.42%. Our test accuracy is again lower than
data with a larger and more complex dataset. A summary the results obtained by [15], but this is likely due to slight
of our results is found in Table 5. changes in architecture as mentioned above.
For the simple CNN, the model did not perform as well After examining the confusion matrix, it is clear that the
7
and hand landmark detection on two different datasets, one
more complex than the other. After running various mod-
els, we found that our model using the more complex data
with data augmentation had a 98.93% validation accuracy.
However, our full model, which includes hand landmark de-
tection, achieved a test accuracy of 96.42%. Based on our
error analysis, it makes sense that adding data augmentation
on the more complex data improved accuracy, while doing
Figure 10. Losses and Accuracies for Complete Model, Larger so on the simpler data worsened accuracy. While we ex-
Dataset pected hand landmark detection to improve accuracy on the
complex data, it is likely that further work in hand detection
model confuses classes 19, 20, and 21 (U, V, and W), 0 and position prediction is needed to improve results. Some
and 18 (A and T), 16 and 19 (R and U), and classes 11 and future work may include running the model on a greater
12 (M and N) (Figure 11). Similarly to the full model on number of epochs, further tuning of the minimum detection
the smaller data, the model still confuses signs that have confidence parameter for hand landmark detection, tuning
similar hand positions even with hand landmark detection. the parameters for data augmentation, and implementing
However, this model slightly out performed the full model additional hand detection or coordinate techniques.
on the smaller data, which is important to note considering
this data is more complex. 7. Contributions & Acknowledgements
All coding and report writing was split equally across
the milestone and the final report. Specifically, Arnav
implemented the baseline models and second input chan-
nel, Anusha did data pre-processing, error analysis, and
augmentation, and Malavi adapted the models to the new
dataset, ran the full models, and collected error analysis
plots.
8
8. Appendices
Figure 13. Losses and Accuracies for Baseline Model, Smaller [2] V. Bazarevsky, F. Zhang, A. Vakunov, C.-L. Chang,
Dataset and M. Grundmann. Mediapipe hands: On-device
real-time hand tracking. https://github.com/
google-ai-edge/mediapipe/blob/master/
docs/solutions/hands.md, 2019. Accessed:
2024-06-05.
[3] J. Bora, S. Dehingia, A. Boruah, A. A. Chetia, and D. Gogoi.
Real-time assamese sign language recognition using me-
diapipe and deep learning. Procedia Computer Science,
218:1384–1393, 2023.
[4] A. Clark and Contributors. Pillow - the friendly pil fork,
2024. Version 10.3.0.
[5] A. Deep, A. Litoriya, A. Ingole, V. Asare, S. M. Bhole, and
Figure 14. Losses and Accuracies for Baseline Model, Data Aug- S. Pathak. Realtime sign language detection and recognition.
mentation, Smaller Dataset In 2022 2nd Asian Conference on Innovation in Technology
(ASIANCON), pages 1–4. IEEE, 2022.
[6] R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter-
national conference on computer vision, pages 1440–1448,
2015.
[7] R. R. G. Giulia Zanon de Castro and F. G. Guimarães. Auto-
matic translation of sign language with multi-stream 3d cnn
and generation of artificial depth maps. Expert Systems with
Applications, 215(119394), 2023.
[8] H. Guo, G. Wang, X. Chen, C. Zhang, F. Qiao, and H. Yang.
Region ensemble network: Improving convolutional net-
work for hand pose estimation. 2017 IEEE International
Figure 15. Losses and Accuracies for Simple CNN, Larger Dataset
Conference on Image Processing (ICIP), Sept. 2017.
[9] A. Halder and A. Tayade. Real-time vernacular sign lan-
guage recognition using mediapipe and machine learning.
References
Journal homepage: www. ijrpr. com ISSN, 2582:7421, 2021.
[1] M. Al-Hammadi, G. Muhammad, W. Abdul, M. Alsulaiman, [10] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gom-
M. A. Bencherif, and M. A. Mekhtiche. Hand gesture recog- mers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor,
nition for sign language using 3dcnn. IEEE Access, 8:79491– S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H.
79509, 2020. van Kerkwijk, M. Brett, A. Haldane, J. F. del Rı́o, M. Wiebe,
9
P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy,
W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant.
Array programming with NumPy, 2020. Version 1.26.4.
[11] J. Huang, W. Zhou, H. Li, and W. Li. Sign language recog-
nition using 3d convolutional neural networks. 2015 IEEE
International Conference on Multimedia and Expo (ICME),
pages 1–6, 2015.
[12] R. Kumar, A. Bajpai, and A. Sinha. Mediapipe and
cnns for real-time asl gesture recognition. arXiv preprint
arXiv:2305.05296, 2023.
[13] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
S. Belongie. Feature pyramid networks for object detection.
In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2117–2125, 2017.
[14] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,
G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,
A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison,
A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and
S. Chintala. Pytorch: An imperative style, high-performance
deep learning library, 2019.
[15] R. Pathan, M. Biswas, S. Yasmin, M. Khandaker, M. Salman,
and A. Youssef. Sign language recognition using the fusion
of image and hand landmarks through multi-headed convo-
lutional neural network. Nature, 13(16975), 2023.
[16] N. Pugeault and R. Bowden. Spelling it out: Real-time asl
fingerspelling recognition. 2011 IEEE International Con-
ference on Computer Vision Workshops (ICCV Workshops),
pages 1114–1119, 2011.
[17] K. K. R. Rastgoo and S. Escalera. Multi-modal deep hand
sign language recognition in still images using restricted
boltzmann machine. Entropy, 20(11), 2011.
[18] J. Rekha, J. Bhattacharya, and S. Majumder. Shape, tex-
ture and local movement hand gesture features for indian
sign language recognition. In 3rd international conference
on trendz in information sciences & computing (TISC2011),
pages 30–35. IEEE, 2011.
[19] S. T. A. S. S. Shanta and M. R. Kabir. Bangla sign lan-
guage detection using sift and cnn. 2018 9th International
Conference on Computing, Communication and Networking
Technologies (ICCCNT), pages 1–6, 2018.
[20] D. Sau. Asl(american sign language) alphabet dataset, 2022.
[21] S. Shahriar, A. Siddiquee, T. Islam, A. Ghosh,
R. Chakraborty, A. I. Khan, C. Shahnaz, and S. A.
Fattah. Real-time american sign language recognition
using skin segmentation and image category classification
with convolutional neural network and deep learning. In
TENCON 2018-2018 IEEE Region 10 Conference, pages
1168–1171. IEEE, 2018.
[22] F. Zhang, V. Bazarevsky, A. Vakunov, A. Tkachenka,
G. Sung, C.-L. Chang, and M. Grundmann. Mediapipe
hands: On-device real-time hand tracking. arXiv preprint
arXiv:2006.10214, 2020.
10