0% found this document useful (0 votes)
4 views10 pages

Sign Language Recognition With Convolutional Neural Networks

The paper presents an ablation study on sign language recognition for American Sign Language (ASL) using convolutional neural networks (CNNs), achieving a test accuracy of 96.42%. Key improvements were made through hyperparameter tuning, data augmentation, and hand landmark detection. Future work aims to enhance model performance by increasing training epochs and refining detection parameters.

Uploaded by

Murilo De Jesus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views10 pages

Sign Language Recognition With Convolutional Neural Networks

The paper presents an ablation study on sign language recognition for American Sign Language (ASL) using convolutional neural networks (CNNs), achieving a test accuracy of 96.42%. Key improvements were made through hyperparameter tuning, data augmentation, and hand landmark detection. Future work aims to enhance model performance by increasing training epochs and refining detection parameters.

Uploaded by

Murilo De Jesus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Sign Language Recognition with Convolutional Neural Networks

Arnav Gangal Anusha Kuppahally Malavi Ravindran


[email protected] [email protected] [email protected]

Abstract in the classification task. Additionally, this topic is an in-


teresting example of the intersection between artificial in-
Our paper presents a two-pronged ablation study for telligence and social good, and in particular how computer
sign language recognition for American Sign Language vision models can be deployed to improve the quality-of-
(ASL) characters on two datasets. Experimentation re- life of often marginalized groups.
vealed that hyperparameter tuning, data augmentation, and
hand landmark detection can help improve accuracy. The fi- 2. Related Work
nal model achieved a test accuracy of 96.42%. Future work
includes running the model for a greater number of epochs, 2.1. Hand Detection
tuning the minimum detection confidence parameter in hand
Within this research area, hand isolation and detection is
landmark detection, further hyperparameter tuning for data
an important component of sign language recognition. One
augmentation, and additional hand detection bounding box
example of this is [15], which uses Google’s hand landmark
or coordinate methods.
model [22] to identify hand landmark coordinates, which
serves as a second input channel to a CNN.
Another example of this is [19], which uses skin mask-
1. Introduction ing, which crops the region of interest (RoI) that only con-
Effective sign language recognition is an active area of tains the hand, the Canny Edge Detection algorithm to de-
research that intersects both computer vision and natural tect the edges of the hand, and extracts features with Scale-
language processing, with a variety of methods aiming to fa- Invariant Feature Transform (SIFT) to account for factors
cilitate communication among the deaf and hard-of-hearing like rotation, scaling, etc.
community. This research area can help resolve a communi- One more instance of hand detection is done by [17],
cation gap between those who use sign language, and those which uses a finetuned CNN model based on the Faster
who do not. The existence of this gap leads to significant Region-based Convolutional Neural Network (RCNN),
barriers in everyday interactions, and reducing it through ef- which uses a region proposal network (RPN) to predict the
fective translation models can create more inclusive and eq- bounds where the hand is located, achieving 99.31% accu-
uitable spaces for deaf and hard-of-hearing people, as well racy.
as improve their quality-of-life. Additionally, [8] uses a tree-structured regional ensem-
To explore this application of computer vision, our ble network (REN), which partitions convolution outputs
project focuses on American Sign Language (ASL) char- into different regions, concatenates results, and regresses
acter detection using CNNs. The input to our algorithm is 3D joint coordinates in depth images with end-to-end op-
static images of ASL character signs, with variation in the timization. All of these papers employ different techniques
dataset coming from the image angle, the image lighting, to detect hands in images, and also use data augmentation.
and the specific subject performing the sign. Then, we use In particular, [17] excels at these techniques by using
a dual-input CNN to identify key points on subjects’ hands the detected hand and then applying various data augmenta-
(’landmarks’), and output a character prediction based on a tion techniques, such as 5 crops and adding noise. Adding
combination of these landmarks and the image itself. Our crops in this way increased the amount of data and made
project focuses on the ASL fingerspelling alphabet, includ- the model more robust. On the other hand, one thing to note
ing all characters except J and Z, as signing these characters about [15] is that this paper did not sufficiently compare re-
requires motion. We chose this project because it is a cur- sults between a single input channel against a multi-headed
rent area of research that has many different methods and CNN after adding the hand landmark coordinates. While
architectures, which gave us many opportunities to test a other papers were formatted as ablation studies, this paper
range of different models, and compare their performance lacked an explanation of how the model was built upon.

1
2.2. Real-time Robust ASL Recognition depth stream and can track body movement, as an input de-
vice. With Kinect, the CNN has 5 inputs, including depth
While classifying and translating static images of sign
and a body skeleton, and achieves 94.2% accuracy, which is
language letters and words is an essential task, there is also
higher than baseline methods.
a need for robust and scalable real-time ASL recognition
Another instance of this is [7], which uses a multi-
models. A noted weakness of existing training datasets is
stream architecture that comprises of CNNs and GANs to
that they often do not contain a variety of skin-tones, mak-
generate depth and joint information from RGB channels.
ing models trained on them prone to failure at inference
Then, manual and non-manual features are processed in a
time when presented with a hand from an ethnicity not seen
3D CNN. This multi-stream model receives the frames in
during training. One strategy to mitigate this is presented in
RGB, segmented hands and faces, distance and speed maps,
[21], which uses a skin detection algorithm to create a mask
and the artificial depth maps generated by the GANs, and
for the input image, that works by looking at the colors of
achieves 91% accuracy.
the image in terms of luminance (Y), and chrominance (Cb
and Cr), a common color space in video compression. This One more example of 3D CNNs is [1], which uses a fu-
mask was used to remove all parts of the image that were sion of parallel 3D CNN structures, where linear sampling
not identified as skin, and their model was able to achieve is applied to select frames, and a 3D CNN learns the spa-
94.7% test accuracy with a downstream classification model tiotemporal features at certain times in the video sequence.
based on AlexNet. Then, the 3D CNN extracts features from one of the clips,
Similarly, [12] proposes a multiple-stage pipeline to im- and then various methods for feature fusion are considered,
prove model robustness with reduced inference latency, by including MLP, LSTM, and stacked autoencoders. After
integrating MediaPipe landmarks into a standard CNN ar- considering scenarios of all combinations, including signer-
chitecture (similar to [15]). This works validates the find- dependent and signer-independent (where the signers in the
ings of multiple other researchers ([5], [3]), who achieved test data aren’t included in the training data), the model that
greater classification accuracy when integrating MediaPipe was signer dependent using MLP fusion achieved 98.12%
hand landmarks (with both static 2d images, and 3d depth accuracy. In particular, [7] employed state-of-the-art tech-
images) into their data preprocessing pipeline. [12] com- niques by combining many architectures and using datasets
pared their model to ones based on pre-trained Incep- in multiple languages of sign language. Also, [1] excels in
tion CNNs, and non-convolutional models such as random testing many different model combinations and techniques,
forests and SVMs, and found that an integrated MediaPipe even applying PCA and t-SNE for data reduction. In con-
landmark/raw image model was able to outperform them in trast, [11] has no mention of data augmentation, which is
terms of accuracy (9̃0% for Inception and SVM baselines, important to consider to prevent overfitting.
compared to over 99% accuracy for their model) and in
some cases, in inference time (particularly the SVM-based 3. Methods
image model used in [18]).
3.1. Baselines
One noticeable vector to reduce gesture recognition in-
ference time is presented in [9]. This paper is distinct in Our project is modeled as an ablation study—we built
that it trains on a variety of international sign languages, upon our model iteratively to test and evaluate each exten-
including Indian Sign Language (ISL) gestures. ISL is dis- sion. We first started with a simple CNN, made up of 2
tinct from ASL in that ISL fingerspelling gestures typically convolutional layers (with 32 and 64 channels respectively),
use two hands instead of one, making the MediaPipe model each of which is followed by ReLU and 2 × 2 max pool-
(which is designed to be adaptable to generate landmarks ing. The data is then flattened, and fed through two fully-
for multiple hands in a single image) an appropriate model- connected layers, with 128 nodes and 24 nodes respectively,
ing choice. Their model achieves low-latency by not using where 24 was the number of classes. Using a simple CNN
CNNs at all, instead using a lightweight SVM as the clas- as our initial model allowed us to develop a robust image
sifier, after reducing the size of the input search area using preprocessing pipeline, and develop an initial understand-
two-stage hand-detection/landmark extraction pipeline. [9] ing of how suited CNNs were as a general approach to this
achieve accuracies of > 98% on Italian, Indian, and Amer- task. This CNN was implemented using the standard deep
ican sign language datasets, indicating that this approach is learning framework Pytorch [14], and image preprocessing
also effective at performing real-time gesture recognition. was performed using the Python Imaging Library (Pillow)
[4]
2.3. 3D CNNs and Hand Modeling
To add model complexity and additional pathways for
3D CNNs are another prominent architecture in sign lan- feature extraction, we then chose to replicate the architec-
guage recognition. One instance of this is [11], which uses ture in [15], using solely the input channel of grayscale im-
Microsoft Kinect, a motion sensor that provides a color and ages. This model will be referred to as our baseline model,

2
and provided us with a more comprehensive baseline for the This is followed by a shallow CNN consisting of 2 convo-
classification task. For this CNN , we used the following ar- lutional layers with 50 and 25 channels respectively, each
chitecture: 5 convolutional layers with a filter size of 3 and with ReLU activation, batch normalization, and 2 × 2 max
32, 64, 128, and 512 filters and ReLU activation. Each is pooling. The MediaPipe landmark model itself is made up
followed by a dropout layer and batch normalization, as a of two models - a palm detector to provide a bounding box
form of model regularization. Each convolutional block af- for hands, and a hand landmark model, that provides a hand
ter the first is followed by a max pooling layer of size 2 × 2. skeleton in the form of 21 hand-knuckle coordinates within
We used a dropout probability of 0.3 and learning rate of the image.
0.001 as initial hyperparameter values, before tuning. We The palm detector allows the model to localize the hand
modified the paper’s architecture slightly to maintain out- to a particular area of the image. The detector uses an
put dimensions. The model ends with a fully connected- encoder-decoder feature extractor, built on the idea of a Fea-
dropout-fully connected block as a classification head, to ture Pyramid Network (FPN) [13]. FPNs are a type of CNN
classify each image into one of the 24 classes. A complete architecture that were specifically designed to enhance the
diagram of this model’s architecture can be seen in Figure ability of CNNs to detect objects at multiple scales. This
1. helps models develop scale-invariance, which is useful for
our proposed task in that it reduces the need for scale-based
data augmentation. FPNs are typically computed on top of
backbone CNNs such as ResNet, and work by successively
extracting feature maps at different stages of the backbone
network’s architecture (for example, in the ResNet case,
these feature maps are taken from different residual blocks).
Feature maps from different stages are them combined, of-
Figure 1. Baseline CNN Architecture, from [15] ten using upsampling and element-wise addition, to produce
a ‘pyramid’ of features at different scales. In the case of
A deeper CNN architecture such as this one offers the palm detection, standard RoI pooling methods such as Fast
benefits of being able to extract more complex spatial fea- R-CNN [6] are used to produce RoIs from the feature maps
tures before classification. The model’s loss was evaluated at different levels of the pyramid. An example of this type
using cross-entropy loss, which can be seen in Equation 1. of architecture can be seen in Figure 2.
N
X
LCE (y, ŷ) = − yi log(ŷi ), (1)
i=1

3.2. Tuning and Augmentation


To maximize this model’s ability to accurately classify
letters, we conducted hyperparameter tuning on dropout
probability and learning rate, with the intention of apply-
ing the most successful values to subsequent models. Since
our baseline dataset was relatively small, we did not per-
form cross-fold validation to tune these parameters, but
rather used the complete dataset. Full details of the result
of this tuning can be found in Table 3. To improve this
input channel’s robustness to potential data sources in-the- Figure 2. FPN region proposal, from [13]
wild, we also applied a variety of image data augmentation
techniques, including salt and pepper noise, random rota- Once bounding boxes around hands have been detected,
tion, random zoom, random shift, random horizontal flip, the hand landmark model performs landmark localization as
and random crop. a regression task to find key coordinates on the hands. This
model takes a proposed region of the input image from the
3.3. Hand Landmarks
palm detection model as input, and outputs 21 2.5D hand
To further compare our approach to models in the liter- landmark coordinates (x, y, and z relative to wrist land-
ature ([15, 5]), we integrated a second-input channel with mark), using some type of feature extractor (details of this
our CNN. This channel takes in color images of signed extractor are not provided in [22]). The model also outputs
gestures, and uses the MediaPipe hand landmark extraction a confidence score, indicating how likely it thinks that the
model to obtain a set of 21 hand landmark coordinates [22]. region contains a hand - in the case that the confidence score

3
was too low, we chose to fill in the output tensor with ze- dataset more diverse, and the classification task more dif-
ros, to maintain shape consistency for the remainder of the ficult, as mentioned in the related work section. We ap-
network. The topology of the landmark coordinates can be plied salt and pepper noise (0.05), random rotation within
seen in Figure 3. 10 degrees, random zoom (10%), random shift (0.1), ran-
dom horizontal flip (0.5), and random crop (50 ×50 pixels)
(Figure 4). The salt and pepper noise was implemented our-
selves, using NumPy [10], and the remaining augmentation
methods were implemented using existing functions in Pil-
low [4]. After this augmentation, images were converted
to grayscale and resized to 50 × 50 pixels, and normalized
before being fed into the model’s first input channel.

Figure 3. Hand landmark topology, from [2]

In our project, we did not implement or train this model


ourselves. Rather, we used the MediaPipe Python package
[22] to instantiate a pre-trained hand landmark model, and
passed the outputs of that model to the shallow CNN that
we did instantiate and train. The output of this shallow
CNN was flattened and concatenated with the output of the Figure 4. Image Processing and Data Augmentation Example
grayscale channel, and the combined tensors were passed
through two fully connected-dropout-fully connected clas- To further evaluate how well our model architecture
sification head for final classification. could perform on more difficult data, we replaced our ex-
3.4. Retesting on different data isting dataset with more complex data, from [20]. This new
dataset consists of 233,104 images for 29 ASL classes, in-
Because our initial dataset was too simple (see Section cluding all 26 characters of the alphabet as well as signs for
4), we chose to replace our initial dataset with a larger “delete”, “nothing”, and “space”. For consistency and com-
ASL dataset that contained more complex images (differ- parability against results using our first dataset, we excluded
ent backgrounds, more occlusion, more potential sources the signs for J and Z, and also excluded the signs for delete,
of distraction like faces) and repeated the same steps men- nothing, and space. We applied the same pre-processing
tioned above. and data augmentation to these images, again resulting in
images of 50 × 50 pixels for the first input channel. We
4. Dataset and Features split the data as before, into 70% train, 15% validation, and
For our primary dataset, we used the same dataset used 15% test. This dataset is significantly more complex than
by [15], which contains 24 letters (excluding J and Z) [16]. the first one, as it contains more diverse and larger back-
The dataset contains images from 5 non-native signers, with grounds, along with the faces of signers. For example, some
over 500 images for each sign per signer. The dataset con- images are taken in rooms where there are multiple different
tains 65,748 images total. We split the data as following: objects in the background. We hoped that by using this data,
70% train, 15% validation, and 15% test. that data augmentation and hand detection would result in
For the first input mode to our model, we applied the improved performance.
following pre-processing steps to replicate [15]: converting
the images to grayscale, sharpening using the same sharp-
5. Experiments, Results, and Discussion
ening filter as in [15], resizing to 50 × 50, and normaliz- For quantitative evaluation, our primary metric was over-
ing the grayscale values. An example of the results of this all accuracy on the validation set, i.e. the percentage of val-
process can be seen in Figure 4. For the second input chan- idation examples which were correctly classified. On par-
nel, the only pre-processing step was resizing the images to ticular models, we also calculated the precision, recall, and
224 × 224 so that they could be passed to the MediaPipe F1 score for each individual class for a more detailed break-
detector. down. These metrics are calculated as follows:
However, after initial results when using the data from
True Positivesi
[16], we discovered that our images were too simple, lead- Precisioni =
ing to our model performing extremely well with little tun- True Positivesi + False Positivesi
ing or extensions (see section 5.1). This motivated the True Positivesi
application of noisy augmentation methods, to make our Recalli =
True Positivesi + False Negativesi

4
Precisioni × Recalli accuracies and losses are plotted in Appendix 13.
F1 Scorei = 2 ×
Precisioni + Recalli
Class Precision Recall F1 Score Support
5.1. Smaller Dataset A 1.00 0.99 0.99 412
For our baseline models, the model’s overall accuracy B 0.99 1.00 0.99 430
on a randomly selected validation set made up of 15% of C 1.00 0.99 0.99 431
the data found in [16] (“smaller dataset”) is found in Table D 0.99 0.99 0.99 403
1. This subsection contains an explanation of each of these E 1.00 1.00 1.00 400
models, and our decision-making process when considering F 1.00 0.99 0.99 388
how to extend the baselines. G 1.00 0.99 0.99 376
H 0.99 1.00 0.99 409
Model Validation Accuracy I 0.99 0.99 0.99 395
Results from [15], single 96.29% K 1.00 0.99 0.99 435
input channel, with data L 0.99 0.99 0.99 396
augmentation
M 0.98 0.99 0.99 410
Results from [15], two 98.42%
N 0.98 0.98 0.98 418
input channels, with data
O 0.99 0.99 0.99 387
augmentation
P 0.99 0.98 0.98 434
Simple CNN 98.14%
Q 0.98 0.98 0.98 378
Baseline Model 99.34%
R 0.97 1.00 0.98 442
Baseline Model with Data 96.40%
S 0.99 0.99 0.99 415
Augmentation
T 0.99 0.99 0.99 430
Complete Model 96.5%
U 0.99 0.99 0.99 407
Table 1. All Model Results, Smaller Dataset V 0.97 0.99 0.98 418
W 0.99 0.98 0.99 457
Our initial point of comparison, the top row of 1, is the X 0.99 0.97 0.98 389
validation accuracy reported in [15] on a model with only Y 0.99 0.99 0.99 406
the 50 × 50 grayscale input channel, and data augmenta- Total 0.99 0.99 0.99 9866
tion. Our second point of comparison is the validation ac-
Table 2. Baseline Model, Smaller Dataset Classification Report
curacy reported in [15] on a dual-input model (grayscale
images and landmarks), with data augmentation. At this point in our experimentation process, we chose to
The first model we tested, as a proof-of-concept, was a conduct hyperparameter tuning on the Adam learning rate
relatively simple CNN with two convolutional layers and and dropout probability of each layer. Our results indicated
a single grayscale image input head. We were surprised that the best combination of parameters were a dropout
to find that this model was able to achieve extremely high probability of 0.2 and a learning rate of 0.0005, and these
accuracy on the smaller dataset (98.14%), when the data parameters were used for all subsequent models. The clas-
had not been augmented. Our initial thoughts were that we sification accuracies for the various combinations of param-
were overfitting to the data. However, plotting our training eters can be found in Table 3.
and validation accuracies and losses as a function of epoch
(Appendix 12) indicated that we were not overfitting to the Dropout Rate
training data, but that the model was actually extremely ef- 0.1 0.2 0.3 0.4
fective at classifying this dataset. One important thing to
Learning Rate

1e-4 99.44% 99.40% 98.84% 97.40%


note about these plots is that the validation loss and accu- 5e-4 99.49% 99.62% 99.46% 99.14%
racy outperforms training loss and accuracy due to the fact 1e-3 99.41% 99.61% 99.49% 98.81%
that the model uses dropout, and so the model’s full classi- 5e-3 99.25% 99.17% 99.02% 98.23%
fication capability is only seen at test time. 0.01 98.82% 98.55% 97.92% 95.84%
We then replicated the deeper CNN architecture in [15] Table 3. Hyperparameter Tuning Classification Accuracy
(“baseline model”), but again only initially worked with a
single input data source. As expected, this model outper- To further bring our experiments more closely in line
formed the Simple CNN, as the increased depth likely al- with those in the literature, we then applied data augmen-
lowed for it to extract richer semantic features from the in- tation to the grayscale images (as detailed in Section 4). A
put images. A detailed breakdown of our precision, recall, baseline model with only the grayscale input head trained
and F1 scores for this model is provided in Table 2, and the on these augmented images was able to achieve 96.4% val-

5
idation accuracy. As expected, due to the simplicity of the
initial data, our final performance was worse than the base-
line model. Similarly to the baseline model, validation ac-
curacy was higher than training accuracy, likely due to the
fact that our model uses dropout layers (Appendix 14). As
seen in the confusion matrix, the characters most confused
were classes 14 and 15 (P and Q), 19 and 20 (U and V),
20 and 21 (V and W), and 0 and 18 (A and T) (Figure 5).
Given the signs for these characters, this misclassification
is likely, as the signs for P and Q look very similar, and the
signs for U, V, and W, and A and T, all have a similar hand
position (Figure 6)[16]. Also, since this dataset consists of
signs from non-native signers, slight errors and variations in
signs may contribute to this misclassification.

Figure 6. ASL alphabet

Figure 7. Losses and Accuracies for Complete Model, Smaller


Dataset

creased. As seen in Figure 6, the signs for these characters


are quite similar. Even with hand landmark detection, it is
possible that the model has confused these two signs due to
their similarity.
Figure 5. Confusion Matrix for Baseline Model, Smaller Dataset, There are a few reasons why this could have happened.
Data Augmentation First, there are some images where the hand landmark
model is not able to detect a hand in the image. In this
After adding hand landmark detection using a minimum case, the prediction is solely based on the image, so for
detection confidence of 0.5, we achieved a slightly higher commonly confused signs, the coordinates are not able to
validation accuracy of 96.5% on our complete model (Fig- assist in preventing misclassification. It is also possible that
ure 7). and reduced the misclassification among classes 19, the coordinates are incorrect based on the image, which may
20, and 21 (U, V, and W), and classes 0 and 18 (A and T) worsen the prediction. Examples of both the missing land-
(Figure 8). For test accuracy on our complete model, we marks case and the incorrect landmarks case can be seen
achieved 96.8% accuracy. Our test accuracy is lower than in Figure 9. With a minimum detection confidence of 0.5,
the results achieved by [15], but there are a few differences 23.27% of the images had landmark coordinates. We tried
to note—our model ran on 10 epochs and used a fixed learn- using a lower minimum detection confidence, but noticed
ing rate of 0.0005 due to hardware restrictions, while [15] that images tended to have incorrect coordinates—we de-
ran their model on 50 epochs with a dynamic learning rate. cided to err towards no coordinates rather than incorrect co-
Despite improved accuracy overall when using hand ordinates as we believed this was more likely to produce a
landmark detection, we noticed that the number of misclas- correct prediction.
sified images among classes 14 and 15 (P and Q) remains Then, we examined the precision, recall, and F1 score.
the same, and surprisingly, the misclassification between The letters G and T had the lowest precision, while the let-
classes 6 and 7 (G and H) and 12 and 18 (N and T) in- ters N and Q had the lowest recall. Overall, classes N, Q,

6
Class Precision Recall F1 Score Support
A 0.98 0.98 0.98 403
B 0.97 1.00 0.98 392
C 0.99 0.99 0.99 444
D 0.99 0.97 0.98 421
E 0.96 0.99 0.97 374
F 0.98 0.97 0.97 406
G 0.92 0.96 0.94 413
H 0.98 0.94 0.96 442
I 0.98 0.96 0.97 417
K 0.96 0.97 0.96 452
L 0.98 0.99 0.99 412
M 0.97 0.97 0.97 412
N 0.97 0.90 0.94 378
O 0.96 0.97 0.96 371
P 0.93 0.95 0.94 401
Q 0.95 0.92 0.93 382
R 0.97 0.93 0.95 454
S 0.95 0.98 0.97 410
Figure 8. Confusion Matrix for Complete Model, Smaller Dataset T 0.92 0.94 0.93 405
U 0.96 0.97 0.97 393
V 0.97 0.98 0.97 405
W 0.98 0.98 0.98 465
X 0.96 0.96 0.96 398
Y 0.99 0.97 0.98 416
Total 0.96 0.96 0.96 9866
Table 4. Complete Model, Smaller Dataset Classification Report

Model Validation Accuracy


Simple CNN 57.45%
Baseline Model 98.25%
Baseline Model with Data 98.93%
Augmentation
Complete Model 96.6%
Table 5. All Model Results, Larger Dataset

as the same model did with the simpler data, which is ex-
pected (Appendix 15).
Next, after running the baseline model on the new data,
our validation accuracy significantly improved to 98.25%
Figure 9. Data Augmentation and Hand Landmark Detection Ex- (Appendix 16).
amples With data augmentation, we achieved a higher validation
accuracy of 98.93%–this indicates that data augmentation
helped make our model more robust to the complex data
and T had the lowest F1 scores (Table ??). (Appendix 17).
Then adding hand landmark detection, we had a valida-
5.2. Larger Dataset
tion accuracy of 96.6% (Figure 10). We also had a test ac-
We then repeated the following steps after replacing our curacy of 96.42%. Our test accuracy is again lower than
data with a larger and more complex dataset. A summary the results obtained by [15], but this is likely due to slight
of our results is found in Table 5. changes in architecture as mentioned above.
For the simple CNN, the model did not perform as well After examining the confusion matrix, it is clear that the

7
and hand landmark detection on two different datasets, one
more complex than the other. After running various mod-
els, we found that our model using the more complex data
with data augmentation had a 98.93% validation accuracy.
However, our full model, which includes hand landmark de-
tection, achieved a test accuracy of 96.42%. Based on our
error analysis, it makes sense that adding data augmentation
on the more complex data improved accuracy, while doing
Figure 10. Losses and Accuracies for Complete Model, Larger so on the simpler data worsened accuracy. While we ex-
Dataset pected hand landmark detection to improve accuracy on the
complex data, it is likely that further work in hand detection
model confuses classes 19, 20, and 21 (U, V, and W), 0 and position prediction is needed to improve results. Some
and 18 (A and T), 16 and 19 (R and U), and classes 11 and future work may include running the model on a greater
12 (M and N) (Figure 11). Similarly to the full model on number of epochs, further tuning of the minimum detection
the smaller data, the model still confuses signs that have confidence parameter for hand landmark detection, tuning
similar hand positions even with hand landmark detection. the parameters for data augmentation, and implementing
However, this model slightly out performed the full model additional hand detection or coordinate techniques.
on the smaller data, which is important to note considering
this data is more complex. 7. Contributions & Acknowledgements
All coding and report writing was split equally across
the milestone and the final report. Specifically, Arnav
implemented the baseline models and second input chan-
nel, Anusha did data pre-processing, error analysis, and
augmentation, and Malavi adapted the models to the new
dataset, ran the full models, and collected error analysis
plots.

Figure 11. Confusion Matrix for Complete Model, Larger Dataset

Using the new data, we had 66.52% of images have co-


ordinates with the same minimum detection confidence of
0.5, which is a significant improvement. We were surprised
to see that adding hand landmark detection slightly wors-
ened results. This may be because of the added complexity
of the data—it may be that additional hand detection meth-
ods may be needed to make the landmark detection more
effective, like RPN or cropping the RoI.

6. Conclusion and Future Work


Overall, our project was aimed at ASL character recog-
nition with CNNs, and hoped to improve performance by
implementing hyperparameter tuning, data augmentation,

8
8. Appendices

Figure 16. Losses and Accuracies for Baseline Model, Larger


Dataset

Figure 12. Losses and Accuracies for Simple CNN, Smaller


Dataset

Figure 17. Losses and Accuracies for Complete Model, Larger


Dataset

Figure 13. Losses and Accuracies for Baseline Model, Smaller [2] V. Bazarevsky, F. Zhang, A. Vakunov, C.-L. Chang,
Dataset and M. Grundmann. Mediapipe hands: On-device
real-time hand tracking. https://github.com/
google-ai-edge/mediapipe/blob/master/
docs/solutions/hands.md, 2019. Accessed:
2024-06-05.
[3] J. Bora, S. Dehingia, A. Boruah, A. A. Chetia, and D. Gogoi.
Real-time assamese sign language recognition using me-
diapipe and deep learning. Procedia Computer Science,
218:1384–1393, 2023.
[4] A. Clark and Contributors. Pillow - the friendly pil fork,
2024. Version 10.3.0.
[5] A. Deep, A. Litoriya, A. Ingole, V. Asare, S. M. Bhole, and
Figure 14. Losses and Accuracies for Baseline Model, Data Aug- S. Pathak. Realtime sign language detection and recognition.
mentation, Smaller Dataset In 2022 2nd Asian Conference on Innovation in Technology
(ASIANCON), pages 1–4. IEEE, 2022.
[6] R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter-
national conference on computer vision, pages 1440–1448,
2015.
[7] R. R. G. Giulia Zanon de Castro and F. G. Guimarães. Auto-
matic translation of sign language with multi-stream 3d cnn
and generation of artificial depth maps. Expert Systems with
Applications, 215(119394), 2023.
[8] H. Guo, G. Wang, X. Chen, C. Zhang, F. Qiao, and H. Yang.
Region ensemble network: Improving convolutional net-
work for hand pose estimation. 2017 IEEE International
Figure 15. Losses and Accuracies for Simple CNN, Larger Dataset
Conference on Image Processing (ICIP), Sept. 2017.
[9] A. Halder and A. Tayade. Real-time vernacular sign lan-
guage recognition using mediapipe and machine learning.
References
Journal homepage: www. ijrpr. com ISSN, 2582:7421, 2021.
[1] M. Al-Hammadi, G. Muhammad, W. Abdul, M. Alsulaiman, [10] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gom-
M. A. Bencherif, and M. A. Mekhtiche. Hand gesture recog- mers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor,
nition for sign language using 3dcnn. IEEE Access, 8:79491– S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H.
79509, 2020. van Kerkwijk, M. Brett, A. Haldane, J. F. del Rı́o, M. Wiebe,

9
P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy,
W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant.
Array programming with NumPy, 2020. Version 1.26.4.
[11] J. Huang, W. Zhou, H. Li, and W. Li. Sign language recog-
nition using 3d convolutional neural networks. 2015 IEEE
International Conference on Multimedia and Expo (ICME),
pages 1–6, 2015.
[12] R. Kumar, A. Bajpai, and A. Sinha. Mediapipe and
cnns for real-time asl gesture recognition. arXiv preprint
arXiv:2305.05296, 2023.
[13] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
S. Belongie. Feature pyramid networks for object detection.
In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2117–2125, 2017.
[14] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,
G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,
A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison,
A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and
S. Chintala. Pytorch: An imperative style, high-performance
deep learning library, 2019.
[15] R. Pathan, M. Biswas, S. Yasmin, M. Khandaker, M. Salman,
and A. Youssef. Sign language recognition using the fusion
of image and hand landmarks through multi-headed convo-
lutional neural network. Nature, 13(16975), 2023.
[16] N. Pugeault and R. Bowden. Spelling it out: Real-time asl
fingerspelling recognition. 2011 IEEE International Con-
ference on Computer Vision Workshops (ICCV Workshops),
pages 1114–1119, 2011.
[17] K. K. R. Rastgoo and S. Escalera. Multi-modal deep hand
sign language recognition in still images using restricted
boltzmann machine. Entropy, 20(11), 2011.
[18] J. Rekha, J. Bhattacharya, and S. Majumder. Shape, tex-
ture and local movement hand gesture features for indian
sign language recognition. In 3rd international conference
on trendz in information sciences & computing (TISC2011),
pages 30–35. IEEE, 2011.
[19] S. T. A. S. S. Shanta and M. R. Kabir. Bangla sign lan-
guage detection using sift and cnn. 2018 9th International
Conference on Computing, Communication and Networking
Technologies (ICCCNT), pages 1–6, 2018.
[20] D. Sau. Asl(american sign language) alphabet dataset, 2022.
[21] S. Shahriar, A. Siddiquee, T. Islam, A. Ghosh,
R. Chakraborty, A. I. Khan, C. Shahnaz, and S. A.
Fattah. Real-time american sign language recognition
using skin segmentation and image category classification
with convolutional neural network and deep learning. In
TENCON 2018-2018 IEEE Region 10 Conference, pages
1168–1171. IEEE, 2018.
[22] F. Zhang, V. Bazarevsky, A. Vakunov, A. Tkachenka,
G. Sung, C.-L. Chang, and M. Grundmann. Mediapipe
hands: On-device real-time hand tracking. arXiv preprint
arXiv:2006.10214, 2020.

10

You might also like