Major Report
Major Report
INTRODUCTION
1.1. General
As is the case with all other spoken languages, sign language has a history that is both
substantial and complex. Approximately around the fifth century B.C., the first people in
Greece began using hand gestures as a means of communication. Nevertheless, the first
recorded records of sign language as a mode of communication appearing in Western
countries date back to the 17th century.
Monastic sign languages were utilised by a number of religious organisations in Europe
beginning in the tenth century and continuing until the end of the century. It was more
accurate to say that these were advanced gestural communication systems than real sign
languages. Before the year 1492, deaf individuals in Native American civilizations made
extensive use of Plains Indian Sign Language for a variety of purposes, including commerce,
ceremonies, storytelling, and ordinary communication.
Our system is comprised of three critical duties that require real-time execution:
1. Videotaping the user as they provide input
2. Assigning a distinct ISL sign to each frame in the video
3. Constructing the classification scores into the most probable word and exhibiting it as the
output.
Among the obstacles encountered in the development of a computer vision-based solution for
this issue are the following:
• Aspects of the environment (such as illumination, background, and camera placement)
• Occlusion, such as complete or partial obstruction of the hands or digits
• Detection of sign boundaries (determining the beginning and end of a sign)
Co-articulation refers to the impact that succeeding or preceding signs have on the current
sign.
Even while neural networks have been used in previous research to recognise ISL signs with
an accuracy that is greater than 90 percent, the vast majority of these techniques require
additional hardware, such as motion-tracking mitts and 3D recording devices. This is because
neural networks are considered to be a relatively new technique. In spite of the fact that
neural networks have been identified with an accuracy that is more than ninety percent, this is
the case. The viability and scalability of these systems are significantly diminished as a result
of the limitations that were highlighted earlier in this discussion.
A video of a user manually signing a word using a web application is processed by the
pipeline that is a component of our system. This pipeline is responsible for processing the
video. The video is processed by the pipeline from the point of view of the input. After we
have extracted individual frames from the video, we then use a convolutional neural network
(CNN) to compute the sign probabilities for each individually extracted frame. This is done
after we have extracted single frames from the video. When taken in their entirety, these
probabilities cover the entirety of the International Sign Language (ISL) sign repertoire. In
order to structure the frames in a manner that is consistent with the sign index to which each
frame is hypothesised to belong, we make use of a wide range of various heuristics. Because
of this, we are able to arrange the frames in a manner that is efficient. Therefore, in
conclusion, we make use of a language model in order to provide the user with the phrase that
is most likely to be utilised. This, in turn, makes it feasible for the deaf community in India to
communicate in a manner that is both instantaneous and completely uninterrupted.
CHAPTER II
LITERTURE REVIEW
The following are some of the research papers studied to understand the working of realtime
sign language recognition system and get an idea of how we can improve the existing
technologies.
1. In the context of ASL recognition system classification, the three predominant classifier
types employed are neural networks, Bayesian networks, and linear classifiers.
- A While linear classifiers may appear straightforward to configure, they operate most
effectively when provided with complex features.
- Singha and Das achieved an accuracy of 96% when implementing one-handed motions
through the utilisation of Karhunen-Loeve Transforms.
2. Real-Time ASL Recognition Using Neural Networks: The researchers are Sigberto
Alarcon Viesca, Barbara Garcia, and Theodore "Brandon" Garcia.
The procedure of translating and rotating axes to establish a new system of coordinates in
accordance with data variance.
The application of linear classifiers to the identification of hand gestures, including the
elevation of the forefinger or the indication of an object.
In order to achieve an accuracy of 62.3%, Sharma et al. implemented piece-wise classifiers,
including SVM and k-NN, after eliminating noise and background.
3. Bayesian Networks, also known as Hidden Markov Models, require precisely defined
models but demonstrate efficacy in capturing temporal patterns.
- A By employing a hand motion monitor (HMM) integrated into a three-dimensional glove,
Starner and Pentland achieved an astounding success rate of 99.2%.
Using a DBN model, Suk et al. recognised hand gestures in live video streams; this brings us
to our fourth point: dynamic bayesian networks.
- A Although not restricted to American Sign Language, the classification of hand gestures
achieved a 99% success rate.
5. ASL Neural Networks: Relevant categorization features for ASL translation are learned by
ASL neural networks.
- By considering the position and movement of the hands, Mekala et al. translated American
Sign Language footage into text using a three-layer neural network.
A more streamlined hand position can be attained through the implementation of Fourier
transforms.
a. Transfer Training
A model that is capable of classifying or categorising input data into separate classes or
categories according to specified traits or features is required for the development of a
classifier. This model must be constructed. When developing a classifier, the goal is to make
an accurate prediction about the category of data that has not been observed before by
making use of patterns that have been learnt from a training dataset. The process of
developing a classifier makes use of a wide variety of algorithms and methods, such as
Bayesian networks, neural networks, support vector machines, and decision trees.
Equation (1) takes the mean loss for each training example, xi , to produce the full softmax
loss.
Transfer learning is a method of machine learning in which a model that has been trained on
a specific task in the past is reused or transferred to a task that is related to the one that it was
trained on. Transfer learning is the process of utilising the knowledge gained from one
activity to improve performance and learning on a different task that is related to the
knowledge gained from the first task. The use of this strategy is particularly helpful in
situations when there is a limited amount of data available for the new task or when training a
model from the beginning would demand a large amount of time or resources.
Through the use of insights learned from a different yet interrelated area or project, the
deployment of transfer learning has the potential to improve the effectiveness of classifiers. A
pre-trained model's information can be transferred to a new task through the process of
transfer learning, which can help speed up the training process, improve generalisation, and
improve the overall performance of the classifier on the new task. Transfer learning can also
help improve the overall performance of the classifier.
Berkeley Vision and Learning Centre (BVLC) is responsible for the development of the
Caffe framework, which is a deep learning framework. It is utilised extensively for a wide
range of machine learning tasks, notably in applications that involve computer vision
capabilities. When it comes to training deep neural networks, Caffe is well-known for its
speed and efficiency, as well as its expressive architecture, which makes it simple to
experiment with a variety of network configurations.
GoogleNet is a deep convolutional neural network architecture that was built by Google. It is
also known as the Inception-v1 model. In order to achieve high accuracy in picture
classification tasks, it is designed to be computationally efficient while yet obtaining high
accuracy. The notion of inception modules was first presented by GoogleNet. These modules
are building pieces that enable the simultaneous processing of various filter sizes inside the
same layer, which ultimately results in enhanced performance.
Researchers and practitioners can take advantage of the capabilities of both the framework
and the model architecture when they use Caffe with GoogleNet for machine learning tasks.
This allows them to construct and train powerful deep learning models for image
classification, object recognition, and other computer vision applications. Because of its
adaptability and effectiveness, Caffe is a particularly well-liked option for the implementation
and training of complex neural networks such as GoogleNet.
Through the utilisation of Caffe in conjunction with the GoogleNet architecture, developers
are able to reap the benefits of the optimised design of GoogleNet for image recognition
tasks, while simultaneously utilising Caffe's framework for the efficient training and
deployment of models. This combination makes it possible to construct cutting-edge machine
learning models that are capable of achieving high levels of accuracy on difficult image
datasets.
In a nutshell, Caffe and GoogleNet are two of the most powerful tools in the world of
machine learning, particularly when it comes to jobs that include images. The combination of
these two factors makes it possible for academics and practitioners to construct and train deep
neural networks in an effective manner, hence achieving top-tier performance in several
computer vision applications, including picture categorization.
A technique that is utilised in the field of machine learning is known as fine-tuning data by
adding orientations. This technique is utilised to enhance the robustness and generalisation of
models, particularly in tasks that are associated with image recognition and object detection.
The model can learn to be invariant to these transformations and perform better on data that it
has not before encountered if the training data is supplemented with variations in orientation,
such as rotations, flips, and translations.
The incorporation of orientations into the training data enables the model to acquire
characteristics that are unaffected by changes in orientation, so rendering it more resistant to
differences in the manner in which objects are depicted in photographs. Through the use of
this augmentation strategy, it is possible to avoid overfitting and enhance the model's
capacity to generalise to data that has not been seen before.
d. Fine-tuning data
In situations when the training data is restricted or where the dataset lacks diversity in terms
of object orientations, fine-tuning the data involves adding orientations. This is especially
effective in situations where the training data is limited. During the training process, the
inclusion of diverse orientations allows the model to acquire the ability to recognise things
from a variety of angles, which ultimately results in improved performance on data from the
real world.
TensorFlow, PyTorch, and Keras are examples of tools and frameworks that can be utilised
by developers in order to apply data augmentation approaches such as that of adding
orientations. Developers are able to simply augment their training data with multiple
orientations by utilising these libraries, which contain functions for rotating, flipping, and
translating images.
Colour images (A) and depth images (B) are included in the Indian Sign
Language (ISL) Finger Spelling Dataset, which was compiled by the Centre for
Vision, Speech, and Signal Processing at the University of Surrey. For the goal
of making the information accessible through a web application and a laptop
equipped with a camera, we have decided to be solely utilising colour photos.
These photographs are close-ups of hands that take up the bulk of the surface
area of the image. The dataset consists of twenty-four static indicators of ISL
that were obtained from five distinct users throughout separate sessions with
lighting and backgrounds that were controlled similarly. A total of over 65,000
colour photographs are included in this collection. The height-to-width ratios of
the colour images vary, but the average size of these images is roughly 150x150
pixels.
The heights and weights of the photos in both collections are not comparable to
one another. Therefore, in order to conform to the input that is anticipated by
GoogleNet, we enlarge them to 256x256 pixels and then take random crops of
224x224 pixels. In addition, we zero-center the data by beginning with the mean
picture from ILSRVC 2012 and subtracting it. It is not necessary for us to
normalise the image tensors because the range of possible values in them is
limited to 0-255. In addition, we flip the photos horizontally because it is
possible to produce signs with either the left or the right hand, and our datasets
contain examples of both of these configurations.
Due to the fact that the distinction between any two classes in our datasets is
rather little in comparison to the differences between ILSRVC classes, we
sought to pad the images with black pixels in order to ensure that they
maintained their aspect ratio when they were resized. Because of this padding,
we are also able to delete a smaller number of pixels that are relevant when we
do random crops.
3.1.3 Experiments, Results and Analysis
When we compare our findings to those of other studies, we use two measures to evaluate
this comparison. In the body of research that has been done, the correctness of the validation
set is one of the criteria that is utilised the most frequently.
Accuracy of the examples that have been classified. The top-5 accuracy statistic is yet
another approach that is frequently utilised. It determines the percentage of classifications in
which the right label is among the top five classes with the highest scores when the
classification is measured.
In addition to this, we make use of a confusion matrix, which is a particular table layout that
offers a visual depiction of the classification model's performance for each class. We are able
to get vital insights that will help us improve our future performance if we analyse the letters
that were incorrectly classified.
For each of the experiments listed below, our model was trained on letters a-y (excluding j).
After conducting some preliminary testing, we discovered that utilising a foundational base
learning
Using a rate of 1e-6 yielded promising results when fitting the training data. It showed a
consistent improvement in accuracy and appeared to converge successfully. After reaching a
point where the improvements in the loss were no longer happening, we decided to intervene
and manually halt the process. We then made the decision to decrease the learning rate in an
attempt to enhance the optimisation of our loss function. We decreased our learning rate by
various factors, ranging from 2 to 100.
In addition, we utilised the training routine that yielded the most favourable results when
tested with actual users on our web application ('2_init'). Additionally, we developed models
to classify letters from a to k (excluding j) or a to e. This allowed us to assess whether we
achieved improved results by reducing the number of classes.
Table 1. Optimal accuracy ranges for all models trained on each letter subset.
Fig. 2: Epochs vs. validation accuracy for all models trained on letters a-y
(excluding j)
Fig. 3: Epochs vs. training loss for all models trained on letters ay (excluding j)
Fig. 4: Epochs vs. validation accuracy for the 2_init models trained on each
letter subset (excluding j)
Fig. 5: Epochs vs. training loss for the 2_init models trained on each letter
subset (excluding j)
Fig. 6: Confusion matrix for the 2_init model trained on letters ay (excluding j)
Fig. 7: Confusion matrix for the 2_init model trained on letters ak (excluding j)
CHAPTER IV
RESULT
4.1. SYSTEM ARCHITECTURE
In order to process video input, extract characteristics, classify gestures, and offer real-
time feedback, the architecture of the system that is used for real-time sign language
recognition often consists of several critical components that collaborate with one
another. An example of a hypothetical system architecture is presented here in high-level
overview form:
1. Data Preprocessing and Transformation, the dataset are processed in order to improve
the overall quality of the images and get them ready for feature extraction. It is possible
that this will involve methods such as scaling, normalization, and noise reduction in order
to enhance the quality of the supplied data.
1. Video Input: The system is able to capture the hand gestures and motions of the user as
they are practicing sign language by receiving live video input from a camera or webcam.
3. Feature Extraction: The video frames that have been preprocessed are then input into a
feature extraction module, which then extracts relevant features that represent the hand
motions and movements. In order to extract spatial and temporal information from the
video frames, it is possible to make use of techniques such as Convolutional Neural
Networks (CNNs) or Recurrent Neural Networks (RNNs).
4. Model Prediction: After the features have been extracted, they are then sent to a gesture
classification model. This model uses the input data to make a prediction about the sign
language motion that corresponds to the recovered features. It is possible for this model to
be a deep learning model that has been trained on a huge dataset of sign language
movements in order to effectively identify and recognize various signals.
5. Post-processing: Once the sign language gesture has been classified, post-processing
techniques can be implement in order to refine the output and improve the accuracy of the
system. In order to improve recognition performance, this may involve smoothing the
predicted gestures across time, integrating information about the context, or applying
language models.
6. Translation: After the gestures in sign language have been identified, they are
displayed on a user interface. This user interface may be a graphical interface that
displays the gestures that have been identified in real time. It may be possible to improve
communication and interaction by providing the user with feedback regarding the signs
that have been recognized.
4.2. USE CASE DIAGRAM
You could conceive of a use case diagram as a visual representation of the interactions
that take place between actors (users or external systems) and a system. This is one way
to perceive a use case diagram. It demonstrates the various applications of the system that
may be employed to achieve certain goals in a variety of different approaches. During the
process of developing software, use case diagrams are a common tool that are utilised for
the purpose of capturing and conveying the functional requirements of a system. The
Unified Modelling Language (UML), which is an environment for modelling, utilises
these diagrams as a component of its infrastructure.
The actors, the use cases, and the interactions that are present are the most significant
components that make up a use case diagram. A use case diagram is composed of several
main components. In contrast, use cases are certain functionality or tasks that the system
is able to carry out, whereas actors are entities that interact with the system. Use cases are
a type of use case. In order to illustrate how actors make use of the system in order to
accomplish their goals, the relationships that exist between actors and use cases provide
an illustration of how the system is utilised by characters.
Use case diagrams are advantageous for a variety of reasons, including the ones listed
above. They offer a high-level overview of the working of the system, assist in
identifying the external entities that interact with the system, and provide a source of
information. In addition to serving as a basis for subsequent analysis and design efforts,
they also provide a source of information. Because they offer a visual representation of
the behaviour of the system from the perspective of the user, they also make it simpler for
stakeholders, such as developers, designers, and customers, to interact with one another.
This is because they make it easier for stakeholders to communicate with one another.
When a use case diagram is created, ovals are used to represent use cases, and stick
figures are used to represent participants in the diagram. The interactions and
dependencies that exist between actors and use cases are expressed through the use of
lines that connect actors to use cases. There are many different types of relationships that
can be displayed in a use case diagram. These relationships include generalisations,
associations, and includes/extends linkages. Each of these specific types of relationships
has a unique purpose in defining the behaviour and requirements of the system. Among
the several sorts of relationships that fall under this category are generalisations,
affiliations, and other examples.
As a result of their ability to assist in defining the essential characteristics and interactions
of the system, use case diagrams are particularly useful at the early stages of system
development. It is because of this that they are so appealing. Additionally, they are useful
in the process of verifying requirements and can be used as a reference for activities that
involve testing and validation that take place during the development process. This is in
addition to the fact that they are helpful by themselves.
4.3. Loss and Accuracy
Both the '1_init' and '2_init' models produced very noisy results for our losses, as seen in
Figure 3. Due to the fact that we were limited in both space and time, we were initially forced
to select a batch size number of 4, which was less than optimal and ultimately led to the noise
loss. Following the observation of these results, we proceeded to train a neural network by
utilising a Lighting MemoryMapped Database (LMDB), and we were successful in
increasing the batch size to twenty. In addition to increasing the rate at which we were able to
converge on a validation accuracy, this made it possible for us to lower our loss in a more
smooth and monotonic manner. The 'full_train' model, on the other hand, had learning rates
that were consistent across all of the layers in the network, which enabled us to acquire
knowledge more rapidly. The reason for this is probably due to the fact that we are able to
modify the pre-trained weights in GoogleNet with more ease and make adjustments to
account for the major disparities that exist between the datasets. In spite of the fact that this
results in a more precise fitting of the training data, it did not appear to affect the validation
accuracy in a manner that was inferior to that achieved with other models. We came to the
conclusion, after doing an analysis of a significant number of the images contained within our
dataset, that they were most likely created by capturing video frames of individuals making
American Sign Language signs in the same room and in a single sitting. The fact that our data
set does not contain any variations is the reason why our 'full_train' model has a validation
accuracy that is comparable to that of other models. It is interesting to note that modifying
our re-initialization strategy and learning rates had a minimal impact on the final top-1 and
top-5 accuracies. The most significant difference between the two models was less than 7%
on top-1 accuracy and just over 1% on top-5 accuracy. However, we did observe that the
models that utilised solely re-initialized weights at the classification layer performed
significantly better than the model that utilised two layers of re-initialization. In spite of the
fact that the separation between classes in our dataset and the ILSVRC 2012 dataset is
extremely dissimilar, this is not something that should come as a surprise because of the
excellent quality of the features that are recovered by GoogleNet. The pre-training was
performed on a substantially larger number of photos than we have available in our dataset.
When compared side by side, the '1_init' and '2_init' models produced very little difference.
Due to the fact that GoogleNet is comprised of 22 layers, our intuition would lead us to
expect that reinitializing two layers (as opposed to reinitializing one layer and increasing the
learning rate multiple on another) would not show to be extremely useful in terms of fine-
tuning our model to our validation set. During our experiments, we were able to verify this.
Using our web application, we conducted qualitative tests on the four models on actual users
(see below for more information). Taking the '2_init' model and developing classifiers for the
letters a through e and a through k (with the exception of j) was the decision that we made
because we anticipated that it would be easier to differentiate between a smaller number of
classes. Figure 4 demonstrates that there was a clear and negative association between the
validation accuracies that were achieved through the use of the '2_init' model and the quantity
of letters that we were attempting to classify. This is not surprising. With five letters, we were
able to get a validation accuracy of approximately 98%, whereas with ten letters, we were
only able to achieve 74%.
4.4.
4.4. Confusion Matrix
Based on the confusion matrices, it appears that the primary reason for our issues in terms of
accuracy is the incorrect classification of particular characters (for example, the letters k and
d in Figure 7). The most important reason for our failures is that this is the case. The classifier
is unable to differentiate between two or three letters that are quite similar to one another in
many situations, or it greatly prefers one of the two letters in a pair (for example, g/h in
Figure 7 and m/n/s/t in Figure 6). In other words, the classifier is unable to detect the
differences between the letters. In other words, the classifier is unable to differentiate
between the letters in the alphabet.
Upon conducting an investigation of the confusion matrix for the ten-letter model, it was
discovered that, with the exception of correctly identifying the letter k, it performed very
satisfactorily. This was the sole thing that was discovered to be absent in the situation.
Especially in light of the findings that we have found, we are of the opinion that this
conclusion can be attributed to two fundamental causes. The first thing to note is that the
dataset contains k that has been signed from a number of different points of view. Variations
that vary from the front to the back of the hand that is facing the camera are included in these
viewpoints. Rotations, in which the fingers are pointed up and to the sides, are also included.
There is just one component that is consistent across all of the shots, and that is the centre
mass of the hand. Actually, this is exactly what each and every one of the images looks to be.
In addition to the fact that the letter k, which is part of the alphabetic range of letters a
through k, has characteristics that are extremely comparable to those of the letters d, g, h, and
i, this is a further insult to the injury. It is possible to build it by combining certain of the
components of these letters in a specific way. Consequently, if the classifiers placed an
excessive degree of trust on any of these qualities alone, it would be straightforward to
mistakenly categorise k as any of them, which would make it more difficult to acquire
knowledge about the latter.
Computer vision techniques of a high level of sophistication are utilised by real-time sign
language recognition in order to complete its translation job. In order to understand the
gestures and motions that are made by the user, these algorithms do analysis and
interpretation. A wide range of signals and gestures can be identified and identified by these
algorithms, which can then translate them into written or spoken language in real time. These
algorithms can interpret a variety of signs and gestures. Users of sign language and
individuals who are not familiar with sign language are able to communicate with one
another in a manner that is not only seamless but also quick as a result of this expertise.
The translation function has the potential to have a particularly big influence when it is
considered in regard to Indian languages. This is because India is home to a large number of
languages that are spoken by its people. It is possible that this technology will be able to
assist in bridging communication barriers not only between those who use sign language and
those who do not use sign language, but also between individuals who speak different
regional languages in India. This is accomplished through the facilitation of translation into a
number of different Indian languages. Individuals may be supplied with the ability to
communicate more effectively and engage fully in a range of aspects of society if they are
provided with better accessibility and participation opportunities.
A further point to consider is that the translation of sign language into English and Indian
languages in real time has the potential to have a wide range of applications. It is possible that
the implementation of this technology in educational settings will make it simpler for
students who are deaf or hard of hearing to effectively communicate with their instructors.
This, in turn, can lead to greater comprehension as well as increased participation in activities
at the classroom level. In the context of healthcare settings, it has the potential to improve
communication between medical workers and patients who use sign language. This would
ensure that patients receive care that is not just correct but also effective.
Not only is it possible to use the translation function of real-time sign language recognition
into individual interactions, but it can also be incorporated into a wide range of technologies
and different kinds of platforms. It is feasible, for example, to combine it into technologies
that are used for video conferencing in order to make it easier for those who use sign
language and hearing people to communicate with one another during virtual meetings.
Additionally, it is able to be included into mobile applications in order to provide translation
assistance for everyday conversations held by the user while they are on the move.
In general, the translation power that is included in real-time sign language recognition has a
great deal of promise for being able to facilitate accessibility, inclusivity, and effective
communication for individuals who use sign language. This technology innovation has the
potential to completely transform the way in which we interact and communicate within a
society that is both varied and welcoming to people of all backgrounds. In order to
accomplish this, it eliminates barriers to communication and makes it possible to translate
into both English and Indian languages in real time.
CHAPTER V
Challenges and Future Scope
The development of a real-time sign language recognition system that makes use of OpenCV
and TensorFlow involves a number of issues that need to be successfully solved in order to
ensure successful implementation.
A big obstacle is the intricacy and diversity of sign language motions, which can be difficult
to interpret. Sign language is characterised by a wide range of hand movements, facial
expressions, and body postures, all of which can vary significantly from one sign language to
another and even from one signer to another. It is necessary to have powerful computer vision
and machine learning algorithms that are able to handle this diversity in order to effectively
capture and interpret these nuances in real time.
A further obstacle is the requirement for processing and inference to take place in real time.
In order to support effective communication, sign language recognition systems need to be
able to digest video data in a short amount of time and deliver correct predictions in real time.
In order to guarantee the system's usability and practicality, it is critically important to
achieve minimal latency while simultaneously keeping high accuracy.
In addition, it is of the utmost importance to guarantee the reliable and generalizable nature of
the system. It is expected that the model would be able to reliably recognise signs in a variety
of lighting conditions, backgrounds, and camera angles. For applications that take place in the
real world, where environmental variables may not be under one's control, it is essential that
the system be resistant to noise, occlusions, and variations in hand forms and movements.
An further hurdle is presented by the collecting and annotation of data. For the purpose of
training the model, it might be time-consuming and resource-intensive to construct a
comprehensive dataset of sign language movements that includes a wide variety of variations
and to ensure that the labelling is accurate. An additional layer of complication is added to
the process of data preparation by the fact that the annotation of sign language data requires
knowledge in sign language interpretation in order to guarantee the accuracy of labels.
Additionally, deployment and integration with programmes that are used in the real world
present obstacles. In order to optimise the model for deployment on devices with limited
resources while still retaining real-time performance, it is necessary to take into consideration
the size of the model, the speed at which it infers, and the amount of memory it uses. There is
a possibility that further development work will be required for integration with pre-existing
systems or applications in order to guarantee a smooth operation and user experience.
Future Scope
Incorporating additional technologies, approaches, or datasets into the real-time sign
language recognition system that was constructed using OpenCV and TensorFlow can
dramatically increase the system's performance and allow it to be used more effectively.
4. A Transfer Learning Approach Utilising Large-Scale Sign Language Datasets: Through the
utilisation of large-scale sign language datasets, such as American Sign Language (ASL) or
other sign languages, transfer learning has the potential to improve the performance of the
model as well as its generalisation capabilities. It is possible to improve the system's ability to
learn complicated patterns and variations in sign language movements by pre-training the
model on a dataset that is both diverse and comprehensive.
5. Data Synthesis and Generation, Increasing the size of the training dataset and introducing
variations in hand positions, backdrops, and lighting conditions can be accomplished by the
generation of synthetic data through the use of techniques such as Generative Adversarial
Networks (GANs). The use of synthetic data can assist in enhancing the generalisation
capabilities of the model as well as improving its robustness to a variety of other scenarios.
It would be good to investigate other neural network architectures that have showed
effectiveness in this domain, such as VGG or ResNet, in addition to our core focus on
optimising the GoogleNet model for picture classification. This would go hand in hand with
our primary goal. We are able to obtain insights into the performance of various models and
possibly discover new tactics for enhancing our sign language recognition system if we
experiment with a variety of models.
In addition, we are aware of the possible impact that could be brought about by the
implementation of comprehensive image preprocessing techniques in order to enhance the
categorization process. Adjustments such as enhancing the contrast, eliminating the
background, and possibly cropping the image could be included in this process. Utilising an
additional convolutional neural network (CNN) to localise and crop the hand region in the
photos would be a more advanced strategy that would result in an increase in the accuracy of
sign language identification.
By combining bigram and trigram language models into our language model, we may
considerably increase our system's capacity to handle whole sentences rather than individual
words. This would be a tremendous improvement. Because of this advancement, it would be
necessary to make improvements in letter segmentation and to design a technique that is more
effective in getting photographs from users at a higher frequency. Through the incorporation
of these enhancements, our objective is to develop a system that is more streamlined and all-
encompassing in its ability to translate sign language into written or spoken language.
Furthermore, one of our goals is to continuously improve the capabilities and performance of
our sign language recognition system. In order to achieve this goal, we are studying
alternative neural network topologies and using complex image preprocessing techniques.
We hope that by taking into consideration these tactics, we will be able to improve the
accuracy, efficiency, and overall user experience of our technology, which will ultimately
lead to improved communication and inclusion for people who use sign language.
CHAPTER VI
References
[1] Mitchell, Ross; Young, Travas; Bachleda, Bellamie; Karchmer, Michael (2006). "How
Many People Use ASL in the United States?: Why Estimates Need Updating" (PDF). Sign
Language Studies (Gallaudet University Press.) 6 (3). ISSN 0302-1475. Retrieved November
27, 2012.
[2] Singha, J. and Das, K. “Hand Gesture Recognition Based on Karhunen-Loeve
Transform”, Mobile and Embedded 232 Technology International Conference (MECON),
January 17-18, 2013, India. 365-371.
[3] D. Aryanie, Y. Heryadi. American Sign Language-Based Finger-spelling Recognition
using k-Nearest Neighbors Classifier. 3rd International Conference on Information and
Communication Technology (2015) 533-536.
[4] R. Sharma et al. Recognition of Single Handed Sign Language Gestures using Contour
Tracing descriptor. Proceedings of the World Congress on Engineering 2013 Vol. II, WCE
2013, July 3 - 5, 2013, London, U.K.
[5] T.Starner and A. Pentland. Real-Time American Sign Language Recognition from Video
Using Hidden Markov Models. Computational Imaging and Vision, 9(1); 227-243, 1997.
[6] M. Jeballi et al. Extension of Hidden Markov Model for Recognizing Large Vocabulary
of Sign Language. International Journal of Artificial Intelligence & Applications 4(2); 35-42,
2013
[7] H. Suk et al. Hand gesture recognition based on dynamic Bayesian network framework.
Patter Recognition 43 (9); 3059-3072, 2010.
[8] P. Mekala et al. Real-time Sign Language Recognition based on Neural Network
Architecture. System Theory (SSST), 2011 IEEE 43rd Southeastern Symposium 14-16
March 2011.
[9] Y.F. Admasu, and K. Raimond, Ethiopian Sign Language Recognition Using Artificial
Neural Network. 10th International Conference on Intelligent Systems Design and
Applications, 2010. 995-1000.
[10] J. Atwood, M. Eicholtz, and J. Farrell. American Sign Language Recognition System.
Artificial Intelligence and Machine Learning for Engineering Design. Dept. of Mechanical
Engineering, Carnegie Mellon University, 2012.
[11] L. Pigou et al. Sign Language Recognition Using Convolutional Neural Networks.
European Conference on Computer Vision 6-12 September 2014
[12] Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding.
http://caffe.berkeleyvision.org/, 2014.
[13] Lifeprint.com. American Sign Language (ASL) Manual Alphabet (fingerspelling) 2007.
CHAPTER VII
Appendix
Code:
Module name: cnn_model-train.py
import numpy as np
import pickle
import cv2, os
from glob import glob
from keras import optimizers
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint
from keras import backend as K
K.set_image_dim_ordering('tf')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
def get_image_size():
img = cv2.imread('gestures/1/100.jpg', 0)
return img.shape
def get_num_of_classes():
return len(glob('gestures/*'))
image_x, image_y = get_image_size()
def cnn_model():
num_of_classes = get_num_of_classes()
model = Sequential()
model.add(Conv2D(16, (2,2), input_shape=(image_x, image_y, 1), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='same'))
model.add(Conv2D(32, (3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(3, 3), padding='same'))
model.add(Conv2D(64, (5,5), activation='relu'))
model.add(MaxPooling2D(pool_size=(5, 5), strides=(5, 5), padding='same'))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_of_classes, activation='softmax'))
sgd = optimizers.SGD(lr=1e-2)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
filepath="cnn_model_keras2.h5"
checkpoint1 = ModelCheckpoint(filepath, monitor='val_acc', verbose=1,
save_best_only=True, mode='max')
callbacks_list = [checkpoint1]
#from keras.utils import plot_model
#plot_model(model, to_file='model.png', show_shapes=True)
return model, callbacks_list
def train():
with open("train_images", "rb") as f:
train_images = np.array(pickle.load(f))
with open("train_labels", "rb") as f:
train_labels = np.array(pickle.load(f), dtype=np.int32)
with open("val_images", "rb") as f:
val_images = np.array(pickle.load(f))
with open("val_labels", "rb") as f:
val_labels = np.array(pickle.load(f), dtype=np.int32)
print(val_labels.shape)
train()
K.clear_session();
Module: create_gestures.py
import cv2
import numpy as np
import pickle, os, sqlite3, random
def init_create_folder_database():
# create the folder and database if not exist
if not os.path.exists("gestures"):
os.mkdir("gestures")
if not os.path.exists("gesture_db.db"):
conn = sqlite3.connect("gesture_db.db")
create_table_cmd = "CREATE TABLE gesture ( g_id INTEGER NOT NULL
PRIMARY KEY AUTOINCREMENT UNIQUE, g_name TEXT NOT NULL )"
conn.execute(create_table_cmd)
conn.commit()
def create_folder(folder_name):
if not os.path.exists(folder_name):
os.mkdir(folder_name)
def store_images(g_id):
total_pics = 1200
hist = get_hand_hist()
cam = cv2.VideoCapture(1)
if cam.read()[0]==False:
cam = cv2.VideoCapture(0)
x, y, w, h = 300, 100, 300, 300
create_folder("gestures/"+str(g_id))
pic_no = 0
flag_start_capturing = False
frames = 0
while True:
img = cam.read()[1]
img = cv2.flip(img, 1)
imgHSV = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
dst = cv2.calcBackProject([imgHSV], [0, 1], hist, [0, 180, 0, 256], 1)
disc = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(10,10))
cv2.filter2D(dst,-1,disc,dst)
blur = cv2.GaussianBlur(dst, (11,11), 0)
blur = cv2.medianBlur(blur, 15)
thresh = cv2.threshold(blur,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]
thresh = cv2.merge((thresh,thresh,thresh))
thresh = cv2.cvtColor(thresh, cv2.COLOR_BGR2GRAY)
thresh = thresh[y:y+h, x:x+w]
contours = cv2.findContours(thresh.copy(), cv2.RETR_TREE,
cv2.CHAIN_APPROX_NONE)[1]
if len(contours) > 0:
contour = max(contours, key = cv2.contourArea)
if cv2.contourArea(contour) > 10000 and frames > 50:
x1, y1, w1, h1 = cv2.boundingRect(contour)
pic_no += 1
save_img = thresh[y1:y1+h1, x1:x1+w1]
if w1 > h1:
save_img = cv2.copyMakeBorder(save_img, int((w1-h1)/2) , int((w1-h1)/2) , 0,
0, cv2.BORDER_CONSTANT, (0, 0, 0))
elif h1 > w1:
save_img = cv2.copyMakeBorder(save_img, 0, 0, int((h1-w1)/2) , int((h1-w1)/2)
, cv2.BORDER_CONSTANT, (0, 0, 0))
save_img = cv2.resize(save_img, (image_x, image_y))
rand = random.randint(0, 10)
if rand % 2 == 0:
save_img = cv2.flip(save_img, 1)
cv2.putText(img, "Capturing...", (30, 60), cv2.FONT_HERSHEY_TRIPLEX, 2,
(127, 255, 255))
cv2.imwrite("gestures/"+str(g_id)+"/"+str(pic_no)+".jpg", save_img)
init_create_folder_database()
g_id = input("Enter gesture no.: ")
g_name = input("Enter gesture name/text: ")
store_in_db(g_id, g_name)
store_images(g_id)
Module: display_gestures.py
import cv2, os, random
import numpy as np
def get_image_size():
img = cv2.imread('gestures/0/100.jpg', 0)
return img.shape
gestures = os.listdir('gestures/')
gestures.sort(key = int)
begin_index = 0
end_index = 5
image_x, image_y = get_image_size()
if len(gestures)%5 != 0:
rows = int(len(gestures)/5)+1
else:
rows = int(len(gestures)/5)
full_img = None
for i in range(rows):
col_img = None
for j in range(begin_index, end_index):
img_path = "gestures/%s/%d.jpg" % (j, random.randint(1, 1200))
img = cv2.imread(img_path, 0)
if np.any(img == None):
img = np.zeros((image_y, image_x), dtype = np.uint8)
if np.any(col_img == None):
col_img = img
else:
col_img = np.hstack((col_img, img))
begin_index += 5
end_index += 5
if np.any(full_img == None):
full_img = col_img
else:
full_img = np.vstack((full_img, col_img))
cv2.imshow("gestures", full_img)
cv2.imwrite('full_img.jpg', full_img)
cv2.waitKey(0)
Module: load_images.py
import cv2
from glob import glob
import numpy as np
import random
from sklearn.utils import shuffle
import pickle
import os
def pickle_images_labels():
images_labels = []
images = glob("gestures/*/*.jpg")
images.sort()
for image in images:
print(image)
label = image[image.find(os.sep)+1: image.rfind(os.sep)]
img = cv2.imread(image, 0)
images_labels.append((np.array(img, dtype=np.uint8), int(label)))
return images_labels
images_labels = pickle_images_labels()
images_labels = shuffle(shuffle(shuffle(shuffle(images_labels))))
images, labels = zip(*images_labels)
print("Length of images_labels", len(images_labels))
train_images = images[:int(5/6*len(images))]
print("Length of train_images", len(train_images))
with open("train_images", "wb") as f:
pickle.dump(train_images, f)
del train_images
train_labels = labels[:int(5/6*len(labels))]
print("Length of train_labels", len(train_labels))
with open("train_labels", "wb") as f:
pickle.dump(train_labels, f)
del train_labels
test_images = images[int(5/6*len(images)):int(11/12*len(images))]
print("Length of test_images", len(test_images))
with open("test_images", "wb") as f:
pickle.dump(test_images, f)
del test_images
test_labels = labels[int(5/6*len(labels)):int(11/12*len(images))]
print("Length of test_labels", len(test_labels))
with open("test_labels", "wb") as f:
pickle.dump(test_labels, f)
del test_labels
val_images = images[int(11/12*len(images)):]
print("Length of test_images", len(val_images))
with open("val_images", "wb") as f:
pickle.dump(val_images, f)
del val_images
val_labels = labels[int(11/12*len(labels)):]
print("Length of val_labels", len(val_labels))
with open("val_labels", "wb") as f:
pickle.dump(val_labels, f)
del val_labels
Module: Rotate_imag.py
import cv2, os
def flip_images():
gest_folder = "gestures"
images_labels = []
images = []
labels = []
for g_id in os.listdir(gest_folder):
for i in range(1200):
path = gest_folder+"/"+g_id+"/"+str(i+1)+".jpg"
new_path = gest_folder+"/"+g_id+"/"+str(i+1+1200)+".jpg"
print(path)
img = cv2.imread(path, 0)
img = cv2.flip(img, 1)
cv2.imwrite(new_path, img)
flip_images()
Module: Set_hand_histogram.py
import cv2
import numpy as np
import pickle
def build_squares(img):
x, y, w, h = 420, 140, 10, 10
d = 10
imgCrop = None
crop = None
for i in range(10):
for j in range(5):
if np.any(imgCrop == None):
imgCrop = img[y:y+h, x:x+w]
else:
imgCrop = np.hstack((imgCrop, img[y:y+h, x:x+w]))
#print(imgCrop.shape)
cv2.rectangle(img, (x,y), (x+w, y+h), (0,255,0), 1)
x+=w+d
if np.any(crop == None):
crop = imgCrop
else:
crop = np.vstack((crop, imgCrop))
imgCrop = None
x = 420
y+=h+d
return crop
def get_hand_hist():
cam = cv2.VideoCapture(1)
if cam.read()[0]==False:
cam = cv2.VideoCapture(0)
x, y, w, h = 300, 100, 300, 300
flagPressedC, flagPressedS = False, False
imgCrop = None
while True:
img = cam.read()[1]
img = cv2.flip(img, 1)
img = cv2.resize(img, (640, 480))
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
keypress = cv2.waitKey(1)
if keypress == ord('c'):
hsvCrop = cv2.cvtColor(imgCrop, cv2.COLOR_BGR2HSV)
flagPressedC = True
hist = cv2.calcHist([hsvCrop], [0, 1], None, [180, 256], [0, 180, 0, 256])
cv2.normalize(hist, hist, 0, 255, cv2.NORM_MINMAX)
elif keypress == ord('s'):
flagPressedS = True
break
if flagPressedC:
dst = cv2.calcBackProject([hsv], [0, 1], hist, [0, 180, 0, 256], 1)
dst1 = dst.copy()
disc = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(10,10))
cv2.filter2D(dst,-1,disc,dst)
blur = cv2.GaussianBlur(dst, (11,11), 0)
blur = cv2.medianBlur(blur, 15)
ret,thresh = cv2.threshold(blur,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)
thresh = cv2.merge((thresh,thresh,thresh))
#cv2.imshow("res", res)
cv2.imshow("Thresh", thresh)
if not flagPressedS:
imgCrop = build_squares(img)
#cv2.rectangle(img, (x,y), (x+w, y+h), (0,255,0), 2)
cv2.imshow("Set hand histogram", img)
cam.release()
cv2.destroyAllWindows()
with open("hist", "wb") as f:
pickle.dump(hist, f)
get_hand_hist()
Module: CodeRunner.py
import cv2
import numpy as np
import pickle
def build_squares(img):
x, y, w, h = 420, 140, 10, 10
d = 10
imgCrop = None
crop = None
for i in range(10):
for j in range(5):
if np.any(imgCrop == None):
imgCrop = img[y:y+h, x:x+w]
else:
imgCrop = np.hstack((imgCrop, img[y:y+h, x:x+w]))
#print(imgCrop.shape)
cv2.rectangle(img, (x,y), (x+w, y+h), (0,255,0), 1)
x+=w+d
if np.any(crop == None):
crop = imgCrop
else:
crop = np.vstack((crop, imgCrop))
imgCrop = None
x = 420
y+=h+d
return crop
def get_hand_hist():
cam = cv2.VideoCapture(1)
if cam.read()[0]==False:
cam = cv2.VideoCapture(0)
x, y, w, h = 300, 100, 300, 300
flagPressedC, flagPressedS = False, False
imgCrop = None
while True:
img = cam.read()[1]
img = cv2.flip(img, 1)
img = cv2.resize(img, (640, 480))
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
keypress = cv2.waitKey(1)
if keypress == ord('c'):
hsvCrop = cv2.cvtColor(imgCrop, cv2.COLOR_BGR2HSV)
flagPressedC = True
hist = cv2.calcHist([hsvCrop], [0, 1], None, [180, 256], [0, 180, 0, 256])
cv2.normalize(hist, hist, 0, 255, cv2.NORM_MINMAX)
elif keypress == ord('s'):
flagPressedS = True
break
if flagPressedC:
dst = cv2.calcBackProject([hsv], [0, 1], hist, [0, 180, 0, 256], 1)
dst1 = dst.copy()
disc = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(10,10))
cv2.filter2D(dst,-1,disc,dst)
blur = cv2.GaussianBlur(dst, (11,11), 0)
blur = cv2.medianBlur(blur, 15)
ret,thresh = cv2.threshold(blur,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)
thresh = cv2.merge((thresh,thresh,thresh))
#cv2.imshow("res", res)
cv2.imshow("Thresh", thresh)
if not flagPressedS:
imgCrop = build_squares(img)
#cv2.rectangle(img, (x,y), (x+w, y+h), (0,255,0), 2)
cv2.imshow("Set hand histogram", img)
cam.release()
cv2.destroyAllWindows()
with open("hist", "wb") as f:
pickle.dump(hist, f)
get_hand_hist()