Final Review Report
Final Review Report
On
Submitted By
S.No Admission No. Student Name Degree/Branch Semester
CANDIDATE’S DECLARATION
We here by certify that the work which is being presented in the project, entitled “Gesture Recognition
Using ML” in partial fulfillment of the requirements for the award of the B.Tech-CSE submitted in the
School of Computing Science and Engineering of Galgotias University, Greater Noida, is an original
work carried out during the 6 months, and 2022, under the supervision of Gautam Kumar(Assistant
Professor), Department of Computer Science and Engineering/Computer Application and Information
and Science, of School of Computing Science and Engineering , Galgotias University, Greater Noida.
The matter presented in the project has not been submitted by us for the award of
any other degree of this or any other places.
Priyabrath Tripathi (20SCSE1010267)
Prakhar Tripathi (20SCSE1010045)
This is to certify that the above statement made by the candidates is correct to the
best of my knowledge.
Mr.Gautam Kumar
Assistant Professor
CERTIFICATE
Date: December,
2023
I
List of Tables
II
List of Figures
III
Table of Contents
Title Page
No.
Abstract I
List of Table II
List of Figures III
Chapter 1 Introduction 1
1.1 Problem Statement
1.2 Tool and Technology Used
1.2.1 Data Collection
1.2.2 Image Processing
1.2.3 Pattern recognition
1.2.4 Tools Used
1.3 Challenges in Gesture Recognition
1.4 Types of Approaches
1.5 Hand gesture recognition application domain
Chapter 2 Literature Survey/Project Design
2.1 Existing Literature
2.2 Project Requirements
2.2.1 Domain Analysis
2.3 System Design
2.3.1 UML diagram
2.3.2 DFD Diagram
2.3.3 Flowchart Diagram
2.3.4 ER-Diagram
Chapter 3 Module Description
3.1 Data Collection
3.2 Data Processing
CHAPTER-1
INTRODUCTION
Regrettably, in today's rapidly changing culture, people with speech & hearing
impairments are frequently neglected and excluded. Normal people face difficulty
in understanding their language. Hence there is a need of a system which
recognizes the different signs, gestures and conveys the information to the normal
people. It bridges the gap between physically challenged people and normal
people.
It will be a fantastic tool for persons with hearing impairments to convey their
thoughts, as well as a great way for non-sign language users to grasp what the latter
is saying. Many countries have their own set of sign motions and interpretations.
An alphabet in Korean sign language, for example, will not be the same as an
alphabet in Indian sign language. While this emphasizes the diversity of sign
languages, it also emphasizes their complexity. Deep learning must be well-versed
in gestures in order to achieve a reasonable level of accuracy.
1
people is required. It connects persons who are physically handicapped with others
who are not.
• Output in which result can be altered image or report that is based on image
analysis.
The prerequisites software & libraries for the sign language project are:
Python
IDE (Visual Code)
Numpy
2
cv2
Keras
Tensor flow
Hardware tools required:
-
Monitor
Keyboard and Mouse
Digital Camera (Webcam)
2. Robotics and Tele-robotic—Actuators and motions of the robotic arms, legs and other parts
can be moved by simulating a human’s action.
3. Games and virtual reality—Virtual reality enable realistic interaction between user and the
virtual environment. It simulates movement of users and translate the movement in 3D world.
4
CHAPTER-2
Literature survey/ Project Design
A survey of the literature for our proposed system reveals that many
attempts have been made to solve sign identification in videos and photos using
various methodologies and algorithms.
5
In [5], Pigou used CLAP14 as his dataset [6]. It consists of 20 Italian sign
gestures. After preprocessing the images, he used a Convolutional Neural network
model having 6 layers for training. It is to be noted that his model is not a 3D CNN
and all the kernels are in 2D. He has used Rectified linear Units (ReLU) as activation
functions. Feature extraction is performed by the CNN while classification uses
ANN or fully connected layer. His work has achieved an accuracy of 91.70% with
an error rate of 8.30%.
6
2.2 Project Requirements
2.2.1 Domain Analysis
The domain analysis that we have done for the project mainly involved
understanding the neural networks (CNN).
The first (or bottom) layer of the CNN usually detects basic features such as
horizontal, vertical, and diagonal edges. The output of the first layer is fed as input
of the next layer, which extracts more complex features, such as corners and
combinations of edges. As you move deeper into the convolutional neural network,
the layers start detecting higher-level features such as objects, faces, and more.
7
Convolutional Layer- The convolutional layer is the core building block of a
CNN, and it is where the majority of computation occurs. It requires a few
components, which are input data, a filter, and a feature map. Let’s assume that
the input will be a color image, which is made up of a matrix of pixels in 3D. This
means that the input will have three dimensions—a height, width, and depth—
which correspond to RGB in an image. We also have a feature detector, also known
as a kernel or a filter, which will move across the receptive fields of the image,
checking if the feature is present. This process is known as a convolution.
This layer performs the task of classification based on the features extracted through
the previous layers and their different filters. While convolutional and pooling
layers tend to use ReLu functions, FC layers usually leverage a softmax activation
function to classify inputs appropriately, producing a probability from 0 to 1.
8
2.3 System Design
Use Case Diagram -Use Case during requirement elicitation and analysis to
represent the functionality of the system. Use case describes a function by
the system that yields a visible result for an actor. The identification of
actorsand use cases result in the definitions of the boundary of the system
i.e., differentiating the tasks accomplished by the system and the tasks
accomplished by its environment. The actors are outside the boundary of the
system, whereas the use cases are inside the boundary of the system. Use
case describes the behavior of the system as seen from the actor’s point of
view. It describes the function provided by the system as a set of events that
yield a visible result for the actor.
9
Sequence Diagram Sequence diagram displays the time sequence of the
objects participating in the interaction. This consists of the vertical
dimension (time) and horizontal dimension (different objects).
Objects: Object can be viewed as an entity at a particular point in time with
specific value and as a holder of identity. A sequence diagram shows object
interactions arranged in time sequence. It depicts the objects and classes
involved in the scenario and the sequence of messages exchanged between the
objects needed to carry out the functionality of the scenario. Sequence
diagrams are typically associated with use case realizations in the Logical
View of the system under development. Sequence diagrams are sometimes
called event diagrams or event scenarios.
10
2.3.2 DATA FLOW DIAGRAM (DFD)
The DFD is also known as bubble chart. It is a simple graphical formalism
that can be used to represent a system in terms of the input data to the system, various
processing carried out on these data, and the output data is generated by the system.
It maps out the flow of information for any process or system, how data is
processed in terms of inputs and outputs. It uses defined symbols like rectangles,
circles and arrows to show data inputs, outputs, storage points and the routes
between each destination. They can be used to analyze an existing system or
modelof a new one.
A DFD can often visually “say” things that
Would be hard to explain in words and they work for both technical and
non- technical. There are four components in DFD:
1. External Entity
2.Process
3.Data Flow
4. Data Store
11
2.3.3 Flowchart
A flowchart is a sort of diagram that depicts an algorithm or process by
depicting the stages as various types of boxes and linking them with arrows to
illustrate their sequence. These boxes and arrows do not represent process operations;
rather, they are suggested by the order of operations. Flowcharts are used in a variety
of areas to analyze, develop, record, and manage a process or programme. In a
flowchart, the two most frequent sorts of boxes are:
Fig-5 Flowchart
12
2.3.4 Entity Relationship Diagram
Entity Relationship Diagram (ERD) is a diagram that shows the
relationship between entities sets recorded in a database. In other words, ER
diagrams aid in the explanation of database logical structure. Entities, attributes,
and relationships are the three core ideas that ER diagrams are built on.
An ER diagram appears to be quite similar to a flowchart at first glance. The ER
Diagram, on the other hand, has numerous specific symbols, and the meanings of
these symbols distinguish this model. The entity framework architecture is
represented by the ER Diagram.
Fig-6 ER-Diagram
13
CHAPTER-3
MODULE
DESCRIPTION
14
Data Collection: Image capture with web camera.
Image Processing: Backgrounds detected and eliminated with HSV, then
morphological operations are performed and masks are applied.
Segmentation of hand gesture is done after which, the image is resized.
Feature Extraction: Binary pixels.
Classification: Using CNN of 3 layers:
Evaluation: The precision, recall and F- measures for each class are
determined.
Prediction: The system predicts input gesture of user and displays result.
Because the photos are in RGB color spaces, segmenting the hand motion only on
the basis of skin color becomes more challenging. As a result, we convert the
photos to HSV color space. It is a model that divides an image's color into three
components: hue, saturation, and value. HSV is a useful technique for improving
image stability by separating brightness from chromaticity. Because the Hue
element is unaffected by any form of light, shadows, or shadings, it may be used to
remove backgrounds. To detect the hand motion and set the backdrop to black, a
track-bar with H values ranging from 0 to 179, S values ranging from 0-255, and V
values ranging from 0 to
255 is utilized. With an elliptical kernel, dilation and erosion operations are
performed on the hand gesture region.
15
(a) (b)
Fig-8 (a) Image captured from web-camera. (b) Image after background is set to black
using HSV (first image).
B. Segmentation
After that, the first image is converted to grayscale. While this technique may
result in a loss of color in the skin gesture region, it will also improve our system's
resiliency to changes in lighting or illumination. The converted image's non-black
pixels are binaries, while the others are left intact, resulting in black. The hand
gesture is split in two ways: first, by removing all of the images attached components,
and then, by allowing only the portion that is really related, in this case, the hand
motion. The frame has been reduced to 64 by 64 pixels. After the segmentation
process, binary pictures of 64 by 64 pixels are created, with the white region
representing the hand gesture and the black colored area representing the
remainder.
(a) (b)
Fig-9 (a) Image after binaries. (b) Image after segmentation and resizing
16
C. Feature Extraction
The ability to identify and extract relevant elements from a picture is one of the
most significant aspects of image processing. Images, when collected and saved as
a dataset, typically take up a lot of space since they include a lot of data. Feature
extraction assists us in solving this challenge by automatically decreasing the data
after the key characteristics have been extracted. It also helps to preserve the
classifier's accuracy while simultaneously reducing its complexity. The binary pixels
of the photographs were determined to be critical in our scenario. We were able to
gather enough characteristics by scaling the photos to 64 pixels to properly
categorize the Sign Language motions.
3.3 Classification
Machine learning algorithms for classification can be classified as
supervised or unsupervised. Supervised machine learning is a method of teaching a
computer to detect patterns in input data that can subsequently be used to predict
future data. Supervised machine learning uses a collection of known training data
and applies it to labelled training data to infer a function. Unsupervised machine
learning is used to make conclusions from datasets that have no labelled responses.
There is no reward or punishment weightage to which classes the data is intended
to go because no labelled response is supplied into the classifier.
17
CHAPTER-4
Code
1. Geneated Gesture
import os
import glob
import pandas as pd import io
import xml.etree.ElementTree as ET
import argparse
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import tensorflow.compat.v1 as tf # Suppress TensorFlow logging (1)
from PIL import Image
path : str
The path containing the .xml files Returns
Pandas DataFrame
The produced dataframe
"""
xml_list = []
for xml_file in glob.glob(path +
def class_text_to_int(row_label):
'/*.xml'): tree = ET.parse(xml_file)
return
root = tree.getroot()
def split(df, group):
datafor member in root.findall('object'):
= namedtuple('data', ['filename',
def 'object'])value = df.groupby(group)
create_tf_example(group,
gb = path):
with (root.find('filename').text,
tf.gfile.GFile(os.path.join(path,
return [data(filename, gb.get_group(x))'{}'.format(group.filename)),
for filename, x in 'rb') as
fid: int(root.find('size')[0].text),
encoded_jpg = fid.read()
zip(gb.groups.keys(), gb.groups)]
encoded_jpg_io = int(root.find('size')
[1].text),
io.BytesIO(encoded_jpg) member[0].text,
image =
int(member[4][0].text),
Image.open(encoded_jpg_io) width,
int(member[4][1].text),
filename = int(member[4][2].text),
int(member[4][3].text)
group.filename.encode('utf8')
)
xml_list.append(value)
column_name = ['filename', 'width', 'height',
'class', 'xmin', 'ymin', 'xmax',
'ymax'] xml_df = pd.DataFrame(xml_list,
xmins =
[] xmaxs
= []
ymins =
[] ymaxs
= []
Many breakthroughs have been made in the field of artificial intelligence, machine
learning and computer vision. They have immensely contributed in how we
perceive things around us and improve the way in which we apply their techniques
in our everyday lives. Many researches have been conducted on sign gesture
recognition using different techniques like ANN, LSTM and 3D CNN. However,
most of them require extra computing power. On the other hand, our research paper
requires low computing power and gives a remarkable accuracy of above 90%. In
our research, we proposed to normalize and rescale our images to 64 pixels in order
to extract features (binary pixels) and make the system more robust.
The movements, body language, and facial expressions used in sign languages
vary greatly from nation to country. The syntax and structure of a sentence might
also differ significantly. Learning and recording motions was a difficult task in our
study since hand movement had to be accurate and on point. Certain movements
are difficult to duplicate. And keeping our hands in the same place while
compiling our dataset was difficult.
We hope to expand our datasets with other alphabets and refine the model so that it
can recognize more alphabetical characteristics while maintaining high accuracy.
We'd like to improve the system even further by include voice recognition so that
blind individuals may benefit as well.
22
respond to human gestures accurately. In conclusion, this technology has
demonstrated significant advancements in real-time gesture analysis, enhancing
human-computer interaction and fostering accessibility in numerous fields.
One of the key strengths of machine learning-based gesture recognition lies in its
ability to continuously improve accuracy and efficiency with more extensive
datasets. Further enhancements could involve refining existing models through
increased data diversity, encompassing various gestures across different
demographics and cultural backgrounds. Additionally, integrating multimodal
approaches by combining gesture recognition with other sensory inputs like voice
or facial expressions could enhance the overall system performance.
23
REFRENCES
[1] https://peda.net/id/08f8c4a8511
[2] K. Bantupalli and Y. Xie, "American Sign Language Recognition using Deep Learning
and Computer Vision," 2018 IEEE International Conference on Big Data (Big Data),
Seattle, WA, USA, 2018, pp. 4896-4899, doi:10.1109/BigData.2018.8622141.
[3] M. Geetha and U. C. Manjusha, , “A Vision Based Recognition of Indian Sign Language
Alphabets and Numerals Using B-Spline Approximation”, International Journal onComputer
Science and Engineering (IJCSE), vol. 4, no. 3, pp. 406-415. 2012.
[4] He, Siming. (2019). Research of a Sign Language Translation System Based on
Deep Learning. 392-396. 10.1109/AIAM48774.2019.00083.
[5] Pigou L., Dieleman S., Kindermans PJ., Schrauwen B. (2015) Sign Language
Recognition Using Convolutional Neural Networks. In: Agapito L., Bronstein
M., Rother C. (eds) Computer Vision - ECCV 2014 Workshops. ECCV 2014. Lecture Notes in
Computer Science, vol 8925. Springer, Cham.
[6] Escalera, S., Baró, X., Gonzàlez, J., Bautista, M., Madadi, M., Reyes, M., . . . Guyon,
I. (2014). ChaLearn Looking at People Challenge 2014: Dataset and
Results. Workshop at the European Conference on Computer Vision (pp. 459-473). Springer, .
Cham.