0% found this document useful (0 votes)
21 views31 pages

Final Review Report

The document discusses developing a gesture recognition system using machine learning to help communicate with hearing or speech impaired individuals. It covers collecting gesture image data, processing the images, using pattern recognition techniques and tools like TensorFlow, Keras and OpenCV to train a model to classify gestures. The system aims to bridge communication between physically challenged individuals and others by recognizing various hand signs and gestures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views31 pages

Final Review Report

The document discusses developing a gesture recognition system using machine learning to help communicate with hearing or speech impaired individuals. It covers collecting gesture image data, processing the images, using pattern recognition techniques and tools like TensorFlow, Keras and OpenCV to train a model to classify gestures. The system aims to bridge communication between physically challenged individuals and others by recognizing various hand signs and gestures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

A Project Final- Report

On

Gesture Recognition Using ML


Submitted in partial
fulfillment of the
requirement for the
award of the degree of
B.TECH (CSE)

Under The Supervision of


Mr. Gautam Kumar
(Assistant Professor)

Submitted By
S.No Admission No. Student Name Degree/Branch Semester

1. 20SCSE1010267 Priyabrath Tripathi B.TECH VII


(CSE)
2. 20SCSE1010082 Prakhar Tripathi B.TECH VII
(CSE)

SCHOOL OF COMPUTING SCIENCE AND


ENGINEERING DEPARTMENT OF COMPUTER
SCIENCE AND ENGINEERING GALGOTIAS
UNIVERSITY, GREATER NOIDA INDIA
December , 2023
SCHOOL OF COMPUTING SCIENCE AND
ENGINEERING
GALGOTIAS UNIVERSITY, GREATER NOIDA

CANDIDATE’S DECLARATION
We here by certify that the work which is being presented in the project, entitled “Gesture Recognition
Using ML” in partial fulfillment of the requirements for the award of the B.Tech-CSE submitted in the
School of Computing Science and Engineering of Galgotias University, Greater Noida, is an original
work carried out during the 6 months, and 2022, under the supervision of Gautam Kumar(Assistant
Professor), Department of Computer Science and Engineering/Computer Application and Information
and Science, of School of Computing Science and Engineering , Galgotias University, Greater Noida.

The matter presented in the project has not been submitted by us for the award of
any other degree of this or any other places.
Priyabrath Tripathi (20SCSE1010267)
Prakhar Tripathi (20SCSE1010045)

This is to certify that the above statement made by the candidates is correct to the
best of my knowledge.
Mr.Gautam Kumar

Assistant Professor
CERTIFICATE

The Final Thesis/Project/ Dissertation Viva-Voce examination of Priyabrath Tripathi


(20SCSE1010267) ,Prakhar Tripathi (20SCSE1010082) has been held on 30/11/2023 and
his/her work is recommended for the award of BTech.

Signature of Examiner(s) Signature of Supervisor(s)

Signature of Project Coordinator Signature of Dean

Date: December,

2023

Place: Greater Noida


Abstract
In this Project, an Indian Gesture based communication acknowledgment utilizing
Python program has been created. This work was taken up by remembering the
hardships that are looked by diversely abled individuals, for example, individuals
who can't talk, or the people who can't hear. The code has been written in Python
and prepared utilizing different modules like Tensor flow, Keras, and working
framework module (OS), OpenCv (Cv2), Numpy, and different preprocessors. The
preparation has been finished involving a by made information base of images in
Indian communication through signing (ISL, for example, digits 0-9 as well as an
Internet based data set from GitHub to further develop precision. The outcomes got
were arranged in Anaconda 3.0 and afterward at last tried. This system can help
contrastingly abled individuals discuss better with others around them. This can be
exceptionally useful for the tragically challenged individuals in speaking with
others as realizing communication through signing isn't something normal to
everything, in addition, this can be reached out to making programmed editors,
where the individual can undoubtedly compose by only their hand signals.

I
List of Tables

 Table of Students Data

S. Name Admission Contact Email ID


No. Number Number
1. Prakhar Tripathi 20SCSE1010267 8887879104 [email protected]
n
2. Priyabrath 20SCSE1010082 9598922535 prakhar. [email protected]
Tripathi

 Table for Faculty Data

S. No. Name Contact Number Email ID


1. Mr. Gautam Kumar 8126707228 [email protected]

II
List of Figures

Figure Figure Name Number


No.
1. Introduction (Convolutional Neural Network CNN) 1-8
2. UML: Use Case Diagram 9
3. UML: Sequence Diagram 10
4. DFD 11
5. Flowchart 12
6. ER-Diagram 13
7. Module Description 14
8. (a) Image captured from web-camera. (b) 15
Image after background is set to black using
HSV (first image).

9. (a) Image after binaries. (b) Image after 16


segmentation and resizing

10. Code 18-21


1. Generated Gesture
2. Jupyter Notes

11. Conclusion 22-23

III
Table of Contents

Title Page
No.
Abstract I
List of Table II
List of Figures III
Chapter 1 Introduction 1
1.1 Problem Statement
1.2 Tool and Technology Used
1.2.1 Data Collection
1.2.2 Image Processing
1.2.3 Pattern recognition
1.2.4 Tools Used
1.3 Challenges in Gesture Recognition
1.4 Types of Approaches
1.5 Hand gesture recognition application domain
Chapter 2 Literature Survey/Project Design
2.1 Existing Literature
2.2 Project Requirements
2.2.1 Domain Analysis
2.3 System Design
2.3.1 UML diagram
2.3.2 DFD Diagram
2.3.3 Flowchart Diagram
2.3.4 ER-Diagram
Chapter 3 Module Description
3.1 Data Collection
3.2 Data Processing
CHAPTER-1

INTRODUCTION

"Talk to a man in a language he knows, that goes to his head," Nelson


Mandela said. Talk to him in his own language; it will reach his heart." Language
is undeniably important in human contact, and it has existed since the dawn of
civilization. It is a medium through which individuals communicate in order to
express themselves and comprehend real-world concepts. No books, no cell
phones, and certainly no word I'm writing would be meaningful without it. It's firmly
ingrained. We often take it for granted in our daily lives and fail to recognize its
significance.

Regrettably, in today's rapidly changing culture, people with speech & hearing
impairments are frequently neglected and excluded. Normal people face difficulty
in understanding their language. Hence there is a need of a system which
recognizes the different signs, gestures and conveys the information to the normal
people. It bridges the gap between physically challenged people and normal
people.

It will be a fantastic tool for persons with hearing impairments to convey their
thoughts, as well as a great way for non-sign language users to grasp what the latter
is saying. Many countries have their own set of sign motions and interpretations.
An alphabet in Korean sign language, for example, will not be the same as an
alphabet in Indian sign language. While this emphasizes the diversity of sign
languages, it also emphasizes their complexity. Deep learning must be well-versed
in gestures in order to achieve a reasonable level of accuracy.

1.1 Problem Statement


Hand signals and gestures are used by those who are unable to speak.
Ordinary people have trouble understanding their own language. As a result, a
system that identifies various signs and gestures and relays information to ordinary

1
people is required. It connects persons who are physically handicapped with others
who are not.

1.2 Tools & Technology Used

1.2.1 Data Collection


The various gestures and hand moments will be collected using a digital camera.

1.2.2 Image Processing


Image processing is a method to perform some operations on an image, in order to
get an enhanced image or to extract some useful information from it. It is a type of
signal processing in which input is an image and output may be image or
characteristics/features associated with that image. Nowadays, image processing is
among rapidly growing technologies. It forms core research area within
engineering and computer science disciplines too.
Image processing basically includes the following three steps:
• Importing the image via image acquisition tools.

• Analyzing and manipulating the image.

• Output in which result can be altered image or report that is based on image
analysis.

1.2.3 Pattern recognition:


On the basis of image processing, it is necessary to separate objects
from images by pattern recognition technology, then to identify and classify these
objects through technologies provided by statistical decision theory.

1.2.4 Tools used

The prerequisites software & libraries for the sign language project are:

Python
IDE (Visual Code)
Numpy

2
cv2
Keras
Tensor flow
Hardware tools required:
-

Monitor
Keyboard and Mouse
Digital Camera (Webcam)

1.3 Challenges in gesture recognition


Motion modelling, motion analysis, pattern recognition, and machine learning are all
used in gesture recognition. It includes of manual and non-manual parameters approaches.
The capacity to foresee is influenced by the environment's structure, such as background
light and movement speed. In 2D space, the gesture seems varied due to the multiple angles.
In certain studies, the signer wears a wrist band or a colorful glove to help with the hand
segmentation procedure. For example, wearing colored gloves minimizes the complexity of
the segmentation process. Temporal variation, spatial complexity, movement epenthesis,
repeatability, and connection, as well as numerous properties such as change of orientation
and region of gesture carried out, are all expected issues in dynamic gesture identification. A
gesture recognition system's performance in overcoming difficulties may be measured using
a variety of metrics. Scalability, robustness, real-time performance, and user independence
are categorized.
1.4 Type of approaches
Recognition of hand gestures can be achieved by using either a vision-based or
sensor- based approaches. Vision-based approaches require the acquisition of images or
video of the hand gestures through video camera. Single camera—Webcam, video camera
and smartphone camera. Stereo-camera—using multiple monocular cameras to provides
depth information. Active techniques—Uses the projection of structured light. Such devices
include Kinect and Leap Motion Controller (LMC). Invasive techniques—Body markers
such as colored gloves, wrist bands, and LED lights.
Sensor-based approach requires the use of sensors, instruments to capture the motion,
position, and velocity of the hand. Inertial measurement unit (IMU)—Measure the acceleration,
position, degree of freedom and acceleration Int. J. Mach. Learn. & Cyber. 1 3 of the fingers.

This includes the use of gyroscope and accelerometer. Electromyography (EMG)—Measures


human muscle’s electrical pulses and harness the bio-signal to detect fingers movements.
Wi-Fi and Radar—Uses radio waves, broad beam radar or spectrogram to detect in-air
signal strength changes. Others—Utilizes flex sensors, ultrasonic, mechanical,
electromagnetics and haptic technologies.
3
1.5 Hand gesture recognition application domain
The ability of a computer or machine to understand the hand gestures is the key to unlock
numerous potential application. Potential application domains of gesture recognition system are
as follows:

1. Sign language recognition—Communication medium for the deaf. It consists of several


categories namely fingerspelling, isolated words, lexicon of words, and continuous signs.

2. Robotics and Tele-robotic—Actuators and motions of the robotic arms, legs and other parts
can be moved by simulating a human’s action.

3. Games and virtual reality—Virtual reality enable realistic interaction between user and the
virtual environment. It simulates movement of users and translate the movement in 3D world.

4. Human–computer interaction (HCI)—Includes application of gesture control in military,


medical field, manipulating graphics, design tools, annotating or editing documents.

4
CHAPTER-2
Literature survey/ Project Design

2.1 Existing Literature

A survey of the literature for our proposed system reveals that many
attempts have been made to solve sign identification in videos and photos using
various methodologies and algorithms.

Kshitij Bantupalli et al., proposed a system to help communication in


between no-sign and people using sign-language by creating a visionary
application which converts the sign language into text with the help of
CNN(Convolutional Neural networks) to extract spatial features and using
RNN(Recurrent neural Network) to train the dataset. The drawbacks faced by this
model is that while taking different skin tones to build and analyze the data model,
there was a loss of accuracy. The model also performed poorly when there was
variation in clothing.

A similar model was proposed by Siming He using a 40-word dataset and


10,000 sign language graphics. Faster R-CNN with an incorporated RPN module is
utilised to locate the hand regions in the video frame. In terms of accuracy, it
enhances performance. When compared to single stage target detection algorithms
like YOLO, detection and template classification can be done at a faster rate. When
compared to Fast-RCNN, the detection accuracy of Faster R-CNN improves from
89.0 percent to 91.7 percent in the paper. For the language image sequences, a 3D
CNN is employed for feature extraction, and a sign-language recognition
framework comprising of long and short time memory (LSTM) coding and
decoding networks is created.

The paper by M. Geetha and U. C. Manjusha, make use of 50 specimens of


every alphabets and digits in a vision based recognition of Indian Sign Language
characters and numerals using B-Spine approximations. The region of interest of
the sign gesture is analyzed and the boundary is removed. The boundaryobtained is
further transformed to a B-spline curve by using the Maximum Curvature Points
(MCPs) as the Control points. The B-spline curve undergoes a series of smoothening
process so features can be extracted. Support vector machine is used to classify the
images and the accuracy is 90.00%.

5
In [5], Pigou used CLAP14 as his dataset [6]. It consists of 20 Italian sign
gestures. After preprocessing the images, he used a Convolutional Neural network
model having 6 layers for training. It is to be noted that his model is not a 3D CNN
and all the kernels are in 2D. He has used Rectified linear Units (ReLU) as activation
functions. Feature extraction is performed by the CNN while classification uses
ANN or fully connected layer. His work has achieved an accuracy of 91.70% with
an error rate of 8.30%.

6
2.2 Project Requirements
2.2.1 Domain Analysis
The domain analysis that we have done for the project mainly involved
understanding the neural networks (CNN).

Convolutional neural networks are composed of multiple layers of artificial


neurons. Artificial neurons, a rough imitation of their biological counterparts, are
mathematical functions that calculate the weighted sum of multiple inputs and
outputs an activation value.

The first (or bottom) layer of the CNN usually detects basic features such as
horizontal, vertical, and diagonal edges. The output of the first layer is fed as input
of the next layer, which extracts more complex features, such as corners and
combinations of edges. As you move deeper into the convolutional neural network,
the layers start detecting higher-level features such as objects, faces, and more.

Convolutional neural networks are distinguished from other neural networks


by their superior performance with image, speech, or audio signal inputs. They
have three main types of layers, which are:
Convolutional layer
Pooling layer
Fully-connected (FC) layer

Fig 1 Convolutional Neural Network CNN

7
Convolutional Layer- The convolutional layer is the core building block of a
CNN, and it is where the majority of computation occurs. It requires a few
components, which are input data, a filter, and a feature map. Let’s assume that
the input will be a color image, which is made up of a matrix of pixels in 3D. This
means that the input will have three dimensions—a height, width, and depth—
which correspond to RGB in an image. We also have a feature detector, also known
as a kernel or a filter, which will move across the receptive fields of the image,
checking if the feature is present. This process is known as a convolution.

The feature detector is a two-dimensional (2-D) array of weights, which represents


part of the image. While they can vary in size, the filter size is typically a 3x3
matrix; this also determines the size of the receptive field. The filter is then
applied to an area of the image, and a dot product is calculated between the input
pixels and the filter. This dot product is then fed into an output array. Afterwards,
the filter shifts by a stride, repeating the process until the kernel has swept across
the entire image. The final output from the series of dot products from the input
and the filter is known as a feature map, activation map, or a convolved feature.

Pooling Layer- Pooling layers, also known as down sampling, conducts


dimensionality reduction, reducing the number of parameters in the input. Similar
to the convolutional layer, the pooling operation sweeps a filter across the entire
input, but the difference is that this filter does not have any weights. Instead, the
kernel applies an aggregation function to the values within the receptive field,
populating the output array. There are two main types of pooling: max pooling and
average pooling

Fully-Connected Layer- The name of the full-connected layer aptly describes


itself. As mentioned earlier, the pixel values of the input image are not directly
connected to the output layer in partially connected layers. However, in the fully-
connected layer, each node in the output layer connects directly to a node in the
previous layer.

This layer performs the task of classification based on the features extracted through
the previous layers and their different filters. While convolutional and pooling
layers tend to use ReLu functions, FC layers usually leverage a softmax activation
function to classify inputs appropriately, producing a probability from 0 to 1.

8
2.3 System Design

2.3.1 UML Diagram


These are different diagrams in UML.

Use Case Diagram -Use Case during requirement elicitation and analysis to
represent the functionality of the system. Use case describes a function by
the system that yields a visible result for an actor. The identification of
actorsand use cases result in the definitions of the boundary of the system
i.e., differentiating the tasks accomplished by the system and the tasks
accomplished by its environment. The actors are outside the boundary of the
system, whereas the use cases are inside the boundary of the system. Use
case describes the behavior of the system as seen from the actor’s point of
view. It describes the function provided by the system as a set of events that
yield a visible result for the actor.

Fig 2 Use Case Diagram

9
Sequence Diagram Sequence diagram displays the time sequence of the
objects participating in the interaction. This consists of the vertical
dimension (time) and horizontal dimension (different objects).
Objects: Object can be viewed as an entity at a particular point in time with
specific value and as a holder of identity. A sequence diagram shows object
interactions arranged in time sequence. It depicts the objects and classes
involved in the scenario and the sequence of messages exchanged between the
objects needed to carry out the functionality of the scenario. Sequence
diagrams are typically associated with use case realizations in the Logical
View of the system under development. Sequence diagrams are sometimes
called event diagrams or event scenarios.

Fig-3 Sequence Diagram

10
2.3.2 DATA FLOW DIAGRAM (DFD)
The DFD is also known as bubble chart. It is a simple graphical formalism
that can be used to represent a system in terms of the input data to the system, various
processing carried out on these data, and the output data is generated by the system.
It maps out the flow of information for any process or system, how data is
processed in terms of inputs and outputs. It uses defined symbols like rectangles,
circles and arrows to show data inputs, outputs, storage points and the routes
between each destination. They can be used to analyze an existing system or
modelof a new one.
A DFD can often visually “say” things that
Would be hard to explain in words and they work for both technical and
non- technical. There are four components in DFD:
1. External Entity
2.Process
3.Data Flow
4. Data Store

Fig-4 DFD Diagram

11
2.3.3 Flowchart
A flowchart is a sort of diagram that depicts an algorithm or process by
depicting the stages as various types of boxes and linking them with arrows to
illustrate their sequence. These boxes and arrows do not represent process operations;
rather, they are suggested by the order of operations. Flowcharts are used in a variety
of areas to analyze, develop, record, and manage a process or programme. In a
flowchart, the two most frequent sorts of boxes are:

A processing phase indicated by a rectangular box and typically referred to as


activity.
A diamond is generally used to represent a choice.

Fig-5 Flowchart

12
2.3.4 Entity Relationship Diagram
Entity Relationship Diagram (ERD) is a diagram that shows the
relationship between entities sets recorded in a database. In other words, ER
diagrams aid in the explanation of database logical structure. Entities, attributes,
and relationships are the three core ideas that ER diagrams are built on.
An ER diagram appears to be quite similar to a flowchart at first glance. The ER
Diagram, on the other hand, has numerous specific symbols, and the meanings of
these symbols distinguish this model. The entity framework architecture is
represented by the ER Diagram.

Fig-6 ER-Diagram

13
CHAPTER-3
MODULE
DESCRIPTION

The proposed system's first phase is to collect data. To capture hand


movements, many researchers have employed sensors or cameras. The hand
motions are captured using the web camera in our system. The photographs go
through a series of steps in which the backgrounds are recognized and removed
using the HSV color extraction technique (Hue, Saturation, Value). Following
that, segmentation is used to identify the skin tone region. A mask is applied to the
images using morphological processes, and a series of dilation and erosion using
an elliptical kernel is performed. The photographs obtained with Open-CV are
resized to the same size, so there is no discernible difference between images of
different gestures. The model then evaluated and the system would then be able
topredict the alphabets.

Fig-7 Module Description

14
Data Collection: Image capture with web camera.
Image Processing: Backgrounds detected and eliminated with HSV, then
morphological operations are performed and masks are applied.
Segmentation of hand gesture is done after which, the image is resized.
Feature Extraction: Binary pixels.
Classification: Using CNN of 3 layers:
Evaluation: The precision, recall and F- measures for each class are
determined.
Prediction: The system predicts input gesture of user and displays result.

3.1 Data Collection

The data obtained in vision-based gesture recognition is a frame of pictures.


Images capturing equipment such as a normal video camera, webcam, stereo camera,
thermal camera, or more modern active approaches like as Kinect and LMC are
used to collect input for such systems. Stereo cameras, Kinect, and LMC are three-
dimensional cameras that can capture depth data. Sensor-based recognition is
defined in this study as any data collecting approach that does not employ cameras.

3.2 Data Processing

A. HSV colourspace and background elimination

Because the photos are in RGB color spaces, segmenting the hand motion only on
the basis of skin color becomes more challenging. As a result, we convert the
photos to HSV color space. It is a model that divides an image's color into three
components: hue, saturation, and value. HSV is a useful technique for improving
image stability by separating brightness from chromaticity. Because the Hue
element is unaffected by any form of light, shadows, or shadings, it may be used to
remove backgrounds. To detect the hand motion and set the backdrop to black, a
track-bar with H values ranging from 0 to 179, S values ranging from 0-255, and V
values ranging from 0 to
255 is utilized. With an elliptical kernel, dilation and erosion operations are
performed on the hand gesture region.

15
(a) (b)
Fig-8 (a) Image captured from web-camera. (b) Image after background is set to black
using HSV (first image).

B. Segmentation
After that, the first image is converted to grayscale. While this technique may
result in a loss of color in the skin gesture region, it will also improve our system's
resiliency to changes in lighting or illumination. The converted image's non-black
pixels are binaries, while the others are left intact, resulting in black. The hand
gesture is split in two ways: first, by removing all of the images attached components,
and then, by allowing only the portion that is really related, in this case, the hand
motion. The frame has been reduced to 64 by 64 pixels. After the segmentation
process, binary pictures of 64 by 64 pixels are created, with the white region
representing the hand gesture and the black colored area representing the
remainder.

(a) (b)
Fig-9 (a) Image after binaries. (b) Image after segmentation and resizing

16
C. Feature Extraction
The ability to identify and extract relevant elements from a picture is one of the
most significant aspects of image processing. Images, when collected and saved as
a dataset, typically take up a lot of space since they include a lot of data. Feature
extraction assists us in solving this challenge by automatically decreasing the data
after the key characteristics have been extracted. It also helps to preserve the
classifier's accuracy while simultaneously reducing its complexity. The binary pixels
of the photographs were determined to be critical in our scenario. We were able to
gather enough characteristics by scaling the photos to 64 pixels to properly
categorize the Sign Language motions.

3.3 Classification
Machine learning algorithms for classification can be classified as
supervised or unsupervised. Supervised machine learning is a method of teaching a
computer to detect patterns in input data that can subsequently be used to predict
future data. Supervised machine learning uses a collection of known training data
and applies it to labelled training data to infer a function. Unsupervised machine
learning is used to make conclusions from datasets that have no labelled responses.
There is no reward or punishment weightage to which classes the data is intended
to go because no labelled response is supplied into the classifier.

17
CHAPTER-4
Code
1. Geneated Gesture
import os
import glob
import pandas as pd import io
import xml.etree.ElementTree as ET
import argparse

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import tensorflow.compat.v1 as tf # Suppress TensorFlow logging (1)
from PIL import Image

from object_detection.utils import dataset_util, label_map_util


from collections import namedtuple

# Initiate argument parser


args = parser.parse_args()
parser = argparse.ArgumentParser(
description="Sample TensorFlow XML-to-TFRecord
if args.image_dir is None:
converter") parser.add_argument("-x",
args.image_dir ="--xml_dir",
def xml_to_csv(path):help="Path to the folder where the input .xml files are
label_map = label_map_util.load_labelmap(args.labels_path)
stored.", type=str)
label_map_dict =
parser.add_argument("-l",
"--labels_path",
help="Path to the labels (.pbtxt) file.",
type=str) parser.add_argument("-o",
"--output_path",
help="Path of output TFRecord (.record) file.",
type=str) parser.add_argument("-i",
"--image_dir",
help="Path to the folder where the input image files are stored. "
"Defaults to the same directory as XML_DIR.",
type=str,
default=None) parser.add_argument("-c",
"--csv_path",
help="Path of output .csv file. If none provided, then no file
will
be "
"written.",
"""Iterates through all .xml files (generated by labelImg) in a given directory
and combines
them in a single Pandas dataframe.
Parameters:

path : str
The path containing the .xml files Returns

Pandas DataFrame
The produced dataframe
"""

xml_list = []
for xml_file in glob.glob(path +
def class_text_to_int(row_label):
'/*.xml'): tree = ET.parse(xml_file)
return
root = tree.getroot()
def split(df, group):
datafor member in root.findall('object'):
= namedtuple('data', ['filename',
def 'object'])value = df.groupby(group)
create_tf_example(group,
gb = path):
with (root.find('filename').text,
tf.gfile.GFile(os.path.join(path,
return [data(filename, gb.get_group(x))'{}'.format(group.filename)),
for filename, x in 'rb') as
fid: int(root.find('size')[0].text),
encoded_jpg = fid.read()
zip(gb.groups.keys(), gb.groups)]
encoded_jpg_io = int(root.find('size')
[1].text),
io.BytesIO(encoded_jpg) member[0].text,
image =
int(member[4][0].text),
Image.open(encoded_jpg_io) width,
int(member[4][1].text),
filename = int(member[4][2].text),
int(member[4][3].text)
group.filename.encode('utf8')
)
xml_list.append(value)
column_name = ['filename', 'width', 'height',
'class', 'xmin', 'ymin', 'xmax',
'ymax'] xml_df = pd.DataFrame(xml_list,
xmins =
[] xmaxs
= []
ymins =
[] ymaxs
= []

for index, row in group.object.iterrows():


xmins.append(row['xmin'] / width)
xmaxs.append(row['xmax'] / width)
def main(_):
ymins.append(row['ymin'] / height)
ymaxs.append(row['ymax'] / height)
writer =
if name classes_text.append(row['class'].encode('utf8'))
== '__main__':
tf.python_io.TFRecordWriter(args.output_path) path =
classes.append(class_text_to_int(row['class']))
tf.app.run()
os.path.join(args.image_dir)
examples
tf_example = xml_to_csv(args.xml_dir)
= tf.train.Example(features=tf.train.Features(feature={
grouped = split(examples,
'image/height': dataset_util.int64_feature(height),
'filename') for group
'image/width': in grouped:
dataset_util.int64_feature(width),
tf_example = create_tf_example(group, path)
'image/filename': dataset_util.bytes_feature(filename),
writer.write(tf_example.SerializeToString())
'image/source_id': dataset_util.bytes_feature(filename),
writer.close()
'image/encoded': dataset_util.bytes_feature(encoded_jpg),
print('Successfully created the TFRecord file:
'image/format': dataset_util.bytes_feature(image_format),
{}'.format(args.output_path)) if args.csv_path is not None:
'image/object/bbox/xmin':
examples.to_csv(args.csv_path, index=None)
dataset_util.float_list_feature(xmins),
'image/object/bbox/xmax':
dataset_util.float_list_feature(xmaxs),
'image/object/bbox/ymin':
dataset_util.float_list_feature(ymins),
'image/object/bbox/ymax':
dataset_util.float_list_feature(ymaxs),
2. JupyterNotes:
import
cv2 Output
import os
import
time
import
uuid
IMAGES_PATH =
'Tensorflow/workspace/images/collectedimages' labels =
['hello', 'thanks', 'yes' , 'no', 'good'] number_imgs =
15
for label in labels:
!mkdir {'Tensorflow\workspace\images\
collectedimages\\'+label} cap = cv2.VideoCapture(0)
print('collecting images for
{}'.format(label)) time.sleep(5)
for imagnum in
range(number_imgs): ret,
frame = cap.read()
imgname = os.path.join(IMAGES_PATH,
label,
label+'.'+'{}.jpg'.format(str(uuid.uuid1())))
cv2.imwrite(imgname,
frame)
CHAPTER-5
CONCLUSION AND FUTURE ENHANCEMENT

Gesture recognition is a field of study that has a wide range of applications,


including sign language recognition, remote control robotics, and virtual reality
human–computer interface. Nonetheless, the occlusion of the hand, the existence of
affine transformation, database scalability, varied backdrop illumination, and high
processing cost remain challenges to establishing an accurate and resilient system.

Many breakthroughs have been made in the field of artificial intelligence, machine
learning and computer vision. They have immensely contributed in how we
perceive things around us and improve the way in which we apply their techniques
in our everyday lives. Many researches have been conducted on sign gesture
recognition using different techniques like ANN, LSTM and 3D CNN. However,
most of them require extra computing power. On the other hand, our research paper
requires low computing power and gives a remarkable accuracy of above 90%. In
our research, we proposed to normalize and rescale our images to 64 pixels in order
to extract features (binary pixels) and make the system more robust.

The movements, body language, and facial expressions used in sign languages
vary greatly from nation to country. The syntax and structure of a sentence might
also differ significantly. Learning and recording motions was a difficult task in our
study since hand movement had to be accurate and on point. Certain movements
are difficult to duplicate. And keeping our hands in the same place while
compiling our dataset was difficult.

We hope to expand our datasets with other alphabets and refine the model so that it
can recognize more alphabetical characteristics while maintaining high accuracy.
We'd like to improve the system even further by include voice recognition so that
blind individuals may benefit as well.

Gesture recognition using machine learning has proven to be a transformative


technology with diverse applications across various industries. Through the
utilization of algorithms and neural networks, it enables systems to interpret and

22
respond to human gestures accurately. In conclusion, this technology has
demonstrated significant advancements in real-time gesture analysis, enhancing
human-computer interaction and fostering accessibility in numerous fields.
One of the key strengths of machine learning-based gesture recognition lies in its
ability to continuously improve accuracy and efficiency with more extensive
datasets. Further enhancements could involve refining existing models through
increased data diversity, encompassing various gestures across different
demographics and cultural backgrounds. Additionally, integrating multimodal
approaches by combining gesture recognition with other sensory inputs like voice
or facial expressions could enhance the overall system performance.

Moreover, optimizing computational efficiency remains crucial for deploying


gesture recognition in resource-constrained environments. Exploring lightweight
architectures and algorithms that balance accuracy and computational cost would
make these systems more accessible across a wider range of devices.

Furthermore, the incorporation of interpretability and transparency into machine


learning models for gesture recognition is essential. Developing methods to
understand and explain the decision-making process of these models could
increase their trustworthiness, particularly in critical applications like healthcare or
autonomous systems.

In conclusion, while machine learning-based gesture recognition has shown


immense potential, continuous research and development are essential to further
improve its accuracy, efficiency, interpretability, and deployment across diverse
applications

23
REFRENCES

[1] https://peda.net/id/08f8c4a8511
[2] K. Bantupalli and Y. Xie, "American Sign Language Recognition using Deep Learning
and Computer Vision," 2018 IEEE International Conference on Big Data (Big Data),
Seattle, WA, USA, 2018, pp. 4896-4899, doi:10.1109/BigData.2018.8622141.

[3] M. Geetha and U. C. Manjusha, , “A Vision Based Recognition of Indian Sign Language
Alphabets and Numerals Using B-Spline Approximation”, International Journal onComputer
Science and Engineering (IJCSE), vol. 4, no. 3, pp. 406-415. 2012.

[4] He, Siming. (2019). Research of a Sign Language Translation System Based on
Deep Learning. 392-396. 10.1109/AIAM48774.2019.00083.

[5] Pigou L., Dieleman S., Kindermans PJ., Schrauwen B. (2015) Sign Language
Recognition Using Convolutional Neural Networks. In: Agapito L., Bronstein
M., Rother C. (eds) Computer Vision - ECCV 2014 Workshops. ECCV 2014. Lecture Notes in
Computer Science, vol 8925. Springer, Cham.

[6] Escalera, S., Baró, X., Gonzàlez, J., Bautista, M., Madadi, M., Reyes, M., . . . Guyon,
I. (2014). ChaLearn Looking at People Challenge 2014: Dataset and
Results. Workshop at the European Conference on Computer Vision (pp. 459-473). Springer, .
Cham.

You might also like