1922 B.SC Cs Batchno 21
1922 B.SC Cs Batchno 21
OBJECT MOTION
By
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
MARCH - 2022
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
(Established under Section 3 of UGC Act, 1956)
JEPPIAAR NAGAR, RAJIV GANDHI SALAI, CHENNAI– 600119
www.sathyabama.ac.in
This is to certify that this Project Report is the bonafide work of NIRANCHANA.K
(Reg. No. 39290072) and JEEVITHA.S (Reg. No. 39290040) who carried out
the project entitled “CURSOR MOVEMENT ON OBJECT MOTION”
under my supervision from to
Internal Guide
Dr. M. SELVI, M.E., Ph.D.,
I, NIRANCHANA.K (Reg. No. 39290072) hereby declare that the Project Report
entitled "CURSOR MOVEMENT ON OBJECT MOTION” done by me under the
guidance of Dr. M. SELVI, M.E, Ph.D., is submitted in partial fulfillment of the
requirements for the award of Bachelor of Science degree in Computer Science.
DATE:
I would like to express my sincere and deep sense of gratitude to my Project Guide
Dr. M. SELVI, M.E., Ph.D., for her valuable guidance, suggestions, and constant
encouragement that paved way for the successful completion of my project work.
I wish to express my thanks to all Teaching and Non-teaching staff members of the
Department of Computer Science and Engineering who were helpful in many ways
For the Completion of the project.
ABSTRACT
computer interaction using computer vision is given. Cross comparison of the testing
of the AI virtual mouse system is difficult because only limited numbers of datasets
are available. The hand gestures and fingertip detection have been tested in various
illumination conditions and also been tested with different distances from the webcam
for tracking of the hand gesture and hand tip detection.
v
TABLE OF CONTENTS
Chapter Page
TITLE
No. No.
ABSTRACT v
1 INTRODUCTION 1
2 LITERATURE SURVEY 2
3.5.1 ANACONDA 9
3.5.3 VS CODE 12
vi
4 EXPERIMENTAL OR MATERIAL METHODS 14
6.1 CONCLUSION 29
REFERENCES 30
APPENDIX 31
A. SOURCE CODE 31
vii
LIST OF FIGURES
identify hand 20
Hand Gesture 28
viii
CHAPTER 1
1. INTRODUCTION
1.1 OVERVIEW OF PROJECT
A virtual mouse is software that allows users to give mouse inputs to a system without
using an actual mouse. To the extreme it can also be called as hardware because it uses an
ordinary web camera. A virtual mouse can usually be operated with multiple input devices,
which may include an actual mouse or a computer keyboard. Virtual mouse which uses web
camera works with the help of different image processing techniques.
In this the hand movements of a user are mapped into mouse inputs. A web camera is
set to take images continuously. Most laptops today are equipped with webcams, which have
recently been used insecurity applications utilizing face recognition. In order to harness the
full potential of a webcam, it can be used for vision-based CC, which would effectively
eliminate the need for a computer mouse or mouse pad. The usefulness of a webcam can
also be greatly extended to other HCI application such as a sign language database or
motion controller. Over the past decades there have been significant advancements in HCI
technologies for gaming purposes, such as the Microsoft Kinect and Nintendo Wii. These
gaming technologies provide a more natural and interactive means of playing videogames.
Motion controls is the future of gaming and it have tremendously boosted the sales of video
games, such as the Nintendo Wii which sold over 50 million consoles within a year of its
release. HCI using hand gestures is very intuitive and effective for one-to-one interaction with
computers and it provides a Natural User Interface (NUI). There has been extensive research
towards novel devices and techniques for cursor control using hand gestures. Besides HCI,
hand gesture recognition is also used in sign language recognition, which makes hand
gesture recognition even more significant.
1
CHAPTER 2
2. LITERATURE SURVEY
2
3. A Survey of Glove-Based Input
Authors: D. J. Sturman and D. Zeltzer, IEEE Computer Graphics and
Applications, 14: 30-39, 1994.
The primary objective is to introduce sensor gloves to the non-specialist
readers interested in selecting one of these devices for their particular application.
In Design and Manufacturing, glove-based systems are used to interact with
computer-generated (typically virtual reality) environments. Measurements taken
with sensor gloves can be complemented with other types of measurements.
Clumsy intermediary devices constrain our interaction with computers and
their applications. Glove-based input devices let us apply our manual dexterity to
the task. We provide a basis for understanding the field by describing key hand-
tracking technologies and applications using glove-based input. The bulk of
development in glove-based input has taken place very recently, and not all of it is
easily accessible in the literature. We present a cross-section of the field to date.
Hand-tracking devices may use the following technologies: position tracking,
optical tracking, marker systems, silhouette analysis, magnetic tracking or
acoustic tracking. Actual glove technologies on the market include: Sayre glove,
MIT LED glove, Digital Data Entry Glove, Data Glove, Dexterous HandMaster,
Power Glove, CyberGlove and Space Glove. Various applications of glove
technologies include projects into the pursuit of natural interfaces, systems for
understanding signed languages, teleoperation and robotic control, computer-
based puppetry, and musical performance.
4
CHAPTER 3
Hand gestures and hand tracking is an important task with many real world
applications. For the purpose of detection of hand gestures and hand tracking, the
MediaPipe framework is used, and OpenCV library is used for computer vision. The
algorithm makes use of the machine learning concepts to track and recognize the
hand gestures and hand tip.
There are generally two approaches for hand gesture recognition, which are
hardware based (Quam 1990; Zhu et al 2006), where the user must wear a device,
and the other is vision based (Shrivastava 2013; Wang and Popović 2009), which
uses image processing techniques with inputs from a camera. The proposed
system is vision based, which uses image processing techniques and inputs from a
computer webcam. Vision based gesture recognition systems are generally broken
down into four stages, skin detection, hand contour extraction, hand tracking and
gesture recognition. The input frame would be captured from the webcam and the
skin region would be detected using skin detection. The hand contour would then
be found and used for hand tracking and gesture recognition. Hand tracking would
be used to navigate the computer cursor and hand gestures would be used to
perform mouse functions such as right click, left click, scroll up and scroll down.
The scope of the project would therefore be to design a vision-based CC system,
which can perform the mouse function previously stated.
5
3.3 SYSTEM REQUIREMENTS
1. Processor : Pentium IV
2. RAM :8GB
3. Processor : 2.4 GHz
4. Main Memory : 8GB RAM
5. Hard Disk Drive : 1tb
6. Web Camera
6
3.4 SOFTWARE USED:
• It provides rich data types and easier to read syntax than any other programming
languages
• It is a platform independent scripted language with full access to operating system
API's
• Compared to other programming languages, it allows more run-time flexibility
• It includes the basic text manipulation facilities of Perl and Awk
• A module in Python may have one or more classes and free functions
• Libraries in Pythons are cross-platform compatible with Linux, Macintosh, and
Windows
• For building large applications, Python can be compiled to byte-code
• Python supports functional and structured programming as well as OOP
• It supports interactive mode that allows interacting Testing and debugging of
snippets of code
• In Python, since there is no compilation step, editing, debugging and testing is fast.
You can create scalable Web Apps using frameworks and CMS (Content
Management System) that are built on Python. Some of the popular platforms for
7
creating Web Apps are: Django, Flask, Pyramid, Plone, Django CMS. Sites like
Mozilla, Reddit, Instagram and PBS are written in Python.
There are numerous libraries available in Python for scientific and numeric
computing. There are libraries like: SciPy and NumPy that are used in general
purpose computing. And, there are specific libraries like: Earthy for earth science,
Astray for Astronomy and so on. Also, the language is heavily used in machine
learning, data mining and deep learning.
8
wrapper for it so that we can use these wrappers as Python modules. This gives
us two advantages: first, our code is as fast as original C/C++ code (since it is the
actual C++ code working in background) and second, it is very easy to code in
Python. This is how OpenCV-Python works, it is a Python wrapper around original
C++ implementation.
And the support of NumPy makes the task easier. NumPy is a highly
optimized library for numerical operations. It gives a MATLAB-style syntax. All the
OpenCV array structures are converted to-and-from NumPy arrays. So whatever
operations you can do in NumPy, you can combine it with OpenCV, which
increases number of weapons in your arsenal. Besides that, several other libraries
like SciPy, Matplotlib which supports Numpy can be used with this.
3.5.1 Anaconda:
9
Fig 3.1: Anaconda Distribution
10
Fig 3.2: Anaconda Navigator Home Page
1. JupyterLab
2. Jupyter Notebook
3. Qt Console
4. Spyder
5. Glueviz
6. Orange3
7. RStudio
11
8. Visual Studio Code
3.5.3 VS Code:
VS Code is free for both private and commercial use, runs on Windows,
macOS, and Linux, and includes support for linting, debugging, task running,
version control and Git integration, IntelliSense code completion, and conda
environments. VS Code is openly extensible and many extensions are available.
• Better Reliability:
The reliability of Anaconda has been improved in the latest release
by capturing and storing the package metadata for installed packages.
• Work in Progress:
There is a casting bug in Numpy with Python 3.7 but the team is
currently working on patching it until Numpy is updated
13
CHAPTER 4
The proposed AI virtual mouse system is based on the frames that have
been captured by the webcam in a laptop or PC. By using the Python computer
vision library OpenCV, the video capture object is created and the web camera will
start capturing video. The web camera captures and passes the frames to the AI
virtual system.
14
Module 2: Capturing the Video and Processing
The AI virtual mouse system uses the webcam where each frame is captured
till the termination of the program. The video frames are processed from BGR to
RGB color space to find the hands in the video frame by frame as shown in the
following code:
def findHands(self, img , draw = True):
imgRGB = cv2.cvtColor(img , cv2.COLOR_BGR2RGB)
self.results = self.hands.process(imgRGB)
In this stage, we are detecting which finger is up using the tip Id of the
respective finger that we found using the MediaPipe and the respective co-
ordinates of the fingers that are up, and according to that, the particular mouse
function is performed.
Module 5: Mouse Functions Depending on the Hand Gestures and Hand Tip
Detection Using Computer Vision For the Mouse Cursor Moving around the
Computer Window
If the index finger is up with tip Id = 1 or both the index finger with tip Id = 1
and the middle finger with tip Id = 2 are up, the mouse cursor is made to move
around the window of the computer using the AutoPy package of Python
15
Module 6: Model Creation:
Algorithm’s Used:
• Mediapipe Framework
• OpenCV Library
Single-shot detector model is used for detecting and recognizing a hand or palm
in real time. The single-shot detector model is used by the MediaPipe. First, in the
hand detection module, it is first trained for a palm detection model because it is easier
to train palms. Furthermore, the non-maximum suppression works significantly better
on small objects such as palms or fists. A model of hand landmark consists of locating
joint or knuckle co-ordinates in the hand region.
16
Fig 4.1: MediaPipe Framework
The basics:
Packet
The basic data flow unit. A packet consists of a numeric timestamp and a shared
pointer to an immutable payload. The payload can be of any C++ type, and the
payload’s type is also referred to as the type of the packet. Packets are value
classes and can be copied cheaply. Each copy shares ownership of the payload,
with reference-counting semantics. Each copy has its own timestamp. See
also Packet.
Graph
MediaPipe processing takes place inside a graph, which defines packet flow paths
between nodes. A graph can have any number of inputs and outputs, and data
flow can branch and merge. Generally, data flows forward, but backward loops are
possible.
Nodes
Nodes produce and/or consume packets, and they are where the bulk of the
graph’s work takes place. They are also known as “calculators”, for historical
reasons. Each node’s interface defines a number of input and output ports,
identified by a tag and/or an index.
17
Streams
A stream is a connection between two nodes that carries a sequence of packets,
whose timestamps must be monotonically increasing.
Side packets
A side packet connection between nodes carries a single packet (with unspecified
timestamp). It can be used to provide some data that will remain constant,
whereas a stream represents a flow of data that changes over time.
Packet Ports
A port has an associated type; packets transiting through the port must be of that
type. An output stream port can be connected to any number of input stream ports
of the same type; each consumer receives a separate copy of the output packets,
and has its own queue, so it can consume them at its own pace. Similarly, a side
packet output port can be connected to as many side packet input ports as
desired.
Similarly, there are sink nodes that receive data and write it to various destinations
(e.g. a file, a memory buffer, etc.), and an application can also receive output from
the graph using callbacks.
Runtime behavior:
Graph lifetime
Once a graph has been initialized, it can be started to begin processing data, and
can process a stream of packets until each stream is closed or the graph
is canceled. Then the graph can be destroyed or started again.
Node lifetime
There are three main lifetime methods the framework will call on a node:
18
• Open: called once, before the other methods. When it is called, all input
side packets required by the node will be available.
• Process: called multiple times, when a new set of inputs is available,
according to the node’s input policy.
• Close: called once, at the end.
In addition, each calculator can define constructor and destructor, which are useful
for creating and deallocating resources that are independent of the processed
data.
Input policies
The default input policy is deterministic collation of packets by timestamp. A node
receives all inputs for the same timestamp at the same time, in an invocation of its
Process method; and successive input sets are received in their timestamp order.
This can require delaying the processing of some packets until a packet with the
same timestamp is received on all input streams, or until it can be guaranteed that
a packet with that timestamp will not be arriving on the streams that have not
received it.
Other policies are also available, implemented using a separate kind of component
known as an InputStreamHandler.
Real-time streams
MediaPipe calculator graphs are often used to process streams of video or audio
frames for interactive applications. Normally, each Calculator runs as soon as all of
its input packets for a given timestamp become available. Calculators used in real-
time graphs need to define output timestamp bounds based on input timestamp
bounds in order to allow downstream calculators to be scheduled promptly.
19
4.2.2 OpenCV Library:
Step 1:
To build this Hand Gesture Recognition project, we’ll need four packages. So,
first, import these.
# import necessary packages for hand gesture recognition project using Python OpenCV
import cv2
import numpy as np
import mediapipe as mp
import tensorflow as tf
from tensorflow.keras.models import load_model
Step 2:
Initialize models
20
Step 3:
Read frames from a webcam
3.1 − We create a Video Capture object and pass an argument ‘0’. It is the
camera ID of the system. In this case, we have 1 webcam connected with
the system. If you have multiple webcams then change the argument
according to your camera ID. Otherwise, leave it default.
3.2 − The cap.read() function reads each frame from the webcam.
21
4.3.1 Real-Time Video From Web Camera:
The proposed AI virtual mouse system is based on the frames that have been
captured by the webcam on a laptop or PC. By using the Python computer vision
library OpenCV, the video capture object is created and the web camera will start
capturing video. The web camera captures and passes the frames to the AI virtual
system.
The AI virtual mouse system uses the webcam where each frame is captured
till the termination of the program. The video frames are processed from BGR to RGB
color space to find the hands in the video frame by frame as shown in the following
code:
def findHands(self, img , draw = True):
imgRGB = cv2.cvtColor(img , cv2.COLOR_BGR2RGB)
self.results = self.hands.process(imgRGB)
In this stage, we are detecting which finger is up using the tip Id of the
respective finger that we found using the MediaPipe and the respective coordinates of
the fingers that are up, and according to that, the particular mouse function is
performed.
22
CHAPTER 5
5. RESULTS AND PERFORMANCE ANALYSIS
23
5.2. Left Click Using Hand gesture:
24
5.3. Right Click Using Hand gesture:
• This is a hand gesture that performs the action of a Right click mouse
movement.
25
5.4. Double Click Using Hand gesture:
• This is a hand gesture that performs the action of a double click mouse
movement.
26
5.5. Brightness Control, Volume Control, and Scroll Funtion Using Hand
gesture:
• This hand gesture is common for all three functions that are Brightness
Control, Volume Control, and Scroll Function.
27
5.6. No Action Performed Using Hand gesture:
28
CHAPTER 6
6. CONCLUSION AND FUTURE ENHANCEMENT
6.1 CONCLUSION:
There are several features and improvements needed in order for the program to
be more user friendly, accurate, and flexible in various environments. The
following describes the improvements and the features required:
a) Smart Movement: Due to the current recognition process are limited within
25cm radius, an adaptive zoom in/out functions are required to improve the
covered distance, where it can automatically adjust the focus rate based on the
distance between the users and the webcam.
b) Better Accuracy & Performance: The response time are heavily relying on the
hardware of the machine, this includes the processing speed of the processor, the
size of the available RAM, and the available features of webcam. Therefore, the
program may have better performance when it's running on a decent machine with
a webcam that performs better in different types of lightings.
c) Mobile Application: In future this web application also able to use on Android
devices, where touchscreen concept is replaced by hand gestures.
29
REFERENCES
30
APPENDIX
A. SOURCE CODE
# Imports
import cv2
import mediapipe as mp
import pyautogui
import math
from enum import IntEnum
from ctypes import cast, POINTER
from comtypes import CLSCTX_ALL
from pycaw.pycaw import AudioUtilities, IAudioEndpointVolume
from google.protobuf.json_format import MessageToDict
import screen_brightness_control as sbcontrol
pyautogui.FAILSAFE = False
mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.hands
# Gesture Encodings
class Gest(IntEnum):
# Binary Encoded
FIST = 0
PINKY = 1
RING = 2
MID = 4
LAST3 = 7
INDEX = 8
31
FIRST2 = 12
LAST4 = 15
THUMB = 16
PALM = 31
# Extra Mappings
V_GEST = 33
TWO_FINGER_CLOSED = 34
PINCH_MAJOR = 35
PINCH_MINOR = 36
# Multi-handedness Labels
class HLabel(IntEnum):
MINOR = 0
MAJOR = 1
32
dist = (self.hand_result.landmark[point[0]].x -
self.hand_result.landmark[point[1]].x)**2
dist += (self.hand_result.landmark[point[0]].y -
self.hand_result.landmark[point[1]].y)**2
dist = math.sqrt(dist)
return dist*sign
def get_dz(self,point):
return abs(self.hand_result.landmark[point[0]].z -
self.hand_result.landmark[point[1]].z)
dist = self.get_signed_dist(point[:2])
dist2 = self.get_signed_dist(point[1:])
33
try:
ratio = round(dist/dist2,1)
except:
ratio = round(dist1/0.01,1)
current_gesture = Gest.PALM
if self.finger in [Gest.LAST3,Gest.LAST4] and self.get_dist([8,4]) < 0.05:
if self.hand_label == HLabel.MINOR :
current_gesture = Gest.PINCH_MINOR
else:
current_gesture = Gest.PINCH_MAJOR
# Executes commands according to detected gestures
class Controller:
tx_old = 0
ty_old = 0
trial = True
flag = False
grabflag = False
pinchmajorflag = False
pinchminorflag = False
pinchstartxcoord = None
34
pinchstartycoord = None
pinchdirectionflag = None
prevpinchlv = 0
pinchlv = 0
framecount = 0
prev_hand = None
pinch_threshold = 0.3
def getpinchylv(hand_result):
dist = round((Controller.pinchstartycoord - hand_result.landmark[8].y)*10,1)
return dist
def getpinchxlv(hand_result):
dist = round((hand_result.landmark[8].x - Controller.pinchstartxcoord)*10,1)
return dist
def changesystembrightness():
currentBrightnessLv = sbcontrol.get_brightness()/100.0
currentBrightnessLv += Controller.pinchlv/50.0
if currentBrightnessLv > 1.0:
currentBrightnessLv = 1.0
elif currentBrightnessLv < 0.0:
currentBrightnessLv = 0.0
sbcontrol.fade_brightness(int(100*currentBrightnessLv) , start =
sbcontrol.get_brightness())
def changesystemvolume():
devices = AudioUtilities.GetSpeakers()
interface = devices.Activate(IAudioEndpointVolume._iid_, CLSCTX_ALL, None)
volume = cast(interface, POINTER(IAudioEndpointVolume))
currentVolumeLv = volume.GetMasterVolumeLevelScalar()
35
currentVolumeLv += Controller.pinchlv/50.0
if currentVolumeLv > 1.0:
currentVolumeLv = 1.0
elif currentVolumeLv < 0.0:
currentVolumeLv = 0.0
volume.SetMasterVolumeLevelScalar(currentVolumeLv, None)
def scrollVertical():
pyautogui.scroll(120 if Controller.pinchlv>0.0 else -120)
def scrollHorizontal():
pyautogui.keyDown('shift')
pyautogui.keyDown('ctrl')
pyautogui.scroll(-120 if Controller.pinchlv>0.0 else 120)
pyautogui.keyUp('ctrl')
pyautogui.keyUp('shift')
36
distsq = delta_x**2 + delta_y**2
ratio = 1
Controller.prev_hand = [x,y]
def pinch_control_init(hand_result):
Controller.pinchstartxcoord = hand_result.landmark[8].x
Controller.pinchstartycoord = hand_result.landmark[8].y
Controller.pinchlv = 0
Controller.prevpinchlv = 0
Controller.framecount = 0
if Controller.pinchdirectionflag == True:
controlHorizontal() #x
37
lvx = Controller.getpinchxlv(hand_result)
lvy = Controller.getpinchylv(hand_result)
# flag reset
if gesture != Gest.FIST and Controller.grabflag:
Controller.grabflag = False
pyautogui.mouseUp(button = "left")
38
if gesture != Gest.PINCH_MINOR and Controller.pinchminorflag:
Controller.pinchminorflag = False
# implementation
if gesture == Gest.V_GEST:
Controller.flag = True
pyautogui.moveTo(x, y, duration = 0.1)
39
elif gesture == Gest.PINCH_MAJOR:
if Controller.pinchmajorflag == False:
Controller.pinch_control_init(hand_result)
Controller.pinchmajorflag = True
Controller.pinch_control(hand_result,Controller.changesystembrightness,
Controller.changesystemvolume)
'''
---------------------------------------- Main Class ----------------------------------------
Entry point of Gesture Controller
'''
class GestureController:
gc_mode = 0
cap = None
CAM_HEIGHT = None
CAM_WIDTH = None
hr_major = None # Right Hand by default
hr_minor = None # Left hand by default
dom_hand = True
def __init__(self):
GestureController.gc_mode = 1
GestureController.cap = cv2.VideoCapture(0)
GestureController.CAM_HEIGHT =
GestureController.cap.get(cv2.CAP_PROP_FRAME_HEIGHT)
GestureController.CAM_WIDTH =
GestureController.cap.get(cv2.CAP_PROP_FRAME_WIDTH)
def classify_hands(results):
left , right = None,None
40
try:
handedness_dict = MessageToDict(results.multi_handedness[0])
if handedness_dict['classification'][0]['label'] == 'Right':
right = results.multi_hand_landmarks[0]
else :
left = results.multi_hand_landmarks[0]
except:
pass
try:
handedness_dict = MessageToDict(results.multi_handedness[1])
if handedness_dict['classification'][0]['label'] == 'Right':
right = results.multi_hand_landmarks[1]
else :
left = results.multi_hand_landmarks[1]
except:
pass
if GestureController.dom_hand == True:
GestureController.hr_major = right
GestureController.hr_minor = left
else :
GestureController.hr_major = left
GestureController.hr_minor = right
def start(self):
handmajor = HandRecog(HLabel.MAJOR)
handminor = HandRecog(HLabel.MINOR)
image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
if results.multi_hand_landmarks:
GestureController.classify_hands(results)
handmajor.update_hand_result(GestureController.hr_major)
handminor.update_hand_result(GestureController.hr_minor)
handmajor.set_finger_state()
handminor.set_finger_state()
gest_name = handminor.get_gesture()
if gest_name == Gest.PINCH_MINOR:
Controller.handle_controls(gest_name, handminor.hand_result)
else:
gest_name = handmajor.get_gesture()
Controller.handle_controls(gest_name, handmajor.hand_result)
43