0% found this document useful (0 votes)
34 views5 pages

Gestop: Customizable Gesture Control of Computer Systems: Sriram S K Nishant Sinha

Uploaded by

Đào Văn Hưng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views5 pages

Gestop: Customizable Gesture Control of Computer Systems: Sriram S K Nishant Sinha

Uploaded by

Đào Văn Hưng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Gestop: Customizable Gesture Control of Computer Systems

Sriram S K Nishant Sinha


[email protected] [email protected]
PES University OffNote Labs
Bengaluru, India Bengaluru, India

ABSTRACT the design is modular and customizable: we provide an easy-to-


The established way of interfacing with most computer systems is use configuration for remapping gestures and actions, adding new
a mouse and keyboard. Hand gestures are an intuitive and effective custom actions as well as new gestures.
touchless way to interact with computer systems. However, hand We make use of two kinds of gestures in our application: static
arXiv:2010.13197v1 [cs.HC] 25 Oct 2020

gesture based systems have seen low adoption among end-users and dynamic. Static gestures are gestures where a single hand
primarily due to numerous technical hurdles in detecting in-air pose provides enough information to classify the gesture, such as
gestures accurately. This paper presents Gestop, a framework de- the "Peace" sign. On the other hand, dynamic gestures cannot be
veloped to bridge this gap. The framework learns to detect gestures detected from a single pose alone, and require a sequence of poses
from demonstrations, is customizable by end-users and enables to be understood and classified. Examples include the gestures
users to interact in real-time with computers having only RGB maintaining the pose of the hand while moving it ("Swipe Up"),
cameras, using gestures. or gestures which involve changing hand posture continuously
("Pinch"). By combining hand motion along with continuous pose
CCS CONCEPTS change, we can create a large number of dynamic gestures.
The architecture of the application follows a modular design.
• Human-centered computing → Gestural input; • Comput-
It is separated into logical components, each performing a single
ing methodologies → Tracking.
task. The Gesture Receiver receives keypoints from the image
and passes it on to the Gesture Recognizer, which uses neural
KEYWORDS networks to classify both static and dynamic gestures, and finally
hand gesture, MediaPipe, neural networks, pytorch the Gesture Executor, which executes an action based on the
ACM Reference Format: detected gesture.
Sriram S K and Nishant Sinha. 2021. Gestop: Customizable Gesture Control A key distinguishing aspect of our application is that it is cus-
of Computer Systems. In 8th ACM IKDD CODS and 26th COMAD (CODS tomizable by the end-user. Other than the inbuilt mouse and key-
COMAD 2021), January 2–4, 2021, Bangalore, India. ACM, New York, NY, board functions, it is possible to map gestures to arbitrary desktop
USA, 5 pages. https://doi.org/10.1145/3430984.3430993 actions, including a shell script. This allows for massive flexibility in
how the application can be used. Gestures can be mapped to launch
1 INTRODUCTION other applications, setup environments and so on. In addition, we
have provided a way to add new gestures as well, allowing the user
Hand detection and gesture recognition has a broad range of poten-
to extend it as much as required.
tial applications, including in-car gestures, sign language recogni-
tion, virtual reality and so on. Through gestures, users can control
or interact with devices without touching them. Although numer- 2 RELATED WORK
ous gesture recognition prototypes and tutorials are available across Hand Gesture Recognition from Video. The literature on ges-
the web, they handle only restricted set of gestures and lack proper ture recognition from static images or video is vast. The solutions
architecture design and description, making it hard for end-users vary based on whether (a) the cameras or sensors (single or multiple
to use and build upon them. instances) provide RGB-only images vs depth data (RGB-D), (b) we
In this paper, we present the architecture and implementation detect hand keypoints (palm and finger joints) as an intermediate
of our end-to-end, extensible system which allows user to control step or perform end-to-end detection directly from video, (c) de-
desktop in real-time using hand gestures. Our tool works across tecting keypoints is the end-goal as opposed to 3D reconstruction
different hardware (CPU/GPU) and operating systems and relies on of hands, (d) gestures are pre-segmented or must be segmented
a medium resolution camera to detect gestures. The tool controls the in real-time. For more details, please refer to recent overview ar-
desktop through hand gestures alone, replacing all mouse actions ticles by Lepetit [8] and Ren et al. [13]. In our approach, we use a
with gestures, and many keyboard shortcuts as well. Furthermore, monocular RGB video stream, which is fed into a two-phase neural
network architecture. The first phase detects hand keypoints from
Permission to make digital or hard copies of part or all of this work for personal or individual video frames (using the off-the-shelf MediaPipe [3] tool)
classroom use is granted without fee provided that copies are not made or distributed and generates a sequence of hand keypoints, which is used by the
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. second phase to detect both static and dynamic gestures (using
For all other uses, contact the owner/author(s). our neural models). GestARLite [7] is a light, on-device framework
CODS COMAD 2021, January 2–4, 2021, Bangalore, India for detecting gestures based on pointing fingers. Our solution is
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8817-7/21/01. targeted towards desktop computers and can recognize much larger
https://doi.org/10.1145/3430984.3430993 set of complex dynamic gestures and can be customized easily.
CODS COMAD 2021, January 2–4, 2021, Bangalore, India Sriram S K and Nishant Sinha

Gesture Recognition Platforms. Gesture recognition is use- the keypoints from the MediaPipe module, and then passes it to
ful for several applications: controlling virtual interfaces, gaming, the Gesture Recognizer and the Mouse Tracker. The output received
embodied AR/VR environments, automotive human-machine in- (the name of a gesture) is then passed to the Gesture Executor.
terface [4], home automation [14], education [15], retail business Mouse Tracker. The Mouse Tracker tracks the cursor on the
environments, consumer electronics control and more. Many appli- screen as the hand moves. As convention, we use the tip of the
cations use hand gestures because they enable highly expressive index finger as the keypoint with which to track the mouse. As the
interaction. Although the gesture recognition market size is rapidly index finger is moved, the motion is projected onto the screen and
growing, there are hardly any end-to-end open source gesture recog- the cursor moves accordingly.
nition platforms. Exceptions include GRT [2], an open-source, C++ Gesture Recognizer. The Gesture Recognizer module classi-
machine learning library designed for real-time gesture recognition, fies gestures given keypoints. We utilize two neural networks for
which provides building blocks for creating custom recognizers. In the same, one to detect static gestures and the other for dynamic
contrast, we use neural layers as our building blocks and develop gestures. The details of their structure and training are elaborated
models using PyTorch [12]. upon in the subsequent sections.
Programming imperative multi-touch gesture recognizers in- Gesture Executor. The input to this module is the name of the
volves dealing with low-level, event-driven programming model. gesture which has been recognized by the Gesture Recognizer. This
Gesture Coder [9] learns from multi-touch gesture examples pro- module finds the action mapped to this gesture and then executes it.
vided by users and generates imperative recognition code auto- We include a small set of predefined gestures and actions to cover
matically and invokes corresponding application actions. Oney et common use cases.
al. [11] investigate declarative abstractions for simplifying program-
ming of multi-touch gestures. In contrast, we recognize gestures 4 GESTURE RECOGNIZER
end-to-end using deep learning from video examples. The inputs to the Gesture Recognizer are the 21 3D keypoints
generated by MediaPipe. each corresponding to a point on the
3 DESIGN ARCHITECTURE hand. Each keypoints consists of three coordinates (𝑥, 𝑦, 𝑧). Thus
We use an open source framework (MediaPipe) [10] to detect the the Gesture Recognizer receives a 63-D input vector. These input
hand keypoints in the images captured by the camera. The Medi- vectors are then transformed into the features expected by the
aPipe module reads data from the camera, processes it and gen- neural networks as described below.
erates keypoints which are then sent to the Gesture Receiver
using ZeroMQ, a messaging queue. On receiving the keypoints, 4.1 Static Gestures
the Gesture Receiver then passes it to the Gesture Recognizer, Static gestures are gestures which can be described by a single hand
which then processes the keypoints into the encoded features, feeds pose. Some of the gestures detected by Gestop are shown in Fig. 2,
them into the network which detects the output gesture. Finally, along with the name which is used to refer to them. The set of static
the Gesture Receiver sends the detected gesture to the Gesture gestures included in the tool can virtually replace the mouse for all
Executor, which executes an action. uses.

Eight Four Hitchhike Seven Spiderman

Figure 2: Sample static gestures that are included with the


application.

Feature Computation. The keypoints generated by MediaPipe


Figure 1: Overview of design architecture are transformed and fed into the network. The vector is computed
by calculating the relative hand vectors: the vector differences be-
tween the input keypoints. These relative vectors encode hand pose
MediaPipe. The first component, which tracks the palms of
information in a position invariant manner, i.e. the same gesture is
the user and generates the hand landmarks or keypoints is built
detected, regardless of where the hand is in the webcam’s field of
using MediaPipe, a cross-platform framework providing a variety
vision. For example, the first relative hand vector (from the base of
of ML solutions. We utilize MediaPipe’s Hand Tracking [16], a
the palm to the first joint of the thumb) can be computed by the
high-fidelity hand and finger tracking solution which can infer 21
following:
3D landmarks of a hand from just a single frame. The tracking
is smooth and handled cases of self-occlusion (the hand covering
𝑉 01𝑥 = 𝑉 1𝑥 − 𝑉 0𝑥
itself) as well.
Gesture Receiver. The Gesture Receiver is the heart of the 𝑉 01𝑦 = 𝑉 1𝑦 − 𝑉 0𝑦
application and acts as a controller for the other modules. It receives 𝑉 01𝑧 = 𝑉 1𝑧 − 𝑉 0𝑧
Gestop: Customizable Gesture Control of Computer Systems CODS COMAD 2021, January 2–4, 2021, Bangalore, India

Where 𝑉 0 and 𝑉 1 represent the 3D coordinates of the points 5 GESTURE EXECUTOR


labeled 0 and 1 in Fig. 3 and 𝑉 01 represents the relative hand vector The Gesture Executor is the user-facing module. Its responsibility
between them. In summary, we have 16 relative hand vectors (4 for is to take in the recognized gesture, map it to the specified action
the thumb, 3 for the other fingers) and each hand vector consisting and then execute it. The mapping of gestures to actions is stored in
of (𝑥, 𝑦.𝑧) coordinate giving us a total of 48 coordinates. Finally, a human-readable JSON file, for easy modification. The format of
the handedness, i.e., the hand with which the gesture is performed, the file is:
is appended and this 49-D vector is fed into the network (Sec. 6).
{'gesture-name':['type','func-name']}
Here, gesture-name is the name of a gesture, type is either sh
(shell) or py (python), denoting the type of the action to be executed
and func-name is the name of the shell script/command to be exe-
cuted if the type is sh, or the name of a user defined function if the
type if py. To remap functionality, for example, if the user wishes to
take a screenshot with Swipe + instead of Grab, the configuration
would change from:
"Grab" : ["py", "take_screenshot"],
"Swipe +" : ["py", "no_func"],
To,
Figure 3: The labeled keypoints generated by MediaPipe "Grab" : ["py", "no_func"],
"Swipe +" : ["py", "take_screenshot"],
Custom Actions. To suit an end user’s specific workflow, Gestop
allows defining custom actions, e.g., a python function or a shell
Dataset. In training the network, we recorded our own data and
script, which are executed when the corresponding gesture is de-
created a small dataset. The data was collected in the following
tected. For example, to execute a shell script script.sh on the "Tap"
manner: The name of the gestures is specified, the MediaPipe com-
gesture, the user may change the mappings in the configuration
ponent is executed and keypoints were captured. These keypoints
file to:
are simply written to a CSV file for later use along with the given
gesture name. For each gesture, around 2000 samples were taken, {'Tap':['sh','./script.sh']}
which took only a couple of minutes.
While static gestures are simple to perform, they are limited by 5.1 New Gestures
virtue of the fact that they are only so many distinct poses one can In our experience, the end user may want to add both new actions
perform with a hand. Hence, dynamic gestures are required. and new gestures. Hence, we have provided a method to add data
for new gestures (static or dynamic). For static gestures, the same
4.2 Dynamic Gestures script that was used to collect initial data to train the network is
reused to add more gestures. The new gesture name is provided, and
Dynamic Gestures are an extension of static gestures, and consist data is recorded and written to disk. For recording dynamic data, we
of a sequence of poses. These include gestures commonly used on provide a script: for each script run, a gesture name is provided and
touchscreen devices, such as "Swipe Up", "Pinch" etc. the corresponding gesture performed repeatedly. Post-run, multiple
Dataset. For training dynamic gestures, we make use of the sequences with gesture labels are stored on the disk. Data for a new
SHREC [5] dataset. This dataset consists of 2800 sequences across gesture, ’Circle’, was collected using this process over the course
14 gestures, including common ones like "Swipe Up", "Tap" as well of 15-20 minutes, demonstrating its feasibility. After adding new
as more complex gestures like "Swipe +". It consists of variable gesture data, the network is retrained and the application is now
length sequences performed by multiple people in 2 ways: either able to detect the user’s custom gestures as well.
using the whole hand, or just the fingers. To capture more data,
Gestop also includes a script to record dynamic gestures. 6 IMPLEMENTATION AND RESULTS
Feature Computation. To compute the feature vector, each
We use the pytorch-lightning [6] framework to build our neural
frame of the input sequence is transformed into a vector consisting
network classifiers. The implementation is open-sourced [1] with
of the following:
detailed documentation for installation and usage, along with a
• The absolute coordinates of the base of the palm i.e. 𝑉 0𝑥 and demo video showcasing Gestop’s capabilities.
𝑉 0𝑦 . This was used because gestures like "Swipe Up", "Swipe Static Gestures. To detect static gestures, we utilize a feed for-
Right" etc. involve moving the hand. ward neural network classifier with 2 linear layers, which takes in
• The timediff coordinates of the base of the palm. This con- a feature vector and classifies it as one of the available gestures.
sisted of the change in position of that coordinate with re- The network was trained for around 50 epochs and the confusion
spect to the previous timestep. This was empirically found matrix of the trained network can be seen in Fig. 4.
to improve the performance of the network. The set of static gestures relevant to an application are much
• Finally, similar to the static case, coordinates of the relative smaller than all possible static gestures. Moreover, it is infeasible
hand vectors to capture the pose of the hand. to record all unwanted gestures to help our classifier discriminate
CODS COMAD 2021, January 2–4, 2021, Bangalore, India Sriram S K and Nishant Sinha

Figure 4: Confusion Matrix for static gestures Figure 5: Confusion Matrix for dynamic gestures

accurately. We handle this data imbalance as follows. Besides the confusion matrix, gestures which involve displacement of the hand
set of relevant gestures, we introduce a none gesture, which is (i.e. the "Swipes") are detected well, whereas those concerned with
selected if no relevant gesture is detected. To train our classifier, the orientation of the hand such as "Tap" have a relatively lower
we capture a variety of unrelated static gestures and label them accuracy. When testing our models trained on SHREC dataset, we
as none. While this improves classifier performance, we still see also observed a domain mismatch problem. The data from SHREC
many false positives. To solve this problem, we manually calibrate was recorded using an Intel RealSense depth camera, whereas the
the softmax output of the classifier by scaling the score of the none incoming stream during testing is from an RGB camera, causing
gesture by a constant 𝑘 (𝑘 = 2 worked well for our experiments). loss in accuracy during testing. In our ongoing work, we plan to
These optimizations allow our static classifier to achieve high address both these issues with improved feature computation and
detection accuracy for multiple users and different lighting con- training on larger, diverse datasets.
ditions. We achieve good performance (Fig. 4), with a validation Compared to existing systems like GestARLite [7], Gestop doesn’t
accuracy of 99.12%, and this translates to test time as well. Hand require a headset or additional hardware to operate. GRT-[2], a ges-
gestures are detected with no noticeable latency. ture recognition toolkit in C++, provides building blocks for users
Dynamic Gestures. To detect dynamic gestures, we use a recur- to build a gesture recognition pipeline, whereas Gestop provides a
rent neural network, which consists of a linear layer, to encode the complete pipeline and a simple interface for end users to customize.
incoming features, connected to a bidirectional GRU. A key issue
when detecting dynamic gestures is that of computing the start and 7 CONCLUSION
end of a gesture precisely. We circumvent this issue by making use
of a signal key to signify the start and end of the gesture; we uti- In this paper, we present Gestop, a novel framework for control-
lize the Ctrl key in Gestop. This enables handling varying length ling the desktop through hand gestures, which may be customized
gestures as well as reduces the number of misclassifications. In to preferences of the end-user. In addition to providing a fully
addition to the gestures provided by SHREC, an additional gesture, functional replacement for the mouse, our framework is easy to
’Circle’, was also added using the aforementioned methods. Despite extend by adding new custom gestures and actions, allowing the
being a complex gesture, the network was able to detect the gesture user to use gestures for many more desktop use cases. We aim to
accurately during testing, leading us to believe that the network improve Gestop further by improving the detection accuracy, make
can successfully generalize to other gestures as well. The confu- incremental training for new gestures efficient, detecting gesture
sion matrix for the various gestures is shown in Fig. 5. We observe start/end and other subtle user intents, and conduct user studies to
that dynamic gestures have lower performance than static gestures, measure the usability of the tool.
with an average accuracy of around 85%. Dynamic gestures are
inherently concerned with two factors, the pose of the hand and ACKNOWLEDGMENTS
the displacement of the hand over time. As can be seen from the We would like to thank Vikram Gupta for useful discussions.
Gestop: Customizable Gesture Control of Computer Systems CODS COMAD 2021, January 2–4, 2021, Bangalore, India

REFERENCES [10] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja,
[1] 2020. Gestop. https://github.com/sriramsk1999/gestop. Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee,
[2] 2020. Gesture Recognition Toolkit. https://github.com/nickgillian/grt. et al. 2019. Mediapipe: A framework for building perception pipelines. arXiv
[3] 2020. MediaPipe: Cross-platform ML solutions made simple. https://google.github. preprint arXiv:1906.08172 (2019).
io/mediapipe/. [11] Steve Oney, Rebecca Krosnick, Joel Brandt, and Brad Myers. [n.d.]. Implementing
[4] Hassene Ben Amara. 2019. End-to-End Multiview Gesture Recognition for Au- Multi-Touch Gestures with Touch Groups and Cross Events. In CHI’19.
tonomous Car Parking System. [12] Adam et al. Paszke. [n.d.]. PyTorch: An Imperative Style, High-Performance
[5] Quentin De Smedt, Hazem Wannous, Jean-Philippe Vandeborre, Joris Guerry, Deep Learning Library. In Neurips’19.
Bertrand Le Saux, and David Filliat. 2017. Shrec’17 track: 3d hand gesture recog- [13] Bin Ren, Mengyuan Liu, Runwei Ding, and Hong Liu. 2020. A Sur-
nition using a depth and skeletal dataset. vey on 3D Skeleton-Based Action Recognition Using Learning Method.
[6] WEA Falcon et al. 2019. Pytorch lightning. GitHub. Note: https://github. arXiv:2002.05907 [cs.CV]
com/williamFalcon/pytorch-lightning Cited by 3 (2019). [14] Heinrich Ruser, Susan Vorwerg, and Cornelia Eicher. [n.d.]. Making the Home
[7] Varun Jain, Gaurav Garg, Ramakrishna Perla, and Ramya Hebbalaguppe. 2019. Accessible - Experiments with an Infrared Handheld Gesture-Based Remote
GestARLite: An On-Device Pointing Finger Based Gestural Interface for Smart- Control. In HCI International 2020 - Posters.
phones and Video See-Through Head-Mounts. ArXiv abs/1904.09843 (2019). [15] Lora Streeter and John Gauch. [n.d.]. Detecting Gestures Through a Gesture-
[8] Vincent Lepetit. 2020. Recent Advances in 3D Object and Hand Pose Estimation. Based Interface to Teach Introductory Programming Concepts. In HCI’20, Masaaki
arXiv:2006.05927 [cs.CV] Kurosu (Ed.).
[9] Hao Lü and Yang Li. [n.d.]. Gesture Coder: A Tool for Programming Multi-Touch [16] Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George
Gestures by Demonstration (CHI ’12). Sung, Chuo-Ling Chang, and Matthias Grundmann. 2020. MediaPipe Hands:
On-device Real-time Hand Tracking. arXiv preprint arXiv:2006.10214 (2020).

You might also like