Gestop: Customizable Gesture Control of Computer Systems: Sriram S K Nishant Sinha
Gestop: Customizable Gesture Control of Computer Systems: Sriram S K Nishant Sinha
gesture based systems have seen low adoption among end-users and dynamic. Static gestures are gestures where a single hand
primarily due to numerous technical hurdles in detecting in-air pose provides enough information to classify the gesture, such as
gestures accurately. This paper presents Gestop, a framework de- the "Peace" sign. On the other hand, dynamic gestures cannot be
veloped to bridge this gap. The framework learns to detect gestures detected from a single pose alone, and require a sequence of poses
from demonstrations, is customizable by end-users and enables to be understood and classified. Examples include the gestures
users to interact in real-time with computers having only RGB maintaining the pose of the hand while moving it ("Swipe Up"),
cameras, using gestures. or gestures which involve changing hand posture continuously
("Pinch"). By combining hand motion along with continuous pose
CCS CONCEPTS change, we can create a large number of dynamic gestures.
The architecture of the application follows a modular design.
• Human-centered computing → Gestural input; • Comput-
It is separated into logical components, each performing a single
ing methodologies → Tracking.
task. The Gesture Receiver receives keypoints from the image
and passes it on to the Gesture Recognizer, which uses neural
KEYWORDS networks to classify both static and dynamic gestures, and finally
hand gesture, MediaPipe, neural networks, pytorch the Gesture Executor, which executes an action based on the
ACM Reference Format: detected gesture.
Sriram S K and Nishant Sinha. 2021. Gestop: Customizable Gesture Control A key distinguishing aspect of our application is that it is cus-
of Computer Systems. In 8th ACM IKDD CODS and 26th COMAD (CODS tomizable by the end-user. Other than the inbuilt mouse and key-
COMAD 2021), January 2–4, 2021, Bangalore, India. ACM, New York, NY, board functions, it is possible to map gestures to arbitrary desktop
USA, 5 pages. https://doi.org/10.1145/3430984.3430993 actions, including a shell script. This allows for massive flexibility in
how the application can be used. Gestures can be mapped to launch
1 INTRODUCTION other applications, setup environments and so on. In addition, we
have provided a way to add new gestures as well, allowing the user
Hand detection and gesture recognition has a broad range of poten-
to extend it as much as required.
tial applications, including in-car gestures, sign language recogni-
tion, virtual reality and so on. Through gestures, users can control
or interact with devices without touching them. Although numer- 2 RELATED WORK
ous gesture recognition prototypes and tutorials are available across Hand Gesture Recognition from Video. The literature on ges-
the web, they handle only restricted set of gestures and lack proper ture recognition from static images or video is vast. The solutions
architecture design and description, making it hard for end-users vary based on whether (a) the cameras or sensors (single or multiple
to use and build upon them. instances) provide RGB-only images vs depth data (RGB-D), (b) we
In this paper, we present the architecture and implementation detect hand keypoints (palm and finger joints) as an intermediate
of our end-to-end, extensible system which allows user to control step or perform end-to-end detection directly from video, (c) de-
desktop in real-time using hand gestures. Our tool works across tecting keypoints is the end-goal as opposed to 3D reconstruction
different hardware (CPU/GPU) and operating systems and relies on of hands, (d) gestures are pre-segmented or must be segmented
a medium resolution camera to detect gestures. The tool controls the in real-time. For more details, please refer to recent overview ar-
desktop through hand gestures alone, replacing all mouse actions ticles by Lepetit [8] and Ren et al. [13]. In our approach, we use a
with gestures, and many keyboard shortcuts as well. Furthermore, monocular RGB video stream, which is fed into a two-phase neural
network architecture. The first phase detects hand keypoints from
Permission to make digital or hard copies of part or all of this work for personal or individual video frames (using the off-the-shelf MediaPipe [3] tool)
classroom use is granted without fee provided that copies are not made or distributed and generates a sequence of hand keypoints, which is used by the
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. second phase to detect both static and dynamic gestures (using
For all other uses, contact the owner/author(s). our neural models). GestARLite [7] is a light, on-device framework
CODS COMAD 2021, January 2–4, 2021, Bangalore, India for detecting gestures based on pointing fingers. Our solution is
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8817-7/21/01. targeted towards desktop computers and can recognize much larger
https://doi.org/10.1145/3430984.3430993 set of complex dynamic gestures and can be customized easily.
CODS COMAD 2021, January 2–4, 2021, Bangalore, India Sriram S K and Nishant Sinha
Gesture Recognition Platforms. Gesture recognition is use- the keypoints from the MediaPipe module, and then passes it to
ful for several applications: controlling virtual interfaces, gaming, the Gesture Recognizer and the Mouse Tracker. The output received
embodied AR/VR environments, automotive human-machine in- (the name of a gesture) is then passed to the Gesture Executor.
terface [4], home automation [14], education [15], retail business Mouse Tracker. The Mouse Tracker tracks the cursor on the
environments, consumer electronics control and more. Many appli- screen as the hand moves. As convention, we use the tip of the
cations use hand gestures because they enable highly expressive index finger as the keypoint with which to track the mouse. As the
interaction. Although the gesture recognition market size is rapidly index finger is moved, the motion is projected onto the screen and
growing, there are hardly any end-to-end open source gesture recog- the cursor moves accordingly.
nition platforms. Exceptions include GRT [2], an open-source, C++ Gesture Recognizer. The Gesture Recognizer module classi-
machine learning library designed for real-time gesture recognition, fies gestures given keypoints. We utilize two neural networks for
which provides building blocks for creating custom recognizers. In the same, one to detect static gestures and the other for dynamic
contrast, we use neural layers as our building blocks and develop gestures. The details of their structure and training are elaborated
models using PyTorch [12]. upon in the subsequent sections.
Programming imperative multi-touch gesture recognizers in- Gesture Executor. The input to this module is the name of the
volves dealing with low-level, event-driven programming model. gesture which has been recognized by the Gesture Recognizer. This
Gesture Coder [9] learns from multi-touch gesture examples pro- module finds the action mapped to this gesture and then executes it.
vided by users and generates imperative recognition code auto- We include a small set of predefined gestures and actions to cover
matically and invokes corresponding application actions. Oney et common use cases.
al. [11] investigate declarative abstractions for simplifying program-
ming of multi-touch gestures. In contrast, we recognize gestures 4 GESTURE RECOGNIZER
end-to-end using deep learning from video examples. The inputs to the Gesture Recognizer are the 21 3D keypoints
generated by MediaPipe. each corresponding to a point on the
3 DESIGN ARCHITECTURE hand. Each keypoints consists of three coordinates (𝑥, 𝑦, 𝑧). Thus
We use an open source framework (MediaPipe) [10] to detect the the Gesture Recognizer receives a 63-D input vector. These input
hand keypoints in the images captured by the camera. The Medi- vectors are then transformed into the features expected by the
aPipe module reads data from the camera, processes it and gen- neural networks as described below.
erates keypoints which are then sent to the Gesture Receiver
using ZeroMQ, a messaging queue. On receiving the keypoints, 4.1 Static Gestures
the Gesture Receiver then passes it to the Gesture Recognizer, Static gestures are gestures which can be described by a single hand
which then processes the keypoints into the encoded features, feeds pose. Some of the gestures detected by Gestop are shown in Fig. 2,
them into the network which detects the output gesture. Finally, along with the name which is used to refer to them. The set of static
the Gesture Receiver sends the detected gesture to the Gesture gestures included in the tool can virtually replace the mouse for all
Executor, which executes an action. uses.
Figure 4: Confusion Matrix for static gestures Figure 5: Confusion Matrix for dynamic gestures
accurately. We handle this data imbalance as follows. Besides the confusion matrix, gestures which involve displacement of the hand
set of relevant gestures, we introduce a none gesture, which is (i.e. the "Swipes") are detected well, whereas those concerned with
selected if no relevant gesture is detected. To train our classifier, the orientation of the hand such as "Tap" have a relatively lower
we capture a variety of unrelated static gestures and label them accuracy. When testing our models trained on SHREC dataset, we
as none. While this improves classifier performance, we still see also observed a domain mismatch problem. The data from SHREC
many false positives. To solve this problem, we manually calibrate was recorded using an Intel RealSense depth camera, whereas the
the softmax output of the classifier by scaling the score of the none incoming stream during testing is from an RGB camera, causing
gesture by a constant 𝑘 (𝑘 = 2 worked well for our experiments). loss in accuracy during testing. In our ongoing work, we plan to
These optimizations allow our static classifier to achieve high address both these issues with improved feature computation and
detection accuracy for multiple users and different lighting con- training on larger, diverse datasets.
ditions. We achieve good performance (Fig. 4), with a validation Compared to existing systems like GestARLite [7], Gestop doesn’t
accuracy of 99.12%, and this translates to test time as well. Hand require a headset or additional hardware to operate. GRT-[2], a ges-
gestures are detected with no noticeable latency. ture recognition toolkit in C++, provides building blocks for users
Dynamic Gestures. To detect dynamic gestures, we use a recur- to build a gesture recognition pipeline, whereas Gestop provides a
rent neural network, which consists of a linear layer, to encode the complete pipeline and a simple interface for end users to customize.
incoming features, connected to a bidirectional GRU. A key issue
when detecting dynamic gestures is that of computing the start and 7 CONCLUSION
end of a gesture precisely. We circumvent this issue by making use
of a signal key to signify the start and end of the gesture; we uti- In this paper, we present Gestop, a novel framework for control-
lize the Ctrl key in Gestop. This enables handling varying length ling the desktop through hand gestures, which may be customized
gestures as well as reduces the number of misclassifications. In to preferences of the end-user. In addition to providing a fully
addition to the gestures provided by SHREC, an additional gesture, functional replacement for the mouse, our framework is easy to
’Circle’, was also added using the aforementioned methods. Despite extend by adding new custom gestures and actions, allowing the
being a complex gesture, the network was able to detect the gesture user to use gestures for many more desktop use cases. We aim to
accurately during testing, leading us to believe that the network improve Gestop further by improving the detection accuracy, make
can successfully generalize to other gestures as well. The confu- incremental training for new gestures efficient, detecting gesture
sion matrix for the various gestures is shown in Fig. 5. We observe start/end and other subtle user intents, and conduct user studies to
that dynamic gestures have lower performance than static gestures, measure the usability of the tool.
with an average accuracy of around 85%. Dynamic gestures are
inherently concerned with two factors, the pose of the hand and ACKNOWLEDGMENTS
the displacement of the hand over time. As can be seen from the We would like to thank Vikram Gupta for useful discussions.
Gestop: Customizable Gesture Control of Computer Systems CODS COMAD 2021, January 2–4, 2021, Bangalore, India
REFERENCES [10] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja,
[1] 2020. Gestop. https://github.com/sriramsk1999/gestop. Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee,
[2] 2020. Gesture Recognition Toolkit. https://github.com/nickgillian/grt. et al. 2019. Mediapipe: A framework for building perception pipelines. arXiv
[3] 2020. MediaPipe: Cross-platform ML solutions made simple. https://google.github. preprint arXiv:1906.08172 (2019).
io/mediapipe/. [11] Steve Oney, Rebecca Krosnick, Joel Brandt, and Brad Myers. [n.d.]. Implementing
[4] Hassene Ben Amara. 2019. End-to-End Multiview Gesture Recognition for Au- Multi-Touch Gestures with Touch Groups and Cross Events. In CHI’19.
tonomous Car Parking System. [12] Adam et al. Paszke. [n.d.]. PyTorch: An Imperative Style, High-Performance
[5] Quentin De Smedt, Hazem Wannous, Jean-Philippe Vandeborre, Joris Guerry, Deep Learning Library. In Neurips’19.
Bertrand Le Saux, and David Filliat. 2017. Shrec’17 track: 3d hand gesture recog- [13] Bin Ren, Mengyuan Liu, Runwei Ding, and Hong Liu. 2020. A Sur-
nition using a depth and skeletal dataset. vey on 3D Skeleton-Based Action Recognition Using Learning Method.
[6] WEA Falcon et al. 2019. Pytorch lightning. GitHub. Note: https://github. arXiv:2002.05907 [cs.CV]
com/williamFalcon/pytorch-lightning Cited by 3 (2019). [14] Heinrich Ruser, Susan Vorwerg, and Cornelia Eicher. [n.d.]. Making the Home
[7] Varun Jain, Gaurav Garg, Ramakrishna Perla, and Ramya Hebbalaguppe. 2019. Accessible - Experiments with an Infrared Handheld Gesture-Based Remote
GestARLite: An On-Device Pointing Finger Based Gestural Interface for Smart- Control. In HCI International 2020 - Posters.
phones and Video See-Through Head-Mounts. ArXiv abs/1904.09843 (2019). [15] Lora Streeter and John Gauch. [n.d.]. Detecting Gestures Through a Gesture-
[8] Vincent Lepetit. 2020. Recent Advances in 3D Object and Hand Pose Estimation. Based Interface to Teach Introductory Programming Concepts. In HCI’20, Masaaki
arXiv:2006.05927 [cs.CV] Kurosu (Ed.).
[9] Hao Lü and Yang Li. [n.d.]. Gesture Coder: A Tool for Programming Multi-Touch [16] Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George
Gestures by Demonstration (CHI ’12). Sung, Chuo-Ling Chang, and Matthias Grundmann. 2020. MediaPipe Hands:
On-device Real-time Hand Tracking. arXiv preprint arXiv:2006.10214 (2020).