0% found this document useful (0 votes)
41 views6 pages

Air-Writing Recognition Using Deep Convolutional and Recurrent Neural Network Architectures

This document summarizes a paper presented at the 2020 International Conference on Frontiers in Handwriting Recognition. The paper explores using deep learning architectures like convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for air-writing recognition, where people write in 3D space without a pen. It tested these models on a dataset of 1200 air-written digits collected using a Leap Motion Controller sensor. The paper compares architectures like LSTMs, CNN-LSTMs, and temporal convolutional networks for classifying the 3D motion trajectories of the air-written digits.

Uploaded by

Veda Gorrepati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views6 pages

Air-Writing Recognition Using Deep Convolutional and Recurrent Neural Network Architectures

This document summarizes a paper presented at the 2020 International Conference on Frontiers in Handwriting Recognition. The paper explores using deep learning architectures like convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for air-writing recognition, where people write in 3D space without a pen. It tested these models on a dataset of 1200 air-written digits collected using a Leap Motion Controller sensor. The paper compares architectures like LSTMs, CNN-LSTMs, and temporal convolutional networks for classifying the 3D motion trajectories of the air-written digits.

Uploaded by

Veda Gorrepati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR)

Air-Writing Recognition using Deep Convolutional


and Recurrent Neural Network Architectures
Grigoris Bastas, Kosmas Kritsis and Vassilis Katsouros
Institute for Language and Speech Processing (ILSP)
Athena R.C.
Athens, Greece
{g.bastas, kosmas.kritsis, vsk}@athenarc.gr

Abstract—In this paper, we explore deep learning architectures 3D space by hand or finger gestures. Air-writing differs from
applied to the air-writing recognition problem where a person conventional handwriting since the latter contains a discrete
writes text freely in the three dimensional space. We focus on stroke with pen-up and pen-down motions, while the former
handwritten digits, namely from 0 to 9, which are structured
as multidimensional time-series acquired from a Leap Motion
lacks such a delimited sequence of writing events. Further-
Controller (LMC) sensor. We examine both dynamic and static more, the absence of visual and tactile feedback as well as
approaches to model the motion trajectory. We train and compare the high interdependence between the target gestures makes
several state-of-the-art convolutional and recurrent architectures. the character recognition task even more challenging.
Specifically, we employed a Long Short-Term Memory (LSTM) From the user’s point of view, there are three different ways
network and also its bidirectional counterpart (BLSTM) in
order to map the input sequence to a vector of fixed dimen-
to accomplish the air-writing process, namely the “connected”,
sionality, which is subsequently passed to a dense layer for “isolated” and “overlapped” [6]. Similar to conventional hand-
classification among the targeted air-handwritten classes. In the writing, with the “connected” method the user may write a
second architecture we adopt 1D Convolutional Neural Networks sequence of multiple characters in free space, for example
(CNNs) to encode the input features before feeding them to an from left to right. The second approach allows the user
LSTM neural network (CNN-LSTM). The third architecture is a
Temporal Convolutional Network (TCN) that uses dilated causal
to perform the writing gesture of each character separately,
convolutions. Finally, a deep CNN architecture for automating within the boundaries of an imaginary box in the 3D space.
the feature learning and classification from raw input data is Finally, the most complicated and uncommon technique is the
presented. The performance evaluation has been carried out on “overlapped”, since it requires to write adjacent characters that
a dataset of 10 participants, who wrote each digit at least 10 are stacked one over another in the same imaginary box. In
times, resulting in almost 1200 examples.
general, the size and shape of “connected” characters heavily
Index Terms—air-writing, gesture recognition, deep learning,
LSTM, CNN, TCN vary when writing in an unconstrained area, which affects the
recognition accuracy. It is difficult even for the same user to
I. I NTRODUCTION repeat the same motion trajectory in the same position. On
the contrary, limiting the writing space within an imaginary
Automatic gesture recognition has gained the interest of box facilitates the character segmentation, although this type
various research projects due to its convenient and natural of writing is not natural [7].
characteristics in conveying communication information in The recent advancements in Motion Capture (MoCap)
Human-Computer Interaction (HCI) related applications such systems in addition to the computational power of modern
as music interaction [1], computer gaming [2], robotics [3], computers, motivated the HCI research community to ex-
sign-language interpretation [4] and automotive systems [5], amine the application of air-writing in various interfaces.
amongst others. Different from traditional HCI input means, Some systems employ handheld markers [8] or custom made
such as moving the mouse, typing on the keyboard, and gloves [9] for capturing the motion trajectory. However, Low-
touching a screen, modern gesture recognition systems usually cost depth sensors such as the Microsoft Kinect [10], the Intel
receive specific body or hand movements in order to control RealSense [11] and the Leap Motion Controller (LMC) [12],
their output by utilizing different types of sensors, including are commonly employed in air-writing proposals due to the
gyroscopes, optical sensors, electromyography sensors etc. absence of intrusiveness while tracking human motion trajec-
In this regard, the term “Air-writing” refers to the process tories in real-time. Kinect-based air-writing interfaces [13] are
of writing linguistic characters or words in mid-air or free designed to detect long-range 3D body postures, while the
LMC and RealSense sensors provide millimeter-level accuracy
We acknowledge support of this work by the project “Computational
Sciences and Technologies for Data, Content and Interaction” (MIS 5002437), and focus on tracking hand and finger motion trajectories [14]–
sub-project “Technologies for Content Analysis in Culture”, implemented un- [16]. Even thought such sensors facilitate the motion tracking
der the Action “Reinforcement of the Research and Innovation Infrastructure”, problem, it is still hard to distinct the actual writing activity
funded by the Operational Programme “Competitiveness, Entrepreneurship
and Innovation” (NSRF 2014-2020) and co-financed by Greece and EU from other arbitrary hand movements in the continuous input
(ERDF). stream of motion tracking data, considering the absence of a

978-1-7281-9966-5/20/$31.00 ©2020 IEEE 7


DOI 10.1109/ICFHR2020.2020.00013
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on August 22,2023 at 08:29:08 UTC from IEEE Xplore. Restrictions apply.
physical delimitation mechanism. This is a substantial prob- the trajectory and inertial angular velocity [6], also known
lem, different from the character or word recognition task. as “hand-crafted” features. In this regard, dynamic gesture
Furthermore, the majority of the existing air-writing systems recognition approaches [1], [19] can be adopted to the air-
focus on letter and digit character recognition due to lack of writing recognition problem [5].
available word-level training examples [17]. Most prominent methods for analyzing static trajectory
In this sense, our proposal has been designed so as to images employ Convolutional Neural Networks (CNNs). For
handle both detection and recognition of air-written charac- instance, in [8] the authors present a marker-based air-writing
ters mixed with other movements without lingering the user framework that uses a generic video camera and apply color
experience. Especially, we developed a web interface based based segmentation for tracking the writing trajectory. Then
on the LMC sensor for collecting 1200 air-written digits (i.e. a pre-trained CNN model is employed for classifying the
0-9). Our method provides visual feedback to the user and static trajectory images. However the illumination fluctuations
uses an efficient fingertip tracking approach for simulating affect the color-based segmentation and the overall system
pen-up and pen-down switching. We model the air-writing performance. A marker-less approach was presented in [20],
recognition process as a classification problem. Both, dynamic where the 3D fingertip coordinates provided by an LMC sensor
3D trajectories (time-series) and their corresponding static 2D were mapped frame by frame to consecutive trajectory images
stroke projections (images) were used to train and evaluate and after a pre-processing phase, the normalized trajectory
various deep learning architectures based on convolutional images were used to train a CNN model, reporting an average
and recurrent operations. The evaluation results present almost recognition rate of 98.8%. However, deep CNN architectures
perfect recognition rates within only a few milliseconds, contain lots of redundant filter parameters that lead to huge
indicating that our system can be deployed in a real-time computational cost, limiting their deployability. In this regard,
setting. the authors in [21] propose a unified algorithm to effectively
The rest of the paper is organized as follows. In the prune the deep CNN models for in-air handwritten Chinese
next section, we briefly review some relevant prior work in character recognition with only 0.17% accuracy drop, com-
air-writing. Section III describes the overall methodology, pared to the baseline VGG-like [22] CNN that achieves the
including the LMC web interface, the dataset acquisition state-of-the-art accuracy of 95.33%.
procedure and the considered deep learning architectures. In Common algorithms for modeling the dynamic aspect of air-
Section IV, we specify the experimental setup and present the writing gestures are Hidden Markov Model (HMM) [6], [9],
collected results. Finally, the conclusions and future directions [15], Dynamic Time Warping (DTW) [13], [14], and support
are discussed in Section V. vector machine (SVM) [9]. However these methods work well
only with simple gestures that have low inter-class similarities,
II. R ELATED W ORK thus reporting low accuracy in the air-writing recognition
During the last decade, various interfaces that employ air- task. To overcome this limitation, Recurrent Neural Networks
based writing have been proposed. Typically, the main interest (RNNs) based on bi-directional Long Short-Term Memory
is the spatiotemporal trajectory information of the hand motion units (BLSTM) have presented significant improvement on the
while writing in the air. Hence, the trajectory information given recognition task [17], [23]. Only the recent years a few
of an air-written character can be segregated into two core fusion-based techniques have been proposed, combining CNN
dimensional representations. In other words, the air-writing with either LSTM [16] or BLSTM classifiers [7], [24]. These
recognition problem can be examined either by focusing on networks outperform all of the previous works since they can
the temporal interdependence between the time-steps of the combine both spatial and temporal feature learning.
trajectory sequence, or by considering the spatial projections of
III. M ETHODOLOGY
the motion trajectory on vertical and perpendicular imaginary
3D planes. We call these two approaches “dynamic” and A. Data collection
“static” respectively. A simple interactive web interface based on the LMC
By following the static modeling approach, the air-writing javascript version 2 API was designed for collecting air-written
trajectory can be represented as a 2D image, resembling digits (0-9), requiring minimal instructions. As it is depicted
traditional handwriting; thus allowing to leverage previous in Fig 1, the LMC is placed on a tabletop orientation in front
techniques that have been applied successfully in the Optical of a screen that is used to provide optical feedback to the
Character Recognition (OCR) problem [18]. On the other user. Based on a calibration process, an effective air-writing
hand, the temporal dimension of the motion trajectory is spotting algorithm is applied to identify the start/end points of
represented by consecutive frames of signal measurements the dynamic gesture, allowing the user to use the screen as a
provided by the underlying motion tracking sensor, organized physical boundary. Furthermore, a blue box on the screen spec-
in time-series with a given sampling frequency. Usually, each ifies the writing area which can be adjusted by the user without
frame contains features such as the hand’s instant velocity and affecting the representation of the captured frame data. For the
its coordinates relative to the referential point system of the dataset collection we considered 10 participants (8 males and 2
sensor, in addition to other application-defined measurements, females) between 23 to 50 years old, without applying any uni-
for instance the angles between the consecutive points of stroke or other writing constrains. The participants wrote each

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on August 22,2023 at 08:29:08 UTC from IEEE Xplore. Restrictions apply.
Fig. 3: 1D Convolutional Filtering followed by a LSTM (CNN-
LSTM) trained on an oversampled sequence of 2-dimensional
digit trail points.

configuration was used in all cases. This means that only the
last output was taken under consideration when computing
Fig. 1: An example of the LMC orientation and the web the cross-entropy training loss and, in the case of LSTM’s
interface. bidirectional configuration, the last outputs of both directions
were combined by simply concatenating the two vectors as
depicted in Fig. 2.
LSTM: This architecture belongs to the family of RNNs.
LSTMs were introduced by Hochreiter & Schmidhuber [26]
and comprise a very common architecture for sequential
modeling. They are mainly characterized by their capability
to grasp long-term dependencies better that other RNNs. A
one-layer LSTM and a bidirectional version (BLSTM) of it
were used, with 300 hidden units in each block. A linear
fully connected layer with log-softmax regression was used
to classify each input to one among 10 classes, each for one
digit.
CNN-LSTM: In the architecture described right above,
Fig. 2: Bidirectional LSTM (BLSTM) trained on a zero- an extra layer that performs 1-dimensional convolutions was
padded sequence of 2-dimensional digit trail points. added right after the input. For the convolutions, 300 filters
were used, with kernel size 2 and stride 1, as illustrated in
Fig. 3.
digit at least 10 times resulting in almost 1200 examples. Every TCN-Dynamic: The family of Temporal Convolutional Net-
recording is a time-series containing raw tracking information work (TCN) architectures employs causal convolutions with
provided by the LMC API with a sampling frequency of 60Hz. dilations and handles sequences of data by mapping the
The web interface and the collected data are hosted online and extracted features to a sequence of predictions of the same
can be accessed in [25]. length as the input. In this work, among the various types of
TCNs, it is the one proposed in [27] that was selected, having
B. Architectures
considered its simplicity and the fact that its performance was
A variety of methods were used in order to tackle the already tested on a similar task, namely on the Sequential
problem of digit recognition. There were two main approaches, MNIST dataset. In the current configuration, as we move into
the dynamic and the static. Several experiments were ran in deeper layers, the dilation factor increases for each layer l by
order to compare or even combine the two strategies. 2l . We use kernel of size 7, 25 hidden units per layer and a
1) Dynamic Approach: In this approach, the input is rep- stack of 6 layers, thus ensuring a receptive field of 64 frames
resented as a sequence of points in the 2-dimensional plane. in the past.
It can be expected that by following the trail of the user’s 2) Static Approach: As for the static approach, a represen-
fingertip frame-by-frame, a sequential model can successfully tation of the drawn digits as images was required. To achieve
manage the information and make correct predictions. Nat- this, each point was mapped to a rectangular area so that binary
urally, since there is more than one way to draw any digit, images of size 28x28 could be created by filling in the right
robustness is an important factor when selecting the models pixels. Min-max normalization was applied on each image, in
to be tried out. The sequential architectures that were tested order to ensure small variance in terms of size and positioning
are: TCN, LSTM, BLSTM and CNN-LSTM. A “many-to-one” of the depicted digits. Two architectures were tried in this

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on August 22,2023 at 08:29:08 UTC from IEEE Xplore. Restrictions apply.
Fig. 4: 2D CNN fed with image representations (28x28) of the drawn digits.

context: (i.e. the maximum sequence length in each batch). This choice
CNN: A model with two 2D convolutional layers, each fol- implies that, while training, one needs to keep track of the
lowed by 2D batch normalization, a ReLU activation function sequence length values in order to take only the last timestep’s
and 2D max pooling. The first layer consisted of 16 filters output into account.
and the second of 32. The kernel size for each convolutional Each network was developed and trained with PyTorch for
layer was 5 and the kernel size of the max pooling was 1000 epochs with batch size of 64 samples. At each epoch
2. The images were given directly as input to this model. the training data were shuffled anew. The validation loss and
The encoded input was then fed to a 2-layer fully-connected accuracy indicated the final model’s parameters to be stored
network followed by a softmax function. and tested. The experiments were conducted on a computer
TCN-Static: A TCN with 8 layers was used following with a 16 GB RAM, an Intel i7-8750H CPU, and a GeForce
Shaojie Bai et. al [27] in respect to the way they have applied GTX 1060 6 GB GPU. The training code can be found in
it on the Sequential MNIST dataset. The array representing [25].
the images was unfolded in such a manner that each row is
appended to the previous one. This sequence of zeroes and B. Results
ones (i.e. the values of the image pixels) was the actual input After having stored the best models for each different
to the TCN. architecture, they were then all evaluated on the test set. The
3) Combining the Static and the Dynamic Approach: Em- accuracy for each model is presented on Table I. Both the
ploying pairs of static and dynamic architectures was selected static and the dynamic approaches lead to remarkable model
as a tactic in order to test weather such a combination could performance. The LSTM gives the best results among every
lead to surpass obstacles that each approach might encounter. other architecture. The BLSTM follows with a slightly lower
A Fuzzy model having an LSTM and a CNN as its building performance. The CNN comes third and outperforms the fuzzy
blocks, was created. The outputs of each model (i.e. a 10- model LSTM&CNN. Then follow the CNN-LSTM and the
dimensional vector) were concatenated and fed to a 2-layer TCN-Dynamic models that exhibit the same performance. The
linear fully-connected network with a final softmax activation TCN-Static architecture comes last. Something that should
layer. also be noted is the approximately 100% better runtime
IV. E VALUATION AND R ESULTS performance of the two TCN models as compared with the
rest. This goes in tandem with their relatively small number
A. Experimental setup
of trainable parameters and the use of dilations.
Among the collected data, there were some faulty recordings The extracted Confusion Matrices indicate that the sources
(i.e. the right hand of the participant was wrongly mapped as of error between the two best performing models, the LSTM
left hand). There were 35 such cases among all the samples. (Fig. 5a) and the CNN (Fig. 5b), each being representative of
After discarding these samples, the dataset was randomly split the two distinct strategies, were all different. One should notice
to training, validation and test sets. There were 200 samples that the LSTM produced only one mismatched example,where
kept for the test set and 100 for the validation set. a “9” was classified as “2”. This can be explained by the
The sequence lengths strongly vary among the recordings. similarity of the trail (when seen sequentially) of this instance
For example the digit “1” usually requires a lot fewer frames of “9” with the trail of many instances of digit “2”. As for
for its recording in contrast to the other digits. Zero-padding the CNN, the results present three mismatched samples. In
was employed in order to create batches of the same length the case of the fuzzy model (LSTM&CNN), a slight decrease

10

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on August 22,2023 at 08:29:08 UTC from IEEE Xplore. Restrictions apply.
(a) Confusion matrix for LSTM. (b) Confusion Matrix for CNN (c) Confusion Matrix for LSTM&CNN
Fig. 5: Confusion Matrices during testing

TABLE I: Test Accuracy and Runtime


tested user into the training set (and removing them from the
Models Accuracy Runtime Trainable Parameters
test set) is a strategy that can be used to improve performance.
CNN 98.5% 0.72 s 94,344 We followed this method for the CNN and LSTM models on
CNN-LSTM 97.5% 0.94 s 726,310 participants #5 and #8. Results are depicted on Table III. In
LSTM 99.5% 0.82 s 367,810 all cases, we notice a monotonic increase of the recognition
BLSTM 99% 0.85 s 367,810
TCN-Dynamic 97.5% 0.07 s 49,410 rates.
TCN-Static 97% 0.09 s 66,910
V. C ONCLUSIONS AND F UTURE W ORK
LSTM&CNN 98% 1.21 s 827,374
The success of our relatively small models in predicting
digit labels owes a lot to the strategy followed for the data
collection. The fact that so few collected data were enough
in accuracy was observed. Interestingly, one can notice in the
for achieving such a good performance supports this exact
corresponding Confusion Matrix (see Fig. 5c), that although
point. The use of a vertical plane (e.g. a computer screen),
the mismatch of the LSTM was resolved, all of the CNN’s
in combination with the distance threshold used to indicate
mismatches are present with an additional fourth one.
where the drawing is disrupted, should be considered as an
In the case of CNN, increasing the number of convolutional
important contribution of this work.
layers led to inferior performance. This is due to the rather
Furthermore, the experiments that were conducted, revealed
small image representation of the trails (28x28) and the
both certain capabilities and limitations of the two different
simplicity of the input values (i.e. 0 or 1). Employing more
approaches and of the neural architectures that were tested. It
pixels for the image representation was found to be redundant.
is noteworthy that we were able to point out the potentials,
Similarly, in the case of trail data, increasing the LSTM layers
not only of the commonly employed static approach for hand-
or applying bidirectional configuration only made the training
writing recognition, but also those of a sequential handling of
phase slower without improving the overall performance.
the problem.
In order to examine the dependence of each model on
As future work, one idea is to enrich our Leap Motion air-
user specific features, we use a participant cross-validation
writing dataset with letter characters. We can also engage in
protocol, and more specifically, a setup where one participant
a comparison of the performance of the different architectures
is left out for testing while the rest of them are used for
used herein, with other publicly available datasets such as the
training. For each among the 10 participants, we keep 100
6DMG motion and gesture dataset [28]. It would be interesting
samples, 10 from each class. The results presented in Table II
to experiment on new ways of combining both the dynamic
reveal a significant decrease in the performance of all mod-
and the static approaches, seeking how to benefit from the
els. The best performing ones are those that employ static
advantages of both. Finally, real-time implementations of
representations. CNN outperforms all models. The dynamic
the recognition system should be examined, for instance, by
recognition strategy proves more prone to errors in this setup.
employing sequential architectures to predict digit labels in
The variability of the recognition rates among different folds
faster time intervals, even before their drawing gets completed.
is also notable, especially in the case of dynamic architectures.
This behavior leads us to the conclusion that the number of ACKNOWLEDGMENT
users and samples in our dataset do not suffice to produce We would like to thank all the colleagues from the Athena
generalized and user independent models. Research Center for participating in the data collection pro-
However, fine-tuning our models by injecting samples of the cess.

11

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on August 22,2023 at 08:29:08 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Test Accuracy for Participant Cross-Validation

Fold No CNN LSTM&CNN TCN-Static LSTM CNN-LSTM TCN-Dynamic


0 93% 86% 84% 92% 89.00% 82%
1 88% 84% 84% 43% 50.00% 46%
2 85% 83% 79% 58% 67.00% 41%
3 98% 97% 89% 77% 80.00% 77%
4 97% 98% 90% 93% 98.00% 96%
5 72% 66% 65% 74% 75.00% 54%
6 90% 94% 74% 91% 74.00% 71%
7 87% 82% 78% 77% 68.00% 77%
8 97% 96% 94% 91% 96.00% 86%
9 95% 90% 83% 88% 90.00% 85%
Average 90.20% 87.60% 82.00% 78.40% 78.70% 71.50%
STD 7.87% 9.71% 8.46% 16.67% 14.97% 18.41%

TABLE III: Recognition rates of subjects with ids #5 and


[12] Ultraleap: Digital worlds that feel human. Accessed April 2020.
#8 using CNN and LSTM for different number of samples [Online]. Available: https://www.ultraleap.com/
injected into the training set [13] S. Poularakis and I. Katsavounidis, “Low-complexity hand gesture
recognition system for continuous streams of digits and letters,” IEEE
Number of samples Participant Participant Transactions on Cybernetics, vol. 46, no. 9, pp. 2094–2108, Sep. 2016.
Models
(per class) #5 #8 [14] R. Aggarwal, S. Swetha, A. M. Namboodiri, J. Sivaswamy, and C. V.
0 72.00% 97.00% Jawahar, “Online handwriting recognition using depth sensors,” in 2015
CNN 2 80.00% 98.80% 13th International Conference on Document Analysis and Recognition
5 86.00% 100.00% (ICDAR), Aug 2015, pp. 1061–1065.
0 74.00% 91.00% [15] M. Chen, G. AlRegib, and B. Juang, “Air-writing recognition—part ii:
LSTM 2 87.50% 96.20% Detection and recognition of writing activity in continuous stream of
5 88.00% 98.00% motion data,” IEEE Transactions on Human-Machine Systems, vol. 46,
no. 3, pp. 436–444, June 2016.
[16] M. S. Alam, K.-C. Kwon, M. A. Alam, M. Y. Abbass, S. M. Imtiaz,
and N. Kim, “Trajectory-based air-writing recognition using deep neural
R EFERENCES network and depth sensor,” Sensors, vol. 20, no. 2, p. 376, Jan 2020.
[17] J. Gan and W. Wang, “In-air handwritten english word recognition using
[1] K. Kritsis, M. Kaliakatsos-Papakostas, V. Katsouros, and A. Pikrakis, attention recurrent translator,” Neural Computing and Applications,
“Deep convolutional and lstm neural network architectures on leap vol. 31, no. 7, pp. 3155–3172, nov 2017.
motion hand tracking data sequences,” in 2019 27th European Signal [18] Y. LeCun, “Neural networks and gradient-based learning in ocr,” in
Processing Conference (EUSIPCO), Sep. 2019, pp. 1–5. Neural Networks for Signal Processing VII. Proceedings of the 1997
[2] J. Pirker, M. Pojer, A. Holzinger, and C. Gütl, “Gesture-based inter- IEEE Signal Processing Society Workshop, Sep. 1997, pp. 255–255.
actions in video games with the leap motion controller,” in Human- [19] M. Maghoumi and J. J. LaViola, “Deepgru: Deep gesture recognition
Computer Interaction. User Interface Design, Development and Multi- utility,” in Advances in Visual Computing, G. Bebis, R. Boyle, B. Parvin,
modality, M. Kurosu, Ed. Springer, 2017, pp. 620–633. D. Koracin, D. Ushizima, S. Chai, S. Sueda, X. Lin, A. Lu, D. Thalmann,
[3] A. Zlatintsi, I. Rodomagoulakis, P. Koutras, A. C. Dometios, V. Pit- C. Wang, and P. Xu, Eds. Springer International Publishing, 2019, pp.
sikalis, C. S. Tzafestas, and P. Maragos, “Multimodal signal processing 16–31.
and learning aspects of human-robot interaction for an assistive bathing [20] J. Hu, C. Fan, and Y. Ming, “Trajectory image based dynamic gesture
robot,” in 2018 IEEE International Conference on Acoustics, Speech and recognition with convolutional neural networks,” in 2015 15th Interna-
Signal Processing (ICASSP), April 2018, pp. 3171–3175. tional Conference on Control, Automation and Systems (ICCAS), Oct
[4] M. Simos and N. Nikolaidis, “Greek sign language alphabet recognition 2015, pp. 1885–1889.
using the leap motion device,” in Proceedings of the 9th Hellenic [21] J. Gan, W. Wang, and K. Lu, “Compressing the cnn architecture for
Conference on Artificial Intelligence. New York, NY, USA: Association in-air handwritten chinese character recognition,” Pattern Recognition
for Computing Machinery, 2016. Letters, vol. 129, pp. 190 – 197, 2020.
[5] H. Wu, Y. Wang, J. Liu, J. Qiu, and X. L. Zhang, “User-defined gesture [22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
interaction for in-vehicle information systems,” Multimedia Tools and large-scale image recognition,” 2014.
Applications, vol. 79, no. 1, pp. 263–288, 2020. [23] V. Frinken and S. Uchida, “Deep blstm neural networks for uncon-
[6] M. Chen, G. AlRegib, and B. Juang, “Air-writing recognition—part strained continuous handwritten text recognition,” in 2015 13th Inter-
i: Modeling and recognition of characters, words, and connecting national Conference on Document Analysis and Recognition (ICDAR),
motions,” IEEE Transactions on Human-Machine Systems, vol. 46, no. 3, Aug 2015, pp. 911–915.
pp. 403–413, June 2016. [24] M. Arsalan and A. Santra, “Character recognition in air-writing based on
[7] B. Yana and T. Onoye, “Fusion networks for air-writing recognition,” network of radars for human-machine interface,” IEEE Sensors Journal,
in MultiMedia Modeling, K. Schoeffmann, T. H. Chalidabhongse, C. W. vol. 19, no. 19, pp. 8855–8864, Oct 2019.
Ngo, S. Aramvith, N. E. O’Connor, Y.-S. Ho, M. Gabbouj, and A. El- [25] Leap motion air-writing web interface, dataset and training code.
gammal, Eds. Cham: Springer, 2018, pp. 142–152. [Online]. Available: https://github.com/kosmasK/air-writing-recognition
[8] P. Roy, S. Ghosh, and U. Pal, “A cnn based framework for unistroke [26] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
numeral recognition in air-writing,” in 2018 16th International Confer- computation, vol. 9, no. 8, pp. 1735–1780, 1997.
ence on Frontiers in Handwriting Recognition (ICFHR), Aug 2018, pp. [27] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
404–409. convolutional and recurrent networks for sequence modeling,” arXiv
[9] C. Amma, M. Georgi, and T. Schultz, “Airwriting: a wearable handwrit- preprint arXiv:1803.01271, 2018.
ing recognition system,” Personal and Ubiquitous Computing, vol. 18, [28] M. Chen, G. AlRegib, and B.-H. Juang, “6dmg: A new 6d motion gesture
no. 1, pp. 191–203, feb 2013. database,” in Proceedings of the 3rd Multimedia Systems Conference.
[10] Kinect for windows. Accessed April 2020. [Online]. Available: New York, NY, USA: Association for Computing Machinery, 2012, p.
https://developer.microsoft.com/en-us/windows/kinect/ 83–88.
[11] Intel realsense depth and tracking cameras. Accessed April 2020.
[Online]. Available: https://www.intelrealsense.com/

12

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on August 22,2023 at 08:29:08 UTC from IEEE Xplore. Restrictions apply.

You might also like