Air-Writing Recognition Using Deep Convolutional and Recurrent Neural Network Architectures
Air-Writing Recognition Using Deep Convolutional and Recurrent Neural Network Architectures
Abstract—In this paper, we explore deep learning architectures 3D space by hand or finger gestures. Air-writing differs from
applied to the air-writing recognition problem where a person conventional handwriting since the latter contains a discrete
writes text freely in the three dimensional space. We focus on stroke with pen-up and pen-down motions, while the former
handwritten digits, namely from 0 to 9, which are structured
as multidimensional time-series acquired from a Leap Motion
lacks such a delimited sequence of writing events. Further-
Controller (LMC) sensor. We examine both dynamic and static more, the absence of visual and tactile feedback as well as
approaches to model the motion trajectory. We train and compare the high interdependence between the target gestures makes
several state-of-the-art convolutional and recurrent architectures. the character recognition task even more challenging.
Specifically, we employed a Long Short-Term Memory (LSTM) From the user’s point of view, there are three different ways
network and also its bidirectional counterpart (BLSTM) in
order to map the input sequence to a vector of fixed dimen-
to accomplish the air-writing process, namely the “connected”,
sionality, which is subsequently passed to a dense layer for “isolated” and “overlapped” [6]. Similar to conventional hand-
classification among the targeted air-handwritten classes. In the writing, with the “connected” method the user may write a
second architecture we adopt 1D Convolutional Neural Networks sequence of multiple characters in free space, for example
(CNNs) to encode the input features before feeding them to an from left to right. The second approach allows the user
LSTM neural network (CNN-LSTM). The third architecture is a
Temporal Convolutional Network (TCN) that uses dilated causal
to perform the writing gesture of each character separately,
convolutions. Finally, a deep CNN architecture for automating within the boundaries of an imaginary box in the 3D space.
the feature learning and classification from raw input data is Finally, the most complicated and uncommon technique is the
presented. The performance evaluation has been carried out on “overlapped”, since it requires to write adjacent characters that
a dataset of 10 participants, who wrote each digit at least 10 are stacked one over another in the same imaginary box. In
times, resulting in almost 1200 examples.
general, the size and shape of “connected” characters heavily
Index Terms—air-writing, gesture recognition, deep learning,
LSTM, CNN, TCN vary when writing in an unconstrained area, which affects the
recognition accuracy. It is difficult even for the same user to
I. I NTRODUCTION repeat the same motion trajectory in the same position. On
the contrary, limiting the writing space within an imaginary
Automatic gesture recognition has gained the interest of box facilitates the character segmentation, although this type
various research projects due to its convenient and natural of writing is not natural [7].
characteristics in conveying communication information in The recent advancements in Motion Capture (MoCap)
Human-Computer Interaction (HCI) related applications such systems in addition to the computational power of modern
as music interaction [1], computer gaming [2], robotics [3], computers, motivated the HCI research community to ex-
sign-language interpretation [4] and automotive systems [5], amine the application of air-writing in various interfaces.
amongst others. Different from traditional HCI input means, Some systems employ handheld markers [8] or custom made
such as moving the mouse, typing on the keyboard, and gloves [9] for capturing the motion trajectory. However, Low-
touching a screen, modern gesture recognition systems usually cost depth sensors such as the Microsoft Kinect [10], the Intel
receive specific body or hand movements in order to control RealSense [11] and the Leap Motion Controller (LMC) [12],
their output by utilizing different types of sensors, including are commonly employed in air-writing proposals due to the
gyroscopes, optical sensors, electromyography sensors etc. absence of intrusiveness while tracking human motion trajec-
In this regard, the term “Air-writing” refers to the process tories in real-time. Kinect-based air-writing interfaces [13] are
of writing linguistic characters or words in mid-air or free designed to detect long-range 3D body postures, while the
LMC and RealSense sensors provide millimeter-level accuracy
We acknowledge support of this work by the project “Computational
Sciences and Technologies for Data, Content and Interaction” (MIS 5002437), and focus on tracking hand and finger motion trajectories [14]–
sub-project “Technologies for Content Analysis in Culture”, implemented un- [16]. Even thought such sensors facilitate the motion tracking
der the Action “Reinforcement of the Research and Innovation Infrastructure”, problem, it is still hard to distinct the actual writing activity
funded by the Operational Programme “Competitiveness, Entrepreneurship
and Innovation” (NSRF 2014-2020) and co-financed by Greece and EU from other arbitrary hand movements in the continuous input
(ERDF). stream of motion tracking data, considering the absence of a
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on August 22,2023 at 08:29:08 UTC from IEEE Xplore. Restrictions apply.
Fig. 3: 1D Convolutional Filtering followed by a LSTM (CNN-
LSTM) trained on an oversampled sequence of 2-dimensional
digit trail points.
configuration was used in all cases. This means that only the
last output was taken under consideration when computing
Fig. 1: An example of the LMC orientation and the web the cross-entropy training loss and, in the case of LSTM’s
interface. bidirectional configuration, the last outputs of both directions
were combined by simply concatenating the two vectors as
depicted in Fig. 2.
LSTM: This architecture belongs to the family of RNNs.
LSTMs were introduced by Hochreiter & Schmidhuber [26]
and comprise a very common architecture for sequential
modeling. They are mainly characterized by their capability
to grasp long-term dependencies better that other RNNs. A
one-layer LSTM and a bidirectional version (BLSTM) of it
were used, with 300 hidden units in each block. A linear
fully connected layer with log-softmax regression was used
to classify each input to one among 10 classes, each for one
digit.
CNN-LSTM: In the architecture described right above,
Fig. 2: Bidirectional LSTM (BLSTM) trained on a zero- an extra layer that performs 1-dimensional convolutions was
padded sequence of 2-dimensional digit trail points. added right after the input. For the convolutions, 300 filters
were used, with kernel size 2 and stride 1, as illustrated in
Fig. 3.
digit at least 10 times resulting in almost 1200 examples. Every TCN-Dynamic: The family of Temporal Convolutional Net-
recording is a time-series containing raw tracking information work (TCN) architectures employs causal convolutions with
provided by the LMC API with a sampling frequency of 60Hz. dilations and handles sequences of data by mapping the
The web interface and the collected data are hosted online and extracted features to a sequence of predictions of the same
can be accessed in [25]. length as the input. In this work, among the various types of
TCNs, it is the one proposed in [27] that was selected, having
B. Architectures
considered its simplicity and the fact that its performance was
A variety of methods were used in order to tackle the already tested on a similar task, namely on the Sequential
problem of digit recognition. There were two main approaches, MNIST dataset. In the current configuration, as we move into
the dynamic and the static. Several experiments were ran in deeper layers, the dilation factor increases for each layer l by
order to compare or even combine the two strategies. 2l . We use kernel of size 7, 25 hidden units per layer and a
1) Dynamic Approach: In this approach, the input is rep- stack of 6 layers, thus ensuring a receptive field of 64 frames
resented as a sequence of points in the 2-dimensional plane. in the past.
It can be expected that by following the trail of the user’s 2) Static Approach: As for the static approach, a represen-
fingertip frame-by-frame, a sequential model can successfully tation of the drawn digits as images was required. To achieve
manage the information and make correct predictions. Nat- this, each point was mapped to a rectangular area so that binary
urally, since there is more than one way to draw any digit, images of size 28x28 could be created by filling in the right
robustness is an important factor when selecting the models pixels. Min-max normalization was applied on each image, in
to be tried out. The sequential architectures that were tested order to ensure small variance in terms of size and positioning
are: TCN, LSTM, BLSTM and CNN-LSTM. A “many-to-one” of the depicted digits. Two architectures were tried in this
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on August 22,2023 at 08:29:08 UTC from IEEE Xplore. Restrictions apply.
Fig. 4: 2D CNN fed with image representations (28x28) of the drawn digits.
context: (i.e. the maximum sequence length in each batch). This choice
CNN: A model with two 2D convolutional layers, each fol- implies that, while training, one needs to keep track of the
lowed by 2D batch normalization, a ReLU activation function sequence length values in order to take only the last timestep’s
and 2D max pooling. The first layer consisted of 16 filters output into account.
and the second of 32. The kernel size for each convolutional Each network was developed and trained with PyTorch for
layer was 5 and the kernel size of the max pooling was 1000 epochs with batch size of 64 samples. At each epoch
2. The images were given directly as input to this model. the training data were shuffled anew. The validation loss and
The encoded input was then fed to a 2-layer fully-connected accuracy indicated the final model’s parameters to be stored
network followed by a softmax function. and tested. The experiments were conducted on a computer
TCN-Static: A TCN with 8 layers was used following with a 16 GB RAM, an Intel i7-8750H CPU, and a GeForce
Shaojie Bai et. al [27] in respect to the way they have applied GTX 1060 6 GB GPU. The training code can be found in
it on the Sequential MNIST dataset. The array representing [25].
the images was unfolded in such a manner that each row is
appended to the previous one. This sequence of zeroes and B. Results
ones (i.e. the values of the image pixels) was the actual input After having stored the best models for each different
to the TCN. architecture, they were then all evaluated on the test set. The
3) Combining the Static and the Dynamic Approach: Em- accuracy for each model is presented on Table I. Both the
ploying pairs of static and dynamic architectures was selected static and the dynamic approaches lead to remarkable model
as a tactic in order to test weather such a combination could performance. The LSTM gives the best results among every
lead to surpass obstacles that each approach might encounter. other architecture. The BLSTM follows with a slightly lower
A Fuzzy model having an LSTM and a CNN as its building performance. The CNN comes third and outperforms the fuzzy
blocks, was created. The outputs of each model (i.e. a 10- model LSTM&CNN. Then follow the CNN-LSTM and the
dimensional vector) were concatenated and fed to a 2-layer TCN-Dynamic models that exhibit the same performance. The
linear fully-connected network with a final softmax activation TCN-Static architecture comes last. Something that should
layer. also be noted is the approximately 100% better runtime
IV. E VALUATION AND R ESULTS performance of the two TCN models as compared with the
rest. This goes in tandem with their relatively small number
A. Experimental setup
of trainable parameters and the use of dilations.
Among the collected data, there were some faulty recordings The extracted Confusion Matrices indicate that the sources
(i.e. the right hand of the participant was wrongly mapped as of error between the two best performing models, the LSTM
left hand). There were 35 such cases among all the samples. (Fig. 5a) and the CNN (Fig. 5b), each being representative of
After discarding these samples, the dataset was randomly split the two distinct strategies, were all different. One should notice
to training, validation and test sets. There were 200 samples that the LSTM produced only one mismatched example,where
kept for the test set and 100 for the validation set. a “9” was classified as “2”. This can be explained by the
The sequence lengths strongly vary among the recordings. similarity of the trail (when seen sequentially) of this instance
For example the digit “1” usually requires a lot fewer frames of “9” with the trail of many instances of digit “2”. As for
for its recording in contrast to the other digits. Zero-padding the CNN, the results present three mismatched samples. In
was employed in order to create batches of the same length the case of the fuzzy model (LSTM&CNN), a slight decrease
10
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on August 22,2023 at 08:29:08 UTC from IEEE Xplore. Restrictions apply.
(a) Confusion matrix for LSTM. (b) Confusion Matrix for CNN (c) Confusion Matrix for LSTM&CNN
Fig. 5: Confusion Matrices during testing
11
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on August 22,2023 at 08:29:08 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Test Accuracy for Participant Cross-Validation
12
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on August 22,2023 at 08:29:08 UTC from IEEE Xplore. Restrictions apply.