0% found this document useful (0 votes)
60 views9 pages

Dop-DenseNet: Densely Convolutional Neural Network-Based Gesture Recognition Using A Micro-Doppler Radar

Dop-DenseNet: Densely Convolutional Neural Network-Based Gesture Recognition Using a Micro-Doppler Radar

Uploaded by

Phuc Hoang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views9 pages

Dop-DenseNet: Densely Convolutional Neural Network-Based Gesture Recognition Using A Micro-Doppler Radar

Dop-DenseNet: Densely Convolutional Neural Network-Based Gesture Recognition Using a Micro-Doppler Radar

Uploaded by

Phuc Hoang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

JOURNAL OF ELECTROMAGNETIC ENGINEERING AND SCIENCE, VOL. 22, NO. 3, 335~343, MAY.

2022

https://doi.org/10.26866/jees.2022.3.r.95
ISSN 2671-7263 (Online) ∙ ISSN 2671-7255 (Print)

Dop-DenseNet: Densely Convolutional Neural


Network-Based Gesture Recognition Using a
Micro-Doppler Radar
Hai Le1 ∙ Van-Phuc Hoang1,* ∙ Van Sang Doan2 ∙ Dai Phong Le1

Abstract

Hand gesture recognition is an efficient and practical solution for the non-contact human–machine interaction in smart devices. To date,
vision-based methods are widely used in this research area, but they are susceptible to light conditions. To address this issue, radar-based
gesture recognition using micro-Doppler signatures can be applied as an alternative. Accordingly, the use of a novel densely convolutional
neural network model, Dop-DenseNet, is proposed in this paper for improving hand gesture recognition in terms of classification accuracy
and latency. The model was designed with cross or skip connections in a dense architecture so that the former features, which can be lost
in the forward-propagation process, can be reused. We evaluated our model with different numbers of filter channels and experimented
with it using the Dop-Net dataset, with different time lengths of input data. As a result, it was found that the model with 64 3 × 3 filters
and 200 time bins of micro-Doppler spectrogram data could achieve the best performance trade-off, with 99.87% classification accuracy
and 3.1 ms latency. In comparison, our model remarkably outperformed the selected state-of-the-art neural networks (GoogLeNet, Res-
Net-50, NasNet-Mobile, and MobileNet-V2) using the same Dop-Net dataset.

Key Words: Convolutional Neural Network, Hand Gesture Recognition, Micro-Doppler Radar.

I. INTRODUCTION terms of recognition accuracy, vision-based approaches are sus-


ceptible to failure under weak light conditions. To address this
Gesture recognition is a hot topic in the field of computer vi- issue, Tan and Triggs [4] combined the strengths of robust illu-
sion and language technology, with the goal of interpreting hu- mination normalization, local texture-based face representations,
man gestures using mathematical algorithms [1]. By under- distance–transform-based matching, kernel-based feature ex-
standing human actions, a device equipped with sensors can traction, and feature fusion. As a result, the method achieved an
react as expected, such as by controlling a smart TV, a gaming 88.1% face verification rate and a 0.1% false accept rate on the
box, a robot, or a computer. Nowadays, gesture recognition is challenging FRGC-204 dataset. However, this accuracy level is
popularly used for emotion recognition using vision-based still low compared to those of other approaches.
methods [2, 3]. Despite achieving remarkable performance in An alternative to vision-based methods is the use of the fre-

Manuscript received July 14, 2021 ; Revised September 6, 2021 ; Accepted September 22, 2021. (ID No. 20210714-079J)
1
Institute of System Integration, Le Quy Don Technical University, Hanoi, Vietnam.
2
Vietnam Naval Academy, Nha Trang, Vietnam.
*
Corresponding Author: Van-Phuc Hoang (e-mail: [email protected])

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0) which permits
unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
ⓒ Copyright The Korean Institute of Electromagnetic Engineering and Science.

335
JOURNAL OF ELECTROMAGNETIC ENGINEERING AND SCIENCE, VOL. 22, NO. 3, MAY. 2022

quency-modulated continuous-wave (FMCW) radar, which can high precision, it suffered from a computational burden. The
provide micro-Doppler features for gesture recognition [5]. human gesture recognition task was also built on an edge-
Accordingly, Malysa et al. [6] used the hidden Markov model computing platform combined with an FMCW radar sensor
(HMM) for gesture recognition in their study, based on a 77-GHz [12]. The system, including the radar and NVIDIA Jetson
FMCW radar system. By using micro-Doppler spectrogram Nano, was embedded. The CNN model achieved 98.47% and
images, HMM can classify four hand gestures with up to 82.3% 93.11% average accuracy using gestures from taught and un-
overall accuracy within 30 frames. However, the study was per- taught subjects, respectively. In addition, a combination of long
formed with a single target and small gesture classes, which are short-term memory and CNN is also employed for facilitating
insufficient challenges for gesture recognition tasks. In a scenario gesture recognition [13].
involving multiple moving subjects, Peng et al. [7] investigated Despite remarkable achievements in gesture recognition, the
the effectiveness of a 5.8-GHz FMCW radar with range-Doppler aforementioned approaches still have one or more of the follow-
processing in recognizing human gestures. They demonstrated ing drawbacks: simple experiment context, high computational
that portable FMCW radars could recognize human gestures in cost, and slow execution time. Thus, there is still room for
the presence of multiple moving people. Nevertheless, no gesture researching and developing a novel DCNN model that can
identification model was presented in this study. With a similar improve human gesture recognition performance in terms of
processing method of range Doppler maps from raw signals of recognition accuracy and computational complexity. Motivated
an FMCW radar, Ryu et al. [8] introduced a feature-based ges- by the Dop-Net dataset used in the study by Ritchie et al. [14],
ture recognition method using a quantum-inspired evolutionary we propose the use of a dense CNN model, the so-called Dop-
algorithm (QEA) to enhance gesture recognition accuracy. On a DenseNet, to improve hand gesture recognition accuracy and
dataset of seven hand motion classes, gesture recognition was reduce time-execution latency. Dop-DenseNet was designed
performed using the features selected by QEA, and 85.81% with cross and skip connections to leverage the combination of
classification accuracy was obtained, higher than those of the k- the extracted features with the former ones for achieving high
nearest neighbors (81.43%) and random forest (RF, 83.33%) accuracy during the training process. The proposed model was
classifiers. As expected, the handcrafted feature extraction ap- evaluated using different numbers of filter channels and different
proaches revealed a disadvantage: their classifiers work only with time lengths of input data. Afterward, the Dop-DenseNet model
small datasets and cannot cover all realistic conditions. In con- with 64 3×3 filters was chosen to compete with several state-of-
trast, deep neural networks (DNNs) can automatically extract the-art networks: GoogLeNet [15], ResNet-50 [16], NasNet-
useful features to improve classification accuracy. In particular, Mobile [17], and MobileNet-V2 [18]. Consequently, our model
Vandersmissen et al. [9] proposed a robust feature learning ap- remarkably outperforms other considered models in terms of
proach based on deep convolutional neural networks (DCNNs) accuracy and execution time.
for use in identifying persons on the basis of their gait character- Thus, the following are the main contributions of our study:
istics through the micro-Doppler signatures of a low-power ㆍWe propose the Dop-DenseNet model and evaluated it
FMCW radar device. In the study, the DCNN model signifi- using different configurations, such as changing the number
cantly outperformed the support vector machine (SVM) and of filter channels and the time lengths of the input data.
RF approaches by a margin of 17%. Moreover, in the experi- ㆍWe compared our model with the other considered ones in
ment with larger time windows, the DCNN model was able to terms of recognition accuracy, execution time, and memory
further lower the error rate to 0% for above 25 seconds windows. cost on the Dop-Net dataset.
In another study, a multidimensional parameter dataset was used The remainder of this paper is organized as follows. Section
for hand gesture recognition [10]. Specifically, Range-Frame- II describes our basis for using the FMCW radar for hand
Map, Doppler-Frame-Map, and Angle-Frame-Map were in- gesture recognition. Section III presents the proposed Dop-
cluded in the datasets for a meta-learning-based multi-branch DenseNet and its evaluation results. Section IV compares the
network. As a result, the network obtained 97.3% recognition performances of several state-of-the-art CNN models with that
accuracy with seven gestures. However, the multidimensional of our Dop-DenseNet model for hand gesture recognition.
parameters lead to high computational complexity in the pre- Finally, Section VI concludes the paper.
processing stage and lengthen the model execution time. To
improve the efficiency of hand gesture recognition systems II. FREQUENCY-MODULATED CONTINUOUS-WAVE
based on the 60-GHz FMCW radar, Lee et al. [11] proposed a RADAR FOR HAND GESTURE RECOGNITION
three-dimensional (3D) CNN with an Inception structure to
process the range–Doppler matrix sequence, which yielded In this section, we present an operation principle for the
98.8% average accuracy. Although the 3D CNN model provided FMCW radar that can be used for hand gesture recognition

336
LE et al.: Dop-DenseNet: DENSELY CONVOLUTIONAL NEURAL NETWORK-BASED GESTURE RECOGNITION USING A MICRO-DOPPLER RADAR

based on micro-Doppler signatures. where R0 is the range of the target at t = 0, and n(t) is the
Gaussian noise. It can be seen in Eq. (4) that the existence of the
1. Principles of the FMCW Radar target is detected by the strength of the IF signal, which is rep-
FMCW radars are widely used for tasks involving human resented by the signal-to-noise ratio. The moving target is de-
gesture recognition. They transmit electromagnetic FMCW termined by analyzing the micro-Doppler spectrogram of the IF
power through antennas into a space for measuring objects. By signal.
processing the reflected signal and comparing it with the dupli-
cate of the transmitted signal, the radar can determine the target 2. Micro-Doppler Effect
position. An example of an FMCW radar scheme for this appli- The micro-Doppler effect is a phenomenon of secondary
cation is shown in Fig. 1. modulation in a reflected signal caused by the movements of
The signal transmitted by the FMCW radar can be expressed parts versus a target center. The micro-Doppler effect facilitates
as follows: determining the kinetic characteristics of the target; therefore, it
𝑠 𝑡 𝐴 𝑐𝑜𝑠 2𝜋𝑓 𝑡 2𝜋 𝑓 𝜏 𝑑𝜏 , (1) can be used to identify hand gestures. In almost all cases, the
Doppler frequency is produced by a combination of many
where 𝑓 𝜏 𝐵/𝑇 𝜏 is the transmitter sweep frequency, 𝑓 complex and different micromovements, such as translation,
is the center carrier frequency, B is the bandwidth, T is the rotation, and vibration. The oscillations of these micromotions
signal period, and 𝐴 is the transmitted signal amplitude. produce frequency modulation phenomena on the reflected
Due to the motion of the hand, the frequency of the receiving signal. They cause additional changes in the Doppler frequency
signal is modulated by the Doppler shift, 𝑓 2𝑓 𝑣/𝑐. Thus, shift constant of the entire translation motion of the target. For
the receiving frequency is defined as follows: a target with only a translation motion with a constant velocity,
𝑓 𝑡 𝑡 𝛥𝜏 𝑓, (2) the Doppler shift produced by this translation motion is a time-
invariant function. In contrast, if a target has a vibrating or
where 𝛥𝜏 2 𝑅 𝑣𝑡 /𝑐 is the time delay of the received rotating motion, the Doppler frequency produced by this rota-
signal, R0 is the range of the hand from the radar antenna, v is tion or vibration is a function of time, which expresses a time-
the speed of hand motion, and c is the speed of light. Thus, the varying modulation of the carrier frequency of the reflected
signal received by the radar antenna is presented as follows: signal. The general technique for micro-Doppler analysis is to
represent the time-frequency spectrum. Nowadays, conventional
𝑠 𝑡 𝐴 𝑐𝑜𝑠 2𝜋𝑓 𝑡 𝛥𝜏 2𝜋 𝑓 𝜏 𝑑𝜏 𝑛 𝑡 , (3)
spectrograms defined by the short-time Fourier transform (STFT)
where 𝐴 is the received signal amplitude and n(t) is the are widely used for micro-Doppler analysis, given as follows:
Gaussian noise. The output of the mixer is an intermediate fre-
𝑋 𝜏, 𝜔 𝑆𝑇𝐹𝑇 𝑥 𝑡 𝑥 𝑡 𝑤 𝑡 𝜏 𝑒 𝑑𝑡, (5)
quency (IF) signal generated by mixing the received signal with ∞

the duplicate of the transmitted one. Afterward, the IF signal where x(t) is the input signal of transformation, and w(t−τ) is
goes to a low-pass filter and an IF amplifier. If there is no re- the kernel (so-called window) function. The resolution of the
flected signal (meaning there is no detected target), the IF signal STFT spectrogram was identified via the window function and
is only noise. Then, the IF signal is converted to digital form the overlapping rate. It is obvious that the micro-Doppler spec-
by an analog-to-digital converter for further processing in the trum has two dimensions: time and frequency. The value of X is
digital signal processor. As a result, the IF signal can be written a complex number; therefore, X is designated as the neural net-
as follows: work model input, which has a size of M × N × 2, where M
𝑐𝑜𝑠 𝜋𝑓 2𝜋 𝑡 𝑛 𝑡 for target,
and N are the sizes of the frequency and time points, respectively,
𝑠 𝑡 (4) and 2 represents the real and imagined parts of complex values X.
𝑛 𝑡 for no target

III. CONVOLUTION NEURAL NETWORK-BASED HAND


GESTURE RECOGNITION

In this section, we first introduce a dataset, Dop-Net, which


we used to assess CNN models. We then propose a CNN model
whose performance was evaluated using the Dop-Net dataset by
\ changing the number of filter channels of the model and the
Fig. 1. Schematic of the FMCW radar for hand gesture recognition. time length of input data.

337
JOURNAL OF ELECTROMAGNETIC ENGINEERING AND SCIENCE, VOL. 22, NO. 3, MAY. 2022

1. Dop-Net Dataset
Dop-Net is an FMCW radar dataset in which each *.mat file
represents the data of a person divided into different gestures
recorded from that person. The FMCW radar constantly trans-
mits a chirp signal, which increases (up-chirp) and decreases
(down-chirp) its frequency linearly over time. The signal reflect-
ed from a hand is then mixed with the duplicate copy of the
transmitted signal to downconvert the received signal to an in-
termediate frequency. The data for this classification issue are
(a) (b)
generated using a 24-GHz FMCW radar with a 750-MHz
bandwidth. The other parameters of the radar are presented in
Table 1.
Each gesture is made directly in front of the radar at a ≈ 30 cm
distance at the same height as the antenna. The FMCW radar
captures 30 seconds of the reflected signal for each gesture. The
dataset provides four separate hand gestures that can be applied
in human–machine interface applications. By using the STFT-
based micro-Doppler analysis method and automatic segmenta-
tion with 75% overlapping, we generated 9,410 total time-
frequency spectral images with a size of 200 × 200. The distri- (c) (d)
bution of gestures is presented in Fig. 2, with the numbers of Fig. 3. Micro-Doppler spectra of different hand gestures: (a) wave,
images for the click, pinch, swipe, and wave actions shown as (b) pinch, (c) click, and (d) swipe.
2,063, 2,026, 2,821, and 2,500, respectively.
Evidently, different hand gestures pose different micro-
Doppler signatures in STFT images (Fig. 3), which facilitate best accuracy (74.2%).
machine learning, and deep learning algorithms discriminate
individual hand actions. Accordingly, Ritchie et al. [14] compared 2. Dop-DenseNet Model
different machine-learning techniques in terms of classification For use in recognizing hand gestures, we propose a light-
accuracy. As a result, a quadratic SVM classifier obtained the weight CNN model based on a densely connected architecture:
Dop-DenseNet. The overall scheme of our network is detailed
in Fig. 4, where cross–skip connections and depth-wise concat-
Table 1. Main parameters of the FMCW radar for dataset generation
enations are employed for leveraging the useful former feature
Parameter Value maps, which can be lost when going through the model’s ex-
Frequency (GHz) 24 traction backbone flow. As shown in Fig. 4, our design reuses
the former feature maps twice. Indeed, the output features from
Bandwidth (MHz) 750
the first normalization (Norm) layer are concatenated with the
ADC resolution (bits) 12
backbone features at the first depth-concatenation (Concat)
Transmit power (dBm) 13 layer, and then they are reused once more at the second Concat
layer. This process is continuously repeated to the last Concat
3000 layer. The reuse of former features helps enhance the repre-
sentative information of each output class and prevents gradient
2500
vanishing and overfitting problems during the training process.
2000
Therefore, this design can improve the classification accuracy of
1500 the model. In addition, maxpool layers with (2,2) strides were
1000
employed to reduce the number of learnable parameters in the
model. All the convolutional (Conv) layers in the backbone flow
500
were designed with 64 filters with a 3×3 size and a (1,1) stride.
0 The Conv layers in skip connection flows have 64 filters with a
Click Pinch Swipe Wave
Hand gestures 1×1 size and a (2,2) stride. Maxpool layers were allocated for
Fig. 2. Distribution of hand gestures in the Dop-Net dataset. selecting the strong features and downsampling the feature

338
LE et al.: Dop-DenseNet: DENSELY CONVOLUTIONAL NEURAL NETWORK-BASED GESTURE RECOGNITION USING A MICRO-DOPPLER RADAR

Fig. 4. Overall architecture of Dop-DenseNet.

maps. The last Conv layer is followed by an average pool layer it can be observed that the cross-validation accuracies are appro-
instead of a maxpool layer. Then, two fully connected (FC) layers priate for the training ones and approach ≈ 100% after 2,350
were deployed as classifiers. The FC 1 layer was set with 512 iterations.
neurons whereas the FC 2 layer was assigned four output neurons The confusion matrix that presents the classification accuracy
corresponding to four classes of hand gestures (click, pinch of our proposed model for four hand gestures is shown in Fig. 6,
swipe, and wave). where it can be seen that the Dop-DenseNet model gained a
very high rate of correct hand gesture classification. Specifically,
3. Experimental Results the model predicted the swipe and wave gestures with 100%
The dataset of micro-Doppler images was randomly divided accuracy. Two other methods also achieved a high correct classi-
into the training set (80%) and the validation/test set (20%). fication rate, with 99.7% and 99.9% for the click and pinch ges-
The Dop-DenseNet model was experimented with on the tures, respectively.
dataset with the same training and test configuration. Specifi- Next, we analyzed the accuracy and prediction time of the
cally, training options were set up, as presented in Table 2. The Dop-DenseNet model by changing the number of filter channels
training and testing processes were executed on a laptop with a in the Conv layers to 8, 16, 32, and 64 channels. The results
Core-i5 9300H, RAM 16 GB CPU and a GTX 1660ti 6 GB comparison of the different numbers of filter channels in Fig. 7
GPU. shows that the Dop-DenseNet model can improve classification
The learning/convergence progress was plotted in Fig. 5, where accuracy by increasing the number of filters, but this will make
the model bigger and slower. Specifically, the model’s perfor-
Table 2. Main parameters of the FMCW radar for dataset generation mance improved by about 1.1% when the number of filter
Parameter Value channels increased from 8 to 16. Nevertheless, it enlarged the
MiniBatchSize 64
model capacity to about 50,300 learnable parameters and raised
MaxEpochs 30
InitialLearnRate 0.01
LearnRateDropFactor 0.5
LearnRateDropPeriod 2
Solving method SGDM

Fig. 6. Confusion matrix of hand gesture recognition using the


Fig. 5. Training and validation accuracies of Dop-DenseNet. Dop-DenseNet model.

339
JOURNAL OF ELECTROMAGNETIC ENGINEERING AND SCIENCE, VOL. 22, NO. 3, MAY. 2022

models (GoogLeNet, DarkNet, ResNet-50, NasNet-Mobile,


and MobileNet-V2) on the Dop-Net dataset. In the following
subsections, we briefly describe the structures of the CNN models.
Afterwards, we present the results of each model’s training with
80% of the dataset, and of the model’s validation and testing
with the remaining 20% of the dataset. Finally, we discuss the
results of the performance comparison.

1. Brief Description of CNN Models

1.1 GoogLeNet
Fig. 7. Performance comparison of Dop-DenseNet when configured GoogLeNet [15] is a CNN with 22 layers. It was built to
with different numbers of filter channels. classify images into 1,000 object categories, such as keyboard,
mouse, pencil, and many different animals. The GoogLeNet
the prediction time delay from 2.1 ms to 2.4 ms. When we model is based on the Inception architecture. It uses Inception
changed the number of channels in the model from 16 to 64, modules, which allow the network to choose between multiple
the classification accuracy improved by only about 0.5%. However, convolutional filter sizes in each block. An Inception network
the model structure significantly increased by about 772,000 stacks these modules on top of each other, with occasional max-
learnable parameters, and its processing rate was reduced to pool layers with a stride of 2 to halve the resolution of the grid.
3.1 ms. The network has learned different feature representations for a
To determine the impact of time length on the gesture wide range of images with a 224 × 224 size. However, on the
recognition performance of Dop-DenseNet, we assessed the Dop-Net dataset, the GoogleNet input size was reassigned to
model with different time lengths (100, 200, 300, and 400 time the 200 × 200 size, with "zero center" normalization. The out-
bins). The results in Fig. 8 show that the longer time length put classes of the fully connected layer were changed from 1,000
makes the model processing slower despite improving the ges- to 4, which is appropriate for the number of gestures in the
ture recognition accuracy. Specifically, it can be seen in Fig. 8 Dop-Net dataset.
that the model’s accuracy was significantly improved when the
time length was changed from 100 bins to 200 bins. However, 1.2 ResNet-50
when the time length was changed from 200 bins to 300 and ResNet-50 [16] is a variant of the ResNet model, which has
400 bins, the model obtained only a small improvement in 48 convolution layers, along with 1 maxpool layer and 1 average
accuracy. In addition, the execution time of the model was pool layer. Therefore, it is 50 layers deep. As a result, it has 3.8
slower with time-length increments. × 109 floating point operations. This architecture is used for
computer vision tasks such as image classification, object detec-
IV. COMPARISON OF IMPLEMENTATION RESULTS tion, and localization. It can also be applied to other tasks to
give them the benefit of depth and to reduce the computational
In this section, we compare the performance of the Dop- expenses. The original ResNet-50 was trained on more than a
DenseNet model with those of several state-of-the-art CNN million images from the ImageNet database to classify images
into 1,000 object categories, similar to GoogLeNet. As a result,
the network has learned rich feature representations for a wide
range of images. The network has an input size of 224 × 224;
Recognition accuracy (%)

however, in this work, we redesigned the 200 × 200 input size


and the output size of four classes.

1.3 NasNet-Mobile [17]


NasNet stands for Neural Architecture Search Network, which
was developed by the Google Brain team. The NasNet model
employs the two main functionalities of normal and reduction
cells. The idea of NasNet is to search for the best combination
Fig. 8. Performance comparison of Dop-DenseNet with different of parameters of the given search space: filter sizes, output chan-
time lengths of input data. nels, strides, number of layers, and others. As a result, NasNet

340
LE et al.: Dop-DenseNet: DENSELY CONVOLUTIONAL NEURAL NETWORK-BASED GESTURE RECOGNITION USING A MICRO-DOPPLER RADAR

achieved state-of-the-art results in the ImageNet competition. models remarkably outperform the machine learning techniques.
However, it requires computation power. In this work, we applied
NasNet-Mobile, a small version of NasNet, to classify hand V. CONCLUSION
gestures on the basis of the micro-Doppler feature maps analyzed
by the FMCW radar. Similar to other networks, we redesigned This paper briefly presents the principles of the FMCW
the input and output sizes of NasNet-Mobile to make it appro- radar with the use of micro-Doppler signatures for hand gesture
priate for the Dop-Net dataset. recognition. For use in recognizing hand gestures, we propose
Dop-DenseNet, a lightweight CNN model with cross-dense
1.4 MobileNet-V2 connections and a skip connection. Our model achieved the
MobileNet-V2 [18] is the second version of the MobileNet- highest overall accuracy (99.87%) and the fastest execution time
V1 model developed by Google. The MobileNet-V2 model is a (3.1 ms) when trained and tested on the Dop-Net dataset with
better module as an inverted residual structure, which plays the a time length of 200 bins. We then compared the performance
role of a backbone for feature extraction, was introduced to it. of the Dop-DenseNet model with 64 3 × 3 filters with those of
Hence, MobileNetV2 achieves state-of-the-art performance in four well-known CNN models: GoogLeNet, ResNet-50, NasNet-
object detection and semantic segmentation. In MobileNet-V2, Mobile, and MobileNet-V2 in terms of classification rate,
two types of blocks are used: a residual block with a stride of 1 prediction time, and number of learnable parameters. All the
and another one with a stride of 2 for downsizing. Each block five models were trained and tested on the Dop-Net dataset,
has three layers: the first layer is a 1 × 1 convolution with a with the same training and testing option configuration. The
ReLU activation function; the second layer is a depth-wise con- Dop-DenseNet model remarkably outperformed the four other
volution; and the third layer is another 1 × 1 convolution but models in terms of classification accuracy, prediction time, and
without any nonlinearity. structural size.

2. Comparison of Results REFERENCES


The performances of the Dop-DenseNet model (with a con-
figuration of 64 3 × 3 filters and an input data time length of [1] J. Kobylarz, J. J. Bird, D. R. Faria, E. P. Ribeiro, and A.
200 time bins) and of four other CNN models are compared in Ekart, "Thumbs up, thumbs down: non-verbal human-
Table 3, where it can be seen that our proposed model has the robot interaction through real-time EMG classification via
least number of learnable parameters (≈0.85 millions). It re- inductive and supervised transductive transfer learning,"
markably outperforms the other models in terms of accuracy Journal of Ambient Intelligence and Humanized Computing,
(overall, 99.87%) and execution time (≈3.1 ms). MobileNet-V2 vol. 11, no. 12, pp. 6021-6031, 2020.
achieved the second-highest performance, with 99.38% classifi- [2] K. Zhang, Y. Huang, Y. Du, and L. Wang, "Facial expression
cation accuracy, 4.0 ms prediction time, and ≈2.22 million recognition based on deep evolutional spatial-temporal net-
learnable parameters. The NasNet-Mobile model with ≈4.3 works," IEEE Transactions on Image Processing, vol. 26, no.
million learnable parameters showed the worst average accuracy 9, pp. 4193-4203, 2017.
(98.77%) and prediction time (5.4 ms). Interestingly, ResNet- [3] Q. Gao, Y. Chen, Z. Ju, and Y. Liang, "Dynamic hand ges-
50 has a large architecture of ≈23.68 million parameters, but its ture recognition based on 3d hand pose estimation for
average classification accuracy (99.18%) is lower than that of human-robot interaction," IEEE Sensors Journal, 2018.
MobileNet-V2. The comparison of all the five CNN models https://10.1109/JSEN.2021.3059685
with machine learning techniques in [14] showed that the CNN [4] X. Tan and B. Triggs, "Enhanced local texture feature sets

Table 3. Performance comparison of Dop-DenseNet with different state-of-the-art CNN models

Classification accuracy (%) Time Number of learnable


Model
Average Click Pinch Swipe Wave (ms) parameters

GoogleNet 98.97 99.37 99.28 97.92 98.92 3.7 5,971,380


ResNet-50 99.18 98.73 98.56 100 100 6.6 23,684,100
NasNet-Moblie 98.77 98.10 99.28 97.92 100 5.4 4,271,232
MobileNet-V2 99.38 99.37 99.28 98.96 100 4.0 2,215,044
Dop-DenseNet 99.87 99.7 99.90 100 100 3.1 852,996

341
JOURNAL OF ELECTROMAGNETIC ENGINEERING AND SCIENCE, VOL. 22, NO. 3, MAY. 2022

for face recognition under difficult lighting conditions," Electronics, vol. 9, no. 12, article no. 2140, 2020. https://doi.
IEEE Transactions on Image Processing, vol. 19, no. 6, pp. org/10.3390/electronics9122140
1635-1650, 2010. [12] Y. Sun, T. Fei, X. Li, A. Warnecke, E. Warsitz, and N.
[5] S. Ahmed, K. D. Kallu, S. Ahmed, and S. H. Cho, "Hand ges- Pohl, "Real-time radar-based gesture detection and recog-
tures recognition using radar sensors for human-computer- nition built in an edge-computing platform," IEEE Sensors
interaction: a review," Remote Sensing, vol. 13, no. 3, article Journal, vol. 20, no. 18, pp. 10706-10716, 2020.
no. 527, 2021. https://doi.org/10.3390/rs13030527 [13] M. Chmurski and M. Zubert, "Novel radar-based gesture
[6] G. Malysa, D. Wang, L. Netsch, and M. Ali, "Hidden Markov recognition system using optimized CNN-LSTM deep
model-based gesture recognition with FMCW radar," in neural network for low-power microcomputer platform,"
Proceedings of 2016 IEEE Global Conference on Signal and in Proceedings of the 13th International Conference on Agents
Information Processing (GlobalSIP), Washington, DC, 2016, and Artificial Intelligence (ICAART), Virtual Event, 2021,
pp. 1017-1021. pp. 882-890.
[7] Z. Peng, C. Li, J. Munoz-Ferreras, and R. Gomez-Garcia, [14] M. Ritchie, R. Capraru, and F. Fioranelli, "Dop-NET: a
"An FMCW radar sensor for human gesture recognition micro-Doppler radar data challenge," Electronics Letters,
in the presence of multiple targets," in Proceedings of 2017 vol. 56, no. 11, pp. 568-570, 2020.
First IEEE MTT-S International Microwave Bio Conference [15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D.
(IMBIOC), Gothenburg, Sweden, 2017, pp. 1-3. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,
[8] S. J. Ryu, J. S. Suh, S. H. Baek, S. Hong, and J. H. Kim, "Going deeper with convolutions," 2014. [Online]. Availa-
"Feature-based hand gesture recognition using an FMCW ble: https://doi.org/10.48550/arXiv.1409.4842.
radar and its temporal feature analysis," IEEE Sensors Journal, [16] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learn-
vol. 18, no. 18, pp. 7593-7602, 2018. ing for image recognition," in Proceedings of 2016 IEEE
[9] B. Vandersmissen, N. Knudde, A. Jalalvand, I. Couckuyt, A. Conference on Computer Vision and Pattern Recognition
Bourdoux, W. De Neve, and T. Dhaene, "Indoor person iden- (CVPR), Las Vegas, NV, 2016, pp. 770-778.
tification using a low-power FMCW radar," IEEE Trans- [17] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, "Learning
actions on Geoscience and Remote Sensing, vol. 56, no. 7, pp. transferable architectures for scalable image recognition,"
3941-3952, 2018. in Proceedings of 2018 IEEE/CVF Conference on Computer
[10] Z. Fan, H. Zheng, and X. Feng, "A meta-learning-based Vision and Pattern Recognition, Salt Lake City, UT, 2018,
approach for hand gesture recognition using FMCW radar," pp. 8697-8710.
in Proceedings of 2020 International Conference on Wireless [18] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.
Communications and Signal Processing (WCSP), Nanjing, C. Chen, "MobileNetV2: inverted residuals and linear
China, 2020, pp. 522-527. bottlenecks," in Proceedings of 2018 IEEE/CVF Conference
[11] H. R. Lee, J. Park, and Y. J. Suh, "Improving classification on Computer Vision and Pattern Recognition, Salt Lake
accuracy of hand gesture recognition based on 60 GHz City, UT, 2018, pp. 4510-4520.
FMCW radar with deep learning domain adaptation,"

Hai Le Van-Phuc Hoang


received his M.Sc. degree in electronic engineering received his Ph.D. in electronic engineering from the
from Le Quy Don Technical University in 2012. He University of Electro-Communications (UEC),
is currently pursuing his Ph.D. in electronic engi- Tokyo, Japan, in 2012. He was a postdoc researcher
neering at the Institute of System Integration, Le and visiting scholar at UEC, Telecom Paris, France,
Quy Don Technical University, Hanoi, Vietnam. and the University of Strathclyde, Glasgow, UK in
His research interests include radar systems, signal 2012–2018. He is currently an associate professor
processing, and deep learning techniques. and director at the Institute of System Integration,
Le Quy Don Technical University, Hanoi, Vietnam.
His research interests include digital circuits and systems, hardware security,
embedded systems for the Internet of Things, and intelligent system inte-
gration with deep learning techniques.

342
LE et al.: Dop-DenseNet: DENSELY CONVOLUTIONAL NEURAL NETWORK-BASED GESTURE RECOGNITION USING A MICRO-DOPPLER RADAR

Van Sang Doan Dai Phong Le


received his M.Sc. and Ph.D. degrees in electronic received his Ph.D. in radar and navigation from
systems and devices from the Faculty of Military Saint Petersburg Electrotechnical University, Russia,
Technology, University of Defence, Brno, Czech in 2012. He is currently the head of the Research
Republic, in 2013 and 2016, respectively. He was Group on Radar and Microelectronics at the Insti-
awarded an Honors degree by the Faculty of Military tute of System Integration, Le Quy Don Technical
Technology of the University of Defence three times, University, Hanoi, Vietnam. His research interests
in 2011, 2013, and 2016. From 2019 to 2020, he include radar and sonar systems, signal processing,
was a postdoctoral research fellow at the ICT Con- and RF integrated-circuit design.
vergence Research Center, Kumoh National Institute of Technology,
South Korea. He is currently a lecturer at the Faculty of Communication
and Radar, Naval Academy in Nha Trang City, Vietnam. His current re-
search interests include radar and sonar systems, signal processing, and deep
learning.

343

You might also like