Mm-Pose Real-Time Human Skeletal Posture Estimation Using Mmwave Radars and CNNs
Mm-Pose Real-Time Human Skeletal Posture Estimation Using Mmwave Radars and CNNs
Mm-Pose Real-Time Human Skeletal Posture Estimation Using Mmwave Radars and CNNs
1558-1748 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.
SENGUPTA et al.: mm-POSE: REAL-TIME HUMAN SKELETAL POSTURE ESTIMATION USING mmWAVE RADARS AND CNNs 10033
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.
10034 IEEE SENSORS JOURNAL, VOL. 20, NO. 17, SEPTEMBER 1, 2020
and open-source data-sets has led to OpenPose being used as truth data, a circular array of 12 2-D vision sensors was used to
the most popular benchmark for generating highly accurate capture the scene, and Open-Pose was used to generate 2-D
ground truth data-sets for training. skeletons from each camera node output, which were then
While the aforementioned approaches paved the way associated and triangulated to obtain 3-D skeletons [29].
towards human pose and skeleton tracking, they were limited In this paper, we propose mm-Pose, a novel approach
to 2-D estimation on account of the images/videos being to use 77 GHz mmWave radars for human skeletal track-
collected from monocular cameras. While monocular cameras ing. mmWave radars offer a greater bandwidth (≈3 GHz),
provide high resolution information of the azimuth and eleva- that in turn provides a more precise resolution. Further-
tion of the objects, extracting depth using monocular vision more, operating at 77 GHz allows it to capture even small
sensors is extremely challenging and non-trivial. To model a abnomalities from the reflection surface, thus adding more
3-D representation of the skeletal joints, HumanEva dataset granularity in terms of identifying more key-points. Unlike
was created by researchers at the University of Toronto [23]. the aforementioned approaches, mmWave radars are low-
The dataset was created by using 7 synchronous video cameras power, low-cost and compact, making it extremely practical
(3 RGB + 4 grayscale) in a circular array, to capture the for deployment. We make use of a forked-CNN architecture
entire scene in its field-of-view. The human subject was made to predict >15 key-points and construct the skeleton in real-
to perform 5 different motions, and reflective markers were time. To obtain ground truth data, we parallely collect the
placed on specific joint locations to track the motion and keypoint locations using Microsoft Kinect on MATLAB API.
a ViconPeak commercial motion capture system was used
to obtain the 3-D ground truth pose of the body. Another
III. B ACKGROUND T HEORY
approach to extract 3-D skeletal joint information is by using
Microsoft Kinect [24]. The Kinect consists of an RGB and A. Radar Signal Processing
infra-red (IR) camera that allows it to capture the scene in The mmWave radar transmits a frequency modulated con-
3-D space. It used a per-pixel classification approach to first tinues wave (FMCW) chirp signal, and utilizes stretch process-
identify the human body parts, followed by joint estimation ing [30] to get the beat frequency, which corresponds to the
by finding the global centroid of the probability mass for each target’s range. The Doppler processing across multiple chirps
identified part. However the downsides of vision based sensors during one coherent processing interval (CPI) determines the
for skeletal tracking are the fact that their performance is Doppler frequency, which is related to the target’s velocity.
extensively hindered in poor lighting and occlusion. Moreover, Mathematically, the n-th chirp during one CPI in complex
as previously introduced, privacy concerns restrict the use of form is given by:
vision based for several applications. BW 2
Studies have previously made use of micro-doppler signa- x n (t) = e j 2π[ f 0 t + 2T
t ]
, nT ≤ t < (n + 1)T,
tures to determine human behavior using RF signals, how- ∀n ∈ [0, 1, . . . , N − 1]. (1)
ever it did not provide spatial information of the subjects’
locations [25], [26] as the signatures solely represented the where f 0 is the chirp starting frequency, BW is the sweeping
temporal velocity profiles of the reflection points. Skeleton bandwidth, T is the duration of one chirp and N is the number
tracking using RF signals is a new and emerging area of of the chirps during one CPI. BTW is referred to as the chirp
research. RF based devices can be further classified into rate. The echo from a target is a time delayed version of
two categories, wearable and non-wearable. Wearable wireless the transmitting chirp. After stretch processing, the resulting
sensors use Wi-Fi signals to track the location and velocity of baseband signal is given as:
the device, which indirectly represents the human. However,
Ar × e j 2π f 0 τn × e j 2π (2τn t −τn2 )
BW
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.
SENGUPTA et al.: mm-POSE: REAL-TIME HUMAN SKELETAL POSTURE ESTIMATION USING mmWAVE RADARS AND CNNs 10035
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.
10036 IEEE SENSORS JOURNAL, VOL. 20, NO. 17, SEPTEMBER 1, 2020
neural networks (CNNs) empirically perform better for tasks further adds to the number of pixels that need to be processed
involving images. Analogous to hidden layers, each CNN by the neural network, that would make the representation
layer can represent multiple higher dimensional representation unsuitable for real-time applications. Alternately, we undergo a
of the input images, based on the specified depth of the constant false alarm rate (CFAR) processing on the previously
layer. For instance a CNN layer with depth 32 would obtained range-doppler map to alleviate background noise and
generate 32 unique transformed representation of the input. clutter, and then perform the range to 3-D position conversion
The transformation is carried out via m×m weight kernel. to obtain a point-cloud representation of the scene, as our
Similar to traditional nodes, the kernel would first take desired application requires us to map the skeletal joint-indices
the weighted inputs (pixels), sum them, and then apply a in real world 3-D co-ordinate space.
non-linear activation function to yeild a single valued scalar Generally, the number of radar reflection points, from a
as an output. This process is repeated when the kernel mask moving human target, is random and varies from frame-to-
traverses the entirety of the image with a user-defined stride frame which makes it difficult to track and associate at an
length. For instance, if an N × N × 3 image is subjected individual point-level. Therefore, it is extremely challenging
to a CNN layer with depth D, and a kernel size k × k × 3 to determine the closed form mapping between the reflection
(k < N), we would obtain a N × N × D volume tensor as point-cloud and the desired skeletal key-points, directly. There-
output, with D distinct k × k × 3 weights to be trained. The fore, as it is non-trivial to map the random radar reflections
training process is similar to MLPs, i.e. gradient descent to obtain the skeletal key-points for pose estimation, in this
using back-propagation is used. study we instead aimed to use supervised-learning to estimate
Unlike the hidden layer outputs from MLP, the CNN the skeleton of a human with the aid of CNNs.
“filters” could have a visual representation. In previous study There are multiple approaches to represent the radar reflec-
with a given visual input, say of a dog, the resulting filters tion data. The simplest approach is a point-cloud represen-
have shown to have detected the outline, edges, eyes, noses tation of the reflection points in 3-D XYZ space, as shown
etc. in the activation map, due to the fact that CNNs are in Fig. 3(a). The term point-cloud had initially been used
inherently spatial filters [32]. Another added advantage that in the Lidar community, as reflected signals were used to
CNNs offer over MLP is the significant reduction in com- map the scene in “cluster/cloud of points” [33]. This ter-
putational complexity, owing to the kernel weights getting minology has also been borrowed in radar-based perception
reused for generating a single transformation. As an example, applications, especially now that improved resolution radars
if an N × N × 3 image is subjected to a CNN layer with could represent individual objects as a “point-cloud” (multiple
depth 1 (for simplicity), and a kernel size k × k × 3, the total reflection points), rather than just a single rigid source of
number of trainable parameters would be 3k 2 , as opposed reflection. However, such representation does not provide an
to 3N 2 with MLP. With the increase in the number of indication of the size of the reflecting surface. By introducing
representations D, while the additional number of parameters the reflection power-levels as an additional feature I , based on
in CNNs would increase linearly (D × (3k 2 )), the increase in the relationship in Eqn. 4, we can assign an RGB weighted
the number of parameters in a fully-connected MLP would pixel value to the points (Fig. 3(b)), resulting in a 3-D heat-
increase exponentially ((3N 2 ) D ). map, which may serve as an input to the CNN. Representing
intensity levels in an 8-bit RGB color-map allows for finer
IV. P ROPOSED A PPROACH resolution along the intensity dynamic range, with the lowest
A. Radar-To-Image Data Representation I corresponding to absolute red (255,0,0), and the maximum
As introduced in the previous sections, radars are essentially I corresponding to absolute blue (0,0,255), and all the inter-
time-of-flight sensors that illuminate the scene with its own RF mediate intensity levels mapped appropriately between them.
signals and use the phase information of the reflected signals Alternately, if such precise resolution in I is not required for
to resolve the time-delay and estimate the range of the points an application, a gray-scale representation could also be used,
of reflection. As the name suggests, mmWave radar signal with the intensity values mapped between 0 to 255 (in an 8-bit
wavelengths are in the order of mm, which enables them to representation). Considering the maximum unambiguous depth
even detect minute abnormalities of a target. Furthermore, with (X ua ), azimuth (Yua ) and elevation (Z ua ) offered by the radar
bandwidths in the range of 3-4 GHz, mmWave radars can also with resolutions x, y and z, respectively, the resulting
provide a high resolution mapping of the scene in range. input data dimension can be represented as:
The radar reflections over a coherent processing interval X ua Yua Z ua
(CPI) results in a radar data cube with 3 dimensions, viz. Di mensi on = × × × Channels (14)
fast-time, slow-time and channel. By using the radar signal x y z
processing chain, as described in Section III-A, we obtain where, Channels is 3 for an RGB representation and 1 for
the range, velocity and angle information of the reflection gray-scale representation. For instance, consider a radar that
points, also referred to as the range-doppler map. By using can detect up to 256 reflection points in a CPI. To represent
basic trigonometric relations, we can obtain the real world the reflection data in a 5 m × 5 m × 5 m scene, with
position (x, y, z) of the reflection points, with respect to the achievable resolutions of 5 cm in all the three dimensions,
radar (at origin), where x, y, z represent the depth, azimuth the input dimensions would be 100 × 100 × 100 pixels, each
and elevation coordinates, respectively. However, using a 3-D with 3 channels (RGB) corresponding to the reflection power
range map coupled with an additional Doppler dimension, intensity. There are a couple of challenges with this approach.
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.
SENGUPTA et al.: mm-POSE: REAL-TIME HUMAN SKELETAL POSTURE ESTIMATION USING mmWAVE RADARS AND CNNs 10037
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.
10038 IEEE SENSORS JOURNAL, VOL. 20, NO. 17, SEPTEMBER 1, 2020
Fig. 4. The N × N × 3 image data generated from radar projections on the XY and XZ plane are subjected to a 3-layer forked CNN architecture
and the outputs are concatenated and flattened. A 3-layer MLP is further used to finally obtain the X,Y and Z positions of the 25 skeletal joints from
the output layer.
from a larger RCS of the body (torso, say) from a smaller of Arizona. Two human subjects with varying sizes were
RCS (elbow, say). used, one-at-a-time, to collect the data. The subjects performed
four different actions in contiguous sets, viz. (i) Walking,
V. E XPERIMENTS AND R ESULTS (ii) Left-Arm Swing, (iii) Right-Arm Swing, and (iv) Both-
A. Experimental Setup and Frames Association Arms Swing. We acquired ≈32000 samples of training data
and ≈6000 samples of validation/development data set, to be
In this study we used Texas Instruments AWR 1642 boost
used for training the model. ≈1700 samples of test data was
mmWave radar transceivers, that has two transmit and four
also collected with the human subject performing the four
receive channels on a linear axis, which in its traditional
actions, in no ordered fashion, for added robustness.
orientation would resolve the radar reflection points in range
The acquired data from both the radars was first sepa-
(depth) and azimuth, only. We used two of these, R-1 and R-2
rated using the module information from the frame headers
(say), with R-2 rotated 90o counter-clockwise with respect to
and then associated frame-by-frame with the corresponding
R-1, with the azimuth now corresponding to the elevation of
Kinect return using the UTC time-stamps. The radar returns
the reflection points. Both the radars transmitted a 3.072 GHz
in each frame were normalized in range, azimuth (R-1),
wide chirp, centered at 79 GHz, every 92 μs. A dual-slot 3-D
elevation (R-2) and intensity corresponding to the dimensions
frame was developed and used to mount both the radars to
of the experiment space. The normalized data was then used
ensure stability and consistent data collection. The processed
to generate two RGB images every frame, corresponding to
radar point cloud data from both radars was captured via
R-1 and R-2 respectively, based on the approach described
USB cables on a robot operating system (ROS) interface,
in Section IV. Note that we did not have the plane projection
running on a Linux computer. Each radar would return up to
stage as R-1 and R-2 already provided returns in XY and
256 detected points in a coherent processing interval, including
XZ planes respectively. The ground truth skeletal joint posi-
their position (depth, elevation/azimuth), velocity and intensity
tions obtained using Kinect were also normalized to a [0,1]
at 20 frames-per-second(fps). Every return also carried a
range. The normalization parameters were stored to rescale
header with the UTC time-stamp and the radar module index.
the predictions from the model and obtain the real-world joint
To capture the ground truth data, we used a Microsoft Kinect
locations.
connected to a Windows computer, using a MATLAB API.
The infra-red (IR) sensor data coupled with the Mathworks
developed skeletal tracking algorithm provided us with the B. Training the Architecture
depth, azimuth and elevation information of 25 joint positions, A forked-CNN architecture, as described in Section IV
as well the UTC time-stamp in each frame. A common time was used as our learning algorithm. The primary reason to
server was used to synchronize clocks on both the computers use CNNs in this study, as opposed to a complete multi-
capturing data (Radar and Kinect), with the clock slew in the layer-perceptron (MLP) deep-network was to reduce the com-
order of one millisecond, which was tolerable. The UTC time- putational complexity of the network and achieve real-time
stamps from Kinect and radar frames were used for frame implementation. Note that unlike traditional CNN layers in
identification and association. classification problems that are aimed to learn and generate
The experiment was setup in an open space in the Electri- spatial filters (edge/corner detection) in higher dimensional
cal and Computer Engineering department at the University space, we have only used them to map our RGB encoded
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.
SENGUPTA et al.: mm-POSE: REAL-TIME HUMAN SKELETAL POSTURE ESTIMATION USING mmWAVE RADARS AND CNNs 10039
Fig. 7. The MAE of mm-Pose predicted 25 joint locations (in all three
dimensions) over all the frames in the test data set. Note that the 6 outlier
joint indices that offer the highest MAEs have been highlighted in green.
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.
10040 IEEE SENSORS JOURNAL, VOL. 20, NO. 17, SEPTEMBER 1, 2020
Fig. 8. Visual representation of the 17-points mm-Pose vs ground truth on the testing data with two frames shown for (a) Walking, (b) Both-arms
swing, (c) Right-arm swing and (d) Left-arm swing. The axes show elevation (gray), azimuth (red) and depth (blue) in meters.
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.
SENGUPTA et al.: mm-POSE: REAL-TIME HUMAN SKELETAL POSTURE ESTIMATION USING mmWAVE RADARS AND CNNs 10041
Fig. 9. Comparing the localization error (in meters) of mm-Pose (blue) vs baseline model (red) the 17 joints across 1696 frames of the testing
data-set. The bottom-right figure depicts the localization error variance offered by the baseline model and the proposed mm-Pose.
would always produce the average location of each of the as listed above, were excluded from further analysis. The mm-
joints based on the training data. Pose predicted skeleton for two frames for each of the four
1) Localization Accuracy: The MAE of all the 25 joint postures, from the test data, along with the ground truth for
locations is shown in Fig 7. From our results, we observe that comparison, is shown in Fig 8. Also note that the outlier joint
a few joint indices are outliers in the training process and positions have also been removed from the ground truth data
offer the highest MAE. We also observed that the outliers for consistent representation and comparison. The MAE of
offer a consistently high error across all the frames. The mmPose predictions from ground-truth for all the 17 joints
outlier joints correspond to (i) Wrist, (ii) Palm, (iii) Hand across all the frames in the test data is also presented in
Tip and (iv) Thumb, of both left and right hands. While the Fig 9, to demonstrate the accuracy in representing all the
ground truth data using Kinect could resolve for these joints on necessary key-points required to construct the skeleton. The
account of high-resolution spatial imagery, we acknowledge proposed mm-Pose architecture offered average localization
the challenges of representing such extremely granular and errors of 3.2 cm in depth (X), 2.7 cm in elevation (Z) and
small RCS joints using mmWave radar returns alone. 7.5 cm in azimuth (Y), respectively. The results show that our
As the general skeletal representation of human pose could model (17 joints) offers better localization in X and Z axis
still be constructed with the remaining 17 points, with neg- than MIT’s RF-Pose3D (8 key-points), however at a greater
ligible effect on its visual interpretability, the 8 outliers, localization error in azimuth due to a higher variance in the
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.
10042 IEEE SENSORS JOURNAL, VOL. 20, NO. 17, SEPTEMBER 1, 2020
Fig. 10. Comparing the cumulative probability distributions of the localization errors from mm-Pose (blue) vs baseline model (red) the 17 joints
across 1696 frames of the testing data-set. mm-Pose has a faster convergence than the baseline architecture, demonstrating a high probability of
a lower localization error.
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.
SENGUPTA et al.: mm-POSE: REAL-TIME HUMAN SKELETAL POSTURE ESTIMATION USING mmWAVE RADARS AND CNNs 10043
lower than the baseline model. A cumulative probability by ≈24% and ≈32%, respectively. However, the localization
distribution of the localization error for each of the joints. error of 7.5 cm in azimuth (Y) was found to be greater than
as shown in Fig. 10, further elucidates mm-Pose’s consistently the 4.9 cm offered by RF-Pose3D. The end-to-end system was
lower errors with a steep convergence to the maximum, unlike verified successfully for real-time estimation, using mmWave
the baseline architecture. radars and the proposed mm-Pose architecture on ROS. The
current implementation of mmPose was developed with the
D. Practical Implementation data obtained using four different motions, however, more
The trained mm-Pose model was then implemented for real- motions could be added by the rather expensive process
time human skeleton based pose estimation using mmWave of data collection and labeling for a wide range of spatial
radars. The entire system was achieved on a ROS interface motions for added robustness. Finally, mm-Pose could be
by instantiating four sequential nodes. Node 1 was the radar used for a wide range of applications including (but not
node that published the point cloud information from reflection limited to) pedestrian tracking, real-time patient monitoring
signals following the signal processing stages. Node 2 was systems and through-the-wall pose estimations for military
subscribed to the published point cloud data, separated R-1 applications.
and R-2 frames and published the corresponding 16 × 16 × 3
radar data encoded RGB images. Node 3 was the mmPose
R EFERENCES
node that used the RGB images from Node 2 to predict the
normalized locations of all the joints. The final node (Node 4) [1] D. A. Forsyth and J. Ponce, Computer Vision: A Modern Approach.
Upper Saddle River, NJ, USA: Prentice-Hall, 2002.
un-normalized the predicted joint locations and converted them [2] R. Szeliski, Computer Vision: Algorithms and Applications. London,
to a real world coordinate system before mapping it over to U.K.: Springer-Verlag, 2010. [Online]. Available: https://link.springer.
a display for monitoring. The system was tested out with com/book/10.1007/978-1-84882-935-0, doi: 10.1007/978-1-84882-
935-0.
different human subjects and the real-time pose estimation [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
system was successfully verified. with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6,
pp. 84–90, May 2017.
[4] S. Messelodi, C. M. Modena, and M. Zanin, “A computer vision
E. Limitations system for the detection and classification of vehicles at urban
With the unavailability of mmWave radar-skeletal databases, road intersections,” Pattern Anal. Appl., vol. 8, nos. 1–2, pp. 17–31,
Sep. 2005.
the data acquisition stage was the most expensive process in [5] A. Petrovskaya and S. Thrun, “Model based vehicle detection and
this study. As mm-Pose was developed with the training data tracking for autonomous urban driving,” Auto. Robots, vol. 26, nos. 2–3,
obtained while performing four different movements as listed pp. 123–139, Apr. 2009.
[6] V. N. Dobrokhodov, I. I. Kaminer, K. D. Jones, and R. Ghabcheloo,
above, the output might not be reliable if the subject performs “Vision-based tracking and motion estimation for moving targets using
a completely different spatial motion (crouch, bend over etc.). small UAVs,” in Proc. Amer. Control Conf., Jun. 2006, pp. 1428–1433.
However, with additional training encompassing more behav- [7] R. Reulke, S. Bauer, T. Doring, and F. Meysel, “Traffic surveillance
using multi-camera detection and multi-target tracking,” in Proc. Image
ioral data, future variants of mmPose could be developed to Vis. Comput. New Zealand, 2007, pp. 175–180.
be more robust, including extension to simultaneous multiple [8] J. A. Oulton, “The global nursing shortage: An overview of issues and
human skeletal pose tracking. actions,” Policy, Politics, Nursing Pract., vol. 7, no. 3, pp. 34S–39S,
Aug. 2006.
[9] S. R. E. Datondji, Y. Dupuis, P. Subirats, and P. Vasseur, “A
VI. C ONCLUSION survey of vision-based traffic monitoring of road intersections,”
In this paper, mm-Pose, a real-time novel skeletal pose esti- IEEE Trans. Intell. Transp. Syst., vol. 17, no. 10, pp. 2681–2698,
Oct. 2016.
mation using mmWave radars is proposed. The 3-D XYZ radar [10] The Tesla Team. (Jun. 2016). A Tragic Loss. Accessed: Jun. 30, 2016.
point cloud data (up to N2 points per CPI) is first projected [Online]. Available: https:// www.tesla.com /blog/tragic-loss and
onto the XY and XZ planes, followed by an N×N×3 RGB [Online]. Available: https://www.tesla.com/blog/tragic-loss
image, with the RGB channels corresponding to the 2-D posi- [11] The New York Times, “Self-driving uber car kills pedestrian
in arizona, where robots roam,” Mar. 2018. [Online]. Available:
tion and intensity information of each reflection point. This https:// www.nytimes.com/ 2018/ 03/ 19/ technology/ uber-driverless-
data representation was aimed at eliminating a voxel based fatality.html
learning approach and reducing the sparsity of the input data. [12] I. D. Robertson and S. Lucyszyn, RFIC and MMIC Design and Tech-
nology, no. 13. Edison, NJ, USA: IET, 2001.
A forked-CNN based deep learning architecture was trained to [13] J. Hasch, E. Topak, R. Schnabel, T. Zwick, R. Weigel, and
estimate the X, Y, and Z locations of 25 joints and construct C. Waldschmidt, “Millimeter-wave technology for automotive radar
a skeletal representation. 8 outlier joints were identified that sensors in the 77 GHz frequency band,” IEEE Trans. Microw. Theory
Techn., vol. 60, no. 3, pp. 845–860, Mar. 2012.
did not aid to the learning process and were subsequently [14] R. A. Alhalabi and G. M. Rebeiz, “Design of high-efficiency millimeter-
removed from our system and further analysis, as we were able wave microstrip antennas for silicon RFIC applications,” in Proc. IEEE
to reasonably reconstruct the skeletal pose using the remain- Int. Symp. Antennas Propag. (APSURSI), Jul. 2011, pp. 2055–2058.
[15] D. Ramanan, D. A. Forsyth, and A. Zisserman, “Strike a pose: Tracking
ing 17 joints. The proposed architecture offered significant people by finding stylized poses,” in Proc. IEEE Comput. Soc. Conf.
reduction in computational complexity compared to traditional Comput. Vis. Pattern Recognit. (CVPR), Jun. 2005, pp. 271–278.
MLP networks and offered a much lower localization error [16] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, “Using k-poselets
and variance when compared to the baseline architectures. for detecting people and localizing their keypoints,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., Jun. 2014, pp. 3582–3589.
The average localization errors of 3.2 cm in depth (X) [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
and 2.7 cm in elevation (Z) outperforms MIT’s RF-Pose3D IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 2961–2969.
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.
10044 IEEE SENSORS JOURNAL, VOL. 20, NO. 17, SEPTEMBER 1, 2020
[18] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, Arindam Sengupta (Student Member, IEEE)
“Deepercut: A deeper, stronger, and faster multi-person pose estimation received the B.E. degree in ECE from the Birla
model,” in Proc. Eur. Conf. Comput. Vis. Springer, 2016, pp. 34–50. Institute of Technology in India, in 2012, and the
[Online]. Available: https://link.springer.com/chapter/10.1007/978-3- M.S. degree in ECE from the University of Akron
319-46466-4_3, doi: 10.1007/978-3-319-46466-4_3 in 2014, where his primary research was on mul-
[19] L. Pishchulin et al., “DeepCut: Joint subset partition and labeling for tidimensional digital signal processing architec-
multi person pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern tures. He is currently pursuing the Ph.D. degree
Recognit. (CVPR), Jun. 2016, pp. 4929–4937. with The University of Arizona, under the supervi-
[20] G. Papandreou et al., “Towards accurate multi-person pose estimation in sion of Dr. Siyang Cao. He is currently research-
the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), ing methods to improve existing target tracking
Jul. 2017, pp. 4903–4911. and localization methods using mmWave radars
[21] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D and applied machine learning. His research interests include signal
pose estimation using part affinity fields,” in Proc. IEEE Conf. Comput. processing, radars, machine learning, and autonomous vehicles.
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 7291–7299.
[22] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, Feng Jin (Student Member, IEEE) received
pp. 740–755. [Online]. Available: https://link.springer.com/chapter/ the B.S. and M.S. degrees in electrical engi-
10.1007/978-3-319-10602-1_48, doi: 10.1007/978-3-319-10602-1_48. neering from Beihang University, Beijing, China,
[23] L. Sigal, A. O. Balan, and M. J. Black, “HumanEva: Synchronized in 2011 and 2014, respectively. He is cur-
video and motion capture dataset and baseline algorithm for evaluation rently pursuing the Ph.D. degree with The
of articulated human motion,” Int. J. Comput. Vis., vol. 87, nos. 1–2, University of Arizona, Tucson, Arizona, USA.
pp. 4–27, Mar. 2010. His research interests include automotive radar
[24] Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE Multimedia signal processing and machine learning on
Mag., vol. 19, no. 2, pp. 4–10, Feb. 2012. mmWave radar sensor for object classification.
[25] F. Jin et al., “Multiple patients behavior detection in real-time using
mmWave radar and deep CNNs,” in Proc. IEEE Radar Conf. (Radar-
Conf), Apr. 2019, pp. 1–6.
[26] R. Zhang and S. Cao, “Real-time human motion behavior detection via
CNN using mmWave radar,” IEEE Sensors Lett., vol. 3, no. 2, pp. 1–4, Renyuan Zhang (Student Member, IEEE)
Feb. 2019. received the B.S. degree from Chongqing Uni-
[27] F. Adib, C.-Y. Hsu, H. Mao, D. Katabi, and F. Durand, “Capturing versity, Chongqing, China, in 2009, and the M.S.
the human figure through a wall,” ACM Trans. Graph., vol. 34, no. 6, degree from The University of Arizona, Tucson,
pp. 1–13, Nov. 2015. AZ, USA, in 2015, where he is currently pursuing
[28] M. Zhao et al., “Through-wall human pose estimation using radio the Ph.D. degree with the Department of Elec-
signals,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., trical and Computer Engineering. His research
Jun. 2018, pp. 7356–7365. interests include mmWave radar signal process-
[29] M. Zhao et al., “RF-based 3D skeletons,” in Proc. Conf. ACM Special ing, micro-Doppler signatures, sensor fusion,
Interest Group Data Commun., Aug. 2018, pp. 267–281. non-coherent integration, and SAR.
[30] M. Richards, Fundamentals of Radar Signal Processing, 2nd ed.
New York, NY, USA: McGraw-Hill, 2014, ch. 4.6.
[31] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside
convolutional networks: Visualising image classification models Siyang Cao (Member, IEEE) received the B.S.
and saliency maps,” 2013, arXiv:1312.6034. [Online]. Available: degree in electronic and information engineering
http://arxiv.org/abs/1312.6034 from Xidian University, Shanxi, China, in 2007,
[32] M. D. Zeiler and R. Fergus, “Visualizing and understanding convo- the M.S. degree in circuits and system from the
lutional networks,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzer- South China University of Technology, Guang-
land: Springer, 2014, pp. 818–833. [Online]. Available: https://link. dong, China, in 2010, and the Ph.D. degree
springer.com/chapter/10.1007/978-3-319-10590-1_53, doi: 10.1007/978- in electrical engineering from The Ohio State
3-319-10590-1_53. University, Columbus, OH, USA, in 2014. Since
[33] G. Vosselman, “Fusion of laser scanning data, maps, and aerial pho- August 2015, he has been an Assistant Professor
tographs for building reconstruction,” in Proc. IEEE Int. Geosci. Remote with the Electrical and Computer Engineering
Sens. Symp., Jun. 2002, pp. 85–88. Department, The University of Arizona, Tucson,
[34] V. Nair and G. E. Hinton, “Rectified linear units improve restricted AZ, USA. His research interests include radar waveform design, synthetic
Boltzmann machines,” in Proc. 27th Int. Conf. Mach. Learn. (ICML), aperture radar, commercial radar, and signal processing with emphasis
2010, pp. 807–814. on radar signal.
Authorized licensed use limited to: Sharda University. Downloaded on July 18,2022 at 15:57:19 UTC from IEEE Xplore. Restrictions apply.