0% found this document useful (0 votes)
48 views12 pages

Fast Human Pose Estimation in Compressed Videos

This article proposes a novel framework for fast human pose estimation in compressed videos. The framework takes advantage of motion vectors that are readily available in compressed videos to rapidly propagate poses across frames, reducing redundant computations. However, motion vectors can be noisy and inaccurate for some body parts. The framework includes modules to adaptively detect these cases and perform full pose estimation instead of warping, as well as handle video transitions. The goal is to achieve real-time pose estimation while controlling accuracy degradation in compressed video domains.

Uploaded by

Nikita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views12 pages

Fast Human Pose Estimation in Compressed Videos

This article proposes a novel framework for fast human pose estimation in compressed videos. The framework takes advantage of motion vectors that are readily available in compressed videos to rapidly propagate poses across frames, reducing redundant computations. However, motion vectors can be noisy and inaccurate for some body parts. The framework includes modules to adaptively detect these cases and perform full pose estimation instead of warping, as well as handle video transitions. The goal is to achieve real-time pose estimation while controlling accuracy degradation in compressed video domains.

Uploaded by

Nikita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3141888, IEEE
Transactions on Multimedia
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Fast Human Pose Estimation in Compressed Videos


Huan Liu, Graduate Student Member, IEEE, Wentao Liu, Zhixiang Chi, Yang Wang, Yuanhao Yu, Jun
Chen, Senior Member, IEEE, and Jin Tang

Abstract—Current approaches for human pose estimation in or reducing the input image size. However, per-frame methods
videos can be categorized into per-frame and warping-based do not consider the temporal continuity between frames. As a
methods. Both approaches have their pros and cons. For example, result, they involve a lot of redundant computations.
per-frame methods are generally more accurate, but they are
often slow. Warping-based approaches are more efficient, but To exploit temporal continuity in videos, warping-based
the performance is usually not good. To bridge the gap, in methods aim to discover temporal relations (e.g. optical flow
this paper, we propose a novel fast framework for human pose [38], [43], pose flow [55], etc.) and quickly propagate human
estimation to meet the real-time inference with controllable pose from one frame to another. However, computing optical
accuracy degradation in compressed video domain. Our approach flow is often time-consuming, so warping-based methods are
takes advantage of the motion representation (called “motion
vector”) that is readily available in a compressed video. Pose rarely used in real-world applications.
joints in a frame are obtained by directly warping the pose In this paper, we introduce an alternative way of exploiting
joints from the previous frame using the motion vectors. We the temporal continuity in videos for human pose estimation.
also propose modules to correct possible errors introduced by The core idea of our approach is to take advantage of the
the pose warping when needed. Extensive experimental results motion information that is already available in compressed
demonstrate the effectiveness of our proposed framework for
accelerating the speed of top-down human pose estimation in videos when they are being encoded by standard video codecs.
videos. Compressed video streams only retain very few frames as
RGB images, but contain massive motion information (i.e.
Index Terms—Human pose estimation, compressed video, deep
neural network. motion vector and residual error) for frame reconstruction.
These motion vectors and residual error are readily available
in compressed videos and do not require any computation to
I. I NTRODUCTION obtain. Recent years have witnessed many successes in han-
Human pose estimation in videos is a cornerstone for many dling computer vision tasks in the compressed video domain.
computer vision applications, such as smart video surveillance, Some early work focuses on classification tasks such as action
human-computer interaction, virtual reality etc. It aims to seek recognition [52], video classification [8], [9]. These tasks
for locations of human body joints (e.g. head, elbow and etc.) usually do not require precise motion cues at the pixel level,
in video sequences. Current real-time solutions to this problem so motion vectors in compressed videos can be easily applied.
can be categorized into per-frame methods [48], [50], [35], There are also works on semantic segmentation [29], [16] in
[10], [47], [40], [22], [34], [32], [7], [53], [46], [13], [19], compressed videos. Although semantic segmentation is a pixel
[11], [56], [24], [30], [33] and warping-based methods [38], labeling task, the performance of semantic segmentation is
[43], [14], [5]. largely influenced by the prediction in the interior of object
Due to their simplicity, per-frame methods are widely instances rather than instance boundaries. As a result, this
deployed in real-world applications. In general, the per-frame task does not require very much motion information either.
methods can be categorized into top-down methods [32], [53], In comparison, human pose estimation in compressed videos
[46], and bottom-up methods [7], [13]. While bottom-up meth- is much more challenging, since this task requires accurate
ods localize human joints for all persons in a frame, top-down joint predictions.
methods decompose the multi-person human pose estimation To this end, we propose a novel framework for human pose
into a simpler task of single-person pose estimation by first estimation in the compressed video domain. The framework
detecting each person in a frame, then applying a single-person consists of four components, i.e. human pose estimator, fast
pose estimation on each detected person. Although the two pose warping module (FPW), pose recall module (PR) and
different pipelines have their distinctive properties, both of transition re-initialization module (TR). To be specific, the
them are usually designed to meet the real-time demand from human pose estimator is a top-down pose estimation network
the perspective of searching compact neural network models working on RGB images. For the purpose of reducing temporal
redundancy, a fast pose warping module is designed to use
Huan Liu and Jun Chen are with the Department of Electrical and Computer
Engineering, McMaster University, Hamilton, ON, L8S 2K1 CA (e-mail: motion vectors for rapid pose propagation across consecutive
[email protected]; [email protected] ). frames. However, since motion vectors are noisy and not
Wentao Liu, Zhixiang Chi, Yuanhao Yu and Jin Tang are always associated with the motion on the body parts, we
with the Noah’s Ark Laboratory, Huawei Technologies Canada
(e-mail:[email protected], [email protected], yuan- design a pose recall module to adaptively find “hard-to-warp”
[email protected], and [email protected]) human instances and perform human pose estimation instead
Yang Wang is with the Department of Computer Science, of warping by jointly considering the motion intensity and
University of Manitoba and Huawei Technologies Canada (e-mail:
[email protected]) confidence on body joints. Moreover, video transitions can
Manuscript received July 20, 2021 result in significant motion cues which are irrelevant to body

1520-9210 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Pimpri Chinchwad College Of Engineering. Downloaded on September 18,2022 at 09:10:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3141888, IEEE
Transactions on Multimedia
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

motion. To address this issue, the transition re-initialization distinguishes an image task from a video task. Exploiting
module is introduced to terminate the warping process at video the temporal correlation wisely can significantly improve the
transitions and switch to RGB-based pose estimation. performance in a video task. However, due to the scarcity
The main contributions of this work can be summarized as of large-scale video-based benchmarks, video-based human
follows. First, this paper represents the first work on real-time pose estimation has only drawn very little attention in recent
human pose estimation in the compressed domain. Second, years. Some methods [38], [43] use dense optical flow as
we propose a human pose estimation framework in the com- temporal representations to capture relationships across the
pressed domain using three well-designed modules. Finally, multiple frames. In contrast, Doering et al. [14] compute task-
we demonstrate through extensive experimental results that specific motion representation only on human joints to reduce
our framework can speed up existing per-frame and warping- redundancy of dense optical flow. Bertasius et al. [5] introduce
based methods by 2-5 times on the Posetrack dataset, while a novel CNN architecture for pose estimation in sparsely
achieving comparable performance in accuracy. labeled videos. This method uses a neural network to directly
learn offsets of consecutive frames. Although most of these
II. R ELATED W ORKS video-based methods show great improvements on estimation
accuracy, they still ignore the problem of how to efficiently
In this section, we briefly review several lines of research estimating human pose in videos.
related to our work. Video Analysis in Compressed Domain: Video analysis in
Per-frame Human Pose Estimation: Traditional human pose the compressed domain is also understudied. There are few
estimation methods [2], [3], [23], [39] usually adopt the works that try to leverage the compressed domain knowledge
pictorial structures model with hand-crafted features. These to assist specific video analysis tasks. The current methods in
methods often fail when some body parts are occluded. In compressed domain can be categorized as traditional methods
recent years, with the emerging of deep convolutional neural and deep learning base methods. For traditional methods, Chen
networks, most of the image-based human pose estimation et al. [12] propose to use global motion estimation and Markov
[48], [50], [35], [10], [47], [40], [22], [34], [32], [7], [53], random field for extraction moving regions in compressed
[46], [13], [19], [11], [56], [58], [25], [24], [30], [33] learn domain. Some works [28], [37] introduce fast scene change
to predict human poses on large-scale datasets with intensive detection algorithm using the feature from compressed videos.
human joints annotations. Instead of mapping images directly The two methods mainly focus on how to precisely detect wipe
to human joint coordinates, most of these methods, except transition. Despite the effectiveness of traditional method, they
for [48], choose to predict heatmaps for easier regression and usually adopt compressed knowledge for transition and motion
optimization. detection rather than high-level video analysis. To further
In the era of deep learning, image-based methods can be exploit the valuable information in compressed domain, some
categorized as top-down methods and bottom-up methods. recent work proposes to use deep learning techniques for
Top-down methods [32], [46], [53], [35], [19], [11] usually video analysis in the compressed video domain. There is some
rely on a human detector that helps localizes human instances work [8], [9] on 3D convolutional neural networks for video
in an image. Then the methods decompose the multi-person classification utilizing compressed domain knowledge. Wu et
human pose estimation task into single person pose estimation al. [52] accelerate action recognition directly on compressed
problems. On the contrary, bottom-up methods [22], [7], [13] videos. The success of extracting high-level representations
first detect all the body joints in an image, and then assign the from the compressed domain implies the potential of com-
detected joints to each person. pressed domain information in other computational vision
These works mainly focus on exploring novel models to tasks. Recently, Li et al. [29] adopt convolutional LSTM to
achieve state-of-the-art human pose estimation accuracy, but propagate semantic maps to consecutive frames by motion
their processing speed is often slow. vector and residual. Feng et al. [16] propose a novel real-
Fast Human Pose Estimation: Although the efficient es- time framework for semantic segmentation using compressed
timation of the human pose is quite important, very few domain knowledge. Due to the nature of semantic segmenta-
works aim for this goal. Rafi et al. [41] introduce a compact tion, where most of the pixels are inside of objects, the noise
neural network that can be trained efficiently on a mid-range in motion vectors can be largely tolerated. On the contrary,
GPU. Bulat et al. [6] binarize heavy CNN architectures for accurately propagating the human joints with noisy motion
model compression and specifically designed a parallel and vectors is a more challenging task. In this paper, we are
multi-scale architecture for the binary case. Zhang et al. [56] inspired by Feng et al. [16] to propose a method for fast
successfully employ a well-trained large network to help human pose estimation in the compressed domain. To our best
boost the performance of a small network with knowledge knowledge, this paper is the first to address this problem.
distillation [20]. However, the above-listed methods only focus Video Analysis Beyond RGB Frames: Our work is loosely
on designing a small network that is cost-effective for de- related to other video analysis tasks that use the information
ploying in practice. In this paper, we alternatively investigate beyond RGB frames in a video. For example, there has been
the possibility of accelerating inference speed in the video lots of work (e.g. [42], [26], [57]) on using depth information
compressed domain. in RGBD videos for object recognition, pose estimation, etc.
Video Based Human Pose Estimation: Temporal depen- However, these works can only work on video data collected
dency among video frames is the most crucial factor that by RGBD cameras, since the depth information is not available

1520-9210 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Pimpri Chinchwad College Of Engineering. Downloaded on September 18,2022 at 09:10:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3141888, IEEE
Transactions on Multimedia
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

RGB Image Motion Vector Residual


matching method [4]. It then represents the correspondence
between the two blocks by a vector pointing from the reference
W :     warp 
I block to the target. Such a vector is known as the motion
frame + :      sum  vector (MV) in the context of video compression. After the
block matching, the residual between the target and reference
blocks is also computed and encoded into the video stream.
As such, the P-frame It is compactly represented by an MV
W
P map Mt and a residual map Rt , and can be reconstructed by
frame
+ reusing the data of the previous frame It−1 .
It (x, y) = It−1 (x, y) − Mt (x, y) + Rt (x, y),
 
(1)
where (x, y) indicates any pixel position in the frame. Figure 1
illustrates the representation and the reconstruction process of
P-frames in a GOP. Some other codecs may generate another
type of frame, i.e. B-frame (bi-directional frame), which is
W encoded in a similar manner to P-frames except that the motion
P
frame vectors are estimated from both previous and future frames.
+

IV. O UR A PPROACH
Fig. 1: Illustration of decoding a compressed video. Each I
Top-down pose estimation is often performed in a two-
frame is encoded as a regular image. Each P frame is stored
stage manner. First, a human detector scans the whole image
as a motion vector and residual that represent the correlation
to crop out each person instance in a bounding box. Then,
between the current P frame and the previous frame.
pose estimation is performed in each of the bounding boxes to
localize each joint of the person using a heatmap. This process
in regular videos. In contrast, our work is more widely appli- is well established for pose estimation on a still image but still
cable since the motion vector information is readily available has room to improve when processing a video. As analyzed
in any compressed video. in Section III, neighboring frames are highly correlated with
each other, so it is intuitively possible to reuse the estimation
results from the previous frame in the current frame. Let us
III. BACKGROUND : C OMPRESSED V IDEO consider an extreme case where a person is doing yoga and
Due to the enormous data volume, digital videos are typ- keeping a posture for a few seconds. The motion vectors in the
ically encoded into video streams for efficient storage and video will indicate that there is no motion between adjacent
transmission. Commonly used modern video codecs include frames. We can then perform pose estimation only on the first
MPEG-4 Part 2 [27], H.264/AVC [51], HEVC [44], VP9 [31], frame and copy the results to the remaining frames. Intuitively,
etc. A video stream compressed by these video codecs has a this approach can save several folds of inference time while
very different structure from a sequence of stand-alone images achieving a similar level of accuracy.
as often seen in an uncompressed video. In this section, we Our proposed approach is inspired by and a natural exten-
take the MPEG-4 Part 2 (Simple Profile) codec [27] as an sion of this intuition. By exploiting the inter-frame relationship
example to analyze the type of data that are available in a readily available in a compressed video stream, we design
video stream. Nevertheless, most popular video codecs share a system that can accelerate any per-frame pose estimation
a similar predictive coding strategy and generate compressed method while maintaining relatively high prediction accuracy.
streams with a similar structure. So our analysis on this As shown in Fig. 2, our proposed approach contains several
particular codec generalizes to other codecs. components: a human instance detector, a single-person pose
The basic unit in a compressed video is called a group of estimator, a fast pose warping module (FPW), a pose recall
pictures (GOP). The encoding and decoding processes of one (PR) module and a transition re-initialization module (TR). In
GOP are independent of any other GOPs. A compressed video particular, the first two components form the baseline image-
is composed of a sequence of such GOPs. In the default mode based pose estimator that is used to initialize the human
of the MPEG-4 Part 2 codec, a GOP consists of 12 frames, pose in I-frames. The last three modules are designed for
with the first being an I-frame (intra-coded frame) and the rest accelerating and correcting pose estimation in P-frames. First,
being P-frames (predictive frames). Video codecs treat the two we design an FPW module to propagate the joints of each
types of frames differently. The I-frame is encoded as a regular person based on the results in the previous frame. By reusing
image, so decoding it does not depend on any other frames in the inference results from the previous frame, both modules
the GOP. However, the encoding of each P frame depends on can significantly speed up the pose estimation in P-frames.
the data from its previous frame, which finally relies on the Although direct warping is fast, it is possible for the warping
data of the first I-frame of the GOP. Specifically, for each 16× error to accumulate over time, and the tracking points grad-
16 block in a P-frame It at time t, the codec first tries to find ually shift off the human body. In order to control the error
a best-matched block in the previous frame It−1 by a block- propagation, we further design a pose recall module to correct

1520-9210 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Pimpri Chinchwad College Of Engineering. Downloaded on September 18,2022 at 09:10:41 UTC from IEEE Xplore. Restrictions apply.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

d p

Human Instance Human Pose


t Detector Bt Estimator Jt
I-Frame I

if is transition

Motion
Vector Mt Residual Rt Bt 1
Jt1 1

t
FPW
M
Mt
PR
TR
Rt
Jt 1 Bt2 1 p

t Jt
RGB Image I
P-Frame TR Transition Re-initialization PR Pose Recall FPW Fast Pose Warping

Fig. 2: An overview of our proposed method for fast human pose estimation in compressed videos. I-frames are directly sent
to a human detector to detect each person. Then a human pose estimator is applied to each person instance to produce a
corresponding human pose. For each P-frame, we first use the transition re-initialization module (TR). If a scene transition
is detected, these frames would be treated the same as an I-frame by reconstructing its RGB image. Otherwise, each person
instance in the P-frame is passed to the pose recall module (PR) to decide whether we need to re-initialize the pose estimation
for this person. If a person instance passes the TR and PR modules, we can directly obtain its pose in the current P-frame by
warping the pose joints from the previous frame using our fast pose warping (FPW) module.

the pose estimation results when the motion is too complex to Human Instance Detector & Human Pose Estimator: We
follow. Another challenge to the fast warping approach is the start by introducing the image-based pose estimation pipeline
occurrence of scene transition, which breaks the relationship for the I-frames. Since an I frame is represented as a standard
between consecutive frames. To address this challenge, we RGB image in a compressed video, we can choose any
design a transition re-initialization module to detect such scene image-based human instance detector, denoted by ϕd and a
transition so that the pose estimation can be re-initialized on pose estimator, denoted by ϕp , to initialize the human pose
the first frame of the new scene. Note that the PR and TR estimation in a GOP. In this paper, we adopt the HRNet [46],
modules depend only on the compressed domain features and which uses an adapted Faster-RCNN for human detection and
thus introduce minimal overhead into the whole pipeline. a specifically designed CNN for subsequent pose estimation.
Fig. 2 presents the complete data flow of the proposed However, we emphasize that any pose estimation methods
framework when processing a compressed video. After decod- sharing a similar pipeline can be easily plugged into our
ing each GOP, the leading I-frame is first sent to the human proposed framework. We also conduct a study to illustrate
instance detector and the pose estimator to obtain the location the influence of different image-based pose estimators in
of body joints. Then the results of P-frames are efficiently Section V-C. In addition to operating on I-frames, ϕd and ϕp
predicted by the FWP module unless the PR and TR modules will also be used to re-initialize on a P-frame by reconstructing
are triggered to re-initialize the pose estimation results of its RGB image if the TR or PR modules are triggered on this
several human instances or the whole image. P-frame.
The proposed framework exhibits three major advantages Fast Pose Warping: The FPW module is performed on
over the traditional per-frame framework. First, the proposed each human instance to localize the joints of this person.
framework does not need to perform image-based pose esti- Specifically, the module warps the human joints Jt−1 i of the
mation on most P-frames, resulting in a significant speedup i-th person in frame It−1 with the motion vectors Mt at time
on highly compressed videos. Second, all the additional t. It then generates a new set of joints location Jti for the same
modules rely only on the features that are readily available person by solving the following equations:
in a compressed video stream. So they introduce minimal
overhead into the pipeline. Third, this framework is compatible Jti (n) − Mt (Jti (n)) = Jit−1 (n), n = 1, . . . , N (2)
with a wide range of image-based pose estimation methods
and consistently achieves 2 to 5× speedup while achieving where Jti (n) denotes the coordinates of the n-th joint of the i-
comparable accuracy. th person in frame It , and where N indicates the total number
We will discuss the details of each component below. of joints of this person.

Authorized licensed use limited to: Pimpri Chinchwad College Of Engineering. Downloaded on September 18,2022 at 09:10:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3141888, IEEE
Transactions on Multimedia
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

a)
Success
warping

b)
Failed
warping

c)
After
Pose Recall

Fig. 3: a) and b) only adopt FPW to propagate pose from left to right. c) shows the results corrected by the PR module. Each
column illustrates four consecutive frames in the same GOP.

Since the approximation of block-matching algorithm is vectors. Larger values in the residual map tend to correspond
usually adopted for calculating motion vector, motion vector to areas where the motion vectors are not reliable. We define
often fails to associate the human parts of successive frames the residual intensity as the average magnitude of the absolute
when there is severe motion (see Fig. 3(b)), which reflects in value of residual (RI) for each human instance i.
big magnitude on residual. P
Pose Recall Module: To solve the problem of the loss (x,y)∈(Hi ,Wi ) |Ri (x, y)|
RIi = (4)
of motion relation introduced by extensive and severe pose Hi × Wi
variance, we design the pose recall module. Before the pose where hi and wi denote the length and width of each human
recall, we firstly fast propagate the human bounding box with bounding box.
Eq. (2) by the center coordinates of the box. Then, for a given Then we select each person in the frame where the motion
P-frame, the goal of this module is to decide whether the pose intensity or the residual intensity is above a certain threshold.
estimation results obtained from the pose warping are likely to Then the selected person instance is sent to the image-based
be unreliable. If so, it will run the image-based pose estimator pose estimator for re-initialization. Fig. 3 illustrates the benefit
on a few specific human instances. of adopting a pose recall module.
We design this module by considering the residual in Transition Re-initialization: Some videos used in our dataset
each human instance and the motion information on human contain scene transitions due to camera switching. For the
body joints to adaptively select the person with fast motion. frame at the camera transition, the human pose in the current
Specifically, this module is based on two measures called the frame is often uncorrelated to the previous frame. As a result,
motion intensity and the residual intensity defined below. the motion vector map does not provide any information for
The motion intensity is defined as the average motion on valid pose warping (see Fig. 4). This is especially problematic
each body joint. It is computed as follows. For the i-th person if the transition is at the beginning of a GOP. In this case, the
in the current frame, we define the motion intensity (M I) of unmatched human pose would be propagated to the remaining
this person as the average motion magnitude on the joints: frames in this GOP. To address this issue, we propose the
N transition re-initialization module to specifically handle the
1 X
M Ii = (|Mti (Jti (n), 0)| + |Mti (Jti (n), 1)|) (3) camera transition.
2N n=1
Our key observation is that when the camera transition
Noted that |.| is the absolute value operator and all the happens, the residual map of the corresponding frame tends
operations are element-wise. Here, 0 and 1 are the channels to have enormous values. This is due to the fact that the
of the motion vector. two frames at the transition correspond to completely different
The residual map measures the error after warping the pixels scenes. Unlike the pose recall module that operates on each
in a P-frame using a motion vector. The absolute values in the person in a frame, this module operates on the global infor-
residual map can be regarded as the confidence map of motion mation of the motion vector map. Once the residual intensity

1520-9210 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Pimpri Chinchwad College Of Engineering. Downloaded on September 18,2022 at 09:10:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3141888, IEEE
Transactions on Multimedia
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

Algorithm 1: Overall Inference Algorithm


Require: Pose estimator model ϕp , human detector
t=0 model ϕd , compressed video stream V and
fast pose warp operation ω.
Output : Jt
for t = 1 to |V| do
if is I frame then
decode I frame to It
t=1 B t = ϕd (I t ) # detect each person
Jt = ϕp (B t ) # estimate human pose
else
decode P frame to Mt , Rt and It
if RIT > T HRtrans then
B t = ϕd (It )
t=2 Jt = ϕp (B t )
else
for i = 1 to |Jt−1 | do
if M Iit ≤ T HRmotion and
RIit ≤ T HRres then
Jti = ω(Jt−1
i )
t=3 else
Jti = ϕp (Bit )

t=4 from this dataset contain various challenging scenarios. For


example, many videos include severe body motion, body pose
variations, video transitions, highly occluded human instances
and crowded scenes with dynamic human movements. These
RGB Image Residual Motion Vector difficulties make it hard to achieve high accuracy on this
dataset. PoseTrack has two different released datasets called
Fig. 4: Illustration of the motion vector and residual at PoseTrack17 and PoseTrack 18. PoseTrack17 contains in total
transition. The residual can better indicate camera transition. 514 video sequences, in which 250, 50 and 214 clips are used
as train, validation and test data, respectively. PoseTrack18 is
significantly larger than PoseTrack17. The new release con-
on the entire image is higher than a threshold T HRtrans , tains 593 train, 170 validation and 375 test clips, respectively.
we consider the frame to be a transition. This frame is then However, both of the two datasets only annotate 30 frames
sent to the pose estimator for re-initialization. The residual around the center of training clips. The annotations include
intensity of transition RIT is defined in a similar way as in head bounding boxes and 15 human key joins with indications
Eq. (4), but the average is performed on the whole residual on whether the joints are visible. The details of the two
map instead of a human bounding box, i.e. (Hi , Wi , Ri ) in datasets are shown in Table I.
Eq. (4) is replaced by (H, W, R) for the frame. We show an
example of the camera transition in Fig. 4. TABLE I: The details of the datasets used in this paper. We
The overall algorithm of our framework is shown in Alg. 1. illustrate the number of video clips of train, val and test split.
Note that our proposed three modules, i.e. FPW, PR, TR, are In addition, annotations/clip denotes the number of annotations
not built with neural networks. So these modules do not have per video clip.
additional model parameters.
Data Split Train Validation Test annotations/clip
V. E XPERIMENTS PoseTrack17 250 50 214 30
In this section, we first describe the datasets and the PoseTrack18 593 170 375 30
implementation details. We then present ablation studies on
various aspects of the proposed framework and compare it In this paper, we conduct ablation studies and experiments
with other methods. on both PoseTrack 17 and PoseTrack18 datasets using the of-
ficial train, validation and test split. The human pose estimator
A. Dataset is fine-tuned on the training set. We then evaluate our proposed
PoseTrack [1] is a commonly used video-based benchmark framework on validation and test sets. The evaluation metric
for multi-person pose estimation and tracking. The videos used in this work is the mean average precision (mAP) as in

1520-9210 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Pimpri Chinchwad College Of Engineering. Downloaded on September 18,2022 at 09:10:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3141888, IEEE
Transactions on Multimedia
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

TABLE II: Ablation studies on the effects of each individual only use FPW, the accuracy degrades dramatically. Figure 5(b)
module on both accuracy and speed. illustrates that the error is accumulated from the beginning of
the GOP to the end. Especially on camera transition, the FPW
FPW PR TR mAP FPS module still propagates the unreliable pose to the next frame,
resulting in inaccurate estimation. In general, human motion is
- - - 79.2 5.8 exceptionally complicated. Directly warping poses with a mo-
✓ - - 54.1 49.2 tion vector could significantly jeopardize the performance. The
method with pose recall and FPW solves the above problem
✓ ✓ - 72.6 22.5 to some extent. During inference, the person with a significant
✓ - ✓ 71.8 36.1 pose variance is rebooted and FPW terminates the error to be
propagated to the next frame. From Fig. 5(c), we can observe
- ✓ ✓ 77.3 5.4 that the PR module can avoid inaccurate pose warping before
✓ ✓ ✓ 77.2 19.2 camera transition. However, after camera transition, the PR
module cannot employ human pose estimation with inaccurate
bounding boxes. Thus we can see the pose of the person with
a white t-shirt is missing. We then show the performance of
[1], [40].
using both FPW module and TR module. It can be seen from
Fig. 5(d) that transition re-initialization can greatly help boost
B. Implementation Details the performance of FPW. Our method (Fig. 5(e)) using all
We choose HRNet-W48 [46] as our pose estimator. It modules gives the best qualitative results.
is pretrained on the COCO dataset and finetuned on the Warping by Optical Flow: We conduct a comparison with
Posetrack18 training set. The finetuning process starts with pose warping using optical flow instead of motion vectors.
an initial learning rate of 10−4 for 10 epochs. We then This will show the effectiveness of our proposed frame-
reduce the learning rate by a factor of 10 until the end of work in terms of accelerating the inference speed. We have
20 epochs. For data augmentation, we take random samples experimented with using PWCNet [45] and FlowNet2 [21]
uniformly distributed over [−45◦ , 45◦ ] and [0.65, 1.35] for for optical flow estimation, respectively. Table III shows the
random rotation and random scale respectively. Flipping and performance of our method using FPW and optical flow-
half body data augmentation [49] are also used. We adopt based methods. Surprisingly, we observe similar accuracy
the detector in [17] for human bounding box detection. The between using the motion vector and using optical flow.
three thresholds, T HRtrans ,T HRmotion and T HRres , in Al- This phenomenon indicates that accurate motion modeling
gorithm 1 are set to 3, 50, and 5 respectively. We use MPEG- provided by optical flow does not provide additional benefit
2 Part2 (Simple Profile) [27] as our codec to compress the for propagating human pose across video frames compared
PoseTrack videos with the default GOP size 12. We do not use with motion vectors that are already available in compressed
any data augmentation during testing. Our proposed method videos. Another observation from Table III is that our fast
is implemented using Pytorch [36] and all the evaluations are feature warping is about 3-8 times faster than optical flow
conducted on the same Nvidia P100 GPU. estimation. The gain on inference speed is mainly from the
fact that we take the existing motion representations from
the compressed video domain instead of relying on expensive
C. Ablation Studies
optical flow estimation.
In this section, we perform extensive ablation studies on Effect of GOP Size: Table IV illustrates the performance of
various aspects of the proposed framework. All the ablation
studies are conducted on the PoseTrack 18 validation set.
Effects of Individual Module: We perform ablation studies to TABLE III: Comparison with pose warping using optical flow.
demonstrate the effectiveness of each module in our proposed We experiment with several different optical flow algorithms.
framework by removing one or more modules. The results Tf low represents the time for estimating optical flow, and
are shown in Table II, from which we can make several Twarp denotes the time for warping. Our method is more
observations: 1) the fast pose warping module can efficiently efficient since it does not require computing optical flow. At
accelerate the human pose estimation with the off-the-shelf the same time, the performance of our method is comparable
pose estimator; 2) the pose recall module can effectively to those using optical flow for pose warping.
identify significant motion in videos and re-initialize the
pose estimation of an individual person; 3) the transition re- Method Tf low Twarp mAP
initialization module can detect “hard-to-warp” frames and
video transitions, which can avoid error propagation along the HRNet+FlowNet2 56 ms 7.6 ms 55.9
time sequence; 4) the entire framework with all these modules HRNet+FlowNet2s 18 ms 7.6 ms 53.9
achieves the best overall balance between accuracy and speed.
Figure 5 shows some qualitative examples of different HRNet+PWCNet 14 ms 7.6 ms 54.8
methods. There is no surprise that the method with only the Ours(FPW only) - 7.6 ms 54.1
fast pose warping has the best efficiency. However, if we

1520-9210 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Pimpri Chinchwad College Of Engineering. Downloaded on September 18,2022 at 09:10:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3141888, IEEE
Transactions on Multimedia
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

(a) Res & MV (b) Only FPW (c) FPW + PR (d) FPW + TR (e) Ours

Fig. 5: Qualitative results of the ablation studies on the PoseTrack18 validation set. We use red arrows to point out the estimation
error, orange boxes to indicate the camera transition and red boxes to illustrate the human instance being recalled by our PR
module. From top to bottom, the decreased number of red arrows indicates the effectiveness of our modules.

our method under different values of the GOP size. In order experiments to show the influence on our proposed method
to demonstrate the significance of our method in balancing in terms of crop size for each human instance. We choose
between inference speed and estimation accuracy, we show to crop human instances with two commonly used bounding
experimental results of the baseline method that only uses fast box sizes (384 × 288 and 256 × 192). The performance of our
pose warping (FPW). When the GOP size is set to 1, the method with two sizes is shown in Table VI. The inference
task of video pose estimation is degraded to per-frame pose speed increases when the input size of a human instance is
estimation. In this case, the accuracy and inference speed of decreased from 384 × 288 to 256 × 192. The gain on inference
the two methods are the same. With the increase of the GOP speed is mainly due to the fact that the human pose estimation
size, we can generally see a decreasing trend in accuracy and model can run faster on a smaller input image. However, our
an increasing trend in inference speed. However, benefiting overall framework can work with any input size. For example,
from our PR and TR module, our method is less sensitive our fast pose warping module only takes the joint coordinates
to the GOP size. The accuracy of our method only decreases from the previous frame and warps poses regardless of the
from 79.2 to 77.2 with nearly 4 times speed-up. In contrast, size of each person.
only adopting the FPW module for fast warping causes the
Effects of Image-based Pose Estimator: Our overall frame-
performance to drop significantly from 79.2 to 54.1. This
work does not depend on the particular choice of the image-
ablation study further proves the robustness of our method.
based pose estimator. In this experiment, we show the perfor-
mance of our framework adopting three different state-of-the-
Influences of Crop Size: The input image size also influences art pose estimation models, i.e. simple baseline [53], HRNet
the inference speed. Intuitively, the inference speed can be [46] and 8-stage Hourglass [32]. The performance of using the
accelerated as the input image becomes smaller. We conduct three baselines is shown in Table VI. We can observe that these

1520-9210 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Pimpri Chinchwad College Of Engineering. Downloaded on September 18,2022 at 09:10:41 UTC from IEEE Xplore. Restrictions apply.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

Fig. 6: Qualitative results on the PoseTrack18 validation set. The first column corresponds to I-frames, while other columns
correspond to P-frames in a GOP.

TABLE IV: The inference speed and estimation accuracy with Overall Inference Speed: In this section, we investigate
different GOP sizes. The frame size is set to 384*288. the overall inference speed of per-frame-based methods and
our proposed one. We consider the timekeeping after the
Method GOP size mAP FPS video is decompressed. Specifically, we add the inference time
of the human detector in the pose estimation process. The
FPW only 1 79.2 5.8 comparison of overall inference speed is shown in Figure 7. It
Ours (HRNet) 1 79.2 5.8 can be noticed that our framework can be 3-5 times faster than
the per-frame-based methods. One reason is that per-frame-
FPW only 4 70.1 16.5 based methods require human bounding box detection for
Ours (HRNet) 4 78.4 9.2 every frame, while our framework only needs such detection
on I-frames. In other words, with PR module, our method
FPW only 8 61.2 33.1 allows human detection on a subset of frames (e.g. I-frame)
Ours (HRNet) 8 77.8 14.7 and quickly propagates bounding boxes from the current frame
to other frames in a GOP. The results also provide a shred of
FPW only 12 54.1 49.2 solid evidence that our method is more efficient when deployed
Ours (HRNet) 12 77.2 19.2 in practice.

D. Main Results
HRNet
Ours(HRNet)
Table V shows the quantitative comparison of our approach
PoseWarper
with several state-of-the-art human pose estimation methods in
SimpleBaseline
terms of accuracy and speed. The comparison is conducted on
Ours
(SimpleBaseline) both PoseTrack17 and PoseTrack18 datasets. We can generally
Real-time

observe a trade-off between inference speed and accuracy


mAP

AlphaPose
in Table V. For example, although PoseWarper is the top-
performing method for all three datasets, the inference speed is
the slowest. PoseFlow and AlphaPose can run over ten frames
8-stage Hourglass per second. However, the accuracy of the two methods is 10
Ours(Hourglass)
mAP less than the top performed methods. Somewhat surpris-
ingly, our method is the only one that can estimate human pose
FPS in real-time over 19 FPS, while achieving accuracy comparable
to the top-performing method. It is worth mentioning that our
Fig. 7: Illustration of overall running time on the PoseTrack18 performance is better than the original HRNet. We show some
validation dataset. The overall running time consists of bound- qualitative examples of our method in Fig. 6.
ing box proposal time and human pose estimation time.
E. Limitation and Future Works
Due to the fact that our proposed approach is introduced
three pose estimation methods can be significantly accelerated to accelerate the current per-frame human pose estimation
once used within our framework. It is worth mentioning method, we notice that our proposed method might inherit
that we achieve 5 times speedup on 8-stage Hourglass. The the limitation of per-frame-based methods. Fig. 8 shows some
empirical analysis further illustrates the advantage of our failure cases of our method. It can be observed that our method
method for speeding up human pose estimation. fails to predict accurate human pose when the human joint is

Authorized licensed use limited to: Pimpri Chinchwad College Of Engineering. Downloaded on September 18,2022 at 09:10:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3141888, IEEE
Transactions on Multimedia
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

TABLE V: Quantitative comparison on the PoseTrack benchmark. The performance of the comparisions are collected either
from PoseTrack leaderboard or paper. We additionally report the FPS of methods that are open-sourced.

Dataset Methods Head Shoulder Elbow Wrist Hip Knee Ankle Mean FPS
PoseFlow [54] 66.7 73.3 68.3 61.1 67.5 67.0 61.3 66.5 10.2
JointFlow [14] - - - - - - - 69.3 -
FastPose [56] 80.0 80.3 69.5 59.1 71.4 67.5 59.4 70.3 6.4
PoseTrack17
Val Set SimpleBaseline [53] 81.7 83.4 80.0 72.4 75.3 74.8 67.1 76.7 6.2
HRNet [46] 82.1 83.6 80.4 73.3 75.5 75.3 68.5 77.3 5.3
PoseWarper [5] 81.4 88.3 83.9 78.0 82.4 80.5 73.6 81.2 1.9
Ours 79.9 87.6 82.8 76.7 80.7 79.4 72.8 80.0 19.0
PoseFlow [54] 64.9 67.5 65.0 59.0 62.5 62.8 57.9 63.0 9.7
JointFlow [14] - - - 53.1 - - 50.4 63.4 -
PoseTrack17 SimpleBaseline [53] 80.1 80.2 76.9 71.5 72.5 72.4 65.7 74.6 5.9
Test Set HRNet [46] 80.1 80.2 76.9 72.0 73.4 72.5 67.0 74.9 5.8
PoseWarper [5] 79.5 84.3 80.1 75.8 77.6 76.8 70.8 77.9
Ours 78.4 83.8 79.3 74.3 75.4 75.4 69.6 76.7 18.6
AlphaPose [15] 63.9 78.7 77.4 71.0 73.7 73.0 69.7 71.9 14.8
PoseTrack18 MDPN [18] 75.3 81.2 79.0 74.1 72.4 73.0 69.9 75.0 -
Val Set PoseWarper [5] 79.9 86.3 82.4 77.5 79.8 78.8 73.2 79.7 2.3
Ours 78.8 84.8 79.8 73.2 76.2 75.6 69.9 77.2 19.2
AlphaPose++ [18], [15] - - - 66.2 - - 65.0 67.6 -
PoseTrack18 MDPN [18] - - - 74.5 - - 69.0 76.4 -
Test Set PoseWarper [5] 78.9 84.4 80.9 76.8 75.6 77.5 71.8 78.0 1.7
Ours 76.8 82.4 78.2 73.0 71.5 74.6 69.0 75.2 18.8

TABLE VI: The inference speed and estimation accuracy with


different size of input images. We also show our framework
when using different human pose estimators (HRNet, Sim-
pleBaseline and 8-stage Hourglass) for the pose estimation
module. Our framework can always significantly accelerate
inference speed without too much accuracy drop.

Method Input size mAP FPS


8-stage Hourglass 256*192 59.8 3.2
8-stage Hourglass 384*288 62.3 2.1
Fig. 8: Failure cases are pointed by red arrows. The pose of
Ours (Hourglass) 256*192 58.1 13.6 elbow and head are unable to be detected because of occlusion
Ours (Hourglass) 384*288 60.8 10.2 and image blurry.
SimpleBaseline 256*192 75.6 11.2
SimpleBaseline 384*288 77.9 6.7 VI. C ONCLUSION
In this paper, we have introduced the task of human pose
Ours (SimpleBaseline) 256*192 73.2 25.7
estimation in the compressed video domain. The goal is to take
Ours (SimpleBaseline) 384*288 75.8 20.8 advantage of the motion representation (i.e. motion vectors)
HRNet-W48 256*192 77.4 9.7 that is already encoded in a video stream to accelerate the
pose estimation. The proposed framework uses motion vectors
HRNet-W48 384*288 79.2 5.8 to propagate the estimated pose joints from the I-frame to
Ours (HRNet) 256*192 75.4 24.3 other P-frames. We also introduce additional modules to re-
initialize the pose estimation when the pose propagation is
Ours (HRNet) 384*288 77.2 19.2
unreliable due to large motions or scene transition. Overall, our
proposed framework achieves a nice balance between accuracy
and inference speed.

occluded or the image is blurry. The two problems are also R EFERENCES
the main challenges in per-frame-based human pose estimation [1] Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, Leonid
[59]. It is preferable to address the problems in the future. Pishchulin, Anton Milan, Juergen Gall, and Bernt Schiele. Posetrack:

1520-9210 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Pimpri Chinchwad College Of Engineering. Downloaded on September 18,2022 at 09:10:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3141888, IEEE
Transactions on Multimedia
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

A benchmark for human pose estimation and tracking. In Proceedings multi-person pose estimation model. In European Conference on
of the IEEE Conference on Computer Vision and Pattern Recognition, Computer Vision, pages 34–50. Springer, 2016.
pages 5167–5176, 2018. [23] Sam Johnson and Mark Everingham. Clustered pose and nonlinear
[2] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Pictorial structures appearance models for human pose estimation. In bmvc, volume 2,
revisited: People detection and articulated pose estimation. In 2009 IEEE page 5. Citeseer, 2010.
conference on computer vision and pattern recognition, pages 1014– [24] Aouaidjia Kamel, Bin Sheng, Ping Li, Jinman Kim, and David Dagan
1021. IEEE, 2009. Feng. Hybrid refinement-correction heatmaps for human pose estima-
[3] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Monocular 3d tion. IEEE Transactions on Multimedia, 23:1330–1342, 2020.
pose estimation and tracking by detection. In 2010 IEEE Computer [25] Seung-Taek Kim and Hyo Jong Lee. Lightweight stacked hourglass
Society Conference on Computer Vision and Pattern Recognition, pages network for human pose estimation. Applied Sciences, 10(18):6497,
623–630. IEEE, 2010. 2020.
[4] Aroh Barjatya. Block matching algorithms for motion estimation. IEEE [26] Alexander Krull, Eric Brachmann, Frank Michel, Michael Ying Yang,
Transactions Evolution Computation, 8(3):225–239, 2004. Stefan Gumhold, and Carsten Rother. Learning analysis-by-synthesis
[5] Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, and for 6d pose estimation in rgb-d images. In Proceedings of the IEEE
Lorenzo Torresani. Learning temporal pose estimation from sparsely- international conference on computer vision, pages 954–962, 2015.
labeled videos. arXiv preprint arXiv:1906.04016, 2019. [27] Didier Le Gall. Mpeg: A video compression standard for multimedia
[6] Adrian Bulat and Georgios Tzimiropoulos. Binarized convolutional applications. Communications of the ACM, 34(4):46–58, 1991.
landmark localizers for human pose estimation and face alignment with [28] Seong-Whan Lee, Young-Min Kim, and Sung Woo Choi. Fast scene
limited resources. In Proceedings of the IEEE International Conference change detection using direct feature extraction from mpeg compressed
on Computer Vision, pages 3706–3714, 2017. videos. IEEE Transactions on Multimedia, 2(4):240–254, 2000.
[7] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi- [29] Ang Li, Yiwei Lu, and Yang Wang. Semantic segmentation in
person 2d pose estimation using part affinity fields. In Proceedings of compressed videos. In 2019 IEEE 21st International Workshop on
the IEEE conference on computer vision and pattern recognition, pages Multimedia Signal Processing (MMSP), pages 1–5. IEEE, 2019.
7291–7299, 2017. [30] Miaopeng Li, Zimeng Zhou, and Xinguo Liu. Multi-person pose
[8] Aaron Chadha, Alhabib Abbas, and Yiannis Andreopoulos. Compressed- estimation using bounding box constraint and lstm. IEEE Transactions
domain video classification with deep neural networks:“there’s way too on Multimedia, 21(10):2653–2663, 2019.
much information to decode the matrix”. In 2017 IEEE International [31] Debargha Mukherjee, Jingning Han, Jim Bankoski, Ronald Bultje,
Conference on Image Processing (ICIP), pages 1832–1836. IEEE, 2017. Adrian Grange, John Koleszar, Paul Wilkins, and Yaowu Xu. A technical
[9] Aaron Chadha, Alhabib Abbas, and Yiannis Andreopoulos. Video overview of vp9—the latest open-source video codec. SMPTE Motion
classification with cnns: Using the codec as a spatio-temporal activity Imaging Journal, 124(1):44–54, 2015.
sensor. IEEE Transactions on Circuits and Systems for Video Technol- [32] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass
ogy, 29(2):475–485, 2017. networks for human pose estimation. In European conference on
[10] Xianjie Chen and Alan L Yuille. Articulated pose estimation by a computer vision, pages 483–499. Springer, 2016.
graphical model with image dependent pairwise relations. In Advances [33] Guanghan Ning, Zhi Zhang, and Zhiquan He. Knowledge-guided deep
in neural information processing systems, pages 1736–1744, 2014. fractal neural networks for human pose estimation. IEEE Transactions
[11] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang on Multimedia, 20(5):1246–1259, 2017.
Yu, and Jian Sun. Cascaded pyramid network for multi-person pose [34] Wanli Ouyang, Xiao Chu, and Xiaogang Wang. Multi-source deep
estimation. In Proceedings of the IEEE conference on computer vision learning for human pose estimation. In Proceedings of the IEEE
and pattern recognition, pages 7103–7112, 2018. Conference on Computer Vision and Pattern Recognition, pages 2329–
[12] Yue-Meng Chen, Ivan V. Bajić, and Parvaneh Saeedi. Moving region 2336, 2014.
segmentation from compressed video using global motion estimation and [35] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev,
markov random fields. IEEE Transactions on Multimedia, 13(3):421– Jonathan Tompson, Chris Bregler, and Kevin Murphy. Towards accurate
431, 2011. multi-person pose estimation in the wild. In Proceedings of the IEEE
[13] Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S Conference on Computer Vision and Pattern Recognition, pages 4903–
Huang, and Lei Zhang. Higherhrnet: Scale-aware representation learning 4911, 2017.
for bottom-up human pose estimation. arXiv preprint arXiv:1908.10357, [36] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward
2019. Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga,
[14] Andreas Doering, Umar Iqbal, and Juergen Gall. Joint flow: Temporal and Adam Lerer. Automatic differentiation in pytorch. 2017.
flow fields for multi person tracking. arXiv preprint arXiv:1805.04596, [37] Soo-Chang Pei and Yu-Zuong Chou. Effective wipe detection in
2018. mpeg compressed video using macro block type information. IEEE
[15] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. Rmpe: Regional Transactions on Multimedia, 4(3):309–319, 2002.
multi-person pose estimation. In Proceedings of the IEEE International [38] Tomas Pfister, James Charles, and Andrew Zisserman. Flowing convnets
Conference on Computer Vision, pages 2334–2343, 2017. for human pose estimation in videos. In Proceedings of the IEEE
[16] Junyi Feng, Songyuan Li, Xi Li, Fei Wu, Qi Tian, Ming-Hsuan Yang, International Conference on Computer Vision, pages 1913–1921, 2015.
and Haibin Ling. Taplab: A fast framework for semantic video segmen- [39] Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, and Bernt
tation tapping into compressed-domain knowledge. IEEE Transactions Schiele. Strong appearance and expressive spatial models for human
on Pattern Analysis and Machine Intelligence, 2020. pose estimation. In Proceedings of the IEEE international conference
[17] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, on Computer Vision, pages 3487–3494, 2013.
and Du Tran. Detect-and-track: Efficient pose estimation in videos. In [40] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres,
Proceedings of the IEEE Conference on Computer Vision and Pattern Mykhaylo Andriluka, Peter V Gehler, and Bernt Schiele. Deepcut:
Recognition, pages 350–359, 2018. Joint subset partition and labeling for multi person pose estimation. In
[18] Hengkai Guo, Tang Tang, Guozhong Luo, Riwei Chen, Yongchen Lu, Proceedings of the IEEE conference on computer vision and pattern
and Linfu Wen. Multi-domain pose network for multi-person pose recognition, pages 4929–4937, 2016.
estimation and tracking. In Proceedings of the European Conference [41] Umer Rafi, Bastian Leibe, Juergen Gall, and Ilya Kostrikov. An efficient
on Computer Vision (ECCV) Workshops, pages 0–0, 2018. convolutional network for human pose estimation. In BMVC, volume 1,
[19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask page 2, 2016.
r-cnn. In Proceedings of the IEEE international conference on computer [42] Max Schwarz, Hannes Schulz, and Sven Behnke. Rgb-d object recog-
vision, pages 2961–2969, 2017. nition and pose estimation based on pre-trained convolutional neural
[20] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge network features. In 2015 IEEE international conference on robotics
in a neural network. arXiv preprint arXiv:1503.02531, 2015. and automation (ICRA), pages 1329–1335. IEEE, 2015.
[21] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey [43] Jie Song, Limin Wang, Luc Van Gool, and Otmar Hilliges. Thin-slicing
Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow network: A deep structured model for pose estimation in videos. In
estimation with deep networks. In Proceedings of the IEEE conference Proceedings of the IEEE conference on computer vision and pattern
on computer vision and pattern recognition, pages 2462–2470, 2017. recognition, pages 4220–4229, 2017.
[22] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo An- [44] Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand.
driluka, and Bernt Schiele. Deepercut: A deeper, stronger, and faster Overview of the high efficiency video coding (hevc) standard. IEEE

1520-9210 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Pimpri Chinchwad College Of Engineering. Downloaded on September 18,2022 at 09:10:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3141888, IEEE
Transactions on Multimedia
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

Transactions on circuits and systems for video technology, 22(12):1649–


1668, 2012.
[45] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net:
Cnns for optical flow using pyramid, warping, and cost volume. In
Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 8934–8943, 2018.
[46] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution
representation learning for human pose estimation. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages
5693–5703, 2019.
[47] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler.
Joint training of a convolutional network and a graphical model for
human pose estimation. In Advances in neural information processing
systems, pages 1799–1807, 2014.
[48] Alexander Toshev and Christian Szegedy. Deeppose: Human pose
estimation via deep neural networks. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 1653–
1660, 2014.
[49] Zhicheng Wang, Wenbo Li, Binyi Yin, Qixiang Peng, Tianzi Xiao,
Yuming Du, Zeming Li, Xiangyu Zhang, Gang Yu, and Jian Sun.
Mscoco keypoints challenge 2018. In Joint Recognition Challenge
Workshop at ECCV 2018, volume 5, 2018.
[50] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh.
Convolutional pose machines. In Proceedings of the IEEE conference
on Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
[51] Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra.
Overview of the h. 264/avc video coding standard. IEEE Transactions
on circuits and systems for video technology, 13(7):560–576, 2003.
[52] Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J
Smola, and Philipp Krähenbühl. Compressed video action recognition.
In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 6026–6035, 2018.
[53] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human
pose estimation and tracking. In Proceedings of the European conference
on computer vision (ECCV), pages 466–481, 2018.
[54] Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu
Lu. Pose flow: Efficient online pose tracking. arXiv preprint
arXiv:1802.00977, 2018.
[55] Dingwen Zhang, Guangyu Guo, Dong Huang, and Junwei Han. Pose-
flow: A deep motion representation for understanding human behaviors
in videos. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 6762–6770, 2018.
[56] Feng Zhang, Xiatian Zhu, and Mao Ye. Fast human pose estimation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3517–3526, 2019.
[57] Guyue Zhang, Jun Liu, Hengduo Li, Yan Qiu Chen, and Larry S Davis.
Joint human detection and head pose estimation via multistream net-
works for rgb-d videos. IEEE Signal Processing Letters, 24(11):1666–
1670, 2017.
[58] Zhe Zhang, Jie Tang, and Gangshan Wu. Simple and lightweight human
pose estimation. arXiv preprint arXiv:1911.10346, 2019.
[59] Ce Zheng, Wenhan Wu, Taojiannan Yang, Sijie Zhu, Chen Chen,
Ruixu Liu, Ju Shen, Nasser Kehtarnavaz, and Mubarak Shah. Deep
learning-based human pose estimation: A survey. arXiv preprint
arXiv:2012.13392, 2020.

1520-9210 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Pimpri Chinchwad College Of Engineering. Downloaded on September 18,2022 at 09:10:41 UTC from IEEE Xplore. Restrictions apply.

You might also like