0% found this document useful (0 votes)
13 views14 pages

【2022】IEEE CSVT a Progressive Quadric Graph Convolutional Network for 3D Human Mesh Recovery2

Uploaded by

lei wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views14 pages

【2022】IEEE CSVT a Progressive Quadric Graph Convolutional Network for 3D Human Mesh Recovery2

Uploaded by

lei wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

104 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 33, NO.

1, JANUARY 2023

A Progressive Quadric Graph Convolutional


Network for 3D Human Mesh Recovery
Lei Wang , Member, IEEE, Xunyu Liu, Xiaoliang Ma , Member, IEEE, Jiaji Wu , Member, IEEE,
Jun Cheng , Member, IEEE, and Mengchu Zhou , Fellow, IEEE

Abstract— Human mesh recovery from one single image has by adding a decoder head. Consequently, the computational
achieved rapid progress recently, but many methods suffer complexity can be reduced greatly.
from the image appearance overfitting since the training data
are collected along with accurate 3D annotations in controlled Index Terms— 3D human mesh, graph convolutional network,
settings of monotonous backgrounds or simple clothes. Some deep learning, VR.
methods regress human mesh vertices from poses to tackle the
above problem. However the mesh topologies have not been well I. I NTRODUCTION
exploited, and artifacts are often generated. In this paper, we aim
to find an efficient low-cost solution to human mesh reconstruc-
tion. To this end, we propose a Progressive Quadric Graph
U NDERSTANDING humans and their behavior are of
prime importance in computer vision, while 3D human
pose [1], [2] and shape estimation [3], [4] provide 3D repre-
Convolutional Network (PQ-GCN), and design a simple and fast
method for 3D human mesh recovery from a single image in the sentation, as shown in Fig. 1. This can be used in potential
wild. Specifically, we apply quadric-based surface simplification applications, such as virtual reality (VR), sports motion analy-
to human meshes and design a progressive graph convolution sis, robotics [5], [6], [7]. 3D recovery from monocular image
network, accompanied by mesh feature up-sampling, to deal with or video is more convenient and lower-cost for practical appli-
the mesh topologies. We carry out a series of studies to validate
our method. The results prove that our method achieves superior cations than the methods based on multi-view cameras, depth
performance on a challenging in-the-wild dataset, while using cameras, or Inertial Measurement Unit (IMU) [8]. However,
66% fewer parameters than the existing method, Pose2Mesh. this task is difficult because of the complex human articulation
Artifacts have also been eliminated and better visual quality has and 2D-to-3D ambiguity.
been obtained without any further post-processing and model Recent deep neural networks (DNNs) in this field have
fitting. Besides, the recovery can be stopped at an earlier stage
obtained rapid progress, and these methods are model-based or
model-free. The former regress the pose and shape parameters
Manuscript received 17 March 2022; revised 20 June 2022 and 30 July of one human mesh model, such as the popular Skinned
2022; accepted 11 August 2022. Date of publication 17 August 2022; date Multi-Person Linear (SMPL) [3]. In these methods, registered
of current version 6 January 2023. This work was supported in part by
the Guangdong Basic and Applied Basic Research Foundation under Grant datasets are needed, which are generated by fitting model
2021A1515012637; in part by the National Natural Science Foundation of parameters [9], [10], [11]. But limited exemplars constrain the
China under Grant U21A20487 and Grant 61976143; in part by the Chinese pose and shape spaces [12]. On the contrary, the latter estimate
Academy of Sciences (CAS) Key Technology Talent Program, Colleges and
Universities Key Laboratory of Intelligent Integrated Automation under Grant mesh vertex coordinates directly, in which case Convolutional
201502; in part by the CAS Key Laboratory of Human-Machine Intelligence- Neural Network (CNN) has limited applications.
Synergy Systems, Shenzhen Institutes of Advanced Technology, CAS, under As to the above situation, some methods exploit Graph Con-
Grant 2014DP173025; in part by the Guangdong-Hong Kong-Macao Joint
Laboratory of Human-Machine Intelligence-Synergy Systems under Grant volutional Network (GCN) [13], [14], [15] for 3D human mesh
2019B121205007; in part by the Shenzhen Engineering Laboratory for 3D recovery [16], [17]. Kolotouros et al. [16] propose a GCN
Content Generating Technologies under Grant [2017] 476; and in part by the architecture to regress 3D vertex coordinates from one single
Shenzhen Technology Project under Grant JCYJ20180507182610734. This
article was recommended by Associate Editor Y. Wu. (Corresponding author: image. Since the directly generated meshes have artifacts on
Xiaoliang Ma.) the surface, a Multi-layer Perceptron (MLP) is needed to
Lei Wang and Jun Cheng are with the CAS Key Laboratory of Human- predict SMPL parameters as post-processing. Choi et al. [17]
Machine Intelligence-Synergy Systems, Guangdong-Hong Kong-Macao Joint
Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Insti- propose Pose2Mesh to recover a 3D human mesh from 2D
tute of Advanced Technology, Chinese Academy of Sciences, Shenzhen human pose using GCN. They apply the graph coarsening
518055, China (e-mail: [email protected]; [email protected]). technique [18] to generate eight graphs of different resolutions,
Xunyu Liu and Xiaoliang Ma are with the College of Computer Science and
Software Engineering, Shenzhen University, Shenzhen 518060, China (e-mail: with the number of vertices ranging from 96 to 12288. This
[email protected]; [email protected]). technique can generate meshes with topologies of higher
Jiaji Wu is with the School of Electronic Engineering, Xidian University, resolution than the target which helps to learn more details,
Xi’an 710071, China (e-mail: [email protected]).
Mengchu Zhou is with the Helen and John C. Hartmann Department of but leads to more graph convolutions and excessive compu-
Electrical and Computer Engineering, New Jersey Institute of Technology, tational resources. From coarse to fine, they use a nearest
Newark, NJ 07102 USA (e-mail: [email protected]). up-sampling algorithm to achieve mesh feature up-sampling
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSVT.2022.3199201. with a scale factor of 2, by copying features of each vertex in
Digital Object Identifier 10.1109/TCSVT.2022.3199201 a low-resolution graph to two corresponding vertices in a high-
1051-8215 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: PQ-GCN FOR 3D HUMAN MESH RECOVERY 105

A. Optimization-Based 3D Human Body Recovery


Optimization-based methods try to fit a 3D body model to
2D keypoints of an input image to obtain a 3D human body
estimation [4], [23], [30]. Kolotouros et al. [21] propose SPIN
(SMPL oPtimization IN the loop) based on a collaboration of
regression and optimization in an iterative routine. Regression
Fig. 1. An efficient low-cost progressive graph convolutional network is results are used as initial values to their optimization process,
designed for 3D human mesh recovery from one single image.
and optimization provides supervision for regression. The
resolution graph. The nearest feature up-sampling may cause optimization-based methods have high complexity and need
no discrimination between the features of adjacent vertices. long time for training. In this paper, our aim is to develop a
In order to get a final human mesh, pre-defined index mapping light-weight low-complexity solution.
is also needed to convert the highest resolution mesh to an
SMPL topology. These questions together result in artifacts in B. Parameters Regression
the final mesh.
To resolve the above questions, we propose a Progressive CNNs have been used for the regression of a 3D body
Quadric Graph Convolutional Network (PQ-GCN) for 3D model’s parameters due to their powerful representation abil-
Human Mesh Recovery. First, we use a quadric surface simpli- ity. Different architectures have been designed to extract
fication to generate a set of coarse-to-fine human meshes from image features and leverage the information of heatmaps and
SMPL as template topologies. This guarantees that human silhouette [24], reprojection loss of keypoints and a discrimina-
meshes have equal or lower resolution than the target. Second, tor [20], pose calibration with a non-rigid transformation [31],
we use a barycentric-based mesh feature up-sampling progres- a mixture-of-experts rotation prior [22], etc. A pyramidal
sively to generate detailed results from the sparse topology. mesh alignment feedback (PyMAF) loop is designed in [32],
A set of progressive graph convolutions are designed with and a pixel-wise supervision has also been incorporated to
the upsampling to learn topology features. Compared with leverage the alignment information further. One question of
Pose2Mesh [17], our model has faster inference speed, fewer these methods is that image appearance overfitting always
model parameters, and less GPU memory consumption. Exper- exists due to the appearance domain’s difference between the
imental results show that our PQ-GCN has achieved state-of- training and test data. When the training data are collected in
the-art performance on in-the-wild dataset, 3DPW [19], with laboratory with accurate 3D annotations in controlled settings
better visual quality than previous GCN-based methods. This of monotonous background or simple clothes, the methods’
work aims to make the following contributions: performance tend to degrade when applied to in-the-wild
datasets. In this paper, we focus on human mesh recovery
1) It proposes PQ-GCN for 3D human body mesh recovery
from in-the-wild images.
from human pose. Such recovery can be stopped at an
earlier stage by adding a decoder head, thereby reducing
the computational complexity greatly; and C. Non-Parameters Regression
2) It designs a PQ-GCN-based human body mesh recovery Some methods do not use model-parameter regression to
method with 50% faster inference speed (62 fps), 66% reconstruct a 3D human body. Instead, they regress mesh
fewer parameters, and less GPU memory consumption vertices from image features [16], [33] or UV space (i.e. a 2D
than Pose2Mesh [17]. It has achieved the state-of-the- space used for texture mapping of 3D mesh) [25], [34], [35],
art performance while eliminating the artifacts existing [36]. Kolotouros et al. [16] propose to use GCN to encode the
in graph convolution-based methods, e.g., Pose2Mesh. topology of SMPL model (GraphCMR). Li et al. [33] propose
a multi-scale graph transformation network for human recon-
II. R ELATED W ORK struction and encode a 3D mesh in a compact latent space.
In this section, we introduce related work, including 3D Another related work is the Convolutional Mesh Autoencoder
human body recovery methods based on optimization and (CoMA) [37], which is designed for 3D face deformation
regression, as well as 3D recovery methods for occlusions, and generation. A volumetric space is exploited for human
videos, expression, or the clothed body. The regression-based body shape inference [38], [39]. 3D space coordinates of the
methods can be categorized into parametric (regression to mesh vertices can be represented as pixel values in the UV
parameterized human body model [20], [21], [22], [23], [24], map. A straightforward encoder-decoder network has been
[25], [26], [27]) and non-parametric (e.g. regression to vertex proposed in [35]. Zeng et al. [34] propose to exploit the dense
coordinates [16], [17], [28], [29]). There are many methods correspondence between the mesh and local image features in
for 3D human pose estimation, such as those in [1] and [2]. UV space (DecoMR). By converting 3D human shape estima-
CNN and Long short-term memory (LSTM) have been used tion to an image inpainting problem from a partial UV map,
for 3D pose prediction in [1]. The work of [2] formulates the the question of object-occlusion is tackled in human mesh
estimation of torso and limb pose as a Perspective-N-Point recovery [36]. GCN-based methods tend to generate meshes
(PNP) and optimization problem, respectively. The 3D pose with unsmoothed surfaces. Hence, this work is motivated to
estimation methods are not reviewed in this paper since we get rid of their drawback and develop a novel method that can
focus on 3D human mesh recovery. eliminate the artifacts and is independent of UV mapping.

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
106 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 33, NO. 1, JANUARY 2023

Fig. 2. Framework of the proposed method. Taking one single image as input, 2D pose is detected first and translated into 3D space when needed. PQ-GCN
learns to map the 2D/3D pose to 3D mesh based on Cheybyshev GCN and mesh feature upsampling on the template topologies progressively. The recovery
can be stopped at an early stage by a decoding head.

Transformer [40] has been widely used in natural language different method by designing a light-weight framework based
processing and recently applied in computer vision tasks [41]. on a quadric graph convolution network.
Lin et al. [12] propose a mesh transformer to regress coor-
dinates of 3D human joints and vertices. They design a III. P ROPOSED M ETHOD
progressive dimensionality reduction architecture in multi- We propose a progressive method for 3D human mesh
transformer encoder. In our network, we have not used the recovery. 2D pose is detected first from an input image and
self-attention mechanism or the transformer encoder, since then optionally translated to 3D space. A progressive graph
seventy times more parameters and much more training time convolution network is designed to transform a 2D/3D human
are needed, while we aim to realize a low-cost lightweight pose to a 3D human body mesh. We first introduce our model’s
solution. architecture, then present the mesh generation method. Since
the graph convolutions are operated on meshes with a 2D/3D
pose as initialization which only has sparse points (i.e. human
D. 3D Recovery Methods for Occlusions, Videos, joints), we generate a set of coarse-to-fine template topologies
Expressions, or the Clothed and progressively achieve the ultimate target. The recovery
can also be stopped at an earlier stage to generate the target
To tackle with partial occlusion or truncation problem,
by previously adding a decoder-head, and the computational
some approaches [42], [43], [44], [45], [46], [47] have
complexity will be greatly reduced.
been proposed. A part attention regressor is proposed in [44]
based on the visibility of individual body parts. To regress
multiple people in one-stage, a collision-aware body-center- A. Model Architecture
guided representation is proposed in [45] with robustness to The overall framework is illustrated in Fig. 2, which consists
person-person occlusions. There are also some methods that of two sub-tasks, pose generation and 3D mesh generation.
have been introduced based on video frames [48], [49], [50], For the first task, taking a single image as input, a 2D pose is
[51], [52] or SMPL-X model [30], [53]. The clothed human detected and optionally translated to corresponding root joint-
shape recovery has also been studied. For example, implicit relative 3D pose, to be briefly introduced later. In this paper,
functions are learned to predict the occupancy field [54], we focus on 3D mesh generation by our proposed PQ-GCN,
and Pixel-aligned Implicit Function (PIFu) is leveraged for which takes 2D pose or concatenates 2D and 3D poses as
3D textured human reconstruction [55]. A topologically-aware input and predicts the 3D mesh vertices’ coordinates.
generative model, SMPLicit, is proposed in [56], and an 1) 2D/3D Pose Generation: We first detect a 2D pose from
opacity-aware differentiable rendering is introduced for this one input image as P2D ∈ R J ×2 , where J is the number
task [57]. The method in [8] takes a video as input, and uses of human joints. If needed, P2D is fed into a 3D pose
SFM for calibration, point cloud reinforcement for shaky body generation network [17], which contains two fully-connected
parts, as well as mesh deformation for surface details. These layers and two residual blocks. A residual block consists of
issues are not the focus of this work, since we aim to recover 1D batch normalization, ReLU activation, a dropout layer,
a body mesh from one single image. Instead, we propose a and a fully-connected layer. The first fully-connected layer

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: PQ-GCN FOR 3D HUMAN MESH RECOVERY 107

transforms the input 2D pose into a 4096-dimensional feature vertices of M, where |F| is the dimension of the feature vector
vector. Then, the vector is fed into residual blocks and the at each vertex. Since we exploit graph convolution to learn
output dimension of the feature vector is 4096. The last topology features of a mesh, we view a mesh as a graph in
fully-connected layer converts the output from the residual the following discussions.
block into a 3 J -dimensional vector which is transferred to the 1) Progressive Mesh Generation: We introduce the pro-
root joint-relative 3D pose P3D ∈ R J ×3 . gressive human mesh generation for a set of coarse-to-fine
2) Progressive GCN: Our target is to learn the 3D coordi- topologies using a down-sampling, the purpose of which is to
nates of human mesh vertices from P2D or P3D . The human generate a group of mesh templates with special graph struc-
mesh M ∈ R N×3 has N vertices. Since we use the SMPL tures (6890, 1723, 431, 108, 27). It is finished before network
human template mesh topology, N = 6890. Based on graph training, and it’s only used to generate target templates. Then
convolution and mesh feature up-sampling, we construct a pro- we use graph convolutions and up-sampling to progressively
gressive mesh processing mechanism. The graph convolution generate target meshes (27, 108, 431, 1723, 6890) from 2D/3D
unit is designed as Chebyshev graph convolution. As shown in pose. According to the weighted graph cuts demonstration in
Fig. 3, our PQ-GCN consists of Bottleneck Layer, Progressive graph clustering [18], the coarsening phase of the multilevel
Layer, and Output Layer. Graph clustering (Graclus) algorithm is efficient, and it has
Bottleneck Layer is composed of three graph convolu- been used in graph convolution [58]. Pose2Mesh also adopts
tion units, two reshaping layers, and a fully-connected layer. the coarsening method of Graclus for graphs generation [17].
We construct a graph G P = (V P , A P , FP ) from the human However, we have found that this is not the most suitable for
skeleton, where V P denotes a group of J human joints, and human mesh recovery. Based on a surface simplification [59],
FP = P is the initialization feature map of graph G P . A P ∈ in our method we simplify a mesh by contracting vertex pairs
{0, 1} J ×J is an adjacency matrix that defines the connectivity iteratively, and uses quadratic matrices to calculate contraction
of those joints. As shown in the left part of Fig. 3, three graph cost.
units sequentially perform the Chebyshev graph convolution A pair contraction can be defined as (vi , v j ) → v̄. The cost
on G P , and transform the feature map FP from R J ×2 or R J ×5 of contracting a pair is defined as
to R J ×64. Then, the bottleneck layer up-samples the feature
map FP to mesh feature F4 of the lowest resolution human (v̄) = v̄T (Qi + Q j )v̄, (1)
body mesh M4 by reshaping and a fully-connected layer.
Progressive Layer realizes the mesh feature progressive where v̄ = [v¯x , v¯y , v¯z , 1]T represents a vertex’s coordinates.
up-sampling which receives bottleneck layer’s output to ini- Qi and Q j are 4 × 4 symmetric matrices corresponding to
tialize the mesh feature F4 . Then, five graph convolution vertices vi and v j , respectively. A down-sampling matrix S ∈
blocks with interleaved up-sampling layers generate the 3D {0, 1}n×m can be obtained based on the mesh down-sampling
mesh feature F0 of human body mesh M0 in R6890×128. Each algorithm, where m and n respectively represent the number
graph convolution block consists of two graph convolution of vertices before and after mesh down-sampling.
units and a residual connection, while each up-sampling layer We apply mesh down-sampling to an SMPL model to gen-
corresponds to an up-sampling matrix. The set of matrices erate a set of coarse-to-fine meshes, {Mc = (Vc , Ac , Fc )}C c=0 ,
are {Uc ∈ R|Vc |×|Vc+1 | }C−1 and the corresponding down-sampling matrix is {Sc ∈
c=0 where C = 4. The up-sampling
process is defined as Fc = Uc Fc+1 , where Fc is the first R|Vc+1 |×|Vc | }C−1
c=0 , where C is the number of down-sampling
feature map of Mc and Fc+1 is the last feature map of Mc+1 . steps. A human mesh down-sampling can be defined as
Five graph convolution blocks sequentially operate on the
Vc+1 = Sc Vc , c = [0, . . . , C − 1]. (2)
progressive mesh group {Mc = (Vc , Ac , Fc )}4c=0 constituting a
coarse-to-fine human mesh processing mechanism. A decoding We set the down-sampling ratio to be 4 in our experiment,
head is optionally added at each block, which consists of two i.e.,
layers of Chebyshev graph convolution and one Multi-layer
Perceptron (MLP), so that the 3D recovery can be stopped at |Vc |
|Vc+1 | =  + 0.5, (3)
an earlier stage when needed. 4
Output Layer is composed of two graph convolution units. where   is a floor function and |V∗ | denote the number of
It receives the progressive layer’s output, and the vertices’ vertices in V∗ . The SMPL model has 6890 vertices, and the
feature dimension is reduced from 128 to 3 to generate a 3D number of vertices from coarse to fine can be
human mesh in R6890×3 .
|Vc | = 27, 108, 431, 1723, 6890. (4)
B. Progressive Quadric Graph Construction M0 represents the vanilla SMPL human mesh topology. Fig. 4
A 3D human mesh M can be represented as a set of vertices, shows the human body mesh down-sampling process.
edges and vertex features, M = (V, A, F), with |V | = n 2) GCN on the Progressive Meshes: Since 3D body meshes
vertices, V ∈ Rn×3 . A ∈ {0, 1}n×n is an adjacency matrix to {Mc = (Vc , Ac , Fc )}Cc=0 can be represented by undirected
define the connectivity of vertices in mesh M. Ai j = 1 when graphs, we can exploit the topology features using GCNs [58],
vertices i and j are the same or connected, and otherwise [60], [61]. In our experiment, we use Chebyshev graph con-
Ai j = 0. The mesh feature F ∈ Rn×|F | is attached to the volution [58], [62] to reduce the computational complexity.

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
108 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 33, NO. 1, JANUARY 2023

Fig. 3. Our progressive graph convolutional network. It consists of three parts, i.e. the Bottleneck Layer, Progressive Layer, and Output Layer.

Fig. 4. Quadric graph down-sampling for the human body model in our method. A set of template topologies will be generated.

As for 3D mesh Mc , its normalized Laplacian matrix is The task of GCN is to train the parameters of the K -th
defined as Chebyshev coefficient matrix k ∈ R|Fin |×|Fout | in the graph
−1 −1 convolution unit.
L c = I − Dc 2 A c Dc 2 , (5)
3) Mesh-Feature Upsampling: To recover the final human
where I is the identity matrix, and 
Dc is the Degree matrix mesh, we need to up-sample the coarse meshes accompanied
for each vertex in V as (Dc )i j = j (A c )i j . Then we can with graph convolutions progressively. The mesh-feature up-
calculate the scaled Laplacian as sampling (MF-Upsampling) defines the feature transformation
Lc = 2Lc /λ̂ − I, (6) relationship between meshes of adjacent-resolutions in a set
of progressive meshes {Mc = (Vc , Ac , Fc )}C c=0 . We use the
where λ̂ is the largest eigenvalue. The Chebyshev polynomial barycentric-based method [37] for our predefined human body
Tk (x) of order k can be computed as: meshes.
Tk (x) = 2x Tk−1 (x) − Tk−2 (x), (7) In {Mc = (Vc , Ac , Fc )}C c=0 , the resolution of Mc and
M0 respectively are the lowest and highest, and the mesh
where T0 (x) = 1, T1 (x) = x. The graph convolution unit resolution gradually increases from Mc to M0 . Generally,
performing the spectral graph convolution on the mesh Mc we use Fc+1 ∈ R|Vc+1 |×|F | and Fc ∈ R|Vc |×|F | to represent
can be defined as the features of a pair of adjacent resolution human meshes,
 −1
K
and c = [C − 1, . . . , 0]. The target of mesh feature upsam-
Fout = Tk (
Lc )Fin k , (8) pling is to project Fc+1 to Fc . We define Uc as the mesh
k=0 feature up-sampling matrix, and the transformation can be
with input feature map Fin ∈ R N×|Fin | and output feature map defined as
Fout ∈ R N×|Fout | , where N = |V | is the number of vertices
of the mesh Mc . Fc = Uc Fc+1 , c = [C − 1, . . . ., 0]. (9)

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: PQ-GCN FOR 3D HUMAN MESH RECOVERY 109

where Uc is built according to Sc in Section III-B.1. The


generation rules of the up-sampling matrix are as follows.
• The features of vertices retained during down-sampling
are directly copied back during up-sampling. That is
Uc (q, p) = 1 if Sc ( p, q) = 1.
• Vertices v q ∈ Vc discarded during down-sampling where
S( p, q) = 0, ∀ p, are projected into the closest triangle
(i, j, k) in the down-sampled mesh Mc+1 , denoted by ṽ p ,
using barycentric coordinates which is calculated as ṽ p =
wi v i + w j v j + wk v k , where wi + w j + wk = 1 and
v i , v j , v k ∈ Vc+1 . The feature vector attaches to ṽ q is
calculated as
Fig. 5. Visualization of different formats of joint sets and corresponding
f˜q = wi f i + w j f j + wk f k , (10) graph structures.

where f i , f j , f k ∈ Fc+1 . The feature vector of ṽ q will


be copied to vertices v q in mesh Mc . Corresponding 3DPW, respectively. The graph structure of the human joint
elements in Uc are updated by wi , w j , wk . sets defines the connection and symmetrical relationships of
Based on the above rules, we can get a set of mesh feature the joints.
up-sampling matrices, {Uc ∈ R|Vc |×|Vc+1 | }C−1
c=0 . We employ standard normalization to a 2D human pose
before input to the network. Specifically, we first subtract
C. Loss Function the average value from the input 2D pose and divide it by
We use the following loss functions to train our network. its standard value. Additionally, we do not use the ground
To train a 3D pose generation network, we use 3D pose loss truth 2D pose of datasets to train our network directly.
L pose . To train the PQ-GCN model, we use L mesh consisting We add realistic errors to 2D input pose during the training
of vertex coordinate loss L vert ex , joint coordinate loss L j oint , stage.
surface normal loss L normal , and surface edge loss L edge , and
IV. E XPERIMENTAL R ESULTS
it is defined as
A. Datasets
L mesh = λv L V + λ j L J + λn L S N + λe L S E , (11)
We employ four datasets for training, i.e. Human3.6M [5],
where each component is calculated as COCO [63], MuCo-3DHP [64] and AMASS [10], and eval-
uate the trained model via datasets of Human3.6M and
L V = M − MGT , (12) 3DPW [19].
L J = J M − P3D
GT , (13) Human3.6M includes 3.6 million video frames and their
  mi − m j corresponding 3D human poses of 5 female and 6 male sub-
LSN = | , n GT |, (14)
mi − m j 2 T f ace jects [5]. There are 17 action scenes, such as sports, greetings,
T f ace {i, j }⊂T f ace
  and other actions, but all in the laboratory. Since the ground-
LSE = |mi − m j 2 −miGT −mGT
j 2 |, (15) truth 3D meshes are not available, we use pseudo-ground-truth
T f ace {i, j }⊂T f ace 3D meshes obtained by using SMPLify-X [30] to fit SMPL
where the superscript or subscript GT denotes the ground- parameters to the 3D ground truth poses. Following [17], [20],
truth. J stands for a joint regression matrix defined in the [65], we use 5 subjects (S1, S5, S6, S7, S8) for training, and
SMPL model. T f ace represents a triangle face in human mesh. 2 subjects (S9, S11) for evaluation.
n GT COCO is a large dataset with 2D annotations [63]. We use
T f ace represents a ground-truth unit normal vector of T f ace .
the pseudo 3D mesh labels by [21] for training.
mi and m j represent the i t h and j t h vertices in T f ace .
MuCo-3DHP [64] is a synthesized dataset based on MPI-
We use a two-step training strategy to train our model. First,
INF-3DHP [66] by the combination of the training data with
we use L pose to pre-train our 3D pose generation network, and
various real background images, and used for training in our
then use the above five losses to train the whole network in
experiment.
an end-to-end manner.
AMASS is a large-scale 3D motion-capture dataset [10].
It generates SMPL parameters from mocap data by Mosh++.
D. Human Joint Sets Definition and Pre-Processing We use camera parameters from Human3.6M to project a 3D
Our method generates a human mesh from 2D/3D human pose obtained from a mesh to the image plane to synthetic 2D
pose based on the proposed PQ-GCN. So we must pre-define pose-3D mesh data. We only use the part of CMU [67] from
the human joint sets and the corresponding graph structure. the dataset in the training process.
We use different pre-defined human joint sets and correspond- 3DPW is an outdoor-image dataset with 3D body pose and
ing graph structures for different evaluation datasets. The mesh annotations [19]. We only use the test set of this dataset
human joint sets are shown in Fig. 5. Specifically, we use for evaluation. We use the method of [68] and [69] to predict
Human3.6M and COCO body joints for Human3.6M and 2D poses for Human3.6M and 3DPW, respectively.

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 33, NO. 1, JANUARY 2023

TABLE I
A CCURACY C OMPARISON OF T WO D IFFERENT T RAINING
M ETHODS ON H UMAN 3.6M

TABLE II
C OMPLEXITY AND S PEED C OMPARISON B ETWEEN P OSE 2M ESH [17] AND
O URS ON THE S AME GPU. O UR M ETHOD H AS R EDUCED A BOUT 66%
PARAMETERS , AND RUNS A BOUT 50% FASTER T HAN P OSE 2M ESH

B. Evaluation Metrics
We use two metrics, MPJPE and PA-MPJPE, for 3D pose
evaluation, and one metric, MPVE, for 3D mesh evaluation in
millimeters (mm). Mean-Per-Joint-Position-Error (MPJPE) [5]
assesses the Euclidean distance between the ground-truth
and the predicted joints. Procrustes Analysis MPJPE
(PA-MPJPE) or Reconstruction Error [70] computes MPJPE
after performing a 3D alignment on 3D pose using Procrustes Fig. 6. Qualitative results on in-the-wild datasets, COCO (rows 1-4) and
Analysis (PA). Mean-Per-Vertex-Error (MPVE) [24] measures 3DPW (rows 5-7).
the Euclidean distance between the predicted and ground truth
mesh vertices.

C. Training and Inference


In our experiments, we have tried two ways to train our
network. First, we pre-train the 3D pose generation network
and then integrate it to our proposed PQ-GCN to train the
whole network in an end-to-end manner. The second is that we
train the whole network directly. The learning rate is set to be
10−3 initially, and 10−4 after the 12t h epoch, and λe = 0 when
the number of training epochs is less than 8. And the weights
are updated by the RMSprop optimization with a mini-batch
size of 64. We train our PQ-GCN with one RTX3090 GPU,
and PyTorch is used for implementation. Table I shows the
performance of two ways on Human3.6M that is also the Fig. 7. Visual quality comparison between our method and Pose2Mesh.
training dataset. For inference, we get the 2D input pose Indices Mapping in Pose2Mesh may cause artifacts especially on areas of
estimated by one integral regression [26] for Human3.6M body dense vertices, such as the face, hand, foot, while our method can mitigate
this problem.
joints and HRNet [69] for COCO body joints.

D. Comparison With Pose2Mesh compare the performance of our method and Pose2Mesh on
Since our work is based on Pose2Mesh [17], we make com- Human3.6M and 3DPW.
parison with it about resource consumption, model parameters, First, we list a comparison on Human3.6M in Table III.
inferencing speed, as well as reconstruction errors. The datasets on the top row of the table are used for training.
We first report the complexity comparison between our Table III shows that our method outperforms Pose2Mesh when
method and Pose2Mesh in Table II. When the batch size the training dataset is Human3.6M or Human3.6M+COCO.
is the same (i.e. 64), our model has reduced 46% GPU When we use MuCo-3DHP as an additional training dataset,
memory consumption of Pose2Mesh [17], and the parameters the performance of our method is slightly worse than that of
of our PQ-GCN is 34% of Pose2Mesh. Only one GPU is Pose2Mesh.
needed for training. Our method runs at 62 fps, 50% faster Second, we compare our method with Pose2Mesh on 3DPW
than Pose2Mesh (41 fps) on the same RTX3090. Then we in Table IV. The results demonstrate that our method obviously

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: PQ-GCN FOR 3D HUMAN MESH RECOVERY 111

TABLE III
A CCURACY C OMPARISON B ETWEEN O UR PQ-GCN AND P OSE 2M ESH ON H UMAN 3.6M. THE DATASET ( S ) ON T OP IS (A RE ) U SED FOR T RAINING

TABLE IV
A CCURACY C OMPARISON B ETWEEN O UR PQ-GCN AND P OSE 2M ESH ON 3DPW. T HE D ATASET ( S ) ON T OP I S (A RE ) U SED FOR T RAINING

TABLE V
C OMPARISON W ITH S TATE - OF - THE -A RT M ETHODS
ON THE 3DPW D ATASET

Fig. 8. Visual quality comparison between our method, Pose2Mesh, and


GraphCMR. All results are without model fitting.

TABLE VI
3D R ECONSTRUCTION R ESULTS W ITH T RUE 2D P OSE AS I NPUT

with ‘’ have been trained with same datasets including


Human3.6M, COCO, and MuCO-3DHP. Those highlighted
with ‘†’ have been trained with the datasets of Human3.6M,
MPI-INF-3DHP, UP, COCO, MPII, LSP, AICH, MuCo-
3DHP,and OH, while ours (Decode-27/108/431/1723/Full)
highlighted with ‘♦’ are trained with only Human3.6M and
COCO. The first three methods highlighted with asterisk,
METRO [12], PARE [44], Mesh Graphormer [71], are trained
with datasets including 3DPW, and then evaluated on its
test set. These methods have used image features, and their
performance is much better than others, while our method
outperforms Pose2Mesh in most cases with the same training only utilizes 2D (3D) pose without any other features (e.g.
datasets. color or texture). We have evaluated our method with true 2D
pose as input, and the results are given in Table VI.
E. Comparison With State-of-the-Art Methods Dynamics [49] and VIBE [50] belong to temporal methods
We compare our method with state-of-the-art methods on in- trained on video data. VIBE [50] performs slightly better
the-wild dataset, 3DPW, in Table V. The methods highlighted than our method on the metric of PA-MPJPE, but ours is

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
112 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 33, NO. 1, JANUARY 2023

Fig. 9. Visual comparison between state-of-the-art methods and ours. From left to right: Input image, HMR [20], GraphCMR [16], SPIN [21], I2L-MeshNet
[29], and ours.

much better on MPJPE and MPVE. Our method outperforms following reasons. First, avoiding image appearance fitting
others including some most recent methods. Hence, this proves is important for the model’s generalization ability for in-the-
the efficiency of our proposed PQ-GCN attributed to the wild images. Second, we can benefit from accurate 2D human

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: PQ-GCN FOR 3D HUMAN MESH RECOVERY 113

TABLE VII
A CCURACY C OMPARISON B ETWEEN O UR M ETHOD AND
S TATE - OF - THE -A RT ON H UMAN 3.6M AND 3DPW

Fig. 11. Visualization of reconstruction error map. From left to right: input
image, ground truth, our result, error map.

TABLE VIII
C OMPLEXITY C OMPARISON W ITH S TATE - OF - THE -A RT M ETHODS

Fig. 10. Some inaccurate results of our method. From left to right: Input
image, 2D pose, recovered mesh. Since our method recovers the mesh from
the human pose, inaccurate results will be generated when the 2D human pose
prediction is unreasonable.

pose estimation, such as [68], [69]. Third, our designed mesh


topologies and graph neural networks are suitable for human TABLE IX
body simulation. T HE N UMBER OF PARAMETERS , AND THE GPU M EMORY U SAGE
Some other state-of-the-art methods [17], [20], [21], [29] C OMPARISON B ETWEEN D IFFERENT PQ-GCN A RCHITECTURES
and ours are compared in Table VII when trained on the same
dataset, Human3.6M. The methods are tested on 3DPW and
Human3.6M. As shown in Table VII, our method achieves
state-of-the-art performance on both 3DPW and Human3.6M.
Note that we have not listed the result of I2L-MeshNet [29]
on 3DPW, since we found that this method cannot work on
3DPW when trained with Human3.6M only. The reason is that
the image-to-lixel prediction network strongly relies on image
F. Ablation Study
appearance [29].
The complexity comparison between SOTA methods and We demonstrate the advantages of the designed graph
ours has been listed in Table VIII, including parameters convolution network via ablation experiments, in which the
and FLOPs which is also a metric to measure one model’s Human3.6M is used as the training dataset.
complexity. It can be seen that our progressive recovery 1) Mesh Down-Sampling: We have studied the effect of dif-
with a decoder-head can lead to a substantial reduction of ferent mesh down-sampling ratios, which will lead to different
computation complexity. The complexity of graph convolution coarse-to-fine meshes. Different PQ-GCN networks need to
is O(n 2 ), where n is the number of vertex, so progressive be designed. Table IX shows the GPU memory usage and the
decoding can avoid graph convolution on meshes of high- number of parameters in different settings. Table X shows the
resolution. MPJPE and PA-MPJPE results on Human3.6M and 3DPW.
We present qualitative results on in-the-wild datasets in As shown in Tables IX-X, the best compromise has been
Fig. 6, a visual quality comparison between ours and achieved when we design a PQ-GCN model based on the
Pose2Mesh in Fig. 7, and comparison with GraphCMR in mesh down-sampling ratio of 4 on SMPL template to generate
Fig. 8. Our method can eliminate the artifacts without post- the coarse-to-fine meshes. Note that the number of vertices of
processing (e.g. model fitting). Fig. 9 shows the visual com- the lowest resolution human mesh topology is more than the
parison between our method and others, including HMR [20], number of human joints (i.e., 17 or 19) when using all the
GraphCMR [16], SPIN [21], and I2L-MeshNet [29]. Fig. 10 settings to build our model.
shows some inaccurate results of our method. Most of these 2) Loss Functions: We have also studied the effect of
failures occur when dealing with self-occlusion or multi- loss functions as shown in Tables XI-XII. Different settings
person cases, which will be tackled as our future work. Fig. 11 of the total loss without L j oint , L edge and L normal have
shows visualization of some reconstruction error maps. been evaluated. Although L j oint benefits the performance on

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
114 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 33, NO. 1, JANUARY 2023

TABLE X TABLE XIV


A CCURACY OF PQ-GCN ON H UMAN 3.6M AND 3DPW T HE P ERFORMANCE OF O UR M ETHOD W ITH D IFFERENT A RCHITECTURES
W ITH D IFFERENT S AMPLING R ATIOS ON 3DPW. T HE T RAINING D ATASET I S H UMAN 3.6M

TABLE XV
TABLE XI C OMPARISON B ETWEEN D IRECT-GCN AND PQ-GCN. (E RROR IN mm)
T HE J OINT E RRORS ON H UMAN 3.6M W HEN THE N ETWORK I S
T RAINED W ITH VARIOUS C OMBINATIONS OF L OSSES

TABLE XVI
C OMPARISON OF P ROGRESSIVE D ECODING F ROM D IFFERENT
R ESOLUTIONS ON 3DPW. H UMAN 3.6M AND
COCO A RE U SED FOR T RAINING
TABLE XII
L OSSES ’ E FFECT ON H UMAN 3.6M AND
3DPW. L sur f ace = L edge &L normal

TABLE XIII 5) Quadratic Graph Convolution: There are two reasons


A CCURACY C OMPARISON OF D IFFERENT VALUE OF K ON H UMAN 3.6M. for the choice of quadratic graph convolution. First, it can
THE T RAINING S ET I S H UMAN 3.6M
guarantee the human body mesh’s original topology (i.e.
SMPL), so the last layer of our progressive graph convolution
will operate on our selected model, which will help to improve
the reconstruction quality. By analyzing Pose2mesh and Graph
CMR, their generated mesh from the last layer both need
one sampling operation to the SMPL model, which will lead
to extra quality degradation and generate unsmooth surface.
Second, the quadratic graph convolution generates a group
of mesh topologies with different resolutions, which reserve
the human body’s basic structure and help to reconstruct the
Human3.6M, it has marginal effect on 3DPW on which the final human mesh from 2D pose by our progressive graph
surface losses L edge and L normal are more important. convolution. This also reduced the learning complexity of
3) K-Order Chebyshev GCN: To reduce computational graph convolution parameters. An additional ablation study is
complexity, we use the K -order Chebyshev polynomial to added about quadratic graph convolution. We have removed
approximate the graph convolution kernel. The K -order poly- all up-sampling layers in the Progressive Layer, while the
nomial in Laplacian indicates that a maximum of K -hop FC layer in the Bottleneck Layer is revised to generate
neighbor nodes are affected by every node. So we study the 6890 vertexes. This is referred as Direct-GCN, the complexity
effect of K in Chebyshev graph convolution as defined in (8), and performance of which is compared with PQ-GCN in
and report the joint errors of different settings on Human3.6M Table XV, from which we can see that PQ-GCN possesses
in Table XIII. As shown in Table XIII, our method has almost the same performance as the Direct-GCN, while the
achieved its best performance when K = 3. complexity is reduced greatly.
4) Different Poses Into PQ-GCN: We study the 3D human
mesh regression ability of our PQ-GCN with different pose
information as input, i.e., PQ-GCN recovers 3D mesh from the G. Progressive Decoding
2D pose, 3D pose, and 2D+3D pose, respectively. The results We have trained the progressive decoding framework on
are shown in Table XIV. As expected, PQ-GCN can regress Human3.6M and COCO datasets, and evaluated on 3DPW.
more accurate results when using the 2D pose concatenated Experimental results and complexity of progressive recovery
with 3D pose as input. structures are compared in Tables XVI and XVII, respectively,

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: PQ-GCN FOR 3D HUMAN MESH RECOVERY 115

TABLE XVII
C OMPLEXITY C OMPARISON OF P ROGRESSIVE D ECODING S TRUCTURES

R EFERENCES
[1] G. Wei, C. Lan, W. Zeng, and Z. Chen, “View invariant 3D human pose
estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12,
pp. 4601–4610, Dec. 2020.
[2] R. Gu, G. Wang, Z. Jiang, and J.-N. Hwang, “Multi-person hierarchical
3D pose estimation in natural videos,” IEEE Trans. Circuits Syst. Video
Technol., vol. 30, no. 11, pp. 4245–4257, Nov. 2020.
[3] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black,
“SMPL: A skinned multi-person linear model,” ACM Trans. Graph.,
vol. 34, no. 6, pp. 1–16, Oct. 2015.
[4] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and
M. J. Black, “Keep it SMPL: Automatic estimation of 3D human pose
and shape from a single image,” in Proc. Eur. Conf. Comput. Vis.
(ECCV), 2016, pp. 561–578.
[5] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6M:
Large scale datasets and predictive methods for 3D human sensing in
Fig. 12. Visual comparison of decoded results from different resolutions. natural environments,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,
From left to right: Input image, decoded from vertices of 27, 108, 431, 1723, no. 7, pp. 1325–1339, Jul. 2014.
6890. [6] C. Zhu, J. Yang, Z. Shao, and C. Liu, “Vision based hand gesture
recognition using 3D shape context,” IEEE/CAA J. Autom. Sinica, vol. 8,
where Decode-X represents a decoder-head is added under X no. 9, pp. 1600–1613, Sep. 2021.
[7] M. Zhao, G. Xiong, M. Zhou, Z. Shen, and F.-Y. Wang, “3D-
resolution. From Table XVI, we can see that the performance RVP: A method for 3D object reconstruction from a single depth
get better when progressively decoding with higher resolution. view using voxel and point,” Neurocomputing, vol. 430, pp. 94–103,
At the same time, the recovery can be stopped at an earlier Mar. 2021.
[8] H. Zhu, Y. Liu, J. Fan, Q. Dai, and X. Cao, “Video-based outdoor human
stage when needed. By previously adding the decoder head, reconstruction,” IEEE Trans. Circuits Syst. Video Technol., vol. 27, no. 4,
the GPU memory and inference time can be greatly reduced pp. 760–770, Apr. 2017.
as shown in Table XVII. Visual comparison of decoded results [9] M. Loper, N. Mahmood, and M. J. Black, “MoSh: Motion and shape
capture from sparse markers,” ACM Trans. Graph., vol. 33, no. 6,
from different resolutions has been shown in Fig. 12. pp. 1–13, Nov. 2014.
[10] N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. Black,
V. C ONCLUSION “AMASS: Archive of motion capture as surface shapes,” in Proc.
This paper presents a progressive 3D human mesh recovery IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5441–5450.
method for a single image in the wild. We propose PQ-GCN [11] B. L. Bhatnagar, C. Sminchisescu, C. Theobalt, and G. Pons-Moll,
“LoopReg: Self-supervised learning of implicit surface correspondences,
to learn the mapping from a 2D/3D pose to a 3D mesh. pose and shape for 3D human mesh registration,” in Proc. Adv. Neural
Based on the elaborated template meshes, we construct the Inf. Process. Syst. (NeurIPS), vol. 33, 2020, pp. 12909–12922.
corresponding progressive graph convolution. Our method has [12] K. Lin, L. Wang, and Z. Liu, “End-to-end human pose and mesh
reconstruction with transformers,” in Proc. IEEE/CVF Conf. Comput.
much fewer parameters (66% down), a faster inference speed Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 1954–1963.
(50% up) and less GPU memory consumption. Experimental [13] X. Hong, T. Zhang, Z. Cui, and J. Yang, “Variational gridded graph
results show that it has eliminated the artifacts in previous convolution network for node classification,” IEEE/CAA J. Autom.
Sinica, vol. 8, no. 10, pp. 1697–1708, Oct. 2021.
graph-based methods, such as Pose2Mesh and GraphCMR, [14] X. Liu, M. Yan, L. Deng, G. Li, X. Ye, and D. Fan, “Sampling
and outperforms state-of-the-art methods, especially on an in- methods for efficient training of graph convolutional networks: A
the-wild dataset named 3DPW. Besides, by adding a decoding survey,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 2, pp. 205–234,
Feb. 2022.
head in the progressive layer, the recovery can be stopped at [15] Z. Zhuo, X. Luo, and M. Zhou, “An auxiliary learning task-enhanced
an earlier stage, thus effectively decreasing the computational graph convolutional network model for highly-accurate node classifica-
burden. tion on weakly supervised graphs,” in Proc. IEEE Int. Conf. Smart Data
Services (SMDS), Sep. 2021, pp. 192–197.
This paper focuses on the mapping from a 2D/3D pose to a [16] N. Kolotouros, G. Pavlakos, and K. Daniilidis, “Convolutional mesh
3D mesh, without well exploiting the shape estimation, which regression for single-image human shape reconstruction,” in Proc.
needs other information such as silhouette. 3D human data IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
pp. 4496–4505.
with a real ground-truth mesh tend to be deficient. The long [17] H. Choi, G. Moon, and K. M. Lee, “Pose2Mesh: Graph convolutional
range interactions in the mesh topology have not been well network for 3D human pose and mesh recovery from a 2D human pose,”
investigated. Our future work should study these questions, in Proc. Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 769–787.
[18] I. S. Dhillon, Y. Guan, and B. Kulis, “Weighted graph cuts without
build a new dataset based on our structured light-based 3D eigenvectors a multilevel approach,” IEEE Trans. Pattern Anal. Mach.
range sensing system, and explore its new applications. Intell., vol. 29, no. 11, pp. 1944–1957, Nov. 2007.

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
116 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 33, NO. 1, JANUARY 2023

[19] T. Marcard, R. Henschel, M. Black, B. Rosenhahn, and G. Pons-Moll, [43] R. Chris and F. F. David, “Full-body awareness from partial observa-
“Recovering accurate 3D human pose in the wild using IMUs and tions,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 522–539.
a moving camera,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, [44] M. Kocabas, C.-H.-P. Huang, O. Hilliges, and M. J. Black, “PARE: Part
pp. 601–617. attention regressor for 3D human body estimation,” in Proc. IEEE/CVF
[20] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 11127–11137.
recovery of human shape and pose,” in Proc. IEEE/CVF Conf. Comput. [45] Y. Sun, Q. Bao, W. Liu, Y. Fu, M. J. Black, and T. Mei, “Monocular,
Vis. Pattern Recognit., Jun. 2018, pp. 7122–7131. one-stage, regression of multiple 3D people,” in Proc. IEEE/CVF Int.
[21] N. Kolotouros, G. Pavlakos, M. Black, and K. Daniilidis, “Learning Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 11179–11188.
to reconstruct 3D human pose and shape via model-fitting in the [46] H. Choi, G. Moon, J. Park, and K. M. Lee, “Learning to estimate
loop,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, robust 3D human mesh from in-the-wild crowded scenes,” 2021,
pp. 2252–2261. arXiv:2104.07300.
[22] R. A. Guler and I. Kokkinos, “HoloPose: Holistic 3D human recon- [47] K. Yang, R. Gu, M. Wang, M. Toyoura, and G. Xu, “LASOR: Learning
struction in-the-wild,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern accurate 3D human pose and shape via synthetic occlusion-aware data
Recognit. (CVPR), Jun. 2019, pp. 10876–10886. and neural mesh rendering,” IEEE Trans. Image Process., vol. 31,
[23] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele, “Neural pp. 1938–1948, 2022.
body fitting: Unifying deep learning and model based human pose [48] J. Zhang, P. Felsen, A. Kanazawa, and J. Malik, “Predicting 3D human
and shape estimation,” in Proc. Int. Conf. 3D Vis. (DV), Sep. 2018, dynamics from video,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
pp. 484–494. (ICCV), Oct. 2019, pp. 7113–7122.
[24] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis, “Learning to esti- [49] A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik, “Learning 3D human
mate 3D human pose and shape from a single color image,” in dynamics from video,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Recognit. (CVPR), Jun. 2019, pp. 5607–5616.
pp. 459–468. [50] M. Kocabas, N. Athanasiou, and M. J. Black, “VIBE: Video inference
[25] Y. Xu, S.-C. Zhu, and T. Tung, “DenseRaC: Joint 3D pose and shape for human body pose and shape estimation,” in Proc. IEEE/CVF Conf.
estimation by dense render-and-compare,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 5252–5262.
Comput. Vis. (ICCV), Oct. 2019, pp. 7759–7769. [51] Z. Cao, M. Wang, S. Guan, W. Liu, C. Qian, and L. Ma, “PNO: Person-
[26] Y. Sun, Y. Ye, W. Liu, W. Gao, Y. Fu, and T. Mei, “Human mesh alized network optimization for human pose and shape reconstruction,”
recovery from monocular images via a skeleton-disentangled represen- in Proc. Artif. Neural Netw. Mach. Learn., 2021, pp. 356–367.
tation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, [52] S. Zou et al., “EventHPE: Event-based 3D human pose and shape esti-
pp. 5348–5357. mation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
[27] Y. Rong, Z. Liu, C. Li, K. Cao, and C. C. Loy, “Delving deep into hybrid pp. 10976–10985.
annotations for 3D human recovery in the wild,” in Proc. IEEE/CVF Int.
[53] V. Choutas, G. Pavlakos, T. Bolkart, D. Tzionas, and M. J. Black,
Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5339–5347.
“Monocular expressive body regression through body-driven attention,”
[28] A. S. Jackson, C. Manafas, and G. Tzimiropoulos, “3D human body in Proc. Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 20–40.
reconstruction from a single image via volumetric regression,” in Proc.
[54] S. Saito, Z. Huang, R. Natsume, S. Morishima, H. Li, and A. Kanazawa,
Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 64–77.
“PIFu: Pixel-aligned implicit function for high-resolution clothed human
[29] G. Moon and K. M. Lee, “I2L-MeshNet: Image-to-lixel prediction
digitization,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
network for accurate 3D human pose and mesh estimation from a
Oct. 2019, pp. 2304–2314.
single RGB image,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2020,
[55] R. Li, Y. Xiu, S. Saito, Z. Huang, K. Olszewski, and H. Li, “Monocular
pp. 752–768.
real-time volumetric performance capture,” in Proc. Eur. Conf. Comput.
[30] G. Pavlakos et al., “Expressive body capture: 3D hands, face, and body
Vis. (ECCV), 2020, pp. 49–67.
from a single image,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2019, pp. 10967–10977. [56] E. Corona, A. Pumarola, G. Alenyà, G. Pons-Moll, and
[31] T. Luan, Y. Wang, J. Zhang, Z. Wang, Z. Zhou, and Y. Qiao, “PC-HMR: F. Moreno-Noguer, “SMPLicit: Topology-aware generative model
Pose calibration for 3D human mesh recovery from 2D images/videos,” for clothed people,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
in Proc. 35nd AAAI Conf. Artif. Intell. (AAAI), 2021, pp. 2269–2276. Recognit. (CVPR), Jun. 2021, pp. 11870–11880.
[32] H. Zhang et al., “PyMAF: 3D human pose and shape regression with [57] Z. Huang, Y. Xu, C. Lassner, H. Li, and T. Tung, “ARCH: Animatable
pyramidal mesh alignment feedback loop,” in Proc. IEEE/CVF Int. Conf. reconstruction of clothed humans,” in Proc. IEEE/CVF Conf. Comput.
Comput. Vis. (ICCV), Oct. 2021, pp. 11446–11456. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 3090–3099.
[33] K. Li et al., “Image-guided human reconstruction via multi-scale [58] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural
graph transformation networks,” IEEE Trans. Image Process., vol. 30, networks on graphs with fast localized spectral filtering,” in Proc. 30th
pp. 5239–5251, 2021. Int. Conf. Neural Inf. Process. Syst., vol. 29, 2016, pp. 3844–3852.
[34] W. Zeng, W. Ouyang, P. Luo, W. Liu, and X. Wang, “3D human [59] M. Garland and P. S. Heckbert, “Surface simplification using quadric
mesh regression with dense correspondence,” in Proc. IEEE/CVF Conf. error metrics,” in Proc. 24th Annu. Conf. Comput. Graph. Interact.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 7052–7061. Techn. (SIGGRAPH), 1997, pp. 209–216.
[35] P. Yao, Z. Fang, F. Wu, Y. Feng, and J. Li, “DenseBody: Directly [60] B. Joan, Z. Wojciech, S. Arthur, and L. Yann, “Spectral networks
regressing dense 3D human pose and shape from a single color image,” and locally connected networks on graphs,” in Proc. Int. Conf. Learn.
CoRR, vol. 1903.10153, pp. 1–10, Mar. 2019. Represent. (ICLR), 2014.
[36] T. Zhang, B. Huang, and Y. Wang, “Object-occluded human shape and [61] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger,
pose estimation from a single color image,” in Proc. IEEE/CVF Conf. “Simplifying graph convolutional networks,” in Proc. 36th Int. Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 7374–7383. Mach. Learn., vol. 97, Jun. 2019, pp. 6861–6871.
[37] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black, “Generating 3D faces [62] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
using convolutional mesh autoencoders,” in Proc, Eur. Conf. Comput. convolutional networks,” in Proc. Int. Conf. Learn. Represent. (ICLR),
Vis. (ECCV), 2018, pp. 725–741. 2017.
[38] G. Varol et al., “BodyNet: Volumetric inference of 3D human body [63] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
shapes,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 20–38. Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[39] Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu, “DeepHuman: 3D human [64] D. Mehta et al., “Single-shot multi-person 3D pose estimation from
reconstruction from a single image,” in Proc. IEEE/CVF Int. Conf. monocular RGB,” in Proc. Int. Conf. 3D Vis. (DV), Sep. 2018,
Comput. Vis. (ICCV), Oct. 2019, pp. 7738–7748. pp. 120–130.
[40] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf. [65] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, “Coarse-to-fine
Neural Inf. Process. Syst., Long Beach, CA, USA, 2017, pp. 6000–6010. volumetric prediction for single-image 3D human pose,” in Proc. IEEE
[41] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1263–1272.
for image recognition at scale,” in Proc. Int. Conf. Learn. Represent. [66] D. Mehta et al., “Monocular 3D human pose estimation in the wild
(ICLR), 2021. using improved CNN supervision,” in Proc. Int. Conf. 3D Vis. (DV),
[42] M. Wang, F. Qiu, W. Liu, C. Qian, X. Zhou, and L. Ma, “Monocular Oct. 2017, pp. 506–516.
human pose and shape reconstruction using part differentiable render- [67] CMU Graphics Lab. (2020). CMU Graphics Lab Motion Capture
ing,” Comput. Graph. Forum, vol. 39, no. 7, pp. 351–362, Oct. 2020. Database. [Online]. Available: http://mocap.cs.cmu.edu/

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: PQ-GCN FOR 3D HUMAN MESH RECOVERY 117

[68] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose Jiaji Wu (Member, IEEE) received the B.S. degree
regression,” in Proc. Comput. Vis. (ECCV), 2018, pp. 536–553. in electrical engineering from Xidian University,
[69] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen- Xi’an China, in 1996, the M.S. degree from the
tation learning for human pose estimation,” in Proc. IEEE/CVF Conf. National Time Service Center (NTSC), Chinese
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5686–5696. Academy of Sciences, in 2002, and the Ph.D. degree
[70] X. Zhou, M. Zhu, G. Pavlakos, S. Leonardos, K. G. Derpanis, and in electrical engineering from Xidian University in
K. Daniilidis, “MonoCap: Monocular human motion capture using a 2005. He is currently a Professor at Xidian Univer-
CNN coupled with a geometric prior,” IEEE Trans. Pattern Anal. Mach. sity. His current research interests include still image
Intell., vol. 41, no. 4, pp. 901–914, Apr. 2019. coding, hyperspectral/multispectral image process-
[71] K. Lin, L. Wang, and Z. Liu, “Mesh graphormer,” in Proc. IEEE/CVF ing, communication, big data, the IoT, and high-
Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 12919–12928. performance computing.
[72] A. A. M. Muzahid, W. Wan, F. Sohel, L. Wu, and L. Hou, “CurveNet:
Curvature-based multitask learning deep networks for 3D object recog-
nition,” in IEEE/CAA J. Autom. Sinica, vol. 8, no. 6, pp. 1177–1187,
June 2021.
[73] H. Xia, M. A. Khan, Z. Li, and M. Zhou, “Wearable robots for human
underwater movement ability enhancement: A survey,” in IEEE/CAA Jun Cheng (Member, IEEE) received the B.E.
J. Autom. Sinica, vol. 9, no. 6, pp. 967–977, June 2022. and M.E. degrees from the University of Science
and Technology of China, Hefei, China, in 1999 and
2002, respectively, and the Ph.D. degree from The
Lei Wang (Member, IEEE) received the Ph.D. Chinese University of Hong Kong, Hong Kong,
degree in electrical engineering from Xidian Uni- in 2006. He is currently with the Shenzhen Insti-
versity, China, in 2010. From 2011 to 2012, tute of Advanced Technology, Chinese Academy
he worked with Huawei Technologies Company Ltd. of Sciences, Shenzhen, China, as a Professor, and
From 2014 to 2015, he was with the Department of the Director of the Laboratory for Human Machine
Embedded Systems Engineering, Incheon National Control. His current research interests include com-
University, as a Post-Doctoral Fellow. He is currently puter vision, robotics, and machine intelligence and
an Associate Professor with the Shenzhen Institute control.
of Advanced Technology, Chinese Academy of Sci-
ences (CAS). He has authored or coauthored over
50 papers in conferences and journals. His research
interests include image processing, transforms, machine learning, computer
vision, visual semantic understanding, video analysis, 3D reconstruction, and
robotics. Mengchu Zhou (Fellow, IEEE) received the B.S.
degree in control engineering from the Nanjing Uni-
versity of Science and Technology, Nanjing, China,
Xunyu Liu is currently pursuing the M.S. degree in 1983, the M.S. degree in automatic control from
with the College of Computer Science and Software the Beijing Institute of Technology, Beijing, China,
Engineering, Shenzhen University, Shenzhen, China. in 1986, and the Ph.D. degree in computer and
His research interests include computer vision, deep systems engineering from the Rensselaer Polytech-
learning, and 3D reconstruction. nic Institute, Troy, NY, USA, in 1990. He joined
the New Jersey Institute of Technology, where he
is currently a Distinguished Professor. He has over
1000 publications including 12 books, 700 journal
articles (more than 600 in IEEE T RANSACTIONS), 29 patents, and 30 book-
chapters. His research interests include Petri nets, automation, the Internet
of Things, and big data. He is a Life Member of the Chinese Association
for Science and Technology, USA, and served as its President, in 1999. He
Xiaoliang Ma (Member, IEEE) received the Ph.D. is a fellow of the International Federation of Automatic Control (IFAC),
degree from the School of Computing, Xidian Uni- the American Association for the Advancement of Science (AAAS), the
versity, Xi’an, China, in 2014. He is currently Chinese Association of Automation (CAA), and the National Academy of
an Assistant Professor with the College of Com- Inventors (NAI). He was a recipient of the Excellence in Research Prize
puter Science and Software Engineering, Shenzhen and Medal from NJIT, the Humboldt Research Award for U.S. Senior
University, Shenzhen, China. His research inter- Scientists from Alexander von Humboldt Foundation, the Franklin V. Taylor
ests include evolutionary computation, multiobjec- Memorial Award and the Norbert Wiener Award from IEEE SMC Society, the
tive optimization, and cooperative coevolution. Computer-Integrated Manufacturing University-Lead Award from the Society
of Manufacturing Engineers, the Distinguished Service Award from the IEEE
Robotics and Automation Society, and the Edison Patent Award from the
Research and Development Council of New Jersey.

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on January 06,2023 at 09:01:16 UTC from IEEE Xplore. Restrictions apply.

You might also like