Gong Meta Agent Teaming Active Learning For Pose Estimation CVPR 2022 Paper
Gong Meta Agent Teaming Active Learning For Pose Estimation CVPR 2022 Paper
Jia Gong1 Zhipeng Fan2 Qiuhong Ke3 Hossein Rahmani4 Jun Liu1∗
1
Singapore University of Technology and Design, Singapore; 2 New York University, United States
3
The University of Melbourne, Australia; 4 Lancaster University, United Kingdom
jia [email protected], [email protected], [email protected]
[email protected], jun [email protected]
11079
To address the aforementioned issues in a single end-to- learning [28, 23, 8], semi-supervised learning [39, 3, 54]
end learning framework, we propose a novel Meta Agent and self-supervised learning [9, 51], have attracted much
Teaming Active Learning (MATAL) model for human hand attention recently. These methods utilize the unlabeled data
(or body) pose estimation, which leverages an agent team to improve the performance. However, most of the methods
to learn a teaming sampling policy from data. Our main in- still rely on the help of labeled data to distill useful informa-
sight is that selecting a batch of informative yet diverse im- tion from the unlabeled images. This means that the quality
ages for annotation can be viewed as a teamwork of a set of and informativeness of labeled data are still crucial in their
agents, where each agent in the team selects one image col- methods. Our active learning approach is parallel to these
laboratively based on the other agents’ decisions. Then this methods and could be integrated into the labeled data col-
active learning procedure can be formulated as a Markov lection process to significantly reduce the annotation cost.
Decision Process (MDP) [45], which could be solved with Active Learning for Pose Estimation. Active learning is
Reinforcement Learning (RL). The agent team receives a an important machine learning problem, which has received
state signal characterizing the distribution of the images in lots of attentions [35, 58, 6, 22]. In recent years, several
the dataset and cooperatively generates a batch of actions works explored applications of active learning for pose es-
to decide which images should be labeled. To help the timation. Liu et al. [22] introduced an uncertainty based
agent team to identify informative samples for annotation, estimator, utilizing the entropy of the predicted heatmaps to
we introduce a novel state-action representation by leverag- select the informative images. Yoo et al. [58] proposed a
ing Kinetic Chain Space (KCS) to encode the topological loss prediction module, which is learned together with the
information of the hand (or body) pose. Finally, as the la- target model to predict the losses of unlabeled samples. A
beled dataset will expand with the new annotated data, we subset of unlabeled samples with high predicted loss values
train our model via meta-learning to facilitate fast adapta- is selected for annotation. Shukla et al. [38] extended [58]
tion to the iteratively enlarged labeled dataset. to improve the correlation between the predicted and true
In summary, our main contributions are: 1) We for- loss values. The work in [4] used the Bayesian uncertainty
mulate the pose estimation active learning procedure as a to estimate the confidence of the pose estimator’s prediction
Markov Decision Process (MDP) and develop a Reinforce- and combined this with Core-set sampling [35] to perform
ment Learning (RL) based framework for effective sample selection. Caramalau et al. [5] employed Graph Convolu-
selection. 2) To help the learning of the agents, we propose tional Networks (GCN) to model the relation between la-
a state-action representation to characterize the informative- beled and unlabeled data. They then proposed two GCN-
ness and representativeness of the samples. 3) We validate based sampling approaches based on uncertainty and distri-
the efficacy of the proposed MATAL framework on both bution, respectively. Though these methods have achieved
human hand and body pose benchmarks. increasingly accurate measurements for uncertainty or dis-
tribution of the images, their sampling policies are not di-
2. Related Work rectly related to the performance of the pose estimator, lead-
Pose Estimation. Below we briefly review the recent pose ing to limited performance improvement. We address this
estimation methods. More works can be found in [7, 18]. by learning a sampling policy driven by the reward that di-
Several approaches [55, 61, 32, 41, 30, 42, 20, 62, 8] rectly relates to the performance of the pose estimator. To
have investigated the usage of deep learning to predict hand the best of our knowledge, we are the first Active Learning-
poses from depth or RGB-D images. These methods em- based multi-agent framework to learn a batch sampling pol-
ployed heatmap [52], pose structure information [44] or icy that promotes the learning of the pose estimator.
hand’s shape information [26] to improve the performance. Reinforcement Learning in Pose Estimation. Reinforce-
More recent works [63, 29, 17, 64] derived the hand joints’ ment learning (RL), which is a learning paradigm to solve
poses from RGB inputs. Similarly, recent human body pose MDP problems, aims to learn a policy that takes actions to
estimation approaches [49, 61, 27, 43, 31, 10] focused more maximize the accumulated reward in MDP [25, 45, 57, 50].
on deriving body joints’ poses from RGB images. The state- Recently, several works [33, 13] explored different applica-
of-the-art Stacked Hourglass [56] employed an encoder- tions of RL in pose estimation tasks. Jianzhun et al. [36]
decoder structure to predict joints’ locations as heatmaps, used RL to learn to manipulate the 3D object to match
while the HRNet [43] maintained high resolution represen- the ground truth mask. Another work [14] considered the
tations through the process to better localize the joints. Our multi-camera settings in the human body pose estimation
framework does not assume a specific architecture for the task and leveraged an RL model to select the appropriate
pose estimator and could be used with various existing mod- viewpoints (or cameras) to improve the performance of the
els to improve their annotation efficiencies. pose estimator. Both of them involved RL into the pose
To reduce the need of labeled data, learning methods estimation procedure, however, with completely different
with less supervision signal, such as weakly-supervised formulations to ours. Instead of employing RL to directly
11080
Pose Estimator 𝒈𝒕 Feature Spaces Agent Team {𝒒𝒎 }𝑵
𝒎"𝟏
Labeled pool 𝑫#" Appearance feature 𝐹+ State 𝒔𝒕 𝟏𝒕𝒉
𝒂𝟏𝒕
⊕ MMD Distribution differences
on {𝐹! , 𝐹"! ,…, 𝐹"" } +
retrain
Budget Information 𝑏 𝟐𝒕𝒉
Topological feature 𝐹*! , 𝐹*" ,…, 𝐹*# 𝒂𝟐𝒕
Unlabeled pool 𝑫'
"
Global Local Action Space 𝑨𝒕
⊖
𝑫'
"
Reward set 𝑫𝒓𝒆 𝑵𝒕𝒉
𝒂𝑵
𝒕
Reward 𝒓𝒕,𝟏
:t : t+1 rt+1 = Acct+1 - Acct
⊕ : add image ⊖: remove image
R𝐞𝐪𝐮𝐞𝐬𝐭𝐢𝐧𝐠 𝐚𝐧𝐧𝐨𝐭𝐚𝐭𝐢𝐨𝐧𝐬
Figure 1. Overview of our MATAL framework for hand pose estimation (MATAL for human body pose estimation shares the similar
structure). The solid lines describe the data flow at the tth active learning iteration and dot lines are that of the (t + 1)th iteration. Given
a labeled sample pool DtL and an unlabeled sample pool DtU , our active learning framework works as follows: 1) We first project both
the DtU and DtL to the feature spaces with the pose estimator gt , then construct the state st and action space At from the feature spaces.
The state st records the differences between DtU and DtL in the feature spaces, and the consumption of annotation budget. The action
space At contains the projection of DtU in the feature spaces. Each action at ∈ At corresponds to a unique image in DtU and describes
the novelty, representativeness and appearance of the image. 2) The agent team follows the Q-learning [45] framework and evaluates the
state-action pair (st ,at ) to determine a set of actions {am N
t }m=1 of raising corresponding images for annotation. 3) We then update Dt+1
U
L U L L
and Dt+1 by moving new annotated images from Dt to Dt . The pose estimator is retrained on Dt+1 to obtain gt+1 . 4) The reward rt+1 ,
which measures the improvement of pose estimator’s prediction accuracy on Dre as Acct+1 − Acct , is used to optimize the agent team.
solve pose/camera parameters, we address the task of active (1) Evaluate the informativeness of each image in DtU ; (2)
learning for annotating selective informative samples under Select a batch of informative images to query annotation;
a specific annotation budget, and design a state-action rep- (3) Move the selected images from DtU to DtL then retrain
L
resentation with a novel meta agent teaming framework to the pose estimator gt on the updated labeled dataset Dt+1
enable effective batch sampling. to obtain gt+1 .
In this paper, we aim at learning an optimal sampling
3. Method strategy that directly maximizes the performance of the tar-
Given an unlabeled human hand (or body) dataset with a get pose estimator under a fixed annotation budget, driven
limited annotation budget, the goal of active learning (AL) by maximizing the designed reward. To ease the under-
is to annotate the most informative images iteratively to standing, we assume there is a single agent to propose a
maximize the performance of the target pose estimator. We single image for annotation in this section. In Sec. 3.2, we
introduce a novel AL framework for human hand (or body) further discuss the image batch selection by multiple agents.
pose estimation, which leverages an agent team to raise a We formulate the AL steps as a MDP: (st , at , rt+1 , st+1 )
batch of informative images at each active learning iteration and convert the key AL steps as: (1) Estimate the state st
as shown in Fig. 1. which characterizes the distribution difference between the
In this section, we first show how AL for pose estimation unlabeled set DtU and the labeled set DtL at the tth iteration.
can be formulated as a Markov Decision Process (MDP) (2) Evaluate each state-action pair (st , at ) to determine an
(Sec. 3.1). Then we present our cooperative multi-agent image to be annotated. (3) Update DtL , DtU to Dt+1 L U
, Dt+1
U L
framework to perform effective batch selection and intro- by moving newly annotated image from Dt to Dt . Re-
L
duce a compact representation to facilitate the cooperation train gt on the updated Dt+1 to obtain gt+1 and update the
L U
between agents (Sec. 3.2). Finally, we introduce the training state to st+1 based on Dt+1 and Dt+1 . (4) Compute the re-
and deployment pipelines as well as a meta-optimization al- ward rt+1 based on gt+1 and gt evaluated on a separatedly
gorithm, which facilitates the agents’ quick adaptation to reserved reward set Dre to update the agent.
the enlarged labeled set in AL procedures during deploy- We adopt the Q-learning algorithm [45] to solve this
ment (Sec. 3.3). MDP problem, in which the agent scores each state-action
representation pair (st , at ) and takes the action at with the
3.1. Active Pose Estimation as MDP
highest score (i.e., the Q-value). By deriving reward from
Existing AL algorithms [38, 58, 4, 5, 22] fall into the the improvement of the pose estimator directly, we can op-
paradigm of iteratively selecting a batch of images to label timize the agent to learn a policy that maximizes the reward
until the annotation budget B runs out. In the tth iteration, as well as the performance of the pose estimator. Below we
given an unlabeled set DtU , a labeled set DtL and a pose elaborate on the detailed definition of state st , action at , and
estimator gt , these AL algorithms take the following steps: reward rt .
11081
State. Intuitively, the state st should capture the distribu- S, where S ∈ {FA , FP0 , FP1 , ..., FP6 }, via MMD as:
tion gap between the labeled dataset DtL and the unlabeled nL X
nL
dataset DtU , which helps the agent to pick out the most in- X k(pi , pj )
KS =MMD(S L , S U ) = +
formative image that could compensate the distribution shift
i=1 j=1
n2L
between DtL and DtU . With an unbiased training set dis- nU X
nU nL XnU
(1)
tribution, the pose estimator is more likely to generalize
X k(qi , qj ) X 2 ∗ k(pi , qj )
− ,
well to unseen cases. Specifically, in pose estimation, we i=1 j=1
n2U i=1 j=1
nL nU
consider two key attributes to characterize the distribution
drifts: appearance variation and pose topological variation, where S and S are the distributions of S on DtL and DtU
L U
which are also key considerations when collecting pose es- respectively, and KS is a scalar representing the distribution
timation datasets [60]. difference between S L and S U . We denote the samples in
S L and S U as p and q. nL and nU are the numbers of
Based on these intuitions, we propose to collect two samples in DtL and DtU , and k(·) corresponds to the radial
kinds of cues including the appearance information and kernel [47] to measure the distance between two samples.
topological information to characterize the distribution dif- Moreover, the available budget is another piece of im-
ference between DtL and DtU . Note that the difference is dy- portant information for the agent to perform an effective
namical as it depends on the pose estimator gt . The design selection. Here, we use the budget consumption ratio b
of the state helps the agent to select appropriate samples for to represent this status. Finally, the state st is defined as:
the pose estimator gt during the active learning process. {KFA , KFP0 , KFP1 , ..., KFP6 , b}, which encodes the distri-
For the appearance information fA of the sample x, bution drifts between the labeled and unlabeled sets as well
we collect the average pooled feature from an intermedi- as the available budget. It guides the agent to determine
ate layer of the pose estimator gt , as shown in Fig. 1. This which kind of images could benefit the pose estimator most.
feature depicts the general look of the image sample x. Action. The action should ideally captures the potential
contribution of a specific unlabeled sample when adding it
For the topological information, we encode the topolog- to the labeled set DtL . Intuitively, combining the state and
ical features such as the bone length and bone rotations via action representations, the agent should have enough infor-
Kinetic Chain Space (KCS) [53, 16]. More precisely, we mation to score each unlabeled sample and select an infor-
derive M bone vectors from the estimated pose ŷ = gt (x) mative image from the unlabeled set DtU to query annota-
and concatenate them to form an M × n matrix, where n tion. To this end, we associate each action at in the action
is the dimension of the joint coordinates. Then the KCS is space At with a unique image x in the unlabeled pool DtU .
computed as the inner product of the matrix and its trans- To assist the selection of the informative sample, we
pose. We denote the KCS for all bone vectors of the whole compute three kinds of features from each unlabeled image
hand (or whole body) as the global topological feature fP0 . x: 1) the novelty of the pose in the image x; 2) the represen-
Moreover, the performance of pose estimator varies with tativeness of the image for the unlabeled pool; 3) the gen-
each joint [56], leading to different pose estimation qualities eral appearance information of the image. Intuitively, these
over various local joints of the hand (or body). To help the three features characterize the informativeness, the repre-
pose estimator to achieve good performance on each joint, sentativeness of the pose as well as the appearance of an
we additionally track the properties for the local parts of the unlabeled image x. We detail each representation below.
hand (or body). We decompose the whole hand (or whole The novelty of the image helps estimate the potential
body) to six local parts including the palm and five fingers performance gain brought by adding accurate annotation.
(torso, head, left/right arm, and left/right leg for body). We However, it is hard to measure without the actual ground
then compute the KCS for these parts as the local topologi- truth pose. Therefore, we propose to approximately evalu-
cal features {fP1 , fP2 , ..., fP6 } of the image x. ate it by utilizing the topological features from the labeled
In this way, we extract the appearance feature fA and set DtL . Intuitively, the closeness of the global/local topo-
the topological features {fP0 , fP1 , ..., fP6 } for the image logical information indicates the similarities between the
x. Then the appearance features of all data in the la- whole/local part of the estimated pose and the ground truth
beled and unlabeled datasets form the appearance feature pose. A novel pose will likely have low similarities to any
space FA . Similarly, we can build the topological fea- pose from the labeled set DtL . Therefore, we compute the
ture spaces {FP0 , FP1 , ..., FP6 }. To model the distribution maximum cosine similarity between the unlabeled image x
drifts between the labeled dataset DtL and the unlabeled and the labeled set DtL individually on each topological fea-
dataset DtU , we regard the labeled and unlabeled datasets as ture space {FP0 , FP1 , ..., FP6 } as {s0 , s1 , ..., s6 }, and con-
two domains and measure the domain gap between them. sider it as a proxy for the pose novelty.
Specifically, we adopt the Maximum Mean Discrepancy We then introduce our parameterization for the repre-
(MMD) [47] and compute the gap for each feature space sentativeness of the sample. The labeled set DtL and un-
11082
labeled set DtU jointly describe the distribution of the data. 𝑠!
Therefore it is also important to sample representative im- 𝑎! c 𝑄-value
ages w.r.t. the unlabeled set DtU , which could be charac- 𝑧!"
terized by the distribution of the similarity scores. We in- ℎ!" 16 128 128
troduce a histogram-based representation d to record the co- th
Figure 2. The architecture of the m agent in the team. Note that
sine similarity distribution between x and DtU on each topo-
each agent shares the similar model architecture but with its own
logical feature space as {d0 , d1 , ..., d6 }. Combining with
parameters. at and hm t are first fed into a linear layer with ReLU
the parameters {s0 , s1 , ..., s6 } representing similarity of x activation to generate feature ztm ∈ R16 . ztm , at and st are then
to DtL , the agent could avoid repeatedly sampling the repre- concatenated and passed through three linear layers with ReLU
sentative images that our pose estimator has already learned activation in between to output the Q-value.
from, leading to improved sampling efficiency.
icy network as q m , and the action performed by it as am t .
Finally, we extract the image appearance feature fA of Then, to model the sequential cooperation between agents,
the unlabeled image x as its appearance property (e.g., we can additionally provide the mth agent with the actions
clothes texture, skin color, background, etc). The fi- m−1
{ait }i=1 of the previous m − 1 agents. However, it will re-
nal action representation at corresponding to the unla- quire an increasingly deep and wide neural network to pro-
beled image x is the combination of these features: at = cess the information of {ait }m−1
i=1 with a large m, leading to
{s0 , s1 , ..., s6 , d0 , d1 , ..., d6 , fA }, enabling the agent to ef- undesired high computational complexity. To address this,
fectively identify the informativeness of the unlabeled im- we use the the expectation of {ait }m−1
i=1 , a fixed-length com-
age x and perform selection. pact representation of the previous agents’ actions, as an ex-
Reward. The reward is a metric that evaluates how much tra state for the mth agent. Mathematically, the expectation
the selected unlabeled image can benefit the target pose of previous agents’ actions hm t is computed as:
model gt . We reserve a specific subset Dre for accurate
reward estimation before starting the active learning proce- m−1
1 X i
dure. Then, we measure the accuracy of the pose estimator hm
t = a, (2)
m − 1 i=1 t
on this reward set Dre , and the reward rt+1 is defined as the
difference of the accuracy between gt+1 and gt , as shown in and then the action made by the mth agent becomes:
Fig. 1. Note Dre is only used for evaluation and not used in
any training process of the pose estimator. With the reward am m m
t = argmax q (st , at , ht ; θm ), (3)
rt+1 , we can optimize the agent to select the most infor- at ∈At
11083
Algorithm 1: Teaming Sampling Policy Learning Alg. 1. Inspired by MAML [12], we consider each retrain-
Input: agent team {q m }N
ing process as a task and leverage the Meta-Learning [12]
m=1 , an initial pose estimator ginit , an
initial set Dinit with annotation and image batch size N to learn a good initialization for the policy network param-
L U re ←
1 Dinit , Dinit , D − RandomPartition(Dinit ) eters that could quickly adapt to the new tasks of retraining
2 while not done do // Episodes training on the enlarged dataset. We adopt this Meta-Learning based
3 D0L ← L , DU ←
− Dinit U
0 − Dinit , g0 ← − UPDATE(ginit , D0L )
4 for t = 0 to T − 1 do // AL procedure
extension in the Training Phase and empirically show that
5 Build the state st and action space At (Sec. 3.1) we could reduce the multi-agent team update cost by a half
6 Use the agent team {q m }N m=1 to select images
without sacrificing the performance, as shown in Sec. 4.3.
following Eq. 3: {xm }N m N
m=1 ⇐ {at }m=1
7 Annotate data: {(xm , ym )}N m=1 ← {xm }N m=1 4. Experiment
8
U L
Update Dt , Dt and gt :
L
Dt+1 ← DtL ∪ {(xm , ym )}N U
m=1 , Dt+1 ← We conduct extensive experiments on both the human
DtU \ {xm }N ,
m=1 t+1 g ←
−UPDATE(g L
t , Dt+1 ) hand and body pose datasets to evaluate the effectiveness of
9 Compute reward on Dre : our proposed MATAL framework.
rt+1 = Acc(gt+1 ) − Acc(gt )
For human hand pose estimation, we follow the ex-
10 end
Update {q m }N
perimental settings of [5] and evaluate the performance
11 m=1 following Eq. 4
12 end of MATAL on three widely used datasets, ICVL [46],
NYU [48] and BigHand2.2M [60]. ICVL is a depth-based
hand image dataset and NYU is a larger RGB-D dataset col-
lected by multiple cameras. Furthermore, to evaluate the
3.3. Model Training with Meta Optimization
efficacy of our method on large-scale datasets, we set up
With the introduced RL for AL formulation in Sec. 3.1 experiments on BigHand2.2M [60], which contains around
and the agent teaming framework in Sec. 3.2, we introduce 2.2 million images collected from ten different subjects. For
the training and deployment pipelines in this section. human body pose estimation, we use MPII [60], which is an
Given an unlabeled dataset Df ull and an annotation bud- RGB dataset widely used in recent works.
get B, our MATAL pipeline works as follows. We first ran-
domly sample an initial subset Dinit to request annotations.
4.1. MATAL on Human Hand Pose Estimation
With the labeled initial subset Dinit , we further partition Baseline. We compare the performance of our MATAL
it to simulate the AL procedure and train our agent team on hand pose estimation task with random sampling as well
{q m }Nm=1 . Specifically, we partition the labeled initial set as existing state-of-the-art methods, including Coreset [35],
L U
Dinit into the labeled set Dinit , the unlabeled set Dinit , and MCD CKE [4], UncertrainGCN [5] and CoreGCN [5],
re
the reward set D , and then have our agent team to play the based on their reported results on each dataset.
active batch image selection game following Sec. 3.1 and Implementation Details. Following [5], we use Deep-
Sec. 3.2. The detailed process is illustrated in Alg. 1. We Prior [32] as the backbone of our pose estimator. We ex-
denote this phase of training the agent team on the initial tract the feature map from the last convolutional layer of
labeled set as Training Phase. DeepPrior, and perform average pooling by a 5 × 5 kernel
Furthermore, once our agent team is trained on Dinit , it with stride 3, followed by flattening to generate a 128-D ap-
could be deployed to execute the real active learning proce- pearance feature vector. We use the 21 joints estimated by
dure on the rest of the unlabeled pool DU = Df ull \Dinit , DeepPrior and compute a 275-D topological feature vector.
until the budget B ran out. We denote this phase as De- We use 40 agents to build the agent team for image batch
ployment Phase, in which the agent team proposes batch selection on NYU and BigHand2.2M, and use 4 agents for
samples {xm }N m=1 for annotation from D
U
at each itera- ICVL as it is much smaller than other datasets.
tion and expands the labeled pool D = DL ∪ {xm }N
L
m=1 For each dataset, we first randomly sample a small num-
to update the pose estimator g. We set DL = Dinit at the ber of images from the training set of the dataset to build
start of this phase and expand it in the Deployment Phase. the initial set Dinit and the remaining images form the un-
With the enlarged labeled set DL , we can then retrain labeled set DU . The sizes of Dinit in ICVL, NYU, and Big-
our agent team on it to improve the performance of the RL Hand2.2m datasets are 80, 800, and 800, respectively. Then
agent team, again following Alg. 1. Note that we set Dinit we train our MATAL on Dinit via Alg. 1, in which the Dinit
in Alg. 1 to the most up-to-date DL each time when we is split into three disjoint sets Dre , Dinit
U L
and Dinit with the
perform retraining in this Deployment Phase. ratio of 3:6:1. Later, we deploy the trained MATAL to sam-
However, the training of the agent team {q m }N m=1 on the ple the images from DU and initialize DL as Dinit . In the
expanded labeled set DL could be time-consuming due to Deployment Phase, the agent team is frozen to sample in-
the growing size of DL . To reduce the time complexity, formative image batches iteratively while the pose estima-
we further propose a Meta-Learning based extension of the tor is updated every time a newly annotated batch arrives.
11084
1.5h
37h
0
1.5h
1h
0
1.8h
2h
51h
2h
(a) ICVL [46] (b) NYU [48] (c) BigHand [60] (d) MPII [1] (e) NYU [48] (f) MPII [60]
Figure 3. (a)-(d): Active learning results of pose estimation over four datasets. The results of (a) ICVL (b) NYU (c) BigHand are for hand
pose estimation and the curves in these figures show the average mean-square error of the joints’ poses (lower is better) over different
numbers of annotated frames. The result of human body pose estimation on MPII dataset is presented in the sub-figure (d), where the
metric is the [email protected] (higher is better). (e)-(f): Ablation study for agent team on human hand and body benchmarks.
Each time the size of the labeled dataset DL doubles com- as the Coreset mainly relies on the appearance feature in-
pared to the previous time the agent team was trained, we go formation but disregards the topological information. MCD
back to the Training Phase to retrain our agent team mod- CKE [4] obtains better performance by utilizing the pose
ule via efficient meta optimization with Alg. 1, in which we estimator’s uncertainty. Our method, benefited by learning
set Dinit to the most up-to-date labeled set DL . With the the sampling policy directly from data, significantly outper-
updated agent team, we resume the AL procedure on DU . forms the MCD CKE baseline. On this dataset, our method
These steps are repeated until the annotation budget B is only requires 5K images to achieve the nearly same perfor-
exhausted. mance (23.5 mm) obtained by other approaches that require
We set the learning rate of our policy network to 1e-4 and around 10K labeled images.
the discount factor γ in Eq. 4 to 0.9. We use the average Result on BigHand2.2M. We use the large scale Big-
joint error to measure performance of the pose estimator Hand2.2M [60] dataset to show the scalability of our
on the test set of each dataset. To show the robustness of method. It contains around 2.2 million images of subjects
our method, we run our experiments 5 times and report the with different hand shapes and contains schemed, random,
mean performance and its deviation. and egocentric poses. Thus, this dataset is much more di-
Result on ICVL. Fig. 3 (a) shows the performance of verse and challenging. Figure 3 (c) shows the performance
our proposed MATAL on ICVL dataset. Our method con- of different AL algorithms. Our method still outperforms
stantly outperforms state-of-the-art methods at each active other methods. It demonstrates that our MATAL can learn
learning iteration by a clear margin. UncertainGCN out- to select informative images even on this diverse dataset.
performs other existing methods at the beginning state, but
4.2. MATAL on Human Body Pose Estimation
later CoreGCN achieves better performance, which is possi-
bly due to the fact that the fixed criteria based on uncertainty Baseline. We benchmark our MATAL framework with
or representativeness could not constantly identify informa- SOTA active learning frameworks for human body pose es-
tive samples during the entire AL procedure. Instead, our timation, including Coreset [35], LearningLoss [58], Learn-
MATAL selects images that can most benefit the pose esti- ingLoss++ [38] and EGL++ [37].
mator with the proposed learning framework, which adapts Implementation details. Following the previous
to the needs of the pose estimator at different stages. As works [38, 37], we use Stacked Hourglass [32] as the back-
shown in Fig. 3 (a), our MATAL just needs 600 labeled im- bone of our pose estimator. We collect the feature map from
ages to reduce the average joint error to less than 12.5 mm, the bottleneck CNN layer of the last Hourglass block and
while uncertainGCN [5] and MCD CKE [4] need more than perform global average on it to build the image appearance
900 labeled images. At the end of the AL procedure with feature and use the predicted 16 joints to build the topo-
1000 labeled images, the average joint error in our model is logical features. A team of 40 agents are set up for batch
reduced to 11.89 mm, which is much lower than the mini- selection, and 800 images are randomly sampled to build
mum value obtained by other methods. the initial dataset Dinit . Moreover, we follow the previous
Result on NYU. This dataset was collected by multi- works [38, 37] and use [email protected] [31] to measure the per-
ple cameras, leading to several images sharing nearly same formance. Other settings follow the hand pose estimation.
topological information. Although these images have dif- Result on MPII. Figure 3 (d) demonstrates the perfor-
ferent appearance features, the redundant topological infor- mance of MATAL on the body pose estimation task. All
mation significantly decreases the learning efficacy of the existing methods achieve better results than random sam-
pose estimator. As shown in Fig. 3 (b), the performance pling but their [email protected] scores are close to each other.
of Coreset [35] is close to the error of random sampling, The EGL++ [37] tends to slightly outperform other exist-
11085
Table 1. Ablation study on the design of the state and the action Table 2. Ablation Study for the meta optimization. We compare
representation. We ablate the state/action representations by com- MATAL with or without the Meta-Optimization and showcase that
paring the accuracy of the model with individual component re- Meta-Optimization significantly accelerates the retraining process.
moved for state st and action at . Method MSE (mm) Time cost (h)
Method MSE (mm) with labeled samples MATAL w/o meta 23.68 4.5
2000 4000 6000 8000 MATAL w meta 23.74 2
State w/o KFP0 28.44 25.21 23.47 23.01
State w/o {KFPi }6i=1 28.30 25.24 23.66 23.23 because these methods tend to select similar images whose
State w/o KFA 29.00 25.65 24.49 23.55 Q-values are high yet close to each other, leading to sev-
State w/o b 28.62 25.22 24.10 23.85 eral inefficiencies in the batch image selection setting. In-
Action w/o {si }6i=0 30.47 26.77 25.36 25.13 troducing cooperation among separate agents helps address
Action w/o {di }6i=0 27.60 24.82 24.51 24.12
Action w/o fA 29.18 25.84 24.44 23.69
this problem, as the proposed expectation of previous ac-
MATAL 26.08 24.11 22.97 22.53 tions provides valuable information about the other agents’
decisions and the agent could learn to sample with a better
ing approaches and has a narrow deviation. Our MATAL coverage of the underlying distribution. Using one agent to
achieves significantly higher accuracy by learning a sam- select an image at each iteration also provides a competi-
pling policy that directly maximizes the performance of the tive performance but still tends to be inferior to our agent
pose estimator. The proposed MATAL uses around 25% of team method. The main reason is that the minor improve-
labels to obtain [email protected] of 85.1% while using the full ment of the pose estimator leads to small and noisy rewards,
annotated data yields [email protected] of 90.5%. Moreover, the making it difficult for the agent to learn a good sampling
proposed MATAL requires only 4K images to achieve simi- policy. Furthermore, the time cost of the method that uses
lar performance compared to others that require 6K images, one agent to select one image only is much higher than our
saving the labeling efforts of around 2K images. agent teaming method.
Effect of Meta-Learning We use meta-optimization to
4.3. Ablation Study
update the agent team module more effectively and effi-
Effect of the state and action representations We per- ciently. In this experiment, we compare the time cost of col-
form an ablation study on NYU dataset to evaluate the con- lecting 5K informative images by our model with/without
tribution of each component in our proposed state and ac- meta-learning on the NYU dataset in Table 2. Note that
tion representations. As the team agent relies on the state the time cost of sampling is almost the same for both mod-
to decide the sampling policy, we first investigate the influ- els, but it is the time consumption for the retraining of the
ence of the state information by removing its components agent team that really makes a difference. As shown in Ta-
individually from the complete model. Similarly, we also ble 2, with our meta-optimization scheme, our model ob-
discuss the effect of the information in the action vector. As tains competitive performance while significantly reducing
shown in Table 1, the complete MATAL gives the lowest the time consumption by more than a half.
average joint error in all active learning iterations. Remov-
ing either global or local topological information in the state 5. Conclusion
will degrade the performance of our method. The largest in-
crease of average joint error occurs when the score of maxi- In this paper we proposed an RL-based batch selec-
mum similarity in action representation {si }6i=0 is removed. tion active learning framework for pose estimation named
It further verifies the effectiveness of using the difference in MATAL. MATAL directly learns a cooperative sampling
global and local topological features to estimate the novelty policy for a team of agents to achieve effective image batch
of the recovered poses. selection. Moreover, a Meta-Optimization was introduced
Effect of agent team policy learning We further vali- to significantly accelerate the retraining of our team agent
date the performance of the proposed multi-agent sampling during the Deployment Phase of the active learning proce-
policy on NYU and MPII datasets. We first consider the us- dure. We conducted extensive ablation studies to verify the
age of only one agent to select a single image in each active design of our framework. Furthermore, we compared the
learning iteration. Then we construct the second baseline performance of our model with existing SOTA works on
as one agent to select a batch of images in one shot. Here, four widely used datasets and obtained better accuracy on
the images with N highest Q-values are sampled. Finally, all experiments.
we present the performance of using N agents to select N Acknowledgments. The project is supported by AI Sin-
images in two different settings: with or without teamwork. gapore under the grant number AISG-100E-2020-065, Na-
Figure. 3 (e) and (f) report the performance of these sam- tional Research Foundation Singapore and SUTD Startup
pling strategies. As shown in Fig. 3 (e) and (f), selecting Research Grant. This work is also partially supported by
multiple images by either a single agent or noncooperative TAILOR, a project funded by EU Horizon 2020 research
multiple agents gives the worst results. We argue that this is and innovation programme under GA No 952215.
11086
References IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pages 9561–9568. IEEE, 2020.
[1] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and [14] Erik Gärtner, Aleksis Pirinen, and Cristian Sminchisescu.
Bernt Schiele. 2d human pose estimation: New benchmark Deep reinforcement learning for active human pose estima-
and state of the art analysis. In Proceedings of the IEEE Con- tion. In Proceedings of the AAAI Conference on Artificial
ference on computer Vision and Pattern Recognition, pages Intelligence, volume 34, pages 10835–10844, 2020.
3686–3693, 2014. [15] Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying
[2] Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr. Wang, Jianfei Cai, and Junsong Yuan. 3d hand shape and
3d hand shape and pose from images in the wild. In Proceed- pose estimation from a single rgb image. In Proceedings of
ings of the IEEE/CVF Conference on Computer Vision and the IEEE/CVF Conference on Computer Vision and Pattern
Pattern Recognition, pages 10843–10852, 2019. Recognition, pages 10833–10842, 2019.
[3] Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan. [16] Kehong Gong, Jianfeng Zhang, and Jiashi Feng. Poseaug:
Weakly-supervised 3d hand pose estimation from monocu- A differentiable pose augmentation framework for 3d hu-
lar rgb images. In Proceedings of the European Conference man pose estimation. In Proceedings of the IEEE/CVF Con-
on Computer Vision (ECCV), pages 666–682, 2018. ference on Computer Vision and Pattern Recognition, pages
[4] Razvan Caramalau, Binod Bhattarai, and Tae-Kyun Kim. 8575–8584, 2021.
Active learning for bayesian 3d hand pose estimation. In [17] Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall,
Proceedings of the IEEE/CVF Winter Conference on Appli- and Jan Kautz. Hand pose estimation via latent 2.5d heatmap
cations of Computer Vision, pages 3419–3428, 2021. regression. In Proceedings of the European Conference on
[5] Razvan Caramalau, Binod Bhattarai, and Tae-Kyun Kim. Se- Computer Vision (ECCV), September 2018.
quential graph convolutional network for active learning. In [18] Rui Li, Zhenyu Liu, and Jianrong Tan. A survey on 3d hand
Proceedings of the IEEE/CVF Conference on Computer Vi- pose estimation: Cameras, methods, and datasets. Pattern
sion and Pattern Recognition, pages 9583–9592, 2021. Recognition, 93:251–272, 2019.
[6] Arantxa Casanova, Pedro O Pinheiro, Negar Rostamzadeh, [19] Shile Li and Dongheui Lee. Point-to-pose voting based hand
and Christopher J Pal. Reinforced active learning for image pose estimation using residual permutation equivariant layer.
segmentation. arXiv preprint arXiv:2002.06583, 2020. In Proceedings of the IEEE/CVF Conference on Computer
[7] Yucheng Chen, Yingli Tian, and Mingyi He. Monocu- Vision and Pattern Recognition, pages 11927–11936, 2019.
lar human pose estimation: A survey of deep learning- [20] Tianjiao Li, Jun Liu, Wei Zhang, Yun Ni, Wenqian Wang,
based methods. Computer Vision and Image Understanding, and Zhiheng Li. Uav-human: A large benchmark for hu-
192:102897, 2020. man behavior understanding with unmanned aerial vehicles.
In Proceedings of the IEEE/CVF Conference on Computer
[8] Yujin Chen, Zhigang Tu, Liuhao Ge, Dejun Zhang, Ruizhi
Vision and Pattern Recognition, pages 16266–16275, 2021.
Chen, and Junsong Yuan. So-handnet: Self-organizing
[21] Xing Liang, Anastassia Angelopoulou, Epaminondas
network for 3d hand pose estimation with semi-supervised
Kapetanios, Bencie Woll, Reda Al Batat, and Tyron Woolfe.
learning. In Proceedings of the IEEE/CVF International
A multi-modal machine learning approach and toolkit to au-
Conference on Computer Vision, pages 6961–6970, 2019.
tomate recognition of early stages of dementia among british
[9] Yujin Chen, Zhigang Tu, Di Kang, Linchao Bao, Ying
sign language users. In European Conference on Computer
Zhang, Xuefei Zhe, Ruizhi Chen, and Junsong Yuan. Model-
Vision, pages 278–293. Springer, 2020.
based 3d hand reconstruction via self-supervised learning. In
[22] Buyu Liu and Vittorio Ferrari. Active learning for human
Proceedings of the IEEE/CVF Conference on Computer Vi-
pose estimation. In Proceedings of the IEEE International
sion and Pattern Recognition, pages 10451–10460, 2021.
Conference on Computer Vision, pages 4363–4372, 2017.
[10] Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, [23] Shaowei Liu, Hanwen Jiang, Jiarui Xu, Sifei Liu, and Xi-
Thomas S Huang, and Lei Zhang. Higherhrnet: Scale-aware aolong Wang. Semi-supervised 3d hand-object poses es-
representation learning for bottom-up human pose estima- timation with interactions in time. In Proceedings of the
tion. In Proceedings of the IEEE/CVF Conference on Com- IEEE/CVF Conference on Computer Vision and Pattern
puter Vision and Pattern Recognition, pages 5386–5395, Recognition, pages 14687–14697, 2021.
2020. [24] Zhuoming Liu, Hao Ding, Huaping Zhong, Weijia Li, Jifeng
[11] Manuela Chessa, Guido Maiello, Lina K Klein, Vivian C Dai, and Conghui He. Influence selection for active learning.
Paulun, and Fabio Solari. Grasping objects in immersive vir- In Proceedings of the IEEE/CVF International Conference
tual reality. In 2019 IEEE Conference on Virtual Reality and on Computer Vision, pages 9274–9283, 2021.
3D User Interfaces (VR), pages 1749–1754. IEEE, 2019. [25] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I.
[12] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- Jordan. Deep transfer learning with joint adaptation net-
agnostic meta-learning for fast adaptation of deep networks. works. In Doina Precup and Yee Whye Teh, editors, Pro-
In International Conference on Machine Learning, pages ceedings of the 34th International Conference on Machine
1126–1135. PMLR, 2017. Learning, volume 70 of Proceedings of Machine Learning
[13] Guillermo Garcia-Hernando, Edward Johns, and Tae-Kyun Research, pages 2208–2217. PMLR, 06–11 Aug 2017.
Kim. Physics-based dexterous manipulations with estimated [26] Jameel Malik, Ahmed Elhayek, Fabrizio Nunnari, Kiran
hand poses and residual reinforcement learning. In 2020 Varanasi, Kiarash Tamaddon, Alexis Heloir, and Didier
11087
Stricker. Deephps: End-to-end estimation of 3d hand pose [39] Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Otmar Hilliges,
and shape by learning from synthetic depth. In 2018 Inter- and Jan Kautz. Weakly supervised 3d hand pose estima-
national Conference on 3D Vision (3DV), pages 110–119. tion via biomechanical constraints. In Computer Vision–
IEEE, 2018. ECCV 2020: 16th European Conference, Glasgow, UK, Au-
[27] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, gust 23–28, 2020, Proceedings, Part XVII 16, pages 211–
Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, 228. Springer, 2020.
Weipeng Xu, Dan Casas, and Christian Theobalt. Vnect: [40] Srinath Sridhar, Anna Maria Feit, Christian Theobalt, and
Real-time 3d human pose estimation with a single rgb cam- Antti Oulasvirta. Investigating the dexterity of multi-finger
era. ACM Transactions on Graphics (TOG), 36(4):1–14, input for mid-air text entry. In Proceedings of the 33rd An-
2017. nual ACM Conference on Human Factors in Computing Sys-
[28] Rahul Mitra, Nitesh B Gundavarapu, Abhishek Sharma, and tems, pages 3643–3652, 2015.
Arjun Jain. Multiview-consistent semi-supervised learn- [41] Srinath Sridhar, Franziska Mueller, Antti Oulasvirta, and
ing for 3d human pose estimation. In Proceedings of Christian Theobalt. Fast and robust hand tracking using
the IEEE/CVF Conference on Computer Vision and Pattern detection-guided optimization. In Proceedings of the IEEE
Recognition, pages 6907–6916, 2020. Conference on Computer Vision and Pattern Recognition,
pages 3213–3221, 2015.
[29] Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori,
[42] Srinath Sridhar, Franziska Mueller, Michael Zollhöfer, Dan
and Kyoung Mu Lee. Interhand2. 6m: A dataset and base-
Casas, Antti Oulasvirta, and Christian Theobalt. Real-time
line for 3d interacting hand pose estimation from a single
joint tracking of a hand manipulating an object from rgb-d
rgb image. In Computer Vision–ECCV 2020: 16th Euro-
input. In European Conference on Computer Vision, pages
pean Conference, Glasgow, UK, August 23–28, 2020, Pro-
294–310. Springer, 2016.
ceedings, Part XX 16, pages 548–564. Springer, 2020.
[43] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
[30] Franziska Mueller, Dushyant Mehta, Oleksandr Sotny- high-resolution representation learning for human pose es-
chenko, Srinath Sridhar, Dan Casas, and Christian Theobalt. timation. In Proceedings of the IEEE/CVF Conference
Real-time hand tracking under occlusion from an egocentric on Computer Vision and Pattern Recognition, pages 5693–
rgb-d sensor. In Proceedings of the IEEE International Con- 5703, 2019.
ference on Computer Vision, pages 1154–1163, 2017. [44] Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei.
[31] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- Compositional human pose regression. In Proceedings of the
glass networks for human pose estimation. In European con- IEEE International Conference on Computer Vision, pages
ference on computer vision, pages 483–499. Springer, 2016. 2602–2611, 2017.
[32] Markus Oberweger, Paul Wohlhart, and Vincent Lepetit. [45] Richard S Sutton and Andrew G Barto. Reinforcement learn-
Hands deep in deep learning for hand pose estimation. arXiv ing: An introduction. MIT press, 2018.
preprint arXiv:1502.06807, 2015. [46] Danhang Tang, Hyung Jin Chang, Alykhan Tejani, and Tae-
[33] Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Kyun Kim. Latent regression forest: Structured estimation of
Abbeel, and Sergey Levine. Sfv: Reinforcement learning of 3d articulated hand posture. In Proceedings of the IEEE con-
physical skills from videos. ACM Transactions On Graphics ference on computer vision and pattern recognition, pages
(TOG), 37(6):1–14, 2018. 3786–3793, 2014.
[34] Viraj Prabhu, Arjun Chandrasekaran, Kate Saenko, and [47] Ilya O Tolstikhin, Bharath K Sriperumbudur, and Bernhard
Judy Hoffman. Active domain adaptation via clustering Schölkopf. Minimax estimation of maximum mean discrep-
uncertainty-weighted embeddings. In Proceedings of the ancy with radial kernels. Advances in Neural Information
IEEE/CVF International Conference on Computer Vision, Processing Systems, 29:1930–1938, 2016.
pages 8505–8514, 2021. [48] Jonathan Tompson, Murphy Stein, Yann Lecun, and Ken
Perlin. Real-time continuous pose recovery of human hands
[35] Ozan Sener and Silvio Savarese. Active learning for convolu-
using convolutional networks. ACM Transactions on Graph-
tional neural networks: A core-set approach. In International
ics (ToG), 33(5):1–10, 2014.
Conference on Learning Representations, 2018.
[49] Alexander Toshev and Christian Szegedy. Deeppose: Hu-
[36] Jianzhun Shao, Yuhang Jiang, Gu Wang, Zhigang Li, and man pose estimation via deep neural networks. In Proceed-
Xiangyang Ji. Pfrl: Pose-free reinforcement learning for ings of the IEEE Conference on Computer Vision and Pattern
6d pose estimation. In Proceedings of the IEEE/CVF Con- Recognition (CVPR), June 2014.
ference on Computer Vision and Pattern Recognition, pages [50] Hado Van Hasselt, Arthur Guez, and David Silver. Deep re-
11454–11463, 2020. inforcement learning with double q-learning. In Proceedings
[37] Megh Shukla. Egl++: Extending expected gradient length of the AAAI conference on artificial intelligence, volume 30,
to active learning for human pose estimation. arXiv preprint 2016.
arXiv:2104.09493, 2021. [51] Chengde Wan, Thomas Probst, Luc Van Gool, and Angela
[38] Megh Shukla and Shuaib Ahmed. A mathematical analysis Yao. Self-supervised 3d hand pose estimation through train-
of learning loss for active learning in regression. In Proceed- ing by fitting. In Proceedings of the IEEE/CVF Conference
ings of the IEEE/CVF Conference on Computer Vision and on Computer Vision and Pattern Recognition, pages 10853–
Pattern Recognition, pages 3320–3328, 2021. 10862, 2019.
11088
[52] Chengde Wan, Thomas Probst, Luc Van Gool, and Angela [64] Christian Zimmermann and Thomas Brox. Learning to esti-
Yao. Dense 3d regression for hand pose estimation. In Pro- mate 3d hand pose from single rgb images. In Proceedings of
ceedings of the IEEE Conference on Computer Vision and the IEEE international conference on computer vision, pages
Pattern Recognition, pages 5147–5156, 2018. 4903–4911, 2017.
[53] Bastian Wandt, Hanno Ackermann, and Bodo Rosenhahn.
A kinematic chain space for monocular motion capture. In
Proceedings of the European Conference on Computer Vi-
sion (ECCV) Workshops, pages 0–0, 2018.
[54] Rongchang Xie, Chunyu Wang, Wenjun Zeng, and Yizhou
Wang. An empirical study of the collapsing problem in semi-
supervised 2d human pose estimation. In Proceedings of
the IEEE/CVF International Conference on Computer Vision
(ICCV), pages 11240–11249, October 2021.
[55] Fu Xiong, Boshen Zhang, Yang Xiao, Zhiguo Cao, Taidong
Yu, Joey Tianyi Zhou, and Junsong Yuan. A2j: Anchor-to-
joint regression network for 3d articulated pose estimation
from a single depth image. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 793–
802, 2019.
[56] Tianhan Xu and Wataru Takano. Graph stacked hourglass
networks for 3d human pose estimation. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 16105–16114, 2021.
[57] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan
Zhang, and Jun Wang. Mean field multi-agent reinforcement
learning. In International Conference on Machine Learning,
pages 5571–5580. PMLR, 2018.
[58] Donggeun Yoo and In So Kweon. Learning loss for active
learning. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 93–102,
2019.
[59] Shanxin Yuan, Guillermo Garcia-Hernando, Björn Stenger,
Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee, Pavlo
Molchanov, Jan Kautz, Sina Honari, Liuhao Ge, et al. Depth-
based 3d hand pose estimation: From current achievements
to future goals. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2636–
2645, 2018.
[60] Shanxin Yuan, Qi Ye, Bjorn Stenger, Siddhant Jain, and Tae-
Kyun Kim. Bighand2. 2m benchmark: Hand pose dataset
and state of the art analysis. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
4866–4874, 2017.
[61] Ho Yub Jung, Soochahn Lee, Yong Seok Heo, and Il
Dong Yun. Random tree walk toward instantaneous 3d hu-
man pose estimation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2467–
2474, 2015.
[62] Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and
Junsong Yuan. Mixste: Seq2seq mixed spatio-temporal en-
coder for 3d human pose estimation in video. arXiv preprint
arXiv:2203.00859, 2022.
[63] Yuxiao Zhou, Marc Habermann, Weipeng Xu, Ikhsanul
Habibie, Christian Theobalt, and Feng Xu. Monocular real-
time hand shape and motion capture using multi-modal data.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5346–5355, 2020.
11089