Gong Meta Agent Teaming Active Learning For Pose Estimation CVPR 2022 Paper

The document presents a novel Meta Agent Teaming Active Learning (MATAL) framework aimed at improving pose estimation by efficiently selecting and annotating informative images. By formulating the image selection process as a Markov Decision Process and utilizing a multi-agent team, the framework optimizes the annotation process to enhance the performance of pose estimators while reducing labeling efforts by approximately 40%. Experimental results demonstrate that MATAL significantly outperforms existing active learning frameworks in both human hand and body pose estimation tasks.

Uploaded by

dpm4212

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views11 pages

Gong Meta Agent Teaming Active Learning For Pose Estimation CVPR 2022 Paper

Uploaded by

dpm4212

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Meta Agent Teaming Active Learning for Pose Estimation

Jia Gong1 Zhipeng Fan2 Qiuhong Ke3 Hossein Rahmani4 Jun Liu1∗
1
Singapore University of Technology and Design, Singapore; 2 New York University, United States
3
The University of Melbourne, Australia; 4 Lancaster University, United Kingdom
jia [email protected], [email protected], [email protected]
[email protected], jun [email protected]

Abstract very expensive and time-consuming, e.g., annotating a sin-

gle image in MPII dataset [1] takes around 40 seconds,
The existing pose estimation approaches often require a which limits the development of large-scale datasets. Ac-
large number of annotated images to attain good estimation cordingly, with the limited scale of the dataset, it is essential
performance, which are laborious to acquire. To reduce to develop algorithms to use data more efficiently.
the human efforts on pose annotations, we propose a novel Active learning (AL), which proactively selects the most
Meta Agent Teaming Active Learning (MATAL) framework informative unlabeled images to annotate, is one promising
to actively select and label informative images for effective solution to this problem. Recent active learning-based pose
learning. Our MATAL formulates the image selection pro- estimation frameworks [38, 58, 4, 5, 22] can be categorized
cedure as a Markov Decision Process and learns an optimal into uncertainty-based or distribution-based methods. The
sampling policy that directly maximizes the performance of uncertainty-based methods [22, 58, 38] query annotations
the pose estimator based on the reward. Our framework for the samples with the lowest confidence scores. How-
consists of a novel state-action representation as well as ever, as shown in [24], neural networks tend to be over-
a multi-agent team to enable batch sampling in the active confident with unfamiliar samples, leading to overestimated
learning procedure. The framework could be effectively op- model performance and therefore lowering the labeling ef-
timized via Meta-Optimization to accelerate the adaptation ficiency. Meanwhile, the distribution-based methods [35, 4]
to the gradually expanded labeled data during deployment. aim to query annotations for representative images from the
Finally, we show experimental results on both human hand unlabeled dataset. However, the most representative images
and body pose estimation benchmark datasets and demon- w.r.t. the unlabeled set may not always be the most infor-
strate that our method significantly outperforms all base- mative ones to the pose estimator, as the estimator may have
lines continuously under the same amount of annotation already learned similar knowledge from earlier samples. As
budget. Moreover, to obtain similar pose estimation accu- the result, for both types of methods, their image selection
racy, our MATAL framework can save around 40% labeling strategy does not directly relate to the improvements of the
efforts on average compared to state-of-the-art active learn- pose estimator, leading to suboptimal performance.
ing frameworks. Moreover, these methods suffer in the batch setting,
where the active learning algorithm selects multiple images
for annotation in one turn. Existing traditional methods
1. Introduction [22, 58] rely on selecting the most informative or represen-
tative images to construct a batch, disregarding the redun-
Human hand (or body) pose estimation, aiming to lo- dancies in the formed batch. Recently, several works [4, 35]
calize the positions of specific key points in images, is an explore the usage of distance-based clustering to identify
important task that has a wide range of applications such unique images yet maintain good coverage of the dataset.
as augmented reality [11], sign language translation [21], However, the adopted clustering algorithms tend to be less
and human-robot interaction [40]. Despite the great suc- effective in the high-dimensional space, leading to less
cess of existing deep learning based pose estimation meth- effective sample selection processes during AL iterations
ods [63, 2, 15, 59, 19, 10, 56], they are notoriously data- [34]. Therefore, it is important to construct a batch of sam-
hungry. Furthermore, acquiring pose annotation is often ples for annotation in an intelligent way, taking care of both
the informativeness of each individual image and the overall
* Corresponding Author diversity of the batch.

11079
To address the aforementioned issues in a single end-to- learning [28, 23, 8], semi-supervised learning [39, 3, 54]
end learning framework, we propose a novel Meta Agent and self-supervised learning [9, 51], have attracted much
Teaming Active Learning (MATAL) model for human hand attention recently. These methods utilize the unlabeled data
(or body) pose estimation, which leverages an agent team to improve the performance. However, most of the methods
to learn a teaming sampling policy from data. Our main in- still rely on the help of labeled data to distill useful informa-
sight is that selecting a batch of informative yet diverse im- tion from the unlabeled images. This means that the quality
ages for annotation can be viewed as a teamwork of a set of and informativeness of labeled data are still crucial in their
agents, where each agent in the team selects one image col- methods. Our active learning approach is parallel to these
laboratively based on the other agents’ decisions. Then this methods and could be integrated into the labeled data col-
active learning procedure can be formulated as a Markov lection process to significantly reduce the annotation cost.
Decision Process (MDP) [45], which could be solved with Active Learning for Pose Estimation. Active learning is
Reinforcement Learning (RL). The agent team receives a an important machine learning problem, which has received
state signal characterizing the distribution of the images in lots of attentions [35, 58, 6, 22]. In recent years, several
the dataset and cooperatively generates a batch of actions works explored applications of active learning for pose es-
to decide which images should be labeled. To help the timation. Liu et al. [22] introduced an uncertainty based
agent team to identify informative samples for annotation, estimator, utilizing the entropy of the predicted heatmaps to
we introduce a novel state-action representation by leverag- select the informative images. Yoo et al. [58] proposed a
ing Kinetic Chain Space (KCS) to encode the topological loss prediction module, which is learned together with the
information of the hand (or body) pose. Finally, as the la- target model to predict the losses of unlabeled samples. A
beled dataset will expand with the new annotated data, we subset of unlabeled samples with high predicted loss values
train our model via meta-learning to facilitate fast adapta- is selected for annotation. Shukla et al. [38] extended [58]
tion to the iteratively enlarged labeled dataset. to improve the correlation between the predicted and true
In summary, our main contributions are: 1) We for- loss values. The work in [4] used the Bayesian uncertainty
mulate the pose estimation active learning procedure as a to estimate the confidence of the pose estimator’s prediction
Markov Decision Process (MDP) and develop a Reinforce- and combined this with Core-set sampling [35] to perform
ment Learning (RL) based framework for effective sample selection. Caramalau et al. [5] employed Graph Convolu-
selection. 2) To help the learning of the agents, we propose tional Networks (GCN) to model the relation between la-
a state-action representation to characterize the informative- beled and unlabeled data. They then proposed two GCN-
ness and representativeness of the samples. 3) We validate based sampling approaches based on uncertainty and distri-
the efficacy of the proposed MATAL framework on both bution, respectively. Though these methods have achieved
human hand and body pose benchmarks. increasingly accurate measurements for uncertainty or dis-
tribution of the images, their sampling policies are not di-
2. Related Work rectly related to the performance of the pose estimator, lead-
Pose Estimation. Below we briefly review the recent pose ing to limited performance improvement. We address this
estimation methods. More works can be found in [7, 18]. by learning a sampling policy driven by the reward that di-
Several approaches [55, 61, 32, 41, 30, 42, 20, 62, 8] rectly relates to the performance of the pose estimator. To
have investigated the usage of deep learning to predict hand the best of our knowledge, we are the first Active Learning-
poses from depth or RGB-D images. These methods em- based multi-agent framework to learn a batch sampling pol-
ployed heatmap [52], pose structure information [44] or icy that promotes the learning of the pose estimator.
hand’s shape information [26] to improve the performance. Reinforcement Learning in Pose Estimation. Reinforce-
More recent works [63, 29, 17, 64] derived the hand joints’ ment learning (RL), which is a learning paradigm to solve
poses from RGB inputs. Similarly, recent human body pose MDP problems, aims to learn a policy that takes actions to
estimation approaches [49, 61, 27, 43, 31, 10] focused more maximize the accumulated reward in MDP [25, 45, 57, 50].
on deriving body joints’ poses from RGB images. The state- Recently, several works [33, 13] explored different applica-
of-the-art Stacked Hourglass [56] employed an encoder- tions of RL in pose estimation tasks. Jianzhun et al. [36]
decoder structure to predict joints’ locations as heatmaps, used RL to learn to manipulate the 3D object to match
while the HRNet [43] maintained high resolution represen- the ground truth mask. Another work [14] considered the
tations through the process to better localize the joints. Our multi-camera settings in the human body pose estimation
framework does not assume a specific architecture for the task and leveraged an RL model to select the appropriate
pose estimator and could be used with various existing mod- viewpoints (or cameras) to improve the performance of the
els to improve their annotation efficiencies. pose estimator. Both of them involved RL into the pose
To reduce the need of labeled data, learning methods estimation procedure, however, with completely different
with less supervision signal, such as weakly-supervised formulations to ours. Instead of employing RL to directly

11080
Pose Estimator 𝒈𝒕 Feature Spaces Agent Team {𝒒𝒎 }𝑵
𝒎"𝟏
Labeled pool 𝑫#" Appearance feature 𝐹+ State 𝒔𝒕 𝟏𝒕𝒉
𝒂𝟏𝒕
⊕ MMD Distribution differences
on {𝐹! , 𝐹"! ,…, 𝐹"" } +
retrain
Budget Information 𝑏 𝟐𝒕𝒉
Topological feature 𝐹*! , 𝐹*" ,…, 𝐹*# 𝒂𝟐𝒕
Unlabeled pool 𝑫'
"
Global Local Action Space 𝑨𝒕
⊖
𝑫'
"
Reward set 𝑫𝒓𝒆 𝑵𝒕𝒉
𝒂𝑵
𝒕
Reward 𝒓𝒕,𝟏
:t : t+1 rt+1 = Acct+1 - Acct
⊕ : add image ⊖: remove image
R𝐞𝐪𝐮𝐞𝐬𝐭𝐢𝐧𝐠 𝐚𝐧𝐧𝐨𝐭𝐚𝐭𝐢𝐨𝐧𝐬

Figure 1. Overview of our MATAL framework for hand pose estimation (MATAL for human body pose estimation shares the similar
structure). The solid lines describe the data flow at the tth active learning iteration and dot lines are that of the (t + 1)th iteration. Given
a labeled sample pool DtL and an unlabeled sample pool DtU , our active learning framework works as follows: 1) We first project both
the DtU and DtL to the feature spaces with the pose estimator gt , then construct the state st and action space At from the feature spaces.
The state st records the differences between DtU and DtL in the feature spaces, and the consumption of annotation budget. The action
space At contains the projection of DtU in the feature spaces. Each action at ∈ At corresponds to a unique image in DtU and describes
the novelty, representativeness and appearance of the image. 2) The agent team follows the Q-learning [45] framework and evaluates the
state-action pair (st ,at ) to determine a set of actions {am N
t }m=1 of raising corresponding images for annotation. 3) We then update Dt+1
U
L U L L
and Dt+1 by moving new annotated images from Dt to Dt . The pose estimator is retrained on Dt+1 to obtain gt+1 . 4) The reward rt+1 ,
which measures the improvement of pose estimator’s prediction accuracy on Dre as Acct+1 − Acct , is used to optimize the agent team.
solve pose/camera parameters, we address the task of active (1) Evaluate the informativeness of each image in DtU ; (2)
learning for annotating selective informative samples under Select a batch of informative images to query annotation;
a specific annotation budget, and design a state-action rep- (3) Move the selected images from DtU to DtL then retrain
L
resentation with a novel meta agent teaming framework to the pose estimator gt on the updated labeled dataset Dt+1
enable effective batch sampling. to obtain gt+1 .
In this paper, we aim at learning an optimal sampling
3. Method strategy that directly maximizes the performance of the tar-
Given an unlabeled human hand (or body) dataset with a get pose estimator under a fixed annotation budget, driven
limited annotation budget, the goal of active learning (AL) by maximizing the designed reward. To ease the under-
is to annotate the most informative images iteratively to standing, we assume there is a single agent to propose a
maximize the performance of the target pose estimator. We single image for annotation in this section. In Sec. 3.2, we
introduce a novel AL framework for human hand (or body) further discuss the image batch selection by multiple agents.
pose estimation, which leverages an agent team to raise a We formulate the AL steps as a MDP: (st , at , rt+1 , st+1 )
batch of informative images at each active learning iteration and convert the key AL steps as: (1) Estimate the state st
as shown in Fig. 1. which characterizes the distribution difference between the
In this section, we first show how AL for pose estimation unlabeled set DtU and the labeled set DtL at the tth iteration.
can be formulated as a Markov Decision Process (MDP) (2) Evaluate each state-action pair (st , at ) to determine an
(Sec. 3.1). Then we present our cooperative multi-agent image to be annotated. (3) Update DtL , DtU to Dt+1 L U
, Dt+1
U L
framework to perform effective batch selection and intro- by moving newly annotated image from Dt to Dt . Re-
L
duce a compact representation to facilitate the cooperation train gt on the updated Dt+1 to obtain gt+1 and update the
L U
between agents (Sec. 3.2). Finally, we introduce the training state to st+1 based on Dt+1 and Dt+1 . (4) Compute the re-
and deployment pipelines as well as a meta-optimization al- ward rt+1 based on gt+1 and gt evaluated on a separatedly
gorithm, which facilitates the agents’ quick adaptation to reserved reward set Dre to update the agent.
the enlarged labeled set in AL procedures during deploy- We adopt the Q-learning algorithm [45] to solve this
ment (Sec. 3.3). MDP problem, in which the agent scores each state-action
representation pair (st , at ) and takes the action at with the
3.1. Active Pose Estimation as MDP
highest score (i.e., the Q-value). By deriving reward from
Existing AL algorithms [38, 58, 4, 5, 22] fall into the the improvement of the pose estimator directly, we can op-
paradigm of iteratively selecting a batch of images to label timize the agent to learn a policy that maximizes the reward
until the annotation budget B runs out. In the tth iteration, as well as the performance of the pose estimator. Below we
given an unlabeled set DtU , a labeled set DtL and a pose elaborate on the detailed definition of state st , action at , and
estimator gt , these AL algorithms take the following steps: reward rt .

11081
State. Intuitively, the state st should capture the distribu- S, where S ∈ {FA , FP0 , FP1 , ..., FP6 }, via MMD as:
tion gap between the labeled dataset DtL and the unlabeled nL X
nL
dataset DtU , which helps the agent to pick out the most in- X k(pi , pj )
KS =MMD(S L , S U ) = +
formative image that could compensate the distribution shift
i=1 j=1
n2L
between DtL and DtU . With an unbiased training set dis- nU X
nU nL XnU
(1)
tribution, the pose estimator is more likely to generalize
X k(qi , qj ) X 2 ∗ k(pi , qj )
− ,
well to unseen cases. Specifically, in pose estimation, we i=1 j=1
n2U i=1 j=1
nL nU
consider two key attributes to characterize the distribution
drifts: appearance variation and pose topological variation, where S and S are the distributions of S on DtL and DtU
L U

which are also key considerations when collecting pose es- respectively, and KS is a scalar representing the distribution
timation datasets [60]. difference between S L and S U . We denote the samples in
S L and S U as p and q. nL and nU are the numbers of
Based on these intuitions, we propose to collect two samples in DtL and DtU , and k(·) corresponds to the radial
kinds of cues including the appearance information and kernel [47] to measure the distance between two samples.
topological information to characterize the distribution dif- Moreover, the available budget is another piece of im-
ference between DtL and DtU . Note that the difference is dy- portant information for the agent to perform an effective
namical as it depends on the pose estimator gt . The design selection. Here, we use the budget consumption ratio b
of the state helps the agent to select appropriate samples for to represent this status. Finally, the state st is defined as:
the pose estimator gt during the active learning process. {KFA , KFP0 , KFP1 , ..., KFP6 , b}, which encodes the distri-
For the appearance information fA of the sample x, bution drifts between the labeled and unlabeled sets as well
we collect the average pooled feature from an intermedi- as the available budget. It guides the agent to determine
ate layer of the pose estimator gt , as shown in Fig. 1. This which kind of images could benefit the pose estimator most.
feature depicts the general look of the image sample x. Action. The action should ideally captures the potential
contribution of a specific unlabeled sample when adding it
For the topological information, we encode the topolog- to the labeled set DtL . Intuitively, combining the state and
ical features such as the bone length and bone rotations via action representations, the agent should have enough infor-
Kinetic Chain Space (KCS) [53, 16]. More precisely, we mation to score each unlabeled sample and select an infor-
derive M bone vectors from the estimated pose ŷ = gt (x) mative image from the unlabeled set DtU to query annota-
and concatenate them to form an M × n matrix, where n tion. To this end, we associate each action at in the action
is the dimension of the joint coordinates. Then the KCS is space At with a unique image x in the unlabeled pool DtU .
computed as the inner product of the matrix and its trans- To assist the selection of the informative sample, we
pose. We denote the KCS for all bone vectors of the whole compute three kinds of features from each unlabeled image
hand (or whole body) as the global topological feature fP0 . x: 1) the novelty of the pose in the image x; 2) the represen-
Moreover, the performance of pose estimator varies with tativeness of the image for the unlabeled pool; 3) the gen-
each joint [56], leading to different pose estimation qualities eral appearance information of the image. Intuitively, these
over various local joints of the hand (or body). To help the three features characterize the informativeness, the repre-
pose estimator to achieve good performance on each joint, sentativeness of the pose as well as the appearance of an
we additionally track the properties for the local parts of the unlabeled image x. We detail each representation below.
hand (or body). We decompose the whole hand (or whole The novelty of the image helps estimate the potential
body) to six local parts including the palm and five fingers performance gain brought by adding accurate annotation.
(torso, head, left/right arm, and left/right leg for body). We However, it is hard to measure without the actual ground
then compute the KCS for these parts as the local topologi- truth pose. Therefore, we propose to approximately evalu-
cal features {fP1 , fP2 , ..., fP6 } of the image x. ate it by utilizing the topological features from the labeled
In this way, we extract the appearance feature fA and set DtL . Intuitively, the closeness of the global/local topo-
the topological features {fP0 , fP1 , ..., fP6 } for the image logical information indicates the similarities between the
x. Then the appearance features of all data in the la- whole/local part of the estimated pose and the ground truth
beled and unlabeled datasets form the appearance feature pose. A novel pose will likely have low similarities to any
space FA . Similarly, we can build the topological fea- pose from the labeled set DtL . Therefore, we compute the
ture spaces {FP0 , FP1 , ..., FP6 }. To model the distribution maximum cosine similarity between the unlabeled image x
drifts between the labeled dataset DtL and the unlabeled and the labeled set DtL individually on each topological fea-
dataset DtU , we regard the labeled and unlabeled datasets as ture space {FP0 , FP1 , ..., FP6 } as {s0 , s1 , ..., s6 }, and con-
two domains and measure the domain gap between them. sider it as a proxy for the pose novelty.
Specifically, we adopt the Maximum Mean Discrepancy We then introduce our parameterization for the repre-
(MMD) [47] and compute the gap for each feature space sentativeness of the sample. The labeled set DtL and un-

11082
labeled set DtU jointly describe the distribution of the data. 𝑠!
Therefore it is also important to sample representative im- 𝑎! c 𝑄-value
ages w.r.t. the unlabeled set DtU , which could be charac- 𝑧!"
terized by the distribution of the similarity scores. We in- ℎ!" 16 128 128
troduce a histogram-based representation d to record the co- th
Figure 2. The architecture of the m agent in the team. Note that
sine similarity distribution between x and DtU on each topo-
each agent shares the similar model architecture but with its own
logical feature space as {d0 , d1 , ..., d6 }. Combining with
parameters. at and hm t are first fed into a linear layer with ReLU
the parameters {s0 , s1 , ..., s6 } representing similarity of x activation to generate feature ztm ∈ R16 . ztm , at and st are then
to DtL , the agent could avoid repeatedly sampling the repre- concatenated and passed through three linear layers with ReLU
sentative images that our pose estimator has already learned activation in between to output the Q-value.
from, leading to improved sampling efficiency.
icy network as q m , and the action performed by it as am t .
Finally, we extract the image appearance feature fA of Then, to model the sequential cooperation between agents,
the unlabeled image x as its appearance property (e.g., we can additionally provide the mth agent with the actions
clothes texture, skin color, background, etc). The fi- m−1
{ait }i=1 of the previous m − 1 agents. However, it will re-
nal action representation at corresponding to the unla- quire an increasingly deep and wide neural network to pro-
beled image x is the combination of these features: at = cess the information of {ait }m−1
i=1 with a large m, leading to
{s0 , s1 , ..., s6 , d0 , d1 , ..., d6 , fA }, enabling the agent to ef- undesired high computational complexity. To address this,
fectively identify the informativeness of the unlabeled im- we use the the expectation of {ait }m−1
i=1 , a fixed-length com-
age x and perform selection. pact representation of the previous agents’ actions, as an ex-
Reward. The reward is a metric that evaluates how much tra state for the mth agent. Mathematically, the expectation
the selected unlabeled image can benefit the target pose of previous agents’ actions hm t is computed as:
model gt . We reserve a specific subset Dre for accurate
reward estimation before starting the active learning proce- m−1
1 X i
dure. Then, we measure the accuracy of the pose estimator hm
t = a, (2)
m − 1 i=1 t
on this reward set Dre , and the reward rt+1 is defined as the
difference of the accuracy between gt+1 and gt , as shown in and then the action made by the mth agent becomes:
Fig. 1. Note Dre is only used for evaluation and not used in
any training process of the pose estimator. With the reward am m m
t = argmax q (st , at , ht ; θm ), (3)
rt+1 , we can optimize the agent to select the most infor- at ∈At

mative image to maximize the reward, leading to improved where am th m

t is the action selected by the m agent q , which
pose estimation accuracy during each AL iteration. is parameterized by θm , and at ∈ At is the candidate action,
which forms into the state-action pair (st , at ) to be evalu-
3.2. Teaming Sampling Policy Learning ated by the agent q m . The structure of the mth agent is
Sampling a single unlabeled image in each active learn- depicted in Fig. 2.
ing iteration to query annotation is inefficient for two major Finally, we build our agent team module as {q m }N m=1 ,
reasons [35]: (1) The performance gain brought by a sin- where N is the number of agents in the team. To train the
gle sample is often hard to measure; (2) The pose estimator agent team, we follow the Double DQN formulation [50] to
needs to be retrained more frequently due to more iterations optimize our agent team by minimizing the temporal differ-
involved. To address these issues, the most recent meth- ence (TD) error as:
ods [38, 4, 5, 22] often query annotations for a batch of N
samples at each active learning iteration.
X
T D(θ, θ̂) = q m (st , am m
t , ht ; θm ) − rt+1
However, it is less trivial to perform batch sampling in m=1
our proposed framework. Using a single agent to generate (4)
N
X 2
a batch of samples by raising images with high predicted −γ q m
(st+1 , am m
t+1 , ht+1 ; θ̂m ) ,
scores (Q-values) disregards the redundancies within the m=1
batch, leading to inferior performance as shown in Sec. 4.3.
Therefore, we further introduce a cooperative agent team where θm is the parameters of the mth agent’s policy net-
module, consisting of a set of agents, working collabo- work, θ = {θ1 , θ2 , . . . θN } is the parameter set of all the
ratively to select image batch effectively and efficiently. agents in the team, θ̂ denotes the parameters of the off-
Specifically, the agents in the team sequentially select sam- policy network, used to keep the learned Q-value, and peri-
ples for annotations, and each agent can observe the previ- odically updates itself with θ, following the setting of Dou-
ous agents’ actions to perform the selection cooperatively. ble DQN [50]. Via such a cooperative mechanism, the agent
For the mth agent in a N -agent team, we denote its pol- team performs batch sampling in each iteration effectively.

11083
Algorithm 1: Teaming Sampling Policy Learning Alg. 1. Inspired by MAML [12], we consider each retrain-
Input: agent team {q m }N
ing process as a task and leverage the Meta-Learning [12]
m=1 , an initial pose estimator ginit , an
initial set Dinit with annotation and image batch size N to learn a good initialization for the policy network param-
L U re ←
1 Dinit , Dinit , D − RandomPartition(Dinit ) eters that could quickly adapt to the new tasks of retraining
2 while not done do // Episodes training on the enlarged dataset. We adopt this Meta-Learning based
3 D0L ← L , DU ←
− Dinit U
0 − Dinit , g0 ← − UPDATE(ginit , D0L )
4 for t = 0 to T − 1 do // AL procedure
extension in the Training Phase and empirically show that
5 Build the state st and action space At (Sec. 3.1) we could reduce the multi-agent team update cost by a half
6 Use the agent team {q m }N m=1 to select images
without sacrificing the performance, as shown in Sec. 4.3.
following Eq. 3: {xm }N m N
m=1 ⇐ {at }m=1
7 Annotate data: {(xm , ym )}N m=1 ← {xm }N m=1 4. Experiment
8
U L
Update Dt , Dt and gt :
L
Dt+1 ← DtL ∪ {(xm , ym )}N U
m=1 , Dt+1 ← We conduct extensive experiments on both the human
DtU \ {xm }N ,
m=1 t+1 g ←
−UPDATE(g L
t , Dt+1 ) hand and body pose datasets to evaluate the effectiveness of
9 Compute reward on Dre : our proposed MATAL framework.
rt+1 = Acc(gt+1 ) − Acc(gt )
For human hand pose estimation, we follow the ex-
10 end
Update {q m }N
perimental settings of [5] and evaluate the performance
11 m=1 following Eq. 4
12 end of MATAL on three widely used datasets, ICVL [46],
NYU [48] and BigHand2.2M [60]. ICVL is a depth-based
hand image dataset and NYU is a larger RGB-D dataset col-
lected by multiple cameras. Furthermore, to evaluate the
3.3. Model Training with Meta Optimization
efficacy of our method on large-scale datasets, we set up
With the introduced RL for AL formulation in Sec. 3.1 experiments on BigHand2.2M [60], which contains around
and the agent teaming framework in Sec. 3.2, we introduce 2.2 million images collected from ten different subjects. For
the training and deployment pipelines in this section. human body pose estimation, we use MPII [60], which is an
Given an unlabeled dataset Df ull and an annotation bud- RGB dataset widely used in recent works.
get B, our MATAL pipeline works as follows. We first ran-
domly sample an initial subset Dinit to request annotations.
4.1. MATAL on Human Hand Pose Estimation
With the labeled initial subset Dinit , we further partition Baseline. We compare the performance of our MATAL
it to simulate the AL procedure and train our agent team on hand pose estimation task with random sampling as well
{q m }Nm=1 . Specifically, we partition the labeled initial set as existing state-of-the-art methods, including Coreset [35],
L U
Dinit into the labeled set Dinit , the unlabeled set Dinit , and MCD CKE [4], UncertrainGCN [5] and CoreGCN [5],
re
the reward set D , and then have our agent team to play the based on their reported results on each dataset.
active batch image selection game following Sec. 3.1 and Implementation Details. Following [5], we use Deep-
Sec. 3.2. The detailed process is illustrated in Alg. 1. We Prior [32] as the backbone of our pose estimator. We ex-
denote this phase of training the agent team on the initial tract the feature map from the last convolutional layer of
labeled set as Training Phase. DeepPrior, and perform average pooling by a 5 × 5 kernel
Furthermore, once our agent team is trained on Dinit , it with stride 3, followed by flattening to generate a 128-D ap-
could be deployed to execute the real active learning proce- pearance feature vector. We use the 21 joints estimated by
dure on the rest of the unlabeled pool DU = Df ull \Dinit , DeepPrior and compute a 275-D topological feature vector.
until the budget B ran out. We denote this phase as De- We use 40 agents to build the agent team for image batch
ployment Phase, in which the agent team proposes batch selection on NYU and BigHand2.2M, and use 4 agents for
samples {xm }N m=1 for annotation from D
U
at each itera- ICVL as it is much smaller than other datasets.
tion and expands the labeled pool D = DL ∪ {xm }N
L
m=1 For each dataset, we first randomly sample a small num-
to update the pose estimator g. We set DL = Dinit at the ber of images from the training set of the dataset to build
start of this phase and expand it in the Deployment Phase. the initial set Dinit and the remaining images form the un-
With the enlarged labeled set DL , we can then retrain labeled set DU . The sizes of Dinit in ICVL, NYU, and Big-
our agent team on it to improve the performance of the RL Hand2.2m datasets are 80, 800, and 800, respectively. Then
agent team, again following Alg. 1. Note that we set Dinit we train our MATAL on Dinit via Alg. 1, in which the Dinit
in Alg. 1 to the most up-to-date DL each time when we is split into three disjoint sets Dre , Dinit
U L
and Dinit with the
perform retraining in this Deployment Phase. ratio of 3:6:1. Later, we deploy the trained MATAL to sam-
However, the training of the agent team {q m }N m=1 on the ple the images from DU and initialize DL as Dinit . In the
expanded labeled set DL could be time-consuming due to Deployment Phase, the agent team is frozen to sample in-
the growing size of DL . To reduce the time complexity, formative image batches iteratively while the pose estima-
we further propose a Meta-Learning based extension of the tor is updated every time a newly annotated batch arrives.

11084
1.5h

37h
0
1.5h
1h
0
1.8h

2h
51h
2h

(a) ICVL [46] (b) NYU [48] (c) BigHand [60] (d) MPII [1] (e) NYU [48] (f) MPII [60]
Figure 3. (a)-(d): Active learning results of pose estimation over four datasets. The results of (a) ICVL (b) NYU (c) BigHand are for hand
pose estimation and the curves in these figures show the average mean-square error of the joints’ poses (lower is better) over different
numbers of annotated frames. The result of human body pose estimation on MPII dataset is presented in the sub-figure (d), where the
metric is the [email protected] (higher is better). (e)-(f): Ablation study for agent team on human hand and body benchmarks.
Each time the size of the labeled dataset DL doubles com- as the Coreset mainly relies on the appearance feature in-
pared to the previous time the agent team was trained, we go formation but disregards the topological information. MCD
back to the Training Phase to retrain our agent team mod- CKE [4] obtains better performance by utilizing the pose
ule via efficient meta optimization with Alg. 1, in which we estimator’s uncertainty. Our method, benefited by learning
set Dinit to the most up-to-date labeled set DL . With the the sampling policy directly from data, significantly outper-
updated agent team, we resume the AL procedure on DU . forms the MCD CKE baseline. On this dataset, our method
These steps are repeated until the annotation budget B is only requires 5K images to achieve the nearly same perfor-
exhausted. mance (23.5 mm) obtained by other approaches that require
We set the learning rate of our policy network to 1e-4 and around 10K labeled images.
the discount factor γ in Eq. 4 to 0.9. We use the average Result on BigHand2.2M. We use the large scale Big-
joint error to measure performance of the pose estimator Hand2.2M [60] dataset to show the scalability of our
on the test set of each dataset. To show the robustness of method. It contains around 2.2 million images of subjects
our method, we run our experiments 5 times and report the with different hand shapes and contains schemed, random,
mean performance and its deviation. and egocentric poses. Thus, this dataset is much more di-
Result on ICVL. Fig. 3 (a) shows the performance of verse and challenging. Figure 3 (c) shows the performance
our proposed MATAL on ICVL dataset. Our method con- of different AL algorithms. Our method still outperforms
stantly outperforms state-of-the-art methods at each active other methods. It demonstrates that our MATAL can learn
learning iteration by a clear margin. UncertainGCN out- to select informative images even on this diverse dataset.
performs other existing methods at the beginning state, but
4.2. MATAL on Human Body Pose Estimation
later CoreGCN achieves better performance, which is possi-
bly due to the fact that the fixed criteria based on uncertainty Baseline. We benchmark our MATAL framework with
or representativeness could not constantly identify informa- SOTA active learning frameworks for human body pose es-
tive samples during the entire AL procedure. Instead, our timation, including Coreset [35], LearningLoss [58], Learn-
MATAL selects images that can most benefit the pose esti- ingLoss++ [38] and EGL++ [37].
mator with the proposed learning framework, which adapts Implementation details. Following the previous
to the needs of the pose estimator at different stages. As works [38, 37], we use Stacked Hourglass [32] as the back-
shown in Fig. 3 (a), our MATAL just needs 600 labeled im- bone of our pose estimator. We collect the feature map from
ages to reduce the average joint error to less than 12.5 mm, the bottleneck CNN layer of the last Hourglass block and
while uncertainGCN [5] and MCD CKE [4] need more than perform global average on it to build the image appearance
900 labeled images. At the end of the AL procedure with feature and use the predicted 16 joints to build the topo-
1000 labeled images, the average joint error in our model is logical features. A team of 40 agents are set up for batch
reduced to 11.89 mm, which is much lower than the mini- selection, and 800 images are randomly sampled to build
mum value obtained by other methods. the initial dataset Dinit . Moreover, we follow the previous
Result on NYU. This dataset was collected by multi- works [38, 37] and use [email protected] [31] to measure the per-
ple cameras, leading to several images sharing nearly same formance. Other settings follow the hand pose estimation.
topological information. Although these images have dif- Result on MPII. Figure 3 (d) demonstrates the perfor-
ferent appearance features, the redundant topological infor- mance of MATAL on the body pose estimation task. All
mation significantly decreases the learning efficacy of the existing methods achieve better results than random sam-
pose estimator. As shown in Fig. 3 (b), the performance pling but their [email protected] scores are close to each other.
of Coreset [35] is close to the error of random sampling, The EGL++ [37] tends to slightly outperform other exist-

11085
Table 1. Ablation study on the design of the state and the action Table 2. Ablation Study for the meta optimization. We compare
representation. We ablate the state/action representations by com- MATAL with or without the Meta-Optimization and showcase that
paring the accuracy of the model with individual component re- Meta-Optimization significantly accelerates the retraining process.
moved for state st and action at . Method MSE (mm) Time cost (h)
Method MSE (mm) with labeled samples MATAL w/o meta 23.68 4.5
2000 4000 6000 8000 MATAL w meta 23.74 2
State w/o KFP0 28.44 25.21 23.47 23.01
State w/o {KFPi }6i=1 28.30 25.24 23.66 23.23 because these methods tend to select similar images whose
State w/o KFA 29.00 25.65 24.49 23.55 Q-values are high yet close to each other, leading to sev-
State w/o b 28.62 25.22 24.10 23.85 eral inefficiencies in the batch image selection setting. In-
Action w/o {si }6i=0 30.47 26.77 25.36 25.13 troducing cooperation among separate agents helps address
Action w/o {di }6i=0 27.60 24.82 24.51 24.12
Action w/o fA 29.18 25.84 24.44 23.69
this problem, as the proposed expectation of previous ac-
MATAL 26.08 24.11 22.97 22.53 tions provides valuable information about the other agents’
decisions and the agent could learn to sample with a better
ing approaches and has a narrow deviation. Our MATAL coverage of the underlying distribution. Using one agent to
achieves significantly higher accuracy by learning a sam- select an image at each iteration also provides a competi-
pling policy that directly maximizes the performance of the tive performance but still tends to be inferior to our agent
pose estimator. The proposed MATAL uses around 25% of team method. The main reason is that the minor improve-
labels to obtain [email protected] of 85.1% while using the full ment of the pose estimator leads to small and noisy rewards,
annotated data yields [email protected] of 90.5%. Moreover, the making it difficult for the agent to learn a good sampling
proposed MATAL requires only 4K images to achieve simi- policy. Furthermore, the time cost of the method that uses
lar performance compared to others that require 6K images, one agent to select one image only is much higher than our
saving the labeling efforts of around 2K images. agent teaming method.
Effect of Meta-Learning We use meta-optimization to
4.3. Ablation Study
update the agent team module more effectively and effi-
Effect of the state and action representations We per- ciently. In this experiment, we compare the time cost of col-
form an ablation study on NYU dataset to evaluate the con- lecting 5K informative images by our model with/without
tribution of each component in our proposed state and ac- meta-learning on the NYU dataset in Table 2. Note that
tion representations. As the team agent relies on the state the time cost of sampling is almost the same for both mod-
to decide the sampling policy, we first investigate the influ- els, but it is the time consumption for the retraining of the
ence of the state information by removing its components agent team that really makes a difference. As shown in Ta-
individually from the complete model. Similarly, we also ble 2, with our meta-optimization scheme, our model ob-
discuss the effect of the information in the action vector. As tains competitive performance while significantly reducing
shown in Table 1, the complete MATAL gives the lowest the time consumption by more than a half.
average joint error in all active learning iterations. Remov-
ing either global or local topological information in the state 5. Conclusion
will degrade the performance of our method. The largest in-
crease of average joint error occurs when the score of maxi- In this paper we proposed an RL-based batch selec-
mum similarity in action representation {si }6i=0 is removed. tion active learning framework for pose estimation named
It further verifies the effectiveness of using the difference in MATAL. MATAL directly learns a cooperative sampling
global and local topological features to estimate the novelty policy for a team of agents to achieve effective image batch
of the recovered poses. selection. Moreover, a Meta-Optimization was introduced
Effect of agent team policy learning We further vali- to significantly accelerate the retraining of our team agent
date the performance of the proposed multi-agent sampling during the Deployment Phase of the active learning proce-
policy on NYU and MPII datasets. We first consider the us- dure. We conducted extensive ablation studies to verify the
age of only one agent to select a single image in each active design of our framework. Furthermore, we compared the
learning iteration. Then we construct the second baseline performance of our model with existing SOTA works on
as one agent to select a batch of images in one shot. Here, four widely used datasets and obtained better accuracy on
the images with N highest Q-values are sampled. Finally, all experiments.
we present the performance of using N agents to select N Acknowledgments. The project is supported by AI Sin-
images in two different settings: with or without teamwork. gapore under the grant number AISG-100E-2020-065, Na-
Figure. 3 (e) and (f) report the performance of these sam- tional Research Foundation Singapore and SUTD Startup
pling strategies. As shown in Fig. 3 (e) and (f), selecting Research Grant. This work is also partially supported by
multiple images by either a single agent or noncooperative TAILOR, a project funded by EU Horizon 2020 research
multiple agents gives the worst results. We argue that this is and innovation programme under GA No 952215.

11086
References IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pages 9561–9568. IEEE, 2020.
[1] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and [14] Erik Gärtner, Aleksis Pirinen, and Cristian Sminchisescu.
Bernt Schiele. 2d human pose estimation: New benchmark Deep reinforcement learning for active human pose estima-
and state of the art analysis. In Proceedings of the IEEE Con- tion. In Proceedings of the AAAI Conference on Artificial
ference on computer Vision and Pattern Recognition, pages Intelligence, volume 34, pages 10835–10844, 2020.
3686–3693, 2014. [15] Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying
[2] Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr. Wang, Jianfei Cai, and Junsong Yuan. 3d hand shape and
3d hand shape and pose from images in the wild. In Proceed- pose estimation from a single rgb image. In Proceedings of
ings of the IEEE/CVF Conference on Computer Vision and the IEEE/CVF Conference on Computer Vision and Pattern
Pattern Recognition, pages 10843–10852, 2019. Recognition, pages 10833–10842, 2019.
[3] Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan. [16] Kehong Gong, Jianfeng Zhang, and Jiashi Feng. Poseaug:
Weakly-supervised 3d hand pose estimation from monocu- A differentiable pose augmentation framework for 3d hu-
lar rgb images. In Proceedings of the European Conference man pose estimation. In Proceedings of the IEEE/CVF Con-
on Computer Vision (ECCV), pages 666–682, 2018. ference on Computer Vision and Pattern Recognition, pages
[4] Razvan Caramalau, Binod Bhattarai, and Tae-Kyun Kim. 8575–8584, 2021.
Active learning for bayesian 3d hand pose estimation. In [17] Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall,
Proceedings of the IEEE/CVF Winter Conference on Appli- and Jan Kautz. Hand pose estimation via latent 2.5d heatmap
cations of Computer Vision, pages 3419–3428, 2021. regression. In Proceedings of the European Conference on
[5] Razvan Caramalau, Binod Bhattarai, and Tae-Kyun Kim. Se- Computer Vision (ECCV), September 2018.
quential graph convolutional network for active learning. In [18] Rui Li, Zhenyu Liu, and Jianrong Tan. A survey on 3d hand
Proceedings of the IEEE/CVF Conference on Computer Vi- pose estimation: Cameras, methods, and datasets. Pattern
sion and Pattern Recognition, pages 9583–9592, 2021. Recognition, 93:251–272, 2019.
[6] Arantxa Casanova, Pedro O Pinheiro, Negar Rostamzadeh, [19] Shile Li and Dongheui Lee. Point-to-pose voting based hand
and Christopher J Pal. Reinforced active learning for image pose estimation using residual permutation equivariant layer.
segmentation. arXiv preprint arXiv:2002.06583, 2020. In Proceedings of the IEEE/CVF Conference on Computer
[7] Yucheng Chen, Yingli Tian, and Mingyi He. Monocu- Vision and Pattern Recognition, pages 11927–11936, 2019.
lar human pose estimation: A survey of deep learning- [20] Tianjiao Li, Jun Liu, Wei Zhang, Yun Ni, Wenqian Wang,
based methods. Computer Vision and Image Understanding, and Zhiheng Li. Uav-human: A large benchmark for hu-
192:102897, 2020. man behavior understanding with unmanned aerial vehicles.
In Proceedings of the IEEE/CVF Conference on Computer
[8] Yujin Chen, Zhigang Tu, Liuhao Ge, Dejun Zhang, Ruizhi
Vision and Pattern Recognition, pages 16266–16275, 2021.
Chen, and Junsong Yuan. So-handnet: Self-organizing
[21] Xing Liang, Anastassia Angelopoulou, Epaminondas
network for 3d hand pose estimation with semi-supervised
Kapetanios, Bencie Woll, Reda Al Batat, and Tyron Woolfe.
learning. In Proceedings of the IEEE/CVF International
A multi-modal machine learning approach and toolkit to au-
Conference on Computer Vision, pages 6961–6970, 2019.
tomate recognition of early stages of dementia among british
[9] Yujin Chen, Zhigang Tu, Di Kang, Linchao Bao, Ying
sign language users. In European Conference on Computer
Zhang, Xuefei Zhe, Ruizhi Chen, and Junsong Yuan. Model-
Vision, pages 278–293. Springer, 2020.
based 3d hand reconstruction via self-supervised learning. In
[22] Buyu Liu and Vittorio Ferrari. Active learning for human
Proceedings of the IEEE/CVF Conference on Computer Vi-
pose estimation. In Proceedings of the IEEE International
sion and Pattern Recognition, pages 10451–10460, 2021.
Conference on Computer Vision, pages 4363–4372, 2017.
[10] Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, [23] Shaowei Liu, Hanwen Jiang, Jiarui Xu, Sifei Liu, and Xi-
Thomas S Huang, and Lei Zhang. Higherhrnet: Scale-aware aolong Wang. Semi-supervised 3d hand-object poses es-
representation learning for bottom-up human pose estima- timation with interactions in time. In Proceedings of the
tion. In Proceedings of the IEEE/CVF Conference on Com- IEEE/CVF Conference on Computer Vision and Pattern
puter Vision and Pattern Recognition, pages 5386–5395, Recognition, pages 14687–14697, 2021.
2020. [24] Zhuoming Liu, Hao Ding, Huaping Zhong, Weijia Li, Jifeng
[11] Manuela Chessa, Guido Maiello, Lina K Klein, Vivian C Dai, and Conghui He. Influence selection for active learning.
Paulun, and Fabio Solari. Grasping objects in immersive vir- In Proceedings of the IEEE/CVF International Conference
tual reality. In 2019 IEEE Conference on Virtual Reality and on Computer Vision, pages 9274–9283, 2021.
3D User Interfaces (VR), pages 1749–1754. IEEE, 2019. [25] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I.
[12] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- Jordan. Deep transfer learning with joint adaptation net-
agnostic meta-learning for fast adaptation of deep networks. works. In Doina Precup and Yee Whye Teh, editors, Pro-
In International Conference on Machine Learning, pages ceedings of the 34th International Conference on Machine
1126–1135. PMLR, 2017. Learning, volume 70 of Proceedings of Machine Learning
[13] Guillermo Garcia-Hernando, Edward Johns, and Tae-Kyun Research, pages 2208–2217. PMLR, 06–11 Aug 2017.
Kim. Physics-based dexterous manipulations with estimated [26] Jameel Malik, Ahmed Elhayek, Fabrizio Nunnari, Kiran
hand poses and residual reinforcement learning. In 2020 Varanasi, Kiarash Tamaddon, Alexis Heloir, and Didier

11087
Stricker. Deephps: End-to-end estimation of 3d hand pose [39] Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Otmar Hilliges,
and shape by learning from synthetic depth. In 2018 Inter- and Jan Kautz. Weakly supervised 3d hand pose estima-
national Conference on 3D Vision (3DV), pages 110–119. tion via biomechanical constraints. In Computer Vision–
IEEE, 2018. ECCV 2020: 16th European Conference, Glasgow, UK, Au-
[27] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, gust 23–28, 2020, Proceedings, Part XVII 16, pages 211–
Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, 228. Springer, 2020.
Weipeng Xu, Dan Casas, and Christian Theobalt. Vnect: [40] Srinath Sridhar, Anna Maria Feit, Christian Theobalt, and
Real-time 3d human pose estimation with a single rgb cam- Antti Oulasvirta. Investigating the dexterity of multi-finger
era. ACM Transactions on Graphics (TOG), 36(4):1–14, input for mid-air text entry. In Proceedings of the 33rd An-
2017. nual ACM Conference on Human Factors in Computing Sys-
[28] Rahul Mitra, Nitesh B Gundavarapu, Abhishek Sharma, and tems, pages 3643–3652, 2015.
Arjun Jain. Multiview-consistent semi-supervised learn- [41] Srinath Sridhar, Franziska Mueller, Antti Oulasvirta, and
ing for 3d human pose estimation. In Proceedings of Christian Theobalt. Fast and robust hand tracking using
the IEEE/CVF Conference on Computer Vision and Pattern detection-guided optimization. In Proceedings of the IEEE
Recognition, pages 6907–6916, 2020. Conference on Computer Vision and Pattern Recognition,
pages 3213–3221, 2015.
[29] Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori,
[42] Srinath Sridhar, Franziska Mueller, Michael Zollhöfer, Dan
and Kyoung Mu Lee. Interhand2. 6m: A dataset and base-
Casas, Antti Oulasvirta, and Christian Theobalt. Real-time
line for 3d interacting hand pose estimation from a single
joint tracking of a hand manipulating an object from rgb-d
rgb image. In Computer Vision–ECCV 2020: 16th Euro-
input. In European Conference on Computer Vision, pages
pean Conference, Glasgow, UK, August 23–28, 2020, Pro-
294–310. Springer, 2016.
ceedings, Part XX 16, pages 548–564. Springer, 2020.
[43] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
[30] Franziska Mueller, Dushyant Mehta, Oleksandr Sotny- high-resolution representation learning for human pose es-
chenko, Srinath Sridhar, Dan Casas, and Christian Theobalt. timation. In Proceedings of the IEEE/CVF Conference
Real-time hand tracking under occlusion from an egocentric on Computer Vision and Pattern Recognition, pages 5693–
rgb-d sensor. In Proceedings of the IEEE International Con- 5703, 2019.
ference on Computer Vision, pages 1154–1163, 2017. [44] Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei.
[31] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- Compositional human pose regression. In Proceedings of the
glass networks for human pose estimation. In European con- IEEE International Conference on Computer Vision, pages
ference on computer vision, pages 483–499. Springer, 2016. 2602–2611, 2017.
[32] Markus Oberweger, Paul Wohlhart, and Vincent Lepetit. [45] Richard S Sutton and Andrew G Barto. Reinforcement learn-
Hands deep in deep learning for hand pose estimation. arXiv ing: An introduction. MIT press, 2018.
preprint arXiv:1502.06807, 2015. [46] Danhang Tang, Hyung Jin Chang, Alykhan Tejani, and Tae-
[33] Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Kyun Kim. Latent regression forest: Structured estimation of
Abbeel, and Sergey Levine. Sfv: Reinforcement learning of 3d articulated hand posture. In Proceedings of the IEEE con-
physical skills from videos. ACM Transactions On Graphics ference on computer vision and pattern recognition, pages
(TOG), 37(6):1–14, 2018. 3786–3793, 2014.
[34] Viraj Prabhu, Arjun Chandrasekaran, Kate Saenko, and [47] Ilya O Tolstikhin, Bharath K Sriperumbudur, and Bernhard
Judy Hoffman. Active domain adaptation via clustering Schölkopf. Minimax estimation of maximum mean discrep-
uncertainty-weighted embeddings. In Proceedings of the ancy with radial kernels. Advances in Neural Information
IEEE/CVF International Conference on Computer Vision, Processing Systems, 29:1930–1938, 2016.
pages 8505–8514, 2021. [48] Jonathan Tompson, Murphy Stein, Yann Lecun, and Ken
Perlin. Real-time continuous pose recovery of human hands
[35] Ozan Sener and Silvio Savarese. Active learning for convolu-
using convolutional networks. ACM Transactions on Graph-
tional neural networks: A core-set approach. In International
ics (ToG), 33(5):1–10, 2014.
Conference on Learning Representations, 2018.
[49] Alexander Toshev and Christian Szegedy. Deeppose: Hu-
[36] Jianzhun Shao, Yuhang Jiang, Gu Wang, Zhigang Li, and man pose estimation via deep neural networks. In Proceed-
Xiangyang Ji. Pfrl: Pose-free reinforcement learning for ings of the IEEE Conference on Computer Vision and Pattern
6d pose estimation. In Proceedings of the IEEE/CVF Con- Recognition (CVPR), June 2014.
ference on Computer Vision and Pattern Recognition, pages [50] Hado Van Hasselt, Arthur Guez, and David Silver. Deep re-
11454–11463, 2020. inforcement learning with double q-learning. In Proceedings
[37] Megh Shukla. Egl++: Extending expected gradient length of the AAAI conference on artificial intelligence, volume 30,
to active learning for human pose estimation. arXiv preprint 2016.
arXiv:2104.09493, 2021. [51] Chengde Wan, Thomas Probst, Luc Van Gool, and Angela
[38] Megh Shukla and Shuaib Ahmed. A mathematical analysis Yao. Self-supervised 3d hand pose estimation through train-
of learning loss for active learning in regression. In Proceed- ing by fitting. In Proceedings of the IEEE/CVF Conference
ings of the IEEE/CVF Conference on Computer Vision and on Computer Vision and Pattern Recognition, pages 10853–
Pattern Recognition, pages 3320–3328, 2021. 10862, 2019.

11088
[52] Chengde Wan, Thomas Probst, Luc Van Gool, and Angela [64] Christian Zimmermann and Thomas Brox. Learning to esti-
Yao. Dense 3d regression for hand pose estimation. In Pro- mate 3d hand pose from single rgb images. In Proceedings of
ceedings of the IEEE Conference on Computer Vision and the IEEE international conference on computer vision, pages
Pattern Recognition, pages 5147–5156, 2018. 4903–4911, 2017.
[53] Bastian Wandt, Hanno Ackermann, and Bodo Rosenhahn.
A kinematic chain space for monocular motion capture. In
Proceedings of the European Conference on Computer Vi-
sion (ECCV) Workshops, pages 0–0, 2018.
[54] Rongchang Xie, Chunyu Wang, Wenjun Zeng, and Yizhou
Wang. An empirical study of the collapsing problem in semi-
supervised 2d human pose estimation. In Proceedings of
the IEEE/CVF International Conference on Computer Vision
(ICCV), pages 11240–11249, October 2021.
[55] Fu Xiong, Boshen Zhang, Yang Xiao, Zhiguo Cao, Taidong
Yu, Joey Tianyi Zhou, and Junsong Yuan. A2j: Anchor-to-
joint regression network for 3d articulated pose estimation
from a single depth image. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 793–
802, 2019.
[56] Tianhan Xu and Wataru Takano. Graph stacked hourglass
networks for 3d human pose estimation. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 16105–16114, 2021.
[57] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan
Zhang, and Jun Wang. Mean field multi-agent reinforcement
learning. In International Conference on Machine Learning,
pages 5571–5580. PMLR, 2018.
[58] Donggeun Yoo and In So Kweon. Learning loss for active
learning. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 93–102,
2019.
[59] Shanxin Yuan, Guillermo Garcia-Hernando, Björn Stenger,
Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee, Pavlo
Molchanov, Jan Kautz, Sina Honari, Liuhao Ge, et al. Depth-
based 3d hand pose estimation: From current achievements
to future goals. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2636–
2645, 2018.
[60] Shanxin Yuan, Qi Ye, Bjorn Stenger, Siddhant Jain, and Tae-
Kyun Kim. Bighand2. 2m benchmark: Hand pose dataset
and state of the art analysis. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
4866–4874, 2017.
[61] Ho Yub Jung, Soochahn Lee, Yong Seok Heo, and Il
Dong Yun. Random tree walk toward instantaneous 3d hu-
man pose estimation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2467–
2474, 2015.
[62] Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and
Junsong Yuan. Mixste: Seq2seq mixed spatio-temporal en-
coder for 3d human pose estimation in video. arXiv preprint
arXiv:2203.00859, 2022.
[63] Yuxiao Zhou, Marc Habermann, Weipeng Xu, Ikhsanul
Habibie, Christian Theobalt, and Feng Xu. Monocular real-
time hand shape and motion capture using multi-modal data.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5346–5355, 2020.

11089

Mathematics 11 00820
No ratings yet
Mathematics 11 00820
38 pages
1876 Diffusion Based Probabilistic
No ratings yet
1876 Diffusion Based Probabilistic
27 pages
Data Mining - Utrecht University - 13. Active Learning
No ratings yet
Data Mining - Utrecht University - 13. Active Learning
57 pages
Active Surrogate Estimators An Active Learning Approach To LabelEfficient Model Evaluation
No ratings yet
Active Surrogate Estimators An Active Learning Approach To LabelEfficient Model Evaluation
14 pages
Gpose
No ratings yet
Gpose
18 pages
ICML - 2022 - Active Testing Sample-Efficient Model Evaluation
No ratings yet
ICML - 2022 - Active Testing Sample-Efficient Model Evaluation
11 pages
Openpose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields
No ratings yet
Openpose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields
14 pages
Learning Active Learning From Data
No ratings yet
Learning Active Learning From Data
11 pages
Pre Study
No ratings yet
Pre Study
14 pages
Streaming Active Learning With Deep Neural Networks: Ash & Adams 2020
No ratings yet
Streaming Active Learning With Deep Neural Networks: Ash & Adams 2020
17 pages
Reinforced Active Learning
No ratings yet
Reinforced Active Learning
17 pages
Scalable Active Learning For Multiclass Image Classification
No ratings yet
Scalable Active Learning For Multiclass Image Classification
15 pages
Deep Learning-Based Object Pose Estimation
No ratings yet
Deep Learning-Based Object Pose Estimation
27 pages
M-VAAL: Multimodal Variational Adversarial Active Learning For Downstream Medical Image Analysis Tasks
No ratings yet
M-VAAL: Multimodal Variational Adversarial Active Learning For Downstream Medical Image Analysis Tasks
17 pages
Poier Learning Pose Specific CVPR 2018 Paper
No ratings yet
Poier Learning Pose Specific CVPR 2018 Paper
10 pages
1.8.citedby - Fusing The Old With The New: Learning Relative Camera Pose With Geometry-Guided Uncertainty - 2104.08278
No ratings yet
1.8.citedby - Fusing The Old With The New: Learning Relative Camera Pose With Geometry-Guided Uncertainty - 2104.08278
11 pages
How To Measure Uncertainty in Uncertainty Sampling
No ratings yet
How To Measure Uncertainty in Uncertainty Sampling
35 pages
Ci GFPose Learning 3D Human Pose Prior With Gradient Fields CVPR 2023 Paper
No ratings yet
Ci GFPose Learning 3D Human Pose Prior With Gradient Fields CVPR 2023 Paper
11 pages
2022 Ohkawa
No ratings yet
2022 Ohkawa
14 pages
Pose Induction For Novel Object Categories
No ratings yet
Pose Induction For Novel Object Categories
9 pages
Active Learning For Entity Alignment
No ratings yet
Active Learning For Entity Alignment
15 pages
2023-Lyu Box-Level Active Detection CVPR 2023 Paper
No ratings yet
2023-Lyu Box-Level Active Detection CVPR 2023 Paper
10 pages
Efficient Annotation and Learning For 3D Hand Pose Estimation: A Survey
No ratings yet
Efficient Annotation and Learning For 3D Hand Pose Estimation: A Survey
18 pages
2008 - CVPR-gjqi - Two-Dimensional Active Learning For Image Classification
No ratings yet
2008 - CVPR-gjqi - Two-Dimensional Active Learning For Image Classification
8 pages
Kreiss PifPaf Composite Fields For Human Pose Estimation CVPR 2019 Paper
No ratings yet
Kreiss PifPaf Composite Fields For Human Pose Estimation CVPR 2019 Paper
10 pages
An Overview of Human Pose Estimation With Deep Learning
No ratings yet
An Overview of Human Pose Estimation With Deep Learning
11 pages
A Comprehensive Survey On Human Pose Estimation AP
No ratings yet
A Comprehensive Survey On Human Pose Estimation AP
30 pages
Uncertainty Estimation For Data-Driven Visual Odometry: IEEE Transactions On Robotics June 2020
No ratings yet
Uncertainty Estimation For Data-Driven Visual Odometry: IEEE Transactions On Robotics June 2020
21 pages
A Simple Baseline For Low-Budget Active Learning
No ratings yet
A Simple Baseline For Low-Budget Active Learning
20 pages
4305-Article Text-7359-1-10-20190706
No ratings yet
4305-Article Text-7359-1-10-20190706
8 pages
Aghdam Et Al. - 2019 - Active Learning For Deep Detection Neural Networks
No ratings yet
Aghdam Et Al. - 2019 - Active Learning For Deep Detection Neural Networks
9 pages
Active Rare Class Discovery and Classification Using Dirichlet Processes
No ratings yet
Active Rare Class Discovery and Classification Using Dirichlet Processes
18 pages
Active Learning From Imbalanced Data
No ratings yet
Active Learning From Imbalanced Data
4 pages
Active One-Shot Learning
No ratings yet
Active One-Shot Learning
8 pages
Sinha Et Al. - 2019 - Variational Adversarial Active Learning
No ratings yet
Sinha Et Al. - 2019 - Variational Adversarial Active Learning
10 pages
Active Learning For Deep Object Detection 2
No ratings yet
Active Learning For Deep Object Detection 2
10 pages
Geng Human Pose As Compositional Tokens CVPR 2023 Paper
No ratings yet
Geng Human Pose As Compositional Tokens CVPR 2023 Paper
12 pages
Mixtures of Gaussian Process Models For Human Pose Estimation
No ratings yet
Mixtures of Gaussian Process Models For Human Pose Estimation
9 pages
Diff Pose
No ratings yet
Diff Pose
15 pages
17013-Article Text-20507-1-2-20210518
No ratings yet
17013-Article Text-20507-1-2-20210518
8 pages
Plug and Play Active Learning For Object Detection
No ratings yet
Plug and Play Active Learning For Object Detection
10 pages
Luvizon 2D3D Pose Estimation CVPR 2018 Paper
No ratings yet
Luvizon 2D3D Pose Estimation CVPR 2018 Paper
10 pages
Gal 17 A
No ratings yet
Gal 17 A
10 pages
Domain Randomization For Active Pose Estimation
No ratings yet
Domain Randomization For Active Pose Estimation
7 pages
hospedalesEtAl Pakdd2011
No ratings yet
hospedalesEtAl Pakdd2011
12 pages
Fast Animal Pose Estimation Using Deep Neural Networks
No ratings yet
Fast Animal Pose Estimation Using Deep Neural Networks
13 pages
An Active Learning Algorithm Based On Parzen Window Classification
No ratings yet
An Active Learning Algorithm Based On Parzen Window Classification
14 pages
Region-Level Active Learning For Cluttered Scenes
No ratings yet
Region-Level Active Learning For Cluttered Scenes
9 pages
Active Learning From Multiple Knowledge Sources
No ratings yet
Active Learning From Multiple Knowledge Sources
8 pages
Iv47402 2020 9304793
No ratings yet
Iv47402 2020 9304793
6 pages
Sampling Yue Fuselage
No ratings yet
Sampling Yue Fuselage
11 pages
Bad Students Make Great Teachers
No ratings yet
Bad Students Make Great Teachers
16 pages
Electronics 13 00967 v4
No ratings yet
Electronics 13 00967 v4
16 pages
ADeepReinforcement Active Learning Method For Multi-Label Image Classification
No ratings yet
ADeepReinforcement Active Learning Method For Multi-Label Image Classification
15 pages
Active Finetuning Exploiting Annotation Budget in The Pretraining Finetuning Paradigm
No ratings yet
Active Finetuning Exploiting Annotation Budget in The Pretraining Finetuning Paradigm
12 pages
Active
No ratings yet
Active
2 pages
Beluch The Power of CVPR 2018 Paper
No ratings yet
Beluch The Power of CVPR 2018 Paper
10 pages
Cost-Effective Active Learning For Deep Image Classification
No ratings yet
Cost-Effective Active Learning For Deep Image Classification
10 pages
Surpac DTM - Surfaces Tutorial
100% (1)
Surpac DTM - Surfaces Tutorial
54 pages
410 P3B-43
No ratings yet
410 P3B-43
8 pages
Adding Numbers Without Regrouping
No ratings yet
Adding Numbers Without Regrouping
16 pages
Gears Engineering Information
No ratings yet
Gears Engineering Information
138 pages
COMSOL Fluid Mechanics Problems
100% (2)
COMSOL Fluid Mechanics Problems
6 pages
Control Systems CH1
No ratings yet
Control Systems CH1
12 pages
Hysys
No ratings yet
Hysys
332 pages
Chapter 5 Design
No ratings yet
Chapter 5 Design
64 pages
Extra Exercises Q2
No ratings yet
Extra Exercises Q2
4 pages
Eee-3901
No ratings yet
Eee-3901
8 pages
Lesson 20 Homework 5.4
100% (1)
Lesson 20 Homework 5.4
4 pages
Question Bank Partial Differentiation: y X y X
No ratings yet
Question Bank Partial Differentiation: y X y X
10 pages
Envelope Theorem
No ratings yet
Envelope Theorem
9 pages
3 DOF Helicopter Courseware Sample For MATLAB Users
No ratings yet
3 DOF Helicopter Courseware Sample For MATLAB Users
9 pages
Chapter 6 - Functions
No ratings yet
Chapter 6 - Functions
3 pages
Mathematics Grade 6 Blueprint (2024-25)
No ratings yet
Mathematics Grade 6 Blueprint (2024-25)
2 pages
Week 6 Advance Differentiation Rules Exponential and Logarithmic Functions
No ratings yet
Week 6 Advance Differentiation Rules Exponential and Logarithmic Functions
21 pages
North Sydney Boys 2023 3U Prelim Yearly & Solutions
No ratings yet
North Sydney Boys 2023 3U Prelim Yearly & Solutions
30 pages
Tutorial 03
No ratings yet
Tutorial 03
2 pages
InDesign CS5.5 Scripting Read Me
No ratings yet
InDesign CS5.5 Scripting Read Me
9 pages
Journal of Experimental Zoology India
No ratings yet
Journal of Experimental Zoology India
3 pages
Unbalanced Faults PDF
No ratings yet
Unbalanced Faults PDF
36 pages
Sample Questions
No ratings yet
Sample Questions
5 pages
Fuzzy Model-Based Robust Controller Design For Hydrofoil Catamaran
No ratings yet
Fuzzy Model-Based Robust Controller Design For Hydrofoil Catamaran
6 pages
ABAQUS XFEM Tutorial: 2D Edge Crack
No ratings yet
ABAQUS XFEM Tutorial: 2D Edge Crack
2 pages
Assestment 1
No ratings yet
Assestment 1
20 pages
PRESENTATION-Mulitiphysics Simulation For MEMS Using Workbench
No ratings yet
PRESENTATION-Mulitiphysics Simulation For MEMS Using Workbench
27 pages
Hybrid-DNNs-Hybrid Deep Neural Networks For Mixed Inputs
No ratings yet
Hybrid-DNNs-Hybrid Deep Neural Networks For Mixed Inputs
14 pages
1.what Is Difference Between Initialization and Assignment? 2.what Is The Difference Between Class and Structure?
No ratings yet
1.what Is Difference Between Initialization and Assignment? 2.what Is The Difference Between Class and Structure?
2 pages
Dryness Fraction
No ratings yet
Dryness Fraction
3 pages
Eswaran Kotwal
No ratings yet
Eswaran Kotwal
17 pages

Gong Meta Agent Teaming Active Learning For Pose Estimation CVPR 2022 Paper

Uploaded by

Gong Meta Agent Teaming Active Learning For Pose Estimation CVPR 2022 Paper

Uploaded by

Meta Agent Teaming Active Learning for Pose Estimation

Abstract very expensive and time-consuming, e.g., annotating a sin-

mative image to maximize the reward, leading to improved where am th m

You might also like