Deep Learning-Based Object Pose Estimation
Deep Learning-Based Object Pose Estimation
Abstract—Object pose estimation is a fundamental computer vision problem with broad applications in augmented reality and robotics.
Over the past decade, deep learning models, due to their superior accuracy and robustness, have increasingly supplanted conventional
algorithms reliant on engineered point pair features. Nevertheless, several challenges persist in contemporary methods, including their
dependency on labeled training data, model compactness, robustness under challenging conditions, and their ability to generalize to
novel unseen objects. A recent survey discussing the progress made on different aspects of this area, outstanding challenges, and
promising future directions, is missing. To fill this gap, we discuss the recent advances in deep learning-based object pose estimation,
covering all three formulations of the problem, i.e., instance-level, category-level, and unseen object pose estimation. Our survey also
covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks, providing the
readers with a holistic understanding of this field. Additionally, it discusses training paradigms of different domains, inference modes,
application areas, evaluation metrics, and benchmark datasets, as well as reports the performance of current state-of-the-art methods
arXiv:2405.07801v3 [cs.CV] 31 May 2024
on these benchmarks, thereby facilitating the readers in selecting the most suitable method for their application. Finally, the survey
identifies key challenges, reviews the prevailing trends along with their pros and cons, and identifies promising directions for future
research. We also keep tracing the latest works at Awesome-Object-Pose-Estimation.
Index Terms—Object pose estimation, deep learning, comprehensive survey, 3D computer vision.
1 I NTRODUCTION
Instance-Level Methods Category-Level Methods Unseen Object Methods
3. Instance-Level Object Pose Estimation 4. Category-Level Object Pose Estimation 5. Unseen Object Pose Estimation
3.1 Correspondence - 3.2 Template - 3.3 Voting - 3.4 Regression - 4.1 Shape prior- 4.2 Shape prior- 5.1 CAD model - 5.2 Manual
reference view-
based methods based methods based methods based methods based methods free methods based methods based methods
Fig. 2. A taxonomy of this survey. Firstly, we review the datasets and evaluation metrics used to evaluate object pose estimation. Next, we review the
deep learning-based methods by dividing them into three categories: instance-level, category-level, and unseen methods. Instance-level methods
can be further classified into correspondence-based, template-based, voting-based, and regression-based methods. Category-level methods can be
further divided into shape prior-based and shape prior-free methods. Unseen methods can be further classified into CAD model-based and manual
reference view-based methods.
intra-class unseen objects without necessitating retraining based methods for object pose estimation. Our survey en-
and employing CAD models during inference. Subsequent compasses all problem formulations, including instance-
category-level methods [23], [24], [25], [26], [27] can be di- level, category-level, and unseen object pose estimation,
vided into shape prior-based and shape prior-free methods. aiming to provide readers with a holistic understanding of
While improving the generalization ability within a category, this field. Additionally, we discuss different domain train-
these category-level methods still need to collect and label ing paradigms, application areas, evaluation metrics, and
extensive training data for each object category. Moreover, benchmark datasets, as well as report the performance of
these methods cannot generalize to unseen object categories. state-of-the-art methods on these benchmarks, aiding readers
To this end, some unseen object pose estimation methods in selecting suitable methods for their applications. Further-
have been recently proposed [1], [3], [28], [29], [30], which more, we also highlight prevailing trends and discuss their
can be further classified into CAD model-based and man- strengths and weaknesses, as well as identify key challenges
ual reference view-based methods. These methods further and promising avenues for future research. The taxonomy of
enhance the generalization of object pose estimation, i.e., this survey is shown in Fig. 2.
they can be generalized to unseen objects without retraining. Our main contributions and highlights are as follows:
Nevertheless, they still need to obtain the object CAD model • We present a comprehensive survey of deep learning-
or annotate a few reference images of the object. based object pose estimation methods. This is the first
Although significant progress has been made in the area survey that covers all three problem formulations in
of object pose estimation, several challenges persist in cur- the domain, including instance-level, category-level,
rent methods, such as the reliance on labeled training data, and unseen object pose estimation.
difficulty in generalizing to novel unseen objects, model • Our survey covers popular input data modalities
compactness, and robustness in challenging scenarios. To (RGB images, depth images, RGBD images), the dif-
enable readers to swiftly grasp the current state-of-the-art ferent degrees of freedom (3DoF, 6DoF, 9DoF) in out-
(SOTA) in object pose estimation and facilitate further re- put poses, object properties (rigid, articulated) for the
search in this direction, it is crucial to provide a thorough task of pose estimation as well as tracking. It is crucial
review of all the relevant problem formulations. A close to cover all these aspects in a single survey to give
examination of the existing academic literature reveals a a complete picture to readers, an aspect overlooked
significant gap when reviewing the various problem formu- by existing surveys which only cover a few of these
lations in object pose estimation. Current prevailing reviews aspects.
[31], [32], [33], [34], [35] tend to exhibit a narrow focus, • We discuss different domain training paradigms, in-
either confined to particular input modalities [32], [33] or ference modes, application areas, evaluation metrics,
tethered to specific application domains [34], [35]. Further- and benchmark datasets as well as report the per-
more, these reviews predominantly scrutinize instance-level formance of existing SOTA methods on these bench-
and category-level methods, thus neglecting the exploration marks to help readers choose the most appropriate
of the most practical problem formulation in the domain ones for deployment in their application.
which is unseen object pose estimation. This hinders readers • We highlight popular trends in the evolution of ob-
from gaining a comprehensive understanding of the area. ject pose estimation techniques over the past decade
For instance, Fan et al. [33] provided valuable insights into and discuss their strengths and weaknesses. We also
RGB image-based object pose estimation. However, their identify key challenges that are still outstanding in
focus is limited to a singular modality, hindering readers object pose estimation along with promising research
from comprehensively understanding methods across vari- directions to guide future efforts.
ous input modalities. Conversely, Du et al. [34] exclusively The rest of this article is organized as follows. Sec. 2
examined object pose estimation within the context of the reviews the datasets and metrics used to evaluate the three
robotic grasping task, which limits the readers to understand categories of object pose estimation methods. We then review
object pose estimation only from the perspective of a single instance-level methods in Sec. 3, category-level methods in
specific application. Sec. 4, and unseen object pose estimation methods in Sec.
To address the above problems, we present here a com- 5. In the aforementioned three sections, we also discuss
prehensive survey of recent advancements in deep learning- the training paradigms, inference modes, challenges, and
3
popular trends associated with representative methods in ITODD Dataset [45] includes 28 real-world industrial ob-
the particular category. Next, Sec. 6 reviews the common jects distributed across over 800 scenes with around 3,500
applications of object pose estimation. Finally, Sec. 7 summa- images. This dataset leverages two industrial 3D sensors
rizes this article and provides an outlook on future research and three high-resolution grayscale cameras to enable multi-
directions based on the challenges in the field. angle observation of the scenes, providing comprehensive
and detailed data for industrial object analysis and evalua-
2 DATASETS AND M ETRICS tion.
The advancement of deep learning-based object pose estima- TYO-L / TUD-L Dataset [36] focus on different lighting
tion is closely linked to the creation and utilization of chal- conditions. Specifically, TYO-L provides observation of 3 ob-
lenging and trustworthy large-scale datasets. This section in- jects under 8 lighting conditions. These scenes are designed
troduces commonly used mainstream object pose estimation to evaluate the robustness of pose estimation algorithms
datasets, categorized into instance-level, category-level, and to lighting variations. Unlike TYO-L, the data collection
unseen object pose estimation methods based on problem method of TUD-L involves fixing the camera and manually
formulation. The chronological overview is shown in Fig. moving the object, providing a more realistic representation
3. In addition, we also conduct an overview of the related of the object’s physical movement.
evaluation metrics. HB Dataset [46] covers various scenes with changes in
2.1 Datasets for Instance-Level Methods occlusion and lighting conditions. It comprises 33 objects,
including 17 toys, 8 household items, and 8 industry-related
Since the BOP Challenge datasets [36] are currently the
objects, distributed across 13 diverse scenes.
most popular datasets for the evaluation of instance-level
HOPE Dataset [47] is specifically designed for household
methods, we divide the instance-level datasets into BOP
objects, containing 28 toy grocery objects. The HOPE-Image
Challenge and other datasets for overview.
dataset includes objects from 50 scenes across 10 home/office
2.1.1 BOP Challenge Datasets
environments. Each scene includes up to 5 lighting vari-
Linemod Dataset (LM) [37] comprises 15 RGBD sequences ations, such as backlit and obliquely directed lighting,
containing annotated RGBD images with ground-truth 6DoF with shadow-casting effects. Additionally, the HOPE-Video
object poses, object CAD models, 2D bounding boxes, and dataset comprises 10 video sequences totaling 2,038 frames,
binary masks. Typically following Brachmann et al. [38], ap- with each scene showcasing between 5 to 20 objects.
proximately 15% of images from each sequence are allocated 2.1.2 Other Datasets
for training, with the remaining 85% reserved for testing.
YCBInEOAT Dataset [48] is designed for RGBD-based object
These sequences present challenging scenarios with cluttered
pose tracking in robotic manipulation. It contains the ego-
scenes, texture-less objects, and varying lighting conditions,
centric RGBD videos of a dual-arm robot manipulating the
making accurate object pose estimation difficult.
YCB objects [43]. There are 3 types of manipulation: single-
Linemod Occlusion Dataset (LM-O) [39] is an extension
arm pick-and-place, within-arm manipulation, and pick-
of the LM dataset [37] specifically designed to evaluate the
and-place between arms. This dataset comprises annotations
performance in occlusion scenarios. This dataset consists of
of ground-truth poses across 7449 frames, encompassing 5
1214 RGBD images from the basic sequence in the LM dataset
distinct objects depicted in 9 videos.
for 8 heavily occluded objects. It is critical for evaluating and
ClearPose Dataset [49] is designed for transparent objects,
improving pose estimation algorithms in complex environ-
which are widely prevalent in daily life, presenting signif-
ments characterized by occlusion.
icant challenges to visual perception and sensing systems
IC-MI [40] / IC-BIN Dataset [41] contribute to texture-less
due to their indistinct texture features and unreliable depth
object pose estimation. IC-MI comprises six objects: 2 texture-
information. It encompasses over 350K real-world RGBD
less and 4 textured household item models. IC-BIN dataset is
images and 5M instance annotations across 63 household
specifically designed to address challenges posed by clutter
objects.
and occlusion in robot garbage bin picking scenarios. IC-BIN
MP6D Dataset [50] is an RGBD dataset designed for object
includes 2 objects from the IC-MI.
pose estimation of metal parts, featuring 20 texture-less
RU-APC Dataset [42] aims to tackle challenges in warehouse
metal components. It consists of 20,100 real-world images
picking tasks and provides rich data for evaluating and im-
with object pose labels collected from various scenarios as
proving the perception capabilities of robots in a warehouse
well as 50K synthetic images, encompassing cluttered and
automation context. The dataset comprises 10,368 registered
occluded scenes.
depth and RGB images, covering 24 types of objects, which
are placed in various poses within different boxes on ware- 2.2 Datasets for Category-Level Methods
house shelves to simulate diverse experimental conditions. In this part, we divide the category-level datasets into rigid
YCB-Video Dataset (YCB-V) [15] comprises 21 objects dis- and articulated object datasets for elaboration.
tributed across 92 RGBD videos, each video containing 3 to 9 2.2.1 Rigid Objects Datasets
objects from the YCB object dataset [43] (totaling 50 objects). CAMERA25 Dataset [22] incorporates 1085 instances across
It includes 133,827 frames with a resolution of 640×480, 6 object categories: bowl, bottle, can, camera, mug, and
making it well-suited for both object pose estimation and laptop. Notably, the object CAD models in CAMERA25
tracking tasks. are sourced from the synthetic ShapeNet dataset [51]. Each
T-LESS Dataset [44] is an RGBD dataset designed for image within this dataset contains multiple instances, accom-
texture-less objects commonly found in industrial settings. panied by segmentation masks and 9DoF pose labels.
It includes 30 electrical objects with no obvious texture REAL275 Dataset [22] is a real-world dataset comprising 18
or distinguishable color properties. In addition, it includes videos and approximately 8K RGBD images. The dataset is
images of varying resolutions. In the training set, images divided into three subsets: a training set (7 videos), a vali-
predominantly feature black backgrounds, while the test set dation set (5 videos), and a testing set (6 videos). It includes
showcases diverse backgrounds with varying lighting con- 42 object instances across 6 categories, consistent with those
ditions and occlusions. T-LESS is challenging because of the in the CAMERA25 dataset. REAL275 is a prominent real-
absence of texture on objects and the intricate environmental world dataset extensively used for category-level object pose
settings. estimation in academic research.
4
LineMod -O RBO MOPED ReArtMix / ReArtVal ContactArt
(Brachmann et al.) (Martín -Martín et al.) (Park et al.) (Liu et al.) (Zhu et al.)
Instance -Level Datasets OnePose -LowTexture MP6D Wild6D GenMOP
IC-BIN YCB -Video TYO -L/TUD -L CAMERA25 kPAM
(Doumanoglou et al.) (Xiang et al.) (Hodan et al.) (Wang et al.) (Manuelli et al.) (He et al.) (Chen et al.) (Ze et al.) (Liu et al.)
Category -Level Datasets
Unseen Object Datasets
Instance -Level &
Unseen Object Datasets
2011 2016 2017 2018 2019 2020 2021 2022 2023 202 4
LineMod IC-MI RU-APC ITODD T-LESS REAL275 YCBInEOAT TOD OnePose ClearPose PhoCaL HOPE HouseCat6D
(Hinterstoisser et al.) (Tejani et al.) (Rennie et al.) (Drost et al.) (Hodan et al.) (Wang et al.) (Wen et al.) (Liu et al.) (Sun et al.) (Chen et al.) (Wang et al.) (Tyree et al.) (Jung et al.)
BMVC2015 HB Objectron HOI4D
(Michel et al.) (Kaskman et al.) (Ahmadyan et al.) (Liu et al.)
Fig. 3. Chronological overview of the datasets for object pose estimation evaluation. Notably, the pink arrows represent the BOP Challenge datasets,
which can be used to evaluate both instance-level and unseen object methods. The red references represent the datasets of articulated objects. From
this, we can also see the development trend in the field of object pose estimation, i.e., from instance-level methods to category-level and unseen
methods.
kPAM Dataset [52] is tailored specifically for robotic appli- and accompanying text files detailing the topology of the
cations, emphasizing the use of keypoints. Notably, it adopts underlying motion chain structure for each object.
a methodology involving 3D reconstruction followed by RBO Dataset [59] contains 14 commonly found articulated
manual keypoint annotation on these reconstructions. With objects in human environments, with 358 interaction se-
a total of 117 training sequences and 245 testing sequences, quences, resulting in a total of 67 minutes of manual manip-
kPAM offers a substantial collection of data for training ulation under different experimental conditions, including
and evaluating algorithms related to robotic perception and changes in interaction type, lighting, viewpoint, and back-
manipulation. ground settings.
TOD Dataset [53] consists of 15 transparent objects cate- HOI4D Dataset [60] is pivotal for advancing research in
gorized into 6 classes, each annotated with pertinent 3D category-level human-object interactions. It comprises 2.4M
keypoints. It encompasses a vast collection of 48K stereo RGBD self-centered video frames depicting interactions be-
and RGBD images capturing both transparent and opaque tween over 9 participants and 800 object instances. These
depth variations. The primary focus of the TOD dataset is instances are divided into 16 categories, including 7 rigid
on transparent 3D object applications, providing essential re- and 9 articulated objects.
sources for tasks such as object detection and pose estimation ReArtMix / ReArtVal Datasets [61] are formulated to tackle
in challenging scenarios involving transparency. the challenge of partial-level multiple articulated objects
Objectron Dataset [54] contains 15K annotated video clips pose estimation featuring unknown kinematic structures.
with over 4M labeled images belonging to categories of The ReArtMix dataset encompasses over 100,000 RGBD
bottles, books, bikes, cameras, chairs, cereal boxes, cups, images rendered against diverse background scenes. The
laptops, and shoes. This dataset is sourced from 10 coun- ReArtVal dataset consists of 6 real-world desktop scenes
tries spanning 5 continents, ensuring diverse geographic comprising over 6,000 RGBD frames.
representation. Due to its extensive content, it is highly ContactArt Dataset [62] is generated using a remote oper-
advantageous for evaluating the RGB-based category-level ating system [63] to manipulate articulated objects in a sim-
object pose estimation and tracking methods. ulation environment. This system utilizes smartphones and
Wild6D Dataset [55] is a substantial real-world dataset used laptops to precisely annotate poses and contact information.
to assess self-supervised category-level object pose estima- This dataset contains 5 prevalent categories of articulated
tion methods. It offers annotations exclusively for 486 test objects: laptops, drawers, safes, microwaves, and trash cans,
videos with diverse backgrounds, showcasing 162 objects for a total of 80 instances. All object models are sourced from
across 5 categories (excluding the ”can” category found in the PartNet dataset [64], thus promoting scalability.
CAMERA25 and REAL275).
2.3 Datasets for Unseen Methods
PhoCaL Dataset [56] incorporates both RGBD and RGB-
P (Polarisation) modalities. It consists of 60 meticulously The current mainstream datasets for evaluating unseen
crafted 3D models representing household objects, including methods are the BOP Challenge datasets, as discussed in
symmetric, transparent, and reflective items. PhoCaL focuses Sec. 2.1.1. Besides these BOP Challenge datasets, there are
on 8 specific object categories across 24 sequences, deliber- also some datasets designed for evaluating manual reference
ately introducing challenges such as occlusion and clutter. view-based methods as follows.
HouseCat6D Dataset [57] is a comprehensive dataset de- MOPED Dataset [65] is a model-free object pose estimation
signed for multi-modal category-level object pose estimation dataset featuring 11 household objects. It includes reference
and grasping tasks. The dataset encompasses a wide range and test images that encompass all views of the objects. Each
of household object categories, featuring 194 high-quality 3D object in the test sequences is depicted in five distinct envi-
models. It includes objects of varying photometric complex- ronments, with approximately 300 test images per object.
ity, such as transparent and reflective items, and spans 41 GenMOP Dataset [1] includes 10 objects ranging from flat
scenes with diverse viewpoints. The dataset is specifically objects to thin structure objects. For each object, there are
curated to address challenges in object pose estimation, two video sequences collected from various backgrounds
including occlusions and the absence of markers, making it and lighting situations. Each video sequence consists of
suitable for evaluating algorithms under real-world condi- approximately 200 images.
tions. OnePose Dataset [66] comprises over 450 real-world video
sequences of 150 objects. These sequences are collected in a
2.2.2 Articulated Objects Datasets variety of background conditions and capture all angles of
BMVC Dataset [58] includes 4 articulated objects: laptop, the objects. Each environment has an average duration of 30
cabinet, cupboard, and toy train. Each object is modeled seconds. The dataset is randomly partitioned into training
as a motion chain comprising components and intercon- and validation sets.
nected heads. Joints are constrained to one rotational and OnePose-LowTexture Dataset [2] is introduced as a com-
one translational DoF. This dataset provides CAD models plement to the testing set of the existing OnePose dataset
5
[66], which predominantly features textured objects. This focusing on perceivable discrepancies and excluding align-
dataset comprises 40 household objects with low texture. For ment along the optical (Z) axis, which can be represented as
each object, there are two video sequences: one serving as follows:
the reference video and the other for testing. Each video is eM SP D P̂ , P̄ , SM , VM =
captured at a resolution of 1920×1440, 30 Frames Per Second (5)
minS∈SM maxx∈VM proj P̂ x − proj P̄ Sx ,
(FPS), and approximately 30 seconds in duration. 2
where the function proj() represents the 2D projection
2.4 Metrics (pixel-level), and the other symbols have the same meanings
In this part, we divide the metrics into 3DoF, 6DoF, 9DoF, as in MSSD.
and other evaluation metrics for a comprehensive overview. Besides the BOP-M, the average point distance (ADD)
[37] and average closest point distance (ADD-S) [37] are
2.4.1 3DoF Evaluation Metrics
also commonly leveraged to evaluate the performance of
The geodesic distance [67] between the ground-truth and 6DoF object pose estimation. They can intuitively quantify
predicted 3D rotations is a commonly used 3DoF pose es- the geometric error between the estimated and the ground-
timation metric. Calculating the angle error between two ro- truth poses by computing the average distance between
tation matrices can visually evaluate their relative deviation. corresponding points on the object CAD model. Specifically,
It can be formulated as follows: the ADD metric is designed for asymmetric objects, while the
tr (Rgt ⊤ R)−1
d (Rgt , R) = arccos /π , (1) ADD-S metric is designed explicitly for symmetric objects.
2
Given the ground-truth rotation Rgt and translation tgt , as
where Rgt and R denote the ground-truth and predicted well as the estimated rotation R and translation t, ADD
3D rotations, respectively. ⊤ represents matrix transpose. tr calculates the average pairwise distance between the 3D
denotes the trace of a matrix, which refers to the sum of model points x ∈ O corresponding to the transformation
the elements on the main diagonal. Typically, 3D rotation between the ground truth and estimated pose:
estimation accuracy is defined as the percentage of objects ADD = avg ∥(Rgt x + tgt ) − (Rx + t)∥ . (6)
whose angle error is below a specific threshold and whose x∈O
predicted class is correct. It can be expressed as follows: For symmetric objects, the matching between points in cer-
tain views is inherently ambiguous. Therefore, the average
1, if d(Rgt , R) < λ and c = cgt distance is calculated using the nearest point distance as
Acc. = , (2)
0, otherwise follows:
where c, cgt , and λ denote the predicted class, ground-truth ADD−S = avg min ∥(Rgt x1 + tgt ) − (Rx2 + t)∥ . (7)
x1 ∈O x2 ∈O
class and predefined threshold, respectively. Meanwhile, the area under the ADD and the ADD-S curve
2.4.2 6DoF Evaluation Metrics (AUC) are often leveraged for evaluation. Specifically, if the
Currently, the BOP metric (BOP-M) [36] is the most pop- ADD and ADD-S are smaller than a given threshold, the
ular metric, which is the Average Recall (AR) of the Visi- predicted pose will be considered correct. Moreover, there
ble Surface Discrepancy (VSD), Maximum Symmetry-Aware are many methods [68], [69], [70] that evaluate asymmetric
Surface Distance (MSSD), and Maximum Symmetry-Aware and symmetric objects using ADD and ADD-S, respectively.
Projection Distance (MSPD) metrics. Specifically, the VSD This metric is termed ADD(S).
[36] metric treats poses that are indistinguishable in shape as In addition, n◦ mcm [71] is also currently a prevalent
equivalent by only measuring the misalignment of the visible evaluation metric (especially in category-level object pose
object surface. It can be expressed as follows: estimation). It directly quantifies the errors in predicted 3D
rotation and 3D translation. An object pose prediction is
eV SD D̂, D̄, V̂ , V̄ , τ = deemed correct if its rotation error is below threshold n◦
0, if p ∈ V̂ ∩ V̄ ∧ |D̂ (p) − D̄ (p) | < τ and its translation error is below threshold mcm. It can be
avgp∈V̂ ∪V̄ ,
1, otherwise defined as an indicator function as follows:
(3)
1, if eR < n◦ and et < mcm
where the symbols D̂ and D̄ represent distance maps gen- In◦ mcm (eR , et ) = ,
erated by rendering the object model M in two different 0, otherwise
poses: P̂ (an estimated pose) and P̄ (the ground-truth pose), (8)
where eR and et represent the rotation and translation errors
respectively. In these maps, each pixel p stores the distance between the estimated and ground-truth values, respectively.
from the camera center to a 3D point xp that projects onto p. Furthermore, compared to directly comparing 6DoF pose
These distance values are derived from depth maps, which in 3D space, the simplicity and practicality of the 2D Pro-
are typical outputs of sensors like Kinect, containing the Z jection metric [38] make it suitable for evaluation as well,
coordinate of xp . These distance maps are compared with the which quantifies the average distance between CAD model
distance map DI of the test image I to derive visibility masks points when projected under the estimated object pose and
V̂ and V̄ . These masks identify pixels where the model the ground-truth pose. A pose is considered correct if the
M is visible in the image I . The parameter τ represents projected distances are less than 5 pixels.
the tolerance for misalignment. In addition, the MSSD [36] 2.4.3 9DoF Evaluation Metric
metric is a suitable factor for determining the likelihood IoU3D denotes the Intersection-over-Union (IoU) [22] per-
of successful robotic manipulation and is not significantly centage between the ground-truth and predicted 3D bound-
affected by object geometry or surface sampling density. It ing boxes, which can evaluate the 6DoF pose estimation as
can be formulated as follows: well as the 3DoF size estimation. It can be expressed as:
PB ∩ GB
eM SSD P̂ , P̄ , SM , VM = min max P̂ x − P̄ Sx , (4) IoU3D = , (9)
S∈SM x∈VM 2 PB ∪ GB
where the set SM comprises global symmetry transforma- where GB and PB represent the ground-truth and the pre-
tions for the object model M , while VM represents the dicted 3D bounding boxes, respectively. The symbols ∩ and
vertices of the model. Furthermore, the MSPD [36] metric is ∪ represent the intersection and union, respectively. The
ideal for evaluating RGB-only methods in augmented reality, correctness of the predicted object pose is determined based
on whether the IoU3D value exceeds a predefined threshold.
6
2.4.4 Other Metric pose estimation challenges associated with scale variations
Since some Normalized Object Coordinate Space (NOCS) in aerospace objects. In a recent development, Lian et al. [83]
shape alignment-based category-level methods reconstruct increased the number of predefined 3D keypoints to enhance
the 3D object shape before estimating the object pose, the the establishment of correspondences. Moreover, they de-
Chamfer Distance (CD) metric [72], which not only captures vised a hierarchical binary encoding approach for localizing
the global shape deviation but is also sensitive to local shape keypoints, enabling gradual refinement of correspondences
differences, is commonly leveraged to evaluate the NOCS and transforming correspondence regression into a more
shape reconstruction accuracy of these methods as follows: efficient classification task. To estimate transparent object
X X pose, Chang et al. [84] used a 3D bounding box prediction
Dcd = min ∥ x − y ∥22 + min ∥ x − y ∥22 , (10)
y∈Mgt x∈N network and multi-view geometry techniques. The method
x∈N y∈Mgt
first detects 2D projections of 3D bounding box vertices,
where N and Mgt represent the reconstructed and ground-
and then reconstructs 3D points based on the multi-view
truth NOCS shape, respectively.
detected 2D projections incorporating camera motion data.
3 I NSTANCE -L EVEL O BJECT P OSE E STIMATION Additionally, they introduced a generalized pose definition
to address pose ambiguity for symmetric objects. To en-
Instance-level object pose estimation describes the task of
hance the efficiency of pose estimation networks, Guo et
estimating the pose of the objects that have been seen during
al. [85] integrated knowledge distillation into object pose
the training of the model. We classify existing instance-
estimation by distilling the teacher’s distribution of local
level methods into four categories: correspondence-based
predictions into the student network. Liu et al. [86] argued
(Sec. 3.1), template-based (Sec. 3.2), voting-based (Sec. 3.3),
that differentiable PnP strategies conflict with the averaging
and regression-based (Sec. 3.4) methods.
nature of the PnP problem, resulting in gradients that may
3.1 Correspondence-Based Methods encourage the network to degrade the accuracy of individual
Correspondence-based object pose estimation refers to tech- correspondences. To mitigate this, they introduced a linear
niques that involve identifying correspondences between covariance loss, which can be used for both sparse and dense
the input data and the given complete object CAD model. correspondence-based methods.
Correspondence-based methods can be divided into sparse To mitigate vulnerability caused by large occlusions,
and dense correspondences. Sparse correspondence-based Crivellaro et al. [87] used several control points to repre-
methods (Sec. 3.1.1) involve detecting object keypoints in sent each object part. Then, they predicted the 2D projec-
the input image or point cloud to establish 2D-3D or 3D- tions of these control points to calculate the object pose.
3D correspondences between the input data and the object Some researchers solved the occlusion problem by predict-
CAD model, followed by the utilization of the Perspective- ing keypoints using small patches. Oberweger et al. [88]
n-Point (PnP) algorithm [73] or least square method to deter- processed each patch separately to generate heatmaps and
mine the object pose. Dense correspondence-based methods then aggregated the results to achieve precise and reliable
(Sec. 3.1.2) aim to establish dense 2D-3D or 3D-3D corre- predictions. Additionally, they offered a straightforward but
spondences, ultimately leading to more accurate object pose efficient strategy to resolve ambiguities between patches
estimation. For the RGB image, they leverage every pixel and heatmaps during training. Hu et al. [89] unveiled a
or multiple patches to generate pixel-wise correspondences, segmentation-driven pose estimation framework in which
while for the point cloud, they use the entire point cloud every visible object part offers a local pose prediction
to find point-wise correspondences. The illustration of these through 2D keypoint locations. Furthermore, Huang et al.
two types of methods is shown in Fig. 4. The attributes and [90] conceptualized 2D keypoint locations as probabilis-
performance of some representative methods are shown in tic distributions within the loss function and designed a
Table 1. confidence-based network.
3.1.1 Sparse Correspondence Methods Reducing the reliance on annotated real-world data is
As a representative method, Rad et al. [74] first used a also an important task. Some methods exploit geometric con-
segmentation method to detect the object of interest in an sistency as additional information to alleviate the need for
RGB image. Then, they predicted the 2D projections of the annotation. Zhao et al. [91] employed image pairs with object
object’s 3D bounding box corners. Finally, they used the annotations and relative transformation between viewpoints
PnP algorithm [73] to estimate the object pose. Additionally, to automatically identify objects’ 3D keypoints that are ge-
they employed a classifier to determine the pose range in ometrically and visually consistent. In addition, Yang et al.
real-time, addressing the issue of ambiguity in symmetric [92] used a keypoint consistency regularization for dual-
objects. Tekin et al. [75] proposed a CNN network inspired scale images with a labeled 2D bounding box. Using semi-
by YOLO [76] to integrate object detection and pose es- supervised learning, Liu et al. [93] developed a unified
timation, directly predicting the locations of the projected framework for estimating 3D hand and object poses. They
vertices of the 3D object bounding box. Unlike [74] and [75], constructed a joint learning framework that conducts explicit
Pavlakos et al. [77] predicted the 2D projections of predefined contextual reasoning between hand and object representa-
semantic keypoints. Doosti et al. [78] introduced a compact tions. To generate pseudo labels in semi-supervised learning,
model comprising two adaptive graph convolutional neural they utilized the spatial-temporal consistency found in large-
networks (GCNNs) [79], collaborating to estimate object and scale hand-object videos as a constraint. Synthetic data is also
hand poses. To further enhance the robustness of object pose a way to solve the annotation problem. Georgakis et al. [94]
estimation, Song et al. [80] employed a hybrid intermediate reduced the need for expensive 3DoF pose annotations by
representation to convey geometric details in the input im- selecting keypoints and maintaining viewpoint and modal-
age, encompassing keypoints, edge vectors, and symmetry ity invariance in RGB images and CAD model renderings.
correspondences. Liu et al. [81] proposed a multi-directional Sock et al. [95] utilized self-supervision to minimize the gap
feature pyramid network along with a method that calculates between synthetic and real data and enforced photometric
object pose estimation confidence by incorporating spatial consistency across different object views to fine-tune the
and plane information. Hu et al. [82] introduced a single- model. Further, Zhang et al. [96] utilized the invariance of ge-
stage hierarchical end-to-end trainable network to address ometry relations between keypoints across real and synthetic
7
Fig. 4. Illustration of the correspondence-based (Sec. 3.1), template-based (Sec. 3.2), voting-based (Sec. 3.3), and regression-based (Sec. 3.4)
instance-level methods. Correspondence-based methods (Sec. 3.1) involve establishing correspondences between input data and a provided object
CAD model. Template-based methods (Sec. 3.2) involve identifying the most similar template from a set of templates labeled with ground-truth object
poses. Voting-based methods (Sec. 3.3) determine object pose through a pixel-level or point-level voting scheme. Regression-based methods (Sec.
3.4) aim to obtain the object pose directly from the learned features.
domains to accomplish domain adaptation. Thalhammer et and match image landmarks consistently across different
al. [97] introduced a specialized feature pyramid network views, aiming to enhance the process of learning 2D-3D map-
to compute multi-scale features, enabling the simultaneous ping. Wang et al. [132] developed a pose estimation pipeline
generation of pose hypotheses across various feature map guided by reconstruction, capitalizing on geometric consis-
resolutions. tency. Further, Shugurov et al. [99] built upon Zakharov et
Overall, the sparse correspondence-based methods can al. [20] by developing a unified deep network capable of
estimate object pose efficiently. However, relying on only a accommodating multiple image modalities (such as RGB
few control points can lead to sub-optimal accuracy. and Depth) and integrating a differentiable rendering-based
3.1.2 Dense Correspondence Methods pose refinement method. Su et al. [133] introduced a discrete
Dense correspondence-based methods utilize a signifi- descriptor realized by hierarchical binary grouping, capable
cantly larger number of correspondences compared to of densely representing the object surface. As a result, this
sparse correspondence-based methods. This enables them method can predict fine-grained correspondences. Chen et
to achieve higher accuracy and handle occlusions more al. [100] introduced a probabilistic PnP [73] layer designed
effectively. Li et al. [19] argued for the differentiation be- for general end-to-end pose estimation. This layer generates
tween rotation and translation, proposing the coordinates- a pose distribution on the SE(3) manifold. On the other hand,
based disentangled pose network. This network separates Xu et al. [134] argued that encoding pose-sensitive local
pose estimation into distinct predictions for rotation and features and modeling the statistical distribution of inlier
translation. Zakharov et al. [20] introduced the dense multi- poses are crucial for accurate and robust 6DoF pose esti-
class 2D-3D correspondence-based object pose detector and a mation. Inspired by PPF [11], they exploited pose-sensitive
tailored deep learning-based refinement process. In addition, information carried by each pair of oriented points and an
Cai et al. [131] proposed a technique to automatically identify ensemble of redundant pose predictions to achieve robust
8
TABLE 1
Representative instance-level methods. For each method, we report its 10 properties: published year, training input, inference input, pose DoF
(3DoF, 6DoF, and 9DoF), object property (rigid, articulated), task (estimation, tracking, and refinement), domain training paradigm (source domain,
domain adaptation, and domain generalization), inference mode, application area, and its performance of key metrics on key datasets. Notably, for
the input of training and inference, we only focus on the input of the pose estimation model, not the input of the front-end segmentation method
(because it can be obtained through RGB as well as through depth or RGBD). D, S, C, T, V, P, and R denote object detection, instance
segmentation, correspondence prediction, template matching, voting, pose solution/regression, and pose refinement, respectively. We report the
average recall of ADD(S) within 10% of the object diameter (termed ADD(S)-0.1d) of LM-O and LM datasets, and the AUC of ADD-S (<0.1m) of
YCB-V dataset (Sec. 2).
Published Training Inference Pose Object Domain Training Inference Application LM-O | LM YCB-V
Methods Task
Year Input Input DoF Property Paradigm Mode Area ADD(S)-0.1d ADD-S (<0.1m)
RGB, RGB, four-stage,
Rad et al. [74] 2017 6DoF rigid estimation source symmetrical objects - 62.7 -
CAD Model CAD Model S+C+P+R
RGB, RGB, two-stage,
Sparse correspondence
Tekin et al. [75] 2018 6DoF rigid estimation source general - 56.0 -
CAD Model CAD Model C+P
RGB, RGB, two-stage,
Hu et al. [89] 2019 6DoF rigid estimation source occlusion 27.0 - -
CAD Model CAD Model C+P
RGB, RGB, three-stage, symmetrical objects,
Correspondence-Based Methods
Song et al. [80] 2020 6DoF rigid estimation source 47.5 91.3 -
CAD Model CAD Model C+P+R occlusion
RGB, RGB, two-stage,
Hu et al. [82] 2021 6DoF rigid estimation source large scale variations 48.6 - -
CAD Model CAD Model C+P
RGB, RGB, two-stage, transparent,
Chang et al. [84] 2021 6DoF rigid estimation source - - -
CAD Model CAD Model C+P symmetrical objects
RGB, RGB, three-stage,
Guo et al. [85] 2023 6DoF rigid estimation source general 44.5 - -
CAD Model CAD Model D+C+P
RGB, RGB, three-stage,
Li et al. [19] 2019 6DoF rigid estimation source general - 89.9 -
CAD Model CAD Model D+C+P
RGB, RGB, two-stage,
Dense correspondence
Hodan et al. [98] 2020 6DoF rigid estimation source symmetrical objects - - -
CAD Model CAD Model C+P
RGB/RGBD, RGB/RGBD, four-stage,
Shugurov et al. [99] 2021 6DoF rigid estimation source general - 99.9 -
CAD Model CAD Model D+C+P+R
RGB, RGB, three-stage,
Chen et al. [100] 2022 6DoF rigid estimation source general - 95.8 -
CAD Model CAD Model D+C+P
RGB/RGBD, RGB/RGBD, four-stage,
Haugaard et al. [101] 2022 6DoF rigid estimation generalization general - - -
CAD Model CAD Model D+C+P+R
RGB, RGB, three-stage,
Li et al. [102] 2023 6DoF rigid estimation source general 51.4 97.8 -
CAD Model CAD Model D+C+P
RGB, RGB, two-stage,
Xu et al. [103] 2024 6DoF rigid refinement source occlusion 60.7 97.4 85.7
CAD Model CAD Model P+R
RGB, RGB, three-stage,
Sundermeyer et al. [104] 2018 6DoF rigid estimation generalization general - 31.4 -
CAD Model CAD Model D+T+P
RGB-based
Template-Based Methods
Li et al. [70] 2022 6DoF rigid estimation source general 70.6 99.5 96.6
CAD Model CAD Model S+P+R
Depth, Depth, two-stage,
Jiang et al. [108] 2023 6DoF rigid estimation source general - - -
CAD Model CAD Model S+P
Depth, Depth, two-stage,
Dang et al. [109] 2024 6DoF rigid estimation source general 52.0 69.0 -
CAD Model CAD Model S+P
two-stage,
Peng et al. [17] 2019 RGB RGB 6DoF rigid estimation source occlusion 40.8 86.3 -
V+P
RGBD, RGBD, three-stage,
He et al. [18] 2020 6DoF rigid estimation source general - 99.4 95.5
Indirect voting
CAD Model
RGBD, RGBD, three-stage,
Wu et al. [111] 2022 6DoF rigid estimation source general 70.2 99.4 96.6
CAD Model CAD Model S+V+P
RGBD, RGBD, three-stage,
Zhou et al. [112] 2023 6DoF rigid estimation source general 77.7 99.8 96.7
CAD Model CAD Model S+V+P
two-stage,
Wang et al. [16] 2019 RGBD RGBD 6DoF rigid estimation source general - 94.3 93.1
P+R
RGBD, three-stage,
Direct voting
Tian et al. [113] 2020 RGBD 6DoF rigid estimation source general - 92.9 91.8
CAD Model S+V+P
RGBD, two-stage,
Zhou et al. [114] 2021 RGBD 6DoF rigid estimation source general 65.0 99.6 95.8
CAD Model S+P
two-stage,
Mo et al. [115] 2022 RGBD RGBD 6DoF rigid estimation source general - - 93.6
S+P
two-stage,
Hong et al. [116] 2024 RGBD RGBD 6DoF rigid estimation source general 71.1 96.7 92.7
S+P
Chen et al. [117] 2020 RGBD RGBD 6DoF rigid estimation source end to end general - 98.7 92.4
RGBD,
Hu et al. [118] 2020 RGB 6DoF rigid estimation source end to end general 43.3 - -
Geometry-guided
CAD Model
RGB, RGB, three-stage,
Labbé et al. [119] 2020 6DoF rigid estimation source general - - 93.4
CAD Model CAD Model D+P+R
RGBD, two-stage,
Wang et al. [120] 2021 RGB 6DoF rigid estimation source general 62.2 - 91.6
CAD Model D+P
RGBD, two-stage,
Di et al. [121] 2021 RGB 6DoF rigid estimation source occlusion 62.32 96.0 90.9
CAD Model D+P
Regression-Based Methods
RGBD, two-stage,
Wang et al. [122] 2021 RGB 6DoF rigid estimation adaptation occlusion 59.8 85.6 90.5
CAD Model D+P
three-stage,
Xiang et al. [15] 2017 RGB RGB 6DoF rigid estimation source cluttered 24.9 - 75.9
S+V+P
two-stage,
Li et al. [123] 2018 RGBD RGBD 6DoF rigid estimation source general - - 94.3
P+R
RGB, RGB, refinement,
Li et al. [124] 2018 6DoF rigid source end to end general 55.5 88.6 81.9
CAD Model CAD Model tracking
Direct regression
performance on severe inter-object occlusion and systematic bird’s eye and front views for object center voting. They
noises in scene point clouds. utilized feature similarity between the center-aligned object
Some methods recover object poses by establishing 3D- and the object CAD model to establish correspondences for
3D correspondences. Huang et al. [135] used an RGB image to Singular Value Decomposition (SVD)-based [137] rotation
predict 3D object coordinates in the camera frustum, thus es- estimation. More recently, Lin et al. [138] utilized an RGBD
tablishing 3D-3D correspondences. Further, Jiang et al. [136] image as input and employed point-to-surface matching
introduced a center-based decoupled framework, leveraging to estimate the object surface correspondence. They estab-
9
lish 3D-3D correspondences by iteratively constricting the identifying the most similar template from a set of tem-
surface, transitioning it into a correspondence point while plates labeled with ground-truth object poses. They can be
progressively eliminating outliers. categorized into RGB-based template (Sec. 3.2.1) and point
Some methods put more effort into handling challeng- cloud-based template (Sec. 3.2.2) methods. These two meth-
ing cases, such as symmetric objects [98], [139], [140] and ods are illustrated in Fig. 4. When the input is an RGB
texture-less objects [141]. Park et al. [139] utilized generative image, the templates comprise 2D projections extracted from
adversarial training to reconstruct occluded parts to alleviate object CAD models, with annotations of ground-truth poses.
the impact of occlusion. They handle symmetric objects by This process transforms object pose estimation into image
guiding predictions towards the nearest symmetric pose. retrieval. Conversely, when dealing with a point cloud, the
In addition, Hodan et al. [98] modeled an object using template comprises the object CAD model with the canonical
compact surface fragments to handle symmetries in object pose. Notably, we classify the methods that directly regress
modeling effectively. For each pixel, the network predicts: the relative pose between the object CAD model and the
the likelihood of each object’s presence, the probability of observed point cloud as template-based methods. This is
the fragments conditional on the object’s presence, and the because these methods can be interpreted as seeking the
exact 3D translation of each fragment. Finally, the object optimal relative pose that aligns the observed point cloud
pose is determined using a robust and efficient version of with the template. Consequently, the determined relative
the PnP-RANSAC algorithm [73]. Further, Wu et al. [140] pose serves as the object pose. The characteristics and perfor-
employed a geometric-aware dense matching network to mance of some representative methods are shown in Table 1.
acquire visible dense correspondences. Additionally, they
utilized the distance consistency of these correspondences to 3.2.1 RGB-Based Template Methods
mitigate ambiguity in symmetrical objects. For texture-less As a seminal contribution, Sundermeyer et al. [104] achieved
objects, Wu et al. [141] leveraged information from the object 3D rotation estimation through a variant of denoising au-
CAD model and established 2D-3D correspondences using a toencoder, which learns an implicit representation of ob-
pseudo-Siamese neural network. ject rotation. If depth is available, it can be used for pose
With research development, domain adaptation, weak refinement. Liu et al. [147] developed a CNN akin to an
supervision, and self-supervision techniques have been in- autoencoder to reconstruct arbitrary scenes featuring the
troduced into pose estimation. Li et al. [142] noticed that target object and extract the object area. In addition, Zhang et
images with varying levels of realism and semantics exhibit al. [148] utilized an object detector and a keypoint extractor
different transferability between synthetic and real domains. to simplify the template search process. Papaioannidis et
Consequently, they decomposed the input image into multi- al. [105] suggested that estimating object poses in synthetic
level semantic representations and merged the strengths of images is more straightforward. Therefore, they employed a
these representations to mitigate the domain gap. Further, generative adversarial network to convert real images into
Hu et al. [143] introduced a method exclusively trained on synthetic ones while preserving the object pose. Li et al. [106]
synthetic images, which infers the necessary pose correc- utilized a new pose representation (i.e., 3D location field)
tion for refining rough poses. Haugaard et al. [101] utilized to guide an auto-encoder to distill pose-related features,
learned distributions to sample, score, and refine pose hy- thereby enhancing the handling of pose ambiguity. Stevšič et
potheses. Correspondence distributions are learned using al. [149] proposed a spatial attention mechanism to identify
a contrastive loss. This method is unsupervised regarding and utilize spatial details for pose refinement. Different from
visual ambiguities. More recently, Li et al. [102] introduced the above methods, Deng et al. [107] addressed the 6DoF
a weakly-supervised reconstruction-based pipeline. Initially, object pose tracking problem within the Rao-Blackwellized
they reconstructed the objects from various viewpoints using particle filtering [150] framework. They finely discretized the
an implicit neural representation. Subsequently, they trained rotation space and trained an autoencoder network to build
a network to predict pixel-wise 2D-3D correspondences. Hai a codebook of feature embeddings for these discretized ro-
et al. [144] proposed a refinement strategy that uses the ge- tations. This method efficiently estimates the 3D translation
ometry constraint in synthetic-to-real image pairs captured along with the full distribution over the 3D rotation.
from multiple viewpoints. RGB cameras are widely used as visual sensors, yet
There are also methods that focus on pose refinement. they struggle to capture sufficient information under poor
Lipson et al. [145] iteratively refined pose and correspon- lighting conditions. This results in poor pose estimation
dences in a tightly coupled manner. They incorporated a performance.
differentiable layer to refine the pose by solving the bidirec- 3.2.2 Point Cloud-Based Template Methods
tional depth-augmented PnP problem. In addition, Xu et al. With the popularity of consumer-grade 3D cameras, point
[103] formulated object pose refinement as a non-linear least cloud-based methods take full advantage of their ability to
squares problem using the estimated correspondence field, adapt to poor illumination and capture geometric informa-
i.e., the correspondence between the RGB image and the tion. Li et al. [70] adopted a feature disentanglement and
rendered image using the initial pose. The non-linear least alignment module to establish part-to-part correspondences
squares problem is then solved by a differentiable levenberg- between the partial point cloud and object CAD model,
marquardt algorithm [146], enabling end-to-end training. enhancing geometric constraint. Jiang et al. [108] proposed
In general, the aforementioned correspondence-based a point cloud registration framework based on the SE(3)
methods exhibit robustness to occlusion since they can utilize diffusion model, which gradually perturbs the optimal rigid
local correspondences to predict object pose. However, these transformation of a pair of point clouds by continuously
methods may encounter challenges when handling objects injecting perturbation transformation through the SE(3) for-
that lack salient shape features or texture. ward diffusion process. Then, the SE(3) reverse denoising
process is used to gradually denoise, making it closer to the
3.2 Template-Based Methods
optimal transformation for accurate pose estimation. Dang
By leveraging global information from the image, template- et al. [109] proposed two key contributions to enhance pose
based methods can effectively address the challenges posed estimation performance on real-world data. First, they intro-
by texture-less objects. Template-based methods involve duced a directly supervised loss function that bypasses the
10
SVD [137] operation, mitigating the sensitivity of SVD-based In response to challenging scenarios such as cluttered or
loss functions to the rotation range between the input partial occlusion, Peng et al. [17] introduced a pixel-wise voting net-
point cloud and the object CAD model. Second, they devised work to regress pixel-level vectors pointing to 3D keypoints.
a match normalization strategy to address disparities in These vectors create a flexible representation for locating
feature distributions between the partial point cloud and the occluded or truncated 3D keypoints. Since most industrial
CAD model. parts are parameterized, Zeng et al. [156] defined 3D key-
In general, template-based methods leverage global infor- points linked to parameters through driven parameters and
mation from the image, enabling them to effectively handle symmetries. This approach effectively addresses the pose
texture-less objects. However, achieving high pose estima- estimation of objects in stacking scenes.
tion accuracy may lead to increased memory usage by the Rather than utilizing a single-view RGBD image as input,
templates and a rapid rise in computational complexity. Duffhauss et al. [157] employed multi-view RGBD images as
Additionally, they may also exhibit poor performance when input. They extracted visual features from each RGB image,
confronted with occluded objects. while geometric features were extracted from the object point
cloud (generated by fusing all depth images). This multi-
3.3 Voting-Based Methods view RGBD feature fusion-based method can accurately
predict object pose in cluttered scenes.
Voting-based methods determine object pose through a
pixel-level or point-level voting scheme, which can be cat- Some researchers have proposed new training strategies
egorized into two main types: indirect voting and direct to improve pose estimation performance. Yu et al. [158]
voting. Indirect voting methods (Sec. 3.3.1) estimate a set developed a differentiable proxy voting loss that simulates
of pre-defined 2D keypoints from the RGB image through hypothesis selection during the voting process, enabling end-
pixel-level voting, or a set of pre-defined 3D keypoints from to-end training. In addition, Lin et al. [159] proposed a novel
the point cloud via point-level voting. Subsequently, the learning framework, which utilizes the accurate result of the
object pose is determined through 2D-3D or 3D-3D keypoint RGBD-based pose refinement method to supervise the RGB-
correspondences between the input image and the CAD based pose estimator. To bridge the domain gap between
model. Direct voting methods (Sec. 3.3.2) directly predict the synthetic and real data, Ikeda et al. [160] introduced a method
pose and confidence at the pixel-level or point-level, then to transfer object style transfer from synthetic to realistic
select the pose with the highest confidence as the object pose. without manual intervention.
The illustration of these types of methods is shown in Fig. Overall, indirect voting-based methods provide an ex-
4. The attributes and performance of some representative cellent solution for instance-level object pose estimation.
methods are shown in Table 1. However, the accuracy of pose estimation heavily relies
on the quality of the keypoints, which can result in lower
3.3.1 Indirect Voting Methods robustness.
Some researchers predicted 2D keypoints and then derived
3.3.2 Direct Voting Methods
the object pose through 2D-3D keypoints correspondence.
Liu et al. [151] introduced a continuous representation The performance of the indirect voting methods heavily
method called keypoint distance field (KDF), which extracts depends on the selection of keypoints. Consequently, direct
2D keypoints by voting on each KDF. Meanwhile, Cao et al. voting methods have been proposed as an alternative so-
[110] proposed a method called dynamic graph PnP [73] to lution. Tian et al. [113] uniformly sampled rotation anchors
learn the object pose from 2D-3D correspondence, enabling in SO(3). Subsequently, they predicted constraint deviations
end-to-end training. Moreover, Liu et al. [152] introduced a for each anchor towards the target, using the uncertainty
bidirectional depth residual fusion network to fuse RGBD score to select the best prediction. Then, they detected the 3D
information, thereby estimating 2D keypoints precisely. In- translation by aggregating point-to-center vectors towards
spired by the diffusion model, Xu et al. [153] proposed a the object center to recover the 6DoF pose. Wang et al.
diffusion-based framework to formulate 2D keypoint detec- [16] fused RGB and depth features on a per-pixel basis
tion as a denoising process to establish more accurate 2D-3D and utilized a pose predictor to generate 6DoF pose and
correspondences. confidence for each pixel. Subsequently, they selected the
Unlike the aforementioned methods that predict 2D key- pose of the pixel with the highest confidence as the final
points, He et al. [18] proposed a depth hough voting network pose. Zhou et al. [161] employed CNNs [162] to extract RGB
to predict 3D keypoints. Subsequently, they estimated the features, which are then integrated into the point cloud to
object pose through levenberg-marquardt algorithm [146]. obtain fused features. Unlike [16], the fused features take the
Furthermore, He et al. [21] introduced a bidirectional fusion form of point sets rather than feature mappings.
network to complement RGB and depth heterogeneous data, However, the aforementioned RGBD fusion methods
thereby better predicting the 3D keypoints. To better capture merely concatenate RGB and depth features without delving
features among object points in 3D space, Mei et al. [154] into their intrinsic relationship. Therefore, Zhou et al. [114]
utilized graph convolutional networks to facilitate feature proposed a new multi-modal fusion graph convolutional
exchange among points in 3D space, aiming to improve the network to enhance the fusion of RGB and depth images,
accuracy of predicting 3D keypoints. Wu et al. [111] proposed capturing the inter-modality correlations through local infor-
a 3D keypoint voting scheme based on cross-spherical sur- mation propagation. Liu et al. [163] decoupled scale-related
faces, allowing for generating smaller and more dispersed and scale-invariant information in the depth image to guide
3D keypoint sets, thus improving estimation efficiency. To the network in perceiving the scene’s 3D structure and
obtain more accurate 3D keypoints, Wang et al. [155] pre- provide scene texture for the RGB image feature extraction.
sented an iterative 3D keypoint voting network to refine the Unlike the aforementioned approaches that use still images,
initial localization of 3D keypoints. Most recently, Zhou et Mu et al. [164] proposed a time fusion model integrating
al. [112] introduced a novel weighted vector 3D keypoints temporal motion information from RGBD images for 6DoF
voting algorithm, which adopts a non-iterative global opti- object pose estimation. This method effectively captures ob-
mization strategy to precisely localize 3D keypoints, while ject motion and changes, thereby enhancing pose estimation
also achieving near real-time inference speed. accuracy and stability.
11
Symmetric objects may have multiple true poses, leading (GDR-Net) to learn object pose from dense 2D-3D corre-
to ambiguity in pose estimation. To address this issue, Moet spondence in an end-to-end manner. Wang et al. [122] intro-
al. [115] designed a symmetric-invariant pose distance met- duced noise-augmented student training and differentiable
ric, which enables the network to estimate symmetric objects rendering based on GDR-Net [120], enabling robustness
accurately. Cai et al. [165] introduced a 3D rotation represen- to occlusion scenes through self-supervised learning with
tation to learn the object implicit symmetry, eliminating the multiple geometric constraints. Zhang et al. [174] proposed
need for additional prior knowledge about object symmetry. a transformer-based pose estimation approach that consists
To reduce dependency on annotated real data, Zeng et of a patch-aware feature fusion module and a transformer-
al. [166] trained their model solely on synthetic dataset. based pose refinement module to address the limitation
Then, they utilized a sim-to-real learning network to improve of CNN-based networks in capturing global dependencies.
their generalization ability. During pose estimation, they Most recently, Feng et al. [175] decoupled rotation into two
transformed scene points into centroid space and obtained sets of corresponding 3D normals. This decoupling strategy
object pose through clustering and voting. significantly improves the rotation accuracy.
Overall, voting-based methods have demonstrated supe- Given the labor-intensive nature of real-world data an-
rior performance in pose estimation tasks. However, the vot- notation, some methods leverage synthetic data training to
ing process is time-consuming and increases computational generalize to the real world. Gao et al. [176] constructed a
complexity [167]. lightweight synthetic point cloud generation pipeline and
leveraged an enhanced point cloud-based autoencoder to
3.4 Regression-Based Methods learn latent object pose information to regress object pose. To
Regression-based methods aim to directly obtain the object improve generalization to real-world scenes, Zhou et al. [177]
pose from the learned features. They can be divided into utilized annotated synthetic data to supervise the network
two main types: geometry-guided regression and direct re- convergence. They proposed a self-supervised pipeline for
gression. Geometry-guided regression methods (Sec. 3.4.1) unannotated real data by minimizing the distance between
leverage geometric information from RGBD images (such as the CAD model transformed from the predicted pose and the
object 3D structural features or 2D-3D geometric constraints) input point cloud. Tan et al. [178] proposed a self-supervised
to assist in object pose estimation. Direct regression methods monocular object pose estimation network consisting of
(Sec. 3.4.2) directly regress the object pose, utilizing RGBD teacher and student modules. The teacher module is trained
image information. The illustration of these two types of on synthetic data for initial object pose estimation, and the
methods is shown in Fig. 4. The attributes and performance student model predicts camera pose from the unannotated
of some representative methods are shown in Table 1. real image. The student module acquires knowledge of ob-
3.4.1 Geometry-Guided Regression Methods ject pose estimation from the teacher module by imposing
geometric constraints derived from the camera pose.
Gao et al. [168] employed decoupled networks for rota-
Geometry-guided regression methods typically require
tion and translation regression from the object point cloud.
additional processing steps to extract and handle geometric
Meanwhile, Chen et al. [117] introduced a rotation residual
information, which increases computational costs and com-
estimator to estimate the residual between the predicted
plexity.
rotation and the ground truth, enhancing the accuracy of
rotation prediction. Lin et al. [169] used a network to extract 3.4.2 Direct Regression Methods
the geometric features of the object point cloud. Then, they Direct regression methods aim to directly recover the object
enhanced the pairwise consistency of geometric features pose from the RGBD image without additional transforma-
by applying spectral convolution on pairwise compatibility tion steps, thus reducing complexity. These methods en-
graphs. Additionally, Shi et al. [170] learned geometric and compass various strategies, including coupled pose output,
contextual features within point cloud blocks. Then, they decoupled pose output, and 3D rotation (3DoF pose) output.
trained a sub-block network to predict the pose of each point Coupled pose involves predicting object rotation and trans-
cloud block. Finally, the most reliable block pose is selected lation together, while decoupled pose involves predicting
as the object pose. To address the challenge of point cloud- them separately. Moreover, the 3DoF pose output focuses
based object pose tracking, Liu et al. [171] proposed a shifted solely on predicting object rotation without considering
point convolution operation between the point clouds of translation. These strategies are discussed in detail below.
adjacent frames to facilitate the local context interaction. Coupled Pose: To overcome lighting variations in the en-
Approaches solely relying on object point cloud often vironment, Rambach et al. [179] used a pencil filter to nor-
overlook the object texture details. Therefore, Wen et al. [172] malize the input image into light-invariant representations,
and An et al. [173] leveraged the complementary nature of and then directly regressed the object coupled pose using
RGB and depth information. They improved cross-modal a CNN network. Additionally, Kleeberger et al. [180] intro-
fusion strategies by employing attention mechanisms to ef- duced a robust framework to handle occlusions between
fectively align and integrate these two heterogeneous data objects and estimate the multiple objects pose in the image.
sources, resulting in enhanced performance. This framework is capable of running in real-time at 65
In contrast to the aforementioned methods that directly FPS. Sarode et al. [181] introduced a PointNet-based [182]
derive geometric information from the depth image or the framework to align point clouds for pose estimation, aiming
object CAD model, many researchers focused more on gen- to reduce sensitivity to pose misalignment. Estimating object
erating geometric constraints from the RGB image. Hu et al. pose from a single RGB image introduces an inherent am-
[118] learned the 2D offset from the CAD model center to biguity problem. Manhardt et al. [126] suggested explicitly
the 3D bounding box corners from the RGB image, and then addressing these ambiguities. They predicted multiple 6DoF
directly regressed the object pose from the 2D-3D correspon- poses for each object to estimate specific pose distributions
dence. Di et al. [121] used a shared encoder and two indepen- caused by symmetry and repetitive textures. Inspired by
dent decoders to generate 2D-3D correspondence and self- the visible surface difference metric, Bengtson et al. [183]
occlusion information, improving the robustness of object relied on a differentiable renderer and the CAD model to
pose estimation under occlusion. Further, Wang et al. [120] generate multiple weighted poses, avoiding falling into local
proposed a Geometry-Guided Direct Regression Network minima. Moreover, Park et al. [184] proposed a method for
12
pose estimation based on the local grid in object space. contour between the RGB image and rendered contour. The
The method locates the grid region of interest on a ray in rendered contour is obtained from the object CAD model
camera space and transforms the grid into object space via using the initial pose. Hai et al. [129] proposed a shape-
the estimated pose. The transformed grid is a new standard constraint recursive matching framework to refine the initial
for sampling mesh and estimating pose. pose. They first computed a pose-induced flow based on the
For object pose tracking, Garon et al. [185] proposed a initial and currently estimated pose, and then directly decou-
real-time tracking method that learns transformation rela- pled the 6DoF pose from the pose-induced flow. To address
tionships from consecutive frames during training, and used the low running efficiency of the pose refinement methods,
FCN [186] to obtain the relative pose between two frames for Iwase et al. [194] introduced a deep texture rendering-based
training and inference. pose refinement method for fast feature extraction using an
Coupled pose may lead to information coupling between object CAD model with a learnable texture. Most recently,
rotation and translation, making it difficult to distinguish Li et al. [130] proposed a two-stage method. The first stage
their relationship during the optimization process, thus af- performs pose classification and renders the object CAD
fecting estimation accuracy. model in the classified poses. The second stage performs
Decoupled Pose: Decoupling the 6DoF object pose enables regression to predict fine-grained residual in the classified
explicit modeling of the dependencies and independencies poses. This method improves robustness by guiding residual
between object rotation and translation [15]. pose regression through pose classification.
In object pose estimation, Xiang et al. [15] estimated the 3DoF Pose: Some researchers pursue more efficient and
3D translation by locating the object center in the image and practical pose estimation by solely regressing the 3D rota-
predicting the distance from the object center to the camera. tion. Papaioannidis et al. [127] proposed a novel quaternion-
They further estimated the 3D rotation by regressing to a based multi-objective loss function, which integrates mani-
quaternion representation, and introduced a novel loss func- fold learning and regression for learning 3DoF pose descrip-
tion to handle symmetric objects better. Meanwhile, Kehl et tors. They obtained the 3DoF pose through the regression
al. [187] extended the SSD framework [188] to generate 2D of the learned descriptors. Liu et al. [128] trained a triple
bounding boxes, as well as confidence scores for each view- network based on convolutional neural networks to extract
point and in-plane rotation. Then, they chose the 2D bound- discriminative features from binary images. They incorpo-
ing box through non-maximum suppression, along with the rated pose-guided methods and regression constraints into
highest confidence viewpoint and in-plane rotation to infer the constructed triple network to adapt the features for the
the 3D translation, resulting in the full 6DoF object pose. Wu regression task, enhancing robustness. In addition, Josifovski
et al. [189], Doet al. [190] and Bukschat et al. [167] used two et al. [195] estimated the camera viewpoint related to the
parallel FCN [186] branches to regress the object rotation and object coordinate system by constructing a viewpoint esti-
translation independently. To eliminate the dependence on mation model, thereby obtaining the 3DoF pose appearing
annotations of real data, Wang et al. [68] used synthetic RGB in the bounding box.
data for fully supervised training, and then leveraged neural Overall, direct regression methods simplify the object
rendering for self-supervised learning on unannotated real pose estimation process and further enhance the perfor-
RGBD data. Moreover, Jiang et al. [69] fused RGBD, built- mance of instance-level methods. However, instance-level
in 2D-pixel coordinate encoding, and depth normal vector methods can only estimate specific object instances in the
features to better estimate the object rotation and translation. training data, limiting their generalization to unseen objects.
Single-view methods suffer from ambiguity, therefore, Li et Additionally, most instance-level methods require accurate
al. [123] proposed a multi-view fusion framework to reduce object CAD models, which is a challenge, especially for
the ambiguity inherent in single-view frameworks. Further, objects with complex shapes and textures.
Labbé et al. [119] proposed a unified approach for multi- 4 C ATEGORY -L EVEL O BJECT P OSE E STIMATION
view, multi-object pose estimation. Initially, they utilized a
Research on category-level methods has garnered significant
single-view, single-object pose estimation technique to derive
attention due to their potential for generalizing to unseen
pose hypotheses for individual objects. Then, they aligned
objects within established categories [196]. In this section, we
these object pose hypotheses across multiple input images
review category-level methods by dividing them into shape
to collectively infer both the camera viewpoints and object
prior-based (Sec. 4.1) and shape prior-free (Sec. 4.2) methods.
pose within a unified scene. Most recently, Hsiao et al. [191]
The illustration of these two categories is shown in Fig. 5.
introduced a score-based diffusion method to solve the pose
The characteristics and performance of some representative
ambiguity problem in RGB-based object pose estimation.
SOTA methods are shown in Table 2.
For object pose tracking, Wen et al. [48] proposed a data-
driven optimization strategy to stabilize the 6DoF object 4.1 Shape Prior-Based Methods
pose tracking. Specifically, they predicted the 6DoF pose by Shape prior-based methods first learn a neural network
predicting the relative pose between the adjacent frames. using CAD models of intra-class seen objects in offline mode
Liu et al. [192] proposed a new subtraction feature fusion to derive shape priors, and then utilize them as 3D ge-
module based on [48] to establish sufficient spatiotemporal ometry prior information to guide intra-class unseen object
information interaction between adjacent frames, improving pose estimation. In this part, we divide the shape prior-
the robustness of object pose tracking in complex scenes. based methods into two categories based on their approach
Different from the method based on RGBD input, Ge et al. to addressing object pose estimation. The first category is
[193] designed a novel deep neural network architecture that Normalized Object Coordinate Space (NOCS) shape align-
integrates visual and inertial features to predict the relative ment methods (Sec. 4.1.1). They first predict the NOCS
object pose between consecutive image frames. shape/map, and then use an offline pose solution method
In object pose refinement, Li et al. [124] iteratively refined (such as the Umeyama algorithm [197]) to align the object
pose through aligning RGB image with rendered image of point cloud with the predicted NOCS shape/map to obtain
object CAD model. Additionally, they predicted optical flow the object pose. The other category is the pose regression
and foreground masks to stabilize the training procedure. methods (Sec. 4.1.2). They directly regress the object pose
Manhardt et al. [125] refined the 6DoF pose by aligning object from the feature level, making the pose acquisition process
13
Fig. 5. Illustration of the shape prior-based (Sec. 4.1) and shape prior-free (Sec. 4.2) category-level methods. The dashed arrows indicate offline
training, which means that we need to train a model offline using the category-level model library to obtain shape priors. (Sec. 4.1): Taking RGBD
input as an example, NOCS shape alignment methods (Sec. 4.1.1) first learn a model to predict the NOCS shape/map of the object, and then align
the object point cloud with the NOCS shape/map through a non-differentiable pose solution method such as the Umeyama algorithm [197] to solve
the object pose. In contrast, direct regress pose methods (Sec. 4.1.2) directly regress the object pose from the extracted input features. On the other
hand, the shape prior-free methods (Sec. 4.2) do not have the process of shape priors regression: Depth-guided geometry-aware methods (Sec.
4.2.1) focus on perceiving the global and local geometric information of the object and leverage these 3D geometric features to estimate the object
pose. Conversely, RGBD-guided semantic and geometry fusion methods (Sec. 4.2.2) regress the object pose by fusing the 2D semantic and 3D
geometric information of the object.
differentiable. The illustration of these two categories is tures alongside potential geometric-semantic associations to
shown in Fig. 5. better investigate intra-class shape information. Yu et al.
4.1.1 NOCS Shape Alignment Methods [228] further divided the NOCS shape reconstruction process
As a pioneering work, Tian et al. [72] first extracted the shape into three parts: coarse deformation, fine deformation, and
prior in offline mode, which is used to represent the mean recurrent refinement to enhance the accuracy of NOCS shape
shape of a category of objects. For example, mugs are com- reconstruction.
posed of a cylindrical cup body and an arc-shaped handle. Given that the annotation of ground-truth object pose is
Next, they introduced a shape prior deformation network for time-consuming, He et al. [229] explored a self-supervised
the intra-class unseen object to reconstruct its NOCS shape. method via enforcing the geometric consistency between
Finally, the Umeyama algorithm [197] is employed to solve point cloud and category prior mesh, avoiding using the
the object pose by aligning the NOCS shape and the object real-world pose annotation. Further, Li et al. [230] first ex-
point cloud. Following Tian et al. [72], some methods aim tracted semantic primitives via a part segmentation net-
to reconstruct the NOCS shape more accurately. Specifically, work, and leveraged semantic primitives to compute SIM(3)-
Wang et al. [198] designed a recurrent reconstruction network invariant shape descriptor to generate the optimized shape.
to iteratively refine the reconstructed NOCS shape. Further, Then, the Umeyama algorithm [197] is utilized to recover the
Chen et al. [23] adjusted the shape prior dynamically by object pose. Through this approach, they achieved domain
using the structure similarity between the RGBD image and generalization, bridging the gap between synthesis and real-
the shape prior. Zou et al. [199] proposed two multi-scale world application.
transformer-based networks (Pixelformer and Pointformer) Depth images may be unavailable in some challenging
for extracting RGB and point cloud features, and subse- scenes (e.g., under strong or low light conditions). Therefore,
quently merging them for shape prior deformation. Different achieving monocular category-level object pose estimation
from the previous methods, Fan et al. [223] introduced an ad- is of great significance across various applications. Fan et
versarial canonical representation reconstruction framework, al. [200] directly predicted object-level depth and NOCS
which includes a reconstructor and a discriminator of NOCS shape from a monocular RGB image by deforming the shape
representation. Specifically, the reconstructor mainly consists prior, and subsequently leveraged the Umeyama algorithm
of a pose-irrelevant module and a relational reconstruction [197] to solve the object pose. Unlike [200], Wei et al. [201]
module to reduce the sensitivity to rotation and translation, estimated the 2.5D sketch and separated scale recovery using
as well as to generate high-quality features, respectively. the shape prior. They then reconstructed the NOCS shape,
Then, the discriminator is used to guide the reconstructor employing the RANSAC [73] algorithm to remove outliers,
to generate realistic NOCS representations. Nie et al. [224] before utilizing the PnP algorithm for recovering the object
improved the accuracy of pose estimation via geometry- pose. For transparent objects, Chen et al. [231] proposed a
informed instance-specific priors and multi-stage shape re- new solution based on stereo vision, which defines a back-
construction. More recently, Zhou et al. [225] designed a two- view NOCS map to tackle the problem of image content
stage pipeline consisting of deformation and registration aliasing.
to improve accuracy. Zou et al. [226] introduced a graph- In general, although these NOCS shape alignment meth-
guided point transformer consisting of a graph-guided at- ods can recover the object pose, the alignment process is
tention encoder and an iterative non-parametric decoder to non-differentiable and is not integrated into the learning
further extract the point cloud feature. In addition, Li et process. Thus, errors in predicting the NOCS shape/map
al. [227] leveraged discrepancies in instance-category struc- have a significant impact on the accuracy of pose estimation.
14
TABLE 2
Representative category-level methods. For each method, we report its 10 properties, which have the same meanings as described in Table. 1. D,
S, N, K, and P denote object detection, instance segmentation, NOCS shape/map regression, keypoints detection, and pose solution/regression,
respectively. Moreover, we report the 5◦ 5cm metric of CAMERA25 and REAL275 datasets (Sec. 2).
Published Training Inference Pose Object Domain Training Inference Application CAMERA25 REAL275
Methods Task
Year Input Input DoF Property Paradigm Mode Area 5◦ 5cm (mAP) 5◦ 5cm (mAP)
RGBD, three-stage,
Tian et al. [72] 2020 RGBD 9DoF rigid estimation source general 59.0 21.4
CAD Model S+N+P
NOCS shape alignment
RGBD, three-stage,
Wang et al. [198] 2021 RGBD 9DoF rigid estimation source general 76.4 34.3
CAD Model S+N+P
RGBD, three-stage,
Chen et al. [23] 2021 RGBD 9DoF rigid estimation source general 74.5 39.6
CAD Model S+N+P
RGBD, three-stage,
Zou et al. [199] 2022 RGBD 9DoF rigid estimation source general 76.7 41.9
CAD Model S+N+P
Shape Prior-Based Methods
RGB, three-stage,
Fan et al. [200] 2022 RGB 9DoF rigid estimation source general - -
CAD Model S+N+P
RGB, three-stage,
Wei et al. [201] 2023 RGB 9DoF rigid estimation source general - -
CAD Model S+N+P
RGBD,
Irshad et al. [202] 2022 RGBD 9DoF rigid estimation source end-to-end general 66.2 29.1
CAD Model
two-stage,
Lin et al. [203] 2022 Depth Depth 9DoF rigid estimation generalization general 70.9 42.3
S+P
Depth, two-stage,
Direct regress pose
Zhang et al. [204] 2022 Depth 9DoF rigid estimation source general 75.5 44.6
CAD Model S+P
Depth, two-stage,
Zhang et al. [205] 2022 Depth 9DoF rigid estimation source general 79.6 48.1
CAD Model S+P
refinement, two-stage,
Liu et al. [206] 2022 Depth Depth 9DoF rigid source general 80.3 54.4
tracking S+P
RGBD, two-stage,
Lin et al. [24] 2022 RGBD 9DoF rigid estimation generalization general - 45.0
CAD Model S+P
RGBD, two-stage,
Ze et al. [55] 2022 RGBD 9DoF rigid estimation adaptation general - 33.9
CAD Model S+P
RGBD, two-stage,
Liu et al. [207] 2024 RGBD 9DoF rigid estimation generalization general - 50.1
CAD Model S+P
two-stage,
Li et al. [208] 2020 Depth Depth 9DoF articulated estimation generalization general - -
Depth-guided geometry-aware
S+P
two-stage,
Chen et al. [209] 2021 Depth Depth 9DoF rigid estimation source general - 28.2
D+P
rigid, two-stage,
Weng et al. [210] 2021 Depth Depth 9DoF tracking source general - 62.2
articulated N+P
two-stage,
Di et al. [25] 2022 Depth Depth 9DoF rigid estimation source general 79.1 42.9
S+P
two-stage,
You et al. [211] 2022 Depth Depth 9DoF rigid estimation generalization general - 16.9
S+P
two-stage,
Zheng et al. [26] 2023 Depth Depth 9DoF rigid estimation source general 80.5 55.2
S+P
estimation, two-stage,
Shape Prior-Free Methods
Zhang et al. [212] 2023 Depth Depth 6DoF rigid source general - 60.9
tracking S+P
two-stage,
Wang et al. [22] 2019 RGBD RGBD 9DoF rigid estimation source general 40.9 10.0
(S+N)+P
RGBD-guided semantic & geometry fusion
two-stage,
Wang et al. [213] 2020 RGBD RGBD 6DoF rigid tracking source general - 33.3
K+P
two-stage,
Lin et al. [214] 2021 RGBD RGBD 9DoF rigid estimation source general 70.7 35.9
S+P
three-stage,
Wen et al. [215] 2021 RGBD RGBD 9DoF rigid tracking source general - 87.4
S+K+P
RGBD, two-stage,
Peng et al. [216] 2022 RGBD 9DoF rigid estimation adaptation general - 33.4
CAD Model S+P
RGBD, three-stage,
Lee et al. [217] 2022 RGBD 9DoF rigid estimation adaptation general - 34.8
CAD Model S+N+P
RGBD, three-stage,
Lee et al. [218] 2023 RGBD 9DoF rigid estimation generalization general - 35.9
CAD Model S+N+P
two-stage,
Liu et al. [27] 2023 RGBD RGBD 9DoF rigid estimation source general 79.9 53.4
S+P
RGBD or RGBD or two-stage,
Lin et al. [219] 2023 9DoF rigid estimation source general 81.4 57.6
Depth Depth S+P
two-stage,
Chen et al. [220] 2024 RGBD RGBD 9DoF rigid estimation source general - 63.6
S+P
RGB, three-stage,
Others
Lee et al. [221] 2021 RGB 9DoF rigid estimation generalization general - -
CAD Model S+N+P
RGBD, RGBD, two-stage,
Lin et al. [222] 2024 9DoF rigid estimation source general 82.2 58.3
Text Text S+P
4.1.2 Direct Regress Pose Methods tating real-world training data, Lin et al. [203] explored the
shape alignment of each intra-class unseen instance against
Due to the non-differentiable nature of the NOCS shape its corresponding category-level shape prior, implicitly rep-
alignment process, several direct regression-based pose resenting its 3D rotation. This approach facilitates domain
methods have been proposed recently to enable end-to-end generalization from synthesis to real-world scenarios. Fur-
training. Irshad et al. [202] treated object instances as spatial ther, Ze et al. [55] proposed a novel framework based on
centers and proposed an end-to-end method that combines pose and shape differentiable rendering to achieve domain
object detection, reconstruction, and pose estimation. Wang adaptation object pose estimation. In addition, they collected
et al. [232] developed a deformable template field to decou- a large Wild6D dataset for category-level object pose estima-
ple shape and pose deformation, improving the accuracy tion in the wild. Following Ze et al. [55], Zhang et al. [234]
of shape reconstruction and pose estimation. On the other introduced 2D-3D and 3D-2D geometry correspondences to
hand, Zhang et al. [204] proposed a symmetry-aware shape enhance the ability of domain adaptation. Different from the
prior deformation method, which integrates shape prior into previous approaches, Remus et al. [235] leveraged instance-
a direct pose estimation network. Further, Zhang et al. [205] level methods for domain-generalized category-level object
introduced a geometry-guided residual object bounding box pose estimation via a single RGB image. Lin et al. [24] pro-
projection framework to address the challenge of insufficient posed a deep prior deformation-based network and lever-
pose-sensitive feature extraction. In order to obtain a more aged a parallel learning scheme to achieve domain gener-
precise object pose, Liu et al. [206] designed CATRE, a pose alization. More recently, Liu et al. [207] designed a multi-
refinement method based on the alignment of the shape prior hypothesis consistency learning framework. This framework
and the object point cloud to refine the object pose estimated addresses the uncertainty problem and reduces the domain
by the above methods. Zheng et al. [233] extended CATRE gap between synthetic and real-world datasets by employing
[206] to address the geometric variation problem by integrat- multiple feature extraction and fusion techniques.
ing hybrid scope layers and learnable affine transformations. Overall, while the shape prior-based methods mentioned
Due to the extensive manual effort required for anno- above significantly improve pose estimation performance,
15
obtaining the shape priors requires constructing category- object pose via fitting joint states through reinforced agent
level CAD model libraries and subsequently training a net- training. Further, Liu et al. [244] learned part-level SE(3)-
work, which is both cumbersome and time-consuming. equivariant features via a pose-aware equivariant point con-
volution operator to address the issue of self-supervised
4.2 Shape Prior-Free Methods articulated object pose estimation.
Shape prior-free methods do not rely on using shape priors To avoid using extensive real-world labeled data for
and thus have better generalization capabilities. These meth- training, Li et al. [245] used SE(3) equivariant point cloud
ods can be divided into three main categories: depth-guided networks for self-supervised object pose estimation. You et
geometry-aware (Sec. 4.2.1), RGBD-guided semantic and ge- al. [211] introduced a category-level point pair feature voting
ometry fusion (Sec. 4.2.2), and other (Sec. 4.2.3) methods. The method to reduce the impact of synthetic to real-world
illustration of the first two categories is shown in Fig. 5. domain gap, achieving generalizable object pose estimation
4.2.1 Depth-Guided Geometry-Aware Methods in the wild.
In general, these methods fully extract the pose-related
Thanks to the rapid development of 3D Graph Convolution
geometric features. However, the absence of semantic infor-
(3DGC), Chen et al. [209] leveraged 3DGC and introduced
mation limits their better performance. Appropriate fusion
a fast shape-based method, which consists of an RGB-based
of semantic and geometric information can significantly im-
network to achieve 2D object detection, a shape-based net-
prove the robustness of pose estimation.
work for 3D segmentation and rotation regression, and a
residual-based network for translation and size regression. 4.2.2 RGBD-Guided Semantic and Geometry Fusion Meth-
Inspired by Chen et al. [209], Liu et al. [236] improved the ods
network with structure encoder and reasoning attention. As a groundbreaking research, Wang et al. [22] designed a
Further, Di et al. [25] proposed a geometry-guided point- normalized object coordinate space to provide a canonical
wise voting method that exploits geometric insights to en- representation for a category of objects. They first predicted
hance the learning of pose-sensitive features. Specifically, the class label, mask, and NOCS map of the intra-class
they designed a symmetry-aware point cloud reconstruction unseen object. Then, they utilized the Umeyama algorithm
network and introduced a point-wise bounding box voting [197] to solve object pose by aligning the NOCS map with
mechanism during training to add additional geometric the object point cloud. To handle various shape changes
guidance. Due to the translation and scale invariant prop- of intra-class objects, Chen et al. [246] learned a canonical
erties of the 3DGC, these methods are limited in perceiving shape space as a unified representation. On the other hand,
object translation and size information. Based on this, Zheng Lin et al. [247] explored the applicability of sparse steerable
et al. [26] further designed a hybrid scope feature extraction convolution (SSC) to object pose estimation and proposed an
layer, which can simultaneously perceive global and local SSC-based pipeline. Further, Lin et al. [214] proposed a dual
geometric structures and encode size and translation infor- pose network, which consists of a shared pose encoder and
mation. two parallel explicit and implicit pose decoders. The implicit
Besides the above 3DGC-based methods, Deng et al. decoder can enforce predicted pose consistency when there
[237] combined a category-level auto-encoder with a parti- are no CAD models during inference. Wang et al. [248]
cle filter framework to achieve object pose estimation and designed an attention-guided network with relation-aware
tracking. Wang et al. [238] leveraged learnable sparse queries and structure-aware for RGB image and point cloud features
as implicit prior to perform deformation and matching for fusion. Very recently, Liu et al. [27] explored the necessity of
pose estimation. In addition, Wan et al. [239] developed a shape priors for shape reconstruction of intra-class unseen
semantically-aware object coordinate space to address the se- objects. They demonstrated that the deformation process is
mantically incoherent problem of NOCS [22]. More recently, more important than the shape prior and proposed a prior-
Zhang et al. [212] proposed a scored-based diffusion model to free implicit space transformation network. Lin et al. [219] ad-
address the multi-hypothesis problem in symmetric objects dressed the poor rotation estimation accuracy by decoupling
and partial point clouds. They first leveraged the scored- the rotation estimation into viewpoint and in-plane rotation.
based diffusion model to generate multiple pose candidates, In addition, they also proposed a spherical feature pyramid
and then utilized an energy-based diffusion model to remove network based on spatial spherical convolution to process
abnormal poses. On the other hand, Lin et al. [240] first intro- spherical signals. With the rapid development of the Large
duced an instance-adaptive keypoints detection method and Vision Model (LVM), Chen et al. [220] further leveraged the
then designed a geometric-aware global and local features LVM DINOv2 [249] to extract the SE(3)-consistent semantic
aggregation network based on the detected keypoints for features and fused them with object-specific hierarchical
pose and size estimation. Li et al. [241] leveraged category- geometric features to encapsulate category-level information
level method to determine part object poses for assembling for rotation estimation.
multi-part multi-joint 3D shape. Since the above methods still require a large amount
To perform pose estimation on articulated objects, Li et al. of real-world annotated training data, their applicability in
[208] inspired by Wang et al. [22], introduced a standard rep- real-world scenes is limited. To this end, Peng et al. [216]
resentation for different articulated objects within a category proposed a real-world self-supervised training framework
by designing an articulation-aware normalized coordinate based on deep implicit shape representation. They leveraged
space hierarchy, which simultaneously constructs a canonical the deep signed distance function [250] as a 3D represen-
object space and a set of canonical part spaces. Weng et al. tation to achieve domain adaptation from synthesis to the
[210] further proposed CAPTRA, a unified framework that real world. In addition, Lee et al. [217] introduced a teacher-
enables 9DoF pose tracking of rigid and articulated objects student self-supervised learning mechanism. They used su-
simultaneously. Due to the nearly unlimited freedom of gar- pervised training in the source domain and self-supervised
ments and extreme self-occlusion, Chi et al. [242] introduce training in the target domain, effectively achieving do-
GarmentNets, which conceptualizes deformable object pose main adaptation. Recently, Lee et al. [218] further proposed
estimation as a shape completion problem within a canonical a test-time adaptation framework for domain-generalized
space. More recently, Liu et al. [243] developed a reinforce- category-level object pose estimation. Specifically, they first
ment learning-based pipeline to predict 9DoF articulated trained the model using labeled synthetic data and then
16
leveraged the pre-trained model for test-time adaptation in and align the representations of the three modalities (image,
the real world during inference. point cloud, and text) in the feature space through multi-
To improve the running speed of the object pose es- modal contrastive learning.
timation method, once the object pose of the first frame On the whole, these shape prior-free methods circumvent
is acquired, continuous spatio-temporal information can be the reliance on shape priors and further improve the general-
utilized to track the object pose. Wang et al. [213] proposed ization ability of category-level object pose estimation meth-
an anchors-based object pose tracking method. They first ods. Nevertheless, these methods are limited to generalizing
detected the anchors of each frame as the keypoints, and within intra-class unseen objects. For objects of different
then solved the relative object pose through the keypoints categories, the training data need to be collected and the
correspondence. Wen et al. [215] first obtained continu- models need to be retrained, which remains a significant
ous frame RGBD masks through the video segmentation limitation.
network, transductive-VOS [251], and then leveraged LF-
Net [252] for generalized keypoints detection. Next, they 5 U NSEEN O BJECT P OSE E STIMATION
matched keypoints between consecutive frames and per- Unseen object pose estimation methods can generalize to
formed coarse registration to estimate the initial relative unseen objects without the need for retraining. Point Pair
pose. Finally, a memory-augmented pose graph optimization Features (PPF) [11] is a classical method for unseen object
method is proposed for continuous pose tracking. pose estimation that utilizes oriented point pair features to
Overall, these RGBD-guided semantic and geometry fu- build global model description and a fast voting scheme to
sion methods achieve superior performance. However, if the match locally. The final pose is solved by pose clustering and
input depth image contains errors, the accuracy of pose iterative closest point [137] refinement. However, PPF suffers
estimation can significantly decrease. Hence, ensuring ro- from low accuracy and slow runtime, limiting its applicabil-
bustness in pose estimation when dealing with erroneous ity. In contrast, deep learning-based methods leverage neural
or missing depth images is crucial. networks to learn more complex features from data without
4.2.3 Others specifically designed feature engineering, thus enhancing
Since most mobile devices are not equipped with depth cam- accuracy and efficiency. In this section, we review the deep
eras, Chen et al. [253] incorporated a neural synthesis module learning-based unseen object pose estimation methods and
with a gradient-based fitting procedure to simultaneously classify them into CAD model-based (Sec. 5.1) and manual
predict object shape and pose, achieving monocular object reference view-based (Sec. 5.2) methods. The illustration of
pose estimation. Lee et al. [221] estimated the NOCS shape these two categories of methods is shown in Fig. 6.
and the metric scale shape of the object, and performed 5.1 CAD Model-Based Methods
a similarity transformation between them to solve the ob-
The CAD model-based methods involve utilizing the ob-
ject pose and size. Further, Yen-Chen et al. [254] inverted
ject CAD model as prior knowledge during the process of
neural radiance fields for monocular category-level pose
estimating the pose of an unseen object. These methods
estimation. Different from the previous methods, Lin et al.
can be further categorized into feature matching-based and
[255] proposed a keypoint-based single-stage pipeline via a
template matching-based methods. Feature matching-based
single RGB image. Guo et al. [256] redefined the monocular
methods (Sec. 5.1.1) focus on designing a network to match
category-level object pose estimation problem from a long-
features between the CAD model and the query image,
horizon visual navigation perspective. On the other hand,
establishing 2D-3D or 3D-3D correspondences, and solving
Ma et al. [257] enhanced the robustness of the monocular
the pose by the PnP algorithm or least squares method. Tem-
method in occlusion scenes through coarse-to-fine rendering
plate matching-based methods (Sec. 5.1.2) utilize rendered
of neural features. Given that transparent instances lack both
templates from the CAD model for retrieval. The initial pose
color and depth information, Zhang et al. [258] proposed
is acquired based on the most similar template, and further
to utilize depth completion and surface normal estimation
refinement is necessary using a refiner to obtain a more accu-
to achieve category-level pose estimation for transparent
rate pose. The illustration of these two categories of methods
instances.
is shown in Fig. 6. The characteristics and performance of
In order to improve the running efficiency of the monoc-
some representative methods are shown in Table 3.
ular method, Lin et al. [259] developed a keypoint-based
monocular object pose tracking approach. This approach 5.1.1 Feature Matching-Based Methods
demonstrates the significance of integrating uncertainty es- As an early exploratory work, Pitteri et al. [265] proposed a
timation using a tracklet-conditioned deep network and 3DoF pose estimation approach that approximates object’s
probabilistic filtering. Following Lin et al. [259], Yu et al. geometry using only the corner points of the CAD model.
[260] further improved the pose tracking accuracy through Nonetheless, it only works effectively on objects having spe-
a network that combines convolutions and transformers. cific corners. Hence, Pitteri et al. [266] further introduced an
To further improve the generalization of category-level embedding that captures the local geometry of 3D points on
methods, Goodwin et al. [261] introduced a reference image- the object surface. Matching these embeddings can create 2D-
based zero-shot approach, which first extracts spatial feature 3D correspondences, and the pose is then determined using
descriptors and builds cyclical descriptor distances. Then, the PnP+RANSAC [73] algorithm. However, these methods
they established the top-k semantic correspondences for pose only estimate the 3DoF pose.
estimation. Zaccaria et al. [262] proposed a self-supervised Gou et al. [267] defined the challenge of estimating the
framework via optical flow consistency. Very recently, Cai et 6DoF pose of unseen objects, offering a baseline solution
al. [263] developed an open-vocabulary framework that aims through the identification of 3D correspondences between
to generalize to unseen categories using textual prompts in object and scene point clouds. Similarly, Hagelskjær et al.
unseen scene images. Felice et al. [264] explored zero-shot [268] trained a network to match keypoints from the CAD
novel view synthesis based on a diffusion model for 3D model to the object point cloud. Yet, it focuses on bin
object reconstruction, and recovered the object pose through picking with homogeneous bins, which only demonstrates
correspondences. Lin et al. [222] used a pre-trained vision- that generalized pose estimation can achieve outstanding
language model to make full use of rich semantic knowledge performance in restricted scenarios.
17
Fig. 6. Illustration of the CAD model-based (Sec. 5.1) and manual reference view-based (Sec. 5.2) methods for unseen object pose estimation. (Sec.
5.1): The feature matching-based methods (Sec. 5.1.1) focus on designing a network to match features between the CAD model and the query image,
establishing correspondences (2D-3D or 3D-3D), and solving the pose using the PnP algorithm or least squares method. The template matching-
based methods (Sec. 5.1.2) utilize rendered templates from the CAD model for retrieval. The initial pose is acquired based on the most similar
template, and further refinement is necessary using a refiner to obtain a more accurate pose. (Sec. 5.2): There are two types of feature matching-
based methods (Sec. 5.2.1). One involves extracting features from reference views and the query image, obtaining 3D-3D correspondences through
a feature matching network. The other initially reconstructs the 3D object representation using the reference views and establishes the 2D-3D
correspondences between the query image and the 3D representation. The object pose is solved using correspondence-based algorithms, like PnP
or the least squares method. Template matching-based methods (Sec. 5.2.2) also have two types. One reconstructs the 3D object representation
using the reference views and then renders multiple templates. The initial pose is acquired by retrieving the most similar template and then refining it
to get the final pose. The other directly uses the reference views as the templates for template matching.
Inspired by point cloud registration methods on un- [67] and Balntas et al. [284] were pioneers in using deep pose
seen objects, Zhao et al. [269] proposed a geometry descriptors for object matching and pose retrieval. How-
correspondence-based method using generic and object- ever, their descriptors are tailored to specific orientations
agnostic geometry features to establish unambiguous and and categories, limiting their utility to objects with similar
robust 3D-3D correspondences. Nevertheless, it still needs appearances. In contrast, Sundermeyer et al. [280] proposed a
to get the class label and segmentation mask of unseen single-encoder-multi-decoder network for jointly estimating
objects through other methods such as Mask-RCNN [270]. the 3D rotation of multiple objects. This approach eliminates
To this end, Chen et al. [271] explored a framework named the need to segregate views of different objects in the latent
ZeroPose, which realizes joint instance segmentation and space and enables the sharing of common features in the
pose estimation of unseen objects. Specifically, they utilized encoder. Yet, it still requires training multiple decoders. Wen
the foundation model SAM [272] to generate possible object et al. [285] addressed this problem by decoupling object
proposals and adopted a template matching method to ac- shape and pose in the latent representation, enabling auto-
complish instance segmentation. After that, they developed encoding without the necessity of multi-path decoders for
a hierarchical geometric feature matching network based on different objects, thus enhancing scalability.
GeoTransformer [273] to establish correspondences. Follow-
Instead of training a network to learn features across
ing ZeroPose, Lin et al. [30] devised a novel matching score
objects, Okorn et al. [278] first generated candidate poses
in terms of semantics, appearance, and geometry to obtain
by PPF [11] and projected each into the scene. Later, they
better segmentation. As for pose estimation, they proposed
designed a scoring network to evaluate the hypothesis by
a two-stage partial-to-partial point matching model to con-
comparing color and geometry differences between the pro-
struct dense 3D-3D correspondence effectively.
jected object point cloud and RGBD image. Busam et al. [286]
Besides these methods employing geometry features, reformulated 6DoF pose retrieval as an action decision pro-
Caraffa et al. [274] devised a method that fuses visual and cess and determined the final pose by iteratively estimating
geometric features extracted from different pre-trained mod- probable movements. Cai et al. [287] retrieved various candi-
els to enhance pose prediction stability and accuracy. It is the date viewpoints from a target object viewpoint codebook,
first technique to estimate the unseen object pose by utilizing and then conducted in-plane 2D rotational regression on
the synergy between geometric and vision foundation mod- each retrieved viewpoint to obtain a set of 3D rotation es-
els. Additionally, Huang et al. [275] proposed a method for timates. These estimates were evaluated using a consistency
object pose prediction from RGBD images by combining 2D score to generate the final rotation prediction. Meanwhile,
texture and 3D geometric cues. Shugurov et al. [277] matched the detected objects with the
To sum up, feature matching-based methods aim to rendering database for initial viewpoint estimation. Then,
extract generic object-agnostic features and achieve strong they predicted the dense 2D-2D correspondences between
correspondences by matching these features. However, these the template and the image via feature matching. Pose esti-
methods require not only robust feature matching models mation was eventually performed by using PnP+RANSAC
but also tailored designs to enhance the representation of [73] or Kabsch [288]+RANSAC.
object features, presenting a significant challenge.
Since estimating the full 6DoF pose of an unseen object
5.1.2 Template Matching-Based Methods is extremely challenging, some works focus on estimating
Template matching has been widely used in computer vision the 3D rotation to simplify it. Different from the previous
and stands as an effective solution for tackling the pose es- works [67], [104], [284], [289] that exploited a global image
timation challenges posed by unseen objects. Wohlhart et al. representation to measure image similarity, Nguyen et al.
18
TABLE 3
Representative CAD-based methods. Since the domain training paradigm of most unseen methods is domain generalization, we report 9 properties
for each method, which have the same meanings as described in the caption of Table. 1. D, S, F, T, P, R, and V denote object detection, instance
segmentation, feature matching to build correspondences, template matching to retrieve pose, pose solution/regression, pose refinement, and pose
voting, respectively. We report the BOP-M across the LM-O and YCB-V datasets (Sec. 2) for various methods. Notably, these methods use
Mask-RCNN [270] (normal font), CNOS [276] or their own proposed methods [271] [30] [277] (bold), and a combination of PPF and SIFT [278]
(italics) for unseen object location, respectively. Moreover, Örnek et al. [279] and Caraffa et al. [274] don’t require any task-specific training, we use
”×” to denote it.
Published Training Inference Pose Object Inference Application LM-O YCB-V
Methods Task
Year Input Input DoF Property Mode Area BOP-M BOP-M
RGB, RGB, three-stage,
Pitteri et al. [266] 2020 3DoF rigid estimation general - -
CAD Model CAD Model S+F+P
Depth, Depth, three-stage,
Zhao et al. [269] 2023 6DoF rigid estimation general 65.2 -
Feature matching
[281] used CNN-extracted local features to compare the databases, Large Language Models (LLMs), and diffusion
similarity between the input image and templates, show- models. It greatly expanded the amount and diversity of
ing better property and occlusion robustness over global data, and ultimately achieved comparable results to instance-
representation. Another noteworthy approach is an image level methods in a render-and-compare manner.
retrieval framework based on multi-scale local similarities It is well known that template matching methods are
developed by Zhao et al. [290]. They extracted feature maps sensitive to occlusions and require considerable time to
of various sizes from the input image and devised a sim- match numerous templates. Therefore, Nguyen et al. [29]
ilarity fusion module to robustly predict image similarity achieved rapid and robust pose estimation by finding the
scores from multi-scale pairwise feature maps. Further, Thal- suitable trade-off between the use of template matching
hammer et al. [291] and Ausserlechner et al. [292] extended and patch correspondences. In particular, the features of the
the scheme of Nguyen et al. [281] and demonstrated that query image and templates are extracted using the ViT [293],
the pre-trained Vision-Transformer (ViT) [293] outperforms followed by fast template matching using a sub-linear near-
task-specific fine-tuned CNN [162] for template matching. est neighbor search. The most similar template provides two
However, these methods still have a noticeable performance DoFs for azimuth and elevation, while the remaining four
gap between seen and unseen objects. To this end, Wang DoFs are obtained by constructing correspondences between
et al. [283] introduced diffusion features that show great the query image and this template. Örnek et al. [279] utilized
potential in modeling unseen objects. Furthermore, they DINOv2 [249] to extract descriptors for the query image
designed three aggregation networks to efficiently capture and templates. Moreover, they introduced a fast template
and aggregate diffusion features at different granularities, retrieval method based on visual words constructed from
thus improving its generalizability. DINOv2 patch descriptors, thereby decreasing the reliance
In order to further improve the generalization and ro- on extensive data and enhancing matching speed compared
bustness of the 6DoF pose estimation, Labbé et al. [28] used a to Labbé et al. [28].
render-and-compare approach and a coarse-to-fine strategy. In summary, template matching-based methods make full
Notably, they leveraged a large-scale 3D model dataset to use of the advantages provided by a multitude of templates,
generate a synthetic dataset containing 2 million images enabling high accuracy and strong generalization. Nonethe-
and over 20,000 models. It achieved strong generalization less, they have limitations in terms of time consumption,
by training the network on this dataset. Compared to the sensitivity to occlusions, and challenges posed by complex
non-differentiable rendering pipeline of Labbé et al. [28], backgrounds and lighting variations.
Tremblay et al. [294] utilized recent advancements in dif- Whether the above-mentioned feature matching-based
ferentiable rendering to design a flexible refiner, allowing or template matching-based methods, they both require a
fine-tuning the setup without retraining. On the other hand, CAD model of the target object to provide prior information.
Moon et al. [282] presented a shape-constraint recurrent flow In practice, accurate CAD models often require specialized
framework, which predicts the optical flow between the hardware to build, which limits the practical application of
template and query image and refines the pose iteratively. It these methods to a certain extent.
took advantage of shape information directly to improve the 5.2 Manual Reference View-Based Methods
accuracy and scalability. Recently, Wen et al. [3] inherited the Aside from these CAD model-based approaches, there are
idea of Labbé et al. [28] and developed a novel synthesis data some manual reference view-based methods that do not
generation pipeline using emerging large-scale 3D model require the unseen object CAD model as a prior condition
19
TABLE 4
Representative manual reference view-based methods. For each method, we report its 9 properties, which have the same meanings as described in
the caption of Table. 1. D, S, F, T, P, R, and V have the same meanings as Table. 3. We report the average recall of ADD(S) within 10% of the object
diameter, termed as ADD(S)-0.1d (Sec. 2). Notably, ”YOLOv5” and ”GT” denote the use of YOLOv5 [295] and ground-truth bounding
box/segmentation mask for object localization, respectively. For a fair comparison, we also report the number of used reference views. ”Full”
represents all views.
Published Training Inference Pose Object Inference Application LM
Methods Task
Year Input Input DoF Property Mode Area ADD(S)-0.1d
three-stage, 83.4
He et al. [296] 2022 RGBD RGBD 6DoF rigid estimation general
D+F+P (GT+16)
Feature matching
three-stage, 63.6
Sun et al. [66] 2022 RGB RGB 6DoF rigid estimation general
D+F+P (YOLOv5+Full)
Manual Reference View-Based Methods
three-stage, 76.9
He et al. [2] 2022 RGB RGB 6DoF rigid estimation general
D+F+P (YOLOv5+Full)
three-stage, 87.5
Castro et al. [297] 2023 RGB RGB 6DoF rigid estimation general
D+F+P (GT+Full)
three-stage, 78.4
Lee et al. [298] 2024 RGB RGB 6DoF rigid estimation general
D+F+P (YOLOv5+64)
three-stage, 87.1
Park et al. [65] 2020 RGBD RGBD 6DoF rigid estimation general
S+T+R (GT+16)
three-stage,
Nguyen et al. [299]
Template matching
but instead require providing some manual labeled reference fail to capture the optimal descriptions for pose estimation.
views with the target object. Similar to CAD model-based Based on this, they redesigned the training pipeline based on
methods, these methods are also categorized into two types: a three-view system for one-shot object-to-image matching.
feature matching-based (Sec. 5.2.1) and template matching- The aforementioned works still require dense support
based (Sec. 5.2.2) methods. These two categories of methods views (i.e., ≥ 32 views). To address this problem, Fan et
are illustrated in Fig. 6. The attributes and performance of al. [306] turned the 6DoF object pose estimation task into
some representative methods are shown in Table 4. relative pose estimation between the retrieved object in the
5.2.1 Feature Matching-Based Methods target view and the reference view. Given only one reference
Different from CAD model-based feature matching meth- view, they achieved it by using the DINOv2 model [249]
ods, manual reference view-based feature matching meth- for global matching and the LoFTR model [304] for local
ods primarily establish 3D-3D correspondences between the matching. Note that this method cannot estimate absolute
RGBD query image and RGBD reference images, or 2D- translation (or object scale), as this is an ill-posed problem
3D correspondences between the query image and sparse when only considering two views. Beyond that, Lee et al.
point cloud reconstructed by reference views. Subsequently, [298] applied a powerful pre-trained technique tailored for
the object pose is solved according to the different corre- 3D vision [307] and demonstrated geometry-oriented visual
spondences. He et al. [296] proposed the first few-shot 6DoF pre-training can get better generalization capability with
object pose estimation method, which can estimate the pose fewer reference views.
of an unseen object by a few support views without extra Generally, due to the lack of prior geometric information
training. Specifically, they desinged a dense RGBD prototype from CAD models, manual reference view-based feature
matching framework based on transformers to fully explore matching methods often require special designs to extract
the semantic and geometric relationship between the query the geometric features of unseen objects. The number of
image and reference views. Corsetti et al. [302] used a textual reference views also constrains the actual application of such
prompt for object segmentation and reformulated the prob- approaches to a certain extent.
lem as a relative pose estimation between two scenes. The
5.2.2 Template Matching-Based Methods
relative pose was obtained via point cloud registration.
Some methods took an alternative route from the per- Template matching-based methods mainly adopt the strat-
spective of matching after reconstruction. Wu et al. [303] egy of retrieval and refinement. There are two types: one re-
developed a global registration-based method that used constructs a 3D object representation using reference views,
reference and query images to reconstruct full-view and renders multiple templates based on this 3D representation,
single-view models, and then searched for point matches and employs a similarity network to compare the query
between the two models. Sun et al. [66] drew inspiration image with each template for the initial pose. A refiner is
from visual localization and revised the pipeline to adapt then used to refine this initial pose for increased accuracy.
it for pose estimation. More precisely, they reconstructed a The other directly uses reference views as templates, requir-
Structure from Motion (SfM) model of the unseen object ing plenty of views for retrieval and greater reliance on
using RGB sequences from all reference viewpoints. Then, a refiner for accuracy. Park et al. [65] introduced a novel
they matched 2D keypoints in the query image with the framework for pose estimation of unseen objects without the
3D points in the SfM model by a graph attention network. CAD model. They reconstructed 3D object representations
Nevertheless, it performed poorly on low-textured objects from a few reference views, followed by estimating transla-
because of its reliance on repeatably detected keypoints. To tion using mask bounding boxes and corresponding depth
deal with this problem, He et al. [2] designed a new keypoint- values. The initial rotation was determined by sampling
free SfM method to reconstruct semi-dense point cloud mod- angles and refined using gradient updates via a render-
els of low-textured objects based on the detector-free feature and-compare approach. By training the network to render
matching method LoFTR [304]. Castro et al. [297] pointed out and reconstruct diverse 3D shapes, it achieved excellent
that these pre-trained feature matching models [304], [305] generalization performance on unseen objects.
20
Unlike Park et al. [65] that used the strategy of render- 6.1 Robotic Manipulation
and-compare after reconstruction, Liu et al. [1] designed a We categorize the robotic manipulation application into
pipeline for detection, retrieval, and refinement. They first instance-level, category-level, and unseen objects. This clas-
designed a detector to identify object bounding boxes in the sification helps in better understanding the challenges and
target view. Next, they compared the query and reference requirements across different levels.
images at the pixel level to acquire the initial pose based 6.1.1 Instance-Level Manipulation
on the similarity score. The pose was then refined using
To tackle the challenge of annotating real data during train-
feature volume and multiple 3D convolution layers. How-
ing, many works utilize synthetic data for training as it
ever, object-centered reference images from cluttered scenes
is easy to acquire and annotate [312], [313]. At the same
are constrained by actual segmentation or bounding box
time, synthetic data can simulate various scenes and envi-
cropping, limiting its real-world applicability. To overcome
ronmental changes, thus helping to improve the adaptability
this limitation, Gao et al. [300] proposed adaptive segmen-
of robotic manipulation. Li et al. [314] used a large-scale
tation modules to learn distinguishable representations of
synthetic dataset and a small-scale weakly labeled real-world
unseen objects, and Zhao et al. [308] leveraged distributed
dataset to reduce the difficulty of system deployment. Addi-
reference kernels and translation estimator to achieve multi-
tionally, Chen et al. [315] proposed an iterative self-training
scale correlation computation and object translation parame-
framework, using a teacher network trained on synthetic
ter prediction, thus robustly learning the prior translation of
data to generate pseudo-labels for real data. Meanwhile, Fu
unseen objects.
et al. [316] trained only on synthetic images based on physical
To further enhance the robustness of the translation es- rendering. One of the critical challenges of synthetic data is
timation for object detection, Pan et al. [309] modified the bridging the gap with reality, which Tremblay et al. [317]
framework of Liu et al. [1]. Precisely, they utilized pre- addressed by combining domain randomization with real
trained ViT [293] to learn robust feature representations and data.
adopted a top-K pose proposal scheme for pose initialization. Handling stacked occlusion scenes is another signifi-
Additionally, they applied a coarse-to-fine cascaded refine- cant challenge, especially in industrial automation and lo-
ment process, incorporating feature pyramids and adaptive gistics. In these scenarios, robots must accurately identify
discrete pose hypotheses. Besides [309], Cai et al. [301] revis- and localize objects stacked on each other, which requires
ited the pipeline of Liu et al. [1]. They proposed a generic an effective process of occluded objects and accurate pose
joint segmentation method and an efficient 3D Gaussian estimation. Dong et al. [318] argued that the regression poses
Splatting-based refiner, improving the performance and ro- of points from the same object should tightly reside in the
bustness of object localization and pose estimation. pose space. Therefore, these points can be clustered into
In unseen object tracking, Nguyen et al. [299] proposed different instances, and their corresponding object poses can
the first method that extended to invisible categories with- be estimated simultaneously. This method can handle severe
out requiring 3D information and extra reference images, object occlusion. Moreover, Zhuang et al. [319] established
given the ground-truth object pose in the first frame. Their an end-to-end pipeline to synchronously regress all potential
transformer-based architecture outputs continuous relative object poses from an unsegmented point cloud. Most re-
object poses between consecutive frames, combined with the cently, Wada et al. [320] proposed a system that fully utilized
initial object pose, to provide the object pose for each frame. identified accurate object CAD models and non-parametric
Wen et al. [310] used the collaborative design of concurrent reconstruction of unrecognized structures to estimate the
tracking and neural object fields to perform 6DoF tracking occluded objects pose in real-time.
from RGBD sequences. Key aspects of it include online pose Low-textured objects lack object surface texture informa-
graph optimization, concurrent neural object fields for 3D tion, making robotic manipulation challenging. Therefore,
shape and appearance reconstruction, and a memory pool Zhang et al. [321] proposed a pose estimation method for
facilitating communication between the two processes. texture-less industrial parts. Poor surface texture and bright-
More recently, Nguyen et al. [311] reconsidered template ness make it challenging to compute discriminative local
matching from the perspective of generating new views. appearance descriptors. This method achieves more accurate
Given a single reference view, they trained a model to results by optimizing the pose in the edge image. In addition,
directly predict the discriminative embeddings of the novel Chang et al. [322] carried out transparent object grasping
viewpoints of the object. In contrast, Wen et al. [3] applied an by estimating the object pose using a proposed model-free
object-centric neural field representation for object modeling method that relies on multiview geometry. In agricultural
and RGBD rendering. scenes, Kim et al. [323] constructed an automated data collec-
In summary, similar to template matching-based methods tion scheme based on a 3D simulator environment to achieve
using CAD models, manual reference view-based methods three-level ripeness classification and pose estimation of
also rely on massive templates. Moreover, due to limited ref- target fruits.
erence views, these methods need to generate new templates 6.1.2 Category-Level Manipulation
or employ additional strategies to optimize the initial pose To investigate the application of category-level object pose
obtained through template matching. estimation for robotic manipulation, Liu et al. [324] intro-
duced a fine segmentation-guided category-level method
with difference-aware shape deformation for robotic grasp-
6 A PPLICATIONS ing. Yu et al. [325] proposed a shape prior-based approach
and explored its application for robotic grasping. Further,
With the advancement of object pose estimation technology, Liu et al. [4] developed a robotic continuous grasping sys-
several applications leveraging this progress have been de- tem with a pre-defined vector orientation-based grasping
ployed. In this section, we elaborate on the development strategy, based on shape transformer-guided object pose
trends of these applications. Specifically, these applications estimation. To improve efficiency and enable the pose esti-
include robotic manipulation (Sec. 6.1), Augmented Reality mation method to be applied to tasks with higher real-time
(AR)/Virtual Reality (VR) (Sec. 6.2), aerospace (Sec. 6.3), requirements, Sun et al. [326] utilized the inter-frame con-
hand-object interaction (Sec. 6.4), and autonomous driving sistent keypoints to perform object pose tracking for aerial
(Sec. 6.5). The chronological overview is shown in Fig. 7.
21
Efficient Track PPR -Net SpacePose CA-SpaceNet Sim -to-Real Pose Gen6D ADPose Seg6D CatDeform DGPF6D
(Pandey et al.) (Dong et al.) (F.Proença et al.) (Wang et al.) (Chen et al.) (Liu et al.) (Hoque et al.) (Liu et al.) (Yu et al.) (Liu et al.)
DMAR MoreFusion Ghostpose Zephyr Ick-Track MegaPose AttentionVote HFL -Net CatDeform FoundationPose
(Su et al.) (Wada et al.) (Chang et al.) (Okorn et al.) (Sun et al.) (Labbe et al.) (Zhuang et al.) (Lin et al.) (Yu et al.) (Wen et al.)
Fig. 7. Chronological overview of some representative applications of object pose estimation methods. The black references, red references, and
orange references represent the application of instance-level, category-level, and unseen methods, respectively. From this, we can also see the
development trend, i.e., from instance-level methods to category-level and unseen methods.
manipulation. To further avoid manual data annotation in successfully applied in various domains, including AR and
the real-world scene, Yu et al. [327] built a robotic grasping robotic manipulation.
platform and designed a self-supervised-based method for
6.3 Aerospace
category-level robotic grasping. More recently, Liu et al. [5]
explored a contrastive learning-guided prior-free object pose Estimating the object pose in space presents unique chal-
estimation method for domain-generalized robotic picking. lenges not commonly encountered in terrestrial environ-
6.1.3 Unseen Object Manipulation ments. One of the most significant differences is the lack
Since unseen object pose estimation belongs to an emerging of atmospheric scattering, which complicates lighting con-
research, there is currently a lack of specialized designs ditions and makes objects invisible over long distances. In-
for robotic. Here, we report several methods that validate orbit proximity operations for space rendezvous, docking,
the effectiveness of unseen object pose estimation through and debris removal require precise pose estimation under di-
robotic manipulation. Okorn et al. [278] introduced a method verse lighting conditions and on high-texture backgrounds.
for zero-shot object pose estimation in clutter. By scoring Proença et al. [330] proposed URSO, a simulator devel-
pose hypotheses and choosing the highest-scoring pose, they oped on Unreal Engine 4 [331] for generating annotated
successfully grasped a novel drill object using a robotic arm. images of spacecraft orbiting Earth, which can be used as
Labbé et al. [28] and Wen et al. [3] adopted the render-and- valuable data for aerospace application. Hu et al. [82] pro-
compare strategy and trained the network on a large-scale posed an encoder-decoder architecture that reliably handles
synthetic dataset, resulting in an outstanding generalization. large-scale changes under challenging conditions, enhancing
They further verified the effectiveness of their methods robustness. Wang et al. [332] introduced a counterfactual
through robotic grasping experiments. analysis framework to achieve robust pose estimation of
spaceborne targets in complex backgrounds. Ulmer et al.
6.2 Augmented Reality/Virtual Reality [333] generated multiple pose hypotheses for objects and
Object pose estimation has various specific applications in introduced a pixel-level posterior formula to estimate the
AR and VR fields. In AR, accurate pose estimation allows probability of each hypothesis. This approach can handle
for a precise overlay of virtual objects onto the real world. extreme visual conditions, including overexposure, high con-
The key to VR technology lies in tracking the head-mounted trast, and low signal-to-noise ratio.
display pose and controller in 3D space.
Su et al. [328] combined two CNN architectures [162] into 6.4 Hand-Object Interaction
a network, consisting of a state estimation branch and a pose When humans/robots interact with the physical world, they
estimation branch explicitly trained on synthetic images, to primarily do so through their hands. Therefore, accurately
achieve AR assembly applications. Pandey et al. [329] intro- understanding how hands interact with objects is crucial.
duced a method for automatically annotating the handheld Hand-object interaction methods often rely on the object
objects pose in camera space, addressing the efficient 6Dof CAD model, and obtaining the object CAD model from daily
pose tracking problem for handheld controllers from the life scenes is challenging. Patten et al. [334] reconstructed
perspective of egocentric cameras. Liu et al. [1] presented high-quality object CAD model to mitigate the reliance on
a generalizable model-free 6DoF object pose estimator that object CAD model in hand-object interaction. To further
has realized the complete object detection and pose estima- enhance hand-object interaction, Lin et al. [6] utilized an
tion process. By simply capturing reference images of an effective attention model to improve the representation ca-
unseen object and retrieving the poses of reference images, pability of hand and object features, thereby improving the
this method can predict the object pose on arbitrary query accuracy of hand and object pose estimation. However, this
images and be easily applied to daily objects for AR/VR method has limited utilization of the underlying geometric
applications. He et al. [2] adopted matching after the re- structures, leading to an increased reliance on visual features.
construction strategy, which establishes the correspondences Performance may degrade when objects lack visual features
between the query image and the reconstructed point cloud or when these features are occluded. Therefore, Rezazadeh et
from reference views. This method does not rely on keypoint al. [7] introduced a hierarchical graph neural network archi-
matching and allows for AR applications even on low- tecture combined with multimodal (visual and tactile) data
texture objects. Wen et al. [3] achieved strong generality by to compensate for visual deficiencies and improve robust-
employing large-scale comprehensive training and innova- ness. Moreover, Qi et al. [335] introduced a hand-object pose
tive transformer-based architecture. This method has been estimation network guided by Signed Distance Fields (SDF),
22
which jointly leverages the SDFs of both the hand and the achieving high-precision estimation of unseen object poses
object to provide a complete global implicit representation. using a single RGB image is crucial. Due to inherent ge-
This method aids in guiding the pose estimation of hands ometric limitations in 2D images, future research can ex-
and objects in occlusion scenarios. plore LVMs-based monocular depth estimation methods to
enhance the accuracy of monocular object pose estimation
6.5 Autonomous Driving
by incorporating scene-level depth information. 3) Model
Object pose estimation can be used to perceive surrounding lightweighting. Existing SOTA models often have large pa-
objects such as vehicles, pedestrians, and obstacles, aiding rameter sizes and inefficient running performance, which
the autonomous driving system in making timely decisions. presents challenges for deployment on mobile devices and
In order to address the pose estimation problem in au- robots with limited computational resources. Future work
tonomous driving, Hoque et al. [336] proposed a 6DoF pose can explore effective lightweight methods, such as teacher-
hypothesis based on a deep hybrid structure composed of student models, to research reducing model parameter count
CNNs [162] and RNNs [337]. More recently, Sun et al. [338] (GPU memory) and improving model running efficiency.
designed an effective keypoint selection algorithm, which
Existing methods are predominantly designed for com-
takes into account the shape information of panel objects
mon objects and scenes, rendering them ineffective for chal-
within the scene of robot cabin inspection, addressing the
lenging objects and scenes. We believe that the applicability
challenge of 6DoF pose estimation of highly variable panel
can be enhanced through the following avenues: 1) Artic-
objects.
ulated object pose estimation. Articulated objects (such as
7 C ONCLUSION AND F UTURE D IRECTION clothing and drawers) exhibit multiple DoF and significant
In this survey, we have provided a systematic overview of self-occlusion compared to rigid objects, making pose esti-
the latest deep learning-based object pose estimation meth- mation challenging. Achieving high-precision pose estima-
ods, covering a comprehensive classification, a comparison tion for articulated objects is an important research problem
of their strengths and weaknesses, and an exploration of that remains to be addressed in the future. 2) Transparent
their applications. Despite the great success, many chal- object pose estimation. The simultaneous absence of texture,
lenges still exist, as discussed in Sec. 3, Sec. 4, and Sec. color, and depth information poses a significant challenge for
5. Based on these challenges, we further point out some estimating the pose of transparent objects. Future research
promising future directions aimed at advancing research in endeavors could focus on enhancing the geometric informa-
object pose estimation. tion of transparent objects through depth augmentation or
From the perspective of label-efficient learning, prevail- completion techniques, thereby improving the accuracy of
ing methodologies predominantly rely on the utilization of pose estimation. 3) Robust methods for handling occlusion.
real-world labeled datasets for training purposes. Neverthe- Occlusion is the most common challenge. Currently, there
less, the labor-intensive nature of manually collecting and exists no object pose estimation method that can effectively
annotating training data is widely acknowledged. Hence, handle severe occlusion. Severe occlusion leads to an in-
we advocate for the exploration of label-efficient learning complete representation of texture and geometric features
techniques for object pose estimation, which can be pur- in objects, introducing uncertainty into the pose estimation
sued through the following avenues: 1) LLMs/LVMs-guided model. Hence, improving the model’s ability to perceive
weak/self-supervised learning methods. With the rapid severe occlusion is crucial for enhancing its robustness.
advancements in pre-trained LLMs/LVMs, their versatile From the aspect of problem formulation, recent instance-
application in various scenarios through an unsupervised level methods have achieved high precision but exhibit
manner has become feasible. Leveraging LLMs/LVMs as poor generalization. Category-level methods demonstrate
prior knowledge holds promise for exploring weak or self- good generalization for intra-class unseen objects but fail to
supervised learning techniques in object pose estimation. generalize to unseen object categories. Unseen object pose
2) Synthesis to real-world domain adaptation and gener- estimation methods have the potential to generalize to any
alization methods. Due to the high costs associated with unseen object, yet they still rely on object CAD models
acquiring real-world training data through manual efforts, or reference views. The following paths can be explored
synthetic data generation offers a cost-effective alternative. from the problem formulation to further enhance the gen-
We believe that by exploring domain adaptation and gener- eralization of object pose estimation: 1) Few-shot learning-
alization techniques from synthetic to real-world domains, based category-level methods for unseen categories. Since
we can mitigate domain gaps and achieve the capability category-level methods need to re-obtain a large amount of
to generalize synthetic data-trained models for real-world annotated training data for unseen object categories, their
applications. generalization is severely limited. Therefore, future research
In terms of applications, facilitating the deployment of could focus on exploring how to leverage few-shot learning
object pose estimation methods on mobile devices and robots to enable the rapid generalization of category-level meth-
is crucial. We argue that enhancing the deployability of ods to unseen object categories. 2) CAD model-free and
existing methods can be achieved through the following sparse manual reference view-based unseen object pose
approaches: 1) End-to-end methods integrating detection or estimation. While current unseen object pose estimation
segmentation. Current SOTA approaches typically require methods do not require retraining for unseen objects, they
initial object detection or segmentation using a pre-trained still rely on either the CAD models or extensive annotated
model before inputting the image into a pose estimation reference views of unseen objects, both of which still require
model (indirect pose estimation models even need to use manual acquisition. To this end, exploring CAD model-free
non-differentiable PnP or Umeyama algorithms to solve and sparse manual reference view-based unseen object pose
pose), which complicates deployment. Future research can estimation methods is crucial. 3) Open-vocabulary strong
enhance the deployability on mobile devices and robots by generalization methods. Given the broad applicability of
exploring end-to-end object pose estimation methods that object pose estimation in human-machine interaction scenes,
seamlessly integrate detection or segmentation. 2) Single future research could leverage open vocabulary provided
RGB image-based methods. Given that most mobile de- by humans as prompts to enhance generalization to unseen
vices (such as smartphones and tablets) lack depth cameras, objects and scenes.
23
[74] M. Rad and V. Lepetit, “Bb8: A scalable, accurate, robust to partial [116] J.-X. Hong and H.-B. Zhang, “A transformer-based multi-modal
occlusion method for predicting the 3d poses of challenging objects fusion network for 6d pose estimation,” Information Fusion, 2024.
without using depth,” in ICCV, 2017. [117] W. Chen and X. Jia, “G2l-net: Global to local network for real-time
[75] B. Tekin and S. N. Sinha, “Real-time seamless single shot 6d object 6d pose estimation with embedding vector features,” in CVPR,
pose prediction,” in CVPR, 2018. 2020.
[76] J. Redmon and S. Divvala, “You only look once: Unified, real-time [118] Y. Hu and P. Fua, “Single-stage 6d object pose estimation,” in
object detection,” in CVPR, 2016. CVPR, 2020.
[77] G. Pavlakos and X. Zhou, “6-dof object pose from semantic key- [119] Y. Labbé and J. Carpentier, “Cosypose: Consistent multi-view
points,” in ICRA, 2017. multi-object 6d pose estimation,” in ECCV, 2020.
[78] B. Doosti and S. Naha, “Hope-net: A graph-based model for hand- [120] G. Wang and F. Manhardt, “Gdr-net: Geometry-guided direct re-
object pose estimation,” in CVPR, 2020. gression network for monocular 6d object pose estimation,” in
[79] T. N. Kipf and M. Welling, “Semi-supervised classification with CVPR, 2021.
graph convolutional networks,” arXiv preprint arXiv:1609.02907, [121] Y. Di and F. Manhardt, “So-pose: Exploiting self-occlusion for di-
2016. rect 6d pose estimation,” in ICCV, 2021.
[80] C. Song and J. Song, “Hybridpose: 6d object pose estimation under [122] G. Wang and F. Manhardt, “Occlusion-aware self-supervised
hybrid representations,” in CVPR, 2020. monocular 6d object pose estimation,” IEEE TPAMI, 2021.
[81] P. Liu and Q. Zhang, “Mfpn-6d : Real-time one-stage pose estima- [123] C. Li and J. Bai, “A unified framework for multi-view multi-class
tion of objects on rgb images,” in ICRA, 2021. object pose estimation,” in ECCV, 2018.
[82] Y. Hu and S. Speierer, “Wide-depth-range 6d object pose estima- [124] Y. Li and G. Wang, “Deepim: Deep iterative matching for 6d pose
tion in space,” in CVPR, 2021. estimation,” in ECCV, 2018.
[83] R. Lian and H. Ling, “Checkerpose: Progressive dense keypoint [125] F. Manhardt and W. Kehl, “Deep model-based 6d pose refinement
localization for object pose estimation with graph neural network,” in rgb,” in ECCV, 2018.
in ICCV, 2023. [126] F. Manhardt and D. M. Arroyo, “Explaining the ambiguity of object
[84] J. Chang and M. Kim, “Ghostpose: Multi-view pose estimation of detection and 6d pose from visual data,” in ICCV, 2019.
transparent objects for robot hand grasping,” in IROS, 2021. [127] C. Papaioannidis and I. Pitas, “3d object pose estimation using
[85] S. Guo and Y. Hu, “Knowledge distillation for 6d pose estimation multi-objective quaternion learning,” IEEE TCSVT, 2019.
by aligning distributions of local predictions,” in CVPR, 2023. [128] Y. Liu and L. Zhou, “Regression-based three-dimensional pose
[86] F. Liu and Y. Hu, “Linear-covariance loss for end-to-end learning estimation for texture-less objects,” IEEE TMM, 2019.
of 6d pose estimation,” in ICCV, 2023. [129] Y. Hai and R. Song, “Shape-constraint recurrent flow for 6d object
[87] A. Crivellaro and M. Rad, “Robust 3d object tracking from monoc- pose estimation,” in CVPR, 2023.
ular images using stable parts,” IEEE TPAMI, 2017. [130] Y. Li and Y. Mao, “Mrc-net: 6-dof pose estimation with multiscale
[88] M. Oberweger and M. Rad, “Making deep heatmaps robust to residual correlation,” in CVPR, 2024.
partial occlusions for 3d object pose estimation,” in ECCV, 2018. [131] M. Cai and I. Reid, “Reconstruct locally, localize globally: A model
[89] Y. Hu and J. Hugonot, “Segmentation-driven 6d object pose esti- free method for object pose estimation,” in CVPR, 2020.
mation,” in CVPR, 2019. [132] D. Wang and G. Zhou, “Geopose: Dense reconstruction guided 6d
[90] W.-L. Huang and C.-Y. Hung, “Confidence-based 6d object pose object pose estimation with geometric consistency,” IEEE TMM,
estimation,” IEEE TMM, 2021. 2021.
[91] W. Zhao and S. Zhang, “Learning deep network for detecting 3d [133] Y. Su and M. Saleh, “Zebrapose: Coarse to fine surface encoding
object keypoints and 6d poses,” in CVPR, 2020. for 6dof object pose estimation,” in CVPR, 2022.
[92] Z. Yang and X. Yu, “Dsc-posenet: Learning 6dof object pose esti- [134] Z. Xu and Y. Zhang, “Bico-net: Regress globally, match locally for
mation via dual-scale consistency,” in CVPR, 2021. robust 6d pose estimation,” in IJCAI, 2022.
[93] S. Liu and H. Jiang, “Semi-supervised 3d hand-object poses esti- [135] L. Huang and T. Hodan, “Neural correspondence field for object
mation with interactions in time,” in CVPR, 2021. pose estimation,” in ECCV, 2022.
[94] G. Georgakis and S. Karanam, “Learning local rgb-to-cad corre- [136] H. Jiang and Z. Dang, “Center-based decoupled point cloud regis-
spondences for object pose estimation,” in ICCV, 2019. tration for 6d object pose estimation,” in ICCV, 2023.
[95] J. Sock and G. Garcia-Hernando, “Introducing pose consistency [137] P. Besl and N. D. McKay, “A method for registration of 3-d shapes,”
and warp-alignment for self-supervised 6d object pose estimation IEEE TPAMI, 1992.
in color images,” in 3DV, 2020. [138] Y. Lin and Y. Su, “Hipose: Hierarchical binary surface encoding
[96] S. Zhang and W. Zhao, “Keypoint-graph-driven learning frame- and correspondence pruning for rgb-d 6dof object pose estima-
work for object pose estimation,” in CVPR, 2021. tion,” in CVPR, 2024.
[97] S. Thalhammer and M. Leitner, “Pyrapose: Feature pyramids for [139] K. Park and T. Patten, “Pix2pose: Pixel-wise coordinate regression
fast and accurate object pose estimation under domain shift,” in of objects for 6d pose estimation,” in ICCV, 2019.
ICRA, 2021. [140] C. Wu and L. Chen, “Geometric-aware dense matching network
[98] T. Hodan and D. Barath, “Epos: Estimating 6d pose of objects with for 6d pose estimation of objects from rgb-d images,” PR, 2023.
symmetries,” in CVPR, 2020. [141] C. Wu and L. Chen, “Pseudo-siamese graph matching network for
[99] I. Shugurov and S. Zakharov, “Dpodv2: Dense correspondence- textureless objects’ 6-d pose estimation,” IEEE TIE, 2021.
based 6 dof pose estimation,” IEEE TPAMI, 2021. [142] Z. Li and Y. Hu, “Sd-pose: Semantic decomposition for cross-
[100] H. Chen and P. Wang, “Epro-pnp: Generalized end-to-end prob- domain 6d object pose estimation,” in AAAI, 2021.
abilistic perspective-n-points for monocular object pose estima- [143] Y. Hu and P. Fua, “Perspective flow aggregation for data-limited
tion,” in CVPR, 2022. 6d object pose estimation,” in ECCV, 2022.
[101] R. L. Haugaard and A. G. Buch, “Surfemb: Dense and continuous [144] Y. Hai and R. Song, “Pseudo flow consistency for self-supervised
correspondence distributions for object pose estimation with learnt 6d object pose estimation,” in ICCV, 2023.
surface embeddings,” in CVPR, 2022. [145] L. Lipson and Z. Teed, “Coupled iterative refinement for 6d multi-
[102] F. Li and S. R. Vutukur, “Nerf-pose: A first-reconstruct-then-regress object pose estimation,” in CVPR, 2022.
approach for weakly-supervised 6d object pose estimation,” in [146] J. J. Moré, “The levenberg-marquardt algorithm: Implementation
ICCV, 2023. and theory,” in Numerical Analysis, 1978.
[103] Y. Xu and K.-Y. Lin, “Rnnpose: 6-dof object pose estimation via [147] X. Liu and J. Zhang, “6dof pose estimation with object cutout based
recurrent correspondence field estimation and pose optimization,” on a deep autoencoder,” in ISMAR-Adjunct, 2019.
IEEE TPAMI, 2024. [148] Y. Zhang and C. Zhang, “6d object pose estimation algorithm
[104] M. Sundermeyer and Z.-C. Marton, “Implicit 3d orientation learn- using preprocessing of segmentation and keypoint extraction,” in
ing for 6d object detection from rgb images,” in ECCV, 2018. I2MTC, 2020.
[105] C. Papaioannidis and V. Mygdalis, “Domain-translated 3d object [149] S. Stevšič and O. Hilliges, “Spatial attention improves iterative 6d
pose estimation,” IEEE TIP, 2020. object pose estimation,” in 3DV, 2020.
[106] Z. Li and X. Ji, “Pose-guided auto-encoder and feature-based re- [150] K. Murphy and S. Russell, Sequential Monte Carlo Methods in Prac-
finement for 6-dof object pose regression,” in ICRA, 2020. tice, ch. Rao-blackwellised particle filtering for dynamic bayesian
[107] X. Deng and A. Mousavian, “Poserbpf: A rao–blackwellized parti- networks. Springer, 2001.
cle filter for 6-d object pose tracking,” IEEE TRO, 2021. [151] X. Liu and S. Iwase, “Kdfnet: Learning keypoint distance field for
[108] H. Jiang and M. Salzmann, “Se(3) diffusion model-based point 6d object pose estimation,” in IROS, 2021.
cloud registration for robust 6d object pose estimation,” in [152] P. Liu and Q. Zhang, “Bdr6d: Bidirectional deep residual fusion
NeurIPS, 2023. network for 6d pose estimation,” IEEE TASE, 2023.
[109] Z. Dang and L. Wang, “Match normalization: Learning-based [153] L. Xu and H. Qu, “6d-diff: A keypoint diffusion framework for 6d
point cloud registration for 6d object pose estimation in the real object pose estimation,” in CVPR, 2024.
world,” IEEE TPAMI, 2024. [154] J. Mei and X. Jiang, “Spatial feature mapping for 6dof object pose
[110] T. Cao and F. Luo, “Dgecn: A depth-guided edge convolutional estimation,” PR, 2022.
network for end-to-end 6d pose estimation,” in CVPR, 2022. [155] F. Wang and X. Zhang, “Kvnet: An iterative 3d keypoints voting
[111] Y. Wu and M. Zand, “Vote from the center: 6 dof pose estimation network for real-time 6-dof object pose estimation,” Neurocomput-
in rgb-d images by radial keypoint voting,” in ECCV, 2022. ing, 2023.
[112] J. Zhou and K. Chen, “Deep fusion transformer network with [156] L. Zeng and W. J. Lv, “Parametricnet: 6dof pose estimation network
weighted vector-wise keypoints voting for robust 6d object pose for parametric shapes in stacked scenarios,” in ICRA, 2021.
estimation,” in ICCV, 2023. [157] F. Duffhauss and T. Demmler, “Mv6d: Multi-view 6d pose estima-
[113] M. Tian and L. Pan, “Robust 6d object pose estimation by learning tion on rgb-d frames using a deep point-wise voting network,” in
rgb-d features,” in ICRA, 2020. IROS, 2022.
[114] G. Zhou and H. Wang, “Pr-gcn: A deep graph convolutional net- [158] X. Yu and Z. Zhuang, “6dof object pose estimation via differen-
work with point refinement for 6d pose estimation,” in ICCV, 2021. tiable proxy voting loss,” arXiv preprint arXiv:2002.03923, 2020.
[115] N. Mo and W. Gan, “Es6d: A computation efficient and symmetry- [159] H. Lin and S. Peng, “Learning to estimate object poses without real
aware 6d pose regression framework,” in CVPR, 2022. image annotations.,” in IJCAI, 2022.
25
[160] T. Ikeda and S. Tanishige, “Sim2real instance-level style transfer for [203] H. Lin and Z. Liu, “Sar-net: Shape alignment and recovery network
6d pose estimation,” in IROS, 2022. for category-level 6d object pose and size estimation,” in CVPR,
[161] G. Zhou and Y. Yan, “A novel depth and color feature fusion 2022.
framework for 6d object pose estimation,” IEEE TMM, 2020. [204] R. Zhang and Y. Di, “Ssp-pose: Symmetry-aware shape prior defor-
[162] Y. LeCun and B. Boser, “Backpropagation applied to handwritten mation for direct category-level object pose estimation,” in IROS,
zip code recognition,” Neural Comput, 1989. 2022.
[163] X. Liu and X. Yuan, “A depth adaptive feature extraction and dense [205] R. Zhang and Y. Di, “Rbp-pose: Residual bounding box projection
prediction network for 6-d pose estimation in robotic grasping,” for category-level pose estimation,” in ECCV, 2022.
IEEE TII, 2023. [206] X. Liu and G. Wang, “Catre: Iterative point clouds alignment for
[164] F. Mu and R. Huang, “Temporalfusion: Temporal motion reasoning category-level object pose refinement,” in ECCV, 2022.
with multi-frame fusion for 6d object pose estimation,” in IROS, [207] J. Liu and W. Sun, “Mh6d: Multi-hypothesis consistency learning
2021. for category-level 6-d object pose estimation,” IEEE TNNLS, 2024.
[165] D. Cai and J. Heikkilä, “Sc6d: Symmetry-agnostic and [208] X. Li and H. Wang, “Category-level articulated object pose estima-
correspondence-free 6d object pose estimation,” in 3DV, 2022. tion,” in CVPR, 2020.
[166] L. Zeng and W. J. Lv, “Ppr-net++: Accurate 6-d pose estimation in [209] W. Chen and X. Jia, “Fs-net: Fast shape-based network for
stacked scenarios,” IEEE TASE, 2021. category-level 6d object pose estimation with decoupled rotation
[167] Y. Bukschat and M. Vetter, “Efficientpose: An efficient, accu- mechanism,” in CVPR, 2021.
rate and scalable end-to-end 6d multi object pose estimation ap- [210] Y. Weng and H. Wang, “Captra: Category-level pose tracking for
proach,” arXiv preprint arXiv:2011.04307, 2020. rigid and articulated objects from point clouds,” in ICCV, 2021.
[168] G. Gao and M. Lauri, “6d object pose regression via supervised [211] Y. You and R. Shi, “Cppf: Towards robust category-level 9d pose
learning on point clouds,” in ICRA, 2020. estimation in the wild,” in CVPR, 2022.
[169] M. Lin and V. Murali, “6d object pose estimation with pairwise [212] J. Zhang and M. Wu, “Generative category-level object pose esti-
compatible geometric features,” in ICRA, 2021. mation via diffusion models,” in NeurIPS, 2023.
[170] Y. Shi and J. Huang, “Stablepose: Learning 6d object poses from [213] C. Wang and Martı́n-Martı́n, “6-pack: Category-level 6d pose
geometrically stable patches,” in CVPR, 2021. tracker with anchor-based keypoints,” in ICRA, 2020.
[171] Z. Liu and Q. Wang, “Pa-pose: Partial point cloud fusion based on [214] J. Lin and Z. Wei, “Dualposenet: Category-level 6d object pose and
reliable alignment for 6d pose tracking,” PR, 2024. size estimation using dual pose network with refined learning of
[172] Y. Wen and Y. Fang, “Gccn: Geometric constraint co-attention net- pose consistency,” in ICCV, 2021.
work for 6d object pose estimation,” in ACM MM, 2021. [215] B. Wen and K. Bekris, “Bundletrack: 6d pose tracking for novel
[173] Y. An and D. Yang, “Hft6d: Multimodal 6d object pose estimation objects without instance or category-level 3d models,” in IROS,
based on hierarchical feature transformer,” Measurement, 2024. 2021.
[174] Z. Zhang and W. Chen, “Trans6d: Transformer-based 6d object [216] W. Peng and J. Yan, “Self-supervised category-level 6d object pose
pose estimation and refinement,” in ECCVW, 2022. estimation with deep implicit shape representation,” in AAAI,
[175] G. Feng and T.-B. Xu, “Nvr-net: Normal vector guided regression 2022.
network for disentangled 6d pose estimation,” IEEE TCSVT, 2023. [217] T. Lee and B.-U. Lee, “Uda-cope: Unsupervised domain adaptation
[176] G. Gao and M. Lauri, “Cloudaae: Learning 6d object pose regres- for category-level object pose estimation,” in CVPR, 2022.
sion with on-line data synthesis on point clouds,” in ICRA, 2021. [218] T. Lee and J. Tremblay, “Tta-cope: Test-time adaptation for
[177] G. Zhou and D. Wang, “Semi-supervised 6d object pose estimation category-level object pose estimation,” in CVPR, 2023.
without using real annotations,” IEEE TCSVT, 2021. [219] J. Lin and Z. Wei, “Vi-net: Boosting category-level 6d object pose
[178] T. Tan and Q. Dong, “Smoc-net: Leveraging camera pose for self- estimation via learning decoupled rotations on the spherical repre-
supervised monocular object pose estimation,” in CVPR, 2023. sentations,” in ICCV, 2023.
[179] J. Rambach and C. Deng, “Learning 6dof object poses from syn- [220] Y. Chen and Y. Di, “Secondpose: Se (3)-consistent dual-stream
thetic single channel images,” in ISMAR-Adjunct, 2018. feature fusion for category-level pose estimation,” in CVPR, 2024.
[180] K. Kleeberger and M. F. Huber, “Single shot 6d object pose estima- [221] T. Lee and B.-U. Lee, “Category-level metric scale object shape and
tion,” in ICRA, 2020. pose estimation,” IEEE RAL, 2021.
[181] V. Sarode and X. Li, “Pcrnet: Point cloud registration network [222] X. Lin and M. Zhu, “Clipose: Category-level object pose estima-
using pointnet encoding,” arXiv preprint arXiv:1908.07906, 2019. tion with pre-trained vision-language knowledge,” arXiv preprint
[182] C. R. Qi and H. Su, “Pointnet: Deep learning on point sets for 3d arXiv:2402.15726, 2024.
classification and segmentation,” in CVPR, 2017. [223] Z. Fan and Z. Song, “Acr-pose: Adversarial canonical represen-
[183] S. H. Bengtson and H. Åström, “Pose estimation from rgb images tation reconstruction network for category level 6d object pose
of highly symmetric objects using a novel multi-pose loss and estimation,” arXiv preprint arXiv:2111.10524, 2021.
differential rendering,” in IROS, 2021. [224] T. Nie and J. Ma, “Category-level 6d pose estimation using
[184] J. Park and N. Cho, “Dprost: Dynamic projective spatial trans- geometry-guided instance-aware prior and multi-stage reconstruc-
former network for 6d pose estimation,” in ECCV, 2022. tion,” IEEE RAL, 2023.
[185] M. Garon and J.-F. Lalonde, “Deep 6-dof tracking,” IEEE TVCG, [225] L. Zhou and Z. Liu, “Dr-pose: A two-stage deformation-and-
2017. registration pipeline for category-level 6d object pose estimation,”
[186] J. Long and E. Shelhamer, “Fully convolutional networks for se- in IROS, 2023.
mantic segmentation,” in CVPR, 2015. [226] L. Zou and Z. Huang, “Gpt-cope: A graph-guided point trans-
[187] W. Kehl and F. Manhardt, “Ssd-6d: Making rgb-based 3d detection former for category-level object pose estimation,” IEEE TCSVT,
and 6d pose estimation great again,” in ICCV, 2017. 2023.
[188] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and [227] G. Li and D. Zhu, “Sd-pose: Structural discrepancy aware
A. C. Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016. category-level 6d object pose estimation,” in WACV, 2023.
[189] J. Wu and B. Zhou, “Real-time object pose estimation with pose [228] S. Yu and D.-H. Zhai, “Catformer: Category-level 6d object pose
interpreter networks,” in IROS, 2018. estimation with transformer,” in AAAI, 2024.
[190] T.-T. Do and M. Cai, “Deep-6dpose: Recovering 6d object pose [229] Y. He and H. Fan, “Towards self-supervised category-level object
from a single rgb image,” arXiv preprint arXiv:1802.10367, 2018. pose and size estimation,” arXiv preprint arXiv:2203.02884, 2022.
[191] T.-C. Hsiao and H.-W. Chen, “Confronting ambiguity in 6d object [230] G. Li and Y. Li, “Generative category-level shape and pose estima-
pose estimation via score-based diffusion on se(3),” in CVPR, 2024. tion with semantic primitives,” in CoRL, 2023.
[192] J. Liu and W. Sun, “Hff6d: Hierarchical feature fusion network for [231] K. Chen and S. James, “Stereopose: Category-level 6d transparent
robust 6d object pose tracking,” IEEE TCSVT, 2022. object pose estimation from stereo images via back-view nocs,” in
[193] R. Ge and G. Loianno, “Vipose: Real-time visual-inertial 6d object ICRA, 2023.
pose tracking,” in IROS, 2021. [232] H. Wang and Z. Fan, “Dtf-net: Category-level pose estimation and
[194] S. Iwase and X. Liu, “Repose: Fast 6d object pose refinement via shape reconstruction via deformable template field,” in ACM MM,
deep texture rendering,” in ICCV, 2021. 2023.
[195] J. Josifovski and M. Kerzel, “Object detection and pose estimation [233] L. Zheng and T. H. E. Tse, “Georef: Geometric alignment across
based on convolutional neural networks trained with synthetic shape variation for category-level object pose refinement,” in
data,” in IROS, 2018. CVPR, 2024.
[196] C. Sahin and T.-K. Kim, “Category-level 6d object pose recovery in [234] K. Zhang and Y. Fu, “Self-supervised geometric correspondence
depth images,” in ECCVW, 2018. for category-level 6d object pose estimation in the wild,” in ICLR,
[197] S. Umeyama, “Least-squares estimation of transformation param- 2023.
eters between two point patterns,” IEEE TPAMI, 1991. [235] A. Remus and S. D’Avella, “I2c-net: Using instance-level neural
[198] J. Wang and K. Chen, “Category-level 6d object pose estimation networks for monocular category-level 6d pose estimation,” IEEE
via cascaded relation and recurrent reconstruction networks,” in RAL, 2023.
IROS, 2021. [236] J. Liu and Z. Cao, “Category-level 6d object pose estimation with
[199] L. Zou and Z. Huang, “6d-vit: Category-level 6d object pose es- structure encoder and reasoning attention,” IEEE TCSVT, 2022.
timation via transformer-based instance representation learning,” [237] X. Deng and J. Geng, “icaps: Iterative category-level object pose
IEEE TIP, 2022. and shape estimation,” IEEE RAL, 2022.
[200] Z. Fan and Z. Song, “Object level depth reconstruction for category [238] R. Wang and X. Wang, “Query6dof: Learning sparse queries as
level 6d object pose estimation from monocular rgb image,” in implicit shape prior for category-level 6dof pose estimation,” in
ECCV, 2022. ICCV, 2023.
[201] J. Wei and X. Song, “Rgb-based category-level object pose es- [239] B. Wan and Y. Shi, “Socs: Semantically-aware object coordinate
timation via decoupled metric scale recovery,” arXiv preprint space for category-level 6d object pose estimation under large
arXiv:2309.10255, 2023. shape variations,” in ICCV, 2023.
[202] M. Z. Irshad and T. Kollar, “Centersnap: Single-shot multi-object [240] X. Lin and W. Yang, “Instance-adaptive and geometric-aware key-
3d shape reconstruction and categorical 6d pose and size estima- point learning for category-level 6d object pose estimation,” in
tion,” in ICRA, 2022. CVPR, 2024.
26
[241] Y. Li and K. Mo, “Category-level multi-part multi-joint 3d shape [283] T. Wang and G. Hu, “Object pose estimation via the aggregation of
assembly,” in CVPR, 2024. diffusion features,” in CVPR, 2024.
[242] C. Chi and S. Song, “Garmentnets: Category-level pose estimation [284] V. Balntas and A. Doumanoglou, “Pose guided rgbd feature learn-
for garments via canonical space shape completion,” in ICCV, ing for 3d object pose estimation,” in ICCV, 2017.
2021. [285] Y. Wen and X. Li, “Disp6d: Disentangled implicit shape and pose
[243] L. Liu and J. Du, “Category-level articulated object 9d pose estima- learning for scalable 6d pose estimation,” in ECCV, 2022.
tion via reinforcement learning,” in ACM MM, 2023. [286] B. Busam and H. J. Jung, “I like to move it: 6d pose estimation as
[244] X. Liu and J. Zhang, “Self-supervised category-level articulated an action decision process,” arXiv preprint arXiv:2009.12678, 2020.
object pose estimation with part-level se (3) equivariance,” ICLR, [287] D. Cai and J. Heikkilä, “Ove6d: Object viewpoint encoding for
2023. depth-based 6d object pose estimation,” in CVPR, 2022.
[245] X. Li and Y. Weng, “Leveraging se (3) equivariance for self- [288] W. Kabsch, “A solution for the best rotation to relate two sets of
supervised category-level object pose estimation from point vectors,” Acta Crystallographica Section A: Crystal Physics, Diffrac-
clouds,” in NeurIPS, 2021. tion, Theoretical and General Crystallography, 1976.
[246] D. Chen and J. Li, “Learning canonical shape space for category- [289] E. Corona and K. Kundu, “Pose estimation for objects with rota-
level 6d object pose and size estimation,” in CVPR, 2020. tional symmetry,” in IROS, 2018.
[247] J. Lin and H. Li, “Sparse steerable convolutions: An efficient learn- [290] C. Zhao and Y. Hu, “Fusing local similarities for retrieval-based 3d
ing of se (3)-equivariant features for estimation and tracking of orientation estimation of unseen objects,” in ECCV, 2022.
object poses in 3d space,” in NeurIPS, 2021. [291] S. Thalhammer and J.-B. Weibel, “Self-supervised vision trans-
[248] H. Wang and W. Li, “Attention-guided rgb-d fusion network for formers for 3d pose estimation of novel objects,” Image Vision
category-level 6d object pose estimation,” in IROS, 2022. Comput, 2023.
[249] M. Oquab and T. Darcet, “Dinov2: Learning robust visual features [292] P. Ausserlechner and D. Haberger, “Zs6d: Zero-shot 6d ob-
without supervision,” arXiv preprint arXiv:2304.07193, 2023. ject pose estimation using vision transformers,” arXiv preprint
[250] J. J. Park and P. Florence, “Deepsdf: Learning continuous signed arXiv:2309.11986, 2023.
distance functions for shape representation,” in CVPR, 2019. [293] A. Dosovitskiy and L. Beyer, “An image is worth 16x16 words:
[251] Y. Zhang and Z. Wu, “A transductive approach for video object Transformers for image recognition at scale,” in CoLR, 2020.
segmentation,” in CVPR, 2020. [294] J. Tremblay and B. Wen, “Diff-dope: Differentiable deep object pose
[252] Y. Ono and E. Trulls, “Lf-net: Learning local features from images,” estimation,” arXiv preprint arXiv:2310.00463, 2023.
in NeurIPS, 2018. [295] Ultralytics, “GitHub - ultralytics/yolov5.” https://github.com/
[253] X. Chen and Z. Dong, “Category level object pose estimation via ultralytics/yolov5, 2024.
neural analysis-by-synthesis,” in ECCV, 2020. [296] Y. He and Y. Wang, “Fs6d: Few-shot 6d pose estimation of novel
[254] L. Yen-Chen and P. Florence, “inerf: Inverting neural radiance objects,” in CVPR, 2022.
fields for pose estimation,” in IROS, 2021. [297] P. Castro and T.-K. Kim, “Posematcher: One-shot 6d object pose
[255] Y. Lin and J. Tremblay, “Single-stage keypoint-based category-level estimation by deep feature matching,” in ICCVW, 2023.
object pose estimation from an rgb image,” in ICRA, 2022. [298] J. Lee and Y. Cabon, “Mfos: Model-free & one-shot object pose
[256] J. Guo and F. Zhong, “A visual navigation perspective for category- estimation,” in AAAI, 2024.
level object pose estimation,” in ECCV, 2022. [299] Y. Du and Y. Xiao, “Pizza: A powerful image-only zero-shot zero-
[257] W. Ma and A. Wang, “Robust category-level 6d pose estimation cad approach to 6 dof tracking,” in 3DV, 2022.
with coarse-to-fine rendering of neural features,” in ECCV, 2022. [300] N. Gao and V. A. Ngo, “Sa6d: Self-adaptive few-shot 6d pose
[258] H. Zhang and A. Opipari, “Transnet: Category-level transparent estimator for novel and occluded objects,” in CoRL, 2023.
object pose estimation,” in ECCV, 2022. [301] D. Cai and J. Heikkilä, “Gs-pose: Cascaded framework for gen-
[259] Y. Lin and J. Tremblay, “Keypoint-based category-level object pose eralizable segmentation-based 6d object pose estimation,” arXiv
preprint arXiv:2403.10683, 2024.
tracking from an rgb sequence with uncertainty estimation,” in
ICRA, 2022. [302] C. Jaime and B. Davide, “Open-vocabulary object 6d pose estima-
tion,” in CVPR, 2024.
[260] S. Yu and D.-H. Zhai, “Cattrack: Single-stage category-level 6d [303] J. Wu and Y. Wang, “Unseen object pose estimation via registra-
object pose tracking via convolution and vision transformer,” IEEE tion,” in RCAR, 2021.
TMM, 2023. [304] J. Sun and Z. Shen, “Loftr: Detector-free local feature matching
[261] W. Goodwin and S. Vaze, “Zero-shot category-level object pose with transformers,” in CVPR, 2021.
estimation,” in ECCV, 2022. [305] P.-E. Sarlin and D. DeTone, “Superglue: Learning feature matching
[262] M. Zaccaria and F. Manhardt, “Self-supervised category-level 6d with graph neural networks,” in CVPR, 2020.
object pose estimation with optical flow consistency,” IEEE RAL, [306] Z. Fan and P. Pan, “Pope: 6-dof promptable pose estimation
2023. of any object, in any scene, with one reference,” arXiv preprint
[263] J. Cai and Y. He, “Ov9d: Open-vocabulary category-level 9d object arXiv:2305.15727, 2023.
pose and size estimation,” arXiv preprint arXiv:2403.12396, 2024. [307] P. Weinzaepfel and V. Leroy, “Croco: Self-supervised pre-training
[264] F. Di Felice and A. Remus, “Zero123-6d: Zero-shot novel view for 3d vision tasks by cross-view completion,” in NeurIPS, 2022.
synthesis for rgb category-level 6d pose estimation,” arXiv preprint [308] C. Zhao and Y. Hu, “Locposenet: Robust location prior for unseen
arXiv:2403.14279, 2024. object pose estimation,” in 3DV, 2024.
[265] G. Pitteri and S. Ilic, “Cornet: Generic 3d corners for 6d pose [309] P. Pan and Z. Fan, “Learning to estimate 6dof pose from limited
estimation of new objects without retraining,” in ICCVW, 2019. data: A few-shot, generalizable approach using rgb images,” in
[266] G. Pitteri and A. Bugeau, “3d object detection and pose estimation 3DV, 2024.
of unseen objects in color images with local surface embeddings,” [310] B. Wen and J. Tremblay, “Bundlesdf: Neural 6-dof tracking and 3d
in ACCV, 2020. reconstruction of unknown objects,” in CVPR, 2023.
[267] M. Gou and H. Pan, “Unseen object 6d pose estimation: A bench- [311] V. N. Nguyen and T. Groueix, “Nope: Novel object pose estimation
mark and baselines,” arXiv preprint arXiv:2206.11808, 2022. from a single image,” in CVPR, 2024.
[268] F. Hagelskjær and R. L. Haugaard, “Keymatchnet: Zero-shot pose [312] K. Park and T. Patten, “Neural object learning for 6d pose estima-
estimation in 3d point clouds by generalized keypoint matching,” tion using a few cluttered images,” in ECCV, 2020.
arXiv preprint arXiv:2303.16102, 2023. [313] H. Chen and F. Manhardt, “Texpose: Neural texture learning for
[269] H. Zhao and S. Wei, “Learning symmetry-aware geometry corre- self-supervised 6d object pose estimation,” in CVPR, 2023.
spondences for 6d object pose estimation,” in ICCV, 2023. [314] Y. Li and J. Sun, “Weakly supervised 6d pose estimation for robotic
[270] K. He and G. Gkioxari, “Mask r-cnn,” in ICCV, 2017. grasping,” in SIGGRAPH, 2018.
[271] J. Chen and M. Sun, “Zeropose: Cad-model-based zero-shot pose [315] K. Chen and R. Cao, “Sim-to-real 6d object pose estimation via
estimation,” arXiv preprint arXiv:2305.17934, 2023. iterative self-training for robotic bin picking,” in ECCV, 2022.
[272] A. Kirillov and E. Mintun, “Segment anything,” in ICCV, 2023. [316] B. Fu and S. K. Leong, “6d robotic assembly based on rgb-only
[273] Z. Qin and H. Yu, “Geometric transformer for fast and robust point object pose estimation,” in IROS, 2022.
cloud registration,” in CVPR, 2022. [317] J. Tremblay and T. To, “Deep object pose estimation for se-
[274] A. Caraffa and D. Boscaini, “Freeze: Training-free zero-shot 6d mantic robotic grasping of household objects,” arXiv preprint
pose estimation with geometric and vision foundation models,” arXiv:1809.10790, 2018.
arXiv preprint arXiv:2312.00947, 2024. [318] Z. Dong and S. Liu, “Ppr-net: Point-wise pose regression network
[275] J. Huang and H. Yu, “Matchu: Matching unseen objects for 6d pose for instance segmentation and 6dd pose estimation in bin-picking
estimation from rgb-d images,” in CVPR, 2024. scenarios,” in IROS, 2019.
[276] V. N. Nguyen and T. Groueix, “Cnos: A strong baseline for cad- [319] C. Zhuang and H. Wang, “Attentionvote: A coarse-to-fine voting
based novel object segmentation,” in ICCV, 2023. network of anchor-free 6d pose estimation on point cloud for
[277] I. Shugurov and F. Li, “Osop: A multi-stage one shot object pose robotic bin-picking application,” Robot Cim-int Manuf, 2024.
estimation framework,” in CVPR, 2022. [320] K. Wada and E. Sucar, “Morefusion: Multi-object reasoning for 6d
[278] B. Okorn and Q. Gu, “Zephyr: Zero-shot pose hypothesis rating,” pose estimation from volumetric fusion,” in CVPR, 2020.
in ICRA, 2021. [321] H. Zhang and Q. Cao, “Detect in rgb, optimize in edge: Accurate
[279] E. P. Örnek and Y. Labbé, “Foundpose: Unseen object pose esti- 6d pose estimation for texture-less industrial parts,” in ICRA, 2019.
mation with foundation features,” arXiv preprint arXiv:2311.18809, [322] J. Chang and M. Kim, “Ghostpose: Multi-view pose estimation of
2023. transparent objects for robot hand grasping,” in IROS, 2021.
[280] M. Sundermeyer and M. Durner, “Multi-path learning for object [323] J. Kim and H. Pyo, “Tomato harvesting robotic system based on
pose estimation across domains,” in CVPR, 2020. deep-tomatos: Deep learning network using transformation loss
[281] V. N. Nguyen and Y. Hu, “Templates for 3d object pose estimation for 6d pose estimation of maturity classified tomatoes with side-
revisited: Generalization to new objects and robustness to occlu- stem,” Comput Electron Agric, 2022.
sions,” in CVPR, 2022. [324] C. Liu and W. Sun, “Fine segmentation and difference-aware shape
[282] S. Moon and H. Son, “Genflow: Generalizable recurrent flow for adjustment for category-level 6dof object pose estimation,” Appl
6d pose refinement of novel objects,” in CVPR, 2024. Intell, 2023.
27
[325] S. Yu and D.-H. Zhai, “Category-level 6-d object pose estimation [332] S. Wang and S. Wang, “Ca-spacenet: Counterfactual analysis for 6d
with shape deformation for robotic grasp detection,” IEEE TNNLS, pose estimation in space,” in IROS, 2022.
2023. [333] M. Ulmer and M. Durner, “6d object pose estimation from approx-
[326] J. Sun and Y. Wang, “Ick-track: A category-level 6-dof pose tracker imate 3d models for orbital robotics,” in IROS, 2023.
using inter-frame consistent keypoints for aerial manipulation,” in [334] T. Patten and K. Park, “Object learning for 6d pose estimation and
IROS, 2022. grasping from rgb-d videos of in-hand manipulation,” in IROS,
[327] S. Yu and D.-H. Zhai, “Robotic grasp detection based on category- 2021.
level object pose estimation with self-supervised learning,” IEEE [335] H. Qi and C. Zhao, “Hoisdf: Constraining 3d hand-object pose
TMEC, 2023. estimation with global signed distance fields,” in CVPR, 2024.
[328] Y. Su and J. Rambach, “Deep multi-state object pose estimation for [336] S. Hoque and S. Xu, “Deep learning for 6d pose estimation of
augmented reality assembly,” in ISMAR-Adjunct, 2019. objects-a case study for autonomous driving,” Expert Syst Appl,
[329] R. Pandey and P. Pidlypenskyi, “Efficient 6-dof tracking of hand- 2023.
held objects from an egocentric viewpoint,” in ECCV, 2018. [337] J. L. Elman, “Finding structure in time,” Cogn Sci, 1990.
[330] P. F. Proença and Y. Gao, “Deep learning for spacecraft pose esti- [338] H. Sun and P. Ni, “Panelpose: A 6d pose estimation of highly-
mation from photorealistic rendering,” in ICRA, 2020. variable panel object for robotic robust cockpit panel inspection,”
[331] U. E. 4, “Unreal engine 4.” https://www.unrealengine.com. in IROS, 2023.