Survey and Systematization of 3D Object Detection Models and Methods
Survey and Systematization of 3D Object Detection Models and Methods
ABSTRACT
This paper offers a comprehensive survey of recent developments in 3D object detection covering the
full pipeline from input data, over data representation and feature extraction to the actual detection
modules. We include basic concepts, focus our survey on a broad spectrum of different approaches
arXiv:2201.09354v1 [cs.CV] 23 Jan 2022
arising in the last ten years and propose a systematization which offers a practical framework to com-
pare those approaches on the methods level.
triangulation and epipolar geometry theory to create range in- noted that research in both areas is not mutually exclusive, as
formation (Giancola et al., 2018). Acquired depth map nor- some 3DOD models offer solutions that are sufficiently generic
mally gets appended to an RGB image as the fourth channel, and therefore do not focus on a particular domain (e.g., Qi et al.,
together called RGB-D image. This sensor variant exhibits a 2018; Xu et al., 2018; Tang and Lee, 2019; Wang and Jia, 2019).
dense depth map, though its quality is heavily dependent on A fundamental difference between indoor and autonomous
depth estimation, which is also computationally expensive (Du vehicle applications is that objects in indoor environments are
et al., 2018). often positioned one above the other. From this fact, possibil-
ities arise with which inter-object relations between the target
Time-of-Flight cameras. Instead of deriving depth from and the base/carrier object can be learned. In research, this is
different perspectives, the Time-of-Flight (TOF) principle can referred to as a holistic understanding of the inner scene, to
directly estimate the device-to-target distance. TOF systems ultimately enable better communication between service robots
are based on the LiDAR principle, sending light signals on and people (Ren and Sudderth, 2020; Huang et al., 2018). Chal-
the scene and measuring its time until receiving it back. The lenges for indoor applications are that the scenes are often clut-
difference between LiDAR and camera based TOF is, that tered and that many objects occlude one another (Ren and Sud-
LiDAR build point clouds using a pulsed laser, whereas TOF- derth, 2020).
cameras capture depth maps through an RGB-like camera. Autonomous vehicle applications are characterized by long
The data captured by stereo and TOF cameras can either be distances to potential objects and difficult weather conditions
transformed into 2.5D representation like RGB-D or into 3D such as snow, rain and fog, which make the detection process
representation by generating a point cloud (e.g., Song and Xiao, more difficult (Arnold et al., 2019). Objects also occlude one
2014; Qi et al., 2019; Sun et al., 2018; Ren and Sudderth, 2020). another, but due to the observation that objects such as cars,
pedestrians and traffic lights are unlikely to be positioned one
In general, camera sensors, such as stereo and TOF, possess above the other, techniques like the bird’s-eye view projection
the advantage of low integration costs and relatively low arith- can efficiently compensate for this disadvantage (e.g., Beltrán
metic complexity. However, all these methods experience con- et al., 2018; Wang et al., 2018).
siderable quality-volatility through environmental conditions
like light and weather.
2.5. Datasets
2.3.2. LiDAR Sensors As seen in 2DOD, a crucial prerequisite for continual de-
LiDAR sensors emit laser pulses on the scene and measure velopment and fast progress of algorithms is the availability of
the time from the point of emission of the beam to receiving of publicly accessible datasets. They are needed to provide ex-
the reflection. In combination with the constant of the speed of tensive data to train models. Further, they are used to apply
light, the measured time reveals the distance to the target. By benchmarks on them to compare one’s own results with those
assembling the 3D spatial information from the reflected laser of others. For instance, the availability of the dataset ImageNet
in a 360° angle consequently, the sensor is constructing a full (Deng et al., 2009) accelerated the development of 2D image
three-dimensional map of the environment. This map is a set of classification and 2DOD models remarkably. The same phe-
3D points, also called point cloud. nomenon is observable in 3DOD and other tasks based on 3D
The respective reflectance values represent the strength of the data: more available data results in a greater coverage of possi-
received pulses. Thereby LiDAR does not consider RGB (Lef- ble scenarios.
sky et al., 2002). The HDL-64E3, as a common LiDAR sys- Similar to the domain focus, the most commonly used
tem, outputs 120,000 points per frame, which adds up to a huge datasets for 3DOD developments can be roughly divided into
amount of data, namely 1,200,000 points per second on a 10 Hz two groups, distinguishing between autonomous driving sce-
frame rate (Arnold et al., 2019). narios and indoor scenes. In the following, we describe some
The advantages of LiDAR sensors are long-range detection of the major datasets that are publicly available.
abilities, high resolution compared to other 3D sensors and in-
dependence of lightning conditions, which are counterbalanced 2.5.1. Autonomous Driving Datasets
by its high costs and bulky devices (Arnold et al., 2019; Fer- KITTI. The most popular dataset for autonomous driving ap-
nandes et al., 2021). plications is KITTI (Geiger et al., 2012). It consists of stereo
images, LiDAR point clouds and GPS coordinates, all synchro-
2.4. Domains nized in time. Recorded scenes range from highways, complex
Looking at the extensive research on 3DOD in the past ten urban areas and narrow country roads. The dataset can be used
years, the literature can be roughly summarized into two main for various tasks such as stereo matching, visual odometry, 3D
areas: indoor applications and autonomous vehicle applica- tracking and 3D object detection. For object detection, KITTI
tions. These two domains are the main drivers for the field, provides 7,481 training- and 7,518 test frames including sen-
even though 3DOD is not strictly limited to these two specific sor calibration information and annotated 3D bounding boxes
areas, as there are also other applications conceivable and al- around the objects of interest in 22 video scenes. The annota-
ready in use, such as in retail, agriculture, and fitness. tions are categorized in easy, moderate and hard cases, depend-
Both domains face individual challenges and opportunities ing on the object size, occlusion and truncation levels. Draw-
which led to this differentiation. However, it should also be backs of the dataset are the limited sensor configurations and
4
light conditions: all recordings have been made during daytime 3. Related Reviews
and mostly under sunny conditions. Moreover, the class fre-
quencies are quite unbalanced. 75% belong to the class car, As of today, there are to the best of our knowledge only a
15% to the class pedestrian and 4% to the class cyclist. In nat- limited set of reviews aiming to arrange and classify the most
ural scenarios, the missing variety challenges the evaluation of important methods and pipelines for 3DOD.
the latest methods. Arnold et al. (2019) were some of the first to propose a clas-
sification for 3DOD approaches with a particular focus on au-
tonomous driving applications. Based on the input data that
nuScenes. NuScenes comprises 1,000 video scenes of 20 sec- is passed into the detection model, they divide the approaches
onds length in autonomous driving context. Each scene is pre- into (i) monocular image-based methods, (ii) point cloud meth-
sented through six different camera views, LiDAR and radar ods and (iii) fusion-based methods. Furthermore, they break
data with full 360° field of view (Caesar et al., 2020). It is sig- down the point cloud category into three subcategories of data
nificantly larger than the pioneering KITTI dataset with over representation: (ii-a) projection-based, (ii-b) volumetric repre-
seven times as many annotations and 100 times as many im- sentations, and (ii-c) Point Nets. The data representation states
ages. Further, the nuScenes dataset also provide night and bad which kind of input the model consumes and which informa-
weather scenarios, which is neglected in the KITTI dataset. On tion this input contains to allow the subsequent stage to process
the downside the dataset has limited LiDAR sensor quality with it more convenient according to the design choice.
34,000 points per frame, as well as limited geographical diver- While regarding various applications, such as 3D object clas-
sity comparing to the Waymo open dataset covering an effective sification, semantic segmentation and 3DOD, Liu et al. (2019b)
area of only five square kilometers. focus on feature extraction methods which constitutes the prop-
erties and characteristics that the model derives from the passed
Waymo Open. The Waymo Open dataset focuses on providing data. They classify deep learning models on point clouds into
a diverse and large-scale dataset. It consists of 1,150 videos that (i) point-based methods and (ii) tree-based methods. The for-
are exhaustively annotated with 2D and 3D bounding boxes in mer directly uses the raw point cloud and the latter first employs
images respectively LiDAR point clouds. The data collection a k-dimensional tree to preprocess the corresponding data rep-
was conducted by five cameras presenting a front and side view resentation.
of the recording vehicle and one LiDAR sensor for 360° view. Griffiths and Boehm (2019) consider object detection as a
Further, the data is recorded in three different cities with various special type of classification and thus provide relevant informa-
light and weather conditions, offering a diverse scenery (Sun tion for 3DOD in their review on deep learning techniques for
et al., 2020). 3D sensed data classification. They differentiate the approaches
on behalf of the data representation into (i) RGB-D methods, (ii)
volumetric approaches, (iii) multi view CNNs, (iv) unordered
2.5.2. Indoor Datasets point set processing methods, and (v) ordered point set process-
NYUv2 & SUN RGB-D. NYUv2 (Silberman et al., 2012) and ing techniques.
its successor SUN RGB-D (Song et al., 2015) are datasets com- Huang and Chen (2020) touch lightly upon 3DOD in their
monly used for indoor applications. The goal of these is to review paper about autonomous driving technologies using
encourage methods focused on total scene understanding. The deep learning methods. They suggest a similar classifica-
datasets got recorded using four different RGB-D sensors to en- tion of methods as Arnold et al. (2019) by distinguishing be-
sure the generalizability of applied methods for different sen- tween (i) camera-based methods, (ii) LiDAR-based methods,
sors. Even though SUN RGB-D inherited the 1449 labeled (iii) sensor-fusion methods, and additionally (iv) radar-based
RGB-D frames from the NYUv2 dataset, NYUv2 is still oc- methods. While giving a coarse structure for 3DOD, the con-
casionally used by nowadays methods. SUN RGB-D consists ference paper is waiving an explanation for their classification.
of 10,335 RGB-D images that are labelled with about 146,000 Bello et al. (2020) consider the field from a broader perspec-
2D polygons and around 64,500 3D bounding boxes with accu- tive by providing a survey of deep learning methods on 3D point
rate object orientation measures. Additionally, there is a room clouds. The authors organize and compare different methods
layout and scene category provided for every image. To im- based on a structure that is task-independent. Subsequently,
prove the image quality, short videos of every scene have been they discuss the application of exemplary approaches for differ-
recorded. Multiple frames of this videos where then used to ent 3D vision tasks, including classification, segmentation, and
create a refined depth map. object detection.
Addressing likewise the higher-level topic of deep learning
for 3D point clouds, Guo et al. (2021) give a more detailed look
Objectron. Recently, Google released the Objectron dataset into 3DOD. They structure the approaches made for handling
(Ahmadyan et al., 2021), which is composed of object centric point clouds into (i) region proposal-based methods, (ii) single
video clips capturing nine different objects categories in indoor shot methods, and (iii) other methods by categorizing them on
and outdoor scenarios. The dataset consists of 14,819 anno- account of their model design choice. Additionally, the region
tated video clips containing over four million annotated images. proposal-based methods are split along their data representation
Each video is accompanied by a sparse-point cloud representa- into (i-a) multi-view, (i-b) segmentation, (i-c) frustum-based
tion. and again (i-d) other methods. Likewise, the single shot cate-
5
gory inherits the subcategories (ii-a) bird’s-eye view, (ii-b) dis- The structuring along the pipeline enables us to order and
cretization, and (ii-c) point-based approaches. understand the underlying principles of this field. Furthermore,
Most recently, Fernandes et al. (2021) presented a compre- we can compare the different approaches and are able to outline
hensive survey which might be the most similar to this work. research trends in different stages of the pipeline. To this end,
They developed a detailed taxonomy for point cloud-based we carry out a qualitative literature analysis of proposed 3DOD
3DOD. In general, they divide the detection models along their approaches in the following sections along the pipeline to ex-
pipeline into three stages, namely data representation, feature amine specific design options, benefits, limitations and trends
extraction and detection network modules. within each stage.
The authors note that in terms of data representation exist-
ing literature either follows the approach to transform the point
cloud data into voxels, pillars, frustums, or 2D-projections or to 5. Input Data
directly consume the raw point cloud.
Feature extraction gets emphasized as the most crucial part
In the first stage of the pipeline, a model consumes the input
of the 3DOD pipeline. Suitable features are essential for an op-
data which already restricts the further processing. Common
timal feature learning which in turn has a great impact on the
inputs for 3DOD pipelines are (i) RGB images (Section 5.1),
appropriate object localization and classification in later steps.
(ii) RGB-D images (Section 5.2), and (iii) point clouds (Sec-
The authors classify the extraction methods into pointwise, seg-
tion 5.3). 3DOD models using RGB-D images are often re-
mentwise, objectwise and CNNs which are further divided into
ferred to as 2.5D approaches (e.g., Deng and Latecki, 2017;
2D-CNN and 3D-CNN backbones.
Sun et al., 2018; Maisano et al., 2018), whereas 3DOD models
The detection network module consists of the multiple out-
using point clouds are regarded as true 3D approaches.
put task of object localization and classification, as well as the
regression of 3D bounding box and orientation. Same as in
2DOD, these modules are categorized into the architectural de- 5.1. RGB Images
sign principles of single-stage and dual-stage detectors.
Although all preceding reviews provide some systematiza- Monocular or RGB images provide a dense pixel represen-
tion for 3DOD, they move – with exception of Fernandes et al. tation in form of texture and shape information (Liang et al.,
(2021) - on a high level of abstraction. They tend to lose some 2018; Giancola et al., 2018; Arnold et al., 2019). A 2D image
of the information which is crucial to fully map relevant trends can be seen as a matrix, containing the dimensions of height
in this vivid research field. and width with the corresponding color values.
Moreover, as mentioned above, all surveys are limited to ei- Especially for subtasks of 3DOD applications such as lane
ther domain-specific aspects (e.g., autonomous driving applica- line detection, traffic light recognition or object classification
tions) or focus on a subset of methods (e.g., point cloud-based monocular based approaches enjoy the advantage of real time
approaches). Monocular-based methods, for example, are ne- processing by 2DOD models. Likely the most severe disad-
glected in almost all existing review papers. vantage of monocular images is the lack of depth information.
3DOD benchmarks has shown that depth data are essential to
achieve precise 3D localization (KITTI, 2021). Additionally,
4. 3D Object Detection Pipeline by only presenting a single view perspective monocular images
faces the problem of object occlusion.
Intending to structure the research field of 3DOD from a
broad perspective, we propose a systematization that enables to
classify current 3DOD approaches at an appropriate abstraction 5.2. RGB-D Images
level, by neither losing relevant information caused by a high
level of abstraction nor being too specific and complex by a too RGB-D images can be produced through stereo or TOF cam-
fine-granular perspective. Likewise, our systematization aims eras, providing depth information aside the color one, as de-
at being sufficiently robust to allow a classification of all ex- scribed in Section 2.3. RGB-D images consist of an RGB im-
isting 3DOD pipelines and methods as well as of future works age with an additional depth map (Du et al., 2018). The depth
without the need of major adjustments to the general frame- map is comparable to a grayscale image, except that each pixel
work. represents the actual distance between the sensor and the sur-
Figure 1 provides an overview of our systematization. It face of the scene object. RGB image and depth image hold
is structured along the general stages of an object detection ideally a one-to-one correspondence between pixels (Wang and
pipeline, with several design choices at each stage. It starts with Ye, 2020).
the choice of input data (Section 5), followed by the selection Also called range images, RGB-D is convenient to use with
of a suitable data representation (Section 6) and correspond- the majority of 2DOD methods and depth information can be
ing approaches for feature extraction (Section 7). For the latter treated likewise to the three channels of RGB (Giancola et al.,
steps, it is possible to apply fusion approaches (Section 8) to 2018). However, same as monocular images, RGB-D faces the
combine different data inputs and take advantage of multiple problem of occlusion since the scene is only presented through
feature representations. Finally, the object detection module is a single perspective. In addition, it presents objects in different
defined (Section 9). scales dependent on its position in the spatial space.
6
Fig. 1. Systematization of a 3D object detection pipeline with its individual fine-branched design choices
5.3. Point Cloud fixed position to their neighbors, which is evenly spaced
The data acquired by 3D sensors can be converted to a more throughout the image.
generic structure, the point cloud. It is a three-dimensional set • Unordered means that the point cloud is just a set of points
of points that has an unorganized spatial structure (Otepka et al., that is invariant to permutations of its members. Partic-
2013). The point cloud is defined by its points which encloses ularly, the order in which the points are stored does not
the spatial coordinates of an object sampled surface. However, change the scene that it represents. In other formats, an
other geometrical and visual attributes can be added to each image for instance, data usually gets stored as a list (Qi
point (Giancola et al., 2018). et al., 2017a; Bello et al., 2020). Permutation invariance,
As described in Section 2.3, point clouds can be obtained however, means that a point cloud of N points has N! per-
from LiDAR sensors or transformed RGB-D images. Yet, point mutation and the subsequent data processing must be in-
clouds obtained from RGB-D images are typically noisier and variant to each of these different representations.
sparser due to low resolution and perspective occlusion in com-
parison to LiDAR generated point clouds (Luo et al., 2020). Figure 2 provides an illustrative overview of three challeng-
The point cloud offers a fully three-dimensional reconstruc- ing characteristics of point cloud data.
tion of the scene, providing rich geometric, shape and scale in-
formation. This enables to extract meaningful features boosting
the detection performance. Nevertheless, point clouds face se- 6. Data Representation
vere challenges which are based on its nature and processabil- In order to ensure a correct processing of the input data by
ity. Common deep learning operations, which have proven to be 3DOD models, it must be available in a suitable representation.
the most effective techniques for object detection, require data Due to the different data formats, 3DOD data representation can
to be organized in a tensor with a dense structure (e.g., images, be generally classified into 2D and 3D representations. Beyond
videos) which is not fulfilled by point clouds (Zhou and Tuzel, that, we assign 2.5 representations to either 2D if they come in
2018). In particular, point clouds exhibit irregular, unstruc- an image format, regardless of the number of channels, or 3D if
tured and unordered data characteristics (Bello et al., 2020). the data gets described in a spatial structure. 2D representations
generally cover (i) monocular representations (Section 6.1) and
• Irregular means that the points of a point cloud are not
(ii) RGB-D front views (Section 6.2). 3D representations, on the
evenly sampled across the scene. Hence, some of the
other hand, cover (iii) grid cells (Section 6.3) and (iv) pointwise
regions have denser point distribution than others. Es-
representations (Section 6.4).
pecially faraway objects are usually represented sparsely
by very few points because of the limited range recording
6.1. Monocular Representation
ability of current sensors.
Despite lacking the availability of range information, monoc-
• Unstructured means that points are not on a regular grid. ular representation enjoys a certain popularity among 3DOD
Accordingly, the distances between neighboring points methods due to its efficient computation in addition to being
can vary. In contrast, pixels in an image always have a affordable and simple to set up with a single camera. Hence,
7
(c)
(a) (b)
Fig. 2. Characteristics of point cloud data: (a) Irregular collection with sparse and dense regions, (b) Unstructured cloud of independent points without a
fixed grid, (c) Unordered set of points that are invariant to permutation (Bello et al., 2020)
monocular representations are attractive for applications where (RV) are occasionally equated with each other in current re-
resources are limited (Ku et al., 2019; Jörgensen et al., 2019). search. To clarify, this work considers the FV as RGB-D image
The vast majority of monocular representation is using the generated by a TOF, stereo or likewise camera, whereas the RV
well-known frontal view which is limited by the viewing an- gets predefined as the natural frontal projection of a point cloud
gle of the camera. Other than that, Payen de La Garanderie (see also Section 6.3.2).
et al. (2018) tackle monocular 360° panoramic imagery using
equirectangular projections instead of rectilinear projection of 6.3. Grid Cells
conventional camera images. To access true 360° processing,
they fold the panorama imagery into a 360° ring by stitching The challenges of processing a point cloud (cf. section 5.3)
left and right edges together with a 2D convolutional padding are of particular difficulty for CNNs, since convolutional oper-
operation. ations require a structured grid which is lacking in point cloud
Only a small proportion of 3DOD monocular approaches ex- data (Bello et al., 2020). Thus, to resort on advanced deep learn-
clusively use image representation (e.g., Jörgensen et al., 2019) ing methods and leverage high informative point clouds, they
for 3D spatial estimations. Most models leverage additional must first be transformed into a suitable representation.
data and information to substitute the missing depth informa- Current research presents two ways to handle point clouds.
tion (more details in section 7.2.2). Additionally, representation The first and more natural solution is to fit a regular grid onto
fusion techniques are quite popular to compensate disadvan- the point cloud, producing a grid cell representation. Many ap-
tages. For instance, 2D candidates get initially detected from proaches do so by either quantizing point clouds into 3D volu-
monocular images before predicting a 3D bounding box for the metric grids (Section 6.3.1) (e.g., Song and Xiao, 2014; Zhou
spatial object on basis of these proposals. In general, the lat- and Tuzel, 2018; Shi et al., 2020b) or by discretizing them to
ter step processes an extruded 3D subspace derived from the (multi view) projections (Section 6.3.2) (e.g., Li et al., 2016;
2D bounding box. In case of representation fusion, monocular Chen et al., 2017; Beltrán et al., 2018; Zheng et al., 2020).
representation is usually not used for full 3DOD but rather as The second and more abstract way to solve the point cloud
a support for increasing efficiency through limiting the search representation problem is to process the point cloud directly by
space for heavy three-dimensional computations or delivering grouping points into point sets. This approach does not require
additional features such as texture and color. These methods convolutions and thus enables to process the point cloud with-
are described in depth in Section 8 (Fusion Approaches). out transformation in a pointwise representation (Section 6.4)
(e.g., Qi et al., 2018; Shi et al., 2019; Huang et al., 2020).
6.2. RGB-D Front View Along these directions, several state-of-the-art methods have
RGB-D data can either be transformed to a point cloud or been proposed for 3DOD pipelines, which we are going to de-
kept in its natural form of four channels. Therefore, we can scribe exemplary in the following
distinguish between an RGB-D (3D) representation (e.g., Chen
et al., 2018; Tang and Lee, 2019; Ferguson and Law, 2019), 6.3.1. Volumetric Grids
which exploits the depth information in its spatial form of a The main idea behind the volumetric representation is to sub-
point cloud, and an RGB-D (2D) representation (e.g., Chen divide the point cloud into equally distributed grid cells, called
et al., 2015; He et al., 2017; Li et al., 2019c; Rahman et al., voxels, enabling a further processing on a structured form.
2019; Luo et al., 2020), which holds an additional 2D depth Therefore, the point cloud is converted into a 3D fixed-size
map in an image format. voxel structure of dimension (x, y, z). The resulting voxels ei-
Thus, as mentioned in section 5.2, RGB-D (2D) images rep- ther contain raw points or already encode the occupied points
resent monocular images with an appended fourth channel of into a feature representation such as point density or intensity
the depth map. The data is compacted along the z-axis generat- per voxel (Bello et al., 2020). Figure 3 illustrates the transfor-
ing a dense projection in the frontal view. The 2D depth image mation of a point cloud into a voxel-based representation.
can be processed similar to RGB-channels by high performant Usually, the voxels are of cuboid shape (e.g., Li, 2017; Zhou
2D-CNN models. and Tuzel, 2018; Ren and Sudderth, 2020). However, there are
The RGB-D (2D) representation is often referred to as front also approaches applying other forms such as pillars (e.g., Lang
view (FV) in 3DOD research. However, front and range view et al., 2019; Lehner et al., 2019).
8
et al., 2019). Furthermore, the assumption that all objects lie maintaining both efficiency and performance is not achievable
on one mutual ground plane often turns out to be infeasible in for any of the representations to date.
reality, especially in indoor scenarios (Zhou et al., 2019). Also
the often coarse voxelization of BEV may remove fine granular 7. Feature Extraction
information leading to inferior detection at small object sizes
(Meyer et al., 2019a). Feature extraction gets emphasized as the most crucial part
Exemplary models using BEV representation can be found in of the 3DOD pipeline that research is focusing on (Fernandes
the work from Wang et al. (2018), Beltrán et al. (2018), Liang et al., 2021). It follows the paradigm to reduce dimensionality
et al. (2018), Yang et al. (2018b), Simon et al. (2019b), Zeng of the data representation with the intention of representing the
et al. (2018), Li et al. (2019b), Ali et al. (2019), He et al. (2020), scene by a robust set of features. Features generally depict the
Zheng et al. (2020), Liang et al. (2020), and Wang et al. (2020). unique characteristics of the data used to bridge the semantic
gap, which denotes the difference between the human compre-
Multi View Representation. Often BEV and RV are not used as hension of the scene and the model’s prediction. Suitable fea-
single data representation but as a multi view approach, mean- tures are essential for an optimal feature learning which in turn
ing that RV and BEV but also images of monocular nature are has a great impact on the detection in later steps. Hence, the
combined to represent the spatial information of a point cloud. goal of feature extraction is to provide a robust semantic repre-
Chen et al. (2017) were the first to integrate this concept into sentation of the visual scene that ultimately leads to the recog-
a 3DOD pipeline, followed by many other models adapting to nition and detection of different objects (Zhao et al., 2019).
fuse 2D representations from different perspectives (e.g., Ku As with 2DOD, feature extraction approaches can be roughly
et al., 2018; Li et al., 2019b; Wang et al., 2019, 2020; Liang divided into (i) handcrafted feature extraction (Section 7.1) and
et al., 2020). (ii) feature learning via deep learning methods. Regarding the
Although the representations of BEV and RV are compact latter, we can distinguish the broad body of 3DOD research de-
and efficient, they are always limited by the loss of information, pending on the respective data representation. Hence, feature
originated during the discretization to a fixed number of grid- learning can be performed either in a (ii-a) monocular (Sec-
cells and the respective feature encoding of the cell’s points. tion 7.2), (ii-b) pointwise (Section 7.3), (ii-c) segmentwise (Sec-
tion 7.4), or in a (ii-d) fusion-based approach (Section 8).
6.4. Pointwise Representation
Either way, discretizing the point cloud into a projection or 7.1. Handcrafted Feature Extraction
volumetric representation inevitably leads to information loss. While the vast majority is shifted to hierarchical deep learn-
Against this backdrop, Qi et al. (2017a) introduced PointNet, ing which is able to produce more complex and robust features,
and thus a new way to consume the raw point in its unstructured there are still cases where features are manually crafted.
nature having access to all the recorded information. Handcrafted feature extraction is distinguishing itself from
In pointwise representations points are available isolated and feature learning in that the features are individually selected and
sparsely distributed in a spatial structure representing the visi- usually directly serve for final determination of the scene. The
ble surface, while preserving precise localization information. features like edges or corners are tailored by hand and are the
PointNet handles this representation by aggregating neighbor- ultimate characteristics for object detection. There is no algo-
ing points and extracting a compressed feature from the low- rithm that independently learns how these features are built-up
dimensional point features of each set, making raw point based or how they can be combined as it is the case for CNNs. The
representation possible for 3DOD. A more detailed descrip- feature initialization depicts already the feature extraction step.
tion of PointNet and its successor PointNet++ is given in Sec- Often these handcrafted features are then scored by SVMs or
tions 7.3.1 and 7.3.2, respectively. random forest classifiers that are exhaustively deployed over the
However, PointNet was developed and tested on point clouds entire image respectively scene.
containing 1,024 points (Qi et al., 2017a), whereas realistic A few exemplary 3DOD models using handcrafted feature
point clouds captured by a standard LiDAR sensor such as Velo- extraction shall be introduced in the following. For instance,
dyne’s HDL-64E3 usually consist of 120,000 points per frame. Song and Xiao (2014) use four types of 3D features which they
Thus, applying PointNet on the whole point cloud is a time- exhaustively extract from each voxel cell, namely point den-
and memory-consuming operation. De facto, point clouds are sity, 3D shape feature, 3D normal feature and truncated signed
rarely consumed in total. As a consequence, it requires further distance function feature, to handle problems of self-occlusion
techniques to improve efficiency, such as cascading fusion ap- (see Figure 5).
proaches (see Section 8.1) that crop point clouds to its region Wang and Posner (2015) also use a fixed-dimensional feature
of interest, only handing subsets of the point cloud to the point- vector containing the mean and variance of the reflectance val-
wise feature extraction stage. ues of all points that lie within a voxel and an additional binary
In general it can be stated that point-based representations occupancy feature. The features are not further processed and
retain more information than voxel- or projection-based meth- directly used for detection purposes in a voting scheme.
ods. But on the downside point-based methods are inefficient Ren and Sudderth (2016) introduce their discriminative cloud
when the number of points is large. Yet, a reduction of the of oriented gradients (COG) descriptor, which they further de-
point clouds like in cascading fusion approaches always comes velop in their subsequent work when proposing LSS (latent sup-
with a decrease in information. In summary, it can be said that port surfaces) (Ren and Sudderth, 2018) and COG 2.0 (Ren and
10
Ma et al. (2019) focus to take the generated depth as the core 2019). For example, Huang et al. (2018) define the 3D object
feature and make explicitly use of its spatial information. center through corresponding 2D and camera parameters. Then
Similarly, Weng and Kitani (2019) adapt the deep ordinal re- a physical intersection between 3D objects and 3D room layout
gression network (DORN) by Fu et al. (2018) for their pseudo gets penalized. Simonelli et al. (2019) first disentangle 2D and
LiDAR point cloud generation and then exploit a Frustum- 3D detection loss to optimize each loss individually, then they
PointNet-like model (Qi et al., 2018) for the object detection also leverage correlation between 2D and 3D in a combined
task. Further information on the Frustum PointNet approach is multi-task loss.
given in Section 8.1. Another approach is presented by Qin et al. (2019b), who
Instead of generating the whole scene to a point-based rep- exploit triangulation, which is well known for estimating 3D
resentation, Ku et al. (2019) primarily reduce the space by geometry in stereo images. They use 2D detection of the same
lightweight predictions and then only transform the candidate object in a left and right monocular image for a newly intro-
boxes to point clouds, preventing redundant computation. To duced anchor triangulation, where they directly localize 3D an-
do so, they exploit instance segmentation and available LiDAR chors based on the 2D region proposals.
data for training to reconstruct a point cloud in a canonical ob-
ject coordinate system. A similar approach for instance depth
3D template matching. An additional way of handling monoc-
estimation is pursued by Qin et al. (2019a).
ular representation in 3DOD is matching the images with 3D
Geometric Constraints. Depth estimation networks give the object templates. The idea is to have a database of object im-
advantage of closing the gap of missing depth in a direct way. ages from different viewpoints and their underlying 3D depth
Yet, errors and noise occur during depth estimation, which may features. For instance, that could be rendering synthetic images
lead to biased overall results and contributes to a limited upper- from computer-aided design (CAD) models, producing images
bound of performance (Brazil and Liu, 2019; Barabanau et al., from all sides of the object. Then the monocular input image is
2020). Hence, various methods try to skip the naturally ill- searched and matched using this template database. Hence, the
posed depth estimation and tackle monocular 3DOD as a ge- object pose and location can be concluded.
ometrical problem of mapping 2D into 3D space. Fidler et al. (2012) address the task of monocular object de-
Especially in autonomous driving applications, the 3D box tection with the representation of an object as a deformable 3D
proposals are often constrained by a flat ground assumption, cuboid. The 3D cuboid consists of faces and parts, which are
namely the street. It is assumed that all possible targets are lo- allowed to deform in accordance to their anchors of the 3D box.
cated on this plane, since automotive vehicles do not fly. There- Each of these faces is modelled by a 2D template that matches
fore, these approaches force the bounding boxes to lay along the the object appearance from an orthogonal point of view. It is as-
ground plane. In indoor scenarios, on the other hand, objects sumed that the 3D cuboid can be rotated, so that the image view
are located on various height levels. Hence, the ground plane can be projected from a defined set of angles on the respective
constraint does not get the attention as in autonomous driving cuboids’ face and subsequently scored by a latent SVM.
application models. Nevertheless, plane fitting is frequently ap- Chabot et al. (2017) initially use a network to generate 2D de-
plied in indoor scenarios to get the room orientation. tection results, vehicle part coordinates and a 3D box dimension
Zia et al. (2015) were one of the first to assume a common proposal. Thereafter, they match the dimensions to a 3D CAD
ground plane in their approach helping them to extensively re- dataset composed by a fixed number of object models to asso-
construct the scene. Further, a ground plane drastically reduces ciate corresponding 3D shapes in form of manually annotated
the search space by only leaving two degrees of freedom for vertices. Those 3D shapes are then used for performing 2D-to-
translation and one for rotation. Other representative examples 3D pose matching in order to recover 3D orientation and loca-
for implementing ground plane assumptions in monocular rep- tion. Barabanau et al. (2020) reason their approach on sparse
resentations are given by Chen et al. (2016), Du et al. (2018) but salient features, namely 2D key points. They match CAD
and Gupta et al. (2019). All of them leverage the random sam- templates on account of 14 key points. After assigning one of
ple consensus approach by Fischler and Bolles (1981), a popu- five distinct geometric classes, they execute an instance depth
lar technique that is applied for ground plane estimation. estimation by a vertical plane passing through two of the visible
A different geometrical approach used to recover the under- key points to lift the predictions to a 3D space.
constrained monocular 3DOD problem, is to establish a consis- 3D templates provide a potentially powerful source of infor-
tency between the 2D and 3D scene. Mousavian et al. (2017), mation, yet, a sufficient quantity of models is not always avail-
Li et al. (2019a), Liu et al. (2019a) and Naiden et al. (2019) do able for each object class. Thus, these methods tend to focus
so by projecting the 3D bounding box into a previously ascer- on a small number of classes. The limitation on shapes covered
tained 2D bounding box. The core notion is that the 3D bound- by the selection of 3D templates, make the 3D template match-
ing box should fit tightly to at least one side of its corresponding ing approach hard to generalize respectively extend to classes
2D box detection. Naiden et al. (2019) use, for instance, a least where models are unavailable (Ku et al., 2019).
square method for the fitting task. Summarizing, monocular 3DOD achieves promising re-
Other methods deploy 2D-3D consistency by incorporating sults. Nevertheless, the missing depth data prevents monocular
geometric constraints such as room layout and camera pose 3DOD from reaching state-of-the-art results. Further, the depth
estimations through an entangled 2D-3D loss function (e.g., substitution techniques may limit the detection performance be-
Huang et al., 2018; Simonelli et al., 2019; Brazil and Liu, cause errors in depth estimation, geometrical assumptions or
12
template matching are propagated to the final 3D box predic- same operation is deployed again by a separate network for fea-
tion. ture alignment of all point clouds in a feature space. Both op-
erations are crucial for the networks’ predictions to be invariant
7.3. Pointwise Feature Learning of the input point cloud.
As described above, the application of deep learning tech-
niques on point cloud data is not straightforward due to the 7.3.2. PointNet++
irregularity of the point cloud (cf. Section 5.3). Many exist- PointNet++ is built in a hierarchical manner on several set
ing methods try to leverage the expertise of convolutional fea- abstraction layers to address the original PointNet’s missing
ture extraction by projecting point clouds either into 2D image ability to consider local structures. At each level, a set of points
views or convert them into regular grids of voxels. However, is further abstracted to produce a new set with fewer elements,
the projection to a specific viewpoint discards valuable informa- in fact summarizing the local context. The set abstraction lay-
tion, which is particularly important in crowded scenes, while ers are composed again of three key layers: (i) a sampling layer,
the voxelization of the point cloud leads to heavy computation (ii) a grouping layer, (iii) and a PointNet layer (Qi et al., 2017b).
on account of the sparse nature of point clouds and supplemen- Figure 6 provides an overview of the architecture.
tary information loss in point-crowded cells. Either way, ma- The sampling layer is employed to reduce the resolution of
nipulating the original data may have a negative effect. points. PointNet++ uses farthest point sampling (FPS) which
To overcome the problem of irregularity in point clouds with only samples the points that are the most distant from the rest of
an alternative approach, Qi et al. (2017a) propose PointNet, the sampled points (Bello et al., 2020). Thereby, FPS identifies
which is able to learn pointwise features directly from the raw and retains the centroids of the local regions for a set of point.
point cloud. It is based on the assumption that points which Subsequently, the grouping layers are used to group the rep-
lie close to each other can be grouped together and compressed resentative points, which are obtained from the sampling opera-
as a single point. Shortly after, Qi et al. (2017b) introduced tion, into local patches. Moreover, it constructs local region sets
its successor PointNet++, adding the ability to capture local by finding neighboring points around the sampled centroids.
structures in the point cloud. Both networks are originally de- These are further exploited to compute the local feature repre-
signed for classification of the whole point cloud, next to being sentation of the neighborhood. PointNet++ adopts ball query,
able of predicting semantic classes for each point of the point which searches a fixed sphere around the centroid point and
cloud. Thereafter, Qi et al. (2018) introduced a way to imple- then groups all within laying points.
ment PointNet into 3DOD by proposing Frustum-PointNet. By The grouping and sampling layers represent preprocessing
now, many of the state-of-the-art 3DOD-methods are based on tasks to capture local structure before giving the abstracted
the general PointNet-architectures. Therefore, it is crucial to points to the PointNet layer. This layer consists of the original
understand the underlying architecture and how it is used in PointNet-architecture and is applied to generate a feature vector
3DOD methods. of the local region pattern. The input for the PointNet layer is
the abstraction of the local regions, meaning the centroids and
7.3.1. PointNet local features that encode the centroids’ neighborhood.
PointNet is built out of three key modules: (i) a max-pooling The process of grouping, sampling and applying PointNet
layer serving as a symmetric function, (ii) a local and global is repeated in a hierarchical manner, which down-sample the
information combination structure in form of a multi-layer per- points further and further until the last layer produces one final
ceptron (MLP), and (iii) two joint alignment networks for the global feature vector (Bello et al., 2020). This allows Point-
alignment of input points and point features, respectively (Qi Net++ to work with the same input data at different scales and
et al., 2017a). produce higher level features at each set abstraction layer and
To face the permutation invariance of point clouds, PointNet thereby capturing local structures.
is made up with symmetric functions in form of max-pooling
operations. Symmetric functions have the same output regard-
less of the input order. The max-pooling operation results in a 7.3.3. PointNet-based Feature Extraction
global feature vector that aggregates information from all points In the following, no explicit distinction is made between
of the point cloud. Since the max pooling function operates as a PointNet and PointNet++. Instead, we summarize both under
“winner takes it all” paradigm, it does not consider local struc- the term PointNet-based approaches, considering that we pri-
tures, which is the main limitation of PointNet. marily want to imply that pointwise methods for feature extrac-
Following, PointNet accommodates an MLP that uses the tion or classification purpose are used.
global feature vector subsequently for the classification tasks. Many state-of-the-art models use PointNet-like networks in
Other than that, the global features can also be used in a com- their pipeline for feature extraction. An exemplary selection
bination with local point features for segmentation purposes. of seminal works include the proposals from Qi et al. (2018),
The joint alignment networks ensure that a single point cloud Yang et al. (2018c), Zhou and Tuzel (2018) Xu et al. (2018),
is invariant to geometric transformations (e.g., rotation and Shi et al. (2019), Shin et al. (2019), Pamplona et al. (2019),
translation). PointNet uses this natural solution to align the in- Lang et al. (2019), Wang and Jia (2019), Yang et al. (2020), Li
put points of a set of point clouds to a canonical space by pose et al. (2020), Yoo et al. (2020), Zhou et al. (2020), and Huang
normalization through spatial transformers, called T-Net. The et al. (2020).
13
skip link concatenation
1) )
+C C3 C) ,k)
Hierarchical point set feature learning Segmentation C2 1,
d+ 3+
C (N
1,
d+ (N ,d+
(N (N
) )
C) C) C1
) C1 C2
d+ d+ d+ d+
,d+ 1,
K, 1, 2,
K, 2,
(N (N (N (N (N
int
r- po es
pe scor
interpolate unit interpolate unit
pointnet pointnet
Classification
(1,C4) (k)
class scores
sampling & pointnet sampling & pointnet
grouping grouping
Very little is changed in the way PointNet is used from the tion 7.4.2), sparse convolution (Section 7.4.3), and voting
models adopting it for feature extraction, indicating that it is al- scheme (Section 7.4.4).
ready a well-designed and mature technique. Yet, Yang et al.
(2020) examine in their work that the up-sampling operation 7.4.1. Feature Initialization
in the feature propagation layers and refinement modules con- Volumetric approaches discretize the point cloud into a spe-
38
sume about half of the inference time of existing PointNet ap- cific volumetric grid during preprocessing. Projection models,
proaches. Therefore, they abandon both processes to drastically on the other hand, typically lay a relatively fine-grained grid on
reduce inference time. However, predicting only on the surviv- the 2D mapping of the point cloud scene. In either case, the
ing representative points on the last set abstraction layer leads to representation is transformed into segments. In other words,
huge performance drops. Therefore, they propose a novel sam- segments can be either volumetric grids like voxels and pillars
pling strategy based on feature distance and fuse this criterion (cf. Section 6.3.1) or discretized projections of the point cloud
with the common euclidean distance sampling for meaningful like RV and BEV projections (cf. Section 6.3.2).
features. These generated segments enclose a certain set of points that
PointNet-based 3DOD models generally show superior per- does not yet possess any processable state. Therefore, an en-
formances in classification as compared to models using other coding is applied on the individual segments, aggregating the
feature extraction methods. The set abstraction operation brings points enclosed by it.
the crucial advantage of flexible receptive fields for feature The intention of the encoding is to fill the formulated seg-
learning through setting different search radii within the group- ments or grids with discriminative features giving information
ing layer (Shi et al., 2019). Flexible receptive fields can better about the set of points that lie in each individual grid. This
capture the relevant content respectively features as it adapts to process is called feature initialization. Through the grid these
the input. Fixed receptive fields such as convolutional kernels features are now available in a regular and structured format.
are always limited to their dimensionality which may result in Other than in handcrafted feature extraction methods (cf. Sec-
that features and objects of different sizes cannot be captured tion 7.1), these grids are not yet used for detection but made
so well. However, PointNet operations, especially set abstrac- accessible to CNNs or other extraction mechanisms to further
tions, are computationally expensive which manifests in long condense the features
inference times compared to convolutions or fully connected For volumetric approaches, current research can be divided
layers (Yang et al., 2019, 2020). into two streams of feature initialization. The first and proba-
bly more intuitive approach is to encode the voxels manually.
7.4. Segmentwise Feature Learning However, the handcrafted feature encoding introduces a bottle-
neck that discards spatial information and may prevent these
Segmentwise feature learning follows the idea of a regular- approaches from effectively taking advantage of 3D shape in-
ized 3D data representation. In comparison to pointwise fea- formation. This led to the latter one, where models apply a
ture extraction, it does not take the points of the whole point lightweight PointNet on each voxel to learn pointwise features
cloud into account as in the previous section, but processes an and assign them in an aggregated form as voxel features, called
aggregated set of grid-like representations of the 3D scene (cf. voxel feature encoding (VFE) (Zhou and Tuzel, 2018).
Section 6.3), which are therefore considered as segments.
In the following, we describe central aspects and opera- Volumetric Feature Initialization. Traditionally, voxels get en-
tions of segmentwise feature learning, including (i) feature ini- coded into manually chosen features, so that each voxel con-
tialization (Section 7.4.1), (ii) 2D and 3D convolutions (Sec- tains one or more values consisting of statistics computed from
14
the points within that voxel cell. The feature choice is already
Point-wise Concatenate
within a voxel and describe the corresponding points in a suffi-
ciently discriminative way.
Element-wise Maxpool
Early approaches like Sliding Shapes (Song and Xiao, 2014)
uses a combination of four types of 3D features to encode
the cells, namely point density, a 3D shape feature, a surface Point-wise
normal features and a specifically developed feature to handle Input
Point-wise
the problem of self-occlusion, called truncated signed distance Feature
function. Others, however, rely only on statistical encoding like Locally
Aggregated
Wang and Posner (2015) as well as their adaption by Engel- Feature
Point-wise
cke et al. (2017) that propose three shape factors, the mean and concatenated
variance of the reflectance values of points as well as a binary Feature
occupancy feature.
Much simpler approaches are pursued by Li (2017) and Li Fig. 7. Architecture of VFE module in VoxelNet (Zhou and Tuzel, 2018)
et al. (2019e) using a binary encoding to express whether a
voxel contains points or not. In order to prevent too much
loss of information through a rudimentary binary encoding, the marily get filled with statistical quantities of the within lying
voxel size is usually set comparable small to generate high- points. Only more recently, first models started to encode the
resolution 3D grids (Li et al., 2019e). features by deep learning methods (e.g., Lehner et al., 2019;
More recently, voxel feature initialization shifted more and Wang et al., 2020; Zheng et al., 2020; Liang et al., 2020).
more towards deep learning approaches, for similar reasons as For RV, it is most popular to encode the projection map into
seen at feature extraction in other computer vision tasks where three-channel features, namely height, distance, and intensity
manually chosen features are not as performant as learned ones. (Chen et al., 2017; Zhou et al., 2019; Liang et al., 2020). In-
While segmentwise approaches turn out to be compara- stead, Meyer et al. (2019b) and Meyer et al. (2019a) are build-
bly efficient 3D feature extraction methods, pointwise models ing a five-channel image with range, height, azimuth angle, in-
demonstrate impressive results in detection accuracy, since they tensity, and a flag indicating whether a cell contains a point.
have recourse to the full information of each point of the point In contrast to manual encoding, Wang et al. (2020) use a point-
cloud. By encoding the voxel in standard features by hand, nor- based fully connected layer to learn high-dimensional point fea-
mally a lot of information gets lost. tures of the LiDAR point cloud and then apply a max-pooling
As a response, Zhou and Tuzel (2018) introduced the seminal operation along the z-axis to obtain the cells’ features in RV.
idea of VFE and a corresponding deep neural network which Similar to RV, Chen et al. (2017) encode each cell of a BEV
moves the feature initialization from hand-crafted voxel feature representation to height, intensity, and density. To increase sig-
encoding to deep-learning-based encoding. More specifically, nificance of this representation, the point cloud gets divided into
they proposed VoxelNet, which is able to extract pointwise fea- M slices along the y-axis, which results in a BEV map with
tures from each segment of a voxelized point cloud through a M + 2 channel features.
lightweight PointNet-like network. Subsequently, the individ- After the introduction to 3DOD by Chen et al. (2017),
ual point features get stacked together with a locally aggregated BEV representation became quite popular and many other ap-
feature at voxel-level. Finally, this volumetric representation is proaches followed their proposal of feature initialization (e.g.,
fed into 3D convolutional layers for further feature aggregation. Beltrán et al., 2018; Liang et al., 2018; Wang et al., 2018; Simon
The use of PointNet allows VoxelNet to capture inter-point et al., 2019b; Li et al., 2019b). Yet, others choose a simpler set
relations within a voxel and therefore hold a more discrimina- up such as only encoding maximum height and density without
tive feature then normal encoded voxel consisting of statisti- slicing the point cloud (Ali et al., 2019) or even use a binary
cal values. A schematic overview of VoxelNet’s architecture is occupancy encoding (Yang et al., 2018b; He et al., 2020).
given in Figure 7. To avoid information loss, more recent approaches initially
The seminal idea of VFE, is – in modified versions – used extract features by means of deep learning approaches and then
in many subsequent approaches. Several works focus on im- project these features into BEV (e.g., Wang et al., 2020; Zheng
proving performance and efficiency of VoxelNet. Performance- et al., 2020; Liang et al., 2020).
wise, Kuang et al. (2020) developed a novel feature pyramid ex- For example, Zheng et al. (2020) first initialize features of a
traction paradigm. For speeding up the model, Yan et al. (2018) voxelized point cloud by calculating the mean coordinates and
and Shi et al. (2020b) combined VFE with more efficient sparse intensities of points in each voxel. Then they apply a sparse
convolutions. Other than that, Sun et al. (2018) as well as Chen convolution and transform the representation to a dense fea-
et al. (2019b) reduced the original architecture of VFE for faster ture map before condensing it on the ground plane to produce
inference times. a BEV feature map. Analogous to their RV approach, Wang
et al. (2020) again use the learned point-wise features of the
Projection-based Feature Initialization. For projection ap- point cloud, but now apply the max pooling operation along the
proaches, feature initialization of the cells are mostly hand- y-axis for feature aggregation in BEV. Liang et al. (2020), on
crafted. Both RV and BEV utilize fine-grained grids that pri- the other hand, first extract features in RV and then transform
15
these to a BEV representation, adding more high-level informa- of sparse convolution while simultaneously preserving the spar-
tion in comparison to directly projecting point cloud to BEV. sity.
After the feature initialization of either volumentric grids or Relevant works using sparse 3D convolutions are proposed
projected cells, segmentwise solutions usually utilize 2D/3D by Yan et al. (2018), Chen et al. (2019b), Shi et al. (2020b),
convolutions to extract features of the global scene. Pang et al. (2020), Yoo et al. (2020), He et al. (2020), Zheng
et al. (2020), and Deng et al. (2021).
7.4.2. 2D and 3D Convolutions
Preprocessed 2D representations, such as feature initialized 7.4.4. Voting Scheme
projections, monocular images, and RGB-D (2D) images, have Another approach to exploit sparsity of point clouds is to ap-
the advantage that they can all leverage mature 2D convolu- ply a voting scheme as exemplified by Wang and Posner (2015),
tional techniques to extract features. Engelcke et al. (2017) and Qi et al. (2019). The idea behind
Volumetric voxelwise representations, on the other hand, voting is, to let each non-zero site in the input declare a set of
present the spatial space in a regular format which are accessed votes to its surrounding cells in the output layer. The voting
by 3D convolutions. However, directly applying convolutions weights are calcuated by flipping the convolutional weights of
in 3D space is a very inefficient procedure due to the multipli- the filter along its diagonal (Fernandes et al., 2021). Equivalent
cation of space. to sparse convolution, processing only needs to be conducted
Early approaches to extend the traditional 2D convolutions to on non-zero sites. Hence, computational cost is proportional to
3D were applied by Song and Xiao (2016) as well as Li (2017) the number of occupied sites rather than the dimension of the
by putting 3D convolutional filter in 3D space and performing scene. Facing the same problem of dilation as sparse convolu-
feature extraction in an exhaustive operation. Since the search tion, Engelcke et al. (2017) argue to select non-linear activation
space increases dramatically from 2D to 3D, this procedure is functions. In this case, rectified linear units help to maintain
associated with immense computation costs. sparsity since only features with values higher than zero and
Further examples using conventional 3D-CNNs can be found not just non-zero are allowed to cast votes.
in the models of Chen et al. (2015), Sun et al. (2018), Zhou and Mathematically, the feature centric voting-operation is equiv-
Tuzel (2018), Sindagi et al. (2019), and Kuang et al. (2020). alent to the submanifold sparse convolution as Wang and Posner
Despite delivering state-of-the-art results, the conventional (2015) proof in their work.
3D-CNN lacks efficiency. Given the fact that the sparsity of
point clouds leads to many empty and non-discriminative vox- 8. Fusion Approaches
els in a volumetric representation, the exhausting 3D-CNN op-
erations perform a huge amount of redundant computations. Single modality pipelines for 3DOD have developed well
This issue can be addressed by a sparse convolution. along the last years and have shown remarkable results. Yet,
unimodal models still reveal shortcomings preventing them to
7.4.3. Sparse Convolution reach full maturity and human-like performance. For instance,
Sparse convolution was proposed by Graham (2014) and camera images are lacking depth information and suffer from
Graham (2015). At first a ground state is defined for the input truncation and occlusion, while point clouds cannot offer any
data. The ground state expresses whether the spatial location is texture information as well as being exposed by sparsity in
active or not. A spatial location in the input representation is ac- longer range. To overcome these problems the attention of more
tive if it has a non-zero value, which in the case of a regularized recent research is increasingly directed towards fusion models
point cloud is a voxel that encloses at least a certain threshold that try to leverage the combination of information from differ-
of points. Additionally, a site in the following layers is active if ent modalities.
any of the spatial location from the foregoing layer, from which The main challenges of fusion approaches are the synchro-
it receives it input, is active. Therefore, sparse convolution must nization of the different representations and the preservation
only process the sites which differ from the ground state of the of the relevant information during the fusion process. Further,
preceding convolution layer, focusing computation power on holding the additional complexity at a computationally reason-
the meaningful and new information present. able level must also be taken into account.
To lower resource cost and speed up feature extraction, pro- Fusion methods can be divided into two classes according to
cessing of point clouds needs to skip irrelevant regions. Yan the orchestration of the modality integration, namely (i) cas-
et al. (2018) were the first to apply sparse convolutions in caded fusion (Section 8.1) and (ii) feature fusion (Section 8.2).
3DOD, which do not suffer but take advantage of sparsity. The former combines different sensor data and their individual
Yet, the principle of sparse convolution has the disadvantage features or predictions across different stages, whereas the latter
that it continuously leads to dilation of the data as it discards jointly reasons about multi-representation inputs.
any non-active sites. The deeper a network gets the more the
sparsity gets reduced and the data is dilated. For that reason, 8.1. Cascaded Fusion
Graham et al. (2018) introduced submanifold sparse convolu- Cascaded fusion methods use consecutive single-modality
tion, where the input first gets padded so that the output retains detectors to restrict second-stage detection by the results of the
the same dimensions. Further, the output is only active if the first detector. Usually, monocular-based object detectors are
central site of the receptive field is active, keeping the efficiency leveraged in the first stage to define a restricted subset of the
16
point cloud, that is only including 3D points which are likely to Other than that, Shen and Stamos (2020) aim to integrate the
be defining an object. Hence, in the second stage, 3D detectors advancements of voxelization into frustum approaches by trans-
only need to reason over a delimited 3D search space. forming regions of interests (ROIs) within the point frustums
Two seminal works in this regard are the fusion frameworks into 3D volumetric grids. Thus, they only voxelize relevant re-
proposed by Lahoud and Ghanem (2017) and Qi et al. (2018). gions, allowing a high resolution that improves the representa-
The approaches use the detection results from the 2D image tion while still being efficient. In this case, the voxels are then
to extrude a corresponding frustum into 3D space. For each fed to a 3D fully convolutional network.
2D proposal, a frustum is generated. The popular Frustum- More recently, Zhang et al. (2020) observed that point-based
PointNets by Qi et al. (2018) then processes the frustum with 3DOD does not perform well in longer range because of an
PointNet for instance segmentation. Finally, the amodal 3D box increasing sparsity of point clouds. Therefore, they take ad-
gets predicted on basis of the frustum and the extracted fore- vantage of RGB images which contain enough information to
ground points (see Figure 8). Lahoud and Ghanem (2017), on recognize faraway objects with mature 2D detectors. While fol-
the other hand, first estimates the orientation for each object lowing the idea of frustum generation, the estimated location of
within the frustum. In the last step, they apply an MLP regres- objects that are considered to be faraway are recognized. Tak-
sor for the 3D boundaries of the object. ing into account that very few points define these objects in a
Several approaches follow this main principle idea in a sim- point cloud, they do not possess sufficient discriminative infor-
ilar way. For example, Yang et al. (2018c) and Ferguson and mation for neural networks so that 2D detectors are applied on
Law (2019) both make use of 2D semantic segmentation and corresponding images. Otherwise, for near objects, Zhang et al.
then project these foreground pixels of the image into a point (2020) use conventional neural networks to process the frustum.
cloud. The selected points are subsequently exploited for pro-
posal generation through PointNet or convolution operations. 8.2. Feature Fusion
Du et al. (2018) leverage the restricted 3D space by applying a Since the performance of the cascaded fusion models is al-
model matching algorithm for detection purpose. In contrast, ways limited by the accuracy of the detector at each stage,
Shin et al. (2019) try to improve the 3D subset generation by some researchers try to increase performance by arguing that
creating point cloud region proposals with the shape of stand- the models should infer more jointly over the different modali-
ing cylinders instead of frustums, which is more robust to sen- ties.
sor synchronization. To this end, feature fusion methods first concatenate the
While the described models above mainly focus on the frus- information of different modalities before reasoning over the
tum creation process, Wang and Jia (2019), Zhang et al. (2020) combined features, trying to exploit the diverse information in
as well as Shen and Stamos (2020) seek to advance the process- its combination. Within feature fusion, it can be further dis-
ing of the frustums. tinguished between (i) early, (ii) late, and (iii) deep fusion ap-
By its modular nature, Frustum-PointNet is not able to pro- proaches (Chen et al., 2017), constituting at which stage of the
vide an end-to-end prediction. To overcome that limitation, 3DOD pipeline fusion occurs. Figure 9 provides an illustrative
Wang and Jia (2019) subdivide the frustums to eventually make overview of the different fusion schemes.
use of a fully convolutional network allowing a continuous es- Early fusion merges multi view features in the input stage be-
timation of oriented boxes in 3D space. They generate a se- fore any feature transformation takes place and proceeds with a
quence of frustums by sliding along the frustum axis and then single network to predict the results. Late fusion in contrast uses
aggregate the grouped points of each respective sections into multiple subnetworks that process the individual inputs sepa-
local pointwise features. These features on frustum-level are rately up until the last state of the pipeline, where they get con-
arrayed as a 2D feature map enabling the use of a subsequent catenated in the prediction stage. Beyond that, deep fusion al-
fully convolutional network. lows an interaction of different input modalities at several stages
17
in the architecture and alternately performs feature transforma- multi view fusion models, some approaches also aim at merging
tion and feature fusion. raw point features with other representations.
Although the following approaches let themselves all be cat- For example, Xu et al. (2018) process the raw point cloud
egorized as feature fusion methods, the classification between with PointNet. Thereafter, they concatenate each pointwise fea-
the various subclasses of early, late, and deep fusion is not triv- ture with a global scene feature and the corresponding image
ial and can be fluid. Nevertheless, the concepts help to convey feature. Each point is then used as a spatial anchor to predict the
a better understanding of feature fusion processes. offset to the 3D bounding box. Likewise, Yang et al. (2019) use
the features of a PointNet++ backbone to extract semantic con-
text features for each point. The sparse features subsequently
get condensed by a point pooling layer to take advantage of a
C C voxel-wise representation applying VFE.
Shi et al. (2020a) first use 3D convolution on a voxel repre-
sentation to summarize the scene into a small set of key points.
(a) Early Fusion (b) Late Fusion In the next step, these voxel-feature key points are fused with
the grid-based proposals for refinement purposes.
The previously presented models all perform late or deep fu-
M M M M Input Intermediate layers Output sion procedures. Instead of fusing multi-sensor features per
C M object after the proposal stage, Wang et al. (2018) follow the
Concatenation Element-wise Mean idea of early fusing BEV and image views by sparse non-
(c) Deep Fusion homogenous pooling layers over the full resolution. Similarly,
Meyer et al. (2019a) also deploy an early fusion but they use
Fig. 9. Early, late and deep feature fusion scheme (Chen et al., 2017) RV images for the point-cloud-related representation. The ap-
proach associates the LiDAR point cloud with camera pixels
The pioneers among 3DOD fusion approaches are Chen et al. by projecting the 3D points onto the 2D image. The image
(2017), introducing the multi view approach. They take multi- features are then concatenated and further processed by a fully
ple perspectives, specifically RV, BEV and FV as input repre- connected convolutional network.
sentations. The BEV representation gets utilized to generate Furthermore, several works apply a hybrid approach of early
3D candidates, following a region-wise feature extracting by and late fusion schemes. For example, Sindagi et al. (2019) first
projecting these 3D proposals onto the respective feature maps project LiDAR to an RV representation and concatenate image
of the individual views. A deep fusion scheme is then used to features with the corresponding points in an early fusion fash-
combine the information elementwise over several intermediate ion. Then they apply a VFE layer to the voxelized point cloud
stages. and append the corresponding image features for every non-
Ku et al. (2018) use 3D anchors which are mapped on both empty voxel. While early concatenation already fuses features,
BEV representation and 2D image. Subsequently, a crop and re- late fusion aggregates image information for volume-based rep-
size operation is applied on every projected anchor and the fea- resentation, where voxels may contain low-quality information
ture crops from both views are fused via an elementwise mean due to low point cloud resolution or distant objects.
operation at an intermediate convolutional layer. Unlike Chen Akin, Liang et al. (2019) initially conduct feature fusion to
et al. (2017), Ku et al. (2018) not only merge features in the late RGB images and BEV feature maps. Thereby they incorporate
refinement stage, but already in the region proposal phase to multi-scale image features to augment the BEV representation.
generate positive proposals. They specifically use the full res- In the refinement stage, the model fuses image and augmented
olution feature map to improve prediction quality, particularly BEV again, but other than in the first stage, the fusion occurs el-
for small objects. Similar approaches to Chen et al. (2017) and ementwise for the regions of interest. They further add ground
Ku et al. (2018) are performed by Chen et al. (2018), Li et al. estimation and depth estimation to the fusion framework to ad-
(2019b) and Wang et al. (2019), who also fuse the region of vance the fusion process.
interests of the entered data representations elementwise. Yet, Simon et al. (2019a) extend with Complexer You Only Look
Chen et al. (2018) use segmentwise 3D detection for box pro- Once (YOLO) their earlier approach Complex YOLO (Simon
posals in the first stage. et al., 2019b) by exchanging the input of a BEV map for a voxel
In contrast, Rahman et al. (2019) not only fuse the region representation. To leverage all inputs, they first create a seman-
of interests, but combine the whole feature map of processed tic segmentation of the RGB image and then fuse this picture
monocular images and FV-representations already within the pointwise on the LiDAR-frame to generate a semantic voxel
region proposal stage. grid.
Further, Ren et al. (2018) want to leverage not only object
detection information but also context information. Therefore, Continuous Fusion. LiDAR points are continuous and sparse,
they simply concatenate the processed features of 2D scene whereas cameras capture dense features at a discrete state. Fus-
classification, 2D object detection and 3D object detection of ing these modalities is not a trivial task due to the one-to-many
the voxelized scene before feeding them to a conditional ran- projection. In other words, not for every image pixel exists an
dom field model for a joint optimization. observable corresponding LiDAR point in any projection, and
Instead of only combining regular representations through vice versa.
18
To overcome this discontinuous mapping of images into old for proposal generation for each sensor and combine detec-
point-cloud-based representations such as BEV, Liang et al. tion candidates before NMS. The final prediction is based on a
(2018) propose a novel continuous convolution that is applied to consistency operation between 2D and 3D proposals in a late
create a dense feature map through interpolation. It proposes to fusion fashion.
project the image feature map onto a BEV space, and then fuse Other representative examples exploiting attention mecha-
the original LiDAR BEV map in a deep fusion manner through nism for effective fusion of features are proposed by Chen et al.
continuous convolution over multiple resolutions. The fused (2019b) and Li et al. (2020).
feature map is then further processed by a 2D-CNN, solving A totally different approach to combine different inputs is
the discrepancy between image and projection representations. presented by Chen et al. (2019a). For the special case of au-
tonomous vehicles, they propose to connect surrounding vehi-
Attention Mechanism. A common challenge among fusion ap- cles with each other and combine their sensor measurements.
proaches is the occurrence of noise and the propagation of irrel- More specifically, LiDAR data collected from different posi-
evant features. Previous approaches fuse multiple features sim- tions and angels of the connected vehicles gets fused together
ply by concatenations or elementwise summation and/or mean to provide the vehicles with a collective perception of the scene.
operations. Thereby, noise such as truncation and occlusion In summary, it is notable that the fusion of different modali-
gets inherited to the resulting feature maps. Thus, inferior point ties is a vivid research area within 3DOD. With continuous con-
features will be obtained through fusion. Attention mechanisms volutions and attention mechanism, potential solutions for com-
can cope with these difficulties by determining the relevance of mon issues, such as image-to-point-cloud discrepancies and/or
each feature in order to only fuse features that improve the rep- noisy data representations, are already introduced. Neverthe-
resentation. less, fusion approaches are still exposed to several unsolved
Lu et al. (2019) use a deep fusion approach for BEV and challenges. For example, 2D-driven fusion approaches like cas-
RGB images in an elementwise fashion but additionally incor- caded methods are always constrained by the quality of the 2D
porate attention modules over both modalities to leverage the detection during the first stage. Therefore, they might fail in
most relevant features. Spatial attention adapts pooling opera- cases that can only be observed properly from the 3D space.
tions over different feature map scales, while the channel-wise Feature fusion approaches, on the other hand, generally face
fusion applies global pooling. Both create an attention map that the difficulty to fuse different data type structures. Take the
expresses the importance of each feature. Those attention maps example fusing images and LiDAR data. While images pro-
are then multiplied with the feature map and finally fused. vide a dense, high-resolution structure, LiDAR point clouds
Analog to this, Wang et al. (2020) use an attentive point- show a sparse structure with a comparably low resolution. The
wise fusion module to estimate the channel-wise importance workaround to transform point clouds to another representation
of BEV, RV, and image features. In contrast, they deploy the inevitably leads to a loss of information. Another challenge
attention mechanism after the concatenation of the multi view for fusion approaches is that crop and resize operations to fuse
feature maps to consider the mutual interference and the im- proposals of different modalities may break the feature struc-
portance of the respective features. They specifically address ture derived from each sensor. Thus, a forced concatenation of
the issue of ill-posed information introduced by the front view a fixed feature vector size could result in imprecise correspon-
of images and RV. To compensate the inevitable loss of geo- dence between the different modalities.
metric information through the projection of LiDAR points, the
authors finally enrich the fused point features with raw point
features through an MLP network. 9. Detection Module
Consecutive to the attention fusion, Yoo et al. (2020) use first
The detection module depicts the last stage of the pipeline.
stage proposals from the attentive camera-LiDAR feature map
It uses the extracted features to perform the multi-task consist-
to extract the single modality LiDAR and camera features of the
ing of classification, localization along with the bounding box
proposal regions. Using a PointNet encoding, these are subse-
regression, and object orientation determination.
quently fused elementwise with the joint feature map for refine-
Early 3DOD approaches either relied on (i) template and key-
ment.
point matching algorithms, such as matching 3D CAD models
Other than that, Huang et al. (2020) operate directly on the
to the scene (Section 9.1), or (ii) suggested handcrafted SVM-
LiDAR point cloud introducing a pointwise fusion. In their
classifiers using sliding window approaches (Section 9.2).
deep fusion approach, they process a PointNet-like geometric
More recent research mainly focuses detection frameworks
- as well as a convolution-based - image stream in parallel. Be-
based on deep learning due to their flexibility and superior per-
tween each abstraction stage, the point features get fused with
formance (Section 9.3). Detection techniques of this era can be
the semantic image features of the corresponding convolutional
further classified into (i) anchor-based detection (Section 9.4),
layer by applying an attention mechanism.
(ii) anchorless detection (Section 9.5), and (iii) hybrid detection
Moreover, Pang et al. (2020) observed that elementwise fu-
(Section 9.6).
sion takes place after non-maximum suppression (NMS), which
can have the consequence of mistakenly suppressing useful can-
didates of the individual modalities. NMS is used to suppress 9.1. Template and Keypoint Matching Algorithms
duplicate candidates after proposal and prediction stage, respec- A natural approach to classify objects is to compare and
tively. Therefore, Pang et al. (2020) use a much-reduced thresh- match them to a template database. These approaches typically
19
leverage 3D CAD models to synthesize object templates that Similarly, Ren and Sudderth (2016, 2018, 2020) use in all of
guide the geometric reasoning in inference. Applied matching their works a sliding window approach with extensively lever-
algorithms use parts or whole CAD models of the objects to aging pre-trained SVMs. In COG 1.0, for example, they use
classify the candidates. SVMs along with a cascaded classification framework used
Teng and Xiao (2014) follow a surface identification ap- to learn contextual relationships among objects in the scene.
proach. Therefore, they accumulate a surface object database Therefore, they train SVMs for each object category with hand-
of RGB-D images from different viewpoints. Then they match crafted features such as surface orientation. Furthermore, they
the specific 3D surface segment obtained by segmentation of integrate a Manhattan room layout, which assumes an orthog-
the current scene to the surface segments of the database. For onal room structure to estimate walls, ceilings and floors for
the pose estimation, they further match key points between the a more holistic understanding of the 3D scene and to restrict
matched surface and the observed surface. the detection. Finally, the contextual information is used in a
Crivellaro et al. (2015) initially perform a part detection. For Markov random field representation problem to consider object
each part, they project seven, so called, 3D control points which relationship in detection (Ren and Sudderth, 2016).
represent the pose of the object. Finally, a bounding box is In the successor model, namely LSS, Ren and Sudderth
estimated out of a small set of learned objects, matching the (2018) observe that the height of the support surface for many
part and key point constraints. object categories is the primary cause of style variation. There-
Kehl et al. (2016) create a codebook of local RGB-D patches fore, they further add support surfaces as a latent part for each
from synthetic CAD models. Consisting of patches from a va- object which they use in combination with an SVM and addi-
riety of different views, those patches get matched against the tionally constraints from the predecessor.
feature descriptors of the scene to classify the object. Even in their latest work, COG 2.0, Ren and Sudderth (2020)
Another matching approach is designed by He et al. (2017) still apply an exhaustive sliding window search for 3DOD, lay-
extending LINE-MOD (Hinterstoisser et al., 2011), which com- ing their focus on robust feature extraction rather than detection
bines surface normal orientations from depth images and sil- techniques.
houette gradient orientations from RGB images to represent ob- Alike, Liu et al. (2018a) also use SVMs learned for each ob-
ject templates. LINE-MOD is first used to produce initial de- ject class based on the feature selection proposed by Ren and
tection results based on lookup tables for similarity matching. Sudderth (2016). Through a pruning of candidates, by compar-
To exclude the many false positive and duplicate detections, He ing the cuboid size of the bounding boxes with the distribution
et al. (2017) cluster templates that matched with a similar spa- of the physical size of the objects, they further reduce inference
tial location and only then score the matchings. time of detection.
Further, Yamazaki et al. (2018) applied a template matching However, an exhaustive sliding window approach tends to be
in point cloud projections. The key novelty is to use constraints computationally expensive, since the third dimension increases
imposed by the spatial relationship among image projection di- the search space significantly. Therefore, Wang and Posner
rections which are linked through the shared point cloud. This (2015) leverage the sparsity of 3D representations by adding
allows to achieve a consistency of the object throughout the a voting scheme, that only activates on occupied cells, reducing
multi-viewpoint images, even in cluttered scenes. the computation while maintaining mathematical equivalence.
Another approach is proposed by Barabanau et al. (2020). Whereas the sliding window approach of Song and Xiao (2014)
The authors introduce a compound solution of key point and operates linear to the total number of cells in 3D grids, voting
template matching. They observe that depth estimation on by Wang and Posner (2015) reduces the operations exclusively
monocular 3DOD is naturally ill-posed. For that reason, they to the occupied cells. The voting scheme is explained in further
propose to make use of sparse but salient key point features. detail in Section 7.4.4.
They initially regress 2D key points and then match these with Engelcke et al. (2017) tie on the success of Wang and Posner
3D CAD models for object dimension and orientation predic- (2015) and propose to exploit feature-centric voting to detect
tion. objects in point clouds in even deeper networks to boost perfor-
mance.
9.2. Sliding Window Approaches
The sliding window technique was largely adopted from 9.3. Detection Frameworks based on Deep Learning
2DOD to 3DOD. In fact, an object detector slides in form of Presenting solid solutions within their specific use cases, all
a specified window over the feature map and directly classify the above detection techniques rely on manually designed fea-
each window position. For 3DOD pipelines, this idea is ex- tures and are difficult to transfer. Thus, to exploit more robust
tended by replacing the 2D window with a spatial rectangular features and improve detection performance, most modern de-
box that drives through a discretized 3D space. However, tested tection approaches are based on deep learning models.
solutions have revealed that running a window over the entire As with 2DOD, detection networks for 3DOD relying on
3D room is a very exhaustive task, leading to heavy computa- deep learning can be basically grouped into two meta frame-
tion and large inference times. works: (i) two-stage detection frameworks (Section 9.3.1) and
One of the popular pioneers in this area were Song and Xiao (ii) single-stage detection frameworks (Section 9.3.2).
(2014) with their Sliding Shapes approach. They use prior To provide a basic understanding of these two concepts, we
trained SVM classifiers to run exhaustively over a voxelized will briefly revisit the major developments for 2D detection
3D space. frameworks in the following subsections.
20
9.3.1. Two-Stage Detection Frameworks R-CNN model, with wich it shares convolutional layers and the
As the name indicates, two-stage frameworks perform the resulting feature maps.
object detection task in two stages. In the first stage, spatial It initializes multiple reference boxes, called anchors, with
sub-regions of the input image are identified that contain object different sizes and aspect ratios at each possible feature map
candidates, commonly known as region proposals. The pro- position. These anchors are then mapped to a lower dimen-
posed regions are coarse predictions that are scored based on sional vector, which is used for ”objectness” classification and
their “objectness”. Regions with a high probability to occupy bounding box regression via fully connected layers. These are
an object will achieve a high score and get passed as input to in turn passed to the Fast R-CNN for bounding box classifi-
the second stage. These unrefined predictions often lack local- cation and fine tuning. Due to the convolutional layers used
ization precision. Therefore, the second stage mainly improves simultaneously by the RPN and the Fast R-CNN, the architec-
the spatial estimation of the object through a more fine-grained ture provides an extremely efficient solution for the region pro-
feature extraction. The following multi-task head then outputs posals (Ren et al., 2017). Furthermore, since Faster R-CNN is
the final bounding box estimation and classification score. a continuous CNN, the network can be trained end-to-end us-
ing backpropagation iteratively and hand-crafted features are no
A seminal work following this central idea is that of Girshick
longer necessary (Zhao et al., 2019; Liu et al., 2020).
et al. (2014), who introduced region-based CNN (R-CNN). In-
stead of working with a huge amount of region proposals via 9.3.2. Single-Stage Detection Frameworks
an exhaustive sliding window procedure, R-CNN integrates the
Single-stage detectors present a simpler network by trans-
selective search algorithm (Uijlings et al., 2013) to extract just
forming the input into a structured data representation and em-
about 2,000 category-independent candidates. More specif-
ploy a CNN to directly estimate bounding box parameters and
ically, selective search is based on a hierarchical segmenta-
class score in a fully convolutional manner.
tion approach which recursively combines smaller regions into Object detectors based on region proposals are computation-
larger ones based on color, texture, size and fill similarity. Sub- ally intensive and have high inference times especially on mo-
sequently, the 2,000 generated region proposals are cropped and
bile devices with limited memory and computational capaci-
warped into fixed size images in order to be fed into a pre-
ties (Redmon et al., 2016; Liu et al., 2020). Therefore, single-
trained and fine-tuned CNN. The CNN acts as feature extractor
stage frameworks with significant time advantages have been
to produce a feature vector with a fixed length, which can then
designed, while having acceptable drawbacks in performance
be consumed by binary SVM classifiers that are trained for each in comparison to the heavyweight two-stage region proposal
object class independently. At the same time, the CNN features detectors of the R-CNN family. The speed improvement re-
are used for the class-specific bounding box regression.
sults from the elimination of bounding box proposals as well as
The original R-CNN framework turned out to be expensive in of the feature resampling phase (Liu et al., 2016). Two popular
both time and disk space due to a lack of shared computations approaches which launched this development are YOLO (you
between the individual training steps (i.e., CNN, SVM classi- only look once) (Redmon et al., 2016) and SSD (single shot
fiers, bounding box regressors). To this end, Girshick (2015) multibox detector) (Liu et al., 2016).
developed an extension, called Fast R-CNN, in which the in- The basic idea of YOLO is to divide the original input im-
dividual computations were integrated into a jointly trained age into an S × S grid. Each grid cell is responsible for both,
framework. Instead of feeding the region proposals generated classifying the objects within it and predicting the bounding
by selective search to the CNN, both operations are swapped, so boxes and their confidence value. However, they use features
that the entire input image is now processed by the CNN to pro- of the entire input image and not only those of proposed local
duce a shared convolutional feature map. The region proposals regions. The use of only a single neural network by omitting
are then projected onto the shared feature map and a feature the RPN allows YOLO-successor FAST YOLO to run in real-
vector of fixed-length is extracted from each region proposal time at up to 155 frames per second. However, YOLO exhibits
using a region-of-interest pooling layer. Subsequently, the ex- disadvantages in a comparable lower quality of results such as
tracted features are consumed by a sequence of fully connected more frequent localization errors especially for smaller objects
layers to predict the results for the final object classes and the (Redmon et al., 2016).
bounding box offset values for refinement purposes (Girshick, The SSD framework also has real-time capability while not
2015). This approach saves memory and improves both, the ac- suffering as severe performance losses like YOLO. Similarly,
curacy and efficiency of object detection models (Zhao et al., the model consists of a single continuous CNN but uses the idea
2019; Liu et al., 2020). of anchors from RPN. Instead of fixed grids as in YOLO, an-
Both R-CNN and Fast R-CNN have the drawback that they chor boxes of various sizes are used to determine the bounding
rely on external region proposals generated by selective search, boxes. In order to detect objects of different sizes, the predic-
which is a time-consuming process. Against this backdrop, tions of several generated feature maps of descending resolution
Ren et al. (2017) introduced another extension, called Faster are combined. In this process, the front layers of the SSD net-
R-CNN. As an innovative enrichment, the detection framework work are increasingly used for classification due to their size,
consists of a region proposal network (RPN) as a sub-network and the back layers are used for detection (Liu et al., 2016).
for nominating regions of interest. The RPN is a CNN by it- Thanks to regressing bounding boxes and class scores in
self and replaces the functionality of the selective search al- one stage, single-stage networks are faster than two-stage ones.
gorithm. To classify objects, the RPN is connected to a Fast However, features are not learned from predicted bounding-box
21
proposals, but from predefined anchors. Hence, resulting pre- 9.4.1. Two-Stage Detection: 2D Anchors
diction typically are not as accurate as those of two-stage frame- For 2D representations of the spatial space (e.g., BEV, RV,
works. and RGB-D), the previously described R-CNN frameworks can
Compared to single-stage approaches, proposal-based mod- be directly implemented and de facto are frequently used. Most
els can leverage finer spatial information at the second stage, models first predict 2D candidates from monocular representa-
only focusing on the narrowed down region of interest, pre- tions. These predictions are then given to a subsequent network
dicted by the first stage. Features get re-extracted for each pro- which transforms these proposals into 3D bounding boxes by
posal which achieves more accurate localization and classifica- applying various constraints such as depth estimation networks,
tion, but in turn increases the computational costs. geometrical constraints or 3D template matching (cf. Section
The single- and two-stage paradigm can be transferred from 7.2.2).
2DOD to 3DOD. Other than that, we want to further distinguish As a representative example of a 2D anchor-based detection
between the detection techniques described in the following. approach in 3DOD, Chen et al. (2017) use a 2D RPN at con-
siderably sparse and low-resolution input data such as FV or
9.4. Anchor-based Detection BEV projections. As this data may not have enough informa-
Many modern object detectors make use of anchor boxes, tion for proposal generation, Chen et al. (2017) assign four 2D
which serve as the initial guess of the bounding box prediction. anchors - per frame and class - to every pixel in BEV, RV and
The main idea behind anchor boxes is to define a certain set of image feature map, and combine these crops in a deep fusion
boxes with different scales and ratios that are mapped densely scheme. The 2D anchors are derived from representative 3D
across the image. This exhaustive selection should be able to boxes which were obtained through clustering the ground truth
capture all relevant objects. The boxes containing and fitting objects in the training set by size and restricting the orientation
the objects best are finally retained. to 0° and 90°. Leveraging sparsity, they only compute non-
Anchor boxes are boxes of predefined width and length. empty anchors of the last convolution feature map.
Both depict important hyperparameter to choose as they need to For the purpose of 3DOD, the 2D anchors are then repro-
match those of the objects in the dataset. To consider all varia- jected to their original spatial dimensionality, that were derived
tion of ratios and scales, it is common to choose a collection of from the ground truth boxes. Following, these 3D proposals
several sized anchor boxes (see Figure 10). serve the ultimate refinement regression of the bounding boxes.
Further examples are proposed by Deng and Latecki (2017),
Zeng et al. (2018), Maisano et al. (2018) and Beltrán et al.
(2018).
as dense spatial anchors and a prediction is performed on each 9.5. Anchorless Detection
of the points with two connected MLPs. Similarly, Yang et al. Anchorless detection methods are commonly based on point-
(2018c) define two anchors to propose a total of six 3D candi- or segmentwise detection estimates. Instead of generating can-
dates on each point of the point cloud. To reduce the number didates, the whole scene is densely classified, and the individ-
of proposals, they apply a 2D semantic segmentation network ual objects and their respective position are derived in a direct
which is mapped into the 3D space and eliminates all propos- fashion. Beside a bigger group of approaches using fully con-
als made on background points. Subsequently, the proposals volutional networks (FCNs) (Section 9.5.1) there exist several
are refined and scored by a lightweight PointNet prediction net- other individual solutions (Section 9.5.2 proposing anchor-free
work. detection models.
9.4.3. Single-Stage Detection: 2D Anchors
9.5.1. Approaches Based on Fully Convolutional Networks
Similar to two-stage architectures, the 2D anchor-based
single-stage detector framework can be applied straightforward Rather than exploiting anchor-based region proposal net-
onto 2D-based representations. Exemplary representatives can works, Li et al. (2016) were pioneers by extending the idea
be in the works by Liang et al. (2018), Meyer et al. (2019a), Ali of fully convolutional networks (FCNs) (Long et al., 2015) to
et al. (2019), and He et al. (2020). 3DOD. The proposed 3D FCN does not need candidate regions
Ali et al. (2019), for instance, use the average box dimen- for detection but implicitly predicts objectness over the entire
sions for each object class from the ground truth dataset as 3D image. Instead of generating multiple anchors over the feature
reference boxes and derive 2D anchors. Then a single-stage map, the bounding box then gets directly determined over the
YOLO framework is adapted on a BEV representation and two objectness regions. Li (2017) further extend this approach in
regression branches are added to produce the z-coordinate of their successor model by going from depth map data to a spatial
the center of the proposal as well as the height of the box. volumetric representation derived from a LiDAR point cloud.
To enhance prediction quality, He et al. (2020) perform the Kim and Kang (2017) use a two-stage approach, initially
auxiliary detection tasks of pointwise foreground segmentation predicting candidates in a projection representation based on
prior to exploiting anchor-based detection. Subsequently, they edge filtering. Leveraging the edge detection, objects get seg-
estimate the object center through a 3D-CNN and only then mented and unique box proposals are generated based on the
reshape the feature maps to BEV and employ anchor-based 2D edge boundaries. In the second stage, the authors then apply a
detection. region-based FCN to the region of interest.
For even further improvement, Gustafsson et al. (2021) de- Meyer et al. (2019a,b) both employ a mean shift clustering
sign a differentiable pooling operator for 3D to be able to ex- for detection. They use an FCN to predict a distribution over
tend the SA-SSD approach of He et al. (2020) by an condi- 3D boxes for each point of the feature map independently. In
tional energy-based regression approach instead of the com- conclusion, points on the same object should predict a similar
monly used Gaussian-model. distribution. To eliminate natural noise of the prediction, they
combine per-point-prediction through a mean shift clustering.
9.4.4. Single-Stage Detection: 3D Anchors Since all distributions are class-dependent and multimodal, the
Single-stage detection networks using anchors based on 3D mean shift has to be performed for each class and modality sep-
representations are especially committed to extract meaningful arately. For efficiency reasons, mean shift clustering is carried
and rich features in the first place, since the 3D detection meth- out over box centers instead of box corners, reducing the di-
ods are not as matured as their equivalent 2D ones. Likewise, mensionality.
the missing performance boost has to be compensated due to Further representatives using FCNs are Yang et al. (2018a,b)
the lack of the additional refinement stage. who use hierarchical multi-scale feature maps, and Wang and
An as exemplary approach, Zhou and Tuzel (2018) introduce Jia (2019) who apply a sliding-window-wise application of
the seminal VFE as discriminative feature extraction (cf. Sec- FCN. These networks output pixelwise predictions at a single
tion 7.4.1). As of today, it is the state-of-the-art encoding for stage, with each prediction corresponding to a 3D object esti-
voxel-wise detection. Having access to these meaningful fea- mate.
tures, they only use a simple convolutional middle layer in com-
bination with a slightly modified RPN for a single-stage detec- 9.5.2. Other Approaches
tion purpose. Since point-based representation do not admit convolutional
Observing the problem of high inference times by 3D-CNNs networks, models processing the raw point cloud need to find
in volumetric representations, Sun et al. (2018) introduce a other solutions to apply detection mechanisms. In the follow-
single-stage 3D-CNN, treating detection and recognition as one ing, we summarize some of the innovative developments.
regression problem in a direct manner. Therefore, they develop A seminal work that offers a pipeline for directly working on
a deep hierarchical fusion network capturing rich contextual in- raw point clouds was proposed by Qi et al. (2019) when intro-
formation. ducing VoteNet. The approach integrates the synergies of 3D
Further exemplary representatives for single-stage 3D anchor deep learning models for feature learning, namely PointNet++
detection are proposed by Yan et al. (2018), Li et al. (2019e), (Qi et al., 2017b), and Hough voting (Leibe et al., 2008). Since
and Kuang et al. (2020) which mainly build upon the success of the centroid of a 3D bounding box is most likely far from any
VoxelNet. surface point, the regressing of bounding box parameters that
23
is based solely on point clouds is a difficult task. By consider- consists of an assembling of all foreground points into their cor-
ing a voting mechanism, the authors generate new points that responding object centers. After clustering, a mean bounding
are located close to object centroids which are used to produce box is derived for each instance that is again further refined by
spatial location proposals for the corresponding bounding box. a network based on PointNet++.
They argue that a voting-based detection is more compatible
with sparse point sets as compared to RPNs, since RPNs have 9.6. Hybrid Detection
to carry out extra computations to adjust the bounding box with-
out having an explicit object center. Furthermore, the center is Next to representation and feature extraction fusion (cf. Sec-
likely to be in an empty space of the point cloud. tion 8), there are also approaches for fusing detection modules.
Two-stage detection frameworks, especially representation-
Figure 11 illustrates the approach. First, a backbone network
fusion-driven models, generally prefer to exploit 3D detec-
based on PointNet++ is used to learn features on the points and
tion methods such as anchor-based 3D CNNs, 3D FCNs, or
derive a subset of points (seeds). Each seed proposes a vote
PointNet-like architectures for the refinement stage, after an
for the centroid by using Hough voting (votes). The votes are
originally lightweight 2D-based estimation was performed in
then grouped and processed by a proposal module to provide
the first place. This opens the advantage of a precise predic-
refined proposals (vote clusters). Eventually, the vote clusters
tion using the spatial space by 3D detection frameworks, which
are classified and bounding boxes are regressed.
are otherwise too time-consuming to be applied on the whole
VoteNet provides accurate detection results even though it
scene. In the following, we classify models that use multiple
relies solely on geometric information. To further enhance this
detection techniques as hybrid detection modules.
approach, Qi et al. (2020) propose ImVoteNet. The successor
Exemplary, Wang and Jia (2019) apply multiple methods to
complements VoteNet by utilizing the high resolution and rich
finally give a prediction. First, they use an anchor-based 2D
texture of images in order to fuse 3D votes in point clouds with
detection for proposal generation which are then extruded as
2D votes in images.
frustums into 3D space. Following, an anchorless FCN detec-
Another voting approach is proposed by Yang et al. (2020). tion technique gets applied to classify the frustum in a sliding
The authors first use a voting-scheme similar to Qi et al. (2019) window fashion along the frustum axis to output an ultimate
to generate candidate points as representatives. The candidate prediction.
points are then further treated as object centers and the sur- A frequently exercised approach is to extrude 2D proposals
rounding points are gathered to feed an anchor-free regression into spatial space of point clouds. The pioneer for this technique
head for bounding box prediction. is Frustum-PointNet (Qi et al., 2018), enabling PointNet for the
Pamplona et al. (2019) propose an on-road object detec- task of object detection. Since PointNet is able to effectively
tion method where they eliminate the ground plane points and segment the scene, but not to produce location estimations, the
then establish an occupation grid representation. The bounding authors use preceding 2D anchor-based proposals, which then
boxes are then extracted for occupation regions containing a are classified by PointNet.
threshold number of points. For classification purposes, Point- Likewise, Ferguson and Law (2019) as well as Shen and Sta-
Net is applied. mos (2020) propose a quite similar idea. They first reduce the
Shi et al. (2019) use PointNet++ for a foreground point seg- search space extruding a frustum from a 2D region proposals
mentation. This contextual dense representation is further used into 3D space and then use a 3D CNN for the detection task.
for a bin-based 3D box regression and then refined through Apart of extruding frustums of the initial 2D candidates, pro-
point cloud region pooling and a canonical transformation. posals in projection representations are often converted to fully
Besides their anchor-based solution, Shi et al. (2020b) also specified 3D proposals into spatial space. This is possible as
conduct experiments on an anchor-free solution, reusing the de- projection occupy depth information enabling the mapping of
tection head of PointRCNN (Shi et al., 2019). They examine, the 2D representation to 3D.
that while the anchorless application being more memory effi- For example, Zhou et al. (2019), Shi et al. (2020a), and Deng
cient, the anchor-based strategy results in a higher object recall. et al. (2021) transfer the proposals of the 2D-representation into
Similar to Shi et al. (2019), Li et al. (2020) also exploit a a spatial representation not by extruding an unrestricted search
foreground segmentation of the point cloud. For each fore- space along the z-axis but as fully defined anchor-based 3D
ground point, an intersection-over-union (IoU)-sensitive pro- proposals by a 2D to 3D conversion of the projections. 3D-
posal is produced, which leverages the attention mechanism. compatible detection methods such as anchorless and anchor-
This is done by only taking the most relevant features into ac- based 3D-CNNs or PointNets can then be deployed for the re-
count, as well as further geometrical information of the sur- finement.
roundings. For the final prediction, they add a supplementary Chen et al. (2019b) first generate voxelized candidate boxes
IoU-perception branch to the commonly used classification and that are further processed in a pointwise representation during
bounding box regression branch for a more accurate instance a second stage. 3D- and 2D-CNNs are stacked upon a VFE-
localization. encoded representation for proposals in the first stage, and only
Other than that, Zhou et al. (2020) introduce spatial then a PointNet-based network is applied for the refinement of
embedding-based object proposals. A pointwise semantic seg- the proposals.
mentation of the scene is used in combination with a spatial Ku et al. (2019) use an anchor-based monocular 2D detection
embedding for instance segmentation. The spatial embedding to estimate the spatial centroid of the object. An object instance
24
VoteNet
Voting in Point Clouds Object Proposal and Classification from Votes
K clusters
Vote
Propose & Classify
Vote
Sampling &
3D NMS
Grouping
Propose & Classify
Point cloud feature
learning backbone shared shared
Fig. 11. Illustration of the architecture and the steps performed by VoteNet (Qi et al., 2019)
is then reconstructed and laid on the proposal in the point cloud, parison to full 3D region proposal networks, since the possible
helping to regress the final 3D bounding box. uncertainties of a 2D detection are inherited to the hierarchical
In contrast to anchor-based detection, Gupta et al. (2019) do next step.
not execute a dense pixelwise regression of the bounding box,
but initially estimates key points in form of the bottom center
of the object. Only those key points and their nearest neighbors 10. Classification of 3D Object Detection Pipelines
are then used to produce a comparable low number of positive
anchors, accelerating the detection process.
Similar to other techniques, matching algorithms are likewise In the previous sections, we gave a comprehensive review
fused in hybrid frameworks. For example, Chabot et al. (2017) of different models and methods along the 3DOD pipeline and
use a network to output 2D bounding boxes, vehicle part coordi- emphasized representative examples for every stage with their
nates, and 3D box dimensions. Then they match the dimensions corresponding design options. In the following, we use our pro-
and parts derived in the first step with CAD templates for final posed pipeline framework from Figure 1 (Section 4) to classify
pose estimation. Further, Du et al. (2018) score the 3D frustum each 3DOD approach of our literature corpus to derive a thor-
region proposals by matching them with a predefined selection ough systematization.
consisting of three car model templates. Alike, Wang and Jia For better comparability, we distinguish all 3DOD models
(2019) use a combination of PointNet and 2D-3D consistency according to their data representation. That is, we provide sep-
constraints within the frustum to locate and classify the objects. arate classification schemes for (i) monocular models (Table 2),
Instead of commonly fusing detection techniques in a hier- (ii) RGB-D front-view-based models (Table 3), (iii) projection-
archical way, Pang et al. (2020) use 2D and 3D anchor-based based models (Table 4), (iv) volumetric grid-based models (Ta-
detection in a parallel fashion to fuse the candidates in a IoU- ble 5), (v) point-based models (Table 6), and (vi) fusion-based
sensitive way. There is no NMS performed before fusing the models (Table 7).
proposals because the 2D-3D consistency constraint between For each 3DOD approach, we provide information on the au-
the proposals eliminates most elements. thors, the year and the name, classify the underlying domain
In summary, hybrid detection approaches try to compensate and benchmark dataset(s), and categorize the specified design
for inferiorities of a single detection framework. While some choices along the 3DOD pipeline.
of these models reach remarkable results, the harmonization of The resulting classification can help researchers and practi-
two different systems represents a major challenge. Next to tioners alike to get a quick overview about the field, spot de-
the advantages also the disadvantage of the specific technique velopments and trends over time, and identify comparable ap-
needs to be handled. In case of a compound solution between proaches for further development and benchmark purposes. As
a 2D- and PointNet-like technique for example, the result of- such, our classification delivers an overview about different de-
fers an obvious improvement in inference speed, as the initial sign options, provides structured access to knowledge in terms
prediction is usually performed in a lightweight 2D detection of a 3DOD pipeline catalog, and offers a setting to position indi-
framework limiting the search space for the PointNet. Yet, the vidual configurations of novel solutions on a more comparable
accuracy and precision of detection is less favorable in com- basis.move
25
Author Year Model Domain Benchmark Sensor Rep. Depth Substitution FE Det Model Archit.
Geometric Constraint
Template Matching
Anchorbased (2D)
Single-Stage
SUN RGB-D
Two-Stage
Matching
Outdoor
2D CNN
Stereo
Indoor
others
Mono
Mono
KITTI
Fidler et al. 2012 3D DPM x x x x x x x x
Crivellaro et al. 2015 Cluttered3DPose x x x x 2D-3D x x x x
Chen et al. 2016 Mono3D x x x x x GP x x x
Chabot et al. 2017 Deep MANTA x x x x x x x x x
Mousavian et al. 2017 Deep3DBox x x x x x 2D-3D x x x
Payen de La Garanderie et al. 2018 360Panoramic x x x 360° x x x x
Huang et al. 2018 GGN-LON x x x x 2D-3D x x x
Gupta et al. 2019 KeypointCBF x x x x GP x x x x
Jörgensen et al. 2019 SS3D x x x x x x x
Simonelli et al. 2019 MonoDIS x x x x 2D-3D x x x
Li et al. 2019a GS3D x x x x 2D-3D x x x
Qin et al. 2019a MonoGRNet x x x x x x x x
Naiden et al. 2019 Shift R-CNN x x x x 2D-3D x x x
Liu et al. 2019a FQNet x x x x 2D-3D x x x
Brazil and Liu 2019 M3D-RPN x x x x 2D-3D x x x
Barabanau et al. 2020 Keypoint3D x x x x x x x x
Qin et al. 2019b TLNet x x x x 2D-3D x x x
Ku et al. 2019 MonoPSR x x x x PC x x x
Table 2. Classification of monocular 3DOD models (360°: 360°-monocular image, PC: point cloud, 2D-3D: 2D-3D consistency, GP: groundplane)
Author Year Model Domain Benchmark Sen. Rep. FE Model Det Model Archit.
Anchorbased (2D)
Single-Stage
Handcrafted
SUN RGB-D
RGB-D (FV)
Two-Stage
Matching
Outdoor
2D CNN
NYUv2
Stereo
Indoor
other
KITTI
Feature Detection
Author Year Model Domain Benchmark Sensor Rep. Initialization FE Model Archit.
Model
Anchorbased (2D)
Anchorless (FCN)
3D Sparse Conv
Bird's Eye View
Single-Stage
Handcrafted
Range View
Two-Stage
Outdoor
2D CNN
LiDAR
KITTI
Anchorless (FCN)
3D Sparse CNN
Sliding Window
Anchorbased
Single-Stage
Handcrafted
Handcrafted
SUN RGB-D
Two-Stage
NuScenes
Outdoor
2D CNN
3D CNN
NYUv2
Stereo
Pillars
Indoor
LiDAR
Voxel
KITTI
VFE
Song and Xiao 2014 Sliding Shapes x x x x x SVM x
Chen et al. 2015 3DOP x x x x x x 3D x
Wang and Posner 2015 Vote3D x x x x x Vote x
Ren and Sudderth 2016 COG 1.0 x x x x x SVM x
Engelcke et al. 2017 Vote3Deep x x x x x x Vote x
Li 2017 3D-FCN x x x x x x FCN x
Yan et al. 2018 SECOND x x x x x x 2D x
Zhou and Tuzel 2018 VoxelNet x x x x x x 3D x
Liu et al. 2018a 3D SelSearch x x x x x SVM x
Ren and Sudderth 2018 LSS x x x x x SVM x
Sun et al. 2018 3D CNN x x x x x x x 3D x
Li et al. 2019e 3DBN x x x x x x 3D x
Lang et al. 2019 PointPillars x x x x x x 2D x
Ren and Sudderth 2020 COG 2.0 x x x x x x SVM x
Shi et al. 2020b Part-A^2 Net x x x x x x x other x
Kuang et al. 2020 Voxel-FPN x x x x x x 3D x
Table 5. Classification of volumetric grid-based models 3DOD models (SVM: support vector machine, Vote: voting scheme, FCN: fully convolutional network)
Author Year Model Domain Benchmark Sensor Rep. FE Model Det. Archit.
Anchorless (others)
Single-Stage
Point-based
SUN RGB-D
PointNet ++
Two-Stage
PointNet
Outdoor
Stereo
Indoor
LiDAR
KITTI
Depth Sub.
Author Year Model Domain Benchmark Sensor Fusion Representation Feature Extraction Model Detection Model Archit.
(Mono)
Anchorbased (2D)
Anchorbased (3D)
Anchorless (FCN)
Feature Fusion
Segmentwise
Waymo Open
Single-Stage
SUN RGB-D
Pointbased
Pointbased
RGB-D (FV)
Two-Stage
Projection
NuScenes
Cascaded
Matching
PointNet
Outdoor
NYUv2
RGB-D
Stereo
Indoor
LiDAR
Mono
Mono
Mono
other
KITTI
Table 7. Classification of fusion-based 3DOD models (BEV: bird’s eye view, RV: range view, V: voxel, PC: point cloud, 2D-3D: 2D-3D consistency (geometrical
constraint), H: handcrafted, VFE: voxel feature encoding, DL: deep learning, P: PointNet, P++: PointNet++, 3D S.-CNN: 3D sparse CNN, Vote: voting
scheme)
27
11. Concluding Remarks and Outlook At the same time, however, it should be noted that the sev-
eral stages of the 3DOD pipeline can be designed with much
3D object detection is a vivid research field with a great va- broader variety and that each stage therefore deserves a much
riety of different approaches. The additional third dimension closer investigation in subsequent studies. Fernandes et al.
compared to 2D vision forces to explore completely new meth- (2021), for instance, go into much further details for the fea-
ods, while mature 2DOD solutions can only be adopted to a ture extraction stage by aiming to organize the entanglement of
limited extent. Hence, new ideas and data usage are emerging different extraction paradigms. Yet, we belief that a full concep-
to handle the advanced problem of 3DOD resulting in a fastly tion and systematization of the entire field has not been reached.
growing research field that is finely branched in its trends and In addition, we would like to acknowledge that, in individ-
approaches. ual cases, it might be difficult to draw a strict boundary for the
From a broader perspective, we could observe several global classification of models regarding design choices and stages.
trends within the field. For instance, a general objective of cur- 3D object detection is a highly complex and multi-faceted field
rent research is clearly the efficiency optimization of increased of research, and knowledge from 2D and 3D computer vision
computation and memory resource requirements due to the ex- as well as continuous progress in artificial intelligence and ma-
tra dimension of 3DOD, with the ultimate goal to finally reach chine learning are getting fused.
real-time detection. Thus, our elaboration of the specific stages marks a broad ori-
Additionally, it can be seen that more recent approaches in- entation within the configuration of a 3DOD model and should
creasingly focus on fully leveraging pointwise representations, be rather seen as a collection of possibilities than an ultimate
since it promises the best conception of spatial space. As of the and isolated choice of design options. Especially modern mod-
literature body of this work, PointNet-based approaches remain els often jump within these stages and do not follow a linear
the only method so far that can directly process the raw point way along the pipeline, making a strict classification challeng-
representation. ing.
Furthermore, we observe that the fusion of feature extrac- For future research, we suggest to look into combination of
tion and detection techniques as well as of data representation methods along all stages of the 3DOD pipeline. We recommend
are the most popular approaches to challenge common prob- examining these aspects independent of the pipeline, since fu-
lems of object detection, such as amodal perception, instance sion of techniques often occurs in a non-linear way.
variety, and noisy data. For feature fusion approaches, the de- Finally, this work could support a practical creation of indi-
velopment of attention mechanisms to efficiently fuse features vidual modules or even a whole new 3DOD model, since the
based on their relevance is a major trend. Additionally, the in- systematization along the pipeline can serve as an orientation
troduction of continuous convolutions facilitates the complex of design choices within the specific stages.
modality mapping. In general, hybrid detection models enjoy
popularity for exploiting lightweight proposals to restrict the
search space for more performant but heavy refinement tech-
niques.
Summarizing, this work aimed to complement previous sur-
veys, such as those from Arnold et al. (2019), Guo et al. (2021)
and Fernandes et al. (2021), by closing a gap of not only fo-
cusing on a single domain and/or specific methods of 3D object
detection. Therefore, our search was narrowed only to the ex-
tent that the relevant literature should provide a design of an
entire pipeline for 3D object detection. We purposely included
all available approaches independent from the varieties of data
inputs, data representations, feature extraction approaches, and
detection methods. Therefore, we reviewed an exhaustively
searched literature corpus published between 2012 and 2021,
including more than 100 approaches from both indoor applica-
tions as well as autonomous driving applications. Since these
two application areas cover the vast majority of existing litera-
ture, our survey may not be subject to the risk of missing major
developments and trends.
Furthermore, a particular goal of this survey was to give an
overview over all aspects of the 3DOD research field. There-
fore, we provided a systematization of 3DOD methods along
the model pipeline with a proposed abstraction level that is
meant to be neither to coarse nor being too specific. As a re-
sult, it was possible to classify all models within our literature
corpus in order to structure the field, highlight emerging trends,
and guide future research.
28
Abbreviations Brazil, G., Liu, X., 2019. M3D-RPN: Monocular 3D Region Proposal Network
for Object Detection, in: 2019 IEEE/CVF International Conference on Com-
puter Vision (ICCV), pp. 9286–9295. doi:10.1109/ICCV.2019.00938.
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A.,
2DOD 2D object detection
Pan, Y., Baldan, G., Beijbom, O., 2020. nuScenes: A Multimodal Dataset
3DOD 3D object detection for Autonomous Driving, in: 2020 IEEE/CVF Conference on Computer Vi-
6DoF Six degrees of freedom sion and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA. pp. 11618–
BEV Bird’s eye view 11628. doi:10.1109/CVPR42600.2020.01164.
Chabot, F., Chaouch, M., Rabarisoa, J., Teuliere, C., Chateau, T., 2017. Deep
CAD Computer-aided design MANTA: A Coarse-to-Fine Many-Task Network for Joint 2D and 3D Vehi-
CNN Convolutional neural networks cle Analysis from Monocular Image, in: 2017 IEEE Conference on Com-
COG Cloud of oriented gradients puter Vision and Pattern Recognition (CVPR), IEEE, Honolulu, HI. pp.
DORN Deep ordinal regression network 1827–1836. doi:10.1109/CVPR.2017.198.
Chen, Q., Tang, S., Yang, Q., Fu, S., 2019a. Cooper: Cooperative Perception for
FCN Fully convolutional networks Connected Autonomous Vehicles Based on 3D Point Clouds, in: 2019 IEEE
FPS Farthest point sampling 39th International Conference on Distributed Computing Systems (ICDCS),
FV Front view pp. 514–524. doi:10.1109/ICDCS.2019.00058.
HOG Histogram of oriented gradients Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R., 2016. Monoc-
ular 3D Object Detection for Autonomous Driving, in: 2016 IEEE Confer-
IoU Intersection-over-union ence on Computer Vision and Pattern Recognition (CVPR), pp. 2147–2156.
LiDAR Light Detection and Ranging doi:10.1109/CVPR.2016.236.
LSS Latent support surfaces Chen, X., Kundu, K., Zhu, Y., Berneshawi, A.G., Ma, H., Fidler, S., Urtasun,
MLP Multi-layer perceptron R., 2015. 3D Object Proposals for Accurate Object Class Detection, in:
Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (Eds.),
NMS Non-maximum suppression Advances in Neural Information Processing Systems 28. Curran Associates,
R-CNN Region-based CNN Inc., pp. 424–432.
RGB-D RGB-Depth Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., Urtasun, R., 2018. 3D Object
Proposals Using Stereo Imagery for Accurate Object Class Detection. IEEE
ROI Regions of interest
Transactions on Pattern Analysis and Machine Intelligence 40, 1259–1272.
RPN Region proposal network doi:10.1109/TPAMI.2017.2706685.
RV Range view Chen, X., Ma, H., Wan, J., Li, B., Xia, T., 2017. Multi-view 3D Ob-
SIFT Scale-invariant feature transform ject Detection Network for Autonomous Driving, in: 2017 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), pp. 6526–6534.
SSD Single shot detection doi:10.1109/CVPR.2017.691.
SVM Support vector machine Chen, Y., Liu, S., Shen, X., Jia, J., 2019b. Fast point r-CNN, in: 2019
TOF Time-of-Flight IEEE/CVF International Conference on Computer Vision (ICCV), IEEE. pp.
VFE Voxel feature encoding 9774–9783. doi:10.1109/ICCV.2019.00987.
Crivellaro, A., Rad, M., Verdie, Y., Yi, K.M., Fua, P., Lepetit, V., 2015. A Novel
YOLO You Only Look Once Representation of Parts for Accurate 3D Object Detection and Tracking in
Monocular Images, in: 2015 IEEE International Conference on Computer
Vision (ICCV), pp. 4391–4399. doi:10.1109/ICCV.2015.499.
Dalal, N., Triggs, B., 2005. Histograms of Oriented Gradients for Human De-
tection, in: 2005 IEEE Computer Society Conference on Computer Vision
References and Pattern Recognition (CVPR’05), IEEE. pp. 886–893. doi:10.1109/
CVPR.2005.177.
Ahmadyan, A., Zhang, L., Ablavatski, A., Wei, J., Grundmann, M., 2021. Ob- Davies, E.R., 2012. Computer and machine vision: theory, algorithms, practi-
jectron: A Large Scale Dataset of Object-Centric Videos in the Wild with calities. 4th ed., Elsevier.
Pose Annotations, in: 2021 IEEE/CVF Conference on Computer Vision and Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. Imagenet: A
Pattern Recognition (CVPR), IEEE, Nashville, TN, USA. pp. 7818–7827. large-scale hierarchical image database, in: 2009 IEEE Conference on Com-
doi:10.1109/CVPR46437.2021.00773. puter Vision and Pattern Recognition, pp. 248–255. doi:10.1109/CVPR.
Ali, W., Abdelkarim, S., Zidan, M., Zahran, M., Sallab, A.E., 2019. YOLO3D: 2009.5206848.
End-to-End Real-Time 3D Oriented Object Bounding Box Detection from Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H., 2021. Voxel R-CNN:
LiDAR Point Cloud, in: Leal-Taixé, L., Roth, S. (Eds.), Computer Vision Towards High Performance Voxel-based 3D Object Detection. Proceedings
– ECCV 2018 Workshops, Springer International Publishing. pp. 716–728. of the AAAI Conference on Artificial Intelligence 35, 1201–1209.
doi:10.1007/978-3-030-11015-4_54. Deng, Z., Latecki, J.L., 2017. Amodal Detection of 3D Objects: Inferring
Arnold, E., Al-Jarrah, O.Y., Dianati, M., Fallah, S., Oxtoby, D., Mouzakitis, 3D Bounding Boxes from 2D Ones in RGB-Depth Images, in: 2017 IEEE
A., 2019. A survey on 3d object detection methods for autonomous driving Conference on Computer Vision and Pattern Recognition (CVPR), pp. 398–
applications. IEEE Transactions on Intelligent Transportation Systems 20, 406. doi:10.1109/CVPR.2017.50.
3782–3795. doi:10.1109/TITS.2019.2892405. Du, X., Ang, M.H., Karaman, S., Rus, D., 2018. A general pipeline for 3d
Barabanau, I., Artemov, A., Burnaev, E., Murashkin, V., 2020. Monocular 3D detection of vehicles, in: 2018 IEEE International Conference on Robotics
Object Detection via Geometric Reasoning on Keypoints:, in: Proceedings and Automation (ICRA), IEEE. pp. 3194–3200. doi:10.1109/ICRA.2018.
of the 15th International Joint Conference on Computer Vision, Imaging 8461232.
and Computer Graphics Theory and Applications, SCITEPRESS - Science Engelcke, M., Rao, D., Wang, D.Z., Tong, C.H., Posner, I., 2017. Vote3Deep:
and Technology Publications, Valletta, Malta. pp. 652–659. doi:10.5220/ Fast object detection in 3D point clouds using efficient convolutional neural
0009102506520659. networks, in: 2017 IEEE International Conference on Robotics and Automa-
Bello, S.A., Yu, S., Wang, C., Adam, J.M., Li, J., 2020. Review: Deep Learning tion (ICRA), pp. 1355–1361. doi:10.1109/ICRA.2017.7989161.
on 3D Point Clouds. Remote Sensing 12, 1729. doi:10.3390/rs12111729. Ferguson, M., Law, K., 2019. A 2D-3D Object Detection System for Updating
Beltrán, J., Guindel, C., Moreno, F.M., Cruzado, D., Garcı́a, F., De La Escalera, Building Information Models with Mobile Robots, in: 2019 IEEE Winter
A., 2018. BirdNet: A 3D Object Detection Framework from LiDAR Infor- Conference on Applications of Computer Vision (WACV), pp. 1357–1365.
mation, in: 2018 21st International Conference on Intelligent Transportation doi:10.1109/WACV.2019.00149.
Systems (ITSC), pp. 3517–3523. doi:10.1109/ITSC.2018.8569311. Fernandes, D., Silva, A., Névoa, R., Simões, C., Gonzalez, D., Guevara, M.,
Bishop, C.M., 2006. Pattern recognition and machine learning. Information Novais, P., Monteiro, J., Melo-Pinto, P., 2021. Point-cloud based 3d object
science and statistics, Springer.
29
detection and classification methods for self-driving applications: A survey Lepetit, V., 2011. Multimodal templates for real-time detection of texture-
and taxonomy. Information Fusion 68, 161–191. doi:10.1016/j.inffus. less objects in heavily cluttered scenes, in: 2011 International Confer-
2020.11.002. ence on Computer Vision, IEEE. pp. 858–865. doi:10.1109/ICCV.2011.
Fidler, S., Dickinson, S., Urtasun, R., 2012. 3D Object Detection and Viewpoint 6126326.
Estimation with a Deformable 3D Cuboid Model, in: Proceedings of the Huang, S., Qi, S., Xiao, Y., Zhu, Y., Wu, Y.N., Zhu, S.C., 2018. Cooperative
25th International Conference on Neural Information Processing Systems - Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera
Volume 1, Curran Associates Inc., USA. pp. 611–619. Pose Estimation, in: Proceedings of the 32nd International Conference on
Fischler, M.A., Bolles, R.C., 1981. Random sample consensus: a paradigm for Neural Information Processing Systems, Curran Associates Inc., USA. pp.
model fitting with applications to image analysis and automated cartogra- 206–217.
phy. Communications of the ACM 24, 381–395. doi:10.1145/358669. Huang, T., Liu, Z., Chen, X., Bai, X., 2020. EPNet: Enhancing Point Features
358692. with Image Semantics for 3D Object Detection, in: Vedaldi, A., Bischof,
Friederich, J., Zschech, P., 2020. Review and systematization of solutions for 3d H., Brox, T., Frahm, J.M. (Eds.), Computer Vision – ECCV 2020. Springer
object detection, in: 2020 15th International Conference on Wirtschaftsin- International Publishing, Cham. volume 12360, pp. 35–52. doi:10.1007/
formatik (WI), pp. 1699–1711. doi:10.30844/wi_2020_r2-friedrich. 978-3-030-58555-6_3.
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D., 2018. Deep Ordinal Huang, Y., Chen, Y., 2020. Survey of state-of-art autonomous driving tech-
Regression Network for Monocular Depth Estimation, in: 2018 IEEE/CVF nologies with deep learning, in: 2020 IEEE 20th International Conference
Conference on Computer Vision and Pattern Recognition, pp. 2002–2011. on Software Quality, Reliability and Security Companion (QRS-C), IEEE.
doi:10.1109/CVPR.2018.00214. pp. 221–228. doi:10.1109/QRS-C51114.2020.00045.
Geiger, A., Lenz, P., Urtasun, R., 2012. Are we ready for autonomous driving? Janiesch, C., Zschech, P., Heinrich, K., 2021. Machine learning and
The KITTI vision benchmark suite, in: 2012 IEEE Conference on Com- deep learning. Electronic Markets 31, 685–695. doi:10.1007/
puter Vision and Pattern Recognition, pp. 3354–3361. doi:10.1109/CVPR. s12525-021-00475-2.
2012.6248074. Jörgensen, E., Zach, C., Kahl, F., 2019. Monocular 3D Object Detection
Giancola, S., Valenti, M., Sala, R., 2018. A Survey on 3D Cameras: Metrologi- and Box Fitting Trained End-to-End Using Intersection-over-Union Loss.
cal Comparison of Time-of-Flight, Structured-Light and Active Stereoscopy arXiv:1906.08070 [cs] , 1–10.
Technologies. SpringerBriefs in Computer Science, Springer International Kehl, W., Milletari, F., Tombari, F., Ilic, S., Navab, N., 2016. Deep Learning
Publishing. doi:10.1007/978-3-319-91761-0. of Local RGB-D Patches for 3D Object Detection and 6D Pose Estimation,
Girshick, R., 2015. Fast R-CNN, in: 2015 IEEE International Conference on in: Leibe, B., Matas, J., Sebe, N., Welling, M. (Eds.), Computer Vision
Computer Vision (ICCV), pp. 1440–1448. doi:10.1109/ICCV.2015.169. – ECCV 2016, Springer International Publishing. pp. 205–220. doi:10.
Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich feature hierarchies 1007/978-3-319-46487-9_13.
for accurate object detection and semantic segmentation, in: 2014 IEEE Kim, J.U., Kang, H., 2017. LiDAR Based 3D Object Detection Using CCD
Conference on Computer Vision and Pattern Recognition, IEEE. pp. 580– Information, in: 2017 IEEE Third International Conference on Multimedia
587. doi:10.1109/CVPR.2014.81. Big Data (BigMM), pp. 303–309. doi:10.1109/BigMM.2017.59.
Godard, C., Aodha, O.M., Brostow, G.J., 2017. Unsupervised Monocular KITTI, 2021. Kitti 3dod benchmark. URL: http://www.cvlibs.net/
Depth Estimation with Left-Right Consistency, in: 2017 IEEE Confer- datasets/kitti/eval_object.php?obj_benchmark=3d.
ence on Computer Vision and Pattern Recognition (CVPR), pp. 6602–6611. Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L., 2018. Joint
doi:10.1109/CVPR.2017.699. 3D Proposal Generation and Object Detection from View Aggregation, in:
Graham, B., 2014. Spatially-sparse convolutional neural networks. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems
arXiv:1409.6070 [cs] . (IROS), pp. 1–8. doi:10.1109/IROS.2018.8594049.
Graham, B., 2015. Sparse 3D convolutional neural networks, in: Procedings of Ku, J., Pon, A.D., Waslander, S.L., 2019. Monocular 3D Object De-
the British Machine Vision Conference 2015, British Machine Vision Asso- tection Leveraging Accurate Proposals and Shape Reconstruction, in:
ciation, Swansea. pp. 150.1–150.9. doi:10.5244/C.29.150. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Graham, B., Engelcke, M., Maaten, L.v.d., 2018. 3D Semantic Segmenta- (CVPR), pp. 11859–11868. doi:10.1109/CVPR.2019.01214.
tion with Submanifold Sparse Convolutional Networks, in: 2018 IEEE/CVF Kuang, H., Wang, B., An, J., Zhang, M., Zhang, Z., 2020. Voxel-FPN: Multi-
Conference on Computer Vision and Pattern Recognition, pp. 9224–9232. scale voxel feature aggregation for 3d object detection from LIDAR point
doi:10.1109/CVPR.2018.00961. clouds. Sensors 20, 704. doi:10.3390/s20030704.
Griffiths, D., Boehm, J., 2019. A Review on Deep Learning Techniques for Payen de La Garanderie, G., Atapour Abarghouei, A., Breckon, T.P., 2018.
3D Sensed Data Classification. Remote Sensing 11, 1499. doi:10.3390/ Eliminating the Blind Spot: Adapting 3D Object Detection and Monocu-
rs11121499. lar Depth Estimation to 360° Panoramic Imagery, in: Ferrari, V., Hebert,
Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M., 2021. Deep M., Sminchisescu, C., Weiss, Y. (Eds.), Computer Vision – ECCV 2018,
Learning for 3D Point Clouds: A Survey. IEEE Transactions on Pattern Springer International Publishing, Cham. pp. 812–830. doi:10.1007/
Analysis and Machine Intelligence 43, 4338–4364. doi:10.1109/TPAMI. 978-3-030-01261-8_48.
2020.3005434. Lahoud, J., Ghanem, B., 2017. 2D-Driven 3D Object Detection in RGB-D Im-
Gupta, I., Rangesh, A., Trivedi, M., 2019. 3D Bounding Boxes for Road ages, in: 2017 IEEE International Conference on Computer Vision (ICCV),
Vehicles: A One-Stage, Localization Prioritized Approach Using Single pp. 4632–4640. doi:10.1109/ICCV.2017.495.
Monocular Images, in: Leal-Taixé, L., Roth, S. (Eds.), Computer Vision – Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O., 2019.
ECCV 2018 Workshops. Springer International Publishing, Cham. volume PointPillars: Fast encoders for object detection from point clouds, in:
11133 of Lecture Notes in Computer Science, pp. 626–641. doi:10.1007/ 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
978-3-030-11021-5_39. (CVPR), IEEE. pp. 12689–12697. doi:10.1109/CVPR.2019.01298.
Gustafsson, F.K., Danelljan, M., Schon, T.B., 2021. Accurate 3D Object De- LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444.
tection using Energy-Based Models, in: 2021 IEEE/CVF Conference on doi:10.1038/nature14539.
Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, Lefsky, M.A., Cohen, W.B., Parker, G.G., Harding, D.J., 2002. Lidar re-
Nashville, TN, USA. pp. 2849–2858. doi:10.1109/CVPRW53098.2021. mote sensing for ecosystem studies. BioScience 52, 19. doi:10.1641/
00320. 0006-3568(2002)052[0019:LRSFES]2.0.CO;2.
He, C., Zeng, H., Huang, J., Hua, X.S., Zhang, L., 2020. Structure aware single- Lehner, J., Mitterecker, A., Adler, T., Hofmarcher, M., Nessler, B., Hochre-
stage 3d object detection from point cloud, in: 2020 IEEE/CVF Conference iter, S., 2019. Patch refinement – localized 3d object detection.
on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 11870– arXiv:1910.04093 [cs] .
11879. doi:10.1109/CVPR42600.2020.01189. Leibe, B., Leonardis, A., Schiele, B., 2008. Robust object detection with inter-
He, R., Rojas, J., Guan, Y., 2017. A 3D object detection and pose estima- leaved categorization and segmentation. International Journal of Computer
tion pipeline using RGB-D images, in: 2017 IEEE International Conference Vision 77, 259–289. doi:10.1007/s11263-007-0095-3.
on Robotics and Biomimetics (ROBIO), pp. 1527–1532. doi:10.1109/ Li, B., 2017. 3d fully convolutional network for vehicle detection in point
ROBIO.2017.8324634. cloud, in: 2017 IEEE/RSJ International Conference on Intelligent Robots
Hinterstoisser, S., Holzer, S., Cagniart, C., Ilic, S., Konolige, K., Navab, N., and Systems (IROS), IEEE. pp. 1513–1518. doi:10.1109/IROS.2017.
30
8205955. hierarchical features from RGB-D images for amodal 3D object detection.
Li, B., Ouyang, W., Sheng, L., Zeng, X., Wang, X., 2019a. GS3D: An Neurocomputing 378, 364–374. doi:10.1016/j.neucom.2019.10.025.
Efficient 3D Object Detection Framework for Autonomous Driving, in: Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., Fan, X., 2019. Accu-
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition rate Monocular 3D Object Detection via Color-Embedded 3D Reconstruc-
(CVPR), pp. 1019–1028. doi:10.1109/CVPR.2019.00111. tion for Autonomous Driving, in: 2019 IEEE/CVF International Conference
Li, B., Zhang, T., Xia, T., 2016. Vehicle detection from 3d lidar using fully on Computer Vision (ICCV), pp. 6850–6859. doi:10.1109/ICCV.2019.
convolutional network, in: Robotics: Science and Systems XII, Robotics: 00695.
Science and Systems Foundation. pp. 1–8. doi:10.15607/RSS.2016.XII. Maisano, R., Tomaselli, V., Capra, A., Longo, F., Puliafito, A., 2018. Reducing
042. Complexity of 3D Indoor Object Detection, in: 2018 IEEE 4th International
Li, J., Luo, S., Zhu, Z., Dai, H., Krylov, A.S., Ding, Y., Shao, L., 2020. 3d Forum on Research and Technology for Society and Industry (RTSI), pp.
IoU-net: IoU guided 3d object detector for point clouds. arXiv:2004.04962 1–6. doi:10.1109/RTSI.2018.8548514.
[cs] . Meyer, G.P., Charland, J., Hegde, D., Laddha, A., Vallespi-Gonzalez, C.,
Li, M., Hu, Y., Zhao, N., Qian, Q., 2019b. One-stage multi-sensor data fusion 2019a. Sensor Fusion for Joint 3D Object Detection and Semantic Seg-
convolutional neural network for 3d object detection. Sensors 19, 1434. mentation, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern
doi:10.3390/s19061434. Recognition Workshops (CVPRW), pp. 1230–1237. doi:10.1109/CVPRW.
Li, P., Chen, X., Shen, S., 2019c. Stereo R-CNN Based 3D Object Detection 2019.00162.
for Autonomous Driving, in: 2019 IEEE/CVF Conference on Computer Vi- Meyer, G.P., Laddha, A., Kee, E., Vallespi-Gonzalez, C., Wellington, C.K.,
sion and Pattern Recognition (CVPR), pp. 7636–7644. doi:10.1109/CVPR. 2019b. LaserNet: An Efficient Probabilistic 3D Object Detector for Au-
2019.00783. tonomous Driving, in: 2019 IEEE/CVF Conference on Computer Vision
Li, S., Yang, L., Huang, J., Hua, X.S., Zhang, L., 2019d. Dynamic Anchor and Pattern Recognition (CVPR), pp. 12669–12678. doi:10.1109/CVPR.
Feature Selection for Single-Shot Object Detection, in: 2019 IEEE/CVF 2019.01296.
International Conference on Computer Vision (ICCV), pp. 6608–6617. Mousavian, A., Anguelov, D., Flynn, J., Košecká, J., 2017. 3D Bounding Box
doi:10.1109/ICCV.2019.00671. Estimation Using Deep Learning and Geometry, in: 2017 IEEE Confer-
Li, X., Guivant, J.E., Kwok, N., Xu, Y., 2019e. 3D Backbone Network for 3D ence on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5640.
Object Detection. arXiv:1901.08373 [cs] . doi:10.1109/CVPR.2017.597.
Liang, M., Yang, B., Chen, Y., Hu, R., Urtasun, R., 2019. Multi-task multi- Naiden, A., Paunescu, V., Kim, G., Jeon, B., Leordeanu, M., 2019. Shift R-
sensor fusion for 3d object detection, in: 2019 IEEE/CVF Conference on CNN: Deep Monocular 3D Object Detection With Closed-Form Geometric
Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 7337–7345. Constraints, in: 2019 IEEE International Conference on Image Processing
doi:10.1109/CVPR.2019.00752. (ICIP), pp. 61–65. doi:10.1109/ICIP.2019.8803397.
Liang, M., Yang, B., Wang, S., Urtasun, R., 2018. Deep Continuous Fusion for Otepka, J., Ghuffar, S., Waldhauser, C., Hochreiter, R., Pfeifer, N., 2013.
Multi-sensor 3D Object Detection, in: Ferrari, V., Hebert, M., Sminchisescu, Georeferenced point clouds: A survey of features and point cloud man-
C., Weiss, Y. (Eds.), Computer Vision – ECCV 2018, Springer International agement. ISPRS International Journal of Geo-Information 2, 1038–1065.
Publishing. pp. 663–678. doi:10.1007/978-3-030-01270-0_39. doi:10.3390/ijgi2041038.
Liang, Z., Zhang, M., Zhang, Z., Zhao, X., Pu, S., 2020. RangeRCNN: To- Pamplona, J., Madrigal, C., de la Escalera, A., 2019. PointNet Evaluation for
wards fast and accurate 3d object detection with range image representation. On-Road Object Detection Using a Multi-resolution Conditioning, in: Vera-
arXiv:2009.00206 [cs] . Rodriguez, R., Fierrez, J., Morales, A. (Eds.), Progress in Pattern Recogni-
Liu, J., Chen, H., Li, J., 2018a. Faster 3D Object Detection in RGB-D Image tion, Image Analysis, Computer Vision, and Applications, Springer Interna-
Using 3D Selective Search and Object Pruning, in: 2018 Chinese Control tional Publishing. pp. 513–520. doi:10.1007/978-3-030-13469-3_60.
And Decision Conference (CCDC), pp. 4862–4866. doi:10.1109/CCDC. Pang, S., Morris, D., Radha, H., 2020. CLOCs: Camera-LiDAR Object Can-
2018.8407973. didates Fusion for 3D Object Detection, in: 2020 IEEE/RSJ International
Liu, L., Lu, J., Xu, C., Tian, Q., Zhou, J., 2019a. Deep Fitting Degree Scoring Conference on Intelligent Robots and Systems (IROS), IEEE, Las Vegas,
Network for Monocular 3D Object Detection, in: 2019 IEEE/CVF Confer- NV, USA. pp. 10386–10393. doi:10.1109/IROS45743.2020.9341791.
ence on Computer Vision and Pattern Recognition (CVPR), pp. 1057–1066. Qi, C.R., Chen, X., Litany, O., Guibas, L.J., 2020. ImVoteNet: Boosting 3D
doi:10.1109/CVPR.2019.00115. Object Detection in Point Clouds With Image Votes, in: 2020 IEEE/CVF
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., Pietikäinen, Conference on Computer Vision and Pattern Recognition (CVPR), IEEE,
M., 2020. Deep learning for generic object detection: A survey. In- Seattle, WA, USA. pp. 4403–4412. doi:10.1109/CVPR42600.2020.
ternational Journal of Computer Vision 128, 261–318. doi:10.1007/ 00446.
s11263-019-01247-4. Qi, C.R., Hao, S., Mo, K., Leonidas, J.G., 2017a. PointNet: Deep Learn-
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C., ing on Point Sets for 3D Classification and Segmentation, in: 2017 IEEE
2016. SSD: Single Shot MultiBox Detector, in: Leibe, B., Matas, J., Sebe, Conference on Computer Vision and Pattern Recognition (CVPR), IEEE,
N., Welling, M. (Eds.), Computer Vision – ECCV 2016, Springer Interna- Honolulu, HI. pp. 77–85. doi:10.1109/CVPR.2017.16.
tional Publishing, Cham. pp. 21–37. doi:10.1007/978-3-319-46448-0_ Qi, C.R., Litany, O., He, K., Guibas, L., 2019. Deep Hough Voting for 3D
2. Object Detection in Point Clouds, in: 2019 IEEE/CVF International Con-
Liu, W., Sun, J., Li, W., Hu, T., Wang, P., 2019b. Deep learning on point clouds ference on Computer Vision (ICCV), pp. 9276–9285. doi:10.1109/ICCV.
and its application: A survey. Sensors 19, 4188. doi:10.3390/s19194188. 2019.00937.
Liu, Y., Xu, Y., Li, S.b., 2018b. 2-D Human Pose Estimation from Images Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J., 2018. Frustum PointNets for
Based on Deep Learning: A Review, in: 2018 2nd IEEE Advanced Informa- 3D Object Detection from RGB-D Data, in: 2018 IEEE/CVF Conference
tion Management,Communicates,Electronic and Automation Control Con- on Computer Vision and Pattern Recognition, pp. 918–927. doi:10.1109/
ference (IMCEC), IEEE, Xi’an. pp. 462–465. doi:10.1109/IMCEC.2018. CVPR.2018.00102.
8469573. Qi, C.R., Yi, L., Su, H., Guibas, L.J., 2017b. PointNet++: Deep Hierarchical
Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for Feature Learning on Point Sets in a Metric Space, in: Advances in Neural
semantic segmentation, in: 2015 IEEE Conference on Computer Vision and Information Processing Systems, Curran Associates, Inc.. pp. 1–10.
Pattern Recognition (CVPR), pp. 3431–3440. doi:10.1109/CVPR.2015. Qin, Z., Wang, J., Lu, Y., 2019a. MonoGRNet: A geometric reasoning network
7298965. for monocular 3d object localization. Proceedings of the AAAI Conference
Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints. on Artificial Intelligence 33, 8851–8858. doi:10.1609/aaai.v33i01.
International Journal of Computer Vision 60, 91–110. doi:10.1023/B: 33018851.
VISI.0000029664.99615.94. Qin, Z., Wang, J., Lu, Y., 2019b. Triangulation Learning Network: From
Lu, H., Chen, X., Zhang, G., Zhou, Q., Ma, Y., Zhao, Y., 2019. Scanet: Spatial- Monocular to Stereo 3D Object Detection, in: 2019 IEEE/CVF Confer-
channel Attention Network for 3D Object Detection, in: ICASSP 2019 - ence on Computer Vision and Pattern Recognition (CVPR), pp. 7607–7615.
2019 IEEE International Conference on Acoustics, Speech and Signal Pro- doi:10.1109/CVPR.2019.00780.
cessing (ICASSP), pp. 1992–1996. doi:10.1109/ICASSP.2019.8682746. Rahman, M.M., Tan, Y., Xue, J., Shao, L., Lu, K., 2019. 3d object detection:
Luo, Q., Ma, H., Tang, L., Wang, Y., Xiong, R., 2020. 3D-SSD: Learning Learning 3d bounding boxes from scaled down 2d bounding boxes in RGB-d
31
images. Information Sciences 476, 147–158. doi:10.1016/j.ins.2018. Automation (ICRA), pp. 7276–7282. doi:10.1109/ICRA.2019.8794195.
09.040. Song, S., Lichtenberg, S.P., Xiao, J., 2015. SUN RGB-D: A RGB-D scene
Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You only look once: understanding benchmark suite, in: 2015 IEEE Conference on Computer
Unified, real-time object detection, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 567–576. doi:10.1109/CVPR.
Vision and Pattern Recognition (CVPR), IEEE. pp. 779–788. doi:10.1109/ 2015.7298655.
CVPR.2016.91. Song, S., Xiao, J., 2014. Sliding Shapes for 3D Object Detection in Depth
Ren, S., He, K., Girshick, R., Sun, J., 2017. Faster R-CNN: Towards Real-Time Images, in: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (Eds.), Com-
Object Detection with Region Proposal Networks. IEEE Transactions on puter Vision – ECCV 2014, Springer International Publishing. pp. 634–651.
Pattern Analysis and Machine Intelligence 39, 1137–1149. doi:10.1109/ doi:10.1007/978-3-319-10599-4_41.
TPAMI.2016.2577031. Song, S., Xiao, J., 2016. Deep Sliding Shapes for Amodal 3D Object Detec-
Ren, Y., Chen, C., Li, S., Kuo, C.C.J., 2018. Context-Assisted 3D (C3D) Object tion in RGB-D Images, in: 2016 IEEE Conference on Computer Vision and
Detection from RGB-D Images. Journal of Visual Communication and Im- Pattern Recognition (CVPR), pp. 808–816. doi:10.1109/CVPR.2016.94.
age Representation 55, 131–141. doi:10.1016/j.jvcir.2018.05.019. Srivastava, S., Jurie, F., Sharma, G., 2019. Learning 2D to 3D Lifting for
Ren, Z., Sudderth, E.B., 2016. Three-Dimensional Object Detection and Lay- Object Detection in 3D for Autonomous Vehicles, in: 2019 IEEE/RSJ Inter-
out Prediction Using Clouds of Oriented Gradients, in: 2016 IEEE Confer- national Conference on Intelligent Robots and Systems (IROS), pp. 4504–
ence on Computer Vision and Pattern Recognition (CVPR), pp. 1525–1533. 4511. doi:10.1109/IROS40897.2019.8967624.
doi:10.1109/CVPR.2016.169. Sun, H., Meng, Z., Du, X., Ang, M.H., 2018. A 3D Convolutional Neural Net-
Ren, Z., Sudderth, E.B., 2018. 3D Object Detection with Latent Support Sur- work Towards Real-Time Amodal 3D Object Detection, in: 2018 IEEE/RSJ
faces, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern International Conference on Intelligent Robots and Systems (IROS), pp.
Recognition, pp. 937–946. doi:10.1109/CVPR.2018.00104. 8331–8338. doi:10.1109/IROS.2018.8593837.
Ren, Z., Sudderth, E.B., 2020. Clouds of Oriented Gradients for 3D Detec- Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo,
tion of Objects, Surfaces, and Indoor Scene Layouts. IEEE Transactions on J., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam, J., Zhao,
Pattern Analysis and Machine Intelligence 42, 2670–2683. doi:10.1109/ H., Timofeev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhang,
TPAMI.2019.2923201. Y., Shlens, J., Chen, Z., Anguelov, D., 2020. Scalability in Perception for
Roddick, T., Kendall, A., Cipolla, R., 2018. Orthographic Feature Transform Autonomous Driving: Waymo Open Dataset, in: 2020 IEEE/CVF Confer-
for Monocular 3D Object Detection. arXiv:1811.08188 [cs] . ence on Computer Vision and Pattern Recognition (CVPR), pp. 2443–2451.
Sager, C., Janiesch, C., Zschech, P., 2021. A survey of image labelling for doi:10.1109/CVPR42600.2020.00252.
computer vision applications. Journal of Business Analytics 4, 91–110. Tang, Y.S., Lee, G.H., 2019. Transferable Semi-Supervised 3D Object De-
doi:10.1080/2573234X.2021.1908861. tection From RGB-D Data, in: 2019 IEEE/CVF International Conference
Shen, X., Stamos, I., 2020. Frustum VoxNet for 3d object detection from on Computer Vision (ICCV), pp. 1931–1940. doi:10.1109/ICCV.2019.
RGB-d or depth images, in: 2020 IEEE Winter Conference on Applica- 00202.
tions of Computer Vision (WACV), IEEE. pp. 1687–1695. doi:10.1109/ Teng, Z., Xiao, J., 2014. Surface-based general 3D object detection and pose
WACV45572.2020.9093276. estimation, in: 2014 IEEE International Conference on Robotics and Au-
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H., 2020a. PV- tomation (ICRA), pp. 5473–5479. doi:10.1109/ICRA.2014.6907664.
RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection, in: Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M., 2013.
2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni- Selective search for object recognition. International Journal of Computer
tion (CVPR), IEEE, Seattle, WA, USA. pp. 10526–10535. doi:10.1109/ Vision 104, 154–171. doi:10.1007/s11263-013-0620-5.
CVPR42600.2020.01054. Viola, P., Jones, M.J., 2004. Robust Real-Time Face Detection. Interna-
Shi, S., Wang, X., Li, H., 2019. PointRCNN: 3D Object Proposal Gen- tional Journal of Computer Vision 57, 137–154. doi:10.1023/B:VISI.
eration and Detection From Point Cloud, in: 2019 IEEE/CVF Confer- 0000013087.49260.fb.
ence on Computer Vision and Pattern Recognition (CVPR), pp. 770–779. Wang, D.Z., Posner, I., 2015. Voting for voting in online point cloud object
doi:10.1109/CVPR.2019.00086. detection, in: Robotics: Science and Systems XI, Robotics: Science and
Shi, S., Wang, Z., Shi, J., Wang, X., Li, H., 2020b. From Points to Parts: 3D Systems Foundation. pp. 1–9. doi:10.15607/RSS.2015.XI.035.
Object Detection from Point Cloud with Part-aware and Part-aggregation Wang, G., Tian, B., Zhang, Y., Chen, L., Cao, D., Wu, J., 2020. Multi-view
Network. IEEE Transactions on Pattern Analysis and Machine Intelligence adaptive fusion network for 3d object detection. arXiv:2011.00652 [cs] .
, 1–1doi:10.1109/TPAMI.2020.2977026. Wang, L., Li, R., Shi, H., Sun, J., Zhao, L., Seah, H.S., Quah, C.K., Tandianus,
Shin, K., Kwon, Y.P., Tomizuka, M., 2019. RoarNet: A Robust 3D Object De- B., 2019. Multi-Channel Convolutional Neural Network Based 3D Object
tection based on RegiOn Approximation Refinement, in: 2019 IEEE Intelli- Detection for Indoor Robot Environmental Perception. Sensors 19, 1–14.
gent Vehicles Symposium (IV), pp. 2510–2515. doi:10.1109/IVS.2019. doi:10.3390/s19040893.
8813895. Wang, Y., Ye, J., 2020. An overview of 3d object detection. arXiv:2010.15614
Silberman, N., Hoiem, D., Kohli, P., Fergus, R., 2012. Indoor Segmen- [cs] .
tation and Support Inference from RGBD Images, in: Fitzgibbon, A., Wang, Z., Jia, K., 2019. Frustum ConvNet: Sliding Frustums to Aggregate
Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (Eds.), Computer Vision – Local Point-Wise Features for Amodal, in: 2019 IEEE/RSJ International
ECCV 2012, Springer, Berlin, Heidelberg. pp. 746–760. doi:10.1007/ Conference on Intelligent Robots and Systems (IROS), pp. 1742–1749.
978-3-642-33715-4_54. doi:10.1109/IROS40897.2019.8968513.
Simon, M., Amende, K., Kraus, A., Honer, J., Samann, T., Kaulbersch, H., Wang, Z., Zhan, W., Tomizuka, M., 2018. Fusing Bird’s Eye View LIDAR
Milz, S., Gross, H.M., 2019a. Complexer-YOLO: Real-Time 3D Object De- Point Cloud and Front View Camera Image for 3D Object Detection, in:
tection and Tracking on Semantic Point Clouds, in: 2019 IEEE/CVF Con- 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1–6. doi:10.1109/
ference on Computer Vision and Pattern Recognition Workshops (CVPRW), IVS.2018.8500387.
IEEE, Long Beach, CA, USA. pp. 1190–1199. doi:10.1109/CVPRW.2019. Weng, X., Kitani, K., 2019. Monocular 3D Object Detection with Pseudo-
00158. LiDAR Point Cloud, in: 2019 IEEE/CVF International Conference on Com-
Simon, M., Milz, S., Amende, K., Gross, H.M., 2019b. Complex-YOLO: puter Vision Workshop (ICCVW), pp. 857–866. doi:10.1109/ICCVW.
An Euler-Region-Proposal for Real-Time 3D Object Detection on Point 2019.00114.
Clouds, in: Leal-Taixé, L., Roth, S. (Eds.), Computer Vision – ECCV 2018 Xu, B., Chen, Z., 2018. Multi-level Fusion Based 3D Object Detection
Workshops, Springer International Publishing. pp. 197–209. doi:10.1007/ from Monocular Images, in: 2018 IEEE/CVF Conference on Computer Vi-
978-3-030-11009-3_11. sion and Pattern Recognition, pp. 2345–2353. doi:10.1109/CVPR.2018.
Simonelli, A., Bulò, S.R., Porzi, L., Lopez-Antequera, M., Kontschieder, P., 00249.
2019. Disentangling Monocular 3D Object Detection, in: 2019 IEEE/CVF Xu, D., Anguelov, D., Jain, A., 2018. PointFusion: Deep Sensor Fusion for 3D
International Conference on Computer Vision (ICCV), pp. 1991–1999. Bounding Box Estimation, in: 2018 IEEE/CVF Conference on Computer
doi:10.1109/ICCV.2019.00208. Vision and Pattern Recognition, pp. 244–253. doi:10.1109/CVPR.2018.
Sindagi, V.A., Zhou, Y., Tuzel, O., 2019. MVX-Net: Multimodal VoxelNet 00033.
for 3D Object Detection, in: 2019 International Conference on Robotics and Yamazaki, T., Sugimura, D., Hamamoto, T., 2018. Discovering Correspon-
32
dence Among Image Sets with Projection View Preservation For 3D Ob-
ject Detection in Point Clouds, in: 2018 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 3111–3115.
doi:10.1109/ICASSP.2018.8461677.
Yan, Y., Mao, Y., Li, B., 2018. SECOND: Sparsely Embedded Convolutional
Detection. Sensors 18, 1–17. doi:10.3390/s18103337.
Yang, B., Liang, M., Urtasun, R., 2018a. HDNET: Exploiting HD Maps for 3D
Object Detection, in: Proceedings of The 2nd Conference on Robot Learn-
ing, PMLR. pp. 146–155.
Yang, B., Luo, W., Urtasun, R., 2018b. PIXOR: Real-time 3D Object Detection
from Point Clouds, in: 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 7652–7660. doi:10.1109/CVPR.2018.00798.
Yang, Z., Sun, Y., Liu, S., Jia, J., 2020. 3DSSD: Point-Based 3D Single Stage
Object Detector, in: 2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), IEEE, Seattle, WA, USA. pp. 11037–11045.
doi:10.1109/CVPR42600.2020.01105.
Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J., 2018c. IPOD: Intensive Point-based
Object Detector for Point Cloud. arXiv:1812.05276 [cs] .
Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J., 2019. STD: Sparse-to-dense 3d ob-
ject detector for point cloud, in: 2019 IEEE/CVF International Conference
on Computer Vision (ICCV), IEEE. pp. 1951–1960. doi:10.1109/ICCV.
2019.00204.
Yoo, J.H., Kim, Y., Kim, J., Choi, J.W., 2020. 3D-CVF: Generating Joint
Camera and LiDAR Features Using Cross-view Spatial Feature Fusion for
3D Object Detection, in: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M.
(Eds.), Computer Vision – ECCV 2020. Springer International Publishing,
Cham. volume 12372, pp. 720–736. doi:10.1007/978-3-030-58583-9_
43.
Zeng, Y., Hu, Y., Liu, S., Ye, J., Han, Y., Li, X., Sun, N., 2018. RT3D: Real-
Time 3-D Vehicle Detection in LiDAR Point Cloud for Autonomous Driv-
ing. IEEE Robotics and Automation Letters 3, 3434–3440. doi:10.1109/
LRA.2018.2852843.
Zhang, H., Yang, D., Yurtsever, E., Redmill, K.A., Özgüner, U., 2020. Faraway-
frustum: Dealing with lidar sparsity for 3d object detection using fusion.
arXiv:2011.01404 [cs] .
Zhao, Z.Q., Zheng, P., Xu, S.T., Wu, X., 2019. Object detection with deep
learning: A review. IEEE Transactions on Neural Networks and Learning
Systems 30, 3212–3232. doi:10.1109/TNNLS.2018.2876865.
Zheng, W., Tang, W., Chen, S., Jiang, L., Fu, C.W., 2020. CIA-SSD: Confident
IoU-aware single-stage object detector from point cloud. arXiv:2012.03015
[cs] .
Zhong, Y., Wang, J., Peng, J., Zhang, L., 2020. Anchor Box Optimization
for Object Detection, in: 2020 IEEE Winter Conference on Applications of
Computer Vision (WACV), IEEE, Snowmass Village, CO, USA. pp. 1275–
1283. doi:10.1109/WACV45572.2020.9093498.
Zhou, D., Fang, J., Song, X., Liu, L., Yin, J., Dai, Y., Li, H., Yang, R., 2020.
Joint 3d instance segmentation and object detection for autonomous driving,
in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), IEEE. pp. 1836–1846. doi:10.1109/CVPR42600.2020.
00191.
Zhou, J., Tan, X., Shao, Z., Ma, L., 2019. FVNet: 3D Front-View Pro-
posal Generation for Real-Time Object Detection from Point Clouds, in:
2019 12th International Congress on Image and Signal Processing, BioMed-
ical Engineering and Informatics (CISP-BMEI), pp. 1–8. doi:10.1109/
CISP-BMEI48845.2019.8965844.
Zhou, Y., Tuzel, O., 2018. VoxelNet: End-to-End Learning for Point Cloud
Based 3D Object Detection, in: 2018 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pp. 4490–4499. doi:10.1109/CVPR.
2018.00472.
Zia, M.Z., Stark, M., Schindler, K., 2015. Towards Scene Understanding with
Detailed 3D Object Representations. International Journal of Computer Vi-
sion 112, 188–203. doi:10.1007/s11263-014-0780-y.