Monocular 3D Vehicle Detection Using Uncalibrated Traffic Cameras
Monocular 3D Vehicle Detection Using Uncalibrated Traffic Cameras
through Homography
Minghan Zhu1 , Songan Zhang1 , Yuanxin Zhong1 , Pingping Lu1 ,
Huei Peng1 and John Lenneman2
I. INTRODUCTION
Traffic cameras are widely deployed today to monitor Fig. 1: The 3D vehicle detection problem is transformed to a 2D detection
traffic conditions especially around intersections. Camera problem in warped bird’s eye view (BEV) images. The orange lines attached
vision algorithms are developed to automate various tasks to each orange boxes are tails, defined in Sec. III-C.1 and Fig. 4, which are
regressed by the network to better handle distortions in BEV images.
including vehicle and pedestrian detection, tracking [1], and
re-identification [2]. However, most of them work in the 2D
image space. In this paper, we consider the task of monocular
3D detection, which is to detect the targets and to estimate available to users. Second, 3D annotations of images from
their positions and poses in the 3D world space from a single these traffic cameras are lacking, while there are some with
image. It could enable us to better understand the behaviors 2D annotations [7]–[9]. Some previous work tried to solve
of the targets from a traffic camera. the 3D object detection problem, but they posed some strong
Monocular camera 3D object detection is a non-trivial task assumptions such as known intrinsic/extrinsic calibration
since images lack depth information. A general strategy is to [10] or fixed orientation of the objects [11]. We extend the 3D
leverage the prior of the sizes of the objects of interest and detection to a more general setup without these assumptions.
the consistency between 3D and 2D detections established We leverage the homography between the road plane
through extrinsic and intrinsic parameters. To help improve and image plane as the only connection between the 3D
the performance, a series of datasets are published with 3D world and the 2D images. The homography can be estimated
object annotation associated with images. [3]–[5] are datasets conveniently using satellite images from public map service.
in driving scenarios, and [6] contains object-centric video As opposed to 3D bounding box detection which requires full
clips in general daily scenarios. calibration, we formulate the 3D object detection problem
However, these research efforts mostly cannot be directly as the detection of rotated bounding boxes in images from
applied to traffic cameras for two reasons. First, the intrin- bird’s eye view (BEV) generated using the homography,
sic/extrinsic calibration information of many cameras are not see Fig. 1. The homography also enables us to synthesize
images from the perspective of a traffic camera even if it
*This work was supported by the Collaborative Safety Research Center is not calibrated, which in return benefits the training of the
at the Toyota Motor North America Research & Development.
1 M. Zhu, S. Zhang, Y. Zhong, P. Lu, and H. Peng are with the University detection network. To address the problem of shape distortion
of Michigan, Ann Arbor, MI 48109, USA. {minghanz, songanz, introduced by the inverse perspective mapping (IPM), we
zyxin, pingpinl, hpeng}@umich.edu designed an innovative regression target called tailed r-box
2 J. Lenneman is with the Collaborative Safety Research Center at the
Toyota Motor North America Research & Development, Ann Arbor, MI as an extension to the conventional rotated bounding boxes,
48105, USA. [email protected] and introduced a dual-view network architecture.
The main contributions of this paper include: calibration, 3D bounding boxes can be constructed from 2D
1) We propose a method to estimate the pose and position bounding boxes or segmentation following the direction of
of vehicles in the 3D world using images from a vanishing points. There are two limitations of this approach.
monocular uncalibrated traffic camera. First, the calibration requires a lot of parallel landmarks
2) We propose two strategies to improve the accuracy of and/or traffic flow in one or two dominant directions, which
object detection using IPM images: (a) tailed r-box may not be the case in real traffic, e.g., roundabouts. Second,
regression, (b) dual-view network architecture. the construction of 3D bounding boxes assumes that all
3) We propose a data synthesis method to generate data vehicles are largely aligned in the direction of the vanishing
that are visually similar to images from an uncalibrated lines, which is not always true, including at curved lanes,
traffic camera. intersections with turning vehicles, and roundabouts.
4) Our work is open-sourced and software is available for [10] avoided the limitations mentioned above, by cal-
download at https://github.com/minghanz/ ibrating the 2D landmarks in images to 3D landmarks in
trafcam_3d. Lidar scans, obtaining full calibration of the camera. It is
The remainder of this paper is organized as follows. The apparently non-trivial to obtain Lidar scans for already-
literature review is given in Sec. II. The proposed method installed traffic camera, which limits the practicality to apply
for 3D detection is introduced in Sec. III. The dataset used this approach. The authors synthesized images of vehicles
for training and the data synthesis method are introduced in from CAD models on random background images as the
Sec. IV. The experimental setup and results are presented in training data. We adopt a similar approach, but we render
Sec. V. Section VI concludes the paper and discusses future the vehicles on the scene of the traffic cameras directly,
research ideas. which reduces the domain gap, while not requiring the
intrinsic/extrinsic calibration of the cameras.
II. R ELATED W ORK
A. Monocular 3D vehicle detection C. Rotated bounding box detection
A lot of work has been done in monocular 3D vehicle Rotated bounding box detection is useful in aerial image
detection. Our primary application is vehicle detection. Al- processing, where objects generally are not aligned to a
though the problem is theoretically ill-posed, most vehicles dominant direction as in our daily images, and are mostly
have similar shapes and sizes, allowing the network to axis-aligned to the gravity direction. Some representative
leverage such priors jointly with the 3D-2D consistency work include [19], [20]. Although the regression target is
determined by the camera intrinsics. For example, [12] very similar in this paper, the challenges are very different in
employed CAD models of vehicles as priors. [13] estimated that we deal with BEV images warped from the perspective
the depth from the consistency of the 2D bounding boxes of traffic cameras through IPM, which introduces severe
and the estimated 3D box dimensions. The object depth can distortion compared with original traffic camera images, and
also be estimated using a monocular depth network module have more occlusions compared with native bird’s eye view
[14]. Some work proposed to change to a space to deal with images (e.g., aerial images).
3D detection better. For example, [15] back-projected 2D
D. Perception networks with perspective transform
images to 3D space using estimated depth and detect 3D
bounding boxes in the 3D space directly. [16] transformed Several previous work also employed a similar idea of
original images to the bird’s eye view (BEV) where vehicles using perspective transform to conduct perception for BEV
can be localized with 2D coordinate, which is similar to images. [16] conducted object detection in warped BEV
our work, but they did not address the challenges caused by images, but it did not address the distortion effect in IPM,
distortion in perspective transform. In this paper, we identify as mentioned at the end of Sec. II-A. [21], [22] addressed
these challenges and propose new solutions. the distortion effect, but the discussions there are in the
context of segmentation. [21], [23] studied lane detection
B. Calibration and 3D vehicle detection for traffic cameras and segmentation, respectively. They employed perspective
Some previous work aimed at solving the 3D detection transform inside the network, which approach we adopt. The
problem from traffic cameras, but with different setup and difference is that they mainly used it to transform results to
assumptions. The detection approach is closely coupled with BEV, while in this work the warping is to fuse features from
the underlying calibration method, as the latter determines the original view images and the BEV images.
how to establish the 3D-2D relation. Therefore we review the
detection and the calibration methods together. A common III. P ROPOSED M ETHOD
type of calibration methods is based on vanishing point Our strategy is to transform the 3D vehicle detection
detection. Methods are proposed to detect vanishing points problem to the 2D rotated bounding box detection problem
from the major direction of vehicle movement and edge- in the bird’s eye view. We are mainly concerned about the
shaped landmarks in the scene [17], [18], from which the planar position of vehicles, and the vertical coordinate in the
rotational part of the extrinsic matrix can be solved. The height direction is of little importance since we assume a
intrinsic matrix (mainly the focal length) is estimated from vehicle is always on the (flat) ground. Under the moderate
the average size of vehicles and that in images. With the assumption of flat-Earth, and that the nonlinear distortion
Fig. 2: Overview of the 3D vehicle detection framework.
effect in the camera is negligible, the pixel coordinates in homography can be estimated if the corresponding points in
the bird’s eye view images are simply a scaling of the real the two planes are known. We find the corresponding points
world planar coordinates. by annotating the same set of landmarks in the traffic camera
There are several other merits to work on bird’s eye view image and in the map (e.g. Google Maps). Using the satellite
images. First, the rotated bounding boxes of the objects images on the map, we can retrieve the real world coordinate
at different distances have a consistent scale in the bird’s of the landmarks given a chosen local frame. With the set
eye view images, making it easier to detect remote objects. of corresponding points {(pworld
i , pori
i )}, the homography
world
Second, the rotated bounding boxes in the bird’s eye view Hori can be solved by Direct Linear Transformation
bev
do not overlap with one another, as oppose to 2D bounding (DLT). Given Hori , the original traffic camera images can
boxes in the original view. Nevertheless, working on bird’s be warped to BEV images.
eye view images also requires us to address the challenges
of distortion and occlusion, as mentioned above, and they B. Rotated bounding box detection in warped bird’s eye view
will be discussed in more details below in this section. (BEV) images
An overview of the proposed method is shown in Fig. 2. It
has three parts: homography calibration, vehicle detection in The rotated bounding box detection network in this pa-
warped BEV images, and data synthesis. The data synthesis per is developed based on YOLOv3 [24], by extending it
part is for network training and not directly related to the to support rotation prediction. We will abbreviate "rotated
detection methodology, therefore introduced later in Sec. IV. bounding box" as "r-box" in the following. We choose
In this section, the first two parts are introduced. YOLOv3, which is a one-stage detector, over two-stage
detectors (e.g. [25]) for the following two reasons. First, two-
A. Calibration of homography stage detectors have advantage in detecting small objects and
A planar homography is a mapping between two planes overlapping objects in crowed scenes, while in bird’s eye
which preserves collinearity, represented by a 3*3 matrix. view images the size of objects does not vary too much, and
We model the homography between the original image the r-boxes are not overlapping. Second, one-stage detectors
and the bird’s eye view image as a composition of two are faster. More recent network architectures like [26] should
homographies: also work.
bev bev world The network is extended to predict rotations by introduc-
Hori = Hworld Hori (1)
ing the yaw (r) dimension in both anchors and predictions.
where spa = Hba pb , denoting that Hba maps coordinates The anchors are now of the form (l, w, r), where r ∈ [0, π],
in frame b to coordinated in frame a up to a scale factor implying that we are not distinguishing the front end and
s, and p = [x, y, 1]T is the homogeneous coordinate of a rear end of vehicles in the network. Although the dimension
point in a plane. bev denotes the BEV image plane. world of the anchors increased by one, we do not increase the total
denotes the road plane in the real world. ori denotes the number of anchors, due to the fact that object size does not
bev vary too much in our bird’s eye view images. There are 9
original image plane. Hworld can be freely defined by users
as long as it is a similarity transform, preserving the angles anchors per YOLO prediction layers, and there are in total 3
between the real-world road plane and the bird’s eye view YOLO layers in the network, the same as in YOLOv3. The
world rotation angles of 9 anchors in a YOLO prediction layer are
image plane. Calibration is needed for Hori , denoting the
homography between the original image plane and road plane evenly distributed over the [0, π] interval.
in the real world. If the intrinsic and extrinsic parameters of a The network predicts the rotational angle offsets to the
camera is known or can be calibrated using existing methods, anchors. Denote the angle of an anchor as r0 , only anchors
the homography can be obtained following Eq. 5. Under with |r0 − rgt |< π/4 can be considered as positive, and for
circumstances where the full calibration is unavailable, the a positive anchor the rotation angle is predicted following
Fig. 3: R-boxes shown in BEV and original camera view. The red r-box
is completely occluded by surrounding vehicles, posing challenges for the Fig. 4: Definition of a tailed r-box and examples in synthetic training data.
network detection. Notice that there are still visible pixels of the vehicle (a) Definition of tailed r-box. The tail is defined as the line connecting
corresponding to the red r-box, therefore detecting it is possible, but needs the center of the bottom face and that of the top face of the 3D bounding
specific solutions (see Sec. III-C.1). box. (b) Tailed r-boxes in original view. (c) Tailed r-boxes in BEV. A tail
indicates the stretched pixels of a vehicle in BEV.
Eq. 2.
π By enforcing the network to predict the tail offset, the net-
rpred =(σ(x) − 0.5) + r0 (2)
2 work is guided to learn that the stretched pixels far from the
where x is the output of a convolution layer, and σ(·) is the r-box are also part of the objects. Especially when the bottom
sigmoid function. It follows that |rpred − r0 |< π/4. part of a vehicle is occluded, the network could still detect it
The loss function for angle prediction is in Eq. 3. Note from the visible pixels at the top, drastically improving the
that the angular residual rres = rpred − rgt ∈ (−π/2, π/2) recall rate (see Sec. V). In comparison, directly regressing
falls in a converging basin of the sin2 (·) function. the projection of the 3D bounding boxes in BEV can achieve
similar effect in guiding the network to leverage all pixels of
Lrotation = sin2 (rres ) = sin2 (rpred − rgt ) (3) a vehicle, but the projected location of the four top points is
harder to determine in BEV, and creates unnecessary burden
C. Special designs for detection in warped BEV images for the network.
With the above setup, the network is able to fulfill the 2) Dual-view network architecture: The distortion in IPM
proposed task, but the distortion introduced in the inverse makes remote objects larger than they really are, posing
perspective mapping poses some challenges to the network, difficulty for learning. To alleviate the problem caused by
which harm the performance. First, in bird’s eye view large receptive field requirements, we propose to use a dual-
images, a large portion of the pixels of vehicles are outside view network structure.
of the r-boxes. What makes it worse, when the vehicles are In the dual-view network, there are two feature extractors
crowded, the r-box area could be completely occluded and with identical structures and non-shared parameters, taking
the visible pixels of the vehicle are disjoint from the r-box BEV images and corresponding original view images as
(see Fig. 3), which makes it difficult for the network to infer. input respectively. The feature maps of original images are
Secondly, the IPM "stretches" the remote pixels, extending then transformed to BEV through IPM and concatenated with
the remote vehicles to a long shape. It requires the network to the feature maps of the BEV images. The IPM of feature
have large receptive field for each pixel to handle very large maps is similar to the IPM of raw images, with different
objects. Our proposed designs solve these two problems. homography matrices. The homography between the feature
1) Tailed r-box regression: We propose a new regression maps of original view and BEV can be calculated using
target called tailed r-box to address the problem that r-boxes Eq. 4.
could be disjoint from the visible pixels of objects. It is bev_f bev_f bev ori
Hori_f = Hbev Hori Hori_f (4)
constructed from the 3D bounding boxes in the original
bev_f ori
view. The tail is defined as the line connecting the center of where Hbev and Hori_f denotes the homography between
the bottom rectangle to that of the top rectangle of the 3D the input image coordinates and the feature map coordinates,
bounding box. After warping to BEV, the tail extends from which are mainly determined by the pooling layers and
the r-box center through the stretched body of the vehicle, convolution layers with strides. The network structure is
as shown in Fig. 4. Note that while the definition of tails shown in Fig. 5.
is in the original view images, the learning and inference With the dual-view architecture, pixels of a vehicle are
of tails can be done in the BEV images. In BEV images, spatially closer in the original view images than in the BEV
predicting tailed r-boxes corresponds to augmenting the images, making it easier to propagate information among the
prediction vector with two elements: utail , vtail , representing pixels. Then the intermediate feature warping stretches the
the offset from the center of r-box to the end of tail in BEV. information with IPM, propagating the consensus of nearby
Anchors are not parameterized with tails. pixels of an object in the original view to pixels of further
Fig. 6: Synthesizing images with real background captured by traffic
cameras and rendered vehicles using CAD models. The intrinsic/extrinsic
parameters corresponding to the background images are unknown, but we
can still render visually realistic images by sampling camera parameters that
keep the homography H invariant.
Fig. 5: Dual-view network architecture. Both the original view and BEV floating in the air, despite that the camera parameters of the
images are taken as input. The original view feature maps are warped
to BEV and concatenated with BEV feature maps. The warping (IPM)
background images are unknown?
stretches the vehicles to be very long in the BEV images, posing difficulty Our observation is that the plausibility mainly depends
to detection due to limited receptive field. The dual-view structure enables on the homography. In other words, if we can maintain the
feature learning before warping, where the object shapes are regular and the
knowledge propagation is easier.
same homography from the road plane to the image plane
in both foreground and background images, the composite
images will look like the vehicles are on the ground, as
seen in Fig. 6. The relation between homography and camera
distances in BEV. In the experiments we show that the dual-
intrinsic/extrinsics is shown in Eq. 5.
view architecture improves the detection performance.
sK[r1 r2 t] = H (5)
IV. DATA SYNTHESIS
where K is the intrinsic matrix, T = [r1 r2 r3 t] is the
The lack of training data for 3D vehicle detection in traffic extrinsic matrix of the camera, ri is the ith column of the T
camera images pose difficulty in learning a high-performance matrix, and s is a scaling factor. Given H, there are infinite
detector. In this work, we adopt two approaches to synthesize number of combinations of K and T such that the equality
training data. holds. One of them corresponds to the actual K and T of the
1) CARLA-synthetic: Our first approach is to generate traffic camera of the background images. However, we do not
synthetic data using a simulation platform CARLA [27]. attempt to find the actual K and T . Instead, we randomly
CARLA is capable of producing photo-realistic images from sample (K, T ) tuples to render the foreground images, as
cameras with user-specified parameters. It is able to simulate long as the equality holds. In practice, we assume K =
different lighting conditions and weather conditions. It also [f 0 cx ; 0 f cy ; 0 0 1] (pixels are square), and each sample
supports camera post-processing effects, e.g., bloom and lens on K will determine a (K, T ) tuple.
flares. We selected several positions in the pre-built maps and Notice that while this strategy lays synthetic vehicles on
collected images from the perspective of traffic cameras. the ground, the perspective between the foreground and the
2) Blender-synthetic: The second approach is to synthe- background may be inconsistent, but it is not essential to
size images by composing real traffic scene background the task of r-box detection, and experiments show that the
images with rendered vehicle foreground from CAD models. network generalizes well to real data.
The background images are pictures of empty road taken
by traffic cameras. The rendering and composition is by V. E XPERIMENTS
using the 3D graphic software Blender. While the CARLA The overall setup of the experiment is training on synthetic
images presents large varieties which benefits generalization, data generated following Sec. IV, and testing on real data.
the discrepancy between synthesized images and real images The training dataset contains 40k synthetic images consists
is still easily perceivable from human eyes. Composing real of two parts: CARLA-synthetic and Blender-synthetic. See
background images with synthesized vehicle foregrounds Fig. 7 for some examples. The CARLA-synthetic set contains
could be a step forward minimizing the domain gap. The 15k images collected from 5 locations in 2 maps pre-built
key challenge is: how to set up the camera in foreground by CARLA, covering 2 four-way intersections, 1 five-way
rendering, such that when compositing the foreground and intersection, 1 three-way intersection, and 1 roundabout. The
the background images together, the output looks like the weather and lighting conditions are dynamically changed
foreground vehicles are laying on the ground, instead of during the data collection, improving the robustness for
TABLE II: Ablation study, evaluated on the Ko-PER dataset. IoU is
defined for r-boxes. d is the distance between centers of predicted and
ground truth r-boxes. l is the length of ground-truth rbox. d ≤ 0.5l only
evaluates the position prediction.
Network settings Average precision (AP, %)
IoU ≥ 0.5 d ≤ 0.5l
r-box (similar to [16]) 65.67 71.96
dual-view 75.78 83.25
tailed r-box 78.27 85.55
tailed r-box + dual-view (ours) 82.44 91.20