0% found this document useful (0 votes)
305 views16 pages

MC-Calib: A Generic and Robust Calibration Toolbox For Multi-Camera Systems (Preprint)

Preprint. Francois Rameaua, Jinsun Parkb, Oleksandr Bailoc, In So Kweona

Uploaded by

Mathieu Plagne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
305 views16 pages

MC-Calib: A Generic and Robust Calibration Toolbox For Multi-Camera Systems (Preprint)

Preprint. Francois Rameaua, Jinsun Parkb, Oleksandr Bailoc, In So Kweona

Uploaded by

Mathieu Plagne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1

Computer Vision and Image Understanding


journal homepage: www.elsevier.com

MC-Calib: a generic and robust calibration toolbox for multi-camera systems

Francois Rameaua , Jinsun Parkb , Oleksandr Bailoc , In So Kweona,∗∗


a KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea
b Pusan National University (PNU), Busan 46241, Republic of Korea
c Independent researcher

ABSTRACT

In this paper, we present MC-Calib, a novel and robust toolbox dedicated to the calibration of com-
plex synchronized multi-camera systems using an arbitrary number of fiducial marker-based patterns.
Calibration results are obtained via successive stages of refinement to reliably estimate both the poses
of the calibration boards and cameras in the system. Our method is not constrained by the number
of cameras, their overlapping field-of-view (FoV), or the number of calibration patterns used. More-
over, neither prior information about the camera system nor the positions of the checkerboards are
required. As a result, minimal user interaction is needed to achieve an accurate and robust calibra-
tion which makes this toolbox accessible even with limited computer vision expertise. In this work,
we put a strong emphasis on the versatility and the robustness of our technique. Specifically, the
hierarchical nature of our strategy allows to reliably calibrate complex vision systems even under
the presence of noisy measurements. Additionally, we propose a new strategy for best-suited im-
age selection and initial parameters estimation dedicated to non-overlapping FoV cameras. Finally,
our calibration toolbox is compatible with both, perspective and fisheye cameras. Our solution has
been validated on a large number of real and synthetic sequences including monocular, stereo, mul-
tiple overlapping cameras, non-overlapping cameras, and converging camera systems. Project page:
https://github.com/rameau-fr/MC-Calib
© 2022 Elsevier Ltd. All rights reserved.

1. Introduction tion (SfM) (Schönberger and Frahm, 2016; Moulon et al., 2016;
Wu et al., 2011) and Simultaneous Localization and Mapping
The recent years have seen a rapid increase in the demand (SLAM) (Mur-Artal and Tardós, 2017; Rosinol et al., 2020; Qin
for polydioptric (multi-camera setup) vision systems in multi- et al., 2018), the development of novel calibration toolboxes
ple fields such as autonomous vehicle navigation (Heng et al., for complex vision systems attracted significantly less atten-
2019), human reconstruction (Alexiadis et al., 2016), indoor tion. As a result, most existing software for camera calibration
robotics applications (Urban et al., 2016; Kuo et al., 2020) and focuses on monocular and stereo systems (Bouguet, 2004; Mei
video surveillance (Rameau et al., 2014). These systems are and Rives, 2007; Scaramuzza et al., 2006) but does not consider
particularly desirable since they allow covering a large field-of- the problem of a multi-camera rig.
view (FoV) and computing metric scale 3D information from
a scene. Despite advantages, these systems remain complex The relevance of our work can be understood in the context
to deploy in practice due to their tedious calibration and the where existing calibration frameworks dedicated to polydiop-
absence of efficient and versatile publicly available calibration tric systems are often designed to deal with specific and re-
toolboxes. stricted setups. For instance, a given and limited number of
While significant efforts have been invested towards ro- cameras (Bouguet, 2004), an overlapping FoV (Rehder et al.,
bust and effective software dedicated to Structure from Mo- 2016), prior knowledge on the intrinsic parameters (Lébraly13
et al., 2010), external vision systems (Zhao et al., 2018), mir-
ror (Kumar et al., 2008; Lébraly et al., 2010), limited mo-
∗∗ Correspondingauthor: Tel.: +82-42-350-5465; tion (Liu et al., 2016) or pre-computed reconstruction of the en-
e-mail: [email protected] (In So Kweon) vironment (Lin et al., 2020; Ataer-Cansizoglu et al., 2014) are
2

Fig. 1. Representative calibration result obtained with our calibration pipeline. (left) Camera rig composed of 3 Intel RealSense (Keselman et al., 2017)
cameras where the 6 infrared cameras are being calibrated, (middle) calibration result, (right) image samples from each camera.

often required. Moreover, most of these available toolboxes are


not currently compatible with converging camera systems (Yu
et al., 2020) which are needed for a full 3D object or human
body reconstruction (Alexiadis et al., 2016). In addition, these
calibration techniques have often been designed either for per-
spective or fisheye cameras but rarely for both.
In this work, we propose a versatile and user-friendly toolbox 1. Checkerboards 2. Board poses & 3. Boards grouping
compatible with any – perspective, fisheye, and hybrid – multi- detection intrinsic calibration into objects

camera configurations: overlapping, non-overlapping, and con-


verging multi-camera systems. Our solution relies on fiducial
markers (Munoz-Salinas, 2012) to jointly calibrate the intrin-
sic and extrinsic parameters of the cameras from a sequence of
synchronized images without any manual interaction. The pro-
posed approach is neither restricted by the number of cameras
in the system nor their overlapping FoV. Moreover, an arbitrary
number of checkerboards and 3D calibration objects (composed
6. Final merging & 5. Non-overlapping 4. Grouping cameras
of a set of planar calibration targets) can be utilized to perform bundle adjustment camera groups as camera groups
the estimation of the cameras’ parameters. Unlike existing ap-
proaches (Bouguet, 2004; Forbes et al., 2002), no prior infor- Fig. 2. MC-Calib pipeline.
mation regarding these 3D objects is needed beforehand since
the geometry of the objects is computed automatically. An ex-
ample of a calibrated system is shown in Fig. 1. cameras in the graph forming a single connected component
For this complex task, we develop new techniques to dras- are then merged into groups and their relative pose w.r.t a ref-
tically improve the versatility, robustness, and effectiveness of erence camera is computed. In this paper, we simply call these
the multi-camera calibration. Particularly, we design novel so- groups of cameras: camera groups.
lutions to improve the stability and accuracy of non-overlapping If multiple camera groups are available, the estimation of
camera calibration via the best image selection, bootstrapped the poses between them is performed. This situation typically
initialization, and successive non-linear refinements. occurs when no overlapping FoV exists between the camera
To sum up our calibration process, first, all the observed groups. For this, we employ a well-known linear hand-eye non-
checkerboards are used to calibrate the intrinsic parameters of overlapping calibration technique (Tsai et al., 1989) between
each camera. After, if multiple calibration boards are visible in each pair of camera groups. The resulting initial poses between
a single image, their relative poses are computed and stored in a camera groups are used to estimate the relative pose between
graph. Under the common assumption that the boards are static all the cameras w.r.t an arbitrarily selected reference camera.
(or rigidly attached), this graph is used to combine all the boards The final stage of our calibration is the joint non-linear refine-
sharing covisibility (i.e., visible simultaneously in an image) to ment of all the parameters (i.e., inter-camera poses, inter-board
form 3D calibration objects. Following the initial estimation poses, and intrinsic parameters) via a bundle adjustment. This
of the 3D geometry of these objects, their 3D structures (poses entire calibration process is summarized in Fig. 2.
between boards) are refined via a bundle adjustment strategy.
After estimating all the 3D calibration objects in the scene, 2. Background
the relative poses of each camera - w.r.t the 3D objects - are
computed. This initial estimation is then used to compute the This section briefly explains cameras’ intrinsic and extrinsic
extrinsic parameters between each camera pair sharing an over- parameters and introduces the notations used in the paper.
lapping FoV in the vision system. To ensure an accurate es-
timation, these inter-camera poses are refined via a non-linear 2.1. Camera parameters
refinement process. From these camera pairs, we can form a The main goal of our calibration pipeline is to accurately
graph representing all the combinations of camera pairs. The compute both the intrinsic and extrinsic parameters of a set of
3

cameras rigidly attached together. The intrinsic parameters of


a camera refer to the set of parameters mapping the projection
of the 3D world onto the image plane. Assuming no geometric
distortion induced by the lens, the geometry of the sensor can
be approximated by the pinhole model. With this model, the Overlapping Non-overlapping Converging Non-globally
perspective projection is parametrized by the camera matrix FoV FoV FoV overlapping FoV Combination
 
 f s u0  Fig. 3. Different multi-camera configurations.
K =  0 λ f v0  , (1)
 
0 0 1
 
where P is a 3D-to-2D projection operator mapping from ho-
encapsulating five parameters modeling the projection, namely,
mogeneous 3D coordinates to image coordinates given the in-
the focal length f , the pixels’ aspect ratio λ, the skew parameter
trinsic and extrinsic parameters. The rotation matrix can also
s (representing the pixel non-orthogonality), and the position
be expressed in a compact vectorial manner such as quaternion
of the principal point in the image pp = (u0 , v0 )T (orthogonal
or Rodrigues. In this work, we rely on the latter, expressing the
projection of the camera center onto the image plane). Note that
rotation matrix R as its angle-axis representation r.
we assume a zero skew factor in our calibration process. A 3D
point P = (X, Y, Z)T , expressed in the camera referential, can
be projected onto the image at a pixel location p = (x, y, 1)T 2.2. Diverse multi-camera configurations
(homogeneous notation) as follow: p ∼ KP. This subsection briefly describes different possible arrange-
In practice, the absence of geometric distortions is rarely ver- ments of multi-camera systems. In total, we can identify four
ified. Thus, to deal with the geometric aberrations inherent to main categories of multi-camera rigs (visualized in Fig. 3): the
the optical design of the lenses, many distortion models have globally or non-globally overlapping field-of-view, the non-
been proposed. A commonly used representation is the Brown- overlapping field-of-view, and the converging field-of-view
Conrady’s model (Duane, 1971) (called the Brown model in configurations. A complex system can also be composed of
this paper) which maps the lens distortion via a polynomial a combination of different camera configurations. For instance,
function applied in the camera coordinate’s system as follows: two individual groups of cameras can be used. While theo-
(k r2 + k2 r4 + k5 r6 )xu + (2k3 xu yu + k4 (r2 + 2xu2 ))
" #
retical and practical tools have been proposed individually for
pd = 1 2 , (2)
(k1 r + k2 r4 + k5 r6 )yu + (k3 (r2 + 2y2u ) + 2k4 xu yu ) the calibration of each configuration, no unified, generic, and
robust approach has yet been introduced. In this paper, we de-
where the distorted point at the location (xd , yd ) – on the velop an efficient and generic approach to calibrate any com-
normalized camera coordinate system – is estimated from its plex multi-camera systems.
undistorted counterpart pu via the distortion parameters k =
(k1 , k2 , k3 , k4 , p
k5 ) and the distance to the distortion center ex-
pressed r = xu2 + y2u . The parameters (k1 , k2 , k5 ) map the ra- 2.3. Notations
dial distortion while (k3 , k4 ) model the tangential distortion. An overview of our notations is presented in Fig. 4. The final
This model can accurately approximate a moderate amount goal of our calibration pipeline is to estimate the intrinsic and
of radial and tangential distortion. However, the Brown model extrinsic parameters of Nc cameras given the observations of
remains relatively ineffective to represent wide field-of-view Mb checkerboards. The camera matrix and distortion parame-
cameras which tend to exhibit larger radial distortions (Sturm ters of the ith camera are noted Kci and kci respectively. Regard-
and Ramalingam, 2011). To tackle this issue, the Kannala- ing the extrinsic parameters of the ith camera, they are expressed
Brandt model (Kannala and Brandt, 2006) proposes a different with respect to a reference camera cre f in the rig as [Rccire f |tccire f ],
polynomial form: or in homogeneous transformation: Mccire f . Similarly, the pose
(θd /r)xu
" #
pd = , (3) of the given ith camera expressed in the referential of the jth
(θd /r)yu
board observed at the tth frame can be written t Mcbij . Multiple
where θd = θ(1 + k1 θ2 + k2 θ4 + k3 θ6 + k4 θ8 ) in which θ = atan(r). boards can be combined together as a single object, for instance
In this work, we take advantage of the Brown model for per- in the example (Fig. 4) o0 = {b0 , b1 }. Similarly to the boards,
spective cameras and the Kannala model for the large field of the kth object can be expressed as ok and its pose with respect
view cameras (such as fisheye). to the camera observing it at the tth frame can be written as
When a 3D point P is not expressed in the camera’s referen- t ok
Mci . During our calibration process, we merge cameras shar-
tial, a prior rigid transformation of the point has to be applied. ing a common field-of-view in a single camera group expressed
This rigid transformation is composed of a 3×1 translation vec- g. For instance, the group g0 in Fig. 4 is formed by two cameras
tor t = [t x , ty , tz ]T and a 3 × 3 orthogonal rotation matrix R. For g0 = {c0 , c1 }. The transformations between the groups of cam-
convenience, we also " employ
# homogeneous transformation as eras without overlapping have to be estimated in the latest part
R t of our process. The transformation between two camera groups
a 4 × 4 matrix M = . Thus, the projection of a 3D point
0 1 g0 and g1 is denoted Mgg10 . Finally, the sth 3D point on the jth
Pw in the world referential can be expressed as follow: board is expressed Pbs j and its observation from the ith camera
p = [x, y]T = P(RPw + t, K, k) = P(MPw , K, k), (4) at the tth frame can be written tci pbs j .
4
b1
Mb 0
o0
Camera/Group graphs

g0
c0 c1
b0 b1
t c1
Mb
1
c3
g0 c2 g2
g1
t
Mg
o0 c0 c1
0
c1 Board/Object graphs
Mo
o1 Mc
0 M
g1 0 o1
g0
c2 b0 b2
t o1
Mg o2
1 g1
b1 b3
b2 o0
b4
o1

Fig. 4. Overview of a multi-camera system to be calibrated. In this figure, the inter-board, the inter-camera, inter-object, and inter-group transformations
are depicted in red, blue, purple, and orange respectively. (a) A multi-camera system composed of three cameras {c0 , c1 , c2 } observing three boards
{b0 , b1 , b2 } at a time t. Notice that the two cameras c0 and c1 have an overlapping field of view such that they form a camera group g0 while the camera c2
shares no FoV and forms a group g1 alone. (b) The two graphs used in our pipeline. Note that the objects o0 and o1 as well as the camera groups g0 and g1
are similar to the configuration of the left figure. A third object and camera group has been included to show the expandability of the approach.
N rgin ing

the most commonly used strategy to estimate the intrinsic pa-


ds V
o
p
-O era

bo g F
nv lap
e

rameters of a perspective camera. For instance, it is imple-


N Cam
iv

Co ver
ct

ar
e
rid
pe

ey

eo

ti-

mented in the Bouguet’s toolbox (Bouguet, 2004), OpenCV (It-


rs

sh

yb

on
ul
er

o
Pe

M
Fi

St
H

Scaramuzza et al. (2006) ✓ ✓ seez, 2015), and Matlab. To deal with cameras with large ra-
Caron and Eynard (2011) ✓ ✓ ✓ ✓ dial distortions, multiple ad hoc solutions have also been pro-
Kalibr Rehder et al. (2016) ✓ ✓ ✓ ✓ ✓
Mei and Rives (2007) ✓ ✓ posed (Scaramuzza et al., 2006; Mei and Rives, 2007).
Bouguet (2004) ✓ ✓
Itseez (2015) ✓ ✓ ✓ The extension of single-camera calibration techniques to a
Lin et al. (2020) ✓ ✓ ✓ ✓ ✓ ✓ stereo-vision system with large overlapping field-of-view is
Li et al. (2013) ✓ ✓ ✓ ✓ ✓
Heng et al. (2013) ✓ ✓ ✓ ✓ ✓ ✓ ✓ trivial and has been implemented in most camera calibration
Liu et al. (2016) ✓ ✓ ✓ ✓ ✓ toolboxes (Bouguet, 2004; Mei and Rives, 2007). However,
Ours ✓ ✓ ✓ ✓ ✓ ✓ ✓
these toolboxes do not extend to the calibration of more than
Table 1. Summary of existing camera calibration toolboxes. two cameras. This limitation can be partly explained by the type
of checkerboard utilized. Indeed, to calibrate a multi-camera
system, indexed observations of the 3D points on the board
3. Bibliography should be visualized simultaneously by different cameras. Us-
ing a traditional checkerboard, the entire board needs to be vis-
Camera calibration is the initial stage of most 3D recon-
ible (to estimate the indexing of the points) which is hardly ap-
struction techniques, thus, it has been an important research
plicable when a large number of cameras is utilized and/or if
topic since the very beginning of photogrammetry. One of
the baseline between the cameras is large. To cope with this
the first practical camera calibration pipelines has been pro-
limitation, Li et al. (2013) propose to use a randomly textured
posed by Tsai (1987). This approach requires a single image of
calibration pattern on which unique keypoints can be detected
the calibration pattern assuming its coplanarity with the cam-
to perform the calibration of a vision system even when a lim-
era plane. While this assumption is difficult to ensure in prac-
ited overlapping between fields of views is available.
tice, the method proposed by Zhang (2000) only needs multiple
observations of a checkerboard without any restriction regard- More recently, the development of effective fiducial marker
ing its position. The Zhang’s calibration pipeline is currently systems allows estimating the index of the observed cor-
5

ners without the need to visualize the entire board. Among bust non-linear optimizations – allowing to deal with outliers.
well-known Augmented Reality (AR) markers, we can men- On the application side, the calibration of non-overlapping
tion Charuco (Itseez, 2015) and AprilTag (Olson, 2011; Wang cameras is particularly critical for automotive vision systems to
and Olson, 2016) which have been widely used for multi- provide an all-around view of the scene. For the sake of prac-
camera calibration systems. Such specific calibration markers ticability, numerous strategies have been proposed to simplify
have drastically eased the calibration of complex vision sys- the calibration of such systems without the need for calibration
tems (Xing et al., 2017; Rehder et al., 2016; Strauß et al., 2014). boards. A seminal work has been proposed in (Heng et al.,
The previously mentioned calibration approaches assume 2013), where the displacement of each camera is estimated (us-
that at least a partially overlapping field of view between the ing visual odometry) to calibrate the system via a hand-eye cal-
cameras is available (Rehder et al., 2016) or that the checker- ibration technique. While this approach does not require set-
boards can be observed together such that they can be merged ting up multiple calibration boards, additional information to
in a single 3D calibration object (Strauß et al., 2014). However, estimate the scale of the displacement (i.e. wheel encoder or
they do not allow generic non-overlapping and converging sys- stereo vision systems) is needed. Moreover, the accuracy ob-
tems calibration without specific a-priori. To conduct the cal- tained with such a strategy is scene-dependent and is relatively
ibration of non-overlapping cameras, many approaches follow- complex to deploy due to a large number of stages involved. To
ing the hand-eye estimation strategy have been proposed (Tsai simplify this calibration process, Ataer-Cansizoglu et al. (2014)
et al., 1989). In the literature, this type of calibration is often take advantage of the prior reconstruction of an arbitrary cali-
achieved under certain assumptions (Lébraly13 et al., 2010; Im bration scene (from a single RGB-D camera) to calibrate a set
et al., 2016): known intrinsic parameters; one board per cam- of non-overlapping cameras. This approach - simple and ef-
era is used (and this board remains the same for the entire se- fective - provides 3D metric scale calibration estimation, but, it
quence); the motion used for calibration should not be degener- requires a prior reconstruction of the scene and cannot guaran-
ate (translation in each direction). tee repeatable calibration results.
More complex calibration strategies, involving additional The previously described techniques assume pre-calibrated
hardware, have also been developed for the calibration of non- intrinsic camera parameters. To ease the automatic calibration
overlapping systems. For instance, in (Kumar et al., 2008), a process further, Lin et al. (2020) propose to use a radial pro-
mirror is utilized to virtually obtain a shared view of the calibra- jection model to estimate the poses of the cameras in a pre-
tion board. A more flexible technique has also been proposed reconstructed 3D scene. The advantage of this strategy is that
by Zhao et al. (Zhao et al., 2018) where an external camera is no prior intrinsic parameter is needed to estimate the extrinsic
used to compute the displacement of the multi-camera rig to and intrinsic parameters of the cameras. While this approach is
be calibrated. Despite their complexity and lack of scalability, practical and versatile, a scene reconstruction at a metric scale
these approaches have the advantage to be more robust against remains a complex task that can be affected by drifts and arti-
degenerate motions than their hand-eye-based counterparts. facts. Moreover, the approach developed by Lin et al. (2020)
However, hand-eye-based approaches tend to be more suffers from structural limitations related to the pose estima-
generic and do not require any specific and cumbersome se- tion via the radial projection model. For instance, the vision
tups. A good illustration of the versatility of hand-eye-based system cannot be calibrated unless at least two cameras have a
approaches is the toolbox “Caliber” (Liu et al., 2016) which non-parallel principal axis. Admittedly, the checkerboard-free
shares similarities with our technique. This toolbox has also strategies are very desirable for practical tasks but remain com-
been designed to be theoretically compatible with any configu- plex to deploy in practice due to their lack of accuracy, use case
ration of cameras. However, in practice, this technique suffers limitation (i.e. cannot be utilized for converging field-of-view
many technical shortcomings. For instance, it is not compatible camera calibration), and repeatability issues.
with fisheye or hybrid (combination of perspective and fisheye While this literature covers the problem of multiple camera
cameras) vision systems. Moreover, no fiducial markers are systems used in robotics, another type of multi-camera sys-
used which implies that many human manipulations of the data tem that we call converging camera system is often needed for
are required. On the contrary, our proposed toolbox can deal single-shot 3D scanner (Pesce et al., 2015) (see Fig. 3). This
with various camera distortion models and is fully automatic. kind of camera system can hardly be calibrated using traditional
Furthermore, neither the motion of the camera rig nor the num- planar patterns and often requires 3D calibration patterns. Only
ber of boards is limited by our strategy. a few approaches address this particular problem. A represen-
In this work, we also rely on a hand-eye calibration strat- tative work is (Forbes et al., 2002) where the geometry of a 3D
egy to estimate the pose between the non-overlapping camera cube is refined using multiple observations to estimate the rel-
groups. However, we have extended it with multiple strategies ative poses between the cameras composing the system. The
to improve the stability and accuracy of the proposed approach. major limitation of this work is the need for prior information
Another notable difference with (Liu et al., 2016) is the overall regarding the position of the 3D points on the calibration ob-
structure of our method. In our case, we design a hierarchical ject. In our work, this object structure is automatically esti-
strategy where problems are solved gradually with a systematic mated without any prior 3D information provided by the user.
non-linear refinement leading to a good convergence. Finally, Worth noting that such vision system can also be calibrated with
we focus our method to be particularly robust by including a hand-eye calibration techniques. However, such strategies re-
RANSAC process (prevents wrong markers’ detection) and ro- quire each camera to observe a unique board during the calibra-
6
b3
tion process which drastically restricts the variety of possible M b1
motions, leading to biased calibration results. On the contrary,
our technique can take advantage of the entire 3D object, allow- b1
b3
ing a wider range of motions. Therefore, our system simplifies
the calibration process and avoids degenerated configurations.
Our solution is a fully functional calibration toolbox called
“MC-Calib”. To underline the relevance of this software, we
provide a quick overview of the existing multi-camera calibra- t c0
t
Mb
c0 Mb
tion toolboxes with their inherent limitations in Table 3. 1
3

4. Methodology c0

In this section, we describe the technical details of the pro- Fig. 5. Board pose estimation at a time t. Here, a camera c0 can see two
posed multi-camera calibration pipeline. Before digging into boards b1 and b3 in a single frame and estimate their inter-board pose.
the detailed description of each stage composing our strategy,
we propose an overview of the entire calibration framework.
First of all, a Charuco board detection (Sec. 4.1) is performed 40%. Note that this threshold can be adjusted by the user in
for all the images acquired by the cameras and the detected 2D case of small overlapping FoV camera systems calibration.
location are stored. These 2D observations and their respec-
tive 3D locations (expressed in their board referential) are used 4.2. Intrinsic parameters initialization
to initialize the intrinsic parameters (Sec. 4.2) for every cam- For each ith camera ci , we collect the 3D↔2D correspon-
era. After estimating the internal parameters of the cameras, dence pairs from all the images containing checkerboards.
the pose of each camera with respect to the observed boards is These matches are used to initialize the intrinsic parameters
estimated (Sec. 4.3) via a perspective-n-point technique. Kci and distortion coefficient kci . For perspective cameras,
The inter-board transformations between all boards visible we adopt the well-known (Zhang, 2000) calibration technique
in a single image are computed (Sec. 4.4) to merge the boards (Brown distortion model) while the calibration of fisheye cam-
sharing a co-visibility into 3D objects (a 3D object is a set of eras is accomplished with the implementation available in
3D boards). After the refinement of the 3D object structures, OpenCV (Itseez, 2015) (Kannalla distortion model). This ini-
we group the cameras (Sec. 4.5) which have seen similar 3D tialization is relatively slow if a large number of images are
objects synchronously via a graph-based strategy. At this stage, used, therefore, we subsample the number of images by ran-
if all the cameras in the system share a globally or non-globally domly selecting a subset of 50 boards observations per camera.
overlapping field of view, the calibration is finalized via a final If less than 50 board observations are available, all images are
bundle adjustment (Sec. 4.7). If multiple camera groups re- utilized. Notice that the intrinsic parameters are refined using
main, a non-overlapping camera group calibration (Sec. 4.6) is all the images in the next stage.
performed between each pair of groups. This estimation is used For certain complex scenarios involving a large number of
to merge all the camera groups and the 3D objects before being non-overlapping cameras, it is sometimes tedious to acquire
refined to obtain the entire parameters of the camera system. enough diversified viewpoints to reach an accurate intrinsic
calibration of the individual cameras. Therefore, our toolbox
4.1. Checkerboard detection and keypoints extraction can also use pre-computed intrinsic parameters provided by the
The initial stage of our calibration process is the detection user. As another functionality, it is compatible with hybrid sys-
of the checkerboards in the images and the precise localiza- tems mixing fisheye and perspective cameras under the condi-
tion of their 2D corners (see Fig. 1). To deal with any com- tion the user specifies the type of each camera in the system.
plex setups, we propose to utilize fiducial checkerboard mark-
ers. Specifically, we take advantage of the CharucoBoard (It- 4.3. Board pose estimation and intrinsic refinement
seez, 2015) mixing a standard planar checkerboard pattern with Given the initial intrinsic parameters — computed from the
ArUco fiducial markers (Garrido-Jurado et al., 2014). previous stage 4.2, we estimate the relative pose of all the cam-
During this stage, all the images from all the cameras are pro- eras for each observed board. An illustration of this process
cessed to store their 2D keypoints location and corresponding for an arbitrary frame t is visible in Fig. 5. Notice that a sin-
3D points in their board’s referential. This detection is critical gle image can contain multiple boards, for instance the camera
since the entire calibration of the system strongly depends on c0 at frame t sees two boards b1 and b3 simultaneously, thus
the robustness and accuracy of these 2D keypoints. Thus, to both transformation t Mcb01 and t Mcb03 are computed and stored.
improve the accuracy of the calibration, we apply an effective The estimation of these poses is achieved with a PnP algo-
corner refinement process Ha et al. (2017). To avoid degener- rithm (Gao et al., 2003) wrapped in a RANSAC robust estima-
ated configuration, we additionally apply a collinearity check. tion process (Fischler and Bolles, 1981). This RANSAC stage
Moreover, to improve the overall robustness, the boards with is intended to remove very large outliers only (e.g. error su-
less than a certain percentage of visible keypoints are discarded perior to 10 pixels reprojection error), improving the overall
from further consideration – we typically set this threshold to robustness of our pipeline. The inlier points are then used to
7

refine the pose estimation of the camera w.r.t. to the board via t c2
o0
Mo
a Levenberg-Marquardt non-linear refinement. Considering a 0

camera ci at the frame t, its pose w.r.t. the board b j is refined by


minimizing the following reprojection error function:
t c1
Mo
0
XS 2
Mcbij Pbs j , Kci , kci ) ,
t s t
min ci pb j − P( (5)
t rci ,t tci
bj bj s=1
c2
t c0
Mo
where S is the number of corner visible on the board. Given 0

these initial guesses for the extrinsic and intrinsic parameters, c0 c1


c1
we refine them together for all the images acquired by each Mc
0

camera individually via the following cost function:


Mb X
T X S H Fig. 6. Illustration of multiple cameras observing a single 3D object at time
X
frame t. Since these 3 cameras share an overlapping field of view they
ci pb j − P( Mbij Pb j , Kci , kci ) ∀i, (6)
t s t c s
min can be clustered together to become a single camera group (all the 3 grey
t rci ,t t i ,K ,k
c
ci ci
bj bj t=1 j=1 s=1 cameras are automatically merged into the same group by our strategy).

where T and Mb are respectively the number of frames and the


number of boards observed by the ith camera. ∥·∥H is the Huber Finally, the 3D structure of each 3D object is refined in a non-
loss function to ensure robustness against outliers. linear fashion by minimizing the following reprojection error:
Nc X
X Mb X
T X
S H
t s bre f s
4.4. Boards grouping into objects min p
ci b j − P(t ci
Mbre f Mbj Pbj , K ci , kci .
)
b b
rb j ,tb j i=1 j=1 t=1 s=1
In the previous stage, we estimate the relative pose of all the re f re f
(7)
cameras for every single board observed. We now attempt to
find the relative pose between the boards to gather them into 3D 4.5. Grouping cameras as camera groups
objects. A 3D object is formed of multiple planar calibration
After the creation of the 3D objects resulting from the merg-
boards, for instance in the illustration Fig. 4, the object o0 is
ing of the boards, the pose of the cameras w.r.t. the objects
formed by two planar targets b0 and b1 . If two or more boards
for all frames is estimated with a PnP algorithm similarly to
are visible in a single image their pair-wise relative poses can be
Sec. 4.3. For instance, in Fig. 6 the pose of the camera c2 w.r.t.
estimated. For instance, in Fig. 5 the relative pose between the
the object o0 at a frame t is expressed as t Mco20 . Since these
two boards t Mbb31 can be computed by the matrix multiplication
−1 three cameras can see a single object simultaneously, they can
Mb1 = t Mcb03 t Mcb01 .
t b3
be merged into a single camera group g0 = {c0 , c1 , c2 }. Follow-
For the sake of robustness, we gather the measurements from ing the strategy explained in Sec. 4.4, the inter-camera pose es-
all the images where pairs of boards are visible together in the timations for multiple frames are averaged and the cameras ob-
same image and compute their average inter-board rotation and serving an object simultaneously are grouped as camera groups
translation. For instance, if the boards b0 and b1 have been (see Fig. 6). This grouping is performed via the camera graph,
observed together over 10 images – by one or more cameras depicted in Fig. 4, where each connected component forms a
in the rig – then the inter-board relationship will be the robust camera group. Once again, we wish to express the camera pose
average of these 10 measurements. This strategy allows to gain in the referential of the reference camera of the group (camera
significant robustness and can allow the system to be calibrated with the lowest index value). To initialize the cameras’ poses
despite potential outliers. w.r.t. this reference camera, the (Dijkstra, 1959) algorithm is
This averaging provides a solid prior estimation of the inter- used to determine the path (maximizing the number of obser-
board poses. To construct the 3D objects from these inter-board vations) in the graph. Finally, all the camera groups are refined
poses, they are stored in a directed weighted graph as shown individually using a Levenberg-Marquardt method by minimiz-
in Fig. 4. Connected components of this graph constitute the ing the following cost function:
3D objects. For each object, a reference board is empirically
Ng X
Mo X So
T X
selected as the board with the lowest index. For instance, in X t s c t cre f s
H
Fig. 4 the reference board of o0 and o2 are b0 and b3 respec- c
minc ci pb j − P(Mcire f Mok Pok , Kci , kci ) ,
rcire f ,tcire f
i=1 k=1 t=1 s=1
tively. Thus, the boards poses among their respective object (8)
b
is expressed as Mbrej f for the jth board in its object. When the where Ng is the number of cameras in the camera group, while
3D object is constituted by more than two boards, a Dijkstra Mo and S o are the number of objects and the number of points
shortest path algorithm (Dijkstra, 1959) is used to determine the in the object respectively.
best transformation composition (to express each board in the
object reference board coordinate system) assuming the edges 4.6. Non-overlapping camera groups estimation
of the graph contain the inverse of the number of observations If a single camera group remains, the calibration can be final-
(1/Nbobservation ) where both boards have been seen together. ized as described in sec. 4.7. Otherwise, the existence of multi-
This strategy welcomes robust paths with many observations. ple camera groups implies that they do not share a common field
8

of view. For each pair of remaining camera groups, their inter- o0


pose is estimated via a hand-eye calibration approach (Tsai
et al., 1989). Thus, these pairs of camera groups can be merged g0 t0 g0
Mo
into a single final camera group. In the remainder of this sec- 0

tion, we describe this process in detail, including, the hand-eye


t0
pose estimation from a pair of non-overlapped camera groups Mg
g1 Mg
t1 t1 g0
0
0 Mo
and our robust bootstrapped initialization technique. 0

t0
Mg
4.6.1. Hand-eye calibration for non-overlapping cameras t1 1

Considering two camera groups g0 and g1 (as depicted in


Fig. 7), their inter-pose Mgg10 can be calculated using a hand-eye t0
Mo
g1 g1 Mg
g1
calibration strategy. Assuming each group can visualize one ob- 1 0

ject across multiple frames, the camera groups’ displacements


can be estimated individually. For instance, if the group g0 and
g1 capture an object o0 and o1 respectively across two frames t1 g1
t0 , t1 , then their displacements can be estimated as follow: Mo
1

o1
t1
t0 Mg0 = t1 Mgo00 t0 Mog00 (9)
t1
t0 Mg1 = t1
Mgo10 t0 Mog01 . (10) Fig. 7. Representation of two camera groups (g0 in blue and g1 in yellow)
with non-overlapping field-of-view observing one object each (for the sake
of clarity and simplicity, each camera group contains a single camera and
As a result, the relationship linking camera groups can be writ- each group is composed of a single board in this example).
ten tt10 Mg1 Mgg10 = Mgg10 tt10 Mg0 , without loss of generality, this rela-
tion can be generalized for all the frames:
tj t
contains outliers. Therefore, we propose an effective manner to
g1
∀ti ∈ [1 · · · T ], ∀t j ∈ [1 · · · T ], t i M g1 M g0 = Mgg10 tij Mg0 . (11) select the best views to calibrate each pair of non-overlapping
camera groups in the system in a fast and robust manner.
This problem takes the form of a system AX = XB which Our best view selection is designed to maximize the diver-
can be resolved using a hand-eye calibration technique. In this sities of views used for the calibration to avoid degenerated
work, we employ the approach proposed by Tsai et al. (1989) configurations and to improve the robustness against outliers.
which consists in a hierarchical resolution of the problem: 1) For this purpose, for each frame, we concatenate the transla-
rotation estimation first and 2) translation computation. To per- tional component of both camera groups to cluster the frames
form the rotation estimation, the authors propose to utilize a as depicted in Fig. 8. Therefore, the most similar poses are as-
variation of the Rodrigues angle-axis representation such that signed to the same cluster. In this situation, the rotational com-
the rotation vector rgg10 can be linearly resolved as follows: ponent does not need to be considered to ensure the diversity of
ht j tj
i g ′ tj tj the poses across clusters since a rotation of one of the groups
ti rg1 + ti rg0 × rg0 = ti rg0 − ti rg1 ,
1
(12)
will inevitably lead to a translation of the second group. In
where the operator [·] × stands for the transformation of a 3D our framework, a k-means clustering (Lloyd, 1982) technique
vector to a skew-symmetric matrix and rgg10 = (2rgg10 ′ )/(1+|rgg10 ′ |2 ). is utilized with the number of clusters fixed to 20.
After the rotation being solved, the translation tgg10 can be com- After this initial clustering, we initiate our bootstrapping
puted via the following set of linear equations: strategy which consists in the successive estimation of the inter-
t j group pose via the selection of multiple mini-batches of frames.
g1 t j tj
 g
ti Rg1 − I tg0 = Rg0 ti tg0 − ti Rg1 ,
1
(13) Specifically, for each iteration of our bootstrapping algorithm,
6 clusters are randomly sampled (from the initial 20 clusters),
where I is the identity matrix. Note that at least two pairs of among each of these 6 clusters one pose is chosen randomly.
motion are needed to perform this hand-eye calibration. For The resulting set of 6 pair of poses is used to perform the hand-
further details, we invite the reader to refer to Tsai et al. (1989). eye calibration (as described in Sec. 4.6) such that the inter-
Note that, this initial calibration is achieved using only the group pose can be computed. The validity of the set is then
two objects that have been seen the largest number of times evaluated by estimating the consistency of the rotational solu-
simultaneously by both camera groups. tion provided by the hand-eye calibration algorithm. If the max-
imum rotational error in the set is superior to 5◦ , the solution is
4.6.2. Best view selection and bootstrapped initialization rejected and, if the set is consistent, this result is stored. This
The hand-eye calibration strategy can be applied for all possi- mini-batch hand-eye estimation is repeated 200 times leading to
ble combinations of frames but the complexity of the problem is a set of plausible inter-group poses. Finally, the estimation of
growing quadratically with the number of frames which is prob- the inter-group pose is obtained by computing the median value
lematic (computational time-wise) for large video sequences. of each translation and rotation (Rodrigues representation) el-
Moreover, successive frames exhibit similar poses which does ement which passed the rotational test. This initial solution is,
not contribute much to the final solution. Furthermore, using thereafter, refined in a non-linear manner. Our bootstrapped
all the frames at once can be problematic if the set of frames initialization procedure has proved to be very effective against
9

Table 2. Average intrinsic and extrinsic errors over all the cameras on syn-
thetic data generated by image rendering.
o0
Seq
t8 Seq01 Seq02 Seq03 Seq04 Seq05
Error
t7 focal (px) 27.601 2.229 27.611 27.648 0.124
t6 pp (px) 0.396 2.060 0.514 0.464 0.718
t0
t1
t5 Rotation (◦ ) 0.002 0.056 0.002 0.005 0.046
t4
t2
t3 Translation (m) 0.000 0.006 0.000 0.000 0.002
Reprojection (px) 0.022 0.023 0.014 0.017 0.090
Fig. 8. Example of the resulting clustering (3 clusters: blue, yellow, and red)
for a single camera orbiting around a calibration board. We can see that
nearby frames belong to the same cluster. In practice, a concatenation of 5.1. Experiments on rendered images
two non-overlapping cameras or camera groups is used for this clustering. To provide a quantitative estimation regarding the ac-
curacy and versatility of the proposed technique, we use
Blender (Community, 2018) to generate a synthetic dataset
outliers and noisy measurements.
composed of 5 different calibration scenarios (see Fig. 9 ): 1)
Stereo system (2 cameras); 2) Non-globally overlapping vision
4.7. Merging camera groups and Bundle adjustment system (3 cameras); 3) Non-overlapping system (4 cameras);
4) Unbalanced non-overlapping vision system (3 overlapping
After the initial poses between all the non-overlapping pair cameras and one non-overlapping camera); 5) Converging vi-
of camera groups are estimated, the camera groups and the ob- sion system (4 cameras). The dataset, the codes, and Blender
jects are merged using a similar graph strategy described in Sec- 3D models used for its creation are available to the public via
tion 4.4. Finally, the entire system (relative position between all the following link1 . For each sequence, 100 synchronized and
boards, camera poses, and the intrinsic parameters) can be re- distortionless frames per camera have been captured at a reso-
fined to minimize the reprojection error in all frames: lution of 1824 × 1376 px. A set of representative synthetically
generated images is available in Fig. 9. For this experiment, the
field of view of the synthetic cameras is fixed at 65◦ of hori-
Mb X
Nc X T X
S
X t s zontal FoV. The mean calibration error against the ground-truth
bj bj
min
b b
ci pb j (14) (over all the cameras in the rigs) are available in Table 2.
rb ,tb ,rcre
i
,t i ,Kci ,kci i =1 j =1 t =1 s =1
f cre f
re f re f

c b
H Seq01: Stereo. While the proposed technique is designed to
− P(Mcrei f t Mcbire f Mbrej f Pbs j , Kci , kci ) . calibrate complex vision systems, it can be employed for the
calibration of rather simple and common vision rig such as
Our hierarchical calibration strategy provides initialization monocular or stereo vision systems. Such calibration can be
that is in the vicinity of convergence leading to very stable and achieved with our toolbox using a single calibration board. To
accurate results. challenge our technique, we utilize 3 individual boards. In this
scenario, our toolbox outputs highly accurate results with a sub-
millimetric translational error and a very low reprojection error.
5. Experiments
Seq02: Non-globally overlapping camera system. For this ex-
periment, we simulate an omnidirectional vision system that
This section contains a large number of assessments on real shares similarities with (Schroers et al., 2018). This system is
and synthetic data. Various use cases are proposed to reflect a composed of 5 cameras arranged in a semi-circle (see Fig. 9(b)).
broad spectrum of scenarios commonly faced in practice. Mul- A single calibration board is used for the entire calibration of
tiple metrics are used to evaluate the quality of the retrieved the vision rig, since each camera shares a partial FoV with its
parameters. The rotational error is calculated as follows: neighbors, the relative poses of the calibration can be achieved
by chaining all the transformations. This process is automati-
1 cally achieved in our calibration pipeline. The difficulty of such
ϵR = acos( (tr(RTest RGT ) − 1)), (15)
2 calibration scenario is the possibility to accumulate a drift be-
tween the reference camera and the other cameras in the system.
where Rest is the estimated rotation matrix and RGT the ground Despite this challenging scenario, our calibration technique is
truth rotation. Regarding the translational and the internal pa- able to reliably estimate the camera poses in the system. We can
rameters’ (principal point pp and focal length) errors, it is the notice a higher mean translational error in this sequence which
euclidean distance between the ground truth and the estimated mostly comes from one camera located on the extreme left of
values. The reported reprojection error is the mean of the Eu- the system with partial visibility of the boards.
clidean distance between the detected and reprojected points for
all the corners observed by all cameras. Note that for the sake
of conciseness we do not provide comparative results for the 1 link to rendered images dataset: https://bosch.frameau.xyz/
aspect ratio λ while the skew factor is assumed to be zero. index.php/s/pLc2T9bApbeLmSz
10

0.6 7
0.1 0.50
2 6
0.4 0.0 0.25 5
1 0.1
0.2 0.00 4
0 Z 0.2Z
Z 0.25Z 3 Z
0.0 0.3
1 0.50 2
0.4
0.2 0.75 1
2 0.5
0
0.4 0.6 1.00
3 1

0.2 0.0 0.2 0.4 0.6 0.8 3 2 1 0 1 2 3 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.75 0.50 0.250.00 0.25 0.50 0.75 4 3 2 1 0 1 2 3 4
X X X X X

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

Fig. 9. Calibration results from synthetically rendered images: (a) Seq01: Stereo, (b) Seq02: Non-globally overlapping, (c) Seq03: Non-overlapping, (d)
Seq04: Unbalanced, (e) Seq05: Converging vision system. (f-j) Sample of rendered images from each sequence.

Seq03: Non-overlapping camera system. Our calibration tool- 5.2. Real vision system experiments
box can be used to calibrate any multi-camera vision system. To demonstrate the effectiveness of the calibration approach
In particular, we propose a robust strategy for the calibration of under realistic scenarios, we propose to calibrate diverse vi-
non-overlapping camera systems (see Fig. 9(c)). In this experi- sion systems ranging from stereo to multiple groups of non-
ment, we proposed one of the most common multi-camera sys- overlapping cameras. Moreover, the proposed scenarios involve
tems used to obtain an all-around view (as depicted in Fig. 9(c)). a different number of calibration boards and 3D objects. All the
This multi-camera vision system is surrounded by a set of eight sequences used in this paper can be downloaded freely2 .
calibration boards placed in a circular manner such that the The stereo vision system calibrated in this experiment is a
boards can be visible by multiple cameras simultaneously. De- ZED camera capturing synchronized pair of 1280 × 720px res-
spite the limited amplitude of motions used for this calibra- olution images. For the multi-camera system configurations,
tion, the obtained results are very close to the ground truth with we use up to 4 synchronized Intel Realsense D415i RGB-D
nearly no translational or rotational error (see Table 2). cameras. Specifically, we utilize 2 infrared sensors of each
camera (spatial resolution of 1280 × 720px). While the RGB
Seq04: Unbalanced non-overlapping vision system. To eval- sensor can be used together for the calibration, the large mo-
uate the applicability of our calibration strategy on an unbal- tion blur and the rolling shutter of this color camera disqual-
anced non-overlapping vision system, we simulate 3 overlap- ified it for this experiment. Due to hardware limitations, the
ping cameras on one side and a single camera pointing in the miss-synchronization of RGB-D camera can reach up to 5ms.
opposite direction (see Fig. 9(d)). This scenario is complex This delay does not seem to cause a significant problem dur-
since a wrongly initialized calibration may lead to a wrong con- ing the calibration process since a low reprojection error has
vergence of the calibration due to the unbalanced reprojection been reported. Finally, the hybrid stereo-vision system is
error on both sides of the system. Our technique is both robust composed of two PointGrey Flea3 cameras FL3-U3-13E4C-C
and effective even for such a specific scenario. This satisfy- (1280 × 512px spatial resolution). The left one is equipped with
ing performance can be explained by the design of our method. a lens (LM3NCM) providing a 90◦ horizontal field of view with
Specifically, our method is built to solve the calibration prob- low radial distortion. On the right camera, we install a fisheye
lem in a progressive step-by-step manner where each step is lens (Fujinon FE185C046HA-1) yielding 182◦ horizontal field
designed to provide an accurate initialization to the next one. of view and very large radial distortion. These two cameras
are spaced by a baseline of 20cm. All the experiments pre-
Seq05: Converging vision system. Most existing calibration
sented in this paper have been conducted on a desktop computer
toolboxes are incompatible with converging vision systems.
equipped with 32GB of RAM and a CPU i7-6800K.
Our toolbox can deal with such scenarios by estimating the
structure of the 3D calibration object. While our method can Stereo vision system. To validate our method, we calibrate a
function with any calibration object composed of multiple pla- stereo camera to allow comparison against a widely used strat-
nar boards, in this experiment we propose to simulate the most egy proposed in (Bouguet, 2004). The stereo ZED camera is
common 3D calibration object: a cube composed of 6 planar calibrated with a single board in both cases. While we utilize
boards. A group of 4 converging cameras are orbiting randomly
around the cube such that every faces of the cube can be ob-
served. The results presented in Table 2 demonstrates sub-pixel 2 link to real images dataset: https://bosch.frameau.xyz/index.

reprojection error and highly accurate camera pose estimation. php/s/fqtFij4PNc9mp2a


11

(a) (b)

(c)

Fig. 10. Experimental setup for our non-overlapping vision system. (a)
Camera rig, (b) obtained calibration result, (c) boards used for calibration.

a video sequence consisting of 900 frames per camera for our


methods only 50 frames are employed for the calibration us- Fig. 11. All around reconstruction from 4 non-overlaping RGB-D cameras.
ing Bouguet (2004) since the detection of the board is manual.
Our calibration leads to a mean reprojection error of 0.39 pixels
while (Bouguet, 2004) suffers from a mean error of 0.44 pixels. notice that both toolboxes lead to relatively similar solutions
This metric by itself is not sufficient to be conclusive regarding with a maximum rotational difference under 0.2◦ and less than
the accuracy of the method, therefore, we provide a comparison 3mm difference in the translation estimation. We can notice that
on the estimated parameters in Table 3. Both toolboxes result in the difference in the estimation increases with the camera index
similar parameters with a translational difference of 0.5mm and which theoretically might be related to the behavior of Kalibr
an insignificant rotational difference. This experiment suggests that computes the camera poses pairwise leading to a potential
the ability of our strategy to calibrate stereo vision systems as drift when many cameras are being calibrated. Also, it should
reliably as tools dedicated specifically to this task. be noticed that the calibration using Kalibr (Rehder et al., 2016)
took more than 1 hour while our calibration technique achieves
Non-overlapping vision system. To demonstrate the ability of parameters estimation in less than 4 minutes. Using our esti-
our approach to calibrate complex vision systems without over- mated parameters, we have aligned the point clouds obtained
lapping, we rigidly fix four Realsense D415i cameras on a bar by the 3D sensors to observe the quality of the overlap between
such that each camera looks in a different direction without any the views (see Fig. 12 (b)). We can notice that the alignment of
overlap between the stereo views, as depicted in Fig. 10(a). the 3D reconstructions is accurate and also validate our method-
Since each RGB-D vision system is composed of two NIR cam- ology for such type of multi-camera systems.
eras, a total of 8 cameras are being calibrated. To achieve this
calibration, we use 4 boards, as shown in Fig. 10(c). The cal-
ibration result is available in Fig. 10(b), the mean reprojection Hybrid stereo-vision system. While the previous experiments
error for the entire sequence of 1200 frames (per camera) is exclusively focused on homogeneous vision systems – com-
0.19 pixel suggesting a very accurate estimation of the parame- posed of similar types of camera, MC-Calib can also deal with
ters of this vision rig. We additionally provide an all-around 3D hybrid vision systems composed of perspective and fisheye
reconstruction obtained from this calibrated system in Fig. 11. cameras. Therefore, we propose a setup composed of one fish-
eye and one perspective camera, as depicted in Fig. 5.2(a). To
Overlapping multi-camera system. To allow comparison calibrate this system we used a single calibration board. The
against other multi-camera calibration toolbox (Rehder et al., resulting calibration lead to a mean reprojection error of 0.15px
2016), we acquire a calibration sequence (∼ 900 frames) from and the estimated baseline is 19.8cm which is consistent with
6 cameras sharing a significant overlapping FoV (see Fig. 12). the organization of the cameras in the rig. We additionally pro-
In Fig. 13 we propose a comparison of the estimated parameters pose a stereo rectification result (see Fig. 5.2(b)) which also
between our approach and Kalibr (Rehder et al., 2016). We can confirm the quality of our calibration.
12

Table 3. Stereo calibration parameters comparison between our approach and Bouguet (2004). fL , fR , ppL and ppR are the focal lengths and principal
points of the left and right cameras respectively.
Parameters
fL (px) fR (px) ppL (px) ppR (px) XYZ Rotation Euler (◦ ) XYZ Translation (cm)
Methods
Bouguet (2004) 703.81 707.12 (631.27, 372.15) (650.37, 386.23) (−0.05, 0.61, −0.47) (−11.98, 0.02, −0.07)
Ours 701.52 704.33 (629.06, 375.08) (651.27, 387.25) (−0.0214, 0.3731, −0.5487) (−12.0231, 0.0250, −0.1151)
Difference 2.29 2.79 3.67 1.35 0.25 0.05

(a)

4
Z (Cm)

2
0
-2 (a) (b)
0 10 20 30
X (Cm)
(b)

(c) (d) (e)

Fig. 14. 3D reconstruction from 3 calibrated RGB-D cameras. (a) Merged


point cloud from the 3 RGB-D reconstructions, (b) overlay of the regis-
(c) (d) (e) tered point cloud where the yellow, blue and red depicts the cameras 1,2,3
respectively, (c-e) color image from each camera.
Fig. 12. Calibration of 6 overlapping cameras. (a) Multi-camera rig, (b)
calibration result, (c-d) three images used for this calibration.
the four 3D point clouds captured by the RGB-D cameras to
Rotational diff. (degree)

0.3 better highlight the accuracy of our method. This reconstruc-


Translational diff. (cm)

0.15
Translational difference
Rotational difference tion is visible in Fig. 5.2. We can notice that this reconstruction
is consistent, suggesting an accurate calibration of our system.
0.2
0.1

5.3. Robustness against noise


0.1
0.05
In this test, we would like to assess the robustness of the pro-
posed technique against noise on corners’ location. For this
purpose, we simulate (numerically on Matlab) three sets of
0 0 ideal camera systems with known parameters, motions, and 3D
1 2 3 4 5 6 7
boards location. The first setup consists of a stereo vision sys-
Camera
tem calibrated with three calibration boards attached together.
Fig. 13. Translational and rotational difference between ours and The second simulated use case is a small light-field setup com-
Kalibr (Rehder et al., 2016) for the calibration of 6 overlapping cameras. posed of a 3 × 3 camera matrix looking at 9 calibration boards.
The final setup is a non-overlapping camera system made of
two back-to-back stereo vision systems each looking at a grid
Converging vision system. The final real setup we propose to of 9 checkerboards. In all of these experiments, the synthetic
explore is a converging multi-camera system composed of 4 cameras are assumed to have a resolution of 1824 × 1376px and
pairs of stereo NIR cameras, as depicted in Fig. 16. The system each board contains 6 × 6 corners. The rig’s motions have been
is calibrated with a cube of 30cm side on which each face is generated randomly in a given range of rotation and transla-
covered with a unique calibration pattern. The geometry of this tion. To test the robustness of the calibration pipeline, Gaussian
cube is unknown for this calibration and any 3D object com- noise is added to the corners’ positions. Different noise levels
posed of planes would also be applicable for this calibration. are used and 100 trials are performed for each noise level. The
The calibration outputs include the intrinsic/extrinsic parame- obtained results are visible in Fig. 18. Note that in this experi-
ters of the cameras and the geometry of the calibration object ment, the RANSAC threshold has been intentionally set higher
(see the top of Fig. 16). This calibration is with 1500 images for than the maximum noise standard deviation to demonstrate our
each of the 8 cameras in the system and the mean reprojection method’s robustness using a set of noisy points.
error is 0.28px. Qualitatively, our calibration result is consistent Our system demonstrates good robustness to noise and, while
with the experimental setup. However, since no ground truth is the overall accuracy is impacted by the noise, no failure cases
available, we propose to reconstruct a 3D object by combining are observed over thousands of runs. Moreover, despite a
13
2

1.5

Z (m)
1

0.5

0
(a)
-1 0 1
X (m)

(b)

Fig. 15. Hybrid stereo-vision system calibration. (a) Picture of the system,
(b, first row) perspective and fisheye images acquired by the cameras, (b,
second row) rectified images with multiple epipolar lines displayed in color. Fig. 16. Experimental setup and calibration results for a converging multi-
camera system. (top-left) Calibration result (note that only the left camera
of each RGB-D camera is displayed for clarity), (top-right) 3D calibration
very high noise, the rotation and translation error never exceed cube and its reconstruction, (bottom-left) experimental setup with a 3D ob-
ject placed in the center of the camera, (bottom-right) images with detected
2◦ and 0.08 meters regardless of the cameras’ configuration.
corners from the camera 1 and 3 respectively.
Worth noting that owing to the filtering and the corner refine-
ment processes, in practice, it is very unlikely to reach inac-
curate corner localization. Regarding the intrinsic parameters,
the deviation from the ground truth remains reasonable with a zero for all three studied scenarios. In practice, such an extreme
maximum of 70px for the principal point and 80px for the fo- presence of outliers is highly unlikely. Nevertheless, without
cal length. Noticeably, the light-field configuration has higher our RANSAC filtering and robust optimization scheme, even a
intrinsic parameters errors which can be related to the limited very few outliers lead to a complete failure of the calibration.
variety of viewpoints in the sequence. Interestingly, the non-
overlapping scenario, which is assumed to be more complex to 5.5. Robustness evaluation of the hand-eye calibration
resolve, seems to reach higher accuracy. This can be attributed In this section, we highlight the relevance of our bootstrapped
to the robustness of the proposed bootstrapping strategy. hand-eye calibration technique (covered in Sec. 4.6.2) for non-
overlapping vision systems under the presence of wrongly esti-
5.4. Robustness against outliers mated poses (outliers). Not only our technique allows a fast and
This section confirms the stability of our approach against constant time hand-eye calibration, but it is also significantly
outliers. Following the same evaluation environments as in more robust to outliers thanks to our minibatch estimation and
Sec. 5.3, we evaluate the robustness of our approach under the our rotation consistency testing stage. To evaluate the level of
presence of hard outliers which may occasionally occur dur- robustness offered by our framework, we synthetically gener-
ing the detection of fiducial markers. In contrast to Sec. 5.3, ate 100 pairs of poses from two non-overlapping cameras. In
for this assessment, no noise is added to the inlier points. We
compute the success rate (mean reprojection error inferior to
5px) over 100 trials for a different level of outlier contamina-
tion ranging from 0 to 70%. An outlier is a point with a devia-
tion of at least more than 10px from its real pixel position (the
outliers are generated randomly in the image with a uniform
distribution). The same number of outliers is enforced for each
image. The computed success rate is available in Fig. 19(a)
where, in most scenarios, our solution is robust up to 60%
of outlier contamination, leaving only 14 points per board to
perform the calibration of the system. This resilience can be
mostly attributed to the RANSAC algorithm used to reject in- (a) (b) (c)
correct points. Since the boards contain a relatively low num-
ber of points, 1000 RANSAC iterations are usually enough to Fig. 17. Reconstructed object using our calibration parameters. (a) Picture
discover a set of uncorrupted points to perform the system cal- of the object, (b) aligned 3D points cloud from the 4 RGB-D cameras dis-
ibration. At a level of 70% of outliers, the success rate falls to played in red, green, blue, and yellow respectively, (c) meshed result.
14
100 100
2.0 Conventional HE
30 100 Ours
0.04

principal point error (pixel)


rotational error (degree)

focal length error (pixel)


80 80
translational error (m)

1.5 80
0.03 20

Sucess rate (%)


60 60 60

Sucess rate
1.0
0.02
40 40 40
10
0.01 0.5 Stereo
20 Non-overlapping
light-field
20 20
0.00 0.0 0 0
0 2 4 6 8 0 2 4 6 8
standard deviation of the 2D corner noise (pixel) standard deviation of the 2D corner noise (pixel) 0
0% 10% 20% 30% 40% 50% 60% 70%
0
0% 2% 4% 6% 8% 10% 12% 14%
percentage of outlier contamination percentage of outlier contamination

0.12 50
2.0
0.10 80 Fig. 19. Robustness against outliers. (a) Success rate of the entire calibra-

principal point error (pixel)


rotational error (degree)

focal length error (pixel)


40
translational error (m)

0.08
1.5 tion pipeline versus various outlier percentage contamination (for 3 sce-
60
30
0.06
narios depicted in red, green and blue) (b) success rate of our hand-eye
1.0
40 20 calibration technique for various levels of outliers contamination (the red
0.04
0.5 and green bars depict standard (Tsai et al., 1989) and our hand-eye cali-
20 10
0.02 bration procedure respectively).
0.00 0.0 0 0
0 2 4 6 8 0 2 4 6 8
standard deviation of the 2D corner noise (pixel) standard deviation of the 2D corner noise (pixel)

Computational time (s)

Computational time (s)


80 1200 300
2.5 30
0.06

principal point error (pixel)


1000
rotational error (degree)

focal length error (pixel)

250
translational error (m)

2.0 25 60
20 800
0.04 1.5 200
15 40 600
1.0 150
0.02 10 400
20
0.5 5 100
200
0.00 0.0 0 0
0 2 4 6 8 0 2 4 6 8 0 50
standard deviation of the 2D corner noise (pixel) standard deviation of the 2D corner noise (pixel) 1 2 3 4 5 6 7 8 1 2 3 4 5 6
Number of cameras Number of boards

(a) (b)
Fig. 18. Robustness against various quantity of noise with 100 iterations per
level of noise, the tick lines represent the mean error value and the trans-
parent envelopes depict the standard deviation. (first column) Translation Fig. 20. (a) Computational time vs number of cameras (8 cameras, 1200
and rotation error assessment for the three sequences: stereo, light-field, images per camera and 4 calibration boards). (b) Computational time
and non-overlapping, (second column) focal length and principal point er- vs number of boards (2 cameras, 550 images, 6 calibration boards). The
ror for the three sequences: stereo, light-field, and non-overlapping. transparent red envelop depicts the standard deviation.

Fig. 19(b), we provide the success rate for 500 trials at a dif-
ferent level of outlier corruption. In this experiment, we do not the elapsed time. To test with a various number of cameras, we
include any non-linear refinement of the pose. We consider the examined a non-overlapping system composed of 4 stereo cam-
pose estimation successful if the accuracy in rotation and trans- eras as described in Sec. 5.2. In this experiment, 1200 images
lation are lower than 5◦ and 2cm respectively. per camera are captured, raising the total number of images to
To better understand the importance of the proposed tech- be processed to 9600, while 4 boards are utilized. In this con-
nique, we compare our solution with a standard hand-eye cali- text, it can take up to 15 minutes to calibrate the entire system.
bration solution that directly tries to resolve the problem using However, the computational time decreases significantly when
all the available poses (Tsai et al., 1989). These conventional reducing the number of cameras utilized. To evaluate the com-
hand-eye calibration techniques have not been designed to deal putational time versus the number of employed boards, we use a
with outliers, thus, the presence of a single outlier leads to very stereo sequence of 550 images of a calibration cube composed
large errors as underlined in Fig. 19(b). On the contrary, our of 6 boards. We decrease the number of boards and measure
technique is specifically designed to deal with outliers and al- the computational time for each scenario ranging from 1 to 6
lows estimating the inter-camera pose even if multiple outliers boards. Once again, we can notice (see Fig. 20(b)) that the cal-
are contaminating the set (with 6% of outliers our algorithm can ibration time decreases with fewer boards. The reason is that a
return a successful pose estimation 60% of the time). larger number of boards to be detected also leads to more com-
putation for their detection.
5.6. Computational time
To better understand which part of the algorithm is time-
While our strategy has not been deliberately designed to be consuming, we propose to analyze the mean computational
computationally effective – since the calibration stage is usually time per stage of the algorithm (see Table 4). This evaluation is
conducted offline, our C++ implementation allows a relatively achieved with 4 non-overlapping stereo cameras (1200 frames
quick calibration of any camera system. To give an overview per camera and 4 calibration boards). As can be seen, the most
of the method’s speed, we have performed tests with various time-consuming part is the detection of the Charuco boards fol-
number of cameras (see Fig. 20(a)) and boards (see Fig. 20(b)). lowed by the initialization of the intrinsic parameters. The rest
In these experiments, the calibration is repeated 25 times for of the proposed calibration process is very light and takes under
each instance to analyze the mean and standard deviation of one minute to calibrate complex multi-camera systems.
15

and motion capturing. IEEE Transactions on Circuits and Systems for Video
Table 4. Computational time per stage for 4 non-overlapping stereo sys-
Technology (TCSVT) 27, 798–813.
tems (8 cameras) with 1200 frames per camera and 4 boards.
Ataer-Cansizoglu, E., Taguchi, Y., Ramalingam, S., Miki, Y., 2014. Calibration
of non-overlapping cameras using an external slam system, in: International
Time
(s) (%) Conference on 3D Vision.
Stage Barreto, J.P., 2006. A unifying geometric representation for central projection
systems. Computer Vision and Image Understanding (CVIU) .
Boards detection 1049.4 90.8 Bouguet, J.Y., 2004. Camera calibration toolbox for matlab. http://www. vision.
Intrinsic estimation 92.0 7.9 caltech. edu/bouguetj/calib doc/index. html .
Objects merging 1.4 0.1 Caron, G., Eynard, D., 2011. Multiple camera types simultaneous stereo cali-
Camera merging 2.2 0.19 bration, in: ICRA.
Community, B.O., 2018. Blender - a 3D modelling and rendering package.
Non-overlap. calib. 6.6 0.6 Blender Foundation. Stichting Blender Foundation, Amsterdam.
Final Optimization 4.2 0.3 Dijkstra, E.W., 1959. A note on two problems in connexion with graphs. Nu-
Total 1155.9 100 merische mathematik 1, 269–271.
Duane, C.B., 1971. Close-range camera calibration. Photogramm. Eng 37,
855–866.
Fischler, M.A., Bolles, R.C., 1981. Random sample consensus: a paradigm for
6. Conclusion model fitting with applications to image analysis and automated cartography.
Communications of the ACM 24, 381–395.
In this paper, we have presented one of the most flexible, ro- Forbes, K., Voigt, A., Bodika, N., 2002. An inexpensive, automatic and ac-
bust, and user-friendly camera calibration toolbox to date. It al- curate camera calibration method, in: South African Workshop on Pattern
Recognition.
lows calibrating fisheye, perspective, and hybrid vision systems Gao, X.S., Hou, X.R., Tang, J., Cheng, H.F., 2003. Complete solution clas-
composed of an arbitrary number of cameras without any priors sification for the perspective-three-point problem. IEEE Transactions on
or restrictions on their location. Moreover, an arbitrary number Pattern Analysis and Machine Intelligence (TPAMI) 25, 930–943.
of calibration boards can be used and placed without specific Garrido-Jurado, S., Muñoz-Salinas, R., Madrid-Cuevas, F.J., Marı́n-Jiménez,
M.J., 2014. Automatic generation and detection of highly reliable fiducial
limitations. Regarding the stability of the technique, our hier- markers under occlusion. Pattern Recognition 47, 2280–2292.
archical calibration strategy ensures a good convergence of the Ha, H., Perdoch, M., Alismail, H., So Kweon, I., Sheikh, Y., 2017. Deltille
intrinsic and extrinsic parameters of the camera rig. This archi- grids for geometric camera calibration, in: ICCV.
tecture combines robust estimation strategies (i.e. bootstrapped Heng, L., Choi, B., Cui, Z., Geppert, M., Hu, S., Kuan, B., Liu, P., Nguyen, R.,
Yeo, Y.C., Geiger, A., et al., 2019. Project autovision: Localization and 3d
initialized of non-overlapping, RANSAC, and robust non-linear scene perception for an autonomous vehicle with a multi-camera system, in:
optimization) to ensure a satisfying calibration. Through a large ICRA.
series of experiments, we have demonstrated the robustness, ac- Heng, L., Li, B., Pollefeys, M., 2013. Camodocal: Automatic intrinsic and
extrinsic calibration of a rig with multiple generic cameras and odometry,
curacy, and relevance of the approach for multiple use cases.
in: IROS.
Our toolbox still has a few limitations. In its current form, it Im, S., Ha, H., Rameau, F., Jeon, H.G., Choe, G., Kweon, I.S., 2016. All-around
can only exploit Charuco markers while the addition of more depth from small motion with a spherical panoramic camera, in: ECCV.
advanced AR markers, such as AprilTag (Olson, 2011), might Itseez, 2015. Open source computer vision library.
Kannala, J., Brandt, S.S., 2006. A generic camera model and calibration
be an interesting extension. Besides, additional camera models method for conventional, wide-angle, and fish-eye lenses. PAMI 28, 1335–
could be included, such as spherical camera models (Usenko 1340.
et al., 2018; Barreto, 2006). Finally, MC-Calib is designed Keselman, L., Iselin Woodfill, J., Grunnet-Jepsen, A., Bhowmik, A., 2017. Intel
for unsynchronized camera systems or for rolling shutter cam- realsense stereoscopic depth cameras, in: IEEE Conference on Computer
Vision and Pattern Recognition Workshops.
eras. Aside from these restrictions, our technique does not suf- Kumar, R.K., Ilie, A., Frahm, J.M., Pollefeys, M., 2008. Simple calibration of
fer other limitations other than usual corner detection-related non-overlapping cameras with a mirror, in: CVPR.
problems (e.g. motion blur, out-of-focus blur, etc.). We believe Kuo, J., Muglikar, M., Zhang, Z., Scaramuzza, D., 2020. Redesigning slam for
this work can be useful for most applications requiring multi- arbitrary multi-camera systems, in: ICRA.
Lébraly, P., Deymier, C., Ait-Aider, O., Royer, E., Dhome, M., 2010. Flexi-
camera systems, in particular in robotics and for autonomous ble extrinsic calibration of non-overlapping cameras using a planar mirror:
cars where multiple fisheye cameras are often employed. Application to vision-based robotics, in: IROS.
Lébraly13, P., Ait-Aider13, O., Royer23, E., Dhome13, M., 2010. Calibra-
tion of non-overlapping cameras-application to vision-based robotics, in:
Acknowledgement BMVC.
Li, B., Heng, L., Koser, K., Pollefeys, M., 2013. A multiple-camera system
Francois Rameau was supported under the framework of calibration toolbox using a feature descriptor-based calibration pattern, in:
IROS.
international cooperation program managed by the National Lin, Y., Larsson, V., Geppert, M., Kukelova, Z., Pollefeys, M., Sattler, T., 2020.
Research Foundation of Korea(NRF-2020M3H8A1115028, Infrastructure-based multi-camera calibration using radial projections, in:
FY2022). Jinsun Park was supported by Basic Science Re- ECCV.
search Program through the National Research Foundation Liu, A., Marschner, S., Snavely, N., 2016. Caliber: Camera localization and
calibration using rigidity constraints. International Journal of Computer Vi-
of Korea(NRF) funded by the Ministry of Education(NRF- sion (IJCV) 118, 1–21.
2021R1I1A1A01060267). Lloyd, S., 1982. Least squares quantization in pcm. IEEE transactions on
information theory 28, 129–137.
Mei, C., Rives, P., 2007. Single view point omnidirectional camera calibration
References from planar grids, in: ICRA.
Moulon, P., Monasse, P., Perrot, R., Marlet, R., 2016. Openmvg: Open multiple
Alexiadis, D.S., Chatzitofis, A., Zioulis, N., Zoidi, O., Louizis, G., Zarpalas, view geometry, in: International Workshop on Reproducible Research in
D., Daras, P., 2016. An integrated platform for live 3d human reconstruction
16

Pattern Recognition.
Munoz-Salinas, R., 2012. Aruco: a minimal library for augmented reality ap-
plications based on opencv. Universidad de Córdoba 386.
Mur-Artal, R., Tardós, J.D., 2017. Orb-slam2: An open-source slam system
for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics
(TRO) 33, 1255–1262.
Olson, E., 2011. Apriltag: A robust and flexible visual fiducial system, in:
ICRA.
Pesce, M., Galantucci, L., Percoco, G., Lavecchia, F., 2015. A low-cost multi
camera 3d scanning system for quality measurement of non-static subjects.
Procedia CIRP 28, 88–93.
Qin, T., Li, P., Shen, S., 2018. Vins-mono: A robust and versatile monocular
visual-inertial state estimator. IEEE Transactions on Robotics (TRO) 34,
1004–1020.
Rameau, F., Demonceaux, C., Sidibé, D., Fofi, D., 2014. Control of a ptz
camera in a hybrid vision system, in: VISAPP.
Rehder, J., Nikolic, J., Schneider, T., Hinzmann, T., Siegwart, R., 2016. Ex-
tending kalibr: Calibrating the extrinsics of multiple imus and of individual
axes, in: ICRA.
Rosinol, A., Abate, M., Chang, Y., Carlone, L., 2020. Kimera: an open-source
library for real-time metric-semantic localization and mapping, in: ICRA.
Scaramuzza, D., Martinelli, A., Siegwart, R., 2006. A toolbox for easily cali-
brating omnidirectional cameras, in: IROS.
Schönberger, J.L., Frahm, J.M., 2016. Structure-from-motion revisited, in:
CVPR.
Schroers, C., Bazin, J.C., Sorkine-Hornung, A., 2018. An omnistereoscopic
video pipeline for capture and display of real-world vr. ACM Transactions
on Graphics (TOG) 37, 1–13.
Strauß, T., Ziegler, J., Beck, J., 2014. Calibrating multiple cameras with non-
overlapping views using coded checkerboard targets, in: 17th international
IEEE conference on intelligent transportation systems (ITSC), IEEE. pp.
2623–2628.
Sturm, P., Ramalingam, S., 2011. Camera models and fundamental concepts
used in geometric computer vision. Now Publishers Inc.
Tsai, R., 1987. A versatile camera calibration technique for high-accuracy 3d
machine vision metrology using off-the-shelf tv cameras and lenses. IEEE
Journal on Robotics and Automation 3, 323–344.
Tsai, R.Y., Lenz, R.K., et al., 1989. A new technique for fully autonomous and
efficient 3 d robotics hand/eye calibration. IEEE Transactions on robotics
and automation 5, 345–358.
Urban, S., Wursthorn, S., Leitloff, J., Hinz, S., 2016. MultiCol Bundle
Adjustment: A Generic Method for Pose Estimation, Simultaneous Self-
Calibration and Reconstruction for Arbitrary Multi-Camera Systems. Inter-
national Journal of Computer Vision (IJCV) , 1–19.
Usenko, V., Demmel, N., Cremers, D., 2018. The double sphere camera model,
in: 3DV.
Wang, J., Olson, E., 2016. Apriltag 2: Efficient and robust fiducial detection,
in: IROS.
Wu, C., et al., 2011. Visualsfm: A visual structure from motion system .
Xing, Z., Yu, J., Ma, Y., 2017. A new calibration technique for multi-camera
systems of limited overlapping field-of-views, in: IROS.
Yu, Z., Yoon, J.S., Lee, I.K., Venkatesh, P., Park, J., Yu, J., Park, H.S., 2020.
Humbi: A large multiview dataset of human body expressions, in: CVPR.
Zhang, Z., 2000. A flexible new technique for camera calibration. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence (TPAMI) 22, 1330.
Zhao, F., Tamaki, T., Kurita, T., Raytchev, B., Kaneda, K., 2018. Marker-based
non-overlapping camera calibration methods with additional support camera
views. Image and Vision Computing 70, 46–54.

You might also like