Ibelyuk LJK
Ibelyuk LJK
Department of Computing
Paul Gribelyuk
August 2013
1
Abstract
2
I would like to thank my supervisor, Professor Paul Kelly
for his guidance, and time in completing this project. I
would also like to thank Professor Andrew Davidson and
Renato Salas-Moreno who helped me understand the finer
points in computer vision and steer me in the right
direction.
3
Contents
1. Introduction 8
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2. Background 11
2.1. Bundle Adjustment . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1. Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2. Modelling the Camera . . . . . . . . . . . . . . . . . . 14
2.1.3. Modelling Motion . . . . . . . . . . . . . . . . . . . . 15
2.1.4. Non-linear Least Squares . . . . . . . . . . . . . . . . 18
2.1.5. Putting It All Together . . . . . . . . . . . . . . . . . 21
2.2. PyOP2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1. Python as a Domain Specific Language . . . . . . . . 22
2.2.2. Finite Element Methods and How they Relate to Bun-
dle Adjustment . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3. Code Generation . . . . . . . . . . . . . . . . . . . . . 24
3. Related Work 26
3.1. Early Bundle Adjustment . . . . . . . . . . . . . . . . . . . . 26
3.2. Modern Methods . . . . . . . . . . . . . . . . . . . . . . . . . 27
4. Program Design 29
4.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2. Building the Error Function . . . . . . . . . . . . . . . . . . . 31
4.3. Building the Jacobian Blocks . . . . . . . . . . . . . . . . . . 35
4.4. Building the Hessian . . . . . . . . . . . . . . . . . . . . . . . 36
4.5. Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.6. Putting It All Together . . . . . . . . . . . . . . . . . . . . . 39
4
5. Evaluation 41
5.1. Analyzing Our Implementation . . . . . . . . . . . . . . . . . 41
5.2. Comparisons With Existing Software . . . . . . . . . . . . . . 45
5.3. Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6. Conclusion 49
6.1. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A. Some Code 55
5
List of Tables
6
List of Figures
7
1. Introduction
We analyze the general setting of the bundle adjustment problem from the
computer vision context and investigate techniques for speeding up the steps
involved in the calculation, using automatic differentiation as well as hetero-
geneous compilation tools. We use the Sympy Python package [24] for sym-
bolic representation and automatic differentiation of positional parameters,
vectors, motion functions, and error vectors and PyOP2 [20] for executing
kernels used. We show that many of the steps can be executed in parallel,
and we outline a domain specific language which enhances programmability
of these problems. Lastly we compare popular bundle adjustment solvers
such as ceres-solver from Google [1], the g2o package [13], and the iSAM
package [12].
1.1. Motivation
Robot scene recognition researches have historically relied on filtering tech-
niques, such as the Extended Kalman Filter [19], Particle Filters [26], and
Rao-Blackwellised Filters [11], to solve the many steps of a larger process
known as Simultaneous Localization and Mapping (SLAM). In the SLAM
setting, a robot is placed in an unknown environment with sensors measur-
ing locations of surrounding landmarks as it navigates this new environment.
Prior to each time step, the robot stores a prior probability distribution for
each landmark as well as its own current and past locations. As it obtains
new measurements of the surrounding environment, it produces a poste-
rior distribution for each landmark. In the Kalman Filtering framework,
properties of Bayesian probability laws are are used to make each update
with an implied Gaussian distribution for errors. Particle Filtering methods
also implement Bayes’ Law but, instead choose Monte Carlo Simulation to
generate an estimate of the probability distribution.
In contrast to these methods, bundle adjustment emerged from pho-
8
togrammetry research in the 60s and 70s, mainly with military and geo-
graphical applications. Computations on collected image data were done
offline and were typically very time consuming. In the past 10 years, ad-
vances in computing hardware and novel architectures (e.g. multicore, si-
multaneous multithreading, vectorized instructions, general purpose GPU
computing, etc.), have narrowed the gap between filtering and bundle ad-
justment techniques allowing researchers to seriously consider bundle ad-
justment frameworks as an alternative to Kalman Filtering in online robot
vision problems. For example, Salas-Moreno et. al. [21] use this approach
to optimize and reconstruct object scenes. Furthermore, algorithms to solve
sparse linear systems have also evolved, with tools such as METIS, PetSc,
and Eigen able to take advantage of sparsity considerations to efficiently par-
allelize computations and make the computations easier to program. These
developments, when properly utilized, can help robotic systems solve very
large scene recognition problems in real-time. As of this writing, the only
notable attempt we have found at applying techniques from the high per-
formance computing toolbox towards speeding up bundle adjustment is the
paper on multicore bundle adjustment by Wu et. al. [27]. We believe there
is more work to be done to more easily expose the necessary functionality
in a bundle adjustment solver without burying the user with unnecessary
abstractions or hardware-specific implementation details.
For this purpose, we leverage PyOP2, a framework for performing finite el-
ement computations over unstructured meshes. The architecture of PyOP2
modularizes domains of expertise without sacrificing performance. We show
that it can be used to efficiently perform bundle adjustment computations.
We use Python as a staging language for our analysis.
1.2. Objectives
We aim to explore the computational challenges of performing bundle ad-
justment in scene recognition problems and propose a set of routines to
speed up these computations using available frameworks. Benchmarking on
available datasets, varying from 900 poses and 1900 landmarks to 500,000
poses and 2,100,000 landmarks, shows the efficiency of using our approach
in comparison to available packages. These benchmarks should also help
guide research into software optimization tools with a view towards accel-
9
erating vision applications.
1.3. Outline
The following report is organized in the following way:
• Chapters 4 and 5 cover the PyOP2 kernels used, as well as the use
of Theano for automatic differentiation when constructing the error
function and the Jacobian and Hessian matrices. Here, we also outline
the computational complexity in the problem and explain how our
approach overcomes these obstacles. We present results compared to
other bundle adjustment implementations.
10
2. Background
11
2.1. Bundle Adjustment
We aim to outline the mathematical underpinnings of bundle adjustment
methods in this section and to lay the groundwork for the choices for our
implementation. We will begin with the geometry of camera (more gen-
erally: sensor) measurements and a background on how incoming image
data is used to obtain information about specific landmarks in the scene.
After that, we will discuss the dynamics of robot motion and its relation-
ship to the expectation of landmark positions from different poses. We use
the output of this calculation to calculate the ‘error‘, given specific camera
measurements. Next, we review the theory of non-linear least squares, espe-
cially sparse matrix techniques, which underly the search for optimal bundle
parameters. This iterative approach is elucidated in the final subsection.
2.1.1. Preprocessing
We begin by considering the two-dimensional image which represents a
three-dimensional scene. To determine specific landmarks present in the
scene, the system first employs a feature recognition algorithm, for ex-
ample, a blob detector. Typically, an image is represented by a mapping
f : R × R → R. In the case of the ‘Laplacian of Gaussian‘ feature detector,
a convolution between the image and a Gaussian kernel is performed:
1 2 2
where g(x, y; t) = 2πt exp{−(x + y )/(2t)} is the Gaussian. Next, the
∂ 2 ∂
Laplace operator ∇2 = ∂x 2 + ∂y 2 is applied to L(x, y; t) exposing extrema
for dark and light ‘blobs‘ present in the image. The variance parameter t
also acts as a scale factor, with smaller values picking up smaller features in
an image, while larger values of t only output larger features. Alternatively,
an edge detection algorithm, such as the Canny Edge Detector [7], uses a
multistep approach:
12
image:
−1 0 1 1 2 1
Gx = −2 0 2 Gy = 0 00
−1 0 1 −1 −2 −1
q
The edge strength at each pixel is then G2x + G2y and the edge di-
G
rection is θ = tan−1 Gxy .
• A final ‘hysteresis‘ step is used to identify edges which might not meet
threshold gradients, but do lie next to pixels which do
These techniques typically parallelize well on GPUs, since there are minimal
read or write conflicts in the data. More advanced techniques, using scale
and rotation invariance [16] allow for matching under more general camera
and robot transformations. These involve robust algorithms to hash and
store feature descriptors for quick comparison and retrieval. As a practical
example, the OpenCV package [4] provides a function to calculate Canny
edges:
This code snippet does the work to produce results in figure [?]. Once
the system has identified landmarks using one of the previously mentioned
techniques, determining which common landmarks are shared by multiple
images is known as the correspondence problem in computer vision. Re-
searchers in this field apply a variety of tools such as normalized cross cor-
13
Figure 2.1.: Canny Edge Detection in Action
14
uniquely describes the counterclockwise rotation of any vector x ∈ R2 by θ
radians about the origin. However, in three dimensions, multiple represen-
tations are possible. For example, the Rodrigues rotation formula describes
3D rotations by θ degrees about a unit-length axis vector u ∈ R3 :
Ry + t
X Y
x = kf and y = lf
Z Z
15
Figure 2.2.: Pinhole Camera Setup
16
19 new_t = self . t + self . r . dot ( otherSE2 . t )
20 new_theta = normalize2 ( self . theta + otherSE2 . theta )
21 new_R = self . R ( new_theta )
22 return SE2 ( new_t [0] , new_t [1] , new_theta )
23
24 def __repr__ ( self ) :
25 return " <x =% f y =% f theta =% f > " % ( self . t [0] , self . t [1] ,
self . theta )
Thus, when given an input which represents motion between two poses, we
can estimate the new position by applying this motion operator to the old
position. Then, when we obtain a measurement of the new position, we can
begin to build the error function which will be important in solving bundle
adjustment. The error function can be defined in a variety of ways, but the
most common is:
e = ẑ − z
where ẑ is our estimate from applying Ta→b a and z = b in this case. Note,
that similar calculations are carried out for landmark locations, since a
landmark l = (xl , yl ) observed in one pose, would be expected to appear at
position Ta→b l when viewed in pose b. Next, to estimate the global error of
our estimate for every position and of every landmark, we take the sum of
squares (SSE):
X
SSE = eTi ei
i
This model can be further adapted by taking into consideration the known
measurement variance for each error term, Ω−1
i . Now the total error be-
comes:
X
SSE = eTi Ωi ei
i
The quadratic form is not chosen by accident. It can be shown that this form
is the best unbiased estimator available if we assume a Gaussian distribution
of measurement errors. Furthermore, it is the solution to the maximum log-
likelihood function. Our goal in the next section will be to consider methods
of solving this problem.
To recap, we have defined a model which allows us to produce param-
eter estimates (either landmark positions, or robot poses, or otherwise).
This framework can be generalized by more generally defining an estima-
17
tion function (similar to __rmul__ in listing [?]) and forming the SSE as
before.
This way, we obtain the next estimate from the previous one as x1 = x0 +
∆x∗ This second-order method is commonly known as Gauss-Newton and
converges quickly when our initial guess, x0 is already close to the optimal
value. Note that we could have also applied a different technique, known as
gradient descent, which involves computing the Jacobian immediately and
following the update rule.
x1 = x0 − λ∇e(∆x)
18
This first-order update rule has quick convergence when the initial guess is
bad, but takes many steps to converge at the end. Thus, the well-known
Levenberg-Marquardt algorithm merges these two approaches by modifying
the Gauss-Newton formulation as follows:
(H + λW)∆x∗ = −JT Ωe
19
Figure 2.4.: Sparsity of Bundle Adjustment Matrices (non-zero values dis-
played): Jacobian (left), and Hamiltonian (right)
H∆x = JΩe
20
by " #
I −W V ∗−1
0 I
we obtain a reduced system, whereby we can first solve the equation for a
and then back-substituted to solve for b.
In Cholesky decomposition, the goal is to decompose the matrix into the
form H = LLT . Specifically, one frames the following equation:
" # " # " #
a11 AT21 l11 0 l11 LT21
A= = ∗
A21 A22 L21 L22 0 L22
√ 1
and then note that for this to hold, l11 = a11 , L21 = l11 A21 , leaving for
us to solve
A22 − L21 LT21 = L22 LT22
21
plexity growth. In an off-line setting, which where we our focus lies, we read
in an entire graph structure from an existing database, and perform the rel-
evant computations. Problems of this sort would arise when learning very
large environments and computational techniques which take advantage of
distributed computing platforms play a leading role.
2.2. PyOP2
We aim to give a background for the PyOP2 framework and the types
of problems it specializes in solving. We will cover the use of Python as
a language for problem specification. A judicious use of operator over-
loading allows researchers to specify the problem domain concisely without
destroying the mathematical structure, which an optimizing compiler can
use to reduce computational load. Most typically PyOP2 is very efficient
at handling unstructured mesh applications, typical in fluid dynamics and
mechanical engineering applications. However, we will to use it to set up
and solve bundle adjustment computations, since both problems deal with
solving sparse systems generated by graph structures.
22
Figure 2.5.: Scientific Python Ecosystem
∇2 u = f for x∈Ω
u = g(x) for x ∈ ∂Ω
Lu = Mb
At this stage, the similarities between the bundle adjustment matrix for-
mulation and finite element method problems still seem somewhat murky,
23
although they are both formulated as a sparse matrix problem. Unfortu-
nately, after further investigation into these similarities, we were not able
to extend the link. Specifically, the weak form is obtained under strong
assumptions about differentiability of the solution, something that is not
present in the bundle adjustment problem. However, finite elements are
also a form of a graph problem. To see this, note that the basis functions
vj (x) in 2D are continuous surfaces in some small sub-domain and 0 else-
where. Thus, they represent edges in the graph, while the graph vertices
can be thought of as the boundaries between adjacent surfaces. In finite ele-
ment parlance, the surfaces are called ‘elements‘ or ‘facets‘. A visualization
of the surfaces is shown in figure 2.6. Although finite element solvers are not
24
and illustrates the techniques.
At this point, the adv variable is of type op2.Kernel and contains code
representing discretized calculation for:
Z
p(x)q(x)dx
The kernel adv contains autogenerated C-code which is next passed to the
op2.par_loop() function (along with variables representing input values)
which distributes work while exploiting the sparsity structure. In our work,
we aim to produce a similar structure for bundle adjustment problems
We have hopefully presented the prior knowledge necessary for the reader
to proceed with the rest of this report. We have presented the context in
which bundle adjustment is relevant for computer vision and have shown a
graph-theoretic formulation of the problem, and the resulting sparse non-
linear linear algebra problem. We have also outlined some sparse solving
techniques, which may be interesting when considering which solver to use
in a given setting. Next, we side-stepped to consider finite element methods,
another variational problem, which has been well studied, and shows math-
ematical similarities to bundle adjustment. We have hopefully presented
our intuition behind choosing PyOP2 as an execution platform for bundle
adjustment. In the remaining chapters, we will show what work has already
been done in this field, and which software is available.
25
3. Related Work
26
(the possibility that the robotic system re-visits a scene previously seen).
• SAM and iSAM - A C++ library developed by Kaess et. al. [12]; They
use QR factorization of the sparse Jacobian matrix to minimize the
quantity ||Jx−b||2 ; Their incremental approach uses Givens rotations
to prevent re-solving of the QR problem in future instances when new
data points arrive; This approach works well in an on-line setting but
still relies on back-solving a lower diagonal matrix at each iteration.
27
• g2o - A C++ project to solve general hypergraph optimization prob-
lems developed by Kummerle et. al. [13]; The library follows an
extensible architecture, which allows the freedom to select different
solvers and specify different measurement and error functions for the
SLAM problem; The authors look for performance gains by perform-
ing vectorized instructions in the called matrix libraries.
These software packages, all released within the past 10 years, underline
the increased interest seen in this field and its application to robot vision.
They all use differing data formats depending on what is most convenient
for the approach. Thus, testing and benchmarking comparisons between
different packages has been difficult to come by, as each researcher would
have to replicate a large chunk of an existing library (especially when the
data reading operations are rooted deeply within the object hierarchy, as in
g2o). We will look at two simulated 2D datasets, Intel campus, and Man-
hattan city grid, to evaluate performance and resource utilization. When
comparing iSAM (in bulk mode) and g2o on the Manhattan dataset (3500
vertices and 5598 edges), g2o performed approximately 1.5 times faster than
iSAM (0.25 seconds versus 0.4 seconds).
28
4. Program Design
29
• Compute the product JT Ωe: this constitutes the right-hand-side of
the non-linear equation we are trying to solve.
• Compare the new error measure with the old error, adjust λ accord-
ingly, and check any stopping conditions to terminate the non-linear
least squares search.
4.1. Data
Bundle adjustment literature has done a thorough job of describing the
methodology used to solve for pose coordinates, landmark coordinates, and
even camera coordinates. There has, however, been less emphasis on a sys-
tematic standardization of data formats as each available package provides
data in non-compatible layouts. We have found the g2o data format to be
the most general formulation, allowing the user to specify a wide range of
graph types.
Generally, data arrives as plain text with rows for vertices and edges.
The vertex label (e.g. SE2, QUART, etc) determines the type and the data
items in that row. Specifically, an SE2 vertex will contain:
representing the vertex index along with estimated parameters for that po-
sition x, y, θ. A row containing an edge (measurement) also comes with
a label, allowing the program to correctly associate the appropriate error
function calculation at each edge. The other items in the row of edge data
provide indices of the vertices that edge connects as well as the measurement
calculations and other fixed parameters (e.g. the measurement variance ma-
trix for that edge Ω).
We used the pandas package [17], which uses numpy arrays for contiguous,
efficient, and fast data storage. The advantage that pandas brings is quick
named column and row selection (see listing A.3). Furthermore, it supports
streaming data from regular or HDF files via iterators and visualization
with output in 4.1:
30
Listing 4.1: Visualizing The Intel Dataset
1 from ba_data import INTEL_G2O
2 import matplotlib . pyplot as plt
3 vertices , edges = quickload ( INTEL_G2O )
4 plt . plot ( vertices [ ’ dim1 ’] , vertices [ ’ dim2 ’ ])
31
having a few degrees of freedom, di ) to errors:
P P
di di
e:R i →R i
Listing 4.2: Example PyOP2 Code for Building Bundle Adjustment Graph
1 import numpy as np
2 from pyop2 import op2
3
4 def identity ( num , dim ) :
5 return np . asarray ([ i /2 for i in range ( dim * num ) ] , dtype =
np . uint32 )
6
7 NUM_POSES = 3
8 N UM_POSE_CONSTRAINTS = 2
9
10 poses = op2 . Set ( NUM_POSES , ’ poses ’)
11 pose_constraints = op2 . Set ( NUM_POSE_CONSTRAINTS , ’
pose_constraints ’)
12
13 # constraint 1: pose 0 --> pose1 ; constraint 2: pose1 --> pose2
14 c onstraint_pose_data = np . asarray ([0 , 1 , 1 , 2] , dtype = np . uint32
)
15 # all constraints map to themselves
16 c o n s t r a i n t _ c o n s t r a i n t _ d a t a = identity ( NUM_POSE_CONSTRAINTS , 2)
17
18 c onstraints_to_poses = op2 . Map ( pose_constraints ,
19 poses ,
20 2,
21 constraint_pose_data ,
22 ’ poses_constraints ’)
32
23
24 c o ns t r a in t _ t o_ c o n st r a i nt = op2 . Map ( pose_constraints ,
25 pose_constraints ,
26 2,
27 constraint_constraint_data ,
28 ’ c on s t r ai n t _ to _ c o ns t r a in t ’)
In mathematical terms, we have created maps which tell PyOP2 how to cre-
ate a correspondence to data when iterating over the set of pose constraints.
When the PyOP2 runtime is given a kernel to execute, it relies on these
maps to partition the iteration space into disjoint subspaces so as to mini-
mize data contention. Note that we build the constraint_to_constraint
mapping with dimensionality 2 because each constraint actually maps to 2
dimensions: x, y (in the case of Euclidean parameters). In the case of finite
elements, PyOP2 uses coloring on the elements, but other techniques have
been studied since at least 1970. In our production code, we can encapsulate
these procedures in a data structure and populate the data appropriately.
These structures are next passed to a PyOP2 C-style execution kernel, which
represents a computation to be done at each iteration. Since PyOP2 per-
forms the iteration space tiling to maximize parallelism on a pre-specified
backend, this approach provides portable performance without needing to
modify for various execution platforms.
Although the above example is simplistic, it illustrates the approach we
take in building the graph in PyOP2 for efficient execution. At each step,
we iterate over the measurements (or ‘constraints‘ or ‘edges‘ in graph the-
ory parlance) and produce calculations over poses or landmarks (‘vertices‘).
The op2.Map construct allows for this level of indirection since it maps a
specific measurement to a specific number of poses which that edge acts
upon. Because this is quite general, it is equally possible to define compu-
tations over hyper-edges and hyper-vertices (where edges connect more than
two vertices), although bundle adjustment problems do not usually warrant
the use of these structures.
The error function, as described earlier, is a high-dimensional vector,
which is better represented as chunks, e = (e1 , e2 , . . . , en )T , with each
chunk ei ∈ Rdi where di is the dimensionality of the ith measurement.
If ei measures distances on the SE2 manifold, di = 3. Our implementation
iterates over these edges, but requires data from the corresponding poses
33
(or landmarks). To do so, we first encapsulate the available measurement
data into an op2.Dat with a specified ‘constraint-to-pose‘ Map built-in. We
then follow a two-stage approach:
• Next, iterate once more over the measurements, this time performing
a trivial IdentityMap over them and applying the user-specified error
function at both estimates and the available observation data
or more generally:
( !
X x2 for |x| < δ
SSEδ = ρδ (eTi Ωi ei ) where ρδ (x) =
i
2δ|x| − δ 2 otherwise
This robustified total error gives less weight to extreme outliers (those lying
further than δ away from the measurement) while maintaining the nice
properties of convexity, and thus, not decreasing the chance of converging
to the global minimum. Since Ωi are inputs, we can provide a reference to
them via PyOP2 Dat objects, defined over measurements (since each one
corresponds solely to the inverse-variance of measurement data and has no
direct connection to pose or landmark data). The point of data contention
is the writing of the SSE variable by different processes are they traverse
the measurement iteration space. PyOP2 provides a type of data carrier,
op2.Global, which is shared among all the processes. Thus, our kernel will
be specified as in listing 4.3:
34
3 void total_error ( double e [2] , double omega_block [4] ,
double * sse )
4 {
5 * sse += ( e [0]* omega_block [0] + e [1]* omega_block [2])
* e [0] +
6 ( e [0]* omega_block [1] + e [1]*
omega_block [3]) * e [1];
7 }
8 """
9
10 total_error = op2 . Kernel ( total_error_code , ’ total_error
’)
11 op2 . par_loop ( total_error , pose_constraints ,
12 e ( op2 . IdentityMap , op2 . READ ) ,
13 F ( op2 . INC ) )
The ẑ are constant data, so the Jacobian block for this measurement will
have the following structure:
! !
∂ex ∂ex ∂ex ∂ex
∂px ∂py ∂qx ∂qy 1 0 −1 0
Je = ∂ey ∂ey ∂ey ∂ey =
∂px ∂py ∂qx ∂qy 0 1 0 −1
35
tion of their problem, and then performs optimizing graph transformations
on the resultant execution graph, following by fast generated C code (for
the CPU) or CUDA/OpenGL code (for GPUs). Some of the functionality
overlaps with PyOP2, though provides less support for more complicated
parallelization techniques such as graph partitioning. Since we are only
concerned with differentiation among a small set of variables, as well as
with smooth functions, we selected to use the Sympy module for this task.
The overriding reason for this is the in-program generation of C code the
Sympy provides, which we can directly use to populate the PyOP2 kernel we
use to construct the Jacobian blocks. For example, in SE2, the estimation
functions is:
(qx − px ) cos pθ + (qy − py ) sin pθ
q − p = −(qx − px ) sin pθ + (qy − py ) cos pθ
between poses and the usual Euclidean one between a pose and a landmark.
Thus, the Jacobian block for a measurement between poses p and q will
look as follows:
∂ex ∂ex ∂ex ∂ex ∂ex ∂ex
∂px ∂py ∂pθ ∂qx ∂qy ∂qθ
∂ey ∂ey ∂ey ∂ey ∂ey ∂ey
Je =
∂px ∂py ∂pθ ∂qx ∂qy ∂qθ
∂eθ ∂eθ ∂eθ ∂eθ ∂eθ ∂eθ
∂px ∂py ∂pθ ∂qx ∂qy ∂qθ
cos wθ sin wθ (qx − px ) sin wθ − (qy − py ) cos wθ − cos wθ − sin wθ 0
= − sin wθ cos wθ −(qx − px ) cos wθ + (qy − py ) sin wθ sin wθ − cos wθ 0
0 0 1 0 0 −1
(H + λW)∆x = JT Ωe where H ≡ JT ΩJ
36
for ∆x, the next step is to build the left-hand-side of that equation. At this
stage, we have computed the Jacobian blocks and the inverse-covariance
blocks, Ωi (encapsulated in a reference via the PyOP2 Dat type), in the
previous step. Although the Hamiltonian matrix has dimensions solely de-
pendent on pose and landmark data, we chose the constraints Set as our
iteration space, allowing us to bring in pose data related to those consraints
via the contraint_to_pose Map defined earlier, and populate the matrix
that way. Specifically, the Hamiltonian construction kernel is displayed in
listing 4.4
37
24
25 op2 . par_loop ( poses_mat_hamiltonian , pose_constraints (2 ,2) ,
26 hamil_mat (( constraints_to_poses [ op2 . i [0]] ,
constraints_to_poses [ op2 . i [1]]) , op2 . INC ) ,
27 jacobian_blocks ( op2 . IdentityMap , op2 . READ ) )
A small, but important consideration has to be made for the initial con-
straint, one that dictates the initial pose. Thus, we added extra code to
update the upper left block of the Hamiltonian appropriately. Without this
extra condition, the Hamiltonian would prove to be ill conditioned. Simi-
larly, we also update the calculation for the right-hand-side of the non-linear
least squares problem. However, if we assume the initial measurement error
is zero (after all, a famous physicist once said that everything is relative),
we can safely neglect extra code there.
4.5. Solvers
As we mentioned in the background review, a variety of sparse linear solvers
exist to tackle our linearized bundle adjustment problem. In PyOP2, we
invoke the built-in solvers with the following code:
Most modern bundle adjustment packages exploit the sparsity of the lin-
earized system at the expense of compromising code flexibility. PyOP2
gives one the freedom to experiment with different solvers, so long as the
back-end is implemented. Thus, the introduction of new solving techniques
doesn’t mean having to rearchitect the library. However, PyOP2, as of
this writing, does not support solving via the Schur complement, a trick
described in chapter ?? and successfully used to reduce dimensionality of
bundle adjustment problems by first solving for poses and afterwards for
landmarks. Thus, we expect our implementation to suffer when there is a
roughly equal balance between the two types of vertices. However, this is
generally not the case. In closed environments, pose data tends to accu-
mulate as the autonomous vehicle measures many of the same landmarks
repeatedly, while in open environments, each new pose will likely corre-
38
spond to a (potentially large) handful of new landmarks. Current research
focuses on considering more and more landmarks from each pose, thus heav-
ily weighing the balance towards landmarks. Thus, we believe we are not
at a great disadvantage for not being able to use Schur.
1. Load data from file or other source and set up preliminary Sets Maps
and Dats including the Ω blocks.
2. Evaluate the total error, SSE, and if it is sufficiently small, exit and
report the optimal parameter values, else proceed to [3].
39
4. Calculate the righ-hand-side kernel including the estimate and error
vectors.
6. Solve the linearized system resulting from steps [4] and [5].
40
5. Evaluation
41
Figure 5.1.: Intel Dataset Hamiltonian Sparsity (non-zero elements dis-
played)
{’error’: 0.0017781257629394531,
42
’estimate’: 0.0017139911651611328,
’jacobian’: 0.0019482135772705077,
’lhs’: 0.0081790447235107425,
’per_iter_setup’: 0.00072622299194335938,
’rhs’: 0.0018266201019287109,
’solve’: 0.0023548126220703123,
’sse’: 0.0017397403717041016}
preprocess: 0.002824
generate kernels: 0.227442
setup data: 0.011783
total time per iteration: 0.020267
total time: 0.101334
We have included the individual steps taken (effectively the time to run
each op2.par_loop(...)), as a guide. The labels are defined as follows:
• lhs: this is the left-hand-side of the least squares equation, which in-
volves repeatedly updating the Hamiltonian matrix with JT ΩJ values
• per iter setup: this is the zero-ing out of some arrays at each iteration
• solve: this step dispatches the lhs and rhs to be solved by the PetSc
solver
In this case, generating the kernels is relatively expensive, taking more than
2X the time the rest of the program uses. As we shall see with the next
dataset, this cost is constant. The biggest computational cost which seems
43
to arise is, as we expected, the calculation of the left-hand-side, the Hamil-
tonian. These are expensive because of the relatively large number of op-
erations the kernel performs. Multiplying 3 3x3 matrices takes 54 multipli-
cation and 54 addition operations, and since we iterate over the constraints
in both rows and columns, we are updating the Hamiltonian over 4 such
blocks each time.
We also show the results from running on a larger dataset, the Manhattan
dataset with 3500 poses and 5598 constraints:
{’error’: 0.0020958423614501954,
’estimate’: 0.0022215366363525389,
’jacobian’: 0.0029551506042480467,
’lhs’: 0.019606590270996094,
’per_iter_setup’: 0.0005340576171875,
’rhs’: 0.0020139694213867189,
’solve’: 0.0038761615753173826,
’sse’: 0.0018854141235351562}
preprocess: 0.006884
generate kernels: 0.232810
setup data: 0.031318
total time per iteration: 0.035189
total time: 0.175944
Lastly, we present the table of timings using cProfile for the Manhattan
dataset:
44
RunningTime Percentage Symbol Name
0.129 38.51% {select.select}
0.072 21.49% { instant module .wrap lhs }
0.039 11.64% {imp.load module}
0.028 8.36% {pyop2.op lib core.build sparsity}
0.021 6.27% {posix.read}
0.008 2.39% {posix.fork}
0.007 2.09% pyop2/petsc base.py:195(solve)
0.006 1.79% { instant module wrap jacobian block }
0.005 1.49% sympy/core/cache.py:78(wrapper)
0.005 1.49% {isinstance}
0.004 1.19% {built-in new of type object}
0.004 1.19% {numpy.core.multiarray.array}
0.003 0.90% sympy/core/basic.py:1772( preorder traversal)
0.002 0.60% balib.py:102(identity map)
0.002 0.60% subprocess.py:650( init )
The predominant expenditure of time comes from the select.select
Python system call, which interfaces with the Unix select call to commu-
nicate data via MPI. Unfortunately, this masks a large part of the profiling
as each op2.par_loop is in essence making MPI calls. Instant and PyOP2
calls comprise a majority of the rest of the execution costs, which is reas-
suring to know that slower Python calls are being avoided. The real power
behind PyOP2 is the ability to run over a variety of architectures and to
generate efficient code for those architectures at compile time. The examples
we have illustrated use a sequential backend, and spend some time in the
MPI communication phase. On larger datasets, and with other backends,
this cost can be partially offset.
45
codebase, the authors did indirectly make use of vectorized calculations by
bit-aligning the data structures where necessary. Similarly iSAM is also lim-
ited to running on a single core. However, both packages rely on either Eigen
(a templated C++ matrix library) or Suite-Sparse (another popular C++
matrix library), which both, in turn, call highly tuned BLAS functions. The
package runs on the Manhattan dataset in approximately 415ms (according
to the console output). We show a representative breakdown of where the
package spent execution time (using the Instruments profiler in Mac OS X):
RunningTime Percentage Symbol Name
219.0ms 45.4% non-virtual thunk to isam::Slam::jacobian()
130.0ms 26.9% isam::CholeskyImpl::factorize()
70.0ms 14.5% cholmod factorize
48.0ms 9.9% cholmod analyze p2
5.0ms 1.0% cholmod solve
57.0ms 11.8% isam::SparseSystem::operator=()
32.0ms 6.6% isam::SparseMatrix:: SparseMatrix()
25.0ms 5.1% non-virtual thunk to isam::Slam::weighted errors()
The system spends about 50% of the time computing Jacobian blocks, and
another 27% on Cholesky factorization and solving the system. Our im-
plementation has ameliorated some of the cost of calculating the numerical
Jacobian, by obtaining the analytical derivative using Sympy.
Next, we profile g2o. This package has been seen to run comparably fast
compared to iSAM on the Intel dataset and 50% faster on the Manhattan
dataset. The Instruments profiler recorded the following timing costs for
this package:
46
RunningTime Percentage Symbol Name
65.0ms 26.2% EdgeSE2::read(std::istream&)
17.0ms 6.8% VertexSE2::read(std::istream&)
11.0ms 4.4% readLine()
11.0ms 4.4% ParameterContainer::read()
8.0ms 3.2% OptimizableGraph::addEdge()
4.0ms 1.6% istream::operator>>()
53.0ms 21.3% BlockSolver<>::solve()
11.0ms 4.4% BlockSolver<>::buildSystem()
3.0ms 1.2% SparseOptimizer::computeActiveErrors()
3.0ms 1.2% BlockSolver<>::buildStructure()
4.0ms 1.6% EdgeSE2::computeError()
15.0ms 6.0% loadStandardSolver()
11.0ms 4.4% loadStandardTypes()
8.0ms 3.2% OptimizableGraph:: OptimizableGraph()
5.0ms 2.0% SparseOptimizer:: SparseOptimizer()
4.0ms 1.6% SparseOptimizer::initializeOptimization(int)
Of the total 250ms spent in running the g2o package on the Manhattan
dataset, it is interesting to note that almost 50% of the time is spent on
data IO operations. This is a symptom of the design choice made by the au-
thors to strongly couple the data loading and processing operations. Thus,
the user is forced to write a load() function for a new vertex or edge type
and implement the calculations needed to update measurement error func-
tion, Jacobian, and Hamiltonian calculation. We speculate that this design
choice maybe have been made to garner a speed improvement, however.
5.3. Remarks
We have shown that PyOP2 can allow the structuring and solving of bun-
dle adjustment problems without the sacrifice in performance that usually
comes with Python. Our code uses Python as a staging language, Sympy for
representing the mathematical formulation of measurement and error func-
tions, and PyOP2 for elegant graph representation and fast execution. It is
important to note that our implementation solves 2D bundle adjustment,
while the packages we have compared to also perform 3D solutions, and iter-
ative solvers. In practice, these features would not be difficult to implement
47
with our design, but care would have to be taken to create the appropri-
ate indirection between poses and landmarks, especially when building the
Hamiltonian, since constraints connecting different vertex types may have
different dimensions. In practice, this could mean a larger memory overhead
to accommodate the larger pose type.
Our implementation was faster than iSAM and marginally faster than g2o.
We consider this a mixed result. First, we note that transitioning to a larger
dataset showed a relative speed improvement compared to other packages,
suggesting that even larger problems will have even greater benefit. Second,
we did not test the ceres-solver package due to data and time limitations.
In the next section, we state closing remarks and suggest potential avenues
for further research.
48
6. Conclusion
In this thesis, our goal was to investigate application of the PyOP2 frame-
work in other contexts, specifically, in robot vision. The field of robot vision
is broad and requires a wide variety of disciplines to work together. We fo-
cused on the subset of problems in robot vision which deal with optimizing
measurements obtained from autonomous exploration of an environment.
We presented a viable solution. Our implementation is far from complete.
Realistic SLAM problems require considerations for camera calibrations. In
the rest of this chapter, we state concluding remarks along with an outline
for future work.
6.1. Results
We have presented our findings on an intuitive approach to building and
solving bundle adjustment problems of arbitrary size. We borrowed tech-
niques for addressing large parallel finite element calculations, as both prob-
lems, although disparate, can be represented by (hyper)graphs and use
sparse linear algebra algorithms at their core. We showed that the PyOP2
framework developed by the Software Optimization group at Imperial Col-
lege has broader applicability, reaching into scene recognition. The larger
result here is the performance portability of bundle adjustment in being
able to seamlessly generate highly performant code across current and fu-
ture computer architectures without a re-write and without sacrificing per-
formance. As we discuss in the next section, interesting graph problems are
also being tackled by computational neuroscientists as well as data scien-
tists analyzing neuronal and social graphs, respectively, where highly sparse
systems with a large degree of local connectivity, also know as small world
networks, also persist.
There are also downsides to this approach. For one, the implementation
phase can be protracted due to limited debugging visibility. First, symbolic
49
problem definition does not allow the user to inspect data values incremen-
tally as they are evaluated lazily. Second, code generation implies that the
implementer’s code is only a representation of the code being run and not
the code itself. As with all parallel programming tools, PyOP2 does not
allow for easy introspection of program execution, and allows for limited
modularity. Both of these points make unit tests difficult to construct.
Lastly, PyOP2 was made with finite elements in mind and some operations,
such as reading from sparse matrix structures to update other Dat variables
is not supported. Further support for more fine grained data manipulation
support would also be helpful in easing ease-of-use for other applications.
50
high dimensional solution space for optimal parameters and solve SLAM,
although benchmark comparisons with bundle adjustment were not done.
It may also be worth exploring whether genetic algorithms are efficient to
finding bundle adjustment solutions. Other techniques from machine learn-
ing, such as support vector machines and graphical models, should also be
used instead of solely relying on direct linear algebra methods.
51
Bibliography
[1] Agarwal, S., and Mierle, K. Ceres Solver: Tutorial & Reference.
Google Inc., 2011.
[3] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pas-
canu, R., Desjardins, G., Turian, J., Warde-farley, D., and
Bengio, Y. Theano : A CPU and GPU Math Compiler in Python.
In Scipy 2010 (2010), no. Scipy, pp. 1–7.
52
[10] Golub, G. H., and Loan, C. F. V. Matrix Computations (Johns
Hopkins Studies in Mathematical Sciences)(3rd Edition), vol. 208-209.
The Johns Hopkins University Press, 1996.
[13] Kummerle, R., Grisetti, G., Strasdat, H., Konolige, K., and
Burgard, W. g2o: A general framework for graph optimization. 2011.
53
High-Level Framework for Performance-Portable Simulations on Un-
structured Meshes. In SC Companion: High Performance Computing,
Networking Storage and Analysis (2012), pp. 1116–1123.
[27] Wu, C., Agarwal, S., Curless, B., and Seitz, S. M. Multicore
bundle adjustment. CVPR 2011, x (June 2011), 3057–3064.
54
A. Some Code
In this section, I provide some of the code used in the examples in this
thesis. Some pieces of code rely on other software libraries. They should be
obvious from the first few lines of each listing.
55
23
24 def unique_params ( d ) :
25 u = set ( item for sublist in d . itervalues () for item in
sublist )
26 return sorted ( u )
27
28 def offsets ( d ) :
29 off = {}
30 key_iter = d . iterkeys ()
31 start_key = key_iter . next ()
32 off = { start_key : 0}
33 cumulative = len ( d [ start_key ])
34 for k in key_iter :
35 off [ k ] = cumulative
36 cumulative += len ( d [ k ])
37 return off , cumulative
38
39 def showJacobian () :
40 N = unique_params ( data )
41 M = len ( data . keys () )
42 off , C = offsets ( data )
43 key_idx = { k : v for (k , v ) in zip ( data . keys () , range ( M ) ) }
44 val_idx = range (M , M + len ( N ) , 1)
45
46 Poses = scipy . sparse . dok_matrix (( C , M ) , dtype = float )
47 Params = scipy . sparse . dok_matrix (( C , len ( N ) ) , dtype = float )
48 for k , v in data . iteritems () :
49 row_mask = range ( off [ k ] , off [ k ] + len ( v ) )
50 col_idx = data . keys () . index ( k )
51 Poses [ row_mask , col_idx ] = np . random . normal () +5
52 for i in range ( len ( v ) ) :
53 Params [ off [ k ]+ i , N . index ( v [ i ]) ] = np . random . normal
() +5
54
55 Cams = scipy . sparse . dok_matrix (( C , len ( cameras ) ) , dtype =
float )
56 for k , v in data . iteritems () :
57 for i , cam in enumerate ( cameras . values () ) :
58 indexes = list ( set ( cam [ ’ images ’ ]) & set ( v ) )
59 print [ v . index ( x ) + off [ k ] for x in indexes ] ,
indexes
60 Cams [[ v . index ( x ) + off [ k ] for x in indexes ] , i ] =
np . random . normal () +5
61
56
62 J = sp . hstack ([ sp . hstack ([ Poses , Params ]) , Cams ])
63
64 fig = figure ()
65 ax1 = fig . add_subplot (111)
66 ax1 . spy ( J . todense () , markersize =5)
67 plt . show ()
68
69 def showHamiltonian ()
70 Hcsr = Jcsr . transpose () * Jcsr
71 fig = figure ()
72 ax1 = fig . add_subplot (111)
73 ax1 . spy ( Hcsr . todense () , markersize =5)
74 plt . show ()
Listing A.1 shows an example of how we load data from delimited files
(as is typical of input data) and process the information for a 2D poses-only
bundle adjustment problem. Using the pandas.DataFrame data structure
to store edges and vertices separately provides an efficient way of handling
the information we process: for vertices, the parameters x, y, θ and for edges,
the parameters dx, dy, dθ as well as the structure of the inverse measurement
covariance matrix Ω, which has 6 degrees of freedom:
Ω00 Ω01 Ω02
Ω01 Ω11 Ω12
57
12 vertices = []
13 edges = []
14 with open ( filepath , ’r ’) as fd :
15 reader = csv . reader ( fd , delimiter = ’ ’)
16 for row in reader :
17 if ’ VERTEX ’ in row [0]. upper () :
18 vertices . append ( np . float64 (
drop_empty ( row [1:]) ) )
19 elif ’ EDGE ’ in row [0]. upper () :
20 edges . append ( np . float64 (
drop_empty ( row [1:]) ) )
21
22 edges_dataframe = pd . DataFrame ( edges , columns =
EDGE_COLS )
23 vertices_dataframe = pd . DataFrame ( vertices , columns =
VERTEX_COLS )
24
25 return edges_dataframe , vertices_dataframe
58
19 i in range (3 , len ( edges . columns ) ) } ,
inplace = True )
20
21 return vertices , edges
59
9 start = jacobian_code ( ’ return ’) +7
10 end = jacobian_code ( ’; ’)
11 jacobian_code [ deriv ] = generated_code [ start : end ]
60
26 {
27 double omega_times_err [\%( c_dim ) d ];
28 \%( omega_err ) s
29 int i = 0;
30 for ( ; i < \%( poses_per_constraint ) d ; ++ i ) {
31 \%( code ) s
32 }
33 }
34 """ \% { ’ name ’ : name , ’ poses_per_constraint ’ :
POSES_PER_CONSTRAINT , ’ j_dim ’ : CONSTRAINT_DIM *
POSES_DIM * POSES_PER_CONSTRAINT , ’ j_subblock_dim ’ :
CONSTRAINT_DIM * POSES_DIM , ’ o_dim ’ : OMEGA_DOF , ’ p_dim
’ : POSES_DIM , ’ c_dim ’ : CONSTRAINT_DIM , ’ omega_err ’ :
’\ n ’. join ( omega_err ) , ’ code ’ : ’\ n ’. join ( code ) }
35
36 if _PRINT_CODE :
37 print rhs_code
38 return op2 . Kernel ( rhs_code , name )
61
, CONSTRAINT_DIM , prefix2 = ’ \% d * i + ’ \%( JBLOCK_SIZE ) ) ) for
i in xrange ( POSES_DIM ) for j in xrange ( POSES_DIM ) ]
15
16 hamiltonian_code = """
17 void \%( name ) s ( double J [\%( j_dim ) d ] , double omega [\%( o_dim )
d ] , double H [\%( p_dim ) d ][\%( p_dim ) d ] , int i , int j )
18 {
19 double j_t_omega [\%( j_t_omega_dim ) d ];
20 \%( jacT_times_omega ) s
21 \%( update ) s
22 }
23 """ \% { ’ name ’ : name , ’ j_t_omega_dim ’ : CONSTRAINT_DIM *
POSES_DIM , ’ p_dim ’ : POSES_DIM , ’ c_dim ’ :
CONSTRAINT_DIM , ’ o_dim ’ : OMEGA_DOF , ’
poses_per_constraint ’ : POSES_PER_CONSTRAINT , ’
jacT_times_omega ’ : ’\ n ’. join ( block_code ) , ’ update ’ :
’\ n ’. join ( update ) , ’ j_dim ’ : JBLOCK_SIZE *
POSES_PER_CONSTRAINT }
24
25 lm_kernel = g e n e r a t e H a m i l t o n i a n D i a g o n a l C o d e ( name + ’ _lm ’ ,
lm_param )
26 if _PRINT_CODE :
27 print hamiltonian_code
28 return op2 . Kernel ( hamiltonian_code , name ) , lm_kernel
62