2367 High Speed Event Camera Tracking
2367 High Speed Event Camera Tracking
Abstract
Event cameras are bioinspired sensors with reaction times in the order of microsec-
onds. This property makes them appealing for use in highly-dynamic computer vision
applications. In this work, we explore the limits of this sensing technology and present an
ultra-fast tracking algorithm able to estimate six-degree-of-freedom motion with dynam-
ics over 25.8 g, at a throughput of 10 kHz, processing over a million events per second.
Our method is capable of tracking either camera motion or the motion of an object in
front of it, using an error-state Kalman filter formulated in a Lie-theoretic sense. The
method includes a robust mechanism for the matching of events with projected line seg-
ments with very fast outlier rejection. Meticulous treatment of sparse matrices is applied
to achieve real-time performance. Different motion models of varying complexity are
considered for the sake of comparison and performance analysis.
1 Introduction
Event cameras send independent pixel information as soon as their intensity change exceeds
an upper or lower threshold, generating “ON" or “OFF" events respectively (see Fig.1). In
contrast to conventional cameras –in which full images are given at a fixed frame rate–, in
event cameras, intensity-change messages come asynchronously per pixel, this happening at
the microsecond resolution. Moreover, event cameras exhibit high dynamic range in lumi-
nosity (e.g. 120dB for the Davis 240C model [1] used in this work). These two assets make
them suitable for applications at high-speed and/or with challenging illumination conditions
(low illumination levels or overexposure). Emerging examples of the use of these cam-
eras in mobile robotics are: event-based optical flow for micro-aerial robotics [18], obstacle
avoidance [3, 15], simultaneous localization and mapping (SLAM) [24] [12], and object
recognition [17], among others.
We are interested in accurately tracking high-speed 6DoF motion with an event camera.
This type of sensors has been used in the past for the tracking of motion. For instance,
2D position estimates are tracked with the aid of a particle filter in [22]. The method was
later extended into an SO(2) SLAM system in which a planar map of the ceiling was re-
constructed [23]. Another SLAM system that tracks only camera rotations and builds a
c 2020. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
2 CHAMORRO, ANDRADE-CETTO, SOLÀ: HIGH-SPEED EVENT CAMERA TRACKING
Raw event camera data
Log Pixel Illumination
Unprocessed
events
UNDISTORT
PROCESS
Time [s]
Figure 1: Working principle of event cameras (left) with distorted (center) and undistorted
output (right).
high-resolution spherical mosaic of the scene was presented in [9]. Full 3D tracking is pro-
posed in [10] where three interleaved probabilistic filters perform pose tracking, scene depth
and log intensity estimation as part of a SLAM system. These systems were not designed
with high-speed motion estimation in mind.
More related to our approach is the full 3D tracking for high-speed maneuvers of a
quadrotor with an event camera presented in [14], extended later to a continuous-time tra-
jectory estimation solution [16]. The method is similar to ours in that it localizes the camera
with respect to a known wire-frame model of the scene by minimizing point-to-line reprojec-
tion errors. In that work, the model being tracked is planar, whereas we are able to localize
with respect to a 3D model. That system was later modified to work with previously built
photometric depth maps [8]. Non-linear optimization was included in a more recent ap-
proach [2];in this case, the tracking was performed in a sparse set of reference images, poses
and depth maps, by having an a priori initial pose guess and taking into account the event
generation model to reduce the number of outliers. This event generation model was ini-
tially stated in [7] for tracking position and velocity in textured known environments. In
a more recent contribution, a parallel tracking and mapping system following a geometric,
semi-dense approach was presented in [19]. The pose tracker is based on edge-map align-
ment using inverse compositional Lucas-Kanade method; additionally, the scene depth is
estimated without intensity reconstruction. In that work, pose estimates are computed at a
rate of 500 Hz.
In the long run, we are also interested in developing a full event-based SLAM system
with parallel threads for tracking and mapping, that is able to work in real-time on a standard
CPU. Since event cameras naturally respond to edges in the scene, the map, in our case, is
made of a set of 3D segments sufficiently scattered and visible to be tracked. This work deals
with the tracking part, and thus such map is assumed given. With fast motion applications
in mind, our tracking thread is able to produce pose updates in the order of tens of kHz on a
standard CPU, 20 times faster than [19], is able to process over a million events per second
and can track motion direction shifts above 15Hz and accelerations above 25.8 g.
The main contribution of this paper is first to present a new event-driven Lie-EKF formu-
lation to track the 6DOF pose of a camera in very high dynamic conditions -that runs in real
time (10kHz throughput). The use of Lie theory in our EKF implementation allows elegant
handling of derivatives and uncertainties in the SO(3) manifold when compared to the clas-
sical error-state EKF. Then we propose a novel fast data association mechanism that robustly
matches events to projected 3D line-based landmarks with fast outlier rejection. It reaches
real-time performance for over a million of events per second and hundreds of landmarks on
CHAMORRO, ANDRADE-CETTO, SOLÀ: HIGH-SPEED EVENT CAMERA TRACKING 3
a standard CPU. Finally the benchmarking of several filter formulations including Lie versus
classic EKF, three motion models and two projection models, adding up to a total of 12 filter
variants.
2 Motion estimation
The lines in the map are parametrized by their endpoints p{1,2} = (x, y, z){1,2} expressed in
the object’s reference frame. We assume the camera is calibrated, and the incoming events
are immediately corrected for lens radial distortion using the exact formula in [5].
The state vector x represents either the camera state respective to a static object, or the
object state respective to a static camera. This model duplicity will be pertinent for the
preservation of camera integrity in the experimental validation, where tracking very high
dynamics will be done by moving the object and not the camera.
To bootstrap the filter’s initial pose, we use the camera’s grayscale images. FAST cor-
ners [6] are detected in this 2D image and matched to those in the 3D predefined map. The
initial pose is then computed using the PnP algorithm [11]. After this initial bootstrapping
process, the grayscale images are no longer used.
Table 1: State transition for CP, CV and CA motion models. Right: error-state partition.
xt = f (· · · CP CV CA ) δ xk
3 1 2
R 3 rk = rk−1 + rn vk−1 ∆t vk−1 ∆t + 2 ak−1 ∆t δ r k ∈ R3
1 2
SO(3) 3 Rk = Rk−1 ⊕ ( θ n ω k−1 ∆t ω k−1 ∆t + 2 α k−1 ∆t ) δ θ k ∈ R3
R3 3 vk = vk−1 + vn ak−1 ∆t δ v k ∈ R3
so(3) 3 ω k = ω k−1 + ωn α k−1 ∆t δ ω k ∈ R3
3
R 3 ak = ak−1 + an δ a k ∈ R3
3
R 3 αk = α k−1 + αn δ α k ∈ R3
4 CHAMORRO, ANDRADE-CETTO, SOLÀ: HIGH-SPEED EVENT CAMERA TRACKING
We follow [21] to compute all the non-trivial Jacobian blocks of F, which correspond to the
SO(3) manifold. Using the notation Jab , ∂ a/∂ b, we have
JR ω ∆t)> , JR
R = Exp(ω ω ∆t)∆t and JR
ω = Jr (ω
1 R
α = 2 Jω ∆t , (2)
where u j = (u, v, w)>j are the projections of the i-th segment’s endpoints p j ∈ R3 , j ∈ {1, 2},
in projective coordinates, and K is the camera intrinsic matrix. Jacobians are also computed,
Jlr = Jlui Jur i and JlR = Jlui JuRi ∈ R3×3 , (5)
CHAMORRO, ANDRADE-CETTO, SOLÀ: HIGH-SPEED EVENT CAMERA TRACKING 5
having Jlu1 = −[u2 ]× , Jlu2 = [u1 ]× , Jur 1 = −KR> for (3a), Jur i = K for (3b), and JuRi is the
Jacobian of the rotation action computed in the Lie-theoretic sense [21], which for the two
projection models becomes
Then, each undistorted event e = (ue , ve )> in the window is matched to a single pro-
jected segment l. On success (see Sec. 2.3 below), we define the event’s innovation as the
Euclidean distance to the matched segment on the image plane, with a measurement noise
nd ∼ N (0, σd2 ),
e> l
distance innovation : z = d(e, l) = √ ∈R, (7)
a2 + b2
where e = (ue , ve , 1)> . The scalar innovation variance is given by Z = Jzx PJzx > + σd2 ∈ R,
where the Jacobian Jzx of the innovation with respect to the state is a sparse row-vector with
zeros in the velocity and acceleration blocks for the larger CV and CA models,
where the state update c) is implemented by a regular sum for the state blocks {r, v, ω , α }
and by the right-plus R Exp(δ θ ) for R ∈ SO(3), as needed for the model in turn (CP, CV,
or CA). We remark for implementation purposes affecting execution speed that the Kalman
gain k is an m−vector, that to compute Z and (9a) we again exploit the sparsity of Jzx , as we
did in 2.1, and that Z −1 is the inverse of a scalar.
60 60
80
120 80 80
Figure 2: Data association process: (a) event window sample with projected lines, (b) cell
identification for a single line based on the tessellation guidelines, (c) thresholding and am-
biguity removal.
endpoint, Cu0 = du0 m/we and Cv0 = d(v0 n/he, where dCe , ceil(C). Then we sequentially
identify all horizontal and vertical intersections,
( (
Cvi = i Cuj = j
Horiz: Vert: (10)
Cui = d(−bhi − cn)m/anwe Cvj = d(−aw j − cm)n/bmhe,
where the iterators i and j keep track of the horizontal and vertical intersections. Their values
start from Cv0 and Cu0 , respectively and are increased or decreased by one in each iteration
until reaching the opposite endpoint cell location. The sign of the increment depends on the
difference between the first and last endpoint cell coordinates.
For each event in the temporal window we must check whether it has a corresponding
line match in its corresponding tessellation cell. Although we are capable of processing all
events at the rate of millions per second, there might be cases in which this is not achievable
due to a sudden surge of incoming events. This depends on the motion model used, the scene
complexity, or the motion dynamics. We might need to leave out up to 1/10th of events on
average in the most demanding conditions (see last row of Tab. 3), and to do so unbiasedly,
we keep track of execution time and skip the event if its timestamp is lagging more than 1µs
from the current time.
Each unskipped event inside each cell is compared only against the segments that are
within that cell. This greatly reduces the combinatorial explosion of comparing N segments
with a huge number M of events from O(M × N) to the smaller cost of updating the cells’
segments lists, which is only O(N). The (very small) number of match segment candidates
for each event are sorted from min to max distance. To validate a match between an event and
its closest segment the following three conditions (evaluated in this order) must be met, see
Fig 2(c): a) the distance d1 (7) to the closest segment is below a predefined threshold, d1 <
α; b) the distance d2 to the second closest segment is above another predefined threshold,
d2 > β ; and c) the orthogonal projection of the event onto the segment falls between the two
v>
1 v2
endpoints, 0 < v>
< 1, where v1 = u2 − u1 , v2 = e − u1 , and ui are the endpoints in pixel
1 v1
coordinates. Events that pass all conditions are used for EKF update as described in Sec. 2.2.
CHAMORRO, ANDRADE-CETTO, SOLÀ: HIGH-SPEED EVENT CAMERA TRACKING 7
Table 2: Perturbation Table 3: RMSE mean values and timings. L: Lie parameteriza-
and noise parameters. tion, and Cl: classic algebra.
σ Value Metric CP+L CV+L CA+L CP+Cl CV+Cl CA+Cl
σr 0.03 m/s1/2 x (m) 0.0149 0.0091 0.0095 0.0162 0.0093 0.0106
σθ 0.3 rad/s1/2 y (m) 0.0125 0.0085 0.0081 0.0119 0.0086 0.0088
σv 3 m/s3/2 z (m) 0.0167 0.0111 0.0012 0.0171 0.0121 0.0113
σω 10 rad/s3/2 φ (rad) 1.2205 0.7522 0.8333 1.2729 0.8613 0.9539
σa 80 m/s5/2 θ (rad) 1.4569 0.9842 1.0209 1.2729 1.2366 1.2645
ψ (rad) 1.2955 0.9252 0.8066 1.1549 1.1201 0.9902
σα 300 rad/s5/2
Tproc (µs) 0.32 0.46 0.72 0.29 0.42 0.64
σd 3.5 pixels
Nevents (%) 97.73 90.96 85.51 98.06 92.68 89.09
0.01 0.05
0.00 0.00
0.03 0.10
Y[m]
0.01 0.05
0.00 0.00
0.02 0.10
ψ [rad]
Z[m]
0.01 0.05
0.00 0.00
0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
time [s] time [s]
(a) (b) (c) (a) (b) (c)
Figure 3: RMS errors and 2-sigma bounds: (a-c) position, (d-f) orientation.
8 CHAMORRO, ANDRADE-CETTO, SOLÀ: HIGH-SPEED EVENT CAMERA TRACKING
Ground truth (opti-track) EKF position estimation Zoom sections
0.40
Lights [On] Lights [Off] Lights [On]
30 30
Lights[s
[Off] Lights [On] 0.26 4
400
ψ[rad] θ[rad] φ[rad] Z[m] Y[m] X[m]
time
X.1 ]
0.20
0.00 time [s] 0.21 X.1
0.40 30 0.24 40
Y.1
time [s ]
0.20
0.00 0.14 Y.1
0.50 0.58
0.30
0.10 Z.1 0.30 Z.1
3.5 3.3
3.0 30 40
2.5 time
P.1 [s ]
3.1 P.1
0.4 R.1 0.1
0.0
-0.4
30
-0.2 R.1
40
-2.0 Yw.1 -2.2
time [s ]
-3.5 -3.1 Yw.1
time [s]
30
35 36 37 38 39 404
time [s] time [s]
6
Processed events Total events (a) (b)
1.0 x10
Number of
events
0.6
0.2
0 10 20 30 40 50 60
time [s]
(c) 3030
time [s]
4
4
time [s]
Event image
10 20 30 40 50 60
time [s]
On event Off event EKF Projected lines
(d)
Figure 4: (a) Strong hand shake (∼ 6Hz) sequence example (using CV+L), (b) with zoom in
the high speed zone, (c) event quantification and (d) visual output snapshots at a given time.
In this evaluation, we use the projection model (3a); i.e., the camera is moving in a
static world. From the 10 runs, we measure the root mean square error (RMSE) of each
component of the camera pose and plot it in Fig. 3. To analyze consistency of the filter, the
errors obtained are compared against their 2-sigma bounds as in [20]. An OptiTrack motion
capture system calibrated with spherical reflective references will provide the ground truth
to analyze the event-based tracker performance.
For the sake of comparison, we also implemented the classic ES-EKF using quaternions,
where Jacobians are obtained using first-order approximations. The error evaluation for the
various filter variants tested are summarized in Tab. 3.
The overall results show a small but noticeable improvement in accuracy when the tracker
is implemented with Lie groups, where the CV model has the best response. Though the Lie
approach is somewhat slower, this can be taken as the price to pay for improved accuracy.
During the RMSE evaluation, CV and CA errors were mostly under the 2-sigma bound (see
Fig. 3 (b,c,e,f)) indicating a sign of consistency. On the other hand, the error using the CP
model is shown to exceed the 2-sigma bound repeatedly. This situation was evidenced during
the experiments by observing less resilience of the tracker in high dynamics (see Fig. 3 (a,d)).
In all cases, per-event total processing time Tproc falls well bellow the microsecond,
where, on average, less than 0.1 µs of this time is spent performing line-event matching,
the rest being spent in prediction and correction operations. With this, the tracker is capa-
ble of treating between 89.1% and 97.7% of the incoming data, depending on the motion
model and state parameterization used, reaching real-time performance, and producing pose
updates at the rate of 10 kHz, limited only by the chosen size of the time window of events
CHAMORRO, ANDRADE-CETTO, SOLÀ: HIGH-SPEED EVENT CAMERA TRACKING 9
of 100 µs.
A comparison of the tracker performance versus the OptiTrack ground truth is shown
in Fig. 4 for the best performing motion model and state parameterization combination:
constant velocity with Lie groups. In this case, the camera is hand-shaked by a human in
front of the scene. The frequency of the motion signal increases from about 1 Hz to 6 Hz, the
fastest achievable with a human hand-shake of the camera. The camera pose is accurately
tracked despite the sudden changes in motion direction, where the most significant errors –in
the order of mm– are observed precisely in these zones where motion changes direction (see
Fig 4(a,b)).
Illumination changes were produced by turning on and off the lights in the laboratory
with no noticeable performance degradation in the tracking nor the event production (see
grey shaded sections in Figs. 4(a),(c)), which reached peaks of about one million events per
second with the most aggressive motion dynamics (see zoomed-in region in Fig. 4(b) and
(c)). The green lines in the snapshots in Fig. 4(d) are the projected map segments using the
estimated camera pose.
> 6 Hz > 8 Hz > 10 Hz > 12 Hz > 13 Hz > 14 Hz > 15 Hz > 6 Hz > 8 Hz > 10 Hz > 12 Hz > 13 Hz > 14 Hz > 15 Hz
0.08
0.088 -2.82
X [m]
0.07
Roll [rad]
0.082
-3.13 0.06
0.074
Y [m]
-0.54 0.05
Y [m]
Pitch [rad]
0.029
0.04
-0.56
0.219 0.03
-0.02
Yaw [rad]
Z [m]
0.02 Simulated
0.208 -0.04 Estimated
0.01
0.210
0.214
0.218
0.222
16.0
16.5
26.0
26.5
36.0
36.5
46.0
46.5
56.0
56.5
66.0
66.5
76.0
76.5
16.0
16.5
26.0
26.5
36.0
36.5
46.0
46.5
56.0
56.5
66.0
66.5
76.0
76.5
time[s] time[s]
Z [m]
Limits of the constrained motion EKF position estimation Limits of the constrained motion EKF orientation estimation
(a) (b) (c)
Y (e)
O C (f)
OC
0.15
CD
4
BC
0.047
OB
O
3
0.06
B
0.015
Z AD
AB
X
0.014
A 0.150 [m] D
(g) (h)
(d)
Figure 5: High dynamics position and orientation evaluation using CV + L: (a,b), poses
up to 950 rpm (15.8 Hz) were accurately estimated before the tracking disengaged, (c) Z-Y
trajectory (d) constrained four-bar motion mechanism, (e) mechanism dimensions and (f-h)
visual snapshots of the tracker for crank angular speeds of 300, 500 and 800 rpm respectively.
10 CHAMORRO, ANDRADE-CETTO, SOLÀ: HIGH-SPEED EVENT CAMERA TRACKING
Acknowledgements
This work was partially supported by the EU H2020 project GAUSS (H2020-Galileo-2017-
1-776293), by the Spanish State Research Agency through projects EB-SLAM (DPI2017-
89564-P) and the María de Maeztu Seal of Excellence to IRI (MDM-2016-0656, and by a
scholarship from SENESCYT, Republic of Ecuador to William Chamorro.
CHAMORRO, ANDRADE-CETTO, SOLÀ: HIGH-SPEED EVENT CAMERA TRACKING 11
References
[1] Christian Brandli, Raphael Berner, Minhao Yang, Shih Chii Liu, and Tobi Delbruck.
A 240 × 180 130 dB 3 µs latency global shutter spatiotemporal vision sensor. IEEE
J. Solid-State Circuits, 49(10):2333–2341, 2014.
[2] Samuel Bryner, Guillermo Gallego, Henri Rebecq, and Davide Scaramuzza. Event-
based direct camera tracking from a photometric 3D map using nonlinear optimization.
In IEEE Int. Conf. Robotics Autom., pages 325–331, 2019.
[3] Davide Scaramuzza Davide Falanga, Kevin Klever. Dynamic obstacle avoidance for
quadrotors with event cameras. Sci. Robotics, 5(40):eaaz9712, 2020.
[4] Jeremie Deray and Joan Solà. manif: a small C++ header-only library for Lie theory.
https://github.com/artivis/manif, jan 2019.
[5] Pierre Drap and Julien Lefèvre. An exact formula for calculating inverse radial lens
distortions. Sensors, 16(6):807, 2016.
[6] Rosten Edward, Porter Reid, and Drummond Tom. Faster and better: A machine learn-
ing approach to corner detection. IEEE Trans. Pattern Anal. Mach. Intell., 32(1):105–
119, 2010.
[7] Guillermo Gallego, Christian Forster, Elias Mueggler, and Davide Scaramuzza. Event-
based camera pose tracking using a generative event model. arXiv: 1510.01972, 1:1–7,
2015.
[8] Guillermo Gallego, Jon E.A. Lund, Elias Mueggler, Henri Rebecq, Tobi Delbruck,
and Davide Scaramuzza. Event-based 6-DOF camera tracking from photometric depth
maps. pami, 40(10):2402–2412, 2017.
[9] Hanme Kim, Ankur Handa, Ryad Benosman, Sio-Hoi Ieng, and Andrew J Davison.
Simultaneous mosaicing and tracking with an event camera. IEEE Journal of Solid-
State Circuits, 43:566–576, 2008.
[10] Hanme Kim, Stefan Leutenegger, and Andrew J Davison. Real-time 3D reconstruction
and 6-DoF tracking with an event camera. In Eur. Conf. Comput. Vis., pages 349–364,
2016.
[11] Pascal Lepetit, Vincent and Moreno-Noguer, Francesc and Fua. EPnP: An accurate
O(n) solution to the PnP problem. Int. J. Comput. Vision, 81:155–166, 2009.
[12] Michael Milford, Hanme Kim, Stefan Leutenegger, and Andrew Davison. Towards
visual SLAM with event-based cameras. RSS Workshop on the Problem of Mobile
Sensors, 2015.
[13] Elias Mueggler. Event-based Vision for High-Speed Robotics. PhD thesis, University
of Zurich, 2017.
[14] Elias Mueggler, Basil Huber, and Davide Scaramuzza. Event-based, 6-DOF pose track-
ing for high-speed maneuvers. In IEEE/RSJ Int. Conf. Intell. Robots Syst., pages 2761–
2768, 2014.
12 CHAMORRO, ANDRADE-CETTO, SOLÀ: HIGH-SPEED EVENT CAMERA TRACKING
[15] Elias Mueggler, Nathan Baumli, Flavio Fontana, and Davide Scaramuzza. Towards
evasive maneuvers with quadrotors using dynamic vision sensors. In Eur. Conf. Mobile
Robots, pages 1–8, 2015.
[16] Elias Mueggler, Guillermo Gallego, and Davide Scaramuzza. Continuous-time trajec-
tory estimation for event-based vision sensors. In Robotics Sci. Syst. Conf., 2015.
[17] Garrick Orchard, Cedric Meyer, Ralph Etienne-Cummings, Christoph Posch, Nitish
Thakor, and Ryad Benosman. HFirst: A temporal approach to object recognition.
IEEE Trans. Pattern Anal. Mach. Intell., 37(10):2028–2040, 2015.
[18] Bas J. Pijnacker Hordijk, Kirk Y.W. Scheper, and Guido C.H.E. de Croon. Vertical
landing for micro air vehicles using event-based optical flow. J. Field Robotics, 35(1):
69–90, 2018.
[19] Henri Rebecq, Timo Horstschaefer, Guillermo Gallego, and Davide Scaramuzza. EVO:
A geometric approach to event-based 6-DOF parallel tracking and mapping in real time.
IEEE Robotics Autom. Lett., 2(2):593–600, 2017.
[20] Joan Solà, Teresa Vidal-Calleja, Javier Civera, and Jose Maria Martinez-Montiel. Im-
pact of landmark parametrization on monocular EKF-SLAM with points and lines. Int.
J. Comput. Vision, 97:339–368, 2011.
[21] Joan Solà, Jeremie Deray, and Dinesh Atchuthan. A micro Lie theory for state estima-
tion in robotics. arXiv: 1812.01537, pages 1–16, 2018.
[22] David Weikersdorfer and Jorg Conradt. Event-based particle filtering for robot self-
localization. In IEEE Int. Conf. Robotics Biomim., pages 866–870, 2012.
[23] David Weikersdorfer, Raoul Hoffmann, and Jörg Conradt. Simultaneous localization
and mapping for event-based vision systems. In Int. Conf. Comput. Vis. Syst., pages
133–142, 2013.
[24] David Weikersdorfer, David Adrian, Daniel Cremers, and Jorg Conradt. Event-based
3D SLAM with a depth-augmented dynamic vision sensor. In IEEE Int. Conf. Robotics
Autom., pages 359–364, 2014.