0% found this document useful (0 votes)
37 views10 pages

HATS: Histograms of Averaged Time Surfaces For Robust Event-Based Object Classification

Uploaded by

bob4bobnow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views10 pages

HATS: Histograms of Averaged Time Surfaces For Robust Event-Based Object Classification

Uploaded by

bob4bobnow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

HATS: Histograms of Averaged Time Surfaces for Robust

Event-based Object Classification

Amos Sironi1∗, Manuele Brambilla1 , Nicolas Bourdis1 , Xavier Lagorce1 , Ryad Benosman1,2,3
1
PROPHESEE, Paris, France 2 Institut de la Vision, UPMC , Paris, France
3
University of Pittsburgh Medical Center / Carnegie Mellon University
{asironi, mbrambilla, nbourdis, xlagorce}@prophesee.ai [email protected]

Abstract

Event-based cameras have recently drawn the attention


of the Computer Vision community thanks to their advan-
tages in terms of high temporal resolution, low power con- ...
eventi
sumption and high dynamic range, compared to traditional event
... i+1
40
frame-based cameras. These properties make event-based
35
cameras an ideal choice for autonomous vehicles, robot
navigation or UAV vision, among others. However, the 30
y (pixels)

accuracy of event-based object classification algorithms, 25

which is of crucial importance for any reliable system work- 20

ing in real-world conditions, is still far behind their frame- 15

based counterparts. Two main reasons for this performance 10

gap are: 1. The lack of effective low-level representations 5

and architectures for event-based object classification and 0

2. The absence of large real-world event-based datasets. In 100

t (ms)
50
0 80 70 60 50 40 30 20 10 0

x (pixels)
this paper we address both problems. First, we introduce
a novel event-based feature representation together with a Figure 1: Pixels of an event-based camera asynchronously
new machine learning architecture. Compared to previous generate events as soon as a contrast change is detected in
approaches, we use local memory units to efficiently lever- their field of view. As a consequence, the output of an event-
age past temporal information and build a robust event- based camera can be extremely sparse and with time resolu-
based representation. Second, we release the first large tion of order of microseconds. Because of the asynchronous
real-world event-based dataset for object classification. We nature of the data and the high resolution of the temporal
compare our method to the state-of-the-art with extensive component of the events, compared to the spatial one, stan-
experiments, showing better classification performance and dard Computer Vision methods can not be directly applied.
real-time computation. Top: An event-based camera (left) recording a natural scene
(right). Bottom: Visualization of the events stream gener-
ated by a moving object. ON and OFF events (Sec. 2) are
1. Introduction represented by yellow and cyan dots respectively. This fig-
ure, as most of the figures in this paper, is best seen in color.
This paper focuses on the problem of object classifi-
cation using the output of a neuromorphic asynchronous on a pre-defined acquisition rate, in event-based cameras,
event-based camera [15, 14, 53]. Event-based cameras of- individual pixels asynchronously emit events when they ob-
fer a novel path to Computer Vision by introducing a funda- serve a sufficient change of the local illuminance intensity
mentally new representation of visual scenes, with a drive (Figure 1). This new principle leads to significant reduction
towards real-time and low-power algorithms. of memory usage and of power consumption and the infor-
Contrary to standard frame-based cameras, which rely mation contained in standard videos of hundreds megabytes
∗ This work was supported in part by the EU H2020 ULPEC project can be naturally compressed in an event stream of few
(grant agreement number 732642) hundreds kilobytes [36, 52, 63]. Additionally, the time

1731
resolution of event-based cameras is orders of magnitude 2. Event-based camera
higher than frame-based cameras, reaching up to hundreds
of microseconds. Finally, thanks to their logarithmic sen- Conventional cameras encode the observed scene by pro-
sitivity to illumination changes, event-based cameras also ducing dense information at a fixed frame-rate. As ex-
have a much larger dynamic range, exceeding 120dB [52]. plained in Sec. 1, this is an inefficient way to encode nat-
These characteristics make event-based cameras particu- ural scenes. Following this observation, a variety of event-
larly interesting for applications with strong constraints on based cameras [36, 52, 63] have been designed over the past
latency (e.g. autonomous navigation), power consumption few years, with the goal to encode the observed scene adap-
(e.g. UAV vision and IoT), or bandwidth (e.g. tracking and tively, based on its content.
surveillance). In this work, we consider the ATIS camera [52]. The
ATIS camera contains an array of fully asynchronous pix-
However, due to the novelty of the field, the performance
els, each composed of an illuminance relative change de-
of event-based systems in real-world conditions is still in-
tector and a conditional exposure measurement block. The
ferior to their frame-based counterparts [28, 66]. We ar-
relative change detector reacts to changes in the observed
gue that two main limiting factors of event-based algorithms
scene, producing information in the form of asynchronous
are: 1. the limited amount of work on low-level feature rep-
address events [4], known henceforth as events. Whenever
resentations and architectures for event-based object classi-
a pixel detects a change in illuminance intensity, it emits
fication; 2. the lack of large event-based datasets acquired
an event containing its x-y position in the pixel array, the
in real-world conditions. In this work, we make important
microsecond timestamp of the observed change and its po-
steps towards the solution of both problems.
larity: i.e. whether the illuminance intensity was increasing
We introduce a new event-based scalable machine learn- (ON events) or decreasing (OFF events). The conditional
ing architecture, relying on a low-level operator called Lo- exposure measurement block measures the absolute lumi-
cal Memory Time Surface. A time surface is a spatio- nous intensity observed by a pixel [49]. In the ATIS, the
temporal representation of activities around an event relying measurement itself is not triggered at fixed frame-rate, but
on the arrival time of events from neighboring pixels [30]. only when a change in the observed scene is detected by the
However, the direct use of this information is sensitive to relative change detector.
noise and non-idealities of the sensors. By contrast, we em- In this work, the luminous intensity measures from the
phasize the importance of using the information carried by ATIS camera were used only to generate ground-truth an-
past events to obtain a robust representation. Moreover, we notations for the dataset presented in Sec. 5. By contrast,
show how to efficiently store and access this past informa- the object classification pipeline was designed to operate on
tion by defining a new architecture based on local mem- change-events only, in order to support generic event-based
ory units, where neighboring pixels share the same memory cameras, whether or not they include the ATIS feature to
block. In this way, the Local Memory Time Surfaces can generate grey levels. In this way, any event-based camera
be efficiently combined into a higher-order representation, can be used to demonstrate the potential of our approach,
which we call Histograms of Averaged Time Surfaces. while leaving the possibility for further improvement when
This results in an event-based architecture which is sig- gray level information is available [39].
nificantly faster and more accurate than existing ones [30,
33, 46]. Driven by brain-like asynchronous event based 3. Related work
computations, this new architecture offers the perspective
of a new class of machine learning algorithms that focus the In this section, we first briefly review frame-based ob-
computational effort only on active parts of the network. ject classification, then we describe previous work on event-
Finally, motivated by the importance of large-scale based features and object classification. Finally, we discuss
datasets for the recent progress of Computer Vision systems existing event-based datasets.
[16, 28, 37], we also present a new real-world event-based
dataset dedicated to car classification. This dataset is com- Frame-based Features and Object Classification There
posed of about 24k samples acquired from a car driving in is a vast literature on spatial [40, 13, 67, 57] and spatio-
urban and motorway environments. These samples were an- temporal [31, 73, 60] feature descriptors for frame-based
notated using a semi-automatic protocol, which we describe Computer Vision. Early approaches mainly focus on hand-
below. To the best of our knowledge this is the largest la- crafting feature representations for a given problem by
beled event-based dataset acquired in real-world conditions. using domain knowledge. Well-designed features com-
We evaluate our method on our new event-based dataset bined with shallow classifiers have driven research in ob-
and on four other challenging ones. We show that our ject recognition for many decades [72, 13, 18] and helped
method reaches higher classification rates and faster com- understanding and modeling important properties of the ob-
putation times than existing event-based algorithms. ject classification problem, such as local geometric invari-

1732
ants, color and light properties, etc. [74, 1]. Neural Networks (CNN) and then to convert the weights
In the last few years, the availability of large datasets [16, to a SNN [9, 58]. In both cases, the obtained solutions
37] and effective learning algorithms [32, 26, 68] shifted are suboptimal and typically the performance is lower than
the research direction towards data driven learning of fea- conventional CNNs on frames. Other methods consider a
ture representations [2, 22]. Typically this is done by opti- smoothed version of the transfer function of a SNN and
mizing the weights of several layers of elementary feature directly optimize it [33, 45, 69]. The convergence of the
extraction operations, such as spatial convolutions, pixel- corresponding optimization problem is still very difficult to
wise transformations, pooling etc. This allowed an impres- obtain and typically only few layers and small networks can
sive improvement in the performance of image classifica- be trained.
tion approaches and many others Computer Vision prob- Recently, [30] proposed an interesting alternative to
lems [28, 66, 75]. Deep Learning models, although less SNNs by introducing a hierarchical representation based on
easily interpretable, also allowed understanding higher or- the definition of Time Surface. In [30], learning is unsu-
der geometrical properties of classical problems [8]. pervised and performed by clustering time surfaces at each
By contrast, the work on event-based Computer Vision is layer, while the last layer sends its output to a classifier.
still in its early stages and it is unclear which feature repre- The main limitations of this method are its high latency, due
sentations and architectures are best suited for this problem. to the increasing time window needed to compute the time
Finding adequate low-level feature operations is a funda- surfaces and the high computational cost of the clustering
mental topic both for understanding the properties of event- algorithm.
based problems and also for finding the best architectures We propose a much simpler yet effective feature rep-
and learning algorithms to solve them. resentation. We generalize time surfaces by introducing a
memory effect in the network by storing the information
carried by past events. We then build our representation by
Event-based Features and Object Classification Simul-
applying a regularization scheme both in time and space to
taneous Localization and Mapping use-cases [25, 54] drove
obtain a compact and fast representation. Although the ar-
the majority of prior work on event-based features [11, 43]
chitecture is scalable, we show that once a memory process
for stable detection and tracking. Corner detectors, have
is introduced, a single layer is sufficient to outperform a
been defined in [11, 71, 43], while the works of [61, 7] fo-
multilayer approach directly relying on time surfaces. This
cused on edge and line extraction.
reduces computation, but more importantly, adds more gen-
Recently, [12] introduced a feature descriptor based on
eralization and robustness to the network.
local distributions of optical flow and applied it to corner de-
tection and gesture recognition. It is inspired by its frame- Event-based Datasets An issue of previous work on
based counterpart [10], but in [12] the algorithm for com- event-based object classification is that the proposed solu-
puting the optical flow relies on the temporal information tions are tested either on very small datasets [64, 30], or
carried by the events. One limitation of [12] is that the qual- on datasets generated by converting standard videos or im-
ity of the descriptor strongly depends on the quality of the ages to an event-based representation [48, 23, 35]. In the
flow. As a consequence, it loses accuracy in presence of first case, the small size of the test set prevents an accurate
noise or poorly contrasted edges. evaluation of the methods. In the second case, the dataset
Event-based classification algorithms can be divided in size is large enough to create a valid tool for testing new
two categories: unsupervised learning methods and super- algorithms. However, since the datasets are generated from
vised ones. Most unsupervised approaches train artificial static images, the real dynamics of a scene and the tempo-
neural networks by reproducing or imitating the learning ral resolution of event-based cameras can not be fully em-
rules observed in biological neural networks [21, 42, 38, 3, ployed and there is no guarantee that a method tested on
65, 41]. Supervised methods [47, 29, 34, 51], similar to this kind of artificial data will behave similarly in real-world
what is done in frame-based Computer Vision, try to op- conditions.
timize the weights of artificial networks by minimizing a The authors of [44] released an event-based dataset
smooth error function. adapted to test visual odometry algorithms. Unfortunately,
The most commonly used architectures for event-based this dataset does not contain labeled information for an ob-
cameras are Spiking Neural Networks (SNN) [5, 59, 24, ject recognition task.
17, 9, 76, 46]. SNN are a promising research field; how- The need of large real-world datasets is a major slow-
ever, their performance is limited by the discrete nature ing factor for event-based vision [70]. By releasing a new
of the events, which makes it difficult to properly train a labeled real-world event-based dataset, and defining an ef-
SNN with gradient descent. To avoid this, some authors ficient semi-automated protocol based on a single event-
[50] use predefined Gabor filters as weights in the network. based camera, we intend to accelerate progress toward a
Others propose to first train a conventional Convolutional robust and accurate event-based object classifier.

1733
t t t

...
noise

...

events

x x x
(a) (b) (c)

Figure 2: Time surface computation around an event ei , in presence of noise. Noisy events are represented as red crosses,
non-noisy events as blue dots. For clarity of visualization only the x-t component of the event stream and a single polarity are
shown. (a) In [30] the time surface T̄ei (Eq. (2)) is computed by considering only the times t′ (xi + z, q) of the last events in a
neighborhood of ei (orange dashed line). As a consequence, noisy events can have a large weight in T̄ei . This is visible from
the spurious peaks in the surface T̄ei . (b) By contrast, the definition of Local Memory Time Surface Tei of Eq. (3), considers
the contribution of all past events in a spatio-temporal window N(z,q) (ei ). In this way, the ratio of noisy events considered to
compute T is smaller and the result better describes the real dynamics of the underling stream of events. (c) The time surface
can be further regularized by spatially averaging the time surfaces for all the events in a neighborhood (Eq. (6)). Thanks to
both the spatial and temporal regularization, the contribution of noise is almost completely suppressed.

4. Method decay factor giving less weight to events further in the past.
Intuitively, a time surface encodes the dynamic context in
In this section, we formalize the event-based representa-
a neighborhood of an event, hence providing both temporal
tion of visual scenes and describe our event-based architec-
and spatial information. Therefore, this compact represen-
ture for object classification.
tation of the content of the scene can be useful to classify
4.1. Time Surfaces different patterns.

Given an event-based sensor with pixel grid size M ×N , 4.2. Local Memory Time Surfaces
a stream of events is given by a sequence To build the feature representation, we start by general-
E= {ei }Ii=1 , with ei = (xi , ti , pi ), (1) izing the time surface T̄ei of Eq. (2). As shown in Fig. 2(a)
using only the time t′ (xi + z, q) of the last event received
where xi = (xi , yi ) ∈ [1, . . . , M ]×[1, . . . , N ] are the coor- in the neighborhood of the time surface pixel xi , leads to a
dinates of the pixel generating the event, ti ≥ 0 the times- descriptor which is too sensitive to noise or small variations
tamp at which the event was generated, with ti ≤ tj for in the event stream.
i < j, and pi ∈ {−1, 1} the polarity of the event, with To avoid this problem, we compute the time surface by
−1, 1 meaning respectively OFF and ON events, and I is considering the history of the events in a temporal window
the number of events. From now on we will refer to indi- of size ∆t. More precisely, we define a local memory time
vidual events by ei and to a sequence of events by {ei }. surface Tei as
In [30], the concept of time surface is introduced to de- ( P t −t
− iτ j
scribe local spatio-temporal patterns around an event. A Tei (z, q) = ej ∈N(z,q) (ei ) e if pi = q
(3)
time surface can be formalized as a local spatial operator 0 otherwise,
acting on an event ei by T̄ei (·, ·) : [−ρ, ρ]2 × {−1, 1} → R,
where ρ is the radius of the spatial neighborhood used to where
compute the time surface. N(z,q) (ei ) = {ej : xj = xi + z, tj ∈ [ti − ∆t, ti ), pj = q}.
For an event ei = (xi , ti , pi ), and (z, q) ∈ [−ρ, ρ]2 × (4)
{−1, 1}, T̄ei is given by As shown in Fig. 2(b), this formulation more robustly de-
( scribes the real dynamics of the scene while resisting noise
t −t′ (xi +z,q)
− i and small variations of events. In the supplementary ma-
T̄ei (z, q) = e τ if pi = q (2)
0 otherwise. terial we compare the results obtained by using Eq. (2) or
Eq. (3) on an object classification task, showing the advan-
Where t′ (xi + z, q) is the time of the last event with po- tage of using the local memory formulation to achieve better
larity q received from pixel xi + z (Fig. 2(a)), and τ is a accuracy.

1734
.
...

Pixels Grid ..... Locally Shared HATS Local Memory Time Surface
Memory Units representation Computation

Figure 3: Overview of the proposed architecture. (a) The pixel grid is divided into cells C of size K × K. When a change of
light is detected by a pixel, an event ei is generated. Then, the time surface Tei is computed and used to update the histogram
hC . The HATS representation is obtained by the concatenation of the histograms hC . (b) Detail of the Local Memory Time
Surface computation using local memory units. For each input event, the time surface of Eq. (3) is computed by using the
past events ej ’s stored in the cell’s local memory unit MC (Sec. 4.4). After computation, Tei is used to update the histogram
hC of the corresponding cell, while event ei is added to the memory unit. For simplicity, the polarity of the event and the
normalization of the histograms are not considered in the scheme.

The name Local Memory Time Surfaces comes from the Algorithm 1 HATS with shared memory units
fact that past events {ej } in N(z,q) (ei ) need to be stored in 1: Input: Events E = {ei }Ii=1 Parameters: ρ, ∆t, τ, K
memory units in order to prevent the algorithm from ‘for- 2: Output: HATS representation H({ei })
getting’ past information. In Sec. 4.4, we will describe how 3: Initialize: hCl = 0, |Cl | = 0, MCl = ∅, for all l
memory units can be shared efficiently by neighboring pix- 4: for i = 1, . . . , I do
els. In this way, we can compute a robust feature represen- 5: Cl ← getCell(xi , yi )
tation without significant increase in memory requirements. 6: Tei ← computeTimeSurface(ei , MCl )
4.3. Histograms of Averaged Time Surfaces 7: h C l ← h C l + T ei
8: M Cl ← M Cl ∪ e i
The local memory time surfaces of Eq. (3) is the ele- 9: |Cl | ← |Cl | + 1
mentary spatio-temporal operator we use in our approach.
10: return H = [hC1 /|C1 |, . . . , hCL /|CL |]⊺
In this section, we describe how this new type of time sur-
face can be used to define a compact representation of an
event stream useful for object classification.
An example of a cell histogram hC (z, p) is shown in
Inspired by [13] in frame-based vision, we group adja-
Fig. 2(c). Given a stream of events, our final descriptor,
cent pixels in cells {Cl }Ll=1 of size K × K. Then, for each which we call HATS for Histograms of Averaged Time Sur-
cell C, we sum the components of the time surfaces com-
faces, is given by concatenating every hC , for all positions
puted on events from C into histograms. More precisely, for
z, polarities and cells 1, . . . , L:
a cell C we have:
X H({ei }) = [hC1 , . . . , hCL ]⊺ . (7)
h̄C (z, p) = Tei (z, p), (5)
ei ∈C
Fig. 3(a) shows an overview of our method.
Similarly to standard Computer Vision methods, we
where, with an abuse of notation, we write ei ∈ C if and
can further group adjacent cells into blocks and perform a
only if pixel coordinates (xi , yi ) of the event belong to C.
block-normalization scheme to obtain more invariance to
A characteristic of event-based sensors is that the amount
velocity and contrast [13]. In Sec. 6, we show how this sim-
of events generated by a moving object is proportional to its
ple representation obtains higher accuracy for event-based
contrast: higher contrast objects generate more events than
object classification compared to previous approaches.
low contrast objects. To make the cell descriptor more in-
variant to contrast, we therefore normalize h̄ by the number 4.4. Architecture with Locally Shared Memory
of events |C| contained in the spatio-temporal window used Units
to compute it. This results in the averaged histogram:
Irregular access in event-based cameras is a well known
1 1 X limiting factor for designing efficient event-based algo-
hC (z, p) = h̄C (z, p) = Tei (z, p). (6)
|C| |C| rithms. One of the main problems is that the use of standard
ei ∈C

1735
hardware accelerations, such as GPU, is not trivial due to 5.2. Dataset Acquired Directly as Events: N-CARS
the sparse and asynchronous nature of the events. For ex-
The datasets described in the previous section are good
ample, accessing spatial neighbors on contiguous memory
datasets for a first evaluation of event-based classifiers.
blocks can impose significant overheads when processing
However, since they were generated by displaying images
event-based data.
on a monitor, they are not very representative of data from
The architecture computing the HATS representation al-
real-world situations. The main shortcoming results from
lows to overcome this memory access issue (Fig. 3). From
the limited and predefined motion of the objects.
Eq. (5) we notice that for every incoming event ei , we need
To overcome these limitations, we created a new dataset
to iterate over all events in a past spatio-temporal neighbor-
by directly recording objects in urban environments with an
hood. Since, for small values of ρ, most of the past events
event-based sensor. The dataset was obtained with the fol-
would not be in the neighborhood of ei , looping through
lowing semi-automatic protocol. First, we captured approx-
the entire temporally ordered event stream would be pro-
imately 80 minutes of video using an ATIS camera (Section
hibitively expensive and inefficient. To avoid this, we no-
2) mounted behind the windshield of a car. The driving was
tice that, for ρ ≈ K, the events falling in the same cell
conducted in a natural way, without particular regards for
C, will share most of the neighbors N(z,q) used to compute
video quality or content. In a second stage, we converted
Eq. (3). Following this observation, for every cell, we de-
gray-scale measurements from the ATIS sensor to conven-
fine a shared memory unit MC , where past events relevant
tional gray-scale images. We then processed them with a
for C are stored. In this way, when a new event arrives in
state-of-the-art object detector [55, 56], to automatically ex-
C, we update Eq. (5) by only looping through MC , which
tract bounding boxes around cars and background samples.
contains only the relevant past events to compute the Local
Finally, the data was manually cleaned to ensure that the
Memory Time Surface of Eq. (3) (Fig. 3(b)).
samples were correctly labeled.
Algorithm 1 describes the computation of HATS with
Since the gray-scale measurements have the same time
memory units. Although this was not the scope of this pa-
resolution of the change detection events, the gray-level im-
per, we notice that Algorithm 1 can be easily parallelized
ages can be easily synchronized with the change detection
and implemented in dedicated neuromorphic chips [62].
events. Thus, the positions and timestamps of the bound-
ing boxes can be directly used to extract the correspond-
5. Datasets
ing event-based samples from the full event stream. Thanks
We validated our approach on five different datasets: to our semi-automated protocol, we generated a two-class
four datasets generated by converting standard frame- dataset composed of 12,336 car samples and 11,693 non-
based datasets to events (namely, the N-MNIST [48], cars samples (background). The dataset was split in 7940
N-Caltech101 [48], MNIST-DVS [63] and CIFAR10- car and 7482 background training samples, and 4396 car
DVS [35] datasets) and a novel dataset, recorded from and 4211 background testing samples. Each example lasts
real-world scenes and introduced for the first time in 100 milliseconds. More details on the dataset can be found
this paper, which we call N-CARS. We made the in the supplementary material.
N-CARS dataset publicly available for download at We called this new dataset N-CARS. As shown in
http://www.prophesee.ai/dataset-n-cars/. Fig. 4(c) the N-CARS is a challenging dataset, containing
cars at different poses, speeds and occlusions, as well as a
5.1. Datasets Converted from Frames large variety of background scenarios.
N-MNIST, N-Caltech101, MNIST-DVS and CIFAR10-
DVS are four publicly available datasets created by convert- 6. Experiments
ing the popular frame-based MNIST [32], Caltech101 [20]
6.1. Event-based Object Classification
and CIFAR10 [27] to an event-based representation.
N-MNIST and N-Caltech101 were obtained by display- Once the features have been extracted from the events
ing each sample image on an LCD monitor, while an ATIS sequences of the database, the problem reduces to a con-
sensor (Section 2) was moving in front of it [48]. Similarly, ventional classification problem. To highlight the contribu-
the MNIST-DVS and CIFAR10-DVS datasets were created tion of our feature representation to classification accuracy,
by displaying a moving image on a monitor and recorded we used a simple linear SVM classifier in all our experi-
with a fixed DVS sensor [63]. ments. A more complex classifier, such as non-linear SVM
In both cases, the result is a conversion of the images or Convolutional Neural Networks, could be used to further
of the original datasets into a stream of events suited for improve the results.
evaluating event-based object classification. Fig. 4(a,b) The parameters for all methods were optimized by split-
shows some representative examples of the datasets gener- ting the training set and using 20% of the data for valida-
ated from frames, for the N-MNIST and N-Caltech101. tion. Once the best settings were found, the classifier was

1736
Figure 4: Sample snapshots from the datasets used for the experiments of Sec. 6. The snapshots are obtained by cumulating
100ms of events. Black pixels represents OFF events, white pixels ON events. (a) N-MNIST Dataset. (b) N-Caltech101
dataset. (c) N-CARS Dataset. Left: positive samples; Right: negative samples. Notice that the N-MNIST and N-Caltech101
datasets have been generated by moving an event-based camera in front of a LCD screen displaying static images. By
contrast, our dataset has been acquired in real-world conditions, therefore it fully exploits the temporal resolution of the
camera by capturing the real dynamics of the objects.
retrained on the whole training set. Results on the N-CARS Datasets For the N-CARS
We noticed little influence of the ρ and τ parameters to dataset, the HATS parameters used are K = 10, ρ = 3 and
accuracy, while small K’s improved performance for low τ = 109 µs. In this case, block normalization was not ap-
resolution inputs. When the input duration is larger than the plied because it did not improve results. Since the N-CARS
value of ∆t used to compute the time surfaces (Eq. 4), we dataset contains only two classes, cars and non-cars, we can
compute the features every ∆t and then stack them together. consider it as a binary classification problem. Therefore,
The baselines methods we consider are HOTS [30], H- we also analyze the performance of the methods using ROC
First [50] and Spiking Neural Networks (SNN) [33, 46]. curves analysis [19]. The Area Under the Curve (AUC) and
For H-First we used the code provided by the authors on- the accuracy (Acc.) for our method and the baselines are
line. For SNN we report the results previously published, shown in Tab. 2, while the ROC curves are presented in the
when available, while for HOTS we used our implementa- supplementary material.
tion of the method described in [30]. As with HATS fea- From the results, we see that our method outperforms the
tures, we used a linear SVM on the features extracted with baselines by a large margin. The variability contained in a
HOTS. Notice that this is in favour of HOTS, since linear real-world dataset, such as the N-CARS one, is too large for
SVM is a more powerful classifier than the one used by the both the H-First and HOTS learning algorithms to converge
authors [30]. to a good feature representation. A predefined Gabor-SNN
architecture has better accuracy than H-First and HOTS, but
Given that no code is available for SNN, we also com-
still 11% lower than our method. The spatio-temporal regu-
pared our results with those of a 2-layer SNN architecture
larization implemented in our method is more robust to the
we implemented using predefined Gabor filters [6]. We then
noise and variability contained in the dataset.
again train a linear SVM on the output of the network. We
call this approach Gabor-SNN. This allowed us to obtain the 6.2. Latency and Computational Time
results for SNN when not readily available in the literature.
Latency is a crucial characteristic for many applications
requiring fast reaction time. In this section, we compare
Results on the Datasets Converted from Frames The HATS , HOTS and Gabor-SNN in terms of their computa-
results for the N-MNIST, N-Caltech101, MNIST-DVS and tional time and latency on the N-CARS dataset. All meth-
CIFAR10-DVS datasets are given in Tab. 1. As it is usually ods are implemented in C++ and run on a laptop equipped
done, we report the results in terms of classification accu- with an Intel i7 CPU (64bits, 2.7GHz) and 16GB of RAM.
racy. The complete set of parameters used for the methods Tab. 3 compares the average computational times to pro-
are reported in the supplementary material. cess a sample. Average computational time per sample was
Our method has the highest classification rate ever re- computed by dividing the total time spent to compute the
ported for an event-based classification method. The per- features on the full training set by the number of training
formance improvement is higher for the more challenging samples. As we can see, our method is more than 20x faster
N-Caltech101 and CIFAR10-DVS datasets. HOTS and a than HOTS and almost 40x times faster than a 2-layer SNN.
predefined Gabor-SNN have similar performance, while the In particular our method is 13 times faster than real time.
H-First learning mechanism is too simple to reach good per- We also report the average number of events processed per
formance. second in Kilo-events per second (Kev/s).

1737
Table 1: Comparison of classification accuracy on datasets converted from frames. Our method has the highest classification
rate ever reported for an event-based classification method.

N-MNIST N-Caltech101 MNIST-DVS CIFAR10-DVS


H-First [50] 0.712 0.054 0.595 0.077
HOTS [30] 0.808 0.210 0.803 0.271
Gabor-SNN 0.837 0.196 0.824 0.245
HATS (this work) 0.991 0.642 0.984 0.524
Phased LSTM [46] 0.973 - - -
Deep SNN [33] 0.987 - - -

Table 2: Comparison of classification results on the N- Table 3: Average computational times per sample (the lower
CARS dataset. The table reports the global classification the better) and average number of events processed per sec-
accuracy (Acc.) and the AUC score (the higher the better). ond, in Kilo-events per second Kev/s (the higher the bet-
Our method outperforms the baselines by a large margin. ter), on the N-CARS dataset. Since each sample is 100ms
long, our method is more than 13 times faster than real time,
N-CARS Acc. AUC while HOTS and Gabor-SNN are respectively 1,5 and 2,8
H-First [50] 0.561 0.408 times slower than real time.
HOTS [30] 0.624 0.568
Gabor-SNN 0.789 0.735 N-CARS Average Comp. Kev/s
HATS (this work) 0.902 0.945 Time per Sample (ms)
HOTS [30] 157.57 25.68
Gabor-SNN 285.95 14.15
Latency represents the time period used to accumulate HATS (this work) 7.28 555.74
evidence in order to reach a decision on the object class. In
our case, this time period is given by the time window used
to compute the features, as longer time windows results in 1
higher latency. Notice that with this definition, the latency
is independent from both the computational time and the 0.9
classification accuracy.
0.8
There is a trade-off between latency and classification
Acc.

accuracy: on one side longer time periods yield more infor- 0.7
mation at the cost of higher latency, on the other side they
0.6
lead to risk of mixing dynamics from separate objects or
HATS
even different dynamics from the same object. We study 0.5 Gabor-SNN
this trade-off by plotting the accuracy as a function of the HOTS
latency for the different methods (Fig. 5). The results were 0.4
0 20 40 60 80 100
averaged over 5 repetitions. By using only 10ms of events, latency (ms)
HATS has higher performance than the baselines applied to
the full 100ms events stream. The performance of HATS Figure 5: Accuracy as a function of latency on the N-CARS
does not completely saturate, probably due to the presence dataset. Our method is consistently more accurate than the
of cars with really small apparent motion in the dataset. baselines and already reaches better performance by using
We also notice that the performance of Gabor-SNN is un- only events contained in the first 10ms of the samples.
stable, especially for low latency. This is due to the spiking
architecture of Gabor-SNN for which small variations in the duced to limit the effect of noise. The proposed architecture
input of a layer can cause large differences at its output. makes efficient use of past information by using local mem-
ory units shared by neighboring pixels, outperforming exist-
7. Conclusion and Future Work
ing spike based methods in both accuracy and efficiency.
In this work, we presented a new feature representation In the future, we plan to extend our method by using
for event-based object recognition by introducing the no- a feature representation also for the memory units, instead
tion of Histograms of Averaged Time Surfaces. It validates of using raw events. This could be done for example by
the idea that information is contained in the relative time training a network to learn linear weights to apply to the
between events, provided a regularization scheme is intro- incoming time surfaces.

1738
References [21] R. Gütig and H. Sompolinsky. The tempotron: a neuron that
learns spike timing–based decisions. Nature neuroscience,
[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour de- 2006. 3
tection and hierarchical image segmentation. TPAMI, 2011. [22] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimen-
3 sionality of data with neural networks. Science, 2006. 3
[2] Y. Bengio, A. Courville, and P. Vincent. Representation [23] Y. Hu, H. Liu, M. Pfeiffer, and T. Delbruck. Dvs benchmark
learning: A review and new perspectives. TPAMI, 2013. 3 datasets for object tracking, action recognition, and object
[3] O. Bichler, D. Querlioz, S. J. Thorpe, J.-P. Bourgoin, and recognition. Frontiers in neuroscience, 2016. 3
C. Gamrat. Extraction of temporally correlated features from [24] N. Kasabov, K. Dhoble, N. Nuntalid, and G. Indiveri. Dy-
dynamic vision sensors with spike-timing-dependent plastic- namic evolving spiking neural networks for on-line spatio-
ity. Neural Networks, 2012. 3 and spectro-temporal pattern recognition. Neural Networks,
[4] K. A. Boahen. Point-to-point connectivity between neuro- 2013. 3
morphic chips using address-events. IEEE Trans. Circuits [25] H. Kim, S. Leutenegger, and A. J. Davison. Real-time 3d
Syst. II, 2000. 2 reconstruction and 6-dof tracking with an event camera. In
[5] S. M. Bohte, J. N. Kok, and H. La Poutre. Error- ECCV, 2016. 3
backpropagation in temporally encoded networks of spiking [26] D. Kingma and J. Ba. Adam: A method for stochastic opti-
neurons. Neurocomputing, 2002. 3 mization. arXiv preprint arXiv:1412.6980, 2014. 3
[6] A. C. Bovik, M. Clark, and W. S. Geisler. Multichannel tex- [27] A. Krizhevsky and G. Hinton. Learning multiple layers of
ture analysis using localized spatial filters. TPAMI, 1990. 7 features from tiny images. 2009. 6
[7] C. Brändli, J. Strubel, S. Keller, D. Scaramuzza, and T. Del- [28] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet Clas-
bruck. Elisedan event-based line segment detector. In sification with Deep Convolutional Neural Networks. In
Event-based Control, Communication, and Signal Process- NIPS, 2012. 2, 3
ing (EBCCSP), International Conference on, 2016. 3 [29] X. Lagorce, S.-H. Ieng, X. Clady, M. Pfeiffer, and R. B.
[8] J. Bruna and S. Mallat. Invariant scattering convolution net- Benosman. Spatiotemporal features for asynchronous event-
works. TPAMI, 2013. 3 based data. Frontiers in neuroscience, 2015. 3
[9] Y. Cao, Y. Chen, and D. Khosla. Spiking deep convolu- [30] X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B.
tional neural networks for energy-efficient object recogni- Benosman. Hots: a hierarchy of event-based time-surfaces
tion. IJCV, 2015. 3 for pattern recognition. TPAMI, 2017. 2, 3, 4, 7, 8
[10] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal. His- [31] I. Laptev. On space-time interest points. IJCV, 2005. 2
tograms of oriented optical flow and binet-cauchy kernels on [32] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E.
nonlinear dynamical systems for the recognition of human Howard, W. E. Hubbard, and L. D. Jackel. Handwritten digit
actions. In CVPR, 2009. 3 recognition with a back-propagation network. In NIPS, 1989.
[11] X. Clady, S.-H. Ieng, and R. Benosman. Asynchronous 3, 6
event-based corner detection and matching. Neural Net- [33] J. H. Lee, T. Delbruck, and M. Pfeiffer. Training deep spik-
works, 2015. 3 ing neural networks using backpropagation. Frontiers in neu-
[12] X. Clady, J.-M. Maro, S. Barré, and R. B. Benosman. A roscience, 2016. 2, 3, 7, 8
motion-based feature for event-based pattern recognition. [34] H. Li, G. Li, and L. Shi. Classification of spatiotemporal
Frontiers in neuroscience, 2017. 3 events based on random forest. In Advances in Brain Inspired
[13] N. Dalal and B. Triggs. Histograms of Oriented Gradients Cognitive Systems: International Conference, 2016. 3
for Human Detection. In CVPR, 2005. 2, 5 [35] H. Li, H. Liu, X. Ji, G. Li, and L. Shi. Cifar10-dvs: An
[14] T. Delbrück, B. Linares-Barranco, E. Culurciello, and event-stream dataset for object classification. Frontiers in
C. Posch. Activity-driven, event-based vision sensors. In neuroscience, 11:309, 2017. 3, 6
Proc. IEEE International Symposium on Circuits and Sys- [36] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128x128 120db
tems, 2010. 1 15us latency asynchronous temporal contrast vision sensor.
[15] T. Delbrück and C. Mead. An electronic photoreceptor sen- IEEE Journal of Solid State Circuits, 2008. 1, 2
sitive to small changes in intensity. In NIPS, 1989. 1 [37] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
[16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. manan, P. Dollár, and C. Zitnick. Microsoft COCO: Com-
Imagenet: A Large-Scale Hierarchical Image Database. In mon Objects in Context. In ECCV, 2014. 2, 3
CVPR, 2009. 2, 3 [38] B. Linares-Barranco, T. Serrano-Gotarredona, L. A.
[17] P. U. Diehl, D. Neil, J. Binas, M. Cook, S.-C. Liu, and Camuñas-Mesa, J. A. Perez-Carrasco, C. Zamarreño-Ramos,
M. Pfeiffer. Fast-classifying, high-accuracy spiking deep and T. Masquelier. On spike-timing-dependent-plasticity,
networks through weight and threshold balancing. In Inter- memristive devices, and building a self-learning visual cor-
national Joint Conference on Neural Networks, 2015. 3 tex. Frontiers in neuroscience, 2011. 3
[18] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian [39] H. Liu, D. P. Moeys, G. Das, D. Neil, S.-C. Liu, and
detection: An evaluation of the state of the art. TPAMI, 2012. T. Delbrück. Combined frame-and event-based detection and
2 tracking. In Circuits and Systems, 2016 IEEE International
[19] T. Fawcett. An introduction to roc analysis. Pattern recogni- Symposium on, 2016. 2
tion letters, 2006. 7 [40] D. G. Lowe. Object recognition from local scale-invariant
[20] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of features. In ICCV, 1999. 2
object categories. TPAMI, 2006. 6 [41] D. Martı́, M. Rigotti, M. Seok, and S. Fusi. Energy-efficient

1739
neuromorphic classifiers. Neural computation, 2016. 3 J. Tapson, and R. Etienne-Cummings. Optimization meth-
[42] T. Masquelier and S. J. Thorpe. Unsupervised learning of vi- ods for spiking neurons and networks. IEEE transactions on
sual features through spike timing dependent plasticity. PLoS neural networks, 2010. 3
computational biology, 2007. 3 [60] M. Scherer, M. Walter, and T. Schreck. Histograms of ori-
[43] E. Mueggler, C. Bartolozzi, and D. Scaramuzza. Fast event- ented gradients for 3d object retrieval. In WSCG, 2010. 2
based corner detection. In BMVC, 2017. 3 [61] S. Seifozzakerini, W.-Y. Yau, B. Zhao, and K. Mao. Event-
[44] E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, and based hough transform in a spiking neural network for multi-
D. Scaramuzza. The event-camera dataset and simulator: ple line detection and tracking using a dynamic vision sensor.
Event-based data for pose estimation, visual odometry, and In BMVC, 2016. 3
slam. The International Journal of Robotics Research, 2017. [62] R. Serrano-Gotarredona, T. Serrano-Gotarredona,
3 A. Acosta-Jimenez, and B. Linares-Barranco. A neu-
[45] D. Neil, M. Pfeiffer, and S.-C. Liu. Learning to be efficient: romorphic cortical-layer microchip for spike-based event
Algorithms for training low-latency, low-compute deep spik- processing vision systems. IEEE Transactions on Circuits
ing neural networks. In Proceedings of the 31st Annual ACM and Systems I: Regular Papers, 2006. 6
Symposium on Applied Computing. ACM, 2016. 3 [63] T. Serrano-Gotarredona and B. Linares-Barranco. A 128 x
[46] D. Neil, M. Pfeiffer, and S.-C. Liu. Phased lstm: Acceler- 128 1.5% contrast sensitivity 0.9% fpn 3 µs latency 4 mw
ating recurrent network training for long or event-based se- asynchronous frame-free dynamic vision sensor using tran-
quences. In NIPS, 2016. 2, 3, 7, 8 simpedance preamplifiers. Solid-State Circuits, IEEE Jour-
[47] P. O’Connor, D. Neil, S.-C. Liu, T. Delbruck, and M. Pfeif- nal of, 2013. 1, 2, 6
fer. Real-time classification and sensor fusion with a spiking [64] T. Serrano-Gotarredona and B. Linares-Barranco. Poker-dvs
deep belief network. Frontiers in neuroscience, 2013. 3 and mnist-dvs. their history, how they were made, and other
[48] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor. details. Frontiers in neuroscience, 2015. 3
Converting static image datasets to spiking neuromorphic [65] S. Sheik, M. Pfeiffer, F. Stefanini, and G. Indiveri. Spatio-
datasets using saccades. Frontiers in Neuroscience, 2015. temporal spike pattern classification in neuromorphic sys-
3, 6 tems. In Biomimetic and Biohybrid Systems. 2013. 3
[49] G. Orchard, D. Matolin, X. Lagorce, R. Benosman, and [66] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional
C. Posch. Accelerated frame-free time-encoded multi-step networks for semantic segmentation. TPAMI, 2017. 2, 3
imaging. In Circuits and Systems, 2014 IEEE International [67] J. Sivic and A. Zisserman. Efficient visual search of videos
Symposium on, 2014. 2 cast as text retrieval. TPAMI, 2009. 2
[50] G. Orchard, C. Meyer, R. Etienne-Cummings, C. Posch, [68] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
N. Thakor, and R. Benosman. Hfirst: A temporal approach R. Salakhutdinov. Dropout: a simple way to prevent neu-
to object recognition. TPAMI, 2015. 3, 7, 8 ral networks from overfitting. Journal of machine learning
[51] X. Peng, B. Zhao, R. Yan, H. Tang, and Z. Yi. Bag of events: research, 2014. 3
An efficient probability-based feature extraction method for [69] E. Stromatias, M. Soto, T. Serrano-Gotarredona, and
aer image sensors. IEEE transactions on neural networks B. Linares-Barranco. An event-driven classifier for spiking
and learning systems, 2017. 3 neural networks fed with synthetic or dynamic vision sensor
[52] C. Posch, D. Matolin, and R. Wohlgenannt. A QVGA 143 data. Frontiers in neuroscience, 2017. 3
dB Dynamic Range Frame-Free PWM Image Sensor With [70] C. Tan, S. Lallee, and G. Orchard. Benchmarking neuromor-
Lossless Pixel-Level Video Compression and Time-Domain phic vision: lessons learnt from computer vision. Frontiers
CDS. Solid-State Circuits, IEEE Journal of, 2011. 1, 2 in neuroscience, 2015. 3
[53] C. Posch, T. Serrano-Gotarredona, B. Linares-Barranco, and [71] V. Vasco, A. Glover, and C. Bartolozzi. Fast event-based
T. Delbruck. Retinomorphic event-based vision sensors: harris corner detection exploiting the advantages of event-
Bioinspired cameras with spiking output. Proceedings of the driven cameras. In Intelligent Robots and Systems (IROS),
IEEE, 2014. 1 2016 IEEE/RSJ International Conference on, 2016. 3
[54] H. Rebecq, T. Horstschaefer, G. Gallego, and D. Scara- [72] P. Viola and M. J. Jones. Robust real-time face detection.
muzza. Evo: A geometric approach to event-based 6-dof IJCV, 2004. 2
parallel tracking and mapping in real time. IEEE Robotics [73] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid.
and Automation Letters, 2017. 3 Evaluation of local spatio-temporal features for action recog-
[55] J. Redmon and A. Farhadi. YOLO9000: better, faster, nition. In BMVC, 2009. 2
stronger. CoRR, 2016. 6 [74] A. Witkin. Scale-space filtering: A new approach to multi-
[56] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To- scale description. In Acoustics, Speech, and Signal Process-
wards real-time object detection with region proposal net- ing, IEEE International Conference on., 1984. 3
works. In NIPS, 2015. 6 [75] S. Xie and Z. Tu. Holistically-nested edge detection. IJCV,
[57] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An 2017. 3
efficient alternative to sift or surf. In ICCV, 2011. 2 [76] B. Zhao, R. Ding, S. Chen, B. Linares-Barranco, and
[58] B. Rueckauer, I.-A. Lungu, Y. Hu, and M. Pfeiffer. Theory H. Tang. Feedforward categorization on aer motion events
and tools for the conversion of analog to spiking convolu- using cortex-like features in a spiking neural network. IEEE
tional neural networks. arXiv preprint arXiv:1612.04052, transactions on neural networks and learning systems, 2015.
2016. 3 3
[59] A. Russell, G. Orchard, Y. Dong, Ş. Mihalas, E. Niebur,

1740

You might also like