0% found this document useful (0 votes)
55 views17 pages

2019CVPRW Asynchronous Convolutional Networks For Object Detection in Neuromorphic Cameras

1. The document proposes two neural network architectures called YOLE and fcYOLE for object detection using event-based cameras. YOLE integrates camera events into surfaces and uses a frame-based model for processing, while fcYOLE is an asynchronous fully convolutional network that uses a novel formulation of convolutional and max pooling layers to exploit the sparsity of camera events. 2. The authors evaluate the algorithms on publicly available datasets extended with synthetic data. fcYOLE directly processes sparse camera events without frame integration, allowing asynchronous computation only where needed. 3. Existing work on event-based cameras has focused on object tracking, visual odometry, and optical flow estimation, but limited work has addressed object detection. The

Uploaded by

Đăng Tuấn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views17 pages

2019CVPRW Asynchronous Convolutional Networks For Object Detection in Neuromorphic Cameras

1. The document proposes two neural network architectures called YOLE and fcYOLE for object detection using event-based cameras. YOLE integrates camera events into surfaces and uses a frame-based model for processing, while fcYOLE is an asynchronous fully convolutional network that uses a novel formulation of convolutional and max pooling layers to exploit the sparsity of camera events. 2. The authors evaluate the algorithms on publicly available datasets extended with synthetic data. fcYOLE directly processes sparse camera events without frame integration, allowing asynchronous computation only where needed. 3. Existing work on event-based cameras has focused on object tracking, visual odometry, and optical flow estimation, but limited work has addressed object detection. The

Uploaded by

Đăng Tuấn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Asynchronous Convolutional Networks for Object Detection

in Neuromorphic Cameras

Marco Cannici Marco Ciccone Andrea Romanoni Matteo Matteucci


Politecnico di Milano, Italy
{marco.cannici,marco.ciccone,andrea.romanoni,matteo.matteucci}@polimi.it

Abstract sor able to produce a stream of asynchronous events that


sparsely encodes changes with microseconds resolution and
Event-based cameras, also known as neuromorphic cam- with minimum requirements in terms of power consump-
eras, are bioinspired sensors able to perceive changes in the tion and bandwidth. The growth in popularity of these type
scene at high frequency with low power consumption. Be- of sensors, and their advantages in terms of temporal res-
coming available only very recently, a limited amount of olution and reduced data redundancy, have led to fully ex-
work addresses object detection on these devices. In this ploit the advantages of event-based vision for a variety of
paper we propose two neural networks architectures for ob- applications, e.g., object tracking [39, 29, 11], visual odom-
ject detection: YOLE, which integrates the events into sur- etry [32, 40], and optical flow estimation [2, 24, 47].
faces and uses a frame-based model to process them, and Spiking Neural Networks (SNNs) [27], a processing
fcYOLE, an asynchronous event-based fully convolutional model aiming to improve the biological realism of artificial
network which uses a novel and general formalization of the neural networks, are one of the most popular neural model
convolutional and max pooling layers to exploit the sparsity able to directly handle events. Despite their advantages in
of camera events. We evaluate the algorithm with different terms of speed and power consumption, however, training
extensions of publicly available datasets, and on a novel deep SNNs models on complex tasks is usually very dif-
synthetic dataset. ficult. To overcome the lack of scalable training proce-
dures, recent works have focused on converting pre-trained
deep networks to SNNs, achieving promising results even
1. Introduction on complex tasks [14, 5, 8].
An alternative solution to deal with event-based cameras
Fundamental techniques underlying Computer Vision is to make use of frame integration procedures and conven-
are based on the ability to extract meaningful features. To tional frame-based networks [35] which can instead rely on
this extent, Convolutional Neural Networks (CNNs) rapidly optimized training procedures. Recently, other alternatives
became the first choice in many computer vision applica- to SNNs making use of hierarchical time surfaces [20] and
tions such as image classification [18, 45, 13, 48], object memory cells [46] have also been introduced. Another so-
detection [42, 41, 25], semantic scene labeling [49, 37, 26], lution, proposed in [33], suggests instead the use of LSTM
and they have been recently extended also to non-euclidean cells to accumulate events and perform classification. An
domains such as manifolds and graphs [16, 31]. In most of extension of this work making use of attention mechanisms
the cases the input of these networks are images. has also been proposed in [4].
In the meanwhile, neuromorphic cameras [43, 36, 3] are Although event-cameras are becoming increasingly pop-
becoming more and more widespread. These devices are ular, only very few datasets for object detection in event
bio-inspired vision sensors that attempt to emulate the func- streams are available, and a limited number of object detec-
tioning of biological retinas. As opposed to conventional tion algorithms has been proposed [23, 6, 38].
cameras, which generate frames at a constant frame rate, In this paper we introduce a novel hybrid approach to ex-
these sensors output data only when a brightness change tract features for object detection problems using neuromor-
is detected in the field of view. Whenever this happens, phic cameras. The proposed framework allows the design of
an event e = h x, y, ts, p i is generated indicating the po- object detection networks able to sparsely compute features
sition (x, y), the instant ts at which the change has been while still preserving the advantages of conventional neu-
detected and its polarity p ∈ {1, −1}, i.e., if the bright- ral networks. More importantly, networks implemented us-
ness change is positive or negative. The result is a sen- ing the proposed procedure are asynchronous, meaning that

1
computation is only performed when a sequence of events Similar procedures capable of maintaining time resolu-
arrive and only where previous results need to be recom- tion have also been proposed, such as those that make use
puted. of exponential decays [7, 19] to update surfaces, and those
In Section 3 the convolution and max-pooling operations relying on histograms of events [28]. Recently, the con-
are reformulated by adding an internal state, i.e., a memory cept of time surface has also been introduced in [20] where
of the previous prediction, that allows us to sparsely recom- surfaces are obtained by associating each event with tempo-
pute feature maps. An asynchronous fully-convolutional ral features computed applying exponential kernels to the
network for event-based object detection which exploits this event neighborhood. Extensions of this procedure making
formulation is finally described in Section 3.4. use of memory cells [46] and event histograms [1] have also
been proposed. Although these event representations better
2. Background describe complex scene dynamics, we make use of a sim-
pler formulation to derive a linear dependence between con-
Leaky Surface. The basic component of the proposed ar- secutive surfaces. This allows us to design the event-based
chitectures is a procedure able to accumulate events. Sparse framework discussed in Section 3 in which time decay is
events generated by the neuromorphic camera are integrated applied to every layer of the network.
into a leaky surface, a structure that takes inspiration from Event-based Object Detection. We identified YOLO
the functioning of Spiking Neural Networks (SNNs) to [41] as a good candidate model to tackle the object detection
maintain memory of past events. A similar mechanism has problem in event-based scenarios: it is fully-differentiable
already been proposed in [7]. Every time an event with co- and produces predictions with small input-output delays.
ordinates (xe , ye ) and timestamp tst is received, the cor- By means of a standard CNN and with a single forward
responding pixel of the surface is incremented of a fixed pass, YOLO is able to simultaneously predict not only the
amount ∆incr . At the same time, the whole surface is decre- class, but also the position and dimension of every object
mented by a quantity which depends on the time elapsed be- in the scene. We used the YOLO loss and the previous
tween the last received event and the previous one. The de- leaky surface procedure to train a baseline model which we
scribed procedure can be formalized by the following equa- called YOLE: ”You Only Look at Events”. The architecture
tions: is depicted in Figure 1. We use this model as a reference
to highlight the strengths and weaknesses of the framework
qxt s ,ys = max(pt−1 xs ,ys − λ · ∆ts , 0) (1) described in Section 3, which is the main contribution of
(
qxt s ,ys + ∆incr if (xs , ys )t = (xe , ye )t this work. YOLE processes 128 × 128 surfaces, it predicts
ptxs ,ys = , (2) B = 2 bounding boxes for each region and classifies ob-
qxt s ,ys otherwise
jects into C different categories.
Note that in this context, we use the term YOLO to refer
where ptxs ,ys is the value of the surface pixel in position
only to the training procedure proposed by [41] and not to
(xs , ys ) of the leaky surface and ∆ts = tst − tst−1 . To im-
the specific network architecture. We used indeed a simpler
prove readability in following equations, we name the quan-
structure for our models as explained in Section 4. Nev-
tity (tst − tst−1 ) · λ as ∆leak . Notice that the effects of λ
ertheless, YOLE, i.e., YOLO + leaky surface, does not ex-
and ∆incr are related: ∆incr determines how much infor-
ploit the sparse nature of events; to address this issue, in the
mation is contained in each single event whereas λ defines
next section, we propose a fully event-based asynchronous
the decay rate of activations. Given a certain choice of these
framework for convolutional networks.
parameters, similar results can be obtained by using, for in-
stance, a higher increment ∆incr and a higher temporal λ.
3. Event-based Fully Convolutional Networks
For this reason, we fix ∆incr = 1 and we vary only λ based
on the dataset to be processed. Pixel values are prevented Conventional CNNs for video analysis treat every frame
from becoming negative by means of the max operation. independently and recompute all the feature maps entirely,
Other frame integration procedures, such as the one even if consecutive frames differ from each other only in
in [35], divide the time in predefined constant intervals. small portions. Beside being a significant waste of power
Frames are obtained by setting each pixel to a binary value and computations, this approach does not match the nature
(depending on the polarity) if at least an event has been re- of event-based cameras.
ceived in each pixel within the integration interval. With To exploit the event-based nature of neuromorphic vi-
this mechanism however, time resolution is lost and the sion, we propose a modification of the forward pass of fully
same importance is given to each event, even if it repre- convolutional architectures. In the following the convolu-
sents noise. The adopted method, instead, performs contin- tion and pooling operations are reformulated to produce
uous and incremental integration and is able to better handle the final prediction by recomputing only the features cor-
noise. responding to regions affected by the events. Feature maps
16 2048
32
64 1024
128 128
64 32 256 20 = C + 5 B
16 - 4 x 4 regions
8 4 - B = 2 bounding
<x,y,ts,p> Integrator 4   boxes per region
4 4 - C = 10 classes
32 16 8
128 64

Conv Layer Conv Layer Conv Layer Conv Layer Conv Layer Fully Fully Fully
5x5x16 5x5x32 5x5x32 5x5x32 5x5x32 connected connected connected
Maxpool Layer Maxpool Layer Maxpool Layer Maxpool Layer Maxpool Layer
2x2 2x2 2x2 2x2 2x2

Figure 1. The YOLE detection network based on YOLO used to detect MNIST-DVS digits. The input surfaces are divided into a grid of
4 × 4 regions which predict 2 bounding boxes each.

maintain their state over time and are updated only as a con- features decay over time. However, due to the transforma-
sequence of incoming events. At the same time, the leak- tions applied by previous layers and the composition of their
ing mechanism that allows past information to be forgotten, activation functions, ∆leak may act differently in different
acts independently on each layer of the CNN. This enables parts of the feature map. For instance, the decrease of a
features computed in the past to fade away as their visual in- pixel intensity value in the input surface may cause the value
formation starts to disappear in the input surface. The result computed by a certain feature in a deeper layer to decrease,
is an asynchronous CNN able to perform computation only but it could also cause another feature of the same layer
when requested and at different rates. The network can in- to increase. The update procedure, therefore, must also be
deed be used to produce an output only when new events ar- able to accurately determine how a single bit of information
rive, dynamically adapting to the timings of the input, or to is transformed by the network through all the previous lay-
produce results at regular rates by using the leaking mecha- ers, in any spatial location. We face this issue by storing an
nism to update layers in absence of new events. additional feature map, F(n) , and by using a particular class
The proposed framework has been developed to extend of activation functions in the hidden layers of the network.
the YOLE detection network presented in Section 2. Nev- Let’s consider the first layer of a CNN which processes
ertheless, this method can be applied to any convolutional surfaces obtained using the procedure described in the pre-
architecture to perform asynchronous computation. A CNN vious section and which computes the convolution of a set
trained to process frames reconstructed from streams of of filters W with bias b and activation function g(·). The
events can indeed be easily converted into an event-based computation performed on each receptive field is:
CNN without any modification on its layers composition, !
and by using the same weights learned while observing
XX
t t t
yij(1) = g xh+i,k+j Whk(1) + b(1) = g(ỹij (1)
),
frames, maintaining its output unchanged. h k
(3)
3.1. Leaky Surface Layer where h, k select a pixel xth+i,k+j in the receptive field of
The procedure used to compute the leaky surface de- the output feature (i, j) and its corresponding value in the
scribed in Section 2 is embedded into an actual layer of kernel W , whereas the subscript (1) indicates the hidden-
the proposed framework. Furthermore, to allow subsequent layer of the network (in this case the first after the leaky
layers to locate changes inside the surface, the following in- surface layer).
formation are also forwarded to the next layer: (i) the list When a new event arrives, the leaky surface layer de-
of incoming events. (ii) ∆leak , which is sent to all the sub- creases all the pixels by ∆leak , i.e., a pixel not directly af-
sequent layers to update feature maps not affected by the fected by the event becomes: xt+1 t t+1
hk = xhk − ∆leak , with
t+1
events. (iii) the list of surface pixels which have been reset ∆leak > 0. At time t + 1 Equation (3) becomes:
to 0 by the max operator in Equation (1).
!
XX
t+1
3.2. Event-based Convolutional Layer (e-conv) yij(1)
=g xt+1
h+i,k+j Whk(1) + b(1)
h k
The event-based convolutional (e-conv) layer we pro- !
pose uses events to determine where the input feature map
XX
=g (xth+i,k+j − ∆t+1
leak )Whk(1) + b(1)
has changed with respect to the previous time step and, h k
therefore, which parts of its internal state, i.e., the feature !
XX
map computed at the previous time step, must be recom- =g t
ỹij − ∆t+1 Whk(1) .
(1) leak
puted and which parts can be reused. We use a procedure h k
similar to the one described in the previous section to let (4)
If g(·) is (i) a piecewise linear activation function g(x) = used to communicate the change to subsequent layers so
{αi · x if x ∈ Di }, as ReLU or Leaky ReLU, and we as- that their update matrix can also be updated accordingly.
sume that (ii) the updated value does not change which lin- The internal state of the e-conv layer, therefore, com-
t−1 t−1
ear segment of the activation function the output falls onto prises the feature maps y(n) and the update values F(n)
and, in this first approximation, (iii) the leaky surface layer computed at the previous time step. The initial values of
does not restrict pixels using max(·, 0), Equation 4 can be the internal state are computed making full inference on a
rewritten as it follows: blank surface; this is the only time the network needs to be
XX executed entirely. As a new sequence of events arrives the
t+1 t t+1
yij (1)
= y ij(1)
− ∆ leak αij (1)
Whk(1) , (5) following operations are performed (see Figure 3(a)):
h k t−1
i. Update F(n) locally on the coordinates specified by the
where αij(1) is the coefficient applied by the piecewise func- list of incoming events (Eq. (7)). Note that we do not
tion g(·) which depends on the feature value at position distinguish between actual events and those generated
(i, j). When the previous assumption is not satisfied, the by the use of a different slope in the linear activation
feature is recomputed as its receptive field was affected by function.
an event (i.e., applying the filter W locally to xt+1 ). ii. Update the feature map y(n) with Eq. (7) in the loca-
Consider now a second convolutional layer attached to tions which are not affected by any event and gener-
the first one: ate an output event where the activation function coef-
  ficient has changed.
t+1 X t+1
y = g y W +b
i+h,j+k(1) hk(2)

ij(2) (2)
h,k iii. Recompute y(n) by applying W locally in correspon-
dence of the incoming events and output which recep-
   
X  t t+1 X
= g −∆ α W 0 0  Whk +b
 
yi+h,j+k leak i+h,j+k(1) h k (2)
tive field has been affected.

(1) (1) (2)
h,k h0 ,k0
 
t
= yij −∆
t+1
α
X  X
W 0 0  Whk
 iv. Forward the feature map and the events generated in
(2) leak ij(2) αi+h,j+k
(1) h k (2)
(1)
h,k h0 ,k0 the current step to the next layer.
t t+1 X t+1 t t+1 t+1
= yij −∆ α F W = yij −∆ F .
leak ij(2) h+i,k+j(1) hk(2) leak ij(2)
(2)
h,k
(2)
3.3. Event-based Max Pooling Layer (e-max-pool)
(6)
The previous equation can be extended by induction as it The location of the maximum value in each receptive
follows: field of a max-pooling layer is likely to remain the same
over time. An event-based pooling layer, hence, can exploit
t+1
yij (n)
t
= yij(n)
− ∆t+1 t+1
leak Fij(n) , this property to avoid recomputing every time the position
XX of maximum values.
with Fijt+1
(n)
= αij(n) t+1
Fi+h,j+k (n−1)
Whk(n) if n > 1 , The internal state of an event-based max-pooling (e-
h k
(7) max-pool) layer can be described by a positional matrix
t
where Fij(n) expresses how visual inputs are transformed I(n) , which has the shape of the output feature map pro-
by the network in every receptive field (i, j), i.e., the com- duced by the layer, and which stores, for every receptive
position of the previous layers activation functions. field, the position of its maximum value. Every time a se-
t
Given this formulation, the max operator applied by the quence of events arrives, the internal state I(n) is sparsely
leaky surface layer can be interpreted as a ReLU, and Equa- updated by recomputing the position of the maximum val-
tion (5) becomes: ues in every receptive field affected by an incoming event.
The internal state is then used both to build the output fea-
t
ture map and to produce the update matrix F(n) by fetching
t+1
yij t
= yij − ∆t+1
XX
t+1 the previous layer on the locations provided by the indices
leak αij(1) Fi+h,j+k Whk(1) ,
(1) (1)
h k
(0)
Iijt (n) . For each e-max-pool layer, the indices of the recep-
(8) tive fields where the maximum value changes are commu-
where the value Fi+h,j+k(0) is 0 if the pixel xi+h,j+k ≤ 0 nicated to the subsequent layers so that the internal states
and 1 otherwise. can be updated accordingly. This mechanism is depicted in
Notice that Fij(n) needs to be updated only when the cor- Figure 3(b).
responding feature changes enough to make the activation Notice that the leaking mechanism acts differently in dis-
function use a different coefficient α, e.g., from 0 to 1 in tinct regions of the input space. Features inside the same
case of ReLU. In this case F(n) is updated locally in cor- receptive field can indeed decrease over time with differ-
respondence of the change by using the update matrix of ent speeds as their update values Fijt (n) could be differ-
the previous layer and by applying Equation 7 only for the ent. Therefore, even if no event has been detected inside
features whose activation function has changed. Events are a region, the position of its maximum value might change.
Leaky
Surface

features and
F (n) matrices

Figure 2. fcYOLE: a fully-convolutional detection network based on YOLE. The last layer is used to map the feature vectors into a set of
20 values which define the parameters of the predicted bounding boxes.

(a) (b)
Figure 3. The structure of the e-conv (a) and e-max-pooling layers (b). The internal states and the update matrices are recomputed locally
only where events are received (green cells) whereas the remaining regions (depicted in yellow) are obtained reusing the previous state.

However, if an input feature M has the minimum update field of view.


rate FM(n−1) among features in its receptive field R and Moreover, this architecture can be used to process sur-
it also corresponds to the maximum value in R, the cor- faces of different sizes without the need to re-train or re-
responding output feature will decrease slower than all the design it. The subnetworks processing 32 × 32 regions, in
others in R and its value will remain the maximum. In this fact, being defined by the same set of parameters, can be
t
case, its index I(n)R
does not need to be recomputed until stacked together to process even larger surfaces.
a new event arrives in R. We check if the maximum has to
be recomputed for each receptive field affected by incoming 4. Experiments
events and also in all positions where the previous condition
4.1. Datasets
does not hold.
Only few event-based object recognition datasets are
3.4. Event FCN for Object Detection (fcYOLE) publicly available in the literature. The most popular
To fully exploit the event-based layers presented so far, ones are: N-MNIST [34], MNIST-DVS [44], CIFAR10-
the YOLE model described in Section 2 needs to be con- DVS [22], N-Caltech101 [34] and POKER-DVS [44].
verted into a fully convolutional object detection network These datasets are obtained from the original MNIST [21],
replacing all its layers with their event-based versions (see CIFAR-10 [17] and Caltech101 [10] datasets by record-
Figure 3). Moreover, fully-connected layers are replaced ing the original images with an event camera while mov-
with 1 × 1 e-conv layers which map features extracted by ing the camera itself or the images of the datasets. We
the previous layers into a precise set of values defining performed experiments on N-Caltech101 and on modified
the bounding boxes parameters predicted by the network. versions of N-MNIST and MNIST-DVS for object detec-
Training was first performed on a network composed of tion, i.e., Shifted N-MNIST and Shifted MNIST-DVS, and on
standard layers; the learned weights were then used with an extended version of POKER-DVS, namely OD-Poker-
e-conv and e-max-pool layers during inference. DVS. Moreover we also perform experiments on a synthetic
This architecture divides the 128 × 128 field of view into dataset, named Blackboard MNIST, showing digits written
a grid of 4 × 4 regions that predicts 2 bounding boxes each on a blackboard. A detailed description of these datasets is
and classify the detected objects into C different classes. provided in the supplementary materials.
The last 1 × 1 e-conv layer is used to decrease the dimen- Shifted N-MNIST The N-MNIST [34] dataset is a con-
sionality of the feature vectors and to map them into the version of the popular MNIST [21] image dataset for com-
right set of parameters, regardless of their position in the puter vision. We enhanced this collection by building a
slightly more complex set of recordings. Each sample is fcYOLE share the same structure up to the last regres-
indeed composed of two N-MNIST samples placed in two sion/classification layers.
random non-overlapping locations of a bigger 124 × 124 For what concerns the N-Caltech101 dataset, we used a
field of view. Each digit was also preprocessed by extract- slightly different architecture inspired by the structure of the
ing its bounding box which was then moved, along with the VGG16 model [45]. The network is composed by only one
events, in its new position of the bigger field of view. The layer for each group of convolutional layers, as we noticed
final dataset is composed of 60, 000 training and 10, 000 that a simpler architecture achieved better results. More-
testing samples. over, the dimensions of the last fully-connected layers have
Shifted MNIST-DVS We used a similar procedure to ob- been adjusted such that the surface is divided into a grid of
tain Shifted MNIST-DVS recordings. We first extracted 5 × 7 regions predicting B = 2 bounding boxes each. As
bounding boxes with the same procedure used in Shifted N- in the original YOLO architecture we used Leaky ReLU for
MNIST and then placed them in a 128 × 128 field of view. the activation functions of hidden layers and a linear activa-
We mixed MNIST-DVS scale4, scale8 and scale16 samples tion for the last one.
within the same recording obtaining a dataset composed of In all the experiments the first 4 convolutional layers
30, 000 samples. have been initialized with kernels obtained from a recog-
OD-Poker-DVS The Poker-DVS dataset is a small col- nition network pretrained to classify target objects, while
lection of neuromorphic samples showing poker card pips the remaining layers using the procedure proposed in [12].
obtained by extracting 31 × 31 symbols from three deck All networks were trained optimizing the multi-objective
recordings. We used the tracking algorithm provided with loss proposed by [41] using Adam [15], learning rate 10−4 ,
the dataset to track pips and enhance the original uncut deck β1 = 0.9, β2 = 0.999 and  = 10−8 . The batch size
recordings with their bounding boxes. We finally divided was chosen depending on the dataset: 10 for Shifted N-
these recordings into a set of shorter examples obtaining a MNIST, 40 for Shifted MNIST-DVS and N-Caltech101, 25
collection composed of 218 training and 74 testing samples. for Blackboard MNIST and 35 for Poker-DVS with the aim
Blackboard MNIST We used the DAVIS simulator re- of filling the memory of the GPU optimally. Early-stopping
leased by [32] to build our own set of synthetic recordings. was applied to prevent overfitting using validation sets with
The resulting dataset consists of a number of samples show- the same size of the test set.
ing digits written on a blackboard in random positions and
with different scales. We preprocessed a subset of images 4.3. Results and Discussion
from the original MNIST dataset by removing their back- Detection performance of YOLE. The YOLE network
ground and by making them look as if they were written achieves good detection results both in terms of mean av-
with a chalk. Sets of digits were then placed on the im- erage precision (mAP) [9] and accuracy, which in this case
age of a blackboard and the simulator was finally run to is computed by matching every ground truth bounding box
obtain event-based recordings and the bounding boxes of with the predicted one having the highest intersection over
every digit visible within the camera field of view. The re- union (IOU), in most of the datasets. The results we ob-
sulting dataset is the union of three simulations featuring tained are summarized in Table 3.
increasing level of variability in terms of camera trajecto- We used the Shifted N-MNIST dataset also to analyze
ries and digit dimensions. The overall dataset is composed how detection performance changes when the network is
of 2750 training and 250 testing samples. used to process scenes composed of a variable number of
N-Caltech101 The N-Caltech101 [34] collection is the objects, as reported in Table 4. We denote as v1 the results
only publicly available event-based dataset providing obtained using scenes composed of a single digit and with
bounding boxes annotations. We split the dataset into 80% v2 those obtained with scenes containing two digits in ran-
training and 20% testing samples using a stratified split. dom locations of the field of view. We further tested the ro-
Since no ground truth bounding boxes are available for the bustness of the proposed model by adding some challenging
background class, we decided not to use this additional cat- noise. We added non-target objects (v2fr) in the form of five
egory in our experiments. 8 × 8 fragments, taken from random N-MNIST digits using
a procedure similar to the one used to build the Cluttered
4.2. Experiments Setup Translated MNIST dataset [30], and 200 additional random
Event-based datasets, especially those based on MNIST, events per frame (v2fr+ns).
are generally simpler than the image-based ones used to In case of multiple objects the algorithm is still able to
train the original YOLO architecture. Therefore, we de- detect all of them, while, as expected, performance drops
signed the MNIST object detection networks taking inspi- both in terms of accuracy and mean average precision when
ration from the simpler LeNet [21] model composed of 6 dealing with noisy data. Neverthelesss, we achieved very
conv-pool layers for feature extraction. Both YOLE and good detection performance on the Shifted MNIST-DVS,
Table 1. YOLE Top-20 average precisions on N-Caltech101. Full table provided in the supplemental material.

Motorbikes

metronome

saxophone

Leopards
airplanes

umbrella

menorah

windsor

trilobite
minaret

garfield
rooster
stapler
laptop

watch
dollar

grand

inline
Faces

piano

skate

chair

yang
easy

bill

yin
AP 97.8 95.8 94.7 88.3 88.1 86.5 85.9 84.2 81.3 81.3 80.7 75.1 68.4 68.1 65.2 64.5 63.3 62.9 62.5 62.3
Ntrain 480 480 261 20 49 32 45 145 46 61 53 19 24 27 34 31 36 120 52 22

Table 2. fcYOLE Top-20 average precisions on N-Caltech101. Full table provided in the supplemental material.
Motorbikes

metronome

saxophone

accordion
dragonfly
Leopards
airplanes

umbrella
menorah

windsor
minaret

buddha
soccer
watch

dollar

grand
Faces

piano

chair
yang
easy

stop
sign
side

ball
bill

yin
car
AP 97.5 96.8 92.2 75.7 74.4 70.3 69.5 67.7 63.4 61.0 60.4 59.7 59.5 57.3 57.2 55.6 55.1 52.3 48.3 46.5
Ntrain 480 480 261 145 32 75 61 53 20 45 36 24 46 40 120 42 40 34 33 51

Table 3. Performance comparison between YOLE and fcYOLE. formance is mainly due to the fact that each region in fcY-
S-MNIST-DVS Blackboard MNIST OLE generates its predictions by only looking at the visual
fcYOLE YOLE fcYOLE YOLE information contained in its portion of the field of view. In-
acc mAP acc mAP acc mAP acc mAP deed, if an object is only partially contained inside a region
94.0 87.4 96.1 92.0 88.5 84.7 90.4 87.4
the network has to guess the object dimensions and class by
OD-Poker-DVS N-Caltech101 only looking at a restricted region of the surface. It should
fcYOLE YOLE fcYOLE YOLE be stressed, however, that the difference in performance be-
acc mAP acc mAP acc mAP acc mAP tween the two architectures does not come from the use of
79.10 78.69 87.3 82.2 57.1 26.9 64.9 39.8 the proposed event layers, whose output are the same as the
conventional ones, but rather from the reduced expressive
Table 4. YOLE performance on S-N-MNIST variants.
power caused by the absence of fully-connected layers in
S-N-MNIST fcYOLE. Indeed, not removing them would have allowed
v1 v2 v2* v2fr v2fr+ns us to obtain the same performance of YOLE, but with the
accuracy 94.9 91.7 94.7 88.6 85.5 drawback of being able to exploit event-based layers only
mAP 91.3 87.9 90.5 81.5 77.4
up to the first FC-layer, which has not been formalized yet
Blackboard MNIST and Poker-DVS datasets which repre- in an event-based form. Removing the last fully-connected
sent a more realistic scenario in terms of noise. All of these layers allowed us to design a detection network made of
experiments were performed using the set of hyperparame- only event-based layers and which uses also a significantly
ters suggested by the original work from [41]. However, a lower number of parameters. In the supplementary mate-
different choice of these parameters, namely λcoord = 25.0 rials we provide a video showing a comparison between
and λnoobj = 0.25, worked better for us increasing both the YOLE and fcYOLE predictions.
accuracy and mean average precision scores (v2*).
To identify the advantages and weaknesses of the pro-
The dataset in which the proposed model did not achieve
posed event-based framework in terms of inference time we
noticeable results is N-Caltech101. This is mainly ex-
compared our detection networks on two datasets, Shifted
plained by the increased difficulty of the task and by the
N-MNIST and Blackboard MNIST. We group events into
fact that the number of samples in each class is not evenly
batches of 10ms and average timings on 1000 runs. In the
balanced. The network, indeed, usually achieves good re-
first dataset the event-based approach achieved a 2x speedup
sults when the number of training samples is high such as
(22.6ms per batch), whereas in the second one it performed
with Airplanes, Motorbikes and Faces easy, and in cases in
slightly slower (43.2ms per batch) w.r.t. a network making
which samples are very similar, e.g., inline skate (see Ta-
use of conventional layers (34.6ms per batch). The second
ble 1 and supplementary material). As the number of train-
benchmark is indeed challenging for our framework since
ing samples decreases and the sample variability within the
changes are not localized in restricted regions. Our current
class increases, however, the performance of the model be-
implementation is not optimized to handle noisy scenes ef-
comes worse, behavior which explains the poor aggregate
ficiently. Indeed, additional experiments showed that asyn-
scores we report in Table 3.
chronous CNNs are able to provide a faster prediction only
Detection performance of fcYOLE. With this fully- up to a 80% of event sparsity (where with sparsity we mean
convolutional variant of the network we registered a slight the percentage of changed pixels in the reconstructed im-
decrease in performance w.r.t. the results we obtained using age). Further investigations are out of the scope of this pa-
YOLE, as reported in Table 3 and Table 2. This gap in per- per and will be addressed in future works.
Shifted Shifted Blackboard
N-MNIST MNIST-DVS OD-Poker-DVS N-Caltech101 MNIST

Figure 4. Examples of YOLE predictions.

5. Conclusions tion obtaining promising results. We are planning to ex-


tend our framework to automatically detect at runtime when
We proposed two different methods, based on the YOLO the use of event-based layers speeds up computation (i.e.,
architecture, to accomplish object detection in event-based changes affect few regions of the surface) or a complete re-
cameras. The first one, namely YOLE, integrates events computation of the feature maps is more beneficial in order
into a unique leaky surface. Conversely, fcYOLE relies to exploit the benefits of both approaches. Nevertheless, we
on a more general extension of the convolutional and max believe that a ad-hoc hardware implementation, would al-
pooling layers to directly deal with events and exploit their low to better exploit the advantages of the proposed method
sparsity by preventing the whole network to be reprocessed. enabling a fair timing comparison with SNNs, which are
The obtained asynchronous detector dynamically adapts to usually implemented in hardware.
the timing of the events stream by producing results only
as a consequence of incoming events and by maintaining its Acknowledgements We would like to thank Prophesee for
internal state, without performing any additional computa- helpful discussions on YOLE. The research leading to these re-
sults has received funding from project TEINVEIN: TEcnologie
tion, when no events arrive. This novel event-based frame-
INnovative per i VEicoli Intelligenti, CUP (Codice Unico Progetto
work can be used in every fully-convolutional architecture
- Unique Project Code): E96D17000110009 - Call “Accordi per
to make it usable with event-cameras, even conventional la Ricerca e lInnovazione”, cofunded by POR FESR 2014-2020
CNN for classification, although in this paper it has been (Programma Operativo Regionale, Fondo Europeo di Sviluppo
applied to object detection networks. Regionale Regional Operational Programme, European Regional
We analyzed the timing performance of this formaliza- Development Fund).
References [17] A. Krizhevsky. Learning multiple layers of features from
tiny images. 04 2009. 5
[1] L. Y. Alex Zihao Zhu. Ev-flownet: Self-supervised optical
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
flow estimation for event-based cameras. Robotics: Science
classification with deep convolutional neural networks. In
and Systems, Jan 2018. 2
F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,
[2] P. Bardow, A. J. Davison, and S. Leutenegger. Simultaneous editors, Advances in Neural Information Processing Systems
optical flow and intensity estimation from an event camera. 25, pages 1097–1105. Curran Associates, Inc., 2012. 1
In Proceedings of the IEEE Conference on Computer Vision
[19] X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B.
and Pattern Recognition, pages 884–892, 2016. 1
Benosman. HOTS: A Hierarchy of Event-Based Time-
[3] R. Berner, C. Brandli, M. Yang, S.-C. Liu, and T. Delbruck.
Surfaces for Pattern Recognition. IEEE Trans. Pattern Anal.
A 240 × 180 10mw 12us latency sparse-output vision sensor
Mach. Intell., 39(7), Jul 2016. 2
for mobile applications. pages C186–C187, 01 2013. 1
[20] X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B.
[4] M. Cannici, M. Ciccone, A. Romanoni, and M. Matteucci.
Benosman. Hots: a hierarchy of event-based time-surfaces
Attention mechanisms for object recognition with event-
for pattern recognition. IEEE transactions on pattern anal-
based cameras. In 2019 IEEE Winter Conference on Appli-
ysis and machine intelligence, 39(7):1346–1359, 2017. 1,
cations of Computer Vision (WACV), pages 1127–1136, Jan
2
2019. 1
[21] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
[5] Y. Cao, Y. Chen, and D. Khosla. Spiking deep convolu-
based learning applied to document recognition. Proc. IEEE,
tional neural networks for energy-efficient object recogni-
86(11):2278–2324, Nov 1998. 5, 6
tion. International Journal of Computer Vision, 113(1):54–
66, 2015. 1 [22] H. Li, H. Liu, X. Ji, G. Li, and L. Shi. CIFAR10-DVS: An
[6] N. F. Y. Chen. Pseudo-labels for Supervised Learning on Event-Stream Dataset for Object Classification. Front. Neu-
Dynamic Vision Sensor Data, Applied to Object Detection rosci., 11:309, May 2017. 5
under Ego-motion. arXiv, Sep 2017. 1 [23] J. Li, F. Shi, W. Liu, D. Zou, Q. Wang, H. Lee, P.-K. Park,
[7] G. K. Cohen. Event-Based Feature Detection, Recognition and H. E. Ryu. Adaptive temporal pooling for object de-
and Classification. PhD thesis, Université Pierre et Marie tection using dynamic vision sensor. British Machine Vision
Curie - Paris VI, Sep 2016. 2 Conference (BMVC), 2017. 1
[8] P. U. Diehl, D. Neil, J. Binas, M. Cook, S. Liu, and M. Pfeif- [24] M. Liu and T. Delbruck. Adaptive time-slice block-matching
fer. Fast-classifying, high-accuracy spiking deep networks optical flow algorithm for dynamic vision sensors. Technical
through weight and threshold balancing. In 2015 Interna- report, 2018. 1
tional Joint Conference on Neural Networks (IJCNN), pages [25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-
1–8, July 2015. 1 Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.
[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, In European conference on computer vision, pages 21–37.
and A. Zisserman. The Pascal Visual Object Classes (VOC) Springer, 2016. 1
Challenge. Int. J. Comput. Vision, 88(2):303–338, Jun 2010. [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
6 networks for semantic segmentation. In Proceedings of the
[10] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of IEEE conference on computer vision and pattern recogni-
object categories. IEEE Trans. Pattern Anal. Mach. Intell., tion, pages 3431–3440, 2015. 1
28(4):594–611, Apr 2006. 5 [27] W. Maass. Networks of spiking neurons: the third generation
[11] D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza. of neural network models. Neural networks, 10(9):1659–
Asynchronous, photometric feature tracking using events 1671, 1997. 1
and frames. In Eur. Conf. Comput. Vis.(ECCV), 2018. 1 [28] A. I. Maqueda, A. Loquercio, G. Gallego, N. Garca, and
[12] X. Glorot and Y. Bengio. Understanding the difficulty of D. Scaramuzza. Event-based vision meets deep learning on
training deep feedforward neural networks. PMLR, pages steering prediction for self-driving cars. In The IEEE Confer-
249–256, Mar 2010. 6 ence on Computer Vision and Pattern Recognition (CVPR),
[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- June 2018. 2
ing for image recognition. In Proceedings of the IEEE con- [29] A. Mitrokhin, C. Fermuller, C. Parameshwara, and Y. Aloi-
ference on computer vision and pattern recognition, pages monos. Event-based moving object detection and tracking.
770–778, 2016. 1 arXiv preprint arXiv:1803.04523, 2018. 1
[14] S. Kim, S. Park, B. Na, and S. Yoon. Spiking-yolo: Spiking [30] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of vi-
neural network for real-time object detection. arXiv preprint sual attention. In Advances in neural information processing
arXiv:1903.06530, 2019. 1 systems, pages 2204–2212, 2014. 6
[15] D. P. Kingma and J. Ba. Adam: A Method for Stochastic [31] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and
Optimization. arXiv, Dec 2014. 6 M. M. Bronstein. Geometric deep learning on graphs and
[16] T. N. Kipf and M. Welling. Semi-supervised classification manifolds using mixture model cnns. In Proceedings of the
with graph convolutional networks. In International Confer- IEEE Conference on Computer Vision and Pattern Recogni-
ence on Learning Representations (ICLR), 2017. 1 tion, pages 5115–5124, 2017. 1
[32] E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, and [46] A. Sironi, M. Brambilla, N. Bourdis, X. Lagorce, and
D. Scaramuzza. The Event-Camera Dataset and Simulator: R. Benosman. Hats: Histograms of averaged time surfaces
Event-based Data for Pose Estimation, Visual Odometry, and for robust event-based object classification. In Proceedings
SLAM. arXiv, Oct 2016. 1, 6 of the IEEE Conference on Computer Vision and Pattern
[33] D. Neil, M. Pfeiffer, and S.-C. Liu. Phased lstm: Accel- Recognition, pages 1731–1740, 2018. 1, 2
erating recurrent network training for long or event-based [47] T. Stoffregen and L. Kleeman. Simultaneous optical flow
sequences. In Advances in Neural Information Processing and segmentation (sofas) using dynamic vision sensor. arXiv
Systems, pages 3882–3890, 2016. 1 preprint arXiv:1805.12326, 2018. 1
[34] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor. [48] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Converting Static Image Datasets to Spiking Neuromorphic Inception-v4, inception-resnet and the impact of residual
Datasets Using Saccades. Front. Neurosci., 9, Nov 2015. 5, connections on learning. In AAAI, volume 4, page 12, 2017.
6 1
[35] J. A. Pérez-Carrasco, B. Zhao, C. Serrano, B. Acha, [49] F. Yu and V. Koltun. Multi-Scale Context Aggregation by Di-
T. Serrano-Gotarredona, S. Chen, and B. Linares-Barranco. lated Convolutions. In International Conference on Learning
Mapping from frame-driven to frame-free event-driven vi- Representations (ICLR), 2016. 1
sion systems by low-rate rate coding and coincidence
processing–application to feedforward ConvNets. IEEE
Trans. Pattern Anal. Mach. Intell., 35(11):2706–2719, Nov
2013. 1, 2
[36] C. Posch, D. Matolin, and R. Wohlgenannt. A QVGA 143
dB Dynamic Range Frame-Free PWM Image Sensor With
Lossless Pixel-Level Video Compression and Time-Domain
CDS. IEEE J. Solid-State Circuits, 46(1):259–275, Jan 2011.
1
[37] A. Raj, D. Maturana, and S. Scherer. Multi-scale convolu-
tional architecture for semantic segmentation. page 14, 01
2015. 1
[38] B. Ramesh, H. Yang, G. Orchard, N. A. L. Thi, and C. Xiang.
DART: Distribution Aware Retinal Transform for Event-
based Cameras. arXiv, Oct 2017. 1
[39] B. Ramesh, S. Zhang, Z. W. Lee, Z. Gao, G. Orchard, and
C. Xiang. Long-term object tracking with a moving event
camera. 2018. 1
[40] H. Rebecq, T. Horstschaefer, G. Gallego, and D. Scara-
muzza. Evo: A geometric approach to event-based 6-dof
parallel tracking and mapping in real time. IEEE Robotics
and Automation Letters, 2(2):593–600, 2017. 1
[41] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
only look once: Unified, real-time object detection. In Pro-
ceedings of the IEEE conference on computer vision and pat-
tern recognition, pages 779–788, 2016. 1, 2, 6, 7
[42] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In
Advances in neural information processing systems, pages
91–99, 2015. 1
[43] T. Serrano-Gotarredona and B. Linares-Barranco. A 128 ×
128 1.5 % Contrast Sensitivity 0.9 % FPN 3 µs Latency 4
mW Asynchronous Frame-Free Dynamic Vision Sensor Us-
ing Transimpedance Preamplifiers. IEEE J. Solid-State Cir-
cuits, 48(3):827–838, Mar 2013. 1
[44] T. Serrano-Gotarredona and B. Linares-Barranco. Poker-
DVS and MNIST-DVS. Their History, How They Were
Made, and Other Details. Front. Neurosci., 9, Dec 2015.
5
[45] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014. 1, 6
Asynchronous Convolutional Networks for Object Detection
in Neuromorphic Cameras
Supplementary material

Marco Cannici Marco Ciccone Andrea Romanoni Matteo Matteucci


Politecnico di Milano, Italy
{marco.cannici,marco.ciccone,andrea.romanoni,matteo.matteucci}@polimi.it

In this document we describe our novel event-based other pixels are considered noise. Then, with a custom ver-
datasets adopted in the paper “Asynchronous Convolutional sion of the DBSCAN [2] density-based clustering algorithm
Network for Object Detection in Neuromorphic Cameras”. we group pixels into a single cluster. A threshold minarea
is used to filter out small bounding boxes extracted in cor-
1. Event-based object detection datasets respondence of low events activities. This condition usu-
ally happens during the transition from a saccade to the next
Due to the lack of object detection datasets with event one as the camera remains still for a small fraction of time
cameras, we extended the publicly available N-MNIST, and no events are generated. We used ρ = 3, R = 2 and
MNIST-DVS, Poker-DVS and we propose a novel dataset minarea = 10. The coordinates of these bounding boxes
based on MNIST, i.e., Blackboard MNIST. They will be are then shifted based on the final position the digit has in
soon released, however, in Figure 1 we reported some ex- the bigger field of view.
ample from the four datasets.
For each N-MNIST sample, another digit was randomly
1.1. Shifted N-MNIST selected in the same portion of the dataset (training, test-
ing or validation) to form a new example. The final dataset
The original N-MNIST [5] extends the well-known contains 60, 000 training samples and 10, 000 testing sam-
MNIST [3]: it provides an event-based representation of ples, as for the original N-MNSIT dataset. In Figure 2 we
both the full training set (60, 000 samples) and the full test- illustrate one example for v1 and the three variants of v2 we
ing set (10, 000 samples) to evaluate object classification al- adopted (and described) in the paper.
gorithms. The dataset has been recorded by means of event
camera in front of an LCD screen and moved to detect static 1.2. Shifted MNIST-DVS
MNIST digits displayed on the monitor. For further details
we refer the reader to [5]. The MNIST-DVS dataset [6] is another collection of
Starting from the N-MNIST dataset, we built a more event-based recordings that extends MNIST [3]. It con-
complex set of recordings that we used to train the object sists of 30, 000 samples recorded by displaying digits on
detection network to detect multiple objects in the same an screen in front of a event camera, but differently from
scene. We created two versions of the dataset, Shifted N- N-MNIST, they move digits on the screen instead of the
MNIST v1 and Shifted N-MNIST v2, that contains respec- sensors, and they use the digits at three different scales, i.e.,
tively one or two non overlapping 34 × 34 N-MNIST digits scale4, scale8 and scale16. The resulting dataset is com-
per sample randomly positioned on a bigger surface. We posed of 30, 000 event-based recordings showing each one
used different surface dimensions in our tests which vary of the selected 10, 000 MNIST digits on thee different di-
from double the original size, 68 × 68, up to 124 × 124. The mensions. Examples of these recordings are shown in Fig-
dimension and structure of the resulting dataset is the same ure 3.
of the original N-MNIST collection. We used MNIST-DVS recordings to build a detection
To extend the dataset for object detection evaluation, the dataset by means of a procedure similar to the one we
bounding boxes ground truth are required. To estimate them used to create the Shifted N-MNIST dataset. However in
we first integrate events into a single frame as described in this case we mix together digits of multiple scales. All
Section 2 of the original paper. We remove the noise by con- the MNIST-DVS samples, despite of the actual dimensions
sidering only non-zero pixels having at least other ρ non- of the digits being recorded, are contained within a fixed
zero pixels around them within a circle of radius R. All the 128 × 128 field of view. Digits are placed centered inside

1
Shifted N-MNIST

Shifted MNIST-DVS

OD-Poker-DVS

Blackboard-MNIST

Figure 1: Examples of samples from the proposed datasets.


v1 v2 v2fr v2fr+ns

Figure 2: Different versions of Shifted N-MNIST.

Figure 3: Examples of the three different scales of MNIST-DVS digits. Two samples at scale scale4, two at scale8 and two
at scale16.

the scene and occupy a limited portion of the frame, espe- diamonds or clubs) extracted from three decks recordings.
cially those belonging to the smallest and middle scales. In Single pips were extracted by means of an event-based
order to place multiple examples on the same scene we first tracking algorithms which was used to follow symbols in-
cropped the three scales of samples into smaller recordings side the scene and to extract 31 × 31 pixels examples.
occupying 35 × 35, 65 × 65 and 105 × 105 spatial regions With OD-Poker-DVS we extend its scope to test also ob-
respectively. The bounding boxes annotations and the fi- ject detection. To do so we used the event-based tracking
nal examples were obtained by means of the same proce- algorithm provided with the original dataset to follow the
dure we used to construct the Shifted N-MNIST dataset. movement of the 31 × 31 samples in the uncut recordings
These recordings were built by mixing digits of different and extract their bounding boxes. The final dataset was ob-
dimensions in the same sample. Based on the original sam- tained using a procedure similar to the one used in [7]. In-
ples dimensions, we decided to use the following four con- deed, we divided the sections of the three original decks
figurations (which specify the number of samples of each recordings containing visible digits into a set of shorter ex-
category used to build a single Shifted MNIST-DVS exam- amples, each of which about 1.5ms long. Examples were
ple): (i) three scale4 digits, (ii) two scale8 digits, (iii) two split in order to ensure approximately the same number of
scale4 digits mixed with one scale8 digit (iv) one scale16 objects (i.e., ground truth bounding boxes) in each exam-
digit placed in random locations of the field of view. The ple. The final detection dataset is composed of 292 small
overall dataset is composed of 30, 000 samples containing examples which we divided into 218 training and 74 testing
these four possible configurations. samples.

1.3. OD-Poker-DVS Even if composed of a limited amount of samples, this


dataset represents an interesting real-world application that
The original Poker-DVS [6] have been proposed to test highlights the potential of event-based vision sensors. The
object recognition algorithms; it is a small collection of neu- nature of the data acquisition is indeed particularly well
romorphic recordings obtained by quickly browsing custom suited to neuromorphic cameras due to their very high tem-
made poker card decks in front of a DVS camera for 2-4 sec- poral resolution. Symbols are clearly visible inside the
onds. The dataset is composed of 131 samples containing recordings even if they move at very high speed. Each pip,
centered pips of the four possible categories (spades, hearts, indeed, takes from 10 to 30 ms to cross the screen but it can
be easily recognized within the first 1-2 ms. scene to generate neuromorphic recordings. Every time a
frame is rendered during the simulation, the bounding boxes
1.4. Blackboard MNIST of all the visible digits inside the frame are also extracted.
The two dataset based on MNIST presented in Section This operation is performed by computing the camera space
1.1 and 1.2 have the drawback of recording digits at prede- coordinates (or normalized device coordinates) of the top-
fined sizes. Therefore, in Blackboard MNIST we propose left and bottom-right vertex of all the images inside the field
a more challenging scenario that consists of a number of of view. Since images are slightly larger than the actual dig-
samples showing digits (from the original MNIST dataset) its they contain, we cropped every bounding box to better
written on a blackboard in random positions and with dif- enclose each digit and also to compensate the small offset
ferent scales. in the pixels position introduced by the camera motion and
by the linear interpolation mechanism. In addition, bound-
We used the DAVIS simulator released by [4] to build our
ing boxes corresponding to objects which are only partially
own set of synthetic recordings. Given a three-dimensional
visible are also filtered out. In order to build the final de-
virtual scene and the trajectory of a moving camera within
tection dataset, this generation process is executed multiple
it, the simulator is able to generate a stream of events de-
times, each time with different digits.
scribing the visual information captured by the virtual cam-
era. The system uses Blender [1], an open-source 3D mod- We built three sub-collections of recordings with increas-
eling tool, to generate thousands of rendered frames along a ing level of complexity which we merged together to obtain
predefined camera trajectory which are then used to recon- our final dataset: Blackboard MNIST EASY, Blackboard
struct the corresponding neuromorphic recording. The in- MNIST MEDIUM, Blackboard MNIST HARD. In Black-
tensity value of each single pixel inside the sequence of ren- board MNIST EASY, we used digits of only one dimen-
dered frames, captured at a constant frame-rate, is tracked. sion (roughly corresponding to the middle scale of MNIST-
As Figure 4a shows, an event is generated whenever the log- DVS samples) and a single type of camera trajectory which
intensity of a pixel crosses an intensity threshold, as in a real moves the camera from right to left with the focus object
event-based camera. A piecewise linear time interpolation moving in a straight line. In addition, only three objects
mechanism is used to determine brightness changes in the were placed on the blackboard using only a fixed portion of
time between frames in order to simulate the microseconds its surface. We collected a total of 1, 200 samples (1, 000
timestamp resolution of a real sensor. We extended the sim- training, 100 testing, 100 validation).
ulator to output bounding boxes annotations associated to Blackboard MNIST MEDIUM features more variability
every visible digit. in the number and dimensions of the digits and in the types
We used Blender APIs to place MNIST digits in ran- of camera movements. Moreover, the portion of the black-
dom locations of a blackboard and to record their position board on which digits were added varies and may cover any
with respect to the camera point of view. Original MNIST region of the blackboard, even those near its edges. The
images depict black handwritten digits on a white back- camera motions were also extended to the set of all pos-
ground. To mimic the chalk on the blackboard, we removed sible trajectories that combine either left-to-right or right-
the background, we turned digits in white and we roughen to-left movements with variable paths of the focus object.
their contours to make them look like if their were written We used three types of trajectories for this object: a straight
with a chalk. An example is shown in Figure 4b. line, a triangular path or a smooth curved trajectory, all par-
The scene contains the image of a blackboard on a ver- allel to the camera trajectory and placed around the position
tical plane and a virtual camera with 128 × 128 resolution of the digits on the blackboard. One of these path was se-
that moves horizontally on a predefined trajectory parallel lected randomly for every generated sample. Triangular and
to the blackboard plane (see Figure 5). The camera points curved trajectories were introduced as we noticed that sud-
a hidden object that moves on the blackboard surface, syn- den movements of the camera produce burst of events that
chronized with the camera, following a given trajectory. To we wanted our detection system to be able to handle. The
introduce variability in the camera movement, and to allow number and dimensions of the digits were chosen following
all the digits outline to be seen (and possibly detected), we three possible configurations, similarly to the Shift MNIST-
used different trajectories that vary from a straight path to a DVS dataset: either six small digits (with sizes compara-
smooth or triangular undulating path that makes the camera ble to scale4 MNIST-DVS digits), three intermediate-size
tilt along the transverse axis while moving (Figure 5b). digits (comparable to the MNIST-DVS scale8) or two big
Before starting the simulation, we randomly select a digits (comparable to the biggest scale of the MNIST-DVS
number of preprocessed MNIST digits and place them in dataset, scale16). A set of 1, 200 recordings was gener-
a random portion of the blackboard. The camera moves so ated using the same splits of the first variant and with equal
that all the digits will be framed during the camera move- amount of samples in each one of the three configurations.
ment. The simulation is then finally started on this modified Finally, Blackboard MNIST HARD contains digits
log Iu (t) Samples of log Iu
Actual Events
Predicted Events

t
(b)
(a)

Figure 4: (a) The image shows in black the intensity, expresses as log Iu (t), of a single pixel u = (x, y). This curve
is sampled at a constant rate when frames are generated by Blender, shown in figure as vertical blue lines. The sampled
values thus obtained (blue circles) are used to approximate the pixel intensity by means of a simple piecewise linear time
interpolation (red line). Whenever this curve crosses one of the threshold values (horizontal dashed lines) a new event is
generated with the corresponding predicted timestamp. (Figure from [4]) (b) A preprocessed MNIST digit on top of the
blackboard’s background.

(b)
(a)

Figure 5: (a) The 3D scene used to generate the Blackboard MNIST dataset. The camera moves in front of the blackboard
along a straight trajectory while following a focus object that moves on the blackboard’s surface, synchronized with the
camera. The camera and its trajectory are depicted in green, the focus object is represented as a red cross and, finally, its
trajectory is depicted as a yellow line. (b) The three types of focus trajectories.

recorded by using the second and third configuration of ob- file format for event-based recordings.
jects we described previously. However, in this case each
image was resized to a variable size spanning from the orig- 2. Results
inal configuration size down to the previous scale. A total
Table 1 provides a comparison between the average pre-
of 600 new samples (500 training, 50 testing, 50 validation)
were generated, 300 of them containing three digits each cision of YOLE and fcYOLE on N-Caltech101 classes. We
and the remaining 300 consisting of two digits with vari- also provide a qualitative comparison between the two mod-
able size. els in the video attachment.
The three collections can be used individually or jointly;
the whole Blackboard MNIST dataset contains 3, 000 sam-
ples in total (2500 training, 250 testing, 250 validation). Ex-
amples of different objects configurations are shown in Fig-
ure 6. Samples were saved by means of the AEDAT v3.1
Figure 6: Examples of the three types of objects configurations used to generate the second collection of the Blackboard
MNIST dataset.

Table 1: YOLE and fcYOLE average precisions on N-Caltech101

grand piano
Motorbikes

Faces easy

chandelier

helicopter
hawksbill
Leopards

crocodile
airplanes

flamingo

menorah
butterfly
car side

starfish
bonsai
watch

ketch

brain
chair

ant
AP fcYOLE 97.5 96.8 92.2 75.7 57.2 7.5 30.2 70.3 42.3 2.3 2.4 34.8 0.0 69.5 35.3 19.6 33.5 8.6 67.7 23.2
AP YOLE 97.8 95.8 94.7 84.2 62.9 17.3 59.3 61.7 52.9 10.0 25.8 55.7 1.6 81.3 53.3 29.1 46.3 14.9 80.7 32.7
Ntrain 480 480 261 145 120 109 78 75 70 68 66 65 61 61 60 60 55 54 53 52

electric guitar

cougar face

euphonium
dalmatian
sunflower

dragonfly
kangaroo

umbrella
scorpion

revolver
trilobite

crayfish
minaret
buddha

laptop

llama

ferry
ewer

crab
ibis

AP fcYOLE 2.5 3.4 41.2 29.1 46.5 35.3 20.3 40.0 1.4 1.5 59.5 61.0 5.0 23.2 21.8 55.6 7.3 24.9 29.7 43.5
AP YOLE 6.9 5.0 62.5 43.3 57.2 51.3 57.4 88.1 10.2 6.5 81.3 85.9 19.5 29.7 39.8 59.9 9.5 33.0 34.0 53.6
Ntrain 52 52 52 51 51 51 50 49 48 48 46 45 45 45 43 42 42 41 41 40

windsor chair
stegosaurus
joshua tree

soccer ball

wheelchair

cellphone

hedgehog
sea horse
stop sign

schooner

yin yang
elephant

pyramid

nautilus
dolphin

rhino
lamp
lotus

bass
cup

AP fcYOLE 18.1 55.1 30.2 57.3 11.4 37.3 6.5 17.7 34.5 5.8 25.8 46.5 60.4 5.6 1.9 41.1 52.3 13.3 4.2 13.0
AP YOLE 27.6 61.7 29.8 51.5 6.0 56.8 11.5 45.3 44.6 11.3 25.3 54.6 63.3 17.5 8.8 48.6 65.2 9.8 4.0 50.4
Ntrain 40 40 40 40 40 39 39 37 37 37 37 37 36 35 35 35 34 34 34 33
crocodile head

flamingo head

brontosaurus
cougar body
gramophone

ceiling fan
dollar bill
accordion

mandolin
camera

pagoda

cannon
rooster

pigeon
stapler
beaver
barrel
pizza

emu

tick

AP fcYOLE 17.4 5.6 48.3 74.4 28.2 9.8 30.6 26.6 15.1 34.0 0.0 30.7 40.8 0.5 0.0 6.6 2.7 0.2 46.0 6.2
AP YOLE 54.5 5.0 52.7 86.5 55.2 10.0 48.4 64.5 32.7 33.3 12.2 41.2 50.1 0.0 0.3 43.4 9.3 25.8 68.1 43.1
Ntrain 33 33 33 32 31 31 31 31 30 29 29 29 29 28 27 27 27 27 27 27
inline skate
metronome
headphone

saxophone

strawberry
water lilly

binocular
platypus
wild cat

gerenuk
scissors

octopus
garfield
wrench

snoopy
mayfly
anchor

lobster

panda
okapi

AP fcYOLE 10.7 14.6 28.2 14.7 10.0 0.0 4.7 59.7 0.5 3.1 23.4 0.0 4.8 13.1 0.4 14.0 0.5 8.3 63.4 37.2
AP YOLE 21.1 17.3 47.5 29.7 44.6 0.0 12.2 68.4 0.7 14.7 62.3 0.0 7.2 34.7 11.8 13.8 29.4 53.1 88.3 75.1
Ntrain 26 26 25 25 25 25 24 24 24 23 22 22 22 22 21 21 21 21 20 19

References [2] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-


based algorithm for discovering clusters a density-based algo-
rithm for discovering clusters in large spatial databases with
[1] Blender Online Community. Blender - a 3D modelling and
noise. In Proceedings of the Second International Conference
rendering package. Blender Foundation, Blender Institute,
on Knowledge Discovery and Data Mining, KDD’96, pages
Amsterdam, 2017. 4
226–231. AAAI Press, 1996. 1
[3] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
based learning applied to document recognition. Proc. IEEE,
86(11):2278–2324, Nov 1998. 1
[4] E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, and
D. Scaramuzza. The Event-Camera Dataset and Simulator:
Event-based Data for Pose Estimation, Visual Odometry, and
SLAM. arXiv, Oct 2016. 4, 5
[5] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor.
Converting Static Image Datasets to Spiking Neuromorphic
Datasets Using Saccades. Front. Neurosci., 9, Nov 2015. 1
[6] T. Serrano-Gotarredona and B. Linares-Barranco. Poker-DVS
and MNIST-DVS. Their History, How They Were Made, and
Other Details. Front. Neurosci., 9, Dec 2015. 1, 3
[7] E. Stromatias, M. Soto, T. Serrano-Gotarredona, and
B. Linares-Barranco. An Event-Driven Classifier for Spik-
ing Neural Networks Fed with Synthetic or Dynamic Vision
Sensor Data. Front. Neurosci., 11, Jun 2017. 3

You might also like