2019CVPRW Asynchronous Convolutional Networks For Object Detection in Neuromorphic Cameras
2019CVPRW Asynchronous Convolutional Networks For Object Detection in Neuromorphic Cameras
in Neuromorphic Cameras
1
computation is only performed when a sequence of events Similar procedures capable of maintaining time resolu-
arrive and only where previous results need to be recom- tion have also been proposed, such as those that make use
puted. of exponential decays [7, 19] to update surfaces, and those
In Section 3 the convolution and max-pooling operations relying on histograms of events [28]. Recently, the con-
are reformulated by adding an internal state, i.e., a memory cept of time surface has also been introduced in [20] where
of the previous prediction, that allows us to sparsely recom- surfaces are obtained by associating each event with tempo-
pute feature maps. An asynchronous fully-convolutional ral features computed applying exponential kernels to the
network for event-based object detection which exploits this event neighborhood. Extensions of this procedure making
formulation is finally described in Section 3.4. use of memory cells [46] and event histograms [1] have also
been proposed. Although these event representations better
2. Background describe complex scene dynamics, we make use of a sim-
pler formulation to derive a linear dependence between con-
Leaky Surface. The basic component of the proposed ar- secutive surfaces. This allows us to design the event-based
chitectures is a procedure able to accumulate events. Sparse framework discussed in Section 3 in which time decay is
events generated by the neuromorphic camera are integrated applied to every layer of the network.
into a leaky surface, a structure that takes inspiration from Event-based Object Detection. We identified YOLO
the functioning of Spiking Neural Networks (SNNs) to [41] as a good candidate model to tackle the object detection
maintain memory of past events. A similar mechanism has problem in event-based scenarios: it is fully-differentiable
already been proposed in [7]. Every time an event with co- and produces predictions with small input-output delays.
ordinates (xe , ye ) and timestamp tst is received, the cor- By means of a standard CNN and with a single forward
responding pixel of the surface is incremented of a fixed pass, YOLO is able to simultaneously predict not only the
amount ∆incr . At the same time, the whole surface is decre- class, but also the position and dimension of every object
mented by a quantity which depends on the time elapsed be- in the scene. We used the YOLO loss and the previous
tween the last received event and the previous one. The de- leaky surface procedure to train a baseline model which we
scribed procedure can be formalized by the following equa- called YOLE: ”You Only Look at Events”. The architecture
tions: is depicted in Figure 1. We use this model as a reference
to highlight the strengths and weaknesses of the framework
qxt s ,ys = max(pt−1 xs ,ys − λ · ∆ts , 0) (1) described in Section 3, which is the main contribution of
(
qxt s ,ys + ∆incr if (xs , ys )t = (xe , ye )t this work. YOLE processes 128 × 128 surfaces, it predicts
ptxs ,ys = , (2) B = 2 bounding boxes for each region and classifies ob-
qxt s ,ys otherwise
jects into C different categories.
Note that in this context, we use the term YOLO to refer
where ptxs ,ys is the value of the surface pixel in position
only to the training procedure proposed by [41] and not to
(xs , ys ) of the leaky surface and ∆ts = tst − tst−1 . To im-
the specific network architecture. We used indeed a simpler
prove readability in following equations, we name the quan-
structure for our models as explained in Section 4. Nev-
tity (tst − tst−1 ) · λ as ∆leak . Notice that the effects of λ
ertheless, YOLE, i.e., YOLO + leaky surface, does not ex-
and ∆incr are related: ∆incr determines how much infor-
ploit the sparse nature of events; to address this issue, in the
mation is contained in each single event whereas λ defines
next section, we propose a fully event-based asynchronous
the decay rate of activations. Given a certain choice of these
framework for convolutional networks.
parameters, similar results can be obtained by using, for in-
stance, a higher increment ∆incr and a higher temporal λ.
3. Event-based Fully Convolutional Networks
For this reason, we fix ∆incr = 1 and we vary only λ based
on the dataset to be processed. Pixel values are prevented Conventional CNNs for video analysis treat every frame
from becoming negative by means of the max operation. independently and recompute all the feature maps entirely,
Other frame integration procedures, such as the one even if consecutive frames differ from each other only in
in [35], divide the time in predefined constant intervals. small portions. Beside being a significant waste of power
Frames are obtained by setting each pixel to a binary value and computations, this approach does not match the nature
(depending on the polarity) if at least an event has been re- of event-based cameras.
ceived in each pixel within the integration interval. With To exploit the event-based nature of neuromorphic vi-
this mechanism however, time resolution is lost and the sion, we propose a modification of the forward pass of fully
same importance is given to each event, even if it repre- convolutional architectures. In the following the convolu-
sents noise. The adopted method, instead, performs contin- tion and pooling operations are reformulated to produce
uous and incremental integration and is able to better handle the final prediction by recomputing only the features cor-
noise. responding to regions affected by the events. Feature maps
16 2048
32
64 1024
128 128
64 32 256 20 = C + 5 B
16 - 4 x 4 regions
8 4 - B = 2 bounding
<x,y,ts,p> Integrator 4 boxes per region
4 4 - C = 10 classes
32 16 8
128 64
Conv Layer Conv Layer Conv Layer Conv Layer Conv Layer Fully Fully Fully
5x5x16 5x5x32 5x5x32 5x5x32 5x5x32 connected connected connected
Maxpool Layer Maxpool Layer Maxpool Layer Maxpool Layer Maxpool Layer
2x2 2x2 2x2 2x2 2x2
Figure 1. The YOLE detection network based on YOLO used to detect MNIST-DVS digits. The input surfaces are divided into a grid of
4 × 4 regions which predict 2 bounding boxes each.
maintain their state over time and are updated only as a con- features decay over time. However, due to the transforma-
sequence of incoming events. At the same time, the leak- tions applied by previous layers and the composition of their
ing mechanism that allows past information to be forgotten, activation functions, ∆leak may act differently in different
acts independently on each layer of the CNN. This enables parts of the feature map. For instance, the decrease of a
features computed in the past to fade away as their visual in- pixel intensity value in the input surface may cause the value
formation starts to disappear in the input surface. The result computed by a certain feature in a deeper layer to decrease,
is an asynchronous CNN able to perform computation only but it could also cause another feature of the same layer
when requested and at different rates. The network can in- to increase. The update procedure, therefore, must also be
deed be used to produce an output only when new events ar- able to accurately determine how a single bit of information
rive, dynamically adapting to the timings of the input, or to is transformed by the network through all the previous lay-
produce results at regular rates by using the leaking mecha- ers, in any spatial location. We face this issue by storing an
nism to update layers in absence of new events. additional feature map, F(n) , and by using a particular class
The proposed framework has been developed to extend of activation functions in the hidden layers of the network.
the YOLE detection network presented in Section 2. Nev- Let’s consider the first layer of a CNN which processes
ertheless, this method can be applied to any convolutional surfaces obtained using the procedure described in the pre-
architecture to perform asynchronous computation. A CNN vious section and which computes the convolution of a set
trained to process frames reconstructed from streams of of filters W with bias b and activation function g(·). The
events can indeed be easily converted into an event-based computation performed on each receptive field is:
CNN without any modification on its layers composition, !
and by using the same weights learned while observing
XX
t t t
yij(1) = g xh+i,k+j Whk(1) + b(1) = g(ỹij (1)
),
frames, maintaining its output unchanged. h k
(3)
3.1. Leaky Surface Layer where h, k select a pixel xth+i,k+j in the receptive field of
The procedure used to compute the leaky surface de- the output feature (i, j) and its corresponding value in the
scribed in Section 2 is embedded into an actual layer of kernel W , whereas the subscript (1) indicates the hidden-
the proposed framework. Furthermore, to allow subsequent layer of the network (in this case the first after the leaky
layers to locate changes inside the surface, the following in- surface layer).
formation are also forwarded to the next layer: (i) the list When a new event arrives, the leaky surface layer de-
of incoming events. (ii) ∆leak , which is sent to all the sub- creases all the pixels by ∆leak , i.e., a pixel not directly af-
sequent layers to update feature maps not affected by the fected by the event becomes: xt+1 t t+1
hk = xhk − ∆leak , with
t+1
events. (iii) the list of surface pixels which have been reset ∆leak > 0. At time t + 1 Equation (3) becomes:
to 0 by the max operator in Equation (1).
!
XX
t+1
3.2. Event-based Convolutional Layer (e-conv) yij(1)
=g xt+1
h+i,k+j Whk(1) + b(1)
h k
The event-based convolutional (e-conv) layer we pro- !
pose uses events to determine where the input feature map
XX
=g (xth+i,k+j − ∆t+1
leak )Whk(1) + b(1)
has changed with respect to the previous time step and, h k
therefore, which parts of its internal state, i.e., the feature !
XX
map computed at the previous time step, must be recom- =g t
ỹij − ∆t+1 Whk(1) .
(1) leak
puted and which parts can be reused. We use a procedure h k
similar to the one described in the previous section to let (4)
If g(·) is (i) a piecewise linear activation function g(x) = used to communicate the change to subsequent layers so
{αi · x if x ∈ Di }, as ReLU or Leaky ReLU, and we as- that their update matrix can also be updated accordingly.
sume that (ii) the updated value does not change which lin- The internal state of the e-conv layer, therefore, com-
t−1 t−1
ear segment of the activation function the output falls onto prises the feature maps y(n) and the update values F(n)
and, in this first approximation, (iii) the leaky surface layer computed at the previous time step. The initial values of
does not restrict pixels using max(·, 0), Equation 4 can be the internal state are computed making full inference on a
rewritten as it follows: blank surface; this is the only time the network needs to be
XX executed entirely. As a new sequence of events arrives the
t+1 t t+1
yij (1)
= y ij(1)
− ∆ leak αij (1)
Whk(1) , (5) following operations are performed (see Figure 3(a)):
h k t−1
i. Update F(n) locally on the coordinates specified by the
where αij(1) is the coefficient applied by the piecewise func- list of incoming events (Eq. (7)). Note that we do not
tion g(·) which depends on the feature value at position distinguish between actual events and those generated
(i, j). When the previous assumption is not satisfied, the by the use of a different slope in the linear activation
feature is recomputed as its receptive field was affected by function.
an event (i.e., applying the filter W locally to xt+1 ). ii. Update the feature map y(n) with Eq. (7) in the loca-
Consider now a second convolutional layer attached to tions which are not affected by any event and gener-
the first one: ate an output event where the activation function coef-
ficient has changed.
t+1 X t+1
y = g y W +b
i+h,j+k(1) hk(2)
ij(2) (2)
h,k iii. Recompute y(n) by applying W locally in correspon-
dence of the incoming events and output which recep-
X t t+1 X
= g −∆ α W 0 0 Whk +b
yi+h,j+k leak i+h,j+k(1) h k (2)
tive field has been affected.
(1) (1) (2)
h,k h0 ,k0
t
= yij −∆
t+1
α
X X
W 0 0 Whk
iv. Forward the feature map and the events generated in
(2) leak ij(2) αi+h,j+k
(1) h k (2)
(1)
h,k h0 ,k0 the current step to the next layer.
t t+1 X t+1 t t+1 t+1
= yij −∆ α F W = yij −∆ F .
leak ij(2) h+i,k+j(1) hk(2) leak ij(2)
(2)
h,k
(2)
3.3. Event-based Max Pooling Layer (e-max-pool)
(6)
The previous equation can be extended by induction as it The location of the maximum value in each receptive
follows: field of a max-pooling layer is likely to remain the same
over time. An event-based pooling layer, hence, can exploit
t+1
yij (n)
t
= yij(n)
− ∆t+1 t+1
leak Fij(n) , this property to avoid recomputing every time the position
XX of maximum values.
with Fijt+1
(n)
= αij(n) t+1
Fi+h,j+k (n−1)
Whk(n) if n > 1 , The internal state of an event-based max-pooling (e-
h k
(7) max-pool) layer can be described by a positional matrix
t
where Fij(n) expresses how visual inputs are transformed I(n) , which has the shape of the output feature map pro-
by the network in every receptive field (i, j), i.e., the com- duced by the layer, and which stores, for every receptive
position of the previous layers activation functions. field, the position of its maximum value. Every time a se-
t
Given this formulation, the max operator applied by the quence of events arrives, the internal state I(n) is sparsely
leaky surface layer can be interpreted as a ReLU, and Equa- updated by recomputing the position of the maximum val-
tion (5) becomes: ues in every receptive field affected by an incoming event.
The internal state is then used both to build the output fea-
t
ture map and to produce the update matrix F(n) by fetching
t+1
yij t
= yij − ∆t+1
XX
t+1 the previous layer on the locations provided by the indices
leak αij(1) Fi+h,j+k Whk(1) ,
(1) (1)
h k
(0)
Iijt (n) . For each e-max-pool layer, the indices of the recep-
(8) tive fields where the maximum value changes are commu-
where the value Fi+h,j+k(0) is 0 if the pixel xi+h,j+k ≤ 0 nicated to the subsequent layers so that the internal states
and 1 otherwise. can be updated accordingly. This mechanism is depicted in
Notice that Fij(n) needs to be updated only when the cor- Figure 3(b).
responding feature changes enough to make the activation Notice that the leaking mechanism acts differently in dis-
function use a different coefficient α, e.g., from 0 to 1 in tinct regions of the input space. Features inside the same
case of ReLU. In this case F(n) is updated locally in cor- receptive field can indeed decrease over time with differ-
respondence of the change by using the update matrix of ent speeds as their update values Fijt (n) could be differ-
the previous layer and by applying Equation 7 only for the ent. Therefore, even if no event has been detected inside
features whose activation function has changed. Events are a region, the position of its maximum value might change.
Leaky
Surface
features and
F (n) matrices
Figure 2. fcYOLE: a fully-convolutional detection network based on YOLE. The last layer is used to map the feature vectors into a set of
20 values which define the parameters of the predicted bounding boxes.
(a) (b)
Figure 3. The structure of the e-conv (a) and e-max-pooling layers (b). The internal states and the update matrices are recomputed locally
only where events are received (green cells) whereas the remaining regions (depicted in yellow) are obtained reusing the previous state.
Motorbikes
metronome
saxophone
Leopards
airplanes
umbrella
menorah
windsor
trilobite
minaret
garfield
rooster
stapler
laptop
watch
dollar
grand
inline
Faces
piano
skate
chair
yang
easy
bill
yin
AP 97.8 95.8 94.7 88.3 88.1 86.5 85.9 84.2 81.3 81.3 80.7 75.1 68.4 68.1 65.2 64.5 63.3 62.9 62.5 62.3
Ntrain 480 480 261 20 49 32 45 145 46 61 53 19 24 27 34 31 36 120 52 22
Table 2. fcYOLE Top-20 average precisions on N-Caltech101. Full table provided in the supplemental material.
Motorbikes
metronome
saxophone
accordion
dragonfly
Leopards
airplanes
umbrella
menorah
windsor
minaret
buddha
soccer
watch
dollar
grand
Faces
piano
chair
yang
easy
stop
sign
side
ball
bill
yin
car
AP 97.5 96.8 92.2 75.7 74.4 70.3 69.5 67.7 63.4 61.0 60.4 59.7 59.5 57.3 57.2 55.6 55.1 52.3 48.3 46.5
Ntrain 480 480 261 145 32 75 61 53 20 45 36 24 46 40 120 42 40 34 33 51
Table 3. Performance comparison between YOLE and fcYOLE. formance is mainly due to the fact that each region in fcY-
S-MNIST-DVS Blackboard MNIST OLE generates its predictions by only looking at the visual
fcYOLE YOLE fcYOLE YOLE information contained in its portion of the field of view. In-
acc mAP acc mAP acc mAP acc mAP deed, if an object is only partially contained inside a region
94.0 87.4 96.1 92.0 88.5 84.7 90.4 87.4
the network has to guess the object dimensions and class by
OD-Poker-DVS N-Caltech101 only looking at a restricted region of the surface. It should
fcYOLE YOLE fcYOLE YOLE be stressed, however, that the difference in performance be-
acc mAP acc mAP acc mAP acc mAP tween the two architectures does not come from the use of
79.10 78.69 87.3 82.2 57.1 26.9 64.9 39.8 the proposed event layers, whose output are the same as the
conventional ones, but rather from the reduced expressive
Table 4. YOLE performance on S-N-MNIST variants.
power caused by the absence of fully-connected layers in
S-N-MNIST fcYOLE. Indeed, not removing them would have allowed
v1 v2 v2* v2fr v2fr+ns us to obtain the same performance of YOLE, but with the
accuracy 94.9 91.7 94.7 88.6 85.5 drawback of being able to exploit event-based layers only
mAP 91.3 87.9 90.5 81.5 77.4
up to the first FC-layer, which has not been formalized yet
Blackboard MNIST and Poker-DVS datasets which repre- in an event-based form. Removing the last fully-connected
sent a more realistic scenario in terms of noise. All of these layers allowed us to design a detection network made of
experiments were performed using the set of hyperparame- only event-based layers and which uses also a significantly
ters suggested by the original work from [41]. However, a lower number of parameters. In the supplementary mate-
different choice of these parameters, namely λcoord = 25.0 rials we provide a video showing a comparison between
and λnoobj = 0.25, worked better for us increasing both the YOLE and fcYOLE predictions.
accuracy and mean average precision scores (v2*).
To identify the advantages and weaknesses of the pro-
The dataset in which the proposed model did not achieve
posed event-based framework in terms of inference time we
noticeable results is N-Caltech101. This is mainly ex-
compared our detection networks on two datasets, Shifted
plained by the increased difficulty of the task and by the
N-MNIST and Blackboard MNIST. We group events into
fact that the number of samples in each class is not evenly
batches of 10ms and average timings on 1000 runs. In the
balanced. The network, indeed, usually achieves good re-
first dataset the event-based approach achieved a 2x speedup
sults when the number of training samples is high such as
(22.6ms per batch), whereas in the second one it performed
with Airplanes, Motorbikes and Faces easy, and in cases in
slightly slower (43.2ms per batch) w.r.t. a network making
which samples are very similar, e.g., inline skate (see Ta-
use of conventional layers (34.6ms per batch). The second
ble 1 and supplementary material). As the number of train-
benchmark is indeed challenging for our framework since
ing samples decreases and the sample variability within the
changes are not localized in restricted regions. Our current
class increases, however, the performance of the model be-
implementation is not optimized to handle noisy scenes ef-
comes worse, behavior which explains the poor aggregate
ficiently. Indeed, additional experiments showed that asyn-
scores we report in Table 3.
chronous CNNs are able to provide a faster prediction only
Detection performance of fcYOLE. With this fully- up to a 80% of event sparsity (where with sparsity we mean
convolutional variant of the network we registered a slight the percentage of changed pixels in the reconstructed im-
decrease in performance w.r.t. the results we obtained using age). Further investigations are out of the scope of this pa-
YOLE, as reported in Table 3 and Table 2. This gap in per- per and will be addressed in future works.
Shifted Shifted Blackboard
N-MNIST MNIST-DVS OD-Poker-DVS N-Caltech101 MNIST
In this document we describe our novel event-based other pixels are considered noise. Then, with a custom ver-
datasets adopted in the paper “Asynchronous Convolutional sion of the DBSCAN [2] density-based clustering algorithm
Network for Object Detection in Neuromorphic Cameras”. we group pixels into a single cluster. A threshold minarea
is used to filter out small bounding boxes extracted in cor-
1. Event-based object detection datasets respondence of low events activities. This condition usu-
ally happens during the transition from a saccade to the next
Due to the lack of object detection datasets with event one as the camera remains still for a small fraction of time
cameras, we extended the publicly available N-MNIST, and no events are generated. We used ρ = 3, R = 2 and
MNIST-DVS, Poker-DVS and we propose a novel dataset minarea = 10. The coordinates of these bounding boxes
based on MNIST, i.e., Blackboard MNIST. They will be are then shifted based on the final position the digit has in
soon released, however, in Figure 1 we reported some ex- the bigger field of view.
ample from the four datasets.
For each N-MNIST sample, another digit was randomly
1.1. Shifted N-MNIST selected in the same portion of the dataset (training, test-
ing or validation) to form a new example. The final dataset
The original N-MNIST [5] extends the well-known contains 60, 000 training samples and 10, 000 testing sam-
MNIST [3]: it provides an event-based representation of ples, as for the original N-MNSIT dataset. In Figure 2 we
both the full training set (60, 000 samples) and the full test- illustrate one example for v1 and the three variants of v2 we
ing set (10, 000 samples) to evaluate object classification al- adopted (and described) in the paper.
gorithms. The dataset has been recorded by means of event
camera in front of an LCD screen and moved to detect static 1.2. Shifted MNIST-DVS
MNIST digits displayed on the monitor. For further details
we refer the reader to [5]. The MNIST-DVS dataset [6] is another collection of
Starting from the N-MNIST dataset, we built a more event-based recordings that extends MNIST [3]. It con-
complex set of recordings that we used to train the object sists of 30, 000 samples recorded by displaying digits on
detection network to detect multiple objects in the same an screen in front of a event camera, but differently from
scene. We created two versions of the dataset, Shifted N- N-MNIST, they move digits on the screen instead of the
MNIST v1 and Shifted N-MNIST v2, that contains respec- sensors, and they use the digits at three different scales, i.e.,
tively one or two non overlapping 34 × 34 N-MNIST digits scale4, scale8 and scale16. The resulting dataset is com-
per sample randomly positioned on a bigger surface. We posed of 30, 000 event-based recordings showing each one
used different surface dimensions in our tests which vary of the selected 10, 000 MNIST digits on thee different di-
from double the original size, 68 × 68, up to 124 × 124. The mensions. Examples of these recordings are shown in Fig-
dimension and structure of the resulting dataset is the same ure 3.
of the original N-MNIST collection. We used MNIST-DVS recordings to build a detection
To extend the dataset for object detection evaluation, the dataset by means of a procedure similar to the one we
bounding boxes ground truth are required. To estimate them used to create the Shifted N-MNIST dataset. However in
we first integrate events into a single frame as described in this case we mix together digits of multiple scales. All
Section 2 of the original paper. We remove the noise by con- the MNIST-DVS samples, despite of the actual dimensions
sidering only non-zero pixels having at least other ρ non- of the digits being recorded, are contained within a fixed
zero pixels around them within a circle of radius R. All the 128 × 128 field of view. Digits are placed centered inside
1
Shifted N-MNIST
Shifted MNIST-DVS
OD-Poker-DVS
Blackboard-MNIST
Figure 3: Examples of the three different scales of MNIST-DVS digits. Two samples at scale scale4, two at scale8 and two
at scale16.
the scene and occupy a limited portion of the frame, espe- diamonds or clubs) extracted from three decks recordings.
cially those belonging to the smallest and middle scales. In Single pips were extracted by means of an event-based
order to place multiple examples on the same scene we first tracking algorithms which was used to follow symbols in-
cropped the three scales of samples into smaller recordings side the scene and to extract 31 × 31 pixels examples.
occupying 35 × 35, 65 × 65 and 105 × 105 spatial regions With OD-Poker-DVS we extend its scope to test also ob-
respectively. The bounding boxes annotations and the fi- ject detection. To do so we used the event-based tracking
nal examples were obtained by means of the same proce- algorithm provided with the original dataset to follow the
dure we used to construct the Shifted N-MNIST dataset. movement of the 31 × 31 samples in the uncut recordings
These recordings were built by mixing digits of different and extract their bounding boxes. The final dataset was ob-
dimensions in the same sample. Based on the original sam- tained using a procedure similar to the one used in [7]. In-
ples dimensions, we decided to use the following four con- deed, we divided the sections of the three original decks
figurations (which specify the number of samples of each recordings containing visible digits into a set of shorter ex-
category used to build a single Shifted MNIST-DVS exam- amples, each of which about 1.5ms long. Examples were
ple): (i) three scale4 digits, (ii) two scale8 digits, (iii) two split in order to ensure approximately the same number of
scale4 digits mixed with one scale8 digit (iv) one scale16 objects (i.e., ground truth bounding boxes) in each exam-
digit placed in random locations of the field of view. The ple. The final detection dataset is composed of 292 small
overall dataset is composed of 30, 000 samples containing examples which we divided into 218 training and 74 testing
these four possible configurations. samples.
t
(b)
(a)
Figure 4: (a) The image shows in black the intensity, expresses as log Iu (t), of a single pixel u = (x, y). This curve
is sampled at a constant rate when frames are generated by Blender, shown in figure as vertical blue lines. The sampled
values thus obtained (blue circles) are used to approximate the pixel intensity by means of a simple piecewise linear time
interpolation (red line). Whenever this curve crosses one of the threshold values (horizontal dashed lines) a new event is
generated with the corresponding predicted timestamp. (Figure from [4]) (b) A preprocessed MNIST digit on top of the
blackboard’s background.
(b)
(a)
Figure 5: (a) The 3D scene used to generate the Blackboard MNIST dataset. The camera moves in front of the blackboard
along a straight trajectory while following a focus object that moves on the blackboard’s surface, synchronized with the
camera. The camera and its trajectory are depicted in green, the focus object is represented as a red cross and, finally, its
trajectory is depicted as a yellow line. (b) The three types of focus trajectories.
recorded by using the second and third configuration of ob- file format for event-based recordings.
jects we described previously. However, in this case each
image was resized to a variable size spanning from the orig- 2. Results
inal configuration size down to the previous scale. A total
Table 1 provides a comparison between the average pre-
of 600 new samples (500 training, 50 testing, 50 validation)
were generated, 300 of them containing three digits each cision of YOLE and fcYOLE on N-Caltech101 classes. We
and the remaining 300 consisting of two digits with vari- also provide a qualitative comparison between the two mod-
able size. els in the video attachment.
The three collections can be used individually or jointly;
the whole Blackboard MNIST dataset contains 3, 000 sam-
ples in total (2500 training, 250 testing, 250 validation). Ex-
amples of different objects configurations are shown in Fig-
ure 6. Samples were saved by means of the AEDAT v3.1
Figure 6: Examples of the three types of objects configurations used to generate the second collection of the Blackboard
MNIST dataset.
grand piano
Motorbikes
Faces easy
chandelier
helicopter
hawksbill
Leopards
crocodile
airplanes
flamingo
menorah
butterfly
car side
starfish
bonsai
watch
ketch
brain
chair
ant
AP fcYOLE 97.5 96.8 92.2 75.7 57.2 7.5 30.2 70.3 42.3 2.3 2.4 34.8 0.0 69.5 35.3 19.6 33.5 8.6 67.7 23.2
AP YOLE 97.8 95.8 94.7 84.2 62.9 17.3 59.3 61.7 52.9 10.0 25.8 55.7 1.6 81.3 53.3 29.1 46.3 14.9 80.7 32.7
Ntrain 480 480 261 145 120 109 78 75 70 68 66 65 61 61 60 60 55 54 53 52
electric guitar
cougar face
euphonium
dalmatian
sunflower
dragonfly
kangaroo
umbrella
scorpion
revolver
trilobite
crayfish
minaret
buddha
laptop
llama
ferry
ewer
crab
ibis
AP fcYOLE 2.5 3.4 41.2 29.1 46.5 35.3 20.3 40.0 1.4 1.5 59.5 61.0 5.0 23.2 21.8 55.6 7.3 24.9 29.7 43.5
AP YOLE 6.9 5.0 62.5 43.3 57.2 51.3 57.4 88.1 10.2 6.5 81.3 85.9 19.5 29.7 39.8 59.9 9.5 33.0 34.0 53.6
Ntrain 52 52 52 51 51 51 50 49 48 48 46 45 45 45 43 42 42 41 41 40
windsor chair
stegosaurus
joshua tree
soccer ball
wheelchair
cellphone
hedgehog
sea horse
stop sign
schooner
yin yang
elephant
pyramid
nautilus
dolphin
rhino
lamp
lotus
bass
cup
AP fcYOLE 18.1 55.1 30.2 57.3 11.4 37.3 6.5 17.7 34.5 5.8 25.8 46.5 60.4 5.6 1.9 41.1 52.3 13.3 4.2 13.0
AP YOLE 27.6 61.7 29.8 51.5 6.0 56.8 11.5 45.3 44.6 11.3 25.3 54.6 63.3 17.5 8.8 48.6 65.2 9.8 4.0 50.4
Ntrain 40 40 40 40 40 39 39 37 37 37 37 37 36 35 35 35 34 34 34 33
crocodile head
flamingo head
brontosaurus
cougar body
gramophone
ceiling fan
dollar bill
accordion
mandolin
camera
pagoda
cannon
rooster
pigeon
stapler
beaver
barrel
pizza
emu
tick
AP fcYOLE 17.4 5.6 48.3 74.4 28.2 9.8 30.6 26.6 15.1 34.0 0.0 30.7 40.8 0.5 0.0 6.6 2.7 0.2 46.0 6.2
AP YOLE 54.5 5.0 52.7 86.5 55.2 10.0 48.4 64.5 32.7 33.3 12.2 41.2 50.1 0.0 0.3 43.4 9.3 25.8 68.1 43.1
Ntrain 33 33 33 32 31 31 31 31 30 29 29 29 29 28 27 27 27 27 27 27
inline skate
metronome
headphone
saxophone
strawberry
water lilly
binocular
platypus
wild cat
gerenuk
scissors
octopus
garfield
wrench
snoopy
mayfly
anchor
lobster
panda
okapi
AP fcYOLE 10.7 14.6 28.2 14.7 10.0 0.0 4.7 59.7 0.5 3.1 23.4 0.0 4.8 13.1 0.4 14.0 0.5 8.3 63.4 37.2
AP YOLE 21.1 17.3 47.5 29.7 44.6 0.0 12.2 68.4 0.7 14.7 62.3 0.0 7.2 34.7 11.8 13.8 29.4 53.1 88.3 75.1
Ntrain 26 26 25 25 25 25 24 24 24 23 22 22 22 22 21 21 21 21 20 19