2015WS HS SpikingVision
2015WS HS SpikingVision
vision tasks
ADVANCED SEMINAR
submitted by
Henry Martin
NEUROWISSENSCHAFTLICHE SYSTEMTHEORIE
PROF. DR. JÖRG CONRADT
2015-10-01
ADVANCED SEMINAR
Problem description:
Neural networks have achieved striking results in object recognition tasks lately. However, most
networks, like standard convolutional networks, work on full images/frames and are expensive with
respect to computing resources. This heavily restricts their use in real-time applications. To overcome
this, research has been going in the direction of fast networks and more efficient visual coding. One
example of this are frame-free spiking convolutional nets: they use event-based vision streams generated
by novel vision sensors (DVS [1]) instead of full frames - as generated by conventional cameras - as
input and process data asynchronously. For this project, we want you to have a look into the capabilities
and limits of spiking neural nets for machine vision tasks and compare them to traditional approaches.
(Jörg Conradt)
Professor
Bibliography:
[1] Lichtsteiner, P., Posch, C. and Delbruck, T. A 128 times; 128 120 dB 15 us Latency Asynchronous
Temporal Contrast Vision Sensor IEEE Journal of Solid-State Circuits Feb. 2008 p. 566-576
Abstract
In the past few years, convolutional neural networks had tremendous suc-
cess in computer vision tasks such as object detection or face recognition.
Despite their success, high computational complexity and energy consump-
tion are limiting their usage for mobile applications and robotics. Therefore,
scientists are working on the next generation of neural networks which use
event-based spikes to encode information. Spiking neural networks seem to
be more efficient in terms of power consumption and algorithmical complex-
ity, but what are the capabilities and limits of this type of networks and how
do they perform in comparison with regular convolutional neural networks?
To answer this question, this work gives a rough overview on regular
neural networks, the basic neuron models and the benefits of a convolutional
architecture. It then proceeds with an introduction to spiking neural net-
works. Both network types are compared in terms of availability of training
data, technology readiness level, speed and efficiency.
At last, the most relevant examples of spiking neural network applica-
tions in computer vision are presented and literature for further reading is
proposed.
1
Contents
1 Regular Neural Networks 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Generation and neuron models . . . . . . . . . . . . . . . . . . . . . 3
1.3 Convolutional Architecture . . . . . . . . . . . . . . . . . . . . . . . 4
5 Conclusion 16
References 18
2
1 Regular Neural Networks
1.1 Introduction
Even if neural networks are known for over 50 years, their widespread use began
only in the last few years. Although they achieved impressive results in simpler
computer vision applications like handwritten digits recognition [1] they where
believed to be unsuitable for more complex problems like object detection. It was
not before 2012 when A. Krizhevsky et. al [2] proposed a deep convolutional neu-
ral network at the ImageNet Large Scale Visual Recognition Challenge(ILSVRC)
which outperformed its competitors by far. Since then, CNNs are successfully
used to solve various computer vision problems like object detection or face recog-
nition.
3
Figure 2: Overview of the elements in a classical neuron model.2
to rectified linear activation functions when networks grew bigger. The reason
is, that gradient based learning methods, such as the error backpropagation al-
gorithm, are multiplying gradients of the activation functions of many connected
neurons and as sigmoid activation functions have a gradient in between 0 and 1,
their product becomes infinitely small. This effect is known as the vanishing gra-
dient problem[3] and can be avoided by using linear activation functions with a
gradient of 1.
Figure 3: The three most common activation functions for neurons. From the
left, to the right Step function for neurons of the first generation, rectified linear
function and sigmoid function for neurons of the second generation.3
At the end of the 1990 one of the most important limitations for neural networks
was still the computational complexity of learning with lots of variables.[6] Even if
CPUs and GPUs got faster, the convolutional architecture of neural networks was
the key to their success. The key contribution of the convolutional architecture
2
https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Activation_
Functions
3
http://chem-eng.utoronto.ca/∼datamining/dmc/artificial_neural_network.htm
4
is the dimensional reduction. In a convolutional layer, neurons are connected
section-wise to the next neuron. These sections overlap partially like in figure
4 and the edges of every section share the same weight. Another layer which is
part of the convolutional architecture is the so called pooling layer. The pooling
layer, example shown in figure 5, summarizes information of a number of input
neurons into one single output neuron. Commonly used pooling functions are the
maximum or the mean function.
Figure 5: shows an example of a pooling layer. The main goal of a pooling layer is
to reduce the complexity of a network. Therefore the inputs are summarized using
a pooling function such as choosing the maximum of the inputs or calculating their
mean. In this figure, the input is a 2x2 matrix and the output is the maximum
value of the matrix. The grid is then moved by the stride. 5
4
http://neuralnetworksanddeeplearning.com/chap6.html
5
http://cs231n.github.io/convolutional-networks/#pool
5
2 Spiking Neural Networks (SNN)
2.1 Introduction
The name spiking neural network as well as the term neural network of the third
generation applies to the deployed neuron model which uses spike formed impulses
instead of a constant time invariant value as output. Unlike conventional neurons,
spiking neurons do not operate on a discrete time basis but will fire a spike when-
ever their membrane potential crosses the firing threshold. This can be seen in
figure 6a, where the membrane potential is increasing due to incoming spikes (also
called events) until the the firing threshold is crossed. The membrane potential
then drops and a spike, as seen in figure 6b, is fired to all connected neurons.
The incoming spike will then increase their membrane potential depending on the
weight of the connection.
While the information in conventional neuron models is encoded in the amplitude
of the output, the amplitude of a spike is constant. There are different ways
to encode information using spikes, examples are spike-rate dependent coding or
spike-timing dependent coding. The way information is encoded in the brain is
still an open research topic and not in scope of this work. If interested, the reader
can get a quick overview in [8] or find detailed information in [9].
6
2.2 Motivation
Similar to the neurons of the second generation, there are different artificial neuron
models which model different parts of the biological neuron. Choosing a neuron
model is usually a trade-off between biological plausibility and complexity which
can be seen in figure 7. Following is a short overview over the most common neuron
models.
Figure 7: Figure from [10], ranks different neuron models by biological plausibility,
which is defined by the number of different spiking behaviors or features a model
can reproduce and by the computational complexity which is needed to simulate
a model
Hodgkin-Huxley
Presented by A. L. Hodgkin and A. F. Huxley in 1954 and awarded with the
Nobel price in medicine in 1963, this model is one of the most known neuron
7
models. Its key feature is the accuracy with which it models the biological
behavior of real neurons. The price for this accuracy is a high complexity
which disqualifies the model for usage in bigger networks. It is nevertheless
important, as it can be used to derive simpler model.
Leaky Integrate-and-Fire (LIF) neuron
The Leaky Integrate-and-Fire is one of the most common spiking neuron
models used. It is Its main advantage is its simplicity. The LIF neuron
integrates all input spikes which increases the membrane potential until it
reaches the firing threshold, then it drops and the neuron sends a spike as
output. If no input event occurs, the membrane potential slowly decreases
(leaking) to zero[7]. An example of a LIF neuron is shown in figure 6.
Izhikevich neuron
In 2003 E. Izhikevich presented a simple neuron model [10] which is able to
reproduce most of the biological spiking behaviors without being particularly
complex. Due to the fact that this model seems to be both, efficient and
plausible, it has attracted lots of attention. Today it is not yet widely used in
neural networks, as more complex spiking behaviors are not yet controllable
for learning and information coding. [13]
While frame based neural networks are widely used, spiking neural Networks are
still in their infancy. One point that slows down the research on SNNs is a lack
of available datasets.[11] While there are millions of ground truth annotated Im-
ages available, the number of labeled, event-based, frame free datasets is small.
Especially big, event-based benchmarking datasets, which would encourage com-
petition, are rare. One of the reasons for this deficit is the difficulty of annotating
event-based video data.[11] Until real event-based benchmarking datasets are avail-
able, a compromise is to transform frame-based datasets into frame-free ones, like
it is done in [12]. These transformations allow to develop first applications for
SNNs but it can only be an interim solution as it is unlikely that SNN will be
able to show their full potential in terms of speed and mobility without datasets
which are tailored to their needs[11][12]. Another problem is, that those datasets
are often flawed like it is shown in figure 8, where the monitor refresh rate is seen
in the event-based dataset.
8
3.2 Technology readiness level
9
Figure 9: This figure shows the differences in speed of regular frame-based vision
system connected to a CNN and an event-based vision systems connected to a
SNN. a) shows an abstract view of the architecture and the input. Both systems
get a clubs symbol as input and have 5 processing stages. While the regular
system in b) is dependent on the frame time of 1ms, the event-based system in
c)can process information as it comes. It can be seen that a deeper layer starts
fire spikes when the first stage is not yet done with processing.[13]
10
event-based neuromorphic hardware such as DVS cameras. Figure 9 shows the
difference in speed when processing event-based information. The regular CNN
works with discrete time steps and information can only progress one layer in every
time step. On the contrary, information in the SNN is processed as it arrives and
can progress through the network without having to wait for the next discrete time
step.
A method that can simplify the work with SNN was proposed in [13]. While learn-
ing methods are very well developed for frame based CNNs, they are still an open
research problem for frame free spiking neural networks. the presented method
avoids this problem by transforming a regular CNN, trained with conventional
learning methods, into a SNN which is then able to solve the same problem.
The result wished in [13], is a convolutional SNN that recognizes the card symbols
off an event-based dataset. It is build from a DVS recording of hands browsing a
poker deck. An example can be seen in figure 10. To train the regular CNN, the
data has to be frame-based, therefore images are generated by collecting events
during a frame times of 30ms. These images where then used to train the frame
based CNN using error back propagation.
After the learning procedure, a SNN with the same architecture and the same neu-
ron connections is created. In [13] a set of equations to parametrize the LIF neurons
calculate their weights is presented. After the mathematical transformation, sim-
ulated annealing optimization routines are used to fine tune the parameters.
The resulting SNN was fed with the testset and was able to recognize in between
97.3% and 99.6% of the symbols. The approach presented by [13] looks very
promising as it evades the difficulties that directly learning a SNN poses and
allows to use the knowledge from CNNs.
There are similar approaches where fuully learned regular CNNs are transformed
into SNNs but where conventional frame-based datasets are preprocessed into
frame-free event-based datasets. This approach was taken by [16] to recognize
handwritten digits which can be seen in section 4.2 and by [15] and [19] to detect
objects as it is presented in section 4.3.
11
4.2 Handwritten digits recognition
In [19] Y. Cao et al. present a method to convert a regular CNN into a con-
volutional SNN which then can be implemented on more efficient neuromorphic
hardware. Unlike the method presented in section 4.1, Y. Cao et al. do not train
their regular network with DVS data which was converted into frames, but they
train their regular network on the original dataset, convert the trained network
into a spiking neural network and use a preprocessing step to convert the regular
input images into frame-free event based input for the SNN.
This approach allows them to test their spiking neural network on a wide-range
of available frame-based datasets. They are benchmarking their network on the
CIFAR-10 dataset, which consists of 60’000, 32 x 32 pixel, labeled images from
ten categories ( for examples: Bird, dog, airplane or truck). CIFAR-10 is a well
known classification benchmarking dataset which allows to compare the results of
regular CNNs with SNNs. As regular neural networks achieve error rates below
1% the successor CIFAR-100 was introduced.
7
Recent estimation results can be seen under: http://rodrigob.github.io/are_we_there_
yet/build/classification_datasets_results.html#4d4e495354
12
Their transformed SNN achieves an error-rate of 22.57% which is worse than the
original network from A. Krizhevsky et al. presented in [2] which achieved 14.63%
and which they used as model. In [15] Hunsberger et al. use a similar approach.
But they present a new way which allows to transform a regular CNN into a SNN
made from LIF neurons. They do this by smoothing the LIF response function
so that rectified linear activation functions from the regular neuron model can be
fitted to the slightly modified LIF neurons. In their paper they present the first
deep convolutional spiking neural network which uses LIF neurons and achieves
an error of only 17.05% which is close to the original result from A. Krishevsky in
[2] which was also the model for them.
In [14] Q. Liu and S. Furber trained a spiking neural network to recognize simple
hand postures. The network runs on a spiNNaker chip, a computer architecture
for SNNs, and gets input from a DVS camera. They present a big and a small
version of the same network to fulfill this task. Both are tested under real live
condition, where they can recognize hand postures in real time with an accuracy
of 93% for the big one and 86.4% for the smaller network. Notable is, that the
smaller network uses only 10% of the resources while it still achieves 92,9% of the
performance.
There are some other applications of SNNs which might be interesting but are not
in the scope of this work and are therefore only mentioned. A recent example from
robotics is [20] where a SNN is used for the indoor navigation of a robot. Another
area for SNN applications is the analysis of spatio-temporal data such as it was
done in [21] for speech recognition applications or in [22] where a SNN is used
to analyze and understand brain data. Next to the famous research topics there
13
Figure 10: The left picture shows the creation of the dataset with a normal frame-
driven camera, the picture on the right shows the same picture from a frame-free
camera obtained by collecting events for 5ms.
Figure 11: Examples of the different hand postures used in [14]. The postures are
from left to the right: Fist,Index Finger,Victory Sign, Full Hand,Thumb up.
14
are also some niches like in [23] where a SNN is used to build a biological more
plausible nose which is then used for tea odour classification.
15
5 Conclusion
Frame-free spiking networks have important advantages over regular, frame-based
neural networks. They are more energy efficient, less computational complex and
faster. These are important requirements for mobile and robotic applications. But
spiking neural networks cannot compete yet with regular neural networks in terms
of performance. Reasons therefore are a lack of suitable, event-based datasets and
not yet fully developed learning algorithms.
It is possible today to avoid those problems by recording regular datasets with a
DVS camera or by transforming fully learned regular neural networks into spiking
neural networks. This simplifies the application of spiking neural networks for
vision tasks such as handwritten digit recognition or object recognition.
Even though these are good preliminary solutions, it is unlikely that spiking neural
networks can develop their full potential while working with datasets or algorithms
which are tailored for the needs of frame-based neural networks. To explore the
full potential of spiking neural networks, the development of efficient learning algo-
rithms and the creation of real event-based benchmarking datasets is needed.
16
List of Figures
1 Example of a fully connected neural network . . . . . . . . . . . . . 3
2 Neuron model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Most common activation functions for neurons used in artificial neu-
ral networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Demonstration of a convolutional layer . . . . . . . . . . . . . . . . 5
5 Example of pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6 Membran potential and spike output of a neuron . . . . . . . . . . . 6
7 Comparison of spiking neuron models . . . . . . . . . . . . . . . . . 7
8 Problems that can occur when transforming a frame-based dataset
into a frame-free one . . . . . . . . . . . . . . . . . . . . . . . . . . 9
9 Comparison of information processing of CNN and SNN . . . . . . 10
10 Creation of the Poker symbol dataset . . . . . . . . . . . . . . . . . 14
11 Example of hand Postures seen by a DVS camera . . . . . . . . . . 14
17
References
[1] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner Gradient-Based
Learning Apllied to Document Recognition Proceedings of the IEEE, 1998
[2] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton ImageNet Classification
with Deep Convolutional Neural Networks Advances in Neural Information
Processing Systems 25 (NIPS), 2012
[3] Sepp Hochreiter, The vanishing gradient problem during learning recurrent
neural nets and problem solutions International Journal of Uncertainty, Fuzzi-
ness and Knowledge-Based Systems 6.02 (1998): PP. 107-116.
[4] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neu-
ral networks. International Conference on Artificial Intelligence and Statistics.
2011.
[5] Kishan Mehrotra, Chilukuri K. Mohan, Sanjay Ranka, Elements of Artificial
Neural Networks Elements of artificial neural networks. MIT press, 1997: Pp
9-16
[6] Kishan Mehrotra, Chilukuri K. Mohan, Sanjay Ranka, Elements of Artificial
Neural Networks Elements of artificial neural networks. MIT press, 1997: P.
85
[7] O’Connor, P., Neil, D., Liu, S. C., Delbruck, T., & Pfeiffer, M. Real-time
classification and sensor fusion with a spiking deep belief network Frontiers
in neuroscience, 7. ,2013
[8] Daniel Kunkle, Chadd Merrigan Pulsed neural networks and their application
Computer Science Dept., College of Computing and Information Sciences,
Rochester Institute of Technology, 2002
[9] Gerstner, W., & Kistler, W. M. Spiking neuron models: Single neurons, pop-
ulations, plasticity. Cambridge university press. (2002).
[10] Eugene M Izhikevich, Which model to use for cortical spiking neurons? IEEE
transactions on neural networks 15.5 (2004): 1063-1070
[11] Tan, Cheston, Stephane Lallee, and Garrick Orchard. Benchmarking neuro-
morphic vision: lessons learnt from computer vision. Frontiers in Neuroscience
9 (2015).
[12] Orchard, G., Jayawant, A., Cohen, G., Thakor, N. Converting Static Im-
age Datasets to Spiking Neuromorphic Datasets Using Saccades Frontiers in
Neuroscience 2015
18
[13] Pérez-Carrasco, J. A., Zhao, B., Serrano, C., Acha, B., Serrano-Gotarredona,
T., Chen, S. and Linares-Barranco, B. Mapping from Frame-Driven to Frame-
Free Event-Driven Vision Systems by Low-Rate Rate Coding and Coincidence
Processing–Application to Feedforward ConvNets. Pattern Analysis and Ma-
chine Intelligence, IEEE Transactions on, 35(11), 2706-2719. (2013)
[14] Liu, Q., and Furber, S. Real-Time Recognition of Dynamic Hand Postures on
a Neuromorphic System. World Academy of Science, Engineering and Tech-
nology, International Journal of Electrical, Computer, Energetic, Electronic
and Communication Engineering, 9(5), 432-439 (2015)
[15] Hunsberger, Eric, and Chris Eliasmith. Spiking Deep Networks with LIF Neu-
rons. arXiv preprint arXiv:1510.08829 (2015).
[16] Diehl, P. U., Neil, D., Binas, J., Cook, M., Liu, S. C. and Pfeiffer, M.
Fast-Classifying, High-Accuracy Spiking Deep Networks Through Weight and
Threshold Balancing. International Joint Conference on Neural Networks
(IJCNN). 2015;
[17] Zhao, B., Ding, R., Chen, S., Linares-Barranco, B., and Tang, H. Feedfor-
ward categorization on AER motion events using cortex-like features in a
spiking neural network. IEEE Transactions on Neural Networks and Learning
Systems, (2014).
[18] Merolla, P. A., Arthur, J. V., Alvarez-Icaza, R., Cassidy, A. S., Sawada, J.,
Akopyan, F., ... & Brezzo, B. A million spiking-neuron integrated circuit with
a scalable communication network and interface. Science, 345(6197), 668-673,
2014
[19] Cao, Y., Chen, Y., & Khosla, D. Spiking Deep Convolutional Neural Networks
for Energy-Efficient Object Recognition. International Journal of Computer
Vision, 113(1), 54-66, 2015
[20] Beyeler, M., Oros, N., Dutt, N., & Krichmar, J. L. A GPU-accelerated cortical
neural network model for visually guided robot navigation. Neural Networks,
72, 75-87, 2015
[21] Zhang, Y., Li, P., Jin, Y., & Choe, Y. A Digital Liquid State Machine With
Biologically Inspired Learning and Its Application to Speech Recognition.
Preprint, 2015
[22] Kasabov, N. K. NeuCube: A spiking neural network architecture for mapping,
learning and understanding of spatio-temporal brain data. Neural Networks,
52, 62-76, 2014
[23] Sarkar, S. T., Bhondekar, A. P., Macaš, M., Kumar, R., Kaur, R., Sharma, A.,
... & Kumar, A. Towards biological plausibility of electronic noses: A spiking
19
neural network based approach for tea odour classification. Neural Networks,
71, 142-149, 2015
20
I hereby certify that this advanced seminar has been composed by myself, and
describes my own work, unless otherwise acknowledged in the text. All references
and verbatim extracts have been quoted, and all sources of information have been
specifically acknowledged.
21